From patchwork Wed Dec 21 22:24:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35531 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4944872dye; Wed, 21 Dec 2022 14:25:15 -0800 (PST) X-Google-Smtp-Source: AMrXdXsfJ6jRItYoToar1zeHY+rDnL5iqocSAMb294kIOH6syFCBKeGB93dq/Y1V64M8Wpt/DDPu X-Received: by 2002:a05:6402:3983:b0:475:c640:ddd2 with SMTP id fk3-20020a056402398300b00475c640ddd2mr3304387edb.26.1671661515330; Wed, 21 Dec 2022 14:25:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661515; cv=none; d=google.com; s=arc-20160816; b=lasC0Z6rtSaO7wl6VrHYwcEDLzXKpOm1X4DyoZgnQXSyJQFrW+TykCplhJ/6pdtpDy 2OggpzyLuW002OZqwNONl4fNV7ks8jSJLPsrrv/RWyz6Z0+o7ZtUEGVIegcVg8SkZFW0 h2aW5lDNQFZJl2F6GyLSLLS+OD8QSM5hlQvzO1ELS6fRqo9xpUiOQZY+EZgjsrzYDlGb b5+0ZybcvnugcTsC/7H/awE0IQm6S5Lkm55MkwlKV+dqKpARiKEflWtkT/L/Zw1Ljtsg 3/FSLQHYb5V/1Y5DPesDPzc7eV9Kv0fHK7rQgWSpq7JG+S/dpljNYVNwMRmfOdb+b54G Bu5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=aKlgYMYIrlGLbMYP963Rt32Noqo2pCtoxK8D34uyWJ4=; b=E2DKCIDo6ob2ngXw66NLo3kdvGRBgbFYZenNIhMaFIUydpJ9IL969JFNiMn15UtXlz FD6FyHEOxk2gMeByPf2QcIB8VBRdmHFMowI3kapT5csh6Mu+kaifA7+YgURU3Gu146NV Y1TQYSZ271Jyi81+S72Tl7N/i3t8qPm2v7yWw0EFapraKni+BaU/6o3D3jYdQ55h8hr1 fPaHTn7tzCgN3VYmbbaSdjIhgMwpcMffDqe4T8rjyKd23QGry1NeTYOpyDpm7KMCtqSu JuWzKvR4+3Ysf+kL3LIn5mgX5Ol+aOIyNA7DTKllOB6nZ6A72M9rFciSeTiuzTmxwfdQ 2cVg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=PWqhNCgk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ta9-20020a1709078c0900b007ac60b83407si13794046ejc.725.2022.12.21.14.24.49; Wed, 21 Dec 2022 14:25:15 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=PWqhNCgk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234945AbiLUWYa (ORCPT + 99 others); Wed, 21 Dec 2022 17:24:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234897AbiLUWYY (ORCPT ); Wed, 21 Dec 2022 17:24:24 -0500 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1961A27164 for ; Wed, 21 Dec 2022 14:24:24 -0800 (PST) Received: by mail-pf1-x44a.google.com with SMTP id k22-20020aa79736000000b0057f3577fdbaso475pfg.8 for ; Wed, 21 Dec 2022 14:24:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=aKlgYMYIrlGLbMYP963Rt32Noqo2pCtoxK8D34uyWJ4=; b=PWqhNCgkLNsL4LTq6n3P2N/wL2IC5fNtKlHcljZPjv1KMJIXNnu+jM/wiOaUfvSQzQ sqxmogZSTOlLja/osRpS7Ow8EfzSUYnreoaLWkkFdbixfiBmjccx0+KswydFi9OPd7G+ ufafOm2aNUYdbkAWCv2FGT5m/chD2Utam3lq9+XJO5/Q1b+hNSmrMeQga388z7ynT7v8 NX7DKNfpSeQMnVeRYugzoXNHWHSvxsw+I6ma6X+E6cBtDFf2MAyzy1YNcTGk8tmCzLNd ney/iJw+QhWYPcoct/qe9KDj8ngJbHLXsqU23YGjejcPJ/Eqxv/IM84tu17ag52e0e8x 5slQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=aKlgYMYIrlGLbMYP963Rt32Noqo2pCtoxK8D34uyWJ4=; b=jeRoZiXDXqNratWFstsbBBv3NvCxI8PDSbm6Xyy9Vl/rIiLhWN2Hea9a1+bjXaHLBv xygDM2wUjMhdPc1zaOxaGOl9UZIUoEZsS6sHMlByDPc7esREnc6o7dRRrYldTfT3vBcy ydq58iYtAMt+7qsR9UQHhJhBhOyA8FKmuWFI967Qiai49ze3L4ktbOO2OfL6Jq7i1F60 pQpHml0qOOBjRsrsyEq/TP+dwbRRMW/RnKPSc5228l0A5DFbVQH2F5Q4mI4FcAqtlBoc nHPq7uGL/Zo/A7M3czS8zrIyR2YtkPfDEWIHiBkNKO2WJ51/Jtbf4jFZLgzqPelkdO4+ FdqQ== X-Gm-Message-State: AFqh2kpLpnKK8+mRXk7LERwvjers7Fk+O+kT06HujJhyWtZDiKxXDTbO xVXRCbRef36i3eWZ4Xg2SmZcWp/x00R8hpRe8ZteC7D8JIzGgfXBNPZKdIkTx/gp+dior+qgKFn 8jsiZaRCBcZCGtyfrAbaU60dYjbPDGQtumNxZot+lg6iUnIUKvJ0yeW4D3/422x64iis3gr3L X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a17:90a:6845:b0:219:2bc2:f71e with SMTP id e5-20020a17090a684500b002192bc2f71emr274766pjm.142.1671661463529; Wed, 21 Dec 2022 14:24:23 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:05 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-2-bgardon@google.com> Subject: [RFC 01/14] KVM: x86/MMU: Add shadow_mmu.(c|h) From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864145471442438?= X-GMAIL-MSGID: =?utf-8?q?1752864145471442438?= As a first step to splitting the Shadow MMU out of KVM MMU common code, add separate files for it with some of the boilerplate and includes the Shadow MMU will need. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/mmu/mmu.c | 1 + arch/x86/kvm/mmu/shadow_mmu.c | 21 +++++++++++++++++++++ arch/x86/kvm/mmu/shadow_mmu.h | 8 ++++++++ 4 files changed, 31 insertions(+), 1 deletion(-) create mode 100644 arch/x86/kvm/mmu/shadow_mmu.c create mode 100644 arch/x86/kvm/mmu/shadow_mmu.h diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index 80e3fe184d17..d6e94660b006 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -12,7 +12,7 @@ include $(srctree)/virt/kvm/Makefile.kvm kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \ i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \ hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \ - mmu/spte.o + mmu/spte.o mmu/shadow_mmu.o ifdef CONFIG_HYPERV kvm-y += kvm_onhyperv.o diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4736d7849c60..07b99a7ce830 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -20,6 +20,7 @@ #include "mmu.h" #include "mmu_internal.h" #include "tdp_mmu.h" +#include "shadow_mmu.h" #include "x86.h" #include "kvm_cache_regs.h" #include "smm.h" diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c new file mode 100644 index 000000000000..7bce5ec52b2e --- /dev/null +++ b/arch/x86/kvm/mmu/shadow_mmu.c @@ -0,0 +1,21 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * KVM Shadow MMU + * + * This file implements the Shadow MMU: the KVM MMU implementation which has + * developed organically from hardware which did not have second level paging, + * and so used "shadow paging" to virtualize guest memory. The Shadow MMU is + * an alternative to the TDP MMU which only supports hardware with Two + * Dimentional Paging. (e.g. EPT on Intel or NPT on AMD CPUs.) Note that the + * Shadow MMU also supports TDP, it's just less scalable. The Shadow and TDP + * MMUs can cooperate to support nested virtualization on hardware with TDP. + */ +#include "mmu.h" +#include "mmu_internal.h" +#include "mmutrace.h" +#include "shadow_mmu.h" +#include "spte.h" + +#include +#include +#include diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h new file mode 100644 index 000000000000..719b10f6c403 --- /dev/null +++ b/arch/x86/kvm/mmu/shadow_mmu.h @@ -0,0 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __KVM_X86_MMU_SHADOW_MMU_H +#define __KVM_X86_MMU_SHADOW_MMU_H + +#include + +#endif /* __KVM_X86_MMU_SHADOW_MMU_H */ From patchwork Wed Dec 21 22:24:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35532 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4944995dye; Wed, 21 Dec 2022 14:25:35 -0800 (PST) X-Google-Smtp-Source: AMrXdXv1QtOdpEryuk4iYQ7ItRI5ZWMez48RYQpa8AngzJZ6AM41O9MlzvlF9+/3A3pM7sUt15L1 X-Received: by 2002:a05:6402:e09:b0:46c:b919:997f with SMTP id h9-20020a0564020e0900b0046cb919997fmr2973804edh.17.1671661535655; Wed, 21 Dec 2022 14:25:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661535; cv=none; d=google.com; s=arc-20160816; b=Y5t/u29SnM4a1Z1Toe49MOd0HCCH4aosXMNY+pz+zVP8tMLk0Y7ixxpOtq+x6ecmMv ERYa0O48/pSMAHyj22dguzaLdxIA/h57aq5p/E0WUMvoajrFC1hcwhz+pHNXzKGE/oLU BF3OHi3yecBGfZbBbW1XkY/ewbDQAtxbQCFAk2bby9Dn53KDetN26ddA+9LPxJAnjPbX iT0GpeiHRPbbh+rSiuMKm4lKl7LGsHydsqjjak1fSfbMSYoir0d56pOBubiyI3FJNHhs 4hW96fvEZwjr2LH2/zAbooRwatzJ0jR1AgXN/jwsIFcV5Nmd2Ra2ltyv/t0i9T36qQYN +fJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=mr/upkq9nUstxPAXFOYz+TIWTjnNhJ7646Ga8zDx7MA=; b=bmoC3Qv9D0qoLnj/KS4ySPZnPGVfCubbjz5n0Ja11yyEkvaVx9uZj+ZX3VLQMD8S+R nzjqnAyJeyg3u0ZhD0Uz+iUyu4koWhfh77xQZ/xzkWePHnzwNjCB1MhWsmxfAUFHPmqW XNThBv9pzQSG2+U1XTJQrv2HoZXrx0We7jGiIwkWe2mYGyxgzlY5aBiIWI6MBGONgmeC iKXh4Mx+4jgMwae2ov0WrSCxbZ3pFPCyXyh+wMaHIvP8q/4IDwxh98aw+M+OGuwkmTdX hAcY0WWmR3+BsWQHYoecdlfnE5LsSPYLIQjlp0OZ6ec2GwA5rcNEL5BZBUpm6/UIaeBt WZKQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=jcxCfXrS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n8-20020aa7db48000000b0047d8042b2desi2826534edt.633.2022.12.21.14.25.11; Wed, 21 Dec 2022 14:25:35 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=jcxCfXrS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229601AbiLUWYf (ORCPT + 99 others); Wed, 21 Dec 2022 17:24:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36150 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234933AbiLUWY1 (ORCPT ); Wed, 21 Dec 2022 17:24:27 -0500 Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com [IPv6:2607:f8b0:4864:20::549]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 039662715F for ; Wed, 21 Dec 2022 14:24:26 -0800 (PST) Received: by mail-pg1-x549.google.com with SMTP id a33-20020a630b61000000b00429d91cc649so107377pgl.8 for ; Wed, 21 Dec 2022 14:24:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=mr/upkq9nUstxPAXFOYz+TIWTjnNhJ7646Ga8zDx7MA=; b=jcxCfXrS0rhqjdrrTg9RFmHommcZLL9JJIOJmSEjlqGB6hsqaBvlotnqSYyzuM2o1r oyxgX7LusyGPkW4t09A2Ps3Tbd/T0zkw/iHV6uV3gwWeBDbz3xZpksS8uIEZQReN8k4D gTZ/kgRejuP4J7gYltEmSRz4GYSehjtuE/FLqamWE/P7u15/c9ZToC0iogQQIP8Pejds ii5k/zaJDE/8yX30EqZr7Fdi6ezF9uKNfAO4Wkpzv0osK8ojhKjyjCCtLfLzO/HKX78u wOc1P5laTqWlvK39uvp0tnnasQLlOXiwegfhLRQh58XXgN51EcR94i5Ib2o2lx+A3Bi6 sIuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mr/upkq9nUstxPAXFOYz+TIWTjnNhJ7646Ga8zDx7MA=; b=bLN0w8LmEg+rNDStTyGtSBR/BVpNUpI9EdWGAy7q6bBMZQiISzcWAo/gLqUlKfwnis BBqwVctXU6hyXBXEi3n7K1ekNp26CuIQfsmwh8bIJ2G8RGY/MZpiRiD15e+1LEfbXOvZ 5bC7bqPfVzZ2Xtugh2vQXYatAdX8hmoWyil0kZTS0sNDTqg83o+dg8yaey3DmCcZei3D ot++SpKFu7oyS86FAfNNkHS6/CfZ+viqcCPHq95F8LuzOcwtP19i5rT07TAh7RhU7Z5t lLAt708X3KUUAGEN4j2kZYBpYys8ZfkDomfLzmb1GHcbxwbvceVxEmkP4o+mrbmFBEyI Y7IQ== X-Gm-Message-State: AFqh2kow6gvb8a5BQPOaHAzQbq+Kz1yRBbn+mhppR7gIheQUppkFtCiK fKzzii1N9bRGj7RR8KjFZgm/DJmSTDM8ZGG7oEg5EJyDauIS6n4bcMDaqR03ZI0gt2U0Yee0xHq RZKmZ17O5KZgZ8JVYIXM+Z/uHjtKfAptQ4qLHIdWm253WkiXXNGucWnq6XhLoYkotjE2IG094 X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a62:2781:0:b0:57a:6e2a:c236 with SMTP id n123-20020a622781000000b0057a6e2ac236mr218098pfn.82.1671661465306; Wed, 21 Dec 2022 14:24:25 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:06 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-3-bgardon@google.com> Subject: [RFC 02/14] KVM: x86/MMU: Expose functions for the Shadow MMU From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864166450940373?= X-GMAIL-MSGID: =?utf-8?q?1752864166450940373?= Expose various common MMU functions which the Shadow MMU will need via mmu_internal.h. This just slightly reduces the work needed to move the shadow MMU code out of mmu.c, which will already be a massive change. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/mmu.c | 41 ++++++++++++++------------------- arch/x86/kvm/mmu/mmu_internal.h | 24 +++++++++++++++++++ 2 files changed, 41 insertions(+), 24 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 07b99a7ce830..729a2799d4d7 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -156,9 +156,9 @@ struct kvm_shadow_walk_iterator { ({ spte = mmu_spte_get_lockless(_walker.sptep); 1; }); \ __shadow_walk_next(&(_walker), spte)) -static struct kmem_cache *pte_list_desc_cache; +struct kmem_cache *pte_list_desc_cache; struct kmem_cache *mmu_page_header_cache; -static struct percpu_counter kvm_total_used_mmu_pages; +struct percpu_counter kvm_total_used_mmu_pages; static void mmu_spte_set(u64 *sptep, u64 spte); @@ -234,11 +234,6 @@ static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu) return regs; } -static inline bool kvm_available_flush_tlb_with_range(void) -{ - return kvm_x86_ops.tlb_remote_flush_with_range; -} - static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm, struct kvm_tlb_range *range) { @@ -262,8 +257,8 @@ void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, kvm_flush_remote_tlbs_with_range(kvm, &range); } -static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn, - unsigned int access) +void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn, + unsigned int access) { u64 spte = make_mmio_spte(vcpu, gfn, access); @@ -610,7 +605,7 @@ static bool mmu_spte_age(u64 *sptep) return true; } -static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu) +void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu) { if (is_tdp_mmu(vcpu->arch.mmu)) { kvm_tdp_mmu_walk_lockless_begin(); @@ -629,7 +624,7 @@ static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu) } } -static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu) +void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu) { if (is_tdp_mmu(vcpu->arch.mmu)) { kvm_tdp_mmu_walk_lockless_end(); @@ -822,8 +817,8 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) &kvm->arch.possible_nx_huge_pages); } -static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, - bool nx_huge_page_possible) +void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, + bool nx_huge_page_possible) { sp->nx_huge_page_disallowed = true; @@ -857,16 +852,15 @@ void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) list_del_init(&sp->possible_nx_huge_page_link); } -static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) +void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) { sp->nx_huge_page_disallowed = false; untrack_possible_nx_huge_page(kvm, sp); } -static struct kvm_memory_slot * -gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn, - bool no_dirty_log) +struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, + gfn_t gfn, bool no_dirty_log) { struct kvm_memory_slot *slot; @@ -1403,7 +1397,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm, return write_protected; } -static bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn) +bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn) { struct kvm_memory_slot *slot; @@ -1902,9 +1896,8 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, return ret; } -static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, - struct list_head *invalid_list, - bool remote_flush) +bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, struct list_head *invalid_list, + bool remote_flush) { if (!remote_flush && list_empty(invalid_list)) return false; @@ -1916,7 +1909,7 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, return true; } -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) +bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) { if (sp->role.invalid) return true; @@ -6148,7 +6141,7 @@ static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min) return kvm_mmu_memory_cache_nr_free_objects(cache) < min; } -static bool need_topup_split_caches_or_resched(struct kvm *kvm) +bool need_topup_split_caches_or_resched(struct kvm *kvm) { if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) return true; @@ -6163,7 +6156,7 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm) need_topup(&kvm->arch.split_shadow_page_cache, 1); } -static int topup_split_caches(struct kvm *kvm) +int topup_split_caches(struct kvm *kvm) { /* * Allocating rmap list entries when splitting huge pages for nested diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index dbaf6755c5a7..856e2e0a8420 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -131,7 +131,9 @@ struct kvm_mmu_page { #endif }; +extern struct kmem_cache *pte_list_desc_cache; extern struct kmem_cache *mmu_page_header_cache; +extern struct percpu_counter kvm_total_used_mmu_pages; static inline int kvm_mmu_role_as_id(union kvm_mmu_page_role role) { @@ -317,6 +319,28 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp); +void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, + bool nx_huge_page_possible); void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp); +void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp); +static inline bool kvm_available_flush_tlb_with_range(void) +{ + return kvm_x86_ops.tlb_remote_flush_with_range; +} + +void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn, + unsigned int access); +struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, + gfn_t gfn, bool no_dirty_log); +bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn); +bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, struct list_head *invalid_list, + bool remote_flush); +bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp); + +void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu); +void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu); + +bool need_topup_split_caches_or_resched(struct kvm *kvm); +int topup_split_caches(struct kvm *kvm); #endif /* __KVM_X86_MMU_INTERNAL_H */ From patchwork Wed Dec 21 22:24:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35533 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4945019dye; Wed, 21 Dec 2022 14:25:40 -0800 (PST) X-Google-Smtp-Source: AMrXdXvSNpcjdRhowhS6eH1W4s2SnN+S0W5zHBOyIv+06pL2ZYMTWYO+fiN+//9MepygZksLBHr7 X-Received: by 2002:a17:907:7ea1:b0:7c4:f752:e95c with SMTP id qb33-20020a1709077ea100b007c4f752e95cmr3442975ejc.1.1671661540472; Wed, 21 Dec 2022 14:25:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661540; cv=none; d=google.com; s=arc-20160816; b=Z0I0Vqwbfj6Hiky4fcJNuaEJwmz9v9zCxOoSvYT/lg+nXHIAJbToKHDXu3X3adEgyW U3NfCetAezGUQN7V0fUhECIRGmDTuiDktoh3eWc8pDRBI7l0TUCGpqqOPsiWTqkfV+07 J0Vf3Rmej994wK34omfa5g3I2izQzG0rw87wJ7QRfeK8h1jI4F5ZxeixJzYZHQAXBe1W qkxu7xaFgw8nBU9ZR07qQrl2MC1EsnRybjrmTIC6XJ7grySxogf5MtxYdZxd/TuUelwE 4GFedQz1oaDhoOiApPJvAlqLDbLTunN/++weqJVJZzTDJv+LeeGPjNCOJ1aeT1xaf3pg 1T5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=jqx+xXNPXyPMfyXZCg/z/s6ey1PJTXuwDa7ZZcqNa58=; b=i09x6x+kxI9uxXc1L77KPm00DMqrN086lTUjbro0UCZzQNO0cZp935aGd0aPPFpFeF zy7Cz3VVI9hs0zKU4o3EsrdD6HlDcYxoqPDyEBU39svDjhpZmbDVUW3CQ7llxkrdbG8h Rmz5yaZVwnu9274cQYw8GBFg9IzCXuTaOz4J0CWloeQAnWkkFkdYnVRcBKodNAa8QEtn NXt26lvCE/sK3j6j+AahSvWfKuILEuz61fXQZ5KtFfaAFm4WGld36BGmDVkN6H0X2sPb /3YESpu4ODMpcix7EgwAUN4t72u8KWFjT2ARmA3FcM1TVeqqnu07XyUHhpkYR/U9waxA 5h/w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=FN9cvFaF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id xb3-20020a170907070300b007c17efb64fcsi14007850ejb.639.2022.12.21.14.25.16; Wed, 21 Dec 2022 14:25:40 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=FN9cvFaF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234956AbiLUWYj (ORCPT + 99 others); Wed, 21 Dec 2022 17:24:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36212 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234946AbiLUWYb (ORCPT ); Wed, 21 Dec 2022 17:24:31 -0500 Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com [IPv6:2607:f8b0:4864:20::1049]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC4A52715F for ; Wed, 21 Dec 2022 14:24:27 -0800 (PST) Received: by mail-pj1-x1049.google.com with SMTP id pm5-20020a17090b3c4500b00219864a46f0so7490807pjb.7 for ; Wed, 21 Dec 2022 14:24:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=jqx+xXNPXyPMfyXZCg/z/s6ey1PJTXuwDa7ZZcqNa58=; b=FN9cvFaFQ0lDrnMfZlJ4Nw5QOVw0uGnDqGaQknumVRuppyrWoDHSBVuz6NL4ToY0RS 9vOuxkJ3/AofaX1iRjfpMBpMgUldPva9VTuxI+GzgvG3hUFpmFSghYUy2sJIfS9ttwfI XGhvxESKye6WvJLOGBgDYOxoD6CLBGJe+6xT4rDQG52AH4WnFl9PGc3AFw28vZayBBBn j9feajSJ2epeUE/GkU2rquziJiIxC8vdf+fWfieOmBNDpJ3ddN06FuqeOipY9Sg3aDoz XwJks6viAmjyebvL5gq7BHgmgh1daETNoQf/LVxOK/ho3WPukaLII5DxvdsU1/MnZ2Ma QC/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=jqx+xXNPXyPMfyXZCg/z/s6ey1PJTXuwDa7ZZcqNa58=; b=Eg8qZe86sz82txRsI9fFukjQmWUn+Ebs5pt4cy6fSXju7FBeGRMJB919xa7m/dOy/M kN6aDfaNjxrcALC4fXwDGq2RUiTBl97mgtBHvTe3SVeZsA8oy6qOc1AS7Uj7siG0oX0R m5v/HXG9j8rwghbw6Btsbowxxghtu6dF/LG2RinX42B5c8gdN/GkeD+FIyV6qT6w/khN udsyWxxqMd4TWB3y4iVGASqcAN+uooGwQ24hKDRE7iPs3S1Kgie6E8aVU8dKID+7D3P2 9+8INMZof+lV57NIDSZtoPzahU9F5IVPbtN27jYo73YvHjeVIqaNXML0Ttyf2h5V3kKE VZjg== X-Gm-Message-State: AFqh2kr82DhwZbCsYZkcGJHdprSiQiaHBrNQfM5X+U5zQmWGsdjzB4hD hWoyTTQLiXJa6AHTX2HzhS0JEelcKvtUNGKRfWq/b0sxB1EUUQozo9pic0kmdu0+wl/Yw+3LWBl pLGMM6Z7E2D20dPyf0tx1bJnlFBI6ox4i0CoW6GzyPXIgPHxTl0zpmBfSuO4w/ilniASghnrW X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a17:90a:4b03:b0:224:c5cb:ce1b with SMTP id g3-20020a17090a4b0300b00224c5cbce1bmr308212pjh.209.1671661467110; Wed, 21 Dec 2022 14:24:27 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:07 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-4-bgardon@google.com> Subject: [RFC 03/14] KVM: x86/MMU: Move the Shadow MMU implementation to shadow_mmu.c From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864171713612617?= X-GMAIL-MSGID: =?utf-8?q?1752864171713612617?= Cut and paste the implementation of the Shadow MMU to shadow_mmu.(c|h). This is a monsterously large commit, moving ~3500 lines. With such a large move, there's no way to make it easy. Do the move in one massive step to simplify dealing with merge conflicts and to make the git history a little easier to dig through. Several cleanup commits follow this one rather than preceed it so that their git history will remain easy to see. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/debugfs.c | 1 + arch/x86/kvm/mmu/mmu.c | 4526 ++++--------------------------- arch/x86/kvm/mmu/mmu_internal.h | 4 +- arch/x86/kvm/mmu/shadow_mmu.c | 3408 +++++++++++++++++++++++ arch/x86/kvm/mmu/shadow_mmu.h | 145 + 5 files changed, 4086 insertions(+), 3998 deletions(-) diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c index c1390357126a..e304243d2041 100644 --- a/arch/x86/kvm/debugfs.c +++ b/arch/x86/kvm/debugfs.c @@ -9,6 +9,7 @@ #include "lapic.h" #include "mmu.h" #include "mmu/mmu_internal.h" +#include "mmu/shadow_mmu.h" static int vcpu_get_timer_advance_ns(void *data, u64 *val) { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 729a2799d4d7..bf14e181eb12 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -109,59 +109,12 @@ bool dbg = 0; module_param(dbg, bool, 0644); #endif -#define PTE_PREFETCH_NUM 8 - #include -/* make pte_list_desc fit well in cache lines */ -#define PTE_LIST_EXT 14 - -/* - * Slight optimization of cacheline layout, by putting `more' and `spte_count' - * at the start; then accessing it will only use one single cacheline for - * either full (entries==PTE_LIST_EXT) case or entries<=6. - */ -struct pte_list_desc { - struct pte_list_desc *more; - /* - * Stores number of entries stored in the pte_list_desc. No need to be - * u64 but just for easier alignment. When PTE_LIST_EXT, means full. - */ - u64 spte_count; - u64 *sptes[PTE_LIST_EXT]; -}; - -struct kvm_shadow_walk_iterator { - u64 addr; - hpa_t shadow_addr; - u64 *sptep; - int level; - unsigned index; -}; - -#define for_each_shadow_entry_using_root(_vcpu, _root, _addr, _walker) \ - for (shadow_walk_init_using_root(&(_walker), (_vcpu), \ - (_root), (_addr)); \ - shadow_walk_okay(&(_walker)); \ - shadow_walk_next(&(_walker))) - -#define for_each_shadow_entry(_vcpu, _addr, _walker) \ - for (shadow_walk_init(&(_walker), _vcpu, _addr); \ - shadow_walk_okay(&(_walker)); \ - shadow_walk_next(&(_walker))) - -#define for_each_shadow_entry_lockless(_vcpu, _addr, _walker, spte) \ - for (shadow_walk_init(&(_walker), _vcpu, _addr); \ - shadow_walk_okay(&(_walker)) && \ - ({ spte = mmu_spte_get_lockless(_walker.sptep); 1; }); \ - __shadow_walk_next(&(_walker), spte)) - struct kmem_cache *pte_list_desc_cache; struct kmem_cache *mmu_page_header_cache; struct percpu_counter kvm_total_used_mmu_pages; -static void mmu_spte_set(u64 *sptep, u64 spte); - struct kvm_mmu_role_regs { const unsigned long cr0; const unsigned long cr4; @@ -257,15 +210,6 @@ void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, kvm_flush_remote_tlbs_with_range(kvm, &range); } -void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn, - unsigned int access) -{ - u64 spte = make_mmio_spte(vcpu, gfn, access); - - trace_mark_mmio_spte(sptep, gfn, spte); - mmu_spte_set(sptep, spte); -} - static gfn_t get_mmio_spte_gfn(u64 spte) { u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask; @@ -301,310 +245,6 @@ static int is_cpuid_PSE36(void) return 1; } -#ifdef CONFIG_X86_64 -static void __set_spte(u64 *sptep, u64 spte) -{ - WRITE_ONCE(*sptep, spte); -} - -static void __update_clear_spte_fast(u64 *sptep, u64 spte) -{ - WRITE_ONCE(*sptep, spte); -} - -static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) -{ - return xchg(sptep, spte); -} - -static u64 __get_spte_lockless(u64 *sptep) -{ - return READ_ONCE(*sptep); -} -#else -union split_spte { - struct { - u32 spte_low; - u32 spte_high; - }; - u64 spte; -}; - -static void count_spte_clear(u64 *sptep, u64 spte) -{ - struct kvm_mmu_page *sp = sptep_to_sp(sptep); - - if (is_shadow_present_pte(spte)) - return; - - /* Ensure the spte is completely set before we increase the count */ - smp_wmb(); - sp->clear_spte_count++; -} - -static void __set_spte(u64 *sptep, u64 spte) -{ - union split_spte *ssptep, sspte; - - ssptep = (union split_spte *)sptep; - sspte = (union split_spte)spte; - - ssptep->spte_high = sspte.spte_high; - - /* - * If we map the spte from nonpresent to present, We should store - * the high bits firstly, then set present bit, so cpu can not - * fetch this spte while we are setting the spte. - */ - smp_wmb(); - - WRITE_ONCE(ssptep->spte_low, sspte.spte_low); -} - -static void __update_clear_spte_fast(u64 *sptep, u64 spte) -{ - union split_spte *ssptep, sspte; - - ssptep = (union split_spte *)sptep; - sspte = (union split_spte)spte; - - WRITE_ONCE(ssptep->spte_low, sspte.spte_low); - - /* - * If we map the spte from present to nonpresent, we should clear - * present bit firstly to avoid vcpu fetch the old high bits. - */ - smp_wmb(); - - ssptep->spte_high = sspte.spte_high; - count_spte_clear(sptep, spte); -} - -static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) -{ - union split_spte *ssptep, sspte, orig; - - ssptep = (union split_spte *)sptep; - sspte = (union split_spte)spte; - - /* xchg acts as a barrier before the setting of the high bits */ - orig.spte_low = xchg(&ssptep->spte_low, sspte.spte_low); - orig.spte_high = ssptep->spte_high; - ssptep->spte_high = sspte.spte_high; - count_spte_clear(sptep, spte); - - return orig.spte; -} - -/* - * The idea using the light way get the spte on x86_32 guest is from - * gup_get_pte (mm/gup.c). - * - * An spte tlb flush may be pending, because kvm_set_pte_rmap - * coalesces them and we are running out of the MMU lock. Therefore - * we need to protect against in-progress updates of the spte. - * - * Reading the spte while an update is in progress may get the old value - * for the high part of the spte. The race is fine for a present->non-present - * change (because the high part of the spte is ignored for non-present spte), - * but for a present->present change we must reread the spte. - * - * All such changes are done in two steps (present->non-present and - * non-present->present), hence it is enough to count the number of - * present->non-present updates: if it changed while reading the spte, - * we might have hit the race. This is done using clear_spte_count. - */ -static u64 __get_spte_lockless(u64 *sptep) -{ - struct kvm_mmu_page *sp = sptep_to_sp(sptep); - union split_spte spte, *orig = (union split_spte *)sptep; - int count; - -retry: - count = sp->clear_spte_count; - smp_rmb(); - - spte.spte_low = orig->spte_low; - smp_rmb(); - - spte.spte_high = orig->spte_high; - smp_rmb(); - - if (unlikely(spte.spte_low != orig->spte_low || - count != sp->clear_spte_count)) - goto retry; - - return spte.spte; -} -#endif - -/* Rules for using mmu_spte_set: - * Set the sptep from nonpresent to present. - * Note: the sptep being assigned *must* be either not present - * or in a state where the hardware will not attempt to update - * the spte. - */ -static void mmu_spte_set(u64 *sptep, u64 new_spte) -{ - WARN_ON(is_shadow_present_pte(*sptep)); - __set_spte(sptep, new_spte); -} - -/* - * Update the SPTE (excluding the PFN), but do not track changes in its - * accessed/dirty status. - */ -static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte) -{ - u64 old_spte = *sptep; - - WARN_ON(!is_shadow_present_pte(new_spte)); - check_spte_writable_invariants(new_spte); - - if (!is_shadow_present_pte(old_spte)) { - mmu_spte_set(sptep, new_spte); - return old_spte; - } - - if (!spte_has_volatile_bits(old_spte)) - __update_clear_spte_fast(sptep, new_spte); - else - old_spte = __update_clear_spte_slow(sptep, new_spte); - - WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte)); - - return old_spte; -} - -/* Rules for using mmu_spte_update: - * Update the state bits, it means the mapped pfn is not changed. - * - * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote - * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only - * spte, even though the writable spte might be cached on a CPU's TLB. - * - * Returns true if the TLB needs to be flushed - */ -static bool mmu_spte_update(u64 *sptep, u64 new_spte) -{ - bool flush = false; - u64 old_spte = mmu_spte_update_no_track(sptep, new_spte); - - if (!is_shadow_present_pte(old_spte)) - return false; - - /* - * For the spte updated out of mmu-lock is safe, since - * we always atomically update it, see the comments in - * spte_has_volatile_bits(). - */ - if (is_mmu_writable_spte(old_spte) && - !is_writable_pte(new_spte)) - flush = true; - - /* - * Flush TLB when accessed/dirty states are changed in the page tables, - * to guarantee consistency between TLB and page tables. - */ - - if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) { - flush = true; - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); - } - - if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) { - flush = true; - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); - } - - return flush; -} - -/* - * Rules for using mmu_spte_clear_track_bits: - * It sets the sptep from present to nonpresent, and track the - * state bits, it is used to clear the last level sptep. - * Returns the old PTE. - */ -static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep) -{ - kvm_pfn_t pfn; - u64 old_spte = *sptep; - int level = sptep_to_sp(sptep)->role.level; - struct page *page; - - if (!is_shadow_present_pte(old_spte) || - !spte_has_volatile_bits(old_spte)) - __update_clear_spte_fast(sptep, 0ull); - else - old_spte = __update_clear_spte_slow(sptep, 0ull); - - if (!is_shadow_present_pte(old_spte)) - return old_spte; - - kvm_update_page_stats(kvm, level, -1); - - pfn = spte_to_pfn(old_spte); - - /* - * KVM doesn't hold a reference to any pages mapped into the guest, and - * instead uses the mmu_notifier to ensure that KVM unmaps any pages - * before they are reclaimed. Sanity check that, if the pfn is backed - * by a refcounted page, the refcount is elevated. - */ - page = kvm_pfn_to_refcounted_page(pfn); - WARN_ON(page && !page_count(page)); - - if (is_accessed_spte(old_spte)) - kvm_set_pfn_accessed(pfn); - - if (is_dirty_spte(old_spte)) - kvm_set_pfn_dirty(pfn); - - return old_spte; -} - -/* - * Rules for using mmu_spte_clear_no_track: - * Directly clear spte without caring the state bits of sptep, - * it is used to set the upper level spte. - */ -static void mmu_spte_clear_no_track(u64 *sptep) -{ - __update_clear_spte_fast(sptep, 0ull); -} - -static u64 mmu_spte_get_lockless(u64 *sptep) -{ - return __get_spte_lockless(sptep); -} - -/* Returns the Accessed status of the PTE and resets it at the same time. */ -static bool mmu_spte_age(u64 *sptep) -{ - u64 spte = mmu_spte_get_lockless(sptep); - - if (!is_accessed_spte(spte)) - return false; - - if (spte_ad_enabled(spte)) { - clear_bit((ffs(shadow_accessed_mask) - 1), - (unsigned long *)sptep); - } else { - /* - * Capture the dirty status of the page, so that it doesn't get - * lost when the SPTE is marked for access tracking. - */ - if (is_writable_pte(spte)) - kvm_set_pfn_dirty(spte_to_pfn(spte)); - - spte = mark_spte_for_access_track(spte); - mmu_spte_update_no_track(sptep, spte); - } - - return true; -} - void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu) { if (is_tdp_mmu(vcpu->arch.mmu)) { @@ -670,77 +310,6 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu) kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache); } -static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc) -{ - kmem_cache_free(pte_list_desc_cache, pte_list_desc); -} - -static bool sp_has_gptes(struct kvm_mmu_page *sp); - -static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) -{ - if (sp->role.passthrough) - return sp->gfn; - - if (!sp->role.direct) - return sp->shadowed_translation[index] >> PAGE_SHIFT; - - return sp->gfn + (index << ((sp->role.level - 1) * SPTE_LEVEL_BITS)); -} - -/* - * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note - * that the SPTE itself may have a more constrained access permissions that - * what the guest enforces. For example, a guest may create an executable - * huge PTE but KVM may disallow execution to mitigate iTLB multihit. - */ -static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index) -{ - if (sp_has_gptes(sp)) - return sp->shadowed_translation[index] & ACC_ALL; - - /* - * For direct MMUs (e.g. TDP or non-paging guests) or passthrough SPs, - * KVM is not shadowing any guest page tables, so the "guest access - * permissions" are just ACC_ALL. - * - * For direct SPs in indirect MMUs (shadow paging), i.e. when KVM - * is shadowing a guest huge page with small pages, the guest access - * permissions being shadowed are the access permissions of the huge - * page. - * - * In both cases, sp->role.access contains the correct access bits. - */ - return sp->role.access; -} - -static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, - gfn_t gfn, unsigned int access) -{ - if (sp_has_gptes(sp)) { - sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access; - return; - } - - WARN_ONCE(access != kvm_mmu_page_get_access(sp, index), - "access mismatch under %s page %llx (expected %u, got %u)\n", - sp->role.passthrough ? "passthrough" : "direct", - sp->gfn, kvm_mmu_page_get_access(sp, index), access); - - WARN_ONCE(gfn != kvm_mmu_page_get_gfn(sp, index), - "gfn mismatch under %s page %llx (expected %llx, got %llx)\n", - sp->role.passthrough ? "passthrough" : "direct", - sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn); -} - -static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, - unsigned int access) -{ - gfn_t gfn = kvm_mmu_page_get_gfn(sp, index); - - kvm_mmu_page_set_translation(sp, index, gfn, access); -} - /* * Return the pointer to the large page information for a given gfn, * handling slots that are not large page aligned. @@ -777,28 +346,6 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn) update_gfn_disallow_lpage_count(slot, gfn, -1); } -static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) -{ - struct kvm_memslots *slots; - struct kvm_memory_slot *slot; - gfn_t gfn; - - kvm->arch.indirect_shadow_pages++; - gfn = sp->gfn; - slots = kvm_memslots_for_spte_role(kvm, sp->role); - slot = __gfn_to_memslot(slots, gfn); - - /* the non-leaf shadow pages are keeping readonly. */ - if (sp->role.level > PG_LEVEL_4K) - return kvm_slot_page_track_add_page(kvm, slot, gfn, - KVM_PAGE_TRACK_WRITE); - - kvm_mmu_gfn_disallow_lpage(slot, gfn); - - if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K)) - kvm_flush_remote_tlbs_with_address(kvm, gfn, 1); -} - void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) { /* @@ -826,23 +373,6 @@ void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp, track_possible_nx_huge_page(kvm, sp); } -static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) -{ - struct kvm_memslots *slots; - struct kvm_memory_slot *slot; - gfn_t gfn; - - kvm->arch.indirect_shadow_pages--; - gfn = sp->gfn; - slots = kvm_memslots_for_spte_role(kvm, sp->role); - slot = __gfn_to_memslot(slots, gfn); - if (sp->role.level > PG_LEVEL_4K) - return kvm_slot_page_track_remove_page(kvm, slot, gfn, - KVM_PAGE_TRACK_WRITE); - - kvm_mmu_gfn_allow_lpage(slot, gfn); -} - void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp) { if (list_empty(&sp->possible_nx_huge_page_link)) @@ -873,437 +403,51 @@ struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, return slot; } -/* - * About rmap_head encoding: +/** + * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages + * @kvm: kvm instance + * @slot: slot to protect + * @gfn_offset: start of the BITS_PER_LONG pages we care about + * @mask: indicates which pages we should protect * - * If the bit zero of rmap_head->val is clear, then it points to the only spte - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct - * pte_list_desc containing more mappings. - */ - -/* - * Returns the number of pointers in the rmap chain, not counting the new one. + * Used when we do not need to care about huge page mappings. */ -static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, - struct kvm_rmap_head *rmap_head) -{ - struct pte_list_desc *desc; - int count = 0; - - if (!rmap_head->val) { - rmap_printk("%p %llx 0->1\n", spte, *spte); - rmap_head->val = (unsigned long)spte; - } else if (!(rmap_head->val & 1)) { - rmap_printk("%p %llx 1->many\n", spte, *spte); - desc = kvm_mmu_memory_cache_alloc(cache); - desc->sptes[0] = (u64 *)rmap_head->val; - desc->sptes[1] = spte; - desc->spte_count = 2; - rmap_head->val = (unsigned long)desc | 1; - ++count; - } else { - rmap_printk("%p %llx many->many\n", spte, *spte); - desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); - while (desc->spte_count == PTE_LIST_EXT) { - count += PTE_LIST_EXT; - if (!desc->more) { - desc->more = kvm_mmu_memory_cache_alloc(cache); - desc = desc->more; - desc->spte_count = 0; - break; - } - desc = desc->more; - } - count += desc->spte_count; - desc->sptes[desc->spte_count++] = spte; - } - return count; -} - -static void -pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head, - struct pte_list_desc *desc, int i, - struct pte_list_desc *prev_desc) +static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, + struct kvm_memory_slot *slot, + gfn_t gfn_offset, unsigned long mask) { - int j = desc->spte_count - 1; + struct kvm_rmap_head *rmap_head; + + if (is_tdp_mmu_enabled(kvm)) + kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot, + slot->base_gfn + gfn_offset, mask, true); - desc->sptes[i] = desc->sptes[j]; - desc->sptes[j] = NULL; - desc->spte_count--; - if (desc->spte_count) + if (!kvm_memslots_have_rmaps(kvm)) return; - if (!prev_desc && !desc->more) - rmap_head->val = 0; - else - if (prev_desc) - prev_desc->more = desc->more; - else - rmap_head->val = (unsigned long)desc->more | 1; - mmu_free_pte_list_desc(desc); -} -static void pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head) -{ - struct pte_list_desc *desc; - struct pte_list_desc *prev_desc; - int i; + while (mask) { + rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), + PG_LEVEL_4K, slot); + rmap_write_protect(rmap_head, false); - if (!rmap_head->val) { - pr_err("%s: %p 0->BUG\n", __func__, spte); - BUG(); - } else if (!(rmap_head->val & 1)) { - rmap_printk("%p 1->0\n", spte); - if ((u64 *)rmap_head->val != spte) { - pr_err("%s: %p 1->BUG\n", __func__, spte); - BUG(); - } - rmap_head->val = 0; - } else { - rmap_printk("%p many->many\n", spte); - desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); - prev_desc = NULL; - while (desc) { - for (i = 0; i < desc->spte_count; ++i) { - if (desc->sptes[i] == spte) { - pte_list_desc_remove_entry(rmap_head, - desc, i, prev_desc); - return; - } - } - prev_desc = desc; - desc = desc->more; - } - pr_err("%s: %p many->many\n", __func__, spte); - BUG(); + /* clear the first set bit */ + mask &= mask - 1; } } -static void kvm_zap_one_rmap_spte(struct kvm *kvm, - struct kvm_rmap_head *rmap_head, u64 *sptep) -{ - mmu_spte_clear_track_bits(kvm, sptep); - pte_list_remove(sptep, rmap_head); -} - -/* Return true if at least one SPTE was zapped, false otherwise */ -static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, - struct kvm_rmap_head *rmap_head) -{ - struct pte_list_desc *desc, *next; - int i; - - if (!rmap_head->val) - return false; - - if (!(rmap_head->val & 1)) { - mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val); - goto out; - } - - desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); - - for (; desc; desc = next) { - for (i = 0; i < desc->spte_count; i++) - mmu_spte_clear_track_bits(kvm, desc->sptes[i]); - next = desc->more; - mmu_free_pte_list_desc(desc); - } -out: - /* rmap_head is meaningless now, remember to reset it */ - rmap_head->val = 0; - return true; -} - -unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) -{ - struct pte_list_desc *desc; - unsigned int count = 0; - - if (!rmap_head->val) - return 0; - else if (!(rmap_head->val & 1)) - return 1; - - desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); - - while (desc) { - count += desc->spte_count; - desc = desc->more; - } - - return count; -} - -static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, - const struct kvm_memory_slot *slot) -{ - unsigned long idx; - - idx = gfn_to_index(gfn, slot->base_gfn, level); - return &slot->arch.rmap[level - PG_LEVEL_4K][idx]; -} - -static bool rmap_can_add(struct kvm_vcpu *vcpu) -{ - struct kvm_mmu_memory_cache *mc; - - mc = &vcpu->arch.mmu_pte_list_desc_cache; - return kvm_mmu_memory_cache_nr_free_objects(mc); -} - -static void rmap_remove(struct kvm *kvm, u64 *spte) -{ - struct kvm_memslots *slots; - struct kvm_memory_slot *slot; - struct kvm_mmu_page *sp; - gfn_t gfn; - struct kvm_rmap_head *rmap_head; - - sp = sptep_to_sp(spte); - gfn = kvm_mmu_page_get_gfn(sp, spte_index(spte)); - - /* - * Unlike rmap_add, rmap_remove does not run in the context of a vCPU - * so we have to determine which memslots to use based on context - * information in sp->role. - */ - slots = kvm_memslots_for_spte_role(kvm, sp->role); - - slot = __gfn_to_memslot(slots, gfn); - rmap_head = gfn_to_rmap(gfn, sp->role.level, slot); - - pte_list_remove(spte, rmap_head); -} - -/* - * Used by the following functions to iterate through the sptes linked by a - * rmap. All fields are private and not assumed to be used outside. - */ -struct rmap_iterator { - /* private fields */ - struct pte_list_desc *desc; /* holds the sptep if not NULL */ - int pos; /* index of the sptep */ -}; - -/* - * Iteration must be started by this function. This should also be used after - * removing/dropping sptes from the rmap link because in such cases the - * information in the iterator may not be valid. - * - * Returns sptep if found, NULL otherwise. - */ -static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head, - struct rmap_iterator *iter) -{ - u64 *sptep; - - if (!rmap_head->val) - return NULL; - - if (!(rmap_head->val & 1)) { - iter->desc = NULL; - sptep = (u64 *)rmap_head->val; - goto out; - } - - iter->desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); - iter->pos = 0; - sptep = iter->desc->sptes[iter->pos]; -out: - BUG_ON(!is_shadow_present_pte(*sptep)); - return sptep; -} - -/* - * Must be used with a valid iterator: e.g. after rmap_get_first(). - * - * Returns sptep if found, NULL otherwise. - */ -static u64 *rmap_get_next(struct rmap_iterator *iter) -{ - u64 *sptep; - - if (iter->desc) { - if (iter->pos < PTE_LIST_EXT - 1) { - ++iter->pos; - sptep = iter->desc->sptes[iter->pos]; - if (sptep) - goto out; - } - - iter->desc = iter->desc->more; - - if (iter->desc) { - iter->pos = 0; - /* desc->sptes[0] cannot be NULL */ - sptep = iter->desc->sptes[iter->pos]; - goto out; - } - } - - return NULL; -out: - BUG_ON(!is_shadow_present_pte(*sptep)); - return sptep; -} - -#define for_each_rmap_spte(_rmap_head_, _iter_, _spte_) \ - for (_spte_ = rmap_get_first(_rmap_head_, _iter_); \ - _spte_; _spte_ = rmap_get_next(_iter_)) - -static void drop_spte(struct kvm *kvm, u64 *sptep) -{ - u64 old_spte = mmu_spte_clear_track_bits(kvm, sptep); - - if (is_shadow_present_pte(old_spte)) - rmap_remove(kvm, sptep); -} - -static void drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush) -{ - struct kvm_mmu_page *sp; - - sp = sptep_to_sp(sptep); - WARN_ON(sp->role.level == PG_LEVEL_4K); - - drop_spte(kvm, sptep); - - if (flush) - kvm_flush_remote_tlbs_with_address(kvm, sp->gfn, - KVM_PAGES_PER_HPAGE(sp->role.level)); -} - -/* - * Write-protect on the specified @sptep, @pt_protect indicates whether - * spte write-protection is caused by protecting shadow page table. - * - * Note: write protection is difference between dirty logging and spte - * protection: - * - for dirty logging, the spte can be set to writable at anytime if - * its dirty bitmap is properly set. - * - for spte protection, the spte can be writable only after unsync-ing - * shadow page. - * - * Return true if tlb need be flushed. - */ -static bool spte_write_protect(u64 *sptep, bool pt_protect) -{ - u64 spte = *sptep; - - if (!is_writable_pte(spte) && - !(pt_protect && is_mmu_writable_spte(spte))) - return false; - - rmap_printk("spte %p %llx\n", sptep, *sptep); - - if (pt_protect) - spte &= ~shadow_mmu_writable_mask; - spte = spte & ~PT_WRITABLE_MASK; - - return mmu_spte_update(sptep, spte); -} - -static bool rmap_write_protect(struct kvm_rmap_head *rmap_head, - bool pt_protect) -{ - u64 *sptep; - struct rmap_iterator iter; - bool flush = false; - - for_each_rmap_spte(rmap_head, &iter, sptep) - flush |= spte_write_protect(sptep, pt_protect); - - return flush; -} - -static bool spte_clear_dirty(u64 *sptep) -{ - u64 spte = *sptep; - - rmap_printk("spte %p %llx\n", sptep, *sptep); - - MMU_WARN_ON(!spte_ad_enabled(spte)); - spte &= ~shadow_dirty_mask; - return mmu_spte_update(sptep, spte); -} - -static bool spte_wrprot_for_clear_dirty(u64 *sptep) -{ - bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT, - (unsigned long *)sptep); - if (was_writable && !spte_ad_enabled(*sptep)) - kvm_set_pfn_dirty(spte_to_pfn(*sptep)); - - return was_writable; -} - -/* - * Gets the GFN ready for another round of dirty logging by clearing the - * - D bit on ad-enabled SPTEs, and - * - W bit on ad-disabled SPTEs. - * Returns true iff any D or W bits were cleared. - */ -static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - const struct kvm_memory_slot *slot) -{ - u64 *sptep; - struct rmap_iterator iter; - bool flush = false; - - for_each_rmap_spte(rmap_head, &iter, sptep) - if (spte_ad_need_write_protect(*sptep)) - flush |= spte_wrprot_for_clear_dirty(sptep); - else - flush |= spte_clear_dirty(sptep); - - return flush; -} - -/** - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages - * @kvm: kvm instance - * @slot: slot to protect - * @gfn_offset: start of the BITS_PER_LONG pages we care about - * @mask: indicates which pages we should protect - * - * Used when we do not need to care about huge page mappings. - */ -static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, - struct kvm_memory_slot *slot, - gfn_t gfn_offset, unsigned long mask) -{ - struct kvm_rmap_head *rmap_head; - - if (is_tdp_mmu_enabled(kvm)) - kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot, - slot->base_gfn + gfn_offset, mask, true); - - if (!kvm_memslots_have_rmaps(kvm)) - return; - - while (mask) { - rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), - PG_LEVEL_4K, slot); - rmap_write_protect(rmap_head, false); - - /* clear the first set bit */ - mask &= mask - 1; - } -} - -/** - * kvm_mmu_clear_dirty_pt_masked - clear MMU D-bit for PT level pages, or write - * protect the page if the D-bit isn't supported. - * @kvm: kvm instance - * @slot: slot to clear D-bit - * @gfn_offset: start of the BITS_PER_LONG pages we care about - * @mask: indicates which pages we should clear D-bit - * - * Used for PML to re-log the dirty GPAs after userspace querying dirty_bitmap. - */ -static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm, - struct kvm_memory_slot *slot, - gfn_t gfn_offset, unsigned long mask) +/** + * kvm_mmu_clear_dirty_pt_masked - clear MMU D-bit for PT level pages, or write + * protect the page if the D-bit isn't supported. + * @kvm: kvm instance + * @slot: slot to clear D-bit + * @gfn_offset: start of the BITS_PER_LONG pages we care about + * @mask: indicates which pages we should clear D-bit + * + * Used for PML to re-log the dirty GPAs after userspace querying dirty_bitmap. + */ +static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm, + struct kvm_memory_slot *slot, + gfn_t gfn_offset, unsigned long mask) { struct kvm_rmap_head *rmap_head; @@ -1405,147 +549,6 @@ bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn) return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn, PG_LEVEL_4K); } -static bool __kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - const struct kvm_memory_slot *slot) -{ - return kvm_zap_all_rmap_sptes(kvm, rmap_head); -} - -static bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, int level, - pte_t unused) -{ - return __kvm_zap_rmap(kvm, rmap_head, slot); -} - -static bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, int level, - pte_t pte) -{ - u64 *sptep; - struct rmap_iterator iter; - bool need_flush = false; - u64 new_spte; - kvm_pfn_t new_pfn; - - WARN_ON(pte_huge(pte)); - new_pfn = pte_pfn(pte); - -restart: - for_each_rmap_spte(rmap_head, &iter, sptep) { - rmap_printk("spte %p %llx gfn %llx (%d)\n", - sptep, *sptep, gfn, level); - - need_flush = true; - - if (pte_write(pte)) { - kvm_zap_one_rmap_spte(kvm, rmap_head, sptep); - goto restart; - } else { - new_spte = kvm_mmu_changed_pte_notifier_make_spte( - *sptep, new_pfn); - - mmu_spte_clear_track_bits(kvm, sptep); - mmu_spte_set(sptep, new_spte); - } - } - - if (need_flush && kvm_available_flush_tlb_with_range()) { - kvm_flush_remote_tlbs_with_address(kvm, gfn, 1); - return false; - } - - return need_flush; -} - -struct slot_rmap_walk_iterator { - /* input fields. */ - const struct kvm_memory_slot *slot; - gfn_t start_gfn; - gfn_t end_gfn; - int start_level; - int end_level; - - /* output fields. */ - gfn_t gfn; - struct kvm_rmap_head *rmap; - int level; - - /* private field. */ - struct kvm_rmap_head *end_rmap; -}; - -static void -rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator, int level) -{ - iterator->level = level; - iterator->gfn = iterator->start_gfn; - iterator->rmap = gfn_to_rmap(iterator->gfn, level, iterator->slot); - iterator->end_rmap = gfn_to_rmap(iterator->end_gfn, level, iterator->slot); -} - -static void -slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator, - const struct kvm_memory_slot *slot, int start_level, - int end_level, gfn_t start_gfn, gfn_t end_gfn) -{ - iterator->slot = slot; - iterator->start_level = start_level; - iterator->end_level = end_level; - iterator->start_gfn = start_gfn; - iterator->end_gfn = end_gfn; - - rmap_walk_init_level(iterator, iterator->start_level); -} - -static bool slot_rmap_walk_okay(struct slot_rmap_walk_iterator *iterator) -{ - return !!iterator->rmap; -} - -static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator) -{ - while (++iterator->rmap <= iterator->end_rmap) { - iterator->gfn += (1UL << KVM_HPAGE_GFN_SHIFT(iterator->level)); - - if (iterator->rmap->val) - return; - } - - if (++iterator->level > iterator->end_level) { - iterator->rmap = NULL; - return; - } - - rmap_walk_init_level(iterator, iterator->level); -} - -#define for_each_slot_rmap_range(_slot_, _start_level_, _end_level_, \ - _start_gfn, _end_gfn, _iter_) \ - for (slot_rmap_walk_init(_iter_, _slot_, _start_level_, \ - _end_level_, _start_gfn, _end_gfn); \ - slot_rmap_walk_okay(_iter_); \ - slot_rmap_walk_next(_iter_)) - -typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, - int level, pte_t pte); - -static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm, - struct kvm_gfn_range *range, - rmap_handler_t handler) -{ - struct slot_rmap_walk_iterator iterator; - bool ret = false; - - for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, - range->start, range->end - 1, &iterator) - ret |= handler(kvm, iterator.rmap, range->slot, iterator.gfn, - iterator.level, range->pte); - - return ret; -} - bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) { bool flush = false; @@ -1572,2392 +575,596 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) return flush; } -static bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, int level, - pte_t unused) +bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { - u64 *sptep; - struct rmap_iterator iter; - int young = 0; - - for_each_rmap_spte(rmap_head, &iter, sptep) - young |= mmu_spte_age(sptep); + bool young = false; - return young; -} + if (kvm_memslots_have_rmaps(kvm)) + young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); -static bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, - int level, pte_t unused) -{ - u64 *sptep; - struct rmap_iterator iter; + if (is_tdp_mmu_enabled(kvm)) + young |= kvm_tdp_mmu_age_gfn_range(kvm, range); - for_each_rmap_spte(rmap_head, &iter, sptep) - if (is_accessed_spte(*sptep)) - return true; - return false; + return young; } -#define RMAP_RECYCLE_THRESHOLD 1000 - -static void __rmap_add(struct kvm *kvm, - struct kvm_mmu_memory_cache *cache, - const struct kvm_memory_slot *slot, - u64 *spte, gfn_t gfn, unsigned int access) +bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { - struct kvm_mmu_page *sp; - struct kvm_rmap_head *rmap_head; - int rmap_count; + bool young = false; - sp = sptep_to_sp(spte); - kvm_mmu_page_set_translation(sp, spte_index(spte), gfn, access); - kvm_update_page_stats(kvm, sp->role.level, 1); + if (kvm_memslots_have_rmaps(kvm)) + young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap); - rmap_head = gfn_to_rmap(gfn, sp->role.level, slot); - rmap_count = pte_list_add(cache, spte, rmap_head); + if (is_tdp_mmu_enabled(kvm)) + young |= kvm_tdp_mmu_test_age_gfn(kvm, range); - if (rmap_count > kvm->stat.max_mmu_rmap_size) - kvm->stat.max_mmu_rmap_size = rmap_count; - if (rmap_count > RMAP_RECYCLE_THRESHOLD) { - kvm_zap_all_rmap_sptes(kvm, rmap_head); - kvm_flush_remote_tlbs_with_address( - kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level)); - } + return young; } -static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot, - u64 *spte, gfn_t gfn, unsigned int access) +bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, struct list_head *invalid_list, + bool remote_flush) { - struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_pte_list_desc_cache; + if (!remote_flush && list_empty(invalid_list)) + return false; - __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access); + if (!list_empty(invalid_list)) + kvm_mmu_commit_zap_page(kvm, invalid_list); + else + kvm_flush_remote_tlbs(kvm); + return true; } -bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) { - bool young = false; - - if (kvm_memslots_have_rmaps(kvm)) - young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); - - if (is_tdp_mmu_enabled(kvm)) - young |= kvm_tdp_mmu_age_gfn_range(kvm, range); + if (sp->role.invalid) + return true; - return young; + /* TDP MMU pages do not use the MMU generation. */ + return !sp->tdp_mmu_page && + unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); } -bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +/* + * Lookup the mapping level for @gfn in the current mm. + * + * WARNING! Use of host_pfn_mapping_level() requires the caller and the end + * consumer to be tied into KVM's handlers for MMU notifier events! + * + * There are several ways to safely use this helper: + * + * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before + * consuming it. In this case, mmu_lock doesn't need to be held during the + * lookup, but it does need to be held while checking the MMU notifier. + * + * - Hold mmu_lock AND ensure there is no in-progress MMU notifier invalidation + * event for the hva. This can be done by explicit checking the MMU notifier + * or by ensuring that KVM already has a valid mapping that covers the hva. + * + * - Do not use the result to install new mappings, e.g. use the host mapping + * level only to decide whether or not to zap an entry. In this case, it's + * not required to hold mmu_lock (though it's highly likely the caller will + * want to hold mmu_lock anyways, e.g. to modify SPTEs). + * + * Note! The lookup can still race with modifications to host page tables, but + * the above "rules" ensure KVM will not _consume_ the result of the walk if a + * race with the primary MMU occurs. + */ +static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, + const struct kvm_memory_slot *slot) { - bool young = false; + int level = PG_LEVEL_4K; + unsigned long hva; + unsigned long flags; + pgd_t pgd; + p4d_t p4d; + pud_t pud; + pmd_t pmd; - if (kvm_memslots_have_rmaps(kvm)) - young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap); + /* + * Note, using the already-retrieved memslot and __gfn_to_hva_memslot() + * is not solely for performance, it's also necessary to avoid the + * "writable" check in __gfn_to_hva_many(), which will always fail on + * read-only memslots due to gfn_to_hva() assuming writes. Earlier + * page fault steps have already verified the guest isn't writing a + * read-only memslot. + */ + hva = __gfn_to_hva_memslot(slot, gfn); - if (is_tdp_mmu_enabled(kvm)) - young |= kvm_tdp_mmu_test_age_gfn(kvm, range); + /* + * Disable IRQs to prevent concurrent tear down of host page tables, + * e.g. if the primary MMU promotes a P*D to a huge page and then frees + * the original page table. + */ + local_irq_save(flags); - return young; -} + /* + * Read each entry once. As above, a non-leaf entry can be promoted to + * a huge page _during_ this walk. Re-reading the entry could send the + * walk into the weeks, e.g. p*d_large() returns false (sees the old + * value) and then p*d_offset() walks into the target huge page instead + * of the old page table (sees the new value). + */ + pgd = READ_ONCE(*pgd_offset(kvm->mm, hva)); + if (pgd_none(pgd)) + goto out; -#ifdef MMU_DEBUG -static int is_empty_shadow_page(u64 *spt) -{ - u64 *pos; - u64 *end; + p4d = READ_ONCE(*p4d_offset(&pgd, hva)); + if (p4d_none(p4d) || !p4d_present(p4d)) + goto out; - for (pos = spt, end = pos + SPTE_ENT_PER_PAGE; pos != end; pos++) - if (is_shadow_present_pte(*pos)) { - printk(KERN_ERR "%s: %p %llx\n", __func__, - pos, *pos); - return 0; - } - return 1; -} -#endif + pud = READ_ONCE(*pud_offset(&p4d, hva)); + if (pud_none(pud) || !pud_present(pud)) + goto out; -/* - * This value is the sum of all of the kvm instances's - * kvm->arch.n_used_mmu_pages values. We need a global, - * aggregate version in order to make the slab shrinker - * faster - */ -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr) -{ - kvm->arch.n_used_mmu_pages += nr; - percpu_counter_add(&kvm_total_used_mmu_pages, nr); -} + if (pud_large(pud)) { + level = PG_LEVEL_1G; + goto out; + } -static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) -{ - kvm_mod_used_mmu_pages(kvm, +1); - kvm_account_pgtable_pages((void *)sp->spt, +1); -} + pmd = READ_ONCE(*pmd_offset(&pud, hva)); + if (pmd_none(pmd) || !pmd_present(pmd)) + goto out; -static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) -{ - kvm_mod_used_mmu_pages(kvm, -1); - kvm_account_pgtable_pages((void *)sp->spt, -1); -} + if (pmd_large(pmd)) + level = PG_LEVEL_2M; -static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp) -{ - MMU_WARN_ON(!is_empty_shadow_page(sp->spt)); - hlist_del(&sp->hash_link); - list_del(&sp->link); - free_page((unsigned long)sp->spt); - if (!sp->role.direct) - free_page((unsigned long)sp->shadowed_translation); - kmem_cache_free(mmu_page_header_cache, sp); +out: + local_irq_restore(flags); + return level; } -static unsigned kvm_page_table_hashfn(gfn_t gfn) +int kvm_mmu_max_mapping_level(struct kvm *kvm, + const struct kvm_memory_slot *slot, gfn_t gfn, + int max_level) { - return hash_64(gfn, KVM_MMU_HASH_SHIFT); -} + struct kvm_lpage_info *linfo; + int host_level; -static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache, - struct kvm_mmu_page *sp, u64 *parent_pte) -{ - if (!parent_pte) - return; + max_level = min(max_level, max_huge_page_level); + for ( ; max_level > PG_LEVEL_4K; max_level--) { + linfo = lpage_info_slot(gfn, slot, max_level); + if (!linfo->disallow_lpage) + break; + } - pte_list_add(cache, parent_pte, &sp->parent_ptes); -} + if (max_level == PG_LEVEL_4K) + return PG_LEVEL_4K; -static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp, - u64 *parent_pte) -{ - pte_list_remove(parent_pte, &sp->parent_ptes); + host_level = host_pfn_mapping_level(kvm, gfn, slot); + return min(host_level, max_level); } -static void drop_parent_pte(struct kvm_mmu_page *sp, - u64 *parent_pte) +void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { - mmu_page_remove_parent_pte(sp, parent_pte); - mmu_spte_clear_no_track(parent_pte); -} + struct kvm_memory_slot *slot = fault->slot; + kvm_pfn_t mask; -static void mark_unsync(u64 *spte); -static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp) -{ - u64 *sptep; - struct rmap_iterator iter; + fault->huge_page_disallowed = fault->exec && fault->nx_huge_page_workaround_enabled; - for_each_rmap_spte(&sp->parent_ptes, &iter, sptep) { - mark_unsync(sptep); - } -} + if (unlikely(fault->max_level == PG_LEVEL_4K)) + return; -static void mark_unsync(u64 *spte) -{ - struct kvm_mmu_page *sp; + if (is_error_noslot_pfn(fault->pfn)) + return; - sp = sptep_to_sp(spte); - if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap)) + if (kvm_slot_dirty_track_enabled(slot)) return; - if (sp->unsync_children++) + + /* + * Enforce the iTLB multihit workaround after capturing the requested + * level, which will be used to do precise, accurate accounting. + */ + fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, + fault->gfn, fault->max_level); + if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed) return; - kvm_mmu_mark_parents_unsync(sp); -} -static int nonpaging_sync_page(struct kvm_vcpu *vcpu, - struct kvm_mmu_page *sp) -{ - return -1; + /* + * mmu_invalidate_retry() was successful and mmu_lock is held, so + * the pmd can't be split from under us. + */ + fault->goal_level = fault->req_level; + mask = KVM_PAGES_PER_HPAGE(fault->goal_level) - 1; + VM_BUG_ON((fault->gfn & mask) != (fault->pfn & mask)); + fault->pfn &= ~mask; } -#define KVM_PAGE_ARRAY_NR 16 - -struct kvm_mmu_pages { - struct mmu_page_and_offset { - struct kvm_mmu_page *sp; - unsigned int idx; - } page[KVM_PAGE_ARRAY_NR]; - unsigned int nr; -}; - -static int mmu_pages_add(struct kvm_mmu_pages *pvec, struct kvm_mmu_page *sp, - int idx) +void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level) { - int i; - - if (sp->unsync) - for (i=0; i < pvec->nr; i++) - if (pvec->page[i].sp == sp) - return 0; - - pvec->page[pvec->nr].sp = sp; - pvec->page[pvec->nr].idx = idx; - pvec->nr++; - return (pvec->nr == KVM_PAGE_ARRAY_NR); + if (cur_level > PG_LEVEL_4K && + cur_level == fault->goal_level && + is_shadow_present_pte(spte) && + !is_large_pte(spte) && + spte_to_child_sp(spte)->nx_huge_page_disallowed) { + /* + * A small SPTE exists for this pfn, but FNAME(fetch) + * and __direct_map would like to create a large PTE + * instead: just force them to go down another level, + * patching back for them into pfn the next 9 bits of + * the address. + */ + u64 page_mask = KVM_PAGES_PER_HPAGE(cur_level) - + KVM_PAGES_PER_HPAGE(cur_level - 1); + fault->pfn |= fault->gfn & page_mask; + fault->goal_level--; + } } -static inline void clear_unsync_child_bit(struct kvm_mmu_page *sp, int idx) +static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *tsk) { - --sp->unsync_children; - WARN_ON((int)sp->unsync_children < 0); - __clear_bit(idx, sp->unsync_child_bitmap); + send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, PAGE_SHIFT, tsk); } -static int __mmu_unsync_walk(struct kvm_mmu_page *sp, - struct kvm_mmu_pages *pvec) +static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) { - int i, ret, nr_unsync_leaf = 0; - - for_each_set_bit(i, sp->unsync_child_bitmap, 512) { - struct kvm_mmu_page *child; - u64 ent = sp->spt[i]; + if (is_sigpending_pfn(pfn)) { + kvm_handle_signal_exit(vcpu); + return -EINTR; + } - if (!is_shadow_present_pte(ent) || is_large_pte(ent)) { - clear_unsync_child_bit(sp, i); - continue; - } + /* + * Do not cache the mmio info caused by writing the readonly gfn + * into the spte otherwise read access on readonly gfn also can + * caused mmio page fault and treat it as mmio access. + */ + if (pfn == KVM_PFN_ERR_RO_FAULT) + return RET_PF_EMULATE; - child = spte_to_child_sp(ent); - - if (child->unsync_children) { - if (mmu_pages_add(pvec, child, i)) - return -ENOSPC; - - ret = __mmu_unsync_walk(child, pvec); - if (!ret) { - clear_unsync_child_bit(sp, i); - continue; - } else if (ret > 0) { - nr_unsync_leaf += ret; - } else - return ret; - } else if (child->unsync) { - nr_unsync_leaf++; - if (mmu_pages_add(pvec, child, i)) - return -ENOSPC; - } else - clear_unsync_child_bit(sp, i); + if (pfn == KVM_PFN_ERR_HWPOISON) { + kvm_send_hwpoison_signal(kvm_vcpu_gfn_to_hva(vcpu, gfn), current); + return RET_PF_RETRY; } - return nr_unsync_leaf; + return -EFAULT; } -#define INVALID_INDEX (-1) - -static int mmu_unsync_walk(struct kvm_mmu_page *sp, - struct kvm_mmu_pages *pvec) +static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, + unsigned int access) { - pvec->nr = 0; - if (!sp->unsync_children) - return 0; + /* The pfn is invalid, report the error! */ + if (unlikely(is_error_pfn(fault->pfn))) + return kvm_handle_error_pfn(vcpu, fault->gfn, fault->pfn); - mmu_pages_add(pvec, sp, INVALID_INDEX); - return __mmu_unsync_walk(sp, pvec); -} + if (unlikely(!fault->slot)) { + gva_t gva = fault->is_tdp ? 0 : fault->addr; -static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) -{ - WARN_ON(!sp->unsync); - trace_kvm_mmu_sync_page(sp); - sp->unsync = 0; - --kvm->stat.mmu_unsync; -} + vcpu_cache_mmio_info(vcpu, gva, fault->gfn, + access & shadow_mmio_access_mask); + /* + * If MMIO caching is disabled, emulate immediately without + * touching the shadow page tables as attempting to install an + * MMIO SPTE will just be an expensive nop. Do not cache MMIO + * whose gfn is greater than host.MAXPHYADDR, any guest that + * generates such gfns is running nested and is being tricked + * by L0 userspace (you can observe gfn > L1.MAXPHYADDR if + * and only if L1's MAXPHYADDR is inaccurate with respect to + * the hardware's). + */ + if (unlikely(!enable_mmio_caching) || + unlikely(fault->gfn > kvm_mmu_max_gfn())) + return RET_PF_EMULATE; + } -static bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, - struct list_head *invalid_list); -static void kvm_mmu_commit_zap_page(struct kvm *kvm, - struct list_head *invalid_list); + return RET_PF_CONTINUE; +} -static bool sp_has_gptes(struct kvm_mmu_page *sp) +static bool page_fault_can_be_fast(struct kvm_page_fault *fault) { - if (sp->role.direct) - return false; - - if (sp->role.passthrough) + /* + * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only + * reach the common page fault handler if the SPTE has an invalid MMIO + * generation number. Refreshing the MMIO generation needs to go down + * the slow path. Note, EPT Misconfigs do NOT set the PRESENT flag! + */ + if (fault->rsvd) return false; - return true; -} - -#define for_each_valid_sp(_kvm, _sp, _list) \ - hlist_for_each_entry(_sp, _list, hash_link) \ - if (is_obsolete_sp((_kvm), (_sp))) { \ - } else - -#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \ - for_each_valid_sp(_kvm, _sp, \ - &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \ - if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else - -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, - struct list_head *invalid_list) -{ - int ret = vcpu->arch.mmu->sync_page(vcpu, sp); + /* + * #PF can be fast if: + * + * 1. The shadow page table entry is not present and A/D bits are + * disabled _by KVM_, which could mean that the fault is potentially + * caused by access tracking (if enabled). If A/D bits are enabled + * by KVM, but disabled by L1 for L2, KVM is forced to disable A/D + * bits for L2 and employ access tracking, but the fast page fault + * mechanism only supports direct MMUs. + * 2. The shadow page table entry is present, the access is a write, + * and no reserved bits are set (MMIO SPTEs cannot be "fixed"), i.e. + * the fault was caused by a write-protection violation. If the + * SPTE is MMU-writable (determined later), the fault can be fixed + * by setting the Writable bit, which can be done out of mmu_lock. + */ + if (!fault->present) + return !kvm_ad_enabled(); - if (ret < 0) - kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); - return ret; + /* + * Note, instruction fetches and writes are mutually exclusive, ignore + * the "exec" flag. + */ + return fault->write; } -bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, struct list_head *invalid_list, - bool remote_flush) +/* + * Returns true if the SPTE was fixed successfully. Otherwise, + * someone else modified the SPTE from its original value. + */ +static bool +fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, + u64 *sptep, u64 old_spte, u64 new_spte) { - if (!remote_flush && list_empty(invalid_list)) + /* + * Theoretically we could also set dirty bit (and flush TLB) here in + * order to eliminate unnecessary PML logging. See comments in + * set_spte. But fast_page_fault is very unlikely to happen with PML + * enabled, so we do not do this. This might result in the same GPA + * to be logged in PML buffer again when the write really happens, and + * eventually to be called by mark_page_dirty twice. But it's also no + * harm. This also avoids the TLB flush needed after setting dirty bit + * so non-PML cases won't be impacted. + * + * Compare with set_spte where instead shadow_dirty_mask is set. + */ + if (!try_cmpxchg64(sptep, &old_spte, new_spte)) return false; - if (!list_empty(invalid_list)) - kvm_mmu_commit_zap_page(kvm, invalid_list); - else - kvm_flush_remote_tlbs(kvm); - return true; -} - -bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) -{ - if (sp->role.invalid) - return true; + if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) + mark_page_dirty_in_slot(vcpu->kvm, fault->slot, fault->gfn); - /* TDP MMU pages do not use the MMU generation. */ - return !sp->tdp_mmu_page && - unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); + return true; } -struct mmu_page_path { - struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL]; - unsigned int idx[PT64_ROOT_MAX_LEVEL]; -}; - -#define for_each_sp(pvec, sp, parents, i) \ - for (i = mmu_pages_first(&pvec, &parents); \ - i < pvec.nr && ({ sp = pvec.page[i].sp; 1;}); \ - i = mmu_pages_next(&pvec, &parents, i)) - -static int mmu_pages_next(struct kvm_mmu_pages *pvec, - struct mmu_page_path *parents, - int i) +static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte) { - int n; - - for (n = i+1; n < pvec->nr; n++) { - struct kvm_mmu_page *sp = pvec->page[n].sp; - unsigned idx = pvec->page[n].idx; - int level = sp->role.level; - - parents->idx[level-1] = idx; - if (level == PG_LEVEL_4K) - break; + if (fault->exec) + return is_executable_pte(spte); - parents->parent[level-2] = sp; - } + if (fault->write) + return is_writable_pte(spte); - return n; + /* Fault was on Read access */ + return spte & PT_PRESENT_MASK; } -static int mmu_pages_first(struct kvm_mmu_pages *pvec, - struct mmu_page_path *parents) +/* + * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS. + */ +static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_mmu_page *sp; - int level; - - if (pvec->nr == 0) - return 0; - - WARN_ON(pvec->page[0].idx != INVALID_INDEX); - - sp = pvec->page[0].sp; - level = sp->role.level; - WARN_ON(level == PG_LEVEL_4K); - - parents->parent[level-2] = sp; + int ret = RET_PF_INVALID; + u64 spte = 0ull; + u64 *sptep = NULL; + uint retry_count = 0; - /* Also set up a sentinel. Further entries in pvec are all - * children of sp, so this element is never overwritten. - */ - parents->parent[level-1] = NULL; - return mmu_pages_next(pvec, parents, 0); -} + if (!page_fault_can_be_fast(fault)) + return ret; -static void mmu_pages_clear_parents(struct mmu_page_path *parents) -{ - struct kvm_mmu_page *sp; - unsigned int level = 0; + walk_shadow_page_lockless_begin(vcpu); do { - unsigned int idx = parents->idx[level]; - sp = parents->parent[level]; - if (!sp) - return; - - WARN_ON(idx == INVALID_INDEX); - clear_unsync_child_bit(sp, idx); - level++; - } while (!sp->unsync_children); -} - -static int mmu_sync_children(struct kvm_vcpu *vcpu, - struct kvm_mmu_page *parent, bool can_yield) -{ - int i; - struct kvm_mmu_page *sp; - struct mmu_page_path parents; - struct kvm_mmu_pages pages; - LIST_HEAD(invalid_list); - bool flush = false; - - while (mmu_unsync_walk(parent, &pages)) { - bool protected = false; - - for_each_sp(pages, sp, parents, i) - protected |= kvm_vcpu_write_protect_gfn(vcpu, sp->gfn); - - if (protected) { - kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, true); - flush = false; - } - - for_each_sp(pages, sp, parents, i) { - kvm_unlink_unsync_page(vcpu->kvm, sp); - flush |= kvm_sync_page(vcpu, sp, &invalid_list) > 0; - mmu_pages_clear_parents(&parents); - } - if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) { - kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush); - if (!can_yield) { - kvm_make_request(KVM_REQ_MMU_SYNC, vcpu); - return -EINTR; - } - - cond_resched_rwlock_write(&vcpu->kvm->mmu_lock); - flush = false; - } - } - - kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush); - return 0; -} + u64 new_spte; -static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp) -{ - atomic_set(&sp->write_flooding_count, 0); -} + if (is_tdp_mmu(vcpu->arch.mmu)) + sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault->addr, &spte); + else + sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte); -static void clear_sp_write_flooding_count(u64 *spte) -{ - __clear_sp_write_flooding_count(sptep_to_sp(spte)); -} + if (!is_shadow_present_pte(spte)) + break; -/* - * The vCPU is required when finding indirect shadow pages; the shadow - * page may already exist and syncing it needs the vCPU pointer in - * order to read guest page tables. Direct shadow pages are never - * unsync, thus @vcpu can be NULL if @role.direct is true. - */ -static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, - struct kvm_vcpu *vcpu, - gfn_t gfn, - struct hlist_head *sp_list, - union kvm_mmu_page_role role) -{ - struct kvm_mmu_page *sp; - int ret; - int collisions = 0; - LIST_HEAD(invalid_list); + sp = sptep_to_sp(sptep); + if (!is_last_spte(spte, sp->role.level)) + break; - for_each_valid_sp(kvm, sp, sp_list) { - if (sp->gfn != gfn) { - collisions++; - continue; + /* + * Check whether the memory access that caused the fault would + * still cause it if it were to be performed right now. If not, + * then this is a spurious fault caused by TLB lazily flushed, + * or some other CPU has already fixed the PTE after the + * current CPU took the fault. + * + * Need not check the access of upper level table entries since + * they are always ACC_ALL. + */ + if (is_access_allowed(fault, spte)) { + ret = RET_PF_SPURIOUS; + break; } - if (sp->role.word != role.word) { - /* - * If the guest is creating an upper-level page, zap - * unsync pages for the same gfn. While it's possible - * the guest is using recursive page tables, in all - * likelihood the guest has stopped using the unsync - * page and is installing a completely unrelated page. - * Unsync pages must not be left as is, because the new - * upper-level page will be write-protected. - */ - if (role.level > PG_LEVEL_4K && sp->unsync) - kvm_mmu_prepare_zap_page(kvm, sp, - &invalid_list); - continue; - } + new_spte = spte; - /* unsync and write-flooding only apply to indirect SPs. */ - if (sp->role.direct) - goto out; + /* + * KVM only supports fixing page faults outside of MMU lock for + * direct MMUs, nested MMUs are always indirect, and KVM always + * uses A/D bits for non-nested MMUs. Thus, if A/D bits are + * enabled, the SPTE can't be an access-tracked SPTE. + */ + if (unlikely(!kvm_ad_enabled()) && is_access_track_spte(spte)) + new_spte = restore_acc_track_spte(new_spte); - if (sp->unsync) { - if (KVM_BUG_ON(!vcpu, kvm)) - break; + /* + * To keep things simple, only SPTEs that are MMU-writable can + * be made fully writable outside of mmu_lock, e.g. only SPTEs + * that were write-protected for dirty-logging or access + * tracking are handled here. Don't bother checking if the + * SPTE is writable to prioritize running with A/D bits enabled. + * The is_access_allowed() check above handles the common case + * of the fault being spurious, and the SPTE is known to be + * shadow-present, i.e. except for access tracking restoration + * making the new SPTE writable, the check is wasteful. + */ + if (fault->write && is_mmu_writable_spte(spte)) { + new_spte |= PT_WRITABLE_MASK; /* - * The page is good, but is stale. kvm_sync_page does - * get the latest guest state, but (unlike mmu_unsync_children) - * it doesn't write-protect the page or mark it synchronized! - * This way the validity of the mapping is ensured, but the - * overhead of write protection is not incurred until the - * guest invalidates the TLB mapping. This allows multiple - * SPs for a single gfn to be unsync. + * Do not fix write-permission on the large spte when + * dirty logging is enabled. Since we only dirty the + * first page into the dirty-bitmap in + * fast_pf_fix_direct_spte(), other pages are missed + * if its slot has dirty logging enabled. * - * If the sync fails, the page is zapped. If so, break - * in order to rebuild it. + * Instead, we let the slow page fault path create a + * normal spte to fix the access. */ - ret = kvm_sync_page(vcpu, sp, &invalid_list); - if (ret < 0) + if (sp->role.level > PG_LEVEL_4K && + kvm_slot_dirty_track_enabled(fault->slot)) break; - - WARN_ON(!list_empty(&invalid_list)); - if (ret > 0) - kvm_flush_remote_tlbs(kvm); } - __clear_sp_write_flooding_count(sp); - - goto out; - } - - sp = NULL; - ++kvm->stat.mmu_cache_miss; - -out: - kvm_mmu_commit_zap_page(kvm, &invalid_list); - - if (collisions > kvm->stat.max_mmu_page_hash_collisions) - kvm->stat.max_mmu_page_hash_collisions = collisions; - return sp; -} - -/* Caches used when allocating a new shadow page. */ -struct shadow_page_caches { - struct kvm_mmu_memory_cache *page_header_cache; - struct kvm_mmu_memory_cache *shadow_page_cache; - struct kvm_mmu_memory_cache *shadowed_info_cache; -}; - -static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm, - struct shadow_page_caches *caches, - gfn_t gfn, - struct hlist_head *sp_list, - union kvm_mmu_page_role role) -{ - struct kvm_mmu_page *sp; - - sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache); - sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache); - if (!role.direct) - sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache); - - set_page_private(virt_to_page(sp->spt), (unsigned long)sp); - - INIT_LIST_HEAD(&sp->possible_nx_huge_page_link); - - /* - * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages() - * depends on valid pages being added to the head of the list. See - * comments in kvm_zap_obsolete_pages(). - */ - sp->mmu_valid_gen = kvm->arch.mmu_valid_gen; - list_add(&sp->link, &kvm->arch.active_mmu_pages); - kvm_account_mmu_page(kvm, sp); - - sp->gfn = gfn; - sp->role = role; - hlist_add_head(&sp->hash_link, sp_list); - if (sp_has_gptes(sp)) - account_shadowed(kvm, sp); - - return sp; -} - -/* Note, @vcpu may be NULL if @role.direct is true; see kvm_mmu_find_shadow_page. */ -static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm, - struct kvm_vcpu *vcpu, - struct shadow_page_caches *caches, - gfn_t gfn, - union kvm_mmu_page_role role) -{ - struct hlist_head *sp_list; - struct kvm_mmu_page *sp; - bool created = false; - - sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; - - sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role); - if (!sp) { - created = true; - sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role); - } - - trace_kvm_mmu_get_page(sp, created); - return sp; -} - -static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, - gfn_t gfn, - union kvm_mmu_page_role role) -{ - struct shadow_page_caches caches = { - .page_header_cache = &vcpu->arch.mmu_page_header_cache, - .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache, - .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache, - }; - - return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role); -} - -static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, - unsigned int access) -{ - struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep); - union kvm_mmu_page_role role; + /* Verify that the fault can be handled in the fast path */ + if (new_spte == spte || + !is_access_allowed(fault, new_spte)) + break; - role = parent_sp->role; - role.level--; - role.access = access; - role.direct = direct; - role.passthrough = 0; + /* + * Currently, fast page fault only works for direct mapping + * since the gfn is not stable for indirect shadow page. See + * Documentation/virt/kvm/locking.rst to get more detail. + */ + if (fast_pf_fix_direct_spte(vcpu, fault, sptep, spte, new_spte)) { + ret = RET_PF_FIXED; + break; + } - /* - * If the guest has 4-byte PTEs then that means it's using 32-bit, - * 2-level, non-PAE paging. KVM shadows such guests with PAE paging - * (i.e. 8-byte PTEs). The difference in PTE size means that KVM must - * shadow each guest page table with multiple shadow page tables, which - * requires extra bookkeeping in the role. - * - * Specifically, to shadow the guest's page directory (which covers a - * 4GiB address space), KVM uses 4 PAE page directories, each mapping - * 1GiB of the address space. @role.quadrant encodes which quarter of - * the address space each maps. - * - * To shadow the guest's page tables (which each map a 4MiB region), KVM - * uses 2 PAE page tables, each mapping a 2MiB region. For these, - * @role.quadrant encodes which half of the region they map. - * - * Concretely, a 4-byte PDE consumes bits 31:22, while an 8-byte PDE - * consumes bits 29:21. To consume bits 31:30, KVM's uses 4 shadow - * PDPTEs; those 4 PAE page directories are pre-allocated and their - * quadrant is assigned in mmu_alloc_root(). A 4-byte PTE consumes - * bits 21:12, while an 8-byte PTE consumes bits 20:12. To consume - * bit 21 in the PTE (the child here), KVM propagates that bit to the - * quadrant, i.e. sets quadrant to '0' or '1'. The parent 8-byte PDE - * covers bit 21 (see above), thus the quadrant is calculated from the - * _least_ significant bit of the PDE index. - */ - if (role.has_4_byte_gpte) { - WARN_ON_ONCE(role.level != PG_LEVEL_4K); - role.quadrant = spte_index(sptep) & 1; - } - - return role; -} - -static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, - u64 *sptep, gfn_t gfn, - bool direct, unsigned int access) -{ - union kvm_mmu_page_role role; - - if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) - return ERR_PTR(-EEXIST); - - role = kvm_mmu_child_role(sptep, direct, access); - return kvm_mmu_get_shadow_page(vcpu, gfn, role); -} - -static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator, - struct kvm_vcpu *vcpu, hpa_t root, - u64 addr) -{ - iterator->addr = addr; - iterator->shadow_addr = root; - iterator->level = vcpu->arch.mmu->root_role.level; - - if (iterator->level >= PT64_ROOT_4LEVEL && - vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL && - !vcpu->arch.mmu->root_role.direct) - iterator->level = PT32E_ROOT_LEVEL; - - if (iterator->level == PT32E_ROOT_LEVEL) { - /* - * prev_root is currently only used for 64-bit hosts. So only - * the active root_hpa is valid here. - */ - BUG_ON(root != vcpu->arch.mmu->root.hpa); - - iterator->shadow_addr - = vcpu->arch.mmu->pae_root[(addr >> 30) & 3]; - iterator->shadow_addr &= SPTE_BASE_ADDR_MASK; - --iterator->level; - if (!iterator->shadow_addr) - iterator->level = 0; - } -} - -static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator, - struct kvm_vcpu *vcpu, u64 addr) -{ - shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root.hpa, - addr); -} - -static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator) -{ - if (iterator->level < PG_LEVEL_4K) - return false; - - iterator->index = SPTE_INDEX(iterator->addr, iterator->level); - iterator->sptep = ((u64 *)__va(iterator->shadow_addr)) + iterator->index; - return true; -} - -static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator, - u64 spte) -{ - if (!is_shadow_present_pte(spte) || is_last_spte(spte, iterator->level)) { - iterator->level = 0; - return; - } - - iterator->shadow_addr = spte & SPTE_BASE_ADDR_MASK; - --iterator->level; -} - -static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) -{ - __shadow_walk_next(iterator, *iterator->sptep); -} - -static void __link_shadow_page(struct kvm *kvm, - struct kvm_mmu_memory_cache *cache, u64 *sptep, - struct kvm_mmu_page *sp, bool flush) -{ - u64 spte; - - BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK); - - /* - * If an SPTE is present already, it must be a leaf and therefore - * a large one. Drop it, and flush the TLB if needed, before - * installing sp. - */ - if (is_shadow_present_pte(*sptep)) - drop_large_spte(kvm, sptep, flush); - - spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp)); - - mmu_spte_set(sptep, spte); - - mmu_page_add_parent_pte(cache, sp, sptep); - - if (sp->unsync_children || sp->unsync) - mark_unsync(sptep); -} - -static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, - struct kvm_mmu_page *sp) -{ - __link_shadow_page(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, sptep, sp, true); -} - -static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, - unsigned direct_access) -{ - if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) { - struct kvm_mmu_page *child; - - /* - * For the direct sp, if the guest pte's dirty bit - * changed form clean to dirty, it will corrupt the - * sp's access: allow writable in the read-only sp, - * so we should update the spte at this point to get - * a new sp with the correct access. - */ - child = spte_to_child_sp(*sptep); - if (child->role.access == direct_access) - return; - - drop_parent_pte(child, sptep); - kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1); - } -} - -/* Returns the number of zapped non-leaf child shadow pages. */ -static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, - u64 *spte, struct list_head *invalid_list) -{ - u64 pte; - struct kvm_mmu_page *child; - - pte = *spte; - if (is_shadow_present_pte(pte)) { - if (is_last_spte(pte, sp->role.level)) { - drop_spte(kvm, spte); - } else { - child = spte_to_child_sp(pte); - drop_parent_pte(child, spte); - - /* - * Recursively zap nested TDP SPs, parentless SPs are - * unlikely to be used again in the near future. This - * avoids retaining a large number of stale nested SPs. - */ - if (tdp_enabled && invalid_list && - child->role.guest_mode && !child->parent_ptes.val) - return kvm_mmu_prepare_zap_page(kvm, child, - invalid_list); - } - } else if (is_mmio_spte(pte)) { - mmu_spte_clear_no_track(spte); - } - return 0; -} - -static int kvm_mmu_page_unlink_children(struct kvm *kvm, - struct kvm_mmu_page *sp, - struct list_head *invalid_list) -{ - int zapped = 0; - unsigned i; - - for (i = 0; i < SPTE_ENT_PER_PAGE; ++i) - zapped += mmu_page_zap_pte(kvm, sp, sp->spt + i, invalid_list); - - return zapped; -} - -static void kvm_mmu_unlink_parents(struct kvm_mmu_page *sp) -{ - u64 *sptep; - struct rmap_iterator iter; - - while ((sptep = rmap_get_first(&sp->parent_ptes, &iter))) - drop_parent_pte(sp, sptep); -} - -static int mmu_zap_unsync_children(struct kvm *kvm, - struct kvm_mmu_page *parent, - struct list_head *invalid_list) -{ - int i, zapped = 0; - struct mmu_page_path parents; - struct kvm_mmu_pages pages; - - if (parent->role.level == PG_LEVEL_4K) - return 0; - - while (mmu_unsync_walk(parent, &pages)) { - struct kvm_mmu_page *sp; - - for_each_sp(pages, sp, parents, i) { - kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); - mmu_pages_clear_parents(&parents); - zapped++; - } - } - - return zapped; -} - -static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, - struct kvm_mmu_page *sp, - struct list_head *invalid_list, - int *nr_zapped) -{ - bool list_unstable, zapped_root = false; - - trace_kvm_mmu_prepare_zap_page(sp); - ++kvm->stat.mmu_shadow_zapped; - *nr_zapped = mmu_zap_unsync_children(kvm, sp, invalid_list); - *nr_zapped += kvm_mmu_page_unlink_children(kvm, sp, invalid_list); - kvm_mmu_unlink_parents(sp); - - /* Zapping children means active_mmu_pages has become unstable. */ - list_unstable = *nr_zapped; - - if (!sp->role.invalid && sp_has_gptes(sp)) - unaccount_shadowed(kvm, sp); - - if (sp->unsync) - kvm_unlink_unsync_page(kvm, sp); - if (!sp->root_count) { - /* Count self */ - (*nr_zapped)++; - - /* - * Already invalid pages (previously active roots) are not on - * the active page list. See list_del() in the "else" case of - * !sp->root_count. - */ - if (sp->role.invalid) - list_add(&sp->link, invalid_list); - else - list_move(&sp->link, invalid_list); - kvm_unaccount_mmu_page(kvm, sp); - } else { - /* - * Remove the active root from the active page list, the root - * will be explicitly freed when the root_count hits zero. - */ - list_del(&sp->link); - - /* - * Obsolete pages cannot be used on any vCPUs, see the comment - * in kvm_mmu_zap_all_fast(). Note, is_obsolete_sp() also - * treats invalid shadow pages as being obsolete. - */ - zapped_root = !is_obsolete_sp(kvm, sp); - } - - if (sp->nx_huge_page_disallowed) - unaccount_nx_huge_page(kvm, sp); - - sp->role.invalid = 1; - - /* - * Make the request to free obsolete roots after marking the root - * invalid, otherwise other vCPUs may not see it as invalid. - */ - if (zapped_root) - kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS); - return list_unstable; -} - -static bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, - struct list_head *invalid_list) -{ - int nr_zapped; - - __kvm_mmu_prepare_zap_page(kvm, sp, invalid_list, &nr_zapped); - return nr_zapped; -} - -static void kvm_mmu_commit_zap_page(struct kvm *kvm, - struct list_head *invalid_list) -{ - struct kvm_mmu_page *sp, *nsp; - - if (list_empty(invalid_list)) - return; - - /* - * We need to make sure everyone sees our modifications to - * the page tables and see changes to vcpu->mode here. The barrier - * in the kvm_flush_remote_tlbs() achieves this. This pairs - * with vcpu_enter_guest and walk_shadow_page_lockless_begin/end. - * - * In addition, kvm_flush_remote_tlbs waits for all vcpus to exit - * guest mode and/or lockless shadow page table walks. - */ - kvm_flush_remote_tlbs(kvm); - - list_for_each_entry_safe(sp, nsp, invalid_list, link) { - WARN_ON(!sp->role.invalid || sp->root_count); - kvm_mmu_free_shadow_page(sp); - } -} - -static unsigned long kvm_mmu_zap_oldest_mmu_pages(struct kvm *kvm, - unsigned long nr_to_zap) -{ - unsigned long total_zapped = 0; - struct kvm_mmu_page *sp, *tmp; - LIST_HEAD(invalid_list); - bool unstable; - int nr_zapped; - - if (list_empty(&kvm->arch.active_mmu_pages)) - return 0; - -restart: - list_for_each_entry_safe_reverse(sp, tmp, &kvm->arch.active_mmu_pages, link) { - /* - * Don't zap active root pages, the page itself can't be freed - * and zapping it will just force vCPUs to realloc and reload. - */ - if (sp->root_count) - continue; - - unstable = __kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, - &nr_zapped); - total_zapped += nr_zapped; - if (total_zapped >= nr_to_zap) - break; - - if (unstable) - goto restart; - } - - kvm_mmu_commit_zap_page(kvm, &invalid_list); - - kvm->stat.mmu_recycled += total_zapped; - return total_zapped; -} - -static inline unsigned long kvm_mmu_available_pages(struct kvm *kvm) -{ - if (kvm->arch.n_max_mmu_pages > kvm->arch.n_used_mmu_pages) - return kvm->arch.n_max_mmu_pages - - kvm->arch.n_used_mmu_pages; - - return 0; -} - -static int make_mmu_pages_available(struct kvm_vcpu *vcpu) -{ - unsigned long avail = kvm_mmu_available_pages(vcpu->kvm); - - if (likely(avail >= KVM_MIN_FREE_MMU_PAGES)) - return 0; - - kvm_mmu_zap_oldest_mmu_pages(vcpu->kvm, KVM_REFILL_PAGES - avail); - - /* - * Note, this check is intentionally soft, it only guarantees that one - * page is available, while the caller may end up allocating as many as - * four pages, e.g. for PAE roots or for 5-level paging. Temporarily - * exceeding the (arbitrary by default) limit will not harm the host, - * being too aggressive may unnecessarily kill the guest, and getting an - * exact count is far more trouble than it's worth, especially in the - * page fault paths. - */ - if (!kvm_mmu_available_pages(vcpu->kvm)) - return -ENOSPC; - return 0; -} - -/* - * Changing the number of mmu pages allocated to the vm - * Note: if goal_nr_mmu_pages is too small, you will get dead lock - */ -void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages) -{ - write_lock(&kvm->mmu_lock); - - if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) { - kvm_mmu_zap_oldest_mmu_pages(kvm, kvm->arch.n_used_mmu_pages - - goal_nr_mmu_pages); - - goal_nr_mmu_pages = kvm->arch.n_used_mmu_pages; - } - - kvm->arch.n_max_mmu_pages = goal_nr_mmu_pages; - - write_unlock(&kvm->mmu_lock); -} - -int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn) -{ - struct kvm_mmu_page *sp; - LIST_HEAD(invalid_list); - int r; - - pgprintk("%s: looking for gfn %llx\n", __func__, gfn); - r = 0; - write_lock(&kvm->mmu_lock); - for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) { - pgprintk("%s: gfn %llx role %x\n", __func__, gfn, - sp->role.word); - r = 1; - kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); - } - kvm_mmu_commit_zap_page(kvm, &invalid_list); - write_unlock(&kvm->mmu_lock); - - return r; -} - -static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva) -{ - gpa_t gpa; - int r; - - if (vcpu->arch.mmu->root_role.direct) - return 0; - - gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL); - - r = kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT); - - return r; -} - -static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) -{ - trace_kvm_mmu_unsync_page(sp); - ++kvm->stat.mmu_unsync; - sp->unsync = 1; - - kvm_mmu_mark_parents_unsync(sp); -} - -/* - * Attempt to unsync any shadow pages that can be reached by the specified gfn, - * KVM is creating a writable mapping for said gfn. Returns 0 if all pages - * were marked unsync (or if there is no shadow page), -EPERM if the SPTE must - * be write-protected. - */ -int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot, - gfn_t gfn, bool can_unsync, bool prefetch) -{ - struct kvm_mmu_page *sp; - bool locked = false; - - /* - * Force write-protection if the page is being tracked. Note, the page - * track machinery is used to write-protect upper-level shadow pages, - * i.e. this guards the role.level == 4K assertion below! - */ - if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE)) - return -EPERM; - - /* - * The page is not write-tracked, mark existing shadow pages unsync - * unless KVM is synchronizing an unsync SP (can_unsync = false). In - * that case, KVM must complete emulation of the guest TLB flush before - * allowing shadow pages to become unsync (writable by the guest). - */ - for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) { - if (!can_unsync) - return -EPERM; - - if (sp->unsync) - continue; - - if (prefetch) - return -EEXIST; - - /* - * TDP MMU page faults require an additional spinlock as they - * run with mmu_lock held for read, not write, and the unsync - * logic is not thread safe. Take the spinklock regardless of - * the MMU type to avoid extra conditionals/parameters, there's - * no meaningful penalty if mmu_lock is held for write. - */ - if (!locked) { - locked = true; - spin_lock(&kvm->arch.mmu_unsync_pages_lock); - - /* - * Recheck after taking the spinlock, a different vCPU - * may have since marked the page unsync. A false - * positive on the unprotected check above is not - * possible as clearing sp->unsync _must_ hold mmu_lock - * for write, i.e. unsync cannot transition from 0->1 - * while this CPU holds mmu_lock for read (or write). - */ - if (READ_ONCE(sp->unsync)) - continue; - } - - WARN_ON(sp->role.level != PG_LEVEL_4K); - kvm_unsync_page(kvm, sp); - } - if (locked) - spin_unlock(&kvm->arch.mmu_unsync_pages_lock); - - /* - * We need to ensure that the marking of unsync pages is visible - * before the SPTE is updated to allow writes because - * kvm_mmu_sync_roots() checks the unsync flags without holding - * the MMU lock and so can race with this. If the SPTE was updated - * before the page had been marked as unsync-ed, something like the - * following could happen: - * - * CPU 1 CPU 2 - * --------------------------------------------------------------------- - * 1.2 Host updates SPTE - * to be writable - * 2.1 Guest writes a GPTE for GVA X. - * (GPTE being in the guest page table shadowed - * by the SP from CPU 1.) - * This reads SPTE during the page table walk. - * Since SPTE.W is read as 1, there is no - * fault. - * - * 2.2 Guest issues TLB flush. - * That causes a VM Exit. - * - * 2.3 Walking of unsync pages sees sp->unsync is - * false and skips the page. - * - * 2.4 Guest accesses GVA X. - * Since the mapping in the SP was not updated, - * so the old mapping for GVA X incorrectly - * gets used. - * 1.1 Host marks SP - * as unsync - * (sp->unsync = true) - * - * The write barrier below ensures that 1.1 happens before 1.2 and thus - * the situation in 2.4 does not arise. It pairs with the read barrier - * in is_unsync_root(), placed between 2.1's load of SPTE.W and 2.3. - */ - smp_wmb(); - - return 0; -} - -static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, - u64 *sptep, unsigned int pte_access, gfn_t gfn, - kvm_pfn_t pfn, struct kvm_page_fault *fault) -{ - struct kvm_mmu_page *sp = sptep_to_sp(sptep); - int level = sp->role.level; - int was_rmapped = 0; - int ret = RET_PF_FIXED; - bool flush = false; - bool wrprot; - u64 spte; - - /* Prefetching always gets a writable pfn. */ - bool host_writable = !fault || fault->map_writable; - bool prefetch = !fault || fault->prefetch; - bool write_fault = fault && fault->write; - - pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__, - *sptep, write_fault, gfn); - - if (unlikely(is_noslot_pfn(pfn))) { - vcpu->stat.pf_mmio_spte_created++; - mark_mmio_spte(vcpu, sptep, gfn, pte_access); - return RET_PF_EMULATE; - } - - if (is_shadow_present_pte(*sptep)) { - /* - * If we overwrite a PTE page pointer with a 2MB PMD, unlink - * the parent of the now unreachable PTE. - */ - if (level > PG_LEVEL_4K && !is_large_pte(*sptep)) { - struct kvm_mmu_page *child; - u64 pte = *sptep; - - child = spte_to_child_sp(pte); - drop_parent_pte(child, sptep); - flush = true; - } else if (pfn != spte_to_pfn(*sptep)) { - pgprintk("hfn old %llx new %llx\n", - spte_to_pfn(*sptep), pfn); - drop_spte(vcpu->kvm, sptep); - flush = true; - } else - was_rmapped = 1; - } - - wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch, - true, host_writable, &spte); - - if (*sptep == spte) { - ret = RET_PF_SPURIOUS; - } else { - flush |= mmu_spte_update(sptep, spte); - trace_kvm_mmu_set_spte(level, gfn, sptep); - } - - if (wrprot) { - if (write_fault) - ret = RET_PF_EMULATE; - } - - if (flush) - kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, - KVM_PAGES_PER_HPAGE(level)); - - pgprintk("%s: setting spte %llx\n", __func__, *sptep); - - if (!was_rmapped) { - WARN_ON_ONCE(ret == RET_PF_SPURIOUS); - rmap_add(vcpu, slot, sptep, gfn, pte_access); - } else { - /* Already rmapped but the pte_access bits may have changed. */ - kvm_mmu_page_set_access(sp, spte_index(sptep), pte_access); - } - - return ret; -} - -static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu, - struct kvm_mmu_page *sp, - u64 *start, u64 *end) -{ - struct page *pages[PTE_PREFETCH_NUM]; - struct kvm_memory_slot *slot; - unsigned int access = sp->role.access; - int i, ret; - gfn_t gfn; - - gfn = kvm_mmu_page_get_gfn(sp, spte_index(start)); - slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK); - if (!slot) - return -1; - - ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start); - if (ret <= 0) - return -1; - - for (i = 0; i < ret; i++, gfn++, start++) { - mmu_set_spte(vcpu, slot, start, access, gfn, - page_to_pfn(pages[i]), NULL); - put_page(pages[i]); - } - - return 0; -} - -static void __direct_pte_prefetch(struct kvm_vcpu *vcpu, - struct kvm_mmu_page *sp, u64 *sptep) -{ - u64 *spte, *start = NULL; - int i; - - WARN_ON(!sp->role.direct); - - i = spte_index(sptep) & ~(PTE_PREFETCH_NUM - 1); - spte = sp->spt + i; - - for (i = 0; i < PTE_PREFETCH_NUM; i++, spte++) { - if (is_shadow_present_pte(*spte) || spte == sptep) { - if (!start) - continue; - if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0) - return; - start = NULL; - } else if (!start) - start = spte; - } - if (start) - direct_pte_prefetch_many(vcpu, sp, start, spte); -} - -static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep) -{ - struct kvm_mmu_page *sp; - - sp = sptep_to_sp(sptep); - - /* - * Without accessed bits, there's no way to distinguish between - * actually accessed translations and prefetched, so disable pte - * prefetch if accessed bits aren't available. - */ - if (sp_ad_disabled(sp)) - return; - - if (sp->role.level > PG_LEVEL_4K) - return; - - /* - * If addresses are being invalidated, skip prefetching to avoid - * accidentally prefetching those addresses. - */ - if (unlikely(vcpu->kvm->mmu_invalidate_in_progress)) - return; - - __direct_pte_prefetch(vcpu, sp, sptep); -} - -/* - * Lookup the mapping level for @gfn in the current mm. - * - * WARNING! Use of host_pfn_mapping_level() requires the caller and the end - * consumer to be tied into KVM's handlers for MMU notifier events! - * - * There are several ways to safely use this helper: - * - * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before - * consuming it. In this case, mmu_lock doesn't need to be held during the - * lookup, but it does need to be held while checking the MMU notifier. - * - * - Hold mmu_lock AND ensure there is no in-progress MMU notifier invalidation - * event for the hva. This can be done by explicit checking the MMU notifier - * or by ensuring that KVM already has a valid mapping that covers the hva. - * - * - Do not use the result to install new mappings, e.g. use the host mapping - * level only to decide whether or not to zap an entry. In this case, it's - * not required to hold mmu_lock (though it's highly likely the caller will - * want to hold mmu_lock anyways, e.g. to modify SPTEs). - * - * Note! The lookup can still race with modifications to host page tables, but - * the above "rules" ensure KVM will not _consume_ the result of the walk if a - * race with the primary MMU occurs. - */ -static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, - const struct kvm_memory_slot *slot) -{ - int level = PG_LEVEL_4K; - unsigned long hva; - unsigned long flags; - pgd_t pgd; - p4d_t p4d; - pud_t pud; - pmd_t pmd; - - /* - * Note, using the already-retrieved memslot and __gfn_to_hva_memslot() - * is not solely for performance, it's also necessary to avoid the - * "writable" check in __gfn_to_hva_many(), which will always fail on - * read-only memslots due to gfn_to_hva() assuming writes. Earlier - * page fault steps have already verified the guest isn't writing a - * read-only memslot. - */ - hva = __gfn_to_hva_memslot(slot, gfn); - - /* - * Disable IRQs to prevent concurrent tear down of host page tables, - * e.g. if the primary MMU promotes a P*D to a huge page and then frees - * the original page table. - */ - local_irq_save(flags); - - /* - * Read each entry once. As above, a non-leaf entry can be promoted to - * a huge page _during_ this walk. Re-reading the entry could send the - * walk into the weeks, e.g. p*d_large() returns false (sees the old - * value) and then p*d_offset() walks into the target huge page instead - * of the old page table (sees the new value). - */ - pgd = READ_ONCE(*pgd_offset(kvm->mm, hva)); - if (pgd_none(pgd)) - goto out; - - p4d = READ_ONCE(*p4d_offset(&pgd, hva)); - if (p4d_none(p4d) || !p4d_present(p4d)) - goto out; - - pud = READ_ONCE(*pud_offset(&p4d, hva)); - if (pud_none(pud) || !pud_present(pud)) - goto out; - - if (pud_large(pud)) { - level = PG_LEVEL_1G; - goto out; - } - - pmd = READ_ONCE(*pmd_offset(&pud, hva)); - if (pmd_none(pmd) || !pmd_present(pmd)) - goto out; - - if (pmd_large(pmd)) - level = PG_LEVEL_2M; - -out: - local_irq_restore(flags); - return level; -} - -int kvm_mmu_max_mapping_level(struct kvm *kvm, - const struct kvm_memory_slot *slot, gfn_t gfn, - int max_level) -{ - struct kvm_lpage_info *linfo; - int host_level; - - max_level = min(max_level, max_huge_page_level); - for ( ; max_level > PG_LEVEL_4K; max_level--) { - linfo = lpage_info_slot(gfn, slot, max_level); - if (!linfo->disallow_lpage) - break; - } - - if (max_level == PG_LEVEL_4K) - return PG_LEVEL_4K; - - host_level = host_pfn_mapping_level(kvm, gfn, slot); - return min(host_level, max_level); -} - -void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) -{ - struct kvm_memory_slot *slot = fault->slot; - kvm_pfn_t mask; - - fault->huge_page_disallowed = fault->exec && fault->nx_huge_page_workaround_enabled; - - if (unlikely(fault->max_level == PG_LEVEL_4K)) - return; - - if (is_error_noslot_pfn(fault->pfn)) - return; - - if (kvm_slot_dirty_track_enabled(slot)) - return; - - /* - * Enforce the iTLB multihit workaround after capturing the requested - * level, which will be used to do precise, accurate accounting. - */ - fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, - fault->gfn, fault->max_level); - if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed) - return; - - /* - * mmu_invalidate_retry() was successful and mmu_lock is held, so - * the pmd can't be split from under us. - */ - fault->goal_level = fault->req_level; - mask = KVM_PAGES_PER_HPAGE(fault->goal_level) - 1; - VM_BUG_ON((fault->gfn & mask) != (fault->pfn & mask)); - fault->pfn &= ~mask; -} - -void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level) -{ - if (cur_level > PG_LEVEL_4K && - cur_level == fault->goal_level && - is_shadow_present_pte(spte) && - !is_large_pte(spte) && - spte_to_child_sp(spte)->nx_huge_page_disallowed) { - /* - * A small SPTE exists for this pfn, but FNAME(fetch) - * and __direct_map would like to create a large PTE - * instead: just force them to go down another level, - * patching back for them into pfn the next 9 bits of - * the address. - */ - u64 page_mask = KVM_PAGES_PER_HPAGE(cur_level) - - KVM_PAGES_PER_HPAGE(cur_level - 1); - fault->pfn |= fault->gfn & page_mask; - fault->goal_level--; - } -} - -static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) -{ - struct kvm_shadow_walk_iterator it; - struct kvm_mmu_page *sp; - int ret; - gfn_t base_gfn = fault->gfn; - - kvm_mmu_hugepage_adjust(vcpu, fault); - - trace_kvm_mmu_spte_requested(fault); - for_each_shadow_entry(vcpu, fault->addr, it) { - /* - * We cannot overwrite existing page tables with an NX - * large page, as the leaf could be executable. - */ - if (fault->nx_huge_page_workaround_enabled) - disallowed_hugepage_adjust(fault, *it.sptep, it.level); - - base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1); - if (it.level == fault->goal_level) - break; - - sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL); - if (sp == ERR_PTR(-EEXIST)) - continue; - - link_shadow_page(vcpu, it.sptep, sp); - if (fault->huge_page_disallowed) - account_nx_huge_page(vcpu->kvm, sp, - fault->req_level >= it.level); - } - - if (WARN_ON_ONCE(it.level != fault->goal_level)) - return -EFAULT; - - ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL, - base_gfn, fault->pfn, fault); - if (ret == RET_PF_SPURIOUS) - return ret; - - direct_pte_prefetch(vcpu, it.sptep); - return ret; -} - -static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *tsk) -{ - send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, PAGE_SHIFT, tsk); -} - -static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) -{ - if (is_sigpending_pfn(pfn)) { - kvm_handle_signal_exit(vcpu); - return -EINTR; - } - - /* - * Do not cache the mmio info caused by writing the readonly gfn - * into the spte otherwise read access on readonly gfn also can - * caused mmio page fault and treat it as mmio access. - */ - if (pfn == KVM_PFN_ERR_RO_FAULT) - return RET_PF_EMULATE; - - if (pfn == KVM_PFN_ERR_HWPOISON) { - kvm_send_hwpoison_signal(kvm_vcpu_gfn_to_hva(vcpu, gfn), current); - return RET_PF_RETRY; - } - - return -EFAULT; -} - -static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, - unsigned int access) -{ - /* The pfn is invalid, report the error! */ - if (unlikely(is_error_pfn(fault->pfn))) - return kvm_handle_error_pfn(vcpu, fault->gfn, fault->pfn); - - if (unlikely(!fault->slot)) { - gva_t gva = fault->is_tdp ? 0 : fault->addr; - - vcpu_cache_mmio_info(vcpu, gva, fault->gfn, - access & shadow_mmio_access_mask); - /* - * If MMIO caching is disabled, emulate immediately without - * touching the shadow page tables as attempting to install an - * MMIO SPTE will just be an expensive nop. Do not cache MMIO - * whose gfn is greater than host.MAXPHYADDR, any guest that - * generates such gfns is running nested and is being tricked - * by L0 userspace (you can observe gfn > L1.MAXPHYADDR if - * and only if L1's MAXPHYADDR is inaccurate with respect to - * the hardware's). - */ - if (unlikely(!enable_mmio_caching) || - unlikely(fault->gfn > kvm_mmu_max_gfn())) - return RET_PF_EMULATE; - } - - return RET_PF_CONTINUE; -} - -static bool page_fault_can_be_fast(struct kvm_page_fault *fault) -{ - /* - * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only - * reach the common page fault handler if the SPTE has an invalid MMIO - * generation number. Refreshing the MMIO generation needs to go down - * the slow path. Note, EPT Misconfigs do NOT set the PRESENT flag! - */ - if (fault->rsvd) - return false; - - /* - * #PF can be fast if: - * - * 1. The shadow page table entry is not present and A/D bits are - * disabled _by KVM_, which could mean that the fault is potentially - * caused by access tracking (if enabled). If A/D bits are enabled - * by KVM, but disabled by L1 for L2, KVM is forced to disable A/D - * bits for L2 and employ access tracking, but the fast page fault - * mechanism only supports direct MMUs. - * 2. The shadow page table entry is present, the access is a write, - * and no reserved bits are set (MMIO SPTEs cannot be "fixed"), i.e. - * the fault was caused by a write-protection violation. If the - * SPTE is MMU-writable (determined later), the fault can be fixed - * by setting the Writable bit, which can be done out of mmu_lock. - */ - if (!fault->present) - return !kvm_ad_enabled(); - - /* - * Note, instruction fetches and writes are mutually exclusive, ignore - * the "exec" flag. - */ - return fault->write; -} - -/* - * Returns true if the SPTE was fixed successfully. Otherwise, - * someone else modified the SPTE from its original value. - */ -static bool -fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, - u64 *sptep, u64 old_spte, u64 new_spte) -{ - /* - * Theoretically we could also set dirty bit (and flush TLB) here in - * order to eliminate unnecessary PML logging. See comments in - * set_spte. But fast_page_fault is very unlikely to happen with PML - * enabled, so we do not do this. This might result in the same GPA - * to be logged in PML buffer again when the write really happens, and - * eventually to be called by mark_page_dirty twice. But it's also no - * harm. This also avoids the TLB flush needed after setting dirty bit - * so non-PML cases won't be impacted. - * - * Compare with set_spte where instead shadow_dirty_mask is set. - */ - if (!try_cmpxchg64(sptep, &old_spte, new_spte)) - return false; - - if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) - mark_page_dirty_in_slot(vcpu->kvm, fault->slot, fault->gfn); - - return true; -} - -static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte) -{ - if (fault->exec) - return is_executable_pte(spte); - - if (fault->write) - return is_writable_pte(spte); - - /* Fault was on Read access */ - return spte & PT_PRESENT_MASK; -} - -/* - * Returns the last level spte pointer of the shadow page walk for the given - * gpa, and sets *spte to the spte value. This spte may be non-preset. If no - * walk could be performed, returns NULL and *spte does not contain valid data. - * - * Contract: - * - Must be called between walk_shadow_page_lockless_{begin,end}. - * - The returned sptep must not be used after walk_shadow_page_lockless_end. - */ -static u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) -{ - struct kvm_shadow_walk_iterator iterator; - u64 old_spte; - u64 *sptep = NULL; - - for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) { - sptep = iterator.sptep; - *spte = old_spte; - } - - return sptep; -} - -/* - * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS. - */ -static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) -{ - struct kvm_mmu_page *sp; - int ret = RET_PF_INVALID; - u64 spte = 0ull; - u64 *sptep = NULL; - uint retry_count = 0; - - if (!page_fault_can_be_fast(fault)) - return ret; - - walk_shadow_page_lockless_begin(vcpu); - - do { - u64 new_spte; - - if (is_tdp_mmu(vcpu->arch.mmu)) - sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault->addr, &spte); - else - sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte); - - if (!is_shadow_present_pte(spte)) - break; - - sp = sptep_to_sp(sptep); - if (!is_last_spte(spte, sp->role.level)) - break; - - /* - * Check whether the memory access that caused the fault would - * still cause it if it were to be performed right now. If not, - * then this is a spurious fault caused by TLB lazily flushed, - * or some other CPU has already fixed the PTE after the - * current CPU took the fault. - * - * Need not check the access of upper level table entries since - * they are always ACC_ALL. - */ - if (is_access_allowed(fault, spte)) { - ret = RET_PF_SPURIOUS; - break; - } - - new_spte = spte; - - /* - * KVM only supports fixing page faults outside of MMU lock for - * direct MMUs, nested MMUs are always indirect, and KVM always - * uses A/D bits for non-nested MMUs. Thus, if A/D bits are - * enabled, the SPTE can't be an access-tracked SPTE. - */ - if (unlikely(!kvm_ad_enabled()) && is_access_track_spte(spte)) - new_spte = restore_acc_track_spte(new_spte); - - /* - * To keep things simple, only SPTEs that are MMU-writable can - * be made fully writable outside of mmu_lock, e.g. only SPTEs - * that were write-protected for dirty-logging or access - * tracking are handled here. Don't bother checking if the - * SPTE is writable to prioritize running with A/D bits enabled. - * The is_access_allowed() check above handles the common case - * of the fault being spurious, and the SPTE is known to be - * shadow-present, i.e. except for access tracking restoration - * making the new SPTE writable, the check is wasteful. - */ - if (fault->write && is_mmu_writable_spte(spte)) { - new_spte |= PT_WRITABLE_MASK; - - /* - * Do not fix write-permission on the large spte when - * dirty logging is enabled. Since we only dirty the - * first page into the dirty-bitmap in - * fast_pf_fix_direct_spte(), other pages are missed - * if its slot has dirty logging enabled. - * - * Instead, we let the slow page fault path create a - * normal spte to fix the access. - */ - if (sp->role.level > PG_LEVEL_4K && - kvm_slot_dirty_track_enabled(fault->slot)) - break; - } - - /* Verify that the fault can be handled in the fast path */ - if (new_spte == spte || - !is_access_allowed(fault, new_spte)) - break; - - /* - * Currently, fast page fault only works for direct mapping - * since the gfn is not stable for indirect shadow page. See - * Documentation/virt/kvm/locking.rst to get more detail. - */ - if (fast_pf_fix_direct_spte(vcpu, fault, sptep, spte, new_spte)) { - ret = RET_PF_FIXED; - break; - } - - if (++retry_count > 4) { - printk_once(KERN_WARNING - "kvm: Fast #PF retrying more than 4 times.\n"); - break; - } - - } while (true); - - trace_fast_page_fault(vcpu, fault, sptep, spte, ret); - walk_shadow_page_lockless_end(vcpu); - - if (ret != RET_PF_INVALID) - vcpu->stat.pf_fast++; - - return ret; -} - -static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa, - struct list_head *invalid_list) -{ - struct kvm_mmu_page *sp; - - if (!VALID_PAGE(*root_hpa)) - return; - - /* - * The "root" may be a special root, e.g. a PAE entry, treat it as a - * SPTE to ensure any non-PA bits are dropped. - */ - sp = spte_to_child_sp(*root_hpa); - if (WARN_ON(!sp)) - return; - - if (is_tdp_mmu_page(sp)) - kvm_tdp_mmu_put_root(kvm, sp, false); - else if (!--sp->root_count && sp->role.invalid) - kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); - - *root_hpa = INVALID_PAGE; -} - -/* roots_to_free must be some combination of the KVM_MMU_ROOT_* flags */ -void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu, - ulong roots_to_free) -{ - int i; - LIST_HEAD(invalid_list); - bool free_active_root; - - BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG); - - /* Before acquiring the MMU lock, see if we need to do any real work. */ - free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT) - && VALID_PAGE(mmu->root.hpa); - - if (!free_active_root) { - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) - if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) && - VALID_PAGE(mmu->prev_roots[i].hpa)) - break; - - if (i == KVM_MMU_NUM_PREV_ROOTS) - return; - } - - write_lock(&kvm->mmu_lock); - - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) - if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) - mmu_free_root_page(kvm, &mmu->prev_roots[i].hpa, - &invalid_list); - - if (free_active_root) { - if (to_shadow_page(mmu->root.hpa)) { - mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list); - } else if (mmu->pae_root) { - for (i = 0; i < 4; ++i) { - if (!IS_VALID_PAE_ROOT(mmu->pae_root[i])) - continue; - - mmu_free_root_page(kvm, &mmu->pae_root[i], - &invalid_list); - mmu->pae_root[i] = INVALID_PAE_ROOT; - } - } - mmu->root.hpa = INVALID_PAGE; - mmu->root.pgd = 0; - } - - kvm_mmu_commit_zap_page(kvm, &invalid_list); - write_unlock(&kvm->mmu_lock); -} -EXPORT_SYMBOL_GPL(kvm_mmu_free_roots); - -void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu) -{ - unsigned long roots_to_free = 0; - hpa_t root_hpa; - int i; - - /* - * This should not be called while L2 is active, L2 can't invalidate - * _only_ its own roots, e.g. INVVPID unconditionally exits. - */ - WARN_ON_ONCE(mmu->root_role.guest_mode); - - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { - root_hpa = mmu->prev_roots[i].hpa; - if (!VALID_PAGE(root_hpa)) - continue; - - if (!to_shadow_page(root_hpa) || - to_shadow_page(root_hpa)->role.guest_mode) - roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i); - } - - kvm_mmu_free_roots(kvm, mmu, roots_to_free); -} -EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots); - - -static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn) -{ - int ret = 0; - - if (!kvm_vcpu_is_visible_gfn(vcpu, root_gfn)) { - kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); - ret = 1; - } - - return ret; -} - -static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, - u8 level) -{ - union kvm_mmu_page_role role = vcpu->arch.mmu->root_role; - struct kvm_mmu_page *sp; - - role.level = level; - role.quadrant = quadrant; - - WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte); - WARN_ON_ONCE(role.direct && role.has_4_byte_gpte); - - sp = kvm_mmu_get_shadow_page(vcpu, gfn, role); - ++sp->root_count; - - return __pa(sp->spt); -} - -static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) -{ - struct kvm_mmu *mmu = vcpu->arch.mmu; - u8 shadow_root_level = mmu->root_role.level; - hpa_t root; - unsigned i; - int r; - - write_lock(&vcpu->kvm->mmu_lock); - r = make_mmu_pages_available(vcpu); - if (r < 0) - goto out_unlock; - - if (is_tdp_mmu_enabled(vcpu->kvm)) { - root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu); - mmu->root.hpa = root; - } else if (shadow_root_level >= PT64_ROOT_4LEVEL) { - root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level); - mmu->root.hpa = root; - } else if (shadow_root_level == PT32E_ROOT_LEVEL) { - if (WARN_ON_ONCE(!mmu->pae_root)) { - r = -EIO; - goto out_unlock; - } - - for (i = 0; i < 4; ++i) { - WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i])); - - root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0, - PT32_ROOT_LEVEL); - mmu->pae_root[i] = root | PT_PRESENT_MASK | - shadow_me_value; - } - mmu->root.hpa = __pa(mmu->pae_root); - } else { - WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level); - r = -EIO; - goto out_unlock; - } - - /* root.pgd is ignored for direct MMUs. */ - mmu->root.pgd = 0; -out_unlock: - write_unlock(&vcpu->kvm->mmu_lock); - return r; -} - -static int mmu_first_shadow_root_alloc(struct kvm *kvm) -{ - struct kvm_memslots *slots; - struct kvm_memory_slot *slot; - int r = 0, i, bkt; - - /* - * Check if this is the first shadow root being allocated before - * taking the lock. - */ - if (kvm_shadow_root_allocated(kvm)) - return 0; - - mutex_lock(&kvm->slots_arch_lock); - - /* Recheck, under the lock, whether this is the first shadow root. */ - if (kvm_shadow_root_allocated(kvm)) - goto out_unlock; - - /* - * Check if anything actually needs to be allocated, e.g. all metadata - * will be allocated upfront if TDP is disabled. - */ - if (kvm_memslots_have_rmaps(kvm) && - kvm_page_track_write_tracking_enabled(kvm)) - goto out_success; - - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { - slots = __kvm_memslots(kvm, i); - kvm_for_each_memslot(slot, bkt, slots) { - /* - * Both of these functions are no-ops if the target is - * already allocated, so unconditionally calling both - * is safe. Intentionally do NOT free allocations on - * failure to avoid having to track which allocations - * were made now versus when the memslot was created. - * The metadata is guaranteed to be freed when the slot - * is freed, and will be kept/used if userspace retries - * KVM_RUN instead of killing the VM. - */ - r = memslot_rmap_alloc(slot, slot->npages); - if (r) - goto out_unlock; - r = kvm_page_track_write_tracking_alloc(slot); - if (r) - goto out_unlock; - } - } - - /* - * Ensure that shadow_root_allocated becomes true strictly after - * all the related pointers are set. - */ -out_success: - smp_store_release(&kvm->arch.shadow_root_allocated, true); - -out_unlock: - mutex_unlock(&kvm->slots_arch_lock); - return r; -} - -static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) -{ - struct kvm_mmu *mmu = vcpu->arch.mmu; - u64 pdptrs[4], pm_mask; - gfn_t root_gfn, root_pgd; - int quadrant, i, r; - hpa_t root; - - root_pgd = mmu->get_guest_pgd(vcpu); - root_gfn = root_pgd >> PAGE_SHIFT; - - if (mmu_check_root(vcpu, root_gfn)) - return 1; - - /* - * On SVM, reading PDPTRs might access guest memory, which might fault - * and thus might sleep. Grab the PDPTRs before acquiring mmu_lock. - */ - if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) { - for (i = 0; i < 4; ++i) { - pdptrs[i] = mmu->get_pdptr(vcpu, i); - if (!(pdptrs[i] & PT_PRESENT_MASK)) - continue; - - if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT)) - return 1; - } - } - - r = mmu_first_shadow_root_alloc(vcpu->kvm); - if (r) - return r; - - write_lock(&vcpu->kvm->mmu_lock); - r = make_mmu_pages_available(vcpu); - if (r < 0) - goto out_unlock; - - /* - * Do we shadow a long mode page table? If so we need to - * write-protect the guests page table root. - */ - if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) { - root = mmu_alloc_root(vcpu, root_gfn, 0, - mmu->root_role.level); - mmu->root.hpa = root; - goto set_root_pgd; - } - - if (WARN_ON_ONCE(!mmu->pae_root)) { - r = -EIO; - goto out_unlock; - } - - /* - * We shadow a 32 bit page table. This may be a legacy 2-level - * or a PAE 3-level page table. In either case we need to be aware that - * the shadow page table may be a PAE or a long mode page table. - */ - pm_mask = PT_PRESENT_MASK | shadow_me_value; - if (mmu->root_role.level >= PT64_ROOT_4LEVEL) { - pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK; - - if (WARN_ON_ONCE(!mmu->pml4_root)) { - r = -EIO; - goto out_unlock; - } - mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask; - - if (mmu->root_role.level == PT64_ROOT_5LEVEL) { - if (WARN_ON_ONCE(!mmu->pml5_root)) { - r = -EIO; - goto out_unlock; - } - mmu->pml5_root[0] = __pa(mmu->pml4_root) | pm_mask; - } - } - - for (i = 0; i < 4; ++i) { - WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i])); - - if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) { - if (!(pdptrs[i] & PT_PRESENT_MASK)) { - mmu->pae_root[i] = INVALID_PAE_ROOT; - continue; - } - root_gfn = pdptrs[i] >> PAGE_SHIFT; - } - - /* - * If shadowing 32-bit non-PAE page tables, each PAE page - * directory maps one quarter of the guest's non-PAE page - * directory. Othwerise each PAE page direct shadows one guest - * PAE page directory so that quadrant should be 0. - */ - quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0; - - root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL); - mmu->pae_root[i] = root | pm_mask; - } - - if (mmu->root_role.level == PT64_ROOT_5LEVEL) - mmu->root.hpa = __pa(mmu->pml5_root); - else if (mmu->root_role.level == PT64_ROOT_4LEVEL) - mmu->root.hpa = __pa(mmu->pml4_root); - else - mmu->root.hpa = __pa(mmu->pae_root); - -set_root_pgd: - mmu->root.pgd = root_pgd; -out_unlock: - write_unlock(&vcpu->kvm->mmu_lock); - - return r; -} - -static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu) -{ - struct kvm_mmu *mmu = vcpu->arch.mmu; - bool need_pml5 = mmu->root_role.level > PT64_ROOT_4LEVEL; - u64 *pml5_root = NULL; - u64 *pml4_root = NULL; - u64 *pae_root; - - /* - * When shadowing 32-bit or PAE NPT with 64-bit NPT, the PML4 and PDP - * tables are allocated and initialized at root creation as there is no - * equivalent level in the guest's NPT to shadow. Allocate the tables - * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare. - */ - if (mmu->root_role.direct || - mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL || - mmu->root_role.level < PT64_ROOT_4LEVEL) - return 0; - - /* - * NPT, the only paging mode that uses this horror, uses a fixed number - * of levels for the shadow page tables, e.g. all MMUs are 4-level or - * all MMus are 5-level. Thus, this can safely require that pml5_root - * is allocated if the other roots are valid and pml5 is needed, as any - * prior MMU would also have required pml5. - */ - if (mmu->pae_root && mmu->pml4_root && (!need_pml5 || mmu->pml5_root)) - return 0; - - /* - * The special roots should always be allocated in concert. Yell and - * bail if KVM ends up in a state where only one of the roots is valid. - */ - if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root || - (need_pml5 && mmu->pml5_root))) - return -EIO; - - /* - * Unlike 32-bit NPT, the PDP table doesn't need to be in low mem, and - * doesn't need to be decrypted. - */ - pae_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); - if (!pae_root) - return -ENOMEM; - -#ifdef CONFIG_X86_64 - pml4_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); - if (!pml4_root) - goto err_pml4; + if (++retry_count > 4) { + printk_once(KERN_WARNING + "kvm: Fast #PF retrying more than 4 times.\n"); + break; + } - if (need_pml5) { - pml5_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); - if (!pml5_root) - goto err_pml5; - } -#endif + } while (true); - mmu->pae_root = pae_root; - mmu->pml4_root = pml4_root; - mmu->pml5_root = pml5_root; + trace_fast_page_fault(vcpu, fault, sptep, spte, ret); + walk_shadow_page_lockless_end(vcpu); - return 0; + if (ret != RET_PF_INVALID) + vcpu->stat.pf_fast++; -#ifdef CONFIG_X86_64 -err_pml5: - free_page((unsigned long)pml4_root); -err_pml4: - free_page((unsigned long)pae_root); - return -ENOMEM; -#endif + return ret; } -static bool is_unsync_root(hpa_t root) +static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa, + struct list_head *invalid_list) { struct kvm_mmu_page *sp; - if (!VALID_PAGE(root)) - return false; - - /* - * The read barrier orders the CPU's read of SPTE.W during the page table - * walk before the reads of sp->unsync/sp->unsync_children here. - * - * Even if another CPU was marking the SP as unsync-ed simultaneously, - * any guest page table changes are not guaranteed to be visible anyway - * until this VCPU issues a TLB flush strictly after those changes are - * made. We only need to ensure that the other CPU sets these flags - * before any actual changes to the page tables are made. The comments - * in mmu_try_to_unsync_pages() describe what could go wrong if this - * requirement isn't satisfied. - */ - smp_rmb(); - sp = to_shadow_page(root); + if (!VALID_PAGE(*root_hpa)) + return; /* - * PAE roots (somewhat arbitrarily) aren't backed by shadow pages, the - * PDPTEs for a given PAE root need to be synchronized individually. + * The "root" may be a special root, e.g. a PAE entry, treat it as a + * SPTE to ensure any non-PA bits are dropped. */ - if (WARN_ON_ONCE(!sp)) - return false; + sp = spte_to_child_sp(*root_hpa); + if (WARN_ON(!sp)) + return; - if (sp->unsync || sp->unsync_children) - return true; + if (is_tdp_mmu_page(sp)) + kvm_tdp_mmu_put_root(kvm, sp, false); + else if (!--sp->root_count && sp->role.invalid) + kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); - return false; + *root_hpa = INVALID_PAGE; } -void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu) +/* roots_to_free must be some combination of the KVM_MMU_ROOT_* flags */ +void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu, + ulong roots_to_free) { int i; - struct kvm_mmu_page *sp; - - if (vcpu->arch.mmu->root_role.direct) - return; + LIST_HEAD(invalid_list); + bool free_active_root; - if (!VALID_PAGE(vcpu->arch.mmu->root.hpa)) - return; + BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG); - vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY); + /* Before acquiring the MMU lock, see if we need to do any real work. */ + free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT) + && VALID_PAGE(mmu->root.hpa); - if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) { - hpa_t root = vcpu->arch.mmu->root.hpa; - sp = to_shadow_page(root); + if (!free_active_root) { + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) + if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) && + VALID_PAGE(mmu->prev_roots[i].hpa)) + break; - if (!is_unsync_root(root)) + if (i == KVM_MMU_NUM_PREV_ROOTS) return; - - write_lock(&vcpu->kvm->mmu_lock); - mmu_sync_children(vcpu, sp, true); - write_unlock(&vcpu->kvm->mmu_lock); - return; } - write_lock(&vcpu->kvm->mmu_lock); + write_lock(&kvm->mmu_lock); + + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) + if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) + mmu_free_root_page(kvm, &mmu->prev_roots[i].hpa, + &invalid_list); - for (i = 0; i < 4; ++i) { - hpa_t root = vcpu->arch.mmu->pae_root[i]; + if (free_active_root) { + if (to_shadow_page(mmu->root.hpa)) { + mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list); + } else if (mmu->pae_root) { + for (i = 0; i < 4; ++i) { + if (!IS_VALID_PAE_ROOT(mmu->pae_root[i])) + continue; - if (IS_VALID_PAE_ROOT(root)) { - sp = spte_to_child_sp(root); - mmu_sync_children(vcpu, sp, true); + mmu_free_root_page(kvm, &mmu->pae_root[i], + &invalid_list); + mmu->pae_root[i] = INVALID_PAE_ROOT; + } } + mmu->root.hpa = INVALID_PAGE; + mmu->root.pgd = 0; } - write_unlock(&vcpu->kvm->mmu_lock); + kvm_mmu_commit_zap_page(kvm, &invalid_list); + write_unlock(&kvm->mmu_lock); } +EXPORT_SYMBOL_GPL(kvm_mmu_free_roots); -void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu) +static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) { - unsigned long roots_to_free = 0; - int i; + struct kvm_mmu *mmu = vcpu->arch.mmu; + u8 shadow_root_level = mmu->root_role.level; + hpa_t root; + unsigned i; + int r; - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) - if (is_unsync_root(vcpu->arch.mmu->prev_roots[i].hpa)) - roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i); + write_lock(&vcpu->kvm->mmu_lock); + r = make_mmu_pages_available(vcpu); + if (r < 0) + goto out_unlock; + + if (is_tdp_mmu_enabled(vcpu->kvm)) { + root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu); + mmu->root.hpa = root; + } else if (shadow_root_level >= PT64_ROOT_4LEVEL) { + root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level); + mmu->root.hpa = root; + } else if (shadow_root_level == PT32E_ROOT_LEVEL) { + if (WARN_ON_ONCE(!mmu->pae_root)) { + r = -EIO; + goto out_unlock; + } + + for (i = 0; i < 4; ++i) { + WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i])); - /* sync prev_roots by simply freeing them */ - kvm_mmu_free_roots(vcpu->kvm, vcpu->arch.mmu, roots_to_free); + root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0, + PT32_ROOT_LEVEL); + mmu->pae_root[i] = root | PT_PRESENT_MASK | + shadow_me_value; + } + mmu->root.hpa = __pa(mmu->pae_root); + } else { + WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level); + r = -EIO; + goto out_unlock; + } + + /* root.pgd is ignored for direct MMUs. */ + mmu->root.pgd = 0; +out_unlock: + write_unlock(&vcpu->kvm->mmu_lock); + return r; } static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, @@ -3984,31 +1191,6 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct) return vcpu_match_mmio_gva(vcpu, addr); } -/* - * Return the level of the lowest level SPTE added to sptes. - * That SPTE may be non-present. - * - * Must be called between walk_shadow_page_lockless_{begin,end}. - */ -static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level) -{ - struct kvm_shadow_walk_iterator iterator; - int leaf = -1; - u64 spte; - - for (shadow_walk_init(&iterator, vcpu, addr), - *root_level = iterator.level; - shadow_walk_okay(&iterator); - __shadow_walk_next(&iterator, spte)) { - leaf = iterator.level; - spte = mmu_spte_get_lockless(iterator.sptep); - - sptes[leaf] = spte; - } - - return leaf; -} - /* return true if reserved bit(s) are detected on a valid, non-MMIO SPTE. */ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep) { @@ -4112,17 +1294,6 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu, return false; } -static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr) -{ - struct kvm_shadow_walk_iterator iterator; - u64 spte; - - walk_shadow_page_lockless_begin(vcpu); - for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) - clear_sp_write_flooding_count(iterator.sptep); - walk_shadow_page_lockless_end(vcpu); -} - static u32 alloc_apf_token(struct kvm_vcpu *vcpu) { /* make sure the token value is not 0 */ @@ -5305,264 +2476,65 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.nested_mmu.root_role.word = 0; vcpu->arch.root_mmu.cpu_role.ext.valid = 0; vcpu->arch.guest_mmu.cpu_role.ext.valid = 0; - vcpu->arch.nested_mmu.cpu_role.ext.valid = 0; - kvm_mmu_reset_context(vcpu); - - /* - * Changing guest CPUID after KVM_RUN is forbidden, see the comment in - * kvm_arch_vcpu_ioctl(). - */ - KVM_BUG_ON(vcpu->arch.last_vmentry_cpu != -1, vcpu->kvm); -} - -void kvm_mmu_reset_context(struct kvm_vcpu *vcpu) -{ - kvm_mmu_unload(vcpu); - kvm_init_mmu(vcpu); -} -EXPORT_SYMBOL_GPL(kvm_mmu_reset_context); - -int kvm_mmu_load(struct kvm_vcpu *vcpu) -{ - int r; - - r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct); - if (r) - goto out; - r = mmu_alloc_special_roots(vcpu); - if (r) - goto out; - if (vcpu->arch.mmu->root_role.direct) - r = mmu_alloc_direct_roots(vcpu); - else - r = mmu_alloc_shadow_roots(vcpu); - if (r) - goto out; - - kvm_mmu_sync_roots(vcpu); - - kvm_mmu_load_pgd(vcpu); - - /* - * Flush any TLB entries for the new root, the provenance of the root - * is unknown. Even if KVM ensures there are no stale TLB entries - * for a freed root, in theory another hypervisor could have left - * stale entries. Flushing on alloc also allows KVM to skip the TLB - * flush when freeing a root (see kvm_tdp_mmu_put_root()). - */ - static_call(kvm_x86_flush_tlb_current)(vcpu); -out: - return r; -} - -void kvm_mmu_unload(struct kvm_vcpu *vcpu) -{ - struct kvm *kvm = vcpu->kvm; - - kvm_mmu_free_roots(kvm, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL); - WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root.hpa)); - kvm_mmu_free_roots(kvm, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL); - WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root.hpa)); - vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY); -} - -static bool is_obsolete_root(struct kvm *kvm, hpa_t root_hpa) -{ - struct kvm_mmu_page *sp; - - if (!VALID_PAGE(root_hpa)) - return false; - - /* - * When freeing obsolete roots, treat roots as obsolete if they don't - * have an associated shadow page. This does mean KVM will get false - * positives and free roots that don't strictly need to be freed, but - * such false positives are relatively rare: - * - * (a) only PAE paging and nested NPT has roots without shadow pages - * (b) remote reloads due to a memslot update obsoletes _all_ roots - * (c) KVM doesn't track previous roots for PAE paging, and the guest - * is unlikely to zap an in-use PGD. - */ - sp = to_shadow_page(root_hpa); - return !sp || is_obsolete_sp(kvm, sp); -} - -static void __kvm_mmu_free_obsolete_roots(struct kvm *kvm, struct kvm_mmu *mmu) -{ - unsigned long roots_to_free = 0; - int i; - - if (is_obsolete_root(kvm, mmu->root.hpa)) - roots_to_free |= KVM_MMU_ROOT_CURRENT; - - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { - if (is_obsolete_root(kvm, mmu->prev_roots[i].hpa)) - roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i); - } - - if (roots_to_free) - kvm_mmu_free_roots(kvm, mmu, roots_to_free); -} - -void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu) -{ - __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.root_mmu); - __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.guest_mmu); -} - -static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa, - int *bytes) -{ - u64 gentry = 0; - int r; - - /* - * Assume that the pte write on a page table of the same type - * as the current vcpu paging mode since we update the sptes only - * when they have the same mode. - */ - if (is_pae(vcpu) && *bytes == 4) { - /* Handle a 32-bit guest writing two halves of a 64-bit gpte */ - *gpa &= ~(gpa_t)7; - *bytes = 8; - } - - if (*bytes == 4 || *bytes == 8) { - r = kvm_vcpu_read_guest_atomic(vcpu, *gpa, &gentry, *bytes); - if (r) - gentry = 0; - } - - return gentry; -} - -/* - * If we're seeing too many writes to a page, it may no longer be a page table, - * or we may be forking, in which case it is better to unmap the page. - */ -static bool detect_write_flooding(struct kvm_mmu_page *sp) -{ - /* - * Skip write-flooding detected for the sp whose level is 1, because - * it can become unsync, then the guest page is not write-protected. - */ - if (sp->role.level == PG_LEVEL_4K) - return false; - - atomic_inc(&sp->write_flooding_count); - return atomic_read(&sp->write_flooding_count) >= 3; -} - -/* - * Misaligned accesses are too much trouble to fix up; also, they usually - * indicate a page is not used as a page table. - */ -static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa, - int bytes) -{ - unsigned offset, pte_size, misaligned; - - pgprintk("misaligned: gpa %llx bytes %d role %x\n", - gpa, bytes, sp->role.word); - - offset = offset_in_page(gpa); - pte_size = sp->role.has_4_byte_gpte ? 4 : 8; + vcpu->arch.nested_mmu.cpu_role.ext.valid = 0; + kvm_mmu_reset_context(vcpu); /* - * Sometimes, the OS only writes the last one bytes to update status - * bits, for example, in linux, andb instruction is used in clear_bit(). + * Changing guest CPUID after KVM_RUN is forbidden, see the comment in + * kvm_arch_vcpu_ioctl(). */ - if (!(offset & (pte_size - 1)) && bytes == 1) - return false; - - misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1); - misaligned |= bytes < 4; - - return misaligned; + KVM_BUG_ON(vcpu->arch.last_vmentry_cpu != -1, vcpu->kvm); } -static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte) +void kvm_mmu_reset_context(struct kvm_vcpu *vcpu) { - unsigned page_offset, quadrant; - u64 *spte; - int level; - - page_offset = offset_in_page(gpa); - level = sp->role.level; - *nspte = 1; - if (sp->role.has_4_byte_gpte) { - page_offset <<= 1; /* 32->64 */ - /* - * A 32-bit pde maps 4MB while the shadow pdes map - * only 2MB. So we need to double the offset again - * and zap two pdes instead of one. - */ - if (level == PT32_ROOT_LEVEL) { - page_offset &= ~7; /* kill rounding error */ - page_offset <<= 1; - *nspte = 2; - } - quadrant = page_offset >> PAGE_SHIFT; - page_offset &= ~PAGE_MASK; - if (quadrant != sp->role.quadrant) - return NULL; - } - - spte = &sp->spt[page_offset / sizeof(*spte)]; - return spte; + kvm_mmu_unload(vcpu); + kvm_init_mmu(vcpu); } +EXPORT_SYMBOL_GPL(kvm_mmu_reset_context); -static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, - const u8 *new, int bytes, - struct kvm_page_track_notifier_node *node) +int kvm_mmu_load(struct kvm_vcpu *vcpu) { - gfn_t gfn = gpa >> PAGE_SHIFT; - struct kvm_mmu_page *sp; - LIST_HEAD(invalid_list); - u64 entry, gentry, *spte; - int npte; - bool flush = false; - - /* - * If we don't have indirect shadow pages, it means no page is - * write-protected, so we can exit simply. - */ - if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) - return; - - pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes); + int r; - write_lock(&vcpu->kvm->mmu_lock); + r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct); + if (r) + goto out; + r = mmu_alloc_special_roots(vcpu); + if (r) + goto out; + if (vcpu->arch.mmu->root_role.direct) + r = mmu_alloc_direct_roots(vcpu); + else + r = mmu_alloc_shadow_roots(vcpu); + if (r) + goto out; - gentry = mmu_pte_write_fetch_gpte(vcpu, &gpa, &bytes); + kvm_mmu_sync_roots(vcpu); - ++vcpu->kvm->stat.mmu_pte_write; + kvm_mmu_load_pgd(vcpu); - for_each_gfn_valid_sp_with_gptes(vcpu->kvm, sp, gfn) { - if (detect_write_misaligned(sp, gpa, bytes) || - detect_write_flooding(sp)) { - kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list); - ++vcpu->kvm->stat.mmu_flooded; - continue; - } + /* + * Flush any TLB entries for the new root, the provenance of the root + * is unknown. Even if KVM ensures there are no stale TLB entries + * for a freed root, in theory another hypervisor could have left + * stale entries. Flushing on alloc also allows KVM to skip the TLB + * flush when freeing a root (see kvm_tdp_mmu_put_root()). + */ + static_call(kvm_x86_flush_tlb_current)(vcpu); +out: + return r; +} - spte = get_written_sptes(sp, gpa, &npte); - if (!spte) - continue; +void kvm_mmu_unload(struct kvm_vcpu *vcpu) +{ + struct kvm *kvm = vcpu->kvm; - while (npte--) { - entry = *spte; - mmu_page_zap_pte(vcpu->kvm, sp, spte, NULL); - if (gentry && sp->role.level != PG_LEVEL_4K) - ++vcpu->kvm->stat.mmu_pde_zapped; - if (is_shadow_present_pte(entry)) - flush = true; - ++spte; - } - } - kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush); - write_unlock(&vcpu->kvm->mmu_lock); + kvm_mmu_free_roots(kvm, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL); + WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root.hpa)); + kvm_mmu_free_roots(kvm, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL); + WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root.hpa)); + vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY); } int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, @@ -5728,58 +2700,6 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, } EXPORT_SYMBOL_GPL(kvm_configure_mmu); -/* The return value indicates if tlb flush on all vcpus is needed. */ -typedef bool (*slot_level_handler) (struct kvm *kvm, - struct kvm_rmap_head *rmap_head, - const struct kvm_memory_slot *slot); - -/* The caller should hold mmu-lock before calling this function. */ -static __always_inline bool -slot_handle_level_range(struct kvm *kvm, const struct kvm_memory_slot *memslot, - slot_level_handler fn, int start_level, int end_level, - gfn_t start_gfn, gfn_t end_gfn, bool flush_on_yield, - bool flush) -{ - struct slot_rmap_walk_iterator iterator; - - for_each_slot_rmap_range(memslot, start_level, end_level, start_gfn, - end_gfn, &iterator) { - if (iterator.rmap) - flush |= fn(kvm, iterator.rmap, memslot); - - if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { - if (flush && flush_on_yield) { - kvm_flush_remote_tlbs_with_address(kvm, - start_gfn, - iterator.gfn - start_gfn + 1); - flush = false; - } - cond_resched_rwlock_write(&kvm->mmu_lock); - } - } - - return flush; -} - -static __always_inline bool -slot_handle_level(struct kvm *kvm, const struct kvm_memory_slot *memslot, - slot_level_handler fn, int start_level, int end_level, - bool flush_on_yield) -{ - return slot_handle_level_range(kvm, memslot, fn, start_level, - end_level, memslot->base_gfn, - memslot->base_gfn + memslot->npages - 1, - flush_on_yield, false); -} - -static __always_inline bool -slot_handle_level_4k(struct kvm *kvm, const struct kvm_memory_slot *memslot, - slot_level_handler fn, bool flush_on_yield) -{ - return slot_handle_level(kvm, memslot, fn, PG_LEVEL_4K, - PG_LEVEL_4K, flush_on_yield); -} - static void free_mmu_pages(struct kvm_mmu *mmu) { if (!tdp_enabled && mmu->pae_root) @@ -5871,63 +2791,6 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu) return ret; } -#define BATCH_ZAP_PAGES 10 -static void kvm_zap_obsolete_pages(struct kvm *kvm) -{ - struct kvm_mmu_page *sp, *node; - int nr_zapped, batch = 0; - bool unstable; - -restart: - list_for_each_entry_safe_reverse(sp, node, - &kvm->arch.active_mmu_pages, link) { - /* - * No obsolete valid page exists before a newly created page - * since active_mmu_pages is a FIFO list. - */ - if (!is_obsolete_sp(kvm, sp)) - break; - - /* - * Invalid pages should never land back on the list of active - * pages. Skip the bogus page, otherwise we'll get stuck in an - * infinite loop if the page gets put back on the list (again). - */ - if (WARN_ON(sp->role.invalid)) - continue; - - /* - * No need to flush the TLB since we're only zapping shadow - * pages with an obsolete generation number and all vCPUS have - * loaded a new root, i.e. the shadow pages being zapped cannot - * be in active use by the guest. - */ - if (batch >= BATCH_ZAP_PAGES && - cond_resched_rwlock_write(&kvm->mmu_lock)) { - batch = 0; - goto restart; - } - - unstable = __kvm_mmu_prepare_zap_page(kvm, sp, - &kvm->arch.zapped_obsolete_pages, &nr_zapped); - batch += nr_zapped; - - if (unstable) - goto restart; - } - - /* - * Kick all vCPUs (via remote TLB flush) before freeing the page tables - * to ensure KVM is not in the middle of a lockless shadow page table - * walk, which may reference the pages. The remote TLB flush itself is - * not required and is simply a convenient way to kick vCPUs as needed. - * KVM performs a local TLB flush when allocating a new root (see - * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are - * running with an obsolete MMU. - */ - kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages); -} - /* * Fast invalidate all shadow pages and use lock-break technique * to zap obsolete pages. @@ -5988,11 +2851,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm) kvm_tdp_mmu_zap_invalidated_roots(kvm); } -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm) -{ - return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages)); -} - static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node) @@ -6047,37 +2905,6 @@ void kvm_mmu_uninit_vm(struct kvm *kvm) mmu_free_vm_memory_caches(kvm); } -static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) -{ - const struct kvm_memory_slot *memslot; - struct kvm_memslots *slots; - struct kvm_memslot_iter iter; - bool flush = false; - gfn_t start, end; - int i; - - if (!kvm_memslots_have_rmaps(kvm)) - return flush; - - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { - slots = __kvm_memslots(kvm, i); - - kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) { - memslot = iter.slot; - start = max(gfn_start, memslot->base_gfn); - end = min(gfn_end, memslot->base_gfn + memslot->npages); - if (WARN_ON_ONCE(start >= end)) - continue; - - flush = slot_handle_level_range(kvm, memslot, __kvm_zap_rmap, - PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, - start, end - 1, true, flush); - } - } - - return flush; -} - /* * Invalidate (zap) SPTEs that cover GFNs from gfn_start and up to gfn_end * (not including it) @@ -6111,13 +2938,6 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) write_unlock(&kvm->mmu_lock); } -static bool slot_rmap_write_protect(struct kvm *kvm, - struct kvm_rmap_head *rmap_head, - const struct kvm_memory_slot *slot) -{ - return rmap_write_protect(rmap_head, false); -} - void kvm_mmu_slot_remove_write_access(struct kvm *kvm, const struct kvm_memory_slot *memslot, int start_level) @@ -6189,183 +3009,6 @@ int topup_split_caches(struct kvm *kvm) return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1); } -static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep) -{ - struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep); - struct shadow_page_caches caches = {}; - union kvm_mmu_page_role role; - unsigned int access; - gfn_t gfn; - - gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep)); - access = kvm_mmu_page_get_access(huge_sp, spte_index(huge_sptep)); - - /* - * Note, huge page splitting always uses direct shadow pages, regardless - * of whether the huge page itself is mapped by a direct or indirect - * shadow page, since the huge page region itself is being directly - * mapped with smaller pages. - */ - role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access); - - /* Direct SPs do not require a shadowed_info_cache. */ - caches.page_header_cache = &kvm->arch.split_page_header_cache; - caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache; - - /* Safe to pass NULL for vCPU since requesting a direct SP. */ - return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role); -} - -static void shadow_mmu_split_huge_page(struct kvm *kvm, - const struct kvm_memory_slot *slot, - u64 *huge_sptep) - -{ - struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache; - u64 huge_spte = READ_ONCE(*huge_sptep); - struct kvm_mmu_page *sp; - bool flush = false; - u64 *sptep, spte; - gfn_t gfn; - int index; - - sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep); - - for (index = 0; index < SPTE_ENT_PER_PAGE; index++) { - sptep = &sp->spt[index]; - gfn = kvm_mmu_page_get_gfn(sp, index); - - /* - * The SP may already have populated SPTEs, e.g. if this huge - * page is aliased by multiple sptes with the same access - * permissions. These entries are guaranteed to map the same - * gfn-to-pfn translation since the SP is direct, so no need to - * modify them. - * - * However, if a given SPTE points to a lower level page table, - * that lower level page table may only be partially populated. - * Installing such SPTEs would effectively unmap a potion of the - * huge page. Unmapping guest memory always requires a TLB flush - * since a subsequent operation on the unmapped regions would - * fail to detect the need to flush. - */ - if (is_shadow_present_pte(*sptep)) { - flush |= !is_last_spte(*sptep, sp->role.level); - continue; - } - - spte = make_huge_page_split_spte(kvm, huge_spte, sp->role, index); - mmu_spte_set(sptep, spte); - __rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access); - } - - __link_shadow_page(kvm, cache, huge_sptep, sp, flush); -} - -static int shadow_mmu_try_split_huge_page(struct kvm *kvm, - const struct kvm_memory_slot *slot, - u64 *huge_sptep) -{ - struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep); - int level, r = 0; - gfn_t gfn; - u64 spte; - - /* Grab information for the tracepoint before dropping the MMU lock. */ - gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep)); - level = huge_sp->role.level; - spte = *huge_sptep; - - if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) { - r = -ENOSPC; - goto out; - } - - if (need_topup_split_caches_or_resched(kvm)) { - write_unlock(&kvm->mmu_lock); - cond_resched(); - /* - * If the topup succeeds, return -EAGAIN to indicate that the - * rmap iterator should be restarted because the MMU lock was - * dropped. - */ - r = topup_split_caches(kvm) ?: -EAGAIN; - write_lock(&kvm->mmu_lock); - goto out; - } - - shadow_mmu_split_huge_page(kvm, slot, huge_sptep); - -out: - trace_kvm_mmu_split_huge_page(gfn, spte, level, r); - return r; -} - -static bool shadow_mmu_try_split_huge_pages(struct kvm *kvm, - struct kvm_rmap_head *rmap_head, - const struct kvm_memory_slot *slot) -{ - struct rmap_iterator iter; - struct kvm_mmu_page *sp; - u64 *huge_sptep; - int r; - -restart: - for_each_rmap_spte(rmap_head, &iter, huge_sptep) { - sp = sptep_to_sp(huge_sptep); - - /* TDP MMU is enabled, so rmap only contains nested MMU SPs. */ - if (WARN_ON_ONCE(!sp->role.guest_mode)) - continue; - - /* The rmaps should never contain non-leaf SPTEs. */ - if (WARN_ON_ONCE(!is_large_pte(*huge_sptep))) - continue; - - /* SPs with level >PG_LEVEL_4K should never by unsync. */ - if (WARN_ON_ONCE(sp->unsync)) - continue; - - /* Don't bother splitting huge pages on invalid SPs. */ - if (sp->role.invalid) - continue; - - r = shadow_mmu_try_split_huge_page(kvm, slot, huge_sptep); - - /* - * The split succeeded or needs to be retried because the MMU - * lock was dropped. Either way, restart the iterator to get it - * back into a consistent state. - */ - if (!r || r == -EAGAIN) - goto restart; - - /* The split failed and shouldn't be retried (e.g. -ENOMEM). */ - break; - } - - return false; -} - -static void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm, - const struct kvm_memory_slot *slot, - gfn_t start, gfn_t end, - int target_level) -{ - int level; - - /* - * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working - * down to the target level. This ensures pages are recursively split - * all the way to the target level. There's no need to split pages - * already at the target level. - */ - for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) { - slot_handle_level_range(kvm, slot, shadow_mmu_try_split_huge_pages, - level, level, start, end - 1, true, false); - } -} - /* Must be called with the mmu_lock held in write-mode. */ void kvm_mmu_try_split_huge_pages(struct kvm *kvm, const struct kvm_memory_slot *memslot, @@ -6417,56 +3060,6 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm, */ } -static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, - struct kvm_rmap_head *rmap_head, - const struct kvm_memory_slot *slot) -{ - u64 *sptep; - struct rmap_iterator iter; - int need_tlb_flush = 0; - struct kvm_mmu_page *sp; - -restart: - for_each_rmap_spte(rmap_head, &iter, sptep) { - sp = sptep_to_sp(sptep); - - /* - * We cannot do huge page mapping for indirect shadow pages, - * which are found on the last rmap (level = 1) when not using - * tdp; such shadow pages are synced with the page table in - * the guest, and the guest page table is using 4K page size - * mapping if the indirect sp has level = 1. - */ - if (sp->role.direct && - sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn, - PG_LEVEL_NUM)) { - kvm_zap_one_rmap_spte(kvm, rmap_head, sptep); - - if (kvm_available_flush_tlb_with_range()) - kvm_flush_remote_tlbs_with_address(kvm, sp->gfn, - KVM_PAGES_PER_HPAGE(sp->role.level)); - else - need_tlb_flush = 1; - - goto restart; - } - } - - return need_tlb_flush; -} - -static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm, - const struct kvm_memory_slot *slot) -{ - /* - * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap - * pages that are already mapped at the maximum hugepage level. - */ - if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte, - PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true)) - kvm_arch_flush_remote_tlbs_memslot(kvm, slot); -} - void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, const struct kvm_memory_slot *slot) { @@ -6577,67 +3170,8 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen) } } -static unsigned long -mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) -{ - struct kvm *kvm; - int nr_to_scan = sc->nr_to_scan; - unsigned long freed = 0; - - mutex_lock(&kvm_lock); - - list_for_each_entry(kvm, &vm_list, vm_list) { - int idx; - LIST_HEAD(invalid_list); - - /* - * Never scan more than sc->nr_to_scan VM instances. - * Will not hit this condition practically since we do not try - * to shrink more than one VM and it is very unlikely to see - * !n_used_mmu_pages so many times. - */ - if (!nr_to_scan--) - break; - /* - * n_used_mmu_pages is accessed without holding kvm->mmu_lock - * here. We may skip a VM instance errorneosly, but we do not - * want to shrink a VM that only started to populate its MMU - * anyway. - */ - if (!kvm->arch.n_used_mmu_pages && - !kvm_has_zapped_obsolete_pages(kvm)) - continue; - - idx = srcu_read_lock(&kvm->srcu); - write_lock(&kvm->mmu_lock); - - if (kvm_has_zapped_obsolete_pages(kvm)) { - kvm_mmu_commit_zap_page(kvm, - &kvm->arch.zapped_obsolete_pages); - goto unlock; - } - - freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan); - -unlock: - write_unlock(&kvm->mmu_lock); - srcu_read_unlock(&kvm->srcu, idx); - - /* - * unfair on small ones - * per-vm shrinkers cry out - * sadness comes quickly - */ - list_move_tail(&kvm->vm_list, &vm_list); - break; - } - - mutex_unlock(&kvm_lock); - return freed; -} - -static unsigned long -mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) +static unsigned long mmu_shrink_count(struct shrinker *shrink, + struct shrink_control *sc) { return percpu_counter_read_positive(&kvm_total_used_mmu_pages); } diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 856e2e0a8420..74a99b67f09e 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -44,6 +44,8 @@ extern bool dbg; #define INVALID_PAE_ROOT 0 #define IS_VALID_PAE_ROOT(x) (!!(x)) +#define PTE_PREFETCH_NUM 8 + typedef u64 __rcu *tdp_ptep_t; struct kvm_mmu_page { @@ -168,8 +170,6 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm, int min_level); void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, u64 start_gfn, u64 pages); -unsigned int pte_list_count(struct kvm_rmap_head *rmap_head); - extern int nx_huge_pages; static inline bool is_nx_huge_page_enabled(struct kvm *kvm) { diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c index 7bce5ec52b2e..05d8f5be559d 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.c +++ b/arch/x86/kvm/mmu/shadow_mmu.c @@ -19,3 +19,3411 @@ #include #include #include + +#define for_each_shadow_entry(_vcpu, _addr, _walker) \ + for (shadow_walk_init(&(_walker), _vcpu, _addr); \ + shadow_walk_okay(&(_walker)); \ + shadow_walk_next(&(_walker))) + +#define for_each_shadow_entry_lockless(_vcpu, _addr, _walker, spte) \ + for (shadow_walk_init(&(_walker), _vcpu, _addr); \ + shadow_walk_okay(&(_walker)) && \ + ({ spte = mmu_spte_get_lockless(_walker.sptep); 1; }); \ + __shadow_walk_next(&(_walker), spte)) + +static void mmu_spte_set(u64 *sptep, u64 spte); + +void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn, + unsigned int access) +{ + u64 spte = make_mmio_spte(vcpu, gfn, access); + + trace_mark_mmio_spte(sptep, gfn, spte); + mmu_spte_set(sptep, spte); +} + +#ifdef CONFIG_X86_64 +static void __set_spte(u64 *sptep, u64 spte) +{ + WRITE_ONCE(*sptep, spte); +} + +static void __update_clear_spte_fast(u64 *sptep, u64 spte) +{ + WRITE_ONCE(*sptep, spte); +} + +static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) +{ + return xchg(sptep, spte); +} + +static u64 __get_spte_lockless(u64 *sptep) +{ + return READ_ONCE(*sptep); +} +#else +union split_spte { + struct { + u32 spte_low; + u32 spte_high; + }; + u64 spte; +}; + +static void count_spte_clear(u64 *sptep, u64 spte) +{ + struct kvm_mmu_page *sp = sptep_to_sp(sptep); + + if (is_shadow_present_pte(spte)) + return; + + /* Ensure the spte is completely set before we increase the count */ + smp_wmb(); + sp->clear_spte_count++; +} + +static void __set_spte(u64 *sptep, u64 spte) +{ + union split_spte *ssptep, sspte; + + ssptep = (union split_spte *)sptep; + sspte = (union split_spte)spte; + + ssptep->spte_high = sspte.spte_high; + + /* + * If we map the spte from nonpresent to present, We should store + * the high bits firstly, then set present bit, so cpu can not + * fetch this spte while we are setting the spte. + */ + smp_wmb(); + + WRITE_ONCE(ssptep->spte_low, sspte.spte_low); +} + +static void __update_clear_spte_fast(u64 *sptep, u64 spte) +{ + union split_spte *ssptep, sspte; + + ssptep = (union split_spte *)sptep; + sspte = (union split_spte)spte; + + WRITE_ONCE(ssptep->spte_low, sspte.spte_low); + + /* + * If we map the spte from present to nonpresent, we should clear + * present bit firstly to avoid vcpu fetch the old high bits. + */ + smp_wmb(); + + ssptep->spte_high = sspte.spte_high; + count_spte_clear(sptep, spte); +} + +static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) +{ + union split_spte *ssptep, sspte, orig; + + ssptep = (union split_spte *)sptep; + sspte = (union split_spte)spte; + + /* xchg acts as a barrier before the setting of the high bits */ + orig.spte_low = xchg(&ssptep->spte_low, sspte.spte_low); + orig.spte_high = ssptep->spte_high; + ssptep->spte_high = sspte.spte_high; + count_spte_clear(sptep, spte); + + return orig.spte; +} + +/* + * The idea using the light way get the spte on x86_32 guest is from + * gup_get_pte (mm/gup.c). + * + * An spte tlb flush may be pending, because kvm_set_pte_rmap + * coalesces them and we are running out of the MMU lock. Therefore + * we need to protect against in-progress updates of the spte. + * + * Reading the spte while an update is in progress may get the old value + * for the high part of the spte. The race is fine for a present->non-present + * change (because the high part of the spte is ignored for non-present spte), + * but for a present->present change we must reread the spte. + * + * All such changes are done in two steps (present->non-present and + * non-present->present), hence it is enough to count the number of + * present->non-present updates: if it changed while reading the spte, + * we might have hit the race. This is done using clear_spte_count. + */ +static u64 __get_spte_lockless(u64 *sptep) +{ + struct kvm_mmu_page *sp = sptep_to_sp(sptep); + union split_spte spte, *orig = (union split_spte *)sptep; + int count; + +retry: + count = sp->clear_spte_count; + smp_rmb(); + + spte.spte_low = orig->spte_low; + smp_rmb(); + + spte.spte_high = orig->spte_high; + smp_rmb(); + + if (unlikely(spte.spte_low != orig->spte_low || + count != sp->clear_spte_count)) + goto retry; + + return spte.spte; +} +#endif + +/* Rules for using mmu_spte_set: + * Set the sptep from nonpresent to present. + * Note: the sptep being assigned *must* be either not present + * or in a state where the hardware will not attempt to update + * the spte. + */ +static void mmu_spte_set(u64 *sptep, u64 new_spte) +{ + WARN_ON(is_shadow_present_pte(*sptep)); + __set_spte(sptep, new_spte); +} + +/* + * Update the SPTE (excluding the PFN), but do not track changes in its + * accessed/dirty status. + */ +static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte) +{ + u64 old_spte = *sptep; + + WARN_ON(!is_shadow_present_pte(new_spte)); + check_spte_writable_invariants(new_spte); + + if (!is_shadow_present_pte(old_spte)) { + mmu_spte_set(sptep, new_spte); + return old_spte; + } + + if (!spte_has_volatile_bits(old_spte)) + __update_clear_spte_fast(sptep, new_spte); + else + old_spte = __update_clear_spte_slow(sptep, new_spte); + + WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte)); + + return old_spte; +} + +/* Rules for using mmu_spte_update: + * Update the state bits, it means the mapped pfn is not changed. + * + * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote + * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only + * spte, even though the writable spte might be cached on a CPU's TLB. + * + * Returns true if the TLB needs to be flushed + */ +bool mmu_spte_update(u64 *sptep, u64 new_spte) +{ + bool flush = false; + u64 old_spte = mmu_spte_update_no_track(sptep, new_spte); + + if (!is_shadow_present_pte(old_spte)) + return false; + + /* + * For the spte updated out of mmu-lock is safe, since + * we always atomically update it, see the comments in + * spte_has_volatile_bits(). + */ + if (is_mmu_writable_spte(old_spte) && + !is_writable_pte(new_spte)) + flush = true; + + /* + * Flush TLB when accessed/dirty states are changed in the page tables, + * to guarantee consistency between TLB and page tables. + */ + + if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) { + flush = true; + kvm_set_pfn_accessed(spte_to_pfn(old_spte)); + } + + if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) { + flush = true; + kvm_set_pfn_dirty(spte_to_pfn(old_spte)); + } + + return flush; +} + +/* + * Rules for using mmu_spte_clear_track_bits: + * It sets the sptep from present to nonpresent, and track the + * state bits, it is used to clear the last level sptep. + * Returns the old PTE. + */ +static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep) +{ + kvm_pfn_t pfn; + u64 old_spte = *sptep; + int level = sptep_to_sp(sptep)->role.level; + struct page *page; + + if (!is_shadow_present_pte(old_spte) || + !spte_has_volatile_bits(old_spte)) + __update_clear_spte_fast(sptep, 0ull); + else + old_spte = __update_clear_spte_slow(sptep, 0ull); + + if (!is_shadow_present_pte(old_spte)) + return old_spte; + + kvm_update_page_stats(kvm, level, -1); + + pfn = spte_to_pfn(old_spte); + + /* + * KVM doesn't hold a reference to any pages mapped into the guest, and + * instead uses the mmu_notifier to ensure that KVM unmaps any pages + * before they are reclaimed. Sanity check that, if the pfn is backed + * by a refcounted page, the refcount is elevated. + */ + page = kvm_pfn_to_refcounted_page(pfn); + WARN_ON(page && !page_count(page)); + + if (is_accessed_spte(old_spte)) + kvm_set_pfn_accessed(pfn); + + if (is_dirty_spte(old_spte)) + kvm_set_pfn_dirty(pfn); + + return old_spte; +} + +/* + * Rules for using mmu_spte_clear_no_track: + * Directly clear spte without caring the state bits of sptep, + * it is used to set the upper level spte. + */ +void mmu_spte_clear_no_track(u64 *sptep) +{ + __update_clear_spte_fast(sptep, 0ull); +} + +static u64 mmu_spte_get_lockless(u64 *sptep) +{ + return __get_spte_lockless(sptep); +} + +/* Returns the Accessed status of the PTE and resets it at the same time. */ +static bool mmu_spte_age(u64 *sptep) +{ + u64 spte = mmu_spte_get_lockless(sptep); + + if (!is_accessed_spte(spte)) + return false; + + if (spte_ad_enabled(spte)) { + clear_bit((ffs(shadow_accessed_mask) - 1), + (unsigned long *)sptep); + } else { + /* + * Capture the dirty status of the page, so that it doesn't get + * lost when the SPTE is marked for access tracking. + */ + if (is_writable_pte(spte)) + kvm_set_pfn_dirty(spte_to_pfn(spte)); + + spte = mark_spte_for_access_track(spte); + mmu_spte_update_no_track(sptep, spte); + } + + return true; +} + +static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc) +{ + kmem_cache_free(pte_list_desc_cache, pte_list_desc); +} + +static bool sp_has_gptes(struct kvm_mmu_page *sp); + +gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) +{ + if (sp->role.passthrough) + return sp->gfn; + + if (!sp->role.direct) + return sp->shadowed_translation[index] >> PAGE_SHIFT; + + return sp->gfn + (index << ((sp->role.level - 1) * SPTE_LEVEL_BITS)); +} + +/* + * For leaf SPTEs, fetch the *guest* access permissions being shadowed. Note + * that the SPTE itself may have a more constrained access permissions that + * what the guest enforces. For example, a guest may create an executable + * huge PTE but KVM may disallow execution to mitigate iTLB multihit. + */ +static u32 kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index) +{ + if (sp_has_gptes(sp)) + return sp->shadowed_translation[index] & ACC_ALL; + + /* + * For direct MMUs (e.g. TDP or non-paging guests) or passthrough SPs, + * KVM is not shadowing any guest page tables, so the "guest access + * permissions" are just ACC_ALL. + * + * For direct SPs in indirect MMUs (shadow paging), i.e. when KVM + * is shadowing a guest huge page with small pages, the guest access + * permissions being shadowed are the access permissions of the huge + * page. + * + * In both cases, sp->role.access contains the correct access bits. + */ + return sp->role.access; +} + +static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, + gfn_t gfn, unsigned int access) +{ + if (sp_has_gptes(sp)) { + sp->shadowed_translation[index] = (gfn << PAGE_SHIFT) | access; + return; + } + + WARN_ONCE(access != kvm_mmu_page_get_access(sp, index), + "access mismatch under %s page %llx (expected %u, got %u)\n", + sp->role.passthrough ? "passthrough" : "direct", + sp->gfn, kvm_mmu_page_get_access(sp, index), access); + + WARN_ONCE(gfn != kvm_mmu_page_get_gfn(sp, index), + "gfn mismatch under %s page %llx (expected %llx, got %llx)\n", + sp->role.passthrough ? "passthrough" : "direct", + sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn); +} + +void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, + unsigned int access) +{ + gfn_t gfn = kvm_mmu_page_get_gfn(sp, index); + + kvm_mmu_page_set_translation(sp, index, gfn, access); +} + +static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + struct kvm_memslots *slots; + struct kvm_memory_slot *slot; + gfn_t gfn; + + kvm->arch.indirect_shadow_pages++; + gfn = sp->gfn; + slots = kvm_memslots_for_spte_role(kvm, sp->role); + slot = __gfn_to_memslot(slots, gfn); + + /* the non-leaf shadow pages are keeping readonly. */ + if (sp->role.level > PG_LEVEL_4K) + return kvm_slot_page_track_add_page(kvm, slot, gfn, + KVM_PAGE_TRACK_WRITE); + + kvm_mmu_gfn_disallow_lpage(slot, gfn); + + if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K)) + kvm_flush_remote_tlbs_with_address(kvm, gfn, 1); +} + +static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + struct kvm_memslots *slots; + struct kvm_memory_slot *slot; + gfn_t gfn; + + kvm->arch.indirect_shadow_pages--; + gfn = sp->gfn; + slots = kvm_memslots_for_spte_role(kvm, sp->role); + slot = __gfn_to_memslot(slots, gfn); + if (sp->role.level > PG_LEVEL_4K) + return kvm_slot_page_track_remove_page(kvm, slot, gfn, + KVM_PAGE_TRACK_WRITE); + + kvm_mmu_gfn_allow_lpage(slot, gfn); +} + +/* + * About rmap_head encoding: + * + * If the bit zero of rmap_head->val is clear, then it points to the only spte + * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct + * pte_list_desc containing more mappings. + */ + +/* + * Returns the number of pointers in the rmap chain, not counting the new one. + */ +static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, + struct kvm_rmap_head *rmap_head) +{ + struct pte_list_desc *desc; + int count = 0; + + if (!rmap_head->val) { + rmap_printk("%p %llx 0->1\n", spte, *spte); + rmap_head->val = (unsigned long)spte; + } else if (!(rmap_head->val & 1)) { + rmap_printk("%p %llx 1->many\n", spte, *spte); + desc = kvm_mmu_memory_cache_alloc(cache); + desc->sptes[0] = (u64 *)rmap_head->val; + desc->sptes[1] = spte; + desc->spte_count = 2; + rmap_head->val = (unsigned long)desc | 1; + ++count; + } else { + rmap_printk("%p %llx many->many\n", spte, *spte); + desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + while (desc->spte_count == PTE_LIST_EXT) { + count += PTE_LIST_EXT; + if (!desc->more) { + desc->more = kvm_mmu_memory_cache_alloc(cache); + desc = desc->more; + desc->spte_count = 0; + break; + } + desc = desc->more; + } + count += desc->spte_count; + desc->sptes[desc->spte_count++] = spte; + } + return count; +} + +static void +pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head, + struct pte_list_desc *desc, int i, + struct pte_list_desc *prev_desc) +{ + int j = desc->spte_count - 1; + + desc->sptes[i] = desc->sptes[j]; + desc->sptes[j] = NULL; + desc->spte_count--; + if (desc->spte_count) + return; + if (!prev_desc && !desc->more) + rmap_head->val = 0; + else + if (prev_desc) + prev_desc->more = desc->more; + else + rmap_head->val = (unsigned long)desc->more | 1; + mmu_free_pte_list_desc(desc); +} + +static void pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head) +{ + struct pte_list_desc *desc; + struct pte_list_desc *prev_desc; + int i; + + if (!rmap_head->val) { + pr_err("%s: %p 0->BUG\n", __func__, spte); + BUG(); + } else if (!(rmap_head->val & 1)) { + rmap_printk("%p 1->0\n", spte); + if ((u64 *)rmap_head->val != spte) { + pr_err("%s: %p 1->BUG\n", __func__, spte); + BUG(); + } + rmap_head->val = 0; + } else { + rmap_printk("%p many->many\n", spte); + desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + prev_desc = NULL; + while (desc) { + for (i = 0; i < desc->spte_count; ++i) { + if (desc->sptes[i] == spte) { + pte_list_desc_remove_entry(rmap_head, + desc, i, prev_desc); + return; + } + } + prev_desc = desc; + desc = desc->more; + } + pr_err("%s: %p many->many\n", __func__, spte); + BUG(); + } +} + +static void kvm_zap_one_rmap_spte(struct kvm *kvm, + struct kvm_rmap_head *rmap_head, u64 *sptep) +{ + mmu_spte_clear_track_bits(kvm, sptep); + pte_list_remove(sptep, rmap_head); +} + +/* Return true if at least one SPTE was zapped, false otherwise */ +static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, + struct kvm_rmap_head *rmap_head) +{ + struct pte_list_desc *desc, *next; + int i; + + if (!rmap_head->val) + return false; + + if (!(rmap_head->val & 1)) { + mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val); + goto out; + } + + desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + + for (; desc; desc = next) { + for (i = 0; i < desc->spte_count; i++) + mmu_spte_clear_track_bits(kvm, desc->sptes[i]); + next = desc->more; + mmu_free_pte_list_desc(desc); + } +out: + /* rmap_head is meaningless now, remember to reset it */ + rmap_head->val = 0; + return true; +} + +unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) +{ + struct pte_list_desc *desc; + unsigned int count = 0; + + if (!rmap_head->val) + return 0; + else if (!(rmap_head->val & 1)) + return 1; + + desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + + while (desc) { + count += desc->spte_count; + desc = desc->more; + } + + return count; +} + +struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, + const struct kvm_memory_slot *slot) +{ + unsigned long idx; + + idx = gfn_to_index(gfn, slot->base_gfn, level); + return &slot->arch.rmap[level - PG_LEVEL_4K][idx]; +} + +bool rmap_can_add(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu_memory_cache *mc; + + mc = &vcpu->arch.mmu_pte_list_desc_cache; + return kvm_mmu_memory_cache_nr_free_objects(mc); +} + +static void rmap_remove(struct kvm *kvm, u64 *spte) +{ + struct kvm_memslots *slots; + struct kvm_memory_slot *slot; + struct kvm_mmu_page *sp; + gfn_t gfn; + struct kvm_rmap_head *rmap_head; + + sp = sptep_to_sp(spte); + gfn = kvm_mmu_page_get_gfn(sp, spte_index(spte)); + + /* + * Unlike rmap_add, rmap_remove does not run in the context of a vCPU + * so we have to determine which memslots to use based on context + * information in sp->role. + */ + slots = kvm_memslots_for_spte_role(kvm, sp->role); + + slot = __gfn_to_memslot(slots, gfn); + rmap_head = gfn_to_rmap(gfn, sp->role.level, slot); + + pte_list_remove(spte, rmap_head); +} + +/* + * Used by the following functions to iterate through the sptes linked by a + * rmap. All fields are private and not assumed to be used outside. + */ +struct rmap_iterator { + /* private fields */ + struct pte_list_desc *desc; /* holds the sptep if not NULL */ + int pos; /* index of the sptep */ +}; + +/* + * Iteration must be started by this function. This should also be used after + * removing/dropping sptes from the rmap link because in such cases the + * information in the iterator may not be valid. + * + * Returns sptep if found, NULL otherwise. + */ +static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head, + struct rmap_iterator *iter) +{ + u64 *sptep; + + if (!rmap_head->val) + return NULL; + + if (!(rmap_head->val & 1)) { + iter->desc = NULL; + sptep = (u64 *)rmap_head->val; + goto out; + } + + iter->desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); + iter->pos = 0; + sptep = iter->desc->sptes[iter->pos]; +out: + BUG_ON(!is_shadow_present_pte(*sptep)); + return sptep; +} + +/* + * Must be used with a valid iterator: e.g. after rmap_get_first(). + * + * Returns sptep if found, NULL otherwise. + */ +static u64 *rmap_get_next(struct rmap_iterator *iter) +{ + u64 *sptep; + + if (iter->desc) { + if (iter->pos < PTE_LIST_EXT - 1) { + ++iter->pos; + sptep = iter->desc->sptes[iter->pos]; + if (sptep) + goto out; + } + + iter->desc = iter->desc->more; + + if (iter->desc) { + iter->pos = 0; + /* desc->sptes[0] cannot be NULL */ + sptep = iter->desc->sptes[iter->pos]; + goto out; + } + } + + return NULL; +out: + BUG_ON(!is_shadow_present_pte(*sptep)); + return sptep; +} + +#define for_each_rmap_spte(_rmap_head_, _iter_, _spte_) \ + for (_spte_ = rmap_get_first(_rmap_head_, _iter_); \ + _spte_; _spte_ = rmap_get_next(_iter_)) + +void drop_spte(struct kvm *kvm, u64 *sptep) +{ + u64 old_spte = mmu_spte_clear_track_bits(kvm, sptep); + + if (is_shadow_present_pte(old_spte)) + rmap_remove(kvm, sptep); +} + +static void drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush) +{ + struct kvm_mmu_page *sp; + + sp = sptep_to_sp(sptep); + WARN_ON(sp->role.level == PG_LEVEL_4K); + + drop_spte(kvm, sptep); + + if (flush) + kvm_flush_remote_tlbs_with_address(kvm, sp->gfn, + KVM_PAGES_PER_HPAGE(sp->role.level)); +} + +/* + * Write-protect on the specified @sptep, @pt_protect indicates whether + * spte write-protection is caused by protecting shadow page table. + * + * Note: write protection is difference between dirty logging and spte + * protection: + * - for dirty logging, the spte can be set to writable at anytime if + * its dirty bitmap is properly set. + * - for spte protection, the spte can be writable only after unsync-ing + * shadow page. + * + * Return true if tlb need be flushed. + */ +static bool spte_write_protect(u64 *sptep, bool pt_protect) +{ + u64 spte = *sptep; + + if (!is_writable_pte(spte) && + !(pt_protect && is_mmu_writable_spte(spte))) + return false; + + rmap_printk("spte %p %llx\n", sptep, *sptep); + + if (pt_protect) + spte &= ~shadow_mmu_writable_mask; + spte = spte & ~PT_WRITABLE_MASK; + + return mmu_spte_update(sptep, spte); +} + +bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect) +{ + u64 *sptep; + struct rmap_iterator iter; + bool flush = false; + + for_each_rmap_spte(rmap_head, &iter, sptep) + flush |= spte_write_protect(sptep, pt_protect); + + return flush; +} + +static bool spte_clear_dirty(u64 *sptep) +{ + u64 spte = *sptep; + + rmap_printk("spte %p %llx\n", sptep, *sptep); + + MMU_WARN_ON(!spte_ad_enabled(spte)); + spte &= ~shadow_dirty_mask; + return mmu_spte_update(sptep, spte); +} + +static bool spte_wrprot_for_clear_dirty(u64 *sptep) +{ + bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT, + (unsigned long *)sptep); + if (was_writable && !spte_ad_enabled(*sptep)) + kvm_set_pfn_dirty(spte_to_pfn(*sptep)); + + return was_writable; +} + +/* + * Gets the GFN ready for another round of dirty logging by clearing the + * - D bit on ad-enabled SPTEs, and + * - W bit on ad-disabled SPTEs. + * Returns true iff any D or W bits were cleared. + */ +bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + const struct kvm_memory_slot *slot) +{ + u64 *sptep; + struct rmap_iterator iter; + bool flush = false; + + for_each_rmap_spte(rmap_head, &iter, sptep) + if (spte_ad_need_write_protect(*sptep)) + flush |= spte_wrprot_for_clear_dirty(sptep); + else + flush |= spte_clear_dirty(sptep); + + return flush; +} + +static bool __kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + const struct kvm_memory_slot *slot) +{ + return kvm_zap_all_rmap_sptes(kvm, rmap_head); +} + +bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, int level, + pte_t unused) +{ + return __kvm_zap_rmap(kvm, rmap_head, slot); +} + +bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, int level, + pte_t pte) +{ + u64 *sptep; + struct rmap_iterator iter; + bool need_flush = false; + u64 new_spte; + kvm_pfn_t new_pfn; + + WARN_ON(pte_huge(pte)); + new_pfn = pte_pfn(pte); + +restart: + for_each_rmap_spte(rmap_head, &iter, sptep) { + rmap_printk("spte %p %llx gfn %llx (%d)\n", + sptep, *sptep, gfn, level); + + need_flush = true; + + if (pte_write(pte)) { + kvm_zap_one_rmap_spte(kvm, rmap_head, sptep); + goto restart; + } else { + new_spte = kvm_mmu_changed_pte_notifier_make_spte( + *sptep, new_pfn); + + mmu_spte_clear_track_bits(kvm, sptep); + mmu_spte_set(sptep, new_spte); + } + } + + if (need_flush && kvm_available_flush_tlb_with_range()) { + kvm_flush_remote_tlbs_with_address(kvm, gfn, 1); + return false; + } + + return need_flush; +} + +struct slot_rmap_walk_iterator { + /* input fields. */ + const struct kvm_memory_slot *slot; + gfn_t start_gfn; + gfn_t end_gfn; + int start_level; + int end_level; + + /* output fields. */ + gfn_t gfn; + struct kvm_rmap_head *rmap; + int level; + + /* private field. */ + struct kvm_rmap_head *end_rmap; +}; + +static void +rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator, int level) +{ + iterator->level = level; + iterator->gfn = iterator->start_gfn; + iterator->rmap = gfn_to_rmap(iterator->gfn, level, iterator->slot); + iterator->end_rmap = gfn_to_rmap(iterator->end_gfn, level, iterator->slot); +} + +static void +slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator, + const struct kvm_memory_slot *slot, int start_level, + int end_level, gfn_t start_gfn, gfn_t end_gfn) +{ + iterator->slot = slot; + iterator->start_level = start_level; + iterator->end_level = end_level; + iterator->start_gfn = start_gfn; + iterator->end_gfn = end_gfn; + + rmap_walk_init_level(iterator, iterator->start_level); +} + +static bool slot_rmap_walk_okay(struct slot_rmap_walk_iterator *iterator) +{ + return !!iterator->rmap; +} + +static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator) +{ + while (++iterator->rmap <= iterator->end_rmap) { + iterator->gfn += (1UL << KVM_HPAGE_GFN_SHIFT(iterator->level)); + + if (iterator->rmap->val) + return; + } + + if (++iterator->level > iterator->end_level) { + iterator->rmap = NULL; + return; + } + + rmap_walk_init_level(iterator, iterator->level); +} + +#define for_each_slot_rmap_range(_slot_, _start_level_, _end_level_, \ + _start_gfn, _end_gfn, _iter_) \ + for (slot_rmap_walk_init(_iter_, _slot_, _start_level_, \ + _end_level_, _start_gfn, _end_gfn); \ + slot_rmap_walk_okay(_iter_); \ + slot_rmap_walk_next(_iter_)) + +__always_inline bool kvm_handle_gfn_range(struct kvm *kvm, + struct kvm_gfn_range *range, + rmap_handler_t handler) +{ + struct slot_rmap_walk_iterator iterator; + bool ret = false; + + for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, + range->start, range->end - 1, &iterator) + ret |= handler(kvm, iterator.rmap, range->slot, iterator.gfn, + iterator.level, range->pte); + + return ret; +} + +bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, int level, + pte_t unused) +{ + u64 *sptep; + struct rmap_iterator iter; + int young = 0; + + for_each_rmap_spte(rmap_head, &iter, sptep) + young |= mmu_spte_age(sptep); + + return young; +} + +bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, + int level, pte_t unused) +{ + u64 *sptep; + struct rmap_iterator iter; + + for_each_rmap_spte(rmap_head, &iter, sptep) + if (is_accessed_spte(*sptep)) + return true; + return false; +} + +#define RMAP_RECYCLE_THRESHOLD 1000 + +static void __rmap_add(struct kvm *kvm, + struct kvm_mmu_memory_cache *cache, + const struct kvm_memory_slot *slot, + u64 *spte, gfn_t gfn, unsigned int access) +{ + struct kvm_mmu_page *sp; + struct kvm_rmap_head *rmap_head; + int rmap_count; + + sp = sptep_to_sp(spte); + kvm_mmu_page_set_translation(sp, spte_index(spte), gfn, access); + kvm_update_page_stats(kvm, sp->role.level, 1); + + rmap_head = gfn_to_rmap(gfn, sp->role.level, slot); + rmap_count = pte_list_add(cache, spte, rmap_head); + + if (rmap_count > kvm->stat.max_mmu_rmap_size) + kvm->stat.max_mmu_rmap_size = rmap_count; + if (rmap_count > RMAP_RECYCLE_THRESHOLD) { + kvm_zap_all_rmap_sptes(kvm, rmap_head); + kvm_flush_remote_tlbs_with_address( + kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level)); + } +} + +static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot, + u64 *spte, gfn_t gfn, unsigned int access) +{ + struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_pte_list_desc_cache; + + __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access); +} + +#ifdef MMU_DEBUG +static int is_empty_shadow_page(u64 *spt) +{ + u64 *pos; + u64 *end; + + for (pos = spt, end = pos + SPTE_ENT_PER_PAGE; pos != end; pos++) + if (is_shadow_present_pte(*pos)) { + printk(KERN_ERR "%s: %p %llx\n", __func__, + pos, *pos); + return 0; + } + return 1; +} +#endif + +/* + * This value is the sum of all of the kvm instances's + * kvm->arch.n_used_mmu_pages values. We need a global, + * aggregate version in order to make the slab shrinker + * faster + */ +static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr) +{ + kvm->arch.n_used_mmu_pages += nr; + percpu_counter_add(&kvm_total_used_mmu_pages, nr); +} + +static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + kvm_mod_used_mmu_pages(kvm, +1); + kvm_account_pgtable_pages((void *)sp->spt, +1); +} + +static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + kvm_mod_used_mmu_pages(kvm, -1); + kvm_account_pgtable_pages((void *)sp->spt, -1); +} + +static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp) +{ + MMU_WARN_ON(!is_empty_shadow_page(sp->spt)); + hlist_del(&sp->hash_link); + list_del(&sp->link); + free_page((unsigned long)sp->spt); + if (!sp->role.direct) + free_page((unsigned long)sp->shadowed_translation); + kmem_cache_free(mmu_page_header_cache, sp); +} + +static unsigned kvm_page_table_hashfn(gfn_t gfn) +{ + return hash_64(gfn, KVM_MMU_HASH_SHIFT); +} + +static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache, + struct kvm_mmu_page *sp, u64 *parent_pte) +{ + if (!parent_pte) + return; + + pte_list_add(cache, parent_pte, &sp->parent_ptes); +} + +static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp, + u64 *parent_pte) +{ + pte_list_remove(parent_pte, &sp->parent_ptes); +} + +void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte) +{ + mmu_page_remove_parent_pte(sp, parent_pte); + mmu_spte_clear_no_track(parent_pte); +} + +static void mark_unsync(u64 *spte); +static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp) +{ + u64 *sptep; + struct rmap_iterator iter; + + for_each_rmap_spte(&sp->parent_ptes, &iter, sptep) { + mark_unsync(sptep); + } +} + +static void mark_unsync(u64 *spte) +{ + struct kvm_mmu_page *sp; + + sp = sptep_to_sp(spte); + if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap)) + return; + if (sp->unsync_children++) + return; + kvm_mmu_mark_parents_unsync(sp); +} + +int nonpaging_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ + return -1; +} + +#define KVM_PAGE_ARRAY_NR 16 + +struct kvm_mmu_pages { + struct mmu_page_and_offset { + struct kvm_mmu_page *sp; + unsigned int idx; + } page[KVM_PAGE_ARRAY_NR]; + unsigned int nr; +}; + +static int mmu_pages_add(struct kvm_mmu_pages *pvec, struct kvm_mmu_page *sp, + int idx) +{ + int i; + + if (sp->unsync) + for (i=0; i < pvec->nr; i++) + if (pvec->page[i].sp == sp) + return 0; + + pvec->page[pvec->nr].sp = sp; + pvec->page[pvec->nr].idx = idx; + pvec->nr++; + return (pvec->nr == KVM_PAGE_ARRAY_NR); +} + +static inline void clear_unsync_child_bit(struct kvm_mmu_page *sp, int idx) +{ + --sp->unsync_children; + WARN_ON((int)sp->unsync_children < 0); + __clear_bit(idx, sp->unsync_child_bitmap); +} + +static int __mmu_unsync_walk(struct kvm_mmu_page *sp, + struct kvm_mmu_pages *pvec) +{ + int i, ret, nr_unsync_leaf = 0; + + for_each_set_bit(i, sp->unsync_child_bitmap, 512) { + struct kvm_mmu_page *child; + u64 ent = sp->spt[i]; + + if (!is_shadow_present_pte(ent) || is_large_pte(ent)) { + clear_unsync_child_bit(sp, i); + continue; + } + + child = spte_to_child_sp(ent); + + if (child->unsync_children) { + if (mmu_pages_add(pvec, child, i)) + return -ENOSPC; + + ret = __mmu_unsync_walk(child, pvec); + if (!ret) { + clear_unsync_child_bit(sp, i); + continue; + } else if (ret > 0) { + nr_unsync_leaf += ret; + } else + return ret; + } else if (child->unsync) { + nr_unsync_leaf++; + if (mmu_pages_add(pvec, child, i)) + return -ENOSPC; + } else + clear_unsync_child_bit(sp, i); + } + + return nr_unsync_leaf; +} + +#define INVALID_INDEX (-1) + +static int mmu_unsync_walk(struct kvm_mmu_page *sp, + struct kvm_mmu_pages *pvec) +{ + pvec->nr = 0; + if (!sp->unsync_children) + return 0; + + mmu_pages_add(pvec, sp, INVALID_INDEX); + return __mmu_unsync_walk(sp, pvec); +} + +static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + WARN_ON(!sp->unsync); + trace_kvm_mmu_sync_page(sp); + sp->unsync = 0; + --kvm->stat.mmu_unsync; +} + +static bool sp_has_gptes(struct kvm_mmu_page *sp) +{ + if (sp->role.direct) + return false; + + if (sp->role.passthrough) + return false; + + return true; +} + +#define for_each_valid_sp(_kvm, _sp, _list) \ + hlist_for_each_entry(_sp, _list, hash_link) \ + if (is_obsolete_sp((_kvm), (_sp))) { \ + } else + +#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \ + for_each_valid_sp(_kvm, _sp, \ + &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \ + if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else + +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, + struct list_head *invalid_list) +{ + int ret = vcpu->arch.mmu->sync_page(vcpu, sp); + + if (ret < 0) + kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); + return ret; +} + +struct mmu_page_path { + struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL]; + unsigned int idx[PT64_ROOT_MAX_LEVEL]; +}; + +#define for_each_sp(pvec, sp, parents, i) \ + for (i = mmu_pages_first(&pvec, &parents); \ + i < pvec.nr && ({ sp = pvec.page[i].sp; 1;}); \ + i = mmu_pages_next(&pvec, &parents, i)) + +static int mmu_pages_next(struct kvm_mmu_pages *pvec, + struct mmu_page_path *parents, + int i) +{ + int n; + + for (n = i+1; n < pvec->nr; n++) { + struct kvm_mmu_page *sp = pvec->page[n].sp; + unsigned idx = pvec->page[n].idx; + int level = sp->role.level; + + parents->idx[level-1] = idx; + if (level == PG_LEVEL_4K) + break; + + parents->parent[level-2] = sp; + } + + return n; +} + +static int mmu_pages_first(struct kvm_mmu_pages *pvec, + struct mmu_page_path *parents) +{ + struct kvm_mmu_page *sp; + int level; + + if (pvec->nr == 0) + return 0; + + WARN_ON(pvec->page[0].idx != INVALID_INDEX); + + sp = pvec->page[0].sp; + level = sp->role.level; + WARN_ON(level == PG_LEVEL_4K); + + parents->parent[level-2] = sp; + + /* Also set up a sentinel. Further entries in pvec are all + * children of sp, so this element is never overwritten. + */ + parents->parent[level-1] = NULL; + return mmu_pages_next(pvec, parents, 0); +} + +static void mmu_pages_clear_parents(struct mmu_page_path *parents) +{ + struct kvm_mmu_page *sp; + unsigned int level = 0; + + do { + unsigned int idx = parents->idx[level]; + sp = parents->parent[level]; + if (!sp) + return; + + WARN_ON(idx == INVALID_INDEX); + clear_unsync_child_bit(sp, idx); + level++; + } while (!sp->unsync_children); +} + +int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent, + bool can_yield) +{ + int i; + struct kvm_mmu_page *sp; + struct mmu_page_path parents; + struct kvm_mmu_pages pages; + LIST_HEAD(invalid_list); + bool flush = false; + + while (mmu_unsync_walk(parent, &pages)) { + bool protected = false; + + for_each_sp(pages, sp, parents, i) + protected |= kvm_vcpu_write_protect_gfn(vcpu, sp->gfn); + + if (protected) { + kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, true); + flush = false; + } + + for_each_sp(pages, sp, parents, i) { + kvm_unlink_unsync_page(vcpu->kvm, sp); + flush |= kvm_sync_page(vcpu, sp, &invalid_list) > 0; + mmu_pages_clear_parents(&parents); + } + if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) { + kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush); + if (!can_yield) { + kvm_make_request(KVM_REQ_MMU_SYNC, vcpu); + return -EINTR; + } + + cond_resched_rwlock_write(&vcpu->kvm->mmu_lock); + flush = false; + } + } + + kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush); + return 0; +} + +void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp) +{ + atomic_set(&sp->write_flooding_count, 0); +} + +void clear_sp_write_flooding_count(u64 *spte) +{ + __clear_sp_write_flooding_count(sptep_to_sp(spte)); +} + +/* + * The vCPU is required when finding indirect shadow pages; the shadow + * page may already exist and syncing it needs the vCPU pointer in + * order to read guest page tables. Direct shadow pages are never + * unsync, thus @vcpu can be NULL if @role.direct is true. + */ +static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, + struct kvm_vcpu *vcpu, + gfn_t gfn, + struct hlist_head *sp_list, + union kvm_mmu_page_role role) +{ + struct kvm_mmu_page *sp; + int ret; + int collisions = 0; + LIST_HEAD(invalid_list); + + for_each_valid_sp(kvm, sp, sp_list) { + if (sp->gfn != gfn) { + collisions++; + continue; + } + + if (sp->role.word != role.word) { + /* + * If the guest is creating an upper-level page, zap + * unsync pages for the same gfn. While it's possible + * the guest is using recursive page tables, in all + * likelihood the guest has stopped using the unsync + * page and is installing a completely unrelated page. + * Unsync pages must not be left as is, because the new + * upper-level page will be write-protected. + */ + if (role.level > PG_LEVEL_4K && sp->unsync) + kvm_mmu_prepare_zap_page(kvm, sp, + &invalid_list); + continue; + } + + /* unsync and write-flooding only apply to indirect SPs. */ + if (sp->role.direct) + goto out; + + if (sp->unsync) { + if (KVM_BUG_ON(!vcpu, kvm)) + break; + + /* + * The page is good, but is stale. kvm_sync_page does + * get the latest guest state, but (unlike mmu_unsync_children) + * it doesn't write-protect the page or mark it synchronized! + * This way the validity of the mapping is ensured, but the + * overhead of write protection is not incurred until the + * guest invalidates the TLB mapping. This allows multiple + * SPs for a single gfn to be unsync. + * + * If the sync fails, the page is zapped. If so, break + * in order to rebuild it. + */ + ret = kvm_sync_page(vcpu, sp, &invalid_list); + if (ret < 0) + break; + + WARN_ON(!list_empty(&invalid_list)); + if (ret > 0) + kvm_flush_remote_tlbs(kvm); + } + + __clear_sp_write_flooding_count(sp); + + goto out; + } + + sp = NULL; + ++kvm->stat.mmu_cache_miss; + +out: + kvm_mmu_commit_zap_page(kvm, &invalid_list); + + if (collisions > kvm->stat.max_mmu_page_hash_collisions) + kvm->stat.max_mmu_page_hash_collisions = collisions; + return sp; +} + +/* Caches used when allocating a new shadow page. */ +struct shadow_page_caches { + struct kvm_mmu_memory_cache *page_header_cache; + struct kvm_mmu_memory_cache *shadow_page_cache; + struct kvm_mmu_memory_cache *shadowed_info_cache; +}; + +static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm, + struct shadow_page_caches *caches, + gfn_t gfn, + struct hlist_head *sp_list, + union kvm_mmu_page_role role) +{ + struct kvm_mmu_page *sp; + + sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache); + sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache); + if (!role.direct) + sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache); + + set_page_private(virt_to_page(sp->spt), (unsigned long)sp); + + INIT_LIST_HEAD(&sp->possible_nx_huge_page_link); + + /* + * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages() + * depends on valid pages being added to the head of the list. See + * comments in kvm_zap_obsolete_pages(). + */ + sp->mmu_valid_gen = kvm->arch.mmu_valid_gen; + list_add(&sp->link, &kvm->arch.active_mmu_pages); + kvm_account_mmu_page(kvm, sp); + + sp->gfn = gfn; + sp->role = role; + hlist_add_head(&sp->hash_link, sp_list); + if (sp_has_gptes(sp)) + account_shadowed(kvm, sp); + + return sp; +} + +/* Note, @vcpu may be NULL if @role.direct is true; see kvm_mmu_find_shadow_page. */ +static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm, + struct kvm_vcpu *vcpu, + struct shadow_page_caches *caches, + gfn_t gfn, + union kvm_mmu_page_role role) +{ + struct hlist_head *sp_list; + struct kvm_mmu_page *sp; + bool created = false; + + sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; + + sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role); + if (!sp) { + created = true; + sp = kvm_mmu_alloc_shadow_page(kvm, caches, gfn, sp_list, role); + } + + trace_kvm_mmu_get_page(sp, created); + return sp; +} + +static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, + gfn_t gfn, + union kvm_mmu_page_role role) +{ + struct shadow_page_caches caches = { + .page_header_cache = &vcpu->arch.mmu_page_header_cache, + .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache, + .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache, + }; + + return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role); +} + +static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, + unsigned int access) +{ + struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep); + union kvm_mmu_page_role role; + + role = parent_sp->role; + role.level--; + role.access = access; + role.direct = direct; + role.passthrough = 0; + + /* + * If the guest has 4-byte PTEs then that means it's using 32-bit, + * 2-level, non-PAE paging. KVM shadows such guests with PAE paging + * (i.e. 8-byte PTEs). The difference in PTE size means that KVM must + * shadow each guest page table with multiple shadow page tables, which + * requires extra bookkeeping in the role. + * + * Specifically, to shadow the guest's page directory (which covers a + * 4GiB address space), KVM uses 4 PAE page directories, each mapping + * 1GiB of the address space. @role.quadrant encodes which quarter of + * the address space each maps. + * + * To shadow the guest's page tables (which each map a 4MiB region), KVM + * uses 2 PAE page tables, each mapping a 2MiB region. For these, + * @role.quadrant encodes which half of the region they map. + * + * Concretely, a 4-byte PDE consumes bits 31:22, while an 8-byte PDE + * consumes bits 29:21. To consume bits 31:30, KVM's uses 4 shadow + * PDPTEs; those 4 PAE page directories are pre-allocated and their + * quadrant is assigned in mmu_alloc_root(). A 4-byte PTE consumes + * bits 21:12, while an 8-byte PTE consumes bits 20:12. To consume + * bit 21 in the PTE (the child here), KVM propagates that bit to the + * quadrant, i.e. sets quadrant to '0' or '1'. The parent 8-byte PDE + * covers bit 21 (see above), thus the quadrant is calculated from the + * _least_ significant bit of the PDE index. + */ + if (role.has_4_byte_gpte) { + WARN_ON_ONCE(role.level != PG_LEVEL_4K); + role.quadrant = spte_index(sptep) & 1; + } + + return role; +} + +struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, u64 *sptep, + gfn_t gfn, bool direct, + unsigned int access) +{ + union kvm_mmu_page_role role; + + if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) + return ERR_PTR(-EEXIST); + + role = kvm_mmu_child_role(sptep, direct, access); + return kvm_mmu_get_shadow_page(vcpu, gfn, role); +} + +void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator, + struct kvm_vcpu *vcpu, hpa_t root, u64 addr) +{ + iterator->addr = addr; + iterator->shadow_addr = root; + iterator->level = vcpu->arch.mmu->root_role.level; + + if (iterator->level >= PT64_ROOT_4LEVEL && + vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL && + !vcpu->arch.mmu->root_role.direct) + iterator->level = PT32E_ROOT_LEVEL; + + if (iterator->level == PT32E_ROOT_LEVEL) { + /* + * prev_root is currently only used for 64-bit hosts. So only + * the active root_hpa is valid here. + */ + BUG_ON(root != vcpu->arch.mmu->root.hpa); + + iterator->shadow_addr + = vcpu->arch.mmu->pae_root[(addr >> 30) & 3]; + iterator->shadow_addr &= SPTE_BASE_ADDR_MASK; + --iterator->level; + if (!iterator->shadow_addr) + iterator->level = 0; + } +} + +void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator, + struct kvm_vcpu *vcpu, u64 addr) +{ + shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root.hpa, + addr); +} + +bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator) +{ + if (iterator->level < PG_LEVEL_4K) + return false; + + iterator->index = SPTE_INDEX(iterator->addr, iterator->level); + iterator->sptep = ((u64 *)__va(iterator->shadow_addr)) + iterator->index; + return true; +} + +static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator, + u64 spte) +{ + if (!is_shadow_present_pte(spte) || is_last_spte(spte, iterator->level)) { + iterator->level = 0; + return; + } + + iterator->shadow_addr = spte & SPTE_BASE_ADDR_MASK; + --iterator->level; +} + +void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) +{ + __shadow_walk_next(iterator, *iterator->sptep); +} + +static void __link_shadow_page(struct kvm *kvm, + struct kvm_mmu_memory_cache *cache, u64 *sptep, + struct kvm_mmu_page *sp, bool flush) +{ + u64 spte; + + BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK); + + /* + * If an SPTE is present already, it must be a leaf and therefore + * a large one. Drop it, and flush the TLB if needed, before + * installing sp. + */ + if (is_shadow_present_pte(*sptep)) + drop_large_spte(kvm, sptep, flush); + + spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp)); + + mmu_spte_set(sptep, spte); + + mmu_page_add_parent_pte(cache, sp, sptep); + + if (sp->unsync_children || sp->unsync) + mark_unsync(sptep); +} + +void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp) +{ + __link_shadow_page(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, sptep, sp, true); +} + +void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, + unsigned direct_access) +{ + if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) { + struct kvm_mmu_page *child; + + /* + * For the direct sp, if the guest pte's dirty bit + * changed form clean to dirty, it will corrupt the + * sp's access: allow writable in the read-only sp, + * so we should update the spte at this point to get + * a new sp with the correct access. + */ + child = spte_to_child_sp(*sptep); + if (child->role.access == direct_access) + return; + + drop_parent_pte(child, sptep); + kvm_flush_remote_tlbs_with_address(vcpu->kvm, child->gfn, 1); + } +} + +/* Returns the number of zapped non-leaf child shadow pages. */ +int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte, + struct list_head *invalid_list) +{ + u64 pte; + struct kvm_mmu_page *child; + + pte = *spte; + if (is_shadow_present_pte(pte)) { + if (is_last_spte(pte, sp->role.level)) { + drop_spte(kvm, spte); + } else { + child = spte_to_child_sp(pte); + drop_parent_pte(child, spte); + + /* + * Recursively zap nested TDP SPs, parentless SPs are + * unlikely to be used again in the near future. This + * avoids retaining a large number of stale nested SPs. + */ + if (tdp_enabled && invalid_list && + child->role.guest_mode && !child->parent_ptes.val) + return kvm_mmu_prepare_zap_page(kvm, child, + invalid_list); + } + } else if (is_mmio_spte(pte)) { + mmu_spte_clear_no_track(spte); + } + return 0; +} + +static int kvm_mmu_page_unlink_children(struct kvm *kvm, + struct kvm_mmu_page *sp, + struct list_head *invalid_list) +{ + int zapped = 0; + unsigned i; + + for (i = 0; i < SPTE_ENT_PER_PAGE; ++i) + zapped += mmu_page_zap_pte(kvm, sp, sp->spt + i, invalid_list); + + return zapped; +} + +static void kvm_mmu_unlink_parents(struct kvm_mmu_page *sp) +{ + u64 *sptep; + struct rmap_iterator iter; + + while ((sptep = rmap_get_first(&sp->parent_ptes, &iter))) + drop_parent_pte(sp, sptep); +} + +static int mmu_zap_unsync_children(struct kvm *kvm, + struct kvm_mmu_page *parent, + struct list_head *invalid_list) +{ + int i, zapped = 0; + struct mmu_page_path parents; + struct kvm_mmu_pages pages; + + if (parent->role.level == PG_LEVEL_4K) + return 0; + + while (mmu_unsync_walk(parent, &pages)) { + struct kvm_mmu_page *sp; + + for_each_sp(pages, sp, parents, i) { + kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); + mmu_pages_clear_parents(&parents); + zapped++; + } + } + + return zapped; +} + +bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list, + int *nr_zapped) +{ + bool list_unstable, zapped_root = false; + + trace_kvm_mmu_prepare_zap_page(sp); + ++kvm->stat.mmu_shadow_zapped; + *nr_zapped = mmu_zap_unsync_children(kvm, sp, invalid_list); + *nr_zapped += kvm_mmu_page_unlink_children(kvm, sp, invalid_list); + kvm_mmu_unlink_parents(sp); + + /* Zapping children means active_mmu_pages has become unstable. */ + list_unstable = *nr_zapped; + + if (!sp->role.invalid && sp_has_gptes(sp)) + unaccount_shadowed(kvm, sp); + + if (sp->unsync) + kvm_unlink_unsync_page(kvm, sp); + if (!sp->root_count) { + /* Count self */ + (*nr_zapped)++; + + /* + * Already invalid pages (previously active roots) are not on + * the active page list. See list_del() in the "else" case of + * !sp->root_count. + */ + if (sp->role.invalid) + list_add(&sp->link, invalid_list); + else + list_move(&sp->link, invalid_list); + kvm_unaccount_mmu_page(kvm, sp); + } else { + /* + * Remove the active root from the active page list, the root + * will be explicitly freed when the root_count hits zero. + */ + list_del(&sp->link); + + /* + * Obsolete pages cannot be used on any vCPUs, see the comment + * in kvm_mmu_zap_all_fast(). Note, is_obsolete_sp() also + * treats invalid shadow pages as being obsolete. + */ + zapped_root = !is_obsolete_sp(kvm, sp); + } + + if (sp->nx_huge_page_disallowed) + unaccount_nx_huge_page(kvm, sp); + + sp->role.invalid = 1; + + /* + * Make the request to free obsolete roots after marking the root + * invalid, otherwise other vCPUs may not see it as invalid. + */ + if (zapped_root) + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS); + return list_unstable; +} + +bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list) +{ + int nr_zapped; + + __kvm_mmu_prepare_zap_page(kvm, sp, invalid_list, &nr_zapped); + return nr_zapped; +} + +void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list) +{ + struct kvm_mmu_page *sp, *nsp; + + if (list_empty(invalid_list)) + return; + + /* + * We need to make sure everyone sees our modifications to + * the page tables and see changes to vcpu->mode here. The barrier + * in the kvm_flush_remote_tlbs() achieves this. This pairs + * with vcpu_enter_guest and walk_shadow_page_lockless_begin/end. + * + * In addition, kvm_flush_remote_tlbs waits for all vcpus to exit + * guest mode and/or lockless shadow page table walks. + */ + kvm_flush_remote_tlbs(kvm); + + list_for_each_entry_safe(sp, nsp, invalid_list, link) { + WARN_ON(!sp->role.invalid || sp->root_count); + kvm_mmu_free_shadow_page(sp); + } +} + +static unsigned long kvm_mmu_zap_oldest_mmu_pages(struct kvm *kvm, + unsigned long nr_to_zap) +{ + unsigned long total_zapped = 0; + struct kvm_mmu_page *sp, *tmp; + LIST_HEAD(invalid_list); + bool unstable; + int nr_zapped; + + if (list_empty(&kvm->arch.active_mmu_pages)) + return 0; + +restart: + list_for_each_entry_safe_reverse(sp, tmp, &kvm->arch.active_mmu_pages, link) { + /* + * Don't zap active root pages, the page itself can't be freed + * and zapping it will just force vCPUs to realloc and reload. + */ + if (sp->root_count) + continue; + + unstable = __kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, + &nr_zapped); + total_zapped += nr_zapped; + if (total_zapped >= nr_to_zap) + break; + + if (unstable) + goto restart; + } + + kvm_mmu_commit_zap_page(kvm, &invalid_list); + + kvm->stat.mmu_recycled += total_zapped; + return total_zapped; +} + +static inline unsigned long kvm_mmu_available_pages(struct kvm *kvm) +{ + if (kvm->arch.n_max_mmu_pages > kvm->arch.n_used_mmu_pages) + return kvm->arch.n_max_mmu_pages - + kvm->arch.n_used_mmu_pages; + + return 0; +} + +int make_mmu_pages_available(struct kvm_vcpu *vcpu) +{ + unsigned long avail = kvm_mmu_available_pages(vcpu->kvm); + + if (likely(avail >= KVM_MIN_FREE_MMU_PAGES)) + return 0; + + kvm_mmu_zap_oldest_mmu_pages(vcpu->kvm, KVM_REFILL_PAGES - avail); + + /* + * Note, this check is intentionally soft, it only guarantees that one + * page is available, while the caller may end up allocating as many as + * four pages, e.g. for PAE roots or for 5-level paging. Temporarily + * exceeding the (arbitrary by default) limit will not harm the host, + * being too aggressive may unnecessarily kill the guest, and getting an + * exact count is far more trouble than it's worth, especially in the + * page fault paths. + */ + if (!kvm_mmu_available_pages(vcpu->kvm)) + return -ENOSPC; + return 0; +} + +/* + * Changing the number of mmu pages allocated to the vm + * Note: if goal_nr_mmu_pages is too small, you will get dead lock + */ +void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages) +{ + write_lock(&kvm->mmu_lock); + + if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) { + kvm_mmu_zap_oldest_mmu_pages(kvm, kvm->arch.n_used_mmu_pages - + goal_nr_mmu_pages); + + goal_nr_mmu_pages = kvm->arch.n_used_mmu_pages; + } + + kvm->arch.n_max_mmu_pages = goal_nr_mmu_pages; + + write_unlock(&kvm->mmu_lock); +} + +int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn) +{ + struct kvm_mmu_page *sp; + LIST_HEAD(invalid_list); + int r; + + pgprintk("%s: looking for gfn %llx\n", __func__, gfn); + r = 0; + write_lock(&kvm->mmu_lock); + for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) { + pgprintk("%s: gfn %llx role %x\n", __func__, gfn, + sp->role.word); + r = 1; + kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); + } + kvm_mmu_commit_zap_page(kvm, &invalid_list); + write_unlock(&kvm->mmu_lock); + + return r; +} + +int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva) +{ + gpa_t gpa; + int r; + + if (vcpu->arch.mmu->root_role.direct) + return 0; + + gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL); + + r = kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT); + + return r; +} + +static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + trace_kvm_mmu_unsync_page(sp); + ++kvm->stat.mmu_unsync; + sp->unsync = 1; + + kvm_mmu_mark_parents_unsync(sp); +} + +/* + * Attempt to unsync any shadow pages that can be reached by the specified gfn, + * KVM is creating a writable mapping for said gfn. Returns 0 if all pages + * were marked unsync (or if there is no shadow page), -EPERM if the SPTE must + * be write-protected. + */ +int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot, + gfn_t gfn, bool can_unsync, bool prefetch) +{ + struct kvm_mmu_page *sp; + bool locked = false; + + /* + * Force write-protection if the page is being tracked. Note, the page + * track machinery is used to write-protect upper-level shadow pages, + * i.e. this guards the role.level == 4K assertion below! + */ + if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE)) + return -EPERM; + + /* + * The page is not write-tracked, mark existing shadow pages unsync + * unless KVM is synchronizing an unsync SP (can_unsync = false). In + * that case, KVM must complete emulation of the guest TLB flush before + * allowing shadow pages to become unsync (writable by the guest). + */ + for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) { + if (!can_unsync) + return -EPERM; + + if (sp->unsync) + continue; + + if (prefetch) + return -EEXIST; + + /* + * TDP MMU page faults require an additional spinlock as they + * run with mmu_lock held for read, not write, and the unsync + * logic is not thread safe. Take the spinklock regardless of + * the MMU type to avoid extra conditionals/parameters, there's + * no meaningful penalty if mmu_lock is held for write. + */ + if (!locked) { + locked = true; + spin_lock(&kvm->arch.mmu_unsync_pages_lock); + + /* + * Recheck after taking the spinlock, a different vCPU + * may have since marked the page unsync. A false + * positive on the unprotected check above is not + * possible as clearing sp->unsync _must_ hold mmu_lock + * for write, i.e. unsync cannot transition from 0->1 + * while this CPU holds mmu_lock for read (or write). + */ + if (READ_ONCE(sp->unsync)) + continue; + } + + WARN_ON(sp->role.level != PG_LEVEL_4K); + kvm_unsync_page(kvm, sp); + } + if (locked) + spin_unlock(&kvm->arch.mmu_unsync_pages_lock); + + /* + * We need to ensure that the marking of unsync pages is visible + * before the SPTE is updated to allow writes because + * kvm_mmu_sync_roots() checks the unsync flags without holding + * the MMU lock and so can race with this. If the SPTE was updated + * before the page had been marked as unsync-ed, something like the + * following could happen: + * + * CPU 1 CPU 2 + * --------------------------------------------------------------------- + * 1.2 Host updates SPTE + * to be writable + * 2.1 Guest writes a GPTE for GVA X. + * (GPTE being in the guest page table shadowed + * by the SP from CPU 1.) + * This reads SPTE during the page table walk. + * Since SPTE.W is read as 1, there is no + * fault. + * + * 2.2 Guest issues TLB flush. + * That causes a VM Exit. + * + * 2.3 Walking of unsync pages sees sp->unsync is + * false and skips the page. + * + * 2.4 Guest accesses GVA X. + * Since the mapping in the SP was not updated, + * so the old mapping for GVA X incorrectly + * gets used. + * 1.1 Host marks SP + * as unsync + * (sp->unsync = true) + * + * The write barrier below ensures that 1.1 happens before 1.2 and thus + * the situation in 2.4 does not arise. It pairs with the read barrier + * in is_unsync_root(), placed between 2.1's load of SPTE.W and 2.3. + */ + smp_wmb(); + + return 0; +} + +int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, + u64 *sptep, unsigned int pte_access, gfn_t gfn, + kvm_pfn_t pfn, struct kvm_page_fault *fault) +{ + struct kvm_mmu_page *sp = sptep_to_sp(sptep); + int level = sp->role.level; + int was_rmapped = 0; + int ret = RET_PF_FIXED; + bool flush = false; + bool wrprot; + u64 spte; + + /* Prefetching always gets a writable pfn. */ + bool host_writable = !fault || fault->map_writable; + bool prefetch = !fault || fault->prefetch; + bool write_fault = fault && fault->write; + + pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__, + *sptep, write_fault, gfn); + + if (unlikely(is_noslot_pfn(pfn))) { + vcpu->stat.pf_mmio_spte_created++; + mark_mmio_spte(vcpu, sptep, gfn, pte_access); + return RET_PF_EMULATE; + } + + if (is_shadow_present_pte(*sptep)) { + /* + * If we overwrite a PTE page pointer with a 2MB PMD, unlink + * the parent of the now unreachable PTE. + */ + if (level > PG_LEVEL_4K && !is_large_pte(*sptep)) { + struct kvm_mmu_page *child; + u64 pte = *sptep; + + child = spte_to_child_sp(pte); + drop_parent_pte(child, sptep); + flush = true; + } else if (pfn != spte_to_pfn(*sptep)) { + pgprintk("hfn old %llx new %llx\n", + spte_to_pfn(*sptep), pfn); + drop_spte(vcpu->kvm, sptep); + flush = true; + } else + was_rmapped = 1; + } + + wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch, + true, host_writable, &spte); + + if (*sptep == spte) { + ret = RET_PF_SPURIOUS; + } else { + flush |= mmu_spte_update(sptep, spte); + trace_kvm_mmu_set_spte(level, gfn, sptep); + } + + if (wrprot) { + if (write_fault) + ret = RET_PF_EMULATE; + } + + if (flush) + kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, + KVM_PAGES_PER_HPAGE(level)); + + pgprintk("%s: setting spte %llx\n", __func__, *sptep); + + if (!was_rmapped) { + WARN_ON_ONCE(ret == RET_PF_SPURIOUS); + rmap_add(vcpu, slot, sptep, gfn, pte_access); + } else { + /* Already rmapped but the pte_access bits may have changed. */ + kvm_mmu_page_set_access(sp, spte_index(sptep), pte_access); + } + + return ret; +} + +static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *sp, + u64 *start, u64 *end) +{ + struct page *pages[PTE_PREFETCH_NUM]; + struct kvm_memory_slot *slot; + unsigned int access = sp->role.access; + int i, ret; + gfn_t gfn; + + gfn = kvm_mmu_page_get_gfn(sp, spte_index(start)); + slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK); + if (!slot) + return -1; + + ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start); + if (ret <= 0) + return -1; + + for (i = 0; i < ret; i++, gfn++, start++) { + mmu_set_spte(vcpu, slot, start, access, gfn, + page_to_pfn(pages[i]), NULL); + put_page(pages[i]); + } + + return 0; +} + +void __direct_pte_prefetch(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, + u64 *sptep) +{ + u64 *spte, *start = NULL; + int i; + + WARN_ON(!sp->role.direct); + + i = spte_index(sptep) & ~(PTE_PREFETCH_NUM - 1); + spte = sp->spt + i; + + for (i = 0; i < PTE_PREFETCH_NUM; i++, spte++) { + if (is_shadow_present_pte(*spte) || spte == sptep) { + if (!start) + continue; + if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0) + return; + start = NULL; + } else if (!start) + start = spte; + } + if (start) + direct_pte_prefetch_many(vcpu, sp, start, spte); +} + +static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep) +{ + struct kvm_mmu_page *sp; + + sp = sptep_to_sp(sptep); + + /* + * Without accessed bits, there's no way to distinguish between + * actually accessed translations and prefetched, so disable pte + * prefetch if accessed bits aren't available. + */ + if (sp_ad_disabled(sp)) + return; + + if (sp->role.level > PG_LEVEL_4K) + return; + + /* + * If addresses are being invalidated, skip prefetching to avoid + * accidentally prefetching those addresses. + */ + if (unlikely(vcpu->kvm->mmu_invalidate_in_progress)) + return; + + __direct_pte_prefetch(vcpu, sp, sptep); +} + +int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) +{ + struct kvm_shadow_walk_iterator it; + struct kvm_mmu_page *sp; + int ret; + gfn_t base_gfn = fault->gfn; + + kvm_mmu_hugepage_adjust(vcpu, fault); + + trace_kvm_mmu_spte_requested(fault); + for_each_shadow_entry(vcpu, fault->addr, it) { + /* + * We cannot overwrite existing page tables with an NX + * large page, as the leaf could be executable. + */ + if (fault->nx_huge_page_workaround_enabled) + disallowed_hugepage_adjust(fault, *it.sptep, it.level); + + base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1); + if (it.level == fault->goal_level) + break; + + sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL); + if (sp == ERR_PTR(-EEXIST)) + continue; + + link_shadow_page(vcpu, it.sptep, sp); + if (fault->huge_page_disallowed) + account_nx_huge_page(vcpu->kvm, sp, + fault->req_level >= it.level); + } + + if (WARN_ON_ONCE(it.level != fault->goal_level)) + return -EFAULT; + + ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL, + base_gfn, fault->pfn, fault); + if (ret == RET_PF_SPURIOUS) + return ret; + + direct_pte_prefetch(vcpu, it.sptep); + return ret; +} + +/* + * Returns the last level spte pointer of the shadow page walk for the given + * gpa, and sets *spte to the spte value. This spte may be non-preset. If no + * walk could be performed, returns NULL and *spte does not contain valid data. + * + * Contract: + * - Must be called between walk_shadow_page_lockless_{begin,end}. + * - The returned sptep must not be used after walk_shadow_page_lockless_end. + */ +u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) +{ + struct kvm_shadow_walk_iterator iterator; + u64 old_spte; + u64 *sptep = NULL; + + for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) { + sptep = iterator.sptep; + *spte = old_spte; + } + + return sptep; +} + +void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu) +{ + unsigned long roots_to_free = 0; + hpa_t root_hpa; + int i; + + /* + * This should not be called while L2 is active, L2 can't invalidate + * _only_ its own roots, e.g. INVVPID unconditionally exits. + */ + WARN_ON_ONCE(mmu->root_role.guest_mode); + + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { + root_hpa = mmu->prev_roots[i].hpa; + if (!VALID_PAGE(root_hpa)) + continue; + + if (!to_shadow_page(root_hpa) || + to_shadow_page(root_hpa)->role.guest_mode) + roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i); + } + + kvm_mmu_free_roots(kvm, mmu, roots_to_free); +} +EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots); + + +static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn) +{ + int ret = 0; + + if (!kvm_vcpu_is_visible_gfn(vcpu, root_gfn)) { + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + ret = 1; + } + + return ret; +} + +hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level) +{ + union kvm_mmu_page_role role = vcpu->arch.mmu->root_role; + struct kvm_mmu_page *sp; + + role.level = level; + role.quadrant = quadrant; + + WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte); + WARN_ON_ONCE(role.direct && role.has_4_byte_gpte); + + sp = kvm_mmu_get_shadow_page(vcpu, gfn, role); + ++sp->root_count; + + return __pa(sp->spt); +} + +static int mmu_first_shadow_root_alloc(struct kvm *kvm) +{ + struct kvm_memslots *slots; + struct kvm_memory_slot *slot; + int r = 0, i, bkt; + + /* + * Check if this is the first shadow root being allocated before + * taking the lock. + */ + if (kvm_shadow_root_allocated(kvm)) + return 0; + + mutex_lock(&kvm->slots_arch_lock); + + /* Recheck, under the lock, whether this is the first shadow root. */ + if (kvm_shadow_root_allocated(kvm)) + goto out_unlock; + + /* + * Check if anything actually needs to be allocated, e.g. all metadata + * will be allocated upfront if TDP is disabled. + */ + if (kvm_memslots_have_rmaps(kvm) && + kvm_page_track_write_tracking_enabled(kvm)) + goto out_success; + + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { + slots = __kvm_memslots(kvm, i); + kvm_for_each_memslot(slot, bkt, slots) { + /* + * Both of these functions are no-ops if the target is + * already allocated, so unconditionally calling both + * is safe. Intentionally do NOT free allocations on + * failure to avoid having to track which allocations + * were made now versus when the memslot was created. + * The metadata is guaranteed to be freed when the slot + * is freed, and will be kept/used if userspace retries + * KVM_RUN instead of killing the VM. + */ + r = memslot_rmap_alloc(slot, slot->npages); + if (r) + goto out_unlock; + r = kvm_page_track_write_tracking_alloc(slot); + if (r) + goto out_unlock; + } + } + + /* + * Ensure that shadow_root_allocated becomes true strictly after + * all the related pointers are set. + */ +out_success: + smp_store_release(&kvm->arch.shadow_root_allocated, true); + +out_unlock: + mutex_unlock(&kvm->slots_arch_lock); + return r; +} + +int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu *mmu = vcpu->arch.mmu; + u64 pdptrs[4], pm_mask; + gfn_t root_gfn, root_pgd; + int quadrant, i, r; + hpa_t root; + + root_pgd = mmu->get_guest_pgd(vcpu); + root_gfn = root_pgd >> PAGE_SHIFT; + + if (mmu_check_root(vcpu, root_gfn)) + return 1; + + /* + * On SVM, reading PDPTRs might access guest memory, which might fault + * and thus might sleep. Grab the PDPTRs before acquiring mmu_lock. + */ + if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) { + for (i = 0; i < 4; ++i) { + pdptrs[i] = mmu->get_pdptr(vcpu, i); + if (!(pdptrs[i] & PT_PRESENT_MASK)) + continue; + + if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT)) + return 1; + } + } + + r = mmu_first_shadow_root_alloc(vcpu->kvm); + if (r) + return r; + + write_lock(&vcpu->kvm->mmu_lock); + r = make_mmu_pages_available(vcpu); + if (r < 0) + goto out_unlock; + + /* + * Do we shadow a long mode page table? If so we need to + * write-protect the guests page table root. + */ + if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) { + root = mmu_alloc_root(vcpu, root_gfn, 0, + mmu->root_role.level); + mmu->root.hpa = root; + goto set_root_pgd; + } + + if (WARN_ON_ONCE(!mmu->pae_root)) { + r = -EIO; + goto out_unlock; + } + + /* + * We shadow a 32 bit page table. This may be a legacy 2-level + * or a PAE 3-level page table. In either case we need to be aware that + * the shadow page table may be a PAE or a long mode page table. + */ + pm_mask = PT_PRESENT_MASK | shadow_me_value; + if (mmu->root_role.level >= PT64_ROOT_4LEVEL) { + pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK; + + if (WARN_ON_ONCE(!mmu->pml4_root)) { + r = -EIO; + goto out_unlock; + } + mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask; + + if (mmu->root_role.level == PT64_ROOT_5LEVEL) { + if (WARN_ON_ONCE(!mmu->pml5_root)) { + r = -EIO; + goto out_unlock; + } + mmu->pml5_root[0] = __pa(mmu->pml4_root) | pm_mask; + } + } + + for (i = 0; i < 4; ++i) { + WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i])); + + if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) { + if (!(pdptrs[i] & PT_PRESENT_MASK)) { + mmu->pae_root[i] = INVALID_PAE_ROOT; + continue; + } + root_gfn = pdptrs[i] >> PAGE_SHIFT; + } + + /* + * If shadowing 32-bit non-PAE page tables, each PAE page + * directory maps one quarter of the guest's non-PAE page + * directory. Othwerise each PAE page direct shadows one guest + * PAE page directory so that quadrant should be 0. + */ + quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0; + + root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL); + mmu->pae_root[i] = root | pm_mask; + } + + if (mmu->root_role.level == PT64_ROOT_5LEVEL) + mmu->root.hpa = __pa(mmu->pml5_root); + else if (mmu->root_role.level == PT64_ROOT_4LEVEL) + mmu->root.hpa = __pa(mmu->pml4_root); + else + mmu->root.hpa = __pa(mmu->pae_root); + +set_root_pgd: + mmu->root.pgd = root_pgd; +out_unlock: + write_unlock(&vcpu->kvm->mmu_lock); + + return r; +} + +int mmu_alloc_special_roots(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu *mmu = vcpu->arch.mmu; + bool need_pml5 = mmu->root_role.level > PT64_ROOT_4LEVEL; + u64 *pml5_root = NULL; + u64 *pml4_root = NULL; + u64 *pae_root; + + /* + * When shadowing 32-bit or PAE NPT with 64-bit NPT, the PML4 and PDP + * tables are allocated and initialized at root creation as there is no + * equivalent level in the guest's NPT to shadow. Allocate the tables + * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare. + */ + if (mmu->root_role.direct || + mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL || + mmu->root_role.level < PT64_ROOT_4LEVEL) + return 0; + + /* + * NPT, the only paging mode that uses this horror, uses a fixed number + * of levels for the shadow page tables, e.g. all MMUs are 4-level or + * all MMus are 5-level. Thus, this can safely require that pml5_root + * is allocated if the other roots are valid and pml5 is needed, as any + * prior MMU would also have required pml5. + */ + if (mmu->pae_root && mmu->pml4_root && (!need_pml5 || mmu->pml5_root)) + return 0; + + /* + * The special roots should always be allocated in concert. Yell and + * bail if KVM ends up in a state where only one of the roots is valid. + */ + if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root || + (need_pml5 && mmu->pml5_root))) + return -EIO; + + /* + * Unlike 32-bit NPT, the PDP table doesn't need to be in low mem, and + * doesn't need to be decrypted. + */ + pae_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); + if (!pae_root) + return -ENOMEM; + +#ifdef CONFIG_X86_64 + pml4_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); + if (!pml4_root) + goto err_pml4; + + if (need_pml5) { + pml5_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); + if (!pml5_root) + goto err_pml5; + } +#endif + + mmu->pae_root = pae_root; + mmu->pml4_root = pml4_root; + mmu->pml5_root = pml5_root; + + return 0; + +#ifdef CONFIG_X86_64 +err_pml5: + free_page((unsigned long)pml4_root); +err_pml4: + free_page((unsigned long)pae_root); + return -ENOMEM; +#endif +} + +static bool is_unsync_root(hpa_t root) +{ + struct kvm_mmu_page *sp; + + if (!VALID_PAGE(root)) + return false; + + /* + * The read barrier orders the CPU's read of SPTE.W during the page table + * walk before the reads of sp->unsync/sp->unsync_children here. + * + * Even if another CPU was marking the SP as unsync-ed simultaneously, + * any guest page table changes are not guaranteed to be visible anyway + * until this VCPU issues a TLB flush strictly after those changes are + * made. We only need to ensure that the other CPU sets these flags + * before any actual changes to the page tables are made. The comments + * in mmu_try_to_unsync_pages() describe what could go wrong if this + * requirement isn't satisfied. + */ + smp_rmb(); + sp = to_shadow_page(root); + + /* + * PAE roots (somewhat arbitrarily) aren't backed by shadow pages, the + * PDPTEs for a given PAE root need to be synchronized individually. + */ + if (WARN_ON_ONCE(!sp)) + return false; + + if (sp->unsync || sp->unsync_children) + return true; + + return false; +} + +void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu) +{ + int i; + struct kvm_mmu_page *sp; + + if (vcpu->arch.mmu->root_role.direct) + return; + + if (!VALID_PAGE(vcpu->arch.mmu->root.hpa)) + return; + + vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY); + + if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) { + hpa_t root = vcpu->arch.mmu->root.hpa; + sp = to_shadow_page(root); + + if (!is_unsync_root(root)) + return; + + write_lock(&vcpu->kvm->mmu_lock); + mmu_sync_children(vcpu, sp, true); + write_unlock(&vcpu->kvm->mmu_lock); + return; + } + + write_lock(&vcpu->kvm->mmu_lock); + + for (i = 0; i < 4; ++i) { + hpa_t root = vcpu->arch.mmu->pae_root[i]; + + if (IS_VALID_PAE_ROOT(root)) { + sp = spte_to_child_sp(root); + mmu_sync_children(vcpu, sp, true); + } + } + + write_unlock(&vcpu->kvm->mmu_lock); +} + +void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu) +{ + unsigned long roots_to_free = 0; + int i; + + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) + if (is_unsync_root(vcpu->arch.mmu->prev_roots[i].hpa)) + roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i); + + /* sync prev_roots by simply freeing them */ + kvm_mmu_free_roots(vcpu->kvm, vcpu->arch.mmu, roots_to_free); +} + +/* + * Return the level of the lowest level SPTE added to sptes. + * That SPTE may be non-present. + * + * Must be called between walk_shadow_page_lockless_{begin,end}. + */ +int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level) +{ + struct kvm_shadow_walk_iterator iterator; + int leaf = -1; + u64 spte; + + for (shadow_walk_init(&iterator, vcpu, addr), + *root_level = iterator.level; + shadow_walk_okay(&iterator); + __shadow_walk_next(&iterator, spte)) { + leaf = iterator.level; + spte = mmu_spte_get_lockless(iterator.sptep); + + sptes[leaf] = spte; + } + + return leaf; +} + +void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr) +{ + struct kvm_shadow_walk_iterator iterator; + u64 spte; + + walk_shadow_page_lockless_begin(vcpu); + for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) + clear_sp_write_flooding_count(iterator.sptep); + walk_shadow_page_lockless_end(vcpu); +} + +static bool is_obsolete_root(struct kvm *kvm, hpa_t root_hpa) +{ + struct kvm_mmu_page *sp; + + if (!VALID_PAGE(root_hpa)) + return false; + + /* + * When freeing obsolete roots, treat roots as obsolete if they don't + * have an associated shadow page. This does mean KVM will get false + * positives and free roots that don't strictly need to be freed, but + * such false positives are relatively rare: + * + * (a) only PAE paging and nested NPT has roots without shadow pages + * (b) remote reloads due to a memslot update obsoletes _all_ roots + * (c) KVM doesn't track previous roots for PAE paging, and the guest + * is unlikely to zap an in-use PGD. + */ + sp = to_shadow_page(root_hpa); + return !sp || is_obsolete_sp(kvm, sp); +} + +static void __kvm_mmu_free_obsolete_roots(struct kvm *kvm, struct kvm_mmu *mmu) +{ + unsigned long roots_to_free = 0; + int i; + + if (is_obsolete_root(kvm, mmu->root.hpa)) + roots_to_free |= KVM_MMU_ROOT_CURRENT; + + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { + if (is_obsolete_root(kvm, mmu->prev_roots[i].hpa)) + roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i); + } + + if (roots_to_free) + kvm_mmu_free_roots(kvm, mmu, roots_to_free); +} + +void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu) +{ + __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.root_mmu); + __kvm_mmu_free_obsolete_roots(vcpu->kvm, &vcpu->arch.guest_mmu); +} + +static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa, + int *bytes) +{ + u64 gentry = 0; + int r; + + /* + * Assume that the pte write on a page table of the same type + * as the current vcpu paging mode since we update the sptes only + * when they have the same mode. + */ + if (is_pae(vcpu) && *bytes == 4) { + /* Handle a 32-bit guest writing two halves of a 64-bit gpte */ + *gpa &= ~(gpa_t)7; + *bytes = 8; + } + + if (*bytes == 4 || *bytes == 8) { + r = kvm_vcpu_read_guest_atomic(vcpu, *gpa, &gentry, *bytes); + if (r) + gentry = 0; + } + + return gentry; +} + +/* + * If we're seeing too many writes to a page, it may no longer be a page table, + * or we may be forking, in which case it is better to unmap the page. + */ +static bool detect_write_flooding(struct kvm_mmu_page *sp) +{ + /* + * Skip write-flooding detected for the sp whose level is 1, because + * it can become unsync, then the guest page is not write-protected. + */ + if (sp->role.level == PG_LEVEL_4K) + return false; + + atomic_inc(&sp->write_flooding_count); + return atomic_read(&sp->write_flooding_count) >= 3; +} + +/* + * Misaligned accesses are too much trouble to fix up; also, they usually + * indicate a page is not used as a page table. + */ +static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa, + int bytes) +{ + unsigned offset, pte_size, misaligned; + + pgprintk("misaligned: gpa %llx bytes %d role %x\n", + gpa, bytes, sp->role.word); + + offset = offset_in_page(gpa); + pte_size = sp->role.has_4_byte_gpte ? 4 : 8; + + /* + * Sometimes, the OS only writes the last one bytes to update status + * bits, for example, in linux, andb instruction is used in clear_bit(). + */ + if (!(offset & (pte_size - 1)) && bytes == 1) + return false; + + misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1); + misaligned |= bytes < 4; + + return misaligned; +} + +static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte) +{ + unsigned page_offset, quadrant; + u64 *spte; + int level; + + page_offset = offset_in_page(gpa); + level = sp->role.level; + *nspte = 1; + if (sp->role.has_4_byte_gpte) { + page_offset <<= 1; /* 32->64 */ + /* + * A 32-bit pde maps 4MB while the shadow pdes map + * only 2MB. So we need to double the offset again + * and zap two pdes instead of one. + */ + if (level == PT32_ROOT_LEVEL) { + page_offset &= ~7; /* kill rounding error */ + page_offset <<= 1; + *nspte = 2; + } + quadrant = page_offset >> PAGE_SHIFT; + page_offset &= ~PAGE_MASK; + if (quadrant != sp->role.quadrant) + return NULL; + } + + spte = &sp->spt[page_offset / sizeof(*spte)]; + return spte; +} + +void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, + int bytes, struct kvm_page_track_notifier_node *node) +{ + gfn_t gfn = gpa >> PAGE_SHIFT; + struct kvm_mmu_page *sp; + LIST_HEAD(invalid_list); + u64 entry, gentry, *spte; + int npte; + bool flush = false; + + /* + * If we don't have indirect shadow pages, it means no page is + * write-protected, so we can exit simply. + */ + if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) + return; + + pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes); + + write_lock(&vcpu->kvm->mmu_lock); + + gentry = mmu_pte_write_fetch_gpte(vcpu, &gpa, &bytes); + + ++vcpu->kvm->stat.mmu_pte_write; + + for_each_gfn_valid_sp_with_gptes(vcpu->kvm, sp, gfn) { + if (detect_write_misaligned(sp, gpa, bytes) || + detect_write_flooding(sp)) { + kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list); + ++vcpu->kvm->stat.mmu_flooded; + continue; + } + + spte = get_written_sptes(sp, gpa, &npte); + if (!spte) + continue; + + while (npte--) { + entry = *spte; + mmu_page_zap_pte(vcpu->kvm, sp, spte, NULL); + if (gentry && sp->role.level != PG_LEVEL_4K) + ++vcpu->kvm->stat.mmu_pde_zapped; + if (is_shadow_present_pte(entry)) + flush = true; + ++spte; + } + } + kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, flush); + write_unlock(&vcpu->kvm->mmu_lock); +} + +/* The caller should hold mmu-lock before calling this function. */ +static __always_inline bool +slot_handle_level_range(struct kvm *kvm, const struct kvm_memory_slot *memslot, + slot_level_handler fn, int start_level, int end_level, + gfn_t start_gfn, gfn_t end_gfn, bool flush_on_yield, + bool flush) +{ + struct slot_rmap_walk_iterator iterator; + + for_each_slot_rmap_range(memslot, start_level, end_level, start_gfn, + end_gfn, &iterator) { + if (iterator.rmap) + flush |= fn(kvm, iterator.rmap, memslot); + + if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { + if (flush && flush_on_yield) { + kvm_flush_remote_tlbs_with_address(kvm, + start_gfn, + iterator.gfn - start_gfn + 1); + flush = false; + } + cond_resched_rwlock_write(&kvm->mmu_lock); + } + } + + return flush; +} + +__always_inline bool slot_handle_level(struct kvm *kvm, + const struct kvm_memory_slot *memslot, + slot_level_handler fn, int start_level, + int end_level, bool flush_on_yield) +{ + return slot_handle_level_range(kvm, memslot, fn, start_level, + end_level, memslot->base_gfn, + memslot->base_gfn + memslot->npages - 1, + flush_on_yield, false); +} + +__always_inline bool slot_handle_level_4k(struct kvm *kvm, + const struct kvm_memory_slot *memslot, + slot_level_handler fn, + bool flush_on_yield) +{ + return slot_handle_level(kvm, memslot, fn, PG_LEVEL_4K, + PG_LEVEL_4K, flush_on_yield); +} + +#define BATCH_ZAP_PAGES 10 +void kvm_zap_obsolete_pages(struct kvm *kvm) +{ + struct kvm_mmu_page *sp, *node; + int nr_zapped, batch = 0; + bool unstable; + +restart: + list_for_each_entry_safe_reverse(sp, node, + &kvm->arch.active_mmu_pages, link) { + /* + * No obsolete valid page exists before a newly created page + * since active_mmu_pages is a FIFO list. + */ + if (!is_obsolete_sp(kvm, sp)) + break; + + /* + * Invalid pages should never land back on the list of active + * pages. Skip the bogus page, otherwise we'll get stuck in an + * infinite loop if the page gets put back on the list (again). + */ + if (WARN_ON(sp->role.invalid)) + continue; + + /* + * No need to flush the TLB since we're only zapping shadow + * pages with an obsolete generation number and all vCPUS have + * loaded a new root, i.e. the shadow pages being zapped cannot + * be in active use by the guest. + */ + if (batch >= BATCH_ZAP_PAGES && + cond_resched_rwlock_write(&kvm->mmu_lock)) { + batch = 0; + goto restart; + } + + unstable = __kvm_mmu_prepare_zap_page(kvm, sp, + &kvm->arch.zapped_obsolete_pages, &nr_zapped); + batch += nr_zapped; + + if (unstable) + goto restart; + } + + /* + * Kick all vCPUs (via remote TLB flush) before freeing the page tables + * to ensure KVM is not in the middle of a lockless shadow page table + * walk, which may reference the pages. The remote TLB flush itself is + * not required and is simply a convenient way to kick vCPUs as needed. + * KVM performs a local TLB flush when allocating a new root (see + * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are + * running with an obsolete MMU. + */ + kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages); +} + +static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm) +{ + return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages)); +} + +bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) +{ + const struct kvm_memory_slot *memslot; + struct kvm_memslots *slots; + struct kvm_memslot_iter iter; + bool flush = false; + gfn_t start, end; + int i; + + if (!kvm_memslots_have_rmaps(kvm)) + return flush; + + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { + slots = __kvm_memslots(kvm, i); + + kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) { + memslot = iter.slot; + start = max(gfn_start, memslot->base_gfn); + end = min(gfn_end, memslot->base_gfn + memslot->npages); + if (WARN_ON_ONCE(start >= end)) + continue; + + flush = slot_handle_level_range(kvm, memslot, __kvm_zap_rmap, + PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, + start, end - 1, true, flush); + } + } + + return flush; +} + +bool slot_rmap_write_protect(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + const struct kvm_memory_slot *slot) +{ + return rmap_write_protect(rmap_head, false); +} + +static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep) +{ + struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep); + struct shadow_page_caches caches = {}; + union kvm_mmu_page_role role; + unsigned int access; + gfn_t gfn; + + gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep)); + access = kvm_mmu_page_get_access(huge_sp, spte_index(huge_sptep)); + + /* + * Note, huge page splitting always uses direct shadow pages, regardless + * of whether the huge page itself is mapped by a direct or indirect + * shadow page, since the huge page region itself is being directly + * mapped with smaller pages. + */ + role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access); + + /* Direct SPs do not require a shadowed_info_cache. */ + caches.page_header_cache = &kvm->arch.split_page_header_cache; + caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache; + + /* Safe to pass NULL for vCPU since requesting a direct SP. */ + return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role); +} + +static void shadow_mmu_split_huge_page(struct kvm *kvm, + const struct kvm_memory_slot *slot, + u64 *huge_sptep) + +{ + struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache; + u64 huge_spte = READ_ONCE(*huge_sptep); + struct kvm_mmu_page *sp; + bool flush = false; + u64 *sptep, spte; + gfn_t gfn; + int index; + + sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep); + + for (index = 0; index < SPTE_ENT_PER_PAGE; index++) { + sptep = &sp->spt[index]; + gfn = kvm_mmu_page_get_gfn(sp, index); + + /* + * The SP may already have populated SPTEs, e.g. if this huge + * page is aliased by multiple sptes with the same access + * permissions. These entries are guaranteed to map the same + * gfn-to-pfn translation since the SP is direct, so no need to + * modify them. + * + * However, if a given SPTE points to a lower level page table, + * that lower level page table may only be partially populated. + * Installing such SPTEs would effectively unmap a potion of the + * huge page. Unmapping guest memory always requires a TLB flush + * since a subsequent operation on the unmapped regions would + * fail to detect the need to flush. + */ + if (is_shadow_present_pte(*sptep)) { + flush |= !is_last_spte(*sptep, sp->role.level); + continue; + } + + spte = make_huge_page_split_spte(kvm, huge_spte, sp->role, index); + mmu_spte_set(sptep, spte); + __rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access); + } + + __link_shadow_page(kvm, cache, huge_sptep, sp, flush); +} + +static int shadow_mmu_try_split_huge_page(struct kvm *kvm, + const struct kvm_memory_slot *slot, + u64 *huge_sptep) +{ + struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep); + int level, r = 0; + gfn_t gfn; + u64 spte; + + /* Grab information for the tracepoint before dropping the MMU lock. */ + gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep)); + level = huge_sp->role.level; + spte = *huge_sptep; + + if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) { + r = -ENOSPC; + goto out; + } + + if (need_topup_split_caches_or_resched(kvm)) { + write_unlock(&kvm->mmu_lock); + cond_resched(); + /* + * If the topup succeeds, return -EAGAIN to indicate that the + * rmap iterator should be restarted because the MMU lock was + * dropped. + */ + r = topup_split_caches(kvm) ?: -EAGAIN; + write_lock(&kvm->mmu_lock); + goto out; + } + + shadow_mmu_split_huge_page(kvm, slot, huge_sptep); + +out: + trace_kvm_mmu_split_huge_page(gfn, spte, level, r); + return r; +} + +static bool shadow_mmu_try_split_huge_pages(struct kvm *kvm, + struct kvm_rmap_head *rmap_head, + const struct kvm_memory_slot *slot) +{ + struct rmap_iterator iter; + struct kvm_mmu_page *sp; + u64 *huge_sptep; + int r; + +restart: + for_each_rmap_spte(rmap_head, &iter, huge_sptep) { + sp = sptep_to_sp(huge_sptep); + + /* TDP MMU is enabled, so rmap only contains nested MMU SPs. */ + if (WARN_ON_ONCE(!sp->role.guest_mode)) + continue; + + /* The rmaps should never contain non-leaf SPTEs. */ + if (WARN_ON_ONCE(!is_large_pte(*huge_sptep))) + continue; + + /* SPs with level >PG_LEVEL_4K should never by unsync. */ + if (WARN_ON_ONCE(sp->unsync)) + continue; + + /* Don't bother splitting huge pages on invalid SPs. */ + if (sp->role.invalid) + continue; + + r = shadow_mmu_try_split_huge_page(kvm, slot, huge_sptep); + + /* + * The split succeeded or needs to be retried because the MMU + * lock was dropped. Either way, restart the iterator to get it + * back into a consistent state. + */ + if (!r || r == -EAGAIN) + goto restart; + + /* The split failed and shouldn't be retried (e.g. -ENOMEM). */ + break; + } + + return false; +} + +void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm, + const struct kvm_memory_slot *slot, + gfn_t start, gfn_t end, + int target_level) +{ + int level; + + /* + * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working + * down to the target level. This ensures pages are recursively split + * all the way to the target level. There's no need to split pages + * already at the target level. + */ + for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) { + slot_handle_level_range(kvm, slot, shadow_mmu_try_split_huge_pages, + level, level, start, end - 1, true, false); + } +} + +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, + struct kvm_rmap_head *rmap_head, + const struct kvm_memory_slot *slot) +{ + u64 *sptep; + struct rmap_iterator iter; + int need_tlb_flush = 0; + struct kvm_mmu_page *sp; + +restart: + for_each_rmap_spte(rmap_head, &iter, sptep) { + sp = sptep_to_sp(sptep); + + /* + * We cannot do huge page mapping for indirect shadow pages, + * which are found on the last rmap (level = 1) when not using + * tdp; such shadow pages are synced with the page table in + * the guest, and the guest page table is using 4K page size + * mapping if the indirect sp has level = 1. + */ + if (sp->role.direct && + sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn, + PG_LEVEL_NUM)) { + kvm_zap_one_rmap_spte(kvm, rmap_head, sptep); + + if (kvm_available_flush_tlb_with_range()) + kvm_flush_remote_tlbs_with_address(kvm, sp->gfn, + KVM_PAGES_PER_HPAGE(sp->role.level)); + else + need_tlb_flush = 1; + + goto restart; + } + } + + return need_tlb_flush; +} + +void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm, + const struct kvm_memory_slot *slot) +{ + /* + * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap + * pages that are already mapped at the maximum hugepage level. + */ + if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte, + PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true)) + kvm_arch_flush_remote_tlbs_memslot(kvm, slot); +} + +unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) +{ + struct kvm *kvm; + int nr_to_scan = sc->nr_to_scan; + unsigned long freed = 0; + + mutex_lock(&kvm_lock); + + list_for_each_entry(kvm, &vm_list, vm_list) { + int idx; + LIST_HEAD(invalid_list); + + /* + * Never scan more than sc->nr_to_scan VM instances. + * Will not hit this condition practically since we do not try + * to shrink more than one VM and it is very unlikely to see + * !n_used_mmu_pages so many times. + */ + if (!nr_to_scan--) + break; + /* + * n_used_mmu_pages is accessed without holding kvm->mmu_lock + * here. We may skip a VM instance errorneosly, but we do not + * want to shrink a VM that only started to populate its MMU + * anyway. + */ + if (!kvm->arch.n_used_mmu_pages && + !kvm_has_zapped_obsolete_pages(kvm)) + continue; + + idx = srcu_read_lock(&kvm->srcu); + write_lock(&kvm->mmu_lock); + + if (kvm_has_zapped_obsolete_pages(kvm)) { + kvm_mmu_commit_zap_page(kvm, + &kvm->arch.zapped_obsolete_pages); + goto unlock; + } + + freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan); + +unlock: + write_unlock(&kvm->mmu_lock); + srcu_read_unlock(&kvm->srcu, idx); + + /* + * unfair on small ones + * per-vm shrinkers cry out + * sadness comes quickly + */ + list_move_tail(&kvm->vm_list, &vm_list); + break; + } + + mutex_unlock(&kvm_lock); + return freed; +} diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h index 719b10f6c403..83876047c1f5 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.h +++ b/arch/x86/kvm/mmu/shadow_mmu.h @@ -5,4 +5,149 @@ #include +/* make pte_list_desc fit well in cache lines */ +#define PTE_LIST_EXT 14 + +/* + * Slight optimization of cacheline layout, by putting `more' and `spte_count' + * at the start; then accessing it will only use one single cacheline for + * either full (entries==PTE_LIST_EXT) case or entries<=6. + */ +struct pte_list_desc { + struct pte_list_desc *more; + /* + * Stores number of entries stored in the pte_list_desc. No need to be + * u64 but just for easier alignment. When PTE_LIST_EXT, means full. + */ + u64 spte_count; + u64 *sptes[PTE_LIST_EXT]; +}; + +unsigned int pte_list_count(struct kvm_rmap_head *rmap_head); + +struct kvm_shadow_walk_iterator { + u64 addr; + hpa_t shadow_addr; + u64 *sptep; + int level; + unsigned index; +}; + +#define for_each_shadow_entry_using_root(_vcpu, _root, _addr, _walker) \ + for (shadow_walk_init_using_root(&(_walker), (_vcpu), \ + (_root), (_addr)); \ + shadow_walk_okay(&(_walker)); \ + shadow_walk_next(&(_walker))) + +bool mmu_spte_update(u64 *sptep, u64 new_spte); +void mmu_spte_clear_no_track(u64 *sptep); +gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index); +void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, + unsigned int access); + +struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, + const struct kvm_memory_slot *slot); +bool rmap_can_add(struct kvm_vcpu *vcpu); +void drop_spte(struct kvm *kvm, u64 *sptep); +bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect); +bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + const struct kvm_memory_slot *slot); +bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, int level, + pte_t unused); +bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, int level, + pte_t pte); + +typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, + int level, pte_t pte); +bool kvm_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range, + rmap_handler_t handler); + +bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, int level, + pte_t unused); +bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, + int level, pte_t unused); + +void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte); +int nonpaging_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp); +int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent, + bool can_yield); +void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp); +void clear_sp_write_flooding_count(u64 *spte); + +struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, u64 *sptep, + gfn_t gfn, bool direct, + unsigned int access); + +void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator, + struct kvm_vcpu *vcpu, hpa_t root, u64 addr); +void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator, + struct kvm_vcpu *vcpu, u64 addr); +bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator); +void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator); + +void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp); + +void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, + unsigned direct_access); + +int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte, + struct list_head *invalid_list); +bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list, + int *nr_zapped); +bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list); +void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list); + +int make_mmu_pages_available(struct kvm_vcpu *vcpu); + +int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva); + +int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, + u64 *sptep, unsigned int pte_access, gfn_t gfn, + kvm_pfn_t pfn, struct kvm_page_fault *fault); +void __direct_pte_prefetch(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, + u64 *sptep); +int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); +u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte); + +hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level); +int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu); +int mmu_alloc_special_roots(struct kvm_vcpu *vcpu); + +int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level); + +void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr); +void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, + int bytes, struct kvm_page_track_notifier_node *node); + +/* The return value indicates if tlb flush on all vcpus is needed. */ +typedef bool (*slot_level_handler) (struct kvm *kvm, + struct kvm_rmap_head *rmap_head, + const struct kvm_memory_slot *slot); +bool slot_handle_level(struct kvm *kvm, const struct kvm_memory_slot *memslot, + slot_level_handler fn, int start_level, int end_level, + bool flush_on_yield); +bool slot_handle_level_4k(struct kvm *kvm, const struct kvm_memory_slot *memslot, + slot_level_handler fn, bool flush_on_yield); + +void kvm_zap_obsolete_pages(struct kvm *kvm); +bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); + +bool slot_rmap_write_protect(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + const struct kvm_memory_slot *slot); + +void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm, + const struct kvm_memory_slot *slot, + gfn_t start, gfn_t end, + int target_level); +void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm, + const struct kvm_memory_slot *slot); + +unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc); #endif /* __KVM_X86_MMU_SHADOW_MMU_H */ From patchwork Wed Dec 21 22:24:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35534 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4945077dye; Wed, 21 Dec 2022 14:25:46 -0800 (PST) X-Google-Smtp-Source: AMrXdXtUs3qrko5d+0tA8+Zvk8YZwypDZJZ6+/syBHZPNtgB3fe18OVc48OfZKgLjd4JGn8bQhkg X-Received: by 2002:a17:906:555:b0:7c1:47bd:1814 with SMTP id k21-20020a170906055500b007c147bd1814mr2882247eja.63.1671661545968; Wed, 21 Dec 2022 14:25:45 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661545; cv=none; d=google.com; s=arc-20160816; b=n/iSHyl/bFWRCnCZHH+ShNv1EpngUjXqzJjVO+1onInQrw8tfzf2u1V2R1OcvH9JcY LYjZY+BGH+j56x6jPnm0EVNIKu2iex6YT9iHEPI3rU1p0JlE2VUESowhPuebuUxCOYfV pWBCJ/cKY8OAoIk+G/WIfPNc8Qt1yf/nW/ngGJlUVcSe7iS/Hld5KnkcADlCOjfrItna nefLNQv/MnM0gpAIuHxmkyWq8eBIXPWvy2Iqlv9qrKTg89D9eSgU+YmAengK8B+o6K7v 5quYkvH33dO/ON/8u5Sq+XNf/6wnpANKDUCncRzlLI3th5wajf0BtpxdEnlzMmMDJWRv ZC2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=yXovaXGmQ2hj+xhfXzlYPf6jy3g3M6Z97i9sJYKtL7o=; b=trzKxWGTN903wVYP9nExslTVC5cKI1FLxoOKdCZ9Fn4sK0dHoyzCgensr/N1n6PiNP TpVdLYwGA+pKdBYUjMUe567bpKqeAy54Xxc7Ruicvpk7WWR3Fj5jk9rRWN41zYjSzSgQ zc1IuLMYiLqAjBXc0g7JOSwsXVdGmbAnN4i7MABuDdinp6cJRIHnrddRZFNSuviPCRQU WOf+HPa1Q5U+CUSM/gccH2OEaU/AQjTV1Vo9TBq3Wg2fRzgSleYs0E8CdtFcY8vGWsIl AhTgZ1FAGykjjE1lcxqww1VQcYCiwSYRny4WvZZyOAAO6mAZZmrhRcpuKSxq2TZeCRe5 2Xjg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=XUqeRrYp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s26-20020aa7cb1a000000b0046cc0300c0fsi11123748edt.581.2022.12.21.14.25.21; Wed, 21 Dec 2022 14:25:45 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=XUqeRrYp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234986AbiLUWYo (ORCPT + 99 others); Wed, 21 Dec 2022 17:24:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36276 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234953AbiLUWYc (ORCPT ); Wed, 21 Dec 2022 17:24:32 -0500 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7FC8027170 for ; Wed, 21 Dec 2022 14:24:29 -0800 (PST) Received: by mail-pl1-x649.google.com with SMTP id c12-20020a170902d48c00b00189e5443387so126066plg.15 for ; Wed, 21 Dec 2022 14:24:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=yXovaXGmQ2hj+xhfXzlYPf6jy3g3M6Z97i9sJYKtL7o=; b=XUqeRrYpmjbz24aOUCRdZhU/hasfAHWkBLZ4lTvglulOiCrqV4g30TxCuKN75BwGCU 4LXBLsZDthjsLf+2AAZudjXxcPvh/idm+4kfCCahPLRDtw4awMTHPJRM+3GLuCE2z2xW UXWY9noDLDSwD7MOFEHp549vimmJUdsZlZLlkat6N1yZkyca8ji4la9bIC+7aRA7MNjJ YtbFM6nqgK4y9kxkqqzX2AYDFPzZeTYjsNQEJ2emwoQuCWJEmRyoXNEcGGrMTsjlBGmI JZjMUvnzAPwUHXLd7BkZn8APBej2IkOxylMhK0+Z0QbMqCvFV2hi5iU8e6JCqA3ryxd/ vIwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=yXovaXGmQ2hj+xhfXzlYPf6jy3g3M6Z97i9sJYKtL7o=; b=0TuAqbB2rmzGeIoqEtRGCLJTNwjCCqIuY4Lu4U5VCC10Z38f9pvtSl8cb9oo3TYnRn 2CGLtAYUViLC2jMKGAeL3aOPEJBbxG59nQt7ve8apuHWvcQpL4wiXXCqtaQ9VUqQEri2 oN3dST/0B5gYZlUyGELemAm9St4G9OQQ77YY1ln59+M9EB6M9fZzftNX9hNjTDNcdCdP njB8X41QrXrCd8S4EET3NBUB0kLN7/OgbKtCMXf5K3JWCoy2xcfcHjzdEo/8I4jix16e U+9E0gm7GbsWydf+EiMC/H88cYzauLapxL3PQWDeH257NgAhETxJmv85P/WmF1Kqfpww H37A== X-Gm-Message-State: AFqh2kpBZpoWykiSmuD8dvv65obE2uJu+PHJVdLELw3UxRKQrt2JPlhO Olusev+qNXu0fnlNUqEiAoW1bxLozpN/gSB1Wz29CDnAluYrEyS9gmI22dJqrn42u/aiismLsR6 u4s95AfrwHRgGxr89FE2KFKQi/IWbyx+Zajd1KNV7mkDE6DOFln//bVohIArHItQ92Wd0FCg0 X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a17:90a:9913:b0:225:aa06:8fcb with SMTP id b19-20020a17090a991300b00225aa068fcbmr167806pjp.78.1671661468931; Wed, 21 Dec 2022 14:24:28 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:08 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-5-bgardon@google.com> Subject: [RFC 04/14] KVM: x86/MMU: Expose functions for paging_tmpl.h From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864176882421661?= X-GMAIL-MSGID: =?utf-8?q?1752864176882421661?= In preparation for moving paging_tmpl.h to shadow_mmu.c, expose various functions it needs through mmu_internal.h. This includes modifying the BUILD_MMU_ROLE_ACCESSOR macro so that it does not automatically include the static label, since some but not all of the accessors are needed by paging_tmpl.h. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/mmu.c | 32 ++++++++++++++++---------------- arch/x86/kvm/mmu/mmu_internal.h | 16 ++++++++++++++++ 2 files changed, 32 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index bf14e181eb12..a17e8a79e4df 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -153,18 +153,18 @@ BUILD_MMU_ROLE_REGS_ACCESSOR(efer, lma, EFER_LMA); * and the vCPU may be incorrect/irrelevant. */ #define BUILD_MMU_ROLE_ACCESSOR(base_or_ext, reg, name) \ -static inline bool __maybe_unused is_##reg##_##name(struct kvm_mmu *mmu) \ +inline bool __maybe_unused is_##reg##_##name(struct kvm_mmu *mmu) \ { \ return !!(mmu->cpu_role. base_or_ext . reg##_##name); \ } BUILD_MMU_ROLE_ACCESSOR(base, cr0, wp); -BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pse); +static BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pse); BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smep); -BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smap); -BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pke); -BUILD_MMU_ROLE_ACCESSOR(ext, cr4, la57); +static BUILD_MMU_ROLE_ACCESSOR(ext, cr4, smap); +static BUILD_MMU_ROLE_ACCESSOR(ext, cr4, pke); +static BUILD_MMU_ROLE_ACCESSOR(ext, cr4, la57); BUILD_MMU_ROLE_ACCESSOR(base, efer, nx); -BUILD_MMU_ROLE_ACCESSOR(ext, efer, lma); +static BUILD_MMU_ROLE_ACCESSOR(ext, efer, lma); static inline bool is_cr0_pg(struct kvm_mmu *mmu) { @@ -210,7 +210,7 @@ void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, kvm_flush_remote_tlbs_with_range(kvm, &range); } -static gfn_t get_mmio_spte_gfn(u64 spte) +gfn_t get_mmio_spte_gfn(u64 spte) { u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask; @@ -240,7 +240,7 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte) return likely(kvm_gen == spte_gen); } -static int is_cpuid_PSE36(void) +int is_cpuid_PSE36(void) { return 1; } @@ -279,7 +279,7 @@ void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu) } } -static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) +int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) { int r; @@ -818,8 +818,8 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) return -EFAULT; } -static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, - unsigned int access) +int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, + unsigned int access) { /* The pfn is invalid, report the error! */ if (unlikely(is_error_pfn(fault->pfn))) @@ -1275,8 +1275,8 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct) return RET_PF_RETRY; } -static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu, - struct kvm_page_fault *fault) +bool page_fault_handle_page_track(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault) { if (unlikely(fault->rsvd)) return false; @@ -1338,7 +1338,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); } -static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) +int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_memory_slot *slot = fault->slot; bool async; @@ -1403,8 +1403,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) * Returns true if the page fault is stale and needs to be retried, i.e. if the * root was invalidated by a memslot update or a relevant mmu_notifier fired. */ -static bool is_page_fault_stale(struct kvm_vcpu *vcpu, - struct kvm_page_fault *fault, int mmu_seq) +bool is_page_fault_stale(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, + int mmu_seq) { struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa); diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 74a99b67f09e..957376fcb333 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -341,6 +341,22 @@ bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp); void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu); void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu); +int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect); bool need_topup_split_caches_or_resched(struct kvm *kvm); int topup_split_caches(struct kvm *kvm); + +bool is_page_fault_stale(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, + int mmu_seq); +bool page_fault_handle_page_track(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault); +int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); +int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, + unsigned int access); + +gfn_t get_mmio_spte_gfn(u64 spte); + +bool is_efer_nx(struct kvm_mmu *mmu); +bool is_cr4_smep(struct kvm_mmu *mmu); +bool is_cr0_wp(struct kvm_mmu *mmu); +int is_cpuid_PSE36(void); #endif /* __KVM_X86_MMU_INTERNAL_H */ From patchwork Wed Dec 21 22:24:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35535 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4945218dye; Wed, 21 Dec 2022 14:26:07 -0800 (PST) X-Google-Smtp-Source: AMrXdXvu+6Fv+82gTeHnF/e+f9caBU9YVN6EJ2CzMc5dz0LGtsCic9jqPbuYY1ZkLu8gdSA94MLf X-Received: by 2002:a05:6402:1654:b0:474:a583:2e25 with SMTP id s20-20020a056402165400b00474a5832e25mr2862021edx.5.1671661567617; Wed, 21 Dec 2022 14:26:07 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661567; cv=none; d=google.com; s=arc-20160816; b=ECFEbswZ9XQ0FQKvCVWMX0Qwo+E26W4UYksP5ih5GbNZ68/n81gdQaGaqyaBELco5u 1Bc2So8+5NhDh8wc9tn694dPmaNBDxkwKCrk2j4LhR9X3gZe4JpwaC8JHxOCYmxR/cJe ro1kk0U/chB15WM/stq/Y3Fo8RUmvuTyuqeWpUqTXE00h0GyOrCefnv7ZPSOUcy4vNm4 BwTEvVbk1sFhXSj4PXApt7ZUUuk4HlzkxbeDxjFFiz9xrdvVPSmt7pwzZwnZ1AJcGhdY q/GYyKQ4QUv6mjy/OfJCMN29GWAdcSzXt3iyo2K24jSmjgsLCsdRgKeRqCC/ypivgf4d VhhA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=8kZrCNGBjg1WS7ODt19uiL6/LMaIleV0jy9dXg5qXsU=; b=rDlFMa++aZW4qJ6o2YRXpxv3RZ1uiAgU61fmT9q3Wn1bAUWZPyZ4q1RFW073hwdFx3 IsnaW5ijX0ze+0ZsJ/82rqfNIxFj/wR/V5q9JSymrSWx3mrkqV0f7VSf4EwY1MD4vFzq S/fO7J4j/kATfVxbrLkIKj+3g2/V5WuNq09hB9v+2RQEwv0sSVQRAXgU8OnrsPzErHiv 493GmvFftlCW13LKt9fxbOw8SBFwm3z+Qngvjcsnh5DYqStpAeoSSiuuFlwYg2hhXd3H /zykvGPGzuKN4aV5rWMwwg1vSpkpouZil/opKier6mNzQO02U0RaN7FO0uAog+3Em4gG saog== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=gm4wVDeM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w16-20020a056402269000b0046afe27742asi16128234edd.396.2022.12.21.14.25.43; Wed, 21 Dec 2022 14:26:07 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=gm4wVDeM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234993AbiLUWYr (ORCPT + 99 others); Wed, 21 Dec 2022 17:24:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234947AbiLUWYd (ORCPT ); Wed, 21 Dec 2022 17:24:33 -0500 Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com [IPv6:2607:f8b0:4864:20::449]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 652FC275C5 for ; Wed, 21 Dec 2022 14:24:31 -0800 (PST) Received: by mail-pf1-x449.google.com with SMTP id k21-20020aa78215000000b00575ab46ca2cso9221276pfi.20 for ; Wed, 21 Dec 2022 14:24:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=8kZrCNGBjg1WS7ODt19uiL6/LMaIleV0jy9dXg5qXsU=; b=gm4wVDeMjNDC25vIMLZJDuIuxtNrgMkW/U8DzY9/JjzdiZzvFqg5Z9nrrcuVrdI+Cr mlp9FAy3A+XDVeEZyHG8Xc52Dea5StINuVBgplWUrUUWZ69CNhfzLT4HzTN4LN1TEk0D IpYExoj/5fAFNK6+d1bgopQ28Zgn7SVezYfuMlI3pItzNuDAxxrhAWNTk67M+IKLP4c2 n+qybJS7g4bTiF/kDELV1VGdQmRTOxAQSwONIrFtaMW76yMhZ3d0Kp39BOtqavNeur3r Auhjsa2ghpUls+QQDvUrWngcD+RIh8YLSET0PWut8cdrrGwgVsH0f/PqQl4TB9PEBsIV uuCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8kZrCNGBjg1WS7ODt19uiL6/LMaIleV0jy9dXg5qXsU=; b=3qYlzkh2EqzqZinkf69cEGb7NAwB/0/CJ/Gh9AMvVa4nj3tSpPniny5BDB84ZWMqPD szl0FhG4q5BNei3w4gsjobsItWLlaIf5LFFbrZ4FbSMeSH6jkKZcIoy1Adjh+Zkc6cf2 IPY3lnzcU+7aCTUNO7L7dSKXtMymGa5Mb+/yz9oMSXn7dqK+imbtdCeFo4SYo2xfNIED 2i3eCvH1WwQlVNHc9e/jERyrrfzs4wS8Nk/Z5EVJY/kMG6VjzbJSA+TasfAR09Hj/Cxe vAlTXFCSUJ0H/6ne9GHqYlQowE+w0RVDcliJIlxm2pNwj99qLPkfEcChv3XO4emH06gZ EZCA== X-Gm-Message-State: AFqh2kpKFf4FKPv7N7I3cQ8CeeYKww81LxEuZzz5FLjzoMfdPT2NEKoJ kVx7QQMLXGyM8lUf3BVLI8JwLhLrTvIkHvcd/VIuxwMeE9tnEFlLkL7lOFJT6lys5ZHHVBxXqaM zWaOEfKgBPtTlQIeJsmRiENRsDYRgfJfMFyfJe8uiy+p9HkcUjdmYBhLqz0t5/MPJgAgVUtHU X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a17:902:d510:b0:189:ced9:c9e7 with SMTP id b16-20020a170902d51000b00189ced9c9e7mr251984plg.108.1671661470555; Wed, 21 Dec 2022 14:24:30 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:09 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-6-bgardon@google.com> Subject: [RFC 05/14] KVM: x86/MMU: Move paging_tmpl.h includes to shadow_mmu.c From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864200282471866?= X-GMAIL-MSGID: =?utf-8?q?1752864200282471866?= Move the integration point for paging_tmpl.h to shadow_mmu.c since paging_tmpl.h is ostensibly part of the Shadow MMU. This requires modifying some of the definitions to be non-static and then exporting the pre-processed function names through shadow_mmu.h since they are needed for mmu context callbacks in mmu.c. This will facilitate cleanups in following commits because many of the functions being exposed by shadow_mmu.h are only needed by paging_tmpl.h. Those functions will no longer need to be exported. sync_mmio_spte() is only used by paging_tmpl.h, so move it along with the includes. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/mmu.c | 29 ----------------------------- arch/x86/kvm/mmu/paging_tmpl.h | 11 +++++------ arch/x86/kvm/mmu/shadow_mmu.c | 30 ++++++++++++++++++++++++++++++ arch/x86/kvm/mmu/shadow_mmu.h | 25 ++++++++++++++++++++++++- 4 files changed, 59 insertions(+), 36 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a17e8a79e4df..dd97e346c786 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1699,35 +1699,6 @@ static unsigned long get_cr3(struct kvm_vcpu *vcpu) return kvm_read_cr3(vcpu); } -static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn, - unsigned int access) -{ - if (unlikely(is_mmio_spte(*sptep))) { - if (gfn != get_mmio_spte_gfn(*sptep)) { - mmu_spte_clear_no_track(sptep); - return true; - } - - mark_mmio_spte(vcpu, sptep, gfn, access); - return true; - } - - return false; -} - -#define PTTYPE_EPT 18 /* arbitrary */ -#define PTTYPE PTTYPE_EPT -#include "paging_tmpl.h" -#undef PTTYPE - -#define PTTYPE 64 -#include "paging_tmpl.h" -#undef PTTYPE - -#define PTTYPE 32 -#include "paging_tmpl.h" -#undef PTTYPE - static void __reset_rsvds_bits_mask(struct rsvd_bits_validate *rsvd_check, u64 pa_bits_rsvd, int level, bool nx, bool gbpages, diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 0f6455072055..2e3b2aca64ad 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -787,7 +787,7 @@ FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu, * Returns: 1 if we need to emulate the instruction, 0 otherwise, or * a negative value on error. */ -static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) +int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct guest_walker walker; int r; @@ -897,7 +897,7 @@ static gpa_t FNAME(get_level1_sp_gpa)(struct kvm_mmu_page *sp) return gfn_to_gpa(sp->gfn) + offset * sizeof(pt_element_t); } -static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa) +void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa) { struct kvm_shadow_walk_iterator iterator; struct kvm_mmu_page *sp; @@ -957,9 +957,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa) } /* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */ -static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - gpa_t addr, u64 access, - struct x86_exception *exception) +gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gpa_t addr, + u64 access, struct x86_exception *exception) { struct guest_walker walker; gpa_t gpa = INVALID_GPA; @@ -992,7 +991,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, * 0: the sp is synced and no tlb flushing is required * > 0: the sp is synced and tlb flushing is required */ -static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) { union kvm_mmu_page_role root_role = vcpu->arch.mmu->root_role; int i; diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c index 05d8f5be559d..86b5fb75d50a 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.c +++ b/arch/x86/kvm/mmu/shadow_mmu.c @@ -10,6 +10,7 @@ * Shadow MMU also supports TDP, it's just less scalable. The Shadow and TDP * MMUs can cooperate to support nested virtualization on hardware with TDP. */ +#include "ioapic.h" #include "mmu.h" #include "mmu_internal.h" #include "mmutrace.h" @@ -2798,6 +2799,35 @@ void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr) walk_shadow_page_lockless_end(vcpu); } +static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn, + unsigned int access) +{ + if (unlikely(is_mmio_spte(*sptep))) { + if (gfn != get_mmio_spte_gfn(*sptep)) { + mmu_spte_clear_no_track(sptep); + return true; + } + + mark_mmio_spte(vcpu, sptep, gfn, access); + return true; + } + + return false; +} + +#define PTTYPE_EPT 18 /* arbitrary */ +#define PTTYPE PTTYPE_EPT +#include "paging_tmpl.h" +#undef PTTYPE + +#define PTTYPE 64 +#include "paging_tmpl.h" +#undef PTTYPE + +#define PTTYPE 32 +#include "paging_tmpl.h" +#undef PTTYPE + static bool is_obsolete_root(struct kvm *kvm, hpa_t root_hpa) { struct kvm_mmu_page *sp; diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h index 83876047c1f5..00d2f9abecf0 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.h +++ b/arch/x86/kvm/mmu/shadow_mmu.h @@ -73,7 +73,6 @@ bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, int level, pte_t unused); void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte); -int nonpaging_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp); int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent, bool can_yield); void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp); @@ -150,4 +149,28 @@ void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm, const struct kvm_memory_slot *slot); unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc); + +/* Exports from paging_tmpl.h */ +gpa_t paging32_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, + gpa_t vaddr, u64 access, + struct x86_exception *exception); +gpa_t paging64_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, + gpa_t vaddr, u64 access, + struct x86_exception *exception); +gpa_t ept_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gpa_t vaddr, + u64 access, struct x86_exception *exception); + +int paging32_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); +int paging64_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); +int ept_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); + +int paging32_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp); +int paging64_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp); +int ept_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp); +/* Defined in shadow_mmu.c. */ +int nonpaging_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp); + +void paging32_invlpg(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root); +void paging64_invlpg(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root); +void ept_invlpg(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root); #endif /* __KVM_X86_MMU_SHADOW_MMU_H */ From patchwork Wed Dec 21 22:24:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35536 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4945252dye; Wed, 21 Dec 2022 14:26:10 -0800 (PST) X-Google-Smtp-Source: AMrXdXtn6RdCp8ydM1IyQwNIOH0nWbbKfA8lfWx9eolUUu82RcXk6/gkl5dJsU96OUpTZEOUflZ4 X-Received: by 2002:aa7:da49:0:b0:470:489c:5369 with SMTP id w9-20020aa7da49000000b00470489c5369mr2841098eds.6.1671661570743; Wed, 21 Dec 2022 14:26:10 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661570; cv=none; d=google.com; s=arc-20160816; b=FXe5u+JGypm7R/hAjr4V72qON/6Cjjp1EPlWOtNYMvlBREuQzRx6syN5M/hRRIskAZ q72gcNFw8VOSGpvQNzZJcLUZlXc/R1GFh/591LJ5cVA+dhWSn017kyOa6Pp6yD2E8yif lhrPPjws4iTg1ksIhxBscKbV+IIdpzQ1ji4IRO1AzWyrraxhl6A7FNg2kUkIVGzaBQcW jDwjjO2J6PXcN9+Bz1nEsBohTaK5yIFw8k1WTdLHsTEfQNnVE3N4nuayyNr3uvHwACK4 JXIdcOz/mUx4yCEMoLl92nsgyu+Aafpqtg0WVNS6JEF3jZAF7sGurNRoVFY5ESsTV8kr BZ+g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=PpjDc5kQAW+k4ZQYHkXtWYHcVTdhQtxJVF/UKPc9J1o=; b=qBfzSIZV20f7u4OURungAFgiGE6HdxxK2XcU20EKc96j64SwFsdySAVDk3mOFF+2Za 9HMYUw7YtZxJslNmzBlr8XbQEvSiPY1+aOZzwQnXZDsCKNSRS6nCBOzZePwGq818FATv SLihxQwwa4hKzkwc6Z19jvy3DlAIzZHatSK05jBZhO8RO58nBO4L2nduEH+dfo5G/Ixe E0ul4CtvpwzSva8yrqjaryMKnOawKuaQqnaJI//5nUjy+T20xze1BwrbJkSs09EbcH4g QV1NjvjrhRAmDQeNYc4NrY1YvjEI6VH3a5rKrebHm9od7qsW7M6odMNdy3jT83LBrpgP HnVA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=iOK91BqR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b11-20020a056402084b00b004739d471e12si16556728edz.280.2022.12.21.14.25.47; Wed, 21 Dec 2022 14:26:10 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=iOK91BqR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235028AbiLUWZK (ORCPT + 99 others); Wed, 21 Dec 2022 17:25:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36280 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234954AbiLUWYi (ORCPT ); Wed, 21 Dec 2022 17:24:38 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4FAFF275DD for ; Wed, 21 Dec 2022 14:24:33 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-43ea89bc5d8so3642417b3.4 for ; Wed, 21 Dec 2022 14:24:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=PpjDc5kQAW+k4ZQYHkXtWYHcVTdhQtxJVF/UKPc9J1o=; b=iOK91BqR2al5dx6yxJYNdJ0KjbgTezozD9d7Mfm19BVc0Kn0bIzeV8O1a2Amyvc21A k4zIUHIPW0rEnb7NHeK7QnH/T9jDadvayvwB+uwAIHMBE+jIW0LKKiX43+hFzNddGEDL PsoSH7FSpEWOZODYGU8x75XtalOTNUa5zPk+Dw5g9eAW8EGY1XV1zBxEfiZC/iqpNyik T1BVGUnWV6RWCRPu8ua0k42ZlNt7NDeOb9uT5nVz4LTvl55RESt8bk00TJGWB9wl+4Xk 8Q+0BypI3uvPrwdnt6Xpgenc3i+yQX0gL4bvsoyke4sHKc9qjqTPrazgLRNuAaA9Jrhb se0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=PpjDc5kQAW+k4ZQYHkXtWYHcVTdhQtxJVF/UKPc9J1o=; b=UxyNhsisehhbdqPsuksS5CAJxdUcgLkT5+9Ou65WvAlomQ8BQjtGTbPz3jm026XkX/ Igyr0iim7tGkg/Shj4x1tpM7PJO0hkPHWACN1UYvB2QpnNKBC2OBr2uMN2/EXqha4Mgt u/cuyocZQeGtq+xFuHeQKjknl96qsEbVU9XJjigoSVRyem+nxGQVtQL3vbq+JtRYTeK7 RNxEGhkXGlD95oHMfVzOFhBkoJyq48722IYguQzNDwiaXsspxOioJDwUbs2u/3h4aBn0 Zh//vtk0VsleMRtc98cI9/xBqHF3LuRSc4y8QCH4tcWjHlAvjtY6R9z6G8M6tGjkdOK8 EAjg== X-Gm-Message-State: AFqh2kr9mA+Gvg8AQKiey2WyWgbFjXZBmtE5yhOsDCTk8vqpI4/8Z62Z ByFdqlyMi3Wp71PTvJG/cFK2I8FlPcxzCiKCnx9nMS8S8N17NAeiZx8T+ifA2ZfxZXQHWgaYl6A 3D142HwWx1gABIg/ZyUDO/qqf39bqtZ4Rvn3QxKg/sHSJYAsM4iJEzmWNIt+LNesiOTZkpX+c X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a05:6902:513:b0:6fa:5a47:c208 with SMTP id x19-20020a056902051300b006fa5a47c208mr376658ybs.472.1671661472442; Wed, 21 Dec 2022 14:24:32 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:10 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-7-bgardon@google.com> Subject: [RFC 06/14] KVM: x86/MMU: Clean up Shadow MMU exports From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864203130124496?= X-GMAIL-MSGID: =?utf-8?q?1752864203130124496?= Now that paging_tmpl.h is included from shadow_mmu.c, there's no need to export many of the functions currrently in shadow_mmu.h, so remove those exports and mark the functions static. This cleans up the interface of the Shadow MMU, and will allow the implementation to keep the details of rmap_heads internal. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/shadow_mmu.c | 78 +++++++++++++++++++++-------------- arch/x86/kvm/mmu/shadow_mmu.h | 51 +---------------------- 2 files changed, 48 insertions(+), 81 deletions(-) diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c index 86b5fb75d50a..090b4788f7de 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.c +++ b/arch/x86/kvm/mmu/shadow_mmu.c @@ -21,6 +21,20 @@ #include #include +struct kvm_shadow_walk_iterator { + u64 addr; + hpa_t shadow_addr; + u64 *sptep; + int level; + unsigned index; +}; + +#define for_each_shadow_entry_using_root(_vcpu, _root, _addr, _walker) \ + for (shadow_walk_init_using_root(&(_walker), (_vcpu), \ + (_root), (_addr)); \ + shadow_walk_okay(&(_walker)); \ + shadow_walk_next(&(_walker))) + #define for_each_shadow_entry(_vcpu, _addr, _walker) \ for (shadow_walk_init(&(_walker), _vcpu, _addr); \ shadow_walk_okay(&(_walker)); \ @@ -227,7 +241,7 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte) * * Returns true if the TLB needs to be flushed */ -bool mmu_spte_update(u64 *sptep, u64 new_spte) +static bool mmu_spte_update(u64 *sptep, u64 new_spte) { bool flush = false; u64 old_spte = mmu_spte_update_no_track(sptep, new_spte); @@ -311,7 +325,7 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep) * Directly clear spte without caring the state bits of sptep, * it is used to set the upper level spte. */ -void mmu_spte_clear_no_track(u64 *sptep) +static void mmu_spte_clear_no_track(u64 *sptep) { __update_clear_spte_fast(sptep, 0ull); } @@ -354,7 +368,7 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc) static bool sp_has_gptes(struct kvm_mmu_page *sp); -gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) +static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) { if (sp->role.passthrough) return sp->gfn; @@ -410,8 +424,8 @@ static void kvm_mmu_page_set_translation(struct kvm_mmu_page *sp, int index, sp->gfn, kvm_mmu_page_get_gfn(sp, index), gfn); } -void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, - unsigned int access) +static void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, + unsigned int access) { gfn_t gfn = kvm_mmu_page_get_gfn(sp, index); @@ -627,7 +641,7 @@ struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, return &slot->arch.rmap[level - PG_LEVEL_4K][idx]; } -bool rmap_can_add(struct kvm_vcpu *vcpu) +static bool rmap_can_add(struct kvm_vcpu *vcpu) { struct kvm_mmu_memory_cache *mc; @@ -735,7 +749,7 @@ static u64 *rmap_get_next(struct rmap_iterator *iter) for (_spte_ = rmap_get_first(_rmap_head_, _iter_); \ _spte_; _spte_ = rmap_get_next(_iter_)) -void drop_spte(struct kvm *kvm, u64 *sptep) +static void drop_spte(struct kvm *kvm, u64 *sptep) { u64 old_spte = mmu_spte_clear_track_bits(kvm, sptep); @@ -1112,7 +1126,7 @@ static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp, pte_list_remove(parent_pte, &sp->parent_ptes); } -void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte) +static void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte) { mmu_page_remove_parent_pte(sp, parent_pte); mmu_spte_clear_no_track(parent_pte); @@ -1342,8 +1356,8 @@ static void mmu_pages_clear_parents(struct mmu_page_path *parents) } while (!sp->unsync_children); } -int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent, - bool can_yield) +static int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent, + bool can_yield) { int i; struct kvm_mmu_page *sp; @@ -1389,7 +1403,7 @@ void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp) atomic_set(&sp->write_flooding_count, 0); } -void clear_sp_write_flooding_count(u64 *spte) +static void clear_sp_write_flooding_count(u64 *spte) { __clear_sp_write_flooding_count(sptep_to_sp(spte)); } @@ -1602,9 +1616,9 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, return role; } -struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, u64 *sptep, - gfn_t gfn, bool direct, - unsigned int access) +static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, + u64 *sptep, gfn_t gfn, + bool direct, unsigned int access) { union kvm_mmu_page_role role; @@ -1615,8 +1629,9 @@ struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, u64 *sptep, return kvm_mmu_get_shadow_page(vcpu, gfn, role); } -void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator, - struct kvm_vcpu *vcpu, hpa_t root, u64 addr) +static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator, + struct kvm_vcpu *vcpu, hpa_t root, + u64 addr) { iterator->addr = addr; iterator->shadow_addr = root; @@ -1643,14 +1658,14 @@ void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator, } } -void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator, - struct kvm_vcpu *vcpu, u64 addr) +static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator, + struct kvm_vcpu *vcpu, u64 addr) { shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root.hpa, addr); } -bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator) +static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator) { if (iterator->level < PG_LEVEL_4K) return false; @@ -1672,7 +1687,7 @@ static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator, --iterator->level; } -void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) +static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) { __shadow_walk_next(iterator, *iterator->sptep); } @@ -1703,13 +1718,14 @@ static void __link_shadow_page(struct kvm *kvm, mark_unsync(sptep); } -void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp) +static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, + struct kvm_mmu_page *sp) { __link_shadow_page(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, sptep, sp, true); } -void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, - unsigned direct_access) +static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, + unsigned direct_access) { if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) { struct kvm_mmu_page *child; @@ -1731,8 +1747,8 @@ void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, } /* Returns the number of zapped non-leaf child shadow pages. */ -int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte, - struct list_head *invalid_list) +static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte, + struct list_head *invalid_list) { u64 pte; struct kvm_mmu_page *child; @@ -2144,9 +2160,9 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot, return 0; } -int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, - u64 *sptep, unsigned int pte_access, gfn_t gfn, - kvm_pfn_t pfn, struct kvm_page_fault *fault) +static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, + u64 *sptep, unsigned int pte_access, gfn_t gfn, + kvm_pfn_t pfn, struct kvm_page_fault *fault) { struct kvm_mmu_page *sp = sptep_to_sp(sptep); int level = sp->role.level; @@ -2251,8 +2267,8 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu, return 0; } -void __direct_pte_prefetch(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, - u64 *sptep) +static void __direct_pte_prefetch(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *sp, u64 *sptep) { u64 *spte, *start = NULL; int i; @@ -2788,7 +2804,7 @@ int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level) return leaf; } -void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr) +static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr) { struct kvm_shadow_walk_iterator iterator; u64 spte; diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h index 00d2f9abecf0..20c65a0ea52c 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.h +++ b/arch/x86/kvm/mmu/shadow_mmu.h @@ -23,32 +23,11 @@ struct pte_list_desc { u64 *sptes[PTE_LIST_EXT]; }; +/* Only exported for debugfs.c. */ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head); -struct kvm_shadow_walk_iterator { - u64 addr; - hpa_t shadow_addr; - u64 *sptep; - int level; - unsigned index; -}; - -#define for_each_shadow_entry_using_root(_vcpu, _root, _addr, _walker) \ - for (shadow_walk_init_using_root(&(_walker), (_vcpu), \ - (_root), (_addr)); \ - shadow_walk_okay(&(_walker)); \ - shadow_walk_next(&(_walker))) - -bool mmu_spte_update(u64 *sptep, u64 new_spte); -void mmu_spte_clear_no_track(u64 *sptep); -gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index); -void kvm_mmu_page_set_access(struct kvm_mmu_page *sp, int index, - unsigned int access); - struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, const struct kvm_memory_slot *slot); -bool rmap_can_add(struct kvm_vcpu *vcpu); -void drop_spte(struct kvm *kvm, u64 *sptep); bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect); bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head, const struct kvm_memory_slot *slot); @@ -72,30 +51,8 @@ bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, struct kvm_memory_slot *slot, gfn_t gfn, int level, pte_t unused); -void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte); -int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent, - bool can_yield); void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp); -void clear_sp_write_flooding_count(u64 *spte); - -struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, u64 *sptep, - gfn_t gfn, bool direct, - unsigned int access); - -void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator, - struct kvm_vcpu *vcpu, hpa_t root, u64 addr); -void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator, - struct kvm_vcpu *vcpu, u64 addr); -bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator); -void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator); - -void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp); - -void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, - unsigned direct_access); -int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte, - struct list_head *invalid_list); bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, struct list_head *invalid_list, int *nr_zapped); @@ -107,11 +64,6 @@ int make_mmu_pages_available(struct kvm_vcpu *vcpu); int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva); -int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, - u64 *sptep, unsigned int pte_access, gfn_t gfn, - kvm_pfn_t pfn, struct kvm_page_fault *fault); -void __direct_pte_prefetch(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, - u64 *sptep); int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte); @@ -121,7 +73,6 @@ int mmu_alloc_special_roots(struct kvm_vcpu *vcpu); int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level); -void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr); void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes, struct kvm_page_track_notifier_node *node); From patchwork Wed Dec 21 22:24:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35537 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4945405dye; Wed, 21 Dec 2022 14:26:29 -0800 (PST) X-Google-Smtp-Source: AMrXdXv60bl5CU4GudkJkTBEI8nlzTMKdNQlCRHgvoH1ugRtnZLFTlFfGvEOmWoA42LMiGZoEE8N X-Received: by 2002:a17:906:a18c:b0:7c1:5467:39b1 with SMTP id s12-20020a170906a18c00b007c1546739b1mr3034113ejy.72.1671661588891; Wed, 21 Dec 2022 14:26:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661588; cv=none; d=google.com; s=arc-20160816; b=cXWGGtBU0Tncp+IU8z0MGIXpXd04W/emom34PnDz6N14p3Id+GL/Oa5xA1zHiyzJtF Dsh8vKRaGhDU2oEWIx1OfIqZWD8jFR4gjjQY76SPCSEaRROoFtWNAv19H8zUV0udRMfC n3O8VbklJMU4pKwaValn3ap4zGdi3B7+eTA7uMas+BVexAS93wXc5u1QPt3DIG1pfHY/ DEfGXlvxZRklvQikiAEYN7asUGz1hN48qQ06DUScrSLgITP9jLuBVbF9m+2mqIJjQmJa S39hwgRNdrqRNwhqTnUyAa5/GITJVX2MkLIHKXPkuL45mFMHuxn2/K8hZBjR/3Wbe/vM 9zqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=6Y1uFW2wjlbXMFly7tpXboV7VZ3E/dcx8Q2Gfsts8Tk=; b=0oCsHCVl7bSVeXtJU0NRqeEqDsbtZekyy9FjONpqMjwQh1shB9Pb866r5iPhbFRpNH bCZgq034C3XlOIzPX+l83V3RMMvIrQwdX9BttrqHQGkYiZcIeykZCTSVZG8JSBDHu34c 60H0IrkWfps6Kg3j0T+VmYBxsWB+58Em82vv2RKAnLeM4pltDH74UMIgtfhFpVuf6Rm3 mQO943qp8jpnSbk4ARtCxkYWRWJyKWu3K8LHwsoSCOsMqgAEaJfW/tSCCeuGl5cqD9YO SLSNCK680Eiwt/+UoztzqqtD2OkZqDGpT38GKBtxVggnQowe8cDHZ/Pzzq38Kyhv/PoF 4n/Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=j6BY0U68; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id nd19-20020a170907629300b007c10a0c6f44si16516262ejc.623.2022.12.21.14.26.04; Wed, 21 Dec 2022 14:26:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=j6BY0U68; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235053AbiLUWZP (ORCPT + 99 others); Wed, 21 Dec 2022 17:25:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234897AbiLUWYn (ORCPT ); Wed, 21 Dec 2022 17:24:43 -0500 Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com [IPv6:2607:f8b0:4864:20::549]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A5655275E2 for ; Wed, 21 Dec 2022 14:24:34 -0800 (PST) Received: by mail-pg1-x549.google.com with SMTP id r126-20020a632b84000000b004393806c06eso115391pgr.4 for ; Wed, 21 Dec 2022 14:24:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6Y1uFW2wjlbXMFly7tpXboV7VZ3E/dcx8Q2Gfsts8Tk=; b=j6BY0U68byYLgySr44DngXyEm134/1hFKLFShEVTVJ/IuhmuHUhPj2IFs5+alWpjes FnmgU/a+mWQ9lDW3CaNzgmNQp5HsZoHpHcyV/b1il72r6Xzaj9qXk5NXsYY0j+JXwNjk FZejHNVrbE5LJsqGqrP4ZPA/zp5q0qfmZqF/ezW2x9qxFl5GNHdMhzaOZVSwWHoYKYqH 08xYqohGPLH5zNwDVBBmvSSBKQHJMlkqT5i+Wtxx9V3Dvv+kdODLKmik7SKFpsmBhFQW cfjubSaMmD3KdtMWQs+eq26B7yPez7mGSsQDiVaiRL+/oXIuPIBq7yXEg4mMa1UUUuq9 /2Vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6Y1uFW2wjlbXMFly7tpXboV7VZ3E/dcx8Q2Gfsts8Tk=; b=1riDbpmR3PBsfWaXdbhmHltNixPuiKvXqKUUA9YckAmohbhrrU6LWWM1MF3prMLQbh TOjWput9CrUesudEYch2F7krfS6G05Wuz6jjbnJ+YTD3RBkYpqvKP42uPDkxJefhkF0J BlwdC4Q4Yx759DBUrNf4bTNWjfgD3t80iCP64iRLS0nBMUXdos2XINyWUsNUOR+ZUXXY Ck7fMLsxnjDg+cr5diUMaWUjLFeizpKi5uwxI9Psrbgh1NZS47GoBXalcyluTJfNClwF xSV6ObAsJeE7cSoegtPT+5qs0euZi/Ax6e747jTjBMNZlYpECcBqBLI3l/j9NcCkPAR0 CvUg== X-Gm-Message-State: AFqh2kp8+5Q2MY4oziACu6vj5Bo3yy2aq8XMzYjx/n0QaDLZ1mERZGI4 c6HNIAGNejb1+iUBVuA58vttaKASGdSiA9le4RzDaU/TWs1ioCUK6iId/Km6CBT5c76Zfx+9iUu PYxqoG4T3oNZu0wX0H+8Po1gkZn1asiQhhqZfwbZ4AFR46IyMOcGYljn/NYwKXyrBgVmKLWkZ X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a17:90a:8183:b0:219:9874:c7d3 with SMTP id e3-20020a17090a818300b002199874c7d3mr320888pjn.221.1671661474119; Wed, 21 Dec 2022 14:24:34 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:11 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-8-bgardon@google.com> Subject: [RFC 07/14] KVM: x86/MMU: Cleanup shrinker interface with Shadow MMU From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864222163575817?= X-GMAIL-MSGID: =?utf-8?q?1752864222163575817?= The MMU shrinker currently only operates on the Shadow MMU, but having the entire implemenatation in shadow_mmu.c is awkward since much of the function isn't Shadow MMU specific. There has also been talk of changing the target of the shrinker to the MMU caches rather than already allocated page tables. As a result, it makes sense to move some of the implementation back to mmu.c. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/mmu.c | 43 ++++++++++++++++++++++++ arch/x86/kvm/mmu/shadow_mmu.c | 62 ++++++++--------------------------- arch/x86/kvm/mmu/shadow_mmu.h | 3 +- 3 files changed, 58 insertions(+), 50 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index dd97e346c786..4c45a5b63356 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3147,6 +3147,49 @@ static unsigned long mmu_shrink_count(struct shrinker *shrink, return percpu_counter_read_positive(&kvm_total_used_mmu_pages); } +unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) +{ + struct kvm *kvm; + int nr_to_scan = sc->nr_to_scan; + unsigned long freed = 0; + + mutex_lock(&kvm_lock); + + list_for_each_entry(kvm, &vm_list, vm_list) { + /* + * Never scan more than sc->nr_to_scan VM instances. + * Will not hit this condition practically since we do not try + * to shrink more than one VM and it is very unlikely to see + * !n_used_mmu_pages so many times. + */ + if (!nr_to_scan--) + break; + + /* + * n_used_mmu_pages is accessed without holding kvm->mmu_lock + * here. We may skip a VM instance errorneosly, but we do not + * want to shrink a VM that only started to populate its MMU + * anyway. + */ + if (!kvm->arch.n_used_mmu_pages && + !kvm_shadow_mmu_has_zapped_obsolete_pages(kvm)) + continue; + + freed = kvm_shadow_mmu_shrink_scan(kvm, sc->nr_to_scan); + + /* + * unfair on small ones + * per-vm shrinkers cry out + * sadness comes quickly + */ + list_move_tail(&kvm->vm_list, &vm_list); + break; + } + + mutex_unlock(&kvm_lock); + return freed; +} + static struct shrinker mmu_shrinker = { .count_objects = mmu_shrink_count, .scan_objects = mmu_shrink_scan, diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c index 090b4788f7de..1259c4a3b140 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.c +++ b/arch/x86/kvm/mmu/shadow_mmu.c @@ -3147,7 +3147,7 @@ void kvm_zap_obsolete_pages(struct kvm *kvm) kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages); } -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm) +bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm) { return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages)); } @@ -3416,60 +3416,24 @@ void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm, kvm_arch_flush_remote_tlbs_memslot(kvm, slot); } -unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) +unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free) { - struct kvm *kvm; - int nr_to_scan = sc->nr_to_scan; unsigned long freed = 0; + int idx; - mutex_lock(&kvm_lock); - - list_for_each_entry(kvm, &vm_list, vm_list) { - int idx; - LIST_HEAD(invalid_list); - - /* - * Never scan more than sc->nr_to_scan VM instances. - * Will not hit this condition practically since we do not try - * to shrink more than one VM and it is very unlikely to see - * !n_used_mmu_pages so many times. - */ - if (!nr_to_scan--) - break; - /* - * n_used_mmu_pages is accessed without holding kvm->mmu_lock - * here. We may skip a VM instance errorneosly, but we do not - * want to shrink a VM that only started to populate its MMU - * anyway. - */ - if (!kvm->arch.n_used_mmu_pages && - !kvm_has_zapped_obsolete_pages(kvm)) - continue; - - idx = srcu_read_lock(&kvm->srcu); - write_lock(&kvm->mmu_lock); - - if (kvm_has_zapped_obsolete_pages(kvm)) { - kvm_mmu_commit_zap_page(kvm, - &kvm->arch.zapped_obsolete_pages); - goto unlock; - } + idx = srcu_read_lock(&kvm->srcu); + write_lock(&kvm->mmu_lock); - freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan); + if (kvm_shadow_mmu_has_zapped_obsolete_pages(kvm)) { + kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages); + goto out; + } -unlock: - write_unlock(&kvm->mmu_lock); - srcu_read_unlock(&kvm->srcu, idx); + freed = kvm_mmu_zap_oldest_mmu_pages(kvm, pages_to_free); - /* - * unfair on small ones - * per-vm shrinkers cry out - * sadness comes quickly - */ - list_move_tail(&kvm->vm_list, &vm_list); - break; - } +out: + write_unlock(&kvm->mmu_lock); + srcu_read_unlock(&kvm->srcu, idx); - mutex_unlock(&kvm_lock); return freed; } diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h index 20c65a0ea52c..9952aa1e86cf 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.h +++ b/arch/x86/kvm/mmu/shadow_mmu.h @@ -99,7 +99,8 @@ void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm, void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm, const struct kvm_memory_slot *slot); -unsigned long mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc); +bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm); +unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free); /* Exports from paging_tmpl.h */ gpa_t paging32_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, From patchwork Wed Dec 21 22:24:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35538 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4945441dye; Wed, 21 Dec 2022 14:26:35 -0800 (PST) X-Google-Smtp-Source: AMrXdXv7GvyXTuhHTEZQNmJgwqnrdb2NJ83DyfFYXrB5nfyh+yWpCl2HfD5dhaXIJjXmhvyz78YR X-Received: by 2002:aa7:cf94:0:b0:46b:6214:44c8 with SMTP id z20-20020aa7cf94000000b0046b621444c8mr3112722edx.39.1671661595245; Wed, 21 Dec 2022 14:26:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661595; cv=none; d=google.com; s=arc-20160816; b=rxYhiQvtpBQA4upNGta/XFqjjLFwiL2jkelAKlKwMTL70xpN7yB36n+e3eDxOXJprJ 3tEq7xLk9mJr+5h40FTVqwBU9KUDb4vP12J0Z2FmqUpxgMJzY+IqtsLkFi6hk9n3xDPn 3hRknbsr12PAoQgkcd5mC7716AdEUNzvO+5lTRrgpSytXyBRmd37XrMhlhVl0oYb6Txs rDDUiEvaU1GeEq6v4/PpPkS+V+Ib2lZuZuNGcmZMRhrYDYbpe0qVYtehwMufRy3g673x q4UGazvyuI5m2GMm10gxLwDrl1xVLkOXnQ8XHpp9vxGfPC5KkOp1NlkLQDvWCq5oec4+ q/mQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=BM/X6lf5Q5u/NqA/5lapzPbr/TRyR56BZ+pcmqdxJ08=; b=HvurH+H96gdsZ4QaWX1nBvxkdph4vaxmh6OLc6uYQ3bI8tu6iwHS+6YdRe/MyvPv3m jtbI99Iu5vZfFGu6/tneh3FQhSL9af2g00ziTKjUWCV6tLoT21ahNznXgHkQFlJ9yX6I t99PSmZ9MEaFHSiR8NxF7s1CxRgK2HHG/PBKdv51FjEUVVNBudkjBjuhWrjHt9LS8T6A 68C/sGAfX0sirRKyW61WV8sCzaYlNB88PGYl2cHT6Go5XgZzze++Y1pvjdHFfs5sWVWL 6vOnbBimGGhwJR0/fxRWsy1tiZM/OofDLU3Y99LbMnbWsWJ3E9qmWAgwvBFH8/DBIP8P J8YA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=qb64BZX2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k26-20020a508ada000000b00469d35e2f13si13766242edk.522.2022.12.21.14.26.11; Wed, 21 Dec 2022 14:26:35 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=qb64BZX2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235066AbiLUWZS (ORCPT + 99 others); Wed, 21 Dec 2022 17:25:18 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36276 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235001AbiLUWZC (ORCPT ); Wed, 21 Dec 2022 17:25:02 -0500 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 07730275E5 for ; Wed, 21 Dec 2022 14:24:36 -0800 (PST) Received: by mail-pl1-x649.google.com with SMTP id h2-20020a170902f54200b0018e56572a4eso137919plf.9 for ; Wed, 21 Dec 2022 14:24:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=BM/X6lf5Q5u/NqA/5lapzPbr/TRyR56BZ+pcmqdxJ08=; b=qb64BZX2skXQRlPzR/gphfOKSPRkDwvBMkn7oaOZpTZ2gp5EHreY2v4lXCy1a9ufgK xN2ODq1pNGyJ8WN+gbl5Ci9n4tmFG1VHrj2V2fhMwYoUFMMUHFOvA6vV2L6tsnpg8ygl tXUsBzQ9SDD3ITbzPXAV49ShZhPEEvUx778mW6XLnI+FkqcU0BR+o0nw9YXw9yC0wa9h UlbCyVeUoOUdGp0IYDzYl8rnePgr5SU26yYQ1K8wRgygvowtWShEeKO5b1Xbv74eRFD4 DZsd2gZiGHEBZb6KIU0q90BsAN3EJ8DQb4+qAU+ctS2eav0Bkp6We3Q3zzxklyeENq31 Hjxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=BM/X6lf5Q5u/NqA/5lapzPbr/TRyR56BZ+pcmqdxJ08=; b=S7FrsbNZaFTtdaxJyk74zxStt7Q2Pz4O9pepPkAuuQ/Zm5Bo4C0Jix16eCYFHIi5hq Hyq5+ftQvCW8J26vo7zblkEoiFZUSEehhntC7McwBwS17dZmXUi9fQTsjhDq3/W77GvU 3uIm/j7whtcxTVtCho+k1ZQdlXxfw3ajBV3W3Wf98UiogI4YHK6D9F7B3DOvtSfddjsX pzTFhVwT/2NT/82Nnlew/srFJvaW9rmuU+QRLeyPzwI3wxPHWlleBJ1Y3wPY7sclLwMI Y55PwH+E1cNdhQRS834EDk35M/s/LD3dNBeqLEvkTN1UXFA0lSGf5J4Wa7Dx6ZAs40iM dPcw== X-Gm-Message-State: AFqh2kqsUGOoHagBfkT2YA2N8vSlfc3uHfnHSRf+ovDanlat7J+Xd9c5 frf3UQ9sNK9A7OK3lH3/gVp8fhzhLiFItnlzZjy2hYdKstSHXD/ppD670WEMZcybMp43R9CFQKY tgyoolkjkwTvJzD4uoBH4ufAQ149ecIi6IeOtyMIXc7NXXTqBj6+0meCUo2svVYTgZtJbLzT+ X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a05:6a00:1d8f:b0:578:7184:b495 with SMTP id z15-20020a056a001d8f00b005787184b495mr270797pfw.16.1671661475626; Wed, 21 Dec 2022 14:24:35 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:12 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-9-bgardon@google.com> Subject: [RFC 08/14] KVM: x86/MMU: Clean up naming of exported Shadow MMU functions From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864228859470178?= X-GMAIL-MSGID: =?utf-8?q?1752864228859470178?= Change the naming scheme on several functions exported from the shadow MMU to match the naming scheme used by the TDP MMU: kvm_shadow_mmu_. More cleanups will follow to convert the remaining functions to a similar naming scheme, but for now, start with the trivial renames. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/mmu.c | 19 ++++++++++--------- arch/x86/kvm/mmu/paging_tmpl.h | 2 +- arch/x86/kvm/mmu/shadow_mmu.c | 19 ++++++++++--------- arch/x86/kvm/mmu/shadow_mmu.h | 17 +++++++++-------- 4 files changed, 30 insertions(+), 27 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4c45a5b63356..bacb519ba7b4 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1129,7 +1129,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) int r; write_lock(&vcpu->kvm->mmu_lock); - r = make_mmu_pages_available(vcpu); + r = kvm_shadow_mmu_make_pages_available(vcpu); if (r < 0) goto out_unlock; @@ -1204,7 +1204,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep) if (is_tdp_mmu(vcpu->arch.mmu)) leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root); else - leaf = get_walk(vcpu, addr, sptes, &root); + leaf = kvm_shadow_mmu_get_walk(vcpu, addr, sptes, &root); walk_shadow_page_lockless_end(vcpu); @@ -1469,14 +1469,14 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault if (is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; - r = make_mmu_pages_available(vcpu); + r = kvm_shadow_mmu_make_pages_available(vcpu); if (r) goto out_unlock; if (is_tdp_mmu_fault) r = kvm_tdp_mmu_map(vcpu, fault); else - r = __direct_map(vcpu, fault); + r = kvm_shadow_mmu_direct_map(vcpu, fault); out_unlock: if (is_tdp_mmu_fault) @@ -1514,7 +1514,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, trace_kvm_page_fault(vcpu, fault_address, error_code); if (kvm_event_needs_reinjection(vcpu)) - kvm_mmu_unprotect_page_virt(vcpu, fault_address); + kvm_shadow_mmu_unprotect_page_virt(vcpu, fault_address); r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn, insn_len); } else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) { @@ -2791,7 +2791,8 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm) * In order to ensure all vCPUs drop their soon-to-be invalid roots, * invalidating TDP MMU roots must be done while holding mmu_lock for * write and in the same critical section as making the reload request, - * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield. + * e.g. before kvm_shadow_mmu_zap_obsolete_pages() could drop mmu_lock + * and yield. */ if (is_tdp_mmu_enabled(kvm)) kvm_tdp_mmu_invalidate_all_roots(kvm); @@ -2806,7 +2807,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm) */ kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS); - kvm_zap_obsolete_pages(kvm); + kvm_shadow_mmu_zap_obsolete_pages(kvm); write_unlock(&kvm->mmu_lock); @@ -2892,7 +2893,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end); - flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end); + flush = kvm_shadow_mmu_zap_gfn_range(kvm, gfn_start, gfn_end); if (is_tdp_mmu_enabled(kvm)) { for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) @@ -3036,7 +3037,7 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, { if (kvm_memslots_have_rmaps(kvm)) { write_lock(&kvm->mmu_lock); - kvm_rmap_zap_collapsible_sptes(kvm, slot); + kvm_shadow_mmu_zap_collapsible_sptes(kvm, slot); write_unlock(&kvm->mmu_lock); } diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 2e3b2aca64ad..85c0b186cb0a 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -874,7 +874,7 @@ int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) if (is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; - r = make_mmu_pages_available(vcpu); + r = kvm_shadow_mmu_make_pages_available(vcpu); if (r) goto out_unlock; r = FNAME(fetch)(vcpu, fault, &walker); diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c index 1259c4a3b140..e36b4d9c67f2 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.c +++ b/arch/x86/kvm/mmu/shadow_mmu.c @@ -1965,7 +1965,7 @@ static inline unsigned long kvm_mmu_available_pages(struct kvm *kvm) return 0; } -int make_mmu_pages_available(struct kvm_vcpu *vcpu) +int kvm_shadow_mmu_make_pages_available(struct kvm_vcpu *vcpu) { unsigned long avail = kvm_mmu_available_pages(vcpu->kvm); @@ -2029,7 +2029,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn) return r; } -int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva) +int kvm_shadow_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva) { gpa_t gpa; int r; @@ -2319,7 +2319,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep) __direct_pte_prefetch(vcpu, sp, sptep); } -int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) +int kvm_shadow_mmu_direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_shadow_walk_iterator it; struct kvm_mmu_page *sp; @@ -2537,7 +2537,7 @@ int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) return r; write_lock(&vcpu->kvm->mmu_lock); - r = make_mmu_pages_available(vcpu); + r = kvm_shadow_mmu_make_pages_available(vcpu); if (r < 0) goto out_unlock; @@ -2785,7 +2785,8 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu) * * Must be called between walk_shadow_page_lockless_{begin,end}. */ -int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level) +int kvm_shadow_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, + int *root_level) { struct kvm_shadow_walk_iterator iterator; int leaf = -1; @@ -3091,7 +3092,7 @@ __always_inline bool slot_handle_level_4k(struct kvm *kvm, } #define BATCH_ZAP_PAGES 10 -void kvm_zap_obsolete_pages(struct kvm *kvm) +void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm) { struct kvm_mmu_page *sp, *node; int nr_zapped, batch = 0; @@ -3152,7 +3153,7 @@ bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm) return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages)); } -bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) +bool kvm_shadow_mmu_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) { const struct kvm_memory_slot *memslot; struct kvm_memslots *slots; @@ -3404,8 +3405,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, return need_tlb_flush; } -void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm, - const struct kvm_memory_slot *slot) +void kvm_shadow_mmu_zap_collapsible_sptes(struct kvm *kvm, + const struct kvm_memory_slot *slot) { /* * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h index 9952aa1e86cf..148cc3593d2b 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.h +++ b/arch/x86/kvm/mmu/shadow_mmu.h @@ -60,18 +60,19 @@ bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, struct list_head *invalid_list); void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list); -int make_mmu_pages_available(struct kvm_vcpu *vcpu); +int kvm_shadow_mmu_make_pages_available(struct kvm_vcpu *vcpu); -int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva); +int kvm_shadow_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva); -int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); +int kvm_shadow_mmu_direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte); hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level); int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu); int mmu_alloc_special_roots(struct kvm_vcpu *vcpu); -int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level); +int kvm_shadow_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, + int *root_level); void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes, struct kvm_page_track_notifier_node *node); @@ -86,8 +87,8 @@ bool slot_handle_level(struct kvm *kvm, const struct kvm_memory_slot *memslot, bool slot_handle_level_4k(struct kvm *kvm, const struct kvm_memory_slot *memslot, slot_level_handler fn, bool flush_on_yield); -void kvm_zap_obsolete_pages(struct kvm *kvm); -bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); +void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm); +bool kvm_shadow_mmu_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); bool slot_rmap_write_protect(struct kvm *kvm, struct kvm_rmap_head *rmap_head, const struct kvm_memory_slot *slot); @@ -96,8 +97,8 @@ void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm, const struct kvm_memory_slot *slot, gfn_t start, gfn_t end, int target_level); -void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm, - const struct kvm_memory_slot *slot); +void kvm_shadow_mmu_zap_collapsible_sptes(struct kvm *kvm, + const struct kvm_memory_slot *slot); bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm); unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free); From patchwork Wed Dec 21 22:24:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35539 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4945510dye; Wed, 21 Dec 2022 14:26:46 -0800 (PST) X-Google-Smtp-Source: AMrXdXuzIRiq21eHNEpNLIxKFm+kAeYCn/L5qcCEpcls58G2lW9NSXpjS/kxRTusfqoqU24GMm3P X-Received: by 2002:a17:907:98e1:b0:7c1:932c:96d7 with SMTP id ke1-20020a17090798e100b007c1932c96d7mr2895181ejc.58.1671661606058; Wed, 21 Dec 2022 14:26:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661606; cv=none; d=google.com; s=arc-20160816; b=bTJV5oDtyofoG+bVmdLmxvgjW4V0FBUlafIGOFTbvF2gt7vD9+0EzA5x41ndu6t5yJ 2/QQneKNeZ0Y4um1eTUDOlhgZUezzfKdka+TcZXz/5D8FwBH1F/oSOc3WpBRAevEW6j5 RplN5B6G8PKd22Wz43h8TFFUiXqTsk8oBPs2RmCA66zQlhSCFBnPW7+t4FIXOUs3XRTC MgbRFCC0iMrkR/rz+iLz1YM0xW7OgkLpx0dQLDcQCdBPq6XkZkEfTiSoELMWb0DC+Mh3 s8T3rKsG9TtQTWDaKjK2vxBghP6n0ddM8PN+9qHhL1MeAH8rhkGdLRrjLmh28yzeO9eY 5K6A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=BoetL47dMtSYpyhZbJzKl1EqSh6TREkVR0ldKUR5DTE=; b=AQb7EqPQvCH0luKopiODTS6JuUmgCVg9obKpHHfKa1si4A0LMZLbkVakiqNjErPgz5 ZiQcGmEH2zeBAv0jPvgx5LTldjVEpKiVjF3peIPvOdv0Thplrd9BOOT2q7samiaFgxZR 1WQGq88IxfVWa4QdP/wg3hTPrmFLQlu3/yoFR17tvniuoN0owJDw7c2F9NHq1gzFmNMr qIxuJyI2jcrC7z/EcpTZsDxn6cctBJ6RocZuxCBkDjV2VSGpVGo7pWiUqXG6tn4FJ+Ti ycLMGJ/wQSE7ZaEHUfdt1FiNnj/18f+S8oV0umeQctVQDJiE90dJa7o2Bluw8voEdci7 s/WA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=B2Il06BX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id qf1-20020a1709077f0100b0078d112eaf80si16201050ejc.86.2022.12.21.14.26.22; Wed, 21 Dec 2022 14:26:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=B2Il06BX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235076AbiLUWZY (ORCPT + 99 others); Wed, 21 Dec 2022 17:25:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36282 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235010AbiLUWZH (ORCPT ); Wed, 21 Dec 2022 17:25:07 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 26691275FD for ; Wed, 21 Dec 2022 14:24:37 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id h6-20020a17090aa88600b00223fccff2efso1952006pjq.6 for ; Wed, 21 Dec 2022 14:24:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=BoetL47dMtSYpyhZbJzKl1EqSh6TREkVR0ldKUR5DTE=; b=B2Il06BXZ5r75uO6Eh6VbInYjOxKdKJBrLZcwxld+Hj9NTKOaFJ4tEln9Ebd43HD3V XGsxjryczXGgpPcb9lEHqJHq6iSZOJDDZYCR9sJdtAcozqn8ZhJQBFnBtQAGVb8fs88i tS4KjdO/CHK6ygrF/kxg+BGf1GBeSdrqZz1Bs1DCwuZ4ALwelH4arRjImcrNksCQH0PJ fre2RUTACKiTUUq7uDfxqdnR50CEEj1LXW0xeHx730J4o2IACZ+mv8pTROJWGX0w8D/f digzIkmHy7fIFib/DTm+CSnIbYO09B6k8GNnVK//AhuZ3XugZ2+LvEEfUXboUxjCh8cR TbTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=BoetL47dMtSYpyhZbJzKl1EqSh6TREkVR0ldKUR5DTE=; b=McvJXaPR7AYhQJfxx6jqlf+n1gVAK8XRb2FSdVeb2C7ypL/UF/V/FvaKCVv9kXWLtW NTu8H2c1yhngcFUlJGsgE57PpLA5VvP8s2bApJhl5TvCriJAUfZh2612M3mxKbjV/nZO vtlkqFRKolPpc+Qp6wswTRj7sF+T/NERmM7GP1Wd42y4bFVn8BA+tR6seut+0wBSBc8t tjYrfJEjTlFOyd4NunUrsd4INKKG3RL4gXy6p+zy85WtieUcEoo7TNi9Pe4nhA7B+Z3t vEZAoyqABAMubRYMzOjFB+g2BXdLH/QUI118DEcufB4ZDuwjZYTCT8MjgSUDu+t0nqqe JirQ== X-Gm-Message-State: AFqh2krLNWRKdx6zs7+re+LRRtTKtZd7bEbUW0+SYTvz9Twf8yR8eYHI P3iAR8uOovs27LBZfe25QexGWRqKPHwQWQuFS1plVyhuKBJK4RyVMla9TAs1lwswN0YKTUGE1uZ SrD7g0qCiOiRqHP7B5wA/tPrnBBRgNNqOXlCWd6xA5nNhO3JTgXlE6p4MEWsHJw3G7u1enLkr X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a63:9359:0:b0:478:bc15:bf28 with SMTP id w25-20020a639359000000b00478bc15bf28mr148390pgm.542.1671661477243; Wed, 21 Dec 2022 14:24:37 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:13 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-10-bgardon@google.com> Subject: [RFC 09/14] KVM: x86/MMU: Only make pages available on Shadow MMU fault From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864240445139509?= X-GMAIL-MSGID: =?utf-8?q?1752864240445139509?= Now that the Shadow MMU has been factored out of mmu.c and the naming sheme has been cleaned up, it's clear that there's an unnecessary operation in direct_page_fault(). Since the MMU page quota is only applied to the Shadow MMU, there's no point to calling kvm_shadow_mmu_make_pages_available on a fault where the TDP MMU is going to handle installing new TDP PTEs. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/mmu.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index bacb519ba7b4..568b36de9eeb 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1469,14 +1469,14 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault if (is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; - r = kvm_shadow_mmu_make_pages_available(vcpu); - if (r) - goto out_unlock; - if (is_tdp_mmu_fault) r = kvm_tdp_mmu_map(vcpu, fault); - else + else { + r = kvm_shadow_mmu_make_pages_available(vcpu); + if (r) + goto out_unlock; r = kvm_shadow_mmu_direct_map(vcpu, fault); + } out_unlock: if (is_tdp_mmu_fault) From patchwork Wed Dec 21 22:24:14 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35545 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4946407dye; Wed, 21 Dec 2022 14:28:41 -0800 (PST) X-Google-Smtp-Source: AMrXdXsKDZ0b3yzul0EqLKDmkjb07hrYigLMp5zBi+LtrHS+5Wf7ApP1ajNBnK3/xjD7bhZQqlhL X-Received: by 2002:aa7:c90b:0:b0:470:362f:6ba9 with SMTP id b11-20020aa7c90b000000b00470362f6ba9mr6839272edt.41.1671661721389; Wed, 21 Dec 2022 14:28:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661721; cv=none; d=google.com; s=arc-20160816; b=iLcm/HqY1N8TrAj0vCX9RWXuVrGdu3FuMj8Q/phqxeYYnsJM/rgVR8V+GL+SL9Vvaa ALsmuTrL+kVn+NO+DDpwLG7XNqQwybb1NS0PrvkcqaNKlWW/yuvvSyiC4MpGyvbPwKR7 BFCJo5qL/QfTvu/v25BlDB4vi/VdJCOHmv2rmR0fnmgjWrTc6PQMriYyI4xKn/u1MH1H AMRx6Ok1nwbS0zm1YzFJIz6wcFUHCoOgQ9Ai9K+qyDY0nUM/YNwy/aew31uspCw34s1y KOpdnnVEKxWKigUfn92VLjJv09xsOAISVQgMvO5SxPJZd5OQugr5sCKPZRNB69dQNDo1 neCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=6Jaams5HOGHGqoL3nWqw9y/u+ytv5a24WDfV/4t3+EM=; b=QY3Hw5bYBs4SmxijdemBDVyKwA+aHlHfS/4sbrreOeD5jlqhBJY2AKWhx32+lR64yf YuXQGxPZAmWgFkOLXO6X64OzIsCMnDW6Xsv74VAljUzmF57ItaOkps43B7pax7QS6mAK GPRzndJi7FY0EUL3o+CK452Z4TEQGTo7Jpni2oEFucSG69GEMKrPosaNurMNUbb9fWkK CAOLNJ0LPJpCH7cxaXKU2o0RI2/iIbySPj8a0CHU5nh4AVHDjZVwuP5iK4lj2X75FvQ7 YJxzVk+rD/ohD73HQHkeU7J5Bb4Q2aiuUJ3zqAuijoVzfQV7ZTM263eDNn8XJhEgYrqG lc0w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=LDwwXjzs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id cw5-20020a056402228500b0046289aad428si13317112edb.496.2022.12.21.14.28.17; Wed, 21 Dec 2022 14:28:41 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=LDwwXjzs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234954AbiLUWZn (ORCPT + 99 others); Wed, 21 Dec 2022 17:25:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37060 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234989AbiLUWZL (ORCPT ); Wed, 21 Dec 2022 17:25:11 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B8B232790B for ; Wed, 21 Dec 2022 14:24:39 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id b16-20020a17090a551000b00225aa26f1dbso674833pji.8 for ; Wed, 21 Dec 2022 14:24:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6Jaams5HOGHGqoL3nWqw9y/u+ytv5a24WDfV/4t3+EM=; b=LDwwXjzscFBywLfN9uix9hm9llnv4amTnGjRnjZ8F1YoWJ+Sx8vfVGDC4kpHeC9EgO WEfcsu1baIScabXlw607E3i2788fHMyO1BW42G+5VcwQgeSz+v+pFd1u/ZvGxxcZewMy glCt5NEDvFxgZLf2WpUwBjGnB+61qLYoZe+yZIM+MdVv4lqAAJJSJ8No2FOJDwGC+2eo SShcXyTKuRFrZ6bHGPgKLETpTxyVlmx46vjGkcTyWPMWi+A6svFpJxcraGBV8jVYAheJ I/ZflFDXmfnhxn6uK6MeCWxfG0eYMFUoT0giUJZCSfsJHkNM5Axy8YtX7OUSRSnNuQ77 hUKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6Jaams5HOGHGqoL3nWqw9y/u+ytv5a24WDfV/4t3+EM=; b=4dA3GwWquL5++se61kQeCeAIzBvRSO0T1vST+HKOmpm1pwO1uzFiqMRNdzTzsRTzAC whDTr2k741HYdSoEk7q5jNkuPnu1Larx6GFWjFravH0Q++4WOGvQ+77Ut5JrCxdKHrXd xyWoBz/EoGqhDO+tBL0atSRgHu44qOu03sAaJfRqmq6eQ9HCMFMEYVnR8kx2VLm/BX38 RqUoQVG7V0+ptqAE38+RbfZ1L/RMoid3b0YS45Y4Li9d6IM9MsmjGySIM3T5W2QotvNr 96fZfwH9fEcMzNFGBzrCtK3gVvMoiU9WpKeKIGKUVyY4miCkAw6JBexTTpdsRPBE9XOn 5cdA== X-Gm-Message-State: AFqh2kpXuqs31xeM7AvO2mK5tzJ3fXvRKDsLYkOnGtBYLKwS34TJIdhd mNxokdaxzj8tn39sBAPKn03VW8XINXLryJH229ew3EYhezWzXQnGcYdiEqeuOXMYfKyBE/TndtJ 4QVfS5ayrCxLedzI0KoqXKFd16pr9trLMEinVvE5GEECEBXYmUjrEmLx1yqxE19KZ//rj33NU X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a62:2f82:0:b0:576:670e:5de2 with SMTP id v124-20020a622f82000000b00576670e5de2mr274183pfv.70.1671661479091; Wed, 21 Dec 2022 14:24:39 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:14 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-11-bgardon@google.com> Subject: [RFC 10/14] KVM: x86/MMU: Fix naming on prepare / commit zap page functions From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864361256324408?= X-GMAIL-MSGID: =?utf-8?q?1752864361256324408?= Since the various prepare / commit zap page functions are part of the Shadow MMU and used all over both shadow_mmu.c and mmu.c, add _shadow_ to the function names to match the rest of the Shadow MMU interface. Since there are so many uses of these functions, this rename gets its own commit. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/mmu.c | 21 +++++++-------- arch/x86/kvm/mmu/shadow_mmu.c | 48 ++++++++++++++++++----------------- arch/x86/kvm/mmu/shadow_mmu.h | 13 +++++----- 3 files changed, 43 insertions(+), 39 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 568b36de9eeb..160dd143a814 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -270,8 +270,9 @@ void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu) kvm_tdp_mmu_walk_lockless_end(); } else { /* - * Make sure the write to vcpu->mode is not reordered in front of - * reads to sptes. If it does, kvm_mmu_commit_zap_page() can see us + * Make sure the write to vcpu->mode is not reordered in front + * of reads to sptes. If it does, + * kvm_shadow_mmu_commit_zap_page() can see us * OUTSIDE_GUEST_MODE and proceed to free the shadow page table. */ smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE); @@ -608,7 +609,7 @@ bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, struct list_head *invalid_list return false; if (!list_empty(invalid_list)) - kvm_mmu_commit_zap_page(kvm, invalid_list); + kvm_shadow_mmu_commit_zap_page(kvm, invalid_list); else kvm_flush_remote_tlbs(kvm); return true; @@ -1062,7 +1063,7 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa, if (is_tdp_mmu_page(sp)) kvm_tdp_mmu_put_root(kvm, sp, false); else if (!--sp->root_count && sp->role.invalid) - kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); + kvm_shadow_mmu_prepare_zap_page(kvm, sp, invalid_list); *root_hpa = INVALID_PAGE; } @@ -1115,7 +1116,7 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu, mmu->root.pgd = 0; } - kvm_mmu_commit_zap_page(kvm, &invalid_list); + kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list); write_unlock(&kvm->mmu_lock); } EXPORT_SYMBOL_GPL(kvm_mmu_free_roots); @@ -1417,8 +1418,8 @@ bool is_page_fault_stale(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, * there is a pending request to free obsolete roots. The request is * only a hint that the current root _may_ be obsolete and needs to be * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a - * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs - * to reload even if no vCPU is actively using the root. + * previous root, then __kvm_shadow_mmu_prepare_zap_page() signals all + * vCPUs to reload even if no vCPU is actively using the root. */ if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu)) return true; @@ -3103,13 +3104,13 @@ void kvm_mmu_zap_all(struct kvm *kvm) list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) { if (WARN_ON(sp->role.invalid)) continue; - if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign)) + if (__kvm_shadow_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign)) goto restart; if (cond_resched_rwlock_write(&kvm->mmu_lock)) goto restart; } - kvm_mmu_commit_zap_page(kvm, &invalid_list); + kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list); if (is_tdp_mmu_enabled(kvm)) kvm_tdp_mmu_zap_all(kvm); @@ -3452,7 +3453,7 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm) else if (is_tdp_mmu_page(sp)) flush |= kvm_tdp_mmu_zap_sp(kvm, sp); else - kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); + kvm_shadow_mmu_prepare_zap_page(kvm, sp, &invalid_list); WARN_ON_ONCE(sp->nx_huge_page_disallowed); if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c index e36b4d9c67f2..2d1a4026cf00 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.c +++ b/arch/x86/kvm/mmu/shadow_mmu.c @@ -1280,7 +1280,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int ret = vcpu->arch.mmu->sync_page(vcpu, sp); if (ret < 0) - kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); + kvm_shadow_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); return ret; } @@ -1442,8 +1442,8 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, * upper-level page will be write-protected. */ if (role.level > PG_LEVEL_4K && sp->unsync) - kvm_mmu_prepare_zap_page(kvm, sp, - &invalid_list); + kvm_shadow_mmu_prepare_zap_page(kvm, sp, + &invalid_list); continue; } @@ -1485,7 +1485,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, ++kvm->stat.mmu_cache_miss; out: - kvm_mmu_commit_zap_page(kvm, &invalid_list); + kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list); if (collisions > kvm->stat.max_mmu_page_hash_collisions) kvm->stat.max_mmu_page_hash_collisions = collisions; @@ -1768,8 +1768,8 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte, */ if (tdp_enabled && invalid_list && child->role.guest_mode && !child->parent_ptes.val) - return kvm_mmu_prepare_zap_page(kvm, child, - invalid_list); + return kvm_shadow_mmu_prepare_zap_page(kvm, + child, invalid_list); } } else if (is_mmio_spte(pte)) { mmu_spte_clear_no_track(spte); @@ -1814,7 +1814,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm, struct kvm_mmu_page *sp; for_each_sp(pages, sp, parents, i) { - kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); + kvm_shadow_mmu_prepare_zap_page(kvm, sp, invalid_list); mmu_pages_clear_parents(&parents); zapped++; } @@ -1823,9 +1823,9 @@ static int mmu_zap_unsync_children(struct kvm *kvm, return zapped; } -bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, - struct list_head *invalid_list, - int *nr_zapped) +bool __kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list, + int *nr_zapped) { bool list_unstable, zapped_root = false; @@ -1886,16 +1886,17 @@ bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, return list_unstable; } -bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, - struct list_head *invalid_list) +bool kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list) { int nr_zapped; - __kvm_mmu_prepare_zap_page(kvm, sp, invalid_list, &nr_zapped); + __kvm_shadow_mmu_prepare_zap_page(kvm, sp, invalid_list, &nr_zapped); return nr_zapped; } -void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list) +void kvm_shadow_mmu_commit_zap_page(struct kvm *kvm, + struct list_head *invalid_list) { struct kvm_mmu_page *sp, *nsp; @@ -1940,8 +1941,8 @@ static unsigned long kvm_mmu_zap_oldest_mmu_pages(struct kvm *kvm, if (sp->root_count) continue; - unstable = __kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, - &nr_zapped); + unstable = __kvm_shadow_mmu_prepare_zap_page(kvm, sp, + &invalid_list, &nr_zapped); total_zapped += nr_zapped; if (total_zapped >= nr_to_zap) break; @@ -1950,7 +1951,7 @@ static unsigned long kvm_mmu_zap_oldest_mmu_pages(struct kvm *kvm, goto restart; } - kvm_mmu_commit_zap_page(kvm, &invalid_list); + kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list); kvm->stat.mmu_recycled += total_zapped; return total_zapped; @@ -2021,9 +2022,9 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn) pgprintk("%s: gfn %llx role %x\n", __func__, gfn, sp->role.word); r = 1; - kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); + kvm_shadow_mmu_prepare_zap_page(kvm, sp, &invalid_list); } - kvm_mmu_commit_zap_page(kvm, &invalid_list); + kvm_shadow_mmu_commit_zap_page(kvm, &invalid_list); write_unlock(&kvm->mmu_lock); return r; @@ -3020,7 +3021,8 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, for_each_gfn_valid_sp_with_gptes(vcpu->kvm, sp, gfn) { if (detect_write_misaligned(sp, gpa, bytes) || detect_write_flooding(sp)) { - kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list); + kvm_shadow_mmu_prepare_zap_page(vcpu->kvm, sp, + &invalid_list); ++vcpu->kvm->stat.mmu_flooded; continue; } @@ -3128,7 +3130,7 @@ void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm) goto restart; } - unstable = __kvm_mmu_prepare_zap_page(kvm, sp, + unstable = __kvm_shadow_mmu_prepare_zap_page(kvm, sp, &kvm->arch.zapped_obsolete_pages, &nr_zapped); batch += nr_zapped; @@ -3145,7 +3147,7 @@ void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm) * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are * running with an obsolete MMU. */ - kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages); + kvm_shadow_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages); } bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm) @@ -3426,7 +3428,7 @@ unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free) write_lock(&kvm->mmu_lock); if (kvm_shadow_mmu_has_zapped_obsolete_pages(kvm)) { - kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages); + kvm_shadow_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages); goto out; } diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h index 148cc3593d2b..af201d34d0b2 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.h +++ b/arch/x86/kvm/mmu/shadow_mmu.h @@ -53,12 +53,13 @@ bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp); -bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, - struct list_head *invalid_list, - int *nr_zapped); -bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, - struct list_head *invalid_list); -void kvm_mmu_commit_zap_page(struct kvm *kvm, struct list_head *invalid_list); +bool __kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list, + int *nr_zapped); +bool kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, + struct list_head *invalid_list); +void kvm_shadow_mmu_commit_zap_page(struct kvm *kvm, + struct list_head *invalid_list); int kvm_shadow_mmu_make_pages_available(struct kvm_vcpu *vcpu); From patchwork Wed Dec 21 22:24:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35541 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4945952dye; Wed, 21 Dec 2022 14:27:44 -0800 (PST) X-Google-Smtp-Source: AMrXdXuzgd2yToz0DJdiWurQamC6eZrdxH30G1xkYUVJUNL/OkxGQCcMSiZiuOOA3Zvxt0jOVZSO X-Received: by 2002:a17:906:22d2:b0:7fc:3787:243d with SMTP id q18-20020a17090622d200b007fc3787243dmr6593883eja.53.1671661664184; Wed, 21 Dec 2022 14:27:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661664; cv=none; d=google.com; s=arc-20160816; b=U/qNngRtcdkUZZAIbvGYssBJnA3HtfnwlnVONKlLmkVQkaUvGQNhVGNx/CLwijXZMD HncMbf5wqYend77D8A0MYJdMABd9dYChcfzvLxmpS4IHIFSbqmJSU7wpnTivQgm0LwXq 6JfZCBqDURDt9Nh+sO/78XlyUaZkHRaOc2alvlgjRvCXbuXQ+OPHktXyn7jPuBYFLfsL LW27uM6q0Qm6rKu46CRwIvN3W4rnfw4azqmwWxLJznndboJX00QI72tJaeO5MwN5tfN7 MOoMM5VJ5d2howSIbdJSL2RvKlvtR3CXQNsxODyAYYySDJIH7PO3VbdiSWqCQ2i/f5oe 2Ejw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=G8yCm8/A9muZfDD3Vt2EwogqpDhIyDMvg2eScRp1Hns=; b=VgwzY6WRBTHNP0eAwwn2W0nnaKRwCexeH9NZPbmTfh9E8XutaF4F0S7eOaYA71FOa+ 2mBZXcCM9oI9t3jpJjLqjzXJa+NSbNvRfs4v0EqiOQWZi76D7YS9FMLRFMu0Umsvvcnl nTC+umHcCT0NORLahLMkklpwnUF3d1QMHr4DvoUn6ZSUk+c5CaZw45fWfNPRpR3UsHMO 0uATob3JNLdBH0g6xWDJdTuoAo4OoKrLwh5BtgycOQscwcS2/Y8m1tECo3U0NwOKgaXr lHlkNwuUYmAThtjZYE9KD0BYHuHz+PnwNsLMN20+AXelc0PItHwA/w6Q44bHFKhVkR7S utog== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=RZtEbegJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hd39-20020a17090796a700b007c170f6b32esi16296078ejc.527.2022.12.21.14.27.20; Wed, 21 Dec 2022 14:27:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=RZtEbegJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235124AbiLUWZq (ORCPT + 99 others); Wed, 21 Dec 2022 17:25:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37072 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234998AbiLUWZM (ORCPT ); Wed, 21 Dec 2022 17:25:12 -0500 Received: from mail-pg1-x54a.google.com (mail-pg1-x54a.google.com [IPv6:2607:f8b0:4864:20::54a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B78E327915 for ; Wed, 21 Dec 2022 14:24:41 -0800 (PST) Received: by mail-pg1-x54a.google.com with SMTP id x79-20020a633152000000b004785d1cf6bbso111447pgx.6 for ; Wed, 21 Dec 2022 14:24:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=G8yCm8/A9muZfDD3Vt2EwogqpDhIyDMvg2eScRp1Hns=; b=RZtEbegJX7+lnjjBMFgZ9yF+y8dpXdZXUnYPKvKz5AnEC3U28/Sy5KCutLHCHP9qii KFytERkZFCHgqDPy7+veHu2TMRasyzT+t0xfCh1UPiqN16FymNAzzxa9XHMJQqDLLKKp 3/i9EgNeEXg34KUtXzZGsmODzlMYPaFeGTQ552wdnUk2RqqvGPcmMEDbt0xbfEhgtDHY s9ncMZW7Lx8E9FfxdqlSAEhyEFNFZhS+Swfg9V0tQRaF2yuhmHLZFhYEJNCMf3OQH8tp LU9mFt24OB62q0EKzcPAS4NYVShoOWZexvsM428AQaFopIGAMH8lOwOOoXMkVOrXXsV8 9SYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=G8yCm8/A9muZfDD3Vt2EwogqpDhIyDMvg2eScRp1Hns=; b=4jfb8G5l/JZsJBlCurKCjsl24Mo0FhRqiD5QarR+mK67dGSUFdJstNiy/E0Y5Zd41X cU+F8+NysRdYHkSLS2IMpprkoS8Jg6PtT4IYMV/7b2SiZBiTsM5QL7I9n+a/Ox9/fD/i uUF3qVGW6yRzNdW/GGVD0mza07EP0T3UUkSMqVmxJxo45qwoTabYNXxhHnuYf+L9g5W1 YlQYQBUlzP2SKGmGsL/s0//dL0KBA9ubdERuYdeZ6ZYLgD0Rgj86T/VD3vntJ79BAOz2 ZeDgBEWdejUc3YgVnjPS3SWY7sKX2gJAAubcKPwNviL0pc4YvRKQZwCHLE1pZmT8mryv J1MQ== X-Gm-Message-State: AFqh2kqZbBCHEJRfjBCgKYVnm7AoJpRhIihoDVK/7RXL5WPJnQlGFqHf Wdgis5WYuVUogV+lclRuXZHEAcnKfNgjbZjWU5HzRx4Ykn9bqiiCIiiLxPQtaeHcBp+NrubIZb8 QGv6lkGylHrCY3NnfCivZMp0jh/HczSn+vP/qlC9qCvueBYyXHNqknovAoNNIIaTY44MN51jC X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:aa7:9f4c:0:b0:56d:8d19:f331 with SMTP id h12-20020aa79f4c000000b0056d8d19f331mr259462pfr.7.1671661480589; Wed, 21 Dec 2022 14:24:40 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:15 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-12-bgardon@google.com> Subject: [RFC 11/14] KVM: x86/MMU: Factor Shadow MMU wrprot / clear dirty ops out of mmu.c From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864301134893628?= X-GMAIL-MSGID: =?utf-8?q?1752864301134893628?= There are several functions in mmu.c which bifrucate to the Shadow and/or TDP MMU implementations. In most of these, the Shadow MMU implementation is open-coded. Wrap these instances in a nice function which just needs kvm and slot arguments or similar. This matches the TDP MMU interface and will allow for some nice cleanups in a following commit. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/mmu.c | 52 ++++++---------------------- arch/x86/kvm/mmu/shadow_mmu.c | 64 +++++++++++++++++++++++++++++++++++ arch/x86/kvm/mmu/shadow_mmu.h | 15 ++++++++ 3 files changed, 90 insertions(+), 41 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 160dd143a814..ce2a6dd38c67 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -417,23 +417,13 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn_offset, unsigned long mask) { - struct kvm_rmap_head *rmap_head; - if (is_tdp_mmu_enabled(kvm)) kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot, slot->base_gfn + gfn_offset, mask, true); - if (!kvm_memslots_have_rmaps(kvm)) - return; - - while (mask) { - rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), - PG_LEVEL_4K, slot); - rmap_write_protect(rmap_head, false); - - /* clear the first set bit */ - mask &= mask - 1; - } + if (kvm_memslots_have_rmaps(kvm)) + kvm_shadow_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, + mask); } /** @@ -450,23 +440,13 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn_offset, unsigned long mask) { - struct kvm_rmap_head *rmap_head; - if (is_tdp_mmu_enabled(kvm)) kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot, slot->base_gfn + gfn_offset, mask, false); - if (!kvm_memslots_have_rmaps(kvm)) - return; - - while (mask) { - rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), - PG_LEVEL_4K, slot); - __rmap_clear_dirty(kvm, rmap_head, slot); - - /* clear the first set bit */ - mask &= mask - 1; - } + if (kvm_memslots_have_rmaps(kvm)) + kvm_shadow_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset, + mask); } /** @@ -524,16 +504,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm, struct kvm_memory_slot *slot, u64 gfn, int min_level) { - struct kvm_rmap_head *rmap_head; - int i; bool write_protected = false; - if (kvm_memslots_have_rmaps(kvm)) { - for (i = min_level; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) { - rmap_head = gfn_to_rmap(gfn, i, slot); - write_protected |= rmap_write_protect(rmap_head, true); - } - } + if (kvm_memslots_have_rmaps(kvm)) + write_protected |= + kvm_shadow_mmu_write_protect_gfn(kvm, slot, gfn, min_level); if (is_tdp_mmu_enabled(kvm)) write_protected |= @@ -2917,8 +2892,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, { if (kvm_memslots_have_rmaps(kvm)) { write_lock(&kvm->mmu_lock); - slot_handle_level(kvm, memslot, slot_rmap_write_protect, - start_level, KVM_MAX_HUGEPAGE_LEVEL, false); + kvm_shadow_mmu_wrprot_slot(kvm, memslot, start_level); write_unlock(&kvm->mmu_lock); } @@ -3069,11 +3043,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, { if (kvm_memslots_have_rmaps(kvm)) { write_lock(&kvm->mmu_lock); - /* - * Clear dirty bits only on 4k SPTEs since the legacy MMU only - * support dirty logging at a 4k granularity. - */ - slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false); + kvm_shadow_mmu_clear_dirty_slot(kvm, memslot); write_unlock(&kvm->mmu_lock); } diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c index 2d1a4026cf00..80b8c78daaeb 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.c +++ b/arch/x86/kvm/mmu/shadow_mmu.c @@ -3440,3 +3440,67 @@ unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free) return freed; } + +void kvm_shadow_mmu_write_protect_pt_masked(struct kvm *kvm, + struct kvm_memory_slot *slot, + gfn_t gfn_offset, unsigned long mask) +{ + struct kvm_rmap_head *rmap_head; + + while (mask) { + rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), + PG_LEVEL_4K, slot); + rmap_write_protect(rmap_head, false); + + /* clear the first set bit */ + mask &= mask - 1; + } +} + +void kvm_shadow_mmu_clear_dirty_pt_masked(struct kvm *kvm, + struct kvm_memory_slot *slot, + gfn_t gfn_offset, unsigned long mask) +{ + struct kvm_rmap_head *rmap_head; + + while (mask) { + rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), + PG_LEVEL_4K, slot); + __rmap_clear_dirty(kvm, rmap_head, slot); + + /* clear the first set bit */ + mask &= mask - 1; + } +} + +bool kvm_shadow_mmu_write_protect_gfn(struct kvm *kvm, + struct kvm_memory_slot *slot, + u64 gfn, int min_level) +{ + struct kvm_rmap_head *rmap_head; + int i; + bool write_protected = false; + + if (kvm_memslots_have_rmaps(kvm)) { + for (i = min_level; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) { + rmap_head = gfn_to_rmap(gfn, i, slot); + write_protected |= rmap_write_protect(rmap_head, true); + } + } + + return write_protected; +} + +void kvm_shadow_mmu_clear_dirty_slot(struct kvm *kvm, + const struct kvm_memory_slot *memslot) +{ + slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false); +} + +void kvm_shadow_mmu_wrprot_slot(struct kvm *kvm, + const struct kvm_memory_slot *memslot, + int start_level) +{ + slot_handle_level(kvm, memslot, slot_rmap_write_protect, + start_level, KVM_MAX_HUGEPAGE_LEVEL, false); +} diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h index af201d34d0b2..c322eeaa0688 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.h +++ b/arch/x86/kvm/mmu/shadow_mmu.h @@ -104,6 +104,21 @@ void kvm_shadow_mmu_zap_collapsible_sptes(struct kvm *kvm, bool kvm_shadow_mmu_has_zapped_obsolete_pages(struct kvm *kvm); unsigned long kvm_shadow_mmu_shrink_scan(struct kvm *kvm, int pages_to_free); +void kvm_shadow_mmu_write_protect_pt_masked(struct kvm *kvm, + struct kvm_memory_slot *slot, + gfn_t gfn_offset, unsigned long mask); +void kvm_shadow_mmu_clear_dirty_pt_masked(struct kvm *kvm, + struct kvm_memory_slot *slot, + gfn_t gfn_offset, unsigned long mask); +bool kvm_shadow_mmu_write_protect_gfn(struct kvm *kvm, + struct kvm_memory_slot *slot, + u64 gfn, int min_level); +void kvm_shadow_mmu_clear_dirty_slot(struct kvm *kvm, + const struct kvm_memory_slot *memslot); +void kvm_shadow_mmu_wrprot_slot(struct kvm *kvm, + const struct kvm_memory_slot *memslot, + int start_level); + /* Exports from paging_tmpl.h */ gpa_t paging32_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gpa_t vaddr, u64 access, From patchwork Wed Dec 21 22:24:16 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35542 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4946033dye; Wed, 21 Dec 2022 14:27:55 -0800 (PST) X-Google-Smtp-Source: AMrXdXsm54bSR0to47z6UQkl8W+W/uBE8BSRq1MObiD2ZmJ3rRO9PqC4wWsvqfOcsV/VvvvFBjM3 X-Received: by 2002:a17:906:265a:b0:7ad:9893:ff40 with SMTP id i26-20020a170906265a00b007ad9893ff40mr7039550ejc.27.1671661675222; Wed, 21 Dec 2022 14:27:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661675; cv=none; d=google.com; s=arc-20160816; b=sHi+ZjyRwaLG+CBYTL4tLHG+r3yzLjVLxl5u7d0biSMEX+ZoJNMqaZZRs6naY5qIw9 rYTOxF5iiqaMtdgRlFS9QWKS+SV2bonObFKegQXzHUSv/kQVbz605pNZVBdtMZWfKm93 b2drweF8txWSYy875KSQ9+6xf1ODNGIIgDehY/K06UpgQI7E2eUOF9ExoB/yWNkC3rCA TtYIndHHDnV7XhkorI20aI8rj/NLdhLmiiVvNqu/9o3T6kSdLhoI9yh+2g6Hje/CjAKD qzpzSkpA++OQ/+sZjoXw0DIuHZjnAhaNYyOSU+1+0ZsRV7WbWw1nlOz1vzd6PIFpvVXT 7KNQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=ZvHSZ7wyO6HpWC7GgWotvlWeVl4F+Q0dZdGlQyXVA6g=; b=QM2h3RU6XSM2Y9YtII39/QtvxAjacmHVvFJxQfxBeX4gyRkK+GROFcTc3jTi09tAKg d/i34A+J49dVF9fQgrBTGFTwma9gxwtRSiv7K2krqTFBXvO8BkwvXAWcOeduke+abuO6 prf1ume6kshD+ncqERp98Li1lrJPlz10MRrvSZA1BZintPaDWu28hoLn1GxBAtlrZ+mP CP2NA0MC4SWXpCeizERvS+/ov3/WtGfwkwvVR+zJd2Cl8oRrAaKkxhfjzmrYFdzs+m2n ss0vsD1Zff0+jK8VcqQj+fKwWSg2d2X1R6WpwV6Z09b9f3lzFakNG+VxnUJ0Jx3rIZew hQpg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=SZmxxj3D; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m7-20020a170906848700b007c0d88f1614si12859959ejx.342.2022.12.21.14.27.31; Wed, 21 Dec 2022 14:27:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=SZmxxj3D; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235126AbiLUWZv (ORCPT + 99 others); Wed, 21 Dec 2022 17:25:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36368 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235051AbiLUWZP (ORCPT ); Wed, 21 Dec 2022 17:25:15 -0500 Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0FD3C27920 for ; Wed, 21 Dec 2022 14:24:43 -0800 (PST) Received: by mail-pl1-x64a.google.com with SMTP id m13-20020a170902f64d00b001899a70c8f1so128658plg.14 for ; Wed, 21 Dec 2022 14:24:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ZvHSZ7wyO6HpWC7GgWotvlWeVl4F+Q0dZdGlQyXVA6g=; b=SZmxxj3D7ZU/s/vea1hbQX3dVt8uH8sCEHNP+G6nLRcEHVdJ8hshrQVvHHLhSFn40n gJwzzUv13Wwj4TPXk4zLMaNE7Il3+japKHq9hAOS3Uzam7+8ANCSXsQO5uZjmDyrib8v n7BsILS36PW4iMO4uWVAOs6+sx5t4+1pg/u9I3KpAsMyLw8mHxIVg/Wt2pfxeyHDqcxc HGT2NGqNDuzehTa0QijHcR97mXOs1J+H+2mJ46JHFhPk6XZfnisD+R37bMDh2M9WiZq3 MSr8s3Qhiy5xIfqLB/smHZVEetk3udAPOdilIivf5+njgF/jOhq3oLoDwhHLecPyYFNr afyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ZvHSZ7wyO6HpWC7GgWotvlWeVl4F+Q0dZdGlQyXVA6g=; b=ZOKwDuuBXE7uOKeThj2elgSeJp+0hgfTgvpom/cMegW+MjNp308tBKS3gb79xGp/d4 1mB9g2ag7H0z5t2P82b0NZQ35nsFzkl8YofIG6K3gL3gIizZJSdkabr2kyiqiiNw93aA jFFhkp+4Sk1yeszPHGkWBiqONAqpLlFkXpFdzqf27zYkPSDFjgZ6VM1mhYjIjmEsM7aV XdshzSbktx/XfoHiRo0FibPj4RAxjLCALZMC7dAD3HPBfbbSe6mwHLE1gqPyRxw37eyW mqyeHuS7IHqY443hFYWMVLyH2lG1RVlsGGVpUEJmfsQxTwAeaW4zHzcvRdXHC8BurHSz MxPQ== X-Gm-Message-State: AFqh2kpvy0T+StbWHnWi22SGxBtE/GgSzGxkDHZVffKIGw7siLZKHORk cB5vQIJrkEmJhPceMB8GRXRLKmeX8Qkhw59592VxQztzEg8sKkIoMsnpmjPfu5Mj357rsEoybkD RS0wRDxG3Vd2slLdkWVYGcnuRpvmBA2DfHu5d6GIBELwa7EAy246eBwP4NEMa9CMnBAiUW3np X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a05:6a00:3404:b0:577:4103:8da with SMTP id cn4-20020a056a00340400b00577410308damr245364pfb.1.1671661482452; Wed, 21 Dec 2022 14:24:42 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:16 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-13-bgardon@google.com> Subject: [RFC 12/14] KVM: x86/MMU: Remove unneeded exports from shadow_mmu.c From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864312834131844?= X-GMAIL-MSGID: =?utf-8?q?1752864312834131844?= Now that the various dirty logging / wrprot function implementations are in shadow_mmu.c, do another round of cleanups to remove functions which no longer need to be exposed and can be marked static. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/shadow_mmu.c | 30 +++++++++++++++++------------- arch/x86/kvm/mmu/shadow_mmu.h | 18 ------------------ 2 files changed, 17 insertions(+), 31 deletions(-) diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c index 80b8c78daaeb..77472eb9b06a 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.c +++ b/arch/x86/kvm/mmu/shadow_mmu.c @@ -632,8 +632,8 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) return count; } -struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, - const struct kvm_memory_slot *slot) +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, + const struct kvm_memory_slot *slot) { unsigned long idx; @@ -801,7 +801,7 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect) return mmu_spte_update(sptep, spte); } -bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect) +static bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect) { u64 *sptep; struct rmap_iterator iter; @@ -840,8 +840,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep) * - W bit on ad-disabled SPTEs. * Returns true iff any D or W bits were cleared. */ -bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - const struct kvm_memory_slot *slot) +static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + const struct kvm_memory_slot *slot) { u64 *sptep; struct rmap_iterator iter; @@ -3045,6 +3045,11 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, write_unlock(&vcpu->kvm->mmu_lock); } +/* The return value indicates if tlb flush on all vcpus is needed. */ +typedef bool (*slot_level_handler) (struct kvm *kvm, + struct kvm_rmap_head *rmap_head, + const struct kvm_memory_slot *slot); + /* The caller should hold mmu-lock before calling this function. */ static __always_inline bool slot_handle_level_range(struct kvm *kvm, const struct kvm_memory_slot *memslot, @@ -3073,10 +3078,10 @@ slot_handle_level_range(struct kvm *kvm, const struct kvm_memory_slot *memslot, return flush; } -__always_inline bool slot_handle_level(struct kvm *kvm, - const struct kvm_memory_slot *memslot, - slot_level_handler fn, int start_level, - int end_level, bool flush_on_yield) +static __always_inline bool +slot_handle_level(struct kvm *kvm, const struct kvm_memory_slot *memslot, + slot_level_handler fn, int start_level, int end_level, + bool flush_on_yield) { return slot_handle_level_range(kvm, memslot, fn, start_level, end_level, memslot->base_gfn, @@ -3084,10 +3089,9 @@ __always_inline bool slot_handle_level(struct kvm *kvm, flush_on_yield, false); } -__always_inline bool slot_handle_level_4k(struct kvm *kvm, - const struct kvm_memory_slot *memslot, - slot_level_handler fn, - bool flush_on_yield) +static __always_inline bool +slot_handle_level_4k(struct kvm *kvm, const struct kvm_memory_slot *memslot, + slot_level_handler fn, bool flush_on_yield) { return slot_handle_level(kvm, memslot, fn, PG_LEVEL_4K, PG_LEVEL_4K, flush_on_yield); diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h index c322eeaa0688..397fb463ef54 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.h +++ b/arch/x86/kvm/mmu/shadow_mmu.h @@ -26,11 +26,6 @@ struct pte_list_desc { /* Only exported for debugfs.c. */ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head); -struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, - const struct kvm_memory_slot *slot); -bool rmap_write_protect(struct kvm_rmap_head *rmap_head, bool pt_protect); -bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - const struct kvm_memory_slot *slot); bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, struct kvm_memory_slot *slot, gfn_t gfn, int level, pte_t unused); @@ -78,22 +73,9 @@ int kvm_shadow_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes, struct kvm_page_track_notifier_node *node); -/* The return value indicates if tlb flush on all vcpus is needed. */ -typedef bool (*slot_level_handler) (struct kvm *kvm, - struct kvm_rmap_head *rmap_head, - const struct kvm_memory_slot *slot); -bool slot_handle_level(struct kvm *kvm, const struct kvm_memory_slot *memslot, - slot_level_handler fn, int start_level, int end_level, - bool flush_on_yield); -bool slot_handle_level_4k(struct kvm *kvm, const struct kvm_memory_slot *memslot, - slot_level_handler fn, bool flush_on_yield); - void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm); bool kvm_shadow_mmu_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); -bool slot_rmap_write_protect(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - const struct kvm_memory_slot *slot); - void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm, const struct kvm_memory_slot *slot, gfn_t start, gfn_t end, From patchwork Wed Dec 21 22:24:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35544 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4946290dye; Wed, 21 Dec 2022 14:28:25 -0800 (PST) X-Google-Smtp-Source: AMrXdXvu4qPjfOibBwnMTX8xUtqzVH9kZh8yafVytSeOqYafEknJPZAy3G825auT2c3Dvw0tyOHY X-Received: by 2002:a17:907:6746:b0:836:e7de:9792 with SMTP id qm6-20020a170907674600b00836e7de9792mr2939215ejc.32.1671661705774; Wed, 21 Dec 2022 14:28:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661705; cv=none; d=google.com; s=arc-20160816; b=yNO8f/DvTVxC4/6kpyfu56HwKYSABdzlhcSah8nuSQGm3uknwdT/PLIzKUXIkOwOud qSwSvUAbHq2sKtbKF+ppXwgx78ppYWPEMpVIPBnXaIOobovxNcfGZoEOKZAFsf4EkdhJ SmChZx2vtutrivNk70QK2+PMkFwbrSVk7a9LHw3MtUbtXHjtXmlQXORhsc4Znr/AwL2r JzCeezFXiykpjw+nRbtsjWu6E/DM5AvO8Y8Xe7cqbZl/TZytX6vZgkDhz84olEO5SvJs q2O5nkmeFQYrYTfiSXEH2UmcjhMPq2Ym0fsnr54/Z/JGroAOmLqckCOh8OYRPDi4OdRF xnFQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=lUvSzm6zhuFwhwDfBHcpPhD05Pwase/3rciSEa3uh/0=; b=DtzhwXg7lI9HLsZyQWk0lwvvIgLQRh/CrYM7iTbd08x/shiDPDTetG7xIuT8vMd7v9 wLqD5OvE6sWe1TJfPH5scjQBhqd4p3G8+0fsb/2Mk58xfE3nkQ4xrKTS1JwVPzldkh9L P1xud3SOjGpgVb6dZT4zVTlvIdMaA5f/m5w+GJ1Lzy1YDDsZ5nJCzTwExdpwoA8cnL8e JZaWPr7i+4tcpBZ49Nz9opbpym3rplCSFQcidRLaSg4Bzx+WbZBrJt5Tpj6VXzkqWjlV 1wR32tHMivpWj9tJghSxJiowKibhgsVocYwXEJ8nU0VloMPc1CAzLcrh5rtRTKo/VGXb s+qw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Vfbr638K; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id mp3-20020a1709071b0300b007815e9c5b80si16872153ejc.617.2022.12.21.14.28.02; Wed, 21 Dec 2022 14:28:25 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Vfbr638K; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235136AbiLUW0D (ORCPT + 99 others); Wed, 21 Dec 2022 17:26:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36302 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235010AbiLUWZ1 (ORCPT ); Wed, 21 Dec 2022 17:25:27 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1ACA727939 for ; Wed, 21 Dec 2022 14:24:44 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id v17-20020a17090abb9100b002239a73bc6eso1967512pjr.1 for ; Wed, 21 Dec 2022 14:24:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=lUvSzm6zhuFwhwDfBHcpPhD05Pwase/3rciSEa3uh/0=; b=Vfbr638K23Tkpy4chmhkz5LuwrNuGxo4NEmcxSAginABFnm1kw7ZenYH4gp9QgSO11 VQEQb+QJWIIWf6UTTkicMqeyvNOQ352ckTefjpJ6hIeef8iMLigA+xLhQvl76CZDldhG /vfNHHh2mcYSByw2dshKNTQmiFQt9db7lIIZKsspjCSPfGPOckOaqu0Re/ZpfkAgeV5V 4s01ixfykyOpDlcceEuPlsaGCSvIfjrIOGdaPFeqj8oct91A/9XvjCMreyP4kDEFUQJM d+Txg1y1hD7UEgILJf9ZMrBjmrH+CglkTPz4qTdePAikPG7sByHhuN1IRTByAXxE7Gvv F1Eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lUvSzm6zhuFwhwDfBHcpPhD05Pwase/3rciSEa3uh/0=; b=Uao2IzAY22IRShfTO6IdbOxPv1KMJTY/msREbAVOw/u8SV9sbzBTFuCLJzpWO4y91U DnqW4XnYguvsaX/lgbIWLE27wl11CwmS12r0ReOl9orV43GHInglhu8Ynxn7jZHogK3k BnWurU9BApRDN8xENgqx4i9ldp1Znfi9+jciloFsA8OTzUhsW8tAfcXHEdLMmZYfoRM6 9qHiDmORR3PhAj6shjuq5mtjtsoqeBxcyDQz5JU/2BeDSvAxsVn9cvC1mmQ15nKxqNQ8 3OLD4/V3rGzo+Hepi9mPBIL7rXVea3rIsnK+0Ft+PqFAMV/zTHWgc9/AQfT3WPt84k+D M3hw== X-Gm-Message-State: AFqh2koJYT3OomofT40x2DT0P7O/tAEVt8UBb9FoOnY6WZAgnZlRKByf 35rQw7635TCvvh2zU8W4JEcInqGKNVeH7w9CgUEttTWgiM8/Eo13Vx55eDvtAbp5pA9lMz6TMzY ZklV5VVyRjAwJzp81/6+q19XieWosQqOqDgt90E2c75QUg92u1H7eC0+NPgh50IR5qh+nW9TA X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a05:6a00:1d81:b0:576:ba28:29a8 with SMTP id z1-20020a056a001d8100b00576ba2829a8mr204908pfw.47.1671661484083; Wed, 21 Dec 2022 14:24:44 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:17 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-14-bgardon@google.com> Subject: [RFC 13/14] KVM: x86/MMU: Wrap uses of kvm_handle_gfn_range in mmu.c From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864345207118854?= X-GMAIL-MSGID: =?utf-8?q?1752864345207118854?= handle_gfn_range + callback is not a bad interface, but it requires exporting the whole callback scheme to mmu.c. Simplify the interface with some basic wrapper functions, making the callback scheme internal to shadow_mmu.c. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/mmu.c | 8 +++--- arch/x86/kvm/mmu/shadow_mmu.c | 54 +++++++++++++++++++++++++---------- arch/x86/kvm/mmu/shadow_mmu.h | 25 ++++------------ 3 files changed, 48 insertions(+), 39 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index ce2a6dd38c67..ceb3146016d0 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -530,7 +530,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) bool flush = false; if (kvm_memslots_have_rmaps(kvm)) - flush = kvm_handle_gfn_range(kvm, range, kvm_zap_rmap); + flush = kvm_shadow_mmu_unmap_gfn_range(kvm, range); if (is_tdp_mmu_enabled(kvm)) flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush); @@ -543,7 +543,7 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) bool flush = false; if (kvm_memslots_have_rmaps(kvm)) - flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmap); + flush = kvm_shadow_mmu_set_spte_gfn(kvm, range); if (is_tdp_mmu_enabled(kvm)) flush |= kvm_tdp_mmu_set_spte_gfn(kvm, range); @@ -556,7 +556,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) bool young = false; if (kvm_memslots_have_rmaps(kvm)) - young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); + young = kvm_shadow_mmu_age_gfn_range(kvm, range); if (is_tdp_mmu_enabled(kvm)) young |= kvm_tdp_mmu_age_gfn_range(kvm, range); @@ -569,7 +569,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) bool young = false; if (kvm_memslots_have_rmaps(kvm)) - young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap); + young = kvm_shadow_mmu_test_age_gfn(kvm, range); if (is_tdp_mmu_enabled(kvm)) young |= kvm_tdp_mmu_test_age_gfn(kvm, range); diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c index 77472eb9b06a..1c6ff6fe3d2c 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.c +++ b/arch/x86/kvm/mmu/shadow_mmu.c @@ -862,16 +862,16 @@ static bool __kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, return kvm_zap_all_rmap_sptes(kvm, rmap_head); } -bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, int level, - pte_t unused) +static bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, int level, + pte_t unused) { return __kvm_zap_rmap(kvm, rmap_head, slot); } -bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, int level, - pte_t pte) +static bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, int level, + pte_t pte) { u64 *sptep; struct rmap_iterator iter; @@ -978,9 +978,13 @@ static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator) slot_rmap_walk_okay(_iter_); \ slot_rmap_walk_next(_iter_)) -__always_inline bool kvm_handle_gfn_range(struct kvm *kvm, - struct kvm_gfn_range *range, - rmap_handler_t handler) +typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, + int level, pte_t pte); + +static __always_inline bool +kvm_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range, + rmap_handler_t handler) { struct slot_rmap_walk_iterator iterator; bool ret = false; @@ -993,9 +997,9 @@ __always_inline bool kvm_handle_gfn_range(struct kvm *kvm, return ret; } -bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, int level, - pte_t unused) +static bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, int level, + pte_t unused) { u64 *sptep; struct rmap_iterator iter; @@ -1007,9 +1011,9 @@ bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, return young; } -bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, - int level, pte_t unused) +static bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, + struct kvm_memory_slot *slot, gfn_t gfn, + int level, pte_t unused) { u64 *sptep; struct rmap_iterator iter; @@ -3508,3 +3512,23 @@ void kvm_shadow_mmu_wrprot_slot(struct kvm *kvm, slot_handle_level(kvm, memslot, slot_rmap_write_protect, start_level, KVM_MAX_HUGEPAGE_LEVEL, false); } + +bool kvm_shadow_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) +{ + return kvm_handle_gfn_range(kvm, range, kvm_zap_rmap); +} + +bool kvm_shadow_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +{ + return kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmap); +} + +bool kvm_shadow_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) +{ + return kvm_handle_gfn_range(kvm, range, kvm_age_rmap); +} + +bool kvm_shadow_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +{ + return kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap); +} diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h index 397fb463ef54..2ded3d674cb0 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.h +++ b/arch/x86/kvm/mmu/shadow_mmu.h @@ -26,26 +26,6 @@ struct pte_list_desc { /* Only exported for debugfs.c. */ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head); -bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, int level, - pte_t unused); -bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, int level, - pte_t pte); - -typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, - int level, pte_t pte); -bool kvm_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range, - rmap_handler_t handler); - -bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, int level, - pte_t unused); -bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, - int level, pte_t unused); - void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp); bool __kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, @@ -101,6 +81,11 @@ void kvm_shadow_mmu_wrprot_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot, int start_level); +bool kvm_shadow_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); +bool kvm_shadow_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); +bool kvm_shadow_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); +bool kvm_shadow_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); + /* Exports from paging_tmpl.h */ gpa_t paging32_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gpa_t vaddr, u64 access, From patchwork Wed Dec 21 22:24:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gardon X-Patchwork-Id: 35546 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:8188:b0:89:790f:f786 with SMTP id m8csp4947857dye; Wed, 21 Dec 2022 14:31:34 -0800 (PST) X-Google-Smtp-Source: AMrXdXsMryN0MUDWuX1COz3lxI+PEhinz303n8sYn+aks5EUFO+5kFqOu90sq70EBTJaxfJjykZO X-Received: by 2002:a62:2741:0:b0:577:a0d:b091 with SMTP id n62-20020a622741000000b005770a0db091mr3201925pfn.14.1671661894492; Wed, 21 Dec 2022 14:31:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671661894; cv=none; d=google.com; s=arc-20160816; b=qRwjMiau+iBLo1SrdNyWSDRpU/5g2nLWOodR9lRGQUFFURhdfWFWN4oR4zN69FwUiZ YQuKwM//CwJGuXxxKyUItgmvdFMXY102Qf8NhjYdILQpfaO1I7koaWrg+90GLm/n1Lca OLbexeaRsULmpvBKMwJOm6HYPkrFqU9J20ma/MSMMtT57ZQtOgLIDXYYAnVTkNzIscIC gvbp1ITmcHMBUwGhvCs3Hz1pDTOtKwOQ1hFiwp60skT3HzanDLJBsvFZEh8xixVaN+py 1QxaFUDkLo8Ybawd/St/YcJ+jD+gkIjZKn2XUIITyHW0bO8Nvjg0iyXpt6TOHUhJj/U1 Ezmg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=4TdMmKeh9xdWHOqsXuzcwDMeXjRCekoXNuBTvbORk4Y=; b=R5P1yJ11CVRFtd+LfFphBsyN9Aj2DvfQrWiA5Q/e4LnlNhesu9I0kSL7UkvG0ejq6i RO2mO0FGAxv8/FU+TxHi65kBDneO8Ofvni5KOisGxiJcKHnVMF02EQkck7DYJu8YCtIE UCpu/9XUJhg9vq0EK2vi2AEBiX246C4AXX2/yO34jCaZ7BRzNdewxp6fQ1H0VZQYka7P o2rdnXaJ3jvfKrPjzI9Jogu4HDRaII2e8pocTqxarCdK8F3GEajl9jobJNAODtB0RHAv yNkS86KM5IieCYgNdRqHchM8UzWZaUrc3eOIHBAioUzPcUY1z4IjY2kWL/6qj2HjojS0 ZAcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=KtV0pVIM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w8-20020aa79548000000b0057fd81d0cc3si8888396pfq.37.2022.12.21.14.31.16; Wed, 21 Dec 2022 14:31:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=KtV0pVIM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235090AbiLUW0K (ORCPT + 99 others); Wed, 21 Dec 2022 17:26:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37234 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235022AbiLUWZj (ORCPT ); Wed, 21 Dec 2022 17:25:39 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 518A727B09 for ; Wed, 21 Dec 2022 14:24:45 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id z4-20020a17090ab10400b002195a146546so1946742pjq.9 for ; Wed, 21 Dec 2022 14:24:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=4TdMmKeh9xdWHOqsXuzcwDMeXjRCekoXNuBTvbORk4Y=; b=KtV0pVIMfC+oSFEy+6xnaTAWg/otjtJQ3S9SFSJiaqziSxkFyLCWYd++OxLcdI3jLy 4C3EcV/Ykpd+HnLaclxh7dAhhVAMH+ziDlp47PaFQ91e+QjdlZFuVFPdGMEr8SQDDGbx 4/Qpcj76ezTxKGP/PoSZw5IxJNR4zFVMJw+5YccygGPHkQL/5G9lr6mY+qVDAXiRMPvO ySvjpodRibFvIKqH+VUZxLjHX8bpBBDQMz2OR1ENE/Bah4ZAlIy3N86pUGBWVap6Wjn7 WB8Yrqmpcq1Rq2w5AxKxOvBTJUTDeI9MXknD17Fi+uhVGb3ypwV5HytB2WWzpjdGu1z2 BZTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4TdMmKeh9xdWHOqsXuzcwDMeXjRCekoXNuBTvbORk4Y=; b=wDCiLKFB4K04unlAzNer3tPMisU2OuWi9PATLf63dwxQZvWr4EvtcgL5fltiaLASlW X3sqkfvgYo3DGKrZmkFAzENB/VffT9G5BklvN5PyQSoQj8TmLzGHhBZf2xO7CjK77fnP rSP22t8XqjM/iTrVu3mZIWmYKaGqEJ/TN3dK2oCUiRW7NY2aFPwppPmL931Y82ZdD2+W u7V7u8VSB0B7cMwNx1nCA35C5BT58th/WJ9ZaGd+oX7C4u0CKKLNUABHZGTWzrFaybYE sFIVzifcUTvHpLCWtTPvVu0ItLtZo5bQDgF5pWlYKP4KkJy+/jgZuUrxKQVk8bXQ3OcR EcuQ== X-Gm-Message-State: AFqh2kpg1z9rQ16tHhPOyQfgtvIWysg6EjLK3Jiz8rNx/Lu6jYwBvv0V Et6z59OsPxWz4fmkeDyvtKvjCMYEQFvQloblu8wNy91ZS9EmLAbVbnZ0ahW4Y2WiLMGJYEPF/A6 GlG2HgRRkDefyjIRVHzn5Dp03GyVK4zvrAof1b3jewzBV8gf8QurGk0AYH+tmLGoIH4h0WCLV X-Received: from sweer.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:e45]) (user=bgardon job=sendgmr) by 2002:a17:902:ab4e:b0:189:5c50:ce5 with SMTP id ij14-20020a170902ab4e00b001895c500ce5mr204757plb.14.1671661485557; Wed, 21 Dec 2022 14:24:45 -0800 (PST) Date: Wed, 21 Dec 2022 22:24:18 +0000 In-Reply-To: <20221221222418.3307832-1-bgardon@google.com> Mime-Version: 1.0 References: <20221221222418.3307832-1-bgardon@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221221222418.3307832-15-bgardon@google.com> Subject: [RFC 14/14] KVM: x86/MMU: Add kvm_shadow_mmu_ to the last few functions in shadow_mmu.h From: Ben Gardon To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , Peter Xu , Sean Christopherson , David Matlack , Vipin Sharma , Nagareddy Reddy , Ben Gardon X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752864542481251847?= X-GMAIL-MSGID: =?utf-8?q?1752864542481251847?= Fix up the names of the last few Shadow MMU functions in shadow_mmu.h. This gives a clean and obvious interface between the shared x86 MMU code and the Shadow MMU. There are still a few functions exported from paging_tmpl.h that are left as-is, but changing those will need to be done separately, if at all. No functional change intended. Signed-off-by: Ben Gardon --- arch/x86/kvm/mmu/mmu.c | 23 ++++++++++-------- arch/x86/kvm/mmu/shadow_mmu.c | 44 +++++++++++++++++++---------------- arch/x86/kvm/mmu/shadow_mmu.h | 16 +++++++------ 3 files changed, 46 insertions(+), 37 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index ceb3146016d0..8f3b96af470d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -921,9 +921,11 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) u64 new_spte; if (is_tdp_mmu(vcpu->arch.mmu)) - sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault->addr, &spte); + sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, + fault->addr, &spte); else - sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte); + sptep = kvm_shadow_mmu_fast_pf_get_last_sptep(vcpu, + fault->addr, &spte); if (!is_shadow_present_pte(spte)) break; @@ -1113,7 +1115,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu); mmu->root.hpa = root; } else if (shadow_root_level >= PT64_ROOT_4LEVEL) { - root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level); + root = kvm_shadow_mmu_alloc_root(vcpu, 0, 0, shadow_root_level); mmu->root.hpa = root; } else if (shadow_root_level == PT32E_ROOT_LEVEL) { if (WARN_ON_ONCE(!mmu->pae_root)) { @@ -1124,8 +1126,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) for (i = 0; i < 4; ++i) { WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i])); - root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), 0, - PT32_ROOT_LEVEL); + root = kvm_shadow_mmu_alloc_root(vcpu, + i << (30 - PAGE_SHIFT), 0, PT32_ROOT_LEVEL); mmu->pae_root[i] = root | PT_PRESENT_MASK | shadow_me_value; } @@ -1665,7 +1667,7 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd) * count. Otherwise, clear the write flooding count. */ if (!new_role.direct) - __clear_sp_write_flooding_count( + kvm_shadow_mmu_clear_sp_write_flooding_count( to_shadow_page(vcpu->arch.mmu->root.hpa)); } EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd); @@ -2447,13 +2449,13 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct); if (r) goto out; - r = mmu_alloc_special_roots(vcpu); + r = kvm_shadow_mmu_alloc_special_roots(vcpu); if (r) goto out; if (vcpu->arch.mmu->root_role.direct) r = mmu_alloc_direct_roots(vcpu); else - r = mmu_alloc_shadow_roots(vcpu); + r = kvm_shadow_mmu_alloc_shadow_roots(vcpu); if (r) goto out; @@ -2679,7 +2681,8 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) * generally doesn't use PAE paging and can skip allocating the PDP * table. The main exception, handled here, is SVM's 32-bit NPT. The * other exception is for shadowing L1's 32-bit or PAE NPT on 64-bit - * KVM; that horror is handled on-demand by mmu_alloc_special_roots(). + * KVM; that horror is handled on-demand by + * kvm_shadow_mmu_alloc_special_roots(). */ if (tdp_enabled && kvm_mmu_get_tdp_level(vcpu) > PT32E_ROOT_LEVEL) return 0; @@ -2820,7 +2823,7 @@ int kvm_mmu_init_vm(struct kvm *kvm) if (r < 0) return r; - node->track_write = kvm_mmu_pte_write; + node->track_write = kvm_shadow_mmu_pte_write; node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot; kvm_page_track_register_notifier(kvm, node); diff --git a/arch/x86/kvm/mmu/shadow_mmu.c b/arch/x86/kvm/mmu/shadow_mmu.c index 1c6ff6fe3d2c..6f3e201af670 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.c +++ b/arch/x86/kvm/mmu/shadow_mmu.c @@ -1402,14 +1402,14 @@ static int mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *parent, return 0; } -void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp) +void kvm_shadow_mmu_clear_sp_write_flooding_count(struct kvm_mmu_page *sp) { atomic_set(&sp->write_flooding_count, 0); } static void clear_sp_write_flooding_count(u64 *spte) { - __clear_sp_write_flooding_count(sptep_to_sp(spte)); + kvm_shadow_mmu_clear_sp_write_flooding_count(sptep_to_sp(spte)); } /* @@ -1480,7 +1480,7 @@ static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm *kvm, kvm_flush_remote_tlbs(kvm); } - __clear_sp_write_flooding_count(sp); + kvm_shadow_mmu_clear_sp_write_flooding_count(sp); goto out; } @@ -1605,12 +1605,13 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, * Concretely, a 4-byte PDE consumes bits 31:22, while an 8-byte PDE * consumes bits 29:21. To consume bits 31:30, KVM's uses 4 shadow * PDPTEs; those 4 PAE page directories are pre-allocated and their - * quadrant is assigned in mmu_alloc_root(). A 4-byte PTE consumes - * bits 21:12, while an 8-byte PTE consumes bits 20:12. To consume - * bit 21 in the PTE (the child here), KVM propagates that bit to the - * quadrant, i.e. sets quadrant to '0' or '1'. The parent 8-byte PDE - * covers bit 21 (see above), thus the quadrant is calculated from the - * _least_ significant bit of the PDE index. + * quadrant is assigned in kvm_shadow_mmu_alloc_root(). + * A 4-byte PTE consumes bits 21:12, while an 8-byte PTE consumes + * bits 20:12. To consume bit 21 in the PTE (the child here), KVM + * propagates that bit to the quadrant, i.e. sets quadrant to + * '0' or '1'. The parent 8-byte PDE covers bit 21 (see above), thus + * the quadrant is calculated from the _least_ significant bit of the + * PDE index. */ if (role.has_4_byte_gpte) { WARN_ON_ONCE(role.level != PG_LEVEL_4K); @@ -2377,7 +2378,8 @@ int kvm_shadow_mmu_direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *faul * - Must be called between walk_shadow_page_lockless_{begin,end}. * - The returned sptep must not be used after walk_shadow_page_lockless_end. */ -u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) +u64 *kvm_shadow_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, + u64 *spte) { struct kvm_shadow_walk_iterator iterator; u64 old_spte; @@ -2430,7 +2432,8 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn) return ret; } -hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level) +hpa_t kvm_shadow_mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, + u8 level) { union kvm_mmu_page_role role = vcpu->arch.mmu->root_role; struct kvm_mmu_page *sp; @@ -2447,7 +2450,7 @@ hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level) return __pa(sp->spt); } -static int mmu_first_shadow_root_alloc(struct kvm *kvm) +static int kvm_shadow_mmu_first_shadow_root_alloc(struct kvm *kvm) { struct kvm_memslots *slots; struct kvm_memory_slot *slot; @@ -2508,7 +2511,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm) return r; } -int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) +int kvm_shadow_mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) { struct kvm_mmu *mmu = vcpu->arch.mmu; u64 pdptrs[4], pm_mask; @@ -2537,7 +2540,7 @@ int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) } } - r = mmu_first_shadow_root_alloc(vcpu->kvm); + r = kvm_shadow_mmu_first_shadow_root_alloc(vcpu->kvm); if (r) return r; @@ -2551,8 +2554,8 @@ int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) * write-protect the guests page table root. */ if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) { - root = mmu_alloc_root(vcpu, root_gfn, 0, - mmu->root_role.level); + root = kvm_shadow_mmu_alloc_root(vcpu, root_gfn, 0, + mmu->root_role.level); mmu->root.hpa = root; goto set_root_pgd; } @@ -2605,7 +2608,8 @@ int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) */ quadrant = (mmu->cpu_role.base.level == PT32_ROOT_LEVEL) ? i : 0; - root = mmu_alloc_root(vcpu, root_gfn, quadrant, PT32_ROOT_LEVEL); + root = kvm_shadow_mmu_alloc_root(vcpu, root_gfn, quadrant, + PT32_ROOT_LEVEL); mmu->pae_root[i] = root | pm_mask; } @@ -2624,7 +2628,7 @@ int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) return r; } -int mmu_alloc_special_roots(struct kvm_vcpu *vcpu) +int kvm_shadow_mmu_alloc_special_roots(struct kvm_vcpu *vcpu) { struct kvm_mmu *mmu = vcpu->arch.mmu; bool need_pml5 = mmu->root_role.level > PT64_ROOT_4LEVEL; @@ -2997,8 +3001,8 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte) return spte; } -void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, - int bytes, struct kvm_page_track_notifier_node *node) +void kvm_shadow_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, + int bytes, struct kvm_page_track_notifier_node *node) { gfn_t gfn = gpa >> PAGE_SHIFT; struct kvm_mmu_page *sp; diff --git a/arch/x86/kvm/mmu/shadow_mmu.h b/arch/x86/kvm/mmu/shadow_mmu.h index 2ded3d674cb0..a3e6daa36236 100644 --- a/arch/x86/kvm/mmu/shadow_mmu.h +++ b/arch/x86/kvm/mmu/shadow_mmu.h @@ -26,7 +26,7 @@ struct pte_list_desc { /* Only exported for debugfs.c. */ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head); -void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp); +void kvm_shadow_mmu_clear_sp_write_flooding_count(struct kvm_mmu_page *sp); bool __kvm_shadow_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, struct list_head *invalid_list, @@ -41,17 +41,19 @@ int kvm_shadow_mmu_make_pages_available(struct kvm_vcpu *vcpu); int kvm_shadow_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva); int kvm_shadow_mmu_direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); -u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte); +u64 *kvm_shadow_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, + u64 *spte); -hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, u8 level); -int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu); -int mmu_alloc_special_roots(struct kvm_vcpu *vcpu); +hpa_t kvm_shadow_mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, + u8 level); +int kvm_shadow_mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu); +int kvm_shadow_mmu_alloc_special_roots(struct kvm_vcpu *vcpu); int kvm_shadow_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level); -void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, - int bytes, struct kvm_page_track_notifier_node *node); +void kvm_shadow_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, + int bytes, struct kvm_page_track_notifier_node *node); void kvm_shadow_mmu_zap_obsolete_pages(struct kvm *kvm); bool kvm_shadow_mmu_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);