Message ID | 20230602005859.784190-1-seanjc@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp722199vqr; Thu, 1 Jun 2023 18:18:59 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6yH9CdK30C7QQS4zxncdGYx98SY0DYM9fgxZ4uOr9yHba1waFcAjrBHp7a86z1B0Fj1JHP X-Received: by 2002:a05:6a20:4289:b0:104:7a4c:6ca6 with SMTP id o9-20020a056a20428900b001047a4c6ca6mr10058484pzj.13.1685668739349; Thu, 01 Jun 2023 18:18:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685668739; cv=none; d=google.com; s=arc-20160816; b=ccKuJoLQXOV5Af8mQFU77XoWeno7Uoo27RxDRv4A/haJLHFnrlEkMgYH7oafu5GrPX gSB1zams+zmo7kKn6pkabZD6+tQXvnbCyw/SU3WC8p4zketS23P2ez5Ev0oO6HKUdF91 scKIyiaY/nUVBJU7+BhpVEjZ2o+IJhASt3D0ZQbX88ypsINnio6FT5EcqM8NCNBaNnpK 2+xtveTXu3EfbDVaDHfhAVcNCX4xAK8ZSDMJ9Sr87SAqsKQDtQ4Hb8qaYDqzzNxPdMfZ ewNrsfSqZbzwiaGy+lkMfOOmYLNT3eQAkV5BJ1FPakIgl70SkZ4B34DsI9ivu3s0rcdq xirg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :reply-to:dkim-signature; bh=0u7y5QK/IUNVLjf4PP/AE2dTkQWeXWOPAwDPk1OpCag=; b=gglcyfYC/nPBWtrxbsiiUP3pBFpL2fL4MkSYQ5JHvCOW/iKLrJYuiBJKbFf46mXvfY 5CrEPNsQFaedRJJtI4H8KusLLeraYdmC0SbF+Y1GPZ2lSCmn2Pf5TktHYF5nsjDau6il Sn97kOr74hQjnqlt4ZS8FRWj7H2wb3nxe7FzTi5Av5+njDF7g3+b7IcBMUg4jyAlA00R E7wdOptuBuvE1fAsG1Ky07L24PHnpRNvibnk7kNa8gG8471nzmr49b+aRVcv5JVQGxy2 xHpOZjs8ZfhsXXe/yp1r1DdHDubU/awh8+HqfpDDUU791YKPNPqv536ATqso342lf1Wm BO3w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=wIQaJ+pM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q24-20020a637518000000b005348f85f89fsi135622pgc.226.2023.06.01.18.18.45; Thu, 01 Jun 2023 18:18:59 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=wIQaJ+pM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232221AbjFBA7F (ORCPT <rfc822;limurcpp@gmail.com> + 99 others); Thu, 1 Jun 2023 20:59:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229524AbjFBA7E (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 1 Jun 2023 20:59:04 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B1FC4E2 for <linux-kernel@vger.kernel.org>; Thu, 1 Jun 2023 17:59:02 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-babb5e91ab4so2153799276.0 for <linux-kernel@vger.kernel.org>; Thu, 01 Jun 2023 17:59:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685667542; x=1688259542; h=cc:to:from:subject:message-id:mime-version:date:reply-to:from:to:cc :subject:date:message-id:reply-to; bh=0u7y5QK/IUNVLjf4PP/AE2dTkQWeXWOPAwDPk1OpCag=; b=wIQaJ+pMgik/b6GJwDHzo6Jgj2/3eBSKUPtjVYobTGqnFrLpg+apm6uFGdvnH/0xTq 6pyYMcvEboxUZ90nuhRIKotL3IA4OWwkUUf2t8INbGwAEsuL2S4Mm6q7GMUGr2TPlk4D Yz29+TXn6p8x8afCO7EwW6HA6rR0EXmSkyvgmKtUzusZ3+u0hwfAxP3YkK6D0iGB+HdT 5Ly1lWrLuW1C8AIAVCnrt0e/O8wLfm1OV3N7gIwdxpSz9kD/q5Gbjv48KPYoGrXnz9VM ao0x13kkeffb0FL+kXWrDhl5QNu1B6Ea/0VayqPjbseL8m2x6OVv4PSQ8mddGx8HNB0V JxDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685667542; x=1688259542; h=cc:to:from:subject:message-id:mime-version:date:reply-to :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0u7y5QK/IUNVLjf4PP/AE2dTkQWeXWOPAwDPk1OpCag=; b=Qzf9tCNoFxoMpSvkGybUHorDGHiP7btrn/B1NxcoGWGPghnpT5MZxBb9/ssvy71CqK CfaMrg7K/149nN4W4DcboKCCT/bZqN/wIv+y69qa1ZBvGOU2IpcWk7guIhDu1wZZooyB nWsEYREfG7PYmqkg6x5ygd69tgNRUgF6iao8z4EOdkwNh5sEjBxoG8WivPHYUIDNr88u JgS9pu80aj0n92Eah3Uh97AlzRRxoC7eVmegh/MHKpTJMFpgSzv7F9n4/bA1d9jFiiYk Jlm/G0Wc9N6QWR5XV2F/WqFAQq9OZJuUq26SmZB+souUb/1xZwOxWldolJivVZS78hHM eDgA== X-Gm-Message-State: AC+VfDx1HN9tYvw0zBKsGdWXYi3KuU1uBc0WA1ruRKS5zMijg0wCpUEE Xkxfdq2zedJ+yIGI1H+yNpUR0JPqcc0= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a25:5d6:0:b0:ba7:d142:eada with SMTP id 205-20020a2505d6000000b00ba7d142eadamr952269ybf.7.1685667542036; Thu, 01 Jun 2023 17:59:02 -0700 (PDT) Reply-To: Sean Christopherson <seanjc@google.com> Date: Thu, 1 Jun 2023 17:58:59 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.41.0.rc2.161.g9c6817b8e7-goog Message-ID: <20230602005859.784190-1-seanjc@google.com> Subject: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages From: Sean Christopherson <seanjc@google.com> To: Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Li RongQing <lirongqing@baidu.com>, Yong He <zhuangel570@gmail.com>, Robert Hoo <robert.hoo.linux@gmail.com>, Kai Huang <kai.huang@intel.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767551783920382851?= X-GMAIL-MSGID: =?utf-8?q?1767551783920382851?= |
Series |
KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages
|
|
Commit Message
Sean Christopherson
June 2, 2023, 12:58 a.m. UTC
Add a "never" option to the nx_huge_pages module param to allow userspace
to do a one-way hard disabling of the mitigation, and don't create the
per-VM recovery threads when the mitigation is hard disabled. Letting
userspace pinky swear that userspace doesn't want to enable NX mitigation
(without reloading KVM) allows certain use cases to avoid the latency
problems associated with spawning a kthread for each VM.
E.g. in FaaS use cases, the guest kernel is trusted and the host may
create 100+ VMs per logical CPU, which can result in 100ms+ latencies when
a burst of VMs is created.
Reported-by: Li RongQing <lirongqing@baidu.com>
Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@baidu.com
Cc: Yong He <zhuangel570@gmail.com>
Cc: Robert Hoo <robert.hoo.linux@gmail.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/mmu/mmu.c | 41 ++++++++++++++++++++++++++++++++++++-----
1 file changed, 36 insertions(+), 5 deletions(-)
base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
Comments
On 6/2/2023 8:58 AM, Sean Christopherson wrote: > Add a "never" option to the nx_huge_pages module param to allow userspace > to do a one-way hard disabling of the mitigation, and don't create the > per-VM recovery threads when the mitigation is hard disabled. Letting > userspace pinky swear that userspace doesn't want to enable NX mitigation > (without reloading KVM) allows certain use cases to avoid the latency > problems associated with spawning a kthread for each VM. > > E.g. in FaaS use cases, the guest kernel is trusted and the host may > create 100+ VMs per logical CPU, which can result in 100ms+ latencies when > a burst of VMs is created. > > Reported-by: Li RongQing <lirongqing@baidu.com> > Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@baidu.com > Cc: Yong He <zhuangel570@gmail.com> > Cc: Robert Hoo <robert.hoo.linux@gmail.com> > Cc: Kai Huang <kai.huang@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 41 ++++++++++++++++++++++++++++++++++++----- > 1 file changed, 36 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index c8961f45e3b1..2ed38916b904 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -58,6 +58,8 @@ > > extern bool itlb_multihit_kvm_mitigation; > > +static bool nx_hugepage_mitigation_hard_disabled; > + > int __read_mostly nx_huge_pages = -1; > static uint __read_mostly nx_huge_pages_recovery_period_ms; > #ifdef CONFIG_PREEMPT_RT > @@ -67,12 +69,13 @@ static uint __read_mostly nx_huge_pages_recovery_ratio = 0; > static uint __read_mostly nx_huge_pages_recovery_ratio = 60; > #endif > > +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp); > static int set_nx_huge_pages(const char *val, const struct kernel_param *kp); > static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel_param *kp); > > static const struct kernel_param_ops nx_huge_pages_ops = { > .set = set_nx_huge_pages, > - .get = param_get_bool, > + .get = get_nx_huge_pages, > }; > > static const struct kernel_param_ops nx_huge_pages_recovery_param_ops = { > @@ -6844,6 +6847,14 @@ static void mmu_destroy_caches(void) > kmem_cache_destroy(mmu_page_header_cache); > } > > +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp) > +{ > + if (nx_hugepage_mitigation_hard_disabled) > + return sprintf(buffer, "never\n"); > + > + return param_get_bool(buffer, kp); > +} > + > static bool get_nx_auto_mode(void) > { > /* Return true when CPU has the bug, and mitigations are ON */ > @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp) > bool old_val = nx_huge_pages; > bool new_val; > > + if (nx_hugepage_mitigation_hard_disabled) > + return -EPERM; > + > /* In "auto" mode deploy workaround only if CPU has the bug. */ > - if (sysfs_streq(val, "off")) > + if (sysfs_streq(val, "off")) { > new_val = 0; > - else if (sysfs_streq(val, "force")) > + } else if (sysfs_streq(val, "force")) { > new_val = 1; > - else if (sysfs_streq(val, "auto")) > + } else if (sysfs_streq(val, "auto")) { > new_val = get_nx_auto_mode(); > - else if (kstrtobool(val, &new_val) < 0) > + } else if (sysfs_streq(val, "never")) { > + new_val = 0; > + > + mutex_lock(&kvm_lock); > + if (!list_empty(&vm_list)) { > + mutex_unlock(&kvm_lock); > + return -EBUSY; > + } > + nx_hugepage_mitigation_hard_disabled = true; > + mutex_unlock(&kvm_lock); > + } else if (kstrtobool(val, &new_val) < 0) { > return -EINVAL; > + } > IIUC, (Initially) "auto_off"/"off" --> create some VM --> turn to "never", the created VMs still have those kthreads, but can never be used, until destroyed with VM. Is this designed/accepted pattern? > __set_nx_huge_pages(new_val); > > @@ -7006,6 +7031,9 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel > uint old_period, new_period; > int err; > > + if (nx_hugepage_mitigation_hard_disabled) > + return -EPERM; > + > was_recovery_enabled = calc_nx_huge_pages_recovery_period(&old_period); > > err = param_set_uint(val, kp); > @@ -7161,6 +7189,9 @@ int kvm_mmu_post_init_vm(struct kvm *kvm) > { > int err; > > + if (nx_hugepage_mitigation_hard_disabled) > + return 0; > + > err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0, > "kvm-nx-lpage-recovery", > &kvm->arch.nx_huge_page_recovery_thread); > > base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
On Fri, Jun 02, 2023, Robert Hoo wrote: > On 6/2/2023 8:58 AM, Sean Christopherson wrote: > > @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp) > > bool old_val = nx_huge_pages; > > bool new_val; > > + if (nx_hugepage_mitigation_hard_disabled) > > + return -EPERM; > > + > > /* In "auto" mode deploy workaround only if CPU has the bug. */ > > - if (sysfs_streq(val, "off")) > > + if (sysfs_streq(val, "off")) { > > new_val = 0; > > - else if (sysfs_streq(val, "force")) > > + } else if (sysfs_streq(val, "force")) { > > new_val = 1; > > - else if (sysfs_streq(val, "auto")) > > + } else if (sysfs_streq(val, "auto")) { > > new_val = get_nx_auto_mode(); > > - else if (kstrtobool(val, &new_val) < 0) > > + } else if (sysfs_streq(val, "never")) { > > + new_val = 0; > > + > > + mutex_lock(&kvm_lock); > > + if (!list_empty(&vm_list)) { > > + mutex_unlock(&kvm_lock); > > + return -EBUSY; > > + } > > + nx_hugepage_mitigation_hard_disabled = true; > > + mutex_unlock(&kvm_lock); > > + } else if (kstrtobool(val, &new_val) < 0) { > > return -EINVAL; > > + } > > > > IIUC, (Initially) "auto_off"/"off" --> create some VM --> turn to "never", > the created VMs still have those kthreads, but can never be used, until > destroyed with VM. Shouldn't be able to happen. The above rejects "never" if vm_list isn't empty, i.e. if there are any VMs, and sets nx_hugepage_mitigation_hard_disabled under kvm_lock to ensure it can't race with KVM_CREATE_VM. I forgot to call this out in the changelog though.
On 6/2/2023 11:09 PM, Sean Christopherson wrote: > On Fri, Jun 02, 2023, Robert Hoo wrote: >> On 6/2/2023 8:58 AM, Sean Christopherson wrote: >>> @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp) >>> bool old_val = nx_huge_pages; >>> bool new_val; >>> + if (nx_hugepage_mitigation_hard_disabled) >>> + return -EPERM; >>> + >>> /* In "auto" mode deploy workaround only if CPU has the bug. */ >>> - if (sysfs_streq(val, "off")) >>> + if (sysfs_streq(val, "off")) { >>> new_val = 0; >>> - else if (sysfs_streq(val, "force")) >>> + } else if (sysfs_streq(val, "force")) { >>> new_val = 1; >>> - else if (sysfs_streq(val, "auto")) >>> + } else if (sysfs_streq(val, "auto")) { >>> new_val = get_nx_auto_mode(); >>> - else if (kstrtobool(val, &new_val) < 0) >>> + } else if (sysfs_streq(val, "never")) { >>> + new_val = 0; >>> + >>> + mutex_lock(&kvm_lock); >>> + if (!list_empty(&vm_list)) { >>> + mutex_unlock(&kvm_lock); >>> + return -EBUSY; Ah, right, thanks for explanation. Reviewed-by: Robert Hoo <robert.hoo.linux@gmail.com> >>> + } >>> + nx_hugepage_mitigation_hard_disabled = true; >>> + mutex_unlock(&kvm_lock); >>> + } else if (kstrtobool(val, &new_val) < 0) { >>> return -EINVAL; >>> + } >>> >> >> IIUC, (Initially) "auto_off"/"off" --> create some VM --> turn to "never", >> the created VMs still have those kthreads, but can never be used, until >> destroyed with VM. > > Shouldn't be able to happen. The above rejects "never" if vm_list isn't empty, > i.e. if there are any VMs, and sets nx_hugepage_mitigation_hard_disabled under > kvm_lock to ensure it can't race with KVM_CREATE_VM. I forgot to call this out > in the changelog though
> -----Original Message----- > From: Sean Christopherson <seanjc@google.com> > Sent: Friday, June 2, 2023 8:59 AM > To: Sean Christopherson <seanjc@google.com>; Paolo Bonzini > <pbonzini@redhat.com> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org; Li,Rongqing > <lirongqing@baidu.com>; Yong He <zhuangel570@gmail.com>; Robert Hoo > <robert.hoo.linux@gmail.com>; Kai Huang <kai.huang@intel.com> > Subject: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of > nx_huge_pages > > Add a "never" option to the nx_huge_pages module param to allow userspace > to do a one-way hard disabling of the mitigation, and don't create the per-VM > recovery threads when the mitigation is hard disabled. Letting userspace pinky > swear that userspace doesn't want to enable NX mitigation (without reloading > KVM) allows certain use cases to avoid the latency problems associated with > spawning a kthread for each VM. > > E.g. in FaaS use cases, the guest kernel is trusted and the host may create 100+ > VMs per logical CPU, which can result in 100ms+ latencies when a burst of VMs > is created. > Reviewed-by: Li RongQing <lirongqing@baidu.com> And I hope nx_huge_pages is never by default if CPU reports that it doesn't have such bug Thanks -Li RongQing
On Fri, 2023-06-02 at 08:09 -0700, Sean Christopherson wrote: > On Fri, Jun 02, 2023, Robert Hoo wrote: > > On 6/2/2023 8:58 AM, Sean Christopherson wrote: > > > @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp) > > > bool old_val = nx_huge_pages; > > > bool new_val; > > > + if (nx_hugepage_mitigation_hard_disabled) > > > + return -EPERM; > > > + > > > /* In "auto" mode deploy workaround only if CPU has the bug. */ > > > - if (sysfs_streq(val, "off")) > > > + if (sysfs_streq(val, "off")) { > > > new_val = 0; > > > - else if (sysfs_streq(val, "force")) > > > + } else if (sysfs_streq(val, "force")) { > > > new_val = 1; > > > - else if (sysfs_streq(val, "auto")) > > > + } else if (sysfs_streq(val, "auto")) { > > > new_val = get_nx_auto_mode(); > > > - else if (kstrtobool(val, &new_val) < 0) > > > + } else if (sysfs_streq(val, "never")) { > > > + new_val = 0; > > > + > > > + mutex_lock(&kvm_lock); > > > + if (!list_empty(&vm_list)) { > > > + mutex_unlock(&kvm_lock); > > > + return -EBUSY; > > > + } > > > + nx_hugepage_mitigation_hard_disabled = true; > > > + mutex_unlock(&kvm_lock); > > > + } else if (kstrtobool(val, &new_val) < 0) { > > > return -EINVAL; > > > + } > > > > > > > IIUC, (Initially) "auto_off"/"off" --> create some VM --> turn to "never", > > the created VMs still have those kthreads, but can never be used, until > > destroyed with VM. > > Shouldn't be able to happen. The above rejects "never" if vm_list isn't empty, > i.e. if there are any VMs, and sets nx_hugepage_mitigation_hard_disabled under > kvm_lock to ensure it can't race with KVM_CREATE_VM. I forgot to call this out > in the changelog though. Also IIUC once set to "never" userspace will not be able to switch to other modes, unless userspace reloads KVM. Worth to call out this too? Acked-by: Kai Huang <kai.huang@intel.com>
On Thu, Jun 01, 2023 at 05:58:59PM -0700, Sean Christopherson wrote: > Add a "never" option to the nx_huge_pages module param to allow userspace > to do a one-way hard disabling of the mitigation, and don't create the > per-VM recovery threads when the mitigation is hard disabled. Letting > userspace pinky swear that userspace doesn't want to enable NX mitigation > (without reloading KVM) allows certain use cases to avoid the latency > problems associated with spawning a kthread for each VM. > > E.g. in FaaS use cases, the guest kernel is trusted and the host may > create 100+ VMs per logical CPU, which can result in 100ms+ latencies when > a burst of VMs is created. Tested-by: Luiz Capitulino <luizcap@amazon.com> Without this patch I can see the 100ms+ latencies on KVM_CREATE_VM even with a single VM. Just run a VM with with strace -T and grep for KVM_CREATE_VM. When using kvmtool I get (latency in seconds - kernel HEAD is a4d7d70112): ioctl(3, KVM_CREATE_VM, 0) = 4 <0.023567> ioctl(3, KVM_CREATE_VM, 0) = 4 <0.076709> ioctl(3, KVM_CREATE_VM, 0) = 4 <0.109109> With this patch and nx_huge_page=never: ioctl(3, KVM_CREATE_VM, 0) = 4 <0.000518> ioctl(3, KVM_CREATE_VM, 0) = 4 <0.000495> ioctl(3, KVM_CREATE_VM, 0) = 4 <0.000513> Now, I debugged down the single VM case before seeing this patch and it can be avoided by building the kernel with CONFIG_CGROUP_FAVOR_DYNMODS=y or mounting the cgroup v2 mount point with the favordynmods mount option. This is because the high latency is coming from a call to cgroup_attach_task_all() in: kvm_vm_worker_thread() cgroup_attach_task_all() percpu_down_write(&cgroup_threadgroup_rwsem) /* calls synchronize_rcu() */ This happens while kvm_vm_create_worker_thread() is waiting on a completion. See commit 6a010a49b63a for more information. This patch is preferable because the favordynmods solution has a trade-off. However, why don't we make nx_huge_pages=never the default behavior if the CPU is not vulnerable? If there are concerns about not being able to restart the worker thread, then maybe we could make this a .config option? - Luiz > > Reported-by: Li RongQing <lirongqing@baidu.com> > Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@baidu.com > Cc: Yong He <zhuangel570@gmail.com> > Cc: Robert Hoo <robert.hoo.linux@gmail.com> > Cc: Kai Huang <kai.huang@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 41 ++++++++++++++++++++++++++++++++++++----- > 1 file changed, 36 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index c8961f45e3b1..2ed38916b904 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -58,6 +58,8 @@ > > extern bool itlb_multihit_kvm_mitigation; > > +static bool nx_hugepage_mitigation_hard_disabled; > + > int __read_mostly nx_huge_pages = -1; > static uint __read_mostly nx_huge_pages_recovery_period_ms; > #ifdef CONFIG_PREEMPT_RT > @@ -67,12 +69,13 @@ static uint __read_mostly nx_huge_pages_recovery_ratio = 0; > static uint __read_mostly nx_huge_pages_recovery_ratio = 60; > #endif > > +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp); > static int set_nx_huge_pages(const char *val, const struct kernel_param *kp); > static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel_param *kp); > > static const struct kernel_param_ops nx_huge_pages_ops = { > .set = set_nx_huge_pages, > - .get = param_get_bool, > + .get = get_nx_huge_pages, > }; > > static const struct kernel_param_ops nx_huge_pages_recovery_param_ops = { > @@ -6844,6 +6847,14 @@ static void mmu_destroy_caches(void) > kmem_cache_destroy(mmu_page_header_cache); > } > > +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp) > +{ > + if (nx_hugepage_mitigation_hard_disabled) > + return sprintf(buffer, "never\n"); > + > + return param_get_bool(buffer, kp); > +} > + > static bool get_nx_auto_mode(void) > { > /* Return true when CPU has the bug, and mitigations are ON */ > @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp) > bool old_val = nx_huge_pages; > bool new_val; > > + if (nx_hugepage_mitigation_hard_disabled) > + return -EPERM; > + > /* In "auto" mode deploy workaround only if CPU has the bug. */ > - if (sysfs_streq(val, "off")) > + if (sysfs_streq(val, "off")) { > new_val = 0; > - else if (sysfs_streq(val, "force")) > + } else if (sysfs_streq(val, "force")) { > new_val = 1; > - else if (sysfs_streq(val, "auto")) > + } else if (sysfs_streq(val, "auto")) { > new_val = get_nx_auto_mode(); > - else if (kstrtobool(val, &new_val) < 0) > + } else if (sysfs_streq(val, "never")) { > + new_val = 0; > + > + mutex_lock(&kvm_lock); > + if (!list_empty(&vm_list)) { > + mutex_unlock(&kvm_lock); > + return -EBUSY; > + } > + nx_hugepage_mitigation_hard_disabled = true; > + mutex_unlock(&kvm_lock); > + } else if (kstrtobool(val, &new_val) < 0) { > return -EINVAL; > + } > > __set_nx_huge_pages(new_val); > > @@ -7006,6 +7031,9 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel > uint old_period, new_period; > int err; > > + if (nx_hugepage_mitigation_hard_disabled) > + return -EPERM; > + > was_recovery_enabled = calc_nx_huge_pages_recovery_period(&old_period); > > err = param_set_uint(val, kp); > @@ -7161,6 +7189,9 @@ int kvm_mmu_post_init_vm(struct kvm *kvm) > { > int err; > > + if (nx_hugepage_mitigation_hard_disabled) > + return 0; > + > err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0, > "kvm-nx-lpage-recovery", > &kvm->arch.nx_huge_page_recovery_thread); > > base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f > -- > 2.41.0.rc2.161.g9c6817b8e7-goog >
On Tue, Jun 06, 2023, Luiz Capitulino wrote: > On Thu, Jun 01, 2023 at 05:58:59PM -0700, Sean Christopherson wrote: > However, why don't we make nx_huge_pages=never the default behavior if the > CPU is not vulnerable? Mainly because the mitigation has been around for 3.5 years, and there's a non-zero chance that making "never" the default could cause hiccups for unsuspecting users. If this were brand new code, I would definitely opt for "never" as the default. > If there are concerns about not being able to restart the worker thread, then > maybe we could make this a .config option? Eh, a Kconfig is unnecessarily complex, and wouldn't really change anything, e.g. for users in the know, it's just as easy to force a module param as it is to force a Kconfig, and to gain any benefit from the param being !never by default, the Kconfig would also have to be off by default. If "everyone" wants never to be the default, and Paolo doesn't object, I'd rather just tack on a patch to make that happen, and cross my fingers there's no fallout :-)
On Tue, Jun 06, 2023 at 03:03:38PM -0700, Sean Christopherson wrote: > > > > On Tue, Jun 06, 2023, Luiz Capitulino wrote: > > On Thu, Jun 01, 2023 at 05:58:59PM -0700, Sean Christopherson wrote: > > However, why don't we make nx_huge_pages=never the default behavior if the > > CPU is not vulnerable? > > Mainly because the mitigation has been around for 3.5 years, and there's a non-zero > chance that making "never" the default could cause hiccups for unsuspecting users. > If this were brand new code, I would definitely opt for "never" as the default. OK. > > If there are concerns about not being able to restart the worker thread, then > > maybe we could make this a .config option? > > Eh, a Kconfig is unnecessarily complex, and wouldn't really change anything, e.g. > for users in the know, it's just as easy to force a module param as it is to force > a Kconfig, and to gain any benefit from the param being !never by default, the > Kconfig would also have to be off by default. I agree it adds some complexity. The benefit is to allow shipping a kernel with a good default where KVM users with non-vulnerable CPUs get low latency out of the box (vs. having to figure out if they are vulnerable or not and changing a run-time configuration). But the idea would be to set "never" by default as long as the CPU is not vulnerable. > If "everyone" wants never to be the default, and Paolo doesn't object, I'd rather > just tack on a patch to make that happen, and cross my fingers there's no fallout :-) This would work too :)
On Thu, 01 Jun 2023 17:58:59 -0700, Sean Christopherson wrote: > Add a "never" option to the nx_huge_pages module param to allow userspace > to do a one-way hard disabling of the mitigation, and don't create the > per-VM recovery threads when the mitigation is hard disabled. Letting > userspace pinky swear that userspace doesn't want to enable NX mitigation > (without reloading KVM) allows certain use cases to avoid the latency > problems associated with spawning a kthread for each VM. > > [...] Applied to kvm-x86 mmu. I kept the default as "auto" for now, as that can go on top and I don't want to introduce that change this late in the cycle. If no one beats me to the punch (hint, hint ;-) ), I'll post a patch to make "never" the default for unaffected hosts so that we can discuss/consider that change for 6.6. Thanks for the reviews! [1/1] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages https://github.com/kvm-x86/linux/commit/0b210faf3373 -- https://github.com/kvm-x86/linux/tree/next https://github.com/kvm-x86/linux/tree/fixes
On 2023-06-13 19:21, Sean Christopherson wrote: > > > > On Thu, 01 Jun 2023 17:58:59 -0700, Sean Christopherson wrote: >> Add a "never" option to the nx_huge_pages module param to allow userspace >> to do a one-way hard disabling of the mitigation, and don't create the >> per-VM recovery threads when the mitigation is hard disabled. Letting >> userspace pinky swear that userspace doesn't want to enable NX mitigation >> (without reloading KVM) allows certain use cases to avoid the latency >> problems associated with spawning a kthread for each VM. >> >> [...] > > Applied to kvm-x86 mmu. I kept the default as "auto" for now, as that can go on > top and I don't want to introduce that change this late in the cycle. If no one > beats me to the punch (hint, hint ;-) ), I'll post a patch to make "never" the > default for unaffected hosts so that we can discuss/consider that change for 6.6. Thanks Sean, I agree with the plan. I could give a try on the patch if you'd like. - Luiz > > Thanks for the reviews! > > [1/1] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages > https://github.com/kvm-x86/linux/commit/0b210faf3373 > > -- > https://github.com/kvm-x86/linux/tree/next > https://github.com/kvm-x86/linux/tree/fixes
On Wed, Jun 14, 2023, Luiz Capitulino wrote: > > > On 2023-06-13 19:21, Sean Christopherson wrote: > > > > > > > > > On Thu, 01 Jun 2023 17:58:59 -0700, Sean Christopherson wrote: > > > Add a "never" option to the nx_huge_pages module param to allow userspace > > > to do a one-way hard disabling of the mitigation, and don't create the > > > per-VM recovery threads when the mitigation is hard disabled. Letting > > > userspace pinky swear that userspace doesn't want to enable NX mitigation > > > (without reloading KVM) allows certain use cases to avoid the latency > > > problems associated with spawning a kthread for each VM. > > > > > > [...] > > > > Applied to kvm-x86 mmu. I kept the default as "auto" for now, as that can go on > > top and I don't want to introduce that change this late in the cycle. If no one > > beats me to the punch (hint, hint ;-) ), I'll post a patch to make "never" the > > default for unaffected hosts so that we can discuss/consider that change for 6.6. > > Thanks Sean, I agree with the plan. I could give a try on the patch if you'd like. Yes please, thanks!
On 2023/6/15 03:07, Sean Christopherson wrote: > On Wed, Jun 14, 2023, Luiz Capitulino wrote: >> >> >> On 2023-06-13 19:21, Sean Christopherson wrote: >> >>> >>> >>> >>> On Thu, 01 Jun 2023 17:58:59 -0700, Sean Christopherson wrote: >>>> Add a "never" option to the nx_huge_pages module param to allow userspace >>>> to do a one-way hard disabling of the mitigation, and don't create the >>>> per-VM recovery threads when the mitigation is hard disabled. Letting >>>> userspace pinky swear that userspace doesn't want to enable NX mitigation >>>> (without reloading KVM) allows certain use cases to avoid the latency >>>> problems associated with spawning a kthread for each VM. >>>> >>>> [...] >>> >>> Applied to kvm-x86 mmu. I kept the default as "auto" for now, as that can go on >>> top and I don't want to introduce that change this late in the cycle. If no one >>> beats me to the punch (hint, hint ;-) ), I'll post a patch to make "never" the >>> default for unaffected hosts so that we can discuss/consider that change for 6.6. >> >> Thanks Sean, I agree with the plan. I could give a try on the patch if you'd like. > > Yes please, thanks! > > As a KVM/x86 *feature*, playing with splitting and reconstructing large pages have other potential user scenarios, e.g. for performance test comparisons in a easier approach, not just for itlb_multihit mitigation. On unaffected machines (ICX and later), nx_huge_pages is already "N", and turning it into "never" doesn't help materially in the mitigation implementation, but loses flexibility. IMO, the real issue here is that the kernel thread "kvm-nx-lpage- recovery" is created unconditionally. We also need to be aware of the existence of this commit 084cc29f8bbb ("KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis"). One of the technical proposals is to defer kvm_vm_create_worker_thread() to kvm_mmu_create() or kvm_init_mmu(), based on kvm->arch.disable_nx_huge_pages, even until guest paging mode is enabled on the first vcpu. Is this step worth taking ?
On Wed, Jul 12, 2023, Like Xu wrote: > On 2023/6/15 03:07, Sean Christopherson wrote: > > On Wed, Jun 14, 2023, Luiz Capitulino wrote: > > > > Applied to kvm-x86 mmu. I kept the default as "auto" for now, as that can go on > > > > top and I don't want to introduce that change this late in the cycle. If no one > > > > beats me to the punch (hint, hint ;-) ), I'll post a patch to make "never" the > > > > default for unaffected hosts so that we can discuss/consider that change for 6.6. > > > > > > Thanks Sean, I agree with the plan. I could give a try on the patch if you'd like. > > > > Yes please, thanks! > > As a KVM/x86 *feature*, playing with splitting and reconstructing large > pages have other potential user scenarios, e.g. for performance test > comparisons in a easier approach, not just for itlb_multihit mitigation. Enabling and disabling dirty logging is a far better tool for that, as it gives userspace much more explicit control over what pages are are split/reconstituted, and when. > On unaffected machines (ICX and later), nx_huge_pages is already "N", > and turning it into "never" doesn't help materially in the mitigation > implementation, but loses flexibility. I'm becoming more and more convinced that losing the flexibility is perfectly acceptable. There's a very good argument to be made that mitigating DoS attacks from the guest kernel should be done several levels up, e.g. by refusing to create VMs for a customer that is bringing down hosts. As Jim has a pointed out, plugging the hole only works if you are 100% confident there are no other holes, and will never be other holes. > IMO, the real issue here is that the kernel thread "kvm-nx-lpage- > recovery" is created unconditionally. We also need to be aware of the > existence of this commit 084cc29f8bbb ("KVM: x86/MMU: Allow NX huge > pages to be disabled on a per-vm basis"). > > One of the technical proposals is to defer kvm_vm_create_worker_thread() > to kvm_mmu_create() or kvm_init_mmu(), based on > kvm->arch.disable_nx_huge_pages, even until guest paging mode is enabled > on the first vcpu. > > Is this step worth taking ? IMO, no. In hindsight, adding KVM_CAP_VM_DISABLE_NX_HUGE_PAGES was likely a mistake; requiring CAP_SYS_BOOT makes it annoyingly difficult to safely use the capability. My preference at this point is to make changes to the NX hugepage mitigation only when there is a substantial benefit to an already-deployed usecase.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c8961f45e3b1..2ed38916b904 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -58,6 +58,8 @@ extern bool itlb_multihit_kvm_mitigation; +static bool nx_hugepage_mitigation_hard_disabled; + int __read_mostly nx_huge_pages = -1; static uint __read_mostly nx_huge_pages_recovery_period_ms; #ifdef CONFIG_PREEMPT_RT @@ -67,12 +69,13 @@ static uint __read_mostly nx_huge_pages_recovery_ratio = 0; static uint __read_mostly nx_huge_pages_recovery_ratio = 60; #endif +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp); static int set_nx_huge_pages(const char *val, const struct kernel_param *kp); static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel_param *kp); static const struct kernel_param_ops nx_huge_pages_ops = { .set = set_nx_huge_pages, - .get = param_get_bool, + .get = get_nx_huge_pages, }; static const struct kernel_param_ops nx_huge_pages_recovery_param_ops = { @@ -6844,6 +6847,14 @@ static void mmu_destroy_caches(void) kmem_cache_destroy(mmu_page_header_cache); } +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp) +{ + if (nx_hugepage_mitigation_hard_disabled) + return sprintf(buffer, "never\n"); + + return param_get_bool(buffer, kp); +} + static bool get_nx_auto_mode(void) { /* Return true when CPU has the bug, and mitigations are ON */ @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp) bool old_val = nx_huge_pages; bool new_val; + if (nx_hugepage_mitigation_hard_disabled) + return -EPERM; + /* In "auto" mode deploy workaround only if CPU has the bug. */ - if (sysfs_streq(val, "off")) + if (sysfs_streq(val, "off")) { new_val = 0; - else if (sysfs_streq(val, "force")) + } else if (sysfs_streq(val, "force")) { new_val = 1; - else if (sysfs_streq(val, "auto")) + } else if (sysfs_streq(val, "auto")) { new_val = get_nx_auto_mode(); - else if (kstrtobool(val, &new_val) < 0) + } else if (sysfs_streq(val, "never")) { + new_val = 0; + + mutex_lock(&kvm_lock); + if (!list_empty(&vm_list)) { + mutex_unlock(&kvm_lock); + return -EBUSY; + } + nx_hugepage_mitigation_hard_disabled = true; + mutex_unlock(&kvm_lock); + } else if (kstrtobool(val, &new_val) < 0) { return -EINVAL; + } __set_nx_huge_pages(new_val); @@ -7006,6 +7031,9 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel uint old_period, new_period; int err; + if (nx_hugepage_mitigation_hard_disabled) + return -EPERM; + was_recovery_enabled = calc_nx_huge_pages_recovery_period(&old_period); err = param_set_uint(val, kp); @@ -7161,6 +7189,9 @@ int kvm_mmu_post_init_vm(struct kvm *kvm) { int err; + if (nx_hugepage_mitigation_hard_disabled) + return 0; + err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0, "kvm-nx-lpage-recovery", &kvm->arch.nx_huge_page_recovery_thread);