Message ID | 20230116060134.80259-1-srivatsa@csail.mit.edu |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp1040849wrn; Sun, 15 Jan 2023 22:06:30 -0800 (PST) X-Google-Smtp-Source: AMrXdXshca9z8GG9/1TLGISSAfyuqp+/G/oQA4BSCB1YdKTda7jrdzT1y03xaKB1prew1xTmsXJn X-Received: by 2002:a17:906:57ce:b0:86e:7683:422b with SMTP id u14-20020a17090657ce00b0086e7683422bmr5819605ejr.68.1673849190151; Sun, 15 Jan 2023 22:06:30 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673849190; cv=none; d=google.com; s=arc-20160816; b=JI/Eq9tylMSbGKO4crWkoL/NfmjBzh3wpNu+a/cOcYz2+FRAfE0vyp4Z8Tu4BTGZ3q 9Elslv72bVpsvbcCOO+pAPPrAKXUgRdtGgKN3ZL9Z1Q7w2g0Ru+NCMmEQfpR4PGyyw6p 7Ohtw4XI1ZNeY/5mYkzfOmOosinRUNNSD5TQwwxPFwHnc0Z7+/Y565t0WHrM1IwG+M62 FAD5cd+R8coVUACF681jRlcXPaxX328yqZu56r0al4u4GAVjzGT2ovtORDBZ+FEvobjw XMrvkZrdoz93LnOGLPZFoDhKguLbX68WrHMgY/3uXuXpLKzIeAYO8eBtNB0DM/OYZm3N yizg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=NbuN8gMNCNOKIwmimGpZxqwdF9JdmeCcZt/mTShnalE=; b=k3OsEiTvEaQbDpHT1ell2fkaIWbz7ZvC9/fv5V0QSyN3QK3c25HwKyuhlrOr4FHp4u HfE1DsHq2Pn8TFpEQHsdXPnYddKdx5gPkMmXDjycLbhuC3HNOSH42d+yU+75ppfPUh+F UKAyF91S0vkRfmk7eWFYTnqGKZWMCJdXb1LW2xmUvd21dm3bX+pgt/Y+8ew4b3D30Ddp 4oIHth+6tO6gF5ViPHnOxpOFWEmmNYMskmfC6cSrhEf4+oZd4nFnkDuQBhAYgYbGuk+d wNTAriq8rIkbaOrXl6NzYsGFiRiyf5JCBsrM0/1qdFhog0uao1IJw/GVulDYfAkQ+38Z 8xQg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=mit.edu Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ds18-20020a170907725200b007fed8db38a8si5203339ejc.114.2023.01.15.22.06.07; Sun, 15 Jan 2023 22:06:30 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=mit.edu Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231835AbjAPGCb (ORCPT <rfc822;stefanalexe802@gmail.com> + 99 others); Mon, 16 Jan 2023 01:02:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44116 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229930AbjAPGC2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 16 Jan 2023 01:02:28 -0500 Received: from outgoing2021.csail.mit.edu (outgoing2021.csail.mit.edu [128.30.2.78]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9CE7C4C1A for <linux-kernel@vger.kernel.org>; Sun, 15 Jan 2023 22:02:27 -0800 (PST) Received: from [128.177.82.146] (helo=srivatsa-dev.eng.vmware.com) by outgoing2021.csail.mit.edu with esmtpa (Exim 4.95) (envelope-from <srivatsa@csail.mit.edu>) id 1pHIZd-00EV5m-KZ; Mon, 16 Jan 2023 01:02:25 -0500 From: "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> To: linux-kernel@vger.kernel.org Cc: amakhalov@vmware.com, ganb@vmware.com, ankitja@vmware.com, bordoloih@vmware.com, keerthanak@vmware.com, blamoreaux@vmware.com, namit@vmware.com, srivatsa@csail.mit.edu, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, "H. Peter Anvin" <hpa@zytor.com>, "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>, "Paul E. McKenney" <paulmck@kernel.org>, Wyes Karny <wyes.karny@amd.com>, Lewis Caroll <lewis.carroll@amd.com>, Tom Lendacky <thomas.lendacky@amd.com>, Juergen Gross <jgross@suse.com>, x86@kernel.org, VMware PV-Drivers Reviewers <pv-drivers@vmware.com>, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, xen-devel@lists.xenproject.org Subject: [PATCH v2] x86/hotplug: Do not put offline vCPUs in mwait idle state Date: Sun, 15 Jan 2023 22:01:34 -0800 Message-Id: <20230116060134.80259-1-srivatsa@csail.mit.edu> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755158088375680608?= X-GMAIL-MSGID: =?utf-8?q?1755158088375680608?= |
Series |
[v2] x86/hotplug: Do not put offline vCPUs in mwait idle state
|
|
Commit Message
Srivatsa S. Bhat
Jan. 16, 2023, 6:01 a.m. UTC
From: "Srivatsa S. Bhat (VMware)" <srivatsa@csail.mit.edu> Under hypervisors that support mwait passthrough, a vCPU in mwait CPU-idle state remains in guest context (instead of yielding to the hypervisor via VMEXIT), which helps speed up wakeups from idle. However, this runs into problems with CPU hotplug, because the Linux CPU offline path prefers to put the vCPU-to-be-offlined in mwait state, whenever mwait is available. As a result, since a vCPU in mwait remains in guest context and does not yield to the hypervisor, an offline vCPU *appears* to be 100% busy as viewed from the host, which prevents the hypervisor from running other vCPUs or workloads on the corresponding pCPU. [ Note that such a vCPU is not actually busy spinning though; it remains in mwait idle state in the guest ]. Fix this by preventing the use of mwait idle state in the vCPU offline play_dead() path for any hypervisor, even if mwait support is available. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Wyes Karny <wyes.karny@amd.com> Cc: Lewis Caroll <lewis.carroll@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Alexey Makhalov <amakhalov@vmware.com> Cc: Juergen Gross <jgross@suse.com> Cc: x86@kernel.org Cc: VMware PV-Drivers Reviewers <pv-drivers@vmware.com> Cc: virtualization@lists.linux-foundation.org Cc: kvm@vger.kernel.org Cc: xen-devel@lists.xenproject.org --- v1: https://lore.kernel.org/lkml/165843627080.142207.12667479241667142176.stgit@csail.mit.edu/ arch/x86/kernel/smpboot.c | 9 +++++++++ 1 file changed, 9 insertions(+)
Comments
On 16.01.23 07:01, Srivatsa S. Bhat wrote: > From: "Srivatsa S. Bhat (VMware)" <srivatsa@csail.mit.edu> > > Under hypervisors that support mwait passthrough, a vCPU in mwait > CPU-idle state remains in guest context (instead of yielding to the > hypervisor via VMEXIT), which helps speed up wakeups from idle. > > However, this runs into problems with CPU hotplug, because the Linux > CPU offline path prefers to put the vCPU-to-be-offlined in mwait > state, whenever mwait is available. As a result, since a vCPU in mwait > remains in guest context and does not yield to the hypervisor, an > offline vCPU *appears* to be 100% busy as viewed from the host, which > prevents the hypervisor from running other vCPUs or workloads on the > corresponding pCPU. [ Note that such a vCPU is not actually busy > spinning though; it remains in mwait idle state in the guest ]. > > Fix this by preventing the use of mwait idle state in the vCPU offline > play_dead() path for any hypervisor, even if mwait support is > available. > > Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: Wyes Karny <wyes.karny@amd.com> > Cc: Lewis Caroll <lewis.carroll@amd.com> > Cc: Tom Lendacky <thomas.lendacky@amd.com> > Cc: Alexey Makhalov <amakhalov@vmware.com> > Cc: Juergen Gross <jgross@suse.com> > Cc: x86@kernel.org > Cc: VMware PV-Drivers Reviewers <pv-drivers@vmware.com> > Cc: virtualization@lists.linux-foundation.org > Cc: kvm@vger.kernel.org > Cc: xen-devel@lists.xenproject.org Reviewed-by: Juergen Gross <jgross@suse.com> Juergen
On Sun, 15 Jan 2023 22:01:34 -0800 "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> wrote: > From: "Srivatsa S. Bhat (VMware)" <srivatsa@csail.mit.edu> > > Under hypervisors that support mwait passthrough, a vCPU in mwait > CPU-idle state remains in guest context (instead of yielding to the > hypervisor via VMEXIT), which helps speed up wakeups from idle. > > However, this runs into problems with CPU hotplug, because the Linux > CPU offline path prefers to put the vCPU-to-be-offlined in mwait > state, whenever mwait is available. As a result, since a vCPU in mwait > remains in guest context and does not yield to the hypervisor, an > offline vCPU *appears* to be 100% busy as viewed from the host, which > prevents the hypervisor from running other vCPUs or workloads on the > corresponding pCPU. [ Note that such a vCPU is not actually busy > spinning though; it remains in mwait idle state in the guest ]. > > Fix this by preventing the use of mwait idle state in the vCPU offline > play_dead() path for any hypervisor, even if mwait support is > available. if mwait is enabled, it's very likely guest to have cpuidle enabled and using the same mwait as well. So exiting early from mwait_play_dead(), might just punt workflow down: native_play_dead() ... mwait_play_dead(); if (cpuidle_play_dead()) <- possible mwait here hlt_play_dead(); and it will end up in mwait again and only if that fails it will go HLT route and maybe transition to VMM. Instead of workaround on guest side, shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest context when it is removing vCPU, among other things) > Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: Wyes Karny <wyes.karny@amd.com> > Cc: Lewis Caroll <lewis.carroll@amd.com> > Cc: Tom Lendacky <thomas.lendacky@amd.com> > Cc: Alexey Makhalov <amakhalov@vmware.com> > Cc: Juergen Gross <jgross@suse.com> > Cc: x86@kernel.org > Cc: VMware PV-Drivers Reviewers <pv-drivers@vmware.com> > Cc: virtualization@lists.linux-foundation.org > Cc: kvm@vger.kernel.org > Cc: xen-devel@lists.xenproject.org > --- > > v1: https://lore.kernel.org/lkml/165843627080.142207.12667479241667142176.stgit@csail.mit.edu/ > > arch/x86/kernel/smpboot.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 55cad72715d9..125a5d4bfded 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1763,6 +1763,15 @@ static inline void mwait_play_dead(void) > return; > if (!this_cpu_has(X86_FEATURE_CLFLUSH)) > return; > + > + /* > + * Do not use mwait in CPU offline play_dead if running under > + * any hypervisor, to make sure that the offline vCPU actually > + * yields to the hypervisor (which may not happen otherwise if > + * the hypervisor supports mwait passthrough). > + */ > + if (this_cpu_has(X86_FEATURE_HYPERVISOR)) > + return; > if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF) > return; >
On Mon, Jan 16 2023 at 15:55, Igor Mammedov wrote: > "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> wrote: >> Fix this by preventing the use of mwait idle state in the vCPU offline >> play_dead() path for any hypervisor, even if mwait support is >> available. > > if mwait is enabled, it's very likely guest to have cpuidle > enabled and using the same mwait as well. So exiting early from > mwait_play_dead(), might just punt workflow down: > native_play_dead() > ... > mwait_play_dead(); > if (cpuidle_play_dead()) <- possible mwait here > hlt_play_dead(); > > and it will end up in mwait again and only if that fails > it will go HLT route and maybe transition to VMM. Good point. > Instead of workaround on guest side, > shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's > actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest > context when it is removing vCPU, among other things) For a pure guest side CPU unplug operation: guest$ echo 0 >/sys/devices/system/cpu/cpu$N/online the hypervisor is not involved at all. The vCPU is not removed in that case. So to ensure that this ends up in HLT something like the below is required. Note, the removal of the comment after mwait_play_dead() is intentional because the comment is completely bogus. Not having MWAIT is not a failure. But that wants to be a seperate patch. Thanks, tglx --- diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 55cad72715d9..3f1f20f71ec5 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1833,7 +1833,10 @@ void native_play_dead(void) play_dead_common(); tboot_shutdown(TB_SHUTDOWN_WFS); - mwait_play_dead(); /* Only returns on failure */ + if (this_cpu_has(X86_FEATURE_HYPERVISOR)) + hlt_play_dead(); + + mwait_play_dead(); if (cpuidle_play_dead()) hlt_play_dead(); }
Hi Igor and Thomas, Thank you for your review! On 1/19/23 1:12 PM, Thomas Gleixner wrote: > On Mon, Jan 16 2023 at 15:55, Igor Mammedov wrote: >> "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> wrote: >>> Fix this by preventing the use of mwait idle state in the vCPU offline >>> play_dead() path for any hypervisor, even if mwait support is >>> available. >> >> if mwait is enabled, it's very likely guest to have cpuidle >> enabled and using the same mwait as well. So exiting early from >> mwait_play_dead(), might just punt workflow down: >> native_play_dead() >> ... >> mwait_play_dead(); >> if (cpuidle_play_dead()) <- possible mwait here >> hlt_play_dead(); >> >> and it will end up in mwait again and only if that fails >> it will go HLT route and maybe transition to VMM. > > Good point. > >> Instead of workaround on guest side, >> shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's >> actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest >> context when it is removing vCPU, among other things) > > For a pure guest side CPU unplug operation: > > guest$ echo 0 >/sys/devices/system/cpu/cpu$N/online > > the hypervisor is not involved at all. The vCPU is not removed in that > case. > Agreed, and this is indeed the scenario I was targeting with this patch, as opposed to vCPU removal from the host side. I'll add this clarification to the commit message. > So to ensure that this ends up in HLT something like the below is > required. > > Note, the removal of the comment after mwait_play_dead() is intentional > because the comment is completely bogus. Not having MWAIT is not a > failure. But that wants to be a seperate patch. > Sounds good, will do and post a new version. Thank you! Regards, Srivatsa VMware Photon OS > Thanks, > > tglx > --- > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 55cad72715d9..3f1f20f71ec5 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1833,7 +1833,10 @@ void native_play_dead(void) > play_dead_common(); > tboot_shutdown(TB_SHUTDOWN_WFS); > > - mwait_play_dead(); /* Only returns on failure */ > + if (this_cpu_has(X86_FEATURE_HYPERVISOR)) > + hlt_play_dead(); > + > + mwait_play_dead(); > if (cpuidle_play_dead()) > hlt_play_dead(); > } > > > >
On Fri, 20 Jan 2023 05:55:11 -0800 "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> wrote: > Hi Igor and Thomas, > > Thank you for your review! > > On 1/19/23 1:12 PM, Thomas Gleixner wrote: > > On Mon, Jan 16 2023 at 15:55, Igor Mammedov wrote: > >> "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> wrote: > >>> Fix this by preventing the use of mwait idle state in the vCPU offline > >>> play_dead() path for any hypervisor, even if mwait support is > >>> available. > >> > >> if mwait is enabled, it's very likely guest to have cpuidle > >> enabled and using the same mwait as well. So exiting early from > >> mwait_play_dead(), might just punt workflow down: > >> native_play_dead() > >> ... > >> mwait_play_dead(); > >> if (cpuidle_play_dead()) <- possible mwait here > >> hlt_play_dead(); > >> > >> and it will end up in mwait again and only if that fails > >> it will go HLT route and maybe transition to VMM. > > > > Good point. > > > >> Instead of workaround on guest side, > >> shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's > >> actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest > >> context when it is removing vCPU, among other things) > > > > For a pure guest side CPU unplug operation: > > > > guest$ echo 0 >/sys/devices/system/cpu/cpu$N/online > > > > the hypervisor is not involved at all. The vCPU is not removed in that > > case. > > > > Agreed, and this is indeed the scenario I was targeting with this patch, > as opposed to vCPU removal from the host side. I'll add this clarification > to the commit message. commit message explicitly said: "which prevents the hypervisor from running other vCPUs or workloads on the corresponding pCPU." and that implies unplug on hypervisor side as well. Why? That's because when hypervisor exposes mwait to guest, it has to reserve/pin a pCPU for each of present vCPUs. And you can safely run other VMs/workloads on that pCPU only after it's not possible for it to be reused by VM where it was used originally. Now consider following worst (and most likely) case without unplug on hypervisor side: 1. vm1mwait: pin pCPU2 to vCPU2 2. vm1mwait: guest$ echo 0 >/sys/devices/system/cpu/cpu2/online -> HLT -> VMEXIT -- 3. vm2mwait: pin pCPU2 to vCPUx and start VM 4. vm2mwait: guest OS onlines Vcpu and starts using it incl. going into idle=>mwait state -- 5. vm1mwait: it still thinks that vCPU is present it can rightfully do: guest$ echo 1 >/sys/devices/system/cpu/cpu2/online -- 6.1 best case vm1mwait online fails after timeout 6.2 worse case: vm2mwait does VMEXIT on vCPUx around time-frame when vm1mwait onlines vCPU2, the online may succeed and then vm2mwait's vCPUx will be stuck (possibly indefinitely) until for some reason VMEXIT happens on vm1mwait's vCPU2 _and_ host decides to schedule vCPUx on pCPU2 which would make vm1mwait stuck on vCPU2. So either way it's expected behavior. And if there is no intention to unplug vCPU on hypervisor side, then VMEXIT on play_dead is not really necessary (mwait is better then HLT), since hypervisor can't safely reuse pCPU elsewhere and VCPU goes into deep sleep within guest context. PS: The only case where making HLT/VMEXIT on play_dead might work out, would be if new workload weren't pinned to the same pCPU nor used mwait (i.e. host can migrate it elsewhere and schedule vCPU2 back on pCPU2). > > So to ensure that this ends up in HLT something like the below is > > required. > > > > Note, the removal of the comment after mwait_play_dead() is intentional > > because the comment is completely bogus. Not having MWAIT is not a > > failure. But that wants to be a seperate patch. > > > > Sounds good, will do and post a new version. > > Thank you! > > Regards, > Srivatsa > VMware Photon OS > > > > Thanks, > > > > tglx > > --- > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > index 55cad72715d9..3f1f20f71ec5 100644 > > --- a/arch/x86/kernel/smpboot.c > > +++ b/arch/x86/kernel/smpboot.c > > @@ -1833,7 +1833,10 @@ void native_play_dead(void) > > play_dead_common(); > > tboot_shutdown(TB_SHUTDOWN_WFS); > > > > - mwait_play_dead(); /* Only returns on failure */ > > + if (this_cpu_has(X86_FEATURE_HYPERVISOR)) > > + hlt_play_dead(); > > + > > + mwait_play_dead(); > > if (cpuidle_play_dead()) > > hlt_play_dead(); > > } > > > > > > > > >
On Fri, Jan 20, 2023, Igor Mammedov wrote: > On Fri, 20 Jan 2023 05:55:11 -0800 > "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> wrote: > > > Hi Igor and Thomas, > > > > Thank you for your review! > > > > On 1/19/23 1:12 PM, Thomas Gleixner wrote: > > > On Mon, Jan 16 2023 at 15:55, Igor Mammedov wrote: > > >> "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> wrote: > > >>> Fix this by preventing the use of mwait idle state in the vCPU offline > > >>> play_dead() path for any hypervisor, even if mwait support is > > >>> available. > > >> > > >> if mwait is enabled, it's very likely guest to have cpuidle > > >> enabled and using the same mwait as well. So exiting early from > > >> mwait_play_dead(), might just punt workflow down: > > >> native_play_dead() > > >> ... > > >> mwait_play_dead(); > > >> if (cpuidle_play_dead()) <- possible mwait here > > >> hlt_play_dead(); > > >> > > >> and it will end up in mwait again and only if that fails > > >> it will go HLT route and maybe transition to VMM. > > > > > > Good point. > > > > > >> Instead of workaround on guest side, > > >> shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's > > >> actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest > > >> context when it is removing vCPU, among other things) > > > > > > For a pure guest side CPU unplug operation: > > > > > > guest$ echo 0 >/sys/devices/system/cpu/cpu$N/online > > > > > > the hypervisor is not involved at all. The vCPU is not removed in that > > > case. > > > > > > > Agreed, and this is indeed the scenario I was targeting with this patch, > > as opposed to vCPU removal from the host side. I'll add this clarification > > to the commit message. Forcing HLT doesn't solve anything, it's perfectly legal to passthrough HLT. I guarantee there are use cases that passthrough HLT but _not_ MONITOR/MWAIT, and that passthrough all of them. > commit message explicitly said: > "which prevents the hypervisor from running other vCPUs or workloads on the > corresponding pCPU." > > and that implies unplug on hypervisor side as well. > Why? That's because when hypervisor exposes mwait to guest, it has to reserve/pin > a pCPU for each of present vCPUs. And you can safely run other VMs/workloads > on that pCPU only after it's not possible for it to be reused by VM where > it was used originally. Pinning isn't strictly required from a safety perspective. The latency of context switching may suffer due to wake times, but preempting a vCPU that it's C1 (or deeper) won't cause functional problems. Passing through an entire socket (or whatever scope triggers extra fun) might be a different story, but pinning isn't strictly required. That said, I 100% agree that this is expected behavior and not a bug. Letting the guest execute MWAIT or HLT means the host won't have perfect visibility into guest activity state. Oversubscribing a pCPU and exposing MWAIT and/or HLT to vCPUs is generally not done precisely because the guest will always appear busy without extra effort on the host. E.g. KVM requires an explicit opt-in from userspace to expose MWAIT and/or HLT. If someone really wants to effeciently oversubscribe pCPUs and passthrough MWAIT, then their best option is probably to have a paravirt interface so that the guest can tell the host its offlining a vCPU. Barring that the host could inspect the guest when preempting a vCPU to try and guesstimate how much work the vCPU is actually doing in order to make better scheduling decisions. > Now consider following worst (and most likely) case without unplug > on hypervisor side: > > 1. vm1mwait: pin pCPU2 to vCPU2 > 2. vm1mwait: guest$ echo 0 >/sys/devices/system/cpu/cpu2/online > -> HLT -> VMEXIT > -- > 3. vm2mwait: pin pCPU2 to vCPUx and start VM > 4. vm2mwait: guest OS onlines Vcpu and starts using it incl. > going into idle=>mwait state > -- > 5. vm1mwait: it still thinks that vCPU is present it can rightfully do: > guest$ echo 1 >/sys/devices/system/cpu/cpu2/online > -- > 6.1 best case vm1mwait online fails after timeout > 6.2 worse case: vm2mwait does VMEXIT on vCPUx around time-frame when > vm1mwait onlines vCPU2, the online may succeed and then vm2mwait's > vCPUx will be stuck (possibly indefinitely) until for some reason > VMEXIT happens on vm1mwait's vCPU2 _and_ host decides to schedule > vCPUx on pCPU2 which would make vm1mwait stuck on vCPU2. > So either way it's expected behavior. > > And if there is no intention to unplug vCPU on hypervisor side, > then VMEXIT on play_dead is not really necessary (mwait is better > then HLT), since hypervisor can't safely reuse pCPU elsewhere and > VCPU goes into deep sleep within guest context. > > PS: > The only case where making HLT/VMEXIT on play_dead might work out, > would be if new workload weren't pinned to the same pCPU nor > used mwait (i.e. host can migrate it elsewhere and schedule > vCPU2 back on pCPU2).
Hi Igor and Sean, On 1/20/23 10:35 AM, Sean Christopherson wrote: > On Fri, Jan 20, 2023, Igor Mammedov wrote: >> On Fri, 20 Jan 2023 05:55:11 -0800 >> "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> wrote: >> >>> Hi Igor and Thomas, >>> >>> Thank you for your review! >>> >>> On 1/19/23 1:12 PM, Thomas Gleixner wrote: >>>> On Mon, Jan 16 2023 at 15:55, Igor Mammedov wrote: >>>>> "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> wrote: >>>>>> Fix this by preventing the use of mwait idle state in the vCPU offline >>>>>> play_dead() path for any hypervisor, even if mwait support is >>>>>> available. >>>>> >>>>> if mwait is enabled, it's very likely guest to have cpuidle >>>>> enabled and using the same mwait as well. So exiting early from >>>>> mwait_play_dead(), might just punt workflow down: >>>>> native_play_dead() >>>>> ... >>>>> mwait_play_dead(); >>>>> if (cpuidle_play_dead()) <- possible mwait here >>>>> hlt_play_dead(); >>>>> >>>>> and it will end up in mwait again and only if that fails >>>>> it will go HLT route and maybe transition to VMM. >>>> >>>> Good point. >>>> >>>>> Instead of workaround on guest side, >>>>> shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's >>>>> actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest >>>>> context when it is removing vCPU, among other things) >>>> >>>> For a pure guest side CPU unplug operation: >>>> >>>> guest$ echo 0 >/sys/devices/system/cpu/cpu$N/online >>>> >>>> the hypervisor is not involved at all. The vCPU is not removed in that >>>> case. >>>> >>> >>> Agreed, and this is indeed the scenario I was targeting with this patch, >>> as opposed to vCPU removal from the host side. I'll add this clarification >>> to the commit message. > > Forcing HLT doesn't solve anything, it's perfectly legal to passthrough HLT. I > guarantee there are use cases that passthrough HLT but _not_ MONITOR/MWAIT, and > that passthrough all of them. > >> commit message explicitly said: >> "which prevents the hypervisor from running other vCPUs or workloads on the >> corresponding pCPU." >> >> and that implies unplug on hypervisor side as well. >> Why? That's because when hypervisor exposes mwait to guest, it has to reserve/pin >> a pCPU for each of present vCPUs. And you can safely run other VMs/workloads >> on that pCPU only after it's not possible for it to be reused by VM where >> it was used originally. > > Pinning isn't strictly required from a safety perspective. The latency of context > switching may suffer due to wake times, but preempting a vCPU that it's C1 (or > deeper) won't cause functional problems. Passing through an entire socket > (or whatever scope triggers extra fun) might be a different story, but pinning > isn't strictly required. > > That said, I 100% agree that this is expected behavior and not a bug. Letting the > guest execute MWAIT or HLT means the host won't have perfect visibility into guest > activity state. > > Oversubscribing a pCPU and exposing MWAIT and/or HLT to vCPUs is generally not done > precisely because the guest will always appear busy without extra effort on the > host. E.g. KVM requires an explicit opt-in from userspace to expose MWAIT and/or > HLT. > > If someone really wants to effeciently oversubscribe pCPUs and passthrough MWAIT, > then their best option is probably to have a paravirt interface so that the guest > can tell the host its offlining a vCPU. Barring that the host could inspect the > guest when preempting a vCPU to try and guesstimate how much work the vCPU is > actually doing in order to make better scheduling decisions. > >> Now consider following worst (and most likely) case without unplug >> on hypervisor side: >> >> 1. vm1mwait: pin pCPU2 to vCPU2 >> 2. vm1mwait: guest$ echo 0 >/sys/devices/system/cpu/cpu2/online >> -> HLT -> VMEXIT >> -- >> 3. vm2mwait: pin pCPU2 to vCPUx and start VM >> 4. vm2mwait: guest OS onlines Vcpu and starts using it incl. >> going into idle=>mwait state >> -- >> 5. vm1mwait: it still thinks that vCPU is present it can rightfully do: >> guest$ echo 1 >/sys/devices/system/cpu/cpu2/online >> -- >> 6.1 best case vm1mwait online fails after timeout >> 6.2 worse case: vm2mwait does VMEXIT on vCPUx around time-frame when >> vm1mwait onlines vCPU2, the online may succeed and then vm2mwait's >> vCPUx will be stuck (possibly indefinitely) until for some reason >> VMEXIT happens on vm1mwait's vCPU2 _and_ host decides to schedule >> vCPUx on pCPU2 which would make vm1mwait stuck on vCPU2. >> So either way it's expected behavior. >> >> And if there is no intention to unplug vCPU on hypervisor side, >> then VMEXIT on play_dead is not really necessary (mwait is better >> then HLT), since hypervisor can't safely reuse pCPU elsewhere and >> VCPU goes into deep sleep within guest context. >> >> PS: >> The only case where making HLT/VMEXIT on play_dead might work out, >> would be if new workload weren't pinned to the same pCPU nor >> used mwait (i.e. host can migrate it elsewhere and schedule >> vCPU2 back on pCPU2). That makes sense. Thank you both for the detailed explanation! Let's drop this patch. Regards, Srivatsa VMware Photon OS
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 55cad72715d9..125a5d4bfded 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1763,6 +1763,15 @@ static inline void mwait_play_dead(void) return; if (!this_cpu_has(X86_FEATURE_CLFLUSH)) return; + + /* + * Do not use mwait in CPU offline play_dead if running under + * any hypervisor, to make sure that the offline vCPU actually + * yields to the hypervisor (which may not happen otherwise if + * the hypervisor supports mwait passthrough). + */ + if (this_cpu_has(X86_FEATURE_HYPERVISOR)) + return; if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF) return;