Message ID | 20231207010302.2240506-1-jmattson@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp4480991vqy; Wed, 6 Dec 2023 17:03:34 -0800 (PST) X-Google-Smtp-Source: AGHT+IFtCuJze5vS9IshjWjFBnS3NFysH55WX/wc5BHwIRd+B+9ueqp4MVS6+3ZLYZnPLtcAV825 X-Received: by 2002:a17:90a:f188:b0:286:b95c:825b with SMTP id bv8-20020a17090af18800b00286b95c825bmr1730770pjb.91.1701911014614; Wed, 06 Dec 2023 17:03:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701911014; cv=none; d=google.com; s=arc-20160816; b=TGggx8ZZIoQ9IjrQizaYO+3KZrvKuXsNFkSabWHbDw/BLB2nMkbgjLTuWvziJDAc4b AMAqqrf9u4B/z/Cc4GV5rqIK4hYZxfk8otQeaO7Tw2GmbSYSVxsdjH+EtoBmhlB7hJ6l quALDm4xU2EFvsX81zsEqjsb/dP9sAk4Ygzzaj8tZ3AGXqvVBudTZkF2x8aQAvYavkLM W/bDL9ktQGwsGFFZrEINJfJEHL7CGuiVMFY0UVlzV2qvh4Y0w3S4o4/hbXibpi6WvZ9G 2PCmXvRhQORijs5albLKNCsw4hif2lD3wZs9Szl3sZk9ZTSFanqOh1RgUNinV3ybVlyI gDIQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=kNsTQFENpLV0yAX5x5gNndSdmWF67ocLDbukBVvw4Ho=; fh=Csy5fKCRLZtAScizg2g3iDi8IALO9Xc02Pl50l/eiPM=; b=QXhyu3fAiHWTgnEV0BjauSnQYRZtvIigjjv4Z+ky0K0qnKPwt5+mn6qADLoqo9RUR1 rMt9y2KUj35BNYnMccTOK2XCx4v2hIHwNAq278M7INxtV0Zt3WPVozY3nEZKVc0Ltj8B 9MuLc9b+z2YxDlahmT7fmSYsiD3X6khgAdYcD4chrPQsR3Xkz+GT5adR1COdK62/LhRQ t5ehhaE06a7l2lvxlqaGnqHT7qrjzF+0SMkyIv7DX81Wl/FMGwhdMmHgtDdZto6u90p+ TZRUnbF3D4ZMFMoXQF4CjduffdCBbiLj0CRkVcX1k8m6PiNy6IG8/NjQNMm47mhd8WeY jBAA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=VBXdf5Cn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id y18-20020a170902d65200b001d08fccc68fsi124745plh.529.2023.12.06.17.03.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 17:03:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=VBXdf5Cn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 93EC481C19EB; Wed, 6 Dec 2023 17:03:31 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1441954AbjLGBDN (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others); Wed, 6 Dec 2023 20:03:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38604 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1441968AbjLGBDL (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 6 Dec 2023 20:03:11 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7127AD73 for <linux-kernel@vger.kernel.org>; Wed, 6 Dec 2023 17:03:17 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id 3f1490d57ef6-db99bac23cdso541154276.1 for <linux-kernel@vger.kernel.org>; Wed, 06 Dec 2023 17:03:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1701910996; x=1702515796; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=kNsTQFENpLV0yAX5x5gNndSdmWF67ocLDbukBVvw4Ho=; b=VBXdf5CnxCVBGP34y2jrZgUXAB3D+Etu2qzxQ7x4Iiqlzlrr3TrdiGUia7gbUt0OJz bxjsnsxC30jqgOBckgOzD4xi2ND/Qv/VAuSXcHUYPkB8Wru/eENwPE7DKRlWC6npfx9t 46tkl86j8NKQSyJwMVC4wVWrE9qq07mPx3ObBigFWvSf1E/oUclgXmjFDOVcnGHyv7Hx gzIVTGRUh1IoH6iP40sTPPU3YlV0dQiBRTut12QJ9OqWU34MLj5t6gZ7HmWGjD475rKJ Q7HozIdE7JBNkkt35QLZVgn6s1NNUtCQRX0Y8iopRkepZNn4jslNci1YOgQ1pNV7ijE3 Ms9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701910996; x=1702515796; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=kNsTQFENpLV0yAX5x5gNndSdmWF67ocLDbukBVvw4Ho=; b=dv/R1mUw1tNxAxZ3N3BHXOA5f7rDsfqusQmbsxUi73PrxuIRuXTen7TfNnfeNOrIxp PdM1Sq/4oBVIZccIQKR6uzsKTMyYBB36U/RmXIr+0YnBynnl0YL+3cC7wZU9nlo0Z8kA rcKelRCnaxZCU2VWvZ76hjYQOw0OOT5V3vOvSVv8bWDe2w5xt2XRNqFKbrzWUksf91Fy 8oPQN10aQZ9N1Xpiz97dorOr/JN/znl8Tvtr/PTrvOfvzjpuayfsEKWQYVVaMO9HSiLh fc051ez76Ye0MYJe2S7vHOlqbYe18JXG+NJwcU3nL8e/Zp9YzzZjL63/jXC8LwC1tV48 e2PA== X-Gm-Message-State: AOJu0Yw76rzsBz44bjWGrAyQ2kFNGEqwAroK5JScnADqsQCUFGgwdAop 3klM1+BK9pscRC0AJl83EQaXsUYra+QHUQ== X-Received: from loggerhead.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:29a]) (user=jmattson job=sendgmr) by 2002:a25:2981:0:b0:dbc:3553:efe with SMTP id p123-20020a252981000000b00dbc35530efemr3380ybp.4.1701910996183; Wed, 06 Dec 2023 17:03:16 -0800 (PST) Date: Wed, 6 Dec 2023 17:03:02 -0800 In-Reply-To: <20220921003201.1441511-11-seanjc@google.com> Mime-Version: 1.0 References: <20220921003201.1441511-11-seanjc@google.com> X-Mailer: git-send-email 2.43.0.rc2.451.g8631bc7472-goog Message-ID: <20231207010302.2240506-1-jmattson@google.com> Subject: [PATCH v4 10/12] KVM: x86: never write to memory from kvm_vcpu_check_block() From: Jim Mattson <jmattson@google.com> To: seanjc@google.com Cc: aleksandar.qemu.devel@gmail.com, alexandru.elisei@arm.com, anup@brainfault.org, aou@eecs.berkeley.edu, atishp@atishpatra.org, borntraeger@linux.ibm.com, chenhuacai@kernel.org, david@redhat.com, frankja@linux.ibm.com, imbrenda@linux.ibm.com, james.morse@arm.com, kvm-riscv@lists.infradead.org, kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org, linux-riscv@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, maz@kernel.org, mlevitsk@redhat.com, oliver.upton@linux.dev, palmer@dabbelt.com, paul.walmsley@sifive.com, pbonzini@redhat.com, suzuki.poulose@arm.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Wed, 06 Dec 2023 17:03:31 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1784583044342820386 X-GMAIL-MSGID: 1784583044342820386 |
Series |
None
|
|
Commit Message
Jim Mattson
Dec. 7, 2023, 1:03 a.m. UTC
kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore it cannot sleep. Writing to guest memory is therefore forbidden, but it can happen on AMD processors if kvm_check_nested_events() causes a vmexit. Fortunately, all events that are caught by kvm_check_nested_events() are also recognized by kvm_vcpu_has_events() through vendor callbacks such as kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so remove the call and postpone the actual processing to vcpu_block(). Opportunistically honor the return of kvm_check_nested_events(). KVM punted on the check in kvm_vcpu_running() because the only error path is if vmx_complete_nested_posted_interrupt() fails, in which case KVM exits to userspace with "internal error" i.e. the VM is likely dead anyways so it wasn't worth overloading the return of kvm_vcpu_running(). Add the check mostly so that KVM is consistent with itself; the return of the call via kvm_apic_accept_events()=>kvm_check_nested_events() that immediately follows _is_ checked. Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [sean: check and handle return of kvm_check_nested_events()] Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/x86.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) This commit breaks delivery of a (virtualized) posted interrupt from an L1 vCPU to a halted L2 vCPU. Looking back at commit e6c67d8cf117 ("KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv"), Liran wrote: Note that this also handles the case of nested posted-interrupt by the fact RVI is updated in vmx_complete_nested_posted_interrupt() which is called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() -> kvm_vcpu_running() -> vmx_check_nested_events() -> vmx_complete_nested_posted_interrupt(). Clearly, that is no longer the case.
Comments
On Wed, Dec 06, 2023, Jim Mattson wrote: > kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore > it cannot sleep. Writing to guest memory is therefore forbidden, but it > can happen on AMD processors if kvm_check_nested_events() causes a vmexit. > > Fortunately, all events that are caught by kvm_check_nested_events() are > also recognized by kvm_vcpu_has_events() through vendor callbacks such as > kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so > remove the call and postpone the actual processing to vcpu_block(). > > Opportunistically honor the return of kvm_check_nested_events(). KVM > punted on the check in kvm_vcpu_running() because the only error path is > if vmx_complete_nested_posted_interrupt() fails, in which case KVM exits > to userspace with "internal error" i.e. the VM is likely dead anyways so > it wasn't worth overloading the return of kvm_vcpu_running(). > > Add the check mostly so that KVM is consistent with itself; the return of > the call via kvm_apic_accept_events()=>kvm_check_nested_events() that > immediately follows _is_ checked. > > Reported-by: Maxim Levitsky <mlevitsk@redhat.com> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > [sean: check and handle return of kvm_check_nested_events()] > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/x86.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index dcc675d4e44b..8aeacbc2bff9 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -10815,6 +10815,17 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) > return 1; > } > > + /* > + * Evaluate nested events before exiting the halted state. This allows > + * the halt state to be recorded properly in the VMCS12's activity > + * state field (AMD does not have a similar field and a VM-Exit always > + * causes a spurious wakeup from HLT). > + */ > + if (is_guest_mode(vcpu)) { > + if (kvm_check_nested_events(vcpu) < 0) > + return 0; > + } > + > if (kvm_apic_accept_events(vcpu) < 0) > return 0; > switch(vcpu->arch.mp_state) { > @@ -10837,9 +10848,6 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) > > static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu) > { > - if (is_guest_mode(vcpu)) > - kvm_check_nested_events(vcpu); > - > return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE && > !vcpu->arch.apf.halted); > } > > This commit breaks delivery of a (virtualized) posted interrupt from > an L1 vCPU to a halted L2 vCPU. > > Looking back at commit e6c67d8cf117 ("KVM: nVMX: Wake blocked vCPU in > guest-mode if pending interrupt in virtual APICv"), Liran wrote: > > Note that this also handles the case of nested posted-interrupt by the > fact RVI is updated in vmx_complete_nested_posted_interrupt() which is > called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() -> > kvm_vcpu_running() -> vmx_check_nested_events() -> > vmx_complete_nested_posted_interrupt(). > > Clearly, that is no longer the case. Doh. We got the less obvious cases and missed the obvious one. Ugh, and we also missed a related mess in kvm_guest_apic_has_interrupt(). That thing should really be folded into vmx_has_nested_events(). Good gravy. And vmx_interrupt_blocked() does the wrong thing because that specifically checks if L1 interrupts are blocked. Compile tested only, and definitely needs to be chunked into multiple patches, but I think something like this mess? --- arch/x86/include/asm/kvm-x86-ops.h | 1 - arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/lapic.c | 20 ++--------------- arch/x86/kvm/lapic.h | 12 ++++++++++ arch/x86/kvm/vmx/nested.c | 36 ++++++++++++++++++++++++++++-- arch/x86/kvm/vmx/vmx.c | 34 ++++++++-------------------- arch/x86/kvm/vmx/vmx.h | 1 + arch/x86/kvm/x86.c | 10 +-------- 8 files changed, 59 insertions(+), 56 deletions(-) diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index 378ed944b849..6f81774c1dd0 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -85,7 +85,6 @@ KVM_X86_OP_OPTIONAL(update_cr8_intercept) KVM_X86_OP(refresh_apicv_exec_ctrl) KVM_X86_OP_OPTIONAL(hwapic_irr_update) KVM_X86_OP_OPTIONAL(hwapic_isr_update) -KVM_X86_OP_OPTIONAL_RET0(guest_apic_has_interrupt) KVM_X86_OP_OPTIONAL(load_eoi_exitmap) KVM_X86_OP_OPTIONAL(set_virtual_apic_mode) KVM_X86_OP_OPTIONAL(set_apic_access_page_addr) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c8c7e2475a18..fc1466035a8c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1685,7 +1685,6 @@ struct kvm_x86_ops { void (*refresh_apicv_exec_ctrl)(struct kvm_vcpu *vcpu); void (*hwapic_irr_update)(struct kvm_vcpu *vcpu, int max_irr); void (*hwapic_isr_update)(int isr); - bool (*guest_apic_has_interrupt)(struct kvm_vcpu *vcpu); void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); void (*set_virtual_apic_mode)(struct kvm_vcpu *vcpu); void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 245b20973cae..6d74c42accdf 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -55,7 +55,6 @@ #define APIC_VERSION 0x14UL #define LAPIC_MMIO_LENGTH (1 << 12) /* followed define is not in apicdef.h */ -#define MAX_APIC_VECTOR 256 #define APIC_VECTORS_PER_REG 32 static bool lapic_timer_advance_dynamic __read_mostly; @@ -619,21 +618,6 @@ static const unsigned int apic_lvt_mask[KVM_APIC_MAX_NR_LVT_ENTRIES] = { [LVT_CMCI] = LVT_MASK | APIC_MODE_MASK }; -static int find_highest_vector(void *bitmap) -{ - int vec; - u32 *reg; - - for (vec = MAX_APIC_VECTOR - APIC_VECTORS_PER_REG; - vec >= 0; vec -= APIC_VECTORS_PER_REG) { - reg = bitmap + REG_POS(vec); - if (*reg) - return __fls(*reg) + vec; - } - - return -1; -} - static u8 count_vectors(void *bitmap) { int vec; @@ -697,7 +681,7 @@ EXPORT_SYMBOL_GPL(kvm_apic_update_irr); static inline int apic_search_irr(struct kvm_lapic *apic) { - return find_highest_vector(apic->regs + APIC_IRR); + return kvm_apic_find_highest_vector(apic->regs + APIC_IRR); } static inline int apic_find_highest_irr(struct kvm_lapic *apic) @@ -775,7 +759,7 @@ static inline int apic_find_highest_isr(struct kvm_lapic *apic) if (likely(apic->highest_isr_cache != -1)) return apic->highest_isr_cache; - result = find_highest_vector(apic->regs + APIC_ISR); + result = kvm_apic_find_highest_vector(apic->regs + APIC_ISR); ASSERT(result == -1 || result >= 16); return result; diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 0a0ea4b5dd8c..4239bf329748 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -12,6 +12,8 @@ #define KVM_APIC_INIT 0 #define KVM_APIC_SIPI 1 +#define MAX_APIC_VECTOR 256 + #define APIC_SHORT_MASK 0xc0000 #define APIC_DEST_NOSHORT 0x0 #define APIC_DEST_MASK 0x800 @@ -151,6 +153,16 @@ u64 kvm_lapic_readable_reg_mask(struct kvm_lapic *apic); #define VEC_POS(v) ((v) & (32 - 1)) #define REG_POS(v) (((v) >> 5) << 4) +static inline int kvm_apic_find_highest_vector(unsigned long *bitmap) +{ + unsigned long bit = find_last_bit(bitmap, MAX_APIC_VECTOR); + + if (bit == MAX_APIC_VECTOR) + return -1; + + return bit; +} + static inline void kvm_lapic_clear_vector(int vec, void *bitmap) { clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 4ba46e1b29d2..28b3c1c0830f 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -3964,8 +3964,40 @@ static bool nested_vmx_preemption_timer_pending(struct kvm_vcpu *vcpu) static bool vmx_has_nested_events(struct kvm_vcpu *vcpu) { - return nested_vmx_preemption_timer_pending(vcpu) || - to_vmx(vcpu)->nested.mtf_pending; + struct vcpu_vmx *vmx = to_vmx(vcpu); + void *vapic = vmx->nested.virtual_apic_map.hva; + int max_irr, vppr; + + if (nested_vmx_preemption_timer_pending(vcpu) || + vmx->nested.mtf_pending) + return true; + + if (!nested_cpu_has_vid(get_vmcs12(vcpu)) || + __vmx_interrupt_blocked(vcpu)) + return false; + + if (!vapic) + return false; + + vppr = *((u32 *)(vapic + APIC_PROCPRI)); + + max_irr = vmx_get_rvi(); + if ((max_irr & 0xf0) > (vppr & 0xf0)) + return true; + + max_irr = kvm_apic_find_highest_vector(vapic + APIC_IRR); + if (max_irr > 0 && (max_irr & 0xf0) > (vppr & 0xf0)) + return true; + + if (vmx->nested.pi_desc && pi_test_on(vmx->nested.pi_desc)) { + void *pir = vmx->nested.pi_desc->pir; + + max_irr = kvm_apic_find_highest_vector(pir); + if (max_irr > 0 && (max_irr & 0xf0) > (vppr & 0xf0)) + return true; + } + + return false; } /* diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index d30df9b3fe3e..be8ab49e9965 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4107,26 +4107,6 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu) } } -static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu) -{ - struct vcpu_vmx *vmx = to_vmx(vcpu); - void *vapic_page; - u32 vppr; - int rvi; - - if (WARN_ON_ONCE(!is_guest_mode(vcpu)) || - !nested_cpu_has_vid(get_vmcs12(vcpu)) || - WARN_ON_ONCE(!vmx->nested.virtual_apic_map.gfn)) - return false; - - rvi = vmx_get_rvi(); - - vapic_page = vmx->nested.virtual_apic_map.hva; - vppr = *((u32 *)(vapic_page + APIC_PROCPRI)); - - return ((rvi & 0xf0) > (vppr & 0xf0)); -} - static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -5040,16 +5020,21 @@ static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection) return !vmx_nmi_blocked(vcpu); } -bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu) +bool __vmx_interrupt_blocked(struct kvm_vcpu *vcpu) { - if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) - return false; - return !(vmx_get_rflags(vcpu) & X86_EFLAGS_IF) || (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); } +bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu) +{ + if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) + return false; + + return __vmx_interrupt_blocked(vcpu); +} + static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection) { if (to_vmx(vcpu)->nested.nested_run_pending) @@ -8335,7 +8320,6 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS, .hwapic_irr_update = vmx_hwapic_irr_update, .hwapic_isr_update = vmx_hwapic_isr_update, - .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt, .sync_pir_to_irr = vmx_sync_pir_to_irr, .deliver_interrupt = vmx_deliver_interrupt, .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt, diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 45cee1a8bc0a..4097afed4425 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -400,6 +400,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level); bool vmx_guest_inject_ac(struct kvm_vcpu *vcpu); void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu); bool vmx_nmi_blocked(struct kvm_vcpu *vcpu); +bool __vmx_interrupt_blocked(struct kvm_vcpu *vcpu); bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu); bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu); void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1983947b8965..b9e599a4b487 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12974,12 +12974,6 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, kvm_arch_free_memslot(kvm, old); } -static inline bool kvm_guest_apic_has_interrupt(struct kvm_vcpu *vcpu) -{ - return (is_guest_mode(vcpu) && - static_call(kvm_x86_guest_apic_has_interrupt)(vcpu)); -} - static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) { if (!list_empty_careful(&vcpu->async_pf.done)) @@ -13010,9 +13004,7 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) if (kvm_test_request(KVM_REQ_PMI, vcpu)) return true; - if (kvm_arch_interrupt_allowed(vcpu) && - (kvm_cpu_has_interrupt(vcpu) || - kvm_guest_apic_has_interrupt(vcpu))) + if (kvm_arch_interrupt_allowed(vcpu) && kvm_cpu_has_interrupt(vcpu)) return true; if (kvm_hv_has_stimer_pending(vcpu)) base-commit: 1ab097653e4dd8d23272d028a61352c23486fd4a --
On Thu, Dec 7, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote: > Doh. We got the less obvious cases and missed the obvious one. > > Ugh, and we also missed a related mess in kvm_guest_apic_has_interrupt(). That > thing should really be folded into vmx_has_nested_events(). > > Good gravy. And vmx_interrupt_blocked() does the wrong thing because that > specifically checks if L1 interrupts are blocked. > > Compile tested only, and definitely needs to be chunked into multiple patches, > but I think something like this mess? The proposed patch does not fix the problem. In fact, it messes things up so much that I don't get any test results back. Google has an internal K-U-T test that demonstrates the problem. I will post it soon.
On Sun, Dec 10, 2023, Jim Mattson wrote: > On Thu, Dec 7, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote: > > Doh. We got the less obvious cases and missed the obvious one. > > > > Ugh, and we also missed a related mess in kvm_guest_apic_has_interrupt(). That > > thing should really be folded into vmx_has_nested_events(). > > > > Good gravy. And vmx_interrupt_blocked() does the wrong thing because that > > specifically checks if L1 interrupts are blocked. > > > > Compile tested only, and definitely needs to be chunked into multiple patches, > > but I think something like this mess? > > The proposed patch does not fix the problem. In fact, it messes things > up so much that I don't get any test results back. Drat. > Google has an internal K-U-T test that demonstrates the problem. I > will post it soon. Received, I'll dig in soonish, though "soonish" might unfortunately might mean 2024.
On Tue, 2023-12-12 at 07:28 -0800, Sean Christopherson wrote: > On Sun, Dec 10, 2023, Jim Mattson wrote: > > On Thu, Dec 7, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote: > > > Doh. We got the less obvious cases and missed the obvious one. > > > > > > Ugh, and we also missed a related mess in kvm_guest_apic_has_interrupt(). That > > > thing should really be folded into vmx_has_nested_events(). > > > > > > Good gravy. And vmx_interrupt_blocked() does the wrong thing because that > > > specifically checks if L1 interrupts are blocked. > > > > > > Compile tested only, and definitely needs to be chunked into multiple patches, > > > but I think something like this mess? > > > > The proposed patch does not fix the problem. In fact, it messes things > > up so much that I don't get any test results back. > > Drat. > > > Google has an internal K-U-T test that demonstrates the problem. I > > will post it soon. > > Received, I'll dig in soonish, though "soonish" might unfortunately might mean > 2024. > Hi, So this is what I think: KVM does have kvm_guest_apic_has_interrupt() for this exact purpose, to check if nested APICv has a pending interrupt before halting. However the problem is bigger - with APICv we have in essence 2 pending interrupt bitmaps - the PIR and the IRR, and to know if the guest has a pending interrupt one has in theory to copy PIR to IRR, then see if the max is larger then the current PPR. Since we don't want to write to guest memory, and the IRR here resides in the guest memory, I guess we have to do a 'dry-run' version of 'vmx_complete_nested_posted_interrupt' and call it from kvm_guest_apic_has_interrupt(). What do you think? I can prepare a patch for this. Can you share a reproducer or write a new one that can be shared? Best regards, Maxim Levitsky
On Wed, Dec 13, 2023 at 2:25 PM Maxim Levitsky <mlevitsk@redhat.com> wrote: > > On Tue, 2023-12-12 at 07:28 -0800, Sean Christopherson wrote: > > On Sun, Dec 10, 2023, Jim Mattson wrote: > > > On Thu, Dec 7, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote: > > > > Doh. We got the less obvious cases and missed the obvious one. > > > > > > > > Ugh, and we also missed a related mess in kvm_guest_apic_has_interrupt(). That > > > > thing should really be folded into vmx_has_nested_events(). > > > > > > > > Good gravy. And vmx_interrupt_blocked() does the wrong thing because that > > > > specifically checks if L1 interrupts are blocked. > > > > > > > > Compile tested only, and definitely needs to be chunked into multiple patches, > > > > but I think something like this mess? > > > > > > The proposed patch does not fix the problem. In fact, it messes things > > > up so much that I don't get any test results back. > > > > Drat. > > > > > Google has an internal K-U-T test that demonstrates the problem. I > > > will post it soon. > > > > Received, I'll dig in soonish, though "soonish" might unfortunately might mean > > 2024. > > > > Hi, > > So this is what I think: > > > KVM does have kvm_guest_apic_has_interrupt() for this exact purpose, > to check if nested APICv has a pending interrupt before halting. > > > However the problem is bigger - with APICv we have in essence 2 pending interrupt > bitmaps - the PIR and the IRR, and to know if the guest has a pending interrupt > one has in theory to copy PIR to IRR, then see if the max is larger then the current PPR. > > Since we don't want to write to guest memory, and the IRR here resides in the guest memory, > I guess we have to do a 'dry-run' version of 'vmx_complete_nested_posted_interrupt' and call > it from kvm_guest_apic_has_interrupt(). > > What do you think? I can prepare a patch for this. > > Can you share a reproducer or write a new one that can be shared? See https://lore.kernel.org/kvm/20231211185552.3856862-1-jmattson@google.com/. > Best regards, > Maxim Levitsky >
On Wed, 2023-12-13 at 14:31 -0800, Jim Mattson wrote: > On Wed, Dec 13, 2023 at 2:25 PM Maxim Levitsky <mlevitsk@redhat.com> wrote: > > On Tue, 2023-12-12 at 07:28 -0800, Sean Christopherson wrote: > > > On Sun, Dec 10, 2023, Jim Mattson wrote: > > > > On Thu, Dec 7, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > Doh. We got the less obvious cases and missed the obvious one. > > > > > > > > > > Ugh, and we also missed a related mess in kvm_guest_apic_has_interrupt(). That > > > > > thing should really be folded into vmx_has_nested_events(). > > > > > > > > > > Good gravy. And vmx_interrupt_blocked() does the wrong thing because that > > > > > specifically checks if L1 interrupts are blocked. > > > > > > > > > > Compile tested only, and definitely needs to be chunked into multiple patches, > > > > > but I think something like this mess? > > > > > > > > The proposed patch does not fix the problem. In fact, it messes things > > > > up so much that I don't get any test results back. > > > > > > Drat. > > > > > > > Google has an internal K-U-T test that demonstrates the problem. I > > > > will post it soon. > > > > > > Received, I'll dig in soonish, though "soonish" might unfortunately might mean > > > 2024. > > > > > > > Hi, > > > > So this is what I think: > > > > > > KVM does have kvm_guest_apic_has_interrupt() for this exact purpose, > > to check if nested APICv has a pending interrupt before halting. > > > > > > However the problem is bigger - with APICv we have in essence 2 pending interrupt > > bitmaps - the PIR and the IRR, and to know if the guest has a pending interrupt > > one has in theory to copy PIR to IRR, then see if the max is larger then the current PPR. > > > > Since we don't want to write to guest memory, and the IRR here resides in the guest memory, > > I guess we have to do a 'dry-run' version of 'vmx_complete_nested_posted_interrupt' and call > > it from kvm_guest_apic_has_interrupt(). > > > > What do you think? I can prepare a patch for this. > > > > Can you share a reproducer or write a new one that can be shared? > > See https://lore.kernel.org/kvm/20231211185552.3856862-1-jmattson@google.com/. Thank you! Best regards, Maxim Levitsky > > > Best regards, > > Maxim Levitsky > >
On Thu, Dec 14, 2023, Maxim Levitsky wrote: > On Tue, 2023-12-12 at 07:28 -0800, Sean Christopherson wrote: > > On Sun, Dec 10, 2023, Jim Mattson wrote: > > > On Thu, Dec 7, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote: > > > > Doh. We got the less obvious cases and missed the obvious one. > > > > > > > > Ugh, and we also missed a related mess in kvm_guest_apic_has_interrupt(). That > > > > thing should really be folded into vmx_has_nested_events(). > > > > > > > > Good gravy. And vmx_interrupt_blocked() does the wrong thing because that > > > > specifically checks if L1 interrupts are blocked. > > > > > > > > Compile tested only, and definitely needs to be chunked into multiple patches, > > > > but I think something like this mess? > > > > > > The proposed patch does not fix the problem. In fact, it messes things > > > up so much that I don't get any test results back. > > > > Drat. > > > > > Google has an internal K-U-T test that demonstrates the problem. I > > > will post it soon. > > > > Received, I'll dig in soonish, though "soonish" might unfortunately might mean > > 2024. > > > > Hi, > > So this is what I think: > > KVM does have kvm_guest_apic_has_interrupt() for this exact purpose, > to check if nested APICv has a pending interrupt before halting. For all intents and purposes, so was nested_ops->has_events(). I don't see any reason to have two APIs that do the same thing, and the call to kvm_guest_apic_has_interrupt() is wrong in that it doesn't verify that IRQs are enabled for _L2_. That's why my preference is to fold the two together. > However the problem is bigger - with APICv we have in essence 2 pending > interrupt bitmaps - the PIR and the IRR, and to know if the guest has a > pending interrupt one has in theory to copy PIR to IRR, then see if the max > is larger then the current PPR. Yeah, this is what my untested hack-a-patch tried to do. > Since we don't want to write to guest memory, The changelog is misleading/wrong. Writing guest memory is ok, what isn't safe is blocking or sleeping, i.e. KVM must not trigger a host page fault due to accessing a page that's been swapped out. Read vs. write doesn't matter. So KVM can safely read and write guest memory so long as it already mapped by kvm_vcpu_map() (or I suppose if we wrapped an access with pagefault_disable(), but I can't think of a sane reason to do that). E.g. nVMX can access a vCPU's PID mapping, but synthesizing a nested VM-Exit will cause explosions on nSVM. > and the IRR here resides in the guest memory, I guess we have to do a > 'dry-run' version of 'vmx_complete_nested_posted_interrupt' and call it from > kvm_guest_apic_has_interrupt(). nested_ops->has_events() is the much better fit, e.g. the naming won't get weird and we can gate the whole thing on is_guest_mode(). Though we probably need a wrapper to handle any commonalities between nVMX and nSVM. > What do you think? I can prepare a patch for this. As above, this is what I tried to do, sort of. Though it's obviously broken. We don't need a full dry-run because KVM only needs to detect events that are unique to L2, e.g. nVMX's preemption timer, MTF, and pending virtual interrupts (hmm, I suspect nSVM's vNMI is broken too). Things like INIT and SMI don't require nested virtualization awareness because the event itself is tracked for the vCPU as a whole.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index dcc675d4e44b..8aeacbc2bff9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10815,6 +10815,17 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) return 1; } + /* + * Evaluate nested events before exiting the halted state. This allows + * the halt state to be recorded properly in the VMCS12's activity + * state field (AMD does not have a similar field and a VM-Exit always + * causes a spurious wakeup from HLT). + */ + if (is_guest_mode(vcpu)) { + if (kvm_check_nested_events(vcpu) < 0) + return 0; + } + if (kvm_apic_accept_events(vcpu) < 0) return 0; switch(vcpu->arch.mp_state) { @@ -10837,9 +10848,6 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu) { - if (is_guest_mode(vcpu)) - kvm_check_nested_events(vcpu); - return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE && !vcpu->arch.apf.halted); }