Message ID | a3989e7ff9cca77f680f9bdfbaee52b707693221.camel@infradead.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp4077240vqu; Fri, 29 Sep 2023 07:29:57 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGJJK6afDmUwhOwhJq5fE2/vAoIfsFYp1UbWqKV+R4BCKuZBWP5tL3M7sikwL79BBeEuRya X-Received: by 2002:a05:6a20:3247:b0:14c:4deb:7189 with SMTP id hm7-20020a056a20324700b0014c4deb7189mr4095101pzc.26.1695997797161; Fri, 29 Sep 2023 07:29:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695997797; cv=none; d=google.com; s=arc-20160816; b=TmGjizoKr+MEvMSF161rQVjZNYx7p1E47JPNe/+74xJhtx/U5IBLyzY6gbiLc/f8xJ jVJ48jWml3CxH4ovVmHtQU6Z+6t656sXTbpErr3EGpkcxBdBFpHVnfFr1a+BbzsPPNex LSLvDIH2i++E3AopV/M2I3hXPJVViRHyWxs0FUGR+Ql2KpApa/Q39Pia/j08JqiUUxYI sksQMA6H9IPL3MbtOSuevvnk5AxsmO6W0t4uxurLtadj52bn2osU1XrgS98MkLcxa6qj sk1LZiZZAovR5YDOrRzqpQCMs5sIcSSIPhKd8ritFE8Yr/ZEoJdlY9OzAoKXMLEmGpmy aQzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:date:cc:to:from:subject :message-id:dkim-signature; bh=I2cLa0h+lpUltv1ermPtkAdKnpzp55fdUDyoi6QPbL8=; fh=uPduKZuPH873ipdY9RepSyf7qDGcwmtBIm6paMJqspA=; b=K+IJuTgxit1ZLKEZxPjmRJz9DMVZkeGPRvj9mOaGNVWLckL04BwAjqsPjkW7DA+ib3 VUdlFRkBCVtTaR/N+FbrHPAvsYU1DGgeLR67dnRvBf+zll0QFJDfU+D+j0ul/wL9aj6A KM5LXXNd3H22HGmweDU+xo5L4KKJP5JY/mQlI7SLIqlsACRNdOtUA2DurgL3hMeRd/SN IATpcssd3yGOb2NYzni4CpwjKGTR9+dKR97g+8+UDLQoP1bmF4DZThcsFWASfTlXKvdz OtZ+W4QVDWlenxsZ5BzKEve3FOQ6h5JQfmd9czVBpovyenRf17q083A2Wki9BQvMDdSh 07BQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=YX+nwm42; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id lx14-20020a17090b4b0e00b00276df8c5b83si1807314pjb.143.2023.09.29.07.29.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 07:29:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=YX+nwm42; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 3440D801D832; Fri, 29 Sep 2023 04:36:51 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233017AbjI2LgX (ORCPT <rfc822;pwkd43@gmail.com> + 20 others); Fri, 29 Sep 2023 07:36:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35934 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229687AbjI2LgW (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 29 Sep 2023 07:36:22 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D334994; Fri, 29 Sep 2023 04:36:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=MIME-Version:Content-Type:Date:Cc:To: From:Subject:Message-ID:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:In-Reply-To:References; bh=I2cLa0h+lpUltv1ermPtkAdKnpzp55fdUDyoi6QPbL8=; b=YX+nwm42ViiId2g8DvolBSNKhe 1mFEo5ozSU7REOu5JV2XHxXH1BwQotm8rybzbCRx0WeWNYGXsG0bcj9R9G9zgtKJZQHDS0j2xhbW5 yZygqM+EdPDQYIhJMb2EsCSL7jaPfO3LxZhYwsxTauiZQeZeGIOdtwoqeDSZ+gGs74HdyJfa5TWka 8S//UM4YOE0mVyM8YVWQDGoCmJyqmitho6rgtJtWLbLxRqZ8Pr5No0aqG+RmVdUc8xoiC4Y2l0UFN lpp0cl9nZilJ7mpfnfoMsh7nx4Jxj0iFYlu7d2mRckJiEBki+E21CTfgAShE7R3kTwPAV3EIl91zg xYicVcew==; Received: from [2001:8b0:10b:5:379a:6ec5:d16c:614] (helo=u3832b3a9db3152.ant.amazon.com) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1qmBn3-008VwF-Ke; Fri, 29 Sep 2023 11:36:15 +0000 Message-ID: <a3989e7ff9cca77f680f9bdfbaee52b707693221.camel@infradead.org> Subject: [PATCH v2] KVM: x86: Use fast path for Xen timer delivery From: David Woodhouse <dwmw2@infradead.org> To: kvm <kvm@vger.kernel.org> Cc: David Woodhouse <dwmw2@infradead.org>, Paul Durrant <paul@xen.org>, Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, linux-kernel@vger.kernel.org Date: Fri, 29 Sep 2023 12:36:11 +0100 Content-Type: multipart/signed; micalg="sha-256"; protocol="application/pkcs7-signature"; boundary="=-JOnWxlw7bvKzdu4tv5HM" User-Agent: Evolution 3.44.4-0ubuntu2 MIME-Version: 1.0 X-SRS-Rewrite: SMTP reverse-path rewritten from <dwmw2@infradead.org> by casper.infradead.org. See http://www.infradead.org/rpr.html X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 29 Sep 2023 04:36:51 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778382585986774552 X-GMAIL-MSGID: 1778382585986774552 |
Series |
[v2] KVM: x86: Use fast path for Xen timer delivery
|
|
Commit Message
David Woodhouse
Sept. 29, 2023, 11:36 a.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk> Most of the time there's no need to kick the vCPU and deliver the timer event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast() directly from the timer callback, and only fall back to the slow path when it's necessary to do so. This gives a significant improvement in timer latency testing (using nanosleep() for various periods and then measuring the actual time elapsed). However, there was a reason¹ the fast path was dropped when this support was first added. The current code holds vcpu->mutex for all operations on the kvm->arch.timer_expires field, and the fast path introduces potential race conditions. So... ensure the hrtimer is *cancelled* before making changes in kvm_xen_start_timer(), and also when reading the values out for KVM_XEN_VCPU_ATTR_TYPE_TIMER. Add some sanity checks to ensure the truth of the claim that all the other code paths are run with the vcpu loaded. And use hrtimer_cancel() directly from kvm_xen_destroy_vcpu() to avoid a false positive from the check in kvm_xen_stop_timer(). ¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com/ Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- • v2: Remember, and deal with, those races. arch/x86/kvm/xen.c | 64 +++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 58 insertions(+), 6 deletions(-)
Comments
On 29/09/2023 12:36, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > Most of the time there's no need to kick the vCPU and deliver the timer > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast() > directly from the timer callback, and only fall back to the slow path > when it's necessary to do so. > > This gives a significant improvement in timer latency testing (using > nanosleep() for various periods and then measuring the actual time > elapsed). > > However, there was a reason¹ the fast path was dropped when this support > was first added. The current code holds vcpu->mutex for all operations on > the kvm->arch.timer_expires field, and the fast path introduces potential > race conditions. So... ensure the hrtimer is *cancelled* before making > changes in kvm_xen_start_timer(), and also when reading the values out > for KVM_XEN_VCPU_ATTR_TYPE_TIMER. > > Add some sanity checks to ensure the truth of the claim that all the > other code paths are run with the vcpu loaded. And use hrtimer_cancel() > directly from kvm_xen_destroy_vcpu() to avoid a false positive from the > check in kvm_xen_stop_timer(). > > ¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com/ > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > > • v2: Remember, and deal with, those races. > > arch/x86/kvm/xen.c | 64 +++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 58 insertions(+), 6 deletions(-) > Reviewed-by: Paul Durrant <paul@xen.org>
On Fri, Sep 29, 2023, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > Most of the time there's no need to kick the vCPU and deliver the timer > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast() > directly from the timer callback, and only fall back to the slow path > when it's necessary to do so. It'd be helpful for non-Xen folks to explain "when it's necessary". IIUC, the only time it's necessary is if the gfn=>pfn cache isn't valid/fresh. > This gives a significant improvement in timer latency testing (using > nanosleep() for various periods and then measuring the actual time > elapsed). > > However, there was a reason¹ the fast path was dropped when this support Heh, please use [1] or [*] like everyone else. I can barely see that tiny little ¹. > was first added. The current code holds vcpu->mutex for all operations on > the kvm->arch.timer_expires field, and the fast path introduces potential > race conditions. So... ensure the hrtimer is *cancelled* before making > changes in kvm_xen_start_timer(), and also when reading the values out > for KVM_XEN_VCPU_ATTR_TYPE_TIMER. > > Add some sanity checks to ensure the truth of the claim that all the > other code paths are run with the vcpu loaded. And use hrtimer_cancel() > directly from kvm_xen_destroy_vcpu() to avoid a false positive from the > check in kvm_xen_stop_timer(). This should really be at least 2 patches, probably 3: 1. add the assertions and the destroy_vcpu() change 2. cancel the timer before starting a new one or reading the value from userspace 3. use the fastpath delivery from the timer callback > ¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com/ > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > > • v2: Remember, and deal with, those races. > > arch/x86/kvm/xen.c | 64 +++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 58 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c > index fb1110b2385a..9d0d602a2466 100644 > --- a/arch/x86/kvm/xen.c > +++ b/arch/x86/kvm/xen.c > @@ -117,6 +117,8 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn) > > void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu) > { > + WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()); > + > if (atomic_read(&vcpu->arch.xen.timer_pending) > 0) { > struct kvm_xen_evtchn e; > > @@ -136,18 +138,41 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) > { > struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu, > arch.xen.timer); > + struct kvm_xen_evtchn e; > + int rc; > + > if (atomic_read(&vcpu->arch.xen.timer_pending)) > return HRTIMER_NORESTART; > > - atomic_inc(&vcpu->arch.xen.timer_pending); > - kvm_make_request(KVM_REQ_UNBLOCK, vcpu); > - kvm_vcpu_kick(vcpu); > + e.vcpu_id = vcpu->vcpu_id; > + e.vcpu_idx = vcpu->vcpu_idx; > + e.port = vcpu->arch.xen.timer_virq; > + e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; > + > + rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm); > + if (rc == -EWOULDBLOCK) { > + atomic_inc(&vcpu->arch.xen.timer_pending); > + kvm_make_request(KVM_REQ_UNBLOCK, vcpu); > + kvm_vcpu_kick(vcpu); > + } else { > + vcpu->arch.xen.timer_expires = 0; Eww. IIUC, timer_expires isn't cleared so that the pending IRQ is captured by kvm_xen_vcpu_get_attr(), i.e. because xen.timer_pending itself isn't migrated. > + } > > return HRTIMER_NORESTART; > } > > static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) > { > + WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()); > + > + /* > + * Avoid races with the old timer firing. Checking timer_expires > + * to avoid calling hrtimer_cancel() will only have false positives > + * so is fine. > + */ > + if (vcpu->arch.xen.timer_expires) > + hrtimer_cancel(&vcpu->arch.xen.timer); > + > atomic_set(&vcpu->arch.xen.timer_pending, 0); > vcpu->arch.xen.timer_expires = guest_abs; > > @@ -163,6 +188,8 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ > > static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu) > { > + WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()); > + > hrtimer_cancel(&vcpu->arch.xen.timer); > vcpu->arch.xen.timer_expires = 0; > atomic_set(&vcpu->arch.xen.timer_pending, 0); > @@ -1019,13 +1046,38 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) > r = 0; > break; > > - case KVM_XEN_VCPU_ATTR_TYPE_TIMER: > + case KVM_XEN_VCPU_ATTR_TYPE_TIMER: { > + bool pending = false; > + > + /* > + * Ensure a consistent snapshot of state is captures, with a s/captures/captured > + * timer either being pending, or fully delivered. Not still Kind of a nit, but IMO "fully delivered" isn't accurate, at least not without more clarification. I would considered "fully delivered" to mean that the IRQ has caused the guest to start executing its IRQ handler. Maybe "fully queued in the event channel"? > + * lurking in the timer_pending flag for deferred delivery. > + */ > + if (vcpu->arch.xen.timer_expires) { > + pending = hrtimer_cancel(&vcpu->arch.xen.timer); > + kvm_xen_inject_timer_irqs(vcpu); If the goal is to not have pending timers, then kvm_xen_inject_timer_irqs() should be called unconditionally after canceling the hrtimer, no? > + } > + > data->u.timer.port = vcpu->arch.xen.timer_virq; > data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; > data->u.timer.expires_ns = vcpu->arch.xen.timer_expires; > + > + /* > + * The timer may be delivered immediately, while the returned > + * state causes it to be set up and delivered again on the Similar to the "fully delivered" stuff above, maybe s/timer/hrtimer to make it a bit more clear that the host hrtimer can fire twice, but it won't ever result in two timer IRQs being delivered from the guest's perspective. > + * destination system after migration. That's fine, as the > + * guest will not even have had a chance to run and process > + * the interrupt by that point, so it won't even notice the > + * duplicate IRQ. Rather than say "so it won't even notice the duplicate IRQ", maybe be more explicit and say "so the queued IRQ is guaranteed to be coalesced in the event channel and/or guest local APIC". Because I read the whole "delivered IRQ" stuff as there really being two injected IRQs into the guest. FWIW, this is all really gross, but I agree that even if the queued IRQ makes it all the way to the guest's APIC, the IRQs will still be coalesced in the end. > + */ > + if (pending) > + hrtimer_start_expires(&vcpu->arch.xen.timer, > + HRTIMER_MODE_ABS_HARD); > + > r = 0; > break; > - > + } > case KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR: > data->u.vector = vcpu->arch.xen.upcall_vector; > r = 0; > @@ -2085,7 +2137,7 @@ void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu) > void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu) > { > if (kvm_xen_timer_enabled(vcpu)) IIUC, this can more precisely be: if (vcpu->arch.xen.timer_expires) hrtimer_cancel(&vcpu->arch.xen.timer); at which point it might make sense to add a small helper static void kvm_xen_cancel_timer(struct kvm_vcpu *vcpu) { if (vcpu->arch.xen.timer_expires) hrtimer_cancel(&vcpu->arch.xen.timer); } to share code with > - kvm_xen_stop_timer(vcpu); > + hrtimer_cancel(&vcpu->arch.xen.timer); > > kvm_gpc_deactivate(&vcpu->arch.xen.runstate_cache); > kvm_gpc_deactivate(&vcpu->arch.xen.runstate2_cache); > -- > 2.40.1 > >
On Fri, 2023-09-29 at 08:16 -0700, Sean Christopherson wrote: > On Fri, Sep 29, 2023, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > Most of the time there's no need to kick the vCPU and deliver the timer > > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast() > > directly from the timer callback, and only fall back to the slow path > > when it's necessary to do so. > > It'd be helpful for non-Xen folks to explain "when it's necessary". IIUC, the > only time it's necessary is if the gfn=>pfn cache isn't valid/fresh. That's an implementation detail. Like all of the fast path functions that can be called from kvm_arch_set_irq_inatomic(), it has its own criteria for why it might return -EWOULDBLOCK or not. Those are *its* business. And in fact one of Paul's current patches is tweaking them subtly, but that isn't relevant here. (But yes, you are broadly correct in your understanding.) > > This gives a significant improvement in timer latency testing (using > > nanosleep() for various periods and then measuring the actual time > > elapsed). > > > > However, there was a reason¹ the fast path was dropped when this support > > Heh, please use [1] or [*] like everyone else. I can barely see that tiny little ¹. Isn't that the *point*? The reference to the footnote isn't supposed to detract from the flow of the main text. It's exactly how you'll see it when typeset properly. I've always assumed the people using [1] or [*] just haven't yet realised that it's the 21st century and we are no longer limited to 7-bit ASCII. Or haven't worked out how to type anything but ASCII. > > was first added. The current code holds vcpu->mutex for all operations on > > the kvm->arch.timer_expires field, and the fast path introduces potential > > race conditions. So... ensure the hrtimer is *cancelled* before making > > changes in kvm_xen_start_timer(), and also when reading the values out > > for KVM_XEN_VCPU_ATTR_TYPE_TIMER. > > > > Add some sanity checks to ensure the truth of the claim that all the > > other code paths are run with the vcpu loaded. And use hrtimer_cancel() > > directly from kvm_xen_destroy_vcpu() to avoid a false positive from the > > check in kvm_xen_stop_timer(). > > This should really be at least 2 patches, probably 3: > > 1. add the assertions and the destroy_vcpu() change > 2. cancel the timer before starting a new one or reading the value from userspace > 3. use the fastpath delivery from the timer callback Hm, I think that's borderline. I pondered it and came to the opposite conclusion. Cancelling the timer wasn't needed without the fastpath delivery; it isn't a separate fix. You could consider it a preparatory patch I suppose... but I didn't. It's not like adding the fast path in itself is complex enough that the other parts need to be broken out. > > ¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com/ > > > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > > --- > > > > • v2: Remember, and deal with, those races. > > > > arch/x86/kvm/xen.c | 64 +++++++++++++++++++++++++++++++++++++++++----- > > 1 file changed, 58 insertions(+), 6 deletions(-) > > > > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c > > index fb1110b2385a..9d0d602a2466 100644 > > --- a/arch/x86/kvm/xen.c > > +++ b/arch/x86/kvm/xen.c > > @@ -117,6 +117,8 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn) > > > > void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu) > > { > > + WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()); > > + > > if (atomic_read(&vcpu->arch.xen.timer_pending) > 0) { > > struct kvm_xen_evtchn e; > > > > @@ -136,18 +138,41 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) > > { > > struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu, > > arch.xen.timer); > > + struct kvm_xen_evtchn e; > > + int rc; > > + > > if (atomic_read(&vcpu->arch.xen.timer_pending)) > > return HRTIMER_NORESTART; > > > > - atomic_inc(&vcpu->arch.xen.timer_pending); > > - kvm_make_request(KVM_REQ_UNBLOCK, vcpu); > > - kvm_vcpu_kick(vcpu); > > + e.vcpu_id = vcpu->vcpu_id; > > + e.vcpu_idx = vcpu->vcpu_idx; > > + e.port = vcpu->arch.xen.timer_virq; > > + e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; > > + > > + rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm); > > + if (rc == -EWOULDBLOCK) { > > + atomic_inc(&vcpu->arch.xen.timer_pending); > > + kvm_make_request(KVM_REQ_UNBLOCK, vcpu); > > + kvm_vcpu_kick(vcpu); > > + } else { > > + vcpu->arch.xen.timer_expires = 0; > > Eww. IIUC, timer_expires isn't cleared so that the pending IRQ is captured by > kvm_xen_vcpu_get_attr(), i.e. because xen.timer_pending itself isn't migrated. -EPARSE. Er... yes? The timer_expires field is non-zero when there's a pending timer. When the timer interrupt has fired, the time is no longer pending and the timer_expires field is set to zero. The xen.timer_pending field is indeed not migrated. We *flush* it in kvm_xen_vcpu_get_attr(). It's now only used for the *deferral* to the slow path. So... yes, timer_expires *wasn't* previously cleared, because the timer IRQ hadn't been delivered yet, and yes that was to avoid races with kvm_xen_vcpu_get_attr(). Partly because the timer_pending field was internal and not migrated. As far as userspace is concerned, the timer has either fired, or it has not. Implementation details of the timer_pending field — and the per-vcpu evtchn_pending_sel — are not exposed. > > + } > > > > return HRTIMER_NORESTART; > > } > > > > static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) > > { > > + WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()); > > + > > + /* > > + * Avoid races with the old timer firing. Checking timer_expires > > + * to avoid calling hrtimer_cancel() will only have false positives > > + * so is fine. > > + */ > > + if (vcpu->arch.xen.timer_expires) > > + hrtimer_cancel(&vcpu->arch.xen.timer); > > + > > atomic_set(&vcpu->arch.xen.timer_pending, 0); > > vcpu->arch.xen.timer_expires = guest_abs; > > > > @@ -163,6 +188,8 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ > > > > static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu) > > { > > + WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()); > > + > > hrtimer_cancel(&vcpu->arch.xen.timer); > > vcpu->arch.xen.timer_expires = 0; > > atomic_set(&vcpu->arch.xen.timer_pending, 0); > > @@ -1019,13 +1046,38 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) > > r = 0; > > break; > > > > - case KVM_XEN_VCPU_ATTR_TYPE_TIMER: > > + case KVM_XEN_VCPU_ATTR_TYPE_TIMER: { > > + bool pending = false; > > + > > + /* > > + * Ensure a consistent snapshot of state is captures, with a > > s/captures/captured Ack. > > + * timer either being pending, or fully delivered. Not still > > Kind of a nit, but IMO "fully delivered" isn't accurate, at least not without > more clarification. I would considered "fully delivered" to mean that the IRQ > has caused the guest to start executing its IRQ handler. Maybe "fully queued in > the event channel"? OK, I'll reword. > > + * lurking in the timer_pending flag for deferred delivery. > > + */ > > + if (vcpu->arch.xen.timer_expires) { > > + pending = hrtimer_cancel(&vcpu->arch.xen.timer); > > + kvm_xen_inject_timer_irqs(vcpu); > > If the goal is to not have pending timers, then kvm_xen_inject_timer_irqs() > should be called unconditionally after canceling the hrtimer, no? It *is* called unconditionally after cancelling the hrtimer. Or did you mean unconditionally whether we cancel the hrtimer or not? The comment explains the logic for not needing that. If *either* the timer is still active, *or* it's already fired and has taken the slow path and set the timer_pending flag, then timer_expires won't have been zeroed yet. So the race conditions inherent in doing this conditional on vcpu->arch.xen.timer_expires are fine because there are only false positives (which cause us to cancel a timer, and try to inject a pending IRQ, which wasn't running and wasn't pending respectively). Sounds like I need to expand that comment? > > + } > > + > > data->u.timer.port = vcpu->arch.xen.timer_virq; > > data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; > > data->u.timer.expires_ns = vcpu->arch.xen.timer_expires; > > + > > + /* > > + * The timer may be delivered immediately, while the returned > > + * state causes it to be set up and delivered again on the > > Similar to the "fully delivered" stuff above, maybe s/timer/hrtimer to make it > a bit more clear that the host hrtimer can fire twice, but it won't ever result > in two timer IRQs being delivered from the guest's perspective. Ack. > > + * destination system after migration. That's fine, as the > > + * guest will not even have had a chance to run and process > > + * the interrupt by that point, so it won't even notice the > > + * duplicate IRQ. > > Rather than say "so it won't even notice the duplicate IRQ", maybe be more explicit > and say "so the queued IRQ is guaranteed to be coalesced in the event channel > and/or guest local APIC". Because I read the whole "delivered IRQ" stuff as there > really being two injected IRQs into the guest. > > FWIW, this is all really gross, but I agree that even if the queued IRQ makes it > all the way to the guest's APIC, the IRQs will still be coalesced in the end. > As discussed before, we *could* have made fetching the timer attr also *pause* the timer. It just seemed like extra complexity for no good reason. The shared info page containing the event channel bitmap has to be migrated after the vCPU state anyway. > > > + */ > > + if (pending) > > + hrtimer_start_expires(&vcpu->arch.xen.timer, > > + HRTIMER_MODE_ABS_HARD); > > + > > r = 0; > > break; > > - > > + } > > case KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR: > > data->u.vector = vcpu->arch.xen.upcall_vector; > > r = 0; > > @@ -2085,7 +2137,7 @@ void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu) > > void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu) > > { > > if (kvm_xen_timer_enabled(vcpu)) > > IIUC, this can more precisely be: > > if (vcpu->arch.xen.timer_expires) > hrtimer_cancel(&vcpu->arch.xen.timer); > > at which point it might make sense to add a small helper > > static void kvm_xen_cancel_timer(struct kvm_vcpu *vcpu) > { > if (vcpu->arch.xen.timer_expires) > hrtimer_cancel(&vcpu->arch.xen.timer); > } > > to share code with > > > - kvm_xen_stop_timer(vcpu); > > + hrtimer_cancel(&vcpu->arch.xen.timer); > > > > kvm_gpc_deactivate(&vcpu->arch.xen.runstate_cache); > > kvm_gpc_deactivate(&vcpu->arch.xen.runstate2_cache); In fact if I make the helper return a bool, I think I can use it *three* times. At a cost of having a third function kvm_xen_cancel_timer() alongside the existing kvm_xen_start_timer() and kvm_xen_stop_timer(), and those *three* now make for a slightly confusing set. I'll take a look and see how it makes me feel.
On Fri, Sep 29, 2023, David Woodhouse wrote: > On Fri, 2023-09-29 at 08:16 -0700, Sean Christopherson wrote: > > On Fri, Sep 29, 2023, David Woodhouse wrote: > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > > > Most of the time there's no need to kick the vCPU and deliver the timer > > > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast() > > > directly from the timer callback, and only fall back to the slow path > > > when it's necessary to do so. > > > > It'd be helpful for non-Xen folks to explain "when it's necessary". IIUC, the > > only time it's necessary is if the gfn=>pfn cache isn't valid/fresh. > > That's an implementation detail. And? The target audience of changelogs are almost always people that care about the implementation. > Like all of the fast path functions that can be called from > kvm_arch_set_irq_inatomic(), it has its own criteria for why it might return > -EWOULDBLOCK or not. Those are *its* business. And all of the KVM code is the business of the people who contribute to the kernel, now and in the future. Yeah, there's a small chance that a detailed changelog can become stale if the patch races with some other in-flight change, but even *that* is a useful data point. E.g. if Paul's patches somehow broke/degraded this code, then knowing that what the author (you) intended/observed didn't match reality when the patch was applied would be extremely useful information for whoever encountered the hypothetical breakage. > And in fact one of Paul's current patches is tweaking them subtly, but that > isn't relevant here. (But yes, you are broadly correct in your > understanding.) > > > > This gives a significant improvement in timer latency testing (using > > > nanosleep() for various periods and then measuring the actual time > > > elapsed). > > > > > > However, there was a reason¹ the fast path was dropped when this support > > > > Heh, please use [1] or [*] like everyone else. I can barely see that tiny little ¹. > > Isn't that the *point*? The reference to the footnote isn't supposed to > detract from the flow of the main text. It's exactly how you'll see it > when typeset properly. Footnotes that are "typeset properly" have the entire footnote in a different font+size. A tiny number next to normal sized text just looks weird to me. And I often do a "reverse lookup" when I get to footnotes that are links, e.g. to gauge whether or not it's worth my time to follow the link. Trying to find the tiny ¹ via a quick visual scan is an exercise in frustration, at least for the monospace font I use for reading mail, e.g. it's much more readable on my end in an editor using a different font. Which is a big benefit to sticking to the old and kludgly ASCII: it provides a fairly consistent experience regardless of what client/font/etc each reader is using. I'm not completely against using unicode characters, e.g. for names with characters not found in the Latin alphabet, but for code and things like this, IMO simpler is better. > I've always assumed the people using [1] or [*] just haven't yet realised > that it's the 21st century and we are no longer limited to 7-bit ASCII. Or > haven't worked out how to type anything but ASCII. Please don't devolve into ad hominem attacks against other reviews and contributors. If you want to argue that using footnote notation unicode is superior in some way, then by all means, present your arguments.
On Mon, 2023-10-02 at 10:41 -0700, Sean Christopherson wrote: > On Fri, Sep 29, 2023, David Woodhouse wrote: > > On Fri, 2023-09-29 at 08:16 -0700, Sean Christopherson wrote: > > > On Fri, Sep 29, 2023, David Woodhouse wrote: > > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > > > > > Most of the time there's no need to kick the vCPU and deliver the timer > > > > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast() > > > > directly from the timer callback, and only fall back to the slow path > > > > when it's necessary to do so. > > > > > > It'd be helpful for non-Xen folks to explain "when it's necessary". IIUC, the > > > only time it's necessary is if the gfn=>pfn cache isn't valid/fresh. > > > > That's an implementation detail. > > And? The target audience of changelogs are almost always people that care about > the implementation. > > > Like all of the fast path functions that can be called from > > kvm_arch_set_irq_inatomic(), it has its own criteria for why it might return > > -EWOULDBLOCK or not. Those are *its* business. > > And all of the KVM code is the business of the people who contribute to the kernel, > now and in the future. Yeah, there's a small chance that a detailed changelog can > become stale if the patch races with some other in-flight change, but even *that* > is a useful data point. E.g. if Paul's patches somehow broke/degraded this code, > then knowing that what the author (you) intended/observed didn't match reality when > the patch was applied would be extremely useful information for whoever encountered > the hypothetical breakage. Fair enough, but on this occasion it truly doesn't matter. It has nothing to do with the implementation of *this* patch. This code makes no assumptions and has no dependency on *when* that fast path might return -EWOULDBLOCK. Sometimes it does, sometimes it doesn't. This code just doesn't care one iota. If this code had *dependencies* on the precise behaviour of kvm_xen_set_evtchn_fast() that we needed to reason about, then sure, I'd have written those explicitly into the commit comment *and* tried to find some way of enforcing them with runtime warnings etc. But it doesn't. So I am no more inclined to document the precise behaviour of kvm_xen_set_evtchn_fast() in a patch which just happens to call it, than I am inclined to document hrtimer_cancel() or any other function called from the new code :) > > And in fact one of Paul's current patches is tweaking them subtly, but that > > isn't relevant here. (But yes, you are broadly correct in your > > understanding.) > > > > > > This gives a significant improvement in timer latency testing (using > > > > nanosleep() for various periods and then measuring the actual time > > > > elapsed). > > > > > > > > However, there was a reason¹ the fast path was dropped when this support > > > > > > Heh, please use [1] or [*] like everyone else. I can barely see that tiny little ¹. > > > > Isn't that the *point*? The reference to the footnote isn't supposed to > > detract from the flow of the main text. It's exactly how you'll see it > > when typeset properly. > > Footnotes that are "typeset properly" have the entire footnote in a different > font+size. A tiny number next to normal sized text just looks weird to me. > > And I often do a "reverse lookup" when I get to footnotes that are links, e.g. to > gauge whether or not it's worth my time to follow the link. Trying to find the > tiny ¹ via a quick visual scan is an exercise in frustration, at least for the > monospace font I use for reading mail, e.g. it's much more readable on my end in > an editor using a different font. > > Which is a big benefit to sticking to the old and kludgly ASCII: it provides a > fairly consistent experience regardless of what client/font/etc each reader is > using. I'm not completely against using unicode characters, e.g. for names with > characters not found in the Latin alphabet, but for code and things like this, > IMO simpler is better. > > > I've always assumed the people using [1] or [*] just haven't yet realised > > that it's the 21st century and we are no longer limited to 7-bit ASCII. Or > > haven't worked out how to type anything but ASCII. > > Please don't devolve into ad hominem attacks against other reviews and contributors. > If you want to argue that using footnote notation unicode is superior in some way, > then by all means, present your arguments. Hey, you started the logical fallacies with the ad populum when you said "everyone else" :) Not that that was true; there are examples of ¹ being used in the kernel changelog going back decades. I can just drop the footnote; there's not a huge amount of benefit in referring back to the old mail thread anyway. The gist of it was covered in the commit comment.
On Mon, Oct 02, 2023, David Woodhouse wrote: > On Mon, 2023-10-02 at 10:41 -0700, Sean Christopherson wrote: > > On Fri, Sep 29, 2023, David Woodhouse wrote: > > > On Fri, 2023-09-29 at 08:16 -0700, Sean Christopherson wrote: > > > > On Fri, Sep 29, 2023, David Woodhouse wrote: > > > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > > > > > > > Most of the time there's no need to kick the vCPU and deliver the timer > > > > > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast() > > > > > directly from the timer callback, and only fall back to the slow path > > > > > when it's necessary to do so. > > > > > > > > It'd be helpful for non-Xen folks to explain "when it's necessary". IIUC, the > > > > only time it's necessary is if the gfn=>pfn cache isn't valid/fresh. > > > > > > That's an implementation detail. > > > > And? The target audience of changelogs are almost always people that care about > > the implementation. > > > > > Like all of the fast path functions that can be called from > > > kvm_arch_set_irq_inatomic(), it has its own criteria for why it might return > > > -EWOULDBLOCK or not. Those are *its* business. > > > > And all of the KVM code is the business of the people who contribute to the kernel, > > now and in the future. Yeah, there's a small chance that a detailed changelog can > > become stale if the patch races with some other in-flight change, but even *that* > > is a useful data point. E.g. if Paul's patches somehow broke/degraded this code, > > then knowing that what the author (you) intended/observed didn't match reality when > > the patch was applied would be extremely useful information for whoever encountered > > the hypothetical breakage. > > Fair enough, but on this occasion it truly doesn't matter. It has > nothing to do with the implementation of *this* patch. This code makes > no assumptions and has no dependency on *when* that fast path might > return -EWOULDBLOCK. Sometimes it does, sometimes it doesn't. This code > just doesn't care one iota. > > If this code had *dependencies* on the precise behaviour of > kvm_xen_set_evtchn_fast() that we needed to reason about, then sure, > I'd have written those explicitly into the commit comment *and* tried > to find some way of enforcing them with runtime warnings etc. > > But it doesn't. So I am no more inclined to document the precise > behaviour of kvm_xen_set_evtchn_fast() in a patch which just happens to > call it, than I am inclined to document hrtimer_cancel() or any other > function called from the new code :) Just because some bit of code doesn't care/differentiate doesn't mean the behavior of said code is correct. I agree that adding a comment to explain the gory details is unnecessary and would lead to stale code. But changelogs essentially capture a single point in a time, and a big role of the changelog is to help reviewers and readers understand (a) the *intent* of the change and (b) whether or not that change is correct. E.g. there's an assumption that -EWOULDBLOCK is the only non-zero return code where the correct response is to go down the slow path. I'm not asking to spell out every single condition, I'm just asking for clarification on what the intended behavior is, e.g. Use kvm_xen_set_evtchn_fast() directly from the timer callback, and fall back to the slow path if the event is valid but fast delivery isn't possible, which currently can only happen if delivery needs to block, e.g. because the gfn=>pfn cache is invalid or stale. instead of simply saying "when it's necessary to do so" and leaving it up to the reader to figure what _they_ think that means, which might not always align with what the author actually meant. > > > And in fact one of Paul's current patches is tweaking them subtly, but that > > > isn't relevant here. (But yes, you are broadly correct in your > > > understanding.) > > > > > > > > This gives a significant improvement in timer latency testing (using > > > > > nanosleep() for various periods and then measuring the actual time > > > > > elapsed). > > > > > > > > > > However, there was a reason¹ the fast path was dropped when this support > > > > > > > > Heh, please use [1] or [*] like everyone else. I can barely see that tiny little ¹. > > > > > > Isn't that the *point*? The reference to the footnote isn't supposed to > > > detract from the flow of the main text. It's exactly how you'll see it > > > when typeset properly. > > > > Footnotes that are "typeset properly" have the entire footnote in a different > > font+size. A tiny number next to normal sized text just looks weird to me. > > > > And I often do a "reverse lookup" when I get to footnotes that are links, e.g. to > > gauge whether or not it's worth my time to follow the link. Trying to find the > > tiny ¹ via a quick visual scan is an exercise in frustration, at least for the > > monospace font I use for reading mail, e.g. it's much more readable on my end in > > an editor using a different font. > > > > Which is a big benefit to sticking to the old and kludgly ASCII: it provides a > > fairly consistent experience regardless of what client/font/etc each reader is > > using. I'm not completely against using unicode characters, e.g. for names with > > characters not found in the Latin alphabet, but for code and things like this, > > IMO simpler is better. > > > > > I've always assumed the people using [1] or [*] just haven't yet realised > > > that it's the 21st century and we are no longer limited to 7-bit ASCII. Or > > > haven't worked out how to type anything but ASCII. > > > > Please don't devolve into ad hominem attacks against other reviews and contributors. > > If you want to argue that using footnote notation unicode is superior in some way, > > then by all means, present your arguments. > > Hey, you started the logical fallacies with the ad populum when you > said "everyone else" :) > > Not that that was true; there are examples of ¹ being used in the > kernel changelog going back decades. LOL, fine, "almost everyone else".
On Mon, 2023-10-02 at 11:45 -0700, Sean Christopherson wrote: > On Mon, Oct 02, 2023, David Woodhouse wrote: > > On Mon, 2023-10-02 at 10:41 -0700, Sean Christopherson wrote: > > > On Fri, Sep 29, 2023, David Woodhouse wrote: > > > > On Fri, 2023-09-29 at 08:16 -0700, Sean Christopherson wrote: > > > > > On Fri, Sep 29, 2023, David Woodhouse wrote: > > > > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > > > > > > > > > Most of the time there's no need to kick the vCPU and deliver the timer > > > > > > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast() > > > > > > directly from the timer callback, and only fall back to the slow path > > > > > > when it's necessary to do so. > > > > > > > > > > It'd be helpful for non-Xen folks to explain "when it's necessary". IIUC, the > > > > > only time it's necessary is if the gfn=>pfn cache isn't valid/fresh. > > > > > > > > That's an implementation detail. > > > > > > And? The target audience of changelogs are almost always people that care about > > > the implementation. > > > > > > > Like all of the fast path functions that can be called from > > > > kvm_arch_set_irq_inatomic(), it has its own criteria for why it might return > > > > -EWOULDBLOCK or not. Those are *its* business. > > > > > > And all of the KVM code is the business of the people who contribute to the kernel, > > > now and in the future. Yeah, there's a small chance that a detailed changelog can > > > become stale if the patch races with some other in-flight change, but even *that* > > > is a useful data point. E.g. if Paul's patches somehow broke/degraded this code, > > > then knowing that what the author (you) intended/observed didn't match reality when > > > the patch was applied would be extremely useful information for whoever encountered > > > the hypothetical breakage. > > > > Fair enough, but on this occasion it truly doesn't matter. It has > > nothing to do with the implementation of *this* patch. This code makes > > no assumptions and has no dependency on *when* that fast path might > > return -EWOULDBLOCK. Sometimes it does, sometimes it doesn't. This code > > just doesn't care one iota. > > > > If this code had *dependencies* on the precise behaviour of > > kvm_xen_set_evtchn_fast() that we needed to reason about, then sure, > > I'd have written those explicitly into the commit comment *and* tried > > to find some way of enforcing them with runtime warnings etc. > > > > But it doesn't. So I am no more inclined to document the precise > > behaviour of kvm_xen_set_evtchn_fast() in a patch which just happens to > > call it, than I am inclined to document hrtimer_cancel() or any other > > function called from the new code :) > > Just because some bit of code doesn't care/differentiate doesn't mean the behavior > of said code is correct. I agree that adding a comment to explain the gory details > is unnecessary and would lead to stale code. But changelogs essentially capture a > single point in a time, and a big role of the changelog is to help reviewers and > readers understand (a) the *intent* of the change and (b) whether or not that change > is correct. > > E.g. there's an assumption that -EWOULDBLOCK is the only non-zero return code where > the correct response is to go down the slow path. > > I'm not asking to spell out every single condition, I'm just asking for clarification > on what the intended behavior is, e.g. > > Use kvm_xen_set_evtchn_fast() directly from the timer callback, and fall > back to the slow path if the event is valid but fast delivery isn't > possible, which currently can only happen if delivery needs to block, > e.g. because the gfn=>pfn cache is invalid or stale. > > instead of simply saying "when it's necessary to do so" and leaving it up to the > reader to figure what _they_ think that means, which might not always align with > what the author actually meant. Fair enough. There's certainly scope for something along the lines of + rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm); + if (rc != -EWOULDBLOCK) { /* * If kvm_xen_set_evtchn_fast() returned -EWOULDBLOCK, then set the * timer_pending flag and kick the vCPU, to defer delivery of the * event channel to a context which can sleep. If it fails for any * other reasons, just let it fail silently. The slow path fails * silently too; a warning in that case may be guest triggerable, * should never happen anyway, and guests are generally going to * *notice* timers going missing. */ + vcpu->arch.xen.timer_expires = 0; + return HRTIMER_NORESTART; + } That's documenting *this* code, not the function it happens to call. It's more verbose than I would normally have bothered to be, but I'm all for improving the level of commenting in our code as long as it's adding value.
On Mon, Oct 02, 2023, David Woodhouse wrote: > On Mon, 2023-10-02 at 11:45 -0700, Sean Christopherson wrote: > > E.g. there's an assumption that -EWOULDBLOCK is the only non-zero return code where > > the correct response is to go down the slow path. > > > > I'm not asking to spell out every single condition, I'm just asking for clarification > > on what the intended behavior is, e.g. > > > > Use kvm_xen_set_evtchn_fast() directly from the timer callback, and fall > > back to the slow path if the event is valid but fast delivery isn't > > possible, which currently can only happen if delivery needs to block, > > e.g. because the gfn=>pfn cache is invalid or stale. > > > > instead of simply saying "when it's necessary to do so" and leaving it up to the > > reader to figure what _they_ think that means, which might not always align with > > what the author actually meant. > > > Fair enough. There's certainly scope for something along the lines of > > > + rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm); > + if (rc != -EWOULDBLOCK) { > > /* > * If kvm_xen_set_evtchn_fast() returned -EWOULDBLOCK, then set the > * timer_pending flag and kick the vCPU, to defer delivery of the > * event channel to a context which can sleep. If it fails for any > * other reasons, just let it fail silently. The slow path fails > * silently too; a warning in that case may be guest triggerable, > * should never happen anyway, and guests are generally going to > * *notice* timers going missing. > */ > > + vcpu->arch.xen.timer_expires = 0; > + return HRTIMER_NORESTART; > + } > > That's documenting *this* code, not the function it happens to call. > It's more verbose than I would normally have bothered to be, but I'm > all for improving the level of commenting in our code as long as it's > adding value. I'm completely ok with no comment, I just want something in the changelog. I'm also not opposed to a comment, but I don't think it's necessary. I don't have a problem with digging around code to understand the subtleties, or even the high level "what" in many cases. What I don't like is encountering code where *nothing* explains the author's intent. All too often I've encountered historical code in KVM where it's not at all obvious if code does what the author intended, e.g. if a bug was a simple goof or a completely misguided design choice. Holler if you plan on sending a v4 with the comment. I'm a-ok applying v3 with a massaged changelog to fold in the gist of the comment.
On Tue, 2023-10-03 at 09:12 -0700, Sean Christopherson wrote: > > Holler if you plan on sending a v4 with the comment. I'm a-ok applying v3 with a > massaged changelog to fold in the gist of the comment. Yes please. I think you have a far better understanding of precisely what it is that you're looking for in said comment. I'm all for good comments, but on this occasion it seems I can't see the wood for the trees, and I can't quite work out what it is that's non-obvious. Thanks!
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index fb1110b2385a..9d0d602a2466 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -117,6 +117,8 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn) void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu) { + WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()); + if (atomic_read(&vcpu->arch.xen.timer_pending) > 0) { struct kvm_xen_evtchn e; @@ -136,18 +138,41 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer) { struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu, arch.xen.timer); + struct kvm_xen_evtchn e; + int rc; + if (atomic_read(&vcpu->arch.xen.timer_pending)) return HRTIMER_NORESTART; - atomic_inc(&vcpu->arch.xen.timer_pending); - kvm_make_request(KVM_REQ_UNBLOCK, vcpu); - kvm_vcpu_kick(vcpu); + e.vcpu_id = vcpu->vcpu_id; + e.vcpu_idx = vcpu->vcpu_idx; + e.port = vcpu->arch.xen.timer_virq; + e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; + + rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm); + if (rc == -EWOULDBLOCK) { + atomic_inc(&vcpu->arch.xen.timer_pending); + kvm_make_request(KVM_REQ_UNBLOCK, vcpu); + kvm_vcpu_kick(vcpu); + } else { + vcpu->arch.xen.timer_expires = 0; + } return HRTIMER_NORESTART; } static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns) { + WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()); + + /* + * Avoid races with the old timer firing. Checking timer_expires + * to avoid calling hrtimer_cancel() will only have false positives + * so is fine. + */ + if (vcpu->arch.xen.timer_expires) + hrtimer_cancel(&vcpu->arch.xen.timer); + atomic_set(&vcpu->arch.xen.timer_pending, 0); vcpu->arch.xen.timer_expires = guest_abs; @@ -163,6 +188,8 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu) { + WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()); + hrtimer_cancel(&vcpu->arch.xen.timer); vcpu->arch.xen.timer_expires = 0; atomic_set(&vcpu->arch.xen.timer_pending, 0); @@ -1019,13 +1046,38 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data) r = 0; break; - case KVM_XEN_VCPU_ATTR_TYPE_TIMER: + case KVM_XEN_VCPU_ATTR_TYPE_TIMER: { + bool pending = false; + + /* + * Ensure a consistent snapshot of state is captures, with a + * timer either being pending, or fully delivered. Not still + * lurking in the timer_pending flag for deferred delivery. + */ + if (vcpu->arch.xen.timer_expires) { + pending = hrtimer_cancel(&vcpu->arch.xen.timer); + kvm_xen_inject_timer_irqs(vcpu); + } + data->u.timer.port = vcpu->arch.xen.timer_virq; data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL; data->u.timer.expires_ns = vcpu->arch.xen.timer_expires; + + /* + * The timer may be delivered immediately, while the returned + * state causes it to be set up and delivered again on the + * destination system after migration. That's fine, as the + * guest will not even have had a chance to run and process + * the interrupt by that point, so it won't even notice the + * duplicate IRQ. + */ + if (pending) + hrtimer_start_expires(&vcpu->arch.xen.timer, + HRTIMER_MODE_ABS_HARD); + r = 0; break; - + } case KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR: data->u.vector = vcpu->arch.xen.upcall_vector; r = 0; @@ -2085,7 +2137,7 @@ void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu) void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu) { if (kvm_xen_timer_enabled(vcpu)) - kvm_xen_stop_timer(vcpu); + hrtimer_cancel(&vcpu->arch.xen.timer); kvm_gpc_deactivate(&vcpu->arch.xen.runstate_cache); kvm_gpc_deactivate(&vcpu->arch.xen.runstate2_cache);