Message ID | 20230608090348.414990-1-gshan@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp150086vqr; Thu, 8 Jun 2023 02:25:04 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ65jPsmPuE0yvSBlY1/E6dR5o211cGnBIstDypVGaUyVq/aObt22xrElOmGjn33Z40OOFGk X-Received: by 2002:a17:903:228c:b0:1b1:9968:53be with SMTP id b12-20020a170903228c00b001b1996853bemr5730640plh.64.1686216303680; Thu, 08 Jun 2023 02:25:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686216303; cv=none; d=google.com; s=arc-20160816; b=HYCLo07TOO1B/aJqsfPt3mCF2/l2hoyui6WFSqJsubU02dXzU4aKInfUQBgPeNSBuj PxwLFrMsseKs5v9oMQT6wQ6ZxHq7zw4iYO+TRMDXcssCsji5ooCj39n3BnfKwXyVwAK0 uoq+RxwaAXsEtQAs+l0aogJyMmVErYAxRxGFOI5pWi9e3HjSlZvPTG7TpTUoL5zC7vVe Gb6zMuMJfEHa4Vx+K10RPUtfl2GG1lLzNQ3kUKMxCFRlaF/qJjrAaMUVI6Hrb0gtyk+A 3JF4WrKBNib2tK5ckihTRKiOhUKVlUOT3oaoEOJJUfMsAp63wIg31nffhESuxXQf+cmc UA/w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=dagrqDiq3A7dkc7gq9vL9zi12EhyXb6jT5d9cSUCfXk=; b=P6B55uiZsJBr+wl7TrCnInReqP3EeZ64bR4RC5Xk2KJeOBilMWpsDCoi8j+j4WsYR/ Fdjf3FaUfz0QEqKR0S1B1KBanOF3FTY5P7rp5q2LrSIK6LS9RqzzlPr7IsgTFEzYJA3k LFb/BGb3p/813C9v8qhgRsh3jfbMkPKdhdvWUXWVzTwDMpUNrY1unZ89TbJwjC3tFtW/ 8X2J649f6aGhgD3TbteviMAUQUNZuZ94CVKvNZLKZMTIWLTneZ2p+xEuJNvAtKABA43X usGSimLnUp3T/DsUWboD1GbeYDqKzlCtZm7S0qZKAxNuIbphXkTGol3Vm8wZ1ZGwHcCO VJhg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=NYCg1rzP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v2-20020a170903238200b001a97fd670e6si763442plh.208.2023.06.08.02.24.51; Thu, 08 Jun 2023 02:25:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=NYCg1rzP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234155AbjFHJFM (ORCPT <rfc822;liningstudo@gmail.com> + 99 others); Thu, 8 Jun 2023 05:05:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53258 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235776AbjFHJE5 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 8 Jun 2023 05:04:57 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A1F7BE50 for <linux-kernel@vger.kernel.org>; Thu, 8 Jun 2023 02:04:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1686215051; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=dagrqDiq3A7dkc7gq9vL9zi12EhyXb6jT5d9cSUCfXk=; b=NYCg1rzPTwiS/aCavnUZtEl4gLkS9NLpi2ef/euDZukufEQRNAXTSVGX1uTiGIFamFE5RZ +AbArl62pOAyiIJwNVYlj22tr/zdiggbLoZ0J6QiP0R5LqHbo88YbzKUWqSR9obMif1b0M roBRhebpB2OCjc7gQBYu7GZy6h+x5A4= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-481-uYlwIiBtPiGpmZWNBjNLZA-1; Thu, 08 Jun 2023 05:04:06 -0400 X-MC-Unique: uYlwIiBtPiGpmZWNBjNLZA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 079FF185A78E; Thu, 8 Jun 2023 09:04:06 +0000 (UTC) Received: from gshan.redhat.com (vpn2-54-168.bne.redhat.com [10.64.54.168]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1C3DC40D1B66; Thu, 8 Jun 2023 09:04:02 +0000 (UTC) From: Gavin Shan <gshan@redhat.com> To: kvmarm@lists.linux.dev Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, oliver.upton@linux.dev, maz@kernel.org, hshuai@redhat.com, zhenyzha@redhat.com, shan.gavin@gmail.com Subject: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot Date: Thu, 8 Jun 2023 19:03:48 +1000 Message-Id: <20230608090348.414990-1-gshan@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768125947212874189?= X-GMAIL-MSGID: =?utf-8?q?1768125947212874189?= |
Series |
KVM: Avoid illegal stage2 mapping on invalid memory slot
|
|
Commit Message
Gavin Shan
June 8, 2023, 9:03 a.m. UTC
We run into guest hang in edk2 firmware when KSM is kept as running
on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
pflash device (TYPE_PFLASH_CFI01) during the operation for sector
erasing or buffered write. The status is returned by reading the
memory region of the pflash device and the read request should
have been forwarded to QEMU and emulated by it. Unfortunately, the
read request is covered by an illegal stage2 mapping when the guest
hang issue occurs. The read request is completed with QEMU bypassed and
wrong status is fetched.
The illegal stage2 mapping is populated due to same page mering by
KSM at (C) even the associated memory slot has been marked as invalid
at (B).
CPU-A CPU-B
----- -----
ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
kvm_vm_ioctl_set_memory_region
kvm_set_memory_region
__kvm_set_memory_region
kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
kvm_invalidate_memslot
kvm_copy_memslot
kvm_replace_memslot
kvm_swap_active_memslots (A)
kvm_arch_flush_shadow_memslot (B)
same page merging by KSM
kvm_mmu_notifier_change_pte
kvm_handle_hva_range
__kvm_handle_hva_range (C)
Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping. Without the illegal stage2 mapping, the read
request for the pflash's status is forwarded to QEMU and emulated by
it. The correct pflash's status can be returned from QEMU to break
the infinite wait in edk2 firmware.
Cc: stable@vger.kernel.org # v5.13+
Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
Reported-by: Shuai Hu <hshuai@redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
virt/kvm/kvm_main.c | 3 +++
1 file changed, 3 insertions(+)
Comments
Hi Gavin, On Thu, 08 Jun 2023 10:03:48 +0100, Gavin Shan <gshan@redhat.com> wrote: > > We run into guest hang in edk2 firmware when KSM is kept as running > on the host. The edk2 firmware is waiting for status 0x80 from QEMU's > pflash device (TYPE_PFLASH_CFI01) during the operation for sector > erasing or buffered write. The status is returned by reading the > memory region of the pflash device and the read request should > have been forwarded to QEMU and emulated by it. Unfortunately, the > read request is covered by an illegal stage2 mapping when the guest > hang issue occurs. The read request is completed with QEMU bypassed and > wrong status is fetched. > > The illegal stage2 mapping is populated due to same page mering by > KSM at (C) even the associated memory slot has been marked as invalid > at (B). > > CPU-A CPU-B > ----- ----- > ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) > kvm_vm_ioctl_set_memory_region > kvm_set_memory_region > __kvm_set_memory_region > kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) > kvm_invalidate_memslot > kvm_copy_memslot > kvm_replace_memslot > kvm_swap_active_memslots (A) > kvm_arch_flush_shadow_memslot (B) > same page merging by KSM > kvm_mmu_notifier_change_pte > kvm_handle_hva_range > __kvm_handle_hva_range (C) > > Fix the issue by skipping the invalid memory slot at (C) to avoid the > illegal stage2 mapping. Without the illegal stage2 mapping, the read > request for the pflash's status is forwarded to QEMU and emulated by > it. The correct pflash's status can be returned from QEMU to break > the infinite wait in edk2 firmware. Huh, nice one :-(. > > Cc: stable@vger.kernel.org # v5.13+ > Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code") > Reported-by: Shuai Hu <hshuai@redhat.com> > Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> > Signed-off-by: Gavin Shan <gshan@redhat.com> > --- > virt/kvm/kvm_main.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 479802a892d4..7f81a3a209b6 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > unsigned long hva_start, hva_end; > > slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); > + if (slot->flags & KVM_MEMSLOT_INVALID) > + continue; > + > hva_start = max(range->start, slot->userspace_addr); > hva_end = min(range->end, slot->userspace_addr + > (slot->npages << PAGE_SHIFT)); I don't immediately see what makes it safer. If we're not holding one of slots_{,arch_}lock in the notifier, we can still race against the update, can't we? I don't think holding the srcu lock helps us here. Thanks, M.
Hi Marc, [Cc Andrea/David/Peter Xu] On 6/9/23 12:31 AM, Marc Zyngier wrote: > On Thu, 08 Jun 2023 10:03:48 +0100, > Gavin Shan <gshan@redhat.com> wrote: >> >> We run into guest hang in edk2 firmware when KSM is kept as running >> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's >> pflash device (TYPE_PFLASH_CFI01) during the operation for sector >> erasing or buffered write. The status is returned by reading the >> memory region of the pflash device and the read request should >> have been forwarded to QEMU and emulated by it. Unfortunately, the >> read request is covered by an illegal stage2 mapping when the guest >> hang issue occurs. The read request is completed with QEMU bypassed and >> wrong status is fetched. >> >> The illegal stage2 mapping is populated due to same page mering by >> KSM at (C) even the associated memory slot has been marked as invalid >> at (B). >> >> CPU-A CPU-B >> ----- ----- >> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) >> kvm_vm_ioctl_set_memory_region >> kvm_set_memory_region >> __kvm_set_memory_region >> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) >> kvm_invalidate_memslot >> kvm_copy_memslot >> kvm_replace_memslot >> kvm_swap_active_memslots (A) >> kvm_arch_flush_shadow_memslot (B) >> same page merging by KSM >> kvm_mmu_notifier_change_pte >> kvm_handle_hva_range >> __kvm_handle_hva_range (C) >> >> Fix the issue by skipping the invalid memory slot at (C) to avoid the >> illegal stage2 mapping. Without the illegal stage2 mapping, the read >> request for the pflash's status is forwarded to QEMU and emulated by >> it. The correct pflash's status can be returned from QEMU to break >> the infinite wait in edk2 firmware. > > Huh, nice one :-(. > Yeah, it's a sneaky one :) >> >> Cc: stable@vger.kernel.org # v5.13+ >> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code") >> Reported-by: Shuai Hu <hshuai@redhat.com> >> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >> Signed-off-by: Gavin Shan <gshan@redhat.com> >> --- >> virt/kvm/kvm_main.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index 479802a892d4..7f81a3a209b6 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, >> unsigned long hva_start, hva_end; >> >> slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); >> + if (slot->flags & KVM_MEMSLOT_INVALID) >> + continue; >> + >> hva_start = max(range->start, slot->userspace_addr); >> hva_end = min(range->end, slot->userspace_addr + >> (slot->npages << PAGE_SHIFT)); > > I don't immediately see what makes it safer. If we're not holding one > of slots_{,arch_}lock in the notifier, we can still race against the > update, can't we? I don't think holding the srcu lock helps us here. > The dead-lock will be triggered if we're holding @slots_lock in __kvm_handle_hva_range() when we have call site like below. There is similar reason why we can't hold @slots_arch_lock in the function. CPU-A CPU-B ----- ------ invalidate_range_start() kvm->mn_active_invalidate_count++ __kvm_handle_hva_range ioctl(kvm, KVM_SET_USER_MEMORY_REGION) // Delete kvm_vm_ioctl_set_memory_region mutex_lock(&kvm->slots_lock); __kvm_set_memory_region kvm_set_memslot(..., KVM_MR_DELETE) kvm_invalidate_memslot kvm_replace_memslot kvm_swap_active_memslots // infinite wait until mutex_unlock(&kvm->slots_lock); // kvm->mn_active_invalidate_count is 0 invalidate_range_end __kvm_handle_hva_range mutex_lock(&kvm->slots_lock); // lock taken by CPU-B mutex_unlock(&kvm->slots_lock); --kvm->mn_active_invalidate_count change_pte() is always surrounded by invalidate_range_start and invalidate_range_end(). It means kvm->mn_active_invalidate_count is always larger than zero when change_pte() is called. With this condition (kvm->mn_active_invalidate_count > 0), The swapping between the inactive and active memory slots by kvm_swap_active_memslots() can't be done. So there are two cases for one memory slot when change_pte() is called: (a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots by kvm_invalidate_memslot(), called before invalidate_range_start(); (b) the memory slot has been deleted from the active memory slots. We're only concerned by (a) when the active memory slots are iterated in __kvm_handle_hva_range(). static void kvm_mmu_notifier_change_pte(...) { : /* * .change_pte() must be surrounded by .invalidate_range_{start,end}(). * If mmu_invalidate_in_progress is zero, then no in-progress * invalidations, including this one, found a relevant memslot at * start(); rechecking memslots here is unnecessary. Note, a false * positive (count elevated by a different invalidation) is sub-optimal * but functionally ok. */ WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) return; : } The srcu lock in __kvm_handle_hva_range() prevents the swapping of the active and inactive memory slots by kvm_swap_active_memslots(). For this particular case, it's not relevant because the swapping between the inactive and active memory slots has been done for once, before invalidate_range_start() is called. Thanks, Gavin
On Fri, 09 Jun 2023 00:17:34 +0100, Gavin Shan <gshan@redhat.com> wrote: > > Hi Marc, > > [Cc Andrea/David/Peter Xu] > > On 6/9/23 12:31 AM, Marc Zyngier wrote: > > On Thu, 08 Jun 2023 10:03:48 +0100, > > Gavin Shan <gshan@redhat.com> wrote: > >> > >> We run into guest hang in edk2 firmware when KSM is kept as running > >> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's > >> pflash device (TYPE_PFLASH_CFI01) during the operation for sector > >> erasing or buffered write. The status is returned by reading the > >> memory region of the pflash device and the read request should > >> have been forwarded to QEMU and emulated by it. Unfortunately, the > >> read request is covered by an illegal stage2 mapping when the guest > >> hang issue occurs. The read request is completed with QEMU bypassed and > >> wrong status is fetched. > >> > >> The illegal stage2 mapping is populated due to same page mering by > >> KSM at (C) even the associated memory slot has been marked as invalid > >> at (B). > >> > >> CPU-A CPU-B > >> ----- ----- > >> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) > >> kvm_vm_ioctl_set_memory_region > >> kvm_set_memory_region > >> __kvm_set_memory_region > >> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) > >> kvm_invalidate_memslot > >> kvm_copy_memslot > >> kvm_replace_memslot > >> kvm_swap_active_memslots (A) > >> kvm_arch_flush_shadow_memslot (B) > >> same page merging by KSM > >> kvm_mmu_notifier_change_pte > >> kvm_handle_hva_range > >> __kvm_handle_hva_range (C) > >> > >> Fix the issue by skipping the invalid memory slot at (C) to avoid the > >> illegal stage2 mapping. Without the illegal stage2 mapping, the read > >> request for the pflash's status is forwarded to QEMU and emulated by > >> it. The correct pflash's status can be returned from QEMU to break > >> the infinite wait in edk2 firmware. > > > > Huh, nice one :-(. > > > > Yeah, it's a sneaky one :) > > >> > >> Cc: stable@vger.kernel.org # v5.13+ > >> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code") > >> Reported-by: Shuai Hu <hshuai@redhat.com> > >> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> > >> Signed-off-by: Gavin Shan <gshan@redhat.com> > >> --- > >> virt/kvm/kvm_main.c | 3 +++ > >> 1 file changed, 3 insertions(+) > >> > >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > >> index 479802a892d4..7f81a3a209b6 100644 > >> --- a/virt/kvm/kvm_main.c > >> +++ b/virt/kvm/kvm_main.c > >> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > >> unsigned long hva_start, hva_end; > >> slot = container_of(node, struct > >> kvm_memory_slot, hva_node[slots->node_idx]); > >> + if (slot->flags & KVM_MEMSLOT_INVALID) > >> + continue; > >> + > >> hva_start = max(range->start, slot->userspace_addr); > >> hva_end = min(range->end, slot->userspace_addr + > >> (slot->npages << PAGE_SHIFT)); > > > > I don't immediately see what makes it safer. If we're not holding one > > of slots_{,arch_}lock in the notifier, we can still race against the > > update, can't we? I don't think holding the srcu lock helps us here. > > [...] > change_pte() is always surrounded by invalidate_range_start and > invalidate_range_end(). It means kvm->mn_active_invalidate_count is always > larger than zero when change_pte() is called. With this condition > (kvm->mn_active_invalidate_count > 0), The swapping between the inactive > and active memory slots by kvm_swap_active_memslots() can't be done. > So there are two cases for one memory slot when change_pte() is called: > (a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots > by kvm_invalidate_memslot(), called before invalidate_range_start(); > (b) the memory slot has been deleted from the active memory slots. We're > only concerned by (a) when the active memory slots are iterated in > __kvm_handle_hva_range(). OK, so to sum it up: - the memslot cannot be swapped while we're walking the active memslots because kvm->mn_active_invalidate_count is elevated, and kvm_swap_active_memslots() will busy loop until this has reached 0 again - holding the srcu read_lock prevents an overlapping memslot update from being published at the wrong time (synchronize_srcu_expedited() in kvm_swap_active_memslots()) If the above holds, then I agree the fix looks correct. I'd definitely want to see some of this rationale captured in the commit message though. Thanks, M. > > static void kvm_mmu_notifier_change_pte(...) > { > : > /* > * .change_pte() must be surrounded by .invalidate_range_{start,end}(). > * If mmu_invalidate_in_progress is zero, then no in-progress > * invalidations, including this one, found a relevant memslot at > * start(); rechecking memslots here is unnecessary. Note, a false > * positive (count elevated by a different invalidation) is sub-optimal > * but functionally ok. > */ > WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) > return; > : > } > > > The srcu lock in __kvm_handle_hva_range() prevents the swapping of > the active and inactive memory slots by kvm_swap_active_memslots(). For > this particular case, it's not relevant because the swapping between > the inactive and active memory slots has been done for once, before > invalidate_range_start() is called.
On 6/9/23 6:06 PM, Marc Zyngier wrote: > On Fri, 09 Jun 2023 00:17:34 +0100, > Gavin Shan <gshan@redhat.com> wrote: >> On 6/9/23 12:31 AM, Marc Zyngier wrote: >>> On Thu, 08 Jun 2023 10:03:48 +0100, >>> Gavin Shan <gshan@redhat.com> wrote: >>>> >>>> We run into guest hang in edk2 firmware when KSM is kept as running >>>> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's >>>> pflash device (TYPE_PFLASH_CFI01) during the operation for sector >>>> erasing or buffered write. The status is returned by reading the >>>> memory region of the pflash device and the read request should >>>> have been forwarded to QEMU and emulated by it. Unfortunately, the >>>> read request is covered by an illegal stage2 mapping when the guest >>>> hang issue occurs. The read request is completed with QEMU bypassed and >>>> wrong status is fetched. >>>> >>>> The illegal stage2 mapping is populated due to same page mering by >>>> KSM at (C) even the associated memory slot has been marked as invalid >>>> at (B). >>>> >>>> CPU-A CPU-B >>>> ----- ----- >>>> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) >>>> kvm_vm_ioctl_set_memory_region >>>> kvm_set_memory_region >>>> __kvm_set_memory_region >>>> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) >>>> kvm_invalidate_memslot >>>> kvm_copy_memslot >>>> kvm_replace_memslot >>>> kvm_swap_active_memslots (A) >>>> kvm_arch_flush_shadow_memslot (B) >>>> same page merging by KSM >>>> kvm_mmu_notifier_change_pte >>>> kvm_handle_hva_range >>>> __kvm_handle_hva_range (C) >>>> >>>> Fix the issue by skipping the invalid memory slot at (C) to avoid the >>>> illegal stage2 mapping. Without the illegal stage2 mapping, the read >>>> request for the pflash's status is forwarded to QEMU and emulated by >>>> it. The correct pflash's status can be returned from QEMU to break >>>> the infinite wait in edk2 firmware. >>> >>> Huh, nice one :-(. >>> >> >> Yeah, it's a sneaky one :) >> >>>> >>>> Cc: stable@vger.kernel.org # v5.13+ >>>> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code") >>>> Reported-by: Shuai Hu <hshuai@redhat.com> >>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >>>> Signed-off-by: Gavin Shan <gshan@redhat.com> >>>> --- >>>> virt/kvm/kvm_main.c | 3 +++ >>>> 1 file changed, 3 insertions(+) >>>> >>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >>>> index 479802a892d4..7f81a3a209b6 100644 >>>> --- a/virt/kvm/kvm_main.c >>>> +++ b/virt/kvm/kvm_main.c >>>> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, >>>> unsigned long hva_start, hva_end; >>>> slot = container_of(node, struct >>>> kvm_memory_slot, hva_node[slots->node_idx]); >>>> + if (slot->flags & KVM_MEMSLOT_INVALID) >>>> + continue; >>>> + >>>> hva_start = max(range->start, slot->userspace_addr); >>>> hva_end = min(range->end, slot->userspace_addr + >>>> (slot->npages << PAGE_SHIFT)); >>> >>> I don't immediately see what makes it safer. If we're not holding one >>> of slots_{,arch_}lock in the notifier, we can still race against the >>> update, can't we? I don't think holding the srcu lock helps us here. >>> > > [...] > >> change_pte() is always surrounded by invalidate_range_start and >> invalidate_range_end(). It means kvm->mn_active_invalidate_count is always >> larger than zero when change_pte() is called. With this condition >> (kvm->mn_active_invalidate_count > 0), The swapping between the inactive >> and active memory slots by kvm_swap_active_memslots() can't be done. >> So there are two cases for one memory slot when change_pte() is called: >> (a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots >> by kvm_invalidate_memslot(), called before invalidate_range_start(); >> (b) the memory slot has been deleted from the active memory slots. We're >> only concerned by (a) when the active memory slots are iterated in >> __kvm_handle_hva_range(). > > OK, so to sum it up: > > - the memslot cannot be swapped while we're walking the active > memslots because kvm->mn_active_invalidate_count is elevated, and > kvm_swap_active_memslots() will busy loop until this has reached 0 > again > > - holding the srcu read_lock prevents an overlapping memslot update > from being published at the wrong time (synchronize_srcu_expedited() > in kvm_swap_active_memslots()) > > If the above holds, then I agree the fix looks correct. I'd definitely > want to see some of this rationale captured in the commit message > though. > Yes, your summary is exactly what I understood. I will improve the changelog to include the rationale in v2. Thanks, Gavin > >> >> static void kvm_mmu_notifier_change_pte(...) >> { >> : >> /* >> * .change_pte() must be surrounded by .invalidate_range_{start,end}(). >> * If mmu_invalidate_in_progress is zero, then no in-progress >> * invalidations, including this one, found a relevant memslot at >> * start(); rechecking memslots here is unnecessary. Note, a false >> * positive (count elevated by a different invalidation) is sub-optimal >> * but functionally ok. >> */ >> WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); >> if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) >> return; >> : >> } >> >> >> The srcu lock in __kvm_handle_hva_range() prevents the swapping of >> the active and inactive memory slots by kvm_swap_active_memslots(). For >> this particular case, it's not relevant because the swapping between >> the inactive and active memory slots has been done for once, before >> invalidate_range_start() is called. > >
On Thu, Jun 08, 2023, Gavin Shan wrote: > We run into guest hang in edk2 firmware when KSM is kept as running > on the host. The edk2 firmware is waiting for status 0x80 from QEMU's > pflash device (TYPE_PFLASH_CFI01) during the operation for sector > erasing or buffered write. The status is returned by reading the > memory region of the pflash device and the read request should > have been forwarded to QEMU and emulated by it. Unfortunately, the > read request is covered by an illegal stage2 mapping when the guest > hang issue occurs. The read request is completed with QEMU bypassed and > wrong status is fetched. > > The illegal stage2 mapping is populated due to same page mering by > KSM at (C) even the associated memory slot has been marked as invalid > at (B). > > CPU-A CPU-B > ----- ----- > ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) > kvm_vm_ioctl_set_memory_region > kvm_set_memory_region > __kvm_set_memory_region > kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) > kvm_invalidate_memslot > kvm_copy_memslot > kvm_replace_memslot > kvm_swap_active_memslots (A) > kvm_arch_flush_shadow_memslot (B) > same page merging by KSM > kvm_mmu_notifier_change_pte > kvm_handle_hva_range > __kvm_handle_hva_range (C) > > Fix the issue by skipping the invalid memory slot at (C) to avoid the > illegal stage2 mapping. Without the illegal stage2 mapping, the read > request for the pflash's status is forwarded to QEMU and emulated by > it. The correct pflash's status can be returned from QEMU to break > the infinite wait in edk2 firmware. > > Cc: stable@vger.kernel.org # v5.13+ > Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code") This Fixes isn't correct. That change only affected x86, which doesn't have this bug. And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU notifier callbacks"), arm64 did NOT skip invalid slots slots = kvm_memslots(kvm); /* we only care about the pages that the guest sees */ kvm_for_each_memslot(memslot, slots) { unsigned long hva_start, hva_end; gfn_t gpa; hva_start = max(start, memslot->userspace_addr); hva_end = min(end, memslot->userspace_addr + (memslot->npages << PAGE_SHIFT)); if (hva_start >= hva_end) continue; gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT; ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data); } #define kvm_for_each_memslot(memslot, slots) \ for (memslot = &slots->memslots[0]; \ memslot < slots->memslots + slots->used_slots; memslot++) \ if (WARN_ON_ONCE(!memslot->npages)) { \ } else Unless I'm missing something, this goes all the way back to initial arm64 support added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup"). > Reported-by: Shuai Hu <hshuai@redhat.com> > Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> > Signed-off-by: Gavin Shan <gshan@redhat.com> > --- > virt/kvm/kvm_main.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 479802a892d4..7f81a3a209b6 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > unsigned long hva_start, hva_end; > > slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); > + if (slot->flags & KVM_MEMSLOT_INVALID) > + continue; Skipping the memslot will lead to use-after-free. E.g. if an invalidate_range_start() comes along between installing the invalid slot and zapping SPTEs, KVM will return from kvm_mmu_notifier_invalidate_range_start() without having dropped all references to the range. I.e. kvm_copy_memslot(invalid_slot, old); invalid_slot->flags |= KVM_MEMSLOT_INVALID; kvm_replace_memslot(kvm, old, invalid_slot); /* * Activate the slot that is now marked INVALID, but don't propagate * the slot to the now inactive slots. The slot is either going to be * deleted or recreated as a new slot. */ kvm_swap_active_memslots(kvm, old->as_id); ==> invalidate_range_start() /* * From this point no new shadow pages pointing to a deleted, or moved, * memslot will be created. Validation of sp->gfn happens in: * - gfn_to_hva (kvm_read_guest, gfn_to_pfn) * - kvm_is_visible_gfn (mmu_check_root) */ kvm_arch_flush_shadow_memslot(kvm, old); And even for change_pte(), skipping the memslot is wrong, as KVM would then fail unmap the prior SPTE. Of course, that can't happen in the current code base because change_pte() is wrapped with invalidate_range_{start,end}(), but that's more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier count is elevated in .change_pte()" for details). That's also why this doesn't show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing SPTE is present, which is never true due to the invalidation. I'd honestly love to just delete the change_pte() callback, but my opinion is more than a bit biased since we don't use KSM. Assuming we keep change_pte(), the best option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the memslot, but with a sanity check and comment explaining the dependency on there being no SPTEs due to the invalidation. E.g. --- virt/kvm/kvm_main.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 479802a892d4..b4987b49fac3 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -686,6 +686,24 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn return __kvm_handle_hva_range(kvm, &range); } + +static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +{ + /* + * Skipping invalid memslots is correct if and only change_pte() is + * surrounded by invalidate_range_{start,end}(), which is currently + * guaranteed by the primary MMU. If that ever changes, KVM needs to + * unmap the memslot instead of skipping the memslot to ensure that KVM + * doesn't hold references to the old PFN. + */ + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); + + if (range->slot->flags & KVM_MEMSLOT_INVALID) + return false; + + return kvm_set_spte_gfn(kvm, range); +} + static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long address, @@ -707,7 +725,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) return; - kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); + kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn); } void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start, base-commit: 94cdeebd82111d7b7da5bd4da053eed9e0f65d72 --
Hi Sean, On 6/13/23 12:41 AM, Sean Christopherson wrote: > On Thu, Jun 08, 2023, Gavin Shan wrote: >> We run into guest hang in edk2 firmware when KSM is kept as running >> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's >> pflash device (TYPE_PFLASH_CFI01) during the operation for sector >> erasing or buffered write. The status is returned by reading the >> memory region of the pflash device and the read request should >> have been forwarded to QEMU and emulated by it. Unfortunately, the >> read request is covered by an illegal stage2 mapping when the guest >> hang issue occurs. The read request is completed with QEMU bypassed and >> wrong status is fetched. >> >> The illegal stage2 mapping is populated due to same page mering by >> KSM at (C) even the associated memory slot has been marked as invalid >> at (B). >> >> CPU-A CPU-B >> ----- ----- >> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) >> kvm_vm_ioctl_set_memory_region >> kvm_set_memory_region >> __kvm_set_memory_region >> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) >> kvm_invalidate_memslot >> kvm_copy_memslot >> kvm_replace_memslot >> kvm_swap_active_memslots (A) >> kvm_arch_flush_shadow_memslot (B) >> same page merging by KSM >> kvm_mmu_notifier_change_pte >> kvm_handle_hva_range >> __kvm_handle_hva_range (C) >> >> Fix the issue by skipping the invalid memory slot at (C) to avoid the >> illegal stage2 mapping. Without the illegal stage2 mapping, the read >> request for the pflash's status is forwarded to QEMU and emulated by >> it. The correct pflash's status can be returned from QEMU to break >> the infinite wait in edk2 firmware. >> >> Cc: stable@vger.kernel.org # v5.13+ >> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code") > > This Fixes isn't correct. That change only affected x86, which doesn't have this > bug. And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU > notifier callbacks"), arm64 did NOT skip invalid slots > > slots = kvm_memslots(kvm); > > /* we only care about the pages that the guest sees */ > kvm_for_each_memslot(memslot, slots) { > unsigned long hva_start, hva_end; > gfn_t gpa; > > hva_start = max(start, memslot->userspace_addr); > hva_end = min(end, memslot->userspace_addr + > (memslot->npages << PAGE_SHIFT)); > if (hva_start >= hva_end) > continue; > > gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT; > ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data); > } > > #define kvm_for_each_memslot(memslot, slots) \ > for (memslot = &slots->memslots[0]; \ > memslot < slots->memslots + slots->used_slots; memslot++) \ > if (WARN_ON_ONCE(!memslot->npages)) { \ > } else > > Unless I'm missing something, this goes all the way back to initial arm64 support > added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup"). > The fixes tag was sorted out based on 'git-bisect', not static code analysis. I agree it should be d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") from the code. From the 'git-bisect', we found the issue disappears when the head is commit 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code"). And yes, the fixes tag should be cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). I'm declined to fix the issue only for ARM64, more details are given below. If we're going to limit the issue to ARM64 and fix it for ARM64 only, the fixes tag should be the one as you pointed. Lets correct it in next revision with: Cc: stable@vger.kernel.org # v3.9+ Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") >> Reported-by: Shuai Hu <hshuai@redhat.com> >> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >> Signed-off-by: Gavin Shan <gshan@redhat.com> >> --- >> virt/kvm/kvm_main.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index 479802a892d4..7f81a3a209b6 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, >> unsigned long hva_start, hva_end; >> >> slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); >> + if (slot->flags & KVM_MEMSLOT_INVALID) >> + continue; > > Skipping the memslot will lead to use-after-free. E.g. if an invalidate_range_start() > comes along between installing the invalid slot and zapping SPTEs, KVM will > return from kvm_mmu_notifier_invalidate_range_start() without having dropped all > references to the range. > > I.e. > > kvm_copy_memslot(invalid_slot, old); > invalid_slot->flags |= KVM_MEMSLOT_INVALID; > kvm_replace_memslot(kvm, old, invalid_slot); > > /* > * Activate the slot that is now marked INVALID, but don't propagate > * the slot to the now inactive slots. The slot is either going to be > * deleted or recreated as a new slot. > */ > kvm_swap_active_memslots(kvm, old->as_id); > > > ==> invalidate_range_start() > > /* > * From this point no new shadow pages pointing to a deleted, or moved, > * memslot will be created. Validation of sp->gfn happens in: > * - gfn_to_hva (kvm_read_guest, gfn_to_pfn) > * - kvm_is_visible_gfn (mmu_check_root) > */ > kvm_arch_flush_shadow_memslot(kvm, old); > > And even for change_pte(), skipping the memslot is wrong, as KVM would then fail > unmap the prior SPTE. Of course, that can't happen in the current code base > because change_pte() is wrapped with invalidate_range_{start,end}(), but that's > more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier > count is elevated in .change_pte()" for details). That's also why this doesn't > show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing > SPTE is present, which is never true due to the invalidation. > Right, those architectural dependencies are really something I worried about. It's safe to skip the invalid memory slots for ARM64, but it may be unsafe to do so for other architectures. You've listed the potential risks to do so for x86. It might be risky with PowerPC's reverse mapping stuff either. I didn't look into the code for the left architectures. In order to escape from the architectural conflicts, I would move the check and skip the invalid memory slot in arch/arm64/kvm/mmu.c::kvm_set_spte_gfn(), something like below. In this way, the guest hang issue in ARM64 guest is fixed. We may have similar issue on other architectures, but we can figure out the fix when we have to. Sean, please let me know if you're happy with this? arch/arm64/kvm/mmu.c::kvm_set_spte_gfn() ---------------------------------------- bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { kvm_pfn_t pfn = pte_pfn(range->pte); if (!kvm->arch.mmu.pgt) return false; /* * The memory slot can become invalid temporarily or permanently * when it's being moved or deleted. Avoid the stage2 mapping so * that all the read and write requests to the region of the memory * slot can be forwarded to VMM and emulated there. */ if (range->slot->flags & KVM_MEMSLOT_INVALID) return false; WARN_ON(range->end - range->start != 1); : } > I'd honestly love to just delete the change_pte() callback, but my opinion is more > than a bit biased since we don't use KSM. Assuming we keep change_pte(), the best > option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the > memslot, but with a sanity check and comment explaining the dependency on there > being no SPTEs due to the invalidation. E.g. > It's good idea, but I'm afraid other architectures like PowerPC will still be affected. So I would like to limit this issue to ARM64 and fix it in its kvm_set_spte_gfn() variant, as above. One question about "we don't use KSM": could you please share more information about this? I'm blindly guessing you're saying KSM isn't used in google cloud? [...] Thanks, Gavin
On Tue, Jun 13, 2023, Gavin Shan wrote: > Hi Sean, > > On 6/13/23 12:41 AM, Sean Christopherson wrote: > > > Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code") > > > > This Fixes isn't correct. That change only affected x86, which doesn't have this > > bug. And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU > > notifier callbacks"), arm64 did NOT skip invalid slots ... > > Unless I'm missing something, this goes all the way back to initial arm64 support > > added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup"). > > > > The fixes tag was sorted out based on 'git-bisect', not static code analysis. I > agree it should be d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") from > the code. From the 'git-bisect', we found the issue disappears when the head is > commit 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code"). > And yes, the fixes tag should be cd4c71835228 ("KVM: arm64: Convert to the gfn-based > MMU notifier callbacks"). No, just because bisect points at a commit doesn't mean that's the commit that introduced a bug. It's not uncommon for a completely valid change to expose a pre-existing bug, which is what I suspect happened here, e.g. after switching to the generic framework, clean_dcache_guest_page() is called *after* retrieving the memslot, versus clean_dcache_guest_page() being called before walking memslots. Blaming the correct commit matters, both so that it's clear to future readers what went wrong, and also so that people maintaining older kernels at least are aware that there kernel may be affected. E.g. a fix in generic KVM obviously won't backport to 5.10, but someone who cares deeply about a 5.10-based kernel on arm64 could manually port the fix to KVM arm64 code. > I'm declined to fix the issue only for ARM64, I never suggested that an ARM-only fix be made, in fact I explicitly suggested the exact opposite. > more details are given below. If we're going to limit the issue to ARM64 and > fix it for ARM64 only, the fixes tag should be the one as you pointed. Lets > correct it in next revision with: For a generic fix, just blame multiple commits, e.g. the guilty arm64, RISC-V, and MIPS commits, which at a glance are the affected architectures. At one point in time, x86 was also likely affected, but that's probably not worth more than a brief mention in the changelog as I don't see any value in tracking down a very theoretical window of time where x86 was affected. > Cc: stable@vger.kernel.org # v3.9+ > Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") > > > > Reported-by: Shuai Hu <hshuai@redhat.com> > > > Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> > > > Signed-off-by: Gavin Shan <gshan@redhat.com> > > > --- > > > virt/kvm/kvm_main.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index 479802a892d4..7f81a3a209b6 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > > > unsigned long hva_start, hva_end; > > > slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); > > > + if (slot->flags & KVM_MEMSLOT_INVALID) > > > + continue; > > > > Skipping the memslot will lead to use-after-free. E.g. if an invalidate_range_start() > > comes along between installing the invalid slot and zapping SPTEs, KVM will > > return from kvm_mmu_notifier_invalidate_range_start() without having dropped all > > references to the range. > > > > I.e. > > > > kvm_copy_memslot(invalid_slot, old); > > invalid_slot->flags |= KVM_MEMSLOT_INVALID; > > kvm_replace_memslot(kvm, old, invalid_slot); > > > > /* > > * Activate the slot that is now marked INVALID, but don't propagate > > * the slot to the now inactive slots. The slot is either going to be > > * deleted or recreated as a new slot. > > */ > > kvm_swap_active_memslots(kvm, old->as_id); > > > > > > ==> invalidate_range_start() > > > > /* > > * From this point no new shadow pages pointing to a deleted, or moved, > > * memslot will be created. Validation of sp->gfn happens in: > > * - gfn_to_hva (kvm_read_guest, gfn_to_pfn) > > * - kvm_is_visible_gfn (mmu_check_root) > > */ > > kvm_arch_flush_shadow_memslot(kvm, old); > > > > And even for change_pte(), skipping the memslot is wrong, as KVM would then fail > > unmap the prior SPTE. Of course, that can't happen in the current code base > > because change_pte() is wrapped with invalidate_range_{start,end}(), but that's > > more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier > > count is elevated in .change_pte()" for details). That's also why this doesn't > > show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing > > SPTE is present, which is never true due to the invalidation. > > > > Right, those architectural dependencies are really something I worried about. The zap case isn't a architecture specific, that's true for all KVM architectures. > It's safe to skip the invalid memory slots for ARM64, No, it's not. The problematic window where an invalidation can be incorrectly skipped is very tiny, and would have zero chance of being seen without something generating invalidations, e.g. page migration. But that doesn't mean there's no bug. > but it may be unsafe to do so for other architectures. You've listed the > potential risks to do so for x86. It might be risky with PowerPC's reverse > mapping stuff either. I didn't look into the code for the left architectures. > In order to escape from the architectural conflicts, I would move the check > and skip the invalid memory slot in arch/arm64/kvm/mmu.c::kvm_set_spte_gfn(), > something like below. In this way, the guest hang issue in ARM64 guest is > fixed. We may have similar issue on other architectures, but we can figure > out the fix when we have to. Sean, please let me know if you're happy with > this? > > arch/arm64/kvm/mmu.c::kvm_set_spte_gfn() > ---------------------------------------- > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > kvm_pfn_t pfn = pte_pfn(range->pte); > > if (!kvm->arch.mmu.pgt) > return false; > > /* > * The memory slot can become invalid temporarily or permanently > * when it's being moved or deleted. Avoid the stage2 mapping so > * that all the read and write requests to the region of the memory > * slot can be forwarded to VMM and emulated there. > */ > if (range->slot->flags & KVM_MEMSLOT_INVALID) > return false; Please no. (a) it papers over common KVM's reliance on the SPTE being zapped by invalidate_range_start(), and (b) it leaves the same bug in RISC-V (which copy+pasted much of arm64's MMU code) and in MIPS (which also installs SPTEs in its callback). > WARN_ON(range->end - range->start != 1); > > : > } > > > I'd honestly love to just delete the change_pte() callback, but my opinion is more > > than a bit biased since we don't use KSM. Assuming we keep change_pte(), the best > > option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the > > memslot, but with a sanity check and comment explaining the dependency on there > > being no SPTEs due to the invalidation. E.g. > > > > It's good idea, but I'm afraid other architectures like PowerPC will still be > affected. I don't follow. PPC unmaps in response to a PTE change, but that's effectively dead code due to change_pte() being wrapped with invalidate_range_{start,end}(). > So I would like to limit this issue to ARM64 and fix it in its > kvm_set_spte_gfn() variant, as above. One question about "we don't use KSM": > could you please share more information about this? I'm blindly guessing > you're saying KSM isn't used in google cloud? Yeah, we == Google/GCE. Sorry, should have clarified that.
Hi Sean, On 6/14/23 12:58 AM, Sean Christopherson wrote: > On Tue, Jun 13, 2023, Gavin Shan wrote: >> On 6/13/23 12:41 AM, Sean Christopherson wrote: >>>> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code") >>> >>> This Fixes isn't correct. That change only affected x86, which doesn't have this >>> bug. And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU >>> notifier callbacks"), arm64 did NOT skip invalid slots > > ... > >>> Unless I'm missing something, this goes all the way back to initial arm64 support >>> added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup"). >>> >> >> The fixes tag was sorted out based on 'git-bisect', not static code analysis. I >> agree it should be d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") from >> the code. From the 'git-bisect', we found the issue disappears when the head is >> commit 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code"). >> And yes, the fixes tag should be cd4c71835228 ("KVM: arm64: Convert to the gfn-based >> MMU notifier callbacks"). > > No, just because bisect points at a commit doesn't mean that's the commit that > introduced a bug. It's not uncommon for a completely valid change to expose a > pre-existing bug, which is what I suspect happened here, e.g. after switching to > the generic framework, clean_dcache_guest_page() is called *after* retrieving the > memslot, versus clean_dcache_guest_page() being called before walking memslots. > > Blaming the correct commit matters, both so that it's clear to future readers what > went wrong, and also so that people maintaining older kernels at least are aware > that there kernel may be affected. E.g. a fix in generic KVM obviously won't > backport to 5.10, but someone who cares deeply about a 5.10-based kernel on arm64 > could manually port the fix to KVM arm64 code. > Yes, the change of the call site on clean_dcache_guest_page() enlarges the race condition window between memory slot removal and change_pte() theoretically. With this, we were able to reproduce the issue in a practical test. In next revision, lets put a note about this in the changelog, like below. We tried a git-bisect and the first problematic commit is cd4c71835228 (" KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this, clean_dcache_guest_page() is called after the memory slots are iterated in kvm_mmu_notifier_change_pte(). Without this commit, clean_dcache_guest_page() is called before the iteration on the memory slots. This change literally enlarges the racy window between kvm_mmu_notifier_change_pte() and memory slot removal so that we're able to reproduce the issue in a practical test case. However, the issue exists since d5d8184d35c9 ("KVM: ARM: Memory virtualization setup"). Cc: stable@vger.kernel.org # v3.9+ Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") >> I'm declined to fix the issue only for ARM64, > > I never suggested that an ARM-only fix be made, in fact I explicitly suggested > the exact opposite. > Well, the original patch ignores the invalid memory slot for all MMU notifiers. You suggested to have this ignorance only for change_pte(). The scope is decreased and I was following this direction to narrow this issue/fix to change_pte() on ARM64 only. Sorry that I misunderstood your suggestion. >> more details are given below. If we're going to limit the issue to ARM64 and >> fix it for ARM64 only, the fixes tag should be the one as you pointed. Lets >> correct it in next revision with: > > For a generic fix, just blame multiple commits, e.g. the guilty arm64, RISC-V, > and MIPS commits, which at a glance are the affected architectures. At one point > in time, x86 was also likely affected, but that's probably not worth more than a > brief mention in the changelog as I don't see any value in tracking down a very > theoretical window of time where x86 was affected. > Right. Please refer to above reply. I plan to add a section in the changelog in next revision. >> Cc: stable@vger.kernel.org # v3.9+ >> Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") >> >>>> Reported-by: Shuai Hu <hshuai@redhat.com> >>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> >>>> Signed-off-by: Gavin Shan <gshan@redhat.com> >>>> --- >>>> virt/kvm/kvm_main.c | 3 +++ >>>> 1 file changed, 3 insertions(+) >>>> >>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >>>> index 479802a892d4..7f81a3a209b6 100644 >>>> --- a/virt/kvm/kvm_main.c >>>> +++ b/virt/kvm/kvm_main.c >>>> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, >>>> unsigned long hva_start, hva_end; >>>> slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); >>>> + if (slot->flags & KVM_MEMSLOT_INVALID) >>>> + continue; >>> >>> Skipping the memslot will lead to use-after-free. E.g. if an invalidate_range_start() >>> comes along between installing the invalid slot and zapping SPTEs, KVM will >>> return from kvm_mmu_notifier_invalidate_range_start() without having dropped all >>> references to the range. >>> >>> I.e. >>> >>> kvm_copy_memslot(invalid_slot, old); >>> invalid_slot->flags |= KVM_MEMSLOT_INVALID; >>> kvm_replace_memslot(kvm, old, invalid_slot); >>> >>> /* >>> * Activate the slot that is now marked INVALID, but don't propagate >>> * the slot to the now inactive slots. The slot is either going to be >>> * deleted or recreated as a new slot. >>> */ >>> kvm_swap_active_memslots(kvm, old->as_id); >>> >>> >>> ==> invalidate_range_start() >>> >>> /* >>> * From this point no new shadow pages pointing to a deleted, or moved, >>> * memslot will be created. Validation of sp->gfn happens in: >>> * - gfn_to_hva (kvm_read_guest, gfn_to_pfn) >>> * - kvm_is_visible_gfn (mmu_check_root) >>> */ >>> kvm_arch_flush_shadow_memslot(kvm, old); >>> >>> And even for change_pte(), skipping the memslot is wrong, as KVM would then fail >>> unmap the prior SPTE. Of course, that can't happen in the current code base >>> because change_pte() is wrapped with invalidate_range_{start,end}(), but that's >>> more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier >>> count is elevated in .change_pte()" for details). That's also why this doesn't >>> show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing >>> SPTE is present, which is never true due to the invalidation. >>> >> >> Right, those architectural dependencies are really something I worried about. > > The zap case isn't a architecture specific, that's true for all KVM architectures. > Yes. >> It's safe to skip the invalid memory slots for ARM64, > > No, it's not. The problematic window where an invalidation can be incorrectly > skipped is very tiny, and would have zero chance of being seen without something > generating invalidations, e.g. page migration. But that doesn't mean there's no > bug. > Yeah, Agree. >> but it may be unsafe to do so for other architectures. You've listed the >> potential risks to do so for x86. It might be risky with PowerPC's reverse >> mapping stuff either. I didn't look into the code for the left architectures. >> In order to escape from the architectural conflicts, I would move the check >> and skip the invalid memory slot in arch/arm64/kvm/mmu.c::kvm_set_spte_gfn(), >> something like below. In this way, the guest hang issue in ARM64 guest is >> fixed. We may have similar issue on other architectures, but we can figure >> out the fix when we have to. Sean, please let me know if you're happy with >> this? >> >> arch/arm64/kvm/mmu.c::kvm_set_spte_gfn() >> ---------------------------------------- >> >> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) >> { >> kvm_pfn_t pfn = pte_pfn(range->pte); >> >> if (!kvm->arch.mmu.pgt) >> return false; >> >> /* >> * The memory slot can become invalid temporarily or permanently >> * when it's being moved or deleted. Avoid the stage2 mapping so >> * that all the read and write requests to the region of the memory >> * slot can be forwarded to VMM and emulated there. >> */ >> if (range->slot->flags & KVM_MEMSLOT_INVALID) >> return false; > > Please no. (a) it papers over common KVM's reliance on the SPTE being zapped by > invalidate_range_start(), and (b) it leaves the same bug in RISC-V (which copy+pasted > much of arm64's MMU code) and in MIPS (which also installs SPTEs in its callback). > Ok. Now I understand your suggestion to fix the issue for architectures. I will post v3 with your suggested changes. >> WARN_ON(range->end - range->start != 1); >> >> : >> } >> >>> I'd honestly love to just delete the change_pte() callback, but my opinion is more >>> than a bit biased since we don't use KSM. Assuming we keep change_pte(), the best >>> option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the >>> memslot, but with a sanity check and comment explaining the dependency on there >>> being no SPTEs due to the invalidation. E.g. >>> >> >> It's good idea, but I'm afraid other architectures like PowerPC will still be >> affected. > > I don't follow. PPC unmaps in response to a PTE change, but that's effectively > dead code due to change_pte() being wrapped with invalidate_range_{start,end}(). > You're right. I forgot the fact that change_pte() is surrounded by invalidate_{start,end}(). >> So I would like to limit this issue to ARM64 and fix it in its >> kvm_set_spte_gfn() variant, as above. One question about "we don't use KSM": >> could you please share more information about this? I'm blindly guessing >> you're saying KSM isn't used in google cloud? > > Yeah, we == Google/GCE. Sorry, should have clarified that. > Ok, thanks for your clarification. Thanks, Gavin
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 479802a892d4..7f81a3a209b6 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, unsigned long hva_start, hva_end; slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); + if (slot->flags & KVM_MEMSLOT_INVALID) + continue; + hva_start = max(range->start, slot->userspace_addr); hva_end = min(range->end, slot->userspace_addr + (slot->npages << PAGE_SHIFT));