KVM: Avoid illegal stage2 mapping on invalid memory slot

Message ID 20230608090348.414990-1-gshan@redhat.com
State New
Headers
Series KVM: Avoid illegal stage2 mapping on invalid memory slot |

Commit Message

Gavin Shan June 8, 2023, 9:03 a.m. UTC
  We run into guest hang in edk2 firmware when KSM is kept as running
on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
pflash device (TYPE_PFLASH_CFI01) during the operation for sector
erasing or buffered write. The status is returned by reading the
memory region of the pflash device and the read request should
have been forwarded to QEMU and emulated by it. Unfortunately, the
read request is covered by an illegal stage2 mapping when the guest
hang issue occurs. The read request is completed with QEMU bypassed and
wrong status is fetched.

The illegal stage2 mapping is populated due to same page mering by
KSM at (C) even the associated memory slot has been marked as invalid
at (B).

  CPU-A                    CPU-B
  -----                    -----
                           ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
                           kvm_vm_ioctl_set_memory_region
                           kvm_set_memory_region
                           __kvm_set_memory_region
                           kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
                             kvm_invalidate_memslot
                               kvm_copy_memslot
                               kvm_replace_memslot
                               kvm_swap_active_memslots        (A)
                               kvm_arch_flush_shadow_memslot   (B)
  same page merging by KSM
  kvm_mmu_notifier_change_pte
  kvm_handle_hva_range
  __kvm_handle_hva_range       (C)

Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping. Without the illegal stage2 mapping, the read
request for the pflash's status is forwarded to QEMU and emulated by
it. The correct pflash's status can be returned from QEMU to break
the infinite wait in edk2 firmware.

Cc: stable@vger.kernel.org # v5.13+
Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
Reported-by: Shuai Hu <hshuai@redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 virt/kvm/kvm_main.c | 3 +++
 1 file changed, 3 insertions(+)
  

Comments

Marc Zyngier June 8, 2023, 2:31 p.m. UTC | #1
Hi Gavin,

On Thu, 08 Jun 2023 10:03:48 +0100,
Gavin Shan <gshan@redhat.com> wrote:
> 
> We run into guest hang in edk2 firmware when KSM is kept as running
> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
> erasing or buffered write. The status is returned by reading the
> memory region of the pflash device and the read request should
> have been forwarded to QEMU and emulated by it. Unfortunately, the
> read request is covered by an illegal stage2 mapping when the guest
> hang issue occurs. The read request is completed with QEMU bypassed and
> wrong status is fetched.
> 
> The illegal stage2 mapping is populated due to same page mering by
> KSM at (C) even the associated memory slot has been marked as invalid
> at (B).
> 
>   CPU-A                    CPU-B
>   -----                    -----
>                            ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
>                            kvm_vm_ioctl_set_memory_region
>                            kvm_set_memory_region
>                            __kvm_set_memory_region
>                            kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
>                              kvm_invalidate_memslot
>                                kvm_copy_memslot
>                                kvm_replace_memslot
>                                kvm_swap_active_memslots        (A)
>                                kvm_arch_flush_shadow_memslot   (B)
>   same page merging by KSM
>   kvm_mmu_notifier_change_pte
>   kvm_handle_hva_range
>   __kvm_handle_hva_range       (C)
> 
> Fix the issue by skipping the invalid memory slot at (C) to avoid the
> illegal stage2 mapping. Without the illegal stage2 mapping, the read
> request for the pflash's status is forwarded to QEMU and emulated by
> it. The correct pflash's status can be returned from QEMU to break
> the infinite wait in edk2 firmware.

Huh, nice one :-(.

> 
> Cc: stable@vger.kernel.org # v5.13+
> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
> Reported-by: Shuai Hu <hshuai@redhat.com>
> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  virt/kvm/kvm_main.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 479802a892d4..7f81a3a209b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  			unsigned long hva_start, hva_end;
>  
>  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> +			if (slot->flags & KVM_MEMSLOT_INVALID)
> +				continue;
> +
>  			hva_start = max(range->start, slot->userspace_addr);
>  			hva_end = min(range->end, slot->userspace_addr +
>  						  (slot->npages << PAGE_SHIFT));

I don't immediately see what makes it safer. If we're not holding one
of slots_{,arch_}lock in the notifier, we can still race against the
update, can't we?  I don't think holding the srcu lock helps us here.

Thanks,

	M.
  
Gavin Shan June 8, 2023, 11:17 p.m. UTC | #2
Hi Marc,

[Cc Andrea/David/Peter Xu]

On 6/9/23 12:31 AM, Marc Zyngier wrote:
> On Thu, 08 Jun 2023 10:03:48 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
>>
>> We run into guest hang in edk2 firmware when KSM is kept as running
>> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
>> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
>> erasing or buffered write. The status is returned by reading the
>> memory region of the pflash device and the read request should
>> have been forwarded to QEMU and emulated by it. Unfortunately, the
>> read request is covered by an illegal stage2 mapping when the guest
>> hang issue occurs. The read request is completed with QEMU bypassed and
>> wrong status is fetched.
>>
>> The illegal stage2 mapping is populated due to same page mering by
>> KSM at (C) even the associated memory slot has been marked as invalid
>> at (B).
>>
>>    CPU-A                    CPU-B
>>    -----                    -----
>>                             ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
>>                             kvm_vm_ioctl_set_memory_region
>>                             kvm_set_memory_region
>>                             __kvm_set_memory_region
>>                             kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
>>                               kvm_invalidate_memslot
>>                                 kvm_copy_memslot
>>                                 kvm_replace_memslot
>>                                 kvm_swap_active_memslots        (A)
>>                                 kvm_arch_flush_shadow_memslot   (B)
>>    same page merging by KSM
>>    kvm_mmu_notifier_change_pte
>>    kvm_handle_hva_range
>>    __kvm_handle_hva_range       (C)
>>
>> Fix the issue by skipping the invalid memory slot at (C) to avoid the
>> illegal stage2 mapping. Without the illegal stage2 mapping, the read
>> request for the pflash's status is forwarded to QEMU and emulated by
>> it. The correct pflash's status can be returned from QEMU to break
>> the infinite wait in edk2 firmware.
> 
> Huh, nice one :-(.
> 

Yeah, it's a sneaky one :)

>>
>> Cc: stable@vger.kernel.org # v5.13+
>> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
>> Reported-by: Shuai Hu <hshuai@redhat.com>
>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   virt/kvm/kvm_main.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 479802a892d4..7f81a3a209b6 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>>   			unsigned long hva_start, hva_end;
>>   
>>   			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
>> +			if (slot->flags & KVM_MEMSLOT_INVALID)
>> +				continue;
>> +
>>   			hva_start = max(range->start, slot->userspace_addr);
>>   			hva_end = min(range->end, slot->userspace_addr +
>>   						  (slot->npages << PAGE_SHIFT));
> 
> I don't immediately see what makes it safer. If we're not holding one
> of slots_{,arch_}lock in the notifier, we can still race against the
> update, can't we?  I don't think holding the srcu lock helps us here.
> 

The dead-lock will be triggered if we're holding @slots_lock in
__kvm_handle_hva_range() when we have call site like below. There is
similar reason why we can't hold @slots_arch_lock in the function.

   CPU-A                                CPU-B
   -----                                ------
   invalidate_range_start()
     kvm->mn_active_invalidate_count++
     __kvm_handle_hva_range
                                        ioctl(kvm, KVM_SET_USER_MEMORY_REGION)   // Delete
                                        kvm_vm_ioctl_set_memory_region
                                        mutex_lock(&kvm->slots_lock);
                                        __kvm_set_memory_region
                                          kvm_set_memslot(..., KVM_MR_DELETE)
                                            kvm_invalidate_memslot
                                              kvm_replace_memslot
                                              kvm_swap_active_memslots          // infinite wait until
                                        mutex_unlock(&kvm->slots_lock);         // kvm->mn_active_invalidate_count is 0
   invalidate_range_end
     __kvm_handle_hva_range
       mutex_lock(&kvm->slots_lock);   // lock taken by CPU-B
       mutex_unlock(&kvm->slots_lock);
     --kvm->mn_active_invalidate_count

change_pte() is always surrounded by invalidate_range_start and
invalidate_range_end(). It means kvm->mn_active_invalidate_count is always
larger than zero when change_pte() is called. With this condition
(kvm->mn_active_invalidate_count > 0), The swapping between the inactive
and active memory slots by kvm_swap_active_memslots() can't be done.
So there are two cases for one memory slot when change_pte() is called:
(a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots
by kvm_invalidate_memslot(), called before invalidate_range_start();
(b) the memory slot has been deleted from the active memory slots. We're
only concerned by (a) when the active memory slots are iterated in
__kvm_handle_hva_range().

static void kvm_mmu_notifier_change_pte(...)
{
         :
         /*
          * .change_pte() must be surrounded by .invalidate_range_{start,end}().
          * If mmu_invalidate_in_progress is zero, then no in-progress
          * invalidations, including this one, found a relevant memslot at
          * start(); rechecking memslots here is unnecessary.  Note, a false
          * positive (count elevated by a different invalidation) is sub-optimal
          * but functionally ok.
          */
         WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
         if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
                 return;
         :
}


The srcu lock in __kvm_handle_hva_range() prevents the swapping of
the active and inactive memory slots by kvm_swap_active_memslots(). For
this particular case, it's not relevant because the swapping between
the inactive and active memory slots has been done for once, before
invalidate_range_start() is called.

Thanks,
Gavin
  
Marc Zyngier June 9, 2023, 8:06 a.m. UTC | #3
On Fri, 09 Jun 2023 00:17:34 +0100,
Gavin Shan <gshan@redhat.com> wrote:
> 
> Hi Marc,
> 
> [Cc Andrea/David/Peter Xu]
> 
> On 6/9/23 12:31 AM, Marc Zyngier wrote:
> > On Thu, 08 Jun 2023 10:03:48 +0100,
> > Gavin Shan <gshan@redhat.com> wrote:
> >> 
> >> We run into guest hang in edk2 firmware when KSM is kept as running
> >> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
> >> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
> >> erasing or buffered write. The status is returned by reading the
> >> memory region of the pflash device and the read request should
> >> have been forwarded to QEMU and emulated by it. Unfortunately, the
> >> read request is covered by an illegal stage2 mapping when the guest
> >> hang issue occurs. The read request is completed with QEMU bypassed and
> >> wrong status is fetched.
> >> 
> >> The illegal stage2 mapping is populated due to same page mering by
> >> KSM at (C) even the associated memory slot has been marked as invalid
> >> at (B).
> >> 
> >>    CPU-A                    CPU-B
> >>    -----                    -----
> >>                             ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
> >>                             kvm_vm_ioctl_set_memory_region
> >>                             kvm_set_memory_region
> >>                             __kvm_set_memory_region
> >>                             kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
> >>                               kvm_invalidate_memslot
> >>                                 kvm_copy_memslot
> >>                                 kvm_replace_memslot
> >>                                 kvm_swap_active_memslots        (A)
> >>                                 kvm_arch_flush_shadow_memslot   (B)
> >>    same page merging by KSM
> >>    kvm_mmu_notifier_change_pte
> >>    kvm_handle_hva_range
> >>    __kvm_handle_hva_range       (C)
> >> 
> >> Fix the issue by skipping the invalid memory slot at (C) to avoid the
> >> illegal stage2 mapping. Without the illegal stage2 mapping, the read
> >> request for the pflash's status is forwarded to QEMU and emulated by
> >> it. The correct pflash's status can be returned from QEMU to break
> >> the infinite wait in edk2 firmware.
> > 
> > Huh, nice one :-(.
> > 
> 
> Yeah, it's a sneaky one :)
> 
> >> 
> >> Cc: stable@vger.kernel.org # v5.13+
> >> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
> >> Reported-by: Shuai Hu <hshuai@redhat.com>
> >> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
> >> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >> ---
> >>   virt/kvm/kvm_main.c | 3 +++
> >>   1 file changed, 3 insertions(+)
> >> 
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 479802a892d4..7f81a3a209b6 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> >>   			unsigned long hva_start, hva_end;
> >>     			slot = container_of(node, struct
> >> kvm_memory_slot, hva_node[slots->node_idx]);
> >> +			if (slot->flags & KVM_MEMSLOT_INVALID)
> >> +				continue;
> >> +
> >>   			hva_start = max(range->start, slot->userspace_addr);
> >>   			hva_end = min(range->end, slot->userspace_addr +
> >>   						  (slot->npages << PAGE_SHIFT));
> > 
> > I don't immediately see what makes it safer. If we're not holding one
> > of slots_{,arch_}lock in the notifier, we can still race against the
> > update, can't we?  I don't think holding the srcu lock helps us here.
> > 

[...]

> change_pte() is always surrounded by invalidate_range_start and
> invalidate_range_end(). It means kvm->mn_active_invalidate_count is always
> larger than zero when change_pte() is called. With this condition
> (kvm->mn_active_invalidate_count > 0), The swapping between the inactive
> and active memory slots by kvm_swap_active_memslots() can't be done.
> So there are two cases for one memory slot when change_pte() is called:
> (a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots
> by kvm_invalidate_memslot(), called before invalidate_range_start();
> (b) the memory slot has been deleted from the active memory slots. We're
> only concerned by (a) when the active memory slots are iterated in
> __kvm_handle_hva_range().

OK, so to sum it up:

- the memslot cannot be swapped while we're walking the active
  memslots because kvm->mn_active_invalidate_count is elevated, and
  kvm_swap_active_memslots() will busy loop until this has reached 0
  again

- holding the srcu read_lock prevents an overlapping memslot update
  from being published at the wrong time (synchronize_srcu_expedited()
  in kvm_swap_active_memslots())

If the above holds, then I agree the fix looks correct. I'd definitely
want to see some of this rationale captured in the commit message
though.

Thanks,

	M.

> 
> static void kvm_mmu_notifier_change_pte(...)
> {
>         :
>         /*
>          * .change_pte() must be surrounded by .invalidate_range_{start,end}().
>          * If mmu_invalidate_in_progress is zero, then no in-progress
>          * invalidations, including this one, found a relevant memslot at
>          * start(); rechecking memslots here is unnecessary.  Note, a false
>          * positive (count elevated by a different invalidation) is sub-optimal
>          * but functionally ok.
>          */
>         WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
>         if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
>                 return;
>         :
> }
> 
> 
> The srcu lock in __kvm_handle_hva_range() prevents the swapping of
> the active and inactive memory slots by kvm_swap_active_memslots(). For
> this particular case, it's not relevant because the swapping between
> the inactive and active memory slots has been done for once, before
> invalidate_range_start() is called.
  
Gavin Shan June 9, 2023, 9:02 a.m. UTC | #4
On 6/9/23 6:06 PM, Marc Zyngier wrote:
> On Fri, 09 Jun 2023 00:17:34 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
>> On 6/9/23 12:31 AM, Marc Zyngier wrote:
>>> On Thu, 08 Jun 2023 10:03:48 +0100,
>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>
>>>> We run into guest hang in edk2 firmware when KSM is kept as running
>>>> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
>>>> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
>>>> erasing or buffered write. The status is returned by reading the
>>>> memory region of the pflash device and the read request should
>>>> have been forwarded to QEMU and emulated by it. Unfortunately, the
>>>> read request is covered by an illegal stage2 mapping when the guest
>>>> hang issue occurs. The read request is completed with QEMU bypassed and
>>>> wrong status is fetched.
>>>>
>>>> The illegal stage2 mapping is populated due to same page mering by
>>>> KSM at (C) even the associated memory slot has been marked as invalid
>>>> at (B).
>>>>
>>>>     CPU-A                    CPU-B
>>>>     -----                    -----
>>>>                              ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
>>>>                              kvm_vm_ioctl_set_memory_region
>>>>                              kvm_set_memory_region
>>>>                              __kvm_set_memory_region
>>>>                              kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
>>>>                                kvm_invalidate_memslot
>>>>                                  kvm_copy_memslot
>>>>                                  kvm_replace_memslot
>>>>                                  kvm_swap_active_memslots        (A)
>>>>                                  kvm_arch_flush_shadow_memslot   (B)
>>>>     same page merging by KSM
>>>>     kvm_mmu_notifier_change_pte
>>>>     kvm_handle_hva_range
>>>>     __kvm_handle_hva_range       (C)
>>>>
>>>> Fix the issue by skipping the invalid memory slot at (C) to avoid the
>>>> illegal stage2 mapping. Without the illegal stage2 mapping, the read
>>>> request for the pflash's status is forwarded to QEMU and emulated by
>>>> it. The correct pflash's status can be returned from QEMU to break
>>>> the infinite wait in edk2 firmware.
>>>
>>> Huh, nice one :-(.
>>>
>>
>> Yeah, it's a sneaky one :)
>>
>>>>
>>>> Cc: stable@vger.kernel.org # v5.13+
>>>> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
>>>> Reported-by: Shuai Hu <hshuai@redhat.com>
>>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>> ---
>>>>    virt/kvm/kvm_main.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 479802a892d4..7f81a3a209b6 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>>>>    			unsigned long hva_start, hva_end;
>>>>      			slot = container_of(node, struct
>>>> kvm_memory_slot, hva_node[slots->node_idx]);
>>>> +			if (slot->flags & KVM_MEMSLOT_INVALID)
>>>> +				continue;
>>>> +
>>>>    			hva_start = max(range->start, slot->userspace_addr);
>>>>    			hva_end = min(range->end, slot->userspace_addr +
>>>>    						  (slot->npages << PAGE_SHIFT));
>>>
>>> I don't immediately see what makes it safer. If we're not holding one
>>> of slots_{,arch_}lock in the notifier, we can still race against the
>>> update, can't we?  I don't think holding the srcu lock helps us here.
>>>
> 
> [...]
> 
>> change_pte() is always surrounded by invalidate_range_start and
>> invalidate_range_end(). It means kvm->mn_active_invalidate_count is always
>> larger than zero when change_pte() is called. With this condition
>> (kvm->mn_active_invalidate_count > 0), The swapping between the inactive
>> and active memory slots by kvm_swap_active_memslots() can't be done.
>> So there are two cases for one memory slot when change_pte() is called:
>> (a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots
>> by kvm_invalidate_memslot(), called before invalidate_range_start();
>> (b) the memory slot has been deleted from the active memory slots. We're
>> only concerned by (a) when the active memory slots are iterated in
>> __kvm_handle_hva_range().
> 
> OK, so to sum it up:
> 
> - the memslot cannot be swapped while we're walking the active
>    memslots because kvm->mn_active_invalidate_count is elevated, and
>    kvm_swap_active_memslots() will busy loop until this has reached 0
>    again
> 
> - holding the srcu read_lock prevents an overlapping memslot update
>    from being published at the wrong time (synchronize_srcu_expedited()
>    in kvm_swap_active_memslots())
> 
> If the above holds, then I agree the fix looks correct. I'd definitely
> want to see some of this rationale captured in the commit message
> though.
> 

Yes, your summary is exactly what I understood. I will improve the
changelog to include the rationale in v2.

Thanks,
Gavin
> 
>>
>> static void kvm_mmu_notifier_change_pte(...)
>> {
>>          :
>>          /*
>>           * .change_pte() must be surrounded by .invalidate_range_{start,end}().
>>           * If mmu_invalidate_in_progress is zero, then no in-progress
>>           * invalidations, including this one, found a relevant memslot at
>>           * start(); rechecking memslots here is unnecessary.  Note, a false
>>           * positive (count elevated by a different invalidation) is sub-optimal
>>           * but functionally ok.
>>           */
>>          WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
>>          if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
>>                  return;
>>          :
>> }
>>
>>
>> The srcu lock in __kvm_handle_hva_range() prevents the swapping of
>> the active and inactive memory slots by kvm_swap_active_memslots(). For
>> this particular case, it's not relevant because the swapping between
>> the inactive and active memory slots has been done for once, before
>> invalidate_range_start() is called.
> 
>
  
Sean Christopherson June 12, 2023, 2:41 p.m. UTC | #5
On Thu, Jun 08, 2023, Gavin Shan wrote:
> We run into guest hang in edk2 firmware when KSM is kept as running
> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
> erasing or buffered write. The status is returned by reading the
> memory region of the pflash device and the read request should
> have been forwarded to QEMU and emulated by it. Unfortunately, the
> read request is covered by an illegal stage2 mapping when the guest
> hang issue occurs. The read request is completed with QEMU bypassed and
> wrong status is fetched.
> 
> The illegal stage2 mapping is populated due to same page mering by
> KSM at (C) even the associated memory slot has been marked as invalid
> at (B).
> 
>   CPU-A                    CPU-B
>   -----                    -----
>                            ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
>                            kvm_vm_ioctl_set_memory_region
>                            kvm_set_memory_region
>                            __kvm_set_memory_region
>                            kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
>                              kvm_invalidate_memslot
>                                kvm_copy_memslot
>                                kvm_replace_memslot
>                                kvm_swap_active_memslots        (A)
>                                kvm_arch_flush_shadow_memslot   (B)
>   same page merging by KSM
>   kvm_mmu_notifier_change_pte
>   kvm_handle_hva_range
>   __kvm_handle_hva_range       (C)
> 
> Fix the issue by skipping the invalid memory slot at (C) to avoid the
> illegal stage2 mapping. Without the illegal stage2 mapping, the read
> request for the pflash's status is forwarded to QEMU and emulated by
> it. The correct pflash's status can be returned from QEMU to break
> the infinite wait in edk2 firmware.
> 
> Cc: stable@vger.kernel.org # v5.13+
> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")

This Fixes isn't correct.  That change only affected x86, which doesn't have this
bug.  And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU
notifier callbacks"), arm64 did NOT skip invalid slots

        slots = kvm_memslots(kvm);

        /* we only care about the pages that the guest sees */
        kvm_for_each_memslot(memslot, slots) {
                unsigned long hva_start, hva_end;
                gfn_t gpa;

                hva_start = max(start, memslot->userspace_addr);
                hva_end = min(end, memslot->userspace_addr +
                                        (memslot->npages << PAGE_SHIFT));
                if (hva_start >= hva_end)
                        continue;

                gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT;
                ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
        }

#define kvm_for_each_memslot(memslot, slots)                            \
        for (memslot = &slots->memslots[0];                             \
             memslot < slots->memslots + slots->used_slots; memslot++)  \
                if (WARN_ON_ONCE(!memslot->npages)) {                   \
                } else

Unless I'm missing something, this goes all the way back to initial arm64 support
added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup").

> Reported-by: Shuai Hu <hshuai@redhat.com>
> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  virt/kvm/kvm_main.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 479802a892d4..7f81a3a209b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  			unsigned long hva_start, hva_end;
>  
>  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> +			if (slot->flags & KVM_MEMSLOT_INVALID)
> +				continue;

Skipping the memslot will lead to use-after-free.  E.g. if an invalidate_range_start()
comes along between installing the invalid slot and zapping SPTEs, KVM will
return from kvm_mmu_notifier_invalidate_range_start() without having dropped all
references to the range.

I.e.

	kvm_copy_memslot(invalid_slot, old);
	invalid_slot->flags |= KVM_MEMSLOT_INVALID;
	kvm_replace_memslot(kvm, old, invalid_slot);

	/*
	 * Activate the slot that is now marked INVALID, but don't propagate
	 * the slot to the now inactive slots. The slot is either going to be
	 * deleted or recreated as a new slot.
	 */
	kvm_swap_active_memslots(kvm, old->as_id);


==> invalidate_range_start()

	/*
	 * From this point no new shadow pages pointing to a deleted, or moved,
	 * memslot will be created.  Validation of sp->gfn happens in:
	 *	- gfn_to_hva (kvm_read_guest, gfn_to_pfn)
	 *	- kvm_is_visible_gfn (mmu_check_root)
	 */
	kvm_arch_flush_shadow_memslot(kvm, old);

And even for change_pte(), skipping the memslot is wrong, as KVM would then fail
unmap the prior SPTE.  Of course, that can't happen in the current code base
because change_pte() is wrapped with invalidate_range_{start,end}(), but that's
more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier
count is elevated in .change_pte()" for details).  That's also why this doesn't
show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing
SPTE is present, which is never true due to the invalidation.

I'd honestly love to just delete the change_pte() callback, but my opinion is more
than a bit biased since we don't use KSM.  Assuming we keep change_pte(), the best
option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the
memslot, but with a sanity check and comment explaining the dependency on there
being no SPTEs due to the invalidation.  E.g.

---
 virt/kvm/kvm_main.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 479802a892d4..b4987b49fac3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -686,6 +686,24 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 
 	return __kvm_handle_hva_range(kvm, &range);
 }
+
+static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	/*
+	 * Skipping invalid memslots is correct if and only change_pte() is
+	 * surrounded by invalidate_range_{start,end}(), which is currently
+	 * guaranteed by the primary MMU.  If that ever changes, KVM needs to
+	 * unmap the memslot instead of skipping the memslot to ensure that KVM
+	 * doesn't hold references to the old PFN.
+	 */
+	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
+
+        if (range->slot->flags & KVM_MEMSLOT_INVALID)
+		return false;
+
+	return kvm_set_spte_gfn(kvm, range);
+}
+
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
@@ -707,7 +725,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
 		return;
 
-	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
+	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
 }
 
 void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,

base-commit: 94cdeebd82111d7b7da5bd4da053eed9e0f65d72
--
  
Gavin Shan June 13, 2023, 1:06 a.m. UTC | #6
Hi Sean,

On 6/13/23 12:41 AM, Sean Christopherson wrote:
> On Thu, Jun 08, 2023, Gavin Shan wrote:
>> We run into guest hang in edk2 firmware when KSM is kept as running
>> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
>> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
>> erasing or buffered write. The status is returned by reading the
>> memory region of the pflash device and the read request should
>> have been forwarded to QEMU and emulated by it. Unfortunately, the
>> read request is covered by an illegal stage2 mapping when the guest
>> hang issue occurs. The read request is completed with QEMU bypassed and
>> wrong status is fetched.
>>
>> The illegal stage2 mapping is populated due to same page mering by
>> KSM at (C) even the associated memory slot has been marked as invalid
>> at (B).
>>
>>    CPU-A                    CPU-B
>>    -----                    -----
>>                             ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
>>                             kvm_vm_ioctl_set_memory_region
>>                             kvm_set_memory_region
>>                             __kvm_set_memory_region
>>                             kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
>>                               kvm_invalidate_memslot
>>                                 kvm_copy_memslot
>>                                 kvm_replace_memslot
>>                                 kvm_swap_active_memslots        (A)
>>                                 kvm_arch_flush_shadow_memslot   (B)
>>    same page merging by KSM
>>    kvm_mmu_notifier_change_pte
>>    kvm_handle_hva_range
>>    __kvm_handle_hva_range       (C)
>>
>> Fix the issue by skipping the invalid memory slot at (C) to avoid the
>> illegal stage2 mapping. Without the illegal stage2 mapping, the read
>> request for the pflash's status is forwarded to QEMU and emulated by
>> it. The correct pflash's status can be returned from QEMU to break
>> the infinite wait in edk2 firmware.
>>
>> Cc: stable@vger.kernel.org # v5.13+
>> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
> 
> This Fixes isn't correct.  That change only affected x86, which doesn't have this
> bug.  And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU
> notifier callbacks"), arm64 did NOT skip invalid slots
> 
>          slots = kvm_memslots(kvm);
> 
>          /* we only care about the pages that the guest sees */
>          kvm_for_each_memslot(memslot, slots) {
>                  unsigned long hva_start, hva_end;
>                  gfn_t gpa;
> 
>                  hva_start = max(start, memslot->userspace_addr);
>                  hva_end = min(end, memslot->userspace_addr +
>                                          (memslot->npages << PAGE_SHIFT));
>                  if (hva_start >= hva_end)
>                          continue;
> 
>                  gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT;
>                  ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
>          }
> 
> #define kvm_for_each_memslot(memslot, slots)                            \
>          for (memslot = &slots->memslots[0];                             \
>               memslot < slots->memslots + slots->used_slots; memslot++)  \
>                  if (WARN_ON_ONCE(!memslot->npages)) {                   \
>                  } else
> 
> Unless I'm missing something, this goes all the way back to initial arm64 support
> added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup").
> 

The fixes tag was sorted out based on 'git-bisect', not static code analysis. I
agree it should be d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") from
the code. From the 'git-bisect', we found the issue disappears when the head is
commit 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code").
And yes, the fixes tag should be cd4c71835228 ("KVM: arm64: Convert to the gfn-based
MMU notifier callbacks").

I'm declined to fix the issue only for ARM64, more details are given below. If we're
going to limit the issue to ARM64 and fix it for ARM64 only, the fixes tag should be
the one as you pointed. Lets correct it in next revision with:

   Cc: stable@vger.kernel.org # v3.9+
   Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")

>> Reported-by: Shuai Hu <hshuai@redhat.com>
>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   virt/kvm/kvm_main.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 479802a892d4..7f81a3a209b6 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>>   			unsigned long hva_start, hva_end;
>>   
>>   			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
>> +			if (slot->flags & KVM_MEMSLOT_INVALID)
>> +				continue;
> 
> Skipping the memslot will lead to use-after-free.  E.g. if an invalidate_range_start()
> comes along between installing the invalid slot and zapping SPTEs, KVM will
> return from kvm_mmu_notifier_invalidate_range_start() without having dropped all
> references to the range.
> 
> I.e.
> 
> 	kvm_copy_memslot(invalid_slot, old);
> 	invalid_slot->flags |= KVM_MEMSLOT_INVALID;
> 	kvm_replace_memslot(kvm, old, invalid_slot);
> 
> 	/*
> 	 * Activate the slot that is now marked INVALID, but don't propagate
> 	 * the slot to the now inactive slots. The slot is either going to be
> 	 * deleted or recreated as a new slot.
> 	 */
> 	kvm_swap_active_memslots(kvm, old->as_id);
> 
> 
> ==> invalidate_range_start()
> 
> 	/*
> 	 * From this point no new shadow pages pointing to a deleted, or moved,
> 	 * memslot will be created.  Validation of sp->gfn happens in:
> 	 *	- gfn_to_hva (kvm_read_guest, gfn_to_pfn)
> 	 *	- kvm_is_visible_gfn (mmu_check_root)
> 	 */
> 	kvm_arch_flush_shadow_memslot(kvm, old);
> 
> And even for change_pte(), skipping the memslot is wrong, as KVM would then fail
> unmap the prior SPTE.  Of course, that can't happen in the current code base
> because change_pte() is wrapped with invalidate_range_{start,end}(), but that's
> more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier
> count is elevated in .change_pte()" for details).  That's also why this doesn't
> show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing
> SPTE is present, which is never true due to the invalidation.
> 

Right, those architectural dependencies are really something I worried about.
It's safe to skip the invalid memory slots for ARM64, but it may be unsafe to
do so for other architectures. You've listed the potential risks to do so for
x86. It might be risky with PowerPC's reverse mapping stuff either. I didn't
look into the code for the left architectures. In order to escape from the
architectural conflicts, I would move the check and skip the invalid memory
slot in arch/arm64/kvm/mmu.c::kvm_set_spte_gfn(), something like below. In
this way, the guest hang issue in ARM64 guest is fixed. We may have similar
issue on other architectures, but we can figure out the fix when we have to.
Sean, please let me know if you're happy with this?

arch/arm64/kvm/mmu.c::kvm_set_spte_gfn()
----------------------------------------

bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
         kvm_pfn_t pfn = pte_pfn(range->pte);

         if (!kvm->arch.mmu.pgt)
                 return false;

         /*
          * The memory slot can become invalid temporarily or permanently
          * when it's being moved or deleted. Avoid the stage2 mapping so
          * that all the read and write requests to the region of the memory
          * slot can be forwarded to VMM and emulated there.
          */
          if (range->slot->flags & KVM_MEMSLOT_INVALID)
              return false;

          WARN_ON(range->end - range->start != 1);

          :
}

> I'd honestly love to just delete the change_pte() callback, but my opinion is more
> than a bit biased since we don't use KSM.  Assuming we keep change_pte(), the best
> option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the
> memslot, but with a sanity check and comment explaining the dependency on there
> being no SPTEs due to the invalidation.  E.g.
> 

It's good idea, but I'm afraid other architectures like PowerPC will still be
affected. So I would like to limit this issue to ARM64 and fix it in its
kvm_set_spte_gfn() variant, as above. One question about "we don't use KSM":
could you please share more information about this? I'm blindly guessing you're
saying KSM isn't used in google cloud?

[...]

Thanks,
Gavin
  
Sean Christopherson June 13, 2023, 2:58 p.m. UTC | #7
On Tue, Jun 13, 2023, Gavin Shan wrote:
> Hi Sean,
> 
> On 6/13/23 12:41 AM, Sean Christopherson wrote:
> > > Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
> > 
> > This Fixes isn't correct.  That change only affected x86, which doesn't have this
> > bug.  And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU
> > notifier callbacks"), arm64 did NOT skip invalid slots

...

> > Unless I'm missing something, this goes all the way back to initial arm64 support
> > added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup").
> > 
> 
> The fixes tag was sorted out based on 'git-bisect', not static code analysis. I
> agree it should be d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") from
> the code. From the 'git-bisect', we found the issue disappears when the head is
> commit 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code").
> And yes, the fixes tag should be cd4c71835228 ("KVM: arm64: Convert to the gfn-based
> MMU notifier callbacks").

No, just because bisect points at a commit doesn't mean that's the commit that
introduced a bug.  It's not uncommon for a completely valid change to expose a
pre-existing bug, which is what I suspect happened here, e.g. after switching to
the generic framework, clean_dcache_guest_page() is called *after* retrieving the
memslot, versus clean_dcache_guest_page() being called before walking memslots.

Blaming the correct commit matters, both so that it's clear to future readers what
went wrong, and also so that people maintaining older kernels at least are aware
that there kernel may be affected.  E.g. a fix in generic KVM obviously won't
backport to 5.10, but someone who cares deeply about a 5.10-based kernel on arm64
could manually port the fix to KVM arm64 code.

> I'm declined to fix the issue only for ARM64,

I never suggested that an ARM-only fix be made, in fact I explicitly suggested
the exact opposite.

> more details are given below. If we're going to limit the issue to ARM64 and
> fix it for ARM64 only, the fixes tag should be the one as you pointed. Lets
> correct it in next revision with:

For a generic fix, just blame multiple commits, e.g. the guilty arm64, RISC-V,
and MIPS commits, which at a glance are the affected architectures.  At one point
in time, x86 was also likely affected, but that's probably not worth more than a
brief mention in the changelog as I don't see any value in tracking down a very
theoretical window of time where x86 was affected.

>   Cc: stable@vger.kernel.org # v3.9+
>   Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
> 
> > > Reported-by: Shuai Hu <hshuai@redhat.com>
> > > Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
> > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > ---
> > >   virt/kvm/kvm_main.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 479802a892d4..7f81a3a209b6 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> > >   			unsigned long hva_start, hva_end;
> > >   			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > > +			if (slot->flags & KVM_MEMSLOT_INVALID)
> > > +				continue;
> > 
> > Skipping the memslot will lead to use-after-free.  E.g. if an invalidate_range_start()
> > comes along between installing the invalid slot and zapping SPTEs, KVM will
> > return from kvm_mmu_notifier_invalidate_range_start() without having dropped all
> > references to the range.
> > 
> > I.e.
> > 
> > 	kvm_copy_memslot(invalid_slot, old);
> > 	invalid_slot->flags |= KVM_MEMSLOT_INVALID;
> > 	kvm_replace_memslot(kvm, old, invalid_slot);
> > 
> > 	/*
> > 	 * Activate the slot that is now marked INVALID, but don't propagate
> > 	 * the slot to the now inactive slots. The slot is either going to be
> > 	 * deleted or recreated as a new slot.
> > 	 */
> > 	kvm_swap_active_memslots(kvm, old->as_id);
> > 
> > 
> > ==> invalidate_range_start()
> > 
> > 	/*
> > 	 * From this point no new shadow pages pointing to a deleted, or moved,
> > 	 * memslot will be created.  Validation of sp->gfn happens in:
> > 	 *	- gfn_to_hva (kvm_read_guest, gfn_to_pfn)
> > 	 *	- kvm_is_visible_gfn (mmu_check_root)
> > 	 */
> > 	kvm_arch_flush_shadow_memslot(kvm, old);
> > 
> > And even for change_pte(), skipping the memslot is wrong, as KVM would then fail
> > unmap the prior SPTE.  Of course, that can't happen in the current code base
> > because change_pte() is wrapped with invalidate_range_{start,end}(), but that's
> > more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier
> > count is elevated in .change_pte()" for details).  That's also why this doesn't
> > show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing
> > SPTE is present, which is never true due to the invalidation.
> > 
> 
> Right, those architectural dependencies are really something I worried about.

The zap case isn't a architecture specific, that's true for all KVM architectures.

> It's safe to skip the invalid memory slots for ARM64,

No, it's not.  The problematic window where an invalidation can be incorrectly
skipped is very tiny, and would have zero chance of being seen without something
generating invalidations, e.g. page migration.  But that doesn't mean there's no
bug.

> but it may be unsafe to do so for other architectures. You've listed the
> potential risks to do so for x86. It might be risky with PowerPC's reverse
> mapping stuff either. I didn't look into the code for the left architectures.
> In order to escape from the architectural conflicts, I would move the check
> and skip the invalid memory slot in arch/arm64/kvm/mmu.c::kvm_set_spte_gfn(),
> something like below. In this way, the guest hang issue in ARM64 guest is
> fixed. We may have similar issue on other architectures, but we can figure
> out the fix when we have to.  Sean, please let me know if you're happy with
> this?
> 
> arch/arm64/kvm/mmu.c::kvm_set_spte_gfn()
> ----------------------------------------
> 
> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
>         kvm_pfn_t pfn = pte_pfn(range->pte);
> 
>         if (!kvm->arch.mmu.pgt)
>                 return false;
> 
>         /*
>          * The memory slot can become invalid temporarily or permanently
>          * when it's being moved or deleted. Avoid the stage2 mapping so
>          * that all the read and write requests to the region of the memory
>          * slot can be forwarded to VMM and emulated there.
>          */
>          if (range->slot->flags & KVM_MEMSLOT_INVALID)
>              return false;

Please no.  (a) it papers over common KVM's reliance on the SPTE being zapped by
invalidate_range_start(), and (b) it leaves the same bug in RISC-V (which copy+pasted
much of arm64's MMU code) and in MIPS (which also installs SPTEs in its callback).

>          WARN_ON(range->end - range->start != 1);
> 
>          :
> }
> 
> > I'd honestly love to just delete the change_pte() callback, but my opinion is more
> > than a bit biased since we don't use KSM.  Assuming we keep change_pte(), the best
> > option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the
> > memslot, but with a sanity check and comment explaining the dependency on there
> > being no SPTEs due to the invalidation.  E.g.
> > 
> 
> It's good idea, but I'm afraid other architectures like PowerPC will still be
> affected.

I don't follow.  PPC unmaps in response to a PTE change, but that's effectively
dead code due to change_pte() being wrapped with invalidate_range_{start,end}().

> So I would like to limit this issue to ARM64 and fix it in its
> kvm_set_spte_gfn() variant, as above. One question about "we don't use KSM":
> could you please share more information about this? I'm blindly guessing
> you're saying KSM isn't used in google cloud?

Yeah, we == Google/GCE.  Sorry, should have clarified that.
  
Gavin Shan June 14, 2023, 12:01 a.m. UTC | #8
Hi Sean,

On 6/14/23 12:58 AM, Sean Christopherson wrote:
> On Tue, Jun 13, 2023, Gavin Shan wrote:
>> On 6/13/23 12:41 AM, Sean Christopherson wrote:
>>>> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
>>>
>>> This Fixes isn't correct.  That change only affected x86, which doesn't have this
>>> bug.  And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU
>>> notifier callbacks"), arm64 did NOT skip invalid slots
> 
> ...
> 
>>> Unless I'm missing something, this goes all the way back to initial arm64 support
>>> added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup").
>>>
>>
>> The fixes tag was sorted out based on 'git-bisect', not static code analysis. I
>> agree it should be d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") from
>> the code. From the 'git-bisect', we found the issue disappears when the head is
>> commit 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code").
>> And yes, the fixes tag should be cd4c71835228 ("KVM: arm64: Convert to the gfn-based
>> MMU notifier callbacks").
> 
> No, just because bisect points at a commit doesn't mean that's the commit that
> introduced a bug.  It's not uncommon for a completely valid change to expose a
> pre-existing bug, which is what I suspect happened here, e.g. after switching to
> the generic framework, clean_dcache_guest_page() is called *after* retrieving the
> memslot, versus clean_dcache_guest_page() being called before walking memslots.
> 
> Blaming the correct commit matters, both so that it's clear to future readers what
> went wrong, and also so that people maintaining older kernels at least are aware
> that there kernel may be affected.  E.g. a fix in generic KVM obviously won't
> backport to 5.10, but someone who cares deeply about a 5.10-based kernel on arm64
> could manually port the fix to KVM arm64 code.
> 

Yes, the change of the call site on clean_dcache_guest_page() enlarges the race
condition window between memory slot removal and change_pte() theoretically.
With this, we were able to reproduce the issue in a practical test. In next
revision, lets put a note about this in the changelog, like below.

   We tried a git-bisect and the first problematic commit is cd4c71835228 ("
   KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
   clean_dcache_guest_page() is called after the memory slots are iterated
   in kvm_mmu_notifier_change_pte(). Without this commit, clean_dcache_guest_page()
   is called before the iteration on the memory slots. This change literally
   enlarges the racy window between kvm_mmu_notifier_change_pte() and memory
   slot removal so that we're able to reproduce the issue in a practical test
   case. However, the issue exists since d5d8184d35c9 ("KVM: ARM: Memory
   virtualization setup").
   
   Cc: stable@vger.kernel.org # v3.9+
   Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")

>> I'm declined to fix the issue only for ARM64,
> 
> I never suggested that an ARM-only fix be made, in fact I explicitly suggested
> the exact opposite.
> 

Well, the original patch ignores the invalid memory slot for all MMU notifiers.
You suggested to have this ignorance only for change_pte(). The scope is decreased
and I was following this direction to narrow this issue/fix to change_pte() on
ARM64 only. Sorry that I misunderstood your suggestion.

>> more details are given below. If we're going to limit the issue to ARM64 and
>> fix it for ARM64 only, the fixes tag should be the one as you pointed. Lets
>> correct it in next revision with:
> 
> For a generic fix, just blame multiple commits, e.g. the guilty arm64, RISC-V,
> and MIPS commits, which at a glance are the affected architectures.  At one point
> in time, x86 was also likely affected, but that's probably not worth more than a
> brief mention in the changelog as I don't see any value in tracking down a very
> theoretical window of time where x86 was affected.
> 

Right. Please refer to above reply. I plan to add a section in the changelog
in next revision.

>>    Cc: stable@vger.kernel.org # v3.9+
>>    Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
>>
>>>> Reported-by: Shuai Hu <hshuai@redhat.com>
>>>> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>> ---
>>>>    virt/kvm/kvm_main.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 479802a892d4..7f81a3a209b6 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>>>>    			unsigned long hva_start, hva_end;
>>>>    			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
>>>> +			if (slot->flags & KVM_MEMSLOT_INVALID)
>>>> +				continue;
>>>
>>> Skipping the memslot will lead to use-after-free.  E.g. if an invalidate_range_start()
>>> comes along between installing the invalid slot and zapping SPTEs, KVM will
>>> return from kvm_mmu_notifier_invalidate_range_start() without having dropped all
>>> references to the range.
>>>
>>> I.e.
>>>
>>> 	kvm_copy_memslot(invalid_slot, old);
>>> 	invalid_slot->flags |= KVM_MEMSLOT_INVALID;
>>> 	kvm_replace_memslot(kvm, old, invalid_slot);
>>>
>>> 	/*
>>> 	 * Activate the slot that is now marked INVALID, but don't propagate
>>> 	 * the slot to the now inactive slots. The slot is either going to be
>>> 	 * deleted or recreated as a new slot.
>>> 	 */
>>> 	kvm_swap_active_memslots(kvm, old->as_id);
>>>
>>>
>>> ==> invalidate_range_start()
>>>
>>> 	/*
>>> 	 * From this point no new shadow pages pointing to a deleted, or moved,
>>> 	 * memslot will be created.  Validation of sp->gfn happens in:
>>> 	 *	- gfn_to_hva (kvm_read_guest, gfn_to_pfn)
>>> 	 *	- kvm_is_visible_gfn (mmu_check_root)
>>> 	 */
>>> 	kvm_arch_flush_shadow_memslot(kvm, old);
>>>
>>> And even for change_pte(), skipping the memslot is wrong, as KVM would then fail
>>> unmap the prior SPTE.  Of course, that can't happen in the current code base
>>> because change_pte() is wrapped with invalidate_range_{start,end}(), but that's
>>> more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier
>>> count is elevated in .change_pte()" for details).  That's also why this doesn't
>>> show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing
>>> SPTE is present, which is never true due to the invalidation.
>>>
>>
>> Right, those architectural dependencies are really something I worried about.
> 
> The zap case isn't a architecture specific, that's true for all KVM architectures.
> 

Yes.

>> It's safe to skip the invalid memory slots for ARM64,
> 
> No, it's not.  The problematic window where an invalidation can be incorrectly
> skipped is very tiny, and would have zero chance of being seen without something
> generating invalidations, e.g. page migration.  But that doesn't mean there's no
> bug.
> 

Yeah, Agree.

>> but it may be unsafe to do so for other architectures. You've listed the
>> potential risks to do so for x86. It might be risky with PowerPC's reverse
>> mapping stuff either. I didn't look into the code for the left architectures.
>> In order to escape from the architectural conflicts, I would move the check
>> and skip the invalid memory slot in arch/arm64/kvm/mmu.c::kvm_set_spte_gfn(),
>> something like below. In this way, the guest hang issue in ARM64 guest is
>> fixed. We may have similar issue on other architectures, but we can figure
>> out the fix when we have to.  Sean, please let me know if you're happy with
>> this?
>>
>> arch/arm64/kvm/mmu.c::kvm_set_spte_gfn()
>> ----------------------------------------
>>
>> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>> {
>>          kvm_pfn_t pfn = pte_pfn(range->pte);
>>
>>          if (!kvm->arch.mmu.pgt)
>>                  return false;
>>
>>          /*
>>           * The memory slot can become invalid temporarily or permanently
>>           * when it's being moved or deleted. Avoid the stage2 mapping so
>>           * that all the read and write requests to the region of the memory
>>           * slot can be forwarded to VMM and emulated there.
>>           */
>>           if (range->slot->flags & KVM_MEMSLOT_INVALID)
>>               return false;
> 
> Please no.  (a) it papers over common KVM's reliance on the SPTE being zapped by
> invalidate_range_start(), and (b) it leaves the same bug in RISC-V (which copy+pasted
> much of arm64's MMU code) and in MIPS (which also installs SPTEs in its callback).
> 

Ok. Now I understand your suggestion to fix the issue for architectures. I will post
v3 with your suggested changes.

>>           WARN_ON(range->end - range->start != 1);
>>
>>           :
>> }
>>
>>> I'd honestly love to just delete the change_pte() callback, but my opinion is more
>>> than a bit biased since we don't use KSM.  Assuming we keep change_pte(), the best
>>> option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the
>>> memslot, but with a sanity check and comment explaining the dependency on there
>>> being no SPTEs due to the invalidation.  E.g.
>>>
>>
>> It's good idea, but I'm afraid other architectures like PowerPC will still be
>> affected.
> 
> I don't follow.  PPC unmaps in response to a PTE change, but that's effectively
> dead code due to change_pte() being wrapped with invalidate_range_{start,end}().
> 

You're right. I forgot the fact that change_pte() is surrounded by invalidate_{start,end}().

>> So I would like to limit this issue to ARM64 and fix it in its
>> kvm_set_spte_gfn() variant, as above. One question about "we don't use KSM":
>> could you please share more information about this? I'm blindly guessing
>> you're saying KSM isn't used in google cloud?
> 
> Yeah, we == Google/GCE.  Sorry, should have clarified that.
> 

Ok, thanks for your clarification.

Thanks,
Gavin
  

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 479802a892d4..7f81a3a209b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -598,6 +598,9 @@  static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			unsigned long hva_start, hva_end;
 
 			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
+			if (slot->flags & KVM_MEMSLOT_INVALID)
+				continue;
+
 			hva_start = max(range->start, slot->userspace_addr);
 			hva_end = min(range->end, slot->userspace_addr +
 						  (slot->npages << PAGE_SHIFT));