Message ID | 20221022154819.1823133-1-eesposit@redhat.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4242:0:0:0:0:0 with SMTP id s2csp1258422wrr; Sat, 22 Oct 2022 08:49:13 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7movciRDZI/0051bcjAopUAHMhFF8MK6oECHZKYONwAHZTKZIhYSBNdifHMqalzgfW3wcO X-Received: by 2002:aa7:d348:0:b0:45b:8ae3:ee3d with SMTP id m8-20020aa7d348000000b0045b8ae3ee3dmr22674729edr.428.1666453753287; Sat, 22 Oct 2022 08:49:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666453753; cv=none; d=google.com; s=arc-20160816; b=HNfjePwAjrnVZrMLXGa9K0S+DtMxswTENaAFC4arIQVSZNabB0PscTCRp5yg0+KV6T zoomSBxMrSo8gbuUCiYoLubk13K3W7hAX/AJIyR2t68ekOoNR48/MyxQ5ZciE1GUNS7n zcZWgP0hV08469gHXsLgAc1DJxEX2AJc1Fj9PU/H+Z6p64o4h1HwV+FFf52hpX4tHgp9 PEIbJAp5qvZdsAJYPipALXq3vf0QrMjzVNK1RKYiTzQjpf6E7b8Sf2B3yWuoGevbJlsW ieqW7EcdcVC52hiJax0CNiosoE0y9giZCy2n0tMU27G68egx4mY9wdTvZ3MNkkIVPcFI fs2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=ubaULjUrByYJfUfr92/n81pA5XqiKjFVYnSXLbaGMQI=; b=kY3tPmJNdbB9eze8IHG7VuuP27A8LVWnKrWDCMmJZa/MomJs72vCLB6z0FuOziiiiv /Wl1LMyuwzGpccs7+AHipz49KhROTCAHoRId53LD6nxVJRcFIVsF3DZkM+9EcEXNf8vm N1uBrBQQ33y13m+Behvjx/NHlqRfwsEhmJd18j8iPT3EHoQ2wWStMixjwOYmgvxJMTti gNaZ+1fHphnoQPcC5S5tYLka5FP3rcRC0d72dnTHgoYqr1jl6siYKrVl4bpCuBNqRBJH H/yKn1fS6ZuD3RkbkQrPRCGcE2vEHO3+kQlDrHtD8z3dfGpNWxcG4vyP+ZW6pTFiT6Mh 1+aw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WDr2tYd3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id xf13-20020a17090731cd00b00788361f96a2si25733129ejb.776.2022.10.22.08.48.48; Sat, 22 Oct 2022 08:49:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WDr2tYd3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229846AbiJVPsb (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Sat, 22 Oct 2022 11:48:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34392 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229494AbiJVPs3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 22 Oct 2022 11:48:29 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EFED424FED2 for <linux-kernel@vger.kernel.org>; Sat, 22 Oct 2022 08:48:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1666453705; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=ubaULjUrByYJfUfr92/n81pA5XqiKjFVYnSXLbaGMQI=; b=WDr2tYd3p2cD3Klk+7vcl4XYlVrRMdZUfp3J2/LJDfgiFT6H2I0dK1khz6aXPQHEDOiwtD iCzdhnI79uVgqsJS237R9dP5M9B+9lZ75NF0+MFgfOdN228wjfs9PyEKqcQAo9LIGxKcAI dTViswD+/s7hMd5vvmoURyagLNA7Vn4= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-140-N3ZNxMu6NdSkX-ZcmhlpQw-1; Sat, 22 Oct 2022 11:48:22 -0400 X-MC-Unique: N3ZNxMu6NdSkX-ZcmhlpQw-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E44E2862FDF; Sat, 22 Oct 2022 15:48:21 +0000 (UTC) Received: from virtlab701.virt.lab.eng.bos.redhat.com (virtlab701.virt.lab.eng.bos.redhat.com [10.19.152.228]) by smtp.corp.redhat.com (Postfix) with ESMTP id 287BB49BB60; Sat, 22 Oct 2022 15:48:21 +0000 (UTC) From: Emanuele Giuseppe Esposito <eesposit@redhat.com> To: kvm@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Jonathan Corbet <corbet@lwn.net>, Maxim Levitsky <mlevitsk@redhat.com>, Sean Christopherson <seanjc@google.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, David Hildenbrand <david@redhat.com>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Emanuele Giuseppe Esposito <eesposit@redhat.com> Subject: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm Date: Sat, 22 Oct 2022 11:48:15 -0400 Message-Id: <20221022154819.1823133-1-eesposit@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747403410752016693?= X-GMAIL-MSGID: =?utf-8?q?1747403410752016693?= |
Series |
KVM: API to block and resume all running vcpus in a vm
|
|
Message
Emanuele Giuseppe Esposito
Oct. 22, 2022, 3:48 p.m. UTC
This new API allows the userspace to stop all running vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with KVM_RESUME_ALL_KICKED_VCPUS. A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl. This serie is especially helpful to userspace hypervisors like QEMU when they need to perform operations on memslots without the risk of having a vcpu reading them in the meanwhile. With "memslots operations" we mean grow, shrink, merge and split memslots, which are not "atomic" because there is a time window between the DELETE memslot operation and the CREATE one. Currently, each memslot operation is performed with one or more ioctls. For example, merging two memslots into one would imply: DELETE(m1) DELETE(m2) CREATE(m1+m2) And a vcpu could attempt to read m2 right after it is deleted, but before the new one is created. Therefore the simplest solution is to pause all vcpus in the kvm side, so that: - userspace just needs to call the new API before making memslots changes, keeping modifications to the minimum - dirty page updates are also performed when vcpus are blocked, so there is no time window between the dirty page ioctl and memslots modifications, since vcpus are all stopped. - no need to modify the existing memslots API Emanuele Giuseppe Esposito (4): linux-headers/linux/kvm.h: introduce kvm_userspace_memory_region_list ioctl KVM: introduce kvm_clear_all_cpus_request KVM: introduce memory transaction semaphore KVM: use signals to abort enter_guest/blocking and retry Documentation/virt/kvm/vcpu-requests.rst | 3 ++ arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/x86.c | 8 +++++ include/uapi/linux/kvm.h | 3 ++ virt/kvm/kvm_main.c | 45 ++++++++++++++++++++++++ 5 files changed, 61 insertions(+)
Comments
Am 22.10.22 um 17:48 schrieb Emanuele Giuseppe Esposito: > This new API allows the userspace to stop all running > vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with > KVM_RESUME_ALL_KICKED_VCPUS. > A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl. > > This serie is especially helpful to userspace hypervisors like > QEMU when they need to perform operations on memslots without the > risk of having a vcpu reading them in the meanwhile. > With "memslots operations" we mean grow, shrink, merge and split > memslots, which are not "atomic" because there is a time window > between the DELETE memslot operation and the CREATE one. > Currently, each memslot operation is performed with one or more > ioctls. > For example, merging two memslots into one would imply: > DELETE(m1) > DELETE(m2) > CREATE(m1+m2) > > And a vcpu could attempt to read m2 right after it is deleted, but > before the new one is created. > > Therefore the simplest solution is to pause all vcpus in the kvm > side, so that: > - userspace just needs to call the new API before making memslots > changes, keeping modifications to the minimum > - dirty page updates are also performed when vcpus are blocked, so > there is no time window between the dirty page ioctl and memslots > modifications, since vcpus are all stopped. > - no need to modify the existing memslots API Isnt QEMU able to achieve the same goal today by forcing all vCPUs into userspace with a signal? Can you provide some rationale why this is better in the cover letter or patch description?
Am 24/10/2022 um 09:56 schrieb Christian Borntraeger: > Am 22.10.22 um 17:48 schrieb Emanuele Giuseppe Esposito: >> This new API allows the userspace to stop all running >> vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with >> KVM_RESUME_ALL_KICKED_VCPUS. >> A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl. >> >> This serie is especially helpful to userspace hypervisors like >> QEMU when they need to perform operations on memslots without the >> risk of having a vcpu reading them in the meanwhile. >> With "memslots operations" we mean grow, shrink, merge and split >> memslots, which are not "atomic" because there is a time window >> between the DELETE memslot operation and the CREATE one. >> Currently, each memslot operation is performed with one or more >> ioctls. >> For example, merging two memslots into one would imply: >> DELETE(m1) >> DELETE(m2) >> CREATE(m1+m2) >> >> And a vcpu could attempt to read m2 right after it is deleted, but >> before the new one is created. >> >> Therefore the simplest solution is to pause all vcpus in the kvm >> side, so that: >> - userspace just needs to call the new API before making memslots >> changes, keeping modifications to the minimum >> - dirty page updates are also performed when vcpus are blocked, so >> there is no time window between the dirty page ioctl and memslots >> modifications, since vcpus are all stopped. >> - no need to modify the existing memslots API > Isnt QEMU able to achieve the same goal today by forcing all vCPUs > into userspace with a signal? Can you provide some rationale why this > is better in the cover letter or patch description? > David Hildenbrand tried to propose something similar here: https://github.com/davidhildenbrand/qemu/commit/86b1bf546a8d00908e33f7362b0b61e2be8dbb7a While it is not optimized, I think it's more complex that the current serie, since qemu should also make sure all running ioctls finish and prevent the new ones from getting executed. Also we can't use pause_all_vcpus()/resume_all_vcpus() because they drop the BQL. Would that be ok as rationale? Thank you, Emanuele
Am 24.10.22 um 10:33 schrieb Emanuele Giuseppe Esposito: > > > Am 24/10/2022 um 09:56 schrieb Christian Borntraeger: >> Am 22.10.22 um 17:48 schrieb Emanuele Giuseppe Esposito: >>> This new API allows the userspace to stop all running >>> vcpus using KVM_KICK_ALL_RUNNING_VCPUS ioctl, and resume them with >>> KVM_RESUME_ALL_KICKED_VCPUS. >>> A "running" vcpu is a vcpu that is executing the KVM_RUN ioctl. >>> >>> This serie is especially helpful to userspace hypervisors like >>> QEMU when they need to perform operations on memslots without the >>> risk of having a vcpu reading them in the meanwhile. >>> With "memslots operations" we mean grow, shrink, merge and split >>> memslots, which are not "atomic" because there is a time window >>> between the DELETE memslot operation and the CREATE one. >>> Currently, each memslot operation is performed with one or more >>> ioctls. >>> For example, merging two memslots into one would imply: >>> DELETE(m1) >>> DELETE(m2) >>> CREATE(m1+m2) >>> >>> And a vcpu could attempt to read m2 right after it is deleted, but >>> before the new one is created. >>> >>> Therefore the simplest solution is to pause all vcpus in the kvm >>> side, so that: >>> - userspace just needs to call the new API before making memslots >>> changes, keeping modifications to the minimum >>> - dirty page updates are also performed when vcpus are blocked, so >>> there is no time window between the dirty page ioctl and memslots >>> modifications, since vcpus are all stopped. >>> - no need to modify the existing memslots API >> Isnt QEMU able to achieve the same goal today by forcing all vCPUs >> into userspace with a signal? Can you provide some rationale why this >> is better in the cover letter or patch description? >> > David Hildenbrand tried to propose something similar here: > https://github.com/davidhildenbrand/qemu/commit/86b1bf546a8d00908e33f7362b0b61e2be8dbb7a > > While it is not optimized, I think it's more complex that the current > serie, since qemu should also make sure all running ioctls finish and > prevent the new ones from getting executed. > > Also we can't use pause_all_vcpus()/resume_all_vcpus() because they drop > the BQL. > > Would that be ok as rationale? Yes that helps and should be part of the cover letter for the next iterations.
On Mon, Oct 24, 2022, Christian Borntraeger wrote: > Am 24.10.22 um 10:33 schrieb Emanuele Giuseppe Esposito: > > Am 24/10/2022 um 09:56 schrieb Christian Borntraeger: > > > > Therefore the simplest solution is to pause all vcpus in the kvm > > > > side, so that: Simplest for QEMU maybe, most definitely not simplest for KVM. > > > > - userspace just needs to call the new API before making memslots > > > > changes, keeping modifications to the minimum > > > > - dirty page updates are also performed when vcpus are blocked, so > > > > there is no time window between the dirty page ioctl and memslots > > > > modifications, since vcpus are all stopped. > > > > - no need to modify the existing memslots API > > > Isnt QEMU able to achieve the same goal today by forcing all vCPUs > > > into userspace with a signal? Can you provide some rationale why this > > > is better in the cover letter or patch description? > > > > > David Hildenbrand tried to propose something similar here: > > https://github.com/davidhildenbrand/qemu/commit/86b1bf546a8d00908e33f7362b0b61e2be8dbb7a > > > > While it is not optimized, I think it's more complex that the current > > serie, since qemu should also make sure all running ioctls finish and > > prevent the new ones from getting executed. > > > > Also we can't use pause_all_vcpus()/resume_all_vcpus() because they drop > > the BQL. > > > > Would that be ok as rationale? > > Yes that helps and should be part of the cover letter for the next iterations. But that doesn't explain why KVM needs to get involved, it only explains why QEMU can't use its existing pause_all_vcpus(). I do not understand why this is a problem QEMU needs KVM's help to solve.
On 10/25/22 00:45, Sean Christopherson wrote: >> Yes that helps and should be part of the cover letter for the next iterations. > But that doesn't explain why KVM needs to get involved, it only explains why QEMU > can't use its existing pause_all_vcpus(). I do not understand why this is a > problem QEMU needs KVM's help to solve. I agree that it's not KVM's problem that QEMU cannot use pause_all_vcpus(). Having a ioctl in KVM, rather than coding the same in QEMU, is *mostly* a matter of programmer and computer efficiency because the code is pretty simple. That said, I believe the limited memslot API makes it more than just a QEMU problem. Because KVM_GET_DIRTY_LOG cannot be combined atomically with KVM_SET_USER_MEMORY_REGION(MR_DELETE), any VMM that uses dirty-log regions while the VM is running is liable to losing the dirty status of some pages. That's also a reason to provide this API in KVM. Paolo
On Tue, Oct 25, 2022, Paolo Bonzini wrote: > On 10/25/22 00:45, Sean Christopherson wrote: > > > Yes that helps and should be part of the cover letter for the next iterations. > > But that doesn't explain why KVM needs to get involved, it only explains why QEMU > > can't use its existing pause_all_vcpus(). I do not understand why this is a > > problem QEMU needs KVM's help to solve. > > I agree that it's not KVM's problem that QEMU cannot use pause_all_vcpus(). > Having a ioctl in KVM, rather than coding the same in QEMU, is *mostly* a > matter of programmer and computer efficiency because the code is pretty > simple. > > That said, I believe the limited memslot API makes it more than just a QEMU > problem. Because KVM_GET_DIRTY_LOG cannot be combined atomically with > KVM_SET_USER_MEMORY_REGION(MR_DELETE), any VMM that uses dirty-log regions > while the VM is running is liable to losing the dirty status of some pages. ... and doesn't already do the sane thing and pause vCPUs _and anything else that can touch guest memory_ before modifying memslots. I honestly think QEMU is the only VMM that would ever use this API. > That's also a reason to provide this API in KVM. It's frankly a terrible API though. Providing a way to force vCPUs out of KVM_RUN is at best half of the solution. Userspace still needs: - a refcounting scheme to track the number of "holds" put on the system - serialization to ensure KVM_RESUME_ALL_KICKED_VCPUS completes before a new KVM_KICK_ALL_RUNNING_VCPUS is initiated - to prevent _all_ ioctls() because it's not just KVM_RUN that consumes memslots - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT - to signal vCPU tasks so that the system doesn't livelock if a vCPU is stuck outside of KVM, e.g. in get_user_pages_unlocked() (Peter Xu's series) And because of the nature of KVM, to support this API on all architectures, KVM needs to make change on all architectures, whereas userspace should be able to implement a generic solution.
On 10/25/22 17:55, Sean Christopherson wrote: > On Tue, Oct 25, 2022, Paolo Bonzini wrote: >> That said, I believe the limited memslot API makes it more than just a QEMU >> problem. Because KVM_GET_DIRTY_LOG cannot be combined atomically with >> KVM_SET_USER_MEMORY_REGION(MR_DELETE), any VMM that uses dirty-log regions >> while the VM is running is liable to losing the dirty status of some pages. > > ... and doesn't already do the sane thing and pause vCPUs _and anything else that > can touch guest memory_ before modifying memslots. I honestly think QEMU is the > only VMM that would ever use this API. Providing a way to force vCPUs out of KVM_RUN> is at best half of the solution. I agree this is not a full solution (and I do want to remove KVM_RESUME_ALL_KICKED_VCPUS). > - a refcounting scheme to track the number of "holds" put on the system > - serialization to ensure KVM_RESUME_ALL_KICKED_VCPUS completes before a new > KVM_KICK_ALL_RUNNING_VCPUS is initiated Both of these can be just a mutex, the others are potentially more interesting but I'm not sure I understand them: > - to prevent _all_ ioctls() because it's not just KVM_RUN that consumes memslots This is perhaps an occasion to solve another disagreement: I still think that accessing memory outside KVM_RUN (for example KVM_SET_NESTED_STATE loading the APICv pages from VMCS12) is a bug, on the other hand we disagreed on that and you wanted to kill KVM_REQ_GET_NESTED_STATE_PAGES. > - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it a guest bug to access the memory (i.e. ignore the strange read-only changes which only happen at boot, and which I agree are QEMU-specific)? > - to signal vCPU tasks so that the system doesn't livelock if a vCPU is stuck > outside of KVM, e.g. in get_user_pages_unlocked() (Peter Xu's series) This is the more important one but why would it livelock? > And because of the nature of KVM, to support this API on all architectures, KVM > needs to make change on all architectures, whereas userspace should be able to > implement a generic solution. Yes, I agree that this is essentially just a more efficient kill(). Emanuele, perhaps you can put together a patch to x86/vmexit.c in kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in a for(;;) busy wait, to measure the various ways to do it? Paolo
On Tue, Oct 25, 2022, Paolo Bonzini wrote: > On 10/25/22 17:55, Sean Christopherson wrote: > > On Tue, Oct 25, 2022, Paolo Bonzini wrote: > > - to prevent _all_ ioctls() because it's not just KVM_RUN that consumes memslots > > This is perhaps an occasion to solve another disagreement: I still think > that accessing memory outside KVM_RUN (for example KVM_SET_NESTED_STATE > loading the APICv pages from VMCS12) is a bug, on the other hand we > disagreed on that and you wanted to kill KVM_REQ_GET_NESTED_STATE_PAGES. I don't think it's realistic to make accesses outside of KVM_RUN go away, e.g. see the ARM ITS discussion in the dirty ring thread. kvm_xen_set_evtchn() also explicitly depends on writing guest memory without going through KVM_RUN (and apparently can be invoked from a kernel thread?!?). In theory, I do actually like the idea of restricting memory access to KVM_RUN, but in reality I just think that forcing everything into KVM_RUN creates far more problems than it solves. E.g. my complaint with KVM_REQ_GET_NESTED_STATE_PAGES is that instead of syncrhonously telling userspace it has a problem, KVM chugs along as if everything is fine and only fails at later point in time. I doubt userspace would actually do anything differently, i.e. the VM is likely hosed no matter what, but deferring work adds complexity in KVM and makes it more difficult to debug problems when they occur. > > - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT > > Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it > a guest bug to access the memory (i.e. ignore the strange read-only changes > which only happen at boot, and which I agree are QEMU-specific)? Yes? I don't know exactly what "the KVM_GET_DIRTY_LOG case" is. > > - to signal vCPU tasks so that the system doesn't livelock if a vCPU is stuck > > outside of KVM, e.g. in get_user_pages_unlocked() (Peter Xu's series) > > This is the more important one but why would it livelock? Livelock may not be the right word. Peter's series is addressing a scenario where a vCPU gets stuck faulting in a page because the page never arrives over the network. The solution is to recognize non-fatal signals while trying to fault in the page. KVM_KICK_ALL_RUNNING_VCPUS doesn't handle that case because it's obviously not realistic to check for pending KVM requests while buried deep in mm/ code. I.e. userspace also needs to send SIGUSR1 or whatever to ensure all vCPUs get kicked out of non-KVM code. That's not the end of the world, and they probably end up being orthogonal things in userspace code, but it yields a weird API because KVM_KICK_ALL_RUNNING_VCPUS ends up with the caveat of "oh, by the way, userspace also needs to signal all vCPU tasks too, otherwise KVM_KICK_ALL_RUNNING_VCPUS might hang". > > And because of the nature of KVM, to support this API on all architectures, KVM > > needs to make change on all architectures, whereas userspace should be able to > > implement a generic solution. > > Yes, I agree that this is essentially just a more efficient kill(). > Emanuele, perhaps you can put together a patch to x86/vmexit.c in > kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in > a for(;;) busy wait, to measure the various ways to do it? I'm a bit confused. Is the goal of this to simplify QEMU, dedup VMM code, provide a more performant solution, something else entirely? I.e. why measure the performance of x86/vmexit.c? I have a hard time believing the overhead of pausing vCPUs is going to be the long pole when it comes to memslot changes. I assume rebuilding KVM's page tables because of the "zap all" behavior seems like would completely dwarf any overhead from pausing vCPUs.
On 10/26/22 01:07, Sean Christopherson wrote: > I don't think it's realistic to make accesses outside of KVM_RUN go away, e.g. > see the ARM ITS discussion in the dirty ring thread. kvm_xen_set_evtchn() also > explicitly depends on writing guest memory without going through KVM_RUN (and > apparently can be invoked from a kernel thread?!?). Yeah, those are the pages that must be considered dirty when using the dirty ring. > In theory, I do actually like the idea of restricting memory access to KVM_RUN, > but in reality I just think that forcing everything into KVM_RUN creates far more > problems than it solves. E.g. my complaint with KVM_REQ_GET_NESTED_STATE_PAGES > is that instead of syncrhonously telling userspace it has a problem, KVM chugs > along as if everything is fine and only fails at later point in time. I doubt > userspace would actually do anything differently, i.e. the VM is likely hosed no > matter what, but deferring work adds complexity in KVM and makes it more difficult > to debug problems when they occur. > >>> - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT >> >> Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it >> a guest bug to access the memory (i.e. ignore the strange read-only changes >> which only happen at boot, and which I agree are QEMU-specific)? > > Yes? I don't know exactly what "the KVM_GET_DIRTY_LOG case" is. It is not possible to atomically read the dirty bitmap and delete a memslot. When you delete a memslot, the bitmap is gone. In this case however memory accesses to the deleted memslot are a guest bug, so stopping KVM-GT would not be necessary. So while I'm being slowly convinced that QEMU should find a way to pause its vCPUs around memslot changes, I'm not sure that pausing everything is needed in general. >>> And because of the nature of KVM, to support this API on all architectures, KVM >>> needs to make change on all architectures, whereas userspace should be able to >>> implement a generic solution. >> >> Yes, I agree that this is essentially just a more efficient kill(). >> Emanuele, perhaps you can put together a patch to x86/vmexit.c in >> kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in >> a for(;;) busy wait, to measure the various ways to do it? > > I'm a bit confused. Is the goal of this to simplify QEMU, dedup VMM code, provide > a more performant solution, something else entirely? Well, a bit of all of them and perhaps that's the problem. And while the issues at hand *are* self-inflicted wounds on part of QEMU, it seems to me that the underlying issues are general. For example, Alex Graf and I looked back at your proposal of a userspace exit for "bad" accesses to memory, wondering if it could help with Hyper-V VTLs too. To recap, the "higher privileged" code at VTL1 can set up VM-wide restrictions on access to some pages through a hypercall (HvModifyVtlProtectionMask). After the hypercall, VTL0 would not be able to access those pages. The hypercall would be handled in userspace and would invoke a KVM_SET_MEMORY_REGION_PERM ioctl to restrict the RWX permissions, and this ioctl would set up a VM-wide permission bitmap that would be used when building page tables. Using such a bitmap instead of memslots makes it possible to cause userspace vmexits on VTL mapping violations with efficient data structures. And it would also be possible to use this mechanism around KVM_GET_DIRTY_LOG, to read the KVM dirty bitmap just before removing a memslot. However, external accesses to the regions (ITS, Xen, KVM-GT, non KVM_RUN ioctls) would not be blocked, due to the lack of a way to report the exit. The intersection of these features with VTLs should be very small (sometimes zero since VTLs are x86 only), but the ioctls would be a problem so I'm wondering what your thoughts are on this. Also, while the exit API could be the same, it is not clear to me that the permission bitmap would be a good match for entirely "void" memslots used to work around non-atomic memslot changes. So for now let's leave this aside and only consider the KVM_GET_DIRTY_LOG case. Paolo
On Wed, Oct 26, 2022, Paolo Bonzini wrote: > On 10/26/22 01:07, Sean Christopherson wrote: > > > > - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT > > > > > > Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it > > > a guest bug to access the memory (i.e. ignore the strange read-only changes > > > which only happen at boot, and which I agree are QEMU-specific)? > > > > Yes? I don't know exactly what "the KVM_GET_DIRTY_LOG case" is. > > It is not possible to atomically read the dirty bitmap and delete a memslot. > When you delete a memslot, the bitmap is gone. In this case however memory > accesses to the deleted memslot are a guest bug, so stopping KVM-GT would > not be necessary. If accesses to the deleted memslot are a guest bug, why do you care about pausing vCPUs? I don't mean to be beligerent, I'm genuinely confused. > So while I'm being slowly convinced that QEMU should find a way to pause its > vCPUs around memslot changes, I'm not sure that pausing everything is needed > in general. > > > > > And because of the nature of KVM, to support this API on all architectures, KVM > > > > needs to make change on all architectures, whereas userspace should be able to > > > > implement a generic solution. > > > > > > Yes, I agree that this is essentially just a more efficient kill(). > > > Emanuele, perhaps you can put together a patch to x86/vmexit.c in > > > kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in > > > a for(;;) busy wait, to measure the various ways to do it? > > > > I'm a bit confused. Is the goal of this to simplify QEMU, dedup VMM code, provide > > a more performant solution, something else entirely? > > Well, a bit of all of them and perhaps that's the problem. And while the > issues at hand *are* self-inflicted wounds on part of QEMU, it seems to me > that the underlying issues are general. > > For example, Alex Graf and I looked back at your proposal of a userspace > exit for "bad" accesses to memory, wondering if it could help with Hyper-V > VTLs too. To recap, the "higher privileged" code at VTL1 can set up VM-wide > restrictions on access to some pages through a hypercall > (HvModifyVtlProtectionMask). After the hypercall, VTL0 would not be able to > access those pages. The hypercall would be handled in userspace and would > invoke a KVM_SET_MEMORY_REGION_PERM ioctl to restrict the RWX permissions, > and this ioctl would set up a VM-wide permission bitmap that would be used > when building page tables. > > Using such a bitmap instead of memslots makes it possible to cause userspace > vmexits on VTL mapping violations with efficient data structures. And it > would also be possible to use this mechanism around KVM_GET_DIRTY_LOG, to > read the KVM dirty bitmap just before removing a memslot. What exactly is the behavior you're trying to achieve for KVM_GET_DIRTY_LOG => delete? If KVM provides KVM_EXIT_MEMORY_FAULT, can you not achieve the desired behavior by doing mprotect(PROT_NONE) => KVM_GET_DIRTY_LOG => delete? If PROT_NONE causes the memory to be freed, won't mprotect(PROT_READ) do what you want even without KVM_EXIT_MEMORY_FAULT? > However, external accesses to the regions (ITS, Xen, KVM-GT, non KVM_RUN > ioctls) would not be blocked, due to the lack of a way to report the exit. Aren't all of those out of scope? E.g. in a very hypothetical world where XEN's event channel is being used with VTLs, if VTL1 makes the event channel inaccessible, that's a guest and/or userspace configuration issue and the guest is hosed no matter what KVM does. Ditto for these case where KVM-GT's buffer is blocked. I'm guessing the ITS is similar? > The intersection of these features with VTLs should be very small (sometimes > zero since VTLs are x86 only), but the ioctls would be a problem so I'm > wondering what your thoughts are on this. How do the ioctls() map to VTLs? I.e. are they considered VTL0, VTL1, out-of-band? > Also, while the exit API could be the same, it is not clear to me that the > permission bitmap would be a good match for entirely "void" memslots used to > work around non-atomic memslot changes. So for now let's leave this aside > and only consider the KVM_GET_DIRTY_LOG case. As above, can't userspace just mprotect() the entire memslot to prevent writes between getting the dirty log and deleting the memslot?