[06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed
Message ID | 20230921203331.3746712-7-seanjc@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp5441736vqi; Fri, 22 Sep 2023 02:52:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH9gtACdy7K+M+yxBR6mwrFqiBkKEqvdmzCErp/Fl4jOXyETDnHXWBzG4wM4438tLTbiBPp X-Received: by 2002:a05:6a00:815:b0:690:3a0f:4164 with SMTP id m21-20020a056a00081500b006903a0f4164mr9043099pfk.19.1695376355833; Fri, 22 Sep 2023 02:52:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695376355; cv=none; d=google.com; s=arc-20160816; b=wH9ltDfbl/x7y9AZYrnbFfHR16xghfptM+2T6wSrX1498fi1TERjPiNrwE44KFD1bZ RXdLpO97QNryibkNNdtpVUo0kBj3aZnh3AjyCG3qgpesNUW77/9BcSL6bMWQ7AHtEGbU g3oxw0DC1VKOK7E3wV4idtrZs8kjmRQCvFmqMX8sx8mUiIP3Ld4bc3FrCs2DzDHM8c+u NKNrP5qsbAHONhr80y1+1FIOJI8VCZhPqlZXhWtf6uZ51t+E7WniC65wXdT41fdgucZP 2yMdDyO+8BNJ9DAkdSHBVobgVDi2m4RsI4J0uReN8zeTMQExnVIx8n/lk67PB5DXfu/q o4qA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:reply-to:dkim-signature; bh=TrGawBYlg9dO9+tpkWoFC/Bm9/vroFhKv3wNgQsPHaE=; fh=wic1xWUcCfivNDoGhaOGRT3zShU41cz7kO9KD9kCHzs=; b=lS9fwvDWADlJ+mcxbyolWwkXdPqDHqNTN6TVSB+C/AFZ2FaBQ/9disXKqhlTQibohX GQKaem7Zud8YAb5R2IrQcqPAemC5PIC+kOlOdWblbyzTKa9WWbRQBnLNOn6UIj8mOUXf jHEJ44vynmRKHv9Kklob/p7n6dMNr5Xei23KYLSdjwPpGQbQ4Lyhf0AK3MNnGJgXXiD+ Mur+15eYiu13EgwoxfcznKCxkYpOMVRFrlgpcdb2a3fNdfpiRpGAcX5PWWeDJeTwKbfs 3ivt9BQvNXqm3+ab1WMHta9BF2PkyTSkSRW7EgworE+n+DRx9/60V4Rlqme2Tr1zWxCu RKiQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=rFUUewuO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id a33-20020a056a001d2100b00690fec2f3bbsi3355816pfx.85.2023.09.22.02.52.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Sep 2023 02:52:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=rFUUewuO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 0EF218526809; Thu, 21 Sep 2023 14:44:12 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231621AbjIUVoH (ORCPT <rfc822;pwkd43@gmail.com> + 29 others); Thu, 21 Sep 2023 17:44:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54968 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229907AbjIUVnj (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 21 Sep 2023 17:43:39 -0400 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C10A3C06BD for <linux-kernel@vger.kernel.org>; Thu, 21 Sep 2023 13:33:45 -0700 (PDT) Received: by mail-pl1-x649.google.com with SMTP id d9443c01a7336-1c46ce0c39fso11210605ad.2 for <linux-kernel@vger.kernel.org>; Thu, 21 Sep 2023 13:33:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1695328425; x=1695933225; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=TrGawBYlg9dO9+tpkWoFC/Bm9/vroFhKv3wNgQsPHaE=; b=rFUUewuO4GiLk6j9N+4yp4fVjBcvcNNt50CIzAVuZSejDlBuhrCOxxOZhc8+s1TLM8 J7VZoW31U63mk8Sr2bA42PCP59D6/UbKYU68Gb988+21XZqO/uiTRQ++zSP9M7VuQdTY fWZHjOXbjbBzOIH5lb8Me7NfER71QI426xW2bVmo6OVkMZN1P3Kf/95+V7SaqbWIvUep S9L8rrrGHAloQASoSFIgpXDbWCOcL0YurN1tRF6PwSf3OY4QSn4i3SuxJKZNG5H0bxkw t/huqM7dO8MESLuMR8p9A9Vg/hr6nzktEEkW4cxkVgBpB7t0tc23FLgGxiLfq0imX7GG L96w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695328425; x=1695933225; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=TrGawBYlg9dO9+tpkWoFC/Bm9/vroFhKv3wNgQsPHaE=; b=LX/3U3N/1+ndY0LvT2Fnq7q5WYWnz4K8rg/OoNu3j+FvApKqP51sgsY8E2hDsOtY8E cMkBCfpfJQyxEEEfyBEBYhg/jnX+mpIKzA9JfPoFWzziMFbGKVzP9PLE/MxqIQS/uL5P 33sFajM8kRkBqtfcvQ1fVNbLjeCyFDiFdNI7rChJgWeAKPv1woLcZ5JseqCh7A+Ii10H dVe81Ki/6pkLBHA+gCYTvoBG9CVFt2jsrEP4p+NFCYSxZKy6MzI9k+v+0/vHKUSYM7N2 SqhF+w2mK/DFDUwfpoUENrLzVF20kVnHQOfUvHzMVbeU4JtFU/2bfTIbadVfPYJ0XbO0 yniQ== X-Gm-Message-State: AOJu0YwCcWn4cK01FuaISGLyJDi2cQI+JyyUBugrecmokTC+GT8KAI9u aOg/c87PMrePxioimFBXtutfqnyLh8A= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:902:d490:b0:1bc:7c69:925c with SMTP id c16-20020a170902d49000b001bc7c69925cmr94805plg.10.1695328425216; Thu, 21 Sep 2023 13:33:45 -0700 (PDT) Reply-To: Sean Christopherson <seanjc@google.com> Date: Thu, 21 Sep 2023 13:33:23 -0700 In-Reply-To: <20230921203331.3746712-1-seanjc@google.com> Mime-Version: 1.0 References: <20230921203331.3746712-1-seanjc@google.com> X-Mailer: git-send-email 2.42.0.515.g380fc7ccd1-goog Message-ID: <20230921203331.3746712-7-seanjc@google.com> Subject: [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed From: Sean Christopherson <seanjc@google.com> To: Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Michael Roth <michael.roth@amd.com>, Binbin Wu <binbin.wu@linux.intel.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Thu, 21 Sep 2023 14:44:13 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777730957301969630 X-GMAIL-MSGID: 1777730957301969630 |
Series |
KVM: guest_memfd fixes
|
|
Commit Message
Sean Christopherson
Sept. 21, 2023, 8:33 p.m. UTC
Remove the restriction that a guest_memfd instance that supports hugepages
can *only* be bound by memslots that are 100% compatible with hugepage
mappings, and instead force KVM to use an order-0 mapping if the binding
isn't compatible with hugepages.
The intent of the draconian binding restriction was purely to simplify the
guest_memfd implementation, e.g. to avoid repeatining the existing logic in
KVM x86ial for precisely tracking which GFNs support hugepages. But
checking that the binding's offset and size is compatible is just as easy
to do when KVM wants to create a mapping.
And on the other hand, completely rejecting bindings that are incompatible
with hugepages makes it practically impossible for userspace to use a
single guest_memfd instance for all guest memory, e.g. on x86 it would be
impossible to skip the legacy VGA hole while still allowing hugepage
mappings for the rest of guest memory.
Suggested-by: Michael Roth <michael.roth@amd.com>
Link: https://lore.kernel.org/all/20230918163647.m6bjgwusc7ww5tyu@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
virt/kvm/guest_mem.c | 54 ++++++++++++++++++++++----------------------
1 file changed, 27 insertions(+), 27 deletions(-)
Comments
On Thu, Sep 21, 2023 at 01:33:23PM -0700, Sean Christopherson wrote: > Remove the restriction that a guest_memfd instance that supports hugepages > can *only* be bound by memslots that are 100% compatible with hugepage > mappings, and instead force KVM to use an order-0 mapping if the binding > isn't compatible with hugepages. > > The intent of the draconian binding restriction was purely to simplify the > guest_memfd implementation, e.g. to avoid repeatining the existing logic in > KVM x86ial for precisely tracking which GFNs support hugepages. But > checking that the binding's offset and size is compatible is just as easy > to do when KVM wants to create a mapping. > > And on the other hand, completely rejecting bindings that are incompatible > with hugepages makes it practically impossible for userspace to use a > single guest_memfd instance for all guest memory, e.g. on x86 it would be > impossible to skip the legacy VGA hole while still allowing hugepage > mappings for the rest of guest memory. > > Suggested-by: Michael Roth <michael.roth@amd.com> > Link: https://lore.kernel.org/all/20230918163647.m6bjgwusc7ww5tyu@amd.com > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > virt/kvm/guest_mem.c | 54 ++++++++++++++++++++++---------------------- > 1 file changed, 27 insertions(+), 27 deletions(-) > > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c > index 68528e9cddd7..4f3a313f5532 100644 > --- a/virt/kvm/guest_mem.c > +++ b/virt/kvm/guest_mem.c > @@ -434,20 +434,6 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags, > return err; > } > > -static bool kvm_gmem_is_valid_size(loff_t size, u64 flags) > -{ > - if (size < 0 || !PAGE_ALIGNED(size)) > - return false; > - > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > - if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && > - !IS_ALIGNED(size, HPAGE_PMD_SIZE)) > - return false; > -#endif > - > - return true; > -} > - > int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > { > loff_t size = args->size; > @@ -460,9 +446,15 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > if (flags & ~valid_flags) > return -EINVAL; > > - if (!kvm_gmem_is_valid_size(size, flags)) > + if (size < 0 || !PAGE_ALIGNED(size)) > return -EINVAL; > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && > + !IS_ALIGNED(size, HPAGE_PMD_SIZE)) > + return -EINVAL; > +#endif > + > return __kvm_gmem_create(kvm, size, flags, kvm_gmem_mnt); > } > > @@ -470,7 +462,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, > unsigned int fd, loff_t offset) > { > loff_t size = slot->npages << PAGE_SHIFT; > - unsigned long start, end, flags; > + unsigned long start, end; > struct kvm_gmem *gmem; > struct inode *inode; > struct file *file; > @@ -489,16 +481,9 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, > goto err; > > inode = file_inode(file); > - flags = (unsigned long)inode->i_private; > > - /* > - * For simplicity, require the offset into the file and the size of the > - * memslot to be aligned to the largest possible page size used to back > - * the file (same as the size of the file itself). > - */ > - if (!kvm_gmem_is_valid_size(offset, flags) || > - !kvm_gmem_is_valid_size(size, flags)) > - goto err; > + if (offset < 0 || !PAGE_ALIGNED(offset)) > + return -EINVAL; > > if (offset + size > i_size_read(inode)) > goto err; > @@ -599,8 +584,23 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > page = folio_file_page(folio, index); > > *pfn = page_to_pfn(page); > - if (max_order) > - *max_order = compound_order(compound_head(page)); > + if (!max_order) > + goto success; > + > + *max_order = compound_order(compound_head(page)); > + if (!*max_order) > + goto success; > + > + /* > + * For simplicity, allow mapping a hugepage if and only if the entire > + * binding is compatible, i.e. don't bother supporting mapping interior > + * sub-ranges with hugepages (unless userspace comes up with a *really* > + * strong use case for needing hugepages within unaligned bindings). > + */ > + if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) || > + !IS_ALIGNED(slot->npages, 1ull << *max_order)) > + *max_order = 0; Thanks for working this in. Unfortunately on x86 the bulk of guest memory ends up getting slotted directly above legacy regions at GFN 0x100, so the associated slot still ends failing these alignment checks if it tries to match the gmemfd offsets up with the shared RAM/memfd offsets. I tried to work around it in userspace by padding the gmemfd offset of each slot to the next 2M boundary, but that also requires dynamically growing the gmemfd inode to account for the padding of each new slot and it gets ugly enough that I'm not sure it's any better than your suggested alternative of using a unique gmemfd for each slot. But what if we relax the check to simply make sure that any large folio must is fully-contained by the range of the slot is bound to? It *seems* like that would still avoid stuff like mapping 2M pages in the NPT (or setting up 2M RMP table entries) that aren't fully contained by a slot while still allowing the bulk of guest memory to get mapped as 2M. Are there other edge cases to consider? The following seems to work for a basic 16GB SNP guest at least: diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c index 9109bf5751ee..e73128d4ebc2 100644 --- a/virt/kvm/guest_mem.c +++ b/virt/kvm/guest_mem.c @@ -618,6 +618,7 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep) { pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff; + pgoff_t huge_index; struct kvm_gmem *gmem; struct folio *folio; struct page *page; @@ -662,9 +663,12 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, * sub-ranges with hugepages (unless userspace comes up with a *really* * strong use case for needing hugepages within unaligned bindings). */ - if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) || - !IS_ALIGNED(slot->npages, 1ull << *max_order)) + huge_index = round_down(index, 1ull << *max_order); + if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) || + huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages) { + pr_debug("%s: GFN %llx failed alignment checks\n", __func__, gfn); *max_order = 0; + } success: r = 0; -Mike > +success: > r = 0; > > out_unlock: > -- > 2.42.0.515.g380fc7ccd1-goog >
On Fri, Sep 22, 2023, Michael Roth wrote: > On Thu, Sep 21, 2023 at 01:33:23PM -0700, Sean Christopherson wrote: > > + /* > > + * For simplicity, allow mapping a hugepage if and only if the entire > > + * binding is compatible, i.e. don't bother supporting mapping interior > > + * sub-ranges with hugepages (unless userspace comes up with a *really* > > + * strong use case for needing hugepages within unaligned bindings). > > + */ > > + if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) || > > + !IS_ALIGNED(slot->npages, 1ull << *max_order)) > > + *max_order = 0; > > Thanks for working this in. Unfortunately on x86 the bulk of guest memory > ends up getting slotted directly above legacy regions at GFN 0x100, Can you provide an example? I'm struggling to understand what the layout actually is. I don't think it changes the story for the kernel, but it sounds like there might be room for optimization in QEMU? Or more likely, I just don't understand what you're saying :-) > so the associated slot still ends failing these alignment checks if it tries > to match the gmemfd offsets up with the shared RAM/memfd offsets. > > I tried to work around it in userspace by padding the gmemfd offset of > each slot to the next 2M boundary, but that also requires dynamically > growing the gmemfd inode to account for the padding of each new slot and > it gets ugly enough that I'm not sure it's any better than your > suggested alternative of using a unique gmemfd for each slot. > > But what if we relax the check to simply make sure that any large folio > must is fully-contained by the range of the slot is bound to? It *seems* > like that would still avoid stuff like mapping 2M pages in the NPT (or > setting up 2M RMP table entries) that aren't fully contained by a slot > while still allowing the bulk of guest memory to get mapped as 2M. Are > there other edge cases to consider? > > The following seems to work for a basic 16GB SNP guest at least: > > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c > index 9109bf5751ee..e73128d4ebc2 100644 > --- a/virt/kvm/guest_mem.c > +++ b/virt/kvm/guest_mem.c > @@ -618,6 +618,7 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep) > { > pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff; > + pgoff_t huge_index; > struct kvm_gmem *gmem; > struct folio *folio; > struct page *page; > @@ -662,9 +663,12 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > * sub-ranges with hugepages (unless userspace comes up with a *really* > * strong use case for needing hugepages within unaligned bindings). > */ > - if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) || > - !IS_ALIGNED(slot->npages, 1ull << *max_order)) > + huge_index = round_down(index, 1ull << *max_order); Why not use ALIGN() here? The size is obviously a power-of-2. Or is my math even worse than I thought? > + if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) || > + huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages) { Argh, I keep forgetting that the MMU is responsible for handling misaligned gfns. Yeah, this looks right. Can you post this as a proper patch, on top of my fixes? And without the pr_debug(). That'll be the easiest for me to apply+squash when the time comes. Thanks much!
On Thu, Sep 28, 2023 at 11:31:51AM -0700, Sean Christopherson wrote: > On Fri, Sep 22, 2023, Michael Roth wrote: > > On Thu, Sep 21, 2023 at 01:33:23PM -0700, Sean Christopherson wrote: > > > + /* > > > + * For simplicity, allow mapping a hugepage if and only if the entire > > > + * binding is compatible, i.e. don't bother supporting mapping interior > > > + * sub-ranges with hugepages (unless userspace comes up with a *really* > > > + * strong use case for needing hugepages within unaligned bindings). > > > + */ > > > + if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) || > > > + !IS_ALIGNED(slot->npages, 1ull << *max_order)) > > > + *max_order = 0; > > > > Thanks for working this in. Unfortunately on x86 the bulk of guest memory > > ends up getting slotted directly above legacy regions at GFN 0x100, > > Can you provide an example? I'm struggling to understand what the layout actually > is. I don't think it changes the story for the kernel, but it sounds like there > might be room for optimization in QEMU? Or more likely, I just don't understand > what you're saying :-) Here's one example, which seems to be fairly normal for an x86 boot: kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0 ^ QEMU creates Slot 0 for all of main guest RAM kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x0 gpa=0x0 size=0x0 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0 kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0xc0000 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0 kvm_set_user_memory AddrSpace#0 Slot#3 flags=0x6 gpa=0xc0000 size=0x20000 ua=0x7f2575000000 ret=0 restricted_fd=33 restricted_offset=0x0 kvm_set_user_memory AddrSpace#0 Slot#4 flags=0x6 gpa=0xe0000 size=0x20000 ua=0x7f2575400000 ret=0 restricted_fd=31 restricted_offset=0x0 ^ legacy regions are created and mapped on top of GPA ranges [0xc0000:0xe0000) and [0xe0000:0x100000) kvm_set_user_memory AddrSpace#0 Slot#5 flags=0x4 gpa=0x100000 size=0x7ff00000 ua=0x7f24afd00000 ret=0 restricted_fd=19 restricted_offset=0x100000 ^ QEMU divides Slot 0 into Slot 0 at [0x0:0xc0000) and Slot 5 at [0x100000:0x80000000) Both Slots still share the same backing memory allocation, so same gmem fd 19 is used,but Slot 5 is assigned to offset 0x100000, whih is not 2M-aligned I tried messing with QEMU handling to pad out guest_memfd offsets to 2MB boundaries but then the inode size needs to be enlarged to account for it and things get a bit messy. Not sure if there are alternative approaches that can be taken from userspace, but with normal malloc()'d or mmap()'d backing memory the kernel can still allocate a 2MB backing page for the [0x0:0x200000) range and I think KVM still handles that when setting up NPT of sub-ranges so there might not be much room for further optimization there. > > > so the associated slot still ends failing these alignment checks if it tries > > to match the gmemfd offsets up with the shared RAM/memfd offsets. > > > > I tried to work around it in userspace by padding the gmemfd offset of > > each slot to the next 2M boundary, but that also requires dynamically > > growing the gmemfd inode to account for the padding of each new slot and > > it gets ugly enough that I'm not sure it's any better than your > > suggested alternative of using a unique gmemfd for each slot. > > > > But what if we relax the check to simply make sure that any large folio > > must is fully-contained by the range of the slot is bound to? It *seems* > > like that would still avoid stuff like mapping 2M pages in the NPT (or > > setting up 2M RMP table entries) that aren't fully contained by a slot > > while still allowing the bulk of guest memory to get mapped as 2M. Are > > there other edge cases to consider? > > > > The following seems to work for a basic 16GB SNP guest at least: > > > > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c > > index 9109bf5751ee..e73128d4ebc2 100644 > > --- a/virt/kvm/guest_mem.c > > +++ b/virt/kvm/guest_mem.c > > @@ -618,6 +618,7 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep) > > { > > pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff; > > + pgoff_t huge_index; > > struct kvm_gmem *gmem; > > struct folio *folio; > > struct page *page; > > @@ -662,9 +663,12 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > * sub-ranges with hugepages (unless userspace comes up with a *really* > > * strong use case for needing hugepages within unaligned bindings). > > */ > > - if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) || > > - !IS_ALIGNED(slot->npages, 1ull << *max_order)) > > + huge_index = round_down(index, 1ull << *max_order); > > Why not use ALIGN() here? The size is obviously a power-of-2. Or is my math > even worse than I thought? I actually only originally used round_down() because kvm_gmem_get_huge_folio() was taking that approach, but I still ended up using ALIGN() below so sorry if the inconsistency caused any confusion. I switched to using ALIGN() above and it works fine. > > > + if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) || > > + huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages) { > > Argh, I keep forgetting that the MMU is responsible for handling misaligned gfns. > Yeah, this looks right. > > Can you post this as a proper patch, on top of my fixes? And without the pr_debug(). > That'll be the easiest for me to apply+squash when the time comes. Sure, I submitted a revised patch on top of kvm-x86: https://lore.kernel.org/lkml/20231002133342.195882-1-michael.roth@amd.com/T/#u I ran into a separate issue trying to test it and submitted a patch for that here: https://lore.kernel.org/lkml/20231002133230.195738-1-michael.roth@amd.com/T/#u -Mike > > Thanks much!
On Mon, Oct 02, 2023, Michael Roth wrote: > On Thu, Sep 28, 2023 at 11:31:51AM -0700, Sean Christopherson wrote: > > On Fri, Sep 22, 2023, Michael Roth wrote: > > > On Thu, Sep 21, 2023 at 01:33:23PM -0700, Sean Christopherson wrote: > > > > + /* > > > > + * For simplicity, allow mapping a hugepage if and only if the entire > > > > + * binding is compatible, i.e. don't bother supporting mapping interior > > > > + * sub-ranges with hugepages (unless userspace comes up with a *really* > > > > + * strong use case for needing hugepages within unaligned bindings). > > > > + */ > > > > + if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) || > > > > + !IS_ALIGNED(slot->npages, 1ull << *max_order)) > > > > + *max_order = 0; > > > > > > Thanks for working this in. Unfortunately on x86 the bulk of guest memory > > > ends up getting slotted directly above legacy regions at GFN 0x100, > > > > Can you provide an example? I'm struggling to understand what the layout actually > > is. I don't think it changes the story for the kernel, but it sounds like there > > might be room for optimization in QEMU? Or more likely, I just don't understand > > what you're saying :-) > > Here's one example, which seems to be fairly normal for an x86 boot: > > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0 > ^ QEMU creates Slot 0 for all of main guest RAM > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x0 gpa=0x0 size=0x0 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0 > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0xc0000 ua=0x7f24afc00000 ret=0 restricted_fd=19 restricted_offset=0x0 > kvm_set_user_memory AddrSpace#0 Slot#3 flags=0x6 gpa=0xc0000 size=0x20000 ua=0x7f2575000000 ret=0 restricted_fd=33 restricted_offset=0x0 > kvm_set_user_memory AddrSpace#0 Slot#4 flags=0x6 gpa=0xe0000 size=0x20000 ua=0x7f2575400000 ret=0 restricted_fd=31 restricted_offset=0x0 > ^ legacy regions are created and mapped on top of GPA ranges [0xc0000:0xe0000) and [0xe0000:0x100000) > kvm_set_user_memory AddrSpace#0 Slot#5 flags=0x4 gpa=0x100000 size=0x7ff00000 ua=0x7f24afd00000 ret=0 restricted_fd=19 restricted_offset=0x100000 > ^ QEMU divides Slot 0 into Slot 0 at [0x0:0xc0000) and Slot 5 at [0x100000:0x80000000) > Both Slots still share the same backing memory allocation, so same gmem > fd 19 is used,but Slot 5 is assigned to offset 0x100000, whih is not > 2M-aligned > > I tried messing with QEMU handling to pad out guest_memfd offsets to 2MB > boundaries but then the inode size needs to be enlarged to account for it > and things get a bit messy. Not sure if there are alternative approaches > that can be taken from userspace, but with normal malloc()'d or mmap()'d > backing memory the kernel can still allocate a 2MB backing page for the > [0x0:0x200000) range and I think KVM still handles that when setting up > NPT of sub-ranges so there might not be much room for further optimization > there. Oooh, duh. QEMU intentionally creates a gap for the VGA and/or BIOS holes, and so the lower DRAM chunk that goes from the end of the system reserved chunk to to TOLUD is started at an unaligned offset, even though 99% of the slot is properly aligned. Yeah, KVM definitely needs to support that. Requiring userspace to align based on the hugepage size could work, e.g. QEMU could divide slot 5 into N slots, to end up with a series of slots to get from 4KiB aligned => 2MiB aligned => 1GiB aligned. But pushing for that would be beyond stubborn. Thanks for being patient :-)
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c index 68528e9cddd7..4f3a313f5532 100644 --- a/virt/kvm/guest_mem.c +++ b/virt/kvm/guest_mem.c @@ -434,20 +434,6 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags, return err; } -static bool kvm_gmem_is_valid_size(loff_t size, u64 flags) -{ - if (size < 0 || !PAGE_ALIGNED(size)) - return false; - -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && - !IS_ALIGNED(size, HPAGE_PMD_SIZE)) - return false; -#endif - - return true; -} - int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) { loff_t size = args->size; @@ -460,9 +446,15 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) if (flags & ~valid_flags) return -EINVAL; - if (!kvm_gmem_is_valid_size(size, flags)) + if (size < 0 || !PAGE_ALIGNED(size)) return -EINVAL; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && + !IS_ALIGNED(size, HPAGE_PMD_SIZE)) + return -EINVAL; +#endif + return __kvm_gmem_create(kvm, size, flags, kvm_gmem_mnt); } @@ -470,7 +462,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned int fd, loff_t offset) { loff_t size = slot->npages << PAGE_SHIFT; - unsigned long start, end, flags; + unsigned long start, end; struct kvm_gmem *gmem; struct inode *inode; struct file *file; @@ -489,16 +481,9 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, goto err; inode = file_inode(file); - flags = (unsigned long)inode->i_private; - /* - * For simplicity, require the offset into the file and the size of the - * memslot to be aligned to the largest possible page size used to back - * the file (same as the size of the file itself). - */ - if (!kvm_gmem_is_valid_size(offset, flags) || - !kvm_gmem_is_valid_size(size, flags)) - goto err; + if (offset < 0 || !PAGE_ALIGNED(offset)) + return -EINVAL; if (offset + size > i_size_read(inode)) goto err; @@ -599,8 +584,23 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, page = folio_file_page(folio, index); *pfn = page_to_pfn(page); - if (max_order) - *max_order = compound_order(compound_head(page)); + if (!max_order) + goto success; + + *max_order = compound_order(compound_head(page)); + if (!*max_order) + goto success; + + /* + * For simplicity, allow mapping a hugepage if and only if the entire + * binding is compatible, i.e. don't bother supporting mapping interior + * sub-ranges with hugepages (unless userspace comes up with a *really* + * strong use case for needing hugepages within unaligned bindings). + */ + if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) || + !IS_ALIGNED(slot->npages, 1ull << *max_order)) + *max_order = 0; +success: r = 0; out_unlock: