Message ID | 20231027182217.3615211-18-seanjc@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:d641:0:b0:403:3b70:6f57 with SMTP id cy1csp801690vqb; Fri, 27 Oct 2023 11:24:56 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEUYDwpgo11ngrs9pD0iWoEDdl6En+X9TWKWjIVA6kl2TxbkGTodcbY49J7TQ+U0jmEJ80W X-Received: by 2002:a81:af03:0:b0:5a7:bb95:681c with SMTP id n3-20020a81af03000000b005a7bb95681cmr3949568ywh.36.1698431095884; Fri, 27 Oct 2023 11:24:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698431095; cv=none; d=google.com; s=arc-20160816; b=MUbC4Fdawq7Zv1XJJnjzMMcKbtwI6CBt9UEQgOVWQr66zAMq8FIKYw+0jkXblz8fhZ sVZP8m9AD+2dTKwgJdoAPMBWK65b3cHG++PTAVsiqVrFryQxTGBOceDPzeomQvXb6vuY 6kFKxAm9m3pqs9hTFICkkN6nY5U8pXJ/VCw0MFzQeletA/sgjrkt03ke/TaWHEKCxZIf mHmGLeI5+2XFGK5JwJUb7ROjpXwHJYhVbKjxzD5rGEQ7G9xim3S99dOR/1eZy3ipQvNV GVAr7Lhwofw767Aw5Q72HTrY4TRB34e/bmUYAPrcmBX5WOdxO7wvEqC4/j1x892sXTpf cndA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:reply-to:dkim-signature; bh=MVLj5FN4ghSPcx7W6m0P6aKU6EoKDZB5qQvPMI5sLXI=; fh=lhteFENhZrfxRoH7K7/E/bqXvDWa/XLvUszFia9mLtM=; b=uTqoyVSQLBNxe2AQQHcZsj583CNQrgtMjdC0BBJVGiLHHMG3Lo4HmU7DKz/gGVK3eE VuiHlKntBy9yUcoSI730v7oruWBp5Y34iHUTLaZCPG3obMn9PxyKdazHqCE/IHeaPr9g 8jnqx49Xcyy3dLb1u1/4X9+00nJlJ8M/x6QcEXwsAGI22uIT88J/rDeVSLMqb5fp+B61 bM/fmFW/vrMPf2gTfwoLnmFVA80ovNvgGIsOXWw3oJ/KTorm/3kjC1WBhha0YCvStAjA 0onmKIjtq2VkcvednudG6+5bN6Z6sW4LQ45fMCYq8GMY5Bb4uCr9kbNXIx7HehLq+sJI MDYw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=u+CSrskQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id i188-20020a0dc6c5000000b0059ea6c982bdsi3115279ywd.490.2023.10.27.11.24.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 27 Oct 2023 11:24:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=u+CSrskQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id C853B80A30E3; Fri, 27 Oct 2023 11:24:54 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346414AbjJ0SYb (ORCPT <rfc822;a1648639935@gmail.com> + 25 others); Fri, 27 Oct 2023 14:24:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36382 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346357AbjJ0SXr (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 27 Oct 2023 14:23:47 -0400 Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9AC6D10DE for <linux-kernel@vger.kernel.org>; Fri, 27 Oct 2023 11:23:01 -0700 (PDT) Received: by mail-pl1-x64a.google.com with SMTP id d9443c01a7336-1c9f973d319so23492935ad.0 for <linux-kernel@vger.kernel.org>; Fri, 27 Oct 2023 11:23:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698430979; x=1699035779; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=MVLj5FN4ghSPcx7W6m0P6aKU6EoKDZB5qQvPMI5sLXI=; b=u+CSrskQPclW4y4R0sY5cq1wt2uZ/PS0jDx133zcNoKBPUUpJoIPNr3V0t/tKrOARP 8Spr9quFnZumx+ty953xTQxKI8/5zTvE8gZJHrIB9uqNBdI6K6V8reSmMFUJ5mw4EDVg rHi8jqaoMH6EVDouJ0iQ0lKeR7lP5hoPY5MMs2LOMbV3aNlsV5ktBpieuS0C30zxslHd 0KR7fJM5JjWDw+6Rm4sDaB6mbYbbhlC6T1SVgPut6ioh4VgcMId9LasE+olpLnJYOs+Z 4AWlI7m8YpVmtZROqKVwIBAWWQ1hymYyuXuGDVgMf2L3zfiDLc7pKZ6k1ydHBA5InSkj tMDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698430979; x=1699035779; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=MVLj5FN4ghSPcx7W6m0P6aKU6EoKDZB5qQvPMI5sLXI=; b=sH2lEaOvwRptRSleRGncGhkKU1ODoLTIAK962B0GD9ON6vTQjQZ60E/9IH03BhTEeX THycIp7OASVWJczDC5ETXkuT5Xag9NXXJ084ybug1XfkRK/hPhoyoAqC6H2g8cCc2aDH g8+0vXyR6fJfVuOw1nfzyp48nkvBdlFfxDHvB+NiyNhNIUJ+I8a6DswoGV/iJoN7yhop JK6c4mLr0qD1Ncyzvk2KwhVUI7Nsvvo3JRMKHY+HVBlA76/rkD37lyIfkOJ0tIQHjd/U C8QGP2eLVUFBXDr0APN1a1skKI9jbHNr+aoBJn8s046DBzcZ+r7fd7jS4XynnBzmHs9A s7eg== X-Gm-Message-State: AOJu0YwBQiFxYJefB8bHAjNmeIw4++kBleXDG14j3bO44yawlDqvxzu0 8XmLEB/8sGLTLDOhcVmPP3ucpOtz9ZA= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:902:7b8f:b0:1c9:f356:b7d5 with SMTP id w15-20020a1709027b8f00b001c9f356b7d5mr60396pll.7.1698430979282; Fri, 27 Oct 2023 11:22:59 -0700 (PDT) Reply-To: Sean Christopherson <seanjc@google.com> Date: Fri, 27 Oct 2023 11:21:59 -0700 In-Reply-To: <20231027182217.3615211-1-seanjc@google.com> Mime-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> X-Mailer: git-send-email 2.42.0.820.g83a721a137-goog Message-ID: <20231027182217.3615211-18-seanjc@google.com> Subject: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory From: Sean Christopherson <seanjc@google.com> To: Paolo Bonzini <pbonzini@redhat.com>, Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>, Huacai Chen <chenhuacai@kernel.org>, Michael Ellerman <mpe@ellerman.id.au>, Anup Patel <anup@brainfault.org>, Paul Walmsley <paul.walmsley@sifive.com>, Palmer Dabbelt <palmer@dabbelt.com>, Albert Ou <aou@eecs.berkeley.edu>, Sean Christopherson <seanjc@google.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Andrew Morton <akpm@linux-foundation.org> Cc: kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li <xiaoyao.li@intel.com>, Xu Yilun <yilun.xu@intel.com>, Chao Peng <chao.p.peng@linux.intel.com>, Fuad Tabba <tabba@google.com>, Jarkko Sakkinen <jarkko@kernel.org>, Anish Moorthy <amoorthy@google.com>, David Matlack <dmatlack@google.com>, Yu Zhang <yu.c.zhang@linux.intel.com>, Isaku Yamahata <isaku.yamahata@intel.com>, " =?utf-8?q?Micka=C3=ABl_Sala?= =?utf-8?q?=C3=BCn?= " <mic@digikod.net>, Vlastimil Babka <vbabka@suse.cz>, Vishal Annapurve <vannapurve@google.com>, Ackerley Tng <ackerleytng@google.com>, Maciej Szmigiero <mail@maciej.szmigiero.name>, David Hildenbrand <david@redhat.com>, Quentin Perret <qperret@google.com>, Michael Roth <michael.roth@amd.com>, Wang <wei.w.wang@intel.com>, Liam Merwick <liam.merwick@oracle.com>, Isaku Yamahata <isaku.yamahata@gmail.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 27 Oct 2023 11:24:54 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780934084888733598 X-GMAIL-MSGID: 1780934084888733598 |
Series |
KVM: guest_memfd() and per-page attributes
|
|
Commit Message
Sean Christopherson
Oct. 27, 2023, 6:21 p.m. UTC
Extended guest_memfd to allow backing guest memory with transparent
hugepages. Require userspace to opt-in via a flag even though there's no
known/anticipated use case for forcing small pages as THP is optional,
i.e. to avoid ending up in a situation where userspace is unaware that
KVM can't provide hugepages.
For simplicity, require the guest_memfd size to be a multiple of the
hugepage size, e.g. so that KVM doesn't need to do bounds checking when
deciding whether or not to allocate a huge folio.
When reporting the max order when KVM gets a pfn from guest_memfd, force
order-0 pages if the hugepage is not fully contained by the memslot
binding, e.g. if userspace requested hugepages but punches a hole in the
memslot bindings in order to emulate x86's VGA hole.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
Documentation/virt/kvm/api.rst | 7 ++++
include/uapi/linux/kvm.h | 2 +
virt/kvm/guest_memfd.c | 73 ++++++++++++++++++++++++++++++----
3 files changed, 75 insertions(+), 7 deletions(-)
Comments
On 10/28/2023 2:21 AM, Sean Christopherson wrote: > Extended guest_memfd to allow backing guest memory with transparent > hugepages. Require userspace to opt-in via a flag even though there's no > known/anticipated use case for forcing small pages as THP is optional, > i.e. to avoid ending up in a situation where userspace is unaware that > KVM can't provide hugepages. Personally, it seems not so "transparent" if requiring userspace to opt-in. People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE support, or check is the sysfs of transparent hugepage exists; 2)get the maximum support hugepage size 3) ensure the size satisfies the alignment; before opt-in it. Even simpler, userspace can blindly try to create guest memfd with transparent hugapage flag. If getting error, fallback to create without the transparent hugepage flag. However, it doesn't look transparent to me.
On Tue, Oct 31, 2023, Xiaoyao Li wrote: > On 10/28/2023 2:21 AM, Sean Christopherson wrote: > > Extended guest_memfd to allow backing guest memory with transparent > > hugepages. Require userspace to opt-in via a flag even though there's no > > known/anticipated use case for forcing small pages as THP is optional, > > i.e. to avoid ending up in a situation where userspace is unaware that > > KVM can't provide hugepages. > > Personally, it seems not so "transparent" if requiring userspace to opt-in. > > People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE > support, or check is the sysfs of transparent hugepage exists; 2)get the > maximum support hugepage size 3) ensure the size satisfies the alignment; > before opt-in it. > > Even simpler, userspace can blindly try to create guest memfd with > transparent hugapage flag. If getting error, fallback to create without the > transparent hugepage flag. > > However, it doesn't look transparent to me. The "transparent" part is referring to the underlying kernel mechanism, it's not saying anything about the API. The "transparent" part of THP is that the kernel doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is (mostly) transparent to userspace. Paolo also isn't the biggest fan[*], but there are also downsides to always allowing hugepages, e.g. silent failure due to lack of THP or unaligned size, and there's precedent in the form of MADV_HUGEPAGE. [*] https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a189@redhat.com
On 10/31/2023 10:16 PM, Sean Christopherson wrote: > On Tue, Oct 31, 2023, Xiaoyao Li wrote: >> On 10/28/2023 2:21 AM, Sean Christopherson wrote: >>> Extended guest_memfd to allow backing guest memory with transparent >>> hugepages. Require userspace to opt-in via a flag even though there's no >>> known/anticipated use case for forcing small pages as THP is optional, >>> i.e. to avoid ending up in a situation where userspace is unaware that >>> KVM can't provide hugepages. >> >> Personally, it seems not so "transparent" if requiring userspace to opt-in. >> >> People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE >> support, or check is the sysfs of transparent hugepage exists; 2)get the >> maximum support hugepage size 3) ensure the size satisfies the alignment; >> before opt-in it. >> >> Even simpler, userspace can blindly try to create guest memfd with >> transparent hugapage flag. If getting error, fallback to create without the >> transparent hugepage flag. >> >> However, it doesn't look transparent to me. > > The "transparent" part is referring to the underlying kernel mechanism, it's not > saying anything about the API. The "transparent" part of THP is that the kernel > doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is > (mostly) transparent to userspace. > > Paolo also isn't the biggest fan[*], but there are also downsides to always > allowing hugepages, e.g. silent failure due to lack of THP or unaligned size, > and there's precedent in the form of MADV_HUGEPAGE. > > [*] https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a189@redhat.com But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the failure of MADV_HUGEPAGE is not fatal, user space can ignore it and continue. However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads to failure of guest memfd creation. For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?
On Wed, Nov 01, 2023, Xiaoyao Li wrote: > On 10/31/2023 10:16 PM, Sean Christopherson wrote: > > On Tue, Oct 31, 2023, Xiaoyao Li wrote: > > > On 10/28/2023 2:21 AM, Sean Christopherson wrote: > > > > Extended guest_memfd to allow backing guest memory with transparent > > > > hugepages. Require userspace to opt-in via a flag even though there's no > > > > known/anticipated use case for forcing small pages as THP is optional, > > > > i.e. to avoid ending up in a situation where userspace is unaware that > > > > KVM can't provide hugepages. > > > > > > Personally, it seems not so "transparent" if requiring userspace to opt-in. > > > > > > People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE > > > support, or check is the sysfs of transparent hugepage exists; 2)get the > > > maximum support hugepage size 3) ensure the size satisfies the alignment; > > > before opt-in it. > > > > > > Even simpler, userspace can blindly try to create guest memfd with > > > transparent hugapage flag. If getting error, fallback to create without the > > > transparent hugepage flag. > > > > > > However, it doesn't look transparent to me. > > > > The "transparent" part is referring to the underlying kernel mechanism, it's not > > saying anything about the API. The "transparent" part of THP is that the kernel > > doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is > > (mostly) transparent to userspace. > > > > Paolo also isn't the biggest fan[*], but there are also downsides to always > > allowing hugepages, e.g. silent failure due to lack of THP or unaligned size, > > and there's precedent in the form of MADV_HUGEPAGE. > > > > [*] https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a189@redhat.com > > But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the > failure of MADV_HUGEPAGE is not fatal, user space can ignore it and > continue. > > However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads > to failure of guest memfd creation. Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different action from userspace, i.e. instead of ignoring the error, userspace could redo KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0. We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we could extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let userspace provide advice/hints after creating a guest_memfd. But I suspect that guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation hint is actually less desirable. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a compatible size or the kernel doesn't support THP. An incompatible size is likely a userspace bug, and for most setups that want to utilize guest_memfd, lack of THP support is likely a configuration bug. I.e. many/most uses *want* failures due to KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal. > For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE > fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*? Why? Verbs like "prefer" and "desire" aren't a good fit IMO because they suggest the flag is a hint, and hints are usually best effort only, i.e. are ignored if there is a fundamental incompatibility. "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't (yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
On Wed, Nov 1, 2023 at 2:41 PM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Nov 01, 2023, Xiaoyao Li wrote: > > On 10/31/2023 10:16 PM, Sean Christopherson wrote: > > > On Tue, Oct 31, 2023, Xiaoyao Li wrote: > > > > On 10/28/2023 2:21 AM, Sean Christopherson wrote: > > > > > Extended guest_memfd to allow backing guest memory with transparent > > > > > hugepages. Require userspace to opt-in via a flag even though there's no > > > > > known/anticipated use case for forcing small pages as THP is optional, > > > > > i.e. to avoid ending up in a situation where userspace is unaware that > > > > > KVM can't provide hugepages. > > > > > > > > Personally, it seems not so "transparent" if requiring userspace to opt-in. > > > > > > > > People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE > > > > support, or check is the sysfs of transparent hugepage exists; 2)get the > > > > maximum support hugepage size 3) ensure the size satisfies the alignment; > > > > before opt-in it. > > > > > > > > Even simpler, userspace can blindly try to create guest memfd with > > > > transparent hugapage flag. If getting error, fallback to create without the > > > > transparent hugepage flag. > > > > > > > > However, it doesn't look transparent to me. > > > > > > The "transparent" part is referring to the underlying kernel mechanism, it's not > > > saying anything about the API. The "transparent" part of THP is that the kernel > > > doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is > > > (mostly) transparent to userspace. > > > > > > Paolo also isn't the biggest fan[*], but there are also downsides to always > > > allowing hugepages, e.g. silent failure due to lack of THP or unaligned size, > > > and there's precedent in the form of MADV_HUGEPAGE. > > > > > > [*] https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a189@redhat.com > > > > But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the > > failure of MADV_HUGEPAGE is not fatal, user space can ignore it and > > continue. > > > > However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads > > to failure of guest memfd creation. > > Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different > action from userspace, i.e. instead of ignoring the error, userspace could redo > KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0. > > We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we could > extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let > userspace provide advice/hints after creating a guest_memfd. But I suspect that > guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation hint > is actually less desirable. > > KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a > compatible size or the kernel doesn't support THP. An incompatible size is likely > a userspace bug, and for most setups that want to utilize guest_memfd, lack of THP > support is likely a configuration bug. I.e. many/most uses *want* failures due to > KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal. > > > For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE > > fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*? > > Why? Verbs like "prefer" and "desire" aren't a good fit IMO because they suggest > the flag is a hint, and hints are usually best effort only, i.e. are ignored if > there is a fundamental incompatibility. > > "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't > (yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than > a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE. I think that the current patch is fine, but I will adjust it to always allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE. If hugepages are not guaranteed, and (theoretically) you could have no hugepage at all in the result, it's okay to get this result even if THP is not available in the kernel. Paolo
On Wed, Nov 01, 2023, Paolo Bonzini wrote: > On Wed, Nov 1, 2023 at 2:41 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Wed, Nov 01, 2023, Xiaoyao Li wrote: > > > On 10/31/2023 10:16 PM, Sean Christopherson wrote: > > > > On Tue, Oct 31, 2023, Xiaoyao Li wrote: > > > > > On 10/28/2023 2:21 AM, Sean Christopherson wrote: > > > But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the > > > failure of MADV_HUGEPAGE is not fatal, user space can ignore it and > > > continue. > > > > > > However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads > > > to failure of guest memfd creation. > > > > Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different > > action from userspace, i.e. instead of ignoring the error, userspace could redo > > KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0. > > > > We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we could > > extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let > > userspace provide advice/hints after creating a guest_memfd. But I suspect that > > guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation hint > > is actually less desirable. > > > > KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a > > compatible size or the kernel doesn't support THP. An incompatible size is likely > > a userspace bug, and for most setups that want to utilize guest_memfd, lack of THP > > support is likely a configuration bug. I.e. many/most uses *want* failures due to > > KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal. > > > > > For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE > > > fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*? > > > > Why? Verbs like "prefer" and "desire" aren't a good fit IMO because they suggest > > the flag is a hint, and hints are usually best effort only, i.e. are ignored if > > there is a fundamental incompatibility. > > > > "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES > > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't > > (yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than > > a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory > > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE. > > I think that the current patch is fine, but I will adjust it to always > allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE. > If hugepages are not guaranteed, and (theoretically) you could have no > hugepage at all in the result, it's okay to get this result even if THP is not > available in the kernel. Can you post a fixup patch? It's not clear to me exactly what behavior you intend to end up with.
On 11/1/23 17:36, Sean Christopherson wrote: >>> "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES >>> or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't >>> (yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than >>> a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory >>> pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE. >> I think that the current patch is fine, but I will adjust it to always >> allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE. >> If hugepages are not guaranteed, and (theoretically) you could have no >> hugepage at all in the result, it's okay to get this result even if THP is not >> available in the kernel. > Can you post a fixup patch? It's not clear to me exactly what behavior you intend > to end up with. Sure, just this: diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 7d1a33c2ad42..34fd070e03d9 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) { loff_t size = args->size; u64 flags = args->flags; - u64 valid_flags = 0; - - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; if (flags & ~valid_flags) return -EINVAL; @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) if (size < 0 || !PAGE_ALIGNED(size)) return -EINVAL; -#ifdef CONFIG_TRANSPARENT_HUGEPAGE if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && !IS_ALIGNED(size, HPAGE_PMD_SIZE)) return -EINVAL; -#endif return __kvm_gmem_create(kvm, size, flags); } Paolo
On Wed, Nov 01, 2023, Paolo Bonzini wrote: > On 11/1/23 17:36, Sean Christopherson wrote: > > > > "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES > > > > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't > > > > (yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than > > > > a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory > > > > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE. > > > I think that the current patch is fine, but I will adjust it to always > > > allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE. > > > If hugepages are not guaranteed, and (theoretically) you could have no > > > hugepage at all in the result, it's okay to get this result even if THP is not > > > available in the kernel. > > Can you post a fixup patch? It's not clear to me exactly what behavior you intend > > to end up with. > > Sure, just this: > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 7d1a33c2ad42..34fd070e03d9 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > { > loff_t size = args->size; > u64 flags = args->flags; > - u64 valid_flags = 0; > - > - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) > - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; > + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; > if (flags & ~valid_flags) > return -EINVAL; > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > if (size < 0 || !PAGE_ALIGNED(size)) > return -EINVAL; > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && > !IS_ALIGNED(size, HPAGE_PMD_SIZE)) > return -EINVAL; > -#endif That won't work, HPAGE_PMD_SIZE is valid only for CONFIG_TRANSPARENT_HUGEPAGE=y. #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; }) ... > return __kvm_gmem_create(kvm, size, flags); > } > > Paolo >
On Wed, Nov 1, 2023 at 11:35 PM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Nov 01, 2023, Paolo Bonzini wrote: > > On 11/1/23 17:36, Sean Christopherson wrote: > > > > > "Allow" isn't perfect, e.g. I would much prefer a straight KVM_GUEST_MEMFD_USE_HUGEPAGES > > > > > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM doesn't > > > > > (yet) guarantee hugepages. I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than > > > > > a hint, but weaker than a requirement. And if/when KVM supports a dedicated memory > > > > > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE. > > > > I think that the current patch is fine, but I will adjust it to always > > > > allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE. > > > > If hugepages are not guaranteed, and (theoretically) you could have no > > > > hugepage at all in the result, it's okay to get this result even if THP is not > > > > available in the kernel. > > > Can you post a fixup patch? It's not clear to me exactly what behavior you intend > > > to end up with. > > > > Sure, just this: > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > index 7d1a33c2ad42..34fd070e03d9 100644 > > --- a/virt/kvm/guest_memfd.c > > +++ b/virt/kvm/guest_memfd.c > > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > > { > > loff_t size = args->size; > > u64 flags = args->flags; > > - u64 valid_flags = 0; > > - > > - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) > > - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; > > + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; > > if (flags & ~valid_flags) > > return -EINVAL; > > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > > if (size < 0 || !PAGE_ALIGNED(size)) > > return -EINVAL; > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && > > !IS_ALIGNED(size, HPAGE_PMD_SIZE)) > > return -EINVAL; > > -#endif > > That won't work, HPAGE_PMD_SIZE is valid only for CONFIG_TRANSPARENT_HUGEPAGE=y. > > #else /* CONFIG_TRANSPARENT_HUGEPAGE */ > #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) > #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) > #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; }) Would have caught it when actually testing it, I guess. :) It has to be PMD_SIZE, possibly with #ifdef CONFIG_TRANSPARENT_HUGEPAGE BUILD_BUG_ON(HPAGE_PMD_SIZE != PMD_SIZE); #endif for extra safety. Paolo
On Thu, Nov 02, 2023, Paolo Bonzini wrote: > On Wed, Nov 1, 2023 at 11:35 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Wed, Nov 01, 2023, Paolo Bonzini wrote: > > > On 11/1/23 17:36, Sean Christopherson wrote: > > > > Can you post a fixup patch? It's not clear to me exactly what behavior you intend > > > > to end up with. > > > > > > Sure, just this: > > > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > > index 7d1a33c2ad42..34fd070e03d9 100644 > > > --- a/virt/kvm/guest_memfd.c > > > +++ b/virt/kvm/guest_memfd.c > > > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > > > { > > > loff_t size = args->size; > > > u64 flags = args->flags; > > > - u64 valid_flags = 0; > > > - > > > - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) > > > - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; > > > + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; > > > if (flags & ~valid_flags) > > > return -EINVAL; > > > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > > > if (size < 0 || !PAGE_ALIGNED(size)) > > > return -EINVAL; > > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && > > > !IS_ALIGNED(size, HPAGE_PMD_SIZE)) > > > return -EINVAL; > > > -#endif > > > > That won't work, HPAGE_PMD_SIZE is valid only for CONFIG_TRANSPARENT_HUGEPAGE=y. > > > > #else /* CONFIG_TRANSPARENT_HUGEPAGE */ > > #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) > > #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) > > #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; }) > > Would have caught it when actually testing it, I guess. :) It has to > be PMD_SIZE, possibly with > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > BUILD_BUG_ON(HPAGE_PMD_SIZE != PMD_SIZE); > #endif Yeah, that works for me. Actually, looking that this again, there's not actually a hard dependency on THP. A THP-enabled kernel _probably_ gives a higher probability of using hugepages, but mostly because THP selects COMPACTION, and I suppose because using THP for other allocations reduces overall fragmentation. So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think we should do the below (I verified KVM can create hugepages with THP=n). We'll need another capability, but (a) we probably should have that anyways and (b) it provides a cleaner path to adding PUD-sized hugepage support in the future. And then adjust the tests like so: diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index c15de9852316..c9f449718fce 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -201,6 +201,10 @@ int main(int argc, char *argv[]) TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD)); + if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && thp_configured()) + TEST_ASSERT_EQ(get_trans_hugepagesz(), + kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE)); + page_size = getpagesize(); total_size = page_size * 4; diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c index be311944e90a..245901587ed2 100644 --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c @@ -396,7 +396,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE)); - if (backing_src_can_be_huge(src_type)) + if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE)) memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; else memfd_flags = 0; -- From: Sean Christopherson <seanjc@google.com> Date: Wed, 25 Oct 2023 16:26:41 -0700 Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest memory Extend guest_memfd to allow backing guest memory with hugepages. For now, make hugepage utilization best-effort, i.e. fall back to non-huge mappings if a hugepage can't be allocated. Guaranteeing hugepages would require a dedicated memory pool and significantly more complexity and churn.. Require userspace to opt-in via a flag even though it's unlikely real use cases will ever want to use order-0 pages, e.g. to give userspace a safety valve in case hugepage support is buggy, and to allow for easier testing of both paths. Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling primarily deals with userspace page tables, which are explicitly not in play for guest_memfd. Selecting THP does make obtaining hugepages more likely, but only because THP selects CONFIG_COMPACTION. Don't select CONFIG_COMPACTION either, because again it's not a hard dependency. For simplicity, require the guest_memfd size to be a multiple of the hugepage size, e.g. so that KVM doesn't need to do bounds checking when deciding whether or not to allocate a huge folio. When reporting the max order when KVM gets a pfn from guest_memfd, force order-0 pages if the hugepage is not fully contained by the memslot binding, e.g. if userspace requested hugepages but punches a hole in the memslot bindings in order to emulate x86's VGA hole. Signed-off-by: Sean Christopherson <seanjc@google.com> --- Documentation/virt/kvm/api.rst | 17 +++++++++ include/uapi/linux/kvm.h | 3 ++ virt/kvm/guest_memfd.c | 69 +++++++++++++++++++++++++++++----- virt/kvm/kvm_main.c | 4 ++ 4 files changed, 84 insertions(+), 9 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index e82c69d5e755..ccdd5413920d 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6176,6 +6176,8 @@ and cannot be resized (guest_memfd files do however support PUNCH_HOLE). __u64 reserved[6]; }; + #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0) + Conceptually, the inode backing a guest_memfd file represents physical memory, i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The file itself, which is bound to a "struct kvm", is that instance's view of the @@ -6192,6 +6194,12 @@ most one mapping per page, i.e. binding multiple memory regions to a single guest_memfd range is not allowed (any number of memory regions can be bound to a single guest_memfd file, but the bound ranges must not overlap). +If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate +and map PMD-size hugepages for the guest_memfd file. This is currently best +effort. If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, size must be aligned to at +least the size reported by KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE (which also +enumerates support for KVM_GUEST_MEMFD_ALLOW_HUGEPAGE). + See KVM_SET_USER_MEMORY_REGION2 for additional details. 5. The kvm_run structure @@ -8639,6 +8647,15 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a 64-bit bitmap (each bit describing a block size). The default value is 0, to disable the eager page splitting. + +8.41 KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE +------------------------------------------ + +This is an information-only capability that returns guest_memfd's hugepage size +for PMD hugepages. Returns '0' if guest_memfd is not supported, or if KVM +doesn't support creating hugepages for guest_memfd. Note, guest_memfd doesn't +currently support PUD-sized hugepages. + 9. Known KVM API problems ========================= diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 25caee8d1a80..b78d0e3bf22a 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1217,6 +1217,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_MEMORY_FAULT_INFO 231 #define KVM_CAP_MEMORY_ATTRIBUTES 232 #define KVM_CAP_GUEST_MEMFD 233 +#define KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE 234 #ifdef KVM_CAP_IRQ_ROUTING @@ -2303,4 +2304,6 @@ struct kvm_create_guest_memfd { __u64 reserved[6]; }; +#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0) + #endif /* __LINUX_KVM_H */ diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 98a12da80214..31b5e94d461a 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -13,14 +13,44 @@ struct kvm_gmem { struct list_head entry; }; +#define NR_PAGES_PER_PMD (1 << PMD_ORDER) + +static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index) +{ + unsigned long huge_index = round_down(index, NR_PAGES_PER_PMD); + unsigned long flags = (unsigned long)inode->i_private; + struct address_space *mapping = inode->i_mapping; + gfp_t gfp = mapping_gfp_mask(mapping); + struct folio *folio; + + if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE)) + return NULL; + + if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT, + (huge_index + NR_PAGES_PER_PMD - 1) << PAGE_SHIFT)) + return NULL; + + folio = filemap_alloc_folio(gfp, PMD_ORDER); + if (!folio) + return NULL; + + if (filemap_add_folio(mapping, folio, huge_index, gfp)) { + folio_put(folio); + return NULL; + } + return folio; +} + static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index) { struct folio *folio; - /* TODO: Support huge pages. */ - folio = filemap_grab_folio(inode->i_mapping, index); - if (IS_ERR_OR_NULL(folio)) - return NULL; + folio = kvm_gmem_get_huge_folio(inode, index); + if (!folio) { + folio = filemap_grab_folio(inode->i_mapping, index); + if (IS_ERR_OR_NULL(folio)) + return NULL; + } /* * Use the up-to-date flag to track whether or not the memory has been @@ -373,6 +403,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) inode->i_mode |= S_IFREG; inode->i_size = size; mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + mapping_set_large_folios(inode->i_mapping); mapping_set_unmovable(inode->i_mapping); /* Unmovable mappings are supposed to be marked unevictable as well. */ WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); @@ -394,14 +425,18 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) { + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; loff_t size = args->size; u64 flags = args->flags; - u64 valid_flags = 0; if (flags & ~valid_flags) return -EINVAL; - if (size < 0 || !PAGE_ALIGNED(size)) + if (size <= 0 || !PAGE_ALIGNED(size)) + return -EINVAL; + + if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && + !IS_ALIGNED(size, PMD_SIZE)) return -EINVAL; return __kvm_gmem_create(kvm, size, flags); @@ -501,7 +536,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot) int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, kvm_pfn_t *pfn, int *max_order) { - pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff; + pgoff_t index, huge_index; struct kvm_gmem *gmem; struct folio *folio; struct page *page; @@ -514,6 +549,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gmem = file->private_data; + index = gfn - slot->base_gfn + slot->gmem.pgoff; if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) { r = -EIO; goto out_fput; @@ -533,9 +569,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, page = folio_file_page(folio, index); *pfn = page_to_pfn(page); - if (max_order) + if (!max_order) + goto success; + + *max_order = compound_order(compound_head(page)); + if (!*max_order) + goto success; + + /* + * The folio can be mapped with a hugepage if and only if the folio is + * fully contained by the range the memslot is bound to. Note, the + * caller is responsible for handling gfn alignment, this only deals + * with the file binding. + */ + huge_index = ALIGN(index, 1ull << *max_order); + if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) || + huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages) *max_order = 0; - +success: r = 0; out_unlock: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 5d1a2f1b4e94..0711f2c75667 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4888,6 +4888,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) #ifdef CONFIG_KVM_PRIVATE_MEM case KVM_CAP_GUEST_MEMFD: return !kvm || kvm_arch_has_private_mem(kvm); + case KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE: + if (kvm && !kvm_arch_has_private_mem(kvm)) + return 0; + return PMD_SIZE; #endif default: break; base-commit: fcbef1e5e5d2a60dacac0d16c06ac00bedaefc0f --
On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson <seanjc@google.com> wrote: > Actually, looking that this again, there's not actually a hard dependency on THP. > A THP-enabled kernel _probably_ gives a higher probability of using hugepages, > but mostly because THP selects COMPACTION, and I suppose because using THP for > other allocations reduces overall fragmentation. Yes, that's why I didn't even bother enabling it unless THP is enabled, but it makes even more sense to just try. > So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think > we should do the below (I verified KVM can create hugepages with THP=n). We'll > need another capability, but (a) we probably should have that anyways and (b) it > provides a cleaner path to adding PUD-sized hugepage support in the future. I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This should be a generic kernel API and in fact the sizes are available in a not-so-friendly format in /sys/kernel/mm/hugepages. We should just add /sys/kernel/mm/hugepages/sizes that contains "2097152 1073741824" on x86 (only the former if 1G pages are not supported). Plus: is this the best API if we need something else for 1G pages? Let's drop *this* patch and proceed incrementally. (Again, this is what I want to do with this final review: identify places that are stil sticky, and don't let them block the rest). Coincidentially we have an open spot next week at plumbers. Let's extend Fuad's section to cover more guestmem work. Paolo > diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c > index c15de9852316..c9f449718fce 100644 > --- a/tools/testing/selftests/kvm/guest_memfd_test.c > +++ b/tools/testing/selftests/kvm/guest_memfd_test.c > @@ -201,6 +201,10 @@ int main(int argc, char *argv[]) > > TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD)); > > + if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && thp_configured()) > + TEST_ASSERT_EQ(get_trans_hugepagesz(), > + kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE)); > + > page_size = getpagesize(); > total_size = page_size * 4; > > diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c > index be311944e90a..245901587ed2 100644 > --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c > +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c > @@ -396,7 +396,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t > > vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE)); > > - if (backing_src_can_be_huge(src_type)) > + if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE)) > memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; > else > memfd_flags = 0; > > -- > From: Sean Christopherson <seanjc@google.com> > Date: Wed, 25 Oct 2023 16:26:41 -0700 > Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest > memory > > Extend guest_memfd to allow backing guest memory with hugepages. For now, > make hugepage utilization best-effort, i.e. fall back to non-huge mappings > if a hugepage can't be allocated. Guaranteeing hugepages would require a > dedicated memory pool and significantly more complexity and churn.. > > Require userspace to opt-in via a flag even though it's unlikely real use > cases will ever want to use order-0 pages, e.g. to give userspace a safety > valve in case hugepage support is buggy, and to allow for easier testing > of both paths. > > Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling > primarily deals with userspace page tables, which are explicitly not in > play for guest_memfd. Selecting THP does make obtaining hugepages more > likely, but only because THP selects CONFIG_COMPACTION. Don't select > CONFIG_COMPACTION either, because again it's not a hard dependency. > > For simplicity, require the guest_memfd size to be a multiple of the > hugepage size, e.g. so that KVM doesn't need to do bounds checking when > deciding whether or not to allocate a huge folio. > > When reporting the max order when KVM gets a pfn from guest_memfd, force > order-0 pages if the hugepage is not fully contained by the memslot > binding, e.g. if userspace requested hugepages but punches a hole in the > memslot bindings in order to emulate x86's VGA hole. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > Documentation/virt/kvm/api.rst | 17 +++++++++ > include/uapi/linux/kvm.h | 3 ++ > virt/kvm/guest_memfd.c | 69 +++++++++++++++++++++++++++++----- > virt/kvm/kvm_main.c | 4 ++ > 4 files changed, 84 insertions(+), 9 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index e82c69d5e755..ccdd5413920d 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6176,6 +6176,8 @@ and cannot be resized (guest_memfd files do however support PUNCH_HOLE). > __u64 reserved[6]; > }; > > + #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0) > + > Conceptually, the inode backing a guest_memfd file represents physical memory, > i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The > file itself, which is bound to a "struct kvm", is that instance's view of the > @@ -6192,6 +6194,12 @@ most one mapping per page, i.e. binding multiple memory regions to a single > guest_memfd range is not allowed (any number of memory regions can be bound to > a single guest_memfd file, but the bound ranges must not overlap). > > +If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate > +and map PMD-size hugepages for the guest_memfd file. This is currently best > +effort. If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, size must be aligned to at > +least the size reported by KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE (which also > +enumerates support for KVM_GUEST_MEMFD_ALLOW_HUGEPAGE). > + > See KVM_SET_USER_MEMORY_REGION2 for additional details. > > 5. The kvm_run structure > @@ -8639,6 +8647,15 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a > 64-bit bitmap (each bit describing a block size). The default value is > 0, to disable the eager page splitting. > > + > +8.41 KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE > +------------------------------------------ > + > +This is an information-only capability that returns guest_memfd's hugepage size > +for PMD hugepages. Returns '0' if guest_memfd is not supported, or if KVM > +doesn't support creating hugepages for guest_memfd. Note, guest_memfd doesn't > +currently support PUD-sized hugepages. > + > 9. Known KVM API problems > ========================= > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index 25caee8d1a80..b78d0e3bf22a 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -1217,6 +1217,7 @@ struct kvm_ppc_resize_hpt { > #define KVM_CAP_MEMORY_FAULT_INFO 231 > #define KVM_CAP_MEMORY_ATTRIBUTES 232 > #define KVM_CAP_GUEST_MEMFD 233 > +#define KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE 234 > > #ifdef KVM_CAP_IRQ_ROUTING > > @@ -2303,4 +2304,6 @@ struct kvm_create_guest_memfd { > __u64 reserved[6]; > }; > > +#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0) > + > #endif /* __LINUX_KVM_H */ > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 98a12da80214..31b5e94d461a 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -13,14 +13,44 @@ struct kvm_gmem { > struct list_head entry; > }; > > +#define NR_PAGES_PER_PMD (1 << PMD_ORDER) > + > +static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index) > +{ > + unsigned long huge_index = round_down(index, NR_PAGES_PER_PMD); > + unsigned long flags = (unsigned long)inode->i_private; > + struct address_space *mapping = inode->i_mapping; > + gfp_t gfp = mapping_gfp_mask(mapping); > + struct folio *folio; > + > + if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE)) > + return NULL; > + > + if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT, > + (huge_index + NR_PAGES_PER_PMD - 1) << PAGE_SHIFT)) > + return NULL; > + > + folio = filemap_alloc_folio(gfp, PMD_ORDER); > + if (!folio) > + return NULL; > + > + if (filemap_add_folio(mapping, folio, huge_index, gfp)) { > + folio_put(folio); > + return NULL; > + } > + return folio; > +} > + > static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index) > { > struct folio *folio; > > - /* TODO: Support huge pages. */ > - folio = filemap_grab_folio(inode->i_mapping, index); > - if (IS_ERR_OR_NULL(folio)) > - return NULL; > + folio = kvm_gmem_get_huge_folio(inode, index); > + if (!folio) { > + folio = filemap_grab_folio(inode->i_mapping, index); > + if (IS_ERR_OR_NULL(folio)) > + return NULL; > + } > > /* > * Use the up-to-date flag to track whether or not the memory has been > @@ -373,6 +403,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) > inode->i_mode |= S_IFREG; > inode->i_size = size; > mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); > + mapping_set_large_folios(inode->i_mapping); > mapping_set_unmovable(inode->i_mapping); > /* Unmovable mappings are supposed to be marked unevictable as well. */ > WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); > @@ -394,14 +425,18 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) > > int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > { > + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; > loff_t size = args->size; > u64 flags = args->flags; > - u64 valid_flags = 0; > > if (flags & ~valid_flags) > return -EINVAL; > > - if (size < 0 || !PAGE_ALIGNED(size)) > + if (size <= 0 || !PAGE_ALIGNED(size)) > + return -EINVAL; > + > + if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && > + !IS_ALIGNED(size, PMD_SIZE)) > return -EINVAL; > > return __kvm_gmem_create(kvm, size, flags); > @@ -501,7 +536,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot) > int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > gfn_t gfn, kvm_pfn_t *pfn, int *max_order) > { > - pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff; > + pgoff_t index, huge_index; > struct kvm_gmem *gmem; > struct folio *folio; > struct page *page; > @@ -514,6 +549,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > gmem = file->private_data; > > + index = gfn - slot->base_gfn + slot->gmem.pgoff; > if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) { > r = -EIO; > goto out_fput; > @@ -533,9 +569,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > page = folio_file_page(folio, index); > > *pfn = page_to_pfn(page); > - if (max_order) > + if (!max_order) > + goto success; > + > + *max_order = compound_order(compound_head(page)); > + if (!*max_order) > + goto success; > + > + /* > + * The folio can be mapped with a hugepage if and only if the folio is > + * fully contained by the range the memslot is bound to. Note, the > + * caller is responsible for handling gfn alignment, this only deals > + * with the file binding. > + */ > + huge_index = ALIGN(index, 1ull << *max_order); > + if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) || > + huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages) > *max_order = 0; > - > +success: > r = 0; > > out_unlock: > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 5d1a2f1b4e94..0711f2c75667 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -4888,6 +4888,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) > #ifdef CONFIG_KVM_PRIVATE_MEM > case KVM_CAP_GUEST_MEMFD: > return !kvm || kvm_arch_has_private_mem(kvm); > + case KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE: > + if (kvm && !kvm_arch_has_private_mem(kvm)) > + return 0; > + return PMD_SIZE; > #endif > default: > break; > > base-commit: fcbef1e5e5d2a60dacac0d16c06ac00bedaefc0f > -- >
On 11/2/23 16:46, Paolo Bonzini wrote: > On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson <seanjc@google.com> wrote: >> Actually, looking that this again, there's not actually a hard dependency on THP. >> A THP-enabled kernel _probably_ gives a higher probability of using hugepages, >> but mostly because THP selects COMPACTION, and I suppose because using THP for >> other allocations reduces overall fragmentation. > > Yes, that's why I didn't even bother enabling it unless THP is > enabled, but it makes even more sense to just try. > >> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think >> we should do the below (I verified KVM can create hugepages with THP=n). We'll >> need another capability, but (a) we probably should have that anyways and (b) it >> provides a cleaner path to adding PUD-sized hugepage support in the future. > > I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This > should be a generic kernel API and in fact the sizes are available in > a not-so-friendly format in /sys/kernel/mm/hugepages. > > We should just add /sys/kernel/mm/hugepages/sizes that contains > "2097152 1073741824" on x86 (only the former if 1G pages are not > supported). > > Plus: is this the best API if we need something else for 1G pages? > > Let's drop *this* patch and proceed incrementally. (Again, this is > what I want to do with this final review: identify places that are > stil sticky, and don't let them block the rest). > > Coincidentially we have an open spot next week at plumbers. Let's > extend Fuad's section to cover more guestmem work. Hi, was there any outcome wrt this one? Based on my experience with THP's it would be best if userspace didn't have to opt-in, nor care about the supported size. If the given size is unaligned, provide a mix of large pages up to an aligned size, and for the rest fallback to base pages, which should be better than -EINVAL on creation (is it possible with the current implementation? I'd hope so so?). A way to opt-out from huge pages could be useful although there's always the risk of some initial troubles resulting in various online sources cargo-cult recommending to opt-out forever. Vlastimil
On Mon, Nov 27, 2023, Vlastimil Babka wrote: > On 11/2/23 16:46, Paolo Bonzini wrote: > > On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson <seanjc@google.com> wrote: > >> Actually, looking that this again, there's not actually a hard dependency on THP. > >> A THP-enabled kernel _probably_ gives a higher probability of using hugepages, > >> but mostly because THP selects COMPACTION, and I suppose because using THP for > >> other allocations reduces overall fragmentation. > > > > Yes, that's why I didn't even bother enabling it unless THP is > > enabled, but it makes even more sense to just try. > > > >> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think > >> we should do the below (I verified KVM can create hugepages with THP=n). We'll > >> need another capability, but (a) we probably should have that anyways and (b) it > >> provides a cleaner path to adding PUD-sized hugepage support in the future. > > > > I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This > > should be a generic kernel API and in fact the sizes are available in > > a not-so-friendly format in /sys/kernel/mm/hugepages. > > > > We should just add /sys/kernel/mm/hugepages/sizes that contains > > "2097152 1073741824" on x86 (only the former if 1G pages are not > > supported). > > > > Plus: is this the best API if we need something else for 1G pages? > > > > Let's drop *this* patch and proceed incrementally. (Again, this is > > what I want to do with this final review: identify places that are > > stil sticky, and don't let them block the rest). > > > > Coincidentially we have an open spot next week at plumbers. Let's > > extend Fuad's section to cover more guestmem work. > > Hi, > > was there any outcome wrt this one? No, we punted on hugepage support for the initial guest_memfd merge. We definitely plan on adding hugeapge support sooner than later, but we haven't yet agreed on exactly what that will look like. > Based on my experience with THP's it would be best if userspace didn't have > to opt-in, nor care about the supported size. If the given size is unaligned, > provide a mix of large pages up to an aligned size, and for the rest fallback > to base pages, which should be better than -EINVAL on creation (is it > possible with the current implementation? I'd hope so so?). guest_memfd serves a different use case than THP. For modern VMs, and especially for slice-of-hardware VMs that are one of the main targets for guest_memfd, if not _the_ main target, guest memory should _always_ be backed by hugepages in the physical domain. The actual guest mappings might not be huge, e.g. x86 needs to do partial mappings to skip over (legacy) memory holes, but KVM already gracefully handles that. In other words, for most guest_memfd use cases, if userspace wants hugepages but KVM can't provide hugepages, then it is much more desirable to return an error than to silently fall back to small pages. I 100% agree that having to opt-in is suboptimal, but IMO providing "error on an incompatible configuration" semantics without requiring userspace to opt-in is an even worse experience for userspace. > A way to opt-out from huge pages could be useful although there's always the > risk of some initial troubles resulting in various online sources cargo-cult > recommending to opt-out forever.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index e82c69d5e755..7f00c310c24a 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6176,6 +6176,8 @@ and cannot be resized (guest_memfd files do however support PUNCH_HOLE). __u64 reserved[6]; }; + #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0) + Conceptually, the inode backing a guest_memfd file represents physical memory, i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The file itself, which is bound to a "struct kvm", is that instance's view of the @@ -6192,6 +6194,11 @@ most one mapping per page, i.e. binding multiple memory regions to a single guest_memfd range is not allowed (any number of memory regions can be bound to a single guest_memfd file, but the bound ranges must not overlap). +If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate +and map hugepages for the guest_memfd file. This is currently best effort. If +KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, the size must be aligned to the maximum +transparent hugepage size supported by the kernel + See KVM_SET_USER_MEMORY_REGION2 for additional details. 5. The kvm_run structure diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 25caee8d1a80..33d542de0a61 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -2303,4 +2303,6 @@ struct kvm_create_guest_memfd { __u64 reserved[6]; }; +#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0) + #endif /* __LINUX_KVM_H */ diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 98a12da80214..94bc478c26f3 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -13,14 +13,47 @@ struct kvm_gmem { struct list_head entry; }; +static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + unsigned long huge_index = round_down(index, HPAGE_PMD_NR); + unsigned long flags = (unsigned long)inode->i_private; + struct address_space *mapping = inode->i_mapping; + gfp_t gfp = mapping_gfp_mask(mapping); + struct folio *folio; + + if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE)) + return NULL; + + if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT, + (huge_index + HPAGE_PMD_NR - 1) << PAGE_SHIFT)) + return NULL; + + folio = filemap_alloc_folio(gfp, HPAGE_PMD_ORDER); + if (!folio) + return NULL; + + if (filemap_add_folio(mapping, folio, huge_index, gfp)) { + folio_put(folio); + return NULL; + } + + return folio; +#else + return NULL; +#endif +} + static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index) { struct folio *folio; - /* TODO: Support huge pages. */ - folio = filemap_grab_folio(inode->i_mapping, index); - if (IS_ERR_OR_NULL(folio)) - return NULL; + folio = kvm_gmem_get_huge_folio(inode, index); + if (!folio) { + folio = filemap_grab_folio(inode->i_mapping, index); + if (IS_ERR_OR_NULL(folio)) + return NULL; + } /* * Use the up-to-date flag to track whether or not the memory has been @@ -373,6 +406,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) inode->i_mode |= S_IFREG; inode->i_size = size; mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + mapping_set_large_folios(inode->i_mapping); mapping_set_unmovable(inode->i_mapping); /* Unmovable mappings are supposed to be marked unevictable as well. */ WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); @@ -398,12 +432,21 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) u64 flags = args->flags; u64 valid_flags = 0; + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) + valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; + if (flags & ~valid_flags) return -EINVAL; if (size < 0 || !PAGE_ALIGNED(size)) return -EINVAL; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && + !IS_ALIGNED(size, HPAGE_PMD_SIZE)) + return -EINVAL; +#endif + return __kvm_gmem_create(kvm, size, flags); } @@ -501,7 +544,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot) int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, kvm_pfn_t *pfn, int *max_order) { - pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff; + pgoff_t index, huge_index; struct kvm_gmem *gmem; struct folio *folio; struct page *page; @@ -514,6 +557,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gmem = file->private_data; + index = gfn - slot->base_gfn + slot->gmem.pgoff; if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) { r = -EIO; goto out_fput; @@ -533,9 +577,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, page = folio_file_page(folio, index); *pfn = page_to_pfn(page); - if (max_order) + if (!max_order) + goto success; + + *max_order = compound_order(compound_head(page)); + if (!*max_order) + goto success; + + /* + * The folio can be mapped with a hugepage if and only if the folio is + * fully contained by the range the memslot is bound to. Note, the + * caller is responsible for handling gfn alignment, this only deals + * with the file binding. + */ + huge_index = ALIGN(index, 1ull << *max_order); + if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) || + huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages) *max_order = 0; - +success: r = 0; out_unlock: