Message ID | 20230718234512.1690985-1-seanjc@google.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c923:0:b0:3e4:2afc:c1 with SMTP id j3csp2091764vqt; Tue, 18 Jul 2023 17:05:55 -0700 (PDT) X-Google-Smtp-Source: APBJJlFJUh9eQZwnn9hw/VZJor+zkxRLShCzUR9KM7TFzhVVn1/AvtGR89WXbO3M/Z7MdPYEqkeO X-Received: by 2002:a05:6a00:2d84:b0:666:8cbb:6e0f with SMTP id fb4-20020a056a002d8400b006668cbb6e0fmr16449949pfb.3.1689725155086; Tue, 18 Jul 2023 17:05:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689725155; cv=none; d=google.com; s=arc-20160816; b=Xtb32/iAfiaZ3J6OMO5MoBwZ1slIt6w+538ul4tALYru0KFKZukHhx1tuUH43qnYaV xMCYMQgevzFb3PYZx53Jlom4TdZRTfmKufSRYffNfPlsP3yLMhf/GNbGsHOv38xWnUuH 4y+ppVDvQdJOVGQF/V/ziO4WgSIcrTmYZNWqVCU8xUTaQUczwNBMXCoqjvqnKXpmpwkl HwnZCYcrpTnylptCBO1jMQgmINmbdL5IOIH+QTk7DArmdgnjYJF3nm+2C4EA+QeVcSJI CbYYp53auXzNP8+JDY5v5EjJqhgLkSaczEVks7YK4TQVq2c3asGsHpuZSDLnGMN3OM46 bTbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :reply-to:dkim-signature; bh=kNioK2g9uNSVoz4fz3Z0hsZYDZVWPADHE8SV+5umY4Y=; fh=D6d5aNryYQiYyjenfSFWOT3594jRd4xJmyqVXmIxeQo=; b=uogmOPPT/e9F1gSvigM0r28NSwxodJh9fQeDK+SSmU6CL35yUFk9mZKEg0U5Sq2qcc eD/q/Kle1SLEK/XcuwGpnHEBQ7um9k99i/4FLH5Jp5EpDfUD4vNkG2LVdK2UDjbHFq4g m0m2Oq1zqRQ+Fa43Q8kndXYEkDP4e2nxauYFconCvb0BhCU1IKQX+GBahSeQZ5nRatU7 8WIEYyDp8E/oEOhIo+2+2v2/8MP5hFDV4jaPp/qMTRDaaGt3bIGljWWOX6rAvBWhiqad q3PhRNxu+GvKS1utX07ByF8UmFv1JhScR+Cxlj6+mr66QgcGMxdRA4t1X2wiomZ4zckZ Ewww== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=osgDTqFd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d10-20020a056a00198a00b00675807330bfsi2311775pfl.194.2023.07.18.17.05.42; Tue, 18 Jul 2023 17:05:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=osgDTqFd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230029AbjGRXsf (ORCPT <rfc822;assdfgzxcv4@gmail.com> + 99 others); Tue, 18 Jul 2023 19:48:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48264 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229490AbjGRXsd (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 18 Jul 2023 19:48:33 -0400 Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1A7CCFD for <linux-kernel@vger.kernel.org>; Tue, 18 Jul 2023 16:48:32 -0700 (PDT) Received: by mail-pl1-x64a.google.com with SMTP id d9443c01a7336-1b888bdacbcso33052655ad.2 for <linux-kernel@vger.kernel.org>; Tue, 18 Jul 2023 16:48:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689724111; x=1692316111; h=cc:to:from:subject:message-id:mime-version:date:reply-to:from:to:cc :subject:date:message-id:reply-to; bh=kNioK2g9uNSVoz4fz3Z0hsZYDZVWPADHE8SV+5umY4Y=; b=osgDTqFdq8IUMbfc5np0SJw5OdVmryzKJlMbpWPj1r+bA24M0IJWiJbXigx+jJyQ5r VlxW0vuMxM/BQlBSs2+AumxHcyWajcOP9JXOkoHqOoO5ZyUcko/ezFKjsqiJkNtv/xwh Y36DMIRI0Q59f7++kPXdwd59QUGKV3Guy/2lgpB3TxpICSaTd3h59fkFiz7MblIecyH8 n7z4KTAocelQbqspLYM5QVY6G/2MvezetA6jqdURrMQiYzDhQ6XI2ghHbJziqFDRtHCX N3lBCweEXoqVHi335cB6PiCMKD3JW4ZiPc0eG4mYnlS454U3dCWqPiIky3zkEczL9g5Z g/Ag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689724111; x=1692316111; h=cc:to:from:subject:message-id:mime-version:date:reply-to :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=kNioK2g9uNSVoz4fz3Z0hsZYDZVWPADHE8SV+5umY4Y=; b=M+5HwWKkT6cPzGCy8l78P5/SjpryRiLy4Yj9iMZLsV2V2CrbeBO0mSjDF8qkobxwGY WOoV6haOOnct6HSinSKOLDkibAQe4bU2vkKcvyanKJ5FlJnetYQm6EsCfhGz5KeppKAK qKOQJ9wSJlyW62TFwhO+p+t4WqKgQkSC+YosLrvVl2iV9xZO8PZ/NiF+ygYmPpV+dotS Vj9Gp4muslxkL8+uSpc6M2GqTFREh05MGW0KhPUmIcBn3REwhu9Z6f78JVWqnHK6T89S b7Fu54emIJpl2q6YTPYoJdVZtiME8LKIZDSo86Bapj8kon49th6Eb6tjXnETi6wJApqn o+oA== X-Gm-Message-State: ABy/qLYhkFiHCOJFUg3tFNK012s5bPayAduYpOIfdOt9CLEpOb8KATkn lBKD+kKdn1I26YJfdnm/83Gcp6Gtzls= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:902:ec8c:b0:1b8:95fc:d0f with SMTP id x12-20020a170902ec8c00b001b895fc0d0fmr7816plg.7.1689724111171; Tue, 18 Jul 2023 16:48:31 -0700 (PDT) Reply-To: Sean Christopherson <seanjc@google.com> Date: Tue, 18 Jul 2023 16:44:43 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.41.0.255.g8b1d071c50-goog Message-ID: <20230718234512.1690985-1-seanjc@google.com> Subject: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes From: Sean Christopherson <seanjc@google.com> To: Paolo Bonzini <pbonzini@redhat.com>, Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>, Huacai Chen <chenhuacai@kernel.org>, Michael Ellerman <mpe@ellerman.id.au>, Anup Patel <anup@brainfault.org>, Paul Walmsley <paul.walmsley@sifive.com>, Palmer Dabbelt <palmer@dabbelt.com>, Albert Ou <aou@eecs.berkeley.edu>, Sean Christopherson <seanjc@google.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, Paul Moore <paul@paul-moore.com>, James Morris <jmorris@namei.org>, "Serge E. Hallyn" <serge@hallyn.com> Cc: kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, Chao Peng <chao.p.peng@linux.intel.com>, Fuad Tabba <tabba@google.com>, Jarkko Sakkinen <jarkko@kernel.org>, Yu Zhang <yu.c.zhang@linux.intel.com>, Vishal Annapurve <vannapurve@google.com>, Ackerley Tng <ackerleytng@google.com>, Maciej Szmigiero <mail@maciej.szmigiero.name>, Vlastimil Babka <vbabka@suse.cz>, David Hildenbrand <david@redhat.com>, Quentin Perret <qperret@google.com>, Michael Roth <michael.roth@amd.com>, Wang <wei.w.wang@intel.com>, Liam Merwick <liam.merwick@oracle.com>, Isaku Yamahata <isaku.yamahata@gmail.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771805244423919410 X-GMAIL-MSGID: 1771805244423919410 |
Series |
KVM: guest_memfd() and per-page attributes
|
|
Message
Sean Christopherson
July 18, 2023, 11:44 p.m. UTC
This is the next iteration of implementing fd-based (instead of vma-based) memory for KVM guests. If you want the full background of why we are doing this, please go read the v10 cover letter[1]. The biggest change from v10 is to implement the backing storage in KVM itself, and expose it via a KVM ioctl() instead of a "generic" sycall. See link[2] for details on why we pivoted to a KVM-specific approach. Key word is "biggest". Relative to v10, there are many big changes. Highlights below (I can't remember everything that got changed at this point). Tagged RFC as there are a lot of empty changelogs, and a lot of missing documentation. And ideally, we'll have even more tests before merging. There are also several gaps/opens (to be discussed in tomorrow's PUCK). v11: - Test private<=>shared conversions *without* doing fallocate() - PUNCH_HOLE all memory between iterations of the conversion test so that KVM doesn't retain pages in the guest_memfd - Rename hugepage control to be a very generic ALLOW_HUGEPAGE, instead of giving it a THP or PMD specific name. - Fold in fixes from a lot of people (thank you!) - Zap SPTEs *before* updating attributes to ensure no weirdness, e.g. if KVM handles a page fault and looks at inconsistent attributes - Refactor MMU interaction with attributes updates to reuse much of KVM's framework for mmu_notifiers. [1] https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com [2] https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com Ackerley Tng (1): KVM: selftests: Test KVM exit behavior for private memory/access Chao Peng (7): KVM: Use gfn instead of hva for mmu_notifier_retry KVM: Add KVM_EXIT_MEMORY_FAULT exit KVM: Introduce per-page memory attributes KVM: x86: Disallow hugepages when memory attributes are mixed KVM: x86/mmu: Handle page fault for private memory KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper KVM: selftests: Expand set_memory_region_test to validate guest_memfd() Sean Christopherson (18): KVM: Wrap kvm_gfn_range.pte in a per-action union KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER KVM: Introduce KVM_SET_USER_MEMORY_REGION2 mm: Add AS_UNMOVABLE to mark mapping as completely unmovable security: Export security_inode_init_security_anon() for use by KVM KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory KVM: Add transparent hugepage support for dedicated guest memory KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro KVM: Allow arch code to track number of memslot address spaces per VM KVM: x86: Add support for "protected VMs" that can utilize private memory KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 KVM: selftests: Add support for creating private memslots KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data KVM: selftests: Add basic selftest for guest_memfd() Vishal Annapurve (3): KVM: selftests: Add helpers to convert guest memory b/w private and shared KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) KVM: selftests: Add x86-only selftest for private memory conversions Documentation/virt/kvm/api.rst | 114 ++++ arch/arm64/include/asm/kvm_host.h | 2 - arch/arm64/kvm/Kconfig | 2 +- arch/arm64/kvm/mmu.c | 2 +- arch/mips/include/asm/kvm_host.h | 2 - arch/mips/kvm/Kconfig | 2 +- arch/mips/kvm/mmu.c | 2 +- arch/powerpc/include/asm/kvm_host.h | 2 - arch/powerpc/kvm/Kconfig | 8 +- arch/powerpc/kvm/book3s_hv.c | 2 +- arch/powerpc/kvm/powerpc.c | 5 +- arch/riscv/include/asm/kvm_host.h | 2 - arch/riscv/kvm/Kconfig | 2 +- arch/riscv/kvm/mmu.c | 2 +- arch/x86/include/asm/kvm_host.h | 17 +- arch/x86/include/uapi/asm/kvm.h | 3 + arch/x86/kvm/Kconfig | 14 +- arch/x86/kvm/debugfs.c | 2 +- arch/x86/kvm/mmu/mmu.c | 287 +++++++- arch/x86/kvm/mmu/mmu_internal.h | 4 + arch/x86/kvm/mmu/mmutrace.h | 1 + arch/x86/kvm/mmu/tdp_mmu.c | 8 +- arch/x86/kvm/vmx/vmx.c | 11 +- arch/x86/kvm/x86.c | 24 +- include/linux/kvm_host.h | 129 +++- include/linux/pagemap.h | 11 + include/uapi/linux/kvm.h | 50 ++ include/uapi/linux/magic.h | 1 + mm/compaction.c | 4 + mm/migrate.c | 2 + security/security.c | 1 + tools/testing/selftests/kvm/Makefile | 3 + tools/testing/selftests/kvm/dirty_log_test.c | 2 +- .../testing/selftests/kvm/guest_memfd_test.c | 114 ++++ .../selftests/kvm/include/kvm_util_base.h | 141 +++- .../testing/selftests/kvm/include/test_util.h | 5 + .../selftests/kvm/include/ucall_common.h | 12 + .../selftests/kvm/include/x86_64/processor.h | 15 + .../selftests/kvm/kvm_page_table_test.c | 2 +- tools/testing/selftests/kvm/lib/kvm_util.c | 230 ++++--- tools/testing/selftests/kvm/lib/memstress.c | 3 +- .../selftests/kvm/set_memory_region_test.c | 99 +++ .../kvm/x86_64/private_mem_conversions_test.c | 408 +++++++++++ .../kvm/x86_64/private_mem_kvm_exits_test.c | 115 ++++ .../kvm/x86_64/ucna_injection_test.c | 2 +- virt/kvm/Kconfig | 17 + virt/kvm/Makefile.kvm | 1 + virt/kvm/dirty_ring.c | 2 +- virt/kvm/guest_mem.c | 635 ++++++++++++++++++ virt/kvm/kvm_main.c | 384 +++++++++-- virt/kvm/kvm_mm.h | 38 ++ 51 files changed, 2700 insertions(+), 246 deletions(-) create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c create mode 100644 virt/kvm/guest_mem.c base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c
Comments
On 7/19/2023 5:14 AM, Sean Christopherson wrote: > This is the next iteration of implementing fd-based (instead of vma-based) > memory for KVM guests. If you want the full background of why we are doing > this, please go read the v10 cover letter[1]. > > The biggest change from v10 is to implement the backing storage in KVM > itself, and expose it via a KVM ioctl() instead of a "generic" sycall. > See link[2] for details on why we pivoted to a KVM-specific approach. > > Key word is "biggest". Relative to v10, there are many big changes. > Highlights below (I can't remember everything that got changed at > this point). > > Tagged RFC as there are a lot of empty changelogs, and a lot of missing > documentation. And ideally, we'll have even more tests before merging. > There are also several gaps/opens (to be discussed in tomorrow's PUCK). As per our discussion on the PUCK call, here are the memory/NUMA accounting related observations that I had while working on SNP guest secure page migration: * gmem allocations are currently treated as file page allocations accounted to the kernel and not to the QEMU process. Starting an SNP guest with 40G memory with memory interleave between Node2 and Node3 $ numactl -i 2,3 ./bootg_snp.sh PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 242179 root 20 0 40.4g 99580 51676 S 78.0 0.0 0:56.58 qemu-system-x86 -> Incorrect process resident memory and shared memory is reported Accounting of the memory happens in the host page fault handler path, but for private guest pages we will never hit that. * NUMA allocation does use the process mempolicy for appropriate node allocation (Node2 and Node3), but they again do not get attributed to the QEMU process Every 1.0s: sudo numastat -m -p qemu-system-x86 | egrep -i "qemu|PID|Node|Filepage" gomati: Mon Jul 24 11:51:34 2023 Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Node 3 Total 242179 (qemu-system-x86) 21.14 1.61 39.44 39.38 101.57 Per-node system memory usage (in MBs): Node 0 Node 1 Node 2 Node 3 Total FilePages 2475.63 2395.83 23999.46 23373.22 52244.14 * Most of the memory accounting relies on the VMAs and as private-fd of gmem doesn't have a VMA(and that was the design goal), user-space fails to attribute the memory appropriately to the process. /proc/<qemu pid>/numa_maps 7f528be00000 interleave:2-3 file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4 7f5c90200000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) 7f5c90400000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=32 active=0 N2=32 kernelpagesize_kB=4 7f5c90800000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=892 active=0 N2=512 N3=380 kernelpagesize_kB=4 /proc/<qemu pid>/smaps 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted) 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted) 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted) 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted) * QEMU based NUMA bindings will not work. Memory backend uses mbind() to set the policy for a particular virtual memory range but gmem private-FD does not have a virtual memory range visible in the host. Regards, Nikunj
On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote: > On 7/19/2023 5:14 AM, Sean Christopherson wrote: > > This is the next iteration of implementing fd-based (instead of vma-based) > > memory for KVM guests. If you want the full background of why we are doing > > this, please go read the v10 cover letter[1]. > > > > The biggest change from v10 is to implement the backing storage in KVM > > itself, and expose it via a KVM ioctl() instead of a "generic" sycall. > > See link[2] for details on why we pivoted to a KVM-specific approach. > > > > Key word is "biggest". Relative to v10, there are many big changes. > > Highlights below (I can't remember everything that got changed at > > this point). > > > > Tagged RFC as there are a lot of empty changelogs, and a lot of missing > > documentation. And ideally, we'll have even more tests before merging. > > There are also several gaps/opens (to be discussed in tomorrow's PUCK). > > As per our discussion on the PUCK call, here are the memory/NUMA accounting > related observations that I had while working on SNP guest secure page migration: > > * gmem allocations are currently treated as file page allocations > accounted to the kernel and not to the QEMU process. We need to level set on terminology: these are all *stats*, not accounting. That distinction matters because we have wiggle room on stats, e.g. we can probably get away with just about any definition of how guest_memfd memory impacts stats, so long as the information that is surfaced to userspace is useful and expected. But we absolutely need to get accounting correct, specifically the allocations need to be correctly accounted in memcg. And unless I'm missing something, nothing in here shows anything related to memcg. > Starting an SNP guest with 40G memory with memory interleave between > Node2 and Node3 > > $ numactl -i 2,3 ./bootg_snp.sh > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 242179 root 20 0 40.4g 99580 51676 S 78.0 0.0 0:56.58 qemu-system-x86 > > -> Incorrect process resident memory and shared memory is reported I don't know that I would call these "incorrect". Shared memory definitely is correct, because by definition guest_memfd isn't shared. RSS is less clear cut; gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up with scenarios where RSS > VIRT, which will be quite confusing for unaware users (I'm assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem memslots). > Accounting of the memory happens in the host page fault handler path, > but for private guest pages we will never hit that. > > * NUMA allocation does use the process mempolicy for appropriate node > allocation (Node2 and Node3), but they again do not get attributed to > the QEMU process > > Every 1.0s: sudo numastat -m -p qemu-system-x86 | egrep -i "qemu|PID|Node|Filepage" gomati: Mon Jul 24 11:51:34 2023 > > Per-node process memory usage (in MBs) > PID Node 0 Node 1 Node 2 Node 3 Total > 242179 (qemu-system-x86) 21.14 1.61 39.44 39.38 101.57 > Per-node system memory usage (in MBs): > Node 0 Node 1 Node 2 Node 3 Total > FilePages 2475.63 2395.83 23999.46 23373.22 52244.14 > > > * Most of the memory accounting relies on the VMAs and as private-fd of > gmem doesn't have a VMA(and that was the design goal), user-space fails > to attribute the memory appropriately to the process. > > /proc/<qemu pid>/numa_maps > 7f528be00000 interleave:2-3 file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4 > 7f5c90200000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) > 7f5c90400000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=32 active=0 N2=32 kernelpagesize_kB=4 > 7f5c90800000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=892 active=0 N2=512 N3=380 kernelpagesize_kB=4 > > /proc/<qemu pid>/smaps > 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted) > 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted) > 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted) > 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted) This is all expected, and IMO correct. There are no userspace mappings, and so not accounting anything is working as intended. > * QEMU based NUMA bindings will not work. Memory backend uses mbind() > to set the policy for a particular virtual memory range but gmem > private-FD does not have a virtual memory range visible in the host. Yes, adding a generic fbind() is the way to solve silve.
Dropped non-KVM folks from Cc: so as not to bother them too much. On Tue, Jul 18, 2023, Sean Christopherson wrote: > This is the next iteration of implementing fd-based (instead of vma-based) > memory for KVM guests. If you want the full background of why we are doing > this, please go read the v10 cover letter[1]. > > The biggest change from v10 is to implement the backing storage in KVM > itself, and expose it via a KVM ioctl() instead of a "generic" sycall. > See link[2] for details on why we pivoted to a KVM-specific approach. > > Key word is "biggest". Relative to v10, there are many big changes. > Highlights below (I can't remember everything that got changed at > this point). > > Tagged RFC as there are a lot of empty changelogs, and a lot of missing > documentation. And ideally, we'll have even more tests before merging. > There are also several gaps/opens (to be discussed in tomorrow's PUCK). I've pushed this to https://github.com/kvm-x86/linux/tree/guest_memfd along with Isaku's fix for the lock ordering bug on top. As discussed at PUCK, I'll apply fixes/tweaks/changes on top until development stabilizes, and will only squash/fixup when we're ready to post v12 for broad review. Please "formally" post patches just like you normally would do, i.e. don't *just* repond to the buggy mail (though that is also helpful). Standalone patches make it easier for me to manage things via lore/b4. If you can, put gmem or guest_memfd inside the square braces, e.g. [PATCH gmem] KVM: <shortlog> so that it's obvious the patch is intended for the guest_memfd branch. For fixes, please also be sure to use Fixes: tags and split patches to fix exactly one base commit, again to make my life easier. I'll likely add my own annotations when applying, e.g. [FIXUP] and whatnot, but that's purely notes for myself for the future squash/rebase. Thanks!
Hi Sean, On 7/24/2023 10:30 PM, Sean Christopherson wrote: > On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote: >> On 7/19/2023 5:14 AM, Sean Christopherson wrote: >>> This is the next iteration of implementing fd-based (instead of vma-based) >>> memory for KVM guests. If you want the full background of why we are doing >>> this, please go read the v10 cover letter[1]. >>> >>> The biggest change from v10 is to implement the backing storage in KVM >>> itself, and expose it via a KVM ioctl() instead of a "generic" sycall. >>> See link[2] for details on why we pivoted to a KVM-specific approach. >>> >>> Key word is "biggest". Relative to v10, there are many big changes. >>> Highlights below (I can't remember everything that got changed at >>> this point). >>> >>> Tagged RFC as there are a lot of empty changelogs, and a lot of missing >>> documentation. And ideally, we'll have even more tests before merging. >>> There are also several gaps/opens (to be discussed in tomorrow's PUCK). >> >> As per our discussion on the PUCK call, here are the memory/NUMA accounting >> related observations that I had while working on SNP guest secure page migration: >> >> * gmem allocations are currently treated as file page allocations >> accounted to the kernel and not to the QEMU process. > > We need to level set on terminology: these are all *stats*, not accounting. That > distinction matters because we have wiggle room on stats, e.g. we can probably get > away with just about any definition of how guest_memfd memory impacts stats, so > long as the information that is surfaced to userspace is useful and expected. > > But we absolutely need to get accounting correct, specifically the allocations > need to be correctly accounted in memcg. And unless I'm missing something, > nothing in here shows anything related to memcg. I tried out memcg after creating a separate cgroup for the qemu process. Guest memory is accounted in memcg. $ egrep -w "file|file_thp|unevictable" memory.stat file 42978775040 file_thp 42949672960 unevictable 42953588736 NUMA allocations are coming from right nodes as set by the numactl. $ egrep -w "file|file_thp|unevictable" memory.numa_stat file N0=0 N1=20480 N2=21489377280 N3=21489377280 file_thp N0=0 N1=0 N2=21472739328 N3=21476933632 unevictable N0=0 N1=0 N2=21474697216 N3=21478891520 > >> Starting an SNP guest with 40G memory with memory interleave between >> Node2 and Node3 >> >> $ numactl -i 2,3 ./bootg_snp.sh >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 242179 root 20 0 40.4g 99580 51676 S 78.0 0.0 0:56.58 qemu-system-x86 >> >> -> Incorrect process resident memory and shared memory is reported > > I don't know that I would call these "incorrect". Shared memory definitely is > correct, because by definition guest_memfd isn't shared. RSS is less clear cut; > gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up with > scenarios where RSS > VIRT, which will be quite confusing for unaware users (I'm > assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem > memslots). I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming all the memory is private) As per my experiments with a hack below. MM_FILEPAGES does get accounted to RSS/SHR in top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4339 root 20 0 40.4g 40.1g 40.1g S 76.7 16.0 0:13.83 qemu-system-x86 diff --git a/mm/memory.c b/mm/memory.c index f456f3b5049c..5b1f48a2e714 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -166,6 +166,7 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member) { trace_rss_stat(mm, member); } +EXPORT_SYMBOL(mm_trace_rss_stat); /* * Note: this doesn't free the actual pages themselves. That diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c index a7e926af4255..e4f268bf9ce2 100644 --- a/virt/kvm/guest_mem.c +++ b/virt/kvm/guest_mem.c @@ -91,6 +91,10 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index) clear_highpage(folio_page(folio, i)); } + /* Account only once for the first time */ + if (!folio_test_dirty(folio)) + add_mm_counter(current->mm, MM_FILEPAGES, folio_nr_pages(folio)); + folio_mark_accessed(folio); folio_mark_dirty(folio); folio_mark_uptodate(folio); We can update the rss_stat appropriately to get correct reporting in userspace. >> Accounting of the memory happens in the host page fault handler path, >> but for private guest pages we will never hit that. >> >> * NUMA allocation does use the process mempolicy for appropriate node >> allocation (Node2 and Node3), but they again do not get attributed to >> the QEMU process >> >> Every 1.0s: sudo numastat -m -p qemu-system-x86 | egrep -i "qemu|PID|Node|Filepage" gomati: Mon Jul 24 11:51:34 2023 >> >> Per-node process memory usage (in MBs) >> PID Node 0 Node 1 Node 2 Node 3 Total >> 242179 (qemu-system-x86) 21.14 1.61 39.44 39.38 101.57 >> >> Per-node system memory usage (in MBs): >> Node 0 Node 1 Node 2 Node 3 Total >> FilePages 2475.63 2395.83 23999.46 23373.22 52244.14 >> >> >> * Most of the memory accounting relies on the VMAs and as private-fd of >> gmem doesn't have a VMA(and that was the design goal), user-space fails >> to attribute the memory appropriately to the process. >> >> /proc/<qemu pid>/numa_maps >> 7f528be00000 interleave:2-3 file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4 >> 7f5c90200000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) >> 7f5c90400000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=32 active=0 N2=32 kernelpagesize_kB=4 >> 7f5c90800000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=892 active=0 N2=512 N3=380 kernelpagesize_kB=4 >> >> /proc/<qemu pid>/smaps >> 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted) >> 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted) >> 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted) >> 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted) > > This is all expected, and IMO correct. There are no userspace mappings, and so > not accounting anything is working as intended. Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how would we know who is using 100GB of memory? > >> * QEMU based NUMA bindings will not work. Memory backend uses mbind() >> to set the policy for a particular virtual memory range but gmem >> private-FD does not have a virtual memory range visible in the host. > > Yes, adding a generic fbind() is the way to solve silve. Regards, Nikunj
On Wed, Jul 26, 2023, Nikunj A. Dadhania wrote: > Hi Sean, > > On 7/24/2023 10:30 PM, Sean Christopherson wrote: > >> Starting an SNP guest with 40G memory with memory interleave between > >> Node2 and Node3 > >> > >> $ numactl -i 2,3 ./bootg_snp.sh > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >> 242179 root 20 0 40.4g 99580 51676 S 78.0 0.0 0:56.58 qemu-system-x86 > >> > >> -> Incorrect process resident memory and shared memory is reported > > > > I don't know that I would call these "incorrect". Shared memory definitely is > > correct, because by definition guest_memfd isn't shared. RSS is less clear cut; > > gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up with > > scenarios where RSS > VIRT, which will be quite confusing for unaware users (I'm > > assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem > > memslots). > > I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming all the > memory is private) And also assuming that (a) userspace mmap()'d the shared side of things 1:1 with private memory and (b) that the shared mappings have not been populated. Those assumptions will mostly probably hold true for QEMU, but kernel correctness shouldn't depend on assumptions about one specific userspace application. > >> /proc/<qemu pid>/smaps > >> 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted) > >> 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted) > >> 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted) > >> 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted) > > > > This is all expected, and IMO correct. There are no userspace mappings, and so > > not accounting anything is working as intended. > Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how > would we know who is using 100GB of memory? It's correct with respect to what the interfaces show, which is how much memory is *mapped* into userspace. As I said (or at least tried to say) in my first reply, I am not against exposing memory usage to userspace via stats, only that it's not obvious to me that the existing VMA-based stats are the most appropriate way to surface this information.
On 7/26/2023 7:54 PM, Sean Christopherson wrote: > On Wed, Jul 26, 2023, Nikunj A. Dadhania wrote: >> On 7/24/2023 10:30 PM, Sean Christopherson wrote: >>>> /proc/<qemu pid>/smaps >>>> 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted) >>>> 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted) >>>> 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted) >>>> 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted) >>> >>> This is all expected, and IMO correct. There are no userspace mappings, and so >>> not accounting anything is working as intended. >> Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how >> would we know who is using 100GB of memory? > > It's correct with respect to what the interfaces show, which is how much memory > is *mapped* into userspace. > > As I said (or at least tried to say) in my first reply, I am not against exposing > memory usage to userspace via stats, only that it's not obvious to me that the > existing VMA-based stats are the most appropriate way to surface this information. Right, then should we think in the line of creating a VM IOCTL for querying current memory usage for guest-memfd ? We could use memcg for statistics, but then memory cgroup can be disabled and so memcg isn't really a dependable option. Do you have some ideas on how to expose the memory usage to the user space other than VMA-based stats ? Regards, Nikunj
On 7/26/23 13:20, Nikunj A. Dadhania wrote: > Hi Sean, > > On 7/24/2023 10:30 PM, Sean Christopherson wrote: >> On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote: >>> On 7/19/2023 5:14 AM, Sean Christopherson wrote: >>>> This is the next iteration of implementing fd-based (instead of vma-based) >>>> memory for KVM guests. If you want the full background of why we are doing >>>> this, please go read the v10 cover letter[1]. >>>> >>>> The biggest change from v10 is to implement the backing storage in KVM >>>> itself, and expose it via a KVM ioctl() instead of a "generic" sycall. >>>> See link[2] for details on why we pivoted to a KVM-specific approach. >>>> >>>> Key word is "biggest". Relative to v10, there are many big changes. >>>> Highlights below (I can't remember everything that got changed at >>>> this point). >>>> >>>> Tagged RFC as there are a lot of empty changelogs, and a lot of missing >>>> documentation. And ideally, we'll have even more tests before merging. >>>> There are also several gaps/opens (to be discussed in tomorrow's PUCK). >>> >>> As per our discussion on the PUCK call, here are the memory/NUMA accounting >>> related observations that I had while working on SNP guest secure page migration: >>> >>> * gmem allocations are currently treated as file page allocations >>> accounted to the kernel and not to the QEMU process. >> >> We need to level set on terminology: these are all *stats*, not accounting. That >> distinction matters because we have wiggle room on stats, e.g. we can probably get >> away with just about any definition of how guest_memfd memory impacts stats, so >> long as the information that is surfaced to userspace is useful and expected. >> >> But we absolutely need to get accounting correct, specifically the allocations >> need to be correctly accounted in memcg. And unless I'm missing something, >> nothing in here shows anything related to memcg. > > I tried out memcg after creating a separate cgroup for the qemu process. Guest > memory is accounted in memcg. > > $ egrep -w "file|file_thp|unevictable" memory.stat > file 42978775040 > file_thp 42949672960 > unevictable 42953588736 > > NUMA allocations are coming from right nodes as set by the numactl. > > $ egrep -w "file|file_thp|unevictable" memory.numa_stat > file N0=0 N1=20480 N2=21489377280 N3=21489377280 > file_thp N0=0 N1=0 N2=21472739328 N3=21476933632 > unevictable N0=0 N1=0 N2=21474697216 N3=21478891520 > >> >>> Starting an SNP guest with 40G memory with memory interleave between >>> Node2 and Node3 >>> >>> $ numactl -i 2,3 ./bootg_snp.sh >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 242179 root 20 0 40.4g 99580 51676 S 78.0 0.0 0:56.58 qemu-system-x86 >>> >>> -> Incorrect process resident memory and shared memory is reported >> >> I don't know that I would call these "incorrect". Shared memory definitely is >> correct, because by definition guest_memfd isn't shared. RSS is less clear cut; >> gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up with >> scenarios where RSS > VIRT, which will be quite confusing for unaware users (I'm >> assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem >> memslots). > > I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming all the > memory is private) > > As per my experiments with a hack below. MM_FILEPAGES does get accounted to RSS/SHR in top > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 4339 root 20 0 40.4g 40.1g 40.1g S 76.7 16.0 0:13.83 qemu-system-x86 > > diff --git a/mm/memory.c b/mm/memory.c > index f456f3b5049c..5b1f48a2e714 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -166,6 +166,7 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member) > { > trace_rss_stat(mm, member); > } > +EXPORT_SYMBOL(mm_trace_rss_stat); > > /* > * Note: this doesn't free the actual pages themselves. That > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c > index a7e926af4255..e4f268bf9ce2 100644 > --- a/virt/kvm/guest_mem.c > +++ b/virt/kvm/guest_mem.c > @@ -91,6 +91,10 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index) > clear_highpage(folio_page(folio, i)); > } > > + /* Account only once for the first time */ > + if (!folio_test_dirty(folio)) > + add_mm_counter(current->mm, MM_FILEPAGES, folio_nr_pages(folio)); I think this alone would cause "Bad rss-counter" messages when the process exits, because there's no corresponding decrement when page tables are torn down. We would probably have to instantiate the page tables (i.e. with PROT_NONE so userspace can't really do accesses through them) for this to work properly. So then it wouldn't technically be "unmapped private memory" anymore, but effectively still would be. Maybe there would be more benefits, like the mbind() working. But where would the PROT_NONE page tables be instantiated if there's no page fault? During the ioctl? And is perhaps too much (CPU) work for little benefit? Maybe, but we could say it makes things simpler and can be optimized later? Anyway IMHO it would be really great if the memory usage was attributable the usual way without new IOCTLs or something. Each time some memory appears "unaccounted" somewhere, it causes confusion. > + > folio_mark_accessed(folio); > folio_mark_dirty(folio); > folio_mark_uptodate(folio); > > We can update the rss_stat appropriately to get correct reporting in userspace. > >>> Accounting of the memory happens in the host page fault handler path, >>> but for private guest pages we will never hit that. >>> >>> * NUMA allocation does use the process mempolicy for appropriate node >>> allocation (Node2 and Node3), but they again do not get attributed to >>> the QEMU process >>> >>> Every 1.0s: sudo numastat -m -p qemu-system-x86 | egrep -i "qemu|PID|Node|Filepage" gomati: Mon Jul 24 11:51:34 2023 >>> >>> Per-node process memory usage (in MBs) >>> PID Node 0 Node 1 Node 2 Node 3 Total >>> 242179 (qemu-system-x86) 21.14 1.61 39.44 39.38 101.57 >>> >>> Per-node system memory usage (in MBs): >>> Node 0 Node 1 Node 2 Node 3 Total >>> FilePages 2475.63 2395.83 23999.46 23373.22 52244.14 >>> >>> >>> * Most of the memory accounting relies on the VMAs and as private-fd of >>> gmem doesn't have a VMA(and that was the design goal), user-space fails >>> to attribute the memory appropriately to the process. >>> >>> /proc/<qemu pid>/numa_maps >>> 7f528be00000 interleave:2-3 file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4 >>> 7f5c90200000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) >>> 7f5c90400000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=32 active=0 N2=32 kernelpagesize_kB=4 >>> 7f5c90800000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=892 active=0 N2=512 N3=380 kernelpagesize_kB=4 >>> >>> /proc/<qemu pid>/smaps >>> 7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629 /memfd:memory-backend-memfd-shared (deleted) >>> 7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033 /memfd:rom-backend-memfd-shared (deleted) >>> 7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032 /memfd:rom-backend-memfd-shared (deleted) >>> 7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025 /memfd:rom-backend-memfd-shared (deleted) >> >> This is all expected, and IMO correct. There are no userspace mappings, and so >> not accounting anything is working as intended. > Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how would we know who is using 100GB of memory? > >> >>> * QEMU based NUMA bindings will not work. Memory backend uses mbind() >>> to set the policy for a particular virtual memory range but gmem >>> private-FD does not have a virtual memory range visible in the host. >> >> Yes, adding a generic fbind() is the way to solve silve. > > Regards, > Nikunj >