[v10,3/9] KVM: Extend the memslot to support fd-based private memory

Message ID 20221202061347.1070246-4-chao.p.peng@linux.intel.com
State New
Headers
Series KVM: mm: fd-based approach for supporting KVM |

Commit Message

Chao Peng Dec. 2, 2022, 6:13 a.m. UTC
  In memory encryption usage, guest memory may be encrypted with special
key and can be accessed only by the guest itself. We call such memory
private memory. It's valueless and sometimes can cause problem to allow
userspace to access guest private memory. This new KVM memslot extension
allows guest private memory being provided through a restrictedmem
backed file descriptor(fd) and userspace is restricted to access the
bookmarked memory in the fd.

This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
additional KVM memslot fields restricted_fd/restricted_offset to allow
userspace to instruct KVM to provide guest memory through restricted_fd.
'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
and the size is 'memory_size'.

The extended memslot can still have the userspace_addr(hva). When use, a
single memslot can maintain both private memory through restricted_fd
and shared memory through userspace_addr. Whether the private or shared
part is visible to guest is maintained by other KVM code.

A restrictedmem_notifier field is also added to the memslot structure to
allow the restricted_fd's backing store to notify KVM the memory change,
KVM then can invalidate its page table entries or handle memory errors.

Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
and right now it is selected on X86_64 only.

To make future maintenance easy, internally use a binary compatible
alias struct kvm_user_mem_region to handle both the normal and the
'_ext' variants.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
---
 Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
 arch/x86/kvm/Kconfig           |  2 ++
 arch/x86/kvm/x86.c             |  2 +-
 include/linux/kvm_host.h       |  8 ++++--
 include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
 virt/kvm/Kconfig               |  3 +++
 virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
 7 files changed, 114 insertions(+), 18 deletions(-)
  

Comments

Fuad Tabba Dec. 5, 2022, 9:03 a.m. UTC | #1
Hi Chao,

On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
>
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
>
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through restricted_fd
> and shared memory through userspace_addr. Whether the private or shared
> part is visible to guest is maintained by other KVM code.
>
> A restrictedmem_notifier field is also added to the memslot structure to
> allow the restricted_fd's backing store to notify KVM the memory change,
> KVM then can invalidate its page table entries or handle memory errors.
>
> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
>
> To make future maintenance easy, internally use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Tested-by: Fuad Tabba <tabba@google.com>

V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
patch series anymore. Any reason you removed it, or is it just an
omission?

[*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/

Thanks,
/fuad

> ---
>  Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
>  arch/x86/kvm/Kconfig           |  2 ++
>  arch/x86/kvm/x86.c             |  2 +-
>  include/linux/kvm_host.h       |  8 ++++--
>  include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
>  virt/kvm/Kconfig               |  3 +++
>  virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
>  7 files changed, 114 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index bb2f709c0900..99352170c130 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
>  :Capability: KVM_CAP_USER_MEMORY
>  :Architectures: all
>  :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
>  :Returns: 0 on success, -1 on error
>
>  ::
> @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
>         __u64 userspace_addr; /* start of the userspace allocated memory */
>    };
>
> +  struct kvm_userspace_memory_region_ext {
> +       struct kvm_userspace_memory_region region;
> +       __u64 restricted_offset;
> +       __u32 restricted_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +  };
> +
>    /* for kvm_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
>    #define KVM_MEM_READONLY     (1UL << 1)
> +  #define KVM_MEM_PRIVATE              (1UL << 2)
>
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only.  In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +kvm_userspace_memory_region_ext struct includes all fields of
> +kvm_userspace_memory_region struct, while also adds additional fields for some
> +other features. See below description of flags field for more information.
> +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> +
> +The flags field supports following flags:
> +
> +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> +  within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> +
> +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> +  read-only. In this case, writes to this memory will be posted to userspace as
> +  KVM_EXIT_MMIO exits.
> +
> +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> +  memory backed by a file descriptor(fd) and userspace access to the fd may be
> +  restricted. Userspace should use restricted_fd/restricted_offset in the
> +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> +  to guest. Userspace should guarantee not to map the same host physical address
> +  indicated by restricted_fd/restricted_offset to different guest physical
> +  addresses within multiple memslots. Failed to do this may result undefined
> +  behavior.
>
>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>  the memory region are automatically reflected into the guest.  For example, an
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index a8e379a3afee..690cb21010e7 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -50,6 +50,8 @@ config KVM
>         select INTERVAL_TREE
>         select HAVE_KVM_PM_NOTIFIER if PM
>         select HAVE_KVM_MEMORY_ATTRIBUTES
> +       select HAVE_KVM_RESTRICTED_MEM if X86_64
> +       select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
>         help
>           Support hosting fully virtualized guest machines using hardware
>           virtualization extensions.  You will need a fairly recent
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7f850dfb4086..9a07380f8d3c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
>         }
>
>         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> -               struct kvm_userspace_memory_region m;
> +               struct kvm_user_mem_region m;
>
>                 m.slot = id | (i << 16);
>                 m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a784e2b06625..02347e386ea2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -44,6 +44,7 @@
>
>  #include <asm/kvm_host.h>
>  #include <linux/kvm_dirty_ring.h>
> +#include <linux/restrictedmem.h>
>
>  #ifndef KVM_MAX_VCPU_IDS
>  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> @@ -585,6 +586,9 @@ struct kvm_memory_slot {
>         u32 flags;
>         short id;
>         u16 as_id;
> +       struct file *restricted_file;
> +       loff_t restricted_offset;
> +       struct restrictedmem_notifier notifier;
>  };
>
>  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
>  };
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem);
> +                         const struct kvm_user_mem_region *mem);
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem);
> +                           const struct kvm_user_mem_region *mem);
>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
>  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
>  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 5d0941acb5bb..13bff963b8b0 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
>         __u64 userspace_addr; /* start of the userspace allocated memory */
>  };
>
> +struct kvm_userspace_memory_region_ext {
> +       struct kvm_userspace_memory_region region;
> +       __u64 restricted_offset;
> +       __u32 restricted_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +};
> +
> +#ifdef __KERNEL__
> +/*
> + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> + * all fields from the top-level "extended" region.
> + */
> +struct kvm_user_mem_region {
> +       __u32 slot;
> +       __u32 flags;
> +       __u64 guest_phys_addr;
> +       __u64 memory_size;
> +       __u64 userspace_addr;
> +       __u64 restricted_offset;
> +       __u32 restricted_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +};
> +#endif
> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>   * other bits are reserved for kvm internal use which are defined in
> @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
>   */
>  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
>  #define KVM_MEM_READONLY       (1UL << 1)
> +#define KVM_MEM_PRIVATE                (1UL << 2)
>
>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index effdea5dd4f0..d605545d6dd1 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
>
>  config HAVE_KVM_PM_NOTIFIER
>         bool
> +
> +config HAVE_KVM_RESTRICTED_MEM
> +       bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7f0f5e9f2406..b882eb2c76a2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
>
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>   * Must be called holding kvm->slots_lock for write.
>   */
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem)
> +                           const struct kvm_user_mem_region *mem)
>  {
>         struct kvm_memory_slot *old, *new;
>         struct kvm_memslots *slots;
> @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem)
> +                         const struct kvm_user_mem_region *mem)
>  {
>         int r;
>
> @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>
>  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> -                                         struct kvm_userspace_memory_region *mem)
> +                                         struct kvm_user_mem_region *mem)
>  {
>         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
>                 return -EINVAL;
> @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>         return fd;
>  }
>
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> +do {                                                                           \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> +                    offsetof(struct kvm_userspace_memory_region, field));      \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> +do {                                                                                   \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>                            unsigned int ioctl, unsigned long arg)
>  {
> @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
>                 break;
>         }
>         case KVM_SET_USER_MEMORY_REGION: {
> -               struct kvm_userspace_memory_region kvm_userspace_mem;
> +               struct kvm_user_mem_region mem;
> +               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +
> +               kvm_sanity_check_user_mem_region_alias();
>
>                 r = -EFAULT;
> -               if (copy_from_user(&kvm_userspace_mem, argp,
> -                                               sizeof(kvm_userspace_mem)))
> +               if (copy_from_user(&mem, argp, size))
> +                       goto out;
> +
> +               r = -EINVAL;
> +               if (mem.flags & KVM_MEM_PRIVATE)
>                         goto out;
>
> -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>                 break;
>         }
>         case KVM_GET_DIRTY_LOG: {
> --
> 2.25.1
>
  
Chao Peng Dec. 6, 2022, 11:53 a.m. UTC | #2
On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> Hi Chao,
> 
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> >
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> >
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> >
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries or handle memory errors.
> >
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> >
> > To make future maintenance easy, internally use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> >
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Reviewed-by: Fuad Tabba <tabba@google.com>
> > Tested-by: Fuad Tabba <tabba@google.com>
> 
> V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> patch series anymore. Any reason you removed it, or is it just an
> omission?

We had some discussion in v9 [1] to add generic memory attributes ioctls
and KVM_CAP_PRIVATE_MEM can be implemented as a new 
KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
ioctl [2]. The api doc has been updated:

+- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
+  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …


[1] https://lore.kernel.org/linux-mm/Y2WB48kD0J4VGynX@google.com/
[2]
https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.peng@linux.intel.com/

Thanks,
Chao
> 
> [*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/
> 
> Thanks,
> /fuad
> 
> > ---
> >  Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> >  arch/x86/kvm/Kconfig           |  2 ++
> >  arch/x86/kvm/x86.c             |  2 +-
> >  include/linux/kvm_host.h       |  8 ++++--
> >  include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
> >  virt/kvm/Kconfig               |  3 +++
> >  virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
> >  7 files changed, 114 insertions(+), 18 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index bb2f709c0900..99352170c130 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> >  :Capability: KVM_CAP_USER_MEMORY
> >  :Architectures: all
> >  :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >  :Returns: 0 on success, -1 on error
> >
> >  ::
> > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> >         __u64 userspace_addr; /* start of the userspace allocated memory */
> >    };
> >
> > +  struct kvm_userspace_memory_region_ext {
> > +       struct kvm_userspace_memory_region region;
> > +       __u64 restricted_offset;
> > +       __u32 restricted_fd;
> > +       __u32 pad1;
> > +       __u64 pad2[14];
> > +  };
> > +
> >    /* for kvm_memory_region::flags */
> >    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
> >    #define KVM_MEM_READONLY     (1UL << 1)
> > +  #define KVM_MEM_PRIVATE              (1UL << 2)
> >
> >  This ioctl allows the user to create, modify or delete a guest physical
> >  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> >  be identical.  This allows large pages in the guest to be backed by large
> >  pages in the host.
> >
> > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> > -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > -to make a new slot read-only.  In this case, writes to this memory will be
> > -posted to userspace as KVM_EXIT_MMIO exits.
> > +kvm_userspace_memory_region_ext struct includes all fields of
> > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > +other features. See below description of flags field for more information.
> > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > +
> > +The flags field supports following flags:
> > +
> > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > +  within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > +
> > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > +  read-only. In this case, writes to this memory will be posted to userspace as
> > +  KVM_EXIT_MMIO exits.
> > +
> > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > +  memory backed by a file descriptor(fd) and userspace access to the fd may be
> > +  restricted. Userspace should use restricted_fd/restricted_offset in the
> > +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > +  to guest. Userspace should guarantee not to map the same host physical address
> > +  indicated by restricted_fd/restricted_offset to different guest physical
> > +  addresses within multiple memslots. Failed to do this may result undefined
> > +  behavior.
> >
> >  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> >  the memory region are automatically reflected into the guest.  For example, an
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index a8e379a3afee..690cb21010e7 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -50,6 +50,8 @@ config KVM
> >         select INTERVAL_TREE
> >         select HAVE_KVM_PM_NOTIFIER if PM
> >         select HAVE_KVM_MEMORY_ATTRIBUTES
> > +       select HAVE_KVM_RESTRICTED_MEM if X86_64
> > +       select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> >         help
> >           Support hosting fully virtualized guest machines using hardware
> >           virtualization extensions.  You will need a fairly recent
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 7f850dfb4086..9a07380f8d3c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> >         }
> >
> >         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > -               struct kvm_userspace_memory_region m;
> > +               struct kvm_user_mem_region m;
> >
> >                 m.slot = id | (i << 16);
> >                 m.flags = 0;
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index a784e2b06625..02347e386ea2 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -44,6 +44,7 @@
> >
> >  #include <asm/kvm_host.h>
> >  #include <linux/kvm_dirty_ring.h>
> > +#include <linux/restrictedmem.h>
> >
> >  #ifndef KVM_MAX_VCPU_IDS
> >  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> >         u32 flags;
> >         short id;
> >         u16 as_id;
> > +       struct file *restricted_file;
> > +       loff_t restricted_offset;
> > +       struct restrictedmem_notifier notifier;
> >  };
> >
> >  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> >  };
> >
> >  int kvm_set_memory_region(struct kvm *kvm,
> > -                         const struct kvm_userspace_memory_region *mem);
> > +                         const struct kvm_user_mem_region *mem);
> >  int __kvm_set_memory_region(struct kvm *kvm,
> > -                           const struct kvm_userspace_memory_region *mem);
> > +                           const struct kvm_user_mem_region *mem);
> >  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> >  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> >  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 5d0941acb5bb..13bff963b8b0 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> >         __u64 userspace_addr; /* start of the userspace allocated memory */
> >  };
> >
> > +struct kvm_userspace_memory_region_ext {
> > +       struct kvm_userspace_memory_region region;
> > +       __u64 restricted_offset;
> > +       __u32 restricted_fd;
> > +       __u32 pad1;
> > +       __u64 pad2[14];
> > +};
> > +
> > +#ifdef __KERNEL__
> > +/*
> > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > + * all fields from the top-level "extended" region.
> > + */
> > +struct kvm_user_mem_region {
> > +       __u32 slot;
> > +       __u32 flags;
> > +       __u64 guest_phys_addr;
> > +       __u64 memory_size;
> > +       __u64 userspace_addr;
> > +       __u64 restricted_offset;
> > +       __u32 restricted_fd;
> > +       __u32 pad1;
> > +       __u64 pad2[14];
> > +};
> > +#endif
> > +
> >  /*
> >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> >   * other bits are reserved for kvm internal use which are defined in
> > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> >   */
> >  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
> >  #define KVM_MEM_READONLY       (1UL << 1)
> > +#define KVM_MEM_PRIVATE                (1UL << 2)
> >
> >  /* for KVM_IRQ_LINE */
> >  struct kvm_irq_level {
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index effdea5dd4f0..d605545d6dd1 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> >
> >  config HAVE_KVM_PM_NOTIFIER
> >         bool
> > +
> > +config HAVE_KVM_RESTRICTED_MEM
> > +       bool
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 7f0f5e9f2406..b882eb2c76a2 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> >         }
> >  }
> >
> > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> >  {
> >         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> >
> > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> >   * Must be called holding kvm->slots_lock for write.
> >   */
> >  int __kvm_set_memory_region(struct kvm *kvm,
> > -                           const struct kvm_userspace_memory_region *mem)
> > +                           const struct kvm_user_mem_region *mem)
> >  {
> >         struct kvm_memory_slot *old, *new;
> >         struct kvm_memslots *slots;
> > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> >
> >  int kvm_set_memory_region(struct kvm *kvm,
> > -                         const struct kvm_userspace_memory_region *mem)
> > +                         const struct kvm_user_mem_region *mem)
> >  {
> >         int r;
> >
> > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> >  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> >
> >  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > -                                         struct kvm_userspace_memory_region *mem)
> > +                                         struct kvm_user_mem_region *mem)
> >  {
> >         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> >                 return -EINVAL;
> > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> >         return fd;
> >  }
> >
> > +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> > +do {                                                                           \
> > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> > +                    offsetof(struct kvm_userspace_memory_region, field));      \
> > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> > +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> > +} while (0)
> > +
> > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> > +do {                                                                                   \
> > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> > +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> > +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> > +} while (0)
> > +
> > +static void kvm_sanity_check_user_mem_region_alias(void)
> > +{
> > +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> > +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> > +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > +}
> > +
> >  static long kvm_vm_ioctl(struct file *filp,
> >                            unsigned int ioctl, unsigned long arg)
> >  {
> > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> >                 break;
> >         }
> >         case KVM_SET_USER_MEMORY_REGION: {
> > -               struct kvm_userspace_memory_region kvm_userspace_mem;
> > +               struct kvm_user_mem_region mem;
> > +               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > +
> > +               kvm_sanity_check_user_mem_region_alias();
> >
> >                 r = -EFAULT;
> > -               if (copy_from_user(&kvm_userspace_mem, argp,
> > -                                               sizeof(kvm_userspace_mem)))
> > +               if (copy_from_user(&mem, argp, size))
> > +                       goto out;
> > +
> > +               r = -EINVAL;
> > +               if (mem.flags & KVM_MEM_PRIVATE)
> >                         goto out;
> >
> > -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> >                 break;
> >         }
> >         case KVM_GET_DIRTY_LOG: {
> > --
> > 2.25.1
> >
  
Fuad Tabba Dec. 6, 2022, 12:39 p.m. UTC | #3
Hi Chao,

On Tue, Dec 6, 2022 at 11:58 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> > Hi Chao,
> >
> > On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > In memory encryption usage, guest memory may be encrypted with special
> > > key and can be accessed only by the guest itself. We call such memory
> > > private memory. It's valueless and sometimes can cause problem to allow
> > > userspace to access guest private memory. This new KVM memslot extension
> > > allows guest private memory being provided through a restrictedmem
> > > backed file descriptor(fd) and userspace is restricted to access the
> > > bookmarked memory in the fd.
> > >
> > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > > userspace to instruct KVM to provide guest memory through restricted_fd.
> > > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > > and the size is 'memory_size'.
> > >
> > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > single memslot can maintain both private memory through restricted_fd
> > > and shared memory through userspace_addr. Whether the private or shared
> > > part is visible to guest is maintained by other KVM code.
> > >
> > > A restrictedmem_notifier field is also added to the memslot structure to
> > > allow the restricted_fd's backing store to notify KVM the memory change,
> > > KVM then can invalidate its page table entries or handle memory errors.
> > >
> > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > and right now it is selected on X86_64 only.
> > >
> > > To make future maintenance easy, internally use a binary compatible
> > > alias struct kvm_user_mem_region to handle both the normal and the
> > > '_ext' variants.
> > >
> > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > Reviewed-by: Fuad Tabba <tabba@google.com>
> > > Tested-by: Fuad Tabba <tabba@google.com>
> >
> > V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> > patch series anymore. Any reason you removed it, or is it just an
> > omission?
>
> We had some discussion in v9 [1] to add generic memory attributes ioctls
> and KVM_CAP_PRIVATE_MEM can be implemented as a new
> KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
> ioctl [2]. The api doc has been updated:
>
> +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …
>
>
> [1] https://lore.kernel.org/linux-mm/Y2WB48kD0J4VGynX@google.com/
> [2]
> https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.peng@linux.intel.com/

I see. I just retested it with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES,
and my Reviewed/Tested-by still apply.

Cheers,
/fuad

>
> Thanks,
> Chao
> >
> > [*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/
> >
> > Thanks,
> > /fuad
> >
> > > ---
> > >  Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> > >  arch/x86/kvm/Kconfig           |  2 ++
> > >  arch/x86/kvm/x86.c             |  2 +-
> > >  include/linux/kvm_host.h       |  8 ++++--
> > >  include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
> > >  virt/kvm/Kconfig               |  3 +++
> > >  virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
> > >  7 files changed, 114 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index bb2f709c0900..99352170c130 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> > >  :Capability: KVM_CAP_USER_MEMORY
> > >  :Architectures: all
> > >  :Type: vm ioctl
> > > -:Parameters: struct kvm_userspace_memory_region (in)
> > > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> > >  :Returns: 0 on success, -1 on error
> > >
> > >  ::
> > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > >         __u64 userspace_addr; /* start of the userspace allocated memory */
> > >    };
> > >
> > > +  struct kvm_userspace_memory_region_ext {
> > > +       struct kvm_userspace_memory_region region;
> > > +       __u64 restricted_offset;
> > > +       __u32 restricted_fd;
> > > +       __u32 pad1;
> > > +       __u64 pad2[14];
> > > +  };
> > > +
> > >    /* for kvm_memory_region::flags */
> > >    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
> > >    #define KVM_MEM_READONLY     (1UL << 1)
> > > +  #define KVM_MEM_PRIVATE              (1UL << 2)
> > >
> > >  This ioctl allows the user to create, modify or delete a guest physical
> > >  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> > > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> > >  be identical.  This allows large pages in the guest to be backed by large
> > >  pages in the host.
> > >
> > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > > -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > > -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> > > -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > > -to make a new slot read-only.  In this case, writes to this memory will be
> > > -posted to userspace as KVM_EXIT_MMIO exits.
> > > +kvm_userspace_memory_region_ext struct includes all fields of
> > > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > > +other features. See below description of flags field for more information.
> > > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > > +
> > > +The flags field supports following flags:
> > > +
> > > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > > +  within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > > +
> > > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > > +  read-only. In this case, writes to this memory will be posted to userspace as
> > > +  KVM_EXIT_MMIO exits.
> > > +
> > > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > > +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > > +  memory backed by a file descriptor(fd) and userspace access to the fd may be
> > > +  restricted. Userspace should use restricted_fd/restricted_offset in the
> > > +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > > +  to guest. Userspace should guarantee not to map the same host physical address
> > > +  indicated by restricted_fd/restricted_offset to different guest physical
> > > +  addresses within multiple memslots. Failed to do this may result undefined
> > > +  behavior.
> > >
> > >  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > >  the memory region are automatically reflected into the guest.  For example, an
> > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > > index a8e379a3afee..690cb21010e7 100644
> > > --- a/arch/x86/kvm/Kconfig
> > > +++ b/arch/x86/kvm/Kconfig
> > > @@ -50,6 +50,8 @@ config KVM
> > >         select INTERVAL_TREE
> > >         select HAVE_KVM_PM_NOTIFIER if PM
> > >         select HAVE_KVM_MEMORY_ATTRIBUTES
> > > +       select HAVE_KVM_RESTRICTED_MEM if X86_64
> > > +       select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > >         help
> > >           Support hosting fully virtualized guest machines using hardware
> > >           virtualization extensions.  You will need a fairly recent
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 7f850dfb4086..9a07380f8d3c 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > >         }
> > >
> > >         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > -               struct kvm_userspace_memory_region m;
> > > +               struct kvm_user_mem_region m;
> > >
> > >                 m.slot = id | (i << 16);
> > >                 m.flags = 0;
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index a784e2b06625..02347e386ea2 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -44,6 +44,7 @@
> > >
> > >  #include <asm/kvm_host.h>
> > >  #include <linux/kvm_dirty_ring.h>
> > > +#include <linux/restrictedmem.h>
> > >
> > >  #ifndef KVM_MAX_VCPU_IDS
> > >  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> > >         u32 flags;
> > >         short id;
> > >         u16 as_id;
> > > +       struct file *restricted_file;
> > > +       loff_t restricted_offset;
> > > +       struct restrictedmem_notifier notifier;
> > >  };
> > >
> > >  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> > >  };
> > >
> > >  int kvm_set_memory_region(struct kvm *kvm,
> > > -                         const struct kvm_userspace_memory_region *mem);
> > > +                         const struct kvm_user_mem_region *mem);
> > >  int __kvm_set_memory_region(struct kvm *kvm,
> > > -                           const struct kvm_userspace_memory_region *mem);
> > > +                           const struct kvm_user_mem_region *mem);
> > >  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> > >  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> > >  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index 5d0941acb5bb..13bff963b8b0 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> > >         __u64 userspace_addr; /* start of the userspace allocated memory */
> > >  };
> > >
> > > +struct kvm_userspace_memory_region_ext {
> > > +       struct kvm_userspace_memory_region region;
> > > +       __u64 restricted_offset;
> > > +       __u32 restricted_fd;
> > > +       __u32 pad1;
> > > +       __u64 pad2[14];
> > > +};
> > > +
> > > +#ifdef __KERNEL__
> > > +/*
> > > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > > + * all fields from the top-level "extended" region.
> > > + */
> > > +struct kvm_user_mem_region {
> > > +       __u32 slot;
> > > +       __u32 flags;
> > > +       __u64 guest_phys_addr;
> > > +       __u64 memory_size;
> > > +       __u64 userspace_addr;
> > > +       __u64 restricted_offset;
> > > +       __u32 restricted_fd;
> > > +       __u32 pad1;
> > > +       __u64 pad2[14];
> > > +};
> > > +#endif
> > > +
> > >  /*
> > >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > >   * other bits are reserved for kvm internal use which are defined in
> > > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> > >   */
> > >  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
> > >  #define KVM_MEM_READONLY       (1UL << 1)
> > > +#define KVM_MEM_PRIVATE                (1UL << 2)
> > >
> > >  /* for KVM_IRQ_LINE */
> > >  struct kvm_irq_level {
> > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > > index effdea5dd4f0..d605545d6dd1 100644
> > > --- a/virt/kvm/Kconfig
> > > +++ b/virt/kvm/Kconfig
> > > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> > >
> > >  config HAVE_KVM_PM_NOTIFIER
> > >         bool
> > > +
> > > +config HAVE_KVM_RESTRICTED_MEM
> > > +       bool
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 7f0f5e9f2406..b882eb2c76a2 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > >         }
> > >  }
> > >
> > > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > >  {
> > >         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > >
> > > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > >   * Must be called holding kvm->slots_lock for write.
> > >   */
> > >  int __kvm_set_memory_region(struct kvm *kvm,
> > > -                           const struct kvm_userspace_memory_region *mem)
> > > +                           const struct kvm_user_mem_region *mem)
> > >  {
> > >         struct kvm_memory_slot *old, *new;
> > >         struct kvm_memslots *slots;
> > > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > >  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> > >
> > >  int kvm_set_memory_region(struct kvm *kvm,
> > > -                         const struct kvm_userspace_memory_region *mem)
> > > +                         const struct kvm_user_mem_region *mem)
> > >  {
> > >         int r;
> > >
> > > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> > >  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> > >
> > >  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > > -                                         struct kvm_userspace_memory_region *mem)
> > > +                                         struct kvm_user_mem_region *mem)
> > >  {
> > >         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> > >                 return -EINVAL;
> > > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> > >         return fd;
> > >  }
> > >
> > > +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> > > +do {                                                                           \
> > > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> > > +                    offsetof(struct kvm_userspace_memory_region, field));      \
> > > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> > > +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> > > +} while (0)
> > > +
> > > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> > > +do {                                                                                   \
> > > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> > > +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> > > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> > > +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> > > +} while (0)
> > > +
> > > +static void kvm_sanity_check_user_mem_region_alias(void)
> > > +{
> > > +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> > > +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> > > +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > > +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > > +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > > +}
> > > +
> > >  static long kvm_vm_ioctl(struct file *filp,
> > >                            unsigned int ioctl, unsigned long arg)
> > >  {
> > > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> > >                 break;
> > >         }
> > >         case KVM_SET_USER_MEMORY_REGION: {
> > > -               struct kvm_userspace_memory_region kvm_userspace_mem;
> > > +               struct kvm_user_mem_region mem;
> > > +               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > > +
> > > +               kvm_sanity_check_user_mem_region_alias();
> > >
> > >                 r = -EFAULT;
> > > -               if (copy_from_user(&kvm_userspace_mem, argp,
> > > -                                               sizeof(kvm_userspace_mem)))
> > > +               if (copy_from_user(&mem, argp, size))
> > > +                       goto out;
> > > +
> > > +               r = -EINVAL;
> > > +               if (mem.flags & KVM_MEM_PRIVATE)
> > >                         goto out;
> > >
> > > -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > > +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > >                 break;
> > >         }
> > >         case KVM_GET_DIRTY_LOG: {
> > > --
> > > 2.25.1
> > >
  
Chao Peng Dec. 7, 2022, 3:10 p.m. UTC | #4
On Tue, Dec 06, 2022 at 12:39:18PM +0000, Fuad Tabba wrote:
> Hi Chao,
> 
> On Tue, Dec 6, 2022 at 11:58 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> > > Hi Chao,
> > >
> > > On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > In memory encryption usage, guest memory may be encrypted with special
> > > > key and can be accessed only by the guest itself. We call such memory
> > > > private memory. It's valueless and sometimes can cause problem to allow
> > > > userspace to access guest private memory. This new KVM memslot extension
> > > > allows guest private memory being provided through a restrictedmem
> > > > backed file descriptor(fd) and userspace is restricted to access the
> > > > bookmarked memory in the fd.
> > > >
> > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > > > userspace to instruct KVM to provide guest memory through restricted_fd.
> > > > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > > > and the size is 'memory_size'.
> > > >
> > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > single memslot can maintain both private memory through restricted_fd
> > > > and shared memory through userspace_addr. Whether the private or shared
> > > > part is visible to guest is maintained by other KVM code.
> > > >
> > > > A restrictedmem_notifier field is also added to the memslot structure to
> > > > allow the restricted_fd's backing store to notify KVM the memory change,
> > > > KVM then can invalidate its page table entries or handle memory errors.
> > > >
> > > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > > and right now it is selected on X86_64 only.
> > > >
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > > >
> > > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > Reviewed-by: Fuad Tabba <tabba@google.com>
> > > > Tested-by: Fuad Tabba <tabba@google.com>
> > >
> > > V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> > > patch series anymore. Any reason you removed it, or is it just an
> > > omission?
> >
> > We had some discussion in v9 [1] to add generic memory attributes ioctls
> > and KVM_CAP_PRIVATE_MEM can be implemented as a new
> > KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
> > ioctl [2]. The api doc has been updated:
> >
> > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …
> >
> >
> > [1] https://lore.kernel.org/linux-mm/Y2WB48kD0J4VGynX@google.com/
> > [2]
> > https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.peng@linux.intel.com/
> 
> I see. I just retested it with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES,
> and my Reviewed/Tested-by still apply.

Thanks for the info.

Chao
> 
> Cheers,
> /fuad
> 
> >
> > Thanks,
> > Chao
> > >
> > > [*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/
> > >
> > > Thanks,
> > > /fuad
> > >
> > > > ---
> > > >  Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> > > >  arch/x86/kvm/Kconfig           |  2 ++
> > > >  arch/x86/kvm/x86.c             |  2 +-
> > > >  include/linux/kvm_host.h       |  8 ++++--
> > > >  include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
> > > >  virt/kvm/Kconfig               |  3 +++
> > > >  virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
> > > >  7 files changed, 114 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > index bb2f709c0900..99352170c130 100644
> > > > --- a/Documentation/virt/kvm/api.rst
> > > > +++ b/Documentation/virt/kvm/api.rst
> > > > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> > > >  :Capability: KVM_CAP_USER_MEMORY
> > > >  :Architectures: all
> > > >  :Type: vm ioctl
> > > > -:Parameters: struct kvm_userspace_memory_region (in)
> > > > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> > > >  :Returns: 0 on success, -1 on error
> > > >
> > > >  ::
> > > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > > >         __u64 userspace_addr; /* start of the userspace allocated memory */
> > > >    };
> > > >
> > > > +  struct kvm_userspace_memory_region_ext {
> > > > +       struct kvm_userspace_memory_region region;
> > > > +       __u64 restricted_offset;
> > > > +       __u32 restricted_fd;
> > > > +       __u32 pad1;
> > > > +       __u64 pad2[14];
> > > > +  };
> > > > +
> > > >    /* for kvm_memory_region::flags */
> > > >    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
> > > >    #define KVM_MEM_READONLY     (1UL << 1)
> > > > +  #define KVM_MEM_PRIVATE              (1UL << 2)
> > > >
> > > >  This ioctl allows the user to create, modify or delete a guest physical
> > > >  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> > > > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> > > >  be identical.  This allows large pages in the guest to be backed by large
> > > >  pages in the host.
> > > >
> > > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > > > -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > > > -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> > > > -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > > > -to make a new slot read-only.  In this case, writes to this memory will be
> > > > -posted to userspace as KVM_EXIT_MMIO exits.
> > > > +kvm_userspace_memory_region_ext struct includes all fields of
> > > > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > > > +other features. See below description of flags field for more information.
> > > > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > > > +
> > > > +The flags field supports following flags:
> > > > +
> > > > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > > > +  within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > > > +
> > > > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > > > +  read-only. In this case, writes to this memory will be posted to userspace as
> > > > +  KVM_EXIT_MMIO exits.
> > > > +
> > > > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > > > +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > > > +  memory backed by a file descriptor(fd) and userspace access to the fd may be
> > > > +  restricted. Userspace should use restricted_fd/restricted_offset in the
> > > > +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > > > +  to guest. Userspace should guarantee not to map the same host physical address
> > > > +  indicated by restricted_fd/restricted_offset to different guest physical
> > > > +  addresses within multiple memslots. Failed to do this may result undefined
> > > > +  behavior.
> > > >
> > > >  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > > >  the memory region are automatically reflected into the guest.  For example, an
> > > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > > > index a8e379a3afee..690cb21010e7 100644
> > > > --- a/arch/x86/kvm/Kconfig
> > > > +++ b/arch/x86/kvm/Kconfig
> > > > @@ -50,6 +50,8 @@ config KVM
> > > >         select INTERVAL_TREE
> > > >         select HAVE_KVM_PM_NOTIFIER if PM
> > > >         select HAVE_KVM_MEMORY_ATTRIBUTES
> > > > +       select HAVE_KVM_RESTRICTED_MEM if X86_64
> > > > +       select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > > >         help
> > > >           Support hosting fully virtualized guest machines using hardware
> > > >           virtualization extensions.  You will need a fairly recent
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 7f850dfb4086..9a07380f8d3c 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > > >         }
> > > >
> > > >         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > > -               struct kvm_userspace_memory_region m;
> > > > +               struct kvm_user_mem_region m;
> > > >
> > > >                 m.slot = id | (i << 16);
> > > >                 m.flags = 0;
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index a784e2b06625..02347e386ea2 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -44,6 +44,7 @@
> > > >
> > > >  #include <asm/kvm_host.h>
> > > >  #include <linux/kvm_dirty_ring.h>
> > > > +#include <linux/restrictedmem.h>
> > > >
> > > >  #ifndef KVM_MAX_VCPU_IDS
> > > >  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > > > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> > > >         u32 flags;
> > > >         short id;
> > > >         u16 as_id;
> > > > +       struct file *restricted_file;
> > > > +       loff_t restricted_offset;
> > > > +       struct restrictedmem_notifier notifier;
> > > >  };
> > > >
> > > >  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > > > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> > > >  };
> > > >
> > > >  int kvm_set_memory_region(struct kvm *kvm,
> > > > -                         const struct kvm_userspace_memory_region *mem);
> > > > +                         const struct kvm_user_mem_region *mem);
> > > >  int __kvm_set_memory_region(struct kvm *kvm,
> > > > -                           const struct kvm_userspace_memory_region *mem);
> > > > +                           const struct kvm_user_mem_region *mem);
> > > >  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> > > >  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> > > >  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > > index 5d0941acb5bb..13bff963b8b0 100644
> > > > --- a/include/uapi/linux/kvm.h
> > > > +++ b/include/uapi/linux/kvm.h
> > > > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> > > >         __u64 userspace_addr; /* start of the userspace allocated memory */
> > > >  };
> > > >
> > > > +struct kvm_userspace_memory_region_ext {
> > > > +       struct kvm_userspace_memory_region region;
> > > > +       __u64 restricted_offset;
> > > > +       __u32 restricted_fd;
> > > > +       __u32 pad1;
> > > > +       __u64 pad2[14];
> > > > +};
> > > > +
> > > > +#ifdef __KERNEL__
> > > > +/*
> > > > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > > > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > > > + * all fields from the top-level "extended" region.
> > > > + */
> > > > +struct kvm_user_mem_region {
> > > > +       __u32 slot;
> > > > +       __u32 flags;
> > > > +       __u64 guest_phys_addr;
> > > > +       __u64 memory_size;
> > > > +       __u64 userspace_addr;
> > > > +       __u64 restricted_offset;
> > > > +       __u32 restricted_fd;
> > > > +       __u32 pad1;
> > > > +       __u64 pad2[14];
> > > > +};
> > > > +#endif
> > > > +
> > > >  /*
> > > >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > > >   * other bits are reserved for kvm internal use which are defined in
> > > > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> > > >   */
> > > >  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
> > > >  #define KVM_MEM_READONLY       (1UL << 1)
> > > > +#define KVM_MEM_PRIVATE                (1UL << 2)
> > > >
> > > >  /* for KVM_IRQ_LINE */
> > > >  struct kvm_irq_level {
> > > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > > > index effdea5dd4f0..d605545d6dd1 100644
> > > > --- a/virt/kvm/Kconfig
> > > > +++ b/virt/kvm/Kconfig
> > > > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> > > >
> > > >  config HAVE_KVM_PM_NOTIFIER
> > > >         bool
> > > > +
> > > > +config HAVE_KVM_RESTRICTED_MEM
> > > > +       bool
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index 7f0f5e9f2406..b882eb2c76a2 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > > >         }
> > > >  }
> > > >
> > > > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > > > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > > >  {
> > > >         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > > >
> > > > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > > >   * Must be called holding kvm->slots_lock for write.
> > > >   */
> > > >  int __kvm_set_memory_region(struct kvm *kvm,
> > > > -                           const struct kvm_userspace_memory_region *mem)
> > > > +                           const struct kvm_user_mem_region *mem)
> > > >  {
> > > >         struct kvm_memory_slot *old, *new;
> > > >         struct kvm_memslots *slots;
> > > > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > >  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> > > >
> > > >  int kvm_set_memory_region(struct kvm *kvm,
> > > > -                         const struct kvm_userspace_memory_region *mem)
> > > > +                         const struct kvm_user_mem_region *mem)
> > > >  {
> > > >         int r;
> > > >
> > > > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> > > >  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> > > >
> > > >  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > > > -                                         struct kvm_userspace_memory_region *mem)
> > > > +                                         struct kvm_user_mem_region *mem)
> > > >  {
> > > >         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> > > >                 return -EINVAL;
> > > > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> > > >         return fd;
> > > >  }
> > > >
> > > > +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> > > > +do {                                                                           \
> > > > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> > > > +                    offsetof(struct kvm_userspace_memory_region, field));      \
> > > > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> > > > +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> > > > +} while (0)
> > > > +
> > > > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> > > > +do {                                                                                   \
> > > > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> > > > +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> > > > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> > > > +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> > > > +} while (0)
> > > > +
> > > > +static void kvm_sanity_check_user_mem_region_alias(void)
> > > > +{
> > > > +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> > > > +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> > > > +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > > > +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > > > +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > > > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > > > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > > > +}
> > > > +
> > > >  static long kvm_vm_ioctl(struct file *filp,
> > > >                            unsigned int ioctl, unsigned long arg)
> > > >  {
> > > > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> > > >                 break;
> > > >         }
> > > >         case KVM_SET_USER_MEMORY_REGION: {
> > > > -               struct kvm_userspace_memory_region kvm_userspace_mem;
> > > > +               struct kvm_user_mem_region mem;
> > > > +               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > > > +
> > > > +               kvm_sanity_check_user_mem_region_alias();
> > > >
> > > >                 r = -EFAULT;
> > > > -               if (copy_from_user(&kvm_userspace_mem, argp,
> > > > -                                               sizeof(kvm_userspace_mem)))
> > > > +               if (copy_from_user(&mem, argp, size))
> > > > +                       goto out;
> > > > +
> > > > +               r = -EINVAL;
> > > > +               if (mem.flags & KVM_MEM_PRIVATE)
> > > >                         goto out;
> > > >
> > > > -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > > > +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > >                 break;
> > > >         }
> > > >         case KVM_GET_DIRTY_LOG: {
> > > > --
> > > > 2.25.1
> > > >
  
Xiaoyao Li Dec. 8, 2022, 8:37 a.m. UTC | #5
On 12/2/2022 2:13 PM, Chao Peng wrote:

..

> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
> 

 From the patch implementation, I have no idea why 
HAVE_KVM_RESTRICTED_MEM is needed.
  
Chao Peng Dec. 8, 2022, 11:30 a.m. UTC | #6
On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
> On 12/2/2022 2:13 PM, Chao Peng wrote:
> 
> ..
> 
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> > 
> 
> From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
> needed.

The reason is we want KVM further controls the feature enabling. An
opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
unsupported architectures.

Here is the original discussion:
https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/

Thanks,
Chao
  
Xiaoyao Li Dec. 13, 2022, 12:04 p.m. UTC | #7
On 12/8/2022 7:30 PM, Chao Peng wrote:
> On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
>> On 12/2/2022 2:13 PM, Chao Peng wrote:
>>
>> ..
>>
>>> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
>>> and right now it is selected on X86_64 only.
>>>
>>
>>  From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
>> needed.
> 
> The reason is we want KVM further controls the feature enabling. An
> opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
> unsupported architectures.

HAVE_KVM_RESTRICTED_MEM is not used in this patch. It's better to 
introduce it in the patch that actually uses it.

> Here is the original discussion:
> https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/
> 
> Thanks,
> Chao
  
Chao Peng Dec. 19, 2022, 7:50 a.m. UTC | #8
On Tue, Dec 13, 2022 at 08:04:14PM +0800, Xiaoyao Li wrote:
> On 12/8/2022 7:30 PM, Chao Peng wrote:
> > On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
> > > On 12/2/2022 2:13 PM, Chao Peng wrote:
> > > 
> > > ..
> > > 
> > > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > > and right now it is selected on X86_64 only.
> > > > 
> > > 
> > >  From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
> > > needed.
> > 
> > The reason is we want KVM further controls the feature enabling. An
> > opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
> > unsupported architectures.
> 
> HAVE_KVM_RESTRICTED_MEM is not used in this patch. It's better to introduce
> it in the patch that actually uses it.

It's being 'used' in this patch by reverse selecting RESTRICTEDMEM in
arch/x86/kvm/Kconfig, this gives people a sense where
restrictedmem_notifier comes from. Introducing the config with other
private/restricted memslot stuff together can also help future
supporting architectures better identify what they need do. But those
are trivial and moving to patch 08 sounds also good to me.

Thanks,
Chao
> 
> > Here is the original discussion:
> > https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/
> > 
> > Thanks,
> > Chao
  
Borislav Petkov Dec. 19, 2022, 2:36 p.m. UTC | #9
On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow

valueless?

I can't parse that.

> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.

bookmarked?

> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
> 
> The extended memslot can still have the userspace_addr(hva). When use, a

"When un use, ..."

...

> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index a8e379a3afee..690cb21010e7 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -50,6 +50,8 @@ config KVM
>  	select INTERVAL_TREE
>  	select HAVE_KVM_PM_NOTIFIER if PM
>  	select HAVE_KVM_MEMORY_ATTRIBUTES
> +	select HAVE_KVM_RESTRICTED_MEM if X86_64
> +	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM

Those deps here look weird.

RESTRICTEDMEM should be selected by TDX_GUEST as it can't live without
it.

Then you don't have to select HAVE_KVM_RESTRICTED_MEM simply because of
X86_64 - you need that functionality when the respective guest support
is enabled in KVM.

Then, looking forward into your patchset, I'm not sure you even
need HAVE_KVM_RESTRICTED_MEM - you could make it all depend on
CONFIG_RESTRICTEDMEM. But that's KVM folks call - I'd always aim for
less Kconfig items because we have waay too many.

Thx.
  
Chao Peng Dec. 20, 2022, 7:43 a.m. UTC | #10
On Mon, Dec 19, 2022 at 03:36:28PM +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> 
> valueless?
> 
> I can't parse that.

It's unnecessary and ...

> 
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> 
> bookmarked?

userspace is restricted to access the memory content in the fd.

> 
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> > 
> > The extended memslot can still have the userspace_addr(hva). When use, a
> 
> "When un use, ..."

When both userspace_addr and restricted_fd/offset were used, ...

> 
> ...
> 
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index a8e379a3afee..690cb21010e7 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -50,6 +50,8 @@ config KVM
> >  	select INTERVAL_TREE
> >  	select HAVE_KVM_PM_NOTIFIER if PM
> >  	select HAVE_KVM_MEMORY_ATTRIBUTES
> > +	select HAVE_KVM_RESTRICTED_MEM if X86_64
> > +	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> 
> Those deps here look weird.
> 
> RESTRICTEDMEM should be selected by TDX_GUEST as it can't live without
> it.

RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.

> 
> Then you don't have to select HAVE_KVM_RESTRICTED_MEM simply because of
> X86_64 - you need that functionality when the respective guest support
> is enabled in KVM.

Letting the actual feature(e.g. TDX or pKVM) select it or add dependency
sounds a viable and clearer solution. Sean, let me know your opinion.

> 
> Then, looking forward into your patchset, I'm not sure you even
> need HAVE_KVM_RESTRICTED_MEM - you could make it all depend on
> CONFIG_RESTRICTEDMEM. But that's KVM folks call - I'd always aim for
> less Kconfig items because we have waay too many.

The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
that.

[*] https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/

Thanks,
Chao
> 
> Thx.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
  
Borislav Petkov Dec. 20, 2022, 9:55 a.m. UTC | #11
On Tue, Dec 20, 2022 at 03:43:18PM +0800, Chao Peng wrote:
> RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.

Which basically means that RESTRICTEDMEM should simply depend on KVM.
Because you can't know upfront whether KVM will run a TDX guest or a SNP
guest and so on.

Which then means that RESTRICTEDMEM will practically end up always
enabled in KVM HV configs.

> The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
> works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
> that.

This is what I mean with "we have too many Kconfig items". :-\
  
Chao Peng Dec. 21, 2022, 1:42 p.m. UTC | #12
On Tue, Dec 20, 2022 at 10:55:44AM +0100, Borislav Petkov wrote:
> On Tue, Dec 20, 2022 at 03:43:18PM +0800, Chao Peng wrote:
> > RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.
> 
> Which basically means that RESTRICTEDMEM should simply depend on KVM.
> Because you can't know upfront whether KVM will run a TDX guest or a SNP
> guest and so on.
> 
> Which then means that RESTRICTEDMEM will practically end up always
> enabled in KVM HV configs.

That's right, CONFIG_RESTRICTEDMEM is always selected for supported KVM
architectures (currently x86_64).

> 
> > The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
> > works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
> > that.
> 
> This is what I mean with "we have too many Kconfig items". :-\

Yes I agree. One way to remove this is probably additionally checking
CONFIG_64BIT instead.

Thanks,
Chao
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
  
Jarkko Sakkinen Jan. 5, 2023, 11:23 a.m. UTC | #13
On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
> 
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
> 
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through restricted_fd
> and shared memory through userspace_addr. Whether the private or shared
> part is visible to guest is maintained by other KVM code.
> 
> A restrictedmem_notifier field is also added to the memslot structure to
> allow the restricted_fd's backing store to notify KVM the memory change,
> KVM then can invalidate its page table entries or handle memory errors.
> 
> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
> 
> To make future maintenance easy, internally use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.

Feels bit hacky IMHO, and more like a completely new feature than
an extension.

Why not just add a new ioctl? The commit message does not address
the most essential design here.

BR, Jarkko
  
Chao Peng Jan. 6, 2023, 9:40 a.m. UTC | #14
On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> > 
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> > 
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> > 
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries or handle memory errors.
> > 
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> > 
> > To make future maintenance easy, internally use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> 
> Feels bit hacky IMHO, and more like a completely new feature than
> an extension.
> 
> Why not just add a new ioctl? The commit message does not address
> the most essential design here.

Yes, people can always choose to add a new ioctl for this kind of change
and the balance point here is we want to also avoid 'too many ioctls' if
the functionalities are similar.  The '_ext' variant reuses all the
existing fields in the 'normal' variant and most importantly KVM
internally can reuse most of the code. I certainly can add some words in
the commit message to explain this design choice.

Thanks,
Chao
> 
> BR, Jarkko
  
Sean Christopherson Jan. 9, 2023, 7:32 p.m. UTC | #15
On Fri, Jan 06, 2023, Chao Peng wrote:
> On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > To make future maintenance easy, internally use a binary compatible
> > > alias struct kvm_user_mem_region to handle both the normal and the
> > > '_ext' variants.
> > 
> > Feels bit hacky IMHO, and more like a completely new feature than
> > an extension.
> > 
> > Why not just add a new ioctl? The commit message does not address
> > the most essential design here.
> 
> Yes, people can always choose to add a new ioctl for this kind of change
> and the balance point here is we want to also avoid 'too many ioctls' if
> the functionalities are similar.  The '_ext' variant reuses all the
> existing fields in the 'normal' variant and most importantly KVM
> internally can reuse most of the code. I certainly can add some words in
> the commit message to explain this design choice.

After seeing the userspace side of this, I agree with Jarkko; overloading
KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
itself.

It feels absolutely ridiculous, but I think the best option is to do:

#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
					 struct kvm_userspace_memory_region2)

/* for KVM_SET_USER_MEMORY_REGION2 */
struct kvm_user_mem_region2 {
	__u32 slot;
	__u32 flags;
	__u64 guest_phys_addr;
	__u64 memory_size;
	__u64 userspace_addr;
	__u64 restricted_offset;
	__u32 restricted_fd;
	__u32 pad1;
	__u64 pad2[14];
}

And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.

Regarding the userspace side of things, please include Vishal's selftests in v11,
it's impossible to properly review the uAPI changes without seeing the userspace
side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
massage it into a set of patches that you can incorporate into your series.

[*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com
  
Chao Peng Jan. 10, 2023, 9:14 a.m. UTC | #16
On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> On Fri, Jan 06, 2023, Chao Peng wrote:
> > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > > 
> > > Feels bit hacky IMHO, and more like a completely new feature than
> > > an extension.
> > > 
> > > Why not just add a new ioctl? The commit message does not address
> > > the most essential design here.
> > 
> > Yes, people can always choose to add a new ioctl for this kind of change
> > and the balance point here is we want to also avoid 'too many ioctls' if
> > the functionalities are similar.  The '_ext' variant reuses all the
> > existing fields in the 'normal' variant and most importantly KVM
> > internally can reuse most of the code. I certainly can add some words in
> > the commit message to explain this design choice.
> 
> After seeing the userspace side of this, I agree with Jarkko; overloading
> KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
> bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> itself.

How is the size validation being bogus? I don't quite follow. Then we
will use kvm_userspace_memory_region2 as the KVM internal alias, right?
I see similar examples use different functions to handle different
versions but it does look easier if we use alias for this function.

> 
> It feels absolutely ridiculous, but I think the best option is to do:
> 
> #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> 					 struct kvm_userspace_memory_region2)

Just interesting, is 0x49 a safe number we can use? 

> 
> /* for KVM_SET_USER_MEMORY_REGION2 */
> struct kvm_user_mem_region2 {
> 	__u32 slot;
> 	__u32 flags;
> 	__u64 guest_phys_addr;
> 	__u64 memory_size;
> 	__u64 userspace_addr;
> 	__u64 restricted_offset;
> 	__u32 restricted_fd;
> 	__u32 pad1;
> 	__u64 pad2[14];
> }
> 
> And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.

Okay, agree from KVM userspace API perspective this is more consistent
with similar existing examples. I see several of them.

I think we will also need a CAP_KVM_SET_USER_MEMORY_REGION2 for this new
ioctl.

> 
> Regarding the userspace side of things, please include Vishal's selftests in v11,
> it's impossible to properly review the uAPI changes without seeing the userspace
> side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
> massage it into a set of patches that you can incorporate into your series.

Previously I included Vishal's selftests in the github repo, but not
include them in this patch series. It's OK for me to incorporate them
directly into this series and review together if Vishal is fine.

Chao
> 
> [*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com
  
Vishal Annapurve Jan. 10, 2023, 10:51 p.m. UTC | #17
On Tue, Jan 10, 2023 at 1:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Regarding the userspace side of things, please include Vishal's selftests in v11,
> > it's impossible to properly review the uAPI changes without seeing the userspace
> > side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
> > massage it into a set of patches that you can incorporate into your series.
>
> Previously I included Vishal's selftests in the github repo, but not
> include them in this patch series. It's OK for me to incorporate them
> directly into this series and review together if Vishal is fine.
>

Yeah, I am ok with incorporating selftest patches into this series and
reviewing them together.

Regards,
Vishal

> Chao
> >
> > [*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com
  
Sean Christopherson Jan. 13, 2023, 10:37 p.m. UTC | #18
On Tue, Jan 10, 2023, Chao Peng wrote:
> On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > To make future maintenance easy, internally use a binary compatible
> > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > '_ext' variants.
> > > > 
> > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > an extension.
> > > > 
> > > > Why not just add a new ioctl? The commit message does not address
> > > > the most essential design here.
> > > 
> > > Yes, people can always choose to add a new ioctl for this kind of change
> > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > the functionalities are similar.  The '_ext' variant reuses all the
> > > existing fields in the 'normal' variant and most importantly KVM
> > > internally can reuse most of the code. I certainly can add some words in
> > > the commit message to explain this design choice.
> > 
> > After seeing the userspace side of this, I agree with Jarkko; overloading
> > KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
> > bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> > itself.
> 
> How is the size validation being bogus? I don't quite follow.

The ioctl() magic embeds the size of the payload (struct kvm_userspace_memory_region
in this case) in the ioctl() number, and that information is visible to userspace
via _IOCTL_SIZE().  Attempting to take a larger size can mess up sanity checks,
e.g. KVM selftests get tripped up on this assert if KVM_SET_USER_MEMORY_REGION is
passed an "extended" struct.

	#define kvm_do_ioctl(fd, cmd, arg)						\
	({										\
		kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == _IOC_SIZE(cmd));	\
		ioctl(fd, cmd, arg);							\
	})

> Then we will use kvm_userspace_memory_region2 as the KVM internal alias,
> right?

Yep.

> I see similar examples use different functions to handle different versions
> but it does look easier if we use alias for this function.
> 
> > 
> > It feels absolutely ridiculous, but I think the best option is to do:
> > 
> > #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> > 					 struct kvm_userspace_memory_region2)
> 
> Just interesting, is 0x49 a safe number we can use? 

Yes?  So long as its not used by KVM, it's safe.  AFAICT, it's unused.
  
Chao Peng Jan. 17, 2023, 12:42 p.m. UTC | #19
On Fri, Jan 13, 2023 at 10:37:39PM +0000, Sean Christopherson wrote:
> On Tue, Jan 10, 2023, Chao Peng wrote:
> > On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > > To make future maintenance easy, internally use a binary compatible
> > > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > > '_ext' variants.
> > > > > 
> > > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > > an extension.
> > > > > 
> > > > > Why not just add a new ioctl? The commit message does not address
> > > > > the most essential design here.
> > > > 
> > > > Yes, people can always choose to add a new ioctl for this kind of change
> > > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > > the functionalities are similar.  The '_ext' variant reuses all the
> > > > existing fields in the 'normal' variant and most importantly KVM
> > > > internally can reuse most of the code. I certainly can add some words in
> > > > the commit message to explain this design choice.
> > > 
> > > After seeing the userspace side of this, I agree with Jarkko; overloading
> > > KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
> > > bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> > > itself.
> > 
> > How is the size validation being bogus? I don't quite follow.
> 
> The ioctl() magic embeds the size of the payload (struct kvm_userspace_memory_region
> in this case) in the ioctl() number, and that information is visible to userspace
> via _IOCTL_SIZE().  Attempting to take a larger size can mess up sanity checks,
> e.g. KVM selftests get tripped up on this assert if KVM_SET_USER_MEMORY_REGION is
> passed an "extended" struct.
> 
> 	#define kvm_do_ioctl(fd, cmd, arg)						\
> 	({										\
> 		kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == _IOC_SIZE(cmd));	\
> 		ioctl(fd, cmd, arg);							\
> 	})

Got it. Thanks for the explanation.

Chao
  
Jarkko Sakkinen Jan. 20, 2023, 11:28 p.m. UTC | #20
On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> On Fri, Jan 06, 2023, Chao Peng wrote:
> > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > > 
> > > Feels bit hacky IMHO, and more like a completely new feature than
> > > an extension.
> > > 
> > > Why not just add a new ioctl? The commit message does not address
> > > the most essential design here.
> > 
> > Yes, people can always choose to add a new ioctl for this kind of change
> > and the balance point here is we want to also avoid 'too many ioctls' if
> > the functionalities are similar.  The '_ext' variant reuses all the
> > existing fields in the 'normal' variant and most importantly KVM
> > internally can reuse most of the code. I certainly can add some words in
> > the commit message to explain this design choice.
> 
> After seeing the userspace side of this, I agree with Jarkko; overloading
> KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
> bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> itself.
> 
> It feels absolutely ridiculous, but I think the best option is to do:
> 
> #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> 					 struct kvm_userspace_memory_region2)
> 
> /* for KVM_SET_USER_MEMORY_REGION2 */
> struct kvm_user_mem_region2 {
> 	__u32 slot;
> 	__u32 flags;
> 	__u64 guest_phys_addr;
> 	__u64 memory_size;
> 	__u64 userspace_addr;
> 	__u64 restricted_offset;
> 	__u32 restricted_fd;
> 	__u32 pad1;
> 	__u64 pad2[14];
> }
> 
> And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.
> 
> Regarding the userspace side of things, please include Vishal's selftests in v11,
> it's impossible to properly review the uAPI changes without seeing the userspace
> side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
> massage it into a set of patches that you can incorporate into your series.
> 
> [*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com

+1

BR, Jarkko
  
Jarkko Sakkinen Jan. 20, 2023, 11:42 p.m. UTC | #21
On Tue, Jan 10, 2023 at 05:14:32PM +0800, Chao Peng wrote:
> On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > To make future maintenance easy, internally use a binary compatible
> > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > '_ext' variants.
> > > > 
> > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > an extension.
> > > > 
> > > > Why not just add a new ioctl? The commit message does not address
> > > > the most essential design here.
> > > 
> > > Yes, people can always choose to add a new ioctl for this kind of change
> > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > the functionalities are similar.  The '_ext' variant reuses all the
> > > existing fields in the 'normal' variant and most importantly KVM
> > > internally can reuse most of the code. I certainly can add some words in
> > > the commit message to explain this design choice.
> > 
> > After seeing the userspace side of this, I agree with Jarkko; overloading
> > KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
> > bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> > itself.
> 
> How is the size validation being bogus? I don't quite follow. Then we
> will use kvm_userspace_memory_region2 as the KVM internal alias, right?
> I see similar examples use different functions to handle different
> versions but it does look easier if we use alias for this function.
> 
> > 
> > It feels absolutely ridiculous, but I think the best option is to do:
> > 
> > #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> > 					 struct kvm_userspace_memory_region2)
> 
> Just interesting, is 0x49 a safe number we can use? 
> 
> > 
> > /* for KVM_SET_USER_MEMORY_REGION2 */
> > struct kvm_user_mem_region2 {
> > 	__u32 slot;
> > 	__u32 flags;
> > 	__u64 guest_phys_addr;
> > 	__u64 memory_size;
> > 	__u64 userspace_addr;
> > 	__u64 restricted_offset;
> > 	__u32 restricted_fd;
> > 	__u32 pad1;
> > 	__u64 pad2[14];
> > }
> > 
> > And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.
> 
> Okay, agree from KVM userspace API perspective this is more consistent
> with similar existing examples. I see several of them.
> 
> I think we will also need a CAP_KVM_SET_USER_MEMORY_REGION2 for this new
> ioctl.

The current API in the patch set is trivial for C user space but for
any other more "constrained" language such as Rust a new ioctl would be
easier to adapt.

> > 
> > Regarding the userspace side of things, please include Vishal's selftests in v11,
> > it's impossible to properly review the uAPI changes without seeing the userspace
> > side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
> > massage it into a set of patches that you can incorporate into your series.
> 
> Previously I included Vishal's selftests in the github repo, but not
> include them in this patch series. It's OK for me to incorporate them
> directly into this series and review together if Vishal is fine.
> 
> Chao
> > 
> > [*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com

BR, Jarkko
  

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index bb2f709c0900..99352170c130 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1319,7 +1319,7 @@  yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1332,9 +1332,18 @@  yet and must be cleared on entry.
 	__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 restricted_offset;
+	__u32 restricted_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+  };
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_PRIVATE		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1365,12 +1374,29 @@  It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext struct includes all fields of
+kvm_userspace_memory_region struct, while also adds additional fields for some
+other features. See below description of flags field for more information.
+It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports following flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
+  within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
+
+- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
+  read-only. In this case, writes to this memory will be posted to userspace as
+  KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
+  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
+  memory backed by a file descriptor(fd) and userspace access to the fd may be
+  restricted. Userspace should use restricted_fd/restricted_offset in the
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
+  to guest. Userspace should guarantee not to map the same host physical address
+  indicated by restricted_fd/restricted_offset to different guest physical
+  addresses within multiple memslots. Failed to do this may result undefined
+  behavior.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index a8e379a3afee..690cb21010e7 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -50,6 +50,8 @@  config KVM
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select HAVE_KVM_MEMORY_ATTRIBUTES
+	select HAVE_KVM_RESTRICTED_MEM if X86_64
+	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7f850dfb4086..9a07380f8d3c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12224,7 +12224,7 @@  void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_user_mem_region m;
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a784e2b06625..02347e386ea2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -44,6 +44,7 @@ 
 
 #include <asm/kvm_host.h>
 #include <linux/kvm_dirty_ring.h>
+#include <linux/restrictedmem.h>
 
 #ifndef KVM_MAX_VCPU_IDS
 #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -585,6 +586,9 @@  struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct file *restricted_file;
+	loff_t restricted_offset;
+	struct restrictedmem_notifier notifier;
 };
 
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
@@ -1123,9 +1127,9 @@  enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+			  const struct kvm_user_mem_region *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_user_mem_region *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5d0941acb5bb..13bff963b8b0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,33 @@  struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 restricted_offset;
+	__u32 restricted_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+
+#ifdef __KERNEL__
+/*
+ * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
+ * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
+ * all fields from the top-level "extended" region.
+ */
+struct kvm_user_mem_region {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size;
+	__u64 userspace_addr;
+	__u64 restricted_offset;
+	__u32 restricted_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+#endif
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -110,6 +137,7 @@  struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index effdea5dd4f0..d605545d6dd1 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -89,3 +89,6 @@  config KVM_XFER_TO_GUEST_WORK
 
 config HAVE_KVM_PM_NOTIFIER
        bool
+
+config HAVE_KVM_RESTRICTED_MEM
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7f0f5e9f2406..b882eb2c76a2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1532,7 +1532,7 @@  static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
@@ -1934,7 +1934,7 @@  static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_user_mem_region *mem)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -2038,7 +2038,7 @@  int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+			  const struct kvm_user_mem_region *mem)
 {
 	int r;
 
@@ -2050,7 +2050,7 @@  int kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+					  struct kvm_user_mem_region *mem)
 {
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
@@ -4698,6 +4698,33 @@  static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
 	return fd;
 }
 
+#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
+do {										\
+	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=		\
+		     offsetof(struct kvm_userspace_memory_region, field));	\
+	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=		\
+		     sizeof_field(struct kvm_userspace_memory_region, field));	\
+} while (0)
+
+#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)					\
+do {											\
+	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=			\
+		     offsetof(struct kvm_userspace_memory_region_ext, field));		\
+	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=			\
+		     sizeof_field(struct kvm_userspace_memory_region_ext, field));	\
+} while (0)
+
+static void kvm_sanity_check_user_mem_region_alias(void)
+{
+	SANITY_CHECK_MEM_REGION_FIELD(slot);
+	SANITY_CHECK_MEM_REGION_FIELD(flags);
+	SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+	SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+	SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
+	SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
+	SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -4721,14 +4748,20 @@  static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_user_mem_region mem;
+		unsigned long size = sizeof(struct kvm_userspace_memory_region);
+
+		kvm_sanity_check_user_mem_region_alias();
 
 		r = -EFAULT;
-		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+		if (copy_from_user(&mem, argp, size))
+			goto out;
+
+		r = -EINVAL;
+		if (mem.flags & KVM_MEM_PRIVATE)
 			goto out;
 
-		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
 	case KVM_GET_DIRTY_LOG: {