[v9,3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

Message ID 20221025151344.3784230-4-chao.p.peng@linux.intel.com
State New
Headers
Series KVM: mm: fd-based approach for supporting KVM |

Commit Message

Chao Peng Oct. 25, 2022, 3:13 p.m. UTC
  This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.

When private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared <-> private memory conversion in memory
encryption usage. In such usage, typically there are two kind of memory
conversions:
  - explicit conversion: happens when guest explicitly calls into KVM
    to map a range (as private or shared), KVM then exits to userspace
    to perform the map/unmap operations.
  - implicit conversion: happens in KVM page fault handler where KVM
    exits to userspace for an implicit conversion when the page is in a
    different state than requested (private or shared).

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  9 +++++++++
 2 files changed, 32 insertions(+)
  

Comments

Peter Maydell Oct. 25, 2022, 3:26 p.m. UTC | #1
On Tue, 25 Oct 2022 at 16:21, Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM
>     to map a range (as private or shared), KVM then exits to userspace
>     to perform the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler where KVM
>     exits to userspace for an implicit conversion when the page is in a
>     different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  9 +++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f3fa75649a78..975688912b8c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>
> +::
> +
> +               /* KVM_EXIT_MEMORY_FAULT */
> +               struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> +                       __u32 flags;
> +                       __u32 padding;
> +                       __u64 gpa;
> +                       __u64 size;
> +               } memory;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +userspace may choose to handle it. The 'flags' field indicates the memory
> +properties of the exit.
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> +   private memory access when the bit is set. Otherwise the memory error is
> +   caused by shared memory access when the bit is clear.
> +
> +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> +may handle the error and return to KVM to retry the previous memory access.
> +

What's the difference between this and a plain old MMIO exit ?
Just that we can specify a wider size and some flags ?

-- PMM
  
Sean Christopherson Oct. 25, 2022, 4:17 p.m. UTC | #2
On Tue, Oct 25, 2022, Peter Maydell wrote:
> On Tue, 25 Oct 2022 at 16:21, Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > +               /* KVM_EXIT_MEMORY_FAULT */
> > +               struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> > +                       __u32 flags;
> > +                       __u32 padding;
> > +                       __u64 gpa;
> > +                       __u64 size;
> > +               } memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> 
> What's the difference between this and a plain old MMIO exit ?
> Just that we can specify a wider size and some flags ?

KVM_EXIT_MMIO is purely for cases where there is no memslot.  KVM_EXIT_MEMORY_FAULT
will be used for scenarios where there is a valid memslot for a GPA, but for
whatever reason KVM cannot map the memslot into the guest.

In this series, the new exit type is use to handle guest-initiated conversions
between shared and private memory.  By design, conversion requires explicit action
from userspace, and so even though KVM has a valid memslot, KVM needs to exit to
userspace to effectively forward the conversion request to userspace.

Long term, I also hope to convert all guest-triggered -EFAULT paths to instead
return KVM_EXIT_MEMORY_FAULT.  At minimum, returning KVM_EXIT_MEMORY_FAULT instead
of -EFAULT will allow KVM to provide userspace with the "bad" GPA when something
goes sideways, e.g. if faulting in the page failed because there's no valid
userspace mapping.

There have also been two potential use cases[1][2], though they both appear to have
been abandoned, where userspace would do something more than just kill the guest
in response to KVM_EXIT_MEMORY_FAULT.

[1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
[2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com
  
Fuad Tabba Oct. 27, 2022, 10:27 a.m. UTC | #3
Hi,

On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM
>     to map a range (as private or shared), KVM then exits to userspace
>     to perform the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler where KVM
>     exits to userspace for an implicit conversion when the page is in a
>     different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>

I have tested the V8 version of this patch on arm64/qemu, and
considering this hasn't changed:
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad



>  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  9 +++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f3fa75649a78..975688912b8c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>
> +::
> +
> +               /* KVM_EXIT_MEMORY_FAULT */
> +               struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> +                       __u32 flags;
> +                       __u32 padding;
> +                       __u64 gpa;
> +                       __u64 size;
> +               } memory;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +userspace may choose to handle it. The 'flags' field indicates the memory
> +properties of the exit.
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> +   private memory access when the bit is set. Otherwise the memory error is
> +   caused by shared memory access when the bit is clear.
> +
> +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> +may handle the error and return to KVM to retry the previous memory access.
> +
>  ::
>
>      /* KVM_EXIT_NOTIFY */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f1ae45c10c94..fa60b032a405 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -300,6 +300,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -538,6 +539,14 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID     (1 << 0)
>                         __u32 flags;
>                 } notify;
> +               /* KVM_EXIT_MEMORY_FAULT */
> +               struct {
> +#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1 << 0)
> +                       __u32 flags;
> +                       __u32 padding;
> +                       __u64 gpa;
> +                       __u64 size;
> +               } memory;
>                 /* Fix the size of the union. */
>                 char padding[256];
>         };
> --
> 2.25.1
>
  
Chao Peng Oct. 28, 2022, 6:14 a.m. UTC | #4
On Thu, Oct 27, 2022 at 11:27:05AM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> >
> > When private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared <-> private memory conversion in memory
> > encryption usage. In such usage, typically there are two kind of memory
> > conversions:
> >   - explicit conversion: happens when guest explicitly calls into KVM
> >     to map a range (as private or shared), KVM then exits to userspace
> >     to perform the map/unmap operations.
> >   - implicit conversion: happens in KVM page fault handler where KVM
> >     exits to userspace for an implicit conversion when the page is in a
> >     different state than requested (private or shared).
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> 
> Reviewed-by: Fuad Tabba <tabba@google.com>
> 
> I have tested the V8 version of this patch on arm64/qemu, and
> considering this hasn't changed:
> Tested-by: Fuad Tabba <tabba@google.com>

Appreciate your review and testing!

Chao
> 
> Cheers,
> /fuad
> 
> 
> 
> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
> >  include/uapi/linux/kvm.h       |  9 +++++++++
> >  2 files changed, 32 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > +               /* KVM_EXIT_MEMORY_FAULT */
> > +               struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> > +                       __u32 flags;
> > +                       __u32 padding;
> > +                       __u64 gpa;
> > +                       __u64 size;
> > +               } memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> >  ::
> >
> >      /* KVM_EXIT_NOTIFY */
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index f1ae45c10c94..fa60b032a405 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >  #define KVM_EXIT_RISCV_SBI        35
> >  #define KVM_EXIT_RISCV_CSR        36
> >  #define KVM_EXIT_NOTIFY           37
> > +#define KVM_EXIT_MEMORY_FAULT     38
> >
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -538,6 +539,14 @@ struct kvm_run {
> >  #define KVM_NOTIFY_CONTEXT_INVALID     (1 << 0)
> >                         __u32 flags;
> >                 } notify;
> > +               /* KVM_EXIT_MEMORY_FAULT */
> > +               struct {
> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1 << 0)
> > +                       __u32 flags;
> > +                       __u32 padding;
> > +                       __u64 gpa;
> > +                       __u64 size;
> > +               } memory;
> >                 /* Fix the size of the union. */
> >                 char padding[256];
> >         };
> > --
> > 2.25.1
> >
  
Alex Bennée Nov. 15, 2022, 4:56 p.m. UTC | #5
Chao Peng <chao.p.peng@linux.intel.com> writes:

> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM
>     to map a range (as private or shared), KVM then exits to userspace
>     to perform the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler where KVM
>     exits to userspace for an implicit conversion when the page is in a
>     different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  9 +++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f3fa75649a78..975688912b8c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> +			__u32 flags;
> +			__u32 padding;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +userspace may choose to handle it. The 'flags' field indicates the memory
> +properties of the exit.
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> +   private memory access when the bit is set. Otherwise the memory error is
> +   caused by shared memory access when the bit is clear.

What does a shared memory access failure entail?

If you envision any other failure modes it might be worth making it
explicit with additional flags. I also wonder if a bitmask makes sense if
there can only be one reason for a failure? Maybe all that is needed is
a reason enum?

> +
> +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> +may handle the error and return to KVM to retry the previous memory access.
> +
>  ::
>  
>      /* KVM_EXIT_NOTIFY */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f1ae45c10c94..fa60b032a405 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -300,6 +300,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -538,6 +539,14 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>  			__u32 flags;
>  		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> +			__u32 flags;
> +			__u32 padding;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
  
Chao Peng Nov. 16, 2022, 3:14 a.m. UTC | #6
On Tue, Nov 15, 2022 at 04:56:12PM +0000, Alex Bennée wrote:
> 
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> >
> > When private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared <-> private memory conversion in memory
> > encryption usage. In such usage, typically there are two kind of memory
> > conversions:
> >   - explicit conversion: happens when guest explicitly calls into KVM
> >     to map a range (as private or shared), KVM then exits to userspace
> >     to perform the map/unmap operations.
> >   - implicit conversion: happens in KVM page fault handler where KVM
> >     exits to userspace for an implicit conversion when the page is in a
> >     different state than requested (private or shared).
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
> >  include/uapi/linux/kvm.h       |  9 +++++++++
> >  2 files changed, 32 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >  
> > +::
> > +
> > +		/* KVM_EXIT_MEMORY_FAULT */
> > +		struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> > +			__u32 flags;
> > +			__u32 padding;
> > +			__u64 gpa;
> > +			__u64 size;
> > +		} memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> 
> What does a shared memory access failure entail?

In the context of confidential computing usages, guest can issue a
shared memory access while the memory is actually private from the host
point of view. This exit with bit 0 cleared gives userspace a chance to
convert the private memory to shared memory on host.

> 
> If you envision any other failure modes it might be worth making it
> explicit with additional flags.

Sean mentioned some more usages[1][]2] other than the memory conversion
for confidential usage. But I would leave those flags being added in the
future after those usages being well discussed.

[1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
[2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com

> I also wonder if a bitmask makes sense if
> there can only be one reason for a failure? Maybe all that is needed is
> a reason enum?

Tough we only have one reason right now but we still want to leave room
for future extension. Enum can express a single value at once well but
bitmask makes it possible to express multiple orthogonal flags.

Chao
> 
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> >  ::
> >  
> >      /* KVM_EXIT_NOTIFY */
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index f1ae45c10c94..fa60b032a405 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >  #define KVM_EXIT_RISCV_SBI        35
> >  #define KVM_EXIT_RISCV_CSR        36
> >  #define KVM_EXIT_NOTIFY           37
> > +#define KVM_EXIT_MEMORY_FAULT     38
> >  
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -538,6 +539,14 @@ struct kvm_run {
> >  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
> >  			__u32 flags;
> >  		} notify;
> > +		/* KVM_EXIT_MEMORY_FAULT */
> > +		struct {
> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> > +			__u32 flags;
> > +			__u32 padding;
> > +			__u64 gpa;
> > +			__u64 size;
> > +		} memory;
> >  		/* Fix the size of the union. */
> >  		char padding[256];
> >  	};
> 
> 
> -- 
> Alex Bennée
  
Andy Lutomirski Nov. 16, 2022, 6:15 p.m. UTC | #7
On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote:
> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM
>     to map a range (as private or shared), KVM then exits to userspace
>     to perform the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler where KVM
>     exits to userspace for an implicit conversion when the page is in a
>     different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  9 +++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst 
> b/Documentation/virt/kvm/api.rst
> index f3fa75649a78..975688912b8c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6537,6 +6537,29 @@ array field represents return values. The 
> userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on 
> RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
> 
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> +			__u32 flags;
> +			__u32 padding;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
> +

Would it make sense to also have a field for the access type (read, write, execute, etc)?  I realize that shared <-> private conversion doesn't strictly need this, but it seems like it could be useful for logging failures and also for avoiding a second immediate fault if the type gets converted but doesn't have the right protection yet.

(Obviously, if this were changed, KVM would need the ability to report that it doesn't actually know the mode.)

--Andy
  
Sean Christopherson Nov. 16, 2022, 6:48 p.m. UTC | #8
On Wed, Nov 16, 2022, Andy Lutomirski wrote:
> 
> 
> On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote:
> > diff --git a/Documentation/virt/kvm/api.rst 
> > b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The 
> > userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on 
> > RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> > 
> > +::
> > +
> > +		/* KVM_EXIT_MEMORY_FAULT */
> > +		struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> > +			__u32 flags;
> > +			__u32 padding;
> > +			__u64 gpa;
> > +			__u64 size;
> > +		} memory;
> > +
> 
> Would it make sense to also have a field for the access type (read, write,
> execute, etc)?  I realize that shared <-> private conversion doesn't strictly
> need this, but it seems like it could be useful for logging failures and also
> for avoiding a second immediate fault if the type gets converted but doesn't
> have the right protection yet.

I don't think a separate field is necessary, that info can be conveyed via flags.
Though maybe we should go straight to a u64 for flags.  Hmm, and maybe avoid bits
0-3 so that if/when RWX info is conveyed the flags can align with
PROT_{READ,WRITE,EXEC} and the EPT flags, e.g.

	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)

> (Obviously, if this were changed, KVM would need the ability to report that
> it doesn't actually know the mode.)
> 
> --Andy
  
Alex Bennée Nov. 16, 2022, 7:03 p.m. UTC | #9
Chao Peng <chao.p.peng@linux.intel.com> writes:

> On Tue, Nov 15, 2022 at 04:56:12PM +0000, Alex Bennée wrote:
>> 
>> Chao Peng <chao.p.peng@linux.intel.com> writes:
>> 
>> > This new KVM exit allows userspace to handle memory-related errors. It
>> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
>> > The flags includes additional information for userspace to handle the
>> > error. Currently bit 0 is defined as 'private memory' where '1'
>> > indicates error happens due to private memory access and '0' indicates
>> > error happens due to shared memory access.
>> >
>> > When private memory is enabled, this new exit will be used for KVM to
>> > exit to userspace for shared <-> private memory conversion in memory
>> > encryption usage. In such usage, typically there are two kind of memory
>> > conversions:
>> >   - explicit conversion: happens when guest explicitly calls into KVM
>> >     to map a range (as private or shared), KVM then exits to userspace
>> >     to perform the map/unmap operations.
>> >   - implicit conversion: happens in KVM page fault handler where KVM
>> >     exits to userspace for an implicit conversion when the page is in a
>> >     different state than requested (private or shared).
>> >
>> > Suggested-by: Sean Christopherson <seanjc@google.com>
>> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> > ---
>> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>> >  include/uapi/linux/kvm.h       |  9 +++++++++
>> >  2 files changed, 32 insertions(+)
>> >
>> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> > index f3fa75649a78..975688912b8c 100644
>> > --- a/Documentation/virt/kvm/api.rst
>> > +++ b/Documentation/virt/kvm/api.rst
>> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
>> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
>> >  
>> > +::
>> > +
>> > +		/* KVM_EXIT_MEMORY_FAULT */
>> > +		struct {
>> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
>> > +			__u32 flags;
>> > +			__u32 padding;
>> > +			__u64 gpa;
>> > +			__u64 size;
>> > +		} memory;
>> > +
>> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
>> > +encountered a memory error which is not handled by KVM kernel module and
>> > +userspace may choose to handle it. The 'flags' field indicates the memory
>> > +properties of the exit.
>> > +
>> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
>> > +   private memory access when the bit is set. Otherwise the memory error is
>> > +   caused by shared memory access when the bit is clear.
>> 
>> What does a shared memory access failure entail?
>
> In the context of confidential computing usages, guest can issue a
> shared memory access while the memory is actually private from the host
> point of view. This exit with bit 0 cleared gives userspace a chance to
> convert the private memory to shared memory on host.

I think this should be explicit rather than implied by the absence of
another flag. Sean suggested you might want flags for RWX failures so
maybe something like:

	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
        KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)

which would allow you to signal the various failure modes of the shared
region, or that you had accessed private memory.

>
>> 
>> If you envision any other failure modes it might be worth making it
>> explicit with additional flags.
>
> Sean mentioned some more usages[1][]2] other than the memory conversion
> for confidential usage. But I would leave those flags being added in the
> future after those usages being well discussed.
>
> [1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
> [2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com
>
>> I also wonder if a bitmask makes sense if
>> there can only be one reason for a failure? Maybe all that is needed is
>> a reason enum?
>
> Tough we only have one reason right now but we still want to leave room
> for future extension. Enum can express a single value at once well but
> bitmask makes it possible to express multiple orthogonal flags.

I agree if multiple orthogonal failures can occur at once a bitmask is
the right choice.

>
> Chao
>> 
>> > +
>> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
>> > +may handle the error and return to KVM to retry the previous memory access.
>> > +
>> >  ::
>> >  
>> >      /* KVM_EXIT_NOTIFY */
>> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> > index f1ae45c10c94..fa60b032a405 100644
>> > --- a/include/uapi/linux/kvm.h
>> > +++ b/include/uapi/linux/kvm.h
>> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
>> >  #define KVM_EXIT_RISCV_SBI        35
>> >  #define KVM_EXIT_RISCV_CSR        36
>> >  #define KVM_EXIT_NOTIFY           37
>> > +#define KVM_EXIT_MEMORY_FAULT     38
>> >  
>> >  /* For KVM_EXIT_INTERNAL_ERROR */
>> >  /* Emulate instruction failed. */
>> > @@ -538,6 +539,14 @@ struct kvm_run {
>> >  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>> >  			__u32 flags;
>> >  		} notify;
>> > +		/* KVM_EXIT_MEMORY_FAULT */
>> > +		struct {
>> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
>> > +			__u32 flags;
>> > +			__u32 padding;
>> > +			__u64 gpa;
>> > +			__u64 size;
>> > +		} memory;
>> >  		/* Fix the size of the union. */
>> >  		char padding[256];
>> >  	};
>> 
>> 
>> -- 
>> Alex Bennée
  
Chao Peng Nov. 17, 2022, 1:42 p.m. UTC | #10
On Wed, Nov 16, 2022 at 06:48:43PM +0000, Sean Christopherson wrote:
> On Wed, Nov 16, 2022, Andy Lutomirski wrote:
> > 
> > 
> > On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote:
> > > diff --git a/Documentation/virt/kvm/api.rst 
> > > b/Documentation/virt/kvm/api.rst
> > > index f3fa75649a78..975688912b8c 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -6537,6 +6537,29 @@ array field represents return values. The 
> > > userspace should update the return
> > >  values of SBI call before resuming the VCPU. For more details on 
> > > RISC-V SBI
> > >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> > > 
> > > +::
> > > +
> > > +		/* KVM_EXIT_MEMORY_FAULT */
> > > +		struct {
> > > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> > > +			__u32 flags;
> > > +			__u32 padding;
> > > +			__u64 gpa;
> > > +			__u64 size;
> > > +		} memory;
> > > +
> > 
> > Would it make sense to also have a field for the access type (read, write,
> > execute, etc)?  I realize that shared <-> private conversion doesn't strictly
> > need this, but it seems like it could be useful for logging failures and also
> > for avoiding a second immediate fault if the type gets converted but doesn't
> > have the right protection yet.
> 
> I don't think a separate field is necessary, that info can be conveyed via flags.
> Though maybe we should go straight to a u64 for flags.

Yeah, I can do that.

> Hmm, and maybe avoid bits
> 0-3 so that if/when RWX info is conveyed the flags can align with
> PROT_{READ,WRITE,EXEC} and the EPT flags, e.g.

You mean avoiding bits 0-2, right, bit3 is not so special and can be used
for KVM_MEMORY_EXIT_FLAG_PRIVATE.

Chao
> 
> 	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
> 	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
> 	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
> 
> > (Obviously, if this were changed, KVM would need the ability to report that
> > it doesn't actually know the mode.)
> > 
> > --Andy
  
Chao Peng Nov. 17, 2022, 1:45 p.m. UTC | #11
On Wed, Nov 16, 2022 at 07:03:49PM +0000, Alex Bennée wrote:
> 
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > On Tue, Nov 15, 2022 at 04:56:12PM +0000, Alex Bennée wrote:
> >> 
> >> Chao Peng <chao.p.peng@linux.intel.com> writes:
> >> 
> >> > This new KVM exit allows userspace to handle memory-related errors. It
> >> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> >> > The flags includes additional information for userspace to handle the
> >> > error. Currently bit 0 is defined as 'private memory' where '1'
> >> > indicates error happens due to private memory access and '0' indicates
> >> > error happens due to shared memory access.
> >> >
> >> > When private memory is enabled, this new exit will be used for KVM to
> >> > exit to userspace for shared <-> private memory conversion in memory
> >> > encryption usage. In such usage, typically there are two kind of memory
> >> > conversions:
> >> >   - explicit conversion: happens when guest explicitly calls into KVM
> >> >     to map a range (as private or shared), KVM then exits to userspace
> >> >     to perform the map/unmap operations.
> >> >   - implicit conversion: happens in KVM page fault handler where KVM
> >> >     exits to userspace for an implicit conversion when the page is in a
> >> >     different state than requested (private or shared).
> >> >
> >> > Suggested-by: Sean Christopherson <seanjc@google.com>
> >> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> >> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> >> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> >> > ---
> >> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
> >> >  include/uapi/linux/kvm.h       |  9 +++++++++
> >> >  2 files changed, 32 insertions(+)
> >> >
> >> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> > index f3fa75649a78..975688912b8c 100644
> >> > --- a/Documentation/virt/kvm/api.rst
> >> > +++ b/Documentation/virt/kvm/api.rst
> >> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
> >> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >> >  
> >> > +::
> >> > +
> >> > +		/* KVM_EXIT_MEMORY_FAULT */
> >> > +		struct {
> >> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> >> > +			__u32 flags;
> >> > +			__u32 padding;
> >> > +			__u64 gpa;
> >> > +			__u64 size;
> >> > +		} memory;
> >> > +
> >> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> >> > +encountered a memory error which is not handled by KVM kernel module and
> >> > +userspace may choose to handle it. The 'flags' field indicates the memory
> >> > +properties of the exit.
> >> > +
> >> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> >> > +   private memory access when the bit is set. Otherwise the memory error is
> >> > +   caused by shared memory access when the bit is clear.
> >> 
> >> What does a shared memory access failure entail?
> >
> > In the context of confidential computing usages, guest can issue a
> > shared memory access while the memory is actually private from the host
> > point of view. This exit with bit 0 cleared gives userspace a chance to
> > convert the private memory to shared memory on host.
> 
> I think this should be explicit rather than implied by the absence of
> another flag. Sean suggested you might want flags for RWX failures so
> maybe something like:
> 
> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)

Yes, but I would not add 'SHARED' to RWX, they are not share memory
specific, private memory can also set them once introduced.

Thanks,
Chao
> 
> which would allow you to signal the various failure modes of the shared
> region, or that you had accessed private memory.
> 
> >
> >> 
> >> If you envision any other failure modes it might be worth making it
> >> explicit with additional flags.
> >
> > Sean mentioned some more usages[1][]2] other than the memory conversion
> > for confidential usage. But I would leave those flags being added in the
> > future after those usages being well discussed.
> >
> > [1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
> > [2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com
> >
> >> I also wonder if a bitmask makes sense if
> >> there can only be one reason for a failure? Maybe all that is needed is
> >> a reason enum?
> >
> > Tough we only have one reason right now but we still want to leave room
> > for future extension. Enum can express a single value at once well but
> > bitmask makes it possible to express multiple orthogonal flags.
> 
> I agree if multiple orthogonal failures can occur at once a bitmask is
> the right choice.
> 
> >
> > Chao
> >> 
> >> > +
> >> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> >> > +may handle the error and return to KVM to retry the previous memory access.
> >> > +
> >> >  ::
> >> >  
> >> >      /* KVM_EXIT_NOTIFY */
> >> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> > index f1ae45c10c94..fa60b032a405 100644
> >> > --- a/include/uapi/linux/kvm.h
> >> > +++ b/include/uapi/linux/kvm.h
> >> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >> >  #define KVM_EXIT_RISCV_SBI        35
> >> >  #define KVM_EXIT_RISCV_CSR        36
> >> >  #define KVM_EXIT_NOTIFY           37
> >> > +#define KVM_EXIT_MEMORY_FAULT     38
> >> >  
> >> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >> >  /* Emulate instruction failed. */
> >> > @@ -538,6 +539,14 @@ struct kvm_run {
> >> >  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
> >> >  			__u32 flags;
> >> >  		} notify;
> >> > +		/* KVM_EXIT_MEMORY_FAULT */
> >> > +		struct {
> >> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> >> > +			__u32 flags;
> >> > +			__u32 padding;
> >> > +			__u64 gpa;
> >> > +			__u64 size;
> >> > +		} memory;
> >> >  		/* Fix the size of the union. */
> >> >  		char padding[256];
> >> >  	};
> >> 
> >> 
> >> -- 
> >> Alex Bennée
> 
> 
> -- 
> Alex Bennée
  
Alex Bennée Nov. 17, 2022, 3:08 p.m. UTC | #12
Chao Peng <chao.p.peng@linux.intel.com> writes:

> On Wed, Nov 16, 2022 at 07:03:49PM +0000, Alex Bennée wrote:
>> 
>> Chao Peng <chao.p.peng@linux.intel.com> writes:
>> 
>> > On Tue, Nov 15, 2022 at 04:56:12PM +0000, Alex Bennée wrote:
>> >> 
>> >> Chao Peng <chao.p.peng@linux.intel.com> writes:
>> >> 
>> >> > This new KVM exit allows userspace to handle memory-related errors. It
>> >> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
>> >> > The flags includes additional information for userspace to handle the
>> >> > error. Currently bit 0 is defined as 'private memory' where '1'
>> >> > indicates error happens due to private memory access and '0' indicates
>> >> > error happens due to shared memory access.
>> >> >
>> >> > When private memory is enabled, this new exit will be used for KVM to
>> >> > exit to userspace for shared <-> private memory conversion in memory
>> >> > encryption usage. In such usage, typically there are two kind of memory
>> >> > conversions:
>> >> >   - explicit conversion: happens when guest explicitly calls into KVM
>> >> >     to map a range (as private or shared), KVM then exits to userspace
>> >> >     to perform the map/unmap operations.
>> >> >   - implicit conversion: happens in KVM page fault handler where KVM
>> >> >     exits to userspace for an implicit conversion when the page is in a
>> >> >     different state than requested (private or shared).
>> >> >
>> >> > Suggested-by: Sean Christopherson <seanjc@google.com>
>> >> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>> >> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>> >> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> >> > ---
>> >> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>> >> >  include/uapi/linux/kvm.h       |  9 +++++++++
>> >> >  2 files changed, 32 insertions(+)
>> >> >
>> >> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> >> > index f3fa75649a78..975688912b8c 100644
>> >> > --- a/Documentation/virt/kvm/api.rst
>> >> > +++ b/Documentation/virt/kvm/api.rst
>> >> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
>> >> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>> >> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
>> >> >  
>> >> > +::
>> >> > +
>> >> > +		/* KVM_EXIT_MEMORY_FAULT */
>> >> > +		struct {
>> >> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
>> >> > +			__u32 flags;
>> >> > +			__u32 padding;
>> >> > +			__u64 gpa;
>> >> > +			__u64 size;
>> >> > +		} memory;
>> >> > +
>> >> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
>> >> > +encountered a memory error which is not handled by KVM kernel module and
>> >> > +userspace may choose to handle it. The 'flags' field indicates the memory
>> >> > +properties of the exit.
>> >> > +
>> >> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
>> >> > +   private memory access when the bit is set. Otherwise the memory error is
>> >> > +   caused by shared memory access when the bit is clear.
>> >> 
>> >> What does a shared memory access failure entail?
>> >
>> > In the context of confidential computing usages, guest can issue a
>> > shared memory access while the memory is actually private from the host
>> > point of view. This exit with bit 0 cleared gives userspace a chance to
>> > convert the private memory to shared memory on host.
>> 
>> I think this should be explicit rather than implied by the absence of
>> another flag. Sean suggested you might want flags for RWX failures so
>> maybe something like:
>> 
>> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
>> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
>> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
>>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)
>
> Yes, but I would not add 'SHARED' to RWX, they are not share memory
> specific, private memory can also set them once introduced.

OK so how about:

 	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
 	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
 	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
        KVM_MEMORY_EXIT_FLAG_SHARED     (1 << 3)
        KVM_MEMORY_EXIT_FLAG_PRIVATE    (1 << 4)

>
> Thanks,
> Chao
>> 
>> which would allow you to signal the various failure modes of the shared
>> region, or that you had accessed private memory.
>> 
>> >
>> >> 
>> >> If you envision any other failure modes it might be worth making it
>> >> explicit with additional flags.
>> >
>> > Sean mentioned some more usages[1][]2] other than the memory conversion
>> > for confidential usage. But I would leave those flags being added in the
>> > future after those usages being well discussed.
>> >
>> > [1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
>> > [2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com
>> >
>> >> I also wonder if a bitmask makes sense if
>> >> there can only be one reason for a failure? Maybe all that is needed is
>> >> a reason enum?
>> >
>> > Tough we only have one reason right now but we still want to leave room
>> > for future extension. Enum can express a single value at once well but
>> > bitmask makes it possible to express multiple orthogonal flags.
>> 
>> I agree if multiple orthogonal failures can occur at once a bitmask is
>> the right choice.
>> 
>> >
>> > Chao
>> >> 
>> >> > +
>> >> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
>> >> > +may handle the error and return to KVM to retry the previous memory access.
>> >> > +
>> >> >  ::
>> >> >  
>> >> >      /* KVM_EXIT_NOTIFY */
>> >> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> >> > index f1ae45c10c94..fa60b032a405 100644
>> >> > --- a/include/uapi/linux/kvm.h
>> >> > +++ b/include/uapi/linux/kvm.h
>> >> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
>> >> >  #define KVM_EXIT_RISCV_SBI        35
>> >> >  #define KVM_EXIT_RISCV_CSR        36
>> >> >  #define KVM_EXIT_NOTIFY           37
>> >> > +#define KVM_EXIT_MEMORY_FAULT     38
>> >> >  
>> >> >  /* For KVM_EXIT_INTERNAL_ERROR */
>> >> >  /* Emulate instruction failed. */
>> >> > @@ -538,6 +539,14 @@ struct kvm_run {
>> >> >  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>> >> >  			__u32 flags;
>> >> >  		} notify;
>> >> > +		/* KVM_EXIT_MEMORY_FAULT */
>> >> > +		struct {
>> >> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
>> >> > +			__u32 flags;
>> >> > +			__u32 padding;
>> >> > +			__u64 gpa;
>> >> > +			__u64 size;
>> >> > +		} memory;
>> >> >  		/* Fix the size of the union. */
>> >> >  		char padding[256];
>> >> >  	};
>> >> 
>> >> 
>> >> -- 
>> >> Alex Bennée
>> 
>> 
>> -- 
>> Alex Bennée
  
Chao Peng Nov. 18, 2022, 1:32 a.m. UTC | #13
On Thu, Nov 17, 2022 at 03:08:17PM +0000, Alex Bennée wrote:
> 
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > On Wed, Nov 16, 2022 at 07:03:49PM +0000, Alex Bennée wrote:
> >> 
> >> Chao Peng <chao.p.peng@linux.intel.com> writes:
> >> 
> >> > On Tue, Nov 15, 2022 at 04:56:12PM +0000, Alex Bennée wrote:
> >> >> 
> >> >> Chao Peng <chao.p.peng@linux.intel.com> writes:
> >> >> 
> >> >> > This new KVM exit allows userspace to handle memory-related errors. It
> >> >> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> >> >> > The flags includes additional information for userspace to handle the
> >> >> > error. Currently bit 0 is defined as 'private memory' where '1'
> >> >> > indicates error happens due to private memory access and '0' indicates
> >> >> > error happens due to shared memory access.
> >> >> >
> >> >> > When private memory is enabled, this new exit will be used for KVM to
> >> >> > exit to userspace for shared <-> private memory conversion in memory
> >> >> > encryption usage. In such usage, typically there are two kind of memory
> >> >> > conversions:
> >> >> >   - explicit conversion: happens when guest explicitly calls into KVM
> >> >> >     to map a range (as private or shared), KVM then exits to userspace
> >> >> >     to perform the map/unmap operations.
> >> >> >   - implicit conversion: happens in KVM page fault handler where KVM
> >> >> >     exits to userspace for an implicit conversion when the page is in a
> >> >> >     different state than requested (private or shared).
> >> >> >
> >> >> > Suggested-by: Sean Christopherson <seanjc@google.com>
> >> >> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> >> >> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> >> >> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> >> >> > ---
> >> >> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
> >> >> >  include/uapi/linux/kvm.h       |  9 +++++++++
> >> >> >  2 files changed, 32 insertions(+)
> >> >> >
> >> >> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> >> > index f3fa75649a78..975688912b8c 100644
> >> >> > --- a/Documentation/virt/kvm/api.rst
> >> >> > +++ b/Documentation/virt/kvm/api.rst
> >> >> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
> >> >> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >> >> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >> >> >  
> >> >> > +::
> >> >> > +
> >> >> > +		/* KVM_EXIT_MEMORY_FAULT */
> >> >> > +		struct {
> >> >> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> >> >> > +			__u32 flags;
> >> >> > +			__u32 padding;
> >> >> > +			__u64 gpa;
> >> >> > +			__u64 size;
> >> >> > +		} memory;
> >> >> > +
> >> >> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> >> >> > +encountered a memory error which is not handled by KVM kernel module and
> >> >> > +userspace may choose to handle it. The 'flags' field indicates the memory
> >> >> > +properties of the exit.
> >> >> > +
> >> >> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> >> >> > +   private memory access when the bit is set. Otherwise the memory error is
> >> >> > +   caused by shared memory access when the bit is clear.
> >> >> 
> >> >> What does a shared memory access failure entail?
> >> >
> >> > In the context of confidential computing usages, guest can issue a
> >> > shared memory access while the memory is actually private from the host
> >> > point of view. This exit with bit 0 cleared gives userspace a chance to
> >> > convert the private memory to shared memory on host.
> >> 
> >> I think this should be explicit rather than implied by the absence of
> >> another flag. Sean suggested you might want flags for RWX failures so
> >> maybe something like:
> >> 
> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
> >>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)
> >
> > Yes, but I would not add 'SHARED' to RWX, they are not share memory
> > specific, private memory can also set them once introduced.
> 
> OK so how about:
> 
>  	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
>  	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
>  	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
>         KVM_MEMORY_EXIT_FLAG_SHARED     (1 << 3)
>         KVM_MEMORY_EXIT_FLAG_PRIVATE    (1 << 4)

We don't actually need a new bit, the opposite side of private is
shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
'shared'.

Chao
> 
> >
> > Thanks,
> > Chao
> >> 
> >> which would allow you to signal the various failure modes of the shared
> >> region, or that you had accessed private memory.
> >> 
> >> >
> >> >> 
> >> >> If you envision any other failure modes it might be worth making it
> >> >> explicit with additional flags.
> >> >
> >> > Sean mentioned some more usages[1][]2] other than the memory conversion
> >> > for confidential usage. But I would leave those flags being added in the
> >> > future after those usages being well discussed.
> >> >
> >> > [1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
> >> > [2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com
> >> >
> >> >> I also wonder if a bitmask makes sense if
> >> >> there can only be one reason for a failure? Maybe all that is needed is
> >> >> a reason enum?
> >> >
> >> > Tough we only have one reason right now but we still want to leave room
> >> > for future extension. Enum can express a single value at once well but
> >> > bitmask makes it possible to express multiple orthogonal flags.
> >> 
> >> I agree if multiple orthogonal failures can occur at once a bitmask is
> >> the right choice.
> >> 
> >> >
> >> > Chao
> >> >> 
> >> >> > +
> >> >> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> >> >> > +may handle the error and return to KVM to retry the previous memory access.
> >> >> > +
> >> >> >  ::
> >> >> >  
> >> >> >      /* KVM_EXIT_NOTIFY */
> >> >> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> >> > index f1ae45c10c94..fa60b032a405 100644
> >> >> > --- a/include/uapi/linux/kvm.h
> >> >> > +++ b/include/uapi/linux/kvm.h
> >> >> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >> >> >  #define KVM_EXIT_RISCV_SBI        35
> >> >> >  #define KVM_EXIT_RISCV_CSR        36
> >> >> >  #define KVM_EXIT_NOTIFY           37
> >> >> > +#define KVM_EXIT_MEMORY_FAULT     38
> >> >> >  
> >> >> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >> >> >  /* Emulate instruction failed. */
> >> >> > @@ -538,6 +539,14 @@ struct kvm_run {
> >> >> >  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
> >> >> >  			__u32 flags;
> >> >> >  		} notify;
> >> >> > +		/* KVM_EXIT_MEMORY_FAULT */
> >> >> > +		struct {
> >> >> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> >> >> > +			__u32 flags;
> >> >> > +			__u32 padding;
> >> >> > +			__u64 gpa;
> >> >> > +			__u64 size;
> >> >> > +		} memory;
> >> >> >  		/* Fix the size of the union. */
> >> >> >  		char padding[256];
> >> >> >  	};
> >> >> 
> >> >> 
> >> >> -- 
> >> >> Alex Bennée
> >> 
> >> 
> >> -- 
> >> Alex Bennée
> 
> 
> -- 
> Alex Bennée
  
Alex Bennée Nov. 18, 2022, 1:23 p.m. UTC | #14
Chao Peng <chao.p.peng@linux.intel.com> writes:

> On Thu, Nov 17, 2022 at 03:08:17PM +0000, Alex Bennée wrote:
>> 
<snip>
>> >> >> > +
>> >> >> > +		/* KVM_EXIT_MEMORY_FAULT */
>> >> >> > +		struct {
>> >> >> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
>> >> >> > +			__u32 flags;
>> >> >> > +			__u32 padding;
>> >> >> > +			__u64 gpa;
>> >> >> > +			__u64 size;
>> >> >> > +		} memory;
>> >> >> > +
>> >> >> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
>> >> >> > +encountered a memory error which is not handled by KVM kernel module and
>> >> >> > +userspace may choose to handle it. The 'flags' field indicates the memory
>> >> >> > +properties of the exit.
>> >> >> > +
>> >> >> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
>> >> >> > +   private memory access when the bit is set. Otherwise the memory error is
>> >> >> > +   caused by shared memory access when the bit is clear.
>> >> >> 
>> >> >> What does a shared memory access failure entail?
>> >> >
>> >> > In the context of confidential computing usages, guest can issue a
>> >> > shared memory access while the memory is actually private from the host
>> >> > point of view. This exit with bit 0 cleared gives userspace a chance to
>> >> > convert the private memory to shared memory on host.
>> >> 
>> >> I think this should be explicit rather than implied by the absence of
>> >> another flag. Sean suggested you might want flags for RWX failures so
>> >> maybe something like:
>> >> 
>> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
>> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
>> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
>> >>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)
>> >
>> > Yes, but I would not add 'SHARED' to RWX, they are not share memory
>> > specific, private memory can also set them once introduced.
>> 
>> OK so how about:
>> 
>>  	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
>>  	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
>>  	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
>>         KVM_MEMORY_EXIT_FLAG_SHARED     (1 << 3)
>>         KVM_MEMORY_EXIT_FLAG_PRIVATE    (1 << 4)
>
> We don't actually need a new bit, the opposite side of private is
> shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
> 'shared'.

If that is always true and we never expect a 3rd type of memory that is
fine. But given we are leaving room for expansion having an explicit bit
allows for that as well as making cases of forgetting to set the flags
more obvious.
  
Sean Christopherson Nov. 18, 2022, 3:59 p.m. UTC | #15
On Fri, Nov 18, 2022, Alex Bennée wrote:
> 
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > On Thu, Nov 17, 2022 at 03:08:17PM +0000, Alex Bennée wrote:
> >> >> I think this should be explicit rather than implied by the absence of
> >> >> another flag. Sean suggested you might want flags for RWX failures so
> >> >> maybe something like:
> >> >> 
> >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
> >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
> >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
> >> >>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)
> >> >
> >> > Yes, but I would not add 'SHARED' to RWX, they are not share memory
> >> > specific, private memory can also set them once introduced.
> >> 
> >> OK so how about:
> >> 
> >>  	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
> >>  	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
> >>  	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
> >>         KVM_MEMORY_EXIT_FLAG_SHARED     (1 << 3)
> >>         KVM_MEMORY_EXIT_FLAG_PRIVATE    (1 << 4)
> >
> > We don't actually need a new bit, the opposite side of private is
> > shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
> > 'shared'.
> 
> If that is always true and we never expect a 3rd type of memory that is
> fine. But given we are leaving room for expansion having an explicit bit
> allows for that as well as making cases of forgetting to set the flags
> more obvious.

Hrm, I'm on the fence.

A dedicated flag isn't strictly needed, e.g. even if we end up with 3+ types in
this category, the baseline could always be "private".

I do like being explicit, and adding a PRIVATE flag costs KVM practically nothing
to implement and maintain, but evetually we'll up with flags that are paired with
an implicit state, e.g. see the many #PF error codes in x86.  In other words,
inevitably KVM will need to define the default/base state of the access, at which
point the base state for SHARED vs. PRIVATE is "undefined".  

The RWX bits are in the same boat, e.g. the READ flag isn't strictly necessary.
I was thinking more of the KVM_SET_MEMORY_ATTRIBUTES ioctl(), which does need
the full RWX gamut, when I typed out that response.

So I would say if we add an explicit READ flag, then we might as well add an explicit
PRIVATE flag too.  But if we omit PRIVATE, then we should omit READ too.
  
Chao Peng Nov. 22, 2022, 9:50 a.m. UTC | #16
On Fri, Nov 18, 2022 at 03:59:12PM +0000, Sean Christopherson wrote:
> On Fri, Nov 18, 2022, Alex Benn?e wrote:
> > 
> > Chao Peng <chao.p.peng@linux.intel.com> writes:
> > 
> > > On Thu, Nov 17, 2022 at 03:08:17PM +0000, Alex Benn?e wrote:
> > >> >> I think this should be explicit rather than implied by the absence of
> > >> >> another flag. Sean suggested you might want flags for RWX failures so
> > >> >> maybe something like:
> > >> >> 
> > >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
> > >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
> > >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
> > >> >>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)
> > >> >
> > >> > Yes, but I would not add 'SHARED' to RWX, they are not share memory
> > >> > specific, private memory can also set them once introduced.
> > >> 
> > >> OK so how about:
> > >> 
> > >>  	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
> > >>  	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
> > >>  	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
> > >>         KVM_MEMORY_EXIT_FLAG_SHARED     (1 << 3)
> > >>         KVM_MEMORY_EXIT_FLAG_PRIVATE    (1 << 4)
> > >
> > > We don't actually need a new bit, the opposite side of private is
> > > shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
> > > 'shared'.
> > 
> > If that is always true and we never expect a 3rd type of memory that is
> > fine. But given we are leaving room for expansion having an explicit bit
> > allows for that as well as making cases of forgetting to set the flags
> > more obvious.
> 
> Hrm, I'm on the fence.
> 
> A dedicated flag isn't strictly needed, e.g. even if we end up with 3+ types in
> this category, the baseline could always be "private".

The baseline for the current code is actually "shared".

> 
> I do like being explicit, and adding a PRIVATE flag costs KVM practically nothing
> to implement and maintain, but evetually we'll up with flags that are paired with
> an implicit state, e.g. see the many #PF error codes in x86.  In other words,
> inevitably KVM will need to define the default/base state of the access, at which
> point the base state for SHARED vs. PRIVATE is "undefined".  

Current memory conversion for confidential usage is bi-directional so we
already need both private and shared states and if we use one bit for
both "shared" and "private" then we will have to define the default
state, e.g, currently the default state is "shared" when we define

	KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)

> 
> The RWX bits are in the same boat, e.g. the READ flag isn't strictly necessary.
> I was thinking more of the KVM_SET_MEMORY_ATTRIBUTES ioctl(), which does need
> the full RWX gamut, when I typed out that response.

For KVM_SET_MEMORY_ATTRIBUTES it's reasonable to add RWX bits and match
that to the permission bits definition in EPT entry.

> 
> So I would say if we add an explicit READ flag, then we might as well add an explicit
> PRIVATE flag too.  But if we omit PRIVATE, then we should omit READ too.

Since we assume the default state is shared, so we actually only need a
PRIVATE flag, e.g. there is no SHARED flag and will ignore the RWX for now.

Chao
  
Sean Christopherson Nov. 23, 2022, 6:02 p.m. UTC | #17
On Tue, Nov 22, 2022, Chao Peng wrote:
> On Fri, Nov 18, 2022 at 03:59:12PM +0000, Sean Christopherson wrote:
> > On Fri, Nov 18, 2022, Alex Benn?e wrote:
> > > > We don't actually need a new bit, the opposite side of private is
> > > > shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
> > > > 'shared'.
> > > 
> > > If that is always true and we never expect a 3rd type of memory that is
> > > fine. But given we are leaving room for expansion having an explicit bit
> > > allows for that as well as making cases of forgetting to set the flags
> > > more obvious.
> > 
> > Hrm, I'm on the fence.
> > 
> > A dedicated flag isn't strictly needed, e.g. even if we end up with 3+ types in
> > this category, the baseline could always be "private".
> 
> The baseline for the current code is actually "shared".

Ah, right, the baseline needs to be "shared" so that legacy code doesn't end up
with impossible states.

> > I do like being explicit, and adding a PRIVATE flag costs KVM practically nothing
> > to implement and maintain, but evetually we'll up with flags that are paired with
> > an implicit state, e.g. see the many #PF error codes in x86.  In other words,
> > inevitably KVM will need to define the default/base state of the access, at which
> > point the base state for SHARED vs. PRIVATE is "undefined".  
> 
> Current memory conversion for confidential usage is bi-directional so we
> already need both private and shared states and if we use one bit for
> both "shared" and "private" then we will have to define the default
> state, e.g, currently the default state is "shared" when we define
> 
> 	KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)

...

> > So I would say if we add an explicit READ flag, then we might as well add an explicit
> > PRIVATE flag too.  But if we omit PRIVATE, then we should omit READ too.
> 
> Since we assume the default state is shared, so we actually only need a
> PRIVATE flag, e.g. there is no SHARED flag and will ignore the RWX for now.

Yeah, I'm leading towards "shared" being the implied default state.  Ditto for
"read" if/when we need to communicate write/execute information  E.g. for VMs
that don't support guest private memory, the "shared" flag is in some ways
nonsensical.  Worst case scenario, e.g. if we end up with variations of "shared",
we'll need something like KVM_MEMORY_EXIT_FLAG_SHARED_RESTRICTIVE or whatever,
but the basic "shared" default will still work.
  

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f3fa75649a78..975688912b8c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6537,6 +6537,29 @@  array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
+
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+userspace may choose to handle it. The 'flags' field indicates the memory
+properties of the exit.
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
+   private memory access when the bit is set. Otherwise the memory error is
+   caused by shared memory access when the bit is clear.
+
+'gpa' and 'size' indicate the memory range the error occurs at. The userspace
+may handle the error and return to KVM to retry the previous memory access.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f1ae45c10c94..fa60b032a405 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -300,6 +300,7 @@  struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -538,6 +539,14 @@  struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
 		/* Fix the size of the union. */
 		char padding[256];
 	};