[RFC,V1,00/13] vdpa live update

Message ID 1704919215-91319-1-git-send-email-steven.sistare@oracle.com
Headers
Series vdpa live update |

Message

Steven Sistare Jan. 10, 2024, 8:40 p.m. UTC
  Live update is a technique wherein an application saves its state, exec's
to an updated version of itself, and restores its state.  Clients of the
application experience a brief suspension of service, on the order of 
100's of milliseconds, but are otherwise unaffected.

Define and implement interfaces that allow vdpa devices to be preserved
across fork or exec, to support live update for applications such as qemu.
The device must be suspended during the update, but its dma mappings are
preserved, so the suspension is brief.

The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
accounting from one process to another.

The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
VHOST_NEW_OWNER is supported.

The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
address in the new process.

The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
VHOST_IOTLB_REMAP is supported and required.  Some devices do not
require it, because the userland address of each dma mapping is discarded
after being translated to a physical address.

Here is a pseudo-code sequence for performing live update, based on
suspend + reset because resume is not yet available.  The vdpa device
descriptor, fd, remains open across the exec.

  ioctl(fd, VHOST_VDPA_SUSPEND)
  ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
  exec 

  ioctl(fd, VHOST_NEW_OWNER)

  issue ioctls to re-create vrings

  if VHOST_BACKEND_F_IOTLB_REMAP
      foreach dma mapping
          write(fd, {VHOST_IOTLB_REMAP, new_addr})

  ioctl(fd, VHOST_VDPA_SET_STATUS,
            ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)


Steve Sistare (13):
  vhost-vdpa: count pinned memory
  vhost-vdpa: pass mm to bind
  vhost-vdpa: VHOST_NEW_OWNER
  vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
  vhost-vdpa: VHOST_IOTLB_REMAP
  vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
  vhost-vdpa: flush workers on suspend
  vduse: flush workers on suspend
  vdpa_sim: reset must not run
  vdpa_sim: flush workers on suspend
  vdpa/mlx5: new owner capability
  vdpa_sim: new owner capability
  vduse: new owner capability

 drivers/vdpa/mlx5/net/mlx5_vnet.c  |   3 +-
 drivers/vdpa/vdpa_sim/vdpa_sim.c   |  24 ++++++-
 drivers/vdpa/vdpa_user/vduse_dev.c |  32 +++++++++
 drivers/vhost/vdpa.c               | 101 +++++++++++++++++++++++++++--
 drivers/vhost/vhost.c              |  15 +++++
 drivers/vhost/vhost.h              |   1 +
 include/uapi/linux/vhost.h         |  10 +++
 include/uapi/linux/vhost_types.h   |  15 ++++-
 8 files changed, 191 insertions(+), 10 deletions(-)
  

Comments

Jason Wang Jan. 11, 2024, 2:55 a.m. UTC | #1
On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> Live update is a technique wherein an application saves its state, exec's
> to an updated version of itself, and restores its state.  Clients of the
> application experience a brief suspension of service, on the order of
> 100's of milliseconds, but are otherwise unaffected.
>
> Define and implement interfaces that allow vdpa devices to be preserved
> across fork or exec, to support live update for applications such as qemu.
> The device must be suspended during the update, but its dma mappings are
> preserved, so the suspension is brief.
>
> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> accounting from one process to another.
>
> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> VHOST_NEW_OWNER is supported.
>
> The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
> address in the new process.
>
> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> VHOST_IOTLB_REMAP is supported and required.  Some devices do not
> require it, because the userland address of each dma mapping is discarded
> after being translated to a physical address.
>
> Here is a pseudo-code sequence for performing live update, based on
> suspend + reset because resume is not yet available.  The vdpa device
> descriptor, fd, remains open across the exec.
>
>   ioctl(fd, VHOST_VDPA_SUSPEND)
>   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
>   exec

Is there a userspace implementation as a reference?

>
>   ioctl(fd, VHOST_NEW_OWNER)
>
>   issue ioctls to re-create vrings
>
>   if VHOST_BACKEND_F_IOTLB_REMAP
>       foreach dma mapping
>           write(fd, {VHOST_IOTLB_REMAP, new_addr})

I think I need to understand the advantages of this approach. For
example, why it is better than

ioctl(VHOST_RESET_OWNER)
exec

ioctl(VHOST_SET_OWNER)

for each dma mapping
     ioctl(VHOST_IOTLB_UPDATE)

Thanks

>
>   ioctl(fd, VHOST_VDPA_SET_STATUS,
>             ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
>
>
> Steve Sistare (13):
>   vhost-vdpa: count pinned memory
>   vhost-vdpa: pass mm to bind
>   vhost-vdpa: VHOST_NEW_OWNER
>   vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
>   vhost-vdpa: VHOST_IOTLB_REMAP
>   vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
>   vhost-vdpa: flush workers on suspend
>   vduse: flush workers on suspend
>   vdpa_sim: reset must not run
>   vdpa_sim: flush workers on suspend
>   vdpa/mlx5: new owner capability
>   vdpa_sim: new owner capability
>   vduse: new owner capability
>
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  |   3 +-
>  drivers/vdpa/vdpa_sim/vdpa_sim.c   |  24 ++++++-
>  drivers/vdpa/vdpa_user/vduse_dev.c |  32 +++++++++
>  drivers/vhost/vdpa.c               | 101 +++++++++++++++++++++++++++--
>  drivers/vhost/vhost.c              |  15 +++++
>  drivers/vhost/vhost.h              |   1 +
>  include/uapi/linux/vhost.h         |  10 +++
>  include/uapi/linux/vhost_types.h   |  15 ++++-
>  8 files changed, 191 insertions(+), 10 deletions(-)
>
> --
> 2.39.3
>
  
Steven Sistare Jan. 17, 2024, 8:31 p.m. UTC | #2
On 1/10/2024 9:55 PM, Jason Wang wrote:
> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> Live update is a technique wherein an application saves its state, exec's
>> to an updated version of itself, and restores its state.  Clients of the
>> application experience a brief suspension of service, on the order of
>> 100's of milliseconds, but are otherwise unaffected.
>>
>> Define and implement interfaces that allow vdpa devices to be preserved
>> across fork or exec, to support live update for applications such as qemu.
>> The device must be suspended during the update, but its dma mappings are
>> preserved, so the suspension is brief.
>>
>> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
>> accounting from one process to another.
>>
>> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
>> VHOST_NEW_OWNER is supported.
>>
>> The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
>> address in the new process.
>>
>> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
>> VHOST_IOTLB_REMAP is supported and required.  Some devices do not
>> require it, because the userland address of each dma mapping is discarded
>> after being translated to a physical address.
>>
>> Here is a pseudo-code sequence for performing live update, based on
>> suspend + reset because resume is not yet available.  The vdpa device
>> descriptor, fd, remains open across the exec.
>>
>>   ioctl(fd, VHOST_VDPA_SUSPEND)
>>   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
>>   exec
> 
> Is there a userspace implementation as a reference?

I have working patches for qemu that use these ioctl's, but they depend on other 
qemu cpr patches that are a work in progress, and not posted yet.  I'm working on
that.

>>   ioctl(fd, VHOST_NEW_OWNER)
>>
>>   issue ioctls to re-create vrings
>>
>>   if VHOST_BACKEND_F_IOTLB_REMAP
>>       foreach dma mapping
>>           write(fd, {VHOST_IOTLB_REMAP, new_addr})
> 
> I think I need to understand the advantages of this approach. For
> example, why it is better than
> 
> ioctl(VHOST_RESET_OWNER)
> exec
> 
> ioctl(VHOST_SET_OWNER)
> 
> for each dma mapping
>      ioctl(VHOST_IOTLB_UPDATE)

That is slower.  VHOST_RESET_OWNER unbinds physical pages, and VHOST_IOTLB_UPDATE
rebinds them.  It costs multiple seconds for large memories, and is incurred during the
virtual machine's pause time during live update.  For comparison, the total pause time
for live update with vfio interfaces is ~100 millis.

However, the interaction with userland is so similar that the same code paths can be used.
In my qemu prototype, after cpr exec's new qemu:
  - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
  - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of VHOST_IOTLB_UPDATE

- Steve
  
Jason Wang Jan. 22, 2024, 4:12 a.m. UTC | #3
On Thu, Jan 18, 2024 at 4:32 AM Steven Sistare
<steven.sistare@oracle.com> wrote:
>
> On 1/10/2024 9:55 PM, Jason Wang wrote:
> > On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare <steven.sistare@oracle.com> wrote:
> >>
> >> Live update is a technique wherein an application saves its state, exec's
> >> to an updated version of itself, and restores its state.  Clients of the
> >> application experience a brief suspension of service, on the order of
> >> 100's of milliseconds, but are otherwise unaffected.
> >>
> >> Define and implement interfaces that allow vdpa devices to be preserved
> >> across fork or exec, to support live update for applications such as qemu.
> >> The device must be suspended during the update, but its dma mappings are
> >> preserved, so the suspension is brief.
> >>
> >> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> >> accounting from one process to another.
> >>
> >> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> >> VHOST_NEW_OWNER is supported.
> >>
> >> The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
> >> address in the new process.
> >>
> >> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> >> VHOST_IOTLB_REMAP is supported and required.  Some devices do not
> >> require it, because the userland address of each dma mapping is discarded
> >> after being translated to a physical address.
> >>
> >> Here is a pseudo-code sequence for performing live update, based on
> >> suspend + reset because resume is not yet available.  The vdpa device
> >> descriptor, fd, remains open across the exec.
> >>
> >>   ioctl(fd, VHOST_VDPA_SUSPEND)
> >>   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
> >>   exec
> >
> > Is there a userspace implementation as a reference?
>
> I have working patches for qemu that use these ioctl's, but they depend on other
> qemu cpr patches that are a work in progress, and not posted yet.  I'm working on
> that.

Ok.

>
> >>   ioctl(fd, VHOST_NEW_OWNER)
> >>
> >>   issue ioctls to re-create vrings
> >>
> >>   if VHOST_BACKEND_F_IOTLB_REMAP
> >>       foreach dma mapping
> >>           write(fd, {VHOST_IOTLB_REMAP, new_addr})
> >
> > I think I need to understand the advantages of this approach. For
> > example, why it is better than
> >
> > ioctl(VHOST_RESET_OWNER)
> > exec
> >
> > ioctl(VHOST_SET_OWNER)
> >
> > for each dma mapping
> >      ioctl(VHOST_IOTLB_UPDATE)
>
> That is slower.  VHOST_RESET_OWNER unbinds physical pages, and VHOST_IOTLB_UPDATE
> rebinds them.  It costs multiple seconds for large memories, and is incurred during the
> virtual machine's pause time during live update.  For comparison, the total pause time
> for live update with vfio interfaces is ~100 millis.
>
> However, the interaction with userland is so similar that the same code paths can be used.
> In my qemu prototype, after cpr exec's new qemu:
>   - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
>   - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of VHOST_IOTLB_UPDATE
>
> - Steve
>

Ok, let's document this in the changlog.

Thanks