[00/11] Use copy_process in vhost layer

Message ID 20230310220332.5309-1-michael.christie@oracle.com
Headers
Series Use copy_process in vhost layer |

Message

Mike Christie March 10, 2023, 10:03 p.m. UTC
  The following patches were made over Linus's tree and apply over next. They
allow the vhost layer to use copy_process instead of using
workqueue_structs to create worker threads for VM's devices.

Details:
Qemu will create vhost devices in the kernel which perform network or SCSI,
IO and perform management operations from worker threads created with the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.

The patches allow the vhost layer to do a copy_process from the thread that
does the VHOST_SET_OWNER ioctl like how io_uring does a copy_process against
its userspace thread. This allows the vhost layer's worker threads to inherit
cgroups, namespaces, address space, etc. This worker thread will also be
accounted for against that owner/parent process's RLIMIT_NPROC limit which
will prevent malicious users from creating VMs with almost unlimited threads
when these patches are used:

https://lore.kernel.org/all/20211207025117.23551-1-michael.christie@oracle.com/

which allow us to create a worker thread per N virtqueues.

V12:
- Change how new fields were added to kernel_clone_args so they don't
unnecessarily expand the size of the struct.
- Use named bitfields and make kthread and io_thread work similarly as
the new fields.
- Allow copy_process users to pass in the name of the new task and
convert kthreads and vhost_tasks.
V11:
- Rebase.
V10:
- Eric's cleanup patches my vhost flush cleanup patches are merged
upstream, so rebase against Linus's tree which has everything.
V9:
- Rebase against Eric's kthread-cleanups-for-v5.19 branch. Drop patches
no longer needed due to kernel clone arg and pf io worker patches in that
branch.
V8:
- Fix kzalloc GFP use.
- Fix email subject version number.
V7:
- Drop generic user_worker_* helpers and replace with vhost_task specific
  ones.
- Drop autoreap patch. Use kernel_wait4 instead.
- Fix issue where vhost.ko could be removed while the worker function is
  still running.
V6:
- Rename kernel_worker to user_worker and fix prefixes.
- Add better patch descriptions.
V5:
- Handle kbuild errors by building patchset against current kernel that
  has all deps merged. Also add patch to remove create_io_thread code as
  it's not used anymore.
- Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
  case added in 5.16-rc1.
- Add PF_USER_WORKER flag so we can check it later after the initial
  thread creation for the wake up, vm and singal cses.
- Added patch to auto reap the worker thread.
V4:
- Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
- Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
  added the new kernel worker functions.
- Fixed extra "i" issue.
- Added PF_USER_WORKER flag and added check that kernel_worker_start users
  had that flag set. Also dropped patches that passed worker flags to
  copy_thread and replaced with PF_USER_WORKER check.
V3:
- Add parentheses in p->flag and work_flags check in copy_thread.
- Fix check in arm/arm64 which was doing the reverse of other archs
  where it did likely(!flags) instead of unlikely(flags).
V2:
- Rename kernel_copy_process to kernel_worker.
- Instead of exporting functions, make kernel_worker() a proper
  function/API that does common work for the caller.
- Instead of adding new fields to kernel_clone_args for each option
  make it flag based similar to CLONE_*.
- Drop unused completion struct in vhost.
- Fix compile warnings by merging vhost cgroup cleanup patch and
  vhost conversion patch.
  

Comments

Linus Torvalds March 11, 2023, 5:21 p.m. UTC | #1
On Fri, Mar 10, 2023 at 2:04 PM Mike Christie
<michael.christie@oracle.com> wrote:
>
> The following patches were made over Linus's tree and apply over next. They
> allow the vhost layer to use copy_process instead of using
> workqueue_structs to create worker threads for VM's devices.

Ok, all these patches looked fine to me from a quick scan - nothing
that I reacted to as objectionable, and several of them looked like
nice cleanups.

The only one I went "Why do you do it that way" for was in 10/11
(entirely internal to vhost, so I don't feel too strongly about this)
how you made "struct vhost_worker" be a pointer in "struct vhost_dev".

It _looks_ to me like it could just have been an embedded structure
rather than a separate allocation.

IOW, why do

   vhost_dev->worker

instead of doing

  vhost_dev.worker

and just having it all in the same allocation?

Not a big deal. Maybe you wanted the 'test if worker pointer is NULL'
code to stay around, and basically use that pointer as a flag too. Or
maybe there is some other reason you want to keep that separate..

               Linus
  
Mike Christie March 11, 2023, 5:49 p.m. UTC | #2
On 3/11/23 11:21 AM, Linus Torvalds wrote:
> On Fri, Mar 10, 2023 at 2:04 PM Mike Christie
> <michael.christie@oracle.com> wrote:
>>
>> The following patches were made over Linus's tree and apply over next. They
>> allow the vhost layer to use copy_process instead of using
>> workqueue_structs to create worker threads for VM's devices.
> 
> Ok, all these patches looked fine to me from a quick scan - nothing
> that I reacted to as objectionable, and several of them looked like
> nice cleanups.
> 
> The only one I went "Why do you do it that way" for was in 10/11
> (entirely internal to vhost, so I don't feel too strongly about this)
> how you made "struct vhost_worker" be a pointer in "struct vhost_dev".
> 
> It _looks_ to me like it could just have been an embedded structure
> rather than a separate allocation.
> 
> IOW, why do
> 
>    vhost_dev->worker
> 
> instead of doing
> 
>   vhost_dev.worker
> 
> and just having it all in the same allocation?
> 
> Not a big deal. Maybe you wanted the 'test if worker pointer is NULL'
> code to stay around, and basically use that pointer as a flag too. Or
> maybe there is some other reason you want to keep that separate..
> 

There were 2 reasons:
1. Yeah, we needed a flag to indicate that the worker was not setup
for the cases like where userspace just opens the dev then closes it
without doing the IOCTL that does vhost_dev_set_owner.

2. I could have handled #1 by embedding the worker in the vhost_dev
and then just testing worker.vtsk. However, I have a followup patchset
that allows us to create multiple worker threads per device. For
that patchset I then do:

- if (vhost_dev->worker)

+ if (vhost_dev->workers)

so I think it just saved me some typing.
  
Michael S. Tsirkin March 11, 2023, 7:15 p.m. UTC | #3
On Sat, Mar 11, 2023 at 09:21:14AM -0800, Linus Torvalds wrote:
> On Fri, Mar 10, 2023 at 2:04 PM Mike Christie
> <michael.christie@oracle.com> wrote:
> >
> > The following patches were made over Linus's tree and apply over next. They
> > allow the vhost layer to use copy_process instead of using
> > workqueue_structs to create worker threads for VM's devices.
> 
> Ok, all these patches looked fine to me from a quick scan - nothing
> that I reacted to as objectionable, and several of them looked like
> nice cleanups.
> 
> The only one I went "Why do you do it that way" for was in 10/11
> (entirely internal to vhost, so I don't feel too strongly about this)
> how you made "struct vhost_worker" be a pointer in "struct vhost_dev".
> 
> It _looks_ to me like it could just have been an embedded structure
> rather than a separate allocation.
> 
> IOW, why do
> 
>    vhost_dev->worker
> 
> instead of doing
> 
>   vhost_dev.worker
> 
> and just having it all in the same allocation?
> 
> Not a big deal. Maybe you wanted the 'test if worker pointer is NULL'
> code to stay around, and basically use that pointer as a flag too. Or
> maybe there is some other reason you want to keep that separate..
> 
>                Linus

I agree with Linus here, slightly better embedded, but no huge deal.
Which tree is this going on?
If not mine here's my ack:

Acked-by: Michael S. Tsirkin <mst@redhat.com>
  
Christian Brauner March 12, 2023, 10:07 a.m. UTC | #4
From: Christian Brauner (Microsoft) <brauner@kernel.org>


On Fri, 10 Mar 2023 16:03:21 -0600, Mike Christie wrote:
> The following patches were made over Linus's tree and apply over next. They
> allow the vhost layer to use copy_process instead of using
> workqueue_structs to create worker threads for VM's devices.
> 
> Details:
> Qemu will create vhost devices in the kernel which perform network or SCSI,
> IO and perform management operations from worker threads created with the
> kthread API. Because the kthread API does a copy_process on the kthreadd
> thread, the vhost layer has to use kthread_use_mm to access the Qemu
> thread's memory and cgroup_attach_task_all to add itself to the Qemu
> thread's cgroups.
> 
> [...]

I've picked this up now and it should show up in -next tomorrow.
Thanks for the patience,

tree: git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux kernel.user_worker

[01/11] csky: Remove kernel_thread declaration
        commit: e0a98139c162af9601ffa8d6db88dbe745f64b3c
[02/11] kernel: Allow a kernel thread's name to be set in copy_process
        commit: cf587db2ee0261c74d04f61f39783db88a0b65e4
[03/11] kthread: Pass in the thread's name during creation
        commit: 73e0c116594d99f807754b15e474690635a87249
[04/11] kernel: Make io_thread and kthread bit fields
        commit: c81cc5819faf5dd77124f5086aa654482281ac37
[05/11] fork/vm: Move common PF_IO_WORKER behavior to new flag
        commit: 54e6842d0775ba76db65cbe21311c3ca466e663d
[06/11] fork: add kernel_clone_args flag to not dup/clone files
        commit: 11f3f500ec8a75c96087f3bed87aa2b1c5de7498
[07/11] fork: Add kernel_clone_args flag to ignore signals
        commit: 094717586bf71ac20ae3b240d2654d826634b21e
[08/11] fork: allow kernel code to call copy_process
        commit: 89c8e98d8cfb0656dbeb648572df5b13e372247d
[09/11] vhost_task: Allow vhost layer to use copy_process
        commit: 77feab3c4156229511b378f15c0a16574d9c08ea
[10/11] vhost: move worker thread fields to new struct
        commit: d45e2b73ead0daec350680fbf20144a2d3670186
[11/11] vhost: use vhost_tasks for worker threads
        commit: 5ab18f4b061ef24a71eea9ffafebd1a82ae2f514