[RFC,0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND

Message ID	20230518000920.191583-1-michael.christie@oracle.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Mike Christie <michael.christie@oracle.com> To: oleg@redhat.com, linux@leemhuis.info, nicolas.dichtel@6wind.com, axboe@kernel.dk, ebiederm@xmission.com, torvalds@linux-foundation.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, mst@redhat.com, sgarzare@redhat.com, jasowang@redhat.com, stefanha@redhat.com, brauner@kernel.org Subject: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Date: Wed, 17 May 2023 19:09:12 -0500 Message-Id: <20230518000920.191583-1-michael.christie@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	vhost_tasks: Use CLONE_THREAD/SIGHAND \| [RFC,0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND [RFC,1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set [RFC,2/8] vhost/vhost_task: Hook vhost layer into signal handler [RFC,3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND [RFC,4/8] vhost-net: Move vhost_net_open [RFC,5/8] vhost: Add callback that stops new work and waits on running ones [RFC,6/8] vhost-scsi: Add callback to stop and wait on works [RFC,7/8] vhost-net: Add callback to stop and wait on works [RFC,8/8] fork/vhost_task: remove no_files

Message ID

20230518000920.191583-1-michael.christie@oracle.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Mike Christie <michael.christie@oracle.com>
To: oleg@redhat.com, linux@leemhuis.info, nicolas.dichtel@6wind.com,
        axboe@kernel.dk, ebiederm@xmission.com,
        torvalds@linux-foundation.org, linux-kernel@vger.kernel.org,
        virtualization@lists.linux-foundation.org, mst@redhat.com,
        sgarzare@redhat.com, jasowang@redhat.com, stefanha@redhat.com,
        brauner@kernel.org
Subject: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
Date: Wed, 17 May 2023 19:09:12 -0500
Message-Id: <20230518000920.191583-1-michael.christie@oracle.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 QNHN7HZkmdn7cwBQDBbXfxACtrcpvzEysn9c/0VTX2/eg7WzLVpLCbaV2wg9LanQeW2o1FQkpxkcPj6Afr/xCrpoODPTrEqWV0VJWVKd/m2J3XjZK+diXSi4DPVT3dgiEVqq0zDcSoN4brEqlFHXpkIGRk1P3RTO3yHbabcr4YHpjEEExExP+bFoZqFbryTS0p+yJ+dgL5b4ASp+DoGQZqkmdlmMvtXgfGsfXotp6AffV7ZILBxhxGx1GmBLegi+G9beIfKtamDwG12DgeXJVVNrX1Skd3oC2uu8RFx4i7E1e+D3F0tooOYdU2Lk/lkXBbRZmjp9l10dYaHaAji4IsV6G97t26SxCjHkpWsOJ3qzOqwKrRpw9CpknwxFCKrUUXC/1SRXRcMaddBbCtHEDrFYNqbn+cxMsE2jXKraMDk7TXE9YAUt7eOK0rIXwGoLKHg9BswMKweTvGQNXVL3NyicUsiWOB+RrDtorxfNWdAKL2pwmws8hUVrxiUpU9saPcj+TdhK6NV3goe75H9kptR1nGuh9Rvt8k9ML5bgex98rS2PtbWHRRj2I5appAKEfVWBMoS5qVlvvFsYfcIDS8VZKwoXx5EFwjFwg4XyBGrvN+EJ7MBf2bEAeHSwlBfLpme3F4KqS5NeIHRe9vELz10+7gm3jUrYufCUk9oYoCBOqyVHfJFtK2yGJPyv31lNLGwzDcoOZcesnIfmc+Ia7qQbeiDie65VliG/K6KnuGGHy9e+pg8BV4NOnl0ovMm7pwSgQq8Tysmxd0g2P+IqToZU9b1SDRKELYUSfko/j7/K0KlytV/iT3iu/B3JXI8yX0pn1KZ7ZXCBeZ+6jxLhlmsbduCdXPO63OXf1FMla3phqc22BzHWHDAZLdrIOSi7bWHkbGESEzw1qnnPsAOnd+BL9iysQf1bJxMV2mGm9N2N0Jfa7GRy7iKDCXLhHyieTMKHleMBSre0VyxUF0CH3kuRJj5QNP8EgUZLf9RngplFQ2+T3jbluhE6ypl4SrOdL2ECFYV45xPKwLSAL3o+9wETl3IWHCzLZbPJcTjRD/PtEWWRiX8oKPAEKqAib8y4D9O6sNYg4D/Dg5+IWBTT6ECpI3XTaTcPxyVYb2rKHhjzwV4uH3FXxagZlxUKkQvNiQpt+gLY1YHa2CDaBG7DJ++v8TxM1eXYVGUSAdzg4CjkbFnWWSX83013arJbcsQrML4T64Q2diTpWWO2BkXTlVJ7krafJrUmMgYeIoo01YVDW98QCewdi3tjvrJhzCzUdpgvdbh53UaCmMHB/IWX7IYNdwqyA7iYag8hV2g5iqyjyiawbih5a9kwV5LEJZe4jebo+ioP9UbnhVfX2StAFJrvwI/4PM9Kq6dcriH2ROZOrlMC/IR0Y1wZzzn7+z4osTdkKEDkHpZ9zU6EojW7suOOc6REJgLqOYJCP19EEgHsIA/Xf2xgS2lGXC2zbpQZKs6nr9MtRJuUWaJtYpH+pBtYTCQbqeLQ71T94Uqxu5ypPHz1nliHGC0J7qjjm1s6Zhs1hEK/nRyDhNP4AFzPwbxxmCTJpon58R5MAAdln3qYgiwAKGXJ2ZFGypUlBC1Pw08bZvDNoKTHVUAHl3rfpA==
X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: 
 hDDT5LgrHDpXFzsqrlL+iGhqXEGV2dWWGN3yRncuBYkl2JJbiDvSPnsCt/gCBs5VyQ+c+neMZlsquuViF9hvZCMjtrmystISpdFb0ai3cJ+lzRO2SdlGcAv2e8JkqHW57heeg0sAzExnMhIf2VgrCRVrn9iof/6mAhmCk+lijb1LkvYkGh+QpUAlBrRKzQ0hteVKxuDpl1C2gujYEIsqcRwuBg9NL5V7vCKlSxQj2qhFwCwsQnVRrpVcy/mWbSNZylWZK982eEW9EWNm5JFazsagiJOS9bsLUKvqJItNZaMOkxIIfVBe1l/rROenPFUA0rIDAt1MCjjrjPXG93QAYWy12Q3SnmSZGBU0nzEAM2vnT8pJ9hMEXonkIlEiy70BzC0sUKPUfhgPt8Lb5vIfr3tSEK7CUgAwG7wGEeDwH+uCNYm+dzwcDrXWOPAnRAeS6mzQ/mRhoBxLptd1HT5Ur/127pClpVsbuBr6QeikOjyvKaMO9fuw8CZRGQfT8M7S5KN6BHSZYBRhAPcsI6yyXLDEFGqeIWpi+pG7d1wSiW/8LcmiXU/6mKDdClxQxZpZ9AG8ewBMwhtt/+8ePvKd4nTzepKYW7kcz1b9tShFZVcSME2nymGly7fiIyCK4Zf1++MneRkinvt2YR0i05dtcXOuTGP6UM97dtOgQYp7wbmJ9q3VecvCZS8prgDOlGJYMmR9Km6l5ywqgWF4C80yEYDA4N62qctHUSkbCKfpdchWIc48ZIO4GL6/xTUk8EP5TKT6Ue2yoEd9xHVngbW0N+uYAzQnfIL/6T6WcgGG2ssmdNQCvN59gYdIjrqkDPE3+Kt78NIn1gQqmyr4WXedvo0Nt4zYQrgMNS+/xSY4HsIBoK195F/QznstM0u1gOS5jMJls2u1ZdJ7spgAfN41cVbbUV7JSLUdYobG5hd8zrAihgdFHLe9hNz5XKYsDzBi3Dq8HpW3hO9lUeFq5vToOMXH6CJsXCNhOK09Hab1JVXNzPjS8ChFryOQd7Tbfs/Wsbx4EEgebUB0B0xslAr8IiuLCSkPJ+DfaQaYU97YDJXkI7gKYt2DaUJNiGHlMwV4
X-OriginatorOrg: oracle.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 ac27967d-6011-4e48-5a9d-08db57342277
X-MS-Exchange-CrossTenant-AuthSource: CY8PR10MB7243.namprd10.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 18 May 2023 00:09:23.3449
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 Qh2XoIPa1DN3kJ0CLwjfn7rebg4gwDjCiuMb0Fa53pNskuvZdPwxS06qf80ODlLaVFxupN28XIZcGJfoDxQTj3ZR0R3T7JoxU54twns8sb8=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BLAPR10MB5332
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.254,Aquarius:18.0.957,Hydra:6.0.573,FMLib:17.11.170.22
 definitions=2023-05-17_04,2023-05-17_02,2023-02-09_01
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0
 mlxscore=0 phishscore=0
 mlxlogscore=477 spamscore=0 adultscore=0 malwarescore=0 suspectscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2304280000
 definitions=main-2305170199
X-Proofpoint-GUID: ZlYnlrom5qoqhuycVxAG5fBaZ5G2gHBK
X-Proofpoint-ORIG-GUID: ZlYnlrom5qoqhuycVxAG5fBaZ5G2gHBK
X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,
        DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW,
        RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE
        autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1766188899840136013?=
X-GMAIL-MSGID: =?utf-8?q?1766188899840136013?=

Series

vhost_tasks: Use CLONE_THREAD/SIGHAND |

Message

Mike Christie May 18, 2023, 12:09 a.m. UTC

  This patch allows the vhost and vhost_task code to use CLONE_THREAD,
CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
normal testing, haven't coverted vsock and vdpa, and I know you guys
will not like the first patch. However, I think it better shows what
we need from the signal code and how we can support signals in the
vhost_task layer.

Note that I took the super simple route and kicked off some work to
the system workqueue. We can do more invassive approaches:
1. Modify the vhost drivers so they can check for IO completions using
a non-blocking interface. We then don't need to run from the system
workqueue and can run from the vhost_task.

2. We could drop patch 1 and just say we are doing a polling type
of approach. We then modify the vhost layer similar to #1 where we
can check for completions using a non-blocking interface and use
the vhost_task task.

Comments

Christian Brauner May 18, 2023, 8:25 a.m. UTC | #1

On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> normal testing, haven't coverted vsock and vdpa, and I know you guys
> will not like the first patch. However, I think it better shows what

Just to summarize the core idea behind my proposal is that no signal
handling changes are needed unless there's a bug in the current way
io_uring workers already work. All that should be needed is
s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.

If you follow my proposal than vhost and io_uring workers should almost
collapse into the same concept. Specifically, io_uring workers and vhost
workers should behave the same when it comes ot handling signals.

See 
https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner

> we need from the signal code and how we can support signals in the
> vhost_task layer.
> 
> Note that I took the super simple route and kicked off some work to
> the system workqueue. We can do more invassive approaches:
> 1. Modify the vhost drivers so they can check for IO completions using
> a non-blocking interface. We then don't need to run from the system
> workqueue and can run from the vhost_task.
> 
> 2. We could drop patch 1 and just say we are doing a polling type
> of approach. We then modify the vhost layer similar to #1 where we
> can check for completions using a non-blocking interface and use
> the vhost_task task.

My preference would be to do whatever is the minimal thing now and has
the least bug potential and is the easiest to review for us non-vhost
experts. Then you can take all the time to rework and improve the vhost
infra based on the possibilities that using user workers offers. Plus,
that can easily happen in the next kernel cycle.

Remember, that we're trying to fix a regression here. A regression on an
unreleased kernel but still.

Christian Brauner May 18, 2023, 8:40 a.m. UTC | #2

On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> > This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> > CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> > normal testing, haven't coverted vsock and vdpa, and I know you guys
> > will not like the first patch. However, I think it better shows what
> 
> Just to summarize the core idea behind my proposal is that no signal
> handling changes are needed unless there's a bug in the current way
> io_uring workers already work. All that should be needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
> 
> If you follow my proposal than vhost and io_uring workers should almost
> collapse into the same concept. Specifically, io_uring workers and vhost
> workers should behave the same when it comes ot handling signals.
> 
> See 
> https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner
> 
> 
> > we need from the signal code and how we can support signals in the
> > vhost_task layer.
> > 
> > Note that I took the super simple route and kicked off some work to
> > the system workqueue. We can do more invassive approaches:
> > 1. Modify the vhost drivers so they can check for IO completions using
> > a non-blocking interface. We then don't need to run from the system
> > workqueue and can run from the vhost_task.
> > 
> > 2. We could drop patch 1 and just say we are doing a polling type
> > of approach. We then modify the vhost layer similar to #1 where we
> > can check for completions using a non-blocking interface and use
> > the vhost_task task.
> 
> My preference would be to do whatever is the minimal thing now and has
> the least bug potential and is the easiest to review for us non-vhost
> experts. Then you can take all the time to rework and improve the vhost
> infra based on the possibilities that using user workers offers. Plus,
> that can easily happen in the next kernel cycle.
> 
> Remember, that we're trying to fix a regression here. A regression on an
> unreleased kernel but still.

It's a public holiday here today so I'll try to find time to review this
tomorrow.

Christian Brauner May 18, 2023, 2:30 p.m. UTC | #3

On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> > This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> > CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> > normal testing, haven't coverted vsock and vdpa, and I know you guys
> > will not like the first patch. However, I think it better shows what
> 
> Just to summarize the core idea behind my proposal is that no signal
> handling changes are needed unless there's a bug in the current way
> io_uring workers already work. All that should be needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
> 
> If you follow my proposal than vhost and io_uring workers should almost
> collapse into the same concept. Specifically, io_uring workers and vhost
> workers should behave the same when it comes ot handling signals.
> 
> See 
> https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner
> 
> 
> > we need from the signal code and how we can support signals in the
> > vhost_task layer.
> > 
> > Note that I took the super simple route and kicked off some work to
> > the system workqueue. We can do more invassive approaches:
> > 1. Modify the vhost drivers so they can check for IO completions using
> > a non-blocking interface. We then don't need to run from the system
> > workqueue and can run from the vhost_task.
> > 
> > 2. We could drop patch 1 and just say we are doing a polling type
> > of approach. We then modify the vhost layer similar to #1 where we
> > can check for completions using a non-blocking interface and use
> > the vhost_task task.
> 
> My preference would be to do whatever is the minimal thing now and has
> the least bug potential and is the easiest to review for us non-vhost
> experts. Then you can take all the time to rework and improve the vhost
> infra based on the possibilities that using user workers offers. Plus,
> that can easily happen in the next kernel cycle.
> 
> Remember, that we're trying to fix a regression here. A regression on an
> unreleased kernel but still.

Just two more thoughts:

The following places currently check for PF_IO_WORKER:

arch/x86/include/asm/fpu/sched.h: !(current->flags & (PF_KTHREAD | PF_IO_WORKER))) {
arch/x86/kernel/fpu/context.h:    if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_IO_WORKER)))
arch/x86/kernel/fpu/core.c:       if (!(current->flags & (PF_KTHREAD | PF_IO_WORKER)) &&

Both PF_KTHREAD and PF_IO_WORKER don't need TIF_NEED_FPU_LOAD because
they never return to userspace. But that's not specific to
PF_IO_WORKERs. Please generalize this to just check for PF_USER_WORKER
via a simple s/PF_IO_WORKER/PF_USER_WORKER/g in these places.

Another thing, in the sched code we have hooks into sched_submit_work()
and sched_update_worker() specific to PF_IO_WORKERs. But again, I don't
think this needs to be special to PF_IO_WORKERS. This might be
generally useful for PF_USER_WORKER. So we should probably generalize
this and have a generic user_worker_sleeping() and user_worker_running()
helper that figures out internally what specific helper to call. That's
not something that needs to be done right now though since I don't think
vhost needs this functionality.

But we should generalize this for the next development cycle so we have
this all nice and clean when someone actually needs this. Overall this
will mean that there would only be a single place left where
PF_IO_WORKER would need to be checked and that's in io_uring code
itself. And if we do things just right we might not even need that
PF_IO_WORKER flag anymore at all. But again, that's just notes for next
cycle.

Thoughts? Rotten apples?

Christian Brauner May 19, 2023, 12:15 p.m. UTC | #4

On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> > This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> > CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> > normal testing, haven't coverted vsock and vdpa, and I know you guys
> > will not like the first patch. However, I think it better shows what
> 
> Just to summarize the core idea behind my proposal is that no signal
> handling changes are needed unless there's a bug in the current way
> io_uring workers already work. All that should be needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
> 
> If you follow my proposal than vhost and io_uring workers should almost
> collapse into the same concept. Specifically, io_uring workers and vhost
> workers should behave the same when it comes ot handling signals.
> 
> See 
> https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner
> 
> 
> > we need from the signal code and how we can support signals in the
> > vhost_task layer.
> > 
> > Note that I took the super simple route and kicked off some work to
> > the system workqueue. We can do more invassive approaches:
> > 1. Modify the vhost drivers so they can check for IO completions using
> > a non-blocking interface. We then don't need to run from the system
> > workqueue and can run from the vhost_task.
> > 
> > 2. We could drop patch 1 and just say we are doing a polling type
> > of approach. We then modify the vhost layer similar to #1 where we
> > can check for completions using a non-blocking interface and use
> > the vhost_task task.
> 
> My preference would be to do whatever is the minimal thing now and has
> the least bug potential and is the easiest to review for us non-vhost
> experts. Then you can take all the time to rework and improve the vhost
> infra based on the possibilities that using user workers offers. Plus,
> that can easily happen in the next kernel cycle.
> 
> Remember, that we're trying to fix a regression here. A regression on an
> unreleased kernel but still.

On Tue, May 16, 2023 at 10:40:01AM +0200, Christian Brauner wrote:
> On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote:
> > On 5/15/23 10:44 AM, Linus Torvalds wrote:
> > > On Mon, May 15, 2023 at 7:23 AM Christian Brauner <brauner@kernel.org> wrote:
> > >>
> > >> So I think we will be able to address (1) and (2) by making vhost tasks
> > >> proper threads and blocking every signal except for SIGKILL and SIGSTOP
> > >> and then having vhost handle get_signal() - as you mentioned - the same
> > >> way io uring already does. We should also remove the ingore_signals
> > >> thing completely imho. I don't think we ever want to do this with user
> > >> workers.
> > > 
> > > Right. That's what IO_URING does:
> > > 
> > >         if (args->io_thread) {
> > >                 /*
> > >                  * Mark us an IO worker, and block any signal that isn't
> > >                  * fatal or STOP
> > >                  */
> > >                 p->flags |= PF_IO_WORKER;
> > >                 siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> > >         }
> > > 
> > > and I really think that vhost should basically do exactly what io_uring does.
> > > 
> > > Not because io_uring fundamentally got this right - but simply because
> > > io_uring had almost all the same bugs (and then some), and what the
> > > io_uring worker threads ended up doing was to basically zoom in on
> > > "this works".
> > > 
> > > And it zoomed in on it largely by just going for "make it look as much
> > > as possible as a real user thread", because every time the kernel
> > > thread did something different, it just caused problems.
> > > 
> > > So I think the patch should just look something like the attached.
> > > Mike, can you test this on whatever vhost test-suite?
> > 
> > I tried that approach already and it doesn't work because io_uring and vhost
> > differ in that vhost drivers implement a device where each device has a vhost_task
> > and the drivers have a file_operations for the device. When the vhost_task's
> > parent gets signal like SIGKILL, then it will exit and call into the vhost
> > driver's file_operations->release function. At this time, we need to do cleanup
> 
> But that's no reason why the vhost worker couldn't just be allowed to
> exit on SIGKILL cleanly similar to io_uring. That's just describing the
> current architecture which isn't a necessity afaict. And the helper
> thread could e.g., crash.
> 
> > like flush the device which uses the vhost_task. There is also the case where if
> > the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.
> 
> In a way I really don't like the patch below. Because this should be
> solvable by adapting vhost workers. Right now, vhost is coming from a
> kthread model and we ported it to a user worker model and the whole
> point of this excercise has been that the workers behave more like
> regular userspace processes. So my tendency is to not massage kernel
> signal handling to now also include a special case for user workers in
> addition to kthreads. That's just the wrong way around and then vhost
> could've just stuck with kthreads in the first place.
> 
> So I'm fine with skipping over the freezing case for now but SIGKILL
> should be handled imho. Only init and kthreads should get the luxury of
> ignoring SIGKILL.
> 
> So, I'm afraid I'm asking some work here of you but how feasible would a
> model be where vhost_worker() similar to io_wq_worker() gracefully
> handles SIGKILL. Yes, I see there's
> 
> net.c:   .release = vhost_net_release
> scsi.c:  .release = vhost_scsi_release
> test.c:  .release = vhost_test_release
> vdpa.c:  .release = vhost_vdpa_release
> vsock.c: .release = virtio_transport_release
> vsock.c: .release = vhost_vsock_dev_release
> 
> but that means you have all the basic logic in place and all of those
> drivers also support the VHOST_RESET_OWNER ioctl which also stops the
> vhost worker. I'm confident that a lof this can be leveraged to just
> cleanup on SIGKILL.
> 
> So it feels like this should be achievable by adding a callback to
> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
> and that all the users of vhost workers are forced to implement.
> 
> Yes, it is more work but I think that's the right thing to do and not to
> complicate our signal handling.
> 
> Worst case if this can't be done fast enough we'll have to revert the
> vhost parts. I think the user worker parts are mostly sane and are

As mentioned, if we can't settle this cleanly before -rc4 we should
revert the vhost parts unless Linus wants to have it earlier.

Thorsten Leemhuis June 1, 2023, 7:58 a.m. UTC | #5

On 19.05.23 14:15, Christian Brauner wrote:
> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
>>> normal testing, haven't coverted vsock and vdpa, and I know you guys
>>> will not like the first patch. However, I think it better shows what
>>
>> Just to summarize the core idea behind my proposal is that no signal
>> handling changes are needed unless there's a bug in the current way
>> io_uring workers already work. All that should be needed is
>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
[...]
>> So it feels like this should be achievable by adding a callback to
>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
>> and that all the users of vhost workers are forced to implement.
>>
>> Yes, it is more work but I think that's the right thing to do and not to
>> complicate our signal handling.
>>
>> Worst case if this can't be done fast enough we'll have to revert the
>> vhost parts. I think the user worker parts are mostly sane and are
> 
> As mentioned, if we can't settle this cleanly before -rc4 we should
> revert the vhost parts unless Linus wants to have it earlier.

Meanwhile -rc5 is just a few days away and there are still a lot of
discussions in the patch-set proposed to address the issues[1]. Which is
kinda great (albeit also why I haven't given it a spin yet), but on the
other hand makes we wonder:

Is it maybe time to revert the vhost parts for 6.4 and try again next cycle?

[1]
https://lore.kernel.org/all/20230522025124.5863-1-michael.christie@oracle.com/

Ciao, Thorsten "not sure if I'm asking because I'm affected, or because
it's my duty as regression tracker" Leemhuis

Nicolas Dichtel June 1, 2023, 10:18 a.m. UTC | #6

Le 01/06/2023 à 09:58, Thorsten Leemhuis a écrit :
[snip]
> 
> Meanwhile -rc5 is just a few days away and there are still a lot of
> discussions in the patch-set proposed to address the issues[1]. Which is
> kinda great (albeit also why I haven't given it a spin yet), but on the
> other hand makes we wonder:
> 
> Is it maybe time to revert the vhost parts for 6.4 and try again next cycle?
At least it's time to find a way to fix this issue :)


Thank you,
Nicolas

Christian Brauner June 1, 2023, 10:47 a.m. UTC | #7

On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
> On 19.05.23 14:15, Christian Brauner wrote:
> > On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> >> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> >>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> >>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> >>> normal testing, haven't coverted vsock and vdpa, and I know you guys
> >>> will not like the first patch. However, I think it better shows what
> >>
> >> Just to summarize the core idea behind my proposal is that no signal
> >> handling changes are needed unless there's a bug in the current way
> >> io_uring workers already work. All that should be needed is
> >> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
> [...]
> >> So it feels like this should be achievable by adding a callback to
> >> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
> >> and that all the users of vhost workers are forced to implement.
> >>
> >> Yes, it is more work but I think that's the right thing to do and not to
> >> complicate our signal handling.
> >>
> >> Worst case if this can't be done fast enough we'll have to revert the
> >> vhost parts. I think the user worker parts are mostly sane and are
> > 
> > As mentioned, if we can't settle this cleanly before -rc4 we should
> > revert the vhost parts unless Linus wants to have it earlier.
> 
> Meanwhile -rc5 is just a few days away and there are still a lot of
> discussions in the patch-set proposed to address the issues[1]. Which is
> kinda great (albeit also why I haven't given it a spin yet), but on the
> other hand makes we wonder:

You might've missed it in the thread but it seems everyone is currently
operating under the assumption that the preferred way is to fix this is
rather than revert. See the mail in [1]:

"So I'd really like to finish this. Even if we end up with a hack or
two in signal handling that we can hopefully fix up later by having
vhost fix up some of its current assumptions."

which is why no revert was send for -rc4. And there's a temporary fix we
seem to have converged on.

@Mike, do you want to prepare an updated version of the temporary fix.
If @Linus prefers to just apply it directly he can just grab it from the
list rather than delaying it. Make sure to grab a Co-developed-by line
on this, @Mike.

Just in case we misunderstood the intention, I also prepared a revert
at the end of this mail that Linus can use.

@Thorsten, you can test it if you want. The revert only reverts the
vhost bits as the general agreement seems to be that user workers are
otherwise the path forward.

[1]: https://lore.kernel.org/lkml/CAHk-=wj4DS=2F5mW+K2P7cVqrsuGd3rKE_2k2BqnnPeeYhUCvg@mail.gmail.com

---

/* Summary */
Switching vhost workers to user workers broke existing workflows because
vhost workers started showing up in ps output breaking various scripts.
The reason is that vhost user workers are currently spawned as separate
processes and not as threads. Revert the patches converting vhost from
kthreads to vhost workers until vhost is ready to support user workers
created as actual threads.

The following changes since commit 7877cb91f1081754a1487c144d85dc0d2e2e7fc4:

  Linux 6.4-rc4 (2023-05-28 07:49:00 -0400)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/kernel/v6.4-rc4/vhost

for you to fetch changes up to b20084b6bc90012a8ccce72ef1c0050d5fd42aa8:

  Revert "vhost_task: Allow vhost layer to use copy_process" (2023-06-01 12:33:19 +0200)

----------------------------------------------------------------
kernel/v6.4-rc4/vhost

----------------------------------------------------------------
Christian Brauner (3):
      Revert "vhost: use vhost_tasks for worker threads"
      Revert "vhost: move worker thread fields to new struct"
      Revert "vhost_task: Allow vhost layer to use copy_process"

 MAINTAINERS                      |   1 -
 drivers/vhost/Kconfig            |   5 --
 drivers/vhost/vhost.c            | 124 ++++++++++++++++++++-------------------
 drivers/vhost/vhost.h            |  11 +---
 include/linux/sched/vhost_task.h |  23 --------
 kernel/Makefile                  |   1 -
 kernel/vhost_task.c              | 117 ------------------------------------
 7 files changed, 67 insertions(+), 215 deletions(-)
 delete mode 100644 include/linux/sched/vhost_task.h
 delete mode 100644 kernel/vhost_task.c

Thorsten Leemhuis June 1, 2023, 11:29 a.m. UTC | #8

On 01.06.23 12:47, Christian Brauner wrote:
> On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
>> On 19.05.23 14:15, Christian Brauner wrote:
>>> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
>>>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
>>>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
>>>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
>>>>> normal testing, haven't coverted vsock and vdpa, and I know you guys
>>>>> will not like the first patch. However, I think it better shows what
>>>>
>>>> Just to summarize the core idea behind my proposal is that no signal
>>>> handling changes are needed unless there's a bug in the current way
>>>> io_uring workers already work. All that should be needed is
>>>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>> [...]
>>>> So it feels like this should be achievable by adding a callback to
>>>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
>>>> and that all the users of vhost workers are forced to implement.
>>>>
>>>> Yes, it is more work but I think that's the right thing to do and not to
>>>> complicate our signal handling.
>>>>
>>>> Worst case if this can't be done fast enough we'll have to revert the
>>>> vhost parts. I think the user worker parts are mostly sane and are
>>>
>>> As mentioned, if we can't settle this cleanly before -rc4 we should
>>> revert the vhost parts unless Linus wants to have it earlier.
>>
>> Meanwhile -rc5 is just a few days away and there are still a lot of
>> discussions in the patch-set proposed to address the issues[1]. Which is
>> kinda great (albeit also why I haven't given it a spin yet), but on the
>> other hand makes we wonder:
> 
> You might've missed it in the thread but it seems everyone is currently
> operating under the assumption that the preferred way is to fix this is
> rather than revert. 

I saw that, but that was also a week ago already, so I slowly started to
wonder if plans might have/should be changed. Anyway: if that's still
the plan forward it's totally fine for me if it's fine for Linus. :-D

BTW: I for now didn't sit down to test Mike's patches, as due to all the
discussions I assumed new ones would be coming sooner or later anyway.
If it's worth giving them a shot, please let me know.

> [...]

Thx for the update!

Ciao, Thorsten

Linus Torvalds June 1, 2023, 12:26 p.m. UTC | #9

On Thu, Jun 1, 2023 at 6:47 AM Christian Brauner <brauner@kernel.org> wrote:
>
> @Mike, do you want to prepare an updated version of the temporary fix.
> If @Linus prefers to just apply it directly he can just grab it from the
> list rather than delaying it. Make sure to grab a Co-developed-by line
> on this, @Mike.

Yeah, let's apply the known "fix the immediate regression" patch wrt
vhost ps output and the freezer. That gets rid of the regression.

I think that we can - and should - then treat the questions about core
dumping and execve as separate issues.

vhost wouldn't have done execve since it's nonsensical and has never
worked anyway since it always left the old mm ref behind, and
similarly core dumping has never been an issue.

So on those things we don't have any "semantic" issues, we just need
to make sure we don't do crazy things like hang uninterruptibly.

            Linus

Mike Christie June 1, 2023, 4:10 p.m. UTC | #10

On 6/1/23 5:47 AM, Christian Brauner wrote:
> On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
>> On 19.05.23 14:15, Christian Brauner wrote:
>>> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
>>>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
>>>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
>>>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
>>>>> normal testing, haven't coverted vsock and vdpa, and I know you guys
>>>>> will not like the first patch. However, I think it better shows what
>>>> Just to summarize the core idea behind my proposal is that no signal
>>>> handling changes are needed unless there's a bug in the current way
>>>> io_uring workers already work. All that should be needed is
>>>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>> [...]
>>>> So it feels like this should be achievable by adding a callback to
>>>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
>>>> and that all the users of vhost workers are forced to implement.
>>>>
>>>> Yes, it is more work but I think that's the right thing to do and not to
>>>> complicate our signal handling.
>>>>
>>>> Worst case if this can't be done fast enough we'll have to revert the
>>>> vhost parts. I think the user worker parts are mostly sane and are
>>> As mentioned, if we can't settle this cleanly before -rc4 we should
>>> revert the vhost parts unless Linus wants to have it earlier.
>> Meanwhile -rc5 is just a few days away and there are still a lot of
>> discussions in the patch-set proposed to address the issues[1]. Which is
>> kinda great (albeit also why I haven't given it a spin yet), but on the
>> other hand makes we wonder:
> You might've missed it in the thread but it seems everyone is currently
> operating under the assumption that the preferred way is to fix this is
> rather than revert. See the mail in [1]:
> 
> "So I'd really like to finish this. Even if we end up with a hack or
> two in signal handling that we can hopefully fix up later by having
> vhost fix up some of its current assumptions."
> 
> which is why no revert was send for -rc4. And there's a temporary fix we
> seem to have converged on.
> 
> @Mike, do you want to prepare an updated version of the temporary fix.
> If @Linus prefers to just apply it directly he can just grab it from the
> list rather than delaying it. Make sure to grab a Co-developed-by line
> on this, @Mike.

Yes, I'll send it within a couple hours.