sched/psi: fix use-after-free in ep_remove_wait_queue()

Message ID 20230202030023.1847084-1-kamatam@amazon.com
State New
Headers
Series sched/psi: fix use-after-free in ep_remove_wait_queue() |

Commit Message

Munehisa Kamata Feb. 2, 2023, 3 a.m. UTC
  If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed without clearing the queue and reference, then it can
result in use-after-free as below. Use wake_up_pollfree() instead to
ensure that the queue and reference are cleared before freeing.

 BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
 Write of size 4 at addr ffff88810e625328 by task a.out/4404

 CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
 Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
 Call Trace:
 <TASK>
 dump_stack_lvl+0x73/0xa0
 print_report+0x16c/0x4e0
 ? _printk+0x59/0x80
 ? __virt_addr_valid+0xb8/0x130
 ? _raw_spin_lock_irqsave+0x60/0xc0
 kasan_report+0xc3/0xf0
 ? _raw_spin_lock_irqsave+0x60/0xc0
 kasan_check_range+0x2d2/0x310
 _raw_spin_lock_irqsave+0x60/0xc0
 remove_wait_queue+0x1a/0xa0
 ep_free+0x12c/0x170
 ep_eventpoll_release+0x26/0x30
 __fput+0x202/0x400
 task_work_run+0x11d/0x170
 do_exit+0x495/0x1130
 ? update_cfs_rq_load_avg+0x2c2/0x2e0
 do_group_exit+0x100/0x100
 get_signal+0xd67/0xde0
 ? finish_task_switch+0x15f/0x3a0
 arch_do_signal_or_restart+0x2a/0x2b0
 exit_to_user_mode_prepare+0x94/0x100
 syscall_exit_to_user_mode+0x20/0x40
 do_syscall_64+0x52/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x7f8e392bfb91
 Code: Unable to access opcode bytes at 0x7f8e392bfb67.
 RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
 RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
 RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
 RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
 R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
 R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

 Allocated by task 4404:
 kasan_set_track+0x3d/0x60
 __kasan_kmalloc+0x85/0x90
 psi_trigger_create+0x113/0x3e0
 pressure_write+0x146/0x2e0
 cgroup_file_write+0x11c/0x250
 kernfs_fop_write_iter+0x186/0x220
 vfs_write+0x3d8/0x5c0
 ksys_write+0x90/0x110
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

 Freed by task 4407:
 kasan_set_track+0x3d/0x60
 kasan_save_free_info+0x27/0x40
 ____kasan_slab_free+0x11d/0x170
 slab_free_freelist_hook+0x87/0x150
 __kmem_cache_free+0xcb/0x180
 psi_trigger_destroy+0x2e8/0x310
 cgroup_file_release+0x4f/0xb0
 kernfs_drain_open_files+0x165/0x1f0
 kernfs_drain+0x162/0x1a0
 __kernfs_remove+0x1fb/0x310
 kernfs_remove_by_name_ns+0x95/0xe0
 cgroup_addrm_files+0x67f/0x700
 cgroup_destroy_locked+0x283/0x3c0
 cgroup_rmdir+0x29/0x100
 kernfs_iop_rmdir+0xd1/0x140
 vfs_rmdir+0xfe/0x240
 do_rmdir+0x13d/0x280
 __x64_sys_rmdir+0x2c/0x30
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Cc: stable@vger.kernel.org
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
---
 kernel/sched/psi.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)
  

Comments

Eric Biggers Feb. 2, 2023, 4:56 a.m. UTC | #1
On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 8ac8b81bfee6..6e66c15f6450 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
>  
>  	group = t->group;
>  	/*
> -	 * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> -	 * from under a polling process.
> +	 * Wakeup waiters to stop polling and clear the queue to prevent it from
> +	 * being accessed later. Can happen if cgroup is deleted from under a
> +	 * polling process otherwise.
>  	 */
> -	wake_up_interruptible(&t->event_wait);
> +	wake_up_pollfree(&t->event_wait);
>  
>  	mutex_lock(&group->trigger_lock);

wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
lifetime of the waitqueue be fixed instead?

- Eric
  
Suren Baghdasaryan Feb. 2, 2023, 9:11 p.m. UTC | #2
On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index 8ac8b81bfee6..6e66c15f6450 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> >
> >       group = t->group;
> >       /*
> > -      * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > -      * from under a polling process.
> > +      * Wakeup waiters to stop polling and clear the queue to prevent it from
> > +      * being accessed later. Can happen if cgroup is deleted from under a
> > +      * polling process otherwise.
> >        */
> > -     wake_up_interruptible(&t->event_wait);
> > +     wake_up_pollfree(&t->event_wait);
> >
> >       mutex_lock(&group->trigger_lock);
>
> wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
> lifetime of the waitqueue be fixed instead?

waitqueue lifetime in this case is linked to cgroup_file_release(),
which seems appropriate to me here. Unfortunately
cgroup_file_release() is not directly linked to the file's lifetime.
For more details see:
https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
.
So, if we want to fix the lifetime of the waitqueue, we would have to
tie cgroup_file_release() to the fput() somehow. IOW, the fix would
have to be done at the cgroups or higher (kernfs?) layer.
Thanks,
Suren.

>
> - Eric
  
Suren Baghdasaryan Feb. 9, 2023, 5:09 p.m. UTC | #3
On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > --- a/kernel/sched/psi.c
> > > +++ b/kernel/sched/psi.c
> > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > >
> > >       group = t->group;
> > >       /*
> > > -      * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > -      * from under a polling process.
> > > +      * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > +      * being accessed later. Can happen if cgroup is deleted from under a
> > > +      * polling process otherwise.
> > >        */
> > > -     wake_up_interruptible(&t->event_wait);
> > > +     wake_up_pollfree(&t->event_wait);
> > >
> > >       mutex_lock(&group->trigger_lock);
> >
> > wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
> > lifetime of the waitqueue be fixed instead?
>
> waitqueue lifetime in this case is linked to cgroup_file_release(),
> which seems appropriate to me here. Unfortunately
> cgroup_file_release() is not directly linked to the file's lifetime.
> For more details see:
> https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> .
> So, if we want to fix the lifetime of the waitqueue, we would have to
> tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> have to be done at the cgroups or higher (kernfs?) layer.

Hi Eric,
Do you still object to using wake_up_pollfree() for this case?
Changing higher levels to make cgroup_file_release() be tied to fput()
would be ideal but I think that would be a big change for this one
case. If you agree I'll Ack this patch.
Thanks,
Suren.

> Thanks,
> Suren.
>
> >
> > - Eric
  
Eric Biggers Feb. 9, 2023, 6:46 p.m. UTC | #4
On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:
> On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > >
> > > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > > --- a/kernel/sched/psi.c
> > > > +++ b/kernel/sched/psi.c
> > > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > > >
> > > >       group = t->group;
> > > >       /*
> > > > -      * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > > -      * from under a polling process.
> > > > +      * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > > +      * being accessed later. Can happen if cgroup is deleted from under a
> > > > +      * polling process otherwise.
> > > >        */
> > > > -     wake_up_interruptible(&t->event_wait);
> > > > +     wake_up_pollfree(&t->event_wait);
> > > >
> > > >       mutex_lock(&group->trigger_lock);
> > >
> > > wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
> > > lifetime of the waitqueue be fixed instead?
> >
> > waitqueue lifetime in this case is linked to cgroup_file_release(),
> > which seems appropriate to me here. Unfortunately
> > cgroup_file_release() is not directly linked to the file's lifetime.
> > For more details see:
> > https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> > .
> > So, if we want to fix the lifetime of the waitqueue, we would have to
> > tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> > have to be done at the cgroups or higher (kernfs?) layer.
> 
> Hi Eric,
> Do you still object to using wake_up_pollfree() for this case?
> Changing higher levels to make cgroup_file_release() be tied to fput()
> would be ideal but I think that would be a big change for this one
> case. If you agree I'll Ack this patch.
> Thanks,
> Suren.
> 

I haven't read the code closely in this case.  I'm just letting you know that
wake_up_pollfree() is very much a last-resort option for when the waitqueue
lifetime can't be fixed.  So if you want to use wake_up_pollfree(), you need to
explain why no other fix is possible.  For example maybe the UAPI depends on the
waitqueue having a nonstandard lifetime.

- Eric
  
Suren Baghdasaryan Feb. 9, 2023, 7:13 p.m. UTC | #5
On Thu, Feb 9, 2023 at 10:46 AM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > > >
> > > > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > > > --- a/kernel/sched/psi.c
> > > > > +++ b/kernel/sched/psi.c
> > > > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > > > >
> > > > >       group = t->group;
> > > > >       /*
> > > > > -      * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > > > -      * from under a polling process.
> > > > > +      * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > > > +      * being accessed later. Can happen if cgroup is deleted from under a
> > > > > +      * polling process otherwise.
> > > > >        */
> > > > > -     wake_up_interruptible(&t->event_wait);
> > > > > +     wake_up_pollfree(&t->event_wait);
> > > > >
> > > > >       mutex_lock(&group->trigger_lock);
> > > >
> > > > wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
> > > > lifetime of the waitqueue be fixed instead?
> > >
> > > waitqueue lifetime in this case is linked to cgroup_file_release(),
> > > which seems appropriate to me here. Unfortunately
> > > cgroup_file_release() is not directly linked to the file's lifetime.
> > > For more details see:
> > > https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> > > .
> > > So, if we want to fix the lifetime of the waitqueue, we would have to
> > > tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> > > have to be done at the cgroups or higher (kernfs?) layer.
> >
> > Hi Eric,
> > Do you still object to using wake_up_pollfree() for this case?
> > Changing higher levels to make cgroup_file_release() be tied to fput()
> > would be ideal but I think that would be a big change for this one
> > case. If you agree I'll Ack this patch.
> > Thanks,
> > Suren.
> >
>
> I haven't read the code closely in this case.  I'm just letting you know that
> wake_up_pollfree() is very much a last-resort option for when the waitqueue
> lifetime can't be fixed.

Got it. Thanks for the warning.
I think it can be fixed but the right fix would require a sizable
higher level refactoring which might be more justifiable if we have
more such cases in the future.

>  So if you want to use wake_up_pollfree(), you need to
> explain why no other fix is possible.  For example maybe the UAPI depends on the
> waitqueue having a nonstandard lifetime.

I think the changelog should explain that the waitqueue lifetime in
cases of non-root cgroups is tied to cgroup_file_release() callback,
which in turn is not tied to file's lifetime. That's the reason for
waitqueue and the file having different lifecycles. Would that suffice
as the justification?
Again, I'm not saying that no other fix is possible, but that the
right fix would be much more complex.
Thanks,
Suren.

>
> - Eric
  
Suren Baghdasaryan Feb. 13, 2023, 11:50 p.m. UTC | #6
On Thu, Feb 9, 2023 at 11:13 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Feb 9, 2023 at 10:46 AM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > > > >
> > > > > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > > > > --- a/kernel/sched/psi.c
> > > > > > +++ b/kernel/sched/psi.c
> > > > > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > > > > >
> > > > > >       group = t->group;
> > > > > >       /*
> > > > > > -      * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > > > > -      * from under a polling process.
> > > > > > +      * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > > > > +      * being accessed later. Can happen if cgroup is deleted from under a
> > > > > > +      * polling process otherwise.
> > > > > >        */
> > > > > > -     wake_up_interruptible(&t->event_wait);
> > > > > > +     wake_up_pollfree(&t->event_wait);
> > > > > >
> > > > > >       mutex_lock(&group->trigger_lock);
> > > > >
> > > > > wake_up_pollfree() should only be used in extremely rare cases.  Why can't the
> > > > > lifetime of the waitqueue be fixed instead?
> > > >
> > > > waitqueue lifetime in this case is linked to cgroup_file_release(),
> > > > which seems appropriate to me here. Unfortunately
> > > > cgroup_file_release() is not directly linked to the file's lifetime.
> > > > For more details see:
> > > > https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> > > > .
> > > > So, if we want to fix the lifetime of the waitqueue, we would have to
> > > > tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> > > > have to be done at the cgroups or higher (kernfs?) layer.
> > >
> > > Hi Eric,
> > > Do you still object to using wake_up_pollfree() for this case?
> > > Changing higher levels to make cgroup_file_release() be tied to fput()
> > > would be ideal but I think that would be a big change for this one
> > > case. If you agree I'll Ack this patch.
> > > Thanks,
> > > Suren.
> > >
> >
> > I haven't read the code closely in this case.  I'm just letting you know that
> > wake_up_pollfree() is very much a last-resort option for when the waitqueue
> > lifetime can't be fixed.
>
> Got it. Thanks for the warning.
> I think it can be fixed but the right fix would require a sizable
> higher level refactoring which might be more justifiable if we have
> more such cases in the future.
>
> >  So if you want to use wake_up_pollfree(), you need to
> > explain why no other fix is possible.  For example maybe the UAPI depends on the
> > waitqueue having a nonstandard lifetime.
>
> I think the changelog should explain that the waitqueue lifetime in
> cases of non-root cgroups is tied to cgroup_file_release() callback,
> which in turn is not tied to file's lifetime. That's the reason for
> waitqueue and the file having different lifecycles. Would that suffice
> as the justification?

Ok, in the absence of objections, I would suggest resending this patch
with the changelog including details about waitqueue lifetime and
reasons wake_up_pollfree() is required here.
Munehisa, feel free to reuse
https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
if you find it useful.
Thanks,
Suren.

> Again, I'm not saying that no other fix is possible, but that the
> right fix would be much more complex.
> Thanks,
> Suren.
>
> >
> > - Eric
  

Patch

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8ac8b81bfee6..6e66c15f6450 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1343,10 +1343,11 @@  void psi_trigger_destroy(struct psi_trigger *t)
 
 	group = t->group;
 	/*
-	 * Wakeup waiters to stop polling. Can happen if cgroup is deleted
-	 * from under a polling process.
+	 * Wakeup waiters to stop polling and clear the queue to prevent it from
+	 * being accessed later. Can happen if cgroup is deleted from under a
+	 * polling process otherwise.
 	 */
-	wake_up_interruptible(&t->event_wait);
+	wake_up_pollfree(&t->event_wait);
 
 	mutex_lock(&group->trigger_lock);