[2/3] livepatch,sched: Add livepatch task switching to cond_resched()

Message ID 58cd637f7557e829142a2eb255daa91fa57e6321.1675969869.git.jpoimboe@kernel.org
State New
Headers
Series livepatch,sched: Add livepatch task switching to cond_resched() |

Commit Message

Josh Poimboeuf Feb. 9, 2023, 7:17 p.m. UTC
  There have been reports [1][2] of live patches failing to complete
within a reasonable amount of time due to CPU-bound kthreads.

Fix it by patching tasks in cond_resched().

There are four different flavors of cond_resched(), depending on the
kernel configuration.  Hook into all of them.

A more elegant solution might be to use a preempt notifier.  However,
non-ORC unwinders can't unwind a preempted task reliably.

[1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/
[2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/livepatch.h       |  1 +
 include/linux/livepatch_sched.h | 29 +++++++++++++++
 include/linux/sched.h           | 20 ++++++++---
 kernel/livepatch/transition.c   | 54 ++++++++++++++++++++++++++++
 kernel/sched/core.c             | 64 ++++++++++++++++++++++++++++-----
 5 files changed, 155 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/livepatch_sched.h
  

Comments

Petr Mladek Feb. 15, 2023, 1:30 p.m. UTC | #1
On Thu 2023-02-09 11:17:48, Josh Poimboeuf wrote:
> There have been reports [1][2] of live patches failing to complete
> within a reasonable amount of time due to CPU-bound kthreads.
> 
> Fix it by patching tasks in cond_resched().
> 
> There are four different flavors of cond_resched(), depending on the
> kernel configuration.  Hook into all of them.
> 
> A more elegant solution might be to use a preempt notifier.  However,
> non-ORC unwinders can't unwind a preempted task reliably.
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2085,20 +2090,25 @@ static __always_inline int _cond_resched(void)
>  	return dynamic_cond_resched();
>  }
>  
> -#else
> +#else /* !CONFIG_PREEMPTION */
>  
>  static inline int _cond_resched(void)
>  {
> +	klp_sched_try_switch();
>  	return __cond_resched();

My only concern is if it might cause any performance problems.

On one hand, cond_resched() is used in code paths that are slow
on its own. Also it will do nothing most of the time.

On the other hand, cond_resched() is typically used in cycles.
One cycle might be fast. The code might be slow because there
are too many cycles. Repeating the same failing test might
prolong the time significantly.

An idea is to try the switch only when it was not done during
a real schedule. Something like:

static inline int _cond_resched(void)
{
	int scheduled;

	scheduled = __cond_resched();
	if (scheduled)
		klp_sched_try_switch();

	return scheduled();
}

But it would make it less reliable/predictable. Also it won't work
in configurations when cond_resched() is always a nop.

I am probably too careful. We might keep it simple until any real
life problems are reported.

> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -76,6 +96,8 @@ static void klp_complete_transition(void)
>  		 klp_transition_patch->mod->name,
>  		 klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
>  
> +	klp_cond_resched_disable();
> +

Nit: Strictly speaking, this is not needed when klp_complete_transition()
     is called from klp_cancel_transition(). In this case,
     klp_cond_resched_enable() was not called. So it might be moved into
     klp_try_complete_transition().


More important thing, thinking loudly:

We need to make sure that no task is in the middle
klp_cond_resched_disable() when we modify anything that is used there.

We seem to be on the safe side in klp_complete_transition(). We are
here only when all tasks have TIF_PATCH_PENDING cleared. In this case,
__klp_sched_try_switch() just returns. Also it calls
klp_synchronize_transition() so that all tasks finish the critical part
in __klp_sched_try_switch() before any new transition starts.

But it is not the case in klp_reverse_transition(). It modifies
klp_target_state() when __klp_sched_try_switch might be in the middle
of klp_check_stack() and it might give wrong result.

klp_reverse_transition() already solves similar race with
klp_update_patch_state() by clearing all TIF_PATCH_PENDING flags
and calling klp_synchronize_transition(). We just need to do
it earlier. Something like:

--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -642,14 +642,11 @@ void klp_reverse_transition(void)
 		 klp_target_state == KLP_PATCHED ? "patching to unpatching" :
 						   "unpatching to patching");
 
-	klp_transition_patch->enabled = !klp_transition_patch->enabled;
-
-	klp_target_state = !klp_target_state;
 
 	/*
 	 * Clear all TIF_PATCH_PENDING flags to prevent races caused by
-	 * klp_update_patch_state() running in parallel with
-	 * klp_start_transition().
+	 * klp_update_patch_state() and __klp_sched_try_switch()
+	 * running in parallel.
 	 */
 	read_lock(&tasklist_lock);
 	for_each_process_thread(g, task)
@@ -659,9 +656,16 @@ void klp_reverse_transition(void)
 	for_each_possible_cpu(cpu)
 		clear_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);
 
-	/* Let any remaining calls to klp_update_patch_state() complete */
+	/*
+	 * Make sure that both klp_update_patch_state() and
+	 * __klp_sched_try_switch() see the TIF_PATCH_PENDING cleared.
+	 */
 	klp_synchronize_transition();
 
+	klp_transition_patch->enabled = !klp_transition_patch->enabled;
+
+	klp_target_state = !klp_target_state;
+
 	klp_start_transition();
 }
 


>  	if (klp_transition_patch->replace && klp_target_state == KLP_PATCHED) {
>  		klp_unpatch_replaced_patches(klp_transition_patch);
>  		klp_discard_nops(klp_transition_patch);


Otherwise, the patch looks fine. I looked at it primary from
the livepatching code side.

Best Regards,
Petr
  
Josh Poimboeuf Feb. 16, 2023, 2:26 a.m. UTC | #2
On Wed, Feb 15, 2023 at 02:30:36PM +0100, Petr Mladek wrote:
> >  static inline int _cond_resched(void)
> >  {
> > +	klp_sched_try_switch();
> >  	return __cond_resched();
> 
> My only concern is if it might cause any performance problems.
> 
> On one hand, cond_resched() is used in code paths that are slow
> on its own. Also it will do nothing most of the time.
> 
> On the other hand, cond_resched() is typically used in cycles.
> One cycle might be fast. The code might be slow because there
> are too many cycles. Repeating the same failing test might
> prolong the time significantly.

Yes, but it should hopefully be very rare to patch a function in the
call stack of a kthread loop.  In general it's a good idea for the patch
author to avoid that.

> An idea is to try the switch only when it was not done during
> a real schedule. Something like:
> 
> static inline int _cond_resched(void)
> {
> 	int scheduled;
> 
> 	scheduled = __cond_resched();
> 	if (scheduled)
> 		klp_sched_try_switch();
> 
> 	return scheduled();
> }
> 
> But it would make it less reliable/predictable. Also it won't work
> in configurations when cond_resched() is always a nop.
> 
> I am probably too careful. We might keep it simple until any real
> life problems are reported.

If we can get away with it, I much prefer the simple unconditional
klp_sched_try_switch() because of the predictability and quickness with
which the kthread gets patched.

> > --- a/kernel/livepatch/transition.c
> > +++ b/kernel/livepatch/transition.c
> > @@ -76,6 +96,8 @@ static void klp_complete_transition(void)
> >  		 klp_transition_patch->mod->name,
> >  		 klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
> >  
> > +	klp_cond_resched_disable();
> > +
> 
> Nit: Strictly speaking, this is not needed when klp_complete_transition()
>      is called from klp_cancel_transition(). In this case,
>      klp_cond_resched_enable() was not called. So it might be moved into
>      klp_try_complete_transition().

Argh, I always forget about that pesky klp_cancel_transition().

> More important thing, thinking loudly:
> 
> We need to make sure that no task is in the middle
> klp_cond_resched_disable() when we modify anything that is used there.
> 
> We seem to be on the safe side in klp_complete_transition(). We are
> here only when all tasks have TIF_PATCH_PENDING cleared. In this case,
> __klp_sched_try_switch() just returns. Also it calls
> klp_synchronize_transition() so that all tasks finish the critical part
> in __klp_sched_try_switch() before any new transition starts.
> 
> But it is not the case in klp_reverse_transition(). It modifies
> klp_target_state() when __klp_sched_try_switch might be in the middle
> of klp_check_stack() and it might give wrong result.
> 
> klp_reverse_transition() already solves similar race with
> klp_update_patch_state() by clearing all TIF_PATCH_PENDING flags
> and calling klp_synchronize_transition(). We just need to do
> it earlier. Something like:

Yes!  Thanks, I can always count on you to find the race conditions ;-)

This highlights the similarities between klp_target_state(current) and
__klp_sched_try_switch(), they both access TIF_PATCH_PENDING
out-of-band.

Also, I'll update the comment in klp_copy_process(). It should be safe
for with __klp_sched_try_switch() for the same reason as
klp_update_patch_state(current): they all only work on 'current'.
  
Petr Mladek Feb. 17, 2023, 9:14 a.m. UTC | #3
On Wed 2023-02-15 18:26:30, Josh Poimboeuf wrote:
> On Wed, Feb 15, 2023 at 02:30:36PM +0100, Petr Mladek wrote:
> > >  static inline int _cond_resched(void)
> > >  {
> > > +	klp_sched_try_switch();
> > >  	return __cond_resched();
> > 
> > My only concern is if it might cause any performance problems.
> > 
> > On one hand, cond_resched() is used in code paths that are slow
> > on its own. Also it will do nothing most of the time.
> > 
> > On the other hand, cond_resched() is typically used in cycles.
> > One cycle might be fast. The code might be slow because there
> > are too many cycles. Repeating the same failing test might
> > prolong the time significantly.
> 
> Yes, but it should hopefully be very rare to patch a function in the
> call stack of a kthread loop.  In general it's a good idea for the patch
> author to avoid that.

I do not have any data about it.

> > An idea is to try the switch only when it was not done during
> > a real schedule. Something like:
> > 
> > static inline int _cond_resched(void)
> > {
> > 	int scheduled;
> > 
> > 	scheduled = __cond_resched();
> > 	if (scheduled)
> > 		klp_sched_try_switch();
> > 
> > 	return scheduled();
> > }
> > 
> > But it would make it less reliable/predictable. Also it won't work
> > in configurations when cond_resched() is always a nop.
> > 
> > I am probably too careful. We might keep it simple until any real
> > life problems are reported.
> 
> If we can get away with it, I much prefer the simple unconditional
> klp_sched_try_switch() because of the predictability and quickness with
> which the kthread gets patched.

I am fine with keeping it simple now. We could always change it when
it causes problems.

I primary wanted to point out the potential problem and check what
others think about it.

Best Regards,
Petr
  

Patch

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 293e29960c6e..9b9b38e89563 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -13,6 +13,7 @@ 
 #include <linux/ftrace.h>
 #include <linux/completion.h>
 #include <linux/list.h>
+#include <linux/livepatch_sched.h>
 
 #if IS_ENABLED(CONFIG_LIVEPATCH)
 
diff --git a/include/linux/livepatch_sched.h b/include/linux/livepatch_sched.h
new file mode 100644
index 000000000000..013794fb5da0
--- /dev/null
+++ b/include/linux/livepatch_sched.h
@@ -0,0 +1,29 @@ 
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_LIVEPATCH_SCHED_H_
+#define _LINUX_LIVEPATCH_SCHED_H_
+
+#include <linux/jump_label.h>
+#include <linux/static_call_types.h>
+
+#ifdef CONFIG_LIVEPATCH
+
+void __klp_sched_try_switch(void);
+
+#if !defined(CONFIG_PREEMPT_DYNAMIC) || !defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+
+DECLARE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
+
+static __always_inline void klp_sched_try_switch(void)
+{
+	if (static_branch_unlikely(&klp_sched_try_switch_key))
+		__klp_sched_try_switch();
+}
+
+#endif /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
+#else /* !CONFIG_LIVEPATCH */
+static inline void klp_sched_try_switch(void) {}
+static inline void __klp_sched_try_switch(void) {}
+#endif /* CONFIG_LIVEPATCH */
+
+#endif /* _LINUX_LIVEPATCH_SCHED_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4df2b3e76b30..2bc5e925a6d8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -36,6 +36,7 @@ 
 #include <linux/seqlock.h>
 #include <linux/kcsan.h>
 #include <linux/rv.h>
+#include <linux/livepatch_sched.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -2070,6 +2071,9 @@  extern int __cond_resched(void);
 
 #if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 
+void sched_dynamic_klp_enable(void);
+void sched_dynamic_klp_disable(void);
+
 DECLARE_STATIC_CALL(cond_resched, __cond_resched);
 
 static __always_inline int _cond_resched(void)
@@ -2078,6 +2082,7 @@  static __always_inline int _cond_resched(void)
 }
 
 #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
+
 extern int dynamic_cond_resched(void);
 
 static __always_inline int _cond_resched(void)
@@ -2085,20 +2090,25 @@  static __always_inline int _cond_resched(void)
 	return dynamic_cond_resched();
 }
 
-#else
+#else /* !CONFIG_PREEMPTION */
 
 static inline int _cond_resched(void)
 {
+	klp_sched_try_switch();
 	return __cond_resched();
 }
 
-#endif /* CONFIG_PREEMPT_DYNAMIC */
+#endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
 
-#else
+#else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
 
-static inline int _cond_resched(void) { return 0; }
+static inline int _cond_resched(void)
+{
+	klp_sched_try_switch();
+	return 0;
+}
 
-#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
+#endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */
 
 #define cond_resched() ({			\
 	__might_resched(__FILE__, __LINE__, 0);	\
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index 4d1f443778f7..e629966c6fbe 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -9,6 +9,7 @@ 
 
 #include <linux/cpu.h>
 #include <linux/stacktrace.h>
+#include <linux/static_call.h>
 #include "core.h"
 #include "patch.h"
 #include "transition.h"
@@ -24,6 +25,25 @@  static int klp_target_state = KLP_UNDEFINED;
 
 static unsigned int klp_signals_cnt;
 
+/*
+ * When a livepatch is in progress, enable klp stack checking in
+ * cond_resched().  This helps CPU-bound kthreads get patched.
+ */
+#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+
+#define klp_cond_resched_enable() sched_dynamic_klp_enable()
+#define klp_cond_resched_disable() sched_dynamic_klp_disable()
+
+#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
+DEFINE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
+EXPORT_SYMBOL(klp_sched_try_switch_key);
+
+#define klp_cond_resched_enable() static_branch_enable(&klp_sched_try_switch_key)
+#define klp_cond_resched_disable() static_branch_disable(&klp_sched_try_switch_key)
+
+#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
 /*
  * This work can be performed periodically to finish patching or unpatching any
  * "straggler" tasks which failed to transition in the first attempt.
@@ -76,6 +96,8 @@  static void klp_complete_transition(void)
 		 klp_transition_patch->mod->name,
 		 klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
 
+	klp_cond_resched_disable();
+
 	if (klp_transition_patch->replace && klp_target_state == KLP_PATCHED) {
 		klp_unpatch_replaced_patches(klp_transition_patch);
 		klp_discard_nops(klp_transition_patch);
@@ -338,6 +360,36 @@  static bool klp_try_switch_task(struct task_struct *task)
 	return !ret;
 }
 
+void __klp_sched_try_switch(void)
+{
+	if (likely(!klp_patch_pending(current)))
+		return;
+
+	/*
+	 * This function is called from cond_resched() which is called in many
+	 * places throughout the kernel.  Using the klp_mutex here might
+	 * deadlock.
+	 *
+	 * Instead, disable preemption to prevent racing with other callers of
+	 * klp_try_switch_task().  Thanks to task_call_func() they won't be
+	 * able to switch this task while it's running.
+	 */
+	preempt_disable();
+
+	/*
+	 * Make sure current didn't get patched between the above check and
+	 * preempt_disable().
+	 */
+	if (unlikely(!klp_patch_pending(current)))
+		goto out;
+
+	klp_try_switch_task(current);
+
+out:
+	preempt_enable();
+}
+EXPORT_SYMBOL(__klp_sched_try_switch);
+
 /*
  * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
  * Kthreads with TIF_PATCH_PENDING set are woken up.
@@ -496,6 +548,8 @@  void klp_start_transition(void)
 			set_tsk_thread_flag(task, TIF_PATCH_PENDING);
 	}
 
+	klp_cond_resched_enable();
+
 	klp_signals_cnt = 0;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3a0ef2fefbd5..ed5bf5ec0477 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8500,6 +8500,7 @@  EXPORT_STATIC_CALL_TRAMP(might_resched);
 static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
 int __sched dynamic_cond_resched(void)
 {
+	klp_sched_try_switch();
 	if (!static_branch_unlikely(&sk_dynamic_cond_resched))
 		return 0;
 	return __cond_resched();
@@ -8648,13 +8649,17 @@  int sched_dynamic_mode(const char *str)
 #error "Unsupported PREEMPT_DYNAMIC mechanism"
 #endif
 
-void sched_dynamic_update(int mode)
+DEFINE_MUTEX(sched_dynamic_mutex);
+static bool klp_override;
+
+static void __sched_dynamic_update(int mode)
 {
 	/*
 	 * Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
 	 * the ZERO state, which is invalid.
 	 */
-	preempt_dynamic_enable(cond_resched);
+	if (!klp_override)
+		preempt_dynamic_enable(cond_resched);
 	preempt_dynamic_enable(might_resched);
 	preempt_dynamic_enable(preempt_schedule);
 	preempt_dynamic_enable(preempt_schedule_notrace);
@@ -8662,36 +8667,79 @@  void sched_dynamic_update(int mode)
 
 	switch (mode) {
 	case preempt_dynamic_none:
-		preempt_dynamic_enable(cond_resched);
+		if (!klp_override)
+			preempt_dynamic_enable(cond_resched);
 		preempt_dynamic_disable(might_resched);
 		preempt_dynamic_disable(preempt_schedule);
 		preempt_dynamic_disable(preempt_schedule_notrace);
 		preempt_dynamic_disable(irqentry_exit_cond_resched);
-		pr_info("Dynamic Preempt: none\n");
+		if (mode != preempt_dynamic_mode)
+			pr_info("Dynamic Preempt: none\n");
 		break;
 
 	case preempt_dynamic_voluntary:
-		preempt_dynamic_enable(cond_resched);
+		if (!klp_override)
+			preempt_dynamic_enable(cond_resched);
 		preempt_dynamic_enable(might_resched);
 		preempt_dynamic_disable(preempt_schedule);
 		preempt_dynamic_disable(preempt_schedule_notrace);
 		preempt_dynamic_disable(irqentry_exit_cond_resched);
-		pr_info("Dynamic Preempt: voluntary\n");
+		if (mode != preempt_dynamic_mode)
+			pr_info("Dynamic Preempt: voluntary\n");
 		break;
 
 	case preempt_dynamic_full:
-		preempt_dynamic_disable(cond_resched);
+		if (!klp_override)
+			preempt_dynamic_disable(cond_resched);
 		preempt_dynamic_disable(might_resched);
 		preempt_dynamic_enable(preempt_schedule);
 		preempt_dynamic_enable(preempt_schedule_notrace);
 		preempt_dynamic_enable(irqentry_exit_cond_resched);
-		pr_info("Dynamic Preempt: full\n");
+		if (mode != preempt_dynamic_mode)
+			pr_info("Dynamic Preempt: full\n");
 		break;
 	}
 
 	preempt_dynamic_mode = mode;
 }
 
+void sched_dynamic_update(int mode)
+{
+	mutex_lock(&sched_dynamic_mutex);
+	__sched_dynamic_update(mode);
+	mutex_unlock(&sched_dynamic_mutex);
+}
+
+#ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
+
+static int klp_cond_resched(void)
+{
+	__klp_sched_try_switch();
+	return __cond_resched();
+}
+
+void sched_dynamic_klp_enable(void)
+{
+	mutex_lock(&sched_dynamic_mutex);
+
+	klp_override = true;
+	static_call_update(cond_resched, klp_cond_resched);
+
+	mutex_unlock(&sched_dynamic_mutex);
+}
+
+void sched_dynamic_klp_disable(void)
+{
+	mutex_lock(&sched_dynamic_mutex);
+
+	klp_override = false;
+	__sched_dynamic_update(preempt_dynamic_mode);
+
+	mutex_unlock(&sched_dynamic_mutex);
+}
+
+#endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
 static int __init setup_preempt_mode(char *str)
 {
 	int mode = sched_dynamic_mode(str);