[04/12] clockevent unbind: use smp_call_func_single_fail

Message ID 20240206185709.928420669@redhat.com
State New
Headers
Series cpu isolation: infra to block interference to select CPUs |

Commit Message

Marcelo Tosatti Feb. 6, 2024, 6:49 p.m. UTC
  Convert clockevents_unbind from smp_call_function_single
to smp_call_func_single_fail, which will fail in case
the target CPU is tagged as block interference CPU.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
  

Comments

Thomas Gleixner Feb. 7, 2024, 11:55 a.m. UTC | #1
On Tue, Feb 06 2024 at 15:49, Marcelo Tosatti wrote:
> Convert clockevents_unbind from smp_call_function_single
> to smp_call_func_single_fail, which will fail in case
> the target CPU is tagged as block interference CPU.

Seriously? This is a root only operation. So if root wants to interfere
then so be it.
  
Marcelo Tosatti Feb. 7, 2024, 12:51 p.m. UTC | #2
On Wed, Feb 07, 2024 at 12:55:59PM +0100, Thomas Gleixner wrote:
> On Tue, Feb 06 2024 at 15:49, Marcelo Tosatti wrote:
> > Convert clockevents_unbind from smp_call_function_single
> > to smp_call_func_single_fail, which will fail in case
> > the target CPU is tagged as block interference CPU.
> 
> Seriously? This is a root only operation. So if root wants to interfere
> then so be it.

Hi Thomas!

OK, so the problem is the following: due to software complexity, one is
often not aware of all operations that might take place.

For example:

https://lore.kernel.org/lkml/Y6mXDUZkII5OnuE8@tpad/T/

[PATCH] hwmon: coretemp: avoid RDMSR interruptions to isolated CPUs

The coretemp driver uses rdmsr_on_cpu calls to read
MSR_IA32_PACKAGE_THERM_STATUS/MSR_IA32_THERM_STATUS registers,
which contain information about current core temperature.

For certain low latency applications, the RDMSR interruption exceeds    
the applications requirements.

So disable reading of crit_alarm and temp files via /sys, in case
CPU isolation is enabled.

Temperature information from the housekeeping cores should be
sufficient to infer die temperature.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c
index 9bee4d33fbdf..30a35f4130d5 100644
--- a/drivers/hwmon/coretemp.c
+++ b/drivers/hwmon/coretemp.c
@@ -27,6 +27,7 @@
 #include <asm/msr.h>
 #include <asm/processor.h>
 #include <asm/cpu_device_id.h>
+#include <linux/sched/isolation.h>
 
 #define DRVNAME	"coretemp"
 
@@ -121,6 +122,10 @@ static ssize_t show_crit_alarm(struct device *dev,
 	struct platform_data *pdata = dev_get_drvdata(dev);
 	struct temp_data *tdata = pdata->core_data[attr->index];
 
+
+	if (!housekeeping_cpu(tdata->cpu, HK_TYPE_MISC))
+		return -EINVAL;
+
 	mutex_lock(&tdata->update_lock);
 	rdmsr_on_cpu(tdata->cpu, tdata->status_reg, &eax, &edx);
 	mutex_unlock(&tdata->update_lock);
@@ -158,6 +163,8 @@ static ssize_t show_temp(struct device *dev,
 
 	/* Check whether the time interval has elapsed */
 	if (!tdata->valid || time_after(jiffies, tdata->last_updated + HZ)) {
+		if (!housekeeping_cpu(tdata->cpu, HK_TYPE_MISC))
+			return -EINVAL;
 		rdmsr_on_cpu(tdata->cpu, tdata->status_reg, &eax, &edx);
 		/*
 		 * Ignore the valid bit. In all observed cases the register


In this case, a userspace application (collecting telemetry data), was
responsible for reading the sysfs files.

Now think of all possible paths, from userspace, that lead to kernel
code that ends up in smp_call_function_* variants (or other functions
that cause IPIs to isolated CPUs).

The alternative, from blocking this in the kernel, would be to validate all 
userspace software involved in your application, to ensure it won't end
up in the kernel sending IPIs. Which is impractical, isnt it ?
(or rather, with such option in the kernel, it would be possible to run 
applications which have not been validated, since the kernel would fail
the operation that results in IPI to isolated CPU).

So the idea would be an additional "isolation mode", which when enabled, 
would disallow the IPIs. Its still possible for root user to disable
this mode, and retry the operation.

So lets say i want to read MSRs on a given CPU, as root.

You'd have to: 

1) readmsr on given CPU (returns -EPERM or whatever), since the
"block interference" mode is enabled for that CPU.

2) Disable that CPU in the block interference cpumask.

3) readmsr on the given CPU (success).

4) Re-enable CPU in block interference cpumask, if desired.


(BTW, better ideas than the cpumask are welcome, but it seems the
realibility aspect of something similar to this is necessary).

Thanks!
  
Thomas Gleixner Feb. 11, 2024, 8:52 a.m. UTC | #3
On Wed, Feb 07 2024 at 09:51, Marcelo Tosatti wrote:
> On Wed, Feb 07, 2024 at 12:55:59PM +0100, Thomas Gleixner wrote:
>
> OK, so the problem is the following: due to software complexity, one is
> often not aware of all operations that might take place.

The problem is that people throw random crap on their systems and avoid
proper system engineering and then complain that their realtime
constraints are violated. So you are proliferating bad engineering
practices and encourage people not to care.

> Now think of all possible paths, from userspace, that lead to kernel
> code that ends up in smp_call_function_* variants (or other functions
> that cause IPIs to isolated CPUs).

So you need to analyze every possible code path and interface and add
your magic functions there after figuring out whether that's valid or
not.

> The alternative, from blocking this in the kernel, would be to validate all 
> userspace software involved in your application, to ensure it won't end
> up in the kernel sending IPIs. Which is impractical, isnt it ?

It's absolutely not impractical. It's part of proper system
engineering. The wet dream that you can run random docker containers and
everything works magically is just a wet dream.

> (or rather, with such option in the kernel, it would be possible to run 
> applications which have not been validated, since the kernel would fail
> the operation that results in IPI to isolated CPU).

That's a fallacy because you _cannot_ define with a single CPU mask
which interface is valid in a particular configuration to end up with an
IPI and which one is not. There are legitimate reasons in realtime or
latency constraint systems to invoke selective functionality which
interferes with the overall system constraints.

How do you cover that with your magic CPU mask? You can't.

Aside of that there is a decent chance that you are subtly breaking user
space that way. Just look at that hwmon/coretemp commit you pointed to:

  "Temperature information from the housekeeping cores should be
   sufficient to infer die temperature."

That's just wishful thinking for various reasons:

  - The die temperature on larger packages is not evenly distributed and
    you can run into situations where the housekeeping cores are sitting
    "far" enough away from the worker core which creates the heat spot

  - Some monitoring applications just stop to work when they can't read
    the full data set, which means that they break subtly and you can
    infer exactly nothing.

> So the idea would be an additional "isolation mode", which when enabled, 
> would disallow the IPIs. Its still possible for root user to disable
> this mode, and retry the operation.
>
> So lets say i want to read MSRs on a given CPU, as root.
>
> You'd have to: 
>
> 1) readmsr on given CPU (returns -EPERM or whatever), since the
> "block interference" mode is enabled for that CPU.
>
> 2) Disable that CPU in the block interference cpumask.
>
> 3) readmsr on the given CPU (success).
>
> 4) Re-enable CPU in block interference cpumask, if desired.

That's just wrong. Why?

Once you enable it just to read the MSR you enable the operation for
_ALL_ other non-validated crap too. So while the single MSR read might
be OK under certain circumstances the fact that you open up a window for
all other interfaces to do far more interfering operations is a red
flag.

This whole thing is a really badly defined policy mechanism of very
dubious value.

Thanks,

        tglx
  
Marcelo Tosatti Feb. 14, 2024, 6:58 p.m. UTC | #4
On Sun, Feb 11, 2024 at 09:52:35AM +0100, Thomas Gleixner wrote:
> On Wed, Feb 07 2024 at 09:51, Marcelo Tosatti wrote:
> > On Wed, Feb 07, 2024 at 12:55:59PM +0100, Thomas Gleixner wrote:
> >
> > OK, so the problem is the following: due to software complexity, one is
> > often not aware of all operations that might take place.
> 
> The problem is that people throw random crap on their systems and avoid
> proper system engineering and then complain that their realtime
> constraints are violated. So you are proliferating bad engineering
> practices and encourage people not to care.

Its more of a practicality and cost concern: one usually does not have
resources to fully review software before using that software.

> > Now think of all possible paths, from userspace, that lead to kernel
> > code that ends up in smp_call_function_* variants (or other functions
> > that cause IPIs to isolated CPUs).
> 
> So you need to analyze every possible code path and interface and add
> your magic functions there after figuring out whether that's valid or
> not.

"A magic function", yes.

> > The alternative, from blocking this in the kernel, would be to validate all 
> > userspace software involved in your application, to ensure it won't end
> > up in the kernel sending IPIs. Which is impractical, isnt it ?
> 
> It's absolutely not impractical. It's part of proper system
> engineering. The wet dream that you can run random docker containers and
> everything works magically is just a wet dream.

Unfortunately that is what people do.

I understand that "full software review" would be the ideal, but in most
situations it does not seem to happen.

> > (or rather, with such option in the kernel, it would be possible to run 
> > applications which have not been validated, since the kernel would fail
> > the operation that results in IPI to isolated CPU).
> 
> That's a fallacy because you _cannot_ define with a single CPU mask
> which interface is valid in a particular configuration to end up with an
> IPI and which one is not. There are legitimate reasons in realtime or
> latency constraint systems to invoke selective functionality which
> interferes with the overall system constraints.
> 
> How do you cover that with your magic CPU mask? You can't.
> 
> Aside of that there is a decent chance that you are subtly breaking user
> space that way. Just look at that hwmon/coretemp commit you pointed to:
> 
>   "Temperature information from the housekeeping cores should be
>    sufficient to infer die temperature."
> 
> That's just wishful thinking for various reasons:
> 
>   - The die temperature on larger packages is not evenly distributed and
>     you can run into situations where the housekeeping cores are sitting
>     "far" enough away from the worker core which creates the heat spot

I know.

>   - Some monitoring applications just stop to work when they can't read
>     the full data set, which means that they break subtly and you can
>     infer exactly nothing.
> 
> > So the idea would be an additional "isolation mode", which when enabled, 
> > would disallow the IPIs. Its still possible for root user to disable
> > this mode, and retry the operation.
> >
> > So lets say i want to read MSRs on a given CPU, as root.
> >
> > You'd have to: 
> >
> > 1) readmsr on given CPU (returns -EPERM or whatever), since the
> > "block interference" mode is enabled for that CPU.
> >
> > 2) Disable that CPU in the block interference cpumask.
> >
> > 3) readmsr on the given CPU (success).
> >
> > 4) Re-enable CPU in block interference cpumask, if desired.
> 
> That's just wrong. Why?
> 
> Once you enable it just to read the MSR you enable the operation for
> _ALL_ other non-validated crap too. So while the single MSR read might
> be OK under certain circumstances the fact that you open up a window for
> all other interfaces to do far more interfering operations is a red
> flag.
> 
> This whole thing is a really badly defined policy mechanism of very
> dubious value.
> 
> Thanks,

OK, fair enough. From your comments, it seems that per-callsite 
toggling would be ideal, for example:

/sys/kernel/interference_blocking/ directory containing one sub-directory
per call site. 

Inside each sub-directory, a "enabled" file, controlling a boolean 
to enable or disable interference blocking for that particular
callsite.

Also a "cpumask" file on each directory, by default containing the same
cpumask as the nohz_full CPUs, to control to which CPUs to block the
interference for.

How does that sound?
  

Patch

Index: linux-isolation/kernel/time/clockevents.c
===================================================================
--- linux-isolation.orig/kernel/time/clockevents.c
+++ linux-isolation/kernel/time/clockevents.c
@@ -13,6 +13,7 @@ 
 #include <linux/module.h>
 #include <linux/smp.h>
 #include <linux/device.h>
+#include <linux/sched/isolation.h>
 
 #include "tick-internal.h"
 
@@ -416,9 +417,14 @@  static void __clockevents_unbind(void *a
  */
 static int clockevents_unbind(struct clock_event_device *ced, int cpu)
 {
+	int ret, idx;
 	struct ce_unbind cu = { .ce = ced, .res = -ENODEV };
 
-	smp_call_function_single(cpu, __clockevents_unbind, &cu, 1);
+	idx = block_interf_srcu_read_lock();
+	ret = smp_call_function_single_fail(cpu, __clockevents_unbind, &cu, 1);
+	block_interf_srcu_read_unlock(idx);
+	if (ret)
+		return ret;
 	return cu.res;
 }