[v2,2/2] lib/percpu_counter: fix dying cpu compare race

Message ID 20230406015629.1804722-3-yebin@huaweicloud.com
State New
Headers
Series fix dying cpu compare race |

Commit Message

Ye Bin April 6, 2023, 1:56 a.m. UTC
  From: Ye Bin <yebin10@huawei.com>

In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race
condition between a cpu dying and percpu_counter_sum() iterating online CPUs
was identified.
Acctually, there's the same race condition between a cpu dying and
__percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment.
But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()',
then maybe return incorrect result.
To solve above issue, also need to add dying CPUs count when do quick judgment
in __percpu_counter_compare().

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 lib/percpu_counter.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)
  

Comments

Dennis Zhou April 8, 2023, 7:13 a.m. UTC | #1
Hello,

On Thu, Apr 06, 2023 at 09:56:29AM +0800, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
> 
> In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race
> condition between a cpu dying and percpu_counter_sum() iterating online CPUs
> was identified.
> Acctually, there's the same race condition between a cpu dying and
> __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment.
> But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()',
> then maybe return incorrect result.
> To solve above issue, also need to add dying CPUs count when do quick judgment
> in __percpu_counter_compare().
> 

I've thought a lot of about this since you've sent v1. For the general
problem, you haven't addressed Dave's concerns from [1].

I agree you've found a valid race condition, but as Dave mentioned,
there's no synchronization in __percpu_counter_compare() and
consequently no guarantees about the accuracy of the value.

However, I might be missing something, but I do think the use case in 
5825bea05265 ("xfs: __percpu_counter_compare() inode count debug too expensive")
is potentially valid. If the rhs is an expected lower bound or upper
bound (depending on if you're counting up or down, but not both) and the
count you're accounting has the same expectations as percpu_refcount
(you can only subtract what you've already added), then should the
percpu_counter_sum() ever be on the wrong side of rhs, that should be an
error and visible via percpu_counter_compare().

I need to think a little longer, but my initial thought is while you
close a race condition, the function itself is inherently vulnerable.

[1] https://lore.kernel.org/lkml/ZCu9LtdA+NMrfG9x@rh/

Thanks,
Dennis

> Signed-off-by: Ye Bin <yebin10@huawei.com>
> ---
>  lib/percpu_counter.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index 5004463c4f9f..399840cb0012 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -227,6 +227,15 @@ static int percpu_counter_cpu_dead(unsigned int cpu)
>  	return 0;
>  }
>  
> +static __always_inline unsigned int num_count_cpus(void)
> +{
> +#ifdef CONFIG_HOTPLUG_CPU
> +	return (num_online_cpus() + num_dying_cpus());
> +#else
> +	return num_online_cpus();
> +#endif
> +}
> +
>  /*
>   * Compare counter against given value.
>   * Return 1 if greater, 0 if equal and -1 if less
> @@ -237,7 +246,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
>  
>  	count = percpu_counter_read(fbc);
>  	/* Check to see if rough count will be sufficient for comparison */
> -	if (abs(count - rhs) > (batch * num_online_cpus())) {
> +	if (abs(count - rhs) > (batch * num_count_cpus())) {
>  		if (count > rhs)
>  			return 1;
>  		else
> -- 
> 2.31.1
> 
>
  
Thomas Gleixner April 10, 2023, 8:53 p.m. UTC | #2
On Thu, Apr 06 2023 at 09:56, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
>
> In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race
> condition between a cpu dying and percpu_counter_sum() iterating online CPUs
> was identified.
> Acctually, there's the same race condition between a cpu dying and
> __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment.
> But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()',
> then maybe return incorrect result.
> To solve above issue, also need to add dying CPUs count when do quick judgment
> in __percpu_counter_compare().

This is all bogus including the above commit.

All CPU masks including cpu_online_mask and cpu_dying_mask are only
valid when the context is:

  - A CPU hotplug callback

  - Any other code which holds cpu_hotplug_lock read or write locked.

cpu_online_mask is special in that regard. It is also protected against
modification in any preemption disabled region, but that's a pure
implementation detail.

cpu_dying_mask is a random number generator w/o cpu_hotplug_lock being
held. And even with that lock held any cpumask operation on it is silly.
The mask is a core detail:

  commit e40f74c535b8 "cpumask: Introduce DYING mask"
    
    Introduce a cpumask that indicates (for each CPU) what direction the
    CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when
    an up fails and we do a roll-back down, it will accurately reflect the
    direction.

It does not tell anything to a user which is not aware of the actual
hotplug state machine state.


The real reason for this percpu counter issue is how percpu counter
hotplug callbacks are implemented: They are asymmetric and at the
completely wrong place.

The above 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") was
done via XFS and without the courtesy of CC'ing the people who care
about the CPU hotplug core. The lenghty analysis of this commit is all
shiny, but fundamentally wrong. See above.

I'm way too tired to come up with a proper fix for this mess, but I'm
going to stare at it tomorrow morning with brain awake.

Thanks,

        tglx
  
Thomas Gleixner April 11, 2023, 6:56 a.m. UTC | #3
On Mon, Apr 10 2023 at 22:53, Thomas Gleixner wrote:
> On Thu, Apr 06 2023 at 09:56, Ye Bin wrote:
> cpu_dying_mask is a random number generator w/o cpu_hotplug_lock being
> held. And even with that lock held any cpumask operation on it is silly.
> The mask is a core detail:
>
>   commit e40f74c535b8 "cpumask: Introduce DYING mask"
>     
>     Introduce a cpumask that indicates (for each CPU) what direction the
>     CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when
>     an up fails and we do a roll-back down, it will accurately reflect the
>     direction.
>
> It does not tell anything to a user which is not aware of the actual
> hotplug state machine state.

Even if the mask is most of the time stable, it's a total disaster
performance wise. The bits in cpu_dying_mask are sticky until the next
online operation.

So for a system which has SMT enabled in BIOS, but SMT is disabled on
the kernel command line or later via the sysfs knob, this means that the
loop in __percpu_counter_sum() will iterate over all shutdown SMT
siblings forever. IOW, it will touch nr_of_shutdown_cpus() cachelines
for absolutely zero reason.

The same applies for the proposed counter.

Thanks,

        tglx
  

Patch

diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 5004463c4f9f..399840cb0012 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -227,6 +227,15 @@  static int percpu_counter_cpu_dead(unsigned int cpu)
 	return 0;
 }
 
+static __always_inline unsigned int num_count_cpus(void)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+	return (num_online_cpus() + num_dying_cpus());
+#else
+	return num_online_cpus();
+#endif
+}
+
 /*
  * Compare counter against given value.
  * Return 1 if greater, 0 if equal and -1 if less
@@ -237,7 +246,7 @@  int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 
 	count = percpu_counter_read(fbc);
 	/* Check to see if rough count will be sufficient for comparison */
-	if (abs(count - rhs) > (batch * num_online_cpus())) {
+	if (abs(count - rhs) > (batch * num_count_cpus())) {
 		if (count > rhs)
 			return 1;
 		else