[v1] lib: objpool: fix head overrun on RK3588 SBC

Message ID 20231114115148.298821-1-wuqiang.matt@bytedance.com
State New
Headers
Series [v1] lib: objpool: fix head overrun on RK3588 SBC |

Commit Message

wuqiang.matt Nov. 14, 2023, 11:51 a.m. UTC
  objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the
following kernel warnings:

    WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100

This message is from objpool.c:168:

    WARN_ON_ONCE(tail - head > pool->nr_objs);

The overrun test case is to validate the case that pre-allocated objects
are insufficient: 8 objects are pre-allocated for each node and consumer
thread per node tries to grab 16 objects in a row. The testing system is
OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When
disabling either all 4 big or 4 little cores, the overrun tests run well,
and once with big and little cores mixed together, the overrun test would
always cause an overrun loop. It's likely the memory timing differences
of big and little cores cause this trouble. Here are the debugging data
of objpool_try_get_slot after try_cmpxchg_release:

    objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278

The local copies of 'head' and 'last' were 278 and 276, and reloading of
'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release
'slot->head' became 'head + 1', which is correct. But what's wrong here
is the stale value of 'last', and that stale value of 'last' finally led
the overrun of 'head'.

Memory updating of 'last' and 'head' are performed in push() and pop()
independently, which could be the culprit leading this out of order
visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's
not enough only checking the condition of 'head != slot', the implicit
condition 'last - head <= nr_objs' must also be explicitly asserted to
guarantee 'last' is always behind 'head' before the object retrieving.

This patch will check and try reloading of 'head' and 'last' to ensure
'last' is behind 'head' at the time of object retrieving. Performance
testings show the average impact is about 0.1% for X86_64 and 1.12% for
ARM64. Here are the results:

    OS: Debian 10 X86_64, Linux 6.6rc
    HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
                      1T         2T         4T         8T        16T
    native:     49543304   99277826  199017659  399070324  795185848
    objpool:    29909085   59865637  119692073  239750369  478005250
    objpool+:   29879313   59230743  119609856  239067773  478509029
                     32T        48T        64T        96T       128T
    native:   1596927073 2390099988 2929397330 3183875848 3257546602
    objpool:   957553042 1435814086 1680872925 2043126796 2165424198
    objpool+:  956476281 1434491297 1666055740 2041556569 2157415622

    OS: Debian 11 AARCH64, Linux 6.6rc
    HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
                      1T         2T         4T         8T        16T
    native:     30890508   60399915  123111980  242257008  494002946
    objpool:    14742531   28883047   57739948  115886644  232455421
    objpool+:   14107220   29032998   57286084  113730493  232232850
                     24T        32T        48T        64T        96T
    native:    746406039 1000174750 1493236240 1998318364 2942911180
    objpool:   349164852  467284332  702296756  934459713 1387898285
    objpool+:  348388180  462750976  696606096  927865887 1368402195

Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
---
 lib/objpool.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)
  

Comments

Masami Hiramatsu (Google) Nov. 20, 2023, 5:18 a.m. UTC | #1
On Tue, 14 Nov 2023 19:51:48 +0800
"wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:

> objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the
> following kernel warnings:
> 
>     WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100
> 
> This message is from objpool.c:168:
> 
>     WARN_ON_ONCE(tail - head > pool->nr_objs);
> 
> The overrun test case is to validate the case that pre-allocated objects
> are insufficient: 8 objects are pre-allocated for each node and consumer
> thread per node tries to grab 16 objects in a row. The testing system is
> OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When
> disabling either all 4 big or 4 little cores, the overrun tests run well,
> and once with big and little cores mixed together, the overrun test would
> always cause an overrun loop. It's likely the memory timing differences
> of big and little cores cause this trouble. Here are the debugging data
> of objpool_try_get_slot after try_cmpxchg_release:
> 
>     objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278
> 
> The local copies of 'head' and 'last' were 278 and 276, and reloading of
> 'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release
> 'slot->head' became 'head + 1', which is correct. But what's wrong here
> is the stale value of 'last', and that stale value of 'last' finally led
> the overrun of 'head'.

Ah, good catch! So even if the ring size is enough, the head/tail update
value is not updated locally, it can cause the overrun!

> 
> Memory updating of 'last' and 'head' are performed in push() and pop()
> independently, which could be the culprit leading this out of order
> visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's
> not enough only checking the condition of 'head != slot', the implicit
> condition 'last - head <= nr_objs' must also be explicitly asserted to
> guarantee 'last' is always behind 'head' before the object retrieving.

Indeed. Thanks for the investigation!

> 
> This patch will check and try reloading of 'head' and 'last' to ensure
> 'last' is behind 'head' at the time of object retrieving. Performance
> testings show the average impact is about 0.1% for X86_64 and 1.12% for
> ARM64. Here are the results:
> 
>     OS: Debian 10 X86_64, Linux 6.6rc
>     HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
>                       1T         2T         4T         8T        16T
>     native:     49543304   99277826  199017659  399070324  795185848
>     objpool:    29909085   59865637  119692073  239750369  478005250
>     objpool+:   29879313   59230743  119609856  239067773  478509029
>                      32T        48T        64T        96T       128T
>     native:   1596927073 2390099988 2929397330 3183875848 3257546602
>     objpool:   957553042 1435814086 1680872925 2043126796 2165424198
>     objpool+:  956476281 1434491297 1666055740 2041556569 2157415622
> 
>     OS: Debian 11 AARCH64, Linux 6.6rc
>     HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
>                       1T         2T         4T         8T        16T
>     native:     30890508   60399915  123111980  242257008  494002946
>     objpool:    14742531   28883047   57739948  115886644  232455421
>     objpool+:   14107220   29032998   57286084  113730493  232232850
>                      24T        32T        48T        64T        96T
>     native:    746406039 1000174750 1493236240 1998318364 2942911180
>     objpool:   349164852  467284332  702296756  934459713 1387898285
>     objpool+:  348388180  462750976  696606096  927865887 1368402195
> 

OK, looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

And let me pick it. 

> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>

BTW, this is a real bugfix, so it should have Fixes tag :)

Fixes: b4edb8d2d464 ("lib: objpool added: ring-array based lockless MPMC")

Thank you!

> ---
>  lib/objpool.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/lib/objpool.c b/lib/objpool.c
> index ce0087f64400..cfdc02420884 100644
> --- a/lib/objpool.c
> +++ b/lib/objpool.c
> @@ -201,6 +201,23 @@ static inline void *objpool_try_get_slot(struct objpool_head *pool, int cpu)
>  	while (head != READ_ONCE(slot->last)) {
>  		void *obj;
>  
> +		/*
> +		 * data visibility of 'last' and 'head' could be out of
> +		 * order since memory updating of 'last' and 'head' are
> +		 * performed in push() and pop() independently
> +		 *
> +		 * before any retrieving attempts, pop() must guarantee
> +		 * 'last' is behind 'head', that is to say, there must
> +		 * be available objects in slot, which could be ensured
> +		 * by condition 'last != head && last - head <= nr_objs'
> +		 * that is equivalent to 'last - head - 1 < nr_objs' as
> +		 * 'last' and 'head' are both unsigned int32
> +		 */
> +		if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) {
> +			head = READ_ONCE(slot->head);
> +			continue;
> +		}
> +
>  		/* obj must be retrieved before moving forward head */
>  		obj = READ_ONCE(slot->entries[head & slot->mask]);
>  
> -- 
> 2.40.1
>
  
wuqiang.matt Nov. 20, 2023, 12:24 p.m. UTC | #2
On 2023/11/20 13:18, Masami Hiramatsu (Google) wrote:
> On Tue, 14 Nov 2023 19:51:48 +0800
> "wuqiang.matt" <wuqiang.matt@bytedance.com> wrote:
> 
>> objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the
>> following kernel warnings:
>>
>>      WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100
>>
>> This message is from objpool.c:168:
>>
>>      WARN_ON_ONCE(tail - head > pool->nr_objs);
>>
>> The overrun test case is to validate the case that pre-allocated objects
>> are insufficient: 8 objects are pre-allocated for each node and consumer
>> thread per node tries to grab 16 objects in a row. The testing system is
>> OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When
>> disabling either all 4 big or 4 little cores, the overrun tests run well,
>> and once with big and little cores mixed together, the overrun test would
>> always cause an overrun loop. It's likely the memory timing differences
>> of big and little cores cause this trouble. Here are the debugging data
>> of objpool_try_get_slot after try_cmpxchg_release:
>>
>>      objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278
>>
>> The local copies of 'head' and 'last' were 278 and 276, and reloading of
>> 'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release
>> 'slot->head' became 'head + 1', which is correct. But what's wrong here
>> is the stale value of 'last', and that stale value of 'last' finally led
>> the overrun of 'head'.
> 
> Ah, good catch! So even if the ring size is enough, the head/tail update
> value is not updated locally, it can cause the overrun!

It's really confusing at the first glance of such an issue. I was assuming
the order between 'last' and 'head' should be implicitly maintained, but
after more digging, then found that wasn't true actually, the order should
be explicitly guaranteed by pop().

I also verified with Amlogic A311D which has 6 cores (4x A73 and 4x A53),
and got same results. I think I just need re-discover the differences of
HMP (heterogeneous multiprocessing) for big.LITTLE or P/E cores cpus.

>>
>> Memory updating of 'last' and 'head' are performed in push() and pop()
>> independently, which could be the culprit leading this out of order
>> visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's
>> not enough only checking the condition of 'head != slot', the implicit
>> condition 'last - head <= nr_objs' must also be explicitly asserted to
>> guarantee 'last' is always behind 'head' before the object retrieving.
> 
> Indeed. Thanks for the investigation!
> 
>>
>> This patch will check and try reloading of 'head' and 'last' to ensure
>> 'last' is behind 'head' at the time of object retrieving. Performance
>> testings show the average impact is about 0.1% for X86_64 and 1.12% for
>> ARM64. Here are the results:
>>
>>      OS: Debian 10 X86_64, Linux 6.6rc
>>      HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
>>                        1T         2T         4T         8T        16T
>>      native:     49543304   99277826  199017659  399070324  795185848
>>      objpool:    29909085   59865637  119692073  239750369  478005250
>>      objpool+:   29879313   59230743  119609856  239067773  478509029
>>                       32T        48T        64T        96T       128T
>>      native:   1596927073 2390099988 2929397330 3183875848 3257546602
>>      objpool:   957553042 1435814086 1680872925 2043126796 2165424198
>>      objpool+:  956476281 1434491297 1666055740 2041556569 2157415622
>>
>>      OS: Debian 11 AARCH64, Linux 6.6rc
>>      HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
>>                        1T         2T         4T         8T        16T
>>      native:     30890508   60399915  123111980  242257008  494002946
>>      objpool:    14742531   28883047   57739948  115886644  232455421
>>      objpool+:   14107220   29032998   57286084  113730493  232232850
>>                       24T        32T        48T        64T        96T
>>      native:    746406039 1000174750 1493236240 1998318364 2942911180
>>      objpool:   349164852  467284332  702296756  934459713 1387898285
>>      objpool+:  348388180  462750976  696606096  927865887 1368402195
>>
> 
> OK, looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> And let me pick it.
> 
>> Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
> 
> BTW, this is a real bugfix, so it should have Fixes tag :)
> 
> Fixes: b4edb8d2d464 ("lib: objpool added: ring-array based lockless MPMC")
> 

Oh, right! Thanks for your kind reminder. I'll keep that in mind.

> Thank you!

Best regards.

> 
>> ---
>>   lib/objpool.c | 17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/lib/objpool.c b/lib/objpool.c
>> index ce0087f64400..cfdc02420884 100644
>> --- a/lib/objpool.c
>> +++ b/lib/objpool.c
>> @@ -201,6 +201,23 @@ static inline void *objpool_try_get_slot(struct objpool_head *pool, int cpu)
>>   	while (head != READ_ONCE(slot->last)) {
>>   		void *obj;
>>   
>> +		/*
>> +		 * data visibility of 'last' and 'head' could be out of
>> +		 * order since memory updating of 'last' and 'head' are
>> +		 * performed in push() and pop() independently
>> +		 *
>> +		 * before any retrieving attempts, pop() must guarantee
>> +		 * 'last' is behind 'head', that is to say, there must
>> +		 * be available objects in slot, which could be ensured
>> +		 * by condition 'last != head && last - head <= nr_objs'
>> +		 * that is equivalent to 'last - head - 1 < nr_objs' as
>> +		 * 'last' and 'head' are both unsigned int32
>> +		 */
>> +		if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) {
>> +			head = READ_ONCE(slot->head);
>> +			continue;
>> +		}
>> +
>>   		/* obj must be retrieved before moving forward head */
>>   		obj = READ_ONCE(slot->entries[head & slot->mask]);
>>   
>> -- 
>> 2.40.1
>>
> 
>
  

Patch

diff --git a/lib/objpool.c b/lib/objpool.c
index ce0087f64400..cfdc02420884 100644
--- a/lib/objpool.c
+++ b/lib/objpool.c
@@ -201,6 +201,23 @@  static inline void *objpool_try_get_slot(struct objpool_head *pool, int cpu)
 	while (head != READ_ONCE(slot->last)) {
 		void *obj;
 
+		/*
+		 * data visibility of 'last' and 'head' could be out of
+		 * order since memory updating of 'last' and 'head' are
+		 * performed in push() and pop() independently
+		 *
+		 * before any retrieving attempts, pop() must guarantee
+		 * 'last' is behind 'head', that is to say, there must
+		 * be available objects in slot, which could be ensured
+		 * by condition 'last != head && last - head <= nr_objs'
+		 * that is equivalent to 'last - head - 1 < nr_objs' as
+		 * 'last' and 'head' are both unsigned int32
+		 */
+		if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) {
+			head = READ_ONCE(slot->head);
+			continue;
+		}
+
 		/* obj must be retrieved before moving forward head */
 		obj = READ_ONCE(slot->entries[head & slot->mask]);