[v2,02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function

Message ID 20230209153204.683821550@redhat.com
State New
Headers
Series fold per-CPU vmstats remotely |

Commit Message

Marcelo Tosatti Feb. 9, 2023, 3:01 p.m. UTC
  Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this, 
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
change ARM's this_cpu_cmpxchg_ helpers to be atomic,
and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
  

Comments

David Hildenbrand March 2, 2023, 10:42 a.m. UTC | #1
On 09.02.23 16:01, Marcelo Tosatti wrote:
> Goal is to have vmstat_shepherd to transfer from
> per-CPU counters to global counters remotely. For this,
> an atomic this_cpu_cmpxchg is necessary.
> 
> Following the kernel convention for cmpxchg/cmpxchg_local,
> change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> ===================================================================
> --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
>   	_pcp_protect_return(xchg_relaxed, pcp, val)
>   
>   #define this_cpu_cmpxchg_1(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>   #define this_cpu_cmpxchg_2(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>   #define this_cpu_cmpxchg_4(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>   #define this_cpu_cmpxchg_8(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
> +
> +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
>   	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +

Call me confused (not necessarily your fault :) ).

We have cmpxchg_local, cmpxchg_relaxed and cmpxchg. 
this_cpu_cmpxchg_local_* now calls ... *drumroll* ... cmpxchg_relaxed.

IIUC, cmpxchg_local is only guaranteed to be atomic WRO the current CPU 
(especially, protection against interrupts when the operation is 
implemented using multiple instructions). We do have a generic 
implementation that disables/enables interrupts.

IIUC, cmpxchg_relaxed an atomic update without any memory ordering 
guarantees (in contrast to cmpxchg, cmpxchg_acquire, cmpxchg_acquire). 
We default to arch_cmpxchg if we don't have arch_cmpxchg_relaxed. 
arch_cmpxchg defaults to arch_cmpxchg_local, if not supported.


Naturally I wonder:

(a) Should these new variants be rather called
     this_cpu_cmpxchg_relaxed_* ?

(b) Should these new variants rather call the "_local" variant?


Shedding some light on this would be great.
  
David Hildenbrand March 2, 2023, 10:51 a.m. UTC | #2
On 02.03.23 11:42, David Hildenbrand wrote:
> On 09.02.23 16:01, Marcelo Tosatti wrote:
>> Goal is to have vmstat_shepherd to transfer from
>> per-CPU counters to global counters remotely. For this,
>> an atomic this_cpu_cmpxchg is necessary.
>>
>> Following the kernel convention for cmpxchg/cmpxchg_local,
>> change ARM's this_cpu_cmpxchg_ helpers to be atomic,
>> and add this_cpu_cmpxchg_local_ helpers which are not atomic.
>>
>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>
>> Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
>> ===================================================================
>> --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
>> +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
>> @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
>>    	_pcp_protect_return(xchg_relaxed, pcp, val)
>>    
>>    #define this_cpu_cmpxchg_1(pcp, o, n)	\
>> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>>    #define this_cpu_cmpxchg_2(pcp, o, n)	\
>> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>>    #define this_cpu_cmpxchg_4(pcp, o, n)	\
>> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>>    #define this_cpu_cmpxchg_8(pcp, o, n)	\
>> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>> +
>> +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
>>    	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
>> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
>> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
>> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +
> 
> Call me confused (not necessarily your fault :) ).
> 
> We have cmpxchg_local, cmpxchg_relaxed and cmpxchg.
> this_cpu_cmpxchg_local_* now calls ... *drumroll* ... cmpxchg_relaxed.
> 
> IIUC, cmpxchg_local is only guaranteed to be atomic WRO the current CPU
> (especially, protection against interrupts when the operation is
> implemented using multiple instructions). We do have a generic
> implementation that disables/enables interrupts.
> 
> IIUC, cmpxchg_relaxed an atomic update without any memory ordering
> guarantees (in contrast to cmpxchg, cmpxchg_acquire, cmpxchg_acquire).
> We default to arch_cmpxchg if we don't have arch_cmpxchg_relaxed.
> arch_cmpxchg defaults to arch_cmpxchg_local, if not supported.
> 
> 
> Naturally I wonder:
> 
> (a) Should these new variants be rather called
>       this_cpu_cmpxchg_relaxed_* ?
> 
> (b) Should these new variants rather call the "_local" variant?
> 
> 
> Shedding some light on this would be great.

Nevermind, looking at the other patches I realized that this is 
arch-specific. Other archs that have _local variants call the _local 
variants. So I assume we really want the name this_cpu_cmpxchg_local_*, 
and using _relaxed here is just the aarch64 way of implementing _local 
via _relaxed.

Confusing :)
  
Marcelo Tosatti March 2, 2023, 2:32 p.m. UTC | #3
On Thu, Mar 02, 2023 at 11:42:57AM +0100, David Hildenbrand wrote:
> On 09.02.23 16:01, Marcelo Tosatti wrote:
> > Goal is to have vmstat_shepherd to transfer from
> > per-CPU counters to global counters remotely. For this,
> > an atomic this_cpu_cmpxchg is necessary.
> > 
> > Following the kernel convention for cmpxchg/cmpxchg_local,
> > change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> > and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > ===================================================================
> > --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> > +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
> >   	_pcp_protect_return(xchg_relaxed, pcp, val)
> >   #define this_cpu_cmpxchg_1(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >   #define this_cpu_cmpxchg_2(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >   #define this_cpu_cmpxchg_4(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >   #define this_cpu_cmpxchg_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > +
> > +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
> >   	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +
> 
> Call me confused (not necessarily your fault :) ).
> 
> We have cmpxchg_local, cmpxchg_relaxed and cmpxchg. this_cpu_cmpxchg_local_*
> now calls ... *drumroll* ... cmpxchg_relaxed.
> IIUC, cmpxchg_local is only guaranteed to be atomic WRO the current CPU
> (especially, protection against interrupts when the operation is implemented
> using multiple instructions). We do have a generic implementation that
> disables/enables interrupts.
>
> IIUC, cmpxchg_relaxed an atomic update without any memory ordering
> guarantees (in contrast to cmpxchg, cmpxchg_acquire, cmpxchg_acquire). We
> default to arch_cmpxchg if we don't have arch_cmpxchg_relaxed. arch_cmpxchg
> defaults to arch_cmpxchg_local, if not supported.
> 
> 
> Naturally I wonder:
> 
> (a) Should these new variants be rather called
>     this_cpu_cmpxchg_relaxed_* ?

No: it happens that on ARM-64 cmpxchg_local == cmpxchg_relaxed.

See cf10b79a7d88edc689479af989b3a88e9adf07ff.

> (b) Should these new variants rather call the "_local" variant?

They probably should. But this patchset maintains the current behaviour
of this_cpu_cmpxch (for this_cpu_cmpxch_local), which was:

 #define this_cpu_cmpxchg_1(pcp, o, n)  \
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_2(pcp, o, n)  \
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_4(pcp, o, n)  \
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_8(pcp, o, n)  \
+       _pcp_protect_return(cmpxchg, pcp, o, n)


Thanks.
  
Peter Xu March 2, 2023, 8:53 p.m. UTC | #4
On Thu, Feb 09, 2023 at 12:01:52PM -0300, Marcelo Tosatti wrote:
> Goal is to have vmstat_shepherd to transfer from
> per-CPU counters to global counters remotely. For this, 
> an atomic this_cpu_cmpxchg is necessary.
> 
> Following the kernel convention for cmpxchg/cmpxchg_local,
> change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> and add this_cpu_cmpxchg_local_ helpers which are not atomic.

I can follow on the necessity of having the _local version, however two
questions below.

> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> ===================================================================
> --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
>  	_pcp_protect_return(xchg_relaxed, pcp, val)
>  
>  #define this_cpu_cmpxchg_1(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_2(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_4(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_8(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg, pcp, o, n)

This makes this_cpu_cmpxchg_*() not only non-local, but also (especially
for arm64) memory barrier implications since cmpxchg() has a strong memory
barrier, while the old this_cpu_cmpxchg*() doesn't have, afaiu.

Maybe it's not a big deal if the audience of this helper is still limited
(e.g. we can add memory barriers if we don't want strict ordering
implication), but just to check with you on whether it's intended, and if
so whether it may worth some comments.

> +
> +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
>  	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)

I think cmpxchg_relaxed()==cmpxchg_local() here for aarch64, however should
we still use cmpxchg_local() to pair with this_cpu_cmpxchg_local_*()?

Nothing about your patch along since it was the same before, but I'm
wondering whether this is a good time to switchover.

The other thing is would it be good to copy arch-list for each arch patch?
Maybe it'll help to extend the audience too.

Thanks,
  
Marcelo Tosatti March 2, 2023, 9:04 p.m. UTC | #5
On Thu, Mar 02, 2023 at 03:53:12PM -0500, Peter Xu wrote:
> On Thu, Feb 09, 2023 at 12:01:52PM -0300, Marcelo Tosatti wrote:
> > Goal is to have vmstat_shepherd to transfer from
> > per-CPU counters to global counters remotely. For this, 
> > an atomic this_cpu_cmpxchg is necessary.
> > 
> > Following the kernel convention for cmpxchg/cmpxchg_local,
> > change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> > and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> 
> I can follow on the necessity of having the _local version, however two
> questions below.
> 
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > ===================================================================
> > --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> > +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
> >  	_pcp_protect_return(xchg_relaxed, pcp, val)
> >  
> >  #define this_cpu_cmpxchg_1(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_2(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_4(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> 
> This makes this_cpu_cmpxchg_*() not only non-local, but also (especially
> for arm64) memory barrier implications since cmpxchg() has a strong memory
> barrier, while the old this_cpu_cmpxchg*() doesn't have, afaiu.
> 
> Maybe it's not a big deal if the audience of this helper is still limited
> (e.g. we can add memory barriers if we don't want strict ordering
> implication), but just to check with you on whether it's intended, and if
> so whether it may worth some comments.

It happens that on ARM-64 cmpxchg_local == cmpxchg_relaxed.

See cf10b79a7d88edc689479af989b3a88e9adf07ff.

This patchset maintains the current behaviour
of this_cpu_cmpxch (for this_cpu_cmpxch_local), which was:

 #define this_cpu_cmpxchg_1(pcp, o, n)  \                                                                           
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_2(pcp, o, n)  \                                                                           
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_4(pcp, o, n)  \                                                                           
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_8(pcp, o, n)  \                                                                           
+       _pcp_protect_return(cmpxchg, pcp, o, n)

> > +
> > +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
> >  	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> 
> I think cmpxchg_relaxed()==cmpxchg_local() here for aarch64, however should
> we still use cmpxchg_local() to pair with this_cpu_cmpxchg_local_*()?

Since cmpxchg_local = cmpxchg_relaxed, seems like this is not necessary.

> Nothing about your patch along since it was the same before, but I'm
> wondering whether this is a good time to switchover.

I would say that another patch is more appropriate to change this, 
if desired.

> The other thing is would it be good to copy arch-list for each arch patch?
> Maybe it'll help to extend the audience too.

Yes, should have done that (or CC each individual maintainer). Will do
on next version.

Thanks.
  
Peter Xu March 2, 2023, 9:25 p.m. UTC | #6
On Thu, Mar 02, 2023 at 06:04:25PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 02, 2023 at 03:53:12PM -0500, Peter Xu wrote:
> > On Thu, Feb 09, 2023 at 12:01:52PM -0300, Marcelo Tosatti wrote:
> > > Goal is to have vmstat_shepherd to transfer from
> > > per-CPU counters to global counters remotely. For this, 
> > > an atomic this_cpu_cmpxchg is necessary.
> > > 
> > > Following the kernel convention for cmpxchg/cmpxchg_local,
> > > change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> > > and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> > 
> > I can follow on the necessity of having the _local version, however two
> > questions below.
> > 
> > > 
> > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > 
> > > Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > > ===================================================================
> > > --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> > > +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > > @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
> > >  	_pcp_protect_return(xchg_relaxed, pcp, val)
> > >  
> > >  #define this_cpu_cmpxchg_1(pcp, o, n)	\
> > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > >  #define this_cpu_cmpxchg_2(pcp, o, n)	\
> > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > >  #define this_cpu_cmpxchg_4(pcp, o, n)	\
> > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > >  #define this_cpu_cmpxchg_8(pcp, o, n)	\
> > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > 
> > This makes this_cpu_cmpxchg_*() not only non-local, but also (especially
> > for arm64) memory barrier implications since cmpxchg() has a strong memory
> > barrier, while the old this_cpu_cmpxchg*() doesn't have, afaiu.
> > 
> > Maybe it's not a big deal if the audience of this helper is still limited
> > (e.g. we can add memory barriers if we don't want strict ordering
> > implication), but just to check with you on whether it's intended, and if
> > so whether it may worth some comments.
> 
> It happens that on ARM-64 cmpxchg_local == cmpxchg_relaxed.
> 
> See cf10b79a7d88edc689479af989b3a88e9adf07ff.

This is more or less a comment in general, rather than for arm only.

Fundamentally starting from this patch it's redefining this_cpu_cmpxchg().
What I meant is whether we should define it properly then implement the
arch patches with what is defined.

We're adding non-local semantics into it, which is obvious to me.

We're (silently, in this patch for aarch64) adding memory barrier semantics
too, this is not obvious to me on whether all archs should implement this
api the same way.

It will make a difference IMHO when the helpers are used in any other code
clips, because IIUC proper definition of memory barrier implications will
decide whether the callers need explicit barriers when ordering is required.

> 
> This patchset maintains the current behaviour
> of this_cpu_cmpxch (for this_cpu_cmpxch_local), which was:
> 
>  #define this_cpu_cmpxchg_1(pcp, o, n)  \                                                                           
> -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +       _pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_2(pcp, o, n)  \                                                                           
> -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +       _pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_4(pcp, o, n)  \                                                                           
> -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +       _pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_8(pcp, o, n)  \                                                                           
> +       _pcp_protect_return(cmpxchg, pcp, o, n)
> 
> > > +
> > > +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
> > >  	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > 
> > I think cmpxchg_relaxed()==cmpxchg_local() here for aarch64, however should
> > we still use cmpxchg_local() to pair with this_cpu_cmpxchg_local_*()?
> 
> Since cmpxchg_local = cmpxchg_relaxed, seems like this is not necessary.
> 
> > Nothing about your patch along since it was the same before, but I'm
> > wondering whether this is a good time to switchover.
> 
> I would say that another patch is more appropriate to change this, 
> if desired.

Sure on this one.  Thanks,
  
Marcelo Tosatti March 3, 2023, 3:39 p.m. UTC | #7
On Thu, Mar 02, 2023 at 04:25:08PM -0500, Peter Xu wrote:
> On Thu, Mar 02, 2023 at 06:04:25PM -0300, Marcelo Tosatti wrote:
> > On Thu, Mar 02, 2023 at 03:53:12PM -0500, Peter Xu wrote:
> > > On Thu, Feb 09, 2023 at 12:01:52PM -0300, Marcelo Tosatti wrote:
> > > > Goal is to have vmstat_shepherd to transfer from
> > > > per-CPU counters to global counters remotely. For this, 
> > > > an atomic this_cpu_cmpxchg is necessary.
> > > > 
> > > > Following the kernel convention for cmpxchg/cmpxchg_local,
> > > > change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> > > > and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> > > 
> > > I can follow on the necessity of having the _local version, however two
> > > questions below.
> > > 
> > > > 
> > > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > > 
> > > > Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > > > ===================================================================
> > > > --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> > > > +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > > > @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
> > > >  	_pcp_protect_return(xchg_relaxed, pcp, val)
> > > >  
> > > >  #define this_cpu_cmpxchg_1(pcp, o, n)	\
> > > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > > >  #define this_cpu_cmpxchg_2(pcp, o, n)	\
> > > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > > >  #define this_cpu_cmpxchg_4(pcp, o, n)	\
> > > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > > >  #define this_cpu_cmpxchg_8(pcp, o, n)	\
> > > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > > 
> > > This makes this_cpu_cmpxchg_*() not only non-local, but also (especially
> > > for arm64) memory barrier implications since cmpxchg() has a strong memory
> > > barrier, while the old this_cpu_cmpxchg*() doesn't have, afaiu.
> > > 
> > > Maybe it's not a big deal if the audience of this helper is still limited
> > > (e.g. we can add memory barriers if we don't want strict ordering
> > > implication), but just to check with you on whether it's intended, and if
> > > so whether it may worth some comments.
> > 
> > It happens that on ARM-64 cmpxchg_local == cmpxchg_relaxed.
> > 
> > See cf10b79a7d88edc689479af989b3a88e9adf07ff.
> 
> This is more or less a comment in general, rather than for arm only.
> 
> Fundamentally starting from this patch it's redefining this_cpu_cmpxchg().
> What I meant is whether we should define it properly then implement the
> arch patches with what is defined.
> 
> We're adding non-local semantics into it, which is obvious to me.

Which match the cmpxchg() function semantics.

> We're (silently, in this patch for aarch64) adding memory barrier semantics
> too, this is not obvious to me on whether all archs should implement this
> api the same way.

Documentation/atomic_t.txt says that _relaxed means "no barriers".

So i'd assume:

cmpxchg_relaxed: no additional barriers
cmpxchg_local:   only guarantees atomicity to wrt local CPU.
cmpxchg:	 atomic in SMP context.

https://lore.kernel.org/linux-arm-kernel/20180505103550.s7xsnto7tgppkmle@gmail.com/#r

There seems to be a lack of clarity in documentation.

> It will make a difference IMHO when the helpers are used in any other code
> clips, because IIUC proper definition of memory barrier implications will
> decide whether the callers need explicit barriers when ordering is required.

Trying to limit the scope of changes to solve the problem at hand.

More specifically what this patch does is:

1) Add this_cpu_cmpxchg_local, uses arch cmpxchg_local implementation
to back it.
2) Add this_cpu_cmpxchg, uses arch cmpxchg implementation to back it.

Note that now becomes consistent with cmpxchg and cmpxchg_local
semantics.

> > This patchset maintains the current behaviour
> > of this_cpu_cmpxch (for this_cpu_cmpxch_local), which was:
> > 
> >  #define this_cpu_cmpxchg_1(pcp, o, n)  \                                                                           
> > -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +       _pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_2(pcp, o, n)  \                                                                           
> > -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +       _pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_4(pcp, o, n)  \                                                                           
> > -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +       _pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_8(pcp, o, n)  \                                                                           
> > +       _pcp_protect_return(cmpxchg, pcp, o, n)
> > 
> > > > +
> > > > +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
> > > >  	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> > > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> > > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> > > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > 
> > > I think cmpxchg_relaxed()==cmpxchg_local() here for aarch64, however should
> > > we still use cmpxchg_local() to pair with this_cpu_cmpxchg_local_*()?
> > 
> > Since cmpxchg_local = cmpxchg_relaxed, seems like this is not necessary.
> > 
> > > Nothing about your patch along since it was the same before, but I'm
> > > wondering whether this is a good time to switchover.
> > 
> > I would say that another patch is more appropriate to change this, 
> > if desired.
> 
> Sure on this one.  Thanks,
> 
> -- 
> Peter Xu
> 
>
  
Marcelo Tosatti March 3, 2023, 3:47 p.m. UTC | #8
On Thu, Mar 02, 2023 at 03:53:12PM -0500, Peter Xu wrote:
> On Thu, Feb 09, 2023 at 12:01:52PM -0300, Marcelo Tosatti wrote:
> > Goal is to have vmstat_shepherd to transfer from
> > per-CPU counters to global counters remotely. For this, 
> > an atomic this_cpu_cmpxchg is necessary.
> > 
> > Following the kernel convention for cmpxchg/cmpxchg_local,
> > change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> > and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> 
> I can follow on the necessity of having the _local version, however two
> questions below.
> 
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > ===================================================================
> > --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> > +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
> >  	_pcp_protect_return(xchg_relaxed, pcp, val)
> >  
> >  #define this_cpu_cmpxchg_1(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_2(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_4(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> 
> This makes this_cpu_cmpxchg_*() not only non-local, but also (especially
> for arm64) memory barrier implications since cmpxchg() has a strong memory
> barrier, while the old this_cpu_cmpxchg*() doesn't have, afaiu.

A later patch changes users of this_cpu_cmpxchg to
this_cpu_cmpxchg_local, which maintains behaviour.

> Maybe it's not a big deal if the audience of this helper is still limited
> (e.g. we can add memory barriers if we don't want strict ordering
> implication), but just to check with you on whether it's intended, and if
> so whether it may worth some comments.
> 
> > +
> > +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
> >  	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> 
> I think cmpxchg_relaxed()==cmpxchg_local() here for aarch64, however should
> we still use cmpxchg_local() to pair with this_cpu_cmpxchg_local_*()?
> 
> Nothing about your patch along since it was the same before, but I'm
> wondering whether this is a good time to switchover.
> 
> The other thing is would it be good to copy arch-list for each arch patch?
> Maybe it'll help to extend the audience too.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
>
  
Christoph Lameter March 15, 2023, 11:56 p.m. UTC | #9
On Thu, 9 Feb 2023, Marcelo Tosatti wrote:

> Goal is to have vmstat_shepherd to transfer from
> per-CPU counters to global counters remotely. For this,
> an atomic this_cpu_cmpxchg is necessary.

The definition for this_cpu_functionality is that it is *not* incurring
atomic overhead and it was introduced to *avoid* the overhead of atomic
operations.

This sabotages this_cpu functionality,
  
Marcelo Tosatti March 16, 2023, 10:54 a.m. UTC | #10
On Thu, Mar 16, 2023 at 12:56:20AM +0100, Christoph Lameter wrote:
> On Thu, 9 Feb 2023, Marcelo Tosatti wrote:
> 
> > Goal is to have vmstat_shepherd to transfer from
> > per-CPU counters to global counters remotely. For this,
> > an atomic this_cpu_cmpxchg is necessary.
> 
> The definition for this_cpu_functionality is that it is *not* incurring
> atomic overhead and it was introduced to *avoid* the overhead of atomic
> operations.
> 
> This sabotages this_cpu functionality,

Christoph,

Two points:

1) If you look at patch 6, users of this_cpu_cmpxchg are converted
to this_cpu_cmpxchg_local (except per-CPU vmstat counters).
Its up to the user of the interface, depending on its requirements,
to decide whether or not atomic operations are necessary
(atomic with reference to other processors).

this_cpu_cmpxchg still has the benefits of use of segment registers:

:Author: Christoph Lameter, August 4th, 2014
:Author: Pranith Kumar, Aug 2nd, 2014

this_cpu operations are a way of optimizing access to per cpu
variables associated with the *currently* executing processor. This is
done through the use of segment registers (or a dedicated register where
the cpu permanently stored the beginning of the per cpu area for a
specific processor).

this_cpu operations add a per cpu variable offset to the processor
specific per cpu base and encode that operation in the instruction
operating on the per cpu variable.

This means that there are no atomicity issues between the calculation of
the offset and the operation on the data. Therefore it is not
necessary to disable preemption or interrupts to ensure that the
processor is not changed between the calculation of the address and
the operation on the data.

2) The performance results seem to indicate that 
cache locking is effective on modern processors (on this particular case and others as well):

4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock

    As preparation for dealing with both of those problems, protect the
    lists with a spinlock.  The IRQ-unsafe version of the lock is used
    because IRQs are already disabled by local_lock_irqsave.  spin_trylock
    is used in combination with local_lock_irqsave() but later will be
    replaced with a spin_trylock_irqsave when the local_lock is removed.

    The per_cpu_pages still fits within the same number of cache lines after
    this patch relative to before the series.

    struct per_cpu_pages {
            spinlock_t                 lock;                 /*     0     4 */
            int                        count;                /*     4     4 */
            int                        high;                 /*     8     4 */
            int                        batch;                /*    12     4 */
            short int                  free_factor;          /*    16     2 */
            short int                  expire;               /*    18     2 */

            /* XXX 4 bytes hole, try to pack */

            struct list_head           lists[13];            /*    24   208 */

            /* size: 256, cachelines: 4, members: 7 */
            /* sum members: 228, holes: 1, sum holes: 4 */
            /* padding: 24 */
    } __attribute__((__aligned__(64)));

    There is overhead in the fast path due to acquiring the spinlock even
    though the spinlock is per-cpu and uncontended in the common case.  Page
    Fault Test (PFT) running on a 1-socket reported the following results on a
    1 socket machine.

                                         5.19.0-rc3               5.19.0-rc3
                                            vanilla      mm-pcpspinirq-v5r16
    Hmean     faults/sec-1   869275.7381 (   0.00%)   874597.5167 *   0.61%*
    Hmean     faults/sec-3  2370266.6681 (   0.00%)  2379802.0362 *   0.40%*
    Hmean     faults/sec-5  2701099.7019 (   0.00%)  2664889.7003 *  -1.34%*
    Hmean     faults/sec-7  3517170.9157 (   0.00%)  3491122.8242 *  -0.74%*
    Hmean     faults/sec-8  3965729.6187 (   0.00%)  3939727.0243 *  -0.66%*

And for this case:

To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.

For the single_page_alloc_free test, which does

        /** Loop to measure **/
        for (i = 0; i < rec->loops; i++) {
                my_page = alloc_page(gfp_mask);
                if (unlikely(my_page == NULL))
                        return 0;
                __free_page(my_page);
        }

Unit is cycles.

Vanilla                 Patched         Diff
115.25                  117             1.4%

(to be honest, the results are in the noise as well, during the tests
the "LOCK cmpxchg" shows no significant difference to the "cmpxchg"
version for the page allocator benchmark).
  

Patch

Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
+++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
@@ -232,13 +232,23 @@  PERCPU_RET_OP(add, add, ldadd)
 	_pcp_protect_return(xchg_relaxed, pcp, val)
 
 #define this_cpu_cmpxchg_1(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_2(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_4(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg, pcp, o, n)
+
+#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
 	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+
 
 #ifdef __KVM_NVHE_HYPERVISOR__
 extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);