diff mbox series

[v2,18/18] x86/resctrl: Separate arch and fs resctrl locks

Message ID	20230113175459.14825-19-james.morse@arm.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: James Morse <james.morse@arm.com> To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: Fenghua Yu <fenghua.yu@intel.com>, Reinette Chatre <reinette.chatre@intel.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, H Peter Anvin <hpa@zytor.com>, Babu Moger <Babu.Moger@amd.com>, James Morse <james.morse@arm.com>, shameerali.kolothum.thodi@huawei.com, D Scott Phillips OS <scott@os.amperecomputing.com>, carl@os.amperecomputing.com, lcherian@marvell.com, bobo.shaobowang@huawei.com, tan.shaopeng@fujitsu.com, xingxin.hx@openanolis.org, baolin.wang@linux.alibaba.com, Jamie Iles <quic_jiles@quicinc.com>, Xin Hao <xhao@linux.alibaba.com>, peternewman@google.com Subject: [PATCH v2 18/18] x86/resctrl: Separate arch and fs resctrl locks Date: Fri, 13 Jan 2023 17:54:59 +0000 Message-Id: <20230113175459.14825-19-james.morse@arm.com> In-Reply-To: <20230113175459.14825-1-james.morse@arm.com> References: <20230113175459.14825-1-james.morse@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	x86/resctrl: monitored closid+rmid together, separate arch/fs locking \| [v2,00/18] x86/resctrl: monitored closid+rmid together, separate arch/fs locking [v2,01/18] x86/resctrl: Track the closid with the rmid [v2,02/18] x86/resctrl: Access per-rmid structures by index [v2,03/18] x86/resctrl: Create helper for RMID allocation and mondata dir creation [v2,04/18] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare() [v2,05/18] x86/resctrl: Allow RMID allocation to be scoped by CLOSID [v2,06/18] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID [v2,07/18] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers [v2,08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI [v2,09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep [v2,10/18] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read() [v2,11/18] x86/resctrl: Make resctrl_mounted checks explicit [v2,12/18] x86/resctrl: Move alloc/mon static keys into helpers [v2,13/18] x86/resctrl: Make rdt_enable_key the arch's decision to switch [v2,14/18] x86/resctrl: Add helpers for system wide mon/alloc capable [v2,15/18] x86/resctrl: Add cpu online callback for resctrl work [v2,16/18] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu [v2,17/18] x86/resctrl: Add cpu offline callback for resctrl work [v2,18/18] x86/resctrl: Separate arch and fs resctrl locks

Commit Message

James Morse Jan. 13, 2023, 5:54 p.m. UTC

  resctrl has one mutex that is taken by the architecture specific code,
and the filesystem parts. The two interact via cpuhp, where the
architecture code updates the domain list. Filesystem handlers that
walk the domains list should not run concurrently with the cpuhp
callback modifying the list.

Exposing a lock from the filesystem code means the interface is not
cleanly defined, and creates the possibility of cross-architecture
lock ordering headaches. The interaction only exists so that certain
filesystem paths are serialised against cpu hotplug. The cpu hotplug
code already has a mechanism to do this using cpus_read_lock().

MPAM's monitors have an overflow interrupt, so it needs to be possible
to walk the domains list in irq context. RCU is ideal for this,
but some paths need to be able to sleep to allocate memory.

Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
of a cpuhp callback, cpus_read_lock() must always be taken first.
rdtgroup_schemata_write() already does this.

All but one of the filesystem code's domain list walkers are
currently protected by the rdtgroup_mutex taken in
rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
which takes the lock directly.

Make the domain list protected by RCU. An architecture-specific
lock prevents concurrent writers. rdt_bit_usage_show() can
walk the domain list under rcu_read_lock().
The other filesystem list walkers need to be able to sleep.
Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
cpuhp callbacks can't be invoked when file system operations are
occurring.

Add lockdep_assert_cpus_held() in the cases where the
rdtgroup_kn_lock_live() call isn't obvious.

Resctrl's domain online/offline calls now need to take the
rdtgroup_mutex themselves.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/x86/kernel/cpu/resctrl/core.c        | 33 ++++++++------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 14 ++++--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  3 ++
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  3 ++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 54 ++++++++++++++++++++---
 include/linux/resctrl.h                   |  2 +-
 6 files changed, 84 insertions(+), 25 deletions(-)

Comments

Reinette Chatre Feb. 2, 2023, 11:50 p.m. UTC | #1

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> resctrl has one mutex that is taken by the architecture specific code,
> and the filesystem parts. The two interact via cpuhp, where the
> architecture code updates the domain list. Filesystem handlers that
> walk the domains list should not run concurrently with the cpuhp
> callback modifying the list.
> 
> Exposing a lock from the filesystem code means the interface is not
> cleanly defined, and creates the possibility of cross-architecture
> lock ordering headaches. The interaction only exists so that certain
> filesystem paths are serialised against cpu hotplug. The cpu hotplug
> code already has a mechanism to do this using cpus_read_lock().
> 
> MPAM's monitors have an overflow interrupt, so it needs to be possible
> to walk the domains list in irq context. RCU is ideal for this,
> but some paths need to be able to sleep to allocate memory.
> 
> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
> of a cpuhp callback, cpus_read_lock() must always be taken first.
> rdtgroup_schemata_write() already does this.
> 
> All but one of the filesystem code's domain list walkers are
> currently protected by the rdtgroup_mutex taken in
> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
> which takes the lock directly.

The new BMEC code also. You can find it on tip's x86/cache branch,
see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().

> 
> Make the domain list protected by RCU. An architecture-specific
> lock prevents concurrent writers. rdt_bit_usage_show() can
> walk the domain list under rcu_read_lock().
> The other filesystem list walkers need to be able to sleep.
> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
> cpuhp callbacks can't be invoked when file system operations are
> occurring.
> 
> Add lockdep_assert_cpus_held() in the cases where the
> rdtgroup_kn_lock_live() call isn't obvious.
> 
> Resctrl's domain online/offline calls now need to take the
> rdtgroup_mutex themselves.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  arch/x86/kernel/cpu/resctrl/core.c        | 33 ++++++++------
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 14 ++++--
>  arch/x86/kernel/cpu/resctrl/monitor.c     |  3 ++
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  3 ++
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 54 ++++++++++++++++++++---
>  include/linux/resctrl.h                   |  2 +-
>  6 files changed, 84 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 7896fcf11df6..dc1ba580c4db 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -25,8 +25,14 @@
>  #include <asm/resctrl.h>
>  #include "internal.h"
>  
> -/* Mutex to protect rdtgroup access. */
> -DEFINE_MUTEX(rdtgroup_mutex);
> +/*
> + * rdt_domain structures are kfree()d when their last cpu goes offline,
> + * and allocated when the first cpu in a new domain comes online.
> + * The rdt_resource's domain list is updated when this happens. The domain
> + * list is protected by RCU, but callers can also take the cpus_read_lock()
> + * to prevent modification if they need to sleep. All writers take this mutex:

Using "callers can" is not specific (compare to "callers should"). Please provide
clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
to prevent modification, why not just take the mutex to prevent modification?"

> + */
> +static DEFINE_MUTEX(domain_list_lock);
>  
>  /*
>   * The cached resctrl_pqr_state is strictly per CPU and can never be
> @@ -483,6 +489,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  	struct rdt_domain *d;
>  	int err;
>  
> +	lockdep_assert_held(&domain_list_lock);
> +
>  	d = rdt_find_domain(r, id, &add_pos);
>  	if (IS_ERR(d)) {
>  		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> @@ -516,11 +524,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>  		return;
>  	}
>  
> -	list_add_tail(&d->list, add_pos);
> +	list_add_tail_rcu(&d->list, add_pos);
>  
>  	err = resctrl_online_domain(r, d);
>  	if (err) {
> -		list_del(&d->list);
> +		list_del_rcu(&d->list);
> +		synchronize_rcu();
>  		domain_free(hw_dom);
>  	}
>  }
> @@ -541,7 +550,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  	cpumask_clear_cpu(cpu, &d->cpu_mask);
>  	if (cpumask_empty(&d->cpu_mask)) {
>  		resctrl_offline_domain(r, d);
> -		list_del(&d->list);
> +		list_del_rcu(&d->list);
> +		synchronize_rcu();
>  
>  		/*
>  		 * rdt_domain "d" is going to be freed below, so clear

Should domain_remove_cpu() also get a "lockdep_assert_held(&domain_list_lock)"?

> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
>  static int resctrl_arch_online_cpu(unsigned int cpu)
>  {
>  	struct rdt_resource *r;
> -	int err;
>  
> -	mutex_lock(&rdtgroup_mutex);
> +	mutex_lock(&domain_list_lock);
>  	for_each_capable_rdt_resource(r)
>  		domain_add_cpu(cpu, r);
>  	clear_closid_rmid(cpu);
> +	mutex_unlock(&domain_list_lock);

Why is clear_closid_rmid(cpu) protected by mutex?

>  
> -	err = resctrl_online_cpu(cpu);
> -	mutex_unlock(&rdtgroup_mutex);
> -
> -	return err;
> +	return resctrl_online_cpu(cpu);
>  }
>  
>  static int resctrl_arch_offline_cpu(unsigned int cpu)
>  {
>  	struct rdt_resource *r;
>  
> -	mutex_lock(&rdtgroup_mutex);
>  	resctrl_offline_cpu(cpu);
>  
> +	mutex_lock(&domain_list_lock);
>  	for_each_capable_rdt_resource(r)
>  		domain_remove_cpu(cpu, r);
>  	clear_closid_rmid(cpu);
> -	mutex_unlock(&rdtgroup_mutex);
> +	mutex_unlock(&domain_list_lock);

Same

>  
>  	return 0;
>  }

Reinette

James Morse March 6, 2023, 11:34 a.m. UTC | #2

Hi Reinette,

On 02/02/2023 23:50, Reinette Chatre wrote:
> On 1/13/2023 9:54 AM, James Morse wrote:
>> resctrl has one mutex that is taken by the architecture specific code,
>> and the filesystem parts. The two interact via cpuhp, where the
>> architecture code updates the domain list. Filesystem handlers that
>> walk the domains list should not run concurrently with the cpuhp
>> callback modifying the list.
>>
>> Exposing a lock from the filesystem code means the interface is not
>> cleanly defined, and creates the possibility of cross-architecture
>> lock ordering headaches. The interaction only exists so that certain
>> filesystem paths are serialised against cpu hotplug. The cpu hotplug
>> code already has a mechanism to do this using cpus_read_lock().
>>
>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>> to walk the domains list in irq context. RCU is ideal for this,
>> but some paths need to be able to sleep to allocate memory.
>>
>> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
>> of a cpuhp callback, cpus_read_lock() must always be taken first.
>> rdtgroup_schemata_write() already does this.
>>
>> All but one of the filesystem code's domain list walkers are
>> currently protected by the rdtgroup_mutex taken in
>> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
>> which takes the lock directly.
> 
> The new BMEC code also. You can find it on tip's x86/cache branch,
> see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().
> 
>>
>> Make the domain list protected by RCU. An architecture-specific
>> lock prevents concurrent writers. rdt_bit_usage_show() can
>> walk the domain list under rcu_read_lock().
>> The other filesystem list walkers need to be able to sleep.
>> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
>> cpuhp callbacks can't be invoked when file system operations are
>> occurring.
>>
>> Add lockdep_assert_cpus_held() in the cases where the
>> rdtgroup_kn_lock_live() call isn't obvious.
>>
>> Resctrl's domain online/offline calls now need to take the
>> rdtgroup_mutex themselves.


>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index 7896fcf11df6..dc1ba580c4db 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -25,8 +25,14 @@
>>  #include <asm/resctrl.h>
>>  #include "internal.h"
>>  
>> -/* Mutex to protect rdtgroup access. */
>> -DEFINE_MUTEX(rdtgroup_mutex);
>> +/*
>> + * rdt_domain structures are kfree()d when their last cpu goes offline,
>> + * and allocated when the first cpu in a new domain comes online.
>> + * The rdt_resource's domain list is updated when this happens. The domain
>> + * list is protected by RCU, but callers can also take the cpus_read_lock()
>> + * to prevent modification if they need to sleep. All writers take this mutex:
> 
> Using "callers can" is not specific (compare to "callers should"). Please provide
> clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
> to prevent modification, why not just take the mutex to prevent modification?"

'if they need to sleep' is the answer to this. I think a certain amount of background
knowledge can be assumed. My aim here wasn't to write an essay, but indicate not all
readers do the same thing. This is already the case in resctrl, and the MPAM pmu stuff
makes that worse.

Is this more robust:
| * rdt_domain structures are kfree()d when their last cpu goes offline,
| * and allocated when the first cpu in a new domain comes online.
| * The rdt_resource's domain list is updated when this happens. Readers of
| * the domain list must either take cpus_read_lock(), or rely on an RCU
| * read-side critical section, to avoid observing concurrent modification.
| * For information about RCU, see Docuemtation/RCU/rcu.rst.
| * All writers take this mutex:

?


>> @@ -541,7 +550,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>>  	cpumask_clear_cpu(cpu, &d->cpu_mask);
>>  	if (cpumask_empty(&d->cpu_mask)) {
>>  		resctrl_offline_domain(r, d);
>> -		list_del(&d->list);
>> +		list_del_rcu(&d->list);
>> +		synchronize_rcu();
>>  
>>  		/*
>>  		 * rdt_domain "d" is going to be freed below, so clear

> Should domain_remove_cpu() also get a "lockdep_assert_held(&domain_list_lock)"?

Yes, not sure why I didn't do that!


>> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
>>  static int resctrl_arch_online_cpu(unsigned int cpu)
>>  {
>>  	struct rdt_resource *r;
>> -	int err;
>>  
>> -	mutex_lock(&rdtgroup_mutex);
>> +	mutex_lock(&domain_list_lock);
>>  	for_each_capable_rdt_resource(r)
>>  		domain_add_cpu(cpu, r);
>>  	clear_closid_rmid(cpu);
>> +	mutex_unlock(&domain_list_lock);

> Why is clear_closid_rmid(cpu) protected by mutex?

It doesn't need to be, its just an artefact of changing the lock, then moving the
filesystem calls out. (its doesn't need to be protected by rdtgroup_mutex today).

If you don't think its churn, I'll move it to make it clearer.


Thanks,

James

Reinette Chatre March 11, 2023, 12:22 a.m. UTC | #3

Hi James,

On 3/6/2023 3:34 AM, James Morse wrote:
> Hi Reinette,
> 
> On 02/02/2023 23:50, Reinette Chatre wrote:
>> On 1/13/2023 9:54 AM, James Morse wrote:
>>> resctrl has one mutex that is taken by the architecture specific code,
>>> and the filesystem parts. The two interact via cpuhp, where the
>>> architecture code updates the domain list. Filesystem handlers that
>>> walk the domains list should not run concurrently with the cpuhp
>>> callback modifying the list.
>>>
>>> Exposing a lock from the filesystem code means the interface is not
>>> cleanly defined, and creates the possibility of cross-architecture
>>> lock ordering headaches. The interaction only exists so that certain
>>> filesystem paths are serialised against cpu hotplug. The cpu hotplug
>>> code already has a mechanism to do this using cpus_read_lock().
>>>
>>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>>> to walk the domains list in irq context. RCU is ideal for this,
>>> but some paths need to be able to sleep to allocate memory.
>>>
>>> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
>>> of a cpuhp callback, cpus_read_lock() must always be taken first.
>>> rdtgroup_schemata_write() already does this.
>>>
>>> All but one of the filesystem code's domain list walkers are
>>> currently protected by the rdtgroup_mutex taken in
>>> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
>>> which takes the lock directly.
>>
>> The new BMEC code also. You can find it on tip's x86/cache branch,
>> see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().
>>
>>>
>>> Make the domain list protected by RCU. An architecture-specific
>>> lock prevents concurrent writers. rdt_bit_usage_show() can
>>> walk the domain list under rcu_read_lock().
>>> The other filesystem list walkers need to be able to sleep.
>>> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
>>> cpuhp callbacks can't be invoked when file system operations are
>>> occurring.
>>>
>>> Add lockdep_assert_cpus_held() in the cases where the
>>> rdtgroup_kn_lock_live() call isn't obvious.
>>>
>>> Resctrl's domain online/offline calls now need to take the
>>> rdtgroup_mutex themselves.
> 
> 
>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>> index 7896fcf11df6..dc1ba580c4db 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>> @@ -25,8 +25,14 @@
>>>  #include <asm/resctrl.h>
>>>  #include "internal.h"
>>>  
>>> -/* Mutex to protect rdtgroup access. */
>>> -DEFINE_MUTEX(rdtgroup_mutex);
>>> +/*
>>> + * rdt_domain structures are kfree()d when their last cpu goes offline,
>>> + * and allocated when the first cpu in a new domain comes online.
>>> + * The rdt_resource's domain list is updated when this happens. The domain
>>> + * list is protected by RCU, but callers can also take the cpus_read_lock()
>>> + * to prevent modification if they need to sleep. All writers take this mutex:
>>
>> Using "callers can" is not specific (compare to "callers should"). Please provide
>> clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
>> to prevent modification, why not just take the mutex to prevent modification?"
> 
> 'if they need to sleep' is the answer to this. I think a certain amount of background
> knowledge can be assumed. My aim here wasn't to write an essay, but indicate not all
> readers do the same thing. This is already the case in resctrl, and the MPAM pmu stuff
> makes that worse.
> 
> Is this more robust:
> | * rdt_domain structures are kfree()d when their last cpu goes offline,
> | * and allocated when the first cpu in a new domain comes online.
> | * The rdt_resource's domain list is updated when this happens. Readers of
> | * the domain list must either take cpus_read_lock(), or rely on an RCU
> | * read-side critical section, to avoid observing concurrent modification.
> | * For information about RCU, see Docuemtation/RCU/rcu.rst.
> | * All writers take this mutex:
> 
> ?

Yes, I do think this is more robust. Since you do mention, "'if they need to sleep'
is the answer to this", how about "... must take cpus_read_lock() if they need to
sleep, or otherwise rely on an RCU read-side critical section, ..."? I do not
think it is necessary to provide a link to the documentation. If you do prefer
to keep it, please note the typo.

Also, please cpu -> CPU. 

>>> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
>>>  static int resctrl_arch_online_cpu(unsigned int cpu)
>>>  {
>>>  	struct rdt_resource *r;
>>> -	int err;
>>>  
>>> -	mutex_lock(&rdtgroup_mutex);
>>> +	mutex_lock(&domain_list_lock);
>>>  	for_each_capable_rdt_resource(r)
>>>  		domain_add_cpu(cpu, r);
>>>  	clear_closid_rmid(cpu);
>>> +	mutex_unlock(&domain_list_lock);
> 
>> Why is clear_closid_rmid(cpu) protected by mutex?
> 
> It doesn't need to be, its just an artefact of changing the lock, then moving the
> filesystem calls out. (its doesn't need to be protected by rdtgroup_mutex today).
> 
> If you don't think its churn, I'll move it to make it clearer.
> 

I do not see a problem with keeping the lock/unlock as before but
if you do find that you can make the locking clearer then
please do.

Reinette

James Morse March 20, 2023, 5:12 p.m. UTC | #4

Hi Reinette

On 11/03/2023 00:22, Reinette Chatre wrote:
> On 3/6/2023 3:34 AM, James Morse wrote:
>> On 02/02/2023 23:50, Reinette Chatre wrote:
>>> On 1/13/2023 9:54 AM, James Morse wrote:
>>>> resctrl has one mutex that is taken by the architecture specific code,
>>>> and the filesystem parts. The two interact via cpuhp, where the
>>>> architecture code updates the domain list. Filesystem handlers that
>>>> walk the domains list should not run concurrently with the cpuhp
>>>> callback modifying the list.
>>>>
>>>> Exposing a lock from the filesystem code means the interface is not
>>>> cleanly defined, and creates the possibility of cross-architecture
>>>> lock ordering headaches. The interaction only exists so that certain
>>>> filesystem paths are serialised against cpu hotplug. The cpu hotplug
>>>> code already has a mechanism to do this using cpus_read_lock().
>>>>
>>>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>>>> to walk the domains list in irq context. RCU is ideal for this,
>>>> but some paths need to be able to sleep to allocate memory.
>>>>
>>>> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
>>>> of a cpuhp callback, cpus_read_lock() must always be taken first.
>>>> rdtgroup_schemata_write() already does this.
>>>>
>>>> All but one of the filesystem code's domain list walkers are
>>>> currently protected by the rdtgroup_mutex taken in
>>>> rdtgroup_kn_lock_live(). The exception is rdt_bit_usage_show()
>>>> which takes the lock directly.
>>>
>>> The new BMEC code also. You can find it on tip's x86/cache branch,
>>> see mbm_total_bytes_config_write() and mbm_local_bytes_config_write().
>>>
>>>>
>>>> Make the domain list protected by RCU. An architecture-specific
>>>> lock prevents concurrent writers. rdt_bit_usage_show() can
>>>> walk the domain list under rcu_read_lock().
>>>> The other filesystem list walkers need to be able to sleep.
>>>> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
>>>> cpuhp callbacks can't be invoked when file system operations are
>>>> occurring.
>>>>
>>>> Add lockdep_assert_cpus_held() in the cases where the
>>>> rdtgroup_kn_lock_live() call isn't obvious.
>>>>
>>>> Resctrl's domain online/offline calls now need to take the
>>>> rdtgroup_mutex themselves.

>>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>>> index 7896fcf11df6..dc1ba580c4db 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>>> @@ -25,8 +25,14 @@
>>>>  #include <asm/resctrl.h>
>>>>  #include "internal.h"
>>>>  
>>>> -/* Mutex to protect rdtgroup access. */
>>>> -DEFINE_MUTEX(rdtgroup_mutex);
>>>> +/*
>>>> + * rdt_domain structures are kfree()d when their last cpu goes offline,
>>>> + * and allocated when the first cpu in a new domain comes online.
>>>> + * The rdt_resource's domain list is updated when this happens. The domain
>>>> + * list is protected by RCU, but callers can also take the cpus_read_lock()
>>>> + * to prevent modification if they need to sleep. All writers take this mutex:
>>>
>>> Using "callers can" is not specific (compare to "callers should"). Please provide
>>> clear guidance on how the locks should be used. Reader may wonder "why take cpus_read_lock()
>>> to prevent modification, why not just take the mutex to prevent modification?"
>>
>> 'if they need to sleep' is the answer to this. I think a certain amount of background
>> knowledge can be assumed. My aim here wasn't to write an essay, but indicate not all
>> readers do the same thing. This is already the case in resctrl, and the MPAM pmu stuff
>> makes that worse.
>>
>> Is this more robust:
>> | * rdt_domain structures are kfree()d when their last cpu goes offline,
>> | * and allocated when the first cpu in a new domain comes online.
>> | * The rdt_resource's domain list is updated when this happens. Readers of
>> | * the domain list must either take cpus_read_lock(), or rely on an RCU
>> | * read-side critical section, to avoid observing concurrent modification.
>> | * For information about RCU, see Docuemtation/RCU/rcu.rst.
>> | * All writers take this mutex:
>>
>> ?
> 
> Yes, I do think this is more robust. Since you do mention, "'if they need to sleep'
> is the answer to this", how about "... must take cpus_read_lock() if they need to
> sleep, or otherwise rely on an RCU read-side critical section, ..."?

Yes, I've changed this to
| * The rdt_resource's domain list is updated when this happens. Readers of
| * the domain list must either take cpus_read_lock() if they need to sleep,
| * or rely on an RCU read-side critical section, to avoid observing concurrent
| * modification.


> I do not
> think it is necessary to provide a link to the documentation. If you do prefer
> to keep it, please note the typo.

I'll drop that then.

> Also, please cpu -> CPU. 

Fixed.


>>>> @@ -569,30 +579,27 @@ static void clear_closid_rmid(int cpu)
>>>>  static int resctrl_arch_online_cpu(unsigned int cpu)
>>>>  {
>>>>  	struct rdt_resource *r;
>>>> -	int err;
>>>>  
>>>> -	mutex_lock(&rdtgroup_mutex);
>>>> +	mutex_lock(&domain_list_lock);
>>>>  	for_each_capable_rdt_resource(r)
>>>>  		domain_add_cpu(cpu, r);
>>>>  	clear_closid_rmid(cpu);
>>>> +	mutex_unlock(&domain_list_lock);
>>
>>> Why is clear_closid_rmid(cpu) protected by mutex?
>>
>> It doesn't need to be, its just an artefact of changing the lock, then moving the
>> filesystem calls out. (its doesn't need to be protected by rdtgroup_mutex today).
>>
>> If you don't think its churn, I'll move it to make it clearer.

> I do not see a problem with keeping the lock/unlock as before but
> if you do find that you can make the locking clearer then
> please do.

Done,


Thanks,

James

diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 7896fcf11df6..dc1ba580c4db 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -25,8 +25,14 @@ 
 #include <asm/resctrl.h>
 #include "internal.h"
 
-/* Mutex to protect rdtgroup access. */
-DEFINE_MUTEX(rdtgroup_mutex);
+/*
+ * rdt_domain structures are kfree()d when their last cpu goes offline,
+ * and allocated when the first cpu in a new domain comes online.
+ * The rdt_resource's domain list is updated when this happens. The domain
+ * list is protected by RCU, but callers can also take the cpus_read_lock()
+ * to prevent modification if they need to sleep. All writers take this mutex:
+ */
+static DEFINE_MUTEX(domain_list_lock);
 
 /*
  * The cached resctrl_pqr_state is strictly per CPU and can never be
@@ -483,6 +489,8 @@  static void domain_add_cpu(int cpu, struct rdt_resource *r)
 	struct rdt_domain *d;
 	int err;
 
+	lockdep_assert_held(&domain_list_lock);
+
 	d = rdt_find_domain(r, id, &add_pos);
 	if (IS_ERR(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -516,11 +524,12 @@  static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	list_add_tail(&d->list, add_pos);
+	list_add_tail_rcu(&d->list, add_pos);
 
 	err = resctrl_online_domain(r, d);
 	if (err) {
-		list_del(&d->list);
+		list_del_rcu(&d->list);
+		synchronize_rcu();
 		domain_free(hw_dom);
 	}
 }
@@ -541,7 +550,8 @@  static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
 		resctrl_offline_domain(r, d);
-		list_del(&d->list);
+		list_del_rcu(&d->list);
+		synchronize_rcu();
 
 		/*
 		 * rdt_domain "d" is going to be freed below, so clear
@@ -569,30 +579,27 @@  static void clear_closid_rmid(int cpu)
 static int resctrl_arch_online_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
-	int err;
 
-	mutex_lock(&rdtgroup_mutex);
+	mutex_lock(&domain_list_lock);
 	for_each_capable_rdt_resource(r)
 		domain_add_cpu(cpu, r);
 	clear_closid_rmid(cpu);
+	mutex_unlock(&domain_list_lock);
 
-	err = resctrl_online_cpu(cpu);
-	mutex_unlock(&rdtgroup_mutex);
-
-	return err;
+	return resctrl_online_cpu(cpu);
 }
 
 static int resctrl_arch_offline_cpu(unsigned int cpu)
 {
 	struct rdt_resource *r;
 
-	mutex_lock(&rdtgroup_mutex);
 	resctrl_offline_cpu(cpu);
 
+	mutex_lock(&domain_list_lock);
 	for_each_capable_rdt_resource(r)
 		domain_remove_cpu(cpu, r);
 	clear_closid_rmid(cpu);
-	mutex_unlock(&rdtgroup_mutex);
+	mutex_unlock(&domain_list_lock);
 
 	return 0;
 }
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 4ee3da6dced7..6eb74304860f 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -208,6 +208,9 @@  static int parse_line(char *line, struct resctrl_schema *s,
 	struct rdt_domain *d;
 	unsigned long dom_id;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
 	    r->rid == RDT_RESOURCE_MBA) {
 		rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n");
@@ -313,6 +316,9 @@  int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
 	int cpu;
 	u32 idx;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
 		return -ENOMEM;
 
@@ -383,11 +389,9 @@  ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 		return -EINVAL;
 	buf[nbytes - 1] = '\0';
 
-	cpus_read_lock();
 	rdtgrp = rdtgroup_kn_lock_live(of->kn);
 	if (!rdtgrp) {
 		rdtgroup_kn_unlock(of->kn);
-		cpus_read_unlock();
 		return -ENOENT;
 	}
 	rdt_last_cmd_clear();
@@ -451,7 +455,6 @@  ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 
 out:
 	rdtgroup_kn_unlock(of->kn);
-	cpus_read_unlock();
 	return ret ?: nbytes;
 }
 
@@ -471,6 +474,9 @@  static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
 	bool sep = false;
 	u32 ctrl_val;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	seq_printf(s, "%*s:", max_name_width, schema->name);
 	list_for_each_entry(dom, &r->domains, list) {
 		if (sep)
@@ -533,7 +539,7 @@  void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		    int evtid, int first)
 {
 	/* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */
-	lockdep_assert_held(&rdtgroup_mutex);
+	lockdep_assert_cpus_held();
 
 	/*
 	 * setup the parameters to pass to mon_event_count() to read the data.
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 334fb3f1c6e2..a2d5cf9052e4 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -421,6 +421,9 @@  static void add_rmid_to_limbo(struct rmid_entry *entry)
 	u32 idx;
 	int err;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
 
 	arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 0b4fdb118643..f8864626d593 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -830,6 +830,9 @@  bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d)
 	struct rdt_domain *d_i;
 	bool ret = false;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL))
 		return true;
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index aa75c42e0faa..391ac3c56680 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -35,6 +35,10 @@ 
 DEFINE_STATIC_KEY_FALSE(rdt_enable_key);
 DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key);
 DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
+
+/* Mutex to protect rdtgroup access. */
+DEFINE_MUTEX(rdtgroup_mutex);
+
 static struct kernfs_root *rdt_root;
 struct rdtgroup rdtgroup_default;
 LIST_HEAD(rdt_all_groups);
@@ -929,7 +933,8 @@  static int rdt_bit_usage_show(struct kernfs_open_file *of,
 
 	mutex_lock(&rdtgroup_mutex);
 	hw_shareable = r->cache.shareable_bits;
-	list_for_each_entry(dom, &r->domains, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(dom, &r->domains, list) {
 		if (sep)
 			seq_putc(seq, ';');
 		sw_shareable = 0;
@@ -985,8 +990,10 @@  static int rdt_bit_usage_show(struct kernfs_open_file *of,
 		}
 		sep = true;
 	}
+	rcu_read_unlock();
 	seq_putc(seq, '\n');
 	mutex_unlock(&rdtgroup_mutex);
+
 	return 0;
 }
 
@@ -1226,6 +1233,9 @@  static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp)
 	struct rdt_domain *d;
 	u32 ctrl;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	list_for_each_entry(s, &resctrl_schema_all, list) {
 		r = s->res;
 		if (r->rid == RDT_RESOURCE_MBA)
@@ -1859,6 +1869,9 @@  static int set_cache_qos_cfg(int level, bool enable)
 	struct rdt_domain *d;
 	int cpu;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	if (level == RDT_RESOURCE_L3)
 		update = l3_qos_cfg_update;
 	else if (level == RDT_RESOURCE_L2)
@@ -2051,6 +2064,7 @@  struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
 	atomic_inc(&rdtgrp->waitcount);
 	kernfs_break_active_protection(kn);
 
+	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
 	/* Was this group deleted while we waited? */
@@ -2068,6 +2082,7 @@  void rdtgroup_kn_unlock(struct kernfs_node *kn)
 		return;
 
 	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
 
 	if (atomic_dec_and_test(&rdtgrp->waitcount) &&
 	    (rdtgrp->flags & RDT_DELETED)) {
@@ -2364,6 +2379,9 @@  static int reset_all_ctrls(struct rdt_resource *r)
 	struct rdt_domain *d;
 	int i, cpu;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
 		return -ENOMEM;
 
@@ -2644,6 +2662,9 @@  static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 	struct rdt_domain *dom;
 	int ret;
 
+	/* Walking r->domains, ensure it can't race with cpuhp */
+	lockdep_assert_cpus_held();
+
 	list_for_each_entry(dom, &r->domains, list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
@@ -3327,7 +3348,8 @@  static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+static void _resctrl_offline_domain(struct rdt_resource *r,
+				    struct rdt_domain *d)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
@@ -3362,6 +3384,13 @@  void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
 	domain_destroy_mon_state(d);
 }
 
+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	mutex_lock(&rdtgroup_mutex);
+	_resctrl_offline_domain(r, d);
+	mutex_unlock(&rdtgroup_mutex);
+}
+
 static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 {
 	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
@@ -3393,7 +3422,7 @@  static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+static int _resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 {
 	int err;
 
@@ -3424,12 +3453,23 @@  int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
+int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+{
+	int err;
+
+	mutex_lock(&rdtgroup_mutex);
+	err = _resctrl_online_domain(r, d);
+	mutex_unlock(&rdtgroup_mutex);
+
+	return err;
+}
+
 int resctrl_online_cpu(unsigned int cpu)
 {
-	lockdep_assert_held(&rdtgroup_mutex);
-
+	mutex_lock(&rdtgroup_mutex);
 	/* The cpu is set in default rdtgroup after online. */
 	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
+	mutex_unlock(&rdtgroup_mutex);
 
 	return 0;
 }
@@ -3450,8 +3490,7 @@  void resctrl_offline_cpu(unsigned int cpu)
 	struct rdtgroup *rdtgrp;
 	struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 
-	lockdep_assert_held(&rdtgroup_mutex);
-
+	mutex_lock(&rdtgroup_mutex);
 	list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
 		if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
 			clear_childcpus(rdtgrp, cpu);
@@ -3471,6 +3510,7 @@  void resctrl_offline_cpu(unsigned int cpu)
 			cqm_setup_limbo_handler(d, 0, cpu);
 		}
 	}
+	mutex_unlock(&rdtgroup_mutex);
 }
 
 /*
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 0d21e7ba7c1d..4fedd45738a5 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -153,7 +153,7 @@  struct resctrl_schema;
  * @cache_level:	Which cache level defines scope of this resource
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
- * @domains:		All domains for this resource
+ * @domains:		RCU list of all domains for this resource
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.