[v9,00/24] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

Message ID 20240213184438.16675-1-james.morse@arm.com
Headers
Series x86/resctrl: monitored closid+rmid together, separate arch/fs locking |

Message

James Morse Feb. 13, 2024, 6:44 p.m. UTC
  Hello!

It's been back and forth for whether this series should be rebased onto Tony's
SNC series. This version isn't, its based on tip/x86/cache.
(I have the rebased-and-tested versions if anyone needs them)

This version just has minor cleanup from the previous one.
* An unusued parameter in unused code has gone,
* I've added a comment about the sizeing of the index array - it only matters on arm64.

Changes are also noted on each patch.

~

This series does two things, it changes resctrl to call resctrl_arch_rmid_read()
in a way that works for MPAM, and it separates the locking so that the arch code
and filesystem code don't have to share a mutex. I tried to split this as two
series, but these touch similar call sites, so it would create more work.

(What's MPAM? See the cover letter of the first series. [1])

On x86 the RMID is an independent number. MPAMs equivalent is PMG, but this
isn't an independent number - it extends the PARTID (same as CLOSID) space
with bits that aren't used to select the configuration. The monitors can
then be told to match specific PMG values, allowing monitor-groups to be
created.

But, MPAM expects the monitors to always monitor by PARTID. The
Cache-storage-utilisation counters can only work this way.
(In the MPAM spec not setting the MATCH_PARTID bit is made CONSTRAINED
UNPREDICTABLE - which is Arm's term to mean portable software can't rely on
this)

It gets worse, as some SoCs may have very few PMG bits. I've seen the
datasheet for one that has a single bit of PMG space.

To be usable, MPAM's counters always need the PARTID and the PMG.
For resctrl, this means always making the CLOSID available when the RMID
is used.

To ensure RMID are always unique, this series combines the CLOSID and RMID
into an index, and manages RMID based on that. For x86, the index and RMID
would always be the same.


Currently the architecture specific code in the cpuhp callbacks takes the
rdtgroup_mutex. This means the filesystem code would have to export this
lock, resulting in an ill-defined interface between the two, and the possibility
of cross-architecture lock-ordering head aches.

The second part of this series adds a domain_list_lock to protect writes to the
domain list, and protects the domain list with RCU - or cpus_read_lock().

Use of RCU is to allow lockless readers of the domain list. To get MPAMs monitors
working, its very likely they'll need to be plumbed up to perf. An uncore PMU
driver would need to be a lockless reader of the domain list.

This series is based on tip/x86/caches's commit fc747eebef73, and can be retrieved from:
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/monitors_and_locking/v9

Bugs welcome,

Thanks,

James

[1] https://lore.kernel.org/lkml/20210728170637.25610-1-james.morse@arm.com/
[v1] https://lore.kernel.org/all/20221021131204.5581-1-james.morse@arm.com/
[v2] https://lore.kernel.org/lkml/20230113175459.14825-1-james.morse@arm.com/
[v3] https://lore.kernel.org/r/20230320172620.18254-1-james.morse@arm.com 
[v4] https://lore.kernel.org/r/20230525180209.19497-1-james.morse@arm.com
[v5] https://lore.kernel.org/lkml/20230728164254.27562-1-james.morse@arm.com/
[v6] https://lore.kernel.org/all/20230914172138.11977-1-james.morse@arm.com/
[v7] https://lore.kernel.org/r/20231025180345.28061-1-james.morse@arm.com/
[v8] https://lore.kernel.org/lkml/20231215174343.13872-1-james.morse@arm.com/

James Morse (24):
  tick/nohz: Move tick_nohz_full_mask declaration outside the #ifdef
  x86/resctrl: kfree() rmid_ptrs from resctrl_exit()
  x86/resctrl: Create helper for RMID allocation and mondata dir
    creation
  x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
  x86/resctrl: Track the closid with the rmid
  x86/resctrl: Access per-rmid structures by index
  x86/resctrl: Allow RMID allocation to be scoped by CLOSID
  x86/resctrl: Track the number of dirty RMID a CLOSID has
  x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding
  x86/resctrl: Allocate the cleanest CLOSID by searching
    closid_num_dirty_rmid
  x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
  x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow
  x86/resctrl: Queue mon_event_read() instead of sending an IPI
  x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  x86/resctrl: Allow arch to allocate memory needed in
    resctrl_arch_rmid_read()
  x86/resctrl: Make resctrl_mounted checks explicit
  x86/resctrl: Move alloc/mon static keys into helpers
  x86/resctrl: Make rdt_enable_key the arch's decision to switch
  x86/resctrl: Add helpers for system wide mon/alloc capable
  x86/resctrl: Add CPU online callback for resctrl work
  x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but
    cpu
  x86/resctrl: Add CPU offline callback for resctrl work
  x86/resctrl: Move domain helper migration into resctrl_offline_cpu()
  x86/resctrl: Separate arch and fs resctrl locks

 arch/x86/include/asm/resctrl.h            |  90 +++++
 arch/x86/kernel/cpu/resctrl/core.c        | 102 ++---
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  48 ++-
 arch/x86/kernel/cpu/resctrl/internal.h    |  67 +++-
 arch/x86/kernel/cpu/resctrl/monitor.c     | 463 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  15 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 359 ++++++++++++-----
 include/linux/resctrl.h                   |  48 ++-
 include/linux/tick.h                      |   9 +-
 9 files changed, 921 insertions(+), 280 deletions(-)
  

Comments

Reinette Chatre Feb. 13, 2024, 11:26 p.m. UTC | #1
(+Tony)

Hi Boris,

Could you please consider this series for inclusion?

Thank you very much.

Reinette

On 2/13/2024 10:44 AM, James Morse wrote:
> Hello!
> 
> It's been back and forth for whether this series should be rebased onto Tony's
> SNC series. This version isn't, its based on tip/x86/cache.
> (I have the rebased-and-tested versions if anyone needs them)
> 
> This version just has minor cleanup from the previous one.
> * An unusued parameter in unused code has gone,
> * I've added a comment about the sizeing of the index array - it only matters on arm64.
> 
> Changes are also noted on each patch.
> 
> ~
> 
> This series does two things, it changes resctrl to call resctrl_arch_rmid_read()
> in a way that works for MPAM, and it separates the locking so that the arch code
> and filesystem code don't have to share a mutex. I tried to split this as two
> series, but these touch similar call sites, so it would create more work.
> 
> (What's MPAM? See the cover letter of the first series. [1])
> 
> On x86 the RMID is an independent number. MPAMs equivalent is PMG, but this
> isn't an independent number - it extends the PARTID (same as CLOSID) space
> with bits that aren't used to select the configuration. The monitors can
> then be told to match specific PMG values, allowing monitor-groups to be
> created.
> 
> But, MPAM expects the monitors to always monitor by PARTID. The
> Cache-storage-utilisation counters can only work this way.
> (In the MPAM spec not setting the MATCH_PARTID bit is made CONSTRAINED
> UNPREDICTABLE - which is Arm's term to mean portable software can't rely on
> this)
> 
> It gets worse, as some SoCs may have very few PMG bits. I've seen the
> datasheet for one that has a single bit of PMG space.
> 
> To be usable, MPAM's counters always need the PARTID and the PMG.
> For resctrl, this means always making the CLOSID available when the RMID
> is used.
> 
> To ensure RMID are always unique, this series combines the CLOSID and RMID
> into an index, and manages RMID based on that. For x86, the index and RMID
> would always be the same.
> 
> 
> Currently the architecture specific code in the cpuhp callbacks takes the
> rdtgroup_mutex. This means the filesystem code would have to export this
> lock, resulting in an ill-defined interface between the two, and the possibility
> of cross-architecture lock-ordering head aches.
> 
> The second part of this series adds a domain_list_lock to protect writes to the
> domain list, and protects the domain list with RCU - or cpus_read_lock().
> 
> Use of RCU is to allow lockless readers of the domain list. To get MPAMs monitors
> working, its very likely they'll need to be plumbed up to perf. An uncore PMU
> driver would need to be a lockless reader of the domain list.
> 
> This series is based on tip/x86/caches's commit fc747eebef73, and can be retrieved from:
> https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/monitors_and_locking/v9
> 
> Bugs welcome,
> 
> Thanks,
> 
> James
> 
> [1] https://lore.kernel.org/lkml/20210728170637.25610-1-james.morse@arm.com/
> [v1] https://lore.kernel.org/all/20221021131204.5581-1-james.morse@arm.com/
> [v2] https://lore.kernel.org/lkml/20230113175459.14825-1-james.morse@arm.com/
> [v3] https://lore.kernel.org/r/20230320172620.18254-1-james.morse@arm.com 
> [v4] https://lore.kernel.org/r/20230525180209.19497-1-james.morse@arm.com
> [v5] https://lore.kernel.org/lkml/20230728164254.27562-1-james.morse@arm.com/
> [v6] https://lore.kernel.org/all/20230914172138.11977-1-james.morse@arm.com/
> [v7] https://lore.kernel.org/r/20231025180345.28061-1-james.morse@arm.com/
> [v8] https://lore.kernel.org/lkml/20231215174343.13872-1-james.morse@arm.com/
> 
> James Morse (24):
>   tick/nohz: Move tick_nohz_full_mask declaration outside the #ifdef
>   x86/resctrl: kfree() rmid_ptrs from resctrl_exit()
>   x86/resctrl: Create helper for RMID allocation and mondata dir
>     creation
>   x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
>   x86/resctrl: Track the closid with the rmid
>   x86/resctrl: Access per-rmid structures by index
>   x86/resctrl: Allow RMID allocation to be scoped by CLOSID
>   x86/resctrl: Track the number of dirty RMID a CLOSID has
>   x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding
>   x86/resctrl: Allocate the cleanest CLOSID by searching
>     closid_num_dirty_rmid
>   x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
>   x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow
>   x86/resctrl: Queue mon_event_read() instead of sending an IPI
>   x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
>   x86/resctrl: Allow arch to allocate memory needed in
>     resctrl_arch_rmid_read()
>   x86/resctrl: Make resctrl_mounted checks explicit
>   x86/resctrl: Move alloc/mon static keys into helpers
>   x86/resctrl: Make rdt_enable_key the arch's decision to switch
>   x86/resctrl: Add helpers for system wide mon/alloc capable
>   x86/resctrl: Add CPU online callback for resctrl work
>   x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but
>     cpu
>   x86/resctrl: Add CPU offline callback for resctrl work
>   x86/resctrl: Move domain helper migration into resctrl_offline_cpu()
>   x86/resctrl: Separate arch and fs resctrl locks
> 
>  arch/x86/include/asm/resctrl.h            |  90 +++++
>  arch/x86/kernel/cpu/resctrl/core.c        | 102 ++---
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  48 ++-
>  arch/x86/kernel/cpu/resctrl/internal.h    |  67 +++-
>  arch/x86/kernel/cpu/resctrl/monitor.c     | 463 +++++++++++++++++-----
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  15 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 359 ++++++++++++-----
>  include/linux/resctrl.h                   |  48 ++-
>  include/linux/tick.h                      |   9 +-
>  9 files changed, 921 insertions(+), 280 deletions(-)
>
  
Moger, Babu Feb. 14, 2024, 3:01 p.m. UTC | #2
Sanity tested the series again on AMD system. Everything looks good.

Tested-by: Babu Moger <babu.moger@amd.com>

On 2/13/24 12:44, James Morse wrote:
> Hello!
> 
> It's been back and forth for whether this series should be rebased onto Tony's
> SNC series. This version isn't, its based on tip/x86/cache.
> (I have the rebased-and-tested versions if anyone needs them)
> 
> This version just has minor cleanup from the previous one.
> * An unusued parameter in unused code has gone,
> * I've added a comment about the sizeing of the index array - it only matters on arm64.
> 
> Changes are also noted on each patch.
> 
> ~
> 
> This series does two things, it changes resctrl to call resctrl_arch_rmid_read()
> in a way that works for MPAM, and it separates the locking so that the arch code
> and filesystem code don't have to share a mutex. I tried to split this as two
> series, but these touch similar call sites, so it would create more work.
> 
> (What's MPAM? See the cover letter of the first series. [1])
> 
> On x86 the RMID is an independent number. MPAMs equivalent is PMG, but this
> isn't an independent number - it extends the PARTID (same as CLOSID) space
> with bits that aren't used to select the configuration. The monitors can
> then be told to match specific PMG values, allowing monitor-groups to be
> created.
> 
> But, MPAM expects the monitors to always monitor by PARTID. The
> Cache-storage-utilisation counters can only work this way.
> (In the MPAM spec not setting the MATCH_PARTID bit is made CONSTRAINED
> UNPREDICTABLE - which is Arm's term to mean portable software can't rely on
> this)
> 
> It gets worse, as some SoCs may have very few PMG bits. I've seen the
> datasheet for one that has a single bit of PMG space.
> 
> To be usable, MPAM's counters always need the PARTID and the PMG.
> For resctrl, this means always making the CLOSID available when the RMID
> is used.
> 
> To ensure RMID are always unique, this series combines the CLOSID and RMID
> into an index, and manages RMID based on that. For x86, the index and RMID
> would always be the same.
> 
> 
> Currently the architecture specific code in the cpuhp callbacks takes the
> rdtgroup_mutex. This means the filesystem code would have to export this
> lock, resulting in an ill-defined interface between the two, and the possibility
> of cross-architecture lock-ordering head aches.
> 
> The second part of this series adds a domain_list_lock to protect writes to the
> domain list, and protects the domain list with RCU - or cpus_read_lock().
> 
> Use of RCU is to allow lockless readers of the domain list. To get MPAMs monitors
> working, its very likely they'll need to be plumbed up to perf. An uncore PMU
> driver would need to be a lockless reader of the domain list.
> 
> This series is based on tip/x86/caches's commit fc747eebef73, and can be retrieved from:
> https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/monitors_and_locking/v9
> 
> Bugs welcome,
> 
> Thanks,
> 
> James
> 
> [1] https://lore.kernel.org/lkml/20210728170637.25610-1-james.morse@arm.com/
> [v1] https://lore.kernel.org/all/20221021131204.5581-1-james.morse@arm.com/
> [v2] https://lore.kernel.org/lkml/20230113175459.14825-1-james.morse@arm.com/
> [v3] https://lore.kernel.org/r/20230320172620.18254-1-james.morse@arm.com 
> [v4] https://lore.kernel.org/r/20230525180209.19497-1-james.morse@arm.com
> [v5] https://lore.kernel.org/lkml/20230728164254.27562-1-james.morse@arm.com/
> [v6] https://lore.kernel.org/all/20230914172138.11977-1-james.morse@arm.com/
> [v7] https://lore.kernel.org/r/20231025180345.28061-1-james.morse@arm.com/
> [v8] https://lore.kernel.org/lkml/20231215174343.13872-1-james.morse@arm.com/
> 
> James Morse (24):
>   tick/nohz: Move tick_nohz_full_mask declaration outside the #ifdef
>   x86/resctrl: kfree() rmid_ptrs from resctrl_exit()
>   x86/resctrl: Create helper for RMID allocation and mondata dir
>     creation
>   x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare()
>   x86/resctrl: Track the closid with the rmid
>   x86/resctrl: Access per-rmid structures by index
>   x86/resctrl: Allow RMID allocation to be scoped by CLOSID
>   x86/resctrl: Track the number of dirty RMID a CLOSID has
>   x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding
>   x86/resctrl: Allocate the cleanest CLOSID by searching
>     closid_num_dirty_rmid
>   x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
>   x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow
>   x86/resctrl: Queue mon_event_read() instead of sending an IPI
>   x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
>   x86/resctrl: Allow arch to allocate memory needed in
>     resctrl_arch_rmid_read()
>   x86/resctrl: Make resctrl_mounted checks explicit
>   x86/resctrl: Move alloc/mon static keys into helpers
>   x86/resctrl: Make rdt_enable_key the arch's decision to switch
>   x86/resctrl: Add helpers for system wide mon/alloc capable
>   x86/resctrl: Add CPU online callback for resctrl work
>   x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but
>     cpu
>   x86/resctrl: Add CPU offline callback for resctrl work
>   x86/resctrl: Move domain helper migration into resctrl_offline_cpu()
>   x86/resctrl: Separate arch and fs resctrl locks
> 
>  arch/x86/include/asm/resctrl.h            |  90 +++++
>  arch/x86/kernel/cpu/resctrl/core.c        | 102 ++---
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  48 ++-
>  arch/x86/kernel/cpu/resctrl/internal.h    |  67 +++-
>  arch/x86/kernel/cpu/resctrl/monitor.c     | 463 +++++++++++++++++-----
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  15 +-
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 359 ++++++++++++-----
>  include/linux/resctrl.h                   |  48 ++-
>  include/linux/tick.h                      |   9 +-
>  9 files changed, 921 insertions(+), 280 deletions(-)
>
  
Luck, Tony Feb. 17, 2024, 12:28 a.m. UTC | #3
On Tue, Feb 13, 2024 at 06:44:14PM +0000, James Morse wrote:
> Hello!
> 
> It's been back and forth for whether this series should be rebased onto Tony's
> SNC series. This version isn't, its based on tip/x86/cache.
> (I have the rebased-and-tested versions if anyone needs them)

In case James' patches go first, I took a crack at basing my SNC series
on top of his patches (specifically the mpam/monitors_and_locking/v9
branch of git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git).

Result is here:

git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git
Branch: james_then_snc

The end result of which ought to be pretty similar to the
"rebased-and-tested" versions that James mentions above.

-Tony
  
Borislav Petkov Feb. 17, 2024, 10:55 a.m. UTC | #4
On Tue, Feb 13, 2024 at 06:44:14PM +0000, James Morse wrote:
> Hello!
> 
> It's been back and forth for whether this series should be rebased onto Tony's
> SNC series. This version isn't, its based on tip/x86/cache.
> (I have the rebased-and-tested versions if anyone needs them)

The set applied ontop of tip:x86/cache gives:

vmlinux.o: in function `get_domain_from_cpu':
(.text+0x150f33): undefined reference to `lockdep_is_cpus_held'
ld: vmlinux.o: in function `rdt_ctrl_update':
(.text+0x150fbc): undefined reference to `lockdep_is_cpus_held'

Config

00-14-04-randconfig-x86_64-26892.cfg

attached.

Thx.
  
Thomas Gleixner Feb. 19, 2024, 4:49 p.m. UTC | #5
On Sat, Feb 17 2024 at 11:55, Borislav Petkov wrote:

> On Tue, Feb 13, 2024 at 06:44:14PM +0000, James Morse wrote:
>> Hello!
>> 
>> It's been back and forth for whether this series should be rebased onto Tony's
>> SNC series. This version isn't, its based on tip/x86/cache.
>> (I have the rebased-and-tested versions if anyone needs them)
>
> The set applied ontop of tip:x86/cache gives:
>
> vmlinux.o: in function `get_domain_from_cpu':
> (.text+0x150f33): undefined reference to `lockdep_is_cpus_held'
> ld: vmlinux.o: in function `rdt_ctrl_update':
> (.text+0x150fbc): undefined reference to `lockdep_is_cpus_held'

Wants to be folded into patch 24.

Thanks,

        tglx
---
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -368,8 +368,8 @@ struct rdt_domain *get_domain_from_cpu(i
 	 * about locks this thread holds will lead to false positives. Check
 	 * someone is holding the CPUs lock.
 	 */
-	if (IS_ENABLED(CONFIG_LOCKDEP))
-		lockdep_is_cpus_held();
+	if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && IS_ENABLED(CONFIG_LOCKDEP))
+		WARN_ON_ONCE(!lockdep_is_cpus_held());
 
 	list_for_each_entry(d, &r->domains, list) {
 		/* Find the domain that contains this CPU */
  
James Morse Feb. 19, 2024, 4:53 p.m. UTC | #6
Hi Thomas,

On 19/02/2024 16:49, Thomas Gleixner wrote:
> On Sat, Feb 17 2024 at 11:55, Borislav Petkov wrote:
>> On Tue, Feb 13, 2024 at 06:44:14PM +0000, James Morse wrote:
>>> It's been back and forth for whether this series should be rebased onto Tony's
>>> SNC series. This version isn't, its based on tip/x86/cache.
>>> (I have the rebased-and-tested versions if anyone needs them)
>>
>> The set applied ontop of tip:x86/cache gives:
>>
>> vmlinux.o: in function `get_domain_from_cpu':
>> (.text+0x150f33): undefined reference to `lockdep_is_cpus_held'
>> ld: vmlinux.o: in function `rdt_ctrl_update':
>> (.text+0x150fbc): undefined reference to `lockdep_is_cpus_held'
> 
> Wants to be folded into patch 24.

Thanks - I'm just putting a v10 together to fix this.


Thanks,

James
  
James Morse Feb. 19, 2024, 4:53 p.m. UTC | #7
Hi Babu,

On 14/02/2024 15:01, Moger, Babu wrote:
> Sanity tested the series again on AMD system. Everything looks good.
> 
> Tested-by: Babu Moger <babu.moger@amd.com>


Thanks!

James
  
James Morse Feb. 19, 2024, 4:54 p.m. UTC | #8
Hi Boris,

Thanks for the config, (as Thomas pointed out) this is coming from an awkward lockdep
annotation - turns out it also depends on HOTPLUG_CPU.

I'll post a v10 with the collected tags and this fixed.


Thanks,

James

On 17/02/2024 10:55, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 06:44:14PM +0000, James Morse wrote:
>> Hello!
>>
>> It's been back and forth for whether this series should be rebased onto Tony's
>> SNC series. This version isn't, its based on tip/x86/cache.
>> (I have the rebased-and-tested versions if anyone needs them)
> 
> The set applied ontop of tip:x86/cache gives:
> 
> vmlinux.o: in function `get_domain_from_cpu':
> (.text+0x150f33): undefined reference to `lockdep_is_cpus_held'
> ld: vmlinux.o: in function `rdt_ctrl_update':
> (.text+0x150fbc): undefined reference to `lockdep_is_cpus_held'
> 
> Config
> 
> 00-14-04-randconfig-x86_64-26892.cfg
> 
> attached.
> 
> Thx.
>
  
Borislav Petkov Feb. 19, 2024, 5:51 p.m. UTC | #9
On Mon, Feb 19, 2024 at 04:53:38PM +0000, James Morse wrote:
> Thanks - I'm just putting a v10 together to fix this.

No need - I'll fold it in.

Thx.
  
James Morse Feb. 19, 2024, 5:55 p.m. UTC | #10
On 19/02/2024 17:51, Borislav Petkov wrote:
> On Mon, Feb 19, 2024 at 04:53:38PM +0000, James Morse wrote:
>> Thanks - I'm just putting a v10 together to fix this.
> 
> No need - I'll fold it in.

Thanks!

James
  
Reinette Chatre Feb. 20, 2024, 6:18 p.m. UTC | #11
Hi Tony,

On 2/16/2024 4:28 PM, Tony Luck wrote:
> On Tue, Feb 13, 2024 at 06:44:14PM +0000, James Morse wrote:
>> Hello!
>>
>> It's been back and forth for whether this series should be rebased onto Tony's
>> SNC series. This version isn't, its based on tip/x86/cache.
>> (I have the rebased-and-tested versions if anyone needs them)
> 
> In case James' patches go first, I took a crack at basing my SNC series
> on top of his patches (specifically the mpam/monitors_and_locking/v9
> branch of git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git).
> 
> Result is here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git
> Branch: james_then_snc
> 
> The end result of which ought to be pretty similar to the
> "rebased-and-tested" versions that James mentions above.

As I understand Babu withdrew his "Reviewed-by" tag [1]. Will you
be posting this new version (with tip tag ordering)?

Reinette

[1] https://lore.kernel.org/lkml/e94db745-9454-4a10-8398-f3b0bc0128e8@amd.com/
  
Luck, Tony Feb. 20, 2024, 6:48 p.m. UTC | #12
>> The end result of which ought to be pretty similar to the
>> "rebased-and-tested" versions that James mentions above.
>
> As I understand Babu withdrew his "Reviewed-by" tag [1]. Will you
> be posting this new version (with tip tag ordering)?

Reinette

A couple of my patches required extensive massage to apply on top
of James series (which I see is now in TIP x86/cache).

I'll rebase to tip and remove all the Reviewed and Tested tags and post
as v15 soon.

-Tony
  
Luck, Tony Feb. 20, 2024, 8:59 p.m. UTC | #13
On Mon, Feb 19, 2024 at 05:49:29PM +0100, Thomas Gleixner wrote:
> On Sat, Feb 17 2024 at 11:55, Borislav Petkov wrote:
> 
> > On Tue, Feb 13, 2024 at 06:44:14PM +0000, James Morse wrote:
> >> Hello!
> >> 
> >> It's been back and forth for whether this series should be rebased onto Tony's
> >> SNC series. This version isn't, its based on tip/x86/cache.
> >> (I have the rebased-and-tested versions if anyone needs them)
> >
> > The set applied ontop of tip:x86/cache gives:
> >
> > vmlinux.o: in function `get_domain_from_cpu':
> > (.text+0x150f33): undefined reference to `lockdep_is_cpus_held'
> > ld: vmlinux.o: in function `rdt_ctrl_update':
> > (.text+0x150fbc): undefined reference to `lockdep_is_cpus_held'
> 
> Wants to be folded into patch 24.
> 
> Thanks,
> 
>         tglx
> ---
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -368,8 +368,8 @@ struct rdt_domain *get_domain_from_cpu(i
>  	 * about locks this thread holds will lead to false positives. Check
>  	 * someone is holding the CPUs lock.
>  	 */
> -	if (IS_ENABLED(CONFIG_LOCKDEP))
> -		lockdep_is_cpus_held();
> +	if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && IS_ENABLED(CONFIG_LOCKDEP))
> +		WARN_ON_ONCE(!lockdep_is_cpus_held());
>  
>  	list_for_each_entry(d, &r->domains, list) {
>  		/* Find the domain that contains this CPU */

Testing tip x86/cache that WARN fires while running
tools/tests/selftests/resctrl/resctrl_test.

Everthing runs OK if I drop the top commit:
  fb700810d30b ("x86/resctrl: Separate arch and fs resctrl locks")

-Tony


[  663.817986] ------------[ cut here ]------------
[  663.822667] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/cpu/resctrl/core.c:372 get_domain_from_cpu+0x45/0x50
[  663.832332] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer snd soundcore nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif sunrpc binfmt_misc kvm dax_hmem cxl_acpi irqbypass vfat iTCO_wdt rapl intel_pmc_bxt iTCO_vendor_support intel_cstate fat intel_uncore cxl_core pcspkr acpi_ipmi isst_if_mmio isst_if_mbox_pci i2c_i801 isst_if_common mei_me i2c_smbus mei intel_pch_thermal intel_vsec ioatdma ipmi_si ipmi_devintf joydev ipmi_msghandler acpi_power_meter acpi_pad loop zram crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 ixgbe igb sha256_ssse3 ast
[  663.832534]  mdio sha1_ssse3 i2c_algo_bit dca wmi ip6_tables ip_tables fuse
[  663.929224] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0-rc1+ #247
[  663.935662] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0021.P06.2104260458 04/26/2021
[  663.946175] RIP: 0010:get_domain_from_cpu+0x45/0x50
[  663.951061] Code: 73 40 89 ef 48 39 f0 75 0a eb 16 48 8b 00 48 39 f0 74 0e 48 0f a3 78 18 73 f1 5b 5d c3 cc cc cc cc 31 c0 5b 5d c3 cc cc cc cc <0f> 0b eb cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90
[  663.969807] RSP: 0018:ff22c48cc0003f80 EFLAGS: 00010046
[  663.975042] RAX: 0000000000000000 RBX: ffffffff92e4a3a0 RCX: 0000000000000001
[  663.982174] RDX: 0000000000000000 RSI: ffffffff92aa8289 RDI: ffffffff92b56a1e
[  663.989305] RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000000
[  663.996436] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff92e4a3a0
[  664.003570] R13: 0000000000000000 R14: ffffffff910672c0 R15: ff22c48ce0e17df0
[  664.010699] FS:  0000000000000000(0000) GS:ff1f2c6cdde00000(0000) knlGS:0000000000000000
[  664.018785] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  664.024530] CR2: 00007f608c001048 CR3: 0000000582e38003 CR4: 0000000000771ef0
[  664.031664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  664.038794] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  664.045927] PKRU: 55555554
[  664.048641] Call Trace:
[  664.051093]  <IRQ>
[  664.053113]  ? get_domain_from_cpu+0x45/0x50
[  664.057391]  ? __warn+0x81/0x170
[  664.060637]  ? get_domain_from_cpu+0x45/0x50
[  664.064917]  ? report_bug+0x18d/0x1c0
[  664.068593]  ? handle_bug+0x3c/0x80
[  664.072089]  ? exc_invalid_op+0x13/0x60
[  664.075931]  ? asm_exc_invalid_op+0x16/0x20
[  664.080124]  ? __pfx_rdt_ctrl_update+0x10/0x10
[  664.084574]  ? get_domain_from_cpu+0x45/0x50
[  664.088854]  rdt_ctrl_update+0x20/0x70
[  664.092613]  __flush_smp_call_function_queue+0xdd/0x560
[  664.097848]  __sysvec_call_function+0x32/0x110
[  664.102301]  sysvec_call_function+0x99/0xc0
[  664.106497]  </IRQ>
[  664.108602]  <TASK>
[  664.110708]  asm_sysvec_call_function+0x16/0x20
[  664.115246] RIP: 0010:cpuidle_enter_state+0xfb/0x4f0
[  664.120221] Code: c0 48 0f a3 05 16 4d 0a 01 0f 82 fb 02 00 00 31 ff e8 d9 32 06 ff 45 84 ff 0f 85 cb 02 00 00 e8 ab 70 18 ff fb 0f 1f 44 00 00 <45> 85 f6 0f 88 eb 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
[  664.138966] RSP: 0018:ffffffff92e03e28 EFLAGS: 00000202
[  664.144191] RAX: 00000000002458c1 RBX: ff54c484be002f38 RCX: 000000000000001f
[  664.151323] RDX: 0000000000000000 RSI: ffffffff92be36c1 RDI: ffffffff92b56a1e
[  664.158454] RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000001
[  664.165588] R10: 0000000000000003 R11: ff1f2c6cdde34de4 R12: ffffffff930d0560
[  664.172719] R13: 0000009a8ea28623 R14: 0000000000000002 R15: 0000000000000000
[  664.179860]  ? cpuidle_enter_state+0xf5/0x4f0
[  664.184221]  cpuidle_enter+0x29/0x40
[  664.187808]  do_idle+0x231/0x290
[  664.191052]  cpu_startup_entry+0x26/0x30
[  664.194983]  rest_init+0xf1/0x190
[  664.198303]  arch_call_rest_init+0xa/0x30
[  664.202323]  start_kernel+0x8b8/0xac0
[  664.205993]  x86_64_start_reservations+0x14/0x30
[  664.210617]  x86_64_start_kernel+0x92/0xa0
[  664.214725]  secondary_startup_64_no_verify+0x184/0x18b
[  664.219965]  </TASK>
[  664.222159] irq event stamp: 2382018
[  664.225737] hardirqs last  enabled at (2382017): [<ffffffff9212dea5>] cpuidle_enter_state+0xf5/0x4f0
[  664.234873] hardirqs last disabled at (2382018): [<ffffffff9212b29a>] sysvec_call_function+0xa/0xc0
[  664.243920] softirqs last  enabled at (2382012): [<ffffffff91120705>] __irq_exit_rcu+0xa5/0x110
[  664.252621] softirqs last disabled at (2381797): [<ffffffff91120705>] __irq_exit_rcu+0xa5/0x110
[  664.261311] ---[ end trace 0000000000000000 ]---
  
Reinette Chatre Feb. 20, 2024, 10:58 p.m. UTC | #14
On 2/20/2024 12:59 PM, Tony Luck wrote:
> On Mon, Feb 19, 2024 at 05:49:29PM +0100, Thomas Gleixner wrote:
>> On Sat, Feb 17 2024 at 11:55, Borislav Petkov wrote:
>>
>>> On Tue, Feb 13, 2024 at 06:44:14PM +0000, James Morse wrote:
>>>> Hello!
>>>>
>>>> It's been back and forth for whether this series should be rebased onto Tony's
>>>> SNC series. This version isn't, its based on tip/x86/cache.
>>>> (I have the rebased-and-tested versions if anyone needs them)
>>>
>>> The set applied ontop of tip:x86/cache gives:
>>>
>>> vmlinux.o: in function `get_domain_from_cpu':
>>> (.text+0x150f33): undefined reference to `lockdep_is_cpus_held'
>>> ld: vmlinux.o: in function `rdt_ctrl_update':
>>> (.text+0x150fbc): undefined reference to `lockdep_is_cpus_held'
>>
>> Wants to be folded into patch 24.
>>
>> Thanks,
>>
>>         tglx
>> ---
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -368,8 +368,8 @@ struct rdt_domain *get_domain_from_cpu(i
>>  	 * about locks this thread holds will lead to false positives. Check
>>  	 * someone is holding the CPUs lock.
>>  	 */
>> -	if (IS_ENABLED(CONFIG_LOCKDEP))
>> -		lockdep_is_cpus_held();
>> +	if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && IS_ENABLED(CONFIG_LOCKDEP))
>> +		WARN_ON_ONCE(!lockdep_is_cpus_held());
>>  
>>  	list_for_each_entry(d, &r->domains, list) {
>>  		/* Find the domain that contains this CPU */
> 
> Testing tip x86/cache that WARN fires while running
> tools/tests/selftests/resctrl/resctrl_test.
> 
> Everthing runs OK if I drop the top commit:
>   fb700810d30b ("x86/resctrl: Separate arch and fs resctrl locks")

The new WARN_ON_ONCE() is why this encountered. The comment notes that
lockdep_is_cpus_held() is used to determine if "someone is holding the
CPUs lock" but it seems that lockdep_is_cpus_held() still only checks
if "current" is holding cpu_hotplug_lock and that is not possible
when running the code via IPI.

The trace that Tony shared notes that this is triggered by get_domain_from_cpu()
called via rdt_ctrl_update(). rdt_ctrl_update() is only run via IPI:

	resctrl_arch_update_domains() {
		...
		lockdep_assert_cpus_held();
		...
		on_each_cpu_mask(cpu_mask, rdt_ctrl_update, &msr_param, 1);
		...
	}

and

	reset_all_ctrls() {
		...
		lockdep_assert_cpus_held();
		...
		on_each_cpu_mask(cpu_mask, rdt_ctrl_update, &msr_param, 1);
	}

I sprinkled some debug_show_held_locks(current) to confirm and encountered
the following when reproducing the trace using the resctrl tests:

[  202.914334] resctrl_arch_update_domains:355
[  202.919971] 4 locks held by resctrl_tests/3330:
[  202.925169]  #0: ff11001086e09408 (sb_writers#15){.+.+}-{0:0}, at: ksys_write+0x69/0x100
[  202.934375]  #1: ff110010bb653688 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf7/0x200
[  202.944348]  #2: ffffffff8346c890 (cpu_hotplug_lock){++++}-{0:0}, at: rdtgroup_kn_lock_live+0x4c/0xa0
[  202.954774]  #3: ffffffff8344ae68 (rdtgroup_mutex){+.+.}-{3:3}, at: rdtgroup_kn_lock_live+0x5a/0xa0
[  202.965030] get_domain_from_cpu:366
[  202.969087] no locks held by swapper/0/0.
[  202.973697] ------------[ cut here ]------------
[  202.978979] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/cpu/resctrl/core.c:375 get_domain_from_cpu+0x6f/0x80
<SNIP>
[  203.200095] Call Trace:
[  203.202947]  <TASK>
[  203.205406]  ? __warn+0x84/0x180
[  203.209123]  ? get_domain_from_cpu+0x6f/0x80
[  203.214011]  ? report_bug+0x1c7/0x1e0
[  203.218214]  ? handle_bug+0x3c/0x80
[  203.222230]  ? exc_invalid_op+0x18/0x80
[  203.227198]  ? asm_exc_invalid_op+0x1a/0x20
[  203.232529]  ? __pfx_rdt_ctrl_update+0x20/0x20
[  203.238146]  ? get_domain_from_cpu+0x6f/0x80
[  203.243548]  rdt_ctrl_update+0x26/0x80
<SNIP>


So even though it is confirmed via lockdep_assert_cpus_held() that
resctrl_arch_update_domains() holds cpu_hotplug_lock, it does not seem possible
to have a similar lockdep check in the function called by it (resctrl_arch_update_domains())
via IPI. It thus does not look like that lockdep checking within
get_domain_from_cpu() can be accurate and I cannot see what it can be replaced with
to make it accurate. Any guidance will be appreciated. Perhaps we should just drop (but
with detailed context comments remaining) the lockdep check in get_domain_from_cpu()? 

Reinette
  
Luck, Tony Feb. 20, 2024, 11:25 p.m. UTC | #15
> So even though it is confirmed via lockdep_assert_cpus_held() that
> resctrl_arch_update_domains() holds cpu_hotplug_lock, it does not seem possible
> to have a similar lockdep check in the function called by it (resctrl_arch_update_domains())
> via IPI. It thus does not look like that lockdep checking within
> get_domain_from_cpu() can be accurate and I cannot see what it can be replaced with
> to make it accurate. Any guidance will be appreciated. Perhaps we should just drop (but
> with detailed context comments remaining) the lockdep check in get_domain_from_cpu()? 

Reinette

Both the places where this has problems (reset_all_ctrls() and
resctrl_arch_update_domains()) have similar structure:


	 list_for_each_entry(d, &r->domains, list) {
		add some bits to a cpumask
	}

	on_each_cpu_mask(cpu_mask, rdt_ctrl_update, &msr_param, 1);


Maybe instead of collecting all CPUs that need to do something, and then each
of them backtrack and search for the domain from a resource (that is passed
in the msr_param argument). The code could be restructured to pass the domain
to the target function. Like this:


	list_for_each_entry(d, &r->domains, list) {
		msr_param.dom = d;
		smp_call_function_single(somecpu, rdt_ctrl_update, &msr_param, 1);
	}

I'll try coding this up to see if it works.

-Tony
  
James Morse Feb. 21, 2024, 12:06 p.m. UTC | #16
Hi Tony, Reinette,

On 20/02/2024 22:58, Reinette Chatre wrote:
> On 2/20/2024 12:59 PM, Tony Luck wrote:
>> On Mon, Feb 19, 2024 at 05:49:29PM +0100, Thomas Gleixner wrote:
>>> On Sat, Feb 17 2024 at 11:55, Borislav Petkov wrote:
>>>
>>>> On Tue, Feb 13, 2024 at 06:44:14PM +0000, James Morse wrote:
>>>>> Hello!
>>>>>
>>>>> It's been back and forth for whether this series should be rebased onto Tony's
>>>>> SNC series. This version isn't, its based on tip/x86/cache.
>>>>> (I have the rebased-and-tested versions if anyone needs them)
>>>>
>>>> The set applied ontop of tip:x86/cache gives:

>> Testing tip x86/cache that WARN fires while running
>> tools/tests/selftests/resctrl/resctrl_test.

I evidently need to build a newer version of that tool.


>> Everthing runs OK if I drop the top commit:
>>   fb700810d30b ("x86/resctrl: Separate arch and fs resctrl locks")
> 
> The new WARN_ON_ONCE() is why this encountered. The comment notes that
> lockdep_is_cpus_held() is used to determine if "someone is holding the
> CPUs lock" but it seems that lockdep_is_cpus_held() still only checks
> if "current" is holding cpu_hotplug_lock and that is not possible
> when running the code via IPI.

I was evidently mistaken that this was the difference between
lockdep_is_cpus_held() and lockdep_assert_cpus_held().

It's a false positive, ripping out the check is the simplest thing to do.


> So even though it is confirmed via lockdep_assert_cpus_held() that
> resctrl_arch_update_domains() holds cpu_hotplug_lock, it does not seem possible
> to have a similar lockdep check in the function called by it (resctrl_arch_update_domains())
> via IPI. It thus does not look like that lockdep checking within
> get_domain_from_cpu() can be accurate and I cannot see what it can be replaced with
> to make it accurate. Any guidance will be appreciated. Perhaps we should just drop (but
> with detailed context comments remaining) the lockdep check in get_domain_from_cpu()? 


Thanks,

James
  
Reinette Chatre Feb. 21, 2024, 4:48 p.m. UTC | #17
Hi James,

On 2/21/2024 4:06 AM, James Morse wrote:
> Hi Tony, Reinette,
> 
> On 20/02/2024 22:58, Reinette Chatre wrote:
>> On 2/20/2024 12:59 PM, Tony Luck wrote:
>>> On Mon, Feb 19, 2024 at 05:49:29PM +0100, Thomas Gleixner wrote:
>>>> On Sat, Feb 17 2024 at 11:55, Borislav Petkov wrote:
>>>>
>>>>> On Tue, Feb 13, 2024 at 06:44:14PM +0000, James Morse wrote:
>>>>>> Hello!
>>>>>>
>>>>>> It's been back and forth for whether this series should be rebased onto Tony's
>>>>>> SNC series. This version isn't, its based on tip/x86/cache.
>>>>>> (I have the rebased-and-tested versions if anyone needs them)
>>>>>
>>>>> The set applied ontop of tip:x86/cache gives:
> 
>>> Testing tip x86/cache that WARN fires while running
>>> tools/tests/selftests/resctrl/resctrl_test.
> 
> I evidently need to build a newer version of that tool.

There has been a lot of changes in the last few cycles. You can find the
latest version on the "next" branch of the kselftest repo [1] where most recent
enhancements are queued up for inclusion. If you are interested, there is one
more series [2] pending merge to the kselftest repo, it adds a new test for
the recent non-contiguous CBM support.

Reinette

[1] git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git
[2] https://lore.kernel.org/lkml/cover.1708072203.git.maciej.wieczor-retman@intel.com/
  
Luck, Tony Feb. 21, 2024, 5:30 p.m. UTC | #18
>>>> Testing tip x86/cache that WARN fires while running
>>>> tools/tests/selftests/resctrl/resctrl_test.
>> 
>> I evidently need to build a newer version of that tool.
>
> There has been a lot of changes in the last few cycles. You can find the
> latest version on the "next" branch of the kselftest repo [1] where most recent
> enhancements are queued up for inclusion. If you are interested, there is one
> more series [2] pending merge to the kselftest repo, it adds a new test for
> the recent non-contiguous CBM support.

James,

You didn't see this WARN because it isn't in your patch. Diff between what
is in your tree and what was applied to TIP:

diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 9f1aa555a8ea..8a4ef4f5bddc 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -368,8 +368,8 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
         * about locks this thread holds will lead to false positives. Check
         * someone is holding the CPUs lock.
         */
-       if (IS_ENABLED(CONFIG_LOCKDEP))
-               lockdep_is_cpus_held();
+       if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && IS_ENABLED(CONFIG_LOCKDEP))
+               WARN_ON_ONCE(!lockdep_is_cpus_held());

        list_for_each_entry(d, &r->domains, list) {
                /* Find the domain that contains this CPU */
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index dc59643498bf..7997b47743a2 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -567,7 +567,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
         * cpumask_any_housekeeping() prefers housekeeping CPUs, but
         * are all the CPUs nohz_full? If yes, pick a CPU to IPI.
         * MPAM's resctrl_arch_rmid_read() is unable to read the
-        * counters on some platforms if its called in irq context.
+        * counters on some platforms if its called in IRQ context.
         */
        if (tick_nohz_full_cpu(cpu))
                smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);

-Tony