[v2,00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Message ID cover.1705688538.git.babu.moger@amd.com
Headers
Series x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) |

Message

Moger, Babu Jan. 19, 2024, 6:22 p.m. UTC
  These series adds the support for Assignable Bandwidth Monitoring Counters
(ABMC). It is also called QoS RMID Pinning feature

The feature details are documented in the  APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC). The documentation is available at
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

The patches are based on top of commit
1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)

# Introduction

AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
feature only guarantees that RMIDs currently assigned to a processor will
be tracked by hardware. The counters of any other RMIDs which are no longer
being tracked will be reset to zero. The MBM event counters return
"Unavailable" for the RMIDs that are not active.
    
Users can create 256 or more monitor groups. But there can be only limited
number of groups that can be give guaranteed monitoring numbers.  With ever
changing configurations there is no way to definitely know which of these
groups will be active for certain point of time. Users do not have the
option to monitor a group or set of groups for certain period of time
without worrying about RMID being reset in between.
    
The ABMC feature provides an option to the user to assign an RMID to the
hardware counter and monitor the bandwidth for a longer duration.
The assigned RMID will be active until the user unassigns it manually.
There is no need to worry about counters being reset during this period.
Additionally, the user can specify a bitmask identifying the specific
bandwidth types from the given source to track with the counter.

Without ABMC enabled, monitoring will work in current mode without
assignment option.

# Linux Implementation

Linux resctrl subsystem provides the interface to count maximum of two
memory bandwidth events per group, from a combination of available total
and local events. Keeping the current interface, users can assign a maximum
of 2 ABMC counters per group. User will also have the option to assign only
one counter to the group. If the system runs out of assignable ABMC
counters, kernel will display an error. Users need to unassign an already
assigned counter to make space for new assignments.


# Examples

a. Check if ABMC support is available
	#mount -t resctrl resctrl /sys/fs/resctrl/

	#cat /sys/fs/resctrl/info/L3_MON/mon_features 
	llc_occupancy
	mbm_total_bytes
	mbm_total_bytes_config
	mbm_local_bytes
	mbm_local_bytes_config
	mbm_assign_capable ←  Linux kernel detected ABMC feature

b. Check if ABMC is enabled. By default, ABMC feature is disabled.
   Monitoring works in legacy monitor mode when ABMC is not enabled.

	#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
	0

c. There will be new file "monitor_state" for each monitor group when ABMC
   feature is supported. However, monitor_state is not available if ABMC is
   disabled.
	
	#cat /sys/fs/resctrl/monitor_state 
	Unsupported
	
d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
   enabled, monitoring will work in current mode without assignment option.
	
	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	779247936
	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
	765207488
	
e. Enable ABMC mode.

	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
        #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
        1

f. Read the monitor states. By default, both total and local MBM
	events are in "unassign" state.
	
	#cat /sys/fs/resctrl/monitor_state
	total=unassign;local=unassign

g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
   the MBA events are not available until the user assigns the events
   explicitly.
	
	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	Unsupported
	
	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
	Unsupported

h. The event llc_occupancy is not affected by ABMC mode. Users can still
   read the llc_occupancy.

	#cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy 
	557056

i. Now assign the total event and read the monitor_state.
	
	#echo total=assign > /sys/fs/resctrl/monitor_state
	#cat /sys/fs/resctrl/monitor_state 
	total=assign;local=unassign
	
j. Now that the total event is assigned. Read the total event.
	
	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	6136000
	
k. Now assign the local event and read the monitor_state.
	
	#echo local=assign > /sys/fs/resctrl/monitor_state
	#cat /sys/fs/resctrl/monitor_state
	total=assign;local=assign

        Users can also assign both total and local events in one single
	command.

l. Now that both total and local events are assigned, read the events.
	
	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	6136000
	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
	58694
	
m. Check the bandwidth configuration for the group. Note that bandwidth
   configuration has a domain scope. Total event defaults to 0x7F (to
   count all the events) and local event defaults to 0x15 (to count all
   the local numa events). The event bitmap decoding is available at
   https://www.kernel.org/doc/Documentation/x86/resctrl.rst
   in section "mbm_total_bytes_config", "mbm_local_bytes_config":
	
	#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	0=0x7f;1=0x7f
	
	#cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 
	0=0x15;1=0xi15
	
n. Change the bandwidth source for domain 0 for the total event to count only reads.
   Note that this change effects lotal events on the domain 0.
	
	#echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	0=0x33;1=0x7F
	
o. Now read the total event again. The mbm_total_bytes should display
   only the read events.
	
	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	314101
	
p. Unmount the resctrl
	 
	#umount /sys/fs/resctrl/

NOTE: For simplicity these examples are run on a default resctrl group.
Similar experiments are can be run non-defaults groups.
---
v2:
   a. Major change is the way ABMC is enabled. Earlier, user needed to remount
      with -o abmc to enable ABMC feature. Removed that option now.
      Now users can enable ABMC by "$echo 1 to /sys/fs/resctrl/info/L3_MON/mbm_assign_enable".
     
   b. Added new word 21 to x86/cpufeatures.h.

   c. Display unsupported if user attempts to read the events when ABMC is enabled
      and event is not assigned.

   d. Display monitor_state as "Unsupported" when ABMC is disabled.
  
   e. Text updates and rebase to latest tip tree (as of Jan 18).
 
   f. This series is still work in progress. I am yet to hear from ARM developers. 

v1 :
   https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/

Babu Moger (17):
  x86/cpufeatures: Add word 21 for scattered CPUID features
  x86/resctrl: Add support for Assignable Bandwidth Monitoring Counters
    (ABMC)
  x86/resctrl: Add ABMC feature in the command line options
  x86/resctrl: Detect Assignable Bandwidth Monitoring feature details
  x86/resctrl: Introduce resctrl_file_fflags_init
  x86/resctrl: Introduce interface to display number of ABMC counters
  x86/resctrl: Add support to enable/disable ABMC feature
  x86/resctrl: Introduce the interface to display ABMC state
  x86/resctrl: Introdruce rdtgroup_assign_enable_write
  x86/resctrl: Add interface to display monitor state of the group
  x86/resctrl: Report Unsupported when MBM events are read
  x86/resctrl: Initialize assignable counters bitmap
  x86/resctrl: Add data structures for ABMC assignment
  x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg
  x86/resctrl: Add the interface to assign the RMID
  x86/resctrl: Add the interface unassign the RMID
  x86/resctrl: Update RMID assignments on event configuration changes

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/arch/x86/resctrl.rst            |  62 +++
 arch/x86/include/asm/cpufeature.h             |   6 +-
 arch/x86/include/asm/cpufeatures.h            |   7 +-
 arch/x86/include/asm/disabled-features.h      |   3 +-
 arch/x86/include/asm/msr-index.h              |   2 +
 arch/x86/include/asm/required-features.h      |   3 +-
 arch/x86/kernel/cpu/cpuid-deps.c              |   3 +
 arch/x86/kernel/cpu/resctrl/core.c            |  25 +-
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c     |  15 +
 arch/x86/kernel/cpu/resctrl/internal.h        |  54 +-
 arch/x86/kernel/cpu/resctrl/monitor.c         |  26 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c        | 466 +++++++++++++++++-
 arch/x86/kernel/cpu/scattered.c               |   1 +
 include/linux/resctrl.h                       |   2 +
 15 files changed, 651 insertions(+), 26 deletions(-)
  

Comments

Reinette Chatre Jan. 19, 2024, 6:32 p.m. UTC | #1
+James

On 1/19/2024 10:22 AM, Babu Moger wrote:
>  
>    f. This series is still work in progress. I am yet to hear from ARM developers. 

Please at least include James in your submissions to make Arm aware of this work.

Reinette
  
Moger, Babu Jan. 19, 2024, 8:35 p.m. UTC | #2
On 1/19/2024 12:32 PM, Reinette Chatre wrote:
> +James
>
> On 1/19/2024 10:22 AM, Babu Moger wrote:
>>   
>>     f. This series is still work in progress. I am yet to hear from ARM developers.
> Please at least include James in your submissions to make Arm aware of this work.

Thank you

Babu Moger
  
Reinette Chatre Feb. 2, 2024, 4:09 a.m. UTC | #3
Hi Babu,

On 1/19/2024 10:22 AM, Babu Moger wrote:
> 
> These series adds the support for Assignable Bandwidth Monitoring Counters

Not a good start ([1]).

> (ABMC). It is also called QoS RMID Pinning feature
> 
> The feature details are documented in the  APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC). The documentation is available at
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> 
> The patches are based on top of commit
> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
> 
> # Introduction
> 
> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
> feature only guarantees that RMIDs currently assigned to a processor will
> be tracked by hardware. The counters of any other RMIDs which are no longer
> being tracked will be reset to zero. The MBM event counters return
> "Unavailable" for the RMIDs that are not active.
>     
> Users can create 256 or more monitor groups. But there can be only limited
> number of groups that can be give guaranteed monitoring numbers.  With ever

"can be given"?

> changing configurations there is no way to definitely know which of these
> groups will be active for certain point of time. Users do not have the
> option to monitor a group or set of groups for certain period of time
> without worrying about RMID being reset in between.
>     
> The ABMC feature provides an option to the user to assign an RMID to the
> hardware counter and monitor the bandwidth for a longer duration.
> The assigned RMID will be active until the user unassigns it manually.
> There is no need to worry about counters being reset during this period.
> Additionally, the user can specify a bitmask identifying the specific
> bandwidth types from the given source to track with the counter.
> 
> Without ABMC enabled, monitoring will work in current mode without
> assignment option.
> 
> # Linux Implementation
> 
> Linux resctrl subsystem provides the interface to count maximum of two
> memory bandwidth events per group, from a combination of available total
> and local events. Keeping the current interface, users can assign a maximum
> of 2 ABMC counters per group. User will also have the option to assign only
> one counter to the group. If the system runs out of assignable ABMC
> counters, kernel will display an error. Users need to unassign an already
> assigned counter to make space for new assignments.
> 
> 
> # Examples
> 
> a. Check if ABMC support is available
> 	#mount -t resctrl resctrl /sys/fs/resctrl/
> 
> 	#cat /sys/fs/resctrl/info/L3_MON/mon_features 
> 	llc_occupancy
> 	mbm_total_bytes
> 	mbm_total_bytes_config
> 	mbm_local_bytes
> 	mbm_local_bytes_config
> 	mbm_assign_capable ←  Linux kernel detected ABMC feature
> 
> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>    Monitoring works in legacy monitor mode when ABMC is not enabled.
> 
> 	#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
> 	0
> 

With the introduction of "mbm_assign_enable" the entry in mon_features seems
to provide duplicate information.

> c. There will be new file "monitor_state" for each monitor group when ABMC
>    feature is supported. However, monitor_state is not available if ABMC is
>    disabled.
> 	
> 	#cat /sys/fs/resctrl/monitor_state 
> 	Unsupported

This sounds potentially confusing since users will still be able to monitor
the groups ...

> 	
> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>    enabled, monitoring will work in current mode without assignment option.
> 	
> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 	779247936
> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
> 	765207488
> 	
> e. Enable ABMC mode.
> 
> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>         #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>         1
> 
> f. Read the monitor states. By default, both total and local MBM
> 	events are in "unassign" state.
> 	
> 	#cat /sys/fs/resctrl/monitor_state
> 	total=unassign;local=unassign

This interface does not seem to take into account that hardware
can support assignment per domain. I understand that this is
not something you want to implement at this time but the user interface
has to accommodate such an enhancement. This was already mentioned, and
you did acknowledge the point [3] to this new version that does not
reflect this is unexpected.

My previous suggestions do seem to still stand and and I also am not able to
see how Peter's requests [2] were considered. This same interface needs to
accommodate usages apart from ABMC. For example, how to use this interface
to address the same counter issue on AMD hardware without ABMC, and MPAM
(pending James's feedback). 

I understand that until we hear from Arm we do not know all the requirements
that this interface needs to support, but I do expect this interface to
at least consider requirements and usage scenarios that are already known.
 
> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>    the MBA events are not available until the user assigns the events
>    explicitly.
> 	
> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 	Unsupported
> 	
> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
> 	Unsupported
> 

This needs some more thought to accommodate Peter's scenario where the counter
can be expected to return the final count after the counter is disabled.

> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>    read the llc_occupancy.
> 
> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy 
> 	557056
> 
> i. Now assign the total event and read the monitor_state.
> 	
> 	#echo total=assign > /sys/fs/resctrl/monitor_state
> 	#cat /sys/fs/resctrl/monitor_state 
> 	total=assign;local=unassign
> 	

I do not see the "global assign/unassign" scenario addressed.

This version seems to ignore (without discussion) a lot of earlier
feedback.

Reinette

[1] https://lore.kernel.org/lkml/5ce67d8f-e207-4029-8fb3-0bc7deab1e9f@amd.com/
[2] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
[3] https://lore.kernel.org/lkml/38421428-84cb-b67e-f3ce-b7a0233e016b@amd.com/
  
Reinette Chatre Feb. 2, 2024, 5:01 a.m. UTC | #4
+James

On 2/1/2024 8:09 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/19/2024 10:22 AM, Babu Moger wrote:
>>
>> These series adds the support for Assignable Bandwidth Monitoring Counters
> 
> Not a good start ([1]).
> 
>> (ABMC). It is also called QoS RMID Pinning feature
>>
>> The feature details are documented in the  APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC). The documentation is available at
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>
>> The patches are based on top of commit
>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>>
>> # Introduction
>>
>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>> feature only guarantees that RMIDs currently assigned to a processor will
>> be tracked by hardware. The counters of any other RMIDs which are no longer
>> being tracked will be reset to zero. The MBM event counters return
>> "Unavailable" for the RMIDs that are not active.
>>     
>> Users can create 256 or more monitor groups. But there can be only limited
>> number of groups that can be give guaranteed monitoring numbers.  With ever
> 
> "can be given"?
> 
>> changing configurations there is no way to definitely know which of these
>> groups will be active for certain point of time. Users do not have the
>> option to monitor a group or set of groups for certain period of time
>> without worrying about RMID being reset in between.
>>     
>> The ABMC feature provides an option to the user to assign an RMID to the
>> hardware counter and monitor the bandwidth for a longer duration.
>> The assigned RMID will be active until the user unassigns it manually.
>> There is no need to worry about counters being reset during this period.
>> Additionally, the user can specify a bitmask identifying the specific
>> bandwidth types from the given source to track with the counter.
>>
>> Without ABMC enabled, monitoring will work in current mode without
>> assignment option.
>>
>> # Linux Implementation
>>
>> Linux resctrl subsystem provides the interface to count maximum of two
>> memory bandwidth events per group, from a combination of available total
>> and local events. Keeping the current interface, users can assign a maximum
>> of 2 ABMC counters per group. User will also have the option to assign only
>> one counter to the group. If the system runs out of assignable ABMC
>> counters, kernel will display an error. Users need to unassign an already
>> assigned counter to make space for new assignments.
>>
>>
>> # Examples
>>
>> a. Check if ABMC support is available
>> 	#mount -t resctrl resctrl /sys/fs/resctrl/
>>
>> 	#cat /sys/fs/resctrl/info/L3_MON/mon_features 
>> 	llc_occupancy
>> 	mbm_total_bytes
>> 	mbm_total_bytes_config
>> 	mbm_local_bytes
>> 	mbm_local_bytes_config
>> 	mbm_assign_capable ←  Linux kernel detected ABMC feature
>>
>> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>>    Monitoring works in legacy monitor mode when ABMC is not enabled.
>>
>> 	#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> 	0
>>
> 
> With the introduction of "mbm_assign_enable" the entry in mon_features seems
> to provide duplicate information.
> 
>> c. There will be new file "monitor_state" for each monitor group when ABMC
>>    feature is supported. However, monitor_state is not available if ABMC is
>>    disabled.
>> 	
>> 	#cat /sys/fs/resctrl/monitor_state 
>> 	Unsupported
> 
> This sounds potentially confusing since users will still be able to monitor
> the groups ...
> 
>> 	
>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>>    enabled, monitoring will work in current mode without assignment option.
>> 	
>> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 	779247936
>> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
>> 	765207488
>> 	
>> e. Enable ABMC mode.
>>
>> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>         #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>         1
>>
>> f. Read the monitor states. By default, both total and local MBM
>> 	events are in "unassign" state.
>> 	
>> 	#cat /sys/fs/resctrl/monitor_state
>> 	total=unassign;local=unassign
> 
> This interface does not seem to take into account that hardware
> can support assignment per domain. I understand that this is
> not something you want to implement at this time but the user interface
> has to accommodate such an enhancement. This was already mentioned, and
> you did acknowledge the point [3] to this new version that does not
> reflect this is unexpected.
> 
> My previous suggestions do seem to still stand and and I also am not able to
> see how Peter's requests [2] were considered. This same interface needs to
> accommodate usages apart from ABMC. For example, how to use this interface
> to address the same counter issue on AMD hardware without ABMC, and MPAM
> (pending James's feedback). 
> 
> I understand that until we hear from Arm we do not know all the requirements
> that this interface needs to support, but I do expect this interface to
> at least consider requirements and usage scenarios that are already known.
>  
>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>>    the MBA events are not available until the user assigns the events
>>    explicitly.
>> 	
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 	Unsupported
>> 	
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
>> 	Unsupported
>>
> 
> This needs some more thought to accommodate Peter's scenario where the counter
> can be expected to return the final count after the counter is disabled.
> 
>> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>>    read the llc_occupancy.
>>
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy 
>> 	557056
>>
>> i. Now assign the total event and read the monitor_state.
>> 	
>> 	#echo total=assign > /sys/fs/resctrl/monitor_state
>> 	#cat /sys/fs/resctrl/monitor_state 
>> 	total=assign;local=unassign
>> 	
> 
> I do not see the "global assign/unassign" scenario addressed.
> 
> This version seems to ignore (without discussion) a lot of earlier
> feedback.
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/5ce67d8f-e207-4029-8fb3-0bc7deab1e9f@amd.com/
> [2] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/38421428-84cb-b67e-f3ce-b7a0233e016b@amd.com/
  
Moger, Babu Feb. 2, 2024, 9:57 p.m. UTC | #5
Hi Reinette,

On 2/1/2024 10:09 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/19/2024 10:22 AM, Babu Moger wrote:
>> These series adds the support for Assignable Bandwidth Monitoring Counters
> Not a good start ([1]).

Yea. My bad.

>
>> (ABMC). It is also called QoS RMID Pinning feature
>>
>> The feature details are documented in the  APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC). The documentation is available at
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>
>> The patches are based on top of commit
>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>>
>> # Introduction
>>
>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>> feature only guarantees that RMIDs currently assigned to a processor will
>> be tracked by hardware. The counters of any other RMIDs which are no longer
>> being tracked will be reset to zero. The MBM event counters return
>> "Unavailable" for the RMIDs that are not active.
>>      
>> Users can create 256 or more monitor groups. But there can be only limited
>> number of groups that can be give guaranteed monitoring numbers.  With ever
> "can be given"?

"can give guaranteed monitoring numbers."

I feel this looks better.

>
>> changing configurations there is no way to definitely know which of these
>> groups will be active for certain point of time. Users do not have the
>> option to monitor a group or set of groups for certain period of time
>> without worrying about RMID being reset in between.
>>      
>> The ABMC feature provides an option to the user to assign an RMID to the
>> hardware counter and monitor the bandwidth for a longer duration.
>> The assigned RMID will be active until the user unassigns it manually.
>> There is no need to worry about counters being reset during this period.
>> Additionally, the user can specify a bitmask identifying the specific
>> bandwidth types from the given source to track with the counter.
>>
>> Without ABMC enabled, monitoring will work in current mode without
>> assignment option.
>>
>> # Linux Implementation
>>
>> Linux resctrl subsystem provides the interface to count maximum of two
>> memory bandwidth events per group, from a combination of available total
>> and local events. Keeping the current interface, users can assign a maximum
>> of 2 ABMC counters per group. User will also have the option to assign only
>> one counter to the group. If the system runs out of assignable ABMC
>> counters, kernel will display an error. Users need to unassign an already
>> assigned counter to make space for new assignments.
>>
>>
>> # Examples
>>
>> a. Check if ABMC support is available
>> 	#mount -t resctrl resctrl /sys/fs/resctrl/
>>
>> 	#cat /sys/fs/resctrl/info/L3_MON/mon_features
>> 	llc_occupancy
>> 	mbm_total_bytes
>> 	mbm_total_bytes_config
>> 	mbm_local_bytes
>> 	mbm_local_bytes_config
>> 	mbm_assign_capable ←  Linux kernel detected ABMC feature
>>
>> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>>     Monitoring works in legacy monitor mode when ABMC is not enabled.
>>
>> 	#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> 	0
>>
> With the introduction of "mbm_assign_enable" the entry in mon_features seems
> to provide duplicate information.

ok. We can remove the text in mon_features and keep mbm_assign_enable. 
We need this to enable and disable the feature.

>
>> c. There will be new file "monitor_state" for each monitor group when ABMC
>>     feature is supported. However, monitor_state is not available if ABMC is
>>     disabled.
>> 	
>> 	#cat /sys/fs/resctrl/monitor_state
>> 	Unsupported
> This sounds potentially confusing since users will still be able to monitor
> the groups ...
How about "Assignment-Unsupported"?
>
>> 	
>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>>     enabled, monitoring will work in current mode without assignment option.
>> 	
>> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 	779247936
>> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> 	765207488
>> 	
>> e. Enable ABMC mode.
>>
>> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>          #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>          1
>>
>> f. Read the monitor states. By default, both total and local MBM
>> 	events are in "unassign" state.
>> 	
>> 	#cat /sys/fs/resctrl/monitor_state
>> 	total=unassign;local=unassign
> This interface does not seem to take into account that hardware
> can support assignment per domain. I understand that this is
> not something you want to implement at this time but the user interface
> has to accommodate such an enhancement. This was already mentioned, and
> you did acknowledge the point [3] to this new version that does not
> reflect this is unexpected.

Yea. Domain level assignment is not supported at this point. Do you want 
me to explicitly mention here?

Please elaborate what you meant here.

>
> My previous suggestions do seem to still stand and and I also am not able to
> see how Peter's requests [2] were considered. This same interface needs to
> accommodate usages apart from ABMC. For example, how to use this interface
> to address the same counter issue on AMD hardware without ABMC, and MPAM
> (pending James's feedback).

Yea. Agree. Peter's comments are not addressed. I am not all clear about 
details of Peters and James requirement.

With respect to ABMC here are my requirements.

a.  Assignment needs to be done at group level.

b. User should be able to assign each event individually. Multiple 
events assignment(in one command) should be supported.

c. I have no plans to implement domain level assignment. It is done at 
system level.

d. We need only couple of states.  Assigned and unassigned.

e. monitor_state is name of file for user interface. We can change that 
based on comments.

Peter, James,

Please comment on what you want achieve in "assignment" based on the 
features you are working on.

Do you want to add new states?

>
> I understand that until we hear from Arm we do not know all the requirements
> that this interface needs to support, but I do expect this interface to
> at least consider requirements and usage scenarios that are already known.

Sure. Will try that in the next version. Lets continue the discussion.


>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>>     the MBA events are not available until the user assigns the events
>>     explicitly.
>> 	
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 	Unsupported
>> 	
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> 	Unsupported
>>
> This needs some more thought to accommodate Peter's scenario where the counter
> can be expected to return the final count after the counter is disabled.

I am not sure how to achieve this with ABMC. This may be applicable to 
soft rmid only. In case of "soft rmid", previous readings are saved in 
the soft rmid state.

>
>> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>>     read the llc_occupancy.
>>
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
>> 	557056
>>
>> i. Now assign the total event and read the monitor_state.
>> 	
>> 	#echo total=assign > /sys/fs/resctrl/monitor_state
>> 	#cat /sys/fs/resctrl/monitor_state
>> 	total=assign;local=unassign
>> 	
> I do not see the "global assign/unassign" scenario addressed.

I am not all clear on meaning of "global assign/unassign".  Does it mean 
looping thru all the groups and assign the RMIDs?

It may not work in many cases. In case of ABMC, we have only limited 
number of hw counters. It will fail after hardware runs out of counters. 
It is better done selectively based on which group user is interested in.

But it can be done later if we find a use case for that.

>
> This version seems to ignore (without discussion) a lot of earlier
> feedback.

Please feel free comment. There are various threads of discussion. I may 
have missed.

Thanks

Babu

>
> Reinette
>
> [1] https://lore.kernel.org/lkml/5ce67d8f-e207-4029-8fb3-0bc7deab1e9f@amd.com/
> [2] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/38421428-84cb-b67e-f3ce-b7a0233e016b@amd.com/
  
Reinette Chatre Feb. 5, 2024, 10:38 p.m. UTC | #6
Hi Babu,

On 2/2/2024 1:57 PM, Moger, Babu wrote:
> On 2/1/2024 10:09 PM, Reinette Chatre wrote:
>> On 1/19/2024 10:22 AM, Babu Moger wrote:
>>> These series adds the support for Assignable Bandwidth Monitoring Counters
>> Not a good start ([1]).
> 
> Yea. My bad.
> 
>>
>>> (ABMC). It is also called QoS RMID Pinning feature
>>>
>>> The feature details are documented in the  APM listed below [1].
>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>> Monitoring (ABMC). The documentation is available at
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>
>>> The patches are based on top of commit
>>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>>>
>>> # Introduction
>>>
>>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>>> feature only guarantees that RMIDs currently assigned to a processor will
>>> be tracked by hardware. The counters of any other RMIDs which are no longer
>>> being tracked will be reset to zero. The MBM event counters return
>>> "Unavailable" for the RMIDs that are not active.
>>>      Users can create 256 or more monitor groups. But there can be only limited
>>> number of groups that can be give guaranteed monitoring numbers.  With ever
>> "can be given"?
> 
> "can give guaranteed monitoring numbers."
> 
> I feel this looks better.

Sounds good. Thank you.

> 
>>
>>> changing configurations there is no way to definitely know which of these
>>> groups will be active for certain point of time. Users do not have the
>>> option to monitor a group or set of groups for certain period of time
>>> without worrying about RMID being reset in between.
>>>      The ABMC feature provides an option to the user to assign an RMID to the
>>> hardware counter and monitor the bandwidth for a longer duration.
>>> The assigned RMID will be active until the user unassigns it manually.
>>> There is no need to worry about counters being reset during this period.
>>> Additionally, the user can specify a bitmask identifying the specific
>>> bandwidth types from the given source to track with the counter.
>>>
>>> Without ABMC enabled, monitoring will work in current mode without
>>> assignment option.
>>>
>>> # Linux Implementation
>>>
>>> Linux resctrl subsystem provides the interface to count maximum of two
>>> memory bandwidth events per group, from a combination of available total
>>> and local events. Keeping the current interface, users can assign a maximum
>>> of 2 ABMC counters per group. User will also have the option to assign only
>>> one counter to the group. If the system runs out of assignable ABMC
>>> counters, kernel will display an error. Users need to unassign an already
>>> assigned counter to make space for new assignments.
>>>
>>>
>>> # Examples
>>>
>>> a. Check if ABMC support is available
>>>     #mount -t resctrl resctrl /sys/fs/resctrl/
>>>
>>>     #cat /sys/fs/resctrl/info/L3_MON/mon_features
>>>     llc_occupancy
>>>     mbm_total_bytes
>>>     mbm_total_bytes_config
>>>     mbm_local_bytes
>>>     mbm_local_bytes_config
>>>     mbm_assign_capable ←  Linux kernel detected ABMC feature
>>>
>>> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>>>     Monitoring works in legacy monitor mode when ABMC is not enabled.
>>>
>>>     #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>     0
>>>
>> With the introduction of "mbm_assign_enable" the entry in mon_features seems
>> to provide duplicate information.
> 
> ok. We can remove the text in mon_features and keep mbm_assign_enable. We need this to enable and disable the feature.

This could be improved beyond a binary "enable"/"disable" interface to user space.
For example, the hardware can discover which "mbm counter assign" related feature
(I'm counting the "soft RMID" here as one of the "mbm counter assign" related
features) is supported on the platform and it can be presented to the user like:

# cat /sys/fs/resctrl/info/L3_MON/mbm_assign
[feature_1] feature_2 feature_3

The output indicates which features are supported by the platform and the brackets indicate
which feature is enabled. 


>>> c. There will be new file "monitor_state" for each monitor group when ABMC
>>>     feature is supported. However, monitor_state is not available if ABMC is
>>>     disabled.
>>>     
>>>     #cat /sys/fs/resctrl/monitor_state
>>>     Unsupported
>> This sounds potentially confusing since users will still be able to monitor
>> the groups ...
> How about "Assignment-Unsupported"?

(please see later)

>>
>>>     
>>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>>>     enabled, monitoring will work in current mode without assignment option.
>>>     
>>>     # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>     779247936
>>>     # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>     765207488
>>>     
>>> e. Enable ABMC mode.
>>>
>>>     #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>          #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>          1
>>>
>>> f. Read the monitor states. By default, both total and local MBM
>>>     events are in "unassign" state.
>>>     
>>>     #cat /sys/fs/resctrl/monitor_state
>>>     total=unassign;local=unassign
>> This interface does not seem to take into account that hardware
>> can support assignment per domain. I understand that this is
>> not something you want to implement at this time but the user interface
>> has to accommodate such an enhancement. This was already mentioned, and
>> you did acknowledge the point [3] to this new version that does not
>> reflect this is unexpected.
> 
> Yea. Domain level assignment is not supported at this point. Do you want me to explicitly mention here?
> 
> Please elaborate what you meant here.

You have made it clear on several occasions that you do not intend to support
domain level assignment. That may be ok but the interface you create should
not prevent future support of domain level assignment.

If my point is not clear, could you please share how this interface is able to
support domain level assignment in the future?

I am starting to think that we need a file similar to the schemata file
for group and domain level monitor configurations. 

>> My previous suggestions do seem to still stand and and I also am not able to
>> see how Peter's requests [2] were considered. This same interface needs to
>> accommodate usages apart from ABMC. For example, how to use this interface
>> to address the same counter issue on AMD hardware without ABMC, and MPAM
>> (pending James's feedback).
> 
> Yea. Agree. Peter's comments are not addressed. I am not all clear
> about details of Peters and James requirement.

Peter listed his requirements in [1]. That email thread is a worthwhile read
for the use cases.

I believe that James is aware of this work and do hope to hear from him. 

> 
> With respect to ABMC here are my requirements.
> 
> a.  Assignment needs to be done at group level.
> 
> b. User should be able to assign each event individually. Multiple events assignment(in one command) should be supported.
> 
> c. I have no plans to implement domain level assignment. It is done at system level.
> 
> d. We need only couple of states.  Assigned and unassigned.
> 
> e. monitor_state is name of file for user interface. We can change that based on comments.
> 
> Peter, James,
> 
> Please comment on what you want achieve in "assignment" based on the features you are working on.
> 
> Do you want to add new states?
> 
>>
>> I understand that until we hear from Arm we do not know all the requirements
>> that this interface needs to support, but I do expect this interface to
>> at least consider requirements and usage scenarios that are already known.
> 
> Sure. Will try that in the next version. Lets continue the discussion.
> 
> 
>>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>>>     the MBA events are not available until the user assigns the events
>>>     explicitly.
>>>     
>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>     Unsupported
>>>     
>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>     Unsupported
>>>
>> This needs some more thought to accommodate Peter's scenario where the counter
>> can be expected to return the final count after the counter is disabled.
> 
> I am not sure how to achieve this with ABMC. This may be applicable
> to soft rmid only. In case of "soft rmid", previous readings are
> saved in the soft rmid state.

Right. Please consider this work in two parts, first, there is a generic
interface that aims to support ABMC, "soft RMID", and MPAM. Second, there
is using this interface to support ABMC.

>>> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>>>     read the llc_occupancy.
>>>
>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
>>>     557056
>>>
>>> i. Now assign the total event and read the monitor_state.
>>>     
>>>     #echo total=assign > /sys/fs/resctrl/monitor_state
>>>     #cat /sys/fs/resctrl/monitor_state
>>>     total=assign;local=unassign
>>>     
>> I do not see the "global assign/unassign" scenario addressed.
> 
> I am not all clear on meaning of "global assign/unassign".  Does it
> mean looping thru all the groups and assign the RMIDs?

Please see [1].

 
> It may not work in many cases. In case of ABMC, we have only limited
> number of hw counters. It will fail after hardware runs out of
> counters. It is better done selectively based on which group user is
> interested in.

Right. This is one more item where the generic interface needs to
accommodate different hardware implementations. Perhaps this could
be one of the "features" exposed by (global) mbm_assign that the
user can "enable"/"disable" on demand?

> But it can be done later if we find a use case for that.

There already exists a use case as presented by Peter in support
of AMD hardware without ABMC, no? 

>> This version seems to ignore (without discussion) a lot of earlier
>> feedback.
> 
> Please feel free comment. There are various threads of discussion. I may have missed.
> 


Reinette

[1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
  
Moger, Babu Feb. 8, 2024, 5:29 p.m. UTC | #7
Hi Reinette,

I am trying to propose few things here to move forward based on my
assumptions. Please point me if I missed something.

On 2/5/24 16:38, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/2/2024 1:57 PM, Moger, Babu wrote:
>> On 2/1/2024 10:09 PM, Reinette Chatre wrote:
>>> On 1/19/2024 10:22 AM, Babu Moger wrote:
>>>> These series adds the support for Assignable Bandwidth Monitoring Counters
>>> Not a good start ([1]).
>>
>> Yea. My bad.
>>
>>>
>>>> (ABMC). It is also called QoS RMID Pinning feature
>>>>
>>>> The feature details are documented in the  APM listed below [1].
>>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>>> Monitoring (ABMC). The documentation is available at
>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>>
>>>> The patches are based on top of commit
>>>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>>>>
>>>> # Introduction
>>>>
>>>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>>>> feature only guarantees that RMIDs currently assigned to a processor will
>>>> be tracked by hardware. The counters of any other RMIDs which are no longer
>>>> being tracked will be reset to zero. The MBM event counters return
>>>> "Unavailable" for the RMIDs that are not active.
>>>>      Users can create 256 or more monitor groups. But there can be only limited
>>>> number of groups that can be give guaranteed monitoring numbers.  With ever
>>> "can be given"?
>>
>> "can give guaranteed monitoring numbers."
>>
>> I feel this looks better.
> 
> Sounds good. Thank you.
> 
>>
>>>
>>>> changing configurations there is no way to definitely know which of these
>>>> groups will be active for certain point of time. Users do not have the
>>>> option to monitor a group or set of groups for certain period of time
>>>> without worrying about RMID being reset in between.
>>>>      The ABMC feature provides an option to the user to assign an RMID to the
>>>> hardware counter and monitor the bandwidth for a longer duration.
>>>> The assigned RMID will be active until the user unassigns it manually.
>>>> There is no need to worry about counters being reset during this period.
>>>> Additionally, the user can specify a bitmask identifying the specific
>>>> bandwidth types from the given source to track with the counter.
>>>>
>>>> Without ABMC enabled, monitoring will work in current mode without
>>>> assignment option.
>>>>
>>>> # Linux Implementation
>>>>
>>>> Linux resctrl subsystem provides the interface to count maximum of two
>>>> memory bandwidth events per group, from a combination of available total
>>>> and local events. Keeping the current interface, users can assign a maximum
>>>> of 2 ABMC counters per group. User will also have the option to assign only
>>>> one counter to the group. If the system runs out of assignable ABMC
>>>> counters, kernel will display an error. Users need to unassign an already
>>>> assigned counter to make space for new assignments.
>>>>
>>>>
>>>> # Examples
>>>>
>>>> a. Check if ABMC support is available
>>>>     #mount -t resctrl resctrl /sys/fs/resctrl/
>>>>
>>>>     #cat /sys/fs/resctrl/info/L3_MON/mon_features
>>>>     llc_occupancy
>>>>     mbm_total_bytes
>>>>     mbm_total_bytes_config
>>>>     mbm_local_bytes
>>>>     mbm_local_bytes_config
>>>>     mbm_assign_capable ←  Linux kernel detected ABMC feature
>>>>
>>>> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>>>>     Monitoring works in legacy monitor mode when ABMC is not enabled.
>>>>
>>>>     #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>     0
>>>>
>>> With the introduction of "mbm_assign_enable" the entry in mon_features seems
>>> to provide duplicate information.
>>
>> ok. We can remove the text in mon_features and keep mbm_assign_enable. We need this to enable and disable the feature.
> 
> This could be improved beyond a binary "enable"/"disable" interface to user space.
> For example, the hardware can discover which "mbm counter assign" related feature
> (I'm counting the "soft RMID" here as one of the "mbm counter assign" related
> features) is supported on the platform and it can be presented to the user like:
> 
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> [feature_1] feature_2 feature_3

How about this?
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign
 ABMC:Capable
 SOFT-RMID:Capable

To enable ABMC
# echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign

When ABMC is enabled:
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign
 ABMC:Enable
 SOFT-RMID:Capable

> The output indicates which features are supported by the platform and the brackets indicate
> which feature is enabled. 
> 
> 
>>>> c. There will be new file "monitor_state" for each monitor group when ABMC
>>>>     feature is supported. However, monitor_state is not available if ABMC is
>>>>     disabled.
>>>>     
>>>>     #cat /sys/fs/resctrl/monitor_state
>>>>     Unsupported
>>> This sounds potentially confusing since users will still be able to monitor
>>> the groups ...
>> How about "Assignment-Unsupported"?
> 
> (please see later)
> 
>>>
>>>>     
>>>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>>>>     enabled, monitoring will work in current mode without assignment option.
>>>>     
>>>>     # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>     779247936
>>>>     # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>     765207488
>>>>     
>>>> e. Enable ABMC mode.
>>>>
>>>>     #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>          #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>          1
>>>>
>>>> f. Read the monitor states. By default, both total and local MBM
>>>>     events are in "unassign" state.
>>>>     
>>>>     #cat /sys/fs/resctrl/monitor_state
>>>>     total=unassign;local=unassign
>>> This interface does not seem to take into account that hardware
>>> can support assignment per domain. I understand that this is
>>> not something you want to implement at this time but the user interface
>>> has to accommodate such an enhancement. This was already mentioned, and
>>> you did acknowledge the point [3] to this new version that does not
>>> reflect this is unexpected.
>>
>> Yea. Domain level assignment is not supported at this point. Do you want me to explicitly mention here?
>>
>> Please elaborate what you meant here.
> 
> You have made it clear on several occasions that you do not intend to support
> domain level assignment. That may be ok but the interface you create should
> not prevent future support of domain level assignment.
> 
> If my point is not clear, could you please share how this interface is able to
> support domain level assignment in the future?
> 
> I am starting to think that we need a file similar to the schemata file
> for group and domain level monitor configurations.

Something like this?

By default
#cat /sys/fs/resctrl/monitor_state
default:0=total=assign,local=assign;1=total=assign,local=assign

With ABMC,
#cat /sys/fs/resctrl/monitor_state
ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign

> 
>>> My previous suggestions do seem to still stand and and I also am not able to
>>> see how Peter's requests [2] were considered. This same interface needs to
>>> accommodate usages apart from ABMC. For example, how to use this interface
>>> to address the same counter issue on AMD hardware without ABMC, and MPAM
>>> (pending James's feedback).
>>
>> Yea. Agree. Peter's comments are not addressed. I am not all clear
>> about details of Peters and James requirement.
> 
> Peter listed his requirements in [1]. That email thread is a worthwhile read
> for the use cases.
> 
> I believe that James is aware of this work and do hope to hear from him. 
> 
>>
>> With respect to ABMC here are my requirements.
>>
>> a.  Assignment needs to be done at group level.
>>
>> b. User should be able to assign each event individually. Multiple events assignment(in one command) should be supported.
>>
>> c. I have no plans to implement domain level assignment. It is done at system level.
>>
>> d. We need only couple of states.  Assigned and unassigned.
>>
>> e. monitor_state is name of file for user interface. We can change that based on comments.
>>
>> Peter, James,
>>
>> Please comment on what you want achieve in "assignment" based on the features you are working on.
>>
>> Do you want to add new states?
>>
>>>
>>> I understand that until we hear from Arm we do not know all the requirements
>>> that this interface needs to support, but I do expect this interface to
>>> at least consider requirements and usage scenarios that are already known.
>>
>> Sure. Will try that in the next version. Lets continue the discussion.
>>
>>
>>>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>>>>     the MBA events are not available until the user assigns the events
>>>>     explicitly.
>>>>     
>>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>     Unsupported
>>>>     
>>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>     Unsupported
>>>>
>>> This needs some more thought to accommodate Peter's scenario where the counter
>>> can be expected to return the final count after the counter is disabled.
>>
>> I am not sure how to achieve this with ABMC. This may be applicable
>> to soft rmid only. In case of "soft rmid", previous readings are
>> saved in the soft rmid state.
> 
> Right. Please consider this work in two parts, first, there is a generic
> interface that aims to support ABMC, "soft RMID", and MPAM. Second, there
> is using this interface to support ABMC.

Yea. But it is tough without knowing all the details of the other features.

How about?

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
ABMC:Unassigned
#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
ABMC:Unassigned

> 
>>>> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>>>>     read the llc_occupancy.
>>>>
>>>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy
>>>>     557056
>>>>
>>>> i. Now assign the total event and read the monitor_state.
>>>>     
>>>>     #echo total=assign > /sys/fs/resctrl/monitor_state
>>>>     #cat /sys/fs/resctrl/monitor_state
>>>>     total=assign;local=unassign
>>>>     
>>> I do not see the "global assign/unassign" scenario addressed.
>>
>> I am not all clear on meaning of "global assign/unassign".  Does it
>> mean looping thru all the groups and assign the RMIDs?
> 
> Please see [1].
> 
>  
>> It may not work in many cases. In case of ABMC, we have only limited
>> number of hw counters. It will fail after hardware runs out of
>> counters. It is better done selectively based on which group user is
>> interested in.
> 
> Right. This is one more item where the generic interface needs to
> accommodate different hardware implementations. Perhaps this could
> be one of the "features" exposed by (global) mbm_assign that the
> user can "enable"/"disable" on demand?
> 
>> But it can be done later if we find a use case for that.
> 
> There already exists a use case as presented by Peter in support
> of AMD hardware without ABMC, no? 

Yes. There is a use case. But seems like the use case is mostly applicable
to soft-rmid feature.

We can tie the global assign only to soft-rmid.

# echo SOFT-RMID:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign

Because this is soft-rmid, call global assign method.

# echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign

Because this is ABMC, do the steps required just to enable ABMC.
Don't do individual assignment

> 
>>> This version seems to ignore (without discussion) a lot of earlier
>>> feedback.
>>
>> Please feel free comment. There are various threads of discussion. I may have missed.
>>
> 
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
  
Peter Newman Feb. 16, 2024, 8:18 p.m. UTC | #8
Hi Babu,

On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <babu.moger@amd.com> wrote:
> On 2/5/24 16:38, Reinette Chatre wrote:
> > This could be improved beyond a binary "enable"/"disable" interface to user space.
> > For example, the hardware can discover which "mbm counter assign" related feature
> > (I'm counting the "soft RMID" here as one of the "mbm counter assign" related
> > features) is supported on the platform and it can be presented to the user like:
> >
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
> > [feature_1] feature_2 feature_3
>
> How about this?
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
>  ABMC:Capable
>  SOFT-RMID:Capable
>
> To enable ABMC
> # echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign
>
> When ABMC is enabled:
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
>  ABMC:Enable
>  SOFT-RMID:Capable

There would be no need to use soft RMIDs on a system that supports
ABMC, so I can't think of a reason why the underlying implementation
would matter to our users. The user should only have to request the
interface where monitors must be assigned manually. The mount would
succeed if the system has a way to support the interface.


> > You have made it clear on several occasions that you do not intend to support
> > domain level assignment. That may be ok but the interface you create should
> > not prevent future support of domain level assignment.
> >
> > If my point is not clear, could you please share how this interface is able to
> > support domain level assignment in the future?
> >
> > I am starting to think that we need a file similar to the schemata file
> > for group and domain level monitor configurations.
>
> Something like this?
>
> By default
> #cat /sys/fs/resctrl/monitor_state
> default:0=total=assign,local=assign;1=total=assign,local=assign
>
> With ABMC,
> #cat /sys/fs/resctrl/monitor_state
> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign

The benefit from all the string parsing in this interface is only
halving the number of monitor_state sysfs writes we'd need compared to
creating a separate file for mbm_local and mbm_total. Given that our
use case is to assign the 32 assignable counters to read the bandwidth
of ~256 monitoring groups, this isn't a substantial gain to help us. I
think you should just focus on providing the necessary control
granularity without trying to consolidate writes in this interface. I
will propose an additional interface below to optimize our use case.

Whether mbm_total and mbm_local are combined in the group directories
or not, I don't see why you wouldn't just repeat the same file
interface in the domain directories for a user needing finer-grained
controls.


> >> Peter, James,
> >>
> >> Please comment on what you want achieve in "assignment" based on the features you are working on.

I prototyped and tested the following additional interface for the
large-scale, batch use case that we're primarily concerned about:

info/L3_MON/mbm_{local,total}_bytes_assigned

Writing a whitespace-delimited list of mongroup directory paths does
the following:
1. unassign all monitors for the given counter
2. assigns a monitor to each mongroup referenced in the write
3. batches per-domain register updates resulting from the assignments
into a single IPI for each domain

This interface allows us to do less sysfs writes and IPIs on systems
with more assignable monitoring resources, rather than doing more.

The reference to a mongroup when reading/writing the above node is the
resctrl-root-relative path to the monitoring group. There is probably
a more concise way to refer to the groups, but my prototype used
kernfs_walk_and_get() to locate each rdtgroup struct.

I would also like to add that in the software-ABMC prototype I made,
because it's based on assignment of a small number of RMIDs,
assignment results in all counters being assigned at once. On
implementations where per-counter assignments aren't possible,
assignment through such a resource would be allowed to assign more
resources than explicitly requested.

This would allow an implementation only capable of global assignment
to assign resources to all groups when a non-empty string is written
to the proposed file nodes, and all resources to be unassigned when an
empty string is written. Reading back from the file nodes would tell
the user how much was actually assigned.

Thanks!
-Peter
  
Moger, Babu Feb. 19, 2024, 6 p.m. UTC | #9
Hi Peter,

On 2/16/24 14:18, Peter Newman wrote:
> Hi Babu,
> 
> On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <babu.moger@amd.com> wrote:
>> On 2/5/24 16:38, Reinette Chatre wrote:
>>> This could be improved beyond a binary "enable"/"disable" interface to user space.
>>> For example, the hardware can discover which "mbm counter assign" related feature
>>> (I'm counting the "soft RMID" here as one of the "mbm counter assign" related
>>> features) is supported on the platform and it can be presented to the user like:
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
>>> [feature_1] feature_2 feature_3
>>
>> How about this?
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
>>  ABMC:Capable
>>  SOFT-RMID:Capable
>>
>> To enable ABMC
>> # echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign
>>
>> When ABMC is enabled:
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign
>>  ABMC:Enable
>>  SOFT-RMID:Capable
> 
> There would be no need to use soft RMIDs on a system that supports
> ABMC, so I can't think of a reason why the underlying implementation
> would matter to our users. The user should only have to request the
> interface where monitors must be assigned manually. The mount would
> succeed if the system has a way to support the interface.

Ok Sure. I will exclude Soft-rmid for this interface.

For now, lets keep this only for ABMC.
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign
ABMC:Capable

Or

# cat /sys/fs/resctrl/info/L3_MON/mbm_assign
ABMC:Enable

> 
> 
>>> You have made it clear on several occasions that you do not intend to support
>>> domain level assignment. That may be ok but the interface you create should
>>> not prevent future support of domain level assignment.
>>>
>>> If my point is not clear, could you please share how this interface is able to
>>> support domain level assignment in the future?
>>>
>>> I am starting to think that we need a file similar to the schemata file
>>> for group and domain level monitor configurations.
>>
>> Something like this?
>>
>> By default
>> #cat /sys/fs/resctrl/monitor_state
>> default:0=total=assign,local=assign;1=total=assign,local=assign
>>
>> With ABMC,
>> #cat /sys/fs/resctrl/monitor_state
>> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign
> 
> The benefit from all the string parsing in this interface is only
> halving the number of monitor_state sysfs writes we'd need compared to
> creating a separate file for mbm_local and mbm_total. Given that our
> use case is to assign the 32 assignable counters to read the bandwidth
> of ~256 monitoring groups, this isn't a substantial gain to help us. I
> think you should just focus on providing the necessary control
> granularity without trying to consolidate writes in this interface. I

Ok. Looks like we need to provide the interface to assign the RMIDs to
individual domains in this interface. I wasn't planning that now. But, it
can be done without much changes.

Something like this(corrected typos: replaced '=' with '-').

#cat /sys/fs/resctrl/monitor_state
ABMC:0=total-unassign,local-unassign;1=total-unassign,local-unassign

To assign:

#echo "ABMC:0=total-assign,local-assign" > /sys/fs/resctrl/monitor_state


> will propose an additional interface below to optimize our use case.
> 
> Whether mbm_total and mbm_local are combined in the group directories
> or not, I don't see why you wouldn't just repeat the same file
> interface in the domain directories for a user needing finer-grained
> controls.

I don't see the need for the same file inside each domain directory in the
group level when the above command can assign the RMIDs per domain.

> 
> 
>>>> Peter, James,
>>>>
>>>> Please comment on what you want achieve in "assignment" based on the features you are working on.
> 
> I prototyped and tested the following additional interface for the
> large-scale, batch use case that we're primarily concerned about:
> 
> info/L3_MON/mbm_{local,total}_bytes_assigned
> 
> Writing a whitespace-delimited list of mongroup directory paths does
> the following:
> 1. unassign all monitors for the given counter
> 2. assigns a monitor to each mongroup referenced in the write
> 3. batches per-domain register updates resulting from the assignments
> into a single IPI for each domain
> 
> This interface allows us to do less sysfs writes and IPIs on systems
> with more assignable monitoring resources, rather than doing more.
> 
> The reference to a mongroup when reading/writing the above node is the
> resctrl-root-relative path to the monitoring group. There is probably
> a more concise way to refer to the groups, but my prototype used
> kernfs_walk_and_get() to locate each rdtgroup struct.
> 
> I would also like to add that in the software-ABMC prototype I made,
> because it's based on assignment of a small number of RMIDs,
> assignment results in all counters being assigned at once. On
> implementations where per-counter assignments aren't possible,
> assignment through such a resource would be allowed to assign more
> resources than explicitly requested.
> 
> This would allow an implementation only capable of global assignment
> to assign resources to all groups when a non-empty string is written
> to the proposed file nodes, and all resources to be unassigned when an
> empty string is written. Reading back from the file nodes would tell
> the user how much was actually assigned.

Yes. This interface can be extended to ABMC as a global assignment option.
If you have your patches ready I can add your patches on top of my ABMC
feature.
Or if you want to add the support later then I will go ahead with current
base ABMC support.
Let me know.
  
James Morse Feb. 20, 2024, 3:21 p.m. UTC | #10
Hi Babu,

On 19/01/2024 18:22, Babu Moger wrote:
> These series adds the support for Assignable Bandwidth Monitoring Counters
> (ABMC). It is also called QoS RMID Pinning feature
> 
> The feature details are documented in the  APM listed below [1].
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> Monitoring (ABMC). The documentation is available at
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> 
> The patches are based on top of commit
> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
> 
> # Introduction
> 
> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
> feature only guarantees that RMIDs currently assigned to a processor will
> be tracked by hardware. The counters of any other RMIDs which are no longer
> being tracked will be reset to zero. The MBM event counters return
> "Unavailable" for the RMIDs that are not active.
>
> Users can create 256 or more monitor groups. But there can be only limited
> number of groups that can be give guaranteed monitoring numbers.  With ever
> changing configurations there is no way to definitely know which of these
> groups will be active for certain point of time. Users do not have the
> option to monitor a group or set of groups for certain period of time
> without worrying about RMID being reset in between.
>
> The ABMC feature provides an option to the user to assign an RMID to the
> hardware counter and monitor the bandwidth for a longer duration.
> The assigned RMID will be active until the user unassigns it manually.
> There is no need to worry about counters being reset during this period.
> Additionally, the user can specify a bitmask identifying the specific
> bandwidth types from the given source to track with the counter.

At a high level, if existing software can't use the counters, I'd prefer we move them into
perf. We're currently re-inventing the perf wheel. (this argument doesn't hold for the
llc_occupancy, which is a state not counter!)

But if this lets someone 'pin' the counters for the groups they monitor, then use existing
tools, that seems a good enough argument for doing this.


> Without ABMC enabled, monitoring will work in current mode without
> assignment option.

To check I understand: the counters will get spuriously reset a the whim of the hardware?


> # Linux Implementation
> 
> Linux resctrl subsystem provides the interface to count maximum of two
> memory bandwidth events per group, from a combination of available total
> and local events. Keeping the current interface, users can assign a maximum
> of 2 ABMC counters per group. User will also have the option to assign only
> one counter to the group. If the system runs out of assignable ABMC
> counters, kernel will display an error. Users need to unassign an already
> assigned counter to make space for new assignments.
> 
> 
> # Examples
> 
> a. Check if ABMC support is available
> 	#mount -t resctrl resctrl /sys/fs/resctrl/
> 
> 	#cat /sys/fs/resctrl/info/L3_MON/mon_features 
> 	llc_occupancy
> 	mbm_total_bytes
> 	mbm_total_bytes_config
> 	mbm_local_bytes
> 	mbm_local_bytes_config
> 	mbm_assign_capable ←  Linux kernel detected ABMC feature
> 
> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>    Monitoring works in legacy monitor mode when ABMC is not enabled.
> 
> 	#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
> 	0
> 
> c. There will be new file "monitor_state" for each monitor group when ABMC
>    feature is supported. However, monitor_state is not available if ABMC is
>    disabled.
> 	#cat /sys/fs/resctrl/monitor_state 
> 	Unsupported
> 	
> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>    enabled, monitoring will work in current mode without assignment option.
> 	
> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 	779247936
> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
> 	765207488
> 	
> e. Enable ABMC mode.
> 
> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>         #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>         1

Why does this mode need enabling? Can't it be enabled automatically on hardware that
supports it, or enabled implicitly when the first assignment attempt arrives?

I guess this is really needed for a reset - could we implement that instead? This way
there isn't an extra step user-space has to do to make the assignments work.


> f. Read the monitor states. By default, both total and local MBM
> 	events are in "unassign" state.
> 	
> 	#cat /sys/fs/resctrl/monitor_state
> 	total=unassign;local=unassign


> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>    the MBA events are not available until the user assigns the events
>    explicitly.

How does this fit with "monitoring will work in current mode without assignment option.".
You mentioned the hardware resets the counters when this mode is enabled, does it also
refuse to count until the MSR is programmed?

If so - is there any mileage in auto-assigning the first N RMID to counters when the
groups are created? This way existing user-space tools work until they exceed the limits
of hardware. From that point a counter needs to be unassigned from another group. (we'd
need to make it easy to find which groups have a counter assigned)


> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 	Unsupported
> 	
> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
> 	Unsupported
> 
> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>    read the llc_occupancy.
> 
> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy 
> 	557056

{
	MPAM would be the same - because llc_occupancy isn't a counter its a view of the
	state, its possible to multiplex a single llc_occupancy counter behind the scenes
	to provide the value for as many groups as needed. I suspect any other
	architecture would  have the same property.
}

> i. Now assign the total event and read the monitor_state.
> 	
> 	#echo total=assign > /sys/fs/resctrl/monitor_state
> 	#cat /sys/fs/resctrl/monitor_state 
> 	total=assign;local=unassign
> 	
> j. Now that the total event is assigned. Read the total event.
> 	
> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 	6136000
> 	
> k. Now assign the local event and read the monitor_state.
> 	
> 	#echo local=assign > /sys/fs/resctrl/monitor_state
> 	#cat /sys/fs/resctrl/monitor_state
> 	total=assign;local=assign
> 
>         Users can also assign both total and local events in one single
> 	command.
> 
> l. Now that both total and local events are assigned, read the events.
> 	
> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 	6136000
> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> 	58694

(the bandwidth configuration stuff is the existing BMEC support right?)

From user-space's perspective MPAM could be made to look the same.

There ought to be some indication to user-space of how many counters it can assign, this
number might be different for different resources. This won't be a problem today, but if
we had 'mbm_total_bytes' on the L2 cache, the number of counters may be different.

MPAM platforms are unlikely to support both 'mbm_total' and 'mbm_local',  I think this is
just a documentation problem to say that mbm_local can't be configured if its not
supported - user-space can't blindly assign both.

If the configuration is changed over time - I bet user-space needs a quick way to find
where the counters are currently assigned - walking the tree to find out is a bit rubbish.
A file that lists the "control_group_name[/mon_group_name]" would help.


Thanks,

James
  
James Morse Feb. 20, 2024, 3:21 p.m. UTC | #11
Hi Peter, Babu,

On 16/02/2024 20:18, Peter Newman wrote:
> On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <babu.moger@amd.com> wrote:
>> On 2/5/24 16:38, Reinette Chatre wrote:
>>> You have made it clear on several occasions that you do not intend to support
>>> domain level assignment. That may be ok but the interface you create should
>>> not prevent future support of domain level assignment.
>>>
>>> If my point is not clear, could you please share how this interface is able to
>>> support domain level assignment in the future?
>>>
>>> I am starting to think that we need a file similar to the schemata file
>>> for group and domain level monitor configurations.
>>
>> Something like this?
>>
>> By default
>> #cat /sys/fs/resctrl/monitor_state
>> default:0=total=assign,local=assign;1=total=assign,local=assign
>>
>> With ABMC,
>> #cat /sys/fs/resctrl/monitor_state
>> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign
> 
> The benefit from all the string parsing in this interface is only
> halving the number of monitor_state sysfs writes we'd need compared to
> creating a separate file for mbm_local and mbm_total. Given that our
> use case is to assign the 32 assignable counters to read the bandwidth
> of ~256 monitoring groups, this isn't a substantial gain to help us. I
> think you should just focus on providing the necessary control
> granularity without trying to consolidate writes in this interface. I
> will propose an additional interface below to optimize our use case.
> 
> Whether mbm_total and mbm_local are combined in the group directories
> or not, I don't see why you wouldn't just repeat the same file
> interface in the domain directories for a user needing finer-grained
> controls.

I don't follow why this has to be done globally. resctrl allows CLOSID to have different
configurations for different purposes between different domains (as long as tasks are
pinned to CPUs). It feels a bit odd that these counters can't be considered as per-domain too.

MPAM can equally allocate monitors/counters per-domain. If we are ever going to have
per-domain assignment, I think its worth the extra work to do that now and avoid the extra
user-space interface baggage from the global version.


>>>> Peter, James,
>>>>
>>>> Please comment on what you want achieve in "assignment" based on the features you are working on.
> 
> I prototyped and tested the following additional interface for the
> large-scale, batch use case that we're primarily concerned about:
> 
> info/L3_MON/mbm_{local,total}_bytes_assigned
> 
> Writing a whitespace-delimited list of mongroup directory paths does

| mkdir /sys/fs/resctrl/my\ group

string parsing in the kernel is rarely fun!


> the following:
> 1. unassign all monitors for the given counter
> 2. assigns a monitor to each mongroup referenced in the write
> 3. batches per-domain register updates resulting from the assignments
> into a single IPI for each domain
> 
> This interface allows us to do less sysfs writes and IPIs on systems
> with more assignable monitoring resources, rather than doing more.
> 
> The reference to a mongroup when reading/writing the above node is the
> resctrl-root-relative path to the monitoring group. There is probably
> a more concise way to refer to the groups, but my prototype used
> kernfs_walk_and_get() to locate each rdtgroup struct.

If this file were re-used for finding where the monitors were currently allocated, using
the name would be a natural fit for building a path to un-assign one group.


> I would also like to add that in the software-ABMC prototype I made,
> because it's based on assignment of a small number of RMIDs,
> assignment results in all counters being assigned at once. On
> implementations where per-counter assignments aren't possible,
> assignment through such a resource would be allowed to assign more
> resources than explicitly requested.
> 
> This would allow an implementation only capable of global assignment

Do we know if this exists? Given the configurations have to be different for a domain, I'd
be surprised if counter configuration is somehow distributed between domains.


> to assign resources to all groups when a non-empty string is written
> to the proposed file nodes, and all resources to be unassigned when an
> empty string is written. Reading back from the file nodes would tell
> the user how much was actually assigned.

What do you mean by 'how much', is this allow to fail early? That feels a bit
counter-intuitive. As this starts with a reset, if the number of counters is known - it
should be easy for user-space to know it can only write X tokens into that file.


Thanks,

James
  
Peter Newman Feb. 20, 2024, 6:11 p.m. UTC | #12
Hi James,

On Tue, Feb 20, 2024 at 7:21 AM James Morse <james.morse@arm.com> wrote:
> On 16/02/2024 20:18, Peter Newman wrote:
> > On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <babu.moger@amd.com> wrote:
> >> On 2/5/24 16:38, Reinette Chatre wrote:
> >>> You have made it clear on several occasions that you do not intend to support
> >>> domain level assignment. That may be ok but the interface you create should
> >>> not prevent future support of domain level assignment.
> >>>
> >>> If my point is not clear, could you please share how this interface is able to
> >>> support domain level assignment in the future?
> >>>
> >>> I am starting to think that we need a file similar to the schemata file
> >>> for group and domain level monitor configurations.
> >>
> >> Something like this?
> >>
> >> By default
> >> #cat /sys/fs/resctrl/monitor_state
> >> default:0=total=assign,local=assign;1=total=assign,local=assign
> >>
> >> With ABMC,
> >> #cat /sys/fs/resctrl/monitor_state
> >> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign
> >
> > The benefit from all the string parsing in this interface is only
> > halving the number of monitor_state sysfs writes we'd need compared to
> > creating a separate file for mbm_local and mbm_total. Given that our
> > use case is to assign the 32 assignable counters to read the bandwidth
> > of ~256 monitoring groups, this isn't a substantial gain to help us. I
> > think you should just focus on providing the necessary control
> > granularity without trying to consolidate writes in this interface. I
> > will propose an additional interface below to optimize our use case.
> >
> > Whether mbm_total and mbm_local are combined in the group directories
> > or not, I don't see why you wouldn't just repeat the same file
> > interface in the domain directories for a user needing finer-grained
> > controls.
>
> I don't follow why this has to be done globally. resctrl allows CLOSID to have different
> configurations for different purposes between different domains (as long as tasks are
> pinned to CPUs). It feels a bit odd that these counters can't be considered as per-domain too.

Assigning to all domains at once would allow us to better parallelize
the resulting IPIs when we do need to iterate a small set of monitors
over a large list of groups.


> > I prototyped and tested the following additional interface for the
> > large-scale, batch use case that we're primarily concerned about:
> >
> > info/L3_MON/mbm_{local,total}_bytes_assigned
> >
> > Writing a whitespace-delimited list of mongroup directory paths does
>
> | mkdir /sys/fs/resctrl/my\ group
>
> string parsing in the kernel is rarely fun!

Hopefully restricting to a newline-delimited list will keep this fun
and easy then.

Otherwise if referring to many groups in a single write isn't a viable
path forward, I'll still need to find a way to address the
fs/syscall/IPI overhead of measuring the bandwidth of a large number
of groups.

>
>
> > the following:
> > 1. unassign all monitors for the given counter
> > 2. assigns a monitor to each mongroup referenced in the write
> > 3. batches per-domain register updates resulting from the assignments
> > into a single IPI for each domain
> >
> > This interface allows us to do less sysfs writes and IPIs on systems
> > with more assignable monitoring resources, rather than doing more.
> >
> > The reference to a mongroup when reading/writing the above node is the
> > resctrl-root-relative path to the monitoring group. There is probably
> > a more concise way to refer to the groups, but my prototype used
> > kernfs_walk_and_get() to locate each rdtgroup struct.
>
> If this file were re-used for finding where the monitors were currently allocated, using
> the name would be a natural fit for building a path to un-assign one group.
>
>
> > I would also like to add that in the software-ABMC prototype I made,
> > because it's based on assignment of a small number of RMIDs,
> > assignment results in all counters being assigned at once. On
> > implementations where per-counter assignments aren't possible,
> > assignment through such a resource would be allowed to assign more
> > resources than explicitly requested.
> >
> > This would allow an implementation only capable of global assignment
>
> Do we know if this exists? Given the configurations have to be different for a domain, I'd
> be surprised if counter configuration is somehow distributed between domains.

It's currently only a proposal[1] for mitigating the context switch
overhead cost of soft RMIDs. I'm looking at the other alternative
first, though.


> > to assign resources to all groups when a non-empty string is written
> > to the proposed file nodes, and all resources to be unassigned when an
> > empty string is written. Reading back from the file nodes would tell
> > the user how much was actually assigned.
>
> What do you mean by 'how much', is this allow to fail early? That feels a bit
> counter-intuitive. As this starts with a reset, if the number of counters is known - it
> should be easy for user-space to know it can only write X tokens into that file.

I was referring to the operation assigning more groups than requested
if the implementation is only capable of a master enable/disable for
all monitoring: reading back would indicate that all monitoring groups
are in the assigned list.

There would otherwise be an interface telling the user how many
monitors can be assigned, so there's no reason to expect this
operation to fail, short of the user doing something silly like
deleting a group while it's concurrently being assigned.

-Peter

[1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
  
James Morse Feb. 20, 2024, 6:14 p.m. UTC | #13
Hi Babu,

On 20/02/2024 15:21, James Morse wrote:
> There ought to be some indication to user-space of how many counters it can assign, this
> number might be different for different resources. This won't be a problem today, but if
> we had 'mbm_total_bytes' on the L2 cache, the number of counters may be different.

I found this in patch 6 - sorry for the noise!


Thanks,

James
  
Moger, Babu Feb. 20, 2024, 8:48 p.m. UTC | #14
Hi James,

On 2/20/24 09:21, James Morse wrote:
> Hi Babu,
> 
> On 19/01/2024 18:22, Babu Moger wrote:
>> These series adds the support for Assignable Bandwidth Monitoring Counters
>> (ABMC). It is also called QoS RMID Pinning feature
>>
>> The feature details are documented in the  APM listed below [1].
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>> Monitoring (ABMC). The documentation is available at
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>
>> The patches are based on top of commit
>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master)
>>
>> # Introduction
>>
>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring
>> feature only guarantees that RMIDs currently assigned to a processor will
>> be tracked by hardware. The counters of any other RMIDs which are no longer
>> being tracked will be reset to zero. The MBM event counters return
>> "Unavailable" for the RMIDs that are not active.
>>
>> Users can create 256 or more monitor groups. But there can be only limited
>> number of groups that can be give guaranteed monitoring numbers.  With ever
>> changing configurations there is no way to definitely know which of these
>> groups will be active for certain point of time. Users do not have the
>> option to monitor a group or set of groups for certain period of time
>> without worrying about RMID being reset in between.
>>
>> The ABMC feature provides an option to the user to assign an RMID to the
>> hardware counter and monitor the bandwidth for a longer duration.
>> The assigned RMID will be active until the user unassigns it manually.
>> There is no need to worry about counters being reset during this period.
>> Additionally, the user can specify a bitmask identifying the specific
>> bandwidth types from the given source to track with the counter.
> 
> At a high level, if existing software can't use the counters, I'd prefer we move them into
> perf. We're currently re-inventing the perf wheel. (this argument doesn't hold for the
> llc_occupancy, which is a state not counter!)
> 
> But if this lets someone 'pin' the counters for the groups they monitor, then use existing
> tools, that seems a good enough argument for doing this.

Not sure if I understand this. Yes. This feature provides the option to
pin the counters to the monitor group.

> 
> 
>> Without ABMC enabled, monitoring will work in current mode without
>> assignment option.
> 
> To check I understand: the counters will get spuriously reset a the whim of the hardware?

[1] Not spuriously. Hardware can keep track of certain number of counters
active simultaneously (active counters). If there are more monitor groups
than the hardware can track, then only most recent associations are kept
active. The active set can change based on user actions(RMID association
changes from user).

This feature can help to pin a counter so it does not reset.

> 
> 
>> # Linux Implementation
>>
>> Linux resctrl subsystem provides the interface to count maximum of two
>> memory bandwidth events per group, from a combination of available total
>> and local events. Keeping the current interface, users can assign a maximum
>> of 2 ABMC counters per group. User will also have the option to assign only
>> one counter to the group. If the system runs out of assignable ABMC
>> counters, kernel will display an error. Users need to unassign an already
>> assigned counter to make space for new assignments.
>>
>>
>> # Examples
>>
>> a. Check if ABMC support is available
>> 	#mount -t resctrl resctrl /sys/fs/resctrl/
>>
>> 	#cat /sys/fs/resctrl/info/L3_MON/mon_features 
>> 	llc_occupancy
>> 	mbm_total_bytes
>> 	mbm_total_bytes_config
>> 	mbm_local_bytes
>> 	mbm_local_bytes_config
>> 	mbm_assign_capable ←  Linux kernel detected ABMC feature
>>
>> b. Check if ABMC is enabled. By default, ABMC feature is disabled.
>>    Monitoring works in legacy monitor mode when ABMC is not enabled.
>>
>> 	#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>> 	0
>>
>> c. There will be new file "monitor_state" for each monitor group when ABMC
>>    feature is supported. However, monitor_state is not available if ABMC is
>>    disabled.
>> 	#cat /sys/fs/resctrl/monitor_state 
>> 	Unsupported
>> 	
>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC
>>    enabled, monitoring will work in current mode without assignment option.
>> 	
>> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 	779247936
>> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
>> 	765207488
>> 	
>> e. Enable ABMC mode.
>>
>> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>         #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>         1
> 
> Why does this mode need enabling? Can't it be enabled automatically on hardware that
> supports it, or enabled implicitly when the first assignment attempt arrives?
> 
> I guess this is really needed for a reset - could we implement that instead? This way
> there isn't an extra step user-space has to do to make the assignments work.

Mostly the new features are added as an opt-in method. So, kept it that
way. If we enable this feature automatically, then we have provide an
option to disable it.

> 
> 
>> f. Read the monitor states. By default, both total and local MBM
>> 	events are in "unassign" state.
>> 	
>> 	#cat /sys/fs/resctrl/monitor_state
>> 	total=unassign;local=unassign
> 
> 
>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode,
>>    the MBA events are not available until the user assigns the events
>>    explicitly.
> 
> How does this fit with "monitoring will work in current mode without assignment option.".

See my response above. [1]

> You mentioned the hardware resets the counters when this mode is enabled, does it also
> refuse to count until the MSR is programmed?

Yes. That is correct. We need to program the MSRs to start counting again.

> 
> If so - is there any mileage in auto-assigning the first N RMID to counters when the
> groups are created? This way existing user-space tools work until they exceed the limits
> of hardware. From that point a counter needs to be unassigned from another group. (we'd
> need to make it easy to find which groups have a counter assigned)

Yes. That is correct. To see the state of assignment, I have added a
monitor_state in each group to see if the counters are assigned to that group.

> 
> 
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 	Unsupported
>> 	
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
>> 	Unsupported
>>
>> h. The event llc_occupancy is not affected by ABMC mode. Users can still
>>    read the llc_occupancy.
>>
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy 
>> 	557056
> 
> {
> 	MPAM would be the same - because llc_occupancy isn't a counter its a view of the
> 	state, its possible to multiplex a single llc_occupancy counter behind the scenes
> 	to provide the value for as many groups as needed. I suspect any other
> 	architecture would  have the same property.

ok. Good to know.

> }
> 
>> i. Now assign the total event and read the monitor_state.
>> 	
>> 	#echo total=assign > /sys/fs/resctrl/monitor_state
>> 	#cat /sys/fs/resctrl/monitor_state 
>> 	total=assign;local=unassign
>> 	
>> j. Now that the total event is assigned. Read the total event.
>> 	
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 	6136000
>> 	
>> k. Now assign the local event and read the monitor_state.
>> 	
>> 	#echo local=assign > /sys/fs/resctrl/monitor_state
>> 	#cat /sys/fs/resctrl/monitor_state
>> 	total=assign;local=assign
>>
>>         Users can also assign both total and local events in one single
>> 	command.
>>
>> l. Now that both total and local events are assigned, read the events.
>> 	
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>> 	6136000
>> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> 	58694
> 
> (the bandwidth configuration stuff is the existing BMEC support right?)

Yes. correct.

> 
> From user-space's perspective MPAM could be made to look the same.
> 
> There ought to be some indication to user-space of how many counters it can assign, this
> number might be different for different resources. This won't be a problem today, but if
> we had 'mbm_total_bytes' on the L2 cache, the number of counters may be different.
> 
> MPAM platforms are unlikely to support both 'mbm_total' and 'mbm_local',  I think this is

Ok. Good to know.

> just a documentation problem to say that mbm_local can't be configured if its not
> supported - user-space can't blindly assign both.
> 
> If the configuration is changed over time - I bet user-space needs a quick way to find
> where the counters are currently assigned - walking the tree to find out is a bit rubbish.
> A file that lists the "control_group_name[/mon_group_name]" would help.

Looks like you already found in here.

https://lore.kernel.org/lkml/c16cac16c813a203390229d77d5ab37ebc923d95.1705688539.git.babu.moger@amd.com/

> 
> 
> Thanks,
> 
> James
  
Reinette Chatre Feb. 23, 2024, 5:17 p.m. UTC | #15
On 2/20/2024 12:48 PM, Moger, Babu wrote:
> On 2/20/24 09:21, James Morse wrote:
>> On 19/01/2024 18:22, Babu Moger wrote:

>>> e. Enable ABMC mode.
>>>
>>> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>         #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>         1
>>
>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>
>> I guess this is really needed for a reset - could we implement that instead? This way
>> there isn't an extra step user-space has to do to make the assignments work.
> 
> Mostly the new features are added as an opt-in method. So, kept it that
> way. If we enable this feature automatically, then we have provide an
> option to disable it.
> 

At the same time it sounds to me like ABMC can improve current users'
experience without requiring them to do anything. This sounds appealing.
For example, if I understand correctly, it may be possible to start resctrl
with ABMC enabled by default and the number of monitoring groups (currently
exposed to user space via "num_rmids") limited to the number of counters
supported by ABMC. Existing users would then by default obtain better behavior
of counters not resetting.

The "new feature" could then be viewed as adding support for more monitoring
groups than what hardware can support concurrently.

Reinette
  
Moger, Babu Feb. 23, 2024, 8:11 p.m. UTC | #16
Hi Reinette,

On 2/23/24 11:17, Reinette Chatre wrote:
> 
> 
> On 2/20/2024 12:48 PM, Moger, Babu wrote:
>> On 2/20/24 09:21, James Morse wrote:
>>> On 19/01/2024 18:22, Babu Moger wrote:
> 
>>>> e. Enable ABMC mode.
>>>>
>>>> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>         #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>         1
>>>
>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>>
>>> I guess this is really needed for a reset - could we implement that instead? This way
>>> there isn't an extra step user-space has to do to make the assignments work.
>>
>> Mostly the new features are added as an opt-in method. So, kept it that
>> way. If we enable this feature automatically, then we have provide an
>> option to disable it.
>>
> 
> At the same time it sounds to me like ABMC can improve current users'
> experience without requiring them to do anything. This sounds appealing.
> For example, if I understand correctly, it may be possible to start resctrl
> with ABMC enabled by default and the number of monitoring groups (currently
> exposed to user space via "num_rmids") limited to the number of counters
> supported by ABMC. Existing users would then by default obtain better behavior
> of counters not resetting.

Yes, I like the idea. But i will break compatibility with pqos
tool(intel_cmt_cat utility). pqos tool monitoring will not work without
supporting ABMC enablement in the tool. ABMC feature requires an extra
step to assign the counters for monitor to work.

> 
> The "new feature" could then be viewed as adding support for more monitoring
> groups than what hardware can support concurrently.
> 
> Reinette
  
Moger, Babu Feb. 23, 2024, 9:47 p.m. UTC | #17
On 2/20/2024 12:11 PM, Peter Newman wrote:
> Hi James,
>
> On Tue, Feb 20, 2024 at 7:21 AM James Morse <james.morse@arm.com> wrote:
>> On 16/02/2024 20:18, Peter Newman wrote:
>>> On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <babu.moger@amd.com> wrote:
>>>> On 2/5/24 16:38, Reinette Chatre wrote:
>>>>> You have made it clear on several occasions that you do not intend to support
>>>>> domain level assignment. That may be ok but the interface you create should
>>>>> not prevent future support of domain level assignment.
>>>>>
>>>>> If my point is not clear, could you please share how this interface is able to
>>>>> support domain level assignment in the future?
>>>>>
>>>>> I am starting to think that we need a file similar to the schemata file
>>>>> for group and domain level monitor configurations.
>>>> Something like this?
>>>>
>>>> By default
>>>> #cat /sys/fs/resctrl/monitor_state
>>>> default:0=total=assign,local=assign;1=total=assign,local=assign
>>>>
>>>> With ABMC,
>>>> #cat /sys/fs/resctrl/monitor_state
>>>> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign
>>> The benefit from all the string parsing in this interface is only
>>> halving the number of monitor_state sysfs writes we'd need compared to
>>> creating a separate file for mbm_local and mbm_total. Given that our
>>> use case is to assign the 32 assignable counters to read the bandwidth
>>> of ~256 monitoring groups, this isn't a substantial gain to help us. I
>>> think you should just focus on providing the necessary control
>>> granularity without trying to consolidate writes in this interface. I
>>> will propose an additional interface below to optimize our use case.
>>>
>>> Whether mbm_total and mbm_local are combined in the group directories
>>> or not, I don't see why you wouldn't just repeat the same file
>>> interface in the domain directories for a user needing finer-grained
>>> controls.
>> I don't follow why this has to be done globally. resctrl allows CLOSID to have different
>> configurations for different purposes between different domains (as long as tasks are
>> pinned to CPUs). It feels a bit odd that these counters can't be considered as per-domain too.
> Assigning to all domains at once would allow us to better parallelize
> the resulting IPIs when we do need to iterate a small set of monitors
> over a large list of groups.

Planning to work on v3 of this series. For now, I will exclude the 
global assignment option from this series.

We can add the global assignment support when we get time to optimize 
assignments at later point.

Thanks

Babu Moger
  
Reinette Chatre Feb. 23, 2024, 10:21 p.m. UTC | #18
Hi Babu,

On 2/23/2024 12:11 PM, Moger, Babu wrote:
> On 2/23/24 11:17, Reinette Chatre wrote:
>>
>>
>> On 2/20/2024 12:48 PM, Moger, Babu wrote:
>>> On 2/20/24 09:21, James Morse wrote:
>>>> On 19/01/2024 18:22, Babu Moger wrote:
>>
>>>>> e. Enable ABMC mode.
>>>>>
>>>>> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>         #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>         1
>>>>
>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>>>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>>>
>>>> I guess this is really needed for a reset - could we implement that instead? This way
>>>> there isn't an extra step user-space has to do to make the assignments work.
>>>
>>> Mostly the new features are added as an opt-in method. So, kept it that
>>> way. If we enable this feature automatically, then we have provide an
>>> option to disable it.
>>>
>>
>> At the same time it sounds to me like ABMC can improve current users'
>> experience without requiring them to do anything. This sounds appealing.
>> For example, if I understand correctly, it may be possible to start resctrl
>> with ABMC enabled by default and the number of monitoring groups (currently
>> exposed to user space via "num_rmids") limited to the number of counters
>> supported by ABMC. Existing users would then by default obtain better behavior
>> of counters not resetting.
> 
> Yes, I like the idea. But i will break compatibility with pqos
> tool(intel_cmt_cat utility). pqos tool monitoring will not work without
> supporting ABMC enablement in the tool. ABMC feature requires an extra
> step to assign the counters for monitor to work.

I am considering two scenarios, the "default behavior" is what a user will
experience when booting resctrl on an ABMC system and the "new feature
behavior" where a user can take full advantage of all that ABMC (and soft
RMID, and MPAM) can offer.

So, first, on an ABMC system in the "default behavior" scenario I expect
that resctrl can do required ABMC counter configuration automatically at
the time a monitor group is created. In this "default behavior" scenario
resctrl would expose "num_rmids" to be half of the number of assignable
counters. When a user then creates a monitor group two counters will be
used and configured to count the local and total bytes respectively. If
two counters are not available then ENOSPC returned, just like when system
is out of closid/rmid.  With this "default behavior" user space thus gets
improved behavior without making any changes on its part. I do not have
insight into how many counters ABMC could be expected to expose though ...
so some users may be surprised at how few monitor groups can be created
with new hardware? This may not be an issue since that would accurately
reflect how many _reliable_ monitor groups can be created and if user needs
more monitor groups then that would be a time to explore the "new feature"
that requires changes in how user interacts with resctrl.

Apart from the "default behavior" there are two options to consider ...
(a) the "original" behavior(? I do not know what to call it) - this would be
    where user space wants(?) to have the current non-ABMC behavior on an ABMC
    system, where the previous "num_rmids" monitor groups can be created but
    the counters are reset unpredictably ... should this still be supported
    on ABMC systems though?
(b) the "new feature" behavior where user space gets full benefit of ABMC
    that allows user space to create any number of monitor groups but then
    user space needs to let hardware (via resctrl) know which
    events should be counted.

I expect that only (b) above would require user space change. Considering
that per documentation, "num_rmids" means "This is the upper bound for how
many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids"
becomes undefined when "new feature" is enabled. When this new feature is enabled
then user space is no longer limited by number of RMIDs on how many monitor
groups can be created and this is the point that the user interface that you
and Peter have ideas about comes into play. Specifically, user space needing
a way to specify:
(a) "let me create more monitor groups that the hardware can support"/"let me
     control which events/monitor groups are counted"
     (like the "mbm_assign" file in your proposal)
(b) "here are the events that need to be counted" 
     (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals)

Reinette
  
Moger, Babu Feb. 26, 2024, 5:59 p.m. UTC | #19
Hi Reinette,

On 2/23/24 16:21, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/23/2024 12:11 PM, Moger, Babu wrote:
>> On 2/23/24 11:17, Reinette Chatre wrote:
>>>
>>>
>>> On 2/20/2024 12:48 PM, Moger, Babu wrote:
>>>> On 2/20/24 09:21, James Morse wrote:
>>>>> On 19/01/2024 18:22, Babu Moger wrote:
>>>
>>>>>> e. Enable ABMC mode.
>>>>>>
>>>>>> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>         #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>         1
>>>>>
>>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>>>>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>>>>
>>>>> I guess this is really needed for a reset - could we implement that instead? This way
>>>>> there isn't an extra step user-space has to do to make the assignments work.
>>>>
>>>> Mostly the new features are added as an opt-in method. So, kept it that
>>>> way. If we enable this feature automatically, then we have provide an
>>>> option to disable it.
>>>>
>>>
>>> At the same time it sounds to me like ABMC can improve current users'
>>> experience without requiring them to do anything. This sounds appealing.
>>> For example, if I understand correctly, it may be possible to start resctrl
>>> with ABMC enabled by default and the number of monitoring groups (currently
>>> exposed to user space via "num_rmids") limited to the number of counters
>>> supported by ABMC. Existing users would then by default obtain better behavior
>>> of counters not resetting.
>>
>> Yes, I like the idea. But i will break compatibility with pqos
>> tool(intel_cmt_cat utility). pqos tool monitoring will not work without
>> supporting ABMC enablement in the tool. ABMC feature requires an extra
>> step to assign the counters for monitor to work.
> 
> I am considering two scenarios, the "default behavior" is what a user will
> experience when booting resctrl on an ABMC system and the "new feature
> behavior" where a user can take full advantage of all that ABMC (and soft
> RMID, and MPAM) can offer.
> 
> So, first, on an ABMC system in the "default behavior" scenario I expect
> that resctrl can do required ABMC counter configuration automatically at
> the time a monitor group is created. In this "default behavior" scenario
> resctrl would expose "num_rmids" to be half of the number of assignable
> counters. When a user then creates a monitor group two counters will be
> used and configured to count the local and total bytes respectively. If
> two counters are not available then ENOSPC returned, just like when system
> is out of closid/rmid.  With this "default behavior" user space thus gets
> improved behavior without making any changes on its part. I do not have

We can automatically assign the h/w counter when monitor group is created
until we run out of h/w counters. That is good idea. By default user will
not notice any difference in ABMC mode.

> insight into how many counters ABMC could be expected to expose though ...
> so some users may be surprised at how few monitor groups can be created
> with new hardware? This may not be an issue since that would accurately
> reflect how many _reliable_ monitor groups can be created and if user needs
> more monitor groups then that would be a time to explore the "new feature"
> that requires changes in how user interacts with resctrl.

Currently, 32 h/w counters are available to configure. With two counters
for each group, we can create 16 groups(15 new groups plus the default
group). That should be fine as pqos tool creates only 16 groups when it is
started.

> 
> Apart from the "default behavior" there are two options to consider ...
> (a) the "original" behavior(? I do not know what to call it) - this would be
>     where user space wants(?) to have the current non-ABMC behavior on an ABMC
>     system, where the previous "num_rmids" monitor groups can be created but
>     the counters are reset unpredictably ... should this still be supported
>     on ABMC systems though?

I would say yes. For some reason user(hardware or software issues) is not
able to use ABMC mode, they have an option to go back to legacy mode.

> (b) the "new feature" behavior where user space gets full benefit of ABMC
>     that allows user space to create any number of monitor groups but then
>     user space needs to let hardware (via resctrl) know which
>     events should be counted.

Is this "new feature" is enabled by default when ABMC is available?

Or we need to provide an interface to enable this feature?


> 
> I expect that only (b) above would require user space change. Considering
> that per documentation, "num_rmids" means "This is the upper bound for how
> many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids"
> becomes undefined when "new feature" is enabled. When this new feature is enabled
> then user space is no longer limited by number of RMIDs on how many monitor

With ABMC, we will have a new field "mbm_assignable_counters". We don't
have to change the definition of "num_rmids".

> groups can be created and this is the point that the user interface that you
> and Peter have ideas about comes into play. Specifically, user space needing
> a way to specify:
> (a) "let me create more monitor groups that the hardware can support"/"let me
>      control which events/monitor groups are counted"
>      (like the "mbm_assign" file in your proposal)
> (b) "here are the events that need to be counted" 
>      (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals)

With global assignment option out of way for now(may be introduced later),
we can provide two interfaces.

1. /sys/fs/resctrl/info/L3_MON/mbm_assign
This will be enabled by default when ABMC is available. Users can disable
this option to go back to legacy mode.

2. /sys/fs/resctrl/monitor_state.
This can used to individually assign or unassign the counters in each group.

When assigned:
#cat /sys/fs/resctrl/monitor_state
0=total-assign,local-assign;1=total-assign,local-assign

When unassigned:
#cat /sys/fs/resctrl/monitor_state
0=total-unassign,local-unassign;1=total-unassign,local-unassign


Thoughts?
  
Reinette Chatre Feb. 26, 2024, 9:20 p.m. UTC | #20
Hi Babu,

On 2/26/2024 9:59 AM, Moger, Babu wrote:
> On 2/23/24 16:21, Reinette Chatre wrote:
>> On 2/23/2024 12:11 PM, Moger, Babu wrote:
>>> On 2/23/24 11:17, Reinette Chatre wrote:
>>>>
>>>>
>>>> On 2/20/2024 12:48 PM, Moger, Babu wrote:
>>>>> On 2/20/24 09:21, James Morse wrote:
>>>>>> On 19/01/2024 18:22, Babu Moger wrote:
>>>>
>>>>>>> e. Enable ABMC mode.
>>>>>>>
>>>>>>> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>>         #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>>         1
>>>>>>
>>>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>>>>>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>>>>>
>>>>>> I guess this is really needed for a reset - could we implement that instead? This way
>>>>>> there isn't an extra step user-space has to do to make the assignments work.
>>>>>
>>>>> Mostly the new features are added as an opt-in method. So, kept it that
>>>>> way. If we enable this feature automatically, then we have provide an
>>>>> option to disable it.
>>>>>
>>>>
>>>> At the same time it sounds to me like ABMC can improve current users'
>>>> experience without requiring them to do anything. This sounds appealing.
>>>> For example, if I understand correctly, it may be possible to start resctrl
>>>> with ABMC enabled by default and the number of monitoring groups (currently
>>>> exposed to user space via "num_rmids") limited to the number of counters
>>>> supported by ABMC. Existing users would then by default obtain better behavior
>>>> of counters not resetting.
>>>
>>> Yes, I like the idea. But i will break compatibility with pqos
>>> tool(intel_cmt_cat utility). pqos tool monitoring will not work without
>>> supporting ABMC enablement in the tool. ABMC feature requires an extra
>>> step to assign the counters for monitor to work.
>>
>> I am considering two scenarios, the "default behavior" is what a user will
>> experience when booting resctrl on an ABMC system and the "new feature
>> behavior" where a user can take full advantage of all that ABMC (and soft
>> RMID, and MPAM) can offer.
>>
>> So, first, on an ABMC system in the "default behavior" scenario I expect
>> that resctrl can do required ABMC counter configuration automatically at
>> the time a monitor group is created. In this "default behavior" scenario
>> resctrl would expose "num_rmids" to be half of the number of assignable
>> counters. When a user then creates a monitor group two counters will be
>> used and configured to count the local and total bytes respectively. If
>> two counters are not available then ENOSPC returned, just like when system
>> is out of closid/rmid.  With this "default behavior" user space thus gets
>> improved behavior without making any changes on its part. I do not have
> 
> We can automatically assign the h/w counter when monitor group is created
> until we run out of h/w counters. That is good idea. By default user will
> not notice any difference in ABMC mode.
> 
>> insight into how many counters ABMC could be expected to expose though ...
>> so some users may be surprised at how few monitor groups can be created
>> with new hardware? This may not be an issue since that would accurately
>> reflect how many _reliable_ monitor groups can be created and if user needs
>> more monitor groups then that would be a time to explore the "new feature"
>> that requires changes in how user interacts with resctrl.
> 
> Currently, 32 h/w counters are available to configure. With two counters
> for each group, we can create 16 groups(15 new groups plus the default
> group). That should be fine as pqos tool creates only 16 groups when it is
> started.

user space can never assume that a certain number of groups can
be created. 

>> Apart from the "default behavior" there are two options to consider ...
>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>     where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>     system, where the previous "num_rmids" monitor groups can be created but
>>     the counters are reset unpredictably ... should this still be supported
>>     on ABMC systems though?
> 
> I would say yes. For some reason user(hardware or software issues) is not
> able to use ABMC mode, they have an option to go back to legacy mode.

I see. Should this perhaps be protected behind the resctrl "debug" mount option?

>> (b) the "new feature" behavior where user space gets full benefit of ABMC
>>     that allows user space to create any number of monitor groups but then
>>     user space needs to let hardware (via resctrl) know which
>>     events should be counted.
> 
> Is this "new feature" is enabled by default when ABMC is available?

Not in this design, no. In these scenarios ABMC will be available and enabled
in both the "default" and "new feature" behavior. The difference is no user
space changes are needed in "default" scenario and resctrl limits the number
of monitor groups to support all monitor groups to be backed by hardware
counters. 
When "new feature" is enabled when ABMC is available and enabled then
user space is able to create more monitor groups than available hardware
counters and new user interface is required to manage associating counters
with monitor events.

> 
> Or we need to provide an interface to enable this feature?

Yes, an interface will be needed to enable this feature.

> 
> 
>>
>> I expect that only (b) above would require user space change. Considering
>> that per documentation, "num_rmids" means "This is the upper bound for how
>> many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids"
>> becomes undefined when "new feature" is enabled. When this new feature is enabled
>> then user space is no longer limited by number of RMIDs on how many monitor
> 
> With ABMC, we will have a new field "mbm_assignable_counters". We don't
> have to change the definition of "num_rmids".

The problem here is that "num_rmids" is (as per Documentation/arch/x86/resctrl.rst)
documented to be an upper bound for how many monitor groups can be created.
As I understand, when ABMC is enabled and its full capability exposed to user
space then there is no limit to how many monitor groups can be created, no?

For example, if I understand correctly, theoretically, when ABMC is enabled then
"num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
is not unsigned, tbd if number of directories may also be limited by kernfs).
User space could theoretically create more monitor groups than the number of
rmids that a resource claims to support using current upstream enumeration.
Instead, it is the "mbm_assignable_counters" that is of interest, that is what
user space uses to determine how many of the (potentially very large number of)
monitor groups/monitor events can be counted at any particular time.

>> groups can be created and this is the point that the user interface that you
>> and Peter have ideas about comes into play. Specifically, user space needing
>> a way to specify:
>> (a) "let me create more monitor groups that the hardware can support"/"let me
>>      control which events/monitor groups are counted"
>>      (like the "mbm_assign" file in your proposal)
>> (b) "here are the events that need to be counted" 
>>      (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals)
> 
> With global assignment option out of way for now(may be introduced later),
> we can provide two interfaces.
> 
> 1. /sys/fs/resctrl/info/L3_MON/mbm_assign
> This will be enabled by default when ABMC is available. Users can disable
> this option to go back to legacy mode.

Potentially (all naming placeholders that will only be visible on systems that
actually supports particular mode):
legacy [default] new_feature soft_rmid

> 
> 2. /sys/fs/resctrl/monitor_state.
> This can used to individually assign or unassign the counters in each group.
> 
> When assigned:
> #cat /sys/fs/resctrl/monitor_state
> 0=total-assign,local-assign;1=total-assign,local-assign
> 
> When unassigned:
> #cat /sys/fs/resctrl/monitor_state
> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
> 
> 
> Thoughts?

How do you expect this interface to be used? I understand the mechanics
of this interface but on a higher level, do you expect user space to
once in a while assign a new counter to a single event or monitor group
(for which a fine grained interface works) or do you expect user space to
shift multiple counters across several monitor events at intervals?

Across resctrl's lifetime we have seen examples of user space wanting
to accomplish more with a single resctrl interaction. For example moving
multiple tasks to a group that you added support for and moving a monitor
group feature from Peter.

I thus think that it would be valuable to consider more efficient
interfaces from the beginning. I do not think that this is the type
of work that is an optimization to be delayed until an unspecified later
time, but instead multiple usage of interface can be considered from the
start with a most optimal interface created from the beginning. Specifically,
why does resctrl need to be "extended" to support a global assignment as proposed
by Peter at a later time, why can it not be done as the original and (ideally)
only mechanism?

Reinette
  
Moger, Babu Feb. 27, 2024, 6:12 p.m. UTC | #21
Hi Reinette,

On 2/26/24 15:20, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>> On 2/23/24 16:21, Reinette Chatre wrote:
>>> On 2/23/2024 12:11 PM, Moger, Babu wrote:
>>>> On 2/23/24 11:17, Reinette Chatre wrote:
>>>>>
>>>>>
>>>>> On 2/20/2024 12:48 PM, Moger, Babu wrote:
>>>>>> On 2/20/24 09:21, James Morse wrote:
>>>>>>> On 19/01/2024 18:22, Babu Moger wrote:
>>>>>
>>>>>>>> e. Enable ABMC mode.
>>>>>>>>
>>>>>>>> 	#echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>>>         #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable
>>>>>>>>         1
>>>>>>>
>>>>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that
>>>>>>> supports it, or enabled implicitly when the first assignment attempt arrives?
>>>>>>>
>>>>>>> I guess this is really needed for a reset - could we implement that instead? This way
>>>>>>> there isn't an extra step user-space has to do to make the assignments work.
>>>>>>
>>>>>> Mostly the new features are added as an opt-in method. So, kept it that
>>>>>> way. If we enable this feature automatically, then we have provide an
>>>>>> option to disable it.
>>>>>>
>>>>>
>>>>> At the same time it sounds to me like ABMC can improve current users'
>>>>> experience without requiring them to do anything. This sounds appealing.
>>>>> For example, if I understand correctly, it may be possible to start resctrl
>>>>> with ABMC enabled by default and the number of monitoring groups (currently
>>>>> exposed to user space via "num_rmids") limited to the number of counters
>>>>> supported by ABMC. Existing users would then by default obtain better behavior
>>>>> of counters not resetting.
>>>>
>>>> Yes, I like the idea. But i will break compatibility with pqos
>>>> tool(intel_cmt_cat utility). pqos tool monitoring will not work without
>>>> supporting ABMC enablement in the tool. ABMC feature requires an extra
>>>> step to assign the counters for monitor to work.
>>>
>>> I am considering two scenarios, the "default behavior" is what a user will
>>> experience when booting resctrl on an ABMC system and the "new feature
>>> behavior" where a user can take full advantage of all that ABMC (and soft
>>> RMID, and MPAM) can offer.
>>>
>>> So, first, on an ABMC system in the "default behavior" scenario I expect
>>> that resctrl can do required ABMC counter configuration automatically at
>>> the time a monitor group is created. In this "default behavior" scenario
>>> resctrl would expose "num_rmids" to be half of the number of assignable
>>> counters. When a user then creates a monitor group two counters will be
>>> used and configured to count the local and total bytes respectively. If
>>> two counters are not available then ENOSPC returned, just like when system
>>> is out of closid/rmid.  With this "default behavior" user space thus gets
>>> improved behavior without making any changes on its part. I do not have
>>
>> We can automatically assign the h/w counter when monitor group is created
>> until we run out of h/w counters. That is good idea. By default user will
>> not notice any difference in ABMC mode.
>>
>>> insight into how many counters ABMC could be expected to expose though ...
>>> so some users may be surprised at how few monitor groups can be created
>>> with new hardware? This may not be an issue since that would accurately
>>> reflect how many _reliable_ monitor groups can be created and if user needs
>>> more monitor groups then that would be a time to explore the "new feature"
>>> that requires changes in how user interacts with resctrl.
>>
>> Currently, 32 h/w counters are available to configure. With two counters
>> for each group, we can create 16 groups(15 new groups plus the default
>> group). That should be fine as pqos tool creates only 16 groups when it is
>> started.
> 
> user space can never assume that a certain number of groups can
> be created. 
> 
>>> Apart from the "default behavior" there are two options to consider ...
>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>     where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>     system, where the previous "num_rmids" monitor groups can be created but
>>>     the counters are reset unpredictably ... should this still be supported
>>>     on ABMC systems though?
>>
>> I would say yes. For some reason user(hardware or software issues) is not
>> able to use ABMC mode, they have an option to go back to legacy mode.
> 
> I see. Should this perhaps be protected behind the resctrl "debug" mount option?

The debug option gives wrong impression. It is better to keep the option
open to enable the feature in normal mode.

> 
>>> (b) the "new feature" behavior where user space gets full benefit of ABMC
>>>     that allows user space to create any number of monitor groups but then
>>>     user space needs to let hardware (via resctrl) know which
>>>     events should be counted.
>>
>> Is this "new feature" is enabled by default when ABMC is available?
> 
> Not in this design, no. In these scenarios ABMC will be available and enabled
> in both the "default" and "new feature" behavior. The difference is no user
> space changes are needed in "default" scenario and resctrl limits the number
> of monitor groups to support all monitor groups to be backed by hardware
> counters. 
> When "new feature" is enabled when ABMC is available and enabled then
> user space is able to create more monitor groups than available hardware
> counters and new user interface is required to manage associating counters
> with monitor events.

ok. That sounds good.

> 
>>
>> Or we need to provide an interface to enable this feature?
> 
> Yes, an interface will be needed to enable this feature.

ok.

> 
>>
>>
>>>
>>> I expect that only (b) above would require user space change. Considering
>>> that per documentation, "num_rmids" means "This is the upper bound for how
>>> many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids"
>>> becomes undefined when "new feature" is enabled. When this new feature is enabled
>>> then user space is no longer limited by number of RMIDs on how many monitor
>>
>> With ABMC, we will have a new field "mbm_assignable_counters". We don't
>> have to change the definition of "num_rmids".
> 
> The problem here is that "num_rmids" is (as per Documentation/arch/x86/resctrl.rst)
> documented to be an upper bound for how many monitor groups can be created.
> As I understand, when ABMC is enabled and its full capability exposed to user
> space then there is no limit to how many monitor groups can be created, no?

No. That is not correct. The number of monitor groups is still limited by
num_rmids. But assignment is limited by mbm_assignable_counters. More below.

> 
> For example, if I understand correctly, theoretically, when ABMC is enabled then
> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
> is not unsigned, tbd if number of directories may also be limited by kernfs).
> User space could theoretically create more monitor groups than the number of
> rmids that a resource claims to support using current upstream enumeration.

CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
bits(depends on specific h/w) to represent RMIDs. So, we cannot create
more than this limit(r->num_rmid).

In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
RMID to assign the monitoring. So, assignment limit is
mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.

> Instead, it is the "mbm_assignable_counters" that is of interest, that is what
> user space uses to determine how many of the (potentially very large number of)
> monitor groups/monitor events can be counted at any particular time.
> 
>>> groups can be created and this is the point that the user interface that you
>>> and Peter have ideas about comes into play. Specifically, user space needing
>>> a way to specify:
>>> (a) "let me create more monitor groups that the hardware can support"/"let me
>>>      control which events/monitor groups are counted"
>>>      (like the "mbm_assign" file in your proposal)
>>> (b) "here are the events that need to be counted" 
>>>      (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals)
>>
>> With global assignment option out of way for now(may be introduced later),
>> we can provide two interfaces.
>>
>> 1. /sys/fs/resctrl/info/L3_MON/mbm_assign
>> This will be enabled by default when ABMC is available. Users can disable
>> this option to go back to legacy mode.
> 
> Potentially (all naming placeholders that will only be visible on systems that
> actually supports particular mode):
> legacy [default] new_feature soft_rmid

ok

> 
>>
>> 2. /sys/fs/resctrl/monitor_state.
>> This can used to individually assign or unassign the counters in each group.
>>
>> When assigned:
>> #cat /sys/fs/resctrl/monitor_state
>> 0=total-assign,local-assign;1=total-assign,local-assign
>>
>> When unassigned:
>> #cat /sys/fs/resctrl/monitor_state
>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>
>>
>> Thoughts?
> 
> How do you expect this interface to be used? I understand the mechanics
> of this interface but on a higher level, do you expect user space to
> once in a while assign a new counter to a single event or monitor group
> (for which a fine grained interface works) or do you expect user space to
> shift multiple counters across several monitor events at intervals?

I think we should provide both the options. I was thinking of providing
fine grained interface first.

Few use cases:
1. User wants to assign only one event (total or local) per group.
   In this case, he can assign 32 events in 32 different groups.

   #echo 0=total-assign >  /sys/fs/resctrl/monitor_state
   or
   #echo 0=local-assign >  /sys/fs/resctrl/monitor_state

   When done:

   #echo 0=total-unassign >  /sys/fs/resctrl/monitor_state
   or
   #echo 0=local-unassign >  /sys/fs/resctrl/monitor_state

   Note: 0 the domain here.


2. User wants to assign both "local" and "total" events per group. In this
case, he can assign 32 events in 16 different groups.

   #echo 0=local-assign,total-assign  >  /sys/fs/resctrl/monitor_state

   When done:

   #echo 0=local-unassign,total-unassign  >  /sys/fs/resctrl/monitor_state

3. combination of 1 and 2.

4. Assign multiple group assignment at once. I consider this as global
assignment. This can be achieved by 1 and 2 from user space looping thru
all the interested groups. Peter is worried about system call latency
here. He wants to optimize this. I was thinking this can done later.

> 
> Across resctrl's lifetime we have seen examples of user space wanting
> to accomplish more with a single resctrl interaction. For example moving
> multiple tasks to a group that you added support for and moving a monitor
> group feature from Peter.
> 
> I thus think that it would be valuable to consider more efficient
> interfaces from the beginning. I do not think that this is the type
> of work that is an optimization to be delayed until an unspecified later
> time, but instead multiple usage of interface can be considered from the
> start with a most optimal interface created from the beginning. Specifically,
> why does resctrl need to be "extended" to support a global assignment as proposed
> by Peter at a later time, why can it not be done as the original and (ideally)
> only mechanism?
> 
> Reinette
  
Peter Newman Feb. 27, 2024, 6:26 p.m. UTC | #22
Hi Babu,

On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <babu.moger@amd.com> wrote:
>
> On 2/26/24 15:20, Reinette Chatre wrote:
> >
> > For example, if I understand correctly, theoretically, when ABMC is enabled then
> > "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
> > is not unsigned, tbd if number of directories may also be limited by kernfs).
> > User space could theoretically create more monitor groups than the number of
> > rmids that a resource claims to support using current upstream enumeration.
>
> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
> more than this limit(r->num_rmid).
>
> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
> RMID to assign the monitoring. So, assignment limit is
> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.

That is not entirely true. As long as you don't need to maintain
bandwidth counts for unassigned monitoring groups, there's no need to
allocate a HW RMID to a monitoring group.

In my soft-ABMC prototype, where a limited number of HW RMIDs are
allocated to assigned monitoring groups, I was forced to replace the
HW RMID value stored in the task_struct to a pointer to the struct
mongroup, since the RMID value assigned to the mongroup would
frequently change, resulting in excessive walks down the tasklist to
find all of the tasks using the previous value.

However, the number of hardware monitor group identifiers supported
(i.e., RMID, PARTID:PMG) is usually high enough that I don't think
there's much motivation to support unlimited monitoring groups. In
both soft-RMID and soft-ABMC, I didn't bother supporting more groups
than num_rmids, because the number was large enough.

-Peter
  
Moger, Babu Feb. 27, 2024, 7:37 p.m. UTC | #23
Hi Peter,

On 2/27/24 12:26, Peter Newman wrote:
> Hi Babu,
> 
> On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <babu.moger@amd.com> wrote:
>>
>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>
>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>> User space could theoretically create more monitor groups than the number of
>>> rmids that a resource claims to support using current upstream enumeration.
>>
>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>> more than this limit(r->num_rmid).
>>
>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>> RMID to assign the monitoring. So, assignment limit is
>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
> 
> That is not entirely true. As long as you don't need to maintain
> bandwidth counts for unassigned monitoring groups, there's no need to
> allocate a HW RMID to a monitoring group.

We don't need to allocate a h/w counter for unassigned group.
My proposal is to allocate h/w counter only if user requests a assignment.
 The limit for assigned events at time is mbm_assignable_counters(32 right
now).

> 
> In my soft-ABMC prototype, where a limited number of HW RMIDs are
> allocated to assigned monitoring groups, I was forced to replace the
> HW RMID value stored in the task_struct to a pointer to the struct
> mongroup, since the RMID value assigned to the mongroup would
> frequently change, resulting in excessive walks down the tasklist to
> find all of the tasks using the previous value.
> 
> However, the number of hardware monitor group identifiers supported
> (i.e., RMID, PARTID:PMG) is usually high enough that I don't think
> there's much motivation to support unlimited monitoring groups. In
> both soft-RMID and soft-ABMC, I didn't bother supporting more groups
> than num_rmids, because the number was large enough.

What is soft-ABMC?
  
Peter Newman Feb. 27, 2024, 8:06 p.m. UTC | #24
Hi Babu,

On Tue, Feb 27, 2024 at 11:37 AM Moger, Babu <babu.moger@amd.com> wrote:
> On 2/27/24 12:26, Peter Newman wrote:
> > On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <babu.moger@amd.com> wrote:
> >>
> >> On 2/26/24 15:20, Reinette Chatre wrote:
> >>>
> >>> For example, if I understand correctly, theoretically, when ABMC is enabled then
> >>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
> >>> is not unsigned, tbd if number of directories may also be limited by kernfs).
> >>> User space could theoretically create more monitor groups than the number of
> >>> rmids that a resource claims to support using current upstream enumeration.
> >>
> >> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
> >> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
> >> more than this limit(r->num_rmid).
> >>
> >> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
> >> RMID to assign the monitoring. So, assignment limit is
> >> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
> >
> > That is not entirely true. As long as you don't need to maintain
> > bandwidth counts for unassigned monitoring groups, there's no need to
> > allocate a HW RMID to a monitoring group.
>
> We don't need to allocate a h/w counter for unassigned group.
> My proposal is to allocate h/w counter only if user requests a assignment.
>  The limit for assigned events at time is mbm_assignable_counters(32 right
> now).

I said "RMID", not "counter". The point is, the main purpose served by
the RMID in an unassigned mongroup is providing a unique value to
write into the task_struct to indicate group membership.

>
> >
> > In my soft-ABMC prototype, where a limited number of HW RMIDs are
> > allocated to assigned monitoring groups, I was forced to replace the
> > HW RMID value stored in the task_struct to a pointer to the struct
> > mongroup, since the RMID value assigned to the mongroup would
> > frequently change, resulting in excessive walks down the tasklist to
> > find all of the tasks using the previous value.
> >
> > However, the number of hardware monitor group identifiers supported
> > (i.e., RMID, PARTID:PMG) is usually high enough that I don't think
> > there's much motivation to support unlimited monitoring groups. In
> > both soft-RMID and soft-ABMC, I didn't bother supporting more groups
> > than num_rmids, because the number was large enough.
>
> What is soft-ABMC?

It's the term I'm using to describe[1] the approach of using the
monitor assignment interface to allocate a small number of RMIDs to
monitoring groups.

-Peter

[1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
  
Moger, Babu Feb. 27, 2024, 8:42 p.m. UTC | #25
Hi Peter,

On 2/27/24 14:06, Peter Newman wrote:
> Hi Babu,
> 
> On Tue, Feb 27, 2024 at 11:37 AM Moger, Babu <babu.moger@amd.com> wrote:
>> On 2/27/24 12:26, Peter Newman wrote:
>>> On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <babu.moger@amd.com> wrote:
>>>>
>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>>
>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>>> User space could theoretically create more monitor groups than the number of
>>>>> rmids that a resource claims to support using current upstream enumeration.
>>>>
>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>>> more than this limit(r->num_rmid).
>>>>
>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>>> RMID to assign the monitoring. So, assignment limit is
>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>>
>>> That is not entirely true. As long as you don't need to maintain
>>> bandwidth counts for unassigned monitoring groups, there's no need to
>>> allocate a HW RMID to a monitoring group.
>>
>> We don't need to allocate a h/w counter for unassigned group.
>> My proposal is to allocate h/w counter only if user requests a assignment.
>>  The limit for assigned events at time is mbm_assignable_counters(32 right
>> now).
> 
> I said "RMID", not "counter". The point is, the main purpose served by
> the RMID in an unassigned mongroup is providing a unique value to
> write into the task_struct to indicate group membership.

In case of ABMC, cpu(or task) association still uses RMID value stored in
"struct mongroup" data structure. Same value is written to PQR_ASSOC(MSR
C8Fh). It needs to be a valid value. Hope that make sense.

> 
>>
>>>
>>> In my soft-ABMC prototype, where a limited number of HW RMIDs are
>>> allocated to assigned monitoring groups, I was forced to replace the
>>> HW RMID value stored in the task_struct to a pointer to the struct
>>> mongroup, since the RMID value assigned to the mongroup would
>>> frequently change, resulting in excessive walks down the tasklist to
>>> find all of the tasks using the previous value.

You are using this pointer as unique value. This will work as long as you
are not writing this value to PQR_ASSOC MSR.


>>>
>>> However, the number of hardware monitor group identifiers supported
>>> (i.e., RMID, PARTID:PMG) is usually high enough that I don't think
>>> there's much motivation to support unlimited monitoring groups. In
>>> both soft-RMID and soft-ABMC, I didn't bother supporting more groups
>>> than num_rmids, because the number was large enough.
>>
>> What is soft-ABMC?
> 
> It's the term I'm using to describe[1] the approach of using the
> monitor assignment interface to allocate a small number of RMIDs to
> monitoring groups.
> 
> -Peter
> 
> [1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
  
Reinette Chatre Feb. 27, 2024, 11:50 p.m. UTC | #26
Hi Babu,

On 2/27/2024 10:12 AM, Moger, Babu wrote:
> On 2/26/24 15:20, Reinette Chatre wrote:
>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>> On 2/23/24 16:21, Reinette Chatre wrote:

>>>> Apart from the "default behavior" there are two options to consider ...
>>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>>     where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>>     system, where the previous "num_rmids" monitor groups can be created but
>>>>     the counters are reset unpredictably ... should this still be supported
>>>>     on ABMC systems though?
>>>
>>> I would say yes. For some reason user(hardware or software issues) is not
>>> able to use ABMC mode, they have an option to go back to legacy mode.
>>
>> I see. Should this perhaps be protected behind the resctrl "debug" mount option?
> 
> The debug option gives wrong impression. It is better to keep the option
> open to enable the feature in normal mode.

You mentioned that it would only be needed when there are hardware or
software issues ... so debug does sound appropriate. Could you please give
an example of how debug option gives wrong impression? Why would you want
users to keep using "legacy" mode on an ABMC system?

..

>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>> User space could theoretically create more monitor groups than the number of
>> rmids that a resource claims to support using current upstream enumeration.
> 
> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
> more than this limit(r->num_rmid).
> 
> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
> RMID to assign the monitoring. So, assignment limit is
> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.

I see. Thank you for clarifying. This does make enabling simpler and one
less user interface item that needs changing.

..

>>> 2. /sys/fs/resctrl/monitor_state.
>>> This can used to individually assign or unassign the counters in each group.
>>>
>>> When assigned:
>>> #cat /sys/fs/resctrl/monitor_state
>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>
>>> When unassigned:
>>> #cat /sys/fs/resctrl/monitor_state
>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>
>>>
>>> Thoughts?
>>
>> How do you expect this interface to be used? I understand the mechanics
>> of this interface but on a higher level, do you expect user space to
>> once in a while assign a new counter to a single event or monitor group
>> (for which a fine grained interface works) or do you expect user space to
>> shift multiple counters across several monitor events at intervals?
> 
> I think we should provide both the options. I was thinking of providing
> fine grained interface first.

Could you please provide a motivation for why two interfaces, one inefficient
and one not, should be created and maintained? Users can still do fine grained
assignment with a global assignment interface.

Reinette
  
Moger, Babu Feb. 28, 2024, 5:59 p.m. UTC | #27
Hi Reinette,

On 2/27/24 17:50, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>> On 2/26/24 15:20, Reinette Chatre wrote:
>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>> On 2/23/24 16:21, Reinette Chatre wrote:
> 
>>>>> Apart from the "default behavior" there are two options to consider ...
>>>>> (a) the "original" behavior(? I do not know what to call it) - this would be
>>>>>     where user space wants(?) to have the current non-ABMC behavior on an ABMC
>>>>>     system, where the previous "num_rmids" monitor groups can be created but
>>>>>     the counters are reset unpredictably ... should this still be supported
>>>>>     on ABMC systems though?
>>>>
>>>> I would say yes. For some reason user(hardware or software issues) is not
>>>> able to use ABMC mode, they have an option to go back to legacy mode.
>>>
>>> I see. Should this perhaps be protected behind the resctrl "debug" mount option?
>>
>> The debug option gives wrong impression. It is better to keep the option
>> open to enable the feature in normal mode.
> 
> You mentioned that it would only be needed when there are hardware or
> software issues ... so debug does sound appropriate. Could you please give
> an example of how debug option gives wrong impression? Why would you want
> users to keep using "legacy" mode on an ABMC system?

I don't have a strong argument here. I am fine as long as there is a way
to go back to legacy mode if required. We can provide legacy option in
debug mode.

> 
> ...
> 
>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>> User space could theoretically create more monitor groups than the number of
>>> rmids that a resource claims to support using current upstream enumeration.
>>
>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>> more than this limit(r->num_rmid).
>>
>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>> RMID to assign the monitoring. So, assignment limit is
>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
> 
> I see. Thank you for clarifying. This does make enabling simpler and one
> less user interface item that needs changing.
> 
> ...
> 
>>>> 2. /sys/fs/resctrl/monitor_state.
>>>> This can used to individually assign or unassign the counters in each group.
>>>>
>>>> When assigned:
>>>> #cat /sys/fs/resctrl/monitor_state
>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>
>>>> When unassigned:
>>>> #cat /sys/fs/resctrl/monitor_state
>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>
>>>>
>>>> Thoughts?
>>>
>>> How do you expect this interface to be used? I understand the mechanics
>>> of this interface but on a higher level, do you expect user space to
>>> once in a while assign a new counter to a single event or monitor group
>>> (for which a fine grained interface works) or do you expect user space to
>>> shift multiple counters across several monitor events at intervals?
>>
>> I think we should provide both the options. I was thinking of providing
>> fine grained interface first.
> 
> Could you please provide a motivation for why two interfaces, one inefficient
> and one not, should be created and maintained? Users can still do fine grained
> assignment with a global assignment interface.

Lets consider one by one.

1. Fine grained assignment.

It will be part of the mongroup(or control mongroup). User has the access
to the group and can query the group's current status before assigning or
unassigning.

   $cd /sys/fs/resctrl/ctrl_mon1
   $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
       0=total-unassign,local-unassign;1=total-unassign,local-unassign;

Assign the total event

  $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state

Assign the local event

   $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state

Assign both events:

   $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state

Check the assignment status.

   $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
       0=total-assign,local-assign;1=total-unassign,local-unassign;

-User interface is simple.

-Assignment will fail if all the h/w counters are exhausted. User needs to
unassign a counter from another group and use that counter here. This can
be done just querying the monitor state of another group.

-Monitor group's details(cpus, tasks) are part of the group. So, it is
better to have assignment state inside the group.

Note: Used interface names here just to give example.


2. global assignment:

I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
directory.

In case there are 100 mongroups, we need to have a way to list current
assignment status for these groups. I am not sure how to list status of
these 100 groups.

If user is wants to assign the local event(or total) in a specific group
in this list of 100 groups, I am not sure how to provide interface for
that. Should we pass the name of mongroup? That will involve looping
through using the call kernfs_walk_and_get. This may be ok if we are
dealing with very small number of groups.
  
Reinette Chatre Feb. 28, 2024, 8:04 p.m. UTC | #28
Hi Babu,

On 2/28/2024 9:59 AM, Moger, Babu wrote:
> On 2/27/24 17:50, Reinette Chatre wrote:
>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>

>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>> User space could theoretically create more monitor groups than the number of
>>>> rmids that a resource claims to support using current upstream enumeration.
>>>
>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>> more than this limit(r->num_rmid).
>>>
>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>> RMID to assign the monitoring. So, assignment limit is
>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>
>> I see. Thank you for clarifying. This does make enabling simpler and one
>> less user interface item that needs changing.
>>
>> ...
>>
>>>>> 2. /sys/fs/resctrl/monitor_state.
>>>>> This can used to individually assign or unassign the counters in each group.
>>>>>
>>>>> When assigned:
>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>>
>>>>> When unassigned:
>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>>
>>>>>
>>>>> Thoughts?
>>>>
>>>> How do you expect this interface to be used? I understand the mechanics
>>>> of this interface but on a higher level, do you expect user space to
>>>> once in a while assign a new counter to a single event or monitor group
>>>> (for which a fine grained interface works) or do you expect user space to
>>>> shift multiple counters across several monitor events at intervals?
>>>
>>> I think we should provide both the options. I was thinking of providing
>>> fine grained interface first.
>>
>> Could you please provide a motivation for why two interfaces, one inefficient
>> and one not, should be created and maintained? Users can still do fine grained
>> assignment with a global assignment interface.
> 
> Lets consider one by one.
> 
> 1. Fine grained assignment.
> 
> It will be part of the mongroup(or control mongroup). User has the access
> to the group and can query the group's current status before assigning or
> unassigning.
> 
>    $cd /sys/fs/resctrl/ctrl_mon1
>    $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>        0=total-unassign,local-unassign;1=total-unassign,local-unassign;
> 
> Assign the total event
> 
>   $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
> 
> Assign the local event
> 
>    $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
> 
> Assign both events:
> 
>    $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
> 
> Check the assignment status.
> 
>    $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>        0=total-assign,local-assign;1=total-unassign,local-unassign;
> 
> -User interface is simple.

This should not be the only motivation. Please do not sacrifice efficiency
and usability just to have a simple interface. One can also argue that this
interface can only be considered simple from the kernel implementation perspective,
from user space it seems complicated. For example, as James pointed out earlier [1],
user space would need to walk the entire resctrl to find out where counters are
assigned. Peter also pointed out how the multiple syscalls needed when adjusting
hundreds of monitor groups is inefficient. Please take all feedback into account.

You consider "simple interface" as a motivation, there seems to be at least two
arguments against this interface. Please consider these in your comparison
between interfaces. These are things that should be noted and make their way to
the cover letter.

> 
> -Assignment will fail if all the h/w counters are exhausted. User needs to
> unassign a counter from another group and use that counter here. This can
> be done just querying the monitor state of another group.

Right ... and as you state there can be hundreds of monitor groups that
user space would need to walk and query to get this information.

> 
> -Monitor group's details(cpus, tasks) are part of the group. So, it is
> better to have assignment state inside the group.

The assignment state should be clear from the event file.

> Note: Used interface names here just to give example.
> 
> 
> 2. global assignment:
> 
> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
> directory.
> 
> In case there are 100 mongroups, we need to have a way to list current
> assignment status for these groups. I am not sure how to list status of
> these 100 groups.

The kernel has many examples of interfaces that manages status of a large
number of entities. I am thinking, for example, we can learn a lot from
how dynamic debug works. On my system I see:

$ wc -l /sys/kernel/debug/dynamic_debug/control
5359 /sys/kernel/debug/dynamic_debug/control

> 
> If user is wants to assign the local event(or total) in a specific group
> in this list of 100 groups, I am not sure how to provide interface for
> that. Should we pass the name of mongroup? That will involve looping
> through using the call kernfs_walk_and_get. This may be ok if we are
> dealing with very small number of groups.
> 

What is your concern when needing to modify a large number of groups?
Are you concerned about the size of the writes needing to be parsed? It looks
like kernfs does support writes of larger than PAGE_SIZE, but it is not clear
to me that such large sizes will be required.   

There is also kernfs_find_and_get() that may be more convenient to use.
I believe user space needs to provide control group name for a global
interface (the same name can be used by monitor groups belonging to
different control groups), and that can be used to narrow search.

Reading your message I do not find any motivation _against_ a global
interface, except that it is not obvious to you how such interface may look
or work. That is fair. Peter seems to have ideas and a working implementation
that can be used as reference. So far I have only seen one comment [2] from James
that was skeptical about the global interface but the reason notes that MPAM
allocates counters per domain, which is the same as ABMC so we will need more
information from James here on what is required since he did not respond to
Peter.

Below is a *hypothetical* interface to start a discussion that explores how
to support fine grained assignment in an interface that aims to be easy to use
by user space. Obviously Peter is also working on something so there
are many viewpoints to consider.

File info/L3_MON/mbm_assign_control:
#control_group/mon_group/flags
ctrl_a/mon_a/00=_;01=_
ctrl_a/mon_b/00=l;01=t
ctrl_b/mon_c/00=lt;01=lt

Above file displays to user:
* No counters are assigned to monitor group mon_a within control group ctrl_a
* Counter for local MBM is assigned to domain 0 of monitor group mon_b within
  control group ctrl_a 
* Counter for total MBM is assigned to domain 1 of monitor group mon_b within
  control group ctrl_a 
* Counters for local and total MBM are assigned to both domains of monitor
  group mon_c within control group ctrl_b

With above interface user space can, with a single read, get insight into
how counters are assigned across all monitor groups.
User space can write to the file to modify the flags. If assigning a new
counter when no more counters are available then the write will fail.
Potentially, if changes are made in order provided by the user then
the user will be able to unassign counters from one group and re-assign to
another group with a single write.

I provide this purely to generate some ideas and gather more thoughts on
a global interface.

Reinette

[1] https://lore.kernel.org/lkml/2f373abf-f0c0-4f5d-9e22-1039a40a57f0@arm.com/
[2] https://lore.kernel.org/lkml/1a8c1cd6-a1ce-47a2-bc87-d4cccc84519b@arm.com/
  
Moger, Babu Feb. 29, 2024, 8:37 p.m. UTC | #29
Hi Reinette,

On 2/28/24 14:04, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/28/2024 9:59 AM, Moger, Babu wrote:
>> On 2/27/24 17:50, Reinette Chatre wrote:
>>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>>
> 
>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>>> User space could theoretically create more monitor groups than the number of
>>>>> rmids that a resource claims to support using current upstream enumeration.
>>>>
>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>>> more than this limit(r->num_rmid).
>>>>
>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>>> RMID to assign the monitoring. So, assignment limit is
>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>>
>>> I see. Thank you for clarifying. This does make enabling simpler and one
>>> less user interface item that needs changing.
>>>
>>> ...
>>>
>>>>>> 2. /sys/fs/resctrl/monitor_state.
>>>>>> This can used to individually assign or unassign the counters in each group.
>>>>>>
>>>>>> When assigned:
>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>>>
>>>>>> When unassigned:
>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>>>
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> How do you expect this interface to be used? I understand the mechanics
>>>>> of this interface but on a higher level, do you expect user space to
>>>>> once in a while assign a new counter to a single event or monitor group
>>>>> (for which a fine grained interface works) or do you expect user space to
>>>>> shift multiple counters across several monitor events at intervals?
>>>>
>>>> I think we should provide both the options. I was thinking of providing
>>>> fine grained interface first.
>>>
>>> Could you please provide a motivation for why two interfaces, one inefficient
>>> and one not, should be created and maintained? Users can still do fine grained
>>> assignment with a global assignment interface.
>>
>> Lets consider one by one.
>>
>> 1. Fine grained assignment.
>>
>> It will be part of the mongroup(or control mongroup). User has the access
>> to the group and can query the group's current status before assigning or
>> unassigning.
>>
>>    $cd /sys/fs/resctrl/ctrl_mon1
>>    $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>        0=total-unassign,local-unassign;1=total-unassign,local-unassign;
>>
>> Assign the total event
>>
>>   $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>
>> Assign the local event
>>
>>    $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>
>> Assign both events:
>>
>>    $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>
>> Check the assignment status.
>>
>>    $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>        0=total-assign,local-assign;1=total-unassign,local-unassign;
>>
>> -User interface is simple.
> 
> This should not be the only motivation. Please do not sacrifice efficiency
> and usability just to have a simple interface. One can also argue that this
> interface can only be considered simple from the kernel implementation perspective,
> from user space it seems complicated. For example, as James pointed out earlier [1],
> user space would need to walk the entire resctrl to find out where counters are
> assigned. Peter also pointed out how the multiple syscalls needed when adjusting
> hundreds of monitor groups is inefficient. Please take all feedback into account.
> 
> You consider "simple interface" as a motivation, there seems to be at least two
> arguments against this interface. Please consider these in your comparison
> between interfaces. These are things that should be noted and make their way to
> the cover letter.
> 
>>
>> -Assignment will fail if all the h/w counters are exhausted. User needs to
>> unassign a counter from another group and use that counter here. This can
>> be done just querying the monitor state of another group.
> 
> Right ... and as you state there can be hundreds of monitor groups that
> user space would need to walk and query to get this information.
> 
>>
>> -Monitor group's details(cpus, tasks) are part of the group. So, it is
>> better to have assignment state inside the group.
> 
> The assignment state should be clear from the event file.
> 
>> Note: Used interface names here just to give example.
>>
>>
>> 2. global assignment:
>>
>> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
>> directory.
>>
>> In case there are 100 mongroups, we need to have a way to list current
>> assignment status for these groups. I am not sure how to list status of
>> these 100 groups.
> 
> The kernel has many examples of interfaces that manages status of a large
> number of entities. I am thinking, for example, we can learn a lot from
> how dynamic debug works. On my system I see:
> 
> $ wc -l /sys/kernel/debug/dynamic_debug/control
> 5359 /sys/kernel/debug/dynamic_debug/control
> 
>>
>> If user is wants to assign the local event(or total) in a specific group
>> in this list of 100 groups, I am not sure how to provide interface for
>> that. Should we pass the name of mongroup? That will involve looping
>> through using the call kernfs_walk_and_get. This may be ok if we are
>> dealing with very small number of groups.
>>
> 
> What is your concern when needing to modify a large number of groups?
> Are you concerned about the size of the writes needing to be parsed? It looks
> like kernfs does support writes of larger than PAGE_SIZE, but it is not clear
> to me that such large sizes will be required.   
> 
> There is also kernfs_find_and_get() that may be more convenient to use.

Will look at this. There is also kernfs_name and kernfs_path.

> I believe user space needs to provide control group name for a global
> interface (the same name can be used by monitor groups belonging to
> different control groups), and that can be used to narrow search.
> 
> Reading your message I do not find any motivation _against_ a global
> interface, except that it is not obvious to you how such interface may look
> or work. That is fair. Peter seems to have ideas and a working implementation
> that can be used as reference. So far I have only seen one comment [2] from James
> that was skeptical about the global interface but the reason notes that MPAM
> allocates counters per domain, which is the same as ABMC so we will need more
> information from James here on what is required since he did not respond to
> Peter.
> 
> Below is a *hypothetical* interface to start a discussion that explores how
> to support fine grained assignment in an interface that aims to be easy to use
> by user space. Obviously Peter is also working on something so there
> are many viewpoints to consider.
> 
> File info/L3_MON/mbm_assign_control:
> #control_group/mon_group/flags
> ctrl_a/mon_a/00=_;01=_
> ctrl_a/mon_b/00=l;01=t
> ctrl_b/mon_c/00=lt;01=lt

I think you left few things here(Like the default control_mon group).

To make more clear, let me list all the groups here based this.

When none of the counters assigned:

$cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
resctrl/00=none,none;01=none,none (#default control_mon group)
resctrl/mon_a/00=none,none;01=none,none (#mon group)
resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group)
resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group)


When some counters are assigned:

$echo "resctrl/00=total,local" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to
default group)

$echo "resctrl/mon_a/00=total;01=total" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon
group)

$echo "resctrl/ctrl_a/00=local;01=local" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control

$echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" >
/sys/fs/resctrl/nfo/L3_MON/mbm_assign_control

$cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
resctrl/00=total,local;01=none,none (#default control_mon group)
resctrl/mon_a/00=total,none;01=total,none (#mon group)
resctrl/ctrl_a/00=none,local;01=none,local (#control_mon group)
resctrl/ctrl_a/mon_ab/00=total,local;01=total,local (#mon group)


Few comments about this approach:
1.This will involve lots of text processing in the kernel. Will need to
figure out calls for these processing.

2.In this approach there is no way to list assignment of a single
group(like group resctrl/ctrl_a/mon_ab alone).

3. This is similar to fine grained approach we discussed but in global level.

Want to get Pater/James comments about this approach.

> 
> Above file displays to user:
> * No counters are assigned to monitor group mon_a within control group ctrl_a
> * Counter for local MBM is assigned to domain 0 of monitor group mon_b within
>   control group ctrl_a 
> * Counter for total MBM is assigned to domain 1 of monitor group mon_b within
>   control group ctrl_a 
> * Counters for local and total MBM are assigned to both domains of monitor
>   group mon_c within control group ctrl_b
> 
> With above interface user space can, with a single read, get insight into
> how counters are assigned across all monitor groups.
> User space can write to the file to modify the flags. If assigning a new
> counter when no more counters are available then the write will fail.
> Potentially, if changes are made in order provided by the user then
> the user will be able to unassign counters from one group and re-assign to
> another group with a single write.
> 
> I provide this purely to generate some ideas and gather more thoughts on
> a global interface.
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/2f373abf-f0c0-4f5d-9e22-1039a40a57f0@arm.com/
> [2] https://lore.kernel.org/lkml/1a8c1cd6-a1ce-47a2-bc87-d4cccc84519b@arm.com/
> 
> 
> 
> 
>
  
Reinette Chatre Feb. 29, 2024, 9:50 p.m. UTC | #30
Hi Babu,

On 2/29/2024 12:37 PM, Moger, Babu wrote:
> On 2/28/24 14:04, Reinette Chatre wrote:
>> On 2/28/2024 9:59 AM, Moger, Babu wrote:
>>> On 2/27/24 17:50, Reinette Chatre wrote:
>>>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>>>
>>
>>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>>>> User space could theoretically create more monitor groups than the number of
>>>>>> rmids that a resource claims to support using current upstream enumeration.
>>>>>
>>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>>>> more than this limit(r->num_rmid).
>>>>>
>>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>>>> RMID to assign the monitoring. So, assignment limit is
>>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>>>
>>>> I see. Thank you for clarifying. This does make enabling simpler and one
>>>> less user interface item that needs changing.
>>>>
>>>> ...
>>>>
>>>>>>> 2. /sys/fs/resctrl/monitor_state.
>>>>>>> This can used to individually assign or unassign the counters in each group.
>>>>>>>
>>>>>>> When assigned:
>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>>>>
>>>>>>> When unassigned:
>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>>>>
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>> How do you expect this interface to be used? I understand the mechanics
>>>>>> of this interface but on a higher level, do you expect user space to
>>>>>> once in a while assign a new counter to a single event or monitor group
>>>>>> (for which a fine grained interface works) or do you expect user space to
>>>>>> shift multiple counters across several monitor events at intervals?
>>>>>
>>>>> I think we should provide both the options. I was thinking of providing
>>>>> fine grained interface first.
>>>>
>>>> Could you please provide a motivation for why two interfaces, one inefficient
>>>> and one not, should be created and maintained? Users can still do fine grained
>>>> assignment with a global assignment interface.
>>>
>>> Lets consider one by one.
>>>
>>> 1. Fine grained assignment.
>>>
>>> It will be part of the mongroup(or control mongroup). User has the access
>>> to the group and can query the group's current status before assigning or
>>> unassigning.
>>>
>>>    $cd /sys/fs/resctrl/ctrl_mon1
>>>    $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>        0=total-unassign,local-unassign;1=total-unassign,local-unassign;
>>>
>>> Assign the total event
>>>
>>>   $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>
>>> Assign the local event
>>>
>>>    $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>
>>> Assign both events:
>>>
>>>    $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>
>>> Check the assignment status.
>>>
>>>    $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>        0=total-assign,local-assign;1=total-unassign,local-unassign;
>>>
>>> -User interface is simple.
>>
>> This should not be the only motivation. Please do not sacrifice efficiency
>> and usability just to have a simple interface. One can also argue that this
>> interface can only be considered simple from the kernel implementation perspective,
>> from user space it seems complicated. For example, as James pointed out earlier [1],
>> user space would need to walk the entire resctrl to find out where counters are
>> assigned. Peter also pointed out how the multiple syscalls needed when adjusting
>> hundreds of monitor groups is inefficient. Please take all feedback into account.
>>
>> You consider "simple interface" as a motivation, there seems to be at least two
>> arguments against this interface. Please consider these in your comparison
>> between interfaces. These are things that should be noted and make their way to
>> the cover letter.
>>
>>>
>>> -Assignment will fail if all the h/w counters are exhausted. User needs to
>>> unassign a counter from another group and use that counter here. This can
>>> be done just querying the monitor state of another group.
>>
>> Right ... and as you state there can be hundreds of monitor groups that
>> user space would need to walk and query to get this information.
>>
>>>
>>> -Monitor group's details(cpus, tasks) are part of the group. So, it is
>>> better to have assignment state inside the group.
>>
>> The assignment state should be clear from the event file.
>>
>>> Note: Used interface names here just to give example.
>>>
>>>
>>> 2. global assignment:
>>>
>>> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
>>> directory.
>>>
>>> In case there are 100 mongroups, we need to have a way to list current
>>> assignment status for these groups. I am not sure how to list status of
>>> these 100 groups.
>>
>> The kernel has many examples of interfaces that manages status of a large
>> number of entities. I am thinking, for example, we can learn a lot from
>> how dynamic debug works. On my system I see:
>>
>> $ wc -l /sys/kernel/debug/dynamic_debug/control
>> 5359 /sys/kernel/debug/dynamic_debug/control
>>
>>>
>>> If user is wants to assign the local event(or total) in a specific group
>>> in this list of 100 groups, I am not sure how to provide interface for
>>> that. Should we pass the name of mongroup? That will involve looping
>>> through using the call kernfs_walk_and_get. This may be ok if we are
>>> dealing with very small number of groups.
>>>
>>
>> What is your concern when needing to modify a large number of groups?
>> Are you concerned about the size of the writes needing to be parsed? It looks
>> like kernfs does support writes of larger than PAGE_SIZE, but it is not clear
>> to me that such large sizes will be required.   
>>
>> There is also kernfs_find_and_get() that may be more convenient to use.
> 
> Will look at this. There is also kernfs_name and kernfs_path.
> 
>> I believe user space needs to provide control group name for a global
>> interface (the same name can be used by monitor groups belonging to
>> different control groups), and that can be used to narrow search.
>>
>> Reading your message I do not find any motivation _against_ a global
>> interface, except that it is not obvious to you how such interface may look
>> or work. That is fair. Peter seems to have ideas and a working implementation
>> that can be used as reference. So far I have only seen one comment [2] from James
>> that was skeptical about the global interface but the reason notes that MPAM
>> allocates counters per domain, which is the same as ABMC so we will need more
>> information from James here on what is required since he did not respond to
>> Peter.
>>
>> Below is a *hypothetical* interface to start a discussion that explores how
>> to support fine grained assignment in an interface that aims to be easy to use
>> by user space. Obviously Peter is also working on something so there
>> are many viewpoints to consider.
>>
>> File info/L3_MON/mbm_assign_control:
>> #control_group/mon_group/flags
>> ctrl_a/mon_a/00=_;01=_
>> ctrl_a/mon_b/00=l;01=t
>> ctrl_b/mon_c/00=lt;01=lt
> 
> I think you left few things here(Like the default control_mon group).

No. Similar to proc_resctrl_show() the fields can be empty for
the default group or mon groups belonging to control group.

> 
> To make more clear, let me list all the groups here based this.
> 
> When none of the counters assigned:
> 
> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> resctrl/00=none,none;01=none,none (#default control_mon group)
> resctrl/mon_a/00=none,none;01=none,none (#mon group)
> resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group)
> resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group)

I am concerned that inconsistent use of "/" will make parsing hard.

I find "resctrl" and all the "none" redundant. It is not clear what
this improves.
Why have:
resctrl/00=none,none;01=none,none
when this could do:
//00=_;01=_


> When some counters are assigned:
> 
> $echo "resctrl/00=total,local" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to
> default group)
> 
> $echo "resctrl/mon_a/00=total;01=total" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon
> group)
> 
> $echo "resctrl/ctrl_a/00=local;01=local" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> 
> $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" >
> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
> 

We could learn some more lessons from dynamic debug (see 
Documentation/admin-guide/dynamic-debug-howto.rst). 
For example, "=" can be used to make an assignment while "+"
can be used to add a counter and "-" can be used to remove a counter.
"=_" can be used to remove counters from all events in that domain.

The interface should also support assign/un-assign to multiple groups with
a single write. To start this could use '\n' as separator as is the custom
with other resctrl interfaces. 

> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> resctrl/00=total,local;01=none,none (#default control_mon group)
> resctrl/mon_a/00=total,none;01=total,none (#mon group)
> resctrl/ctrl_a/00=none,local;01=none,local (#control_mon group)
> resctrl/ctrl_a/mon_ab/00=total,local;01=total,local (#mon group)
> 
> 
> Few comments about this approach:
> 1.This will involve lots of text processing in the kernel. Will need to
> figure out calls for these processing.

I see that additional parsing will be needed to determine control group
and monitor group. For these it sounds like you already have a few options
for kernfs API to use.
Apart from that the counter assignment will be similar parsing as what
was done in your previous versions. I think parsing will be easier if it
does not try to use words for the events but just use one letter flags.
For example, there is thus no need to look for "," in the parsing of the
events, just parse one character at a time where each character has a
specific meaning.

> 
> 2.In this approach there is no way to list assignment of a single
> group(like group resctrl/ctrl_a/mon_ab alone).

Should the kernel be responsible for enabling this? User space can just
do a "cat mbm_assign_control | grep mon_ab". Is this not sufficient?

> 
> 3. This is similar to fine grained approach we discussed but in global level.

That is what I have been trying to get across. This has full benefit of the
original implementation while also addressing all problems raised against it.

> 
> Want to get Pater/James comments about this approach.
(Peter)

Of course. I look forward to that. Once agreed it may also be worthwhile to
approach x86 maintainers with an RFC of the proposed new user interface to learn
their guidance. This is where it is important to keep track of all the requirements,
as well as pros and cons of different options.

Reinette
  
Moger, Babu March 1, 2024, 8:36 p.m. UTC | #31
Hi Reinette,

On 2/29/24 15:50, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/29/2024 12:37 PM, Moger, Babu wrote:
>> On 2/28/24 14:04, Reinette Chatre wrote:
>>> On 2/28/2024 9:59 AM, Moger, Babu wrote:
>>>> On 2/27/24 17:50, Reinette Chatre wrote:
>>>>> On 2/27/2024 10:12 AM, Moger, Babu wrote:
>>>>>> On 2/26/24 15:20, Reinette Chatre wrote:
>>>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote:
>>>>>>>> On 2/23/24 16:21, Reinette Chatre wrote:
>>>>>
>>>
>>>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then
>>>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid
>>>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs).
>>>>>>> User space could theoretically create more monitor groups than the number of
>>>>>>> rmids that a resource claims to support using current upstream enumeration.
>>>>>>
>>>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11
>>>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create
>>>>>> more than this limit(r->num_rmid).
>>>>>>
>>>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with
>>>>>> RMID to assign the monitoring. So, assignment limit is
>>>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid.
>>>>>
>>>>> I see. Thank you for clarifying. This does make enabling simpler and one
>>>>> less user interface item that needs changing.
>>>>>
>>>>> ...
>>>>>
>>>>>>>> 2. /sys/fs/resctrl/monitor_state.
>>>>>>>> This can used to individually assign or unassign the counters in each group.
>>>>>>>>
>>>>>>>> When assigned:
>>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>>> 0=total-assign,local-assign;1=total-assign,local-assign
>>>>>>>>
>>>>>>>> When unassigned:
>>>>>>>> #cat /sys/fs/resctrl/monitor_state
>>>>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign
>>>>>>>>
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>
>>>>>>> How do you expect this interface to be used? I understand the mechanics
>>>>>>> of this interface but on a higher level, do you expect user space to
>>>>>>> once in a while assign a new counter to a single event or monitor group
>>>>>>> (for which a fine grained interface works) or do you expect user space to
>>>>>>> shift multiple counters across several monitor events at intervals?
>>>>>>
>>>>>> I think we should provide both the options. I was thinking of providing
>>>>>> fine grained interface first.
>>>>>
>>>>> Could you please provide a motivation for why two interfaces, one inefficient
>>>>> and one not, should be created and maintained? Users can still do fine grained
>>>>> assignment with a global assignment interface.
>>>>
>>>> Lets consider one by one.
>>>>
>>>> 1. Fine grained assignment.
>>>>
>>>> It will be part of the mongroup(or control mongroup). User has the access
>>>> to the group and can query the group's current status before assigning or
>>>> unassigning.
>>>>
>>>>    $cd /sys/fs/resctrl/ctrl_mon1
>>>>    $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>        0=total-unassign,local-unassign;1=total-unassign,local-unassign;
>>>>
>>>> Assign the total event
>>>>
>>>>   $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>
>>>> Assign the local event
>>>>
>>>>    $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>
>>>> Assign both events:
>>>>
>>>>    $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>
>>>> Check the assignment status.
>>>>
>>>>    $cat /sys/fs/resctrl/ctrl_mon1/monitor_state
>>>>        0=total-assign,local-assign;1=total-unassign,local-unassign;
>>>>
>>>> -User interface is simple.
>>>
>>> This should not be the only motivation. Please do not sacrifice efficiency
>>> and usability just to have a simple interface. One can also argue that this
>>> interface can only be considered simple from the kernel implementation perspective,
>>> from user space it seems complicated. For example, as James pointed out earlier [1],
>>> user space would need to walk the entire resctrl to find out where counters are
>>> assigned. Peter also pointed out how the multiple syscalls needed when adjusting
>>> hundreds of monitor groups is inefficient. Please take all feedback into account.
>>>
>>> You consider "simple interface" as a motivation, there seems to be at least two
>>> arguments against this interface. Please consider these in your comparison
>>> between interfaces. These are things that should be noted and make their way to
>>> the cover letter.
>>>
>>>>
>>>> -Assignment will fail if all the h/w counters are exhausted. User needs to
>>>> unassign a counter from another group and use that counter here. This can
>>>> be done just querying the monitor state of another group.
>>>
>>> Right ... and as you state there can be hundreds of monitor groups that
>>> user space would need to walk and query to get this information.
>>>
>>>>
>>>> -Monitor group's details(cpus, tasks) are part of the group. So, it is
>>>> better to have assignment state inside the group.
>>>
>>> The assignment state should be clear from the event file.
>>>
>>>> Note: Used interface names here just to give example.
>>>>
>>>>
>>>> 2. global assignment:
>>>>
>>>> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/
>>>> directory.
>>>>
>>>> In case there are 100 mongroups, we need to have a way to list current
>>>> assignment status for these groups. I am not sure how to list status of
>>>> these 100 groups.
>>>
>>> The kernel has many examples of interfaces that manages status of a large
>>> number of entities. I am thinking, for example, we can learn a lot from
>>> how dynamic debug works. On my system I see:
>>>
>>> $ wc -l /sys/kernel/debug/dynamic_debug/control
>>> 5359 /sys/kernel/debug/dynamic_debug/control
>>>
>>>>
>>>> If user is wants to assign the local event(or total) in a specific group
>>>> in this list of 100 groups, I am not sure how to provide interface for
>>>> that. Should we pass the name of mongroup? That will involve looping
>>>> through using the call kernfs_walk_and_get. This may be ok if we are
>>>> dealing with very small number of groups.
>>>>
>>>
>>> What is your concern when needing to modify a large number of groups?
>>> Are you concerned about the size of the writes needing to be parsed? It looks
>>> like kernfs does support writes of larger than PAGE_SIZE, but it is not clear
>>> to me that such large sizes will be required.   
>>>
>>> There is also kernfs_find_and_get() that may be more convenient to use.
>>
>> Will look at this. There is also kernfs_name and kernfs_path.
>>
>>> I believe user space needs to provide control group name for a global
>>> interface (the same name can be used by monitor groups belonging to
>>> different control groups), and that can be used to narrow search.
>>>
>>> Reading your message I do not find any motivation _against_ a global
>>> interface, except that it is not obvious to you how such interface may look
>>> or work. That is fair. Peter seems to have ideas and a working implementation
>>> that can be used as reference. So far I have only seen one comment [2] from James
>>> that was skeptical about the global interface but the reason notes that MPAM
>>> allocates counters per domain, which is the same as ABMC so we will need more
>>> information from James here on what is required since he did not respond to
>>> Peter.
>>>
>>> Below is a *hypothetical* interface to start a discussion that explores how
>>> to support fine grained assignment in an interface that aims to be easy to use
>>> by user space. Obviously Peter is also working on something so there
>>> are many viewpoints to consider.
>>>
>>> File info/L3_MON/mbm_assign_control:
>>> #control_group/mon_group/flags
>>> ctrl_a/mon_a/00=_;01=_
>>> ctrl_a/mon_b/00=l;01=t
>>> ctrl_b/mon_c/00=lt;01=lt
>>
>> I think you left few things here(Like the default control_mon group).
> 
> No. Similar to proc_resctrl_show() the fields can be empty for
> the default group or mon groups belonging to control group.

ok. Need to understand this better. Hope I learn while doing this work.

> 
>>
>> To make more clear, let me list all the groups here based this.
>>
>> When none of the counters assigned:
>>
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> resctrl/00=none,none;01=none,none (#default control_mon group)
>> resctrl/mon_a/00=none,none;01=none,none (#mon group)
>> resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group)
>> resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group)
> 
> I am concerned that inconsistent use of "/" will make parsing hard.

Do you mean, you don't want to see multiple "/"?

resctrl/ctrl_a/mon_ab/

Change to

mon_ab/

> 
> I find "resctrl" and all the "none" redundant. It is not clear what
> this improves.
> Why have:
> resctrl/00=none,none;01=none,none
> when this could do:
> //00=_;01=_

ok.

"//" meaning root of resctrl filesystem?


> 
> 
>> When some counters are assigned:
>>
>> $echo "resctrl/00=total,local" >
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to
>> default group)
>>
>> $echo "resctrl/mon_a/00=total;01=total" >
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon
>> group)
>>
>> $echo "resctrl/ctrl_a/00=local;01=local" >
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" >
>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>
> 
> We could learn some more lessons from dynamic debug (see 
> Documentation/admin-guide/dynamic-debug-howto.rst). 
> For example, "=" can be used to make an assignment while "+"
> can be used to add a counter and "-" can be used to remove a counter.
> "=_" can be used to remove counters from all events in that domain.

Yes. Looked at dynamic debug. I am still learning this interface. Some 
examples below based on my understanding.

To assign a counters to default group on domain 0.
$echo "//00=+lt;01=+lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

To assign a counters to mon group inside the default group.
$echo "mon_a/00=+t;01=+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

To assign a counters to control mon group inside the default group.
$echo "ctrl_a/00=+l;01=+l"  > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

To assign a counters to control mon group inside another control group.
$echo "mon_ab/00=+lt;01=+lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro

To unassign a counters to control mon group inside another control group.
$echo "mon_ab/00=-lt;01=-lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control

To unassign all the counters on a specific group.
$echo "mon_ab/00=_" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control

It does not matter control group or mon group. We just need to name of 
the group in this interface.

Listing will be

$cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//00=lt;01=lt
/mon_a/00=t;01=t
/ctrl_a/00=l;01=l
/mon_ab/00=_;01=_

> 
> The interface should also support assign/un-assign to multiple groups with
> a single write. To start this could use '\n' as separator as is the custom
> with other resctrl interfaces. 

Yes. that should be fine.

> 
>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> resctrl/00=total,local;01=none,none (#default control_mon group)
>> resctrl/mon_a/00=total,none;01=total,none (#mon group)
>> resctrl/ctrl_a/00=none,local;01=none,local (#control_mon group)
>> resctrl/ctrl_a/mon_ab/00=total,local;01=total,local (#mon group)
>>
>>
>> Few comments about this approach:
>> 1.This will involve lots of text processing in the kernel. Will need to
>> figure out calls for these processing.
> 
> I see that additional parsing will be needed to determine control group
> and monitor group. For these it sounds like you already have a few options
> for kernfs API to use.
> Apart from that the counter assignment will be similar parsing as what
> was done in your previous versions. I think parsing will be easier if it
> does not try to use words for the events but just use one letter flags.
> For example, there is thus no need to look for "," in the parsing of the
> events, just parse one character at a time where each character has a
> specific meaning.

ok.

> 
>>
>> 2.In this approach there is no way to list assignment of a single
>> group(like group resctrl/ctrl_a/mon_ab alone).
> 
> Should the kernel be responsible for enabling this? User space can just
> do a "cat mbm_assign_control | grep mon_ab". Is this not sufficient?

That may be ok. Peter, Please comment on this.

> 
>>
>> 3. This is similar to fine grained approach we discussed but in global level.
> 
> That is what I have been trying to get across. This has full benefit of the
> original implementation while also addressing all problems raised against it.
> 
>>
>> Want to get Pater/James comments about this approach.
> (Peter)
> 
> Of course. I look forward to that. Once agreed it may also be worthwhile to
> approach x86 maintainers with an RFC of the proposed new user interface to learn
> their guidance. This is where it is important to keep track of all the requirements,
> as well as pros and cons of different options.

Ok. Sure. I am fine making next version as RFC.

> 
> Reinette
  
Reinette Chatre March 1, 2024, 11:20 p.m. UTC | #32
Hi Babu,

On 3/1/2024 12:36 PM, Moger, Babu wrote:
> On 2/29/24 15:50, Reinette Chatre wrote:
>> On 2/29/2024 12:37 PM, Moger, Babu wrote:

..

>>> To make more clear, let me list all the groups here based this.
>>>
>>> When none of the counters assigned:
>>>
>>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> resctrl/00=none,none;01=none,none (#default control_mon group)
>>> resctrl/mon_a/00=none,none;01=none,none (#mon group)
>>> resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group)
>>> resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group)
>>
>> I am concerned that inconsistent use of "/" will make parsing hard.
> 
> Do you mean, you don't want to see multiple "/"?

No. I think that having a consistent number of "/" will be easier to
parse. In the above example, there are instances of 1, 2, as well as
three "/" among the lines. That seems complicated to parse.

I was thinking that it will make interpreting and parsing easier if there
consistently are just always two "/".

(You may find things to be different once you work on the parsing code
though.)

In summary:
* for monitoring of default CTRL_MON group: "//<flags>"
* for MON_GROUP inside default CTRL_MON group: "/<MON group>/<flags>"
* for monitoring of non-default CTRL_MON group: "<CTRL_MON group>//flags"
* for MON_GROUP within CTRL_MON group: "<CTRL_MON group>/<MON group>/<flags>"

What do you think?

> 
> resctrl/ctrl_a/mon_ab/
> 
> Change to
> 
> mon_ab/

rather:
ctrl_a/mon_ab/<flags>

> 
>>
>> I find "resctrl" and all the "none" redundant. It is not clear what
>> this improves.
>> Why have:
>> resctrl/00=none,none;01=none,none
>> when this could do:
>> //00=_;01=_
> 
> ok.
> 
> "//" meaning root of resctrl filesystem?

More specifically, monitoring of default control group. It is not intended to
specify a pathname.

>>> When some counters are assigned:
>>>
>>> $echo "resctrl/00=total,local" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to
>>> default group)
>>>
>>> $echo "resctrl/mon_a/00=total;01=total" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon
>>> group)
>>>
>>> $echo "resctrl/ctrl_a/00=local;01=local" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>> $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" >
>>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
>>>
>>
>> We could learn some more lessons from dynamic debug (see Documentation/admin-guide/dynamic-debug-howto.rst). For example, "=" can be used to make an assignment while "+"
>> can be used to add a counter and "-" can be used to remove a counter.
>> "=_" can be used to remove counters from all events in that domain.
> 
> Yes. Looked at dynamic debug. I am still learning this interface. Some examples below based on my understanding.
> 
> To assign a counters to default group on domain 0.
> $echo "//00=+lt;01=+lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

It should not be necessary to use both "=" and "+" in the same assignment.
I think of "=" as "assign" and "+" as append ("-" as remove).

An example of this, just focusing on default group.

#hypothetical start state of no counters assigned
$ cat mbm_assign_control 
#control_group/monitor_group/flags
//00=_;01=_

#assign counter to total MBM of both domains
$ echo "//00=t;01=t" > mbm_assign_control 
$ cat mbm_assign_control
#control_group/monitor_group/flags
//00=t;01=t

#add counter to local MBM of both domains without impacting total MBM counters
$echo "//00+l;01+l" > mbm_assign_control
$ cat mbm_assign_control
#control_group/monitor_group/flags
//00=tl;01=tl

#remove local MBM counters without impacting total MBM counters
$echo "//00-l;01-l" > mbm_assign_control
$ cat mbm_assign_control
#control_group/monitor_group/flags
//00=t;01=t

#assign local MBM counters, removing total MBM counters while doing so
$echo "//00=l;01=l" > mbm_assign_control
$ cat mbm_assign_control
#control_group/monitor_group/flags
//00=l;01=l

#remove all counters
$echo "//00=_;01=_" > mbm_assign_control
$ cat mbm_assign_control
#control_group/monitor_group/flags
//00=_;01=_


> 
> To assign a counters to mon group inside the default group.
> $echo "mon_a/00=+t;01=+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

I think it will simplify parsing if number of "/" is consistent:
$echo "/mon_a/00=t;01=t" > ...

> 
> To assign a counters to control mon group inside the default group.

It is not clear to me what you mean with this.

> $echo "ctrl_a/00=+l;01=+l"  > /sys/fs/resctrl/info/L3_MON/mbm_assign_control

this looks similar to previous example, so I think it will be hard for parser
to know whether it is dealing with control group or monitor group.
I am not sure I understand your example, but this may perhaps be:

echo "ctrl_a//00=l;01=l > ...

> 
> To assign a counters to control mon group inside another control group.

I do not know what you mean with "another control group" 

> $echo "mon_ab/00=+lt;01=+lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro

How will parser know which control group? I was expecting:
$ echo "<CTRL_MON group>/<MON group>/<flags>"

> 
> To unassign a counters to control mon group inside another control group.
> $echo "mon_ab/00=-lt;01=-lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
> 
> To unassign all the counters on a specific group.
> $echo "mon_ab/00=_" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control
> 
> It does not matter control group or mon group. We just need to name of the group in this interface.

It matters because users can have monitor groups with the same name within
different control groups.

Reinette