[v2,2/2] x86/resctrl: Add tracepoint for llc_occupancy tracking

Message ID 20240221092101.90740-3-haifeng.xu@shopee.com
State New
Headers
Series Track llc_occpuancy of RMIDs in limbo list |

Commit Message

Haifeng Xu Feb. 21, 2024, 9:21 a.m. UTC
  In our production environment, after removing monitor groups, those unused
RMIDs get stuck in the limbo list forever because their llc_occupancy are
always larger than the threshold. But the unused RMIDs can be successfully
freed by turning up the threshold.

In order to know how much the threshold should be, the following steps can
be taken to acquire the llc_occupancy of RMIDs in each rdt domain:

1) perf probe -a '__rmid_read eventid rmid'
   perf probe -a '__rmid_read%return $retval'
2) perf record -e probe:__rmid_read -e probe:__rmid_read__return -aR sleep 10
3) perf script > __rmid_read.txt
4) cat __rmid_read.txt | grep "eventid=0x1 " -A 1 | grep "kworker" > llc_occupnacy.txt

Instead of using perf tool to track llc_occupancy and filter the log manually,
it is more convenient for users to use tracepoint to do this work. So add a new
tracepoint that shows the llc_occupancy of busy RMIDs when scanning the limbo
list.

Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/resctrl/monitor.c |  2 ++
 arch/x86/kernel/cpu/resctrl/trace.h   | 13 +++++++++++++
 2 files changed, 15 insertions(+)
  

Comments

Reinette Chatre Feb. 23, 2024, 7:41 p.m. UTC | #1
(+James)

Hi Haifeng and James,

On 2/21/2024 1:21 AM, Haifeng Xu wrote:
> In our production environment, after removing monitor groups, those unused
> RMIDs get stuck in the limbo list forever because their llc_occupancy are
> always larger than the threshold. But the unused RMIDs can be successfully
> freed by turning up the threshold.
> 
> In order to know how much the threshold should be, the following steps can
> be taken to acquire the llc_occupancy of RMIDs in each rdt domain:
> 
> 1) perf probe -a '__rmid_read eventid rmid'
>    perf probe -a '__rmid_read%return $retval'
> 2) perf record -e probe:__rmid_read -e probe:__rmid_read__return -aR sleep 10
> 3) perf script > __rmid_read.txt
> 4) cat __rmid_read.txt | grep "eventid=0x1 " -A 1 | grep "kworker" > llc_occupnacy.txt
> 

The details on how perf can be used was useful during the discussion of this
work but can be omitted from this changelog.

> Instead of using perf tool to track llc_occupancy and filter the log manually,
> it is more convenient for users to use tracepoint to do this work. So add a new
> tracepoint that shows the llc_occupancy of busy RMIDs when scanning the limbo
> list.
> 
> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/monitor.c |  2 ++
>  arch/x86/kernel/cpu/resctrl/trace.h   | 13 +++++++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..1533b1932b49 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -23,6 +23,7 @@
>  #include <asm/resctrl.h>
>  
>  #include "internal.h"
> +#include "trace.h"
>  
>  struct rmid_entry {
>  	u32				rmid;
> @@ -302,6 +303,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>  			}
>  		}
>  		crmid = nrmid + 1;
> +		trace_mon_llc_occupancy_limbo(nrmid, d->id, val);

This area recently received some changes (you can find the latest on the
x86/cache branch of the tip repo). Please see [1] for a good
description of the new "index". For this tracing to be useful to MPAM
I thus expect that the tracepoint will need to print the MPAM equivalent
to CLOSID, the PARTID. We can refer to this CLOSID/PARTID value as
"ctrl_hw_id".

This snippet can then change to use the new resctrl_arch_rmid_idx_decode()
to learn the "ctrl_hw_id" and "mon_hw_id" and print it as part of
tracepoint:
"ctrl_hw_id=%u mon_hw_id=%u domain=%d llc_occupancy=%llu"

This will be filesystem code so it cannot know how an architecture
treats these numbers. Consequently, this may look strange to x86 users
when ctrl_hw_id will always be X86_RESCTRL_EMPTY_CLOSID ... but it should
be clear that it is invalid? 

James, what do you think? Any thoughts on how MPAM will use the limbo handler
to understand what information can be useful to the user here?

Reinette

[1] https://lore.kernel.org/lkml/20240213184438.16675-7-james.morse@arm.com/
  
Haifeng Xu Feb. 29, 2024, 3:16 a.m. UTC | #2
On 2024/2/24 03:41, Reinette Chatre wrote:
> (+James)
> 
> Hi Haifeng and James,
> 
> On 2/21/2024 1:21 AM, Haifeng Xu wrote:
>> In our production environment, after removing monitor groups, those unused
>> RMIDs get stuck in the limbo list forever because their llc_occupancy are
>> always larger than the threshold. But the unused RMIDs can be successfully
>> freed by turning up the threshold.
>>
>> In order to know how much the threshold should be, the following steps can
>> be taken to acquire the llc_occupancy of RMIDs in each rdt domain:
>>
>> 1) perf probe -a '__rmid_read eventid rmid'
>>    perf probe -a '__rmid_read%return $retval'
>> 2) perf record -e probe:__rmid_read -e probe:__rmid_read__return -aR sleep 10
>> 3) perf script > __rmid_read.txt
>> 4) cat __rmid_read.txt | grep "eventid=0x1 " -A 1 | grep "kworker" > llc_occupnacy.txt
>>
> 
> The details on how perf can be used was useful during the discussion of this
> work but can be omitted from this changelog.
 
Got it.

> 
>> Instead of using perf tool to track llc_occupancy and filter the log manually,
>> it is more convenient for users to use tracepoint to do this work. So add a new
>> tracepoint that shows the llc_occupancy of busy RMIDs when scanning the limbo
>> list.
>>
>> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
>> Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
>> ---
>>  arch/x86/kernel/cpu/resctrl/monitor.c |  2 ++
>>  arch/x86/kernel/cpu/resctrl/trace.h   | 13 +++++++++++++
>>  2 files changed, 15 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index f136ac046851..1533b1932b49 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -23,6 +23,7 @@
>>  #include <asm/resctrl.h>
>>  
>>  #include "internal.h"
>> +#include "trace.h"
>>  
>>  struct rmid_entry {
>>  	u32				rmid;
>> @@ -302,6 +303,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>>  			}
>>  		}
>>  		crmid = nrmid + 1;
>> +		trace_mon_llc_occupancy_limbo(nrmid, d->id, val);
> 
> This area recently received some changes (you can find the latest on the
> x86/cache branch of the tip repo). Please see [1] for a good
> description of the new "index". For this tracing to be useful to MPAM
> I thus expect that the tracepoint will need to print the MPAM equivalent
> to CLOSID, the PARTID. We can refer to this CLOSID/PARTID value as
> "ctrl_hw_id".
> 
> This snippet can then change to use the new resctrl_arch_rmid_idx_decode()
> to learn the "ctrl_hw_id" and "mon_hw_id" and print it as part of
> tracepoint:
> "ctrl_hw_id=%u mon_hw_id=%u domain=%d llc_occupancy=%llu"

OK, I'll post a new patch based on tip repo.

> 
> This will be filesystem code so it cannot know how an architecture
> treats these numbers. Consequently, this may look strange to x86 users
> when ctrl_hw_id will always be X86_RESCTRL_EMPTY_CLOSID ... but it should
> be clear that it is invalid? 
> 
> James, what do you think? Any thoughts on how MPAM will use the limbo handler
> to understand what information can be useful to the user here?
> 
> Reinette
> 
> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_20240213184438.16675-2D7-2Djames.morse-40arm.com_&d=DwICaQ&c=R1GFtfTqKXCFH-lgEPXWwic6stQkW4U7uVq33mt-crw&r=3uoFsejk1jN2oga47MZfph01lLGODc93n4Zqe7b0NRk&m=Grl-QGKKyzz601g4WQFhPFVML6pju3g8CUGyD2VF8r8BUlO_caHlZMafoTxW9iYc&s=ToJ7E8_Afpnn5zh-c-CVReg4WqM-T0pEgB9hN6ntj1A&e=
  
James Morse March 1, 2024, 5:47 p.m. UTC | #3
Hi Reinette,

On 23/02/2024 19:41, Reinette Chatre wrote:
> On 2/21/2024 1:21 AM, Haifeng Xu wrote:
>> In our production environment, after removing monitor groups, those unused
>> RMIDs get stuck in the limbo list forever because their llc_occupancy are
>> always larger than the threshold. But the unused RMIDs can be successfully
>> freed by turning up the threshold.
>>
>> In order to know how much the threshold should be, the following steps can
>> be taken to acquire the llc_occupancy of RMIDs in each rdt domain:
>>
>> 1) perf probe -a '__rmid_read eventid rmid'
>>    perf probe -a '__rmid_read%return $retval'
>> 2) perf record -e probe:__rmid_read -e probe:__rmid_read__return -aR sleep 10
>> 3) perf script > __rmid_read.txt
>> 4) cat __rmid_read.txt | grep "eventid=0x1 " -A 1 | grep "kworker" > llc_occupnacy.txt

Ah, this ftrace trickery. It wouldn't be portable - I agree a tracepoint is much better!


> The details on how perf can be used was useful during the discussion of this
> work but can be omitted from this changelog.
> 
>> Instead of using perf tool to track llc_occupancy and filter the log manually,
>> it is more convenient for users to use tracepoint to do this work. So add a new
>> tracepoint that shows the llc_occupancy of busy RMIDs when scanning the limbo
>> list.

>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index f136ac046851..1533b1932b49 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -23,6 +23,7 @@
>>  #include <asm/resctrl.h>
>>  
>>  #include "internal.h"
>> +#include "trace.h"
>>  
>>  struct rmid_entry {
>>  	u32				rmid;
>> @@ -302,6 +303,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>>  			}
>>  		}
>>  		crmid = nrmid + 1;
>> +		trace_mon_llc_occupancy_limbo(nrmid, d->id, val);

> This area recently received some changes (you can find the latest on the
> x86/cache branch of the tip repo). Please see [1] for a good
> description of the new "index". For this tracing to be useful to MPAM
> I thus expect that the tracepoint will need to print the MPAM equivalent
> to CLOSID, the PARTID. We can refer to this CLOSID/PARTID value as
> "ctrl_hw_id".
> 
> This snippet can then change to use the new resctrl_arch_rmid_idx_decode()
> to learn the "ctrl_hw_id" and "mon_hw_id" and print it as part of
> tracepoint:
> "ctrl_hw_id=%u mon_hw_id=%u domain=%d llc_occupancy=%llu"
> 
> This will be filesystem code so it cannot know how an architecture
> treats these numbers. Consequently, this may look strange to x86 users
> when ctrl_hw_id will always be X86_RESCTRL_EMPTY_CLOSID ... but it should
> be clear that it is invalid? 


> James, what do you think? Any thoughts on how MPAM will use the limbo handler
> to understand what information can be useful to the user here?

Initially it will be exactly the same, and this certainly works. I agree outputting both
the CLOSID and RMID (with more portable names) is the right thing to do.

I'll reply in more detail on what appears to be v3.
lore.kernel.org/r/20240229071125.100991-1-haifeng.xu@shopee.com


Thanks,

James
  

Patch

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..1533b1932b49 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -23,6 +23,7 @@ 
 #include <asm/resctrl.h>
 
 #include "internal.h"
+#include "trace.h"
 
 struct rmid_entry {
 	u32				rmid;
@@ -302,6 +303,7 @@  void __check_limbo(struct rdt_domain *d, bool force_free)
 			}
 		}
 		crmid = nrmid + 1;
+		trace_mon_llc_occupancy_limbo(nrmid, d->id, val);
 	}
 }
 
diff --git a/arch/x86/kernel/cpu/resctrl/trace.h b/arch/x86/kernel/cpu/resctrl/trace.h
index 495fb90c8572..4bf95b7b4db8 100644
--- a/arch/x86/kernel/cpu/resctrl/trace.h
+++ b/arch/x86/kernel/cpu/resctrl/trace.h
@@ -35,6 +35,19 @@  TRACE_EVENT(pseudo_lock_l3,
 	    TP_printk("hits=%llu miss=%llu",
 		      __entry->l3_hits, __entry->l3_miss));
 
+TRACE_EVENT(mon_llc_occupancy_limbo,
+	    TP_PROTO(u32 mon_hw_id, int id, u64 occupancy),
+	    TP_ARGS(mon_hw_id, id, occupancy),
+	    TP_STRUCT__entry(__field(u32, mon_hw_id)
+			     __field(int, id)
+			     __field(u64, occupancy)),
+	    TP_fast_assign(__entry->mon_hw_id = mon_hw_id;
+			   __entry->id = id;
+			   __entry->occupancy = occupancy;),
+	    TP_printk("mon_hw_id=%u domain=%d llc_occupancy=%llu",
+		      __entry->mon_hw_id, __entry->id, __entry->occupancy)
+	   );
+
 #endif /* _TRACE_RESCTRL_H */
 
 #undef TRACE_INCLUDE_PATH