x86/resctrl: Implement new MBA_mbps throttling heuristic

Message ID 20240109220003.29225-1-tony.luck@intel.com
State New
Headers
Series x86/resctrl: Implement new MBA_mbps throttling heuristic |

Commit Message

Luck, Tony Jan. 9, 2024, 10 p.m. UTC
  The MBA_mbps feedback loop increases throttling when a group is using
more bandwidth than the target set by the user in the schemata file, and
decreases throttling when below target.

To avoid possibly stepping throttling up and down on every poll a
flag "delta_comp" is set whenever throttling is changed to indicate
that the actual change in bandwidth should be recorded on the next
poll in "delta_bw". Throttling is only reduced if the current bandwidth
plus delta_bw is below the user target.

This algorithm works well if the workload has steady bandwidth needs.
But it can go badly wrong if the workload moves to a different phase
just as the throttling level changed. E.g. if the workload becomes
essentially idle right as throttling level is increased, the value
calculated for delta_bw will be more or less the old bandwidth level.
If the workload then resumes, Linux may never reduce throttling because
current bandwidth plu delta_bw is above the target set by the user.

Implement a simpler heuristic by assuming that in the worst case the
currently measured bandwidth is being controlled by the current level of
throttling. Compute how much it may increase if throttling is relaxed to
the next higher level. If that is still below the user target, then it
is ok to reduce the amount of throttling.

Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
Reported-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---

This patch was previously posted in:

  https://lore.kernel.org/lkml/ZVU+L92LRBbJXgXn@agluck-desk3/#t

as part of a proposal to allow the mba_MBps mount option to base its
feedback loop input on total bandwidth instead of local bandwidth.
Sending it now as a standalone patch because Xiaochen reported that
real systems have experienced problems when delta_bw is incorrectly
calculated.

 arch/x86/kernel/cpu/resctrl/internal.h |  4 ---
 arch/x86/kernel/cpu/resctrl/monitor.c  | 42 ++++++--------------------
 2 files changed, 10 insertions(+), 36 deletions(-)


base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
  

Comments

Reinette Chatre Jan. 16, 2024, 7:55 p.m. UTC | #1
Hi Xiaochen,

On 1/9/2024 2:00 PM, Tony Luck wrote:
> The MBA_mbps feedback loop increases throttling when a group is using
> more bandwidth than the target set by the user in the schemata file, and
> decreases throttling when below target.
> 
> To avoid possibly stepping throttling up and down on every poll a
> flag "delta_comp" is set whenever throttling is changed to indicate
> that the actual change in bandwidth should be recorded on the next
> poll in "delta_bw". Throttling is only reduced if the current bandwidth
> plus delta_bw is below the user target.
> 
> This algorithm works well if the workload has steady bandwidth needs.
> But it can go badly wrong if the workload moves to a different phase
> just as the throttling level changed. E.g. if the workload becomes
> essentially idle right as throttling level is increased, the value
> calculated for delta_bw will be more or less the old bandwidth level.
> If the workload then resumes, Linux may never reduce throttling because
> current bandwidth plu delta_bw is above the target set by the user.
> 
> Implement a simpler heuristic by assuming that in the worst case the
> currently measured bandwidth is being controlled by the current level of
> throttling. Compute how much it may increase if throttling is relaxed to
> the next higher level. If that is still below the user target, then it
> is ok to reduce the amount of throttling.
> 
> Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
> Reported-by: Xiaochen Shen <xiaochen.shen@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> 
> This patch was previously posted in:
> 
>   https://lore.kernel.org/lkml/ZVU+L92LRBbJXgXn@agluck-desk3/#t
> 
> as part of a proposal to allow the mba_MBps mount option to base its
> feedback loop input on total bandwidth instead of local bandwidth.
> Sending it now as a standalone patch because Xiaochen reported that
> real systems have experienced problems when delta_bw is incorrectly
> calculated.
> 

Does this new heuristic fix the problems you observe? If so, would you be
willing to provide a "Tested-by" tag?

Thank you.

Reinette
  
Xiaochen Shen Jan. 17, 2024, 3:36 a.m. UTC | #2
Hi Reinette and Tony,

On 1/17/2024 3:55, Reinette Chatre wrote:
> Hi Xiaochen,
>
> On 1/9/2024 2:00 PM, Tony Luck wrote:
>> The MBA_mbps feedback loop increases throttling when a group is using
>> more bandwidth than the target set by the user in the schemata file, and
>> decreases throttling when below target.
>>
>> To avoid possibly stepping throttling up and down on every poll a
>> flag "delta_comp" is set whenever throttling is changed to indicate
>> that the actual change in bandwidth should be recorded on the next
>> poll in "delta_bw". Throttling is only reduced if the current bandwidth
>> plus delta_bw is below the user target.
>>
>> This algorithm works well if the workload has steady bandwidth needs.
>> But it can go badly wrong if the workload moves to a different phase
>> just as the throttling level changed. E.g. if the workload becomes
>> essentially idle right as throttling level is increased, the value
>> calculated for delta_bw will be more or less the old bandwidth level.
>> If the workload then resumes, Linux may never reduce throttling because
>> current bandwidth plu delta_bw is above the target set by the user.
>>
>> Implement a simpler heuristic by assuming that in the worst case the
>> currently measured bandwidth is being controlled by the current level of
>> throttling. Compute how much it may increase if throttling is relaxed to
>> the next higher level. If that is still below the user target, then it
>> is ok to reduce the amount of throttling.
>>
>> Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
>> Reported-by: Xiaochen Shen <xiaochen.shen@intel.com>
>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>> ---
>>
>> This patch was previously posted in:
>>
>>    https://lore.kernel.org/lkml/ZVU+L92LRBbJXgXn@agluck-desk3/#t
>>
>> as part of a proposal to allow the mba_MBps mount option to base its
>> feedback loop input on total bandwidth instead of local bandwidth.
>> Sending it now as a standalone patch because Xiaochen reported that
>> real systems have experienced problems when delta_bw is incorrectly
>> calculated.
>>
> Does this new heuristic fix the problems you observe? If so, would you be
> willing to provide a "Tested-by" tag?

Yes. The patch fixed the problem. I will reply to the original thread to 
add "Tested-by" tag.
Thank you very much for help!

>
> Thank you.
>
> Reinette

Best regards,
Xiaochen
  
Xiaochen Shen Jan. 17, 2024, 3:40 a.m. UTC | #3
Tested-by: Xiaochen Shen <xiaochen.shen@intel.com>

On 1/10/2024 6:00, Tony Luck wrote:
> The MBA_mbps feedback loop increases throttling when a group is using
> more bandwidth than the target set by the user in the schemata file, and
> decreases throttling when below target.
>
> To avoid possibly stepping throttling up and down on every poll a
> flag "delta_comp" is set whenever throttling is changed to indicate
> that the actual change in bandwidth should be recorded on the next
> poll in "delta_bw". Throttling is only reduced if the current bandwidth
> plus delta_bw is below the user target.
>
> This algorithm works well if the workload has steady bandwidth needs.
> But it can go badly wrong if the workload moves to a different phase
> just as the throttling level changed. E.g. if the workload becomes
> essentially idle right as throttling level is increased, the value
> calculated for delta_bw will be more or less the old bandwidth level.
> If the workload then resumes, Linux may never reduce throttling because
> current bandwidth plu delta_bw is above the target set by the user.
>
> Implement a simpler heuristic by assuming that in the worst case the
> currently measured bandwidth is being controlled by the current level of
> throttling. Compute how much it may increase if throttling is relaxed to
> the next higher level. If that is still below the user target, then it
> is ok to reduce the amount of throttling.
>
> Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
> Reported-by: Xiaochen Shen <xiaochen.shen@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>
> This patch was previously posted in:
>
>    https://lore.kernel.org/lkml/ZVU+L92LRBbJXgXn@agluck-desk3/#t
>
> as part of a proposal to allow the mba_MBps mount option to base its
> feedback loop input on total bandwidth instead of local bandwidth.
> Sending it now as a standalone patch because Xiaochen reported that
> real systems have experienced problems when delta_bw is incorrectly
> calculated.
>
>   arch/x86/kernel/cpu/resctrl/internal.h |  4 ---
>   arch/x86/kernel/cpu/resctrl/monitor.c  | 42 ++++++--------------------
>   2 files changed, 10 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a4f1aa15f0a2..71bbd2245cc7 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -296,14 +296,10 @@ struct rftype {
>    * struct mbm_state - status for each MBM counter in each domain
>    * @prev_bw_bytes: Previous bytes value read for bandwidth calculation
>    * @prev_bw:	The most recent bandwidth in MBps
> - * @delta_bw:	Difference between the current and previous bandwidth
> - * @delta_comp:	Indicates whether to compute the delta_bw
>    */
>   struct mbm_state {
>   	u64	prev_bw_bytes;
>   	u32	prev_bw;
> -	u32	delta_bw;
> -	bool	delta_comp;
>   };
>   
>   /**
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..1961823b555b 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -440,9 +440,6 @@ static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
>   
>   	cur_bw = bytes / SZ_1M;
>   
> -	if (m->delta_comp)
> -		m->delta_bw = abs(cur_bw - m->prev_bw);
> -	m->delta_comp = false;
>   	m->prev_bw = cur_bw;
>   }
>   
> @@ -520,7 +517,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
>   {
>   	u32 closid, rmid, cur_msr_val, new_msr_val;
>   	struct mbm_state *pmbm_data, *cmbm_data;
> -	u32 cur_bw, delta_bw, user_bw;
> +	u32 cur_bw, user_bw;
>   	struct rdt_resource *r_mba;
>   	struct rdt_domain *dom_mba;
>   	struct list_head *head;
> @@ -543,7 +540,6 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
>   
>   	cur_bw = pmbm_data->prev_bw;
>   	user_bw = dom_mba->mbps_val[closid];
> -	delta_bw = pmbm_data->delta_bw;
>   
>   	/* MBA resource doesn't support CDP */
>   	cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE);
> @@ -555,49 +551,31 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
>   	list_for_each_entry(entry, head, mon.crdtgrp_list) {
>   		cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
>   		cur_bw += cmbm_data->prev_bw;
> -		delta_bw += cmbm_data->delta_bw;
>   	}
>   
>   	/*
>   	 * Scale up/down the bandwidth linearly for the ctrl group.  The
>   	 * bandwidth step is the bandwidth granularity specified by the
>   	 * hardware.
> -	 *
> -	 * The delta_bw is used when increasing the bandwidth so that we
> -	 * dont alternately increase and decrease the control values
> -	 * continuously.
> -	 *
> -	 * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if
> -	 * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep
> -	 * switching between 90 and 110 continuously if we only check
> -	 * cur_bw < user_bw.
> +	 * Always increase throttling if current bandwidth is above the
> +	 * target set by user.
> +	 * But avoid thrashing up and down on every poll by checking
> +	 * whether a decrease in throttling is likely to push the group
> +	 * back over target. E.g. if currently throttling to 30% of bandwidth
> +	 * on a system with 10% granularity steps, check whether moving to
> +	 * 40% would go past the limit by multiplying current bandwidth by
> +	 * "(30 + 10) / 30".
>   	 */
>   	if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
>   		new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
>   	} else if (cur_msr_val < MAX_MBA_BW &&
> -		   (user_bw > (cur_bw + delta_bw))) {
> +		   (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) {
>   		new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
>   	} else {
>   		return;
>   	}
>   
>   	resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
> -
> -	/*
> -	 * Delta values are updated dynamically package wise for each
> -	 * rdtgrp every time the throttle MSR changes value.
> -	 *
> -	 * This is because (1)the increase in bandwidth is not perfectly
> -	 * linear and only "approximately" linear even when the hardware
> -	 * says it is linear.(2)Also since MBA is a core specific
> -	 * mechanism, the delta values vary based on number of cores used
> -	 * by the rdtgrp.
> -	 */
> -	pmbm_data->delta_comp = true;
> -	list_for_each_entry(entry, head, mon.crdtgrp_list) {
> -		cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> -		cmbm_data->delta_comp = true;
> -	}
>   }
>   
>   static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
>
> base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a

Best regards,
Xiaochen
  
Reinette Chatre Jan. 18, 2024, 12:26 a.m. UTC | #4
Hi Tony,

Please note in the subject and changelog that the mount option
is mba_MBps. The subject and first sentence of changelog seems
to fuse the mount option with the software controller enabled by it,
but I do not think doing so dilutes the message.

On 1/9/2024 2:00 PM, Tony Luck wrote:
> The MBA_mbps feedback loop increases throttling when a group is using
> more bandwidth than the target set by the user in the schemata file, and
> decreases throttling when below target.
> 
> To avoid possibly stepping throttling up and down on every poll a
> flag "delta_comp" is set whenever throttling is changed to indicate
> that the actual change in bandwidth should be recorded on the next
> poll in "delta_bw". Throttling is only reduced if the current bandwidth
> plus delta_bw is below the user target.
> 
> This algorithm works well if the workload has steady bandwidth needs.
> But it can go badly wrong if the workload moves to a different phase
> just as the throttling level changed. E.g. if the workload becomes
> essentially idle right as throttling level is increased, the value
> calculated for delta_bw will be more or less the old bandwidth level.
> If the workload then resumes, Linux may never reduce throttling because
> current bandwidth plu delta_bw is above the target set by the user.

"plu" -> "plus"

> 
> Implement a simpler heuristic by assuming that in the worst case the
> currently measured bandwidth is being controlled by the current level of
> throttling. Compute how much it may increase if throttling is relaxed to
> the next higher level. If that is still below the user target, then it
> is ok to reduce the amount of throttling.
> 
> Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
> Reported-by: Xiaochen Shen <xiaochen.shen@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---

Thank you very much for catching and fixing this.

With the issues ("MBA_mbps" -> "mba_MBps", "plu" -> "plus") addressed:

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette
  

Patch

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..71bbd2245cc7 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -296,14 +296,10 @@  struct rftype {
  * struct mbm_state - status for each MBM counter in each domain
  * @prev_bw_bytes: Previous bytes value read for bandwidth calculation
  * @prev_bw:	The most recent bandwidth in MBps
- * @delta_bw:	Difference between the current and previous bandwidth
- * @delta_comp:	Indicates whether to compute the delta_bw
  */
 struct mbm_state {
 	u64	prev_bw_bytes;
 	u32	prev_bw;
-	u32	delta_bw;
-	bool	delta_comp;
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..1961823b555b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -440,9 +440,6 @@  static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
 
 	cur_bw = bytes / SZ_1M;
 
-	if (m->delta_comp)
-		m->delta_bw = abs(cur_bw - m->prev_bw);
-	m->delta_comp = false;
 	m->prev_bw = cur_bw;
 }
 
@@ -520,7 +517,7 @@  static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
-	u32 cur_bw, delta_bw, user_bw;
+	u32 cur_bw, user_bw;
 	struct rdt_resource *r_mba;
 	struct rdt_domain *dom_mba;
 	struct list_head *head;
@@ -543,7 +540,6 @@  static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 
 	cur_bw = pmbm_data->prev_bw;
 	user_bw = dom_mba->mbps_val[closid];
-	delta_bw = pmbm_data->delta_bw;
 
 	/* MBA resource doesn't support CDP */
 	cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE);
@@ -555,49 +551,31 @@  static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	list_for_each_entry(entry, head, mon.crdtgrp_list) {
 		cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
 		cur_bw += cmbm_data->prev_bw;
-		delta_bw += cmbm_data->delta_bw;
 	}
 
 	/*
 	 * Scale up/down the bandwidth linearly for the ctrl group.  The
 	 * bandwidth step is the bandwidth granularity specified by the
 	 * hardware.
-	 *
-	 * The delta_bw is used when increasing the bandwidth so that we
-	 * dont alternately increase and decrease the control values
-	 * continuously.
-	 *
-	 * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if
-	 * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep
-	 * switching between 90 and 110 continuously if we only check
-	 * cur_bw < user_bw.
+	 * Always increase throttling if current bandwidth is above the
+	 * target set by user.
+	 * But avoid thrashing up and down on every poll by checking
+	 * whether a decrease in throttling is likely to push the group
+	 * back over target. E.g. if currently throttling to 30% of bandwidth
+	 * on a system with 10% granularity steps, check whether moving to
+	 * 40% would go past the limit by multiplying current bandwidth by
+	 * "(30 + 10) / 30".
 	 */
 	if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
 		new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
 	} else if (cur_msr_val < MAX_MBA_BW &&
-		   (user_bw > (cur_bw + delta_bw))) {
+		   (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) {
 		new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
 	} else {
 		return;
 	}
 
 	resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
-
-	/*
-	 * Delta values are updated dynamically package wise for each
-	 * rdtgrp every time the throttle MSR changes value.
-	 *
-	 * This is because (1)the increase in bandwidth is not perfectly
-	 * linear and only "approximately" linear even when the hardware
-	 * says it is linear.(2)Also since MBA is a core specific
-	 * mechanism, the delta values vary based on number of cores used
-	 * by the rdtgrp.
-	 */
-	pmbm_data->delta_comp = true;
-	list_for_each_entry(entry, head, mon.crdtgrp_list) {
-		cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
-		cmbm_data->delta_comp = true;
-	}
 }
 
 static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)