[v2] x86/resctrl: Implement new mba_MBps throttling heuristic

Message ID 20240118214213.59596-1-tony.luck@intel.com
State New
Headers
Series [v2] x86/resctrl: Implement new mba_MBps throttling heuristic |

Commit Message

Luck, Tony Jan. 18, 2024, 9:42 p.m. UTC
  The mba_MBps feedback loop increases throttling when a group is using
more bandwidth than the target set by the user in the schemata file, and
decreases throttling when below target.

To avoid possibly stepping throttling up and down on every poll a
flag "delta_comp" is set whenever throttling is changed to indicate
that the actual change in bandwidth should be recorded on the next
poll in "delta_bw". Throttling is only reduced if the current bandwidth
plus delta_bw is below the user target.

This algorithm works well if the workload has steady bandwidth needs.
But it can go badly wrong if the workload moves to a different phase
just as the throttling level changed. E.g. if the workload becomes
essentially idle right as throttling level is increased, the value
calculated for delta_bw will be more or less the old bandwidth level.
If the workload then resumes, Linux may never reduce throttling because
current bandwidth plus delta_bw is above the target set by the user.

Implement a simpler heuristic by assuming that in the worst case the
currently measured bandwidth is being controlled by the current level of
throttling. Compute how much it may increase if throttling is relaxed to
the next higher level. If that is still below the user target, then it
is ok to reduce the amount of throttling.

Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
Reported-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Xiaochen Shen <xiaochen.shen@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---

Changes since v1:
Reinette: Subject & commit comment: ("MBA_mbps" -> "mba_MBps", "plu" -> "plus")

Added Xiaochen's Tested-by
Added Reinette's Reviewed-by

 arch/x86/kernel/cpu/resctrl/internal.h |  4 ---
 arch/x86/kernel/cpu/resctrl/monitor.c  | 42 ++++++--------------------
 2 files changed, 10 insertions(+), 36 deletions(-)


base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
  

Comments

Reinette Chatre Jan. 22, 2024, 5:34 p.m. UTC | #1
Hi Tony,

On 1/18/2024 1:42 PM, Tony Luck wrote:

> @@ -520,7 +517,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
>  {
>  	u32 closid, rmid, cur_msr_val, new_msr_val;
>  	struct mbm_state *pmbm_data, *cmbm_data;
> -	u32 cur_bw, delta_bw, user_bw;
> +	u32 cur_bw, user_bw;
>  	struct rdt_resource *r_mba;
>  	struct rdt_domain *dom_mba;
>  	struct list_head *head;

I am sorry Tony, this reverse-fir breakage slipped through. I only noticed it after
trying to see how best to present all the pending work to x86 maintainers. Could
you please send a fixup with this addressed? After that I am planning to propose its
inclusion among all the other pending resctrl fixes followed by the SNC work. The SNC
work will need a rebase starting with this snippet, unless you see an easier way forward?

Reinette
  
Luck, Tony Jan. 22, 2024, 6:07 p.m. UTC | #2
> I am sorry Tony, this reverse-fir breakage slipped through. I only noticed it after
> trying to see how best to present all the pending work to x86 maintainers. Could
> you please send a fixup with this addressed? After that I am planning to propose its
> inclusion among all the other pending resctrl fixes followed by the SNC work. The SNC
> work will need a rebase starting with this snippet, unless you see an easier way forward?

Reinette,

Thanks for catching this. I can send v3 of this patch right away (just fixing fir tree order only
needs a compile test, not extensive regression tests).

What are "all the other pending resctrl" fixes?  Since the SNC series is invasive to so many
files & functions, I'll need to re-base SNC on top of everything else that's happening this
cycle.

-Tony
  
Reinette Chatre Jan. 22, 2024, 6:18 p.m. UTC | #3
Hi Tony,

On 1/22/2024 10:07 AM, Luck, Tony wrote:
>> I am sorry Tony, this reverse-fir breakage slipped through. I only noticed it after
>> trying to see how best to present all the pending work to x86 maintainers. Could
>> you please send a fixup with this addressed? After that I am planning to propose its
>> inclusion among all the other pending resctrl fixes followed by the SNC work. The SNC
>> work will need a rebase starting with this snippet, unless you see an easier way forward?
> 
> Reinette,
> 
> Thanks for catching this. I can send v3 of this patch right away (just fixing fir tree order only
> needs a compile test, not extensive regression tests).

Thank you very much Tony.

> 
> What are "all the other pending resctrl" fixes?  Since the SNC series is invasive to so many
> files & functions, I'll need to re-base SNC on top of everything else that's happening this
> cycle.

The fixes I have as pending are:

[PATCH v2] x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()
	https://lore.kernel.org/lkml/ZULCd%2FTGJL9Dmncf@agluck-desk3/

[PATCH v5 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit
	https://lore.kernel.org/lkml/c26a8ca79d399ed076cf8bf2e9fbc58048808289.1705359148.git.babu.moger@amd.com/

[PATCH v5 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command
	https://lore.kernel.org/lkml/669896fa512c7451319fa5ca2fdb6f7e015b5635.1705359148.git.babu.moger@amd.com/

[PATCH v3] x86/resctrl: Implement new mba_MBps throttling heuristic
	https://lore.kernel.org/lkml/20240122180807.70518-1-tony.luck@intel.com/


Reinette
  
Borislav Petkov Jan. 22, 2024, 6:21 p.m. UTC | #4
On Mon, Jan 22, 2024 at 10:18:01AM -0800, Reinette Chatre wrote:
> The fixes I have as pending are:
> 
> [PATCH v2] x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()
> 	https://lore.kernel.org/lkml/ZULCd%2FTGJL9Dmncf@agluck-desk3/
> 
> [PATCH v5 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit
> 	https://lore.kernel.org/lkml/c26a8ca79d399ed076cf8bf2e9fbc58048808289.1705359148.git.babu.moger@amd.com/
> 
> [PATCH v5 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command
> 	https://lore.kernel.org/lkml/669896fa512c7451319fa5ca2fdb6f7e015b5635.1705359148.git.babu.moger@amd.com/
> 
> [PATCH v3] x86/resctrl: Implement new mba_MBps throttling heuristic
> 	https://lore.kernel.org/lkml/20240122180807.70518-1-tony.luck@intel.com/

Is that the order I should start queueing them in?

If not, pls send me a note and I can start picking up stuff.

Thx.
  
Reinette Chatre Jan. 22, 2024, 6:41 p.m. UTC | #5
Hi Boris,

On 1/22/2024 10:21 AM, Borislav Petkov wrote:
> On Mon, Jan 22, 2024 at 10:18:01AM -0800, Reinette Chatre wrote:
>> The fixes I have as pending are:
>>
>> [PATCH v2] x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()
>> 	https://lore.kernel.org/lkml/ZULCd%2FTGJL9Dmncf@agluck-desk3/
>>
>> [PATCH v5 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit
>> 	https://lore.kernel.org/lkml/c26a8ca79d399ed076cf8bf2e9fbc58048808289.1705359148.git.babu.moger@amd.com/
>>
>> [PATCH v5 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command
>> 	https://lore.kernel.org/lkml/669896fa512c7451319fa5ca2fdb6f7e015b5635.1705359148.git.babu.moger@amd.com/
>>
>> [PATCH v3] x86/resctrl: Implement new mba_MBps throttling heuristic
>> 	https://lore.kernel.org/lkml/20240122180807.70518-1-tony.luck@intel.com/
> 
> Is that the order I should start queueing them in?

Thank you for noting this exchange. Yes, this order applies cleanly.
This is the list of pending resctrl fixes. There is also one
pending resctrl feature, the SNC work [1], that Tony plans to rebase
on these fixes.

As a side note: I did have a conversation with James about the
MPAM work and his expectation is that the SNC work goes first.

> If not, pls send me a note and I can start picking up stuff.

I was working on a separate note to you but we can continue discussing
the pending resctrl work here.

Reinette

[1] https://lore.kernel.org/lkml/20231204185357.120501-1-tony.luck@intel.com/
  
Borislav Petkov Jan. 22, 2024, 6:47 p.m. UTC | #6
On Mon, Jan 22, 2024 at 10:41:01AM -0800, Reinette Chatre wrote:
> Thank you for noting this exchange. Yes, this order applies cleanly.
> This is the list of pending resctrl fixes.

Ok, if some of the fixes need to go to Linus now, lemme know.

> There is also one pending resctrl feature, the SNC work [1], that Tony
> plans to rebase on these fixes.

Ack.

> As a side note: I did have a conversation with James about the
> MPAM work and his expectation is that the SNC work goes first.

Ok.

> I was working on a separate note to you but we can continue discussing
> the pending resctrl work here.

Separate note is fine too - however you prefer. I'll start going through
those in the coming days and we can take it from there.

Thx.
  
Luck, Tony Jan. 22, 2024, 8:58 p.m. UTC | #7
>> There is also one pending resctrl feature, the SNC work [1], that Tony
>> plans to rebase on these fixes.
>
> Ack.

It looks like there is just one change in part 0004/0008 that doesn't apply automatically from git am.
Super easy to resolve.

I need to grab an SNC system to re-check that everything still works when re-based.
But right now, this looks like adding the SNC series will be easy (famous last words!).

-Tony
  
James Morse Jan. 23, 2024, 12:12 p.m. UTC | #8
Hi Tony,

On 22/01/2024 20:58, Luck, Tony wrote:
>>> There is also one pending resctrl feature, the SNC work [1], that Tony
>>> plans to rebase on these fixes.
>>
>> Ack.
> 
> It looks like there is just one change in part 0004/0008 that doesn't apply automatically from git am.
> Super easy to resolve.
> 
> I need to grab an SNC system to re-check that everything still works when re-based.
> But right now, this looks like adding the SNC series will be easy (famous last words!).

Once you're ready - can you point me at something I can use as a stable branch to rebase onto?


Thanks,

James
  
Luck, Tony Jan. 23, 2024, 5:07 p.m. UTC | #9
>> I need to grab an SNC system to re-check that everything still works when re-based.
>> But right now, this looks like adding the SNC series will be easy (famous last words!).
>
> Once you're ready - can you point me at something I can use as a stable branch to rebase onto?

I'll be running the tests this afternoon. But I'm somewhat confident that they will pass.

Tree is here: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git branch "hopeful_snc"

Commits are:

793635f10aeb x86/resctrl: Update documentation with Sub-NUMA cluster changes
dca7ba785d6f x86/resctrl: Sub NUMA Cluster detection and enable
e41bd88101c8 x86/resctrl: Introduce snc_nodes_per_l3_cache
ccbab7875197 x86/resctrl: Add node-scope to the options for feature scope
2f0db8c7072b x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
22697627cc5f x86/resctrl: Prepare for different scope for control/monitor operations
f3ff831a9042 x86/resctrl: Prepare to split rdt_domain structure
c0679868ee78 x86/resctrl: Prepare for new domain scope
327b4394f309 x86/resctrl: Implement new mba_MBps throttling heuristic
5b14817cf87e x86/resctrl: Read supported bandwidth sources using CPUID command
5699cd082e1f x86/resctrl: Remove hard-coded memory bandwidth limit
1b908debf53f x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()

Only the last of these (1b908debf53f) had been accepted by Boris into TIP x86/cache
when I cut this tree yesterday.

Two more applied by Boris today without any changes from what is in my tree, but the
commit IDs in TIP are obviously different from in my tree.

54e35eb8611c x86/resctrl: Read supported bandwidth sources from CPUID
0976783bb123 x86/resctrl: Remove hard-coded memory bandwidth limit

All the others have the normal risk that Boris will find some way to make them better.

-Tony
  
Luck, Tony Jan. 24, 2024, 12:29 a.m. UTC | #10
On Tue, Jan 23, 2024 at 05:07:50PM +0000, Luck, Tony wrote:
> >> I need to grab an SNC system to re-check that everything still works when re-based.
> >> But right now, this looks like adding the SNC series will be easy (famous last words!).
> >
> > Once you're ready - can you point me at something I can use as a stable branch to rebase onto?
> 
> I'll be running the tests this afternoon. But I'm somewhat confident that they will pass.
> 
> Tree is here: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git branch "hopeful_snc"
> 
> Commits are:
> 
> 793635f10aeb x86/resctrl: Update documentation with Sub-NUMA cluster changes
> dca7ba785d6f x86/resctrl: Sub NUMA Cluster detection and enable
> e41bd88101c8 x86/resctrl: Introduce snc_nodes_per_l3_cache
> ccbab7875197 x86/resctrl: Add node-scope to the options for feature scope
> 2f0db8c7072b x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
> 22697627cc5f x86/resctrl: Prepare for different scope for control/monitor operations
> f3ff831a9042 x86/resctrl: Prepare to split rdt_domain structure
> c0679868ee78 x86/resctrl: Prepare for new domain scope
> 327b4394f309 x86/resctrl: Implement new mba_MBps throttling heuristic
> 5b14817cf87e x86/resctrl: Read supported bandwidth sources using CPUID command
> 5699cd082e1f x86/resctrl: Remove hard-coded memory bandwidth limit
> 1b908debf53f x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()
> 
> Only the last of these (1b908debf53f) had been accepted by Boris into TIP x86/cache
> when I cut this tree yesterday.
> 
> Two more applied by Boris today without any changes from what is in my tree, but the
> commit IDs in TIP are obviously different from in my tree.
> 
> 54e35eb8611c x86/resctrl: Read supported bandwidth sources from CPUID
> 0976783bb123 x86/resctrl: Remove hard-coded memory bandwidth limit
> 
> All the others have the normal risk that Boris will find some way to make them better.

All my SNC tests pass. So this tree is as good as I can make it.

Boris: After you take/reject "x86/resctrl: Implement new mba_MBps throttling heuristic"
in TIP x86/cache I will post v14 of the SNC series rebased to that TIP
branch.

-Tony
  
Luck, Tony Jan. 25, 2024, 5:29 p.m. UTC | #11
On Tue, Jan 23, 2024 at 04:29:42PM -0800, Tony Luck wrote:
> On Tue, Jan 23, 2024 at 05:07:50PM +0000, Luck, Tony wrote:
> > >> I need to grab an SNC system to re-check that everything still works when re-based.
> > >> But right now, this looks like adding the SNC series will be easy (famous last words!).
> > >
> > > Once you're ready - can you point me at something I can use as a stable branch to rebase onto?
> > 
> > I'll be running the tests this afternoon. But I'm somewhat confident that they will pass.
> > 
> > Tree is here: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git branch "hopeful_snc"
> > 
> > Commits are:
> > 
> > 793635f10aeb x86/resctrl: Update documentation with Sub-NUMA cluster changes
> > dca7ba785d6f x86/resctrl: Sub NUMA Cluster detection and enable
> > e41bd88101c8 x86/resctrl: Introduce snc_nodes_per_l3_cache
> > ccbab7875197 x86/resctrl: Add node-scope to the options for feature scope
> > 2f0db8c7072b x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
> > 22697627cc5f x86/resctrl: Prepare for different scope for control/monitor operations
> > f3ff831a9042 x86/resctrl: Prepare to split rdt_domain structure
> > c0679868ee78 x86/resctrl: Prepare for new domain scope
> > 327b4394f309 x86/resctrl: Implement new mba_MBps throttling heuristic
> > 5b14817cf87e x86/resctrl: Read supported bandwidth sources using CPUID command
> > 5699cd082e1f x86/resctrl: Remove hard-coded memory bandwidth limit
> > 1b908debf53f x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()
> > 
> > Only the last of these (1b908debf53f) had been accepted by Boris into TIP x86/cache
> > when I cut this tree yesterday.
> > 
> > Two more applied by Boris today without any changes from what is in my tree, but the
> > commit IDs in TIP are obviously different from in my tree.
> > 
> > 54e35eb8611c x86/resctrl: Read supported bandwidth sources from CPUID
> > 0976783bb123 x86/resctrl: Remove hard-coded memory bandwidth limit
> > 
> > All the others have the normal risk that Boris will find some way to make them better.
> 
> All my SNC tests pass. So this tree is as good as I can make it.
> 
> Boris: After you take/reject "x86/resctrl: Implement new mba_MBps throttling heuristic"
> in TIP x86/cache I will post v14 of the SNC series rebased to that TIP
> branch.

Boris,

I got the tip bot message that you have applied:

  c2427e70c163 ("x86/resctrl: Implement new mba_MBps throttling heuristic")

I see that you also applied:

  fc747eebef73 ("x86/resctrl: Remove redundant variable in mbm_config_write_domain()")

Does that empty your resctrl queue except for the SNC series? If it
does, then I'll rebase SNC patches onto TIP x86/cache.

-Tony
  

Patch

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..71bbd2245cc7 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -296,14 +296,10 @@  struct rftype {
  * struct mbm_state - status for each MBM counter in each domain
  * @prev_bw_bytes: Previous bytes value read for bandwidth calculation
  * @prev_bw:	The most recent bandwidth in MBps
- * @delta_bw:	Difference between the current and previous bandwidth
- * @delta_comp:	Indicates whether to compute the delta_bw
  */
 struct mbm_state {
 	u64	prev_bw_bytes;
 	u32	prev_bw;
-	u32	delta_bw;
-	bool	delta_comp;
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..1961823b555b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -440,9 +440,6 @@  static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
 
 	cur_bw = bytes / SZ_1M;
 
-	if (m->delta_comp)
-		m->delta_bw = abs(cur_bw - m->prev_bw);
-	m->delta_comp = false;
 	m->prev_bw = cur_bw;
 }
 
@@ -520,7 +517,7 @@  static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 {
 	u32 closid, rmid, cur_msr_val, new_msr_val;
 	struct mbm_state *pmbm_data, *cmbm_data;
-	u32 cur_bw, delta_bw, user_bw;
+	u32 cur_bw, user_bw;
 	struct rdt_resource *r_mba;
 	struct rdt_domain *dom_mba;
 	struct list_head *head;
@@ -543,7 +540,6 @@  static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 
 	cur_bw = pmbm_data->prev_bw;
 	user_bw = dom_mba->mbps_val[closid];
-	delta_bw = pmbm_data->delta_bw;
 
 	/* MBA resource doesn't support CDP */
 	cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE);
@@ -555,49 +551,31 @@  static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
 	list_for_each_entry(entry, head, mon.crdtgrp_list) {
 		cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
 		cur_bw += cmbm_data->prev_bw;
-		delta_bw += cmbm_data->delta_bw;
 	}
 
 	/*
 	 * Scale up/down the bandwidth linearly for the ctrl group.  The
 	 * bandwidth step is the bandwidth granularity specified by the
 	 * hardware.
-	 *
-	 * The delta_bw is used when increasing the bandwidth so that we
-	 * dont alternately increase and decrease the control values
-	 * continuously.
-	 *
-	 * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if
-	 * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep
-	 * switching between 90 and 110 continuously if we only check
-	 * cur_bw < user_bw.
+	 * Always increase throttling if current bandwidth is above the
+	 * target set by user.
+	 * But avoid thrashing up and down on every poll by checking
+	 * whether a decrease in throttling is likely to push the group
+	 * back over target. E.g. if currently throttling to 30% of bandwidth
+	 * on a system with 10% granularity steps, check whether moving to
+	 * 40% would go past the limit by multiplying current bandwidth by
+	 * "(30 + 10) / 30".
 	 */
 	if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
 		new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
 	} else if (cur_msr_val < MAX_MBA_BW &&
-		   (user_bw > (cur_bw + delta_bw))) {
+		   (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) {
 		new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
 	} else {
 		return;
 	}
 
 	resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
-
-	/*
-	 * Delta values are updated dynamically package wise for each
-	 * rdtgrp every time the throttle MSR changes value.
-	 *
-	 * This is because (1)the increase in bandwidth is not perfectly
-	 * linear and only "approximately" linear even when the hardware
-	 * says it is linear.(2)Also since MBA is a core specific
-	 * mechanism, the delta values vary based on number of cores used
-	 * by the rdtgrp.
-	 */
-	pmbm_data->delta_comp = true;
-	list_for_each_entry(entry, head, mon.crdtgrp_list) {
-		cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
-		cmbm_data->delta_comp = true;
-	}
 }
 
 static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)