[15/17] cgroup/drm: Expose GPU utilisation

Message ID 20230712114605.519432-16-tvrtko.ursulin@linux.intel.com
State New
Headers
Series DRM cgroup controller with scheduling control and memory stats |

Commit Message

Tvrtko Ursulin July 12, 2023, 11:46 a.m. UTC
  From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

To support container use cases where external orchestrators want to make
deployment and migration decisions based on GPU load and capacity, we can
expose the GPU load as seen by the controller in a new drm.active_us
field. This field contains a monotonic cumulative time cgroup has spent
executing GPU loads, as reported by the DRM drivers being used by group
members.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  3 +++
 kernel/cgroup/drm.c                     | 26 ++++++++++++++++++++++++-
 2 files changed, 28 insertions(+), 1 deletion(-)
  

Comments

Tejun Heo July 21, 2023, 10:19 p.m. UTC | #1
On Wed, Jul 12, 2023 at 12:46:03PM +0100, Tvrtko Ursulin wrote:
> +  drm.active_us
> +	GPU time used by the group recursively including all child groups.

Maybe instead add drm.stat and have "usage_usec" inside? That'd be more
consistent with cpu side.

Thanks.
  
Tejun Heo July 21, 2023, 10:20 p.m. UTC | #2
On Fri, Jul 21, 2023 at 12:19:32PM -1000, Tejun Heo wrote:
> On Wed, Jul 12, 2023 at 12:46:03PM +0100, Tvrtko Ursulin wrote:
> > +  drm.active_us
> > +	GPU time used by the group recursively including all child groups.
> 
> Maybe instead add drm.stat and have "usage_usec" inside? That'd be more
> consistent with cpu side.

Also, shouldn't this be keyed by the drm device?
  
Tvrtko Ursulin July 25, 2023, 2:08 p.m. UTC | #3
On 21/07/2023 23:20, Tejun Heo wrote:
> On Fri, Jul 21, 2023 at 12:19:32PM -1000, Tejun Heo wrote:
>> On Wed, Jul 12, 2023 at 12:46:03PM +0100, Tvrtko Ursulin wrote:
>>> +  drm.active_us
>>> +	GPU time used by the group recursively including all child groups.
>>
>> Maybe instead add drm.stat and have "usage_usec" inside? That'd be more
>> consistent with cpu side.

Could be, but no strong opinion from my side either way. Perhaps it boils down to what could be put in the file, I mean to decide whether keyed format makes sense or not.
  
> Also, shouldn't this be keyed by the drm device?
  
It could have that too, or it could come later. Fun with GPUs that it not only could be keyed by the device, but also by the type of the GPU engine. (Which are a) vendor specific and b) some aree fully independent, some partially so, and some not at all - so it could get complicated semantics wise really fast.)

If for now I'd go with drm.stat/usage_usec containing the total time spent how would you suggest adding per device granularity? Files as documented are either flag or nested, not both at the same time. So something like:

usage_usec 100000
card0 usage_usec 50000
card1 usage_usec 50000

Would or would not fly? Have two files along the lines of drm.stat and drm.dev_stat?

While on this general topic, you will notice that for memory stats I have _sort of_ nested keyed per device format, for example on integrated Intel GPU:

   $ cat drm.memory.stat
   card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
   card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

If one a discrete Intel GPU two more lines would appear with memory regions of local and local-system. But then on some server class multi-tile GPUs even further regions with more than one device local memory region. And users do want to see this granularity for container use cases at least.

Anyway, this may not be compatible with the nested key format as documented in cgroup-v2.rst, although it does not explicitly say.

Should I cheat and create key names based on device and memory region name and let userspace parse it? Like:

   $ cat drm.memory.stat
   card0.system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
   card0.stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

Regards,

Tvrtko
  
Tejun Heo July 25, 2023, 9:44 p.m. UTC | #4
Hello,

On Tue, Jul 25, 2023 at 03:08:40PM +0100, Tvrtko Ursulin wrote:
> > Also, shouldn't this be keyed by the drm device?
>
> It could have that too, or it could come later. Fun with GPUs that it not
> only could be keyed by the device, but also by the type of the GPU engine.
> (Which are a) vendor specific and b) some aree fully independent, some
> partially so, and some not at all - so it could get complicated semantics
> wise really fast.)

I see.

> If for now I'd go with drm.stat/usage_usec containing the total time spent
> how would you suggest adding per device granularity? Files as documented
> are either flag or nested, not both at the same time. So something like:
> 
> usage_usec 100000
> card0 usage_usec 50000
> card1 usage_usec 50000
> 
> Would or would not fly? Have two files along the lines of drm.stat and drm.dev_stat?

Please follow one of the pre-defined formats. If you want to have card
identifier and field key, it should be a nested keyed file like io.stat.

> While on this general topic, you will notice that for memory stats I have
> _sort of_ nested keyed per device format, for example on integrated Intel
> GPU:
> 
>   $ cat drm.memory.stat
>   card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
>   card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
> 
> If one a discrete Intel GPU two more lines would appear with memory
> regions of local and local-system. But then on some server class
> multi-tile GPUs even further regions with more than one device local
> memory region. And users do want to see this granularity for container use
> cases at least.
> 
> Anyway, this may not be compatible with the nested key format as
> documented in cgroup-v2.rst, although it does not explicitly say.
> 
> Should I cheat and create key names based on device and memory region name
> and let userspace parse it? Like:
> 
>   $ cat drm.memory.stat
>   card0.system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
>   card0.stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

Yeah, this looks better to me. If they're reporting different values for the
same fields, they're separate keys.

Thanks.
  

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index da350858c59f..bbe986366f4a 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2445,6 +2445,9 @@  will be respected.
 DRM scheduling soft limits interface files
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+  drm.active_us
+	GPU time used by the group recursively including all child groups.
+
   drm.weight
 	Standard cgroup weight based control [1, 10000] used to configure the
 	relative distributing of GPU time between the sibling groups.
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index b244e3d828cc..7c20d4ebc634 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -25,6 +25,8 @@  struct drm_cgroup_state {
 	bool over;
 	bool over_budget;
 
+	u64 total_us;
+
 	u64 per_s_budget_us;
 	u64 prev_active_us;
 	u64 active_us;
@@ -117,6 +119,20 @@  drmcs_write_weight(struct cgroup_subsys_state *css, struct cftype *cftype,
 	return 0;
 }
 
+static u64
+drmcs_read_total_us(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+	u64 val;
+
+	/* Mutex being overkill unless arch cannot atomically read u64.. */
+	mutex_lock(&drmcg_mutex);
+	val = drmcs->total_us;
+	mutex_unlock(&drmcg_mutex);
+
+	return val;
+}
+
 static bool __start_scanning(unsigned int period_us)
 {
 	struct drm_cgroup_state *root = &root_drmcs.drmcs;
@@ -169,11 +185,14 @@  static bool __start_scanning(unsigned int period_us)
 		parent = css_to_drmcs(node->parent);
 
 		active = drmcs_get_active_time_us(drmcs);
-		if (period_us && active > drmcs->prev_active_us)
+		if (period_us && active > drmcs->prev_active_us) {
 			drmcs->active_us += active - drmcs->prev_active_us;
+			drmcs->total_us += drmcs->active_us;
+		}
 		drmcs->prev_active_us = active;
 
 		parent->active_us += drmcs->active_us;
+		parent->total_us += drmcs->active_us;
 		parent->sum_children_weights += drmcs->weight;
 
 		css_put(node);
@@ -551,6 +570,11 @@  struct cftype files[] = {
 		.read_u64 = drmcs_read_weight,
 		.write_u64 = drmcs_write_weight,
 	},
+	{
+		.name = "active_us",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = drmcs_read_total_us,
+	},
 	{ } /* Zero entry terminates. */
 };