[16/17] cgroup/drm: Expose memory stats

Message ID 20230712114605.519432-17-tvrtko.ursulin@linux.intel.com
State New
Headers
Series DRM cgroup controller with scheduling control and memory stats |

Commit Message

Tvrtko Ursulin July 12, 2023, 11:46 a.m. UTC
  From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

With a few DRM drivers exposing per client memory stats via the fdinfo
interface already, we can add support for exposing the same data (albeit
aggregated for cgroup hierarchies) via the drm cgroup controller.

Add some driver callbacks and controller code to use them, walking the
sub-tree, collating the numbers, and presenting them in a new field
name drm.memory.stat.

Example file content:

  $ cat drm.memory.stat
  card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
  card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

Data is generated on demand for simplicty of implementation ie. no running
totals are kept or accounted during migrations and such. Various
optimisations such as cheaper collection of data are possible but
deliberately left out for now.

Overall, the feature is deemed to be useful to container orchestration
software (and manual management).

Limits, either soft or hard, are not envisaged to be implemented on top of
this approach due on demand nature of collecting the stats.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  22 ++++
 include/drm/drm_drv.h                   |  61 ++++++++++
 kernel/cgroup/drm.c                     | 149 ++++++++++++++++++++++++
 3 files changed, 232 insertions(+)
  

Comments

Tejun Heo July 21, 2023, 10:21 p.m. UTC | #1
On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>   $ cat drm.memory.stat
>   card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
>   card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
> 
> Data is generated on demand for simplicty of implementation ie. no running
> totals are kept or accounted during migrations and such. Various
> optimisations such as cheaper collection of data are possible but
> deliberately left out for now.
> 
> Overall, the feature is deemed to be useful to container orchestration
> software (and manual management).
> 
> Limits, either soft or hard, are not envisaged to be implemented on top of
> this approach due on demand nature of collecting the stats.

So, yeah, if you want to add memory controls, we better think through how
the fd ownership migration should work.

Thanks.
  
Maarten Lankhorst July 26, 2023, 10:14 a.m. UTC | #2
Hey,

On 2023-07-22 00:21, Tejun Heo wrote:
> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>>    $ cat drm.memory.stat
>>    card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
>>    card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
>>
>> Data is generated on demand for simplicty of implementation ie. no running
>> totals are kept or accounted during migrations and such. Various
>> optimisations such as cheaper collection of data are possible but
>> deliberately left out for now.
>>
>> Overall, the feature is deemed to be useful to container orchestration
>> software (and manual management).
>>
>> Limits, either soft or hard, are not envisaged to be implemented on top of
>> this approach due on demand nature of collecting the stats.
> 
> So, yeah, if you want to add memory controls, we better think through how
> the fd ownership migration should work.
I've taken a look at the series, since I have been working on cgroup 
memory eviction.

The scheduling stuff will work for i915, since it has a purely software 
execlist scheduler, but I don't think it will work for GuC (firmware) 
scheduling or other drivers that use the generic drm scheduler.

For something like this,  you would probably want it to work inside the 
drm scheduler first. Presumably, this can be done by setting a weight on 
each runqueue, and perhaps adding a callback to update one for a running 
queue. Calculating the weights hierarchically might be fun..

I have taken a look at how the rest of cgroup controllers change 
ownership when moved to a different cgroup, and the answer was: not at 
all. If we attempt to create the scheduler controls only on the first 
time the fd is used, you could probably get rid of all the tracking.
This can be done very easily with the drm scheduler.

WRT memory, I think the consensus is to track system memory like normal 
memory. Stolen memory doesn't need to be tracked. It's kernel only 
memory, used for internal bookkeeping  only.

The only time userspace can directly manipulate stolen memory, is by 
mapping the pinned initial framebuffer to its own address space. The 
only allocation it can do is when a framebuffer is displayed, and 
framebuffer compression creates some stolen memory. Userspace is not
aware of this though, and has no way to manipulate those contents.

Cheers,
~Maarten
  
Tvrtko Ursulin July 26, 2023, 11:41 a.m. UTC | #3
On 26/07/2023 11:14, Maarten Lankhorst wrote:
> Hey,
> 
> On 2023-07-22 00:21, Tejun Heo wrote:
>> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>>>    $ cat drm.memory.stat
>>>    card0 region=system total=12898304 shared=0 active=0 
>>> resident=12111872 purgeable=167936
>>>    card0 region=stolen-system total=0 shared=0 active=0 resident=0 
>>> purgeable=0
>>>
>>> Data is generated on demand for simplicty of implementation ie. no 
>>> running
>>> totals are kept or accounted during migrations and such. Various
>>> optimisations such as cheaper collection of data are possible but
>>> deliberately left out for now.
>>>
>>> Overall, the feature is deemed to be useful to container orchestration
>>> software (and manual management).
>>>
>>> Limits, either soft or hard, are not envisaged to be implemented on 
>>> top of
>>> this approach due on demand nature of collecting the stats.
>>
>> So, yeah, if you want to add memory controls, we better think through how
>> the fd ownership migration should work.
> I've taken a look at the series, since I have been working on cgroup 
> memory eviction.
> 
> The scheduling stuff will work for i915, since it has a purely software 
> execlist scheduler, but I don't think it will work for GuC (firmware) 
> scheduling or other drivers that use the generic drm scheduler.

It actually works - I used to have a blurb in the cover letter about it 
but apparently I dropped it. Just a bit less well with many clients, 
since there are fewer priority levels.

All that the design requires from the invididual drivers is some way to 
react to the "you are over budget by this much" signal. The rest is 
driver and backend specific.

> For something like this,  you would probably want it to work inside the 
> drm scheduler first. Presumably, this can be done by setting a weight on 
> each runqueue, and perhaps adding a callback to update one for a running 
> queue. Calculating the weights hierarchically might be fun..

It is not needed to work in drm scheduler first. In fact drm scheduler 
based drivers can plug into what I have since it already has the notion 
of scheduling priorities.

They would only need to implement a hook which allow the cgroup 
controller to query client GPU utilisation and another to received the 
over budget signal.

Amdgpu and msm AFAIK could be easy candidates because they both support 
per client utilisation and priorities.

Looks like I need to put all this info back into the cover letter.

Also, hierarchic weights and time budgets are all already there. What 
could be done later is make this all smarter and respect the time budget 
with more precision. That would however, in many cases including Intel, 
require co-operation with the firmware. In any case it is only work in 
the implementation, while the cgroup control interface remains the same.

> I have taken a look at how the rest of cgroup controllers change 
> ownership when moved to a different cgroup, and the answer was: not at 
> all. If we attempt to create the scheduler controls only on the first 
> time the fd is used, you could probably get rid of all the tracking.

Can you send a CPU file descriptor from process A to process B and have 
CPU usage belonging to process B show up in process' A cgroup, or 
vice-versa? Nope, I am not making any sense, am I? My point being it is 
not like-to-like, model is different.

No ownership transfer would mean in wide deployments all GPU utilisation 
would be assigned to Xorg and so there is no point to any of this. No 
way to throttle a cgroup with un-important GPU clients for instance.

> This can be done very easily with the drm scheduler.
> 
> WRT memory, I think the consensus is to track system memory like normal 
> memory. Stolen memory doesn't need to be tracked. It's kernel only 
> memory, used for internal bookkeeping  only.
> 
> The only time userspace can directly manipulate stolen memory, is by 
> mapping the pinned initial framebuffer to its own address space. The 
> only allocation it can do is when a framebuffer is displayed, and 
> framebuffer compression creates some stolen memory. Userspace is not
> aware of this though, and has no way to manipulate those contents.

Stolen memory is irrelevant and not something cgroup controller knows 
about. Point is drivers say which memory regions they have and their 
utilisation.

Imagine instead of stolen it said vram0, or on Intel multi-tile it shows 
local0 and local1. People working with containers are interested to see 
this breakdown. I guess the parallel and use case here is closer to 
memory.numa_stat.

Regards,

Tvrtko
  
Tvrtko Ursulin July 26, 2023, 4:44 p.m. UTC | #4
On 21/07/2023 23:21, Tejun Heo wrote:
> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>>    $ cat drm.memory.stat
>>    card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
>>    card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
>>
>> Data is generated on demand for simplicty of implementation ie. no running
>> totals are kept or accounted during migrations and such. Various
>> optimisations such as cheaper collection of data are possible but
>> deliberately left out for now.
>>
>> Overall, the feature is deemed to be useful to container orchestration
>> software (and manual management).
>>
>> Limits, either soft or hard, are not envisaged to be implemented on top of
>> this approach due on demand nature of collecting the stats.
> 
> So, yeah, if you want to add memory controls, we better think through how
> the fd ownership migration should work.

It would be quite easy to make the implicit migration fail - just the 
matter of failing the first ioctl, which is what triggers the migration, 
after the file descriptor access from a new owner.

But I don't think I can really add that in the RFC given I have no hard 
controls or anything like that.

With GPU usage throttling it doesn't really apply, at least I don't 
think it does, since even when migrated to a lower budget group it would 
just get immediately de-prioritized.

I don't think hard GPU time limits are feasible in general, and while 
soft might be, again I don't see that any limiting would necessarily 
have to run immediately on implicit migration.

Second part of the story are hypothetical/future memory controls.

I think first thing to say is that implicit migration is important, but 
it is not really established to use the file descriptor from two places 
or to migrate more than once. It is simply fresh fd which gets sent to 
clients from Xorg, which is one of the legacy ways of doing things.

So we probably can just ignore that given no significant amount of 
memory ownership would be getting migrated.

And for drm.memory.stat I think what I have is good enough - both 
private and shared data get accounted, for any clients that have handles 
to particular buffers.

Maarten was working on memory controls so maybe he would have more 
thoughts on memory ownership and implicit migration.

But I don't think there is anything incompatible with that and 
drm.memory.stats as proposed here, given how the categories reported are 
the established ones from the DRM fdinfo spec, and it is fact of the 
matter that we can have multiple memory regions per driver.

The main thing that would change between this RFC and future memory 
controls in the area of drm.memory.stat is the implementation - it would 
have to get changed under the hood from "collect on query" to "account 
at allocation/free/etc". But that is just implementation details.

Regards,

Tvrtko
  
Tejun Heo July 26, 2023, 7:44 p.m. UTC | #5
Hello,

On Wed, Jul 26, 2023 at 12:14:24PM +0200, Maarten Lankhorst wrote:
> > So, yeah, if you want to add memory controls, we better think through how
> > the fd ownership migration should work.
>
> I've taken a look at the series, since I have been working on cgroup memory
> eviction.
> 
> The scheduling stuff will work for i915, since it has a purely software
> execlist scheduler, but I don't think it will work for GuC (firmware)
> scheduling or other drivers that use the generic drm scheduler.
> 
> For something like this,  you would probably want it to work inside the drm
> scheduler first. Presumably, this can be done by setting a weight on each
> runqueue, and perhaps adding a callback to update one for a running queue.
> Calculating the weights hierarchically might be fun..

I don't have any idea on this front. The basic idea of making high level
distribution decisions in core code and letting individual drivers enforce
that in a way which fits them the best makes sense to me but I don't know
enough to have an opinion here.

> I have taken a look at how the rest of cgroup controllers change ownership
> when moved to a different cgroup, and the answer was: not at all. If we

For persistent resources, that's the general rule. Whoever instantiates a
resource gets to own it until the resource gets freed. There is an exception
with the pid controller and there are discussions around whether we want
some sort of migration behavior with memcg but yes by and large instantiator
being the owner is the general model cgroup follows.

> attempt to create the scheduler controls only on the first time the fd is
> used, you could probably get rid of all the tracking.
> This can be done very easily with the drm scheduler.
>
> WRT memory, I think the consensus is to track system memory like normal
> memory. Stolen memory doesn't need to be tracked. It's kernel only memory,
> used for internal bookkeeping  only.
> 
> The only time userspace can directly manipulate stolen memory, is by mapping
> the pinned initial framebuffer to its own address space. The only allocation
> it can do is when a framebuffer is displayed, and framebuffer compression
> creates some stolen memory. Userspace is not
> aware of this though, and has no way to manipulate those contents.

So, my dumb understanding:

* Ownership of an fd can be established on the first ioctl call and doesn't
  need to be migrated afterwards. There are no persistent resources to
  migration on the first call.

* Memory then can be tracked in a similar way to memcg. Memory gets charged
  to the initial instantiator and doesn't need to be moved around
  afterwards. There may be some discrepancies around stolen memory but the
  magnitude of inaccuracy introduced that way is limited and bound and can
  be safely ignored.

Is that correct?

Thanks.
  
Tejun Heo July 26, 2023, 7:49 p.m. UTC | #6
Hello,

On Wed, Jul 26, 2023 at 05:44:28PM +0100, Tvrtko Ursulin wrote:
...
> > So, yeah, if you want to add memory controls, we better think through how
> > the fd ownership migration should work.
> 
> It would be quite easy to make the implicit migration fail - just the matter
> of failing the first ioctl, which is what triggers the migration, after the
> file descriptor access from a new owner.

So, it'd be best if there's no migration involved at all as per the
discussion with Maarten.

> But I don't think I can really add that in the RFC given I have no hard
> controls or anything like that.
> 
> With GPU usage throttling it doesn't really apply, at least I don't think it
> does, since even when migrated to a lower budget group it would just get
> immediately de-prioritized.
> 
> I don't think hard GPU time limits are feasible in general, and while soft
> might be, again I don't see that any limiting would necessarily have to run
> immediately on implicit migration.

Yeah, I wouldn't worry about hard allocation of GPU time. CPU RT control
does that but it's barely used.

> Second part of the story are hypothetical/future memory controls.
> 
> I think first thing to say is that implicit migration is important, but it
> is not really established to use the file descriptor from two places or to
> migrate more than once. It is simply fresh fd which gets sent to clients
> from Xorg, which is one of the legacy ways of doing things.
> 
> So we probably can just ignore that given no significant amount of memory
> ownership would be getting migrated.

So, if this is the case, it'd be better to clarify this. ie. if the summary is:

fd gets assigned to the user with a certain action at which point the fd
doesn't have significant resources attached to it and the fd can't be moved
to some other cgroup afterwards.

then, everything is pretty simple. No need to worry about migration at all.
fd just gets assigned once at the beginning and everything gets accounted
towards that afterwards.

> And for drm.memory.stat I think what I have is good enough - both private
> and shared data get accounted, for any clients that have handles to
> particular buffers.
> 
> Maarten was working on memory controls so maybe he would have more thoughts
> on memory ownership and implicit migration.
> 
> But I don't think there is anything incompatible with that and
> drm.memory.stats as proposed here, given how the categories reported are the
> established ones from the DRM fdinfo spec, and it is fact of the matter that
> we can have multiple memory regions per driver.
> 
> The main thing that would change between this RFC and future memory controls
> in the area of drm.memory.stat is the implementation - it would have to get
> changed under the hood from "collect on query" to "account at
> allocation/free/etc". But that is just implementation details.

I'd much prefer to straighten out this before adding a prelimiary stat only
thing. If the previously described ownership model is sufficient, none of
this is complicated, right? We can just add counters to track the resources
and print them out.

Thanks.
  
Maarten Lankhorst July 27, 2023, 11:54 a.m. UTC | #7
Hey,

On 2023-07-26 13:41, Tvrtko Ursulin wrote:
> 
> On 26/07/2023 11:14, Maarten Lankhorst wrote:
>> Hey,
>>
>> On 2023-07-22 00:21, Tejun Heo wrote:
>>> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>>>>    $ cat drm.memory.stat
>>>>    card0 region=system total=12898304 shared=0 active=0 
>>>> resident=12111872 purgeable=167936
>>>>    card0 region=stolen-system total=0 shared=0 active=0 resident=0 
>>>> purgeable=0
>>>>
>>>> Data is generated on demand for simplicty of implementation ie. no 
>>>> running
>>>> totals are kept or accounted during migrations and such. Various
>>>> optimisations such as cheaper collection of data are possible but
>>>> deliberately left out for now.
>>>>
>>>> Overall, the feature is deemed to be useful to container orchestration
>>>> software (and manual management).
>>>>
>>>> Limits, either soft or hard, are not envisaged to be implemented on 
>>>> top of
>>>> this approach due on demand nature of collecting the stats.
>>>
>>> So, yeah, if you want to add memory controls, we better think through 
>>> how
>>> the fd ownership migration should work.
>> I've taken a look at the series, since I have been working on cgroup 
>> memory eviction.
>>
>> The scheduling stuff will work for i915, since it has a purely 
>> software execlist scheduler, but I don't think it will work for GuC 
>> (firmware) scheduling or other drivers that use the generic drm 
>> scheduler.
> 
> It actually works - I used to have a blurb in the cover letter about it 
> but apparently I dropped it. Just a bit less well with many clients, 
> since there are fewer priority levels.
> 
> All that the design requires from the invididual drivers is some way to 
> react to the "you are over budget by this much" signal. The rest is 
> driver and backend specific.

What I mean is that this signal may not be applicable since the drm 
scheduler just schedules jobs that run. Adding a weight might be done in 
hardware, since it's responsible for  scheduling which context gets to 
run. The over budget signal is useless in that case, and you just need 
to set a scheduling priority for the hardware instead.

>> For something like this,  you would probably want it to work inside 
>> the drm scheduler first. Presumably, this can be done by setting a 
>> weight on each runqueue, and perhaps adding a callback to update one 
>> for a running queue. Calculating the weights hierarchically might be 
>> fun..
> 
> It is not needed to work in drm scheduler first. In fact drm scheduler 
> based drivers can plug into what I have since it already has the notion 
> of scheduling priorities.
> 
> They would only need to implement a hook which allow the cgroup 
> controller to query client GPU utilisation and another to received the 
> over budget signal.
> 
> Amdgpu and msm AFAIK could be easy candidates because they both support 
> per client utilisation and priorities.
> 
> Looks like I need to put all this info back into the cover letter.
> 
> Also, hierarchic weights and time budgets are all already there. What 
> could be done later is make this all smarter and respect the time budget 
> with more precision. That would however, in many cases including Intel, 
> require co-operation with the firmware. In any case it is only work in 
> the implementation, while the cgroup control interface remains the same.
> 
>> I have taken a look at how the rest of cgroup controllers change 
>> ownership when moved to a different cgroup, and the answer was: not at 
>> all. If we attempt to create the scheduler controls only on the first 
>> time the fd is used, you could probably get rid of all the tracking.
> 
> Can you send a CPU file descriptor from process A to process B and have 
> CPU usage belonging to process B show up in process' A cgroup, or 
> vice-versa? Nope, I am not making any sense, am I? My point being it is 
> not like-to-like, model is different.
> 
> No ownership transfer would mean in wide deployments all GPU utilisation 
> would be assigned to Xorg and so there is no point to any of this. No 
> way to throttle a cgroup with un-important GPU clients for instance.
If you just grab the current process' cgroup when a drm_sched_entity is 
created, you don't have everything charged to X.org. No need for 
complicated ownership tracking in drm_file. The same equivalent should 
be done in i915 as well when a context is created as it's not using the 
drm scheduler.

>> This can be done very easily with the drm scheduler.
>>
>> WRT memory, I think the consensus is to track system memory like 
>> normal memory. Stolen memory doesn't need to be tracked. It's kernel 
>> only memory, used for internal bookkeeping  only.
>>
>> The only time userspace can directly manipulate stolen memory, is by 
>> mapping the pinned initial framebuffer to its own address space. The 
>> only allocation it can do is when a framebuffer is displayed, and 
>> framebuffer compression creates some stolen memory. Userspace is not
>> aware of this though, and has no way to manipulate those contents.
> 
> Stolen memory is irrelevant and not something cgroup controller knows 
> about. Point is drivers say which memory regions they have and their 
> utilisation.
> 
> Imagine instead of stolen it said vram0, or on Intel multi-tile it shows 
> local0 and local1. People working with containers are interested to see 
> this breakdown. I guess the parallel and use case here is closer to 
> memory.numa_stat.
Correct, but for the same reason, I think it might be more useful to 
split up the weight too.

A single scheduling weight for the global GPU might be less useful than 
per engine, or per tile perhaps..

Cheers,
~Maarten
  
Maarten Lankhorst July 27, 2023, 1:42 p.m. UTC | #8
Hey,

On 2023-07-26 21:44, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jul 26, 2023 at 12:14:24PM +0200, Maarten Lankhorst wrote:
>>> So, yeah, if you want to add memory controls, we better think through how
>>> the fd ownership migration should work.
>>
>> I've taken a look at the series, since I have been working on cgroup memory
>> eviction.
>>
>> The scheduling stuff will work for i915, since it has a purely software
>> execlist scheduler, but I don't think it will work for GuC (firmware)
>> scheduling or other drivers that use the generic drm scheduler.
>>
>> For something like this,  you would probably want it to work inside the drm
>> scheduler first. Presumably, this can be done by setting a weight on each
>> runqueue, and perhaps adding a callback to update one for a running queue.
>> Calculating the weights hierarchically might be fun..
> 
> I don't have any idea on this front. The basic idea of making high level
> distribution decisions in core code and letting individual drivers enforce
> that in a way which fits them the best makes sense to me but I don't know
> enough to have an opinion here.
> 
>> I have taken a look at how the rest of cgroup controllers change ownership
>> when moved to a different cgroup, and the answer was: not at all. If we
> 
> For persistent resources, that's the general rule. Whoever instantiates a
> resource gets to own it until the resource gets freed. There is an exception
> with the pid controller and there are discussions around whether we want
> some sort of migration behavior with memcg but yes by and large instantiator
> being the owner is the general model cgroup follows.
> 
>> attempt to create the scheduler controls only on the first time the fd is
>> used, you could probably get rid of all the tracking.
>> This can be done very easily with the drm scheduler.
>>
>> WRT memory, I think the consensus is to track system memory like normal
>> memory. Stolen memory doesn't need to be tracked. It's kernel only memory,
>> used for internal bookkeeping  only.
>>
>> The only time userspace can directly manipulate stolen memory, is by mapping
>> the pinned initial framebuffer to its own address space. The only allocation
>> it can do is when a framebuffer is displayed, and framebuffer compression
>> creates some stolen memory. Userspace is not
>> aware of this though, and has no way to manipulate those contents.
> 
> So, my dumb understanding:
> 
> * Ownership of an fd can be established on the first ioctl call and doesn't
>    need to be migrated afterwards. There are no persistent resources to
>    migration on the first call.
> 
> * Memory then can be tracked in a similar way to memcg. Memory gets charged
>    to the initial instantiator and doesn't need to be moved around
>    afterwards. There may be some discrepancies around stolen memory but the
>    magnitude of inaccuracy introduced that way is limited and bound and can
>    be safely ignored.
> 
> Is that correct?

Hey,

Yeah mostly, I think we can stop tracking stolen memory. I stopped doing 
that for Xe, there is literally nothing to control for userspace in there.

Cheers,
Maarten
  
Tvrtko Ursulin July 27, 2023, 4:43 p.m. UTC | #9
On 27/07/2023 14:42, Maarten Lankhorst wrote:
> On 2023-07-26 21:44, Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Jul 26, 2023 at 12:14:24PM +0200, Maarten Lankhorst wrote:
>>>> So, yeah, if you want to add memory controls, we better think 
>>>> through how
>>>> the fd ownership migration should work.
>>>
>>> I've taken a look at the series, since I have been working on cgroup 
>>> memory
>>> eviction.
>>>
>>> The scheduling stuff will work for i915, since it has a purely software
>>> execlist scheduler, but I don't think it will work for GuC (firmware)
>>> scheduling or other drivers that use the generic drm scheduler.
>>>
>>> For something like this,  you would probably want it to work inside 
>>> the drm
>>> scheduler first. Presumably, this can be done by setting a weight on 
>>> each
>>> runqueue, and perhaps adding a callback to update one for a running 
>>> queue.
>>> Calculating the weights hierarchically might be fun..
>>
>> I don't have any idea on this front. The basic idea of making high level
>> distribution decisions in core code and letting individual drivers 
>> enforce
>> that in a way which fits them the best makes sense to me but I don't know
>> enough to have an opinion here.
>>
>>> I have taken a look at how the rest of cgroup controllers change 
>>> ownership
>>> when moved to a different cgroup, and the answer was: not at all. If we
>>
>> For persistent resources, that's the general rule. Whoever instantiates a
>> resource gets to own it until the resource gets freed. There is an 
>> exception
>> with the pid controller and there are discussions around whether we want
>> some sort of migration behavior with memcg but yes by and large 
>> instantiator
>> being the owner is the general model cgroup follows.
>>
>>> attempt to create the scheduler controls only on the first time the 
>>> fd is
>>> used, you could probably get rid of all the tracking.
>>> This can be done very easily with the drm scheduler.
>>>
>>> WRT memory, I think the consensus is to track system memory like normal
>>> memory. Stolen memory doesn't need to be tracked. It's kernel only 
>>> memory,
>>> used for internal bookkeeping  only.
>>>
>>> The only time userspace can directly manipulate stolen memory, is by 
>>> mapping
>>> the pinned initial framebuffer to its own address space. The only 
>>> allocation
>>> it can do is when a framebuffer is displayed, and framebuffer 
>>> compression
>>> creates some stolen memory. Userspace is not
>>> aware of this though, and has no way to manipulate those contents.
>>
>> So, my dumb understanding:
>>
>> * Ownership of an fd can be established on the first ioctl call and 
>> doesn't
>>    need to be migrated afterwards. There are no persistent resources to
>>    migration on the first call.

Yes, keyword is "can". Trouble is migration may or may not happen.

One may choose "Plasma X.org" session type in your login manager and all 
DRM fds would be under Xorg if not migrated. Or one may choose "Plasma 
Wayland" and migration wouldn't matter. But former is I think has a huge 
deployed base so that not supporting implicit migration would be a 
significant asterisk next to the controller.

>> * Memory then can be tracked in a similar way to memcg. Memory gets 
>> charged
>>    to the initial instantiator and doesn't need to be moved around
>>    afterwards. There may be some discrepancies around stolen memory 
>> but the
>>    magnitude of inaccuracy introduced that way is limited and bound 
>> and can
>>    be safely ignored.
>>
>> Is that correct?
> 
> Hey,
> 
> Yeah mostly, I think we can stop tracking stolen memory. I stopped doing 
> that for Xe, there is literally nothing to control for userspace in there.

Right, but for reporting stolen is a red-herring. In this RFC I simply 
report on all memory regions known by the driver. As I said in the other 
reply, imagine the keys are 'system' and 'vram0'. Point was just to 
illustrate multiplicity of regions.

Regards,

Tvrtko
  
Tvrtko Ursulin July 27, 2023, 5:08 p.m. UTC | #10
On 27/07/2023 12:54, Maarten Lankhorst wrote:
> Hey,
> 
> On 2023-07-26 13:41, Tvrtko Ursulin wrote:
>>
>> On 26/07/2023 11:14, Maarten Lankhorst wrote:
>>> Hey,
>>>
>>> On 2023-07-22 00:21, Tejun Heo wrote:
>>>> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>>>>>    $ cat drm.memory.stat
>>>>>    card0 region=system total=12898304 shared=0 active=0 
>>>>> resident=12111872 purgeable=167936
>>>>>    card0 region=stolen-system total=0 shared=0 active=0 resident=0 
>>>>> purgeable=0
>>>>>
>>>>> Data is generated on demand for simplicty of implementation ie. no 
>>>>> running
>>>>> totals are kept or accounted during migrations and such. Various
>>>>> optimisations such as cheaper collection of data are possible but
>>>>> deliberately left out for now.
>>>>>
>>>>> Overall, the feature is deemed to be useful to container orchestration
>>>>> software (and manual management).
>>>>>
>>>>> Limits, either soft or hard, are not envisaged to be implemented on 
>>>>> top of
>>>>> this approach due on demand nature of collecting the stats.
>>>>
>>>> So, yeah, if you want to add memory controls, we better think 
>>>> through how
>>>> the fd ownership migration should work.
>>> I've taken a look at the series, since I have been working on cgroup 
>>> memory eviction.
>>>
>>> The scheduling stuff will work for i915, since it has a purely 
>>> software execlist scheduler, but I don't think it will work for GuC 
>>> (firmware) scheduling or other drivers that use the generic drm 
>>> scheduler.
>>
>> It actually works - I used to have a blurb in the cover letter about 
>> it but apparently I dropped it. Just a bit less well with many 
>> clients, since there are fewer priority levels.
>>
>> All that the design requires from the invididual drivers is some way 
>> to react to the "you are over budget by this much" signal. The rest is 
>> driver and backend specific.
> 
> What I mean is that this signal may not be applicable since the drm 
> scheduler just schedules jobs that run. Adding a weight might be done in 
> hardware, since it's responsible for  scheduling which context gets to 
> run. The over budget signal is useless in that case, and you just need 
> to set a scheduling priority for the hardware instead.

The over budget callback lets the driver know its assigned budget and 
its current utilisation. Already with that data drivers could implement 
something smarter than what I did in my RFC. So I don't think callback 
is completely useless even for some smarter implementation which 
potentially ties into firmware scheduling.

Anyway, I maintain this is implementation details.

>>> For something like this,  you would probably want it to work inside 
>>> the drm scheduler first. Presumably, this can be done by setting a 
>>> weight on each runqueue, and perhaps adding a callback to update one 
>>> for a running queue. Calculating the weights hierarchically might be 
>>> fun..
>>
>> It is not needed to work in drm scheduler first. In fact drm scheduler 
>> based drivers can plug into what I have since it already has the 
>> notion of scheduling priorities.
>>
>> They would only need to implement a hook which allow the cgroup 
>> controller to query client GPU utilisation and another to received the 
>> over budget signal.
>>
>> Amdgpu and msm AFAIK could be easy candidates because they both 
>> support per client utilisation and priorities.
>>
>> Looks like I need to put all this info back into the cover letter.
>>
>> Also, hierarchic weights and time budgets are all already there. What 
>> could be done later is make this all smarter and respect the time 
>> budget with more precision. That would however, in many cases 
>> including Intel, require co-operation with the firmware. In any case 
>> it is only work in the implementation, while the cgroup control 
>> interface remains the same.
>>
>>> I have taken a look at how the rest of cgroup controllers change 
>>> ownership when moved to a different cgroup, and the answer was: not 
>>> at all. If we attempt to create the scheduler controls only on the 
>>> first time the fd is used, you could probably get rid of all the 
>>> tracking.
>>
>> Can you send a CPU file descriptor from process A to process B and 
>> have CPU usage belonging to process B show up in process' A cgroup, or 
>> vice-versa? Nope, I am not making any sense, am I? My point being it 
>> is not like-to-like, model is different.
>>
>> No ownership transfer would mean in wide deployments all GPU 
>> utilisation would be assigned to Xorg and so there is no point to any 
>> of this. No way to throttle a cgroup with un-important GPU clients for 
>> instance.
> If you just grab the current process' cgroup when a drm_sched_entity is 
> created, you don't have everything charged to X.org. No need for 
> complicated ownership tracking in drm_file. The same equivalent should 
> be done in i915 as well when a context is created as it's not using the 
> drm scheduler.

Okay so essentially nuking the concept of DRM clients belongs to one 
cgroup and instead tracking at the context level. That is an interesting 
idea. I suspect implementation could require somewhat generalizing the 
concept of an "execution context", or at least expressing it via the DRM 
cgroup controller.

I can give this a spin, or at least some more detailed thought, once we 
close on a few more details regarding charging in general.

>>> This can be done very easily with the drm scheduler.
>>>
>>> WRT memory, I think the consensus is to track system memory like 
>>> normal memory. Stolen memory doesn't need to be tracked. It's kernel 
>>> only memory, used for internal bookkeeping  only.
>>>
>>> The only time userspace can directly manipulate stolen memory, is by 
>>> mapping the pinned initial framebuffer to its own address space. The 
>>> only allocation it can do is when a framebuffer is displayed, and 
>>> framebuffer compression creates some stolen memory. Userspace is not
>>> aware of this though, and has no way to manipulate those contents.
>>
>> Stolen memory is irrelevant and not something cgroup controller knows 
>> about. Point is drivers say which memory regions they have and their 
>> utilisation.
>>
>> Imagine instead of stolen it said vram0, or on Intel multi-tile it 
>> shows local0 and local1. People working with containers are interested 
>> to see this breakdown. I guess the parallel and use case here is 
>> closer to memory.numa_stat.
> Correct, but for the same reason, I think it might be more useful to 
> split up the weight too.
> 
> A single scheduling weight for the global GPU might be less useful than 
> per engine, or per tile perhaps..

Yeah, there is some complexity there for sure and could be a larger 
write up. In short per engine stuff tends to work out in practice as is 
given how each driver can decide upon receiving the signal what to do.

In the i915 RFC for instance if it gets "over budget" signal from the 
group, but it sees that the physical engines belonging to this specific 
GPU are not over-subscribed, it simply omits any throttling. Which in 
practice works out fine for two clients competing for different engines. 
Same would be for multiple GPUs (or tiles with our stuff) in the same 
cgroup.

Going back to the single scheduling weight or more fine grained. We 
could choose to follow for instance io.weight format? Start with 
drm.weight being "default 1000" and later extend to per card (or more):

"""
default 100
card0 20
card1 50
"""

In this case I would convert drm.weight to this format straight away for 
the next respin, just wouldn't support per card just yet.

Regards,

Tvrtko
  
Tvrtko Ursulin July 28, 2023, 2:15 p.m. UTC | #11
One additional thought on one sub-topic:

On 27/07/2023 18:08, Tvrtko Ursulin wrote:

[snip]

>>>> For something like this,  you would probably want it to work inside 
>>>> the drm scheduler first. Presumably, this can be done by setting a 
>>>> weight on each runqueue, and perhaps adding a callback to update one 
>>>> for a running queue. Calculating the weights hierarchically might be 
>>>> fun..
>>>
>>> It is not needed to work in drm scheduler first. In fact drm 
>>> scheduler based drivers can plug into what I have since it already 
>>> has the notion of scheduling priorities.
>>>
>>> They would only need to implement a hook which allow the cgroup 
>>> controller to query client GPU utilisation and another to received 
>>> the over budget signal.
>>>
>>> Amdgpu and msm AFAIK could be easy candidates because they both 
>>> support per client utilisation and priorities.
>>>
>>> Looks like I need to put all this info back into the cover letter.
>>>
>>> Also, hierarchic weights and time budgets are all already there. What 
>>> could be done later is make this all smarter and respect the time 
>>> budget with more precision. That would however, in many cases 
>>> including Intel, require co-operation with the firmware. In any case 
>>> it is only work in the implementation, while the cgroup control 
>>> interface remains the same.
>>>
>>>> I have taken a look at how the rest of cgroup controllers change 
>>>> ownership when moved to a different cgroup, and the answer was: not 
>>>> at all. If we attempt to create the scheduler controls only on the 
>>>> first time the fd is used, you could probably get rid of all the 
>>>> tracking.
>>>
>>> Can you send a CPU file descriptor from process A to process B and 
>>> have CPU usage belonging to process B show up in process' A cgroup, 
>>> or vice-versa? Nope, I am not making any sense, am I? My point being 
>>> it is not like-to-like, model is different.
>>>
>>> No ownership transfer would mean in wide deployments all GPU 
>>> utilisation would be assigned to Xorg and so there is no point to any 
>>> of this. No way to throttle a cgroup with un-important GPU clients 
>>> for instance.
>> If you just grab the current process' cgroup when a drm_sched_entity 
>> is created, you don't have everything charged to X.org. No need for 
>> complicated ownership tracking in drm_file. The same equivalent should 
>> be done in i915 as well when a context is created as it's not using 
>> the drm scheduler.
> 
> Okay so essentially nuking the concept of DRM clients belongs to one 
> cgroup and instead tracking at the context level. That is an interesting 
> idea. I suspect implementation could require somewhat generalizing the 
> concept of an "execution context", or at least expressing it via the DRM 
> cgroup controller.
> 
> I can give this a spin, or at least some more detailed thought, once we 
> close on a few more details regarding charging in general.

I didn't get much time to brainstorm this just yet, only one downside 
randomly came to mind later - with this approach for i915 we wouldn't 
correctly attribute any GPU activity done in the receiving process 
against our default contexts. Those would still be accounted to the 
sending process.

How much problem in practice that would be remains to be investigated, 
including if it applies to other drivers too. If there is a good amount 
of deployed userspace which use the default context, then it would be a 
bit messy.

Regards,

Tvrtko

*) For non DRM and non i915 people, default context is a GPU submission 
context implicitly created during the device node open. It always 
remains valid, including in the receiving process if SCM_RIGHTS is used.
  

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index bbe986366f4a..1891c7d98206 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2452,6 +2452,28 @@  DRM scheduling soft limits interface files
 	Standard cgroup weight based control [1, 10000] used to configure the
 	relative distributing of GPU time between the sibling groups.
 
+  drm.memory.stat
+	A nested file containing cumulative memory statistics for the whole
+	sub-hierarchy, broken down into separate GPUs and separate memory
+	regions supported by the latter.
+
+	For example::
+
+	  $ cat drm.memory.stat
+	  card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
+	  card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
+
+	Card designation corresponds to the DRM device names and multiple line
+	entries can be present per card.
+
+	Memory region names should be expected to be driver specific with the
+	exception of 'system' which is standardised and applicable for GPUs
+	which can operate on system memory buffers.
+
+	Sub-keys 'resident' and 'purgeable' are optional.
+
+	Per category region usage is reported in bytes.
+
 Misc
 ----
 
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 29e11a87bf75..2ea9a46b5031 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -41,6 +41,7 @@  struct drm_minor;
 struct dma_buf;
 struct dma_buf_attachment;
 struct drm_display_mode;
+struct drm_memory_stats;
 struct drm_mode_create_dumb;
 struct drm_printer;
 struct sg_table;
@@ -175,6 +176,66 @@  struct drm_cgroup_ops {
 	 * messages sent by the DRM cgroup controller.
 	 */
 	int (*signal_budget) (struct drm_file *, u64 used, u64 budget);
+
+
+	/**
+	 * @num_memory_regions:
+	 *
+	 * Optional callback reporting the number of memory regions driver
+	 * supports.
+	 *
+	 * Callback is allowed to report a larger number than present memory
+	 * regions, but @memory_region_name is then supposed to return NULL for
+	 * those indices.
+	 *
+	 * Used by the DRM core when queried by the DRM cgroup controller.
+	 *
+	 * All three callbacks of @num_memory_regions, @memory_region_name and
+	 * @memory_stats need to be implemented for DRM cgroup memory stats
+	 * support.
+	 */
+	unsigned int (*num_memory_regions) (const struct drm_device *);
+
+	/**
+	 * @memory_region_name:
+	 *
+	 * Optional callback reporting the name of the queried memory region.
+	 *
+	 * Can be NULL if the memory region index is not supported by the
+	 * passed in device.
+	 *
+	 * Used by the DRM core when queried by the DRM cgroup controller.
+	 *
+	 * All three callbacks of @num_memory_regions, @memory_region_name and
+	 * @memory_stats need to be implemented for DRM cgroup memory stats
+	 * support.
+	 */
+	const char * (*memory_region_name) (const struct drm_device *,
+					    unsigned int index);
+
+	/**
+	 * memory_stats:
+	 *
+	 * Optional callback adding to the passed in array of struct
+	 * drm_memory_stats objects.
+	 *
+	 * Number of objects in the array is passed in the @num argument.
+	 *
+	 * Returns a bitmask of supported enum drm_gem_object_status by the
+	 * driver instance.
+	 *
+	 * Callback is only allow to add to the existing fields and should
+	 * never clear them.
+	 *
+	 * Used by the DRM core when queried by the DRM cgroup controller.
+	 *
+	 * All three callbacks of @num_memory_regions, @memory_region_name and
+	 * @memory_stats need to be implemented for DRM cgroup memory stats
+	 * support.
+	 */
+	unsigned int (*memory_stats) (struct drm_file *,
+				      struct drm_memory_stats *,
+				      unsigned int num);
 };
 
 /**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 7c20d4ebc634..22fc180dd659 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -5,6 +5,7 @@ 
 
 #include <linux/cgroup.h>
 #include <linux/cgroup_drm.h>
+#include <linux/device.h>
 #include <linux/list.h>
 #include <linux/moduleparam.h>
 #include <linux/mutex.h>
@@ -12,6 +13,8 @@ 
 #include <linux/slab.h>
 
 #include <drm/drm_drv.h>
+#include <drm/drm_file.h>
+#include <drm/drm_gem.h>
 
 struct drm_cgroup_state {
 	struct cgroup_subsys_state css;
@@ -133,6 +136,147 @@  drmcs_read_total_us(struct cgroup_subsys_state *css, struct cftype *cft)
 	return val;
 }
 
+struct drmcs_stat {
+	const struct drm_device *dev;
+	const struct drm_cgroup_ops *cg_ops;
+	const char *device_name;
+	unsigned int regions;
+	enum drm_gem_object_status flags;
+	struct drm_memory_stats *stats;
+};
+
+static int drmcs_seq_show_memory(struct seq_file *sf, void *v)
+{
+	struct cgroup_subsys_state *node;
+	struct drmcs_stat *stats = NULL;
+	unsigned int num_devices, i;
+	int ret;
+
+	/*
+	 * We could avoid taking the cgroup_lock and just walk the tree under
+	 * RCU but then allocating temporary storage becomes a problem. So for
+	 * now keep it simple and take the lock.
+	 */
+	cgroup_lock();
+
+	/* Protect against client migrations and clients disappearing. */
+	ret = mutex_lock_interruptible(&drmcg_mutex);
+	if (ret) {
+		cgroup_unlock();
+		return ret;
+	}
+
+	num_devices = 0;
+	css_for_each_descendant_pre(node, seq_css(sf)) {
+		struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+		struct drm_file *fpriv;
+
+		list_for_each_entry(fpriv, &drmcs->clients, clink) {
+			const struct drm_cgroup_ops *cg_ops =
+				fpriv->minor->dev->driver->cg_ops;
+			const char *device_name = dev_name(fpriv->minor->kdev);
+			struct drmcs_stat *stat;
+			unsigned int regions;
+
+			/* Does this driver supports memory stats? */
+			if (cg_ops &&
+			    cg_ops->num_memory_regions &&
+			    cg_ops->memory_region_name &&
+			    cg_ops->memory_stats)
+				regions =
+				  cg_ops->num_memory_regions(fpriv->minor->dev);
+			else
+				regions = 0;
+
+			if (!regions)
+				continue;
+
+			/* Have we seen this device before? */
+			stat = NULL;
+			for (i = 0; i < num_devices; i++) {
+				if (!strcmp(stats[i].device_name,
+					    device_name)) {
+					stat = &stats[i];
+					break;
+				}
+			}
+
+			/* If not allocate space for it. */
+			if (!stat) {
+				stats = krealloc_array(stats, num_devices + 1,
+						       sizeof(*stats),
+						       GFP_USER);
+				if (!stats) {
+					ret = -ENOMEM;
+					goto out;
+				}
+
+				stat = &stats[num_devices++];
+				stat->dev = fpriv->minor->dev;
+				stat->cg_ops = cg_ops;
+				stat->device_name = device_name;
+				stat->flags = 0;
+				stat->regions = regions;
+				stat->stats =
+					kcalloc(regions,
+						sizeof(struct drm_memory_stats),
+						GFP_USER);
+				if (!stat->stats) {
+					ret = -ENOMEM;
+					goto out;
+				}
+			}
+
+			/* Accumulate the stats for this device+client. */
+			stat->flags |= cg_ops->memory_stats(fpriv,
+							    stat->stats,
+							    stat->regions);
+		}
+	}
+
+	for (i = 0; i < num_devices; i++) {
+		struct drmcs_stat *stat = &stats[i];
+		unsigned int j;
+
+		for (j = 0; j < stat->regions; j++) {
+			const char *name =
+				stat->cg_ops->memory_region_name(stat->dev, j);
+
+			if (!name)
+				continue;
+
+			seq_printf(sf,
+				   "%s region=%s total=%llu shared=%llu active=%llu",
+				   stat->device_name,
+				   name,
+				   stat->stats[j].private +
+				   stat->stats[j].shared,
+				   stat->stats[j].shared,
+				   stat->stats[j].active);
+
+			if (stat->flags & DRM_GEM_OBJECT_RESIDENT)
+				seq_printf(sf, " resident=%llu",
+					   stat->stats[j].resident);
+
+			if (stat->flags & DRM_GEM_OBJECT_PURGEABLE)
+				seq_printf(sf, " purgeable=%llu",
+					   stat->stats[j].purgeable);
+
+			seq_puts(sf, "\n");
+		}
+	}
+
+out:
+	mutex_unlock(&drmcg_mutex);
+	cgroup_unlock();
+
+	for (i = 0; i < num_devices; i++)
+		kfree(stats[i].stats);
+	kfree(stats);
+
+	return ret;
+}
+
 static bool __start_scanning(unsigned int period_us)
 {
 	struct drm_cgroup_state *root = &root_drmcs.drmcs;
@@ -575,6 +719,11 @@  struct cftype files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = drmcs_read_total_us,
 	},
+	{
+		.name = "memory.stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = drmcs_seq_show_memory,
+	},
 	{ } /* Zero entry terminates. */
 };