[drm-misc-next,v3] drm/sched: implement dynamic job-flow control

Message ID 20231026161431.5934-1-dakr@redhat.com
State New
Headers
Series [drm-misc-next,v3] drm/sched: implement dynamic job-flow control |

Commit Message

Danilo Krummrich Oct. 26, 2023, 4:13 p.m. UTC
  Currently, job flow control is implemented simply by limiting the number
of jobs in flight. Therefore, a scheduler is initialized with a credit
limit that corresponds to the number of jobs which can be sent to the
hardware.

This implies that for each job, drivers need to account for the maximum
job size possible in order to not overflow the ring buffer.

However, there are drivers, such as Nouveau, where the job size has a
rather large range. For such drivers it can easily happen that job
submissions not even filling the ring by 1% can block subsequent
submissions, which, in the worst case, can lead to the ring run dry.

In order to overcome this issue, allow for tracking the actual job size
instead of the number of jobs. Therefore, add a field to track a job's
credit count, which represents the number of credits a job contributes
to the scheduler's credit limit.

Signed-off-by: Danilo Krummrich <dakr@redhat.com>
---
Changes in V2:
==============
  - fixed up influence on scheduling fairness due to consideration of a job's
    size
    - If we reach a ready entity in drm_sched_select_entity() but can't actually
      queue a job from it due to size limitations, just give up and go to sleep
      until woken up due to a pending job finishing, rather than continue to try
      other entities.
  - added a callback to dynamically update a job's credits (Boris)
  - renamed 'units' to 'credits'
  - fixed commit message and comments as requested by Luben

Changes in V3:
==============
  - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
  - move up drm_sched_can_queue() instead of adding a forward declaration
    (Boris)
  - add a drm_sched_available_credits() helper (Boris)
  - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
  - re-phrase a few comments and fix a typo (Luben)
  - change naming of all structures credit fields and function parameters to the
    following scheme
    - drm_sched_job::credits
    - drm_gpu_scheduler::credit_limit
    - drm_gpu_scheduler::credit_count
    - drm_sched_init(..., u32 credit_limit, ...)
    - drm_sched_job_init(..., u32 credits, ...)
  - add proper documentation for the scheduler's job-flow control mechanism

This patch is based on V7 of the "DRM scheduler changes for Xe" series. [1]
provides a branch based on drm-misc-next, with the named series and this patch
on top of it.

[1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
---
 Documentation/gpu/drm-mm.rst                  |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
 drivers/gpu/drm/lima/lima_sched.c             |   2 +-
 drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
 drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
 .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
 drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
 drivers/gpu/drm/scheduler/sched_main.c        | 142 ++++++++++++++----
 drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
 include/drm/gpu_scheduler.h                   |  33 +++-
 12 files changed, 152 insertions(+), 49 deletions(-)
  

Comments

Luben Tuikov Oct. 26, 2023, 9:13 p.m. UTC | #1
On 2023-10-26 12:13, Danilo Krummrich wrote:
> Currently, job flow control is implemented simply by limiting the number
> of jobs in flight. Therefore, a scheduler is initialized with a credit
> limit that corresponds to the number of jobs which can be sent to the
> hardware.
> 
> This implies that for each job, drivers need to account for the maximum
> job size possible in order to not overflow the ring buffer.
> 
> However, there are drivers, such as Nouveau, where the job size has a
> rather large range. For such drivers it can easily happen that job
> submissions not even filling the ring by 1% can block subsequent
> submissions, which, in the worst case, can lead to the ring run dry.
> 
> In order to overcome this issue, allow for tracking the actual job size
> instead of the number of jobs. Therefore, add a field to track a job's
> credit count, which represents the number of credits a job contributes
> to the scheduler's credit limit.
> 
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> ---
> Changes in V2:
> ==============
>   - fixed up influence on scheduling fairness due to consideration of a job's
>     size
>     - If we reach a ready entity in drm_sched_select_entity() but can't actually
>       queue a job from it due to size limitations, just give up and go to sleep
>       until woken up due to a pending job finishing, rather than continue to try
>       other entities.
>   - added a callback to dynamically update a job's credits (Boris)
>   - renamed 'units' to 'credits'
>   - fixed commit message and comments as requested by Luben
> 
> Changes in V3:
> ==============
>   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>   - move up drm_sched_can_queue() instead of adding a forward declaration
>     (Boris)
>   - add a drm_sched_available_credits() helper (Boris)
>   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>   - re-phrase a few comments and fix a typo (Luben)
>   - change naming of all structures credit fields and function parameters to the
>     following scheme
>     - drm_sched_job::credits
>     - drm_gpu_scheduler::credit_limit
>     - drm_gpu_scheduler::credit_count
>     - drm_sched_init(..., u32 credit_limit, ...)
>     - drm_sched_job_init(..., u32 credits, ...)
>   - add proper documentation for the scheduler's job-flow control mechanism
> 
> This patch is based on V7 of the "DRM scheduler changes for Xe" series. [1]
> provides a branch based on drm-misc-next, with the named series and this patch
> on top of it.
> 
> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> ---
>  Documentation/gpu/drm-mm.rst                  |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>  drivers/gpu/drm/lima/lima_sched.c             |   2 +-
>  drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>  drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
>  drivers/gpu/drm/scheduler/sched_main.c        | 142 ++++++++++++++----
>  drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
>  include/drm/gpu_scheduler.h                   |  33 +++-
>  12 files changed, 152 insertions(+), 49 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index 602010cb6894..acc5901ac840 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Overview
>  
> +Flow Control
> +------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Flow Control
> +
>  Scheduler Function References
>  -----------------------------
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 1f357198533f..62bb7fc7448a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>  	if (!entity)
>  		return 0;
>  
> -	return drm_sched_job_init(&(*job)->base, entity, owner);
> +	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>  }
>  
>  int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> index 2416c526f9b0..3d0f8d182506 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
>  
>  	ret = drm_sched_job_init(&submit->sched_job,
>  				 &ctx->sched_entity[args->pipe],
> -				 submit->ctx);
> +				 1, submit->ctx);
>  	if (ret)
>  		goto err_submit_put;
>  
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 23a6276f1332..cdcb37ff62c3 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>  	for (i = 0; i < num_bos; i++)
>  		drm_gem_object_get(&bos[i]->base.base);
>  
> -	err = drm_sched_job_init(&task->base, &context->base, vm);
> +	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
>  	if (err) {
>  		kfree(task->bos);
>  		return err;
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 99744de6c05a..c002cabe7b9c 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>  		return ERR_PTR(ret);
>  	}
>  
> -	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
> +	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
>  	if (ret) {
>  		kfree(submit->hw_fence);
>  		kfree(submit);
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index 7e64b5ef90fb..1b2cc3f2e1c7 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -89,7 +89,7 @@ nouveau_job_init(struct nouveau_job *job,
>  
>  	}
>  
> -	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
> +	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
>  	if (ret)
>  		goto err_free_chains;
>  
> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
> index b834777b409b..54d1c19bea84 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> @@ -274,7 +274,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>  
>  	ret = drm_sched_job_init(&job->base,
>  				 &file_priv->sched_entity[slot],
> -				 NULL);
> +				 1, NULL);
>  	if (ret)
>  		goto out_put_job;
>  
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> index 3143ecaaff86..f8ed093b7356 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> @@ -51,7 +51,7 @@ DECLARE_EVENT_CLASS(drm_sched_job,
>  			   __assign_str(name, sched_job->sched->name);
>  			   __entry->job_count = spsc_queue_count(&entity->job_queue);
>  			   __entry->hw_job_count = atomic_read(
> -				   &sched_job->sched->hw_rq_count);
> +				   &sched_job->sched->credit_count);
>  			   ),
>  	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
>  		      __entry->entity, __entry->id,
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 409e4256f6e7..b79e0672b94f 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>  		container_of(cb, struct drm_sched_entity, cb);
>  
>  	drm_sched_entity_clear_dep(f, cb);
> -	drm_sched_wakeup_if_can_queue(entity->rq->sched);
> +	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  }
>  
>  /**
> @@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>  		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>  			drm_sched_rq_update_fifo(entity, submit_ts);
>  
> -		drm_sched_wakeup_if_can_queue(entity->rq->sched);
> +		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  	}
>  }
>  EXPORT_SYMBOL(drm_sched_entity_push_job);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 246213963928..3cc3dc0091fc 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -48,6 +48,30 @@
>   * through the jobs entity pointer.
>   */
>  
> +/**
> + * DOC: Flow Control
> + *
> + * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
> + * in which the jobs fetched from scheduler entities are executed.
> + *
> + * In this context the &drm_gpu_scheduler keeps track of a driver specified
> + * credit limit representing the capacity of this scheduler and a credit count;
> + * every &drm_sched_job carries a driver specified number of credits.
> + *
> + * Once a job is executed (but not yet finished), the job's credits contribute
> + * to the scheduler's credit count until the job is finished. If by executing
> + * one more job the scheduler's credit count would exceed the scheduler's
> + * credit limit, the job won't be executed. Instead, the scheduler will wait
> + * until the credit count has decreased enough to not overflow its credit limit.
> + * This implies waiting for previously executed jobs.
> + *
> + * Optionally, drivers can register a callback (update_job_credits) provided by

"can" --> "may".

> + * struct drm_sched_backend_ops to update the job's credits dynamically. The
> + * scheduler will execute this callback every time the scheduler considers a job

No future tense, "will execute" --> "executes", because when this patch lands,
it all becomes present reality. See below for a reworded paragraph.

> + * for execution and subsequently checks whether the job fits the scheduler's
> + * credit limit.

This here is a good explanation of what update_job_credits() does, but the one
where this callback is defined in the scheduler ops is not very clear (see further down
into the patch).

I think we shouldn't use the word "update" as we don't really have any "update" code,
a la,
	if old value != new value, then
		old value = new value # update it
in the code using this new sched op.

Perhaps it should be a "get_job_credits()" as this is how this function is used
in drm_sched_can_queue().

Perhaps this adds some clarity:

* Optionally, drivers may register a sched ops callback, get_job_credits(), which
* returns the number of credits the job passed as an argument would take when
* submitted to driver/hardware. The scheduler executes this callback every time it
* considers a job for execution in order to check if the job's credits fit
* into the scheduler's credit limit.

(FWIW, it may be doing some "update" into the driver/firmware/hardware, but DRM
has no visibility or care for that, since DRM is using it simply as "tell me
the number or credits this job would take.")

> + */
> +
>  #include <linux/wait.h>
>  #include <linux/sched.h>
>  #include <linux/completion.h>
> @@ -75,6 +99,46 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>  MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
>  module_param_named(sched_policy, drm_sched_policy, int, 0444);
>  
> +static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
> +{
> +	u32 credits;
> +
> +	WARN_ON(check_sub_overflow(sched->credit_limit,
> +				   atomic_read(&sched->credit_count),
> +				   &credits));
> +
> +	return credits;
> +}
> +
> +/**
> + * drm_sched_can_queue -- Can we queue more to the hardware?
> + * @sched: scheduler instance
> + * @entity: the scheduler entity
> + *
> + * Return true if we can push at least one more job from @entity, false
> + * otherwise.
> + */
> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
> +				struct drm_sched_entity *entity)
> +{
> +	struct drm_sched_job *s_job;
> +
> +	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> +	if (!s_job)
> +		return false;

We don't have this gating logic at the moment, and I don't think
we need it.

At the current state of the code, we don't care if there's jobs
in the incoming queue or not.

The only thing we should check here for is the credit availability
for _this job_, as you do in the code below.

> +
> +	if (sched->ops->update_job_credits) {
> +		u32 credits = sched->ops->update_job_credits(s_job);
> +
> +		if (credits)
> +			s_job->credits = credits;
> +	}
> +
> +	WARN_ON(s_job->credits > sched->credit_limit);
> +
> +	return drm_sched_available_credits(sched) >= s_job->credits;
> +}
> +
>  static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
>  							    const struct rb_node *b)
>  {
> @@ -186,12 +250,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>  /**
>   * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>   *
> + * @sched: the gpu scheduler
>   * @rq: scheduler run queue to check.
>   *
>   * Try to find a ready entity, returns NULL if none found.
>   */
>  static struct drm_sched_entity *
> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
> +			      struct drm_sched_rq *rq)
>  {
>  	struct drm_sched_entity *entity;
>  
> @@ -201,6 +267,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>  	if (entity) {
>  		list_for_each_entry_continue(entity, &rq->entities, list) {
>  			if (drm_sched_entity_is_ready(entity)) {
> +				/* If we can't queue yet, preserve the current
> +				 * entity in terms of fairness.
> +				 */
> +				if (!drm_sched_can_queue(sched, entity))
> +					goto out;
> +
>  				rq->current_entity = entity;
>  				reinit_completion(&entity->entity_idle);
>  				spin_unlock(&rq->lock);
> @@ -210,8 +282,13 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>  	}
>  
>  	list_for_each_entry(entity, &rq->entities, list) {
> -
>  		if (drm_sched_entity_is_ready(entity)) {
> +			/* If we can't queue yet, preserve the current entity in
> +			 * terms of fairness.
> +			 */
> +			if (!drm_sched_can_queue(sched, entity))
> +				goto out;
> +
>  			rq->current_entity = entity;
>  			reinit_completion(&entity->entity_idle);
>  			spin_unlock(&rq->lock);
> @@ -222,20 +299,22 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>  			break;
>  	}
>  
> +out:
>  	spin_unlock(&rq->lock);
> -
>  	return NULL;
>  }
>  
>  /**
>   * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>   *
> + * @sched: the gpu scheduler
>   * @rq: scheduler run queue to check.
>   *
>   * Find oldest waiting ready entity, returns NULL if none found.
>   */
>  static struct drm_sched_entity *
> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
> +				struct drm_sched_rq *rq)
>  {
>  	struct rb_node *rb;
>  
> @@ -245,6 +324,15 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>  
>  		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>  		if (drm_sched_entity_is_ready(entity)) {
> +			/* If we can't queue yet, don't pick another entity
> +			 * which's job might fit and wait until we got space for

			 * whose job might not fit and wait until we get space for
Three fixes:                 ^              ^                         ^

> +			 * this one in terms of fairness.

                         * this one in terms of credit availability.

It's not the "fairness", it's the number of credits we check for in drm_sched_can_queue() below.

> +			 */
> +			if (!drm_sched_can_queue(sched, entity)) {
> +				rb = NULL;
> +				break;
> +			}
> +

That's good.

>  			rq->current_entity = entity;
>  			reinit_completion(&entity->entity_idle);
>  			break;
> @@ -266,18 +354,6 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>  		queue_work(sched->submit_wq, &sched->work_run_job);
>  }
>  
> -/**
> - * drm_sched_can_queue -- Can we queue more to the hardware?
> - * @sched: scheduler instance
> - *
> - * Return true if we can push more jobs to the hw, otherwise false.
> - */
> -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> -{
> -	return atomic_read(&sched->hw_rq_count) <
> -		sched->hw_submission_limit;
> -}
> -
>  /**
>   * drm_sched_select_entity - Select next entity to process
>   *
> @@ -291,14 +367,11 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>  	struct drm_sched_entity *entity;
>  	int i;
>  
> -	if (!drm_sched_can_queue(sched))
> -		return NULL;
> -
>  	/* Kernel run queue has higher priority than normal run queue*/
>  	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>  		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
> -			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
> -			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
> +			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
> +			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
>  		if (entity)
>  			break;
>  	}
> @@ -353,7 +426,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>  	struct drm_sched_fence *s_fence = s_job->s_fence;
>  	struct drm_gpu_scheduler *sched = s_fence->sched;
>  
> -	atomic_dec(&sched->hw_rq_count);
> +	atomic_sub(s_job->credits, &sched->credit_count);
>  	atomic_dec(sched->score);
>  
>  	trace_drm_sched_process_job(s_fence);
> @@ -576,7 +649,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  					      &s_job->cb)) {
>  			dma_fence_put(s_job->s_fence->parent);
>  			s_job->s_fence->parent = NULL;
> -			atomic_dec(&sched->hw_rq_count);
> +			atomic_sub(s_job->credits, &sched->credit_count);
>  		} else {
>  			/*
>  			 * remove job from pending_list.
> @@ -637,7 +710,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>  	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>  		struct dma_fence *fence = s_job->s_fence->parent;
>  
> -		atomic_inc(&sched->hw_rq_count);
> +		atomic_add(s_job->credits, &sched->credit_count);
>  
>  		if (!full_recovery)
>  			continue;
> @@ -718,6 +791,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>   * drm_sched_job_init - init a scheduler job
>   * @job: scheduler job to init
>   * @entity: scheduler entity to use
> + * @credits: the number of credits this job contributes to the schedulers
> + * credit limit
>   * @owner: job owner for debugging
>   *
>   * Refer to drm_sched_entity_push_job() documentation
> @@ -735,7 +810,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>   */
>  int drm_sched_job_init(struct drm_sched_job *job,
>  		       struct drm_sched_entity *entity,
> -		       void *owner)
> +		       u32 credits, void *owner)
>  {
>  	if (!entity->rq) {
>  		/* This will most likely be followed by missing frames
> @@ -752,6 +827,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>  		return -ENOMEM;
>  
>  	INIT_LIST_HEAD(&job->list);
> +	job->credits = credits ? credits : 1;
>  
>  	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
>  
> @@ -961,12 +1037,14 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>  /**
>   * drm_sched_wakeup_if_can_queue - Wake up the scheduler
>   * @sched: scheduler instance
> + * @entity: the scheduler entity
>   *
>   * Wake up the scheduler if we can queue jobs.
>   */
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
> +				   struct drm_sched_entity *entity)
>  {
> -	if (drm_sched_can_queue(sched))
> +	if (drm_sched_can_queue(sched, entity))
>  		drm_sched_run_job_queue(sched);
>  }
>  
> @@ -1104,7 +1182,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>  
>  	s_fence = sched_job->s_fence;
>  
> -	atomic_inc(&sched->hw_rq_count);
> +	atomic_add(sched_job->credits, &sched->credit_count);
>  	drm_sched_job_begin(sched_job);
>  
>  	trace_drm_run_job(sched_job, entity);
> @@ -1139,7 +1217,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>   * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
>   *	       allocated and used
>   * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
> - * @hw_submission: number of hw submissions that can be in flight
> + * @credit_limit: the number of credits this scheduler can hold from all jobs
>   * @hang_limit: number of times to allow a job to hang before dropping it
>   * @timeout: timeout value in jiffies for the scheduler
>   * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
> @@ -1153,14 +1231,14 @@ static void drm_sched_run_job_work(struct work_struct *w)
>  int drm_sched_init(struct drm_gpu_scheduler *sched,
>  		   const struct drm_sched_backend_ops *ops,
>  		   struct workqueue_struct *submit_wq,
> -		   u32 num_rqs, unsigned hw_submission, unsigned hang_limit,
> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>  		   long timeout, struct workqueue_struct *timeout_wq,
>  		   atomic_t *score, const char *name, struct device *dev)
>  {
>  	int i, ret;
>  
>  	sched->ops = ops;
> -	sched->hw_submission_limit = hw_submission;
> +	sched->credit_limit = credit_limit;
>  	sched->name = name;
>  	if (submit_wq) {
>  		sched->submit_wq = submit_wq;
> @@ -1211,7 +1289,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  	init_waitqueue_head(&sched->job_scheduled);
>  	INIT_LIST_HEAD(&sched->pending_list);
>  	spin_lock_init(&sched->job_list_lock);
> -	atomic_set(&sched->hw_rq_count, 0);
> +	atomic_set(&sched->credit_count, 0);
>  	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>  	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>  	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
> index 2e94ce788c71..8479e5302f7b 100644
> --- a/drivers/gpu/drm/v3d/v3d_gem.c
> +++ b/drivers/gpu/drm/v3d/v3d_gem.c
> @@ -417,7 +417,7 @@ v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
>  	job->free = free;
>  
>  	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
> -				 v3d_priv);
> +				 1, v3d_priv);
>  	if (ret)
>  		goto fail;
>  
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index e5a6166eb152..a911b3f5917e 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -321,6 +321,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>   * @sched: the scheduler instance on which this job is scheduled.
>   * @s_fence: contains the fences for the scheduling of job.
>   * @finish_cb: the callback for the finished fence.
> + * @credits: the number of credits this job contributes to the scheduler
>   * @work: Helper to reschdeule job kill to different context.
>   * @id: a unique id assigned to each job scheduled on the scheduler.
>   * @karma: increment on every hang caused by this job. If this exceeds the hang
> @@ -340,6 +341,8 @@ struct drm_sched_job {
>  	struct drm_gpu_scheduler	*sched;
>  	struct drm_sched_fence		*s_fence;
>  
> +	u32				credits;
> +
>  	/*
>  	 * work is used only after finish_cb has been used and will not be
>  	 * accessed anymore.
> @@ -463,13 +466,29 @@ struct drm_sched_backend_ops {
>           * and it's time to clean it up.
>  	 */
>  	void (*free_job)(struct drm_sched_job *sched_job);
> +
> +	/**
> +	 * @update_job_credits: Called once the scheduler is considering this
> +	 * job for execution.
> +	 *

This whole paragraph isn't very clear of what update_job_credits does.
Perhaps a simple and straightforward name would be more clear:
	get_job_credits: Returns the number of credits this job would
		take.

> +	 * Drivers may use this to update the job's submission credits, which is
> +	 * useful to e.g. deduct the number of native fences which have been
> +	 * signaled meanwhile.
> +	 *
> +	 * The callback must either return the new number of submission credits
> +	 * for the given job, or zero if no update is required.

The word "update" is confusing here and it implies that DRM should keep track
of the previous value this function had returned.

Perhaps we can just say:
	* This callback returns the number of credits this job would take if pushed
	* to the driver/hardware. It returns 0, if this job would take no credits.

Else, driver writers would have to deduce what to return here by looking at
drm_sched_can_queue() which effectively does:
	s_job->credits = sched->ops->update_job_credits(s_job).

> +	 *
> +	 * This callback is optional.
> +	 */
> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
>  };
>  

The rest is good--thanks for the changes!

>  /**
>   * struct drm_gpu_scheduler - scheduler instance-specific data
>   *
>   * @ops: backend operations provided by the driver.
> - * @hw_submission_limit: the max size of the hardware queue.
> + * @credit_limit: the credit limit of this scheduler
> + * @credit_count: the current credit count of this scheduler
>   * @timeout: the time after which a job is removed from the scheduler.
>   * @name: name of the ring for which this scheduler is being used.
>   * @num_rqs: Number of run-queues. This is at most DRM_SCHED_PRIORITY_COUNT,
> @@ -478,7 +497,6 @@ struct drm_sched_backend_ops {
>   * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>   *                 waits on this wait queue until all the scheduled jobs are
>   *                 finished.
> - * @hw_rq_count: the number of jobs currently in the hardware queue.
>   * @job_id_count: used to assign unique id to the each job.
>   * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>   * @timeout_wq: workqueue used to queue @work_tdr
> @@ -502,13 +520,13 @@ struct drm_sched_backend_ops {
>   */
>  struct drm_gpu_scheduler {
>  	const struct drm_sched_backend_ops	*ops;
> -	uint32_t			hw_submission_limit;
> +	u32				credit_limit;
> +	atomic_t			credit_count;
>  	long				timeout;
>  	const char			*name;
>  	u32                             num_rqs;
>  	struct drm_sched_rq             **sched_rq;
>  	wait_queue_head_t		job_scheduled;
> -	atomic_t			hw_rq_count;
>  	atomic64_t			job_id_count;
>  	struct workqueue_struct		*submit_wq;
>  	struct workqueue_struct		*timeout_wq;
> @@ -530,14 +548,14 @@ struct drm_gpu_scheduler {
>  int drm_sched_init(struct drm_gpu_scheduler *sched,
>  		   const struct drm_sched_backend_ops *ops,
>  		   struct workqueue_struct *submit_wq,
> -		   u32 num_rqs, uint32_t hw_submission, unsigned hang_limit,
> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>  		   long timeout, struct workqueue_struct *timeout_wq,
>  		   atomic_t *score, const char *name, struct device *dev);
>  
>  void drm_sched_fini(struct drm_gpu_scheduler *sched);
>  int drm_sched_job_init(struct drm_sched_job *job,
>  		       struct drm_sched_entity *entity,
> -		       void *owner);
> +		       u32 credits, void *owner);
>  void drm_sched_job_arm(struct drm_sched_job *job);
>  int drm_sched_job_add_dependency(struct drm_sched_job *job,
>  				 struct dma_fence *fence);
> @@ -559,7 +577,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>  
>  void drm_sched_tdr_queue_imm(struct drm_gpu_scheduler *sched);
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
> +				   struct drm_sched_entity *entity);
>  bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
>  void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>  void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
  
Luben Tuikov Oct. 27, 2023, 1:03 a.m. UTC | #2
On 2023-10-26 17:13, Luben Tuikov wrote:
> On 2023-10-26 12:13, Danilo Krummrich wrote:
>> Currently, job flow control is implemented simply by limiting the number
>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>> limit that corresponds to the number of jobs which can be sent to the
>> hardware.
>>
>> This implies that for each job, drivers need to account for the maximum
>> job size possible in order to not overflow the ring buffer.
>>
>> However, there are drivers, such as Nouveau, where the job size has a
>> rather large range. For such drivers it can easily happen that job
>> submissions not even filling the ring by 1% can block subsequent
>> submissions, which, in the worst case, can lead to the ring run dry.
>>
>> In order to overcome this issue, allow for tracking the actual job size
>> instead of the number of jobs. Therefore, add a field to track a job's
>> credit count, which represents the number of credits a job contributes
>> to the scheduler's credit limit.
>>
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>> ---
>> Changes in V2:
>> ==============
>>   - fixed up influence on scheduling fairness due to consideration of a job's
>>     size
>>     - If we reach a ready entity in drm_sched_select_entity() but can't actually
>>       queue a job from it due to size limitations, just give up and go to sleep
>>       until woken up due to a pending job finishing, rather than continue to try
>>       other entities.
>>   - added a callback to dynamically update a job's credits (Boris)
>>   - renamed 'units' to 'credits'
>>   - fixed commit message and comments as requested by Luben
>>
>> Changes in V3:
>> ==============
>>   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>>   - move up drm_sched_can_queue() instead of adding a forward declaration
>>     (Boris)
>>   - add a drm_sched_available_credits() helper (Boris)
>>   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>>   - re-phrase a few comments and fix a typo (Luben)
>>   - change naming of all structures credit fields and function parameters to the
>>     following scheme
>>     - drm_sched_job::credits
>>     - drm_gpu_scheduler::credit_limit
>>     - drm_gpu_scheduler::credit_count
>>     - drm_sched_init(..., u32 credit_limit, ...)
>>     - drm_sched_job_init(..., u32 credits, ...)
>>   - add proper documentation for the scheduler's job-flow control mechanism
>>
>> This patch is based on V7 of the "DRM scheduler changes for Xe" series. [1]
>> provides a branch based on drm-misc-next, with the named series and this patch
>> on top of it.
>>
>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
>> ---
>>  Documentation/gpu/drm-mm.rst                  |   6 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
>>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>>  drivers/gpu/drm/lima/lima_sched.c             |   2 +-
>>  drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
>>  drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
>>  drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
>>  .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>>  drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
>>  drivers/gpu/drm/scheduler/sched_main.c        | 142 ++++++++++++++----
>>  drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
>>  include/drm/gpu_scheduler.h                   |  33 +++-
>>  12 files changed, 152 insertions(+), 49 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
>> index 602010cb6894..acc5901ac840 100644
>> --- a/Documentation/gpu/drm-mm.rst
>> +++ b/Documentation/gpu/drm-mm.rst
>> @@ -552,6 +552,12 @@ Overview
>>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>     :doc: Overview
>>  
>> +Flow Control
>> +------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Flow Control
>> +
>>  Scheduler Function References
>>  -----------------------------
>>  
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 1f357198533f..62bb7fc7448a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>  	if (!entity)
>>  		return 0;
>>  
>> -	return drm_sched_job_init(&(*job)->base, entity, owner);
>> +	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>>  }
>>  
>>  int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>> index 2416c526f9b0..3d0f8d182506 100644
>> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>> @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
>>  
>>  	ret = drm_sched_job_init(&submit->sched_job,
>>  				 &ctx->sched_entity[args->pipe],
>> -				 submit->ctx);
>> +				 1, submit->ctx);
>>  	if (ret)
>>  		goto err_submit_put;
>>  
>> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
>> index 23a6276f1332..cdcb37ff62c3 100644
>> --- a/drivers/gpu/drm/lima/lima_sched.c
>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>>  	for (i = 0; i < num_bos; i++)
>>  		drm_gem_object_get(&bos[i]->base.base);
>>  
>> -	err = drm_sched_job_init(&task->base, &context->base, vm);
>> +	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
>>  	if (err) {
>>  		kfree(task->bos);
>>  		return err;
>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>> index 99744de6c05a..c002cabe7b9c 100644
>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>> @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>>  		return ERR_PTR(ret);
>>  	}
>>  
>> -	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
>> +	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
>>  	if (ret) {
>>  		kfree(submit->hw_fence);
>>  		kfree(submit);
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> index 7e64b5ef90fb..1b2cc3f2e1c7 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> @@ -89,7 +89,7 @@ nouveau_job_init(struct nouveau_job *job,
>>  
>>  	}
>>  
>> -	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
>> +	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
>>  	if (ret)
>>  		goto err_free_chains;
>>  
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
>> index b834777b409b..54d1c19bea84 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
>> @@ -274,7 +274,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>>  
>>  	ret = drm_sched_job_init(&job->base,
>>  				 &file_priv->sched_entity[slot],
>> -				 NULL);
>> +				 1, NULL);
>>  	if (ret)
>>  		goto out_put_job;
>>  
>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>> index 3143ecaaff86..f8ed093b7356 100644
>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>> @@ -51,7 +51,7 @@ DECLARE_EVENT_CLASS(drm_sched_job,
>>  			   __assign_str(name, sched_job->sched->name);
>>  			   __entry->job_count = spsc_queue_count(&entity->job_queue);
>>  			   __entry->hw_job_count = atomic_read(
>> -				   &sched_job->sched->hw_rq_count);
>> +				   &sched_job->sched->credit_count);
>>  			   ),
>>  	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
>>  		      __entry->entity, __entry->id,
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 409e4256f6e7..b79e0672b94f 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>  		container_of(cb, struct drm_sched_entity, cb);
>>  
>>  	drm_sched_entity_clear_dep(f, cb);
>> -	drm_sched_wakeup_if_can_queue(entity->rq->sched);
>> +	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>  }
>>  
>>  /**
>> @@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>  		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>  			drm_sched_rq_update_fifo(entity, submit_ts);
>>  
>> -		drm_sched_wakeup_if_can_queue(entity->rq->sched);
>> +		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>  	}
>>  }
>>  EXPORT_SYMBOL(drm_sched_entity_push_job);
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 246213963928..3cc3dc0091fc 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -48,6 +48,30 @@
>>   * through the jobs entity pointer.
>>   */
>>  
>> +/**
>> + * DOC: Flow Control
>> + *
>> + * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
>> + * in which the jobs fetched from scheduler entities are executed.
>> + *
>> + * In this context the &drm_gpu_scheduler keeps track of a driver specified
>> + * credit limit representing the capacity of this scheduler and a credit count;
>> + * every &drm_sched_job carries a driver specified number of credits.
>> + *
>> + * Once a job is executed (but not yet finished), the job's credits contribute
>> + * to the scheduler's credit count until the job is finished. If by executing
>> + * one more job the scheduler's credit count would exceed the scheduler's
>> + * credit limit, the job won't be executed. Instead, the scheduler will wait
>> + * until the credit count has decreased enough to not overflow its credit limit.
>> + * This implies waiting for previously executed jobs.
>> + *
>> + * Optionally, drivers can register a callback (update_job_credits) provided by
> 
> "can" --> "may".
> 
>> + * struct drm_sched_backend_ops to update the job's credits dynamically. The
>> + * scheduler will execute this callback every time the scheduler considers a job
> 
> No future tense, "will execute" --> "executes", because when this patch lands,
> it all becomes present reality. See below for a reworded paragraph.
> 
>> + * for execution and subsequently checks whether the job fits the scheduler's
>> + * credit limit.
> 
> This here is a good explanation of what update_job_credits() does, but the one
> where this callback is defined in the scheduler ops is not very clear (see further down
> into the patch).
> 
> I think we shouldn't use the word "update" as we don't really have any "update" code,
> a la,
> 	if old value != new value, then
> 		old value = new value # update it
> in the code using this new sched op.
> 
> Perhaps it should be a "get_job_credits()" as this is how this function is used
> in drm_sched_can_queue().
> 
> Perhaps this adds some clarity:
> 
> * Optionally, drivers may register a sched ops callback, get_job_credits(), which
> * returns the number of credits the job passed as an argument would take when
> * submitted to driver/hardware. The scheduler executes this callback every time it
> * considers a job for execution in order to check if the job's credits fit
> * into the scheduler's credit limit.
> 
> (FWIW, it may be doing some "update" into the driver/firmware/hardware, but DRM
> has no visibility or care for that, since DRM is using it simply as "tell me
> the number or credits this job would take.")
> 
>> + */
>> +
>>  #include <linux/wait.h>
>>  #include <linux/sched.h>
>>  #include <linux/completion.h>
>> @@ -75,6 +99,46 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>>  MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
>>  module_param_named(sched_policy, drm_sched_policy, int, 0444);
>>  
>> +static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
>> +{
>> +	u32 credits;
>> +
>> +	WARN_ON(check_sub_overflow(sched->credit_limit,
>> +				   atomic_read(&sched->credit_count),
>> +				   &credits));
>> +
>> +	return credits;
>> +}
>> +
>> +/**
>> + * drm_sched_can_queue -- Can we queue more to the hardware?
>> + * @sched: scheduler instance
>> + * @entity: the scheduler entity
>> + *
>> + * Return true if we can push at least one more job from @entity, false
>> + * otherwise.
>> + */
>> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>> +				struct drm_sched_entity *entity)
>> +{
>> +	struct drm_sched_job *s_job;
>> +
>> +	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>> +	if (!s_job)
>> +		return false;
> 
> We don't have this gating logic at the moment, and I don't think
> we need it.
> 
> At the current state of the code, we don't care if there's jobs
> in the incoming queue or not.
> 
> The only thing we should check here for is the credit availability
> for _this job_, as you do in the code below.

I see it now--never mind.

(I would've probably added a comment--something like
/* Incoming queue empty--nothing to schedule, return false. */
as it circles all the way back to where we pick the next job to schedule.)

So, I ported this patch to drm-misc-next and then to 6.6-rc7 and ran
with it. All is good.

I think this patch is ready to go in after addressing the cosmetic fixes
to comments and clarity--if you also feel the same way.

Thanks! Great job!

Regards,
Luben


> 
>> +
>> +	if (sched->ops->update_job_credits) {
>> +		u32 credits = sched->ops->update_job_credits(s_job);
>> +
>> +		if (credits)
>> +			s_job->credits = credits;
>> +	}
>> +
>> +	WARN_ON(s_job->credits > sched->credit_limit);
>> +
>> +	return drm_sched_available_credits(sched) >= s_job->credits;
>> +}
>> +
>>  static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
>>  							    const struct rb_node *b)
>>  {
>> @@ -186,12 +250,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>  /**
>>   * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>   *
>> + * @sched: the gpu scheduler
>>   * @rq: scheduler run queue to check.
>>   *
>>   * Try to find a ready entity, returns NULL if none found.
>>   */
>>  static struct drm_sched_entity *
>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
>> +			      struct drm_sched_rq *rq)
>>  {
>>  	struct drm_sched_entity *entity;
>>  
>> @@ -201,6 +267,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>  	if (entity) {
>>  		list_for_each_entry_continue(entity, &rq->entities, list) {
>>  			if (drm_sched_entity_is_ready(entity)) {
>> +				/* If we can't queue yet, preserve the current
>> +				 * entity in terms of fairness.
>> +				 */
>> +				if (!drm_sched_can_queue(sched, entity))
>> +					goto out;
>> +
>>  				rq->current_entity = entity;
>>  				reinit_completion(&entity->entity_idle);
>>  				spin_unlock(&rq->lock);
>> @@ -210,8 +282,13 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>  	}
>>  
>>  	list_for_each_entry(entity, &rq->entities, list) {
>> -
>>  		if (drm_sched_entity_is_ready(entity)) {
>> +			/* If we can't queue yet, preserve the current entity in
>> +			 * terms of fairness.
>> +			 */
>> +			if (!drm_sched_can_queue(sched, entity))
>> +				goto out;
>> +
>>  			rq->current_entity = entity;
>>  			reinit_completion(&entity->entity_idle);
>>  			spin_unlock(&rq->lock);
>> @@ -222,20 +299,22 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>  			break;
>>  	}
>>  
>> +out:
>>  	spin_unlock(&rq->lock);
>> -
>>  	return NULL;
>>  }
>>  
>>  /**
>>   * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>   *
>> + * @sched: the gpu scheduler
>>   * @rq: scheduler run queue to check.
>>   *
>>   * Find oldest waiting ready entity, returns NULL if none found.
>>   */
>>  static struct drm_sched_entity *
>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
>> +				struct drm_sched_rq *rq)
>>  {
>>  	struct rb_node *rb;
>>  
>> @@ -245,6 +324,15 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>  
>>  		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>  		if (drm_sched_entity_is_ready(entity)) {
>> +			/* If we can't queue yet, don't pick another entity
>> +			 * which's job might fit and wait until we got space for
> 
> 			 * whose job might not fit and wait until we get space for
> Three fixes:                 ^              ^                         ^
> 
>> +			 * this one in terms of fairness.
> 
>                          * this one in terms of credit availability.
> 
> It's not the "fairness", it's the number of credits we check for in drm_sched_can_queue() below.
> 
>> +			 */
>> +			if (!drm_sched_can_queue(sched, entity)) {
>> +				rb = NULL;
>> +				break;
>> +			}
>> +
> 
> That's good.
> 
>>  			rq->current_entity = entity;
>>  			reinit_completion(&entity->entity_idle);
>>  			break;
>> @@ -266,18 +354,6 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>  		queue_work(sched->submit_wq, &sched->work_run_job);
>>  }
>>  
>> -/**
>> - * drm_sched_can_queue -- Can we queue more to the hardware?
>> - * @sched: scheduler instance
>> - *
>> - * Return true if we can push more jobs to the hw, otherwise false.
>> - */
>> -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>> -{
>> -	return atomic_read(&sched->hw_rq_count) <
>> -		sched->hw_submission_limit;
>> -}
>> -
>>  /**
>>   * drm_sched_select_entity - Select next entity to process
>>   *
>> @@ -291,14 +367,11 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>  	struct drm_sched_entity *entity;
>>  	int i;
>>  
>> -	if (!drm_sched_can_queue(sched))
>> -		return NULL;
>> -
>>  	/* Kernel run queue has higher priority than normal run queue*/
>>  	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>  		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
>> -			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
>> -			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
>> +			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
>> +			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
>>  		if (entity)
>>  			break;
>>  	}
>> @@ -353,7 +426,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>  	struct drm_sched_fence *s_fence = s_job->s_fence;
>>  	struct drm_gpu_scheduler *sched = s_fence->sched;
>>  
>> -	atomic_dec(&sched->hw_rq_count);
>> +	atomic_sub(s_job->credits, &sched->credit_count);
>>  	atomic_dec(sched->score);
>>  
>>  	trace_drm_sched_process_job(s_fence);
>> @@ -576,7 +649,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>  					      &s_job->cb)) {
>>  			dma_fence_put(s_job->s_fence->parent);
>>  			s_job->s_fence->parent = NULL;
>> -			atomic_dec(&sched->hw_rq_count);
>> +			atomic_sub(s_job->credits, &sched->credit_count);
>>  		} else {
>>  			/*
>>  			 * remove job from pending_list.
>> @@ -637,7 +710,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>>  	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>  		struct dma_fence *fence = s_job->s_fence->parent;
>>  
>> -		atomic_inc(&sched->hw_rq_count);
>> +		atomic_add(s_job->credits, &sched->credit_count);
>>  
>>  		if (!full_recovery)
>>  			continue;
>> @@ -718,6 +791,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>   * drm_sched_job_init - init a scheduler job
>>   * @job: scheduler job to init
>>   * @entity: scheduler entity to use
>> + * @credits: the number of credits this job contributes to the schedulers
>> + * credit limit
>>   * @owner: job owner for debugging
>>   *
>>   * Refer to drm_sched_entity_push_job() documentation
>> @@ -735,7 +810,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>   */
>>  int drm_sched_job_init(struct drm_sched_job *job,
>>  		       struct drm_sched_entity *entity,
>> -		       void *owner)
>> +		       u32 credits, void *owner)
>>  {
>>  	if (!entity->rq) {
>>  		/* This will most likely be followed by missing frames
>> @@ -752,6 +827,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>  		return -ENOMEM;
>>  
>>  	INIT_LIST_HEAD(&job->list);
>> +	job->credits = credits ? credits : 1;
>>  
>>  	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
>>  
>> @@ -961,12 +1037,14 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>>  /**
>>   * drm_sched_wakeup_if_can_queue - Wake up the scheduler
>>   * @sched: scheduler instance
>> + * @entity: the scheduler entity
>>   *
>>   * Wake up the scheduler if we can queue jobs.
>>   */
>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
>> +				   struct drm_sched_entity *entity)
>>  {
>> -	if (drm_sched_can_queue(sched))
>> +	if (drm_sched_can_queue(sched, entity))
>>  		drm_sched_run_job_queue(sched);
>>  }
>>  
>> @@ -1104,7 +1182,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>  
>>  	s_fence = sched_job->s_fence;
>>  
>> -	atomic_inc(&sched->hw_rq_count);
>> +	atomic_add(sched_job->credits, &sched->credit_count);
>>  	drm_sched_job_begin(sched_job);
>>  
>>  	trace_drm_run_job(sched_job, entity);
>> @@ -1139,7 +1217,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>   * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
>>   *	       allocated and used
>>   * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
>> - * @hw_submission: number of hw submissions that can be in flight
>> + * @credit_limit: the number of credits this scheduler can hold from all jobs
>>   * @hang_limit: number of times to allow a job to hang before dropping it
>>   * @timeout: timeout value in jiffies for the scheduler
>>   * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
>> @@ -1153,14 +1231,14 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>  int drm_sched_init(struct drm_gpu_scheduler *sched,
>>  		   const struct drm_sched_backend_ops *ops,
>>  		   struct workqueue_struct *submit_wq,
>> -		   u32 num_rqs, unsigned hw_submission, unsigned hang_limit,
>> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>>  		   long timeout, struct workqueue_struct *timeout_wq,
>>  		   atomic_t *score, const char *name, struct device *dev)
>>  {
>>  	int i, ret;
>>  
>>  	sched->ops = ops;
>> -	sched->hw_submission_limit = hw_submission;
>> +	sched->credit_limit = credit_limit;
>>  	sched->name = name;
>>  	if (submit_wq) {
>>  		sched->submit_wq = submit_wq;
>> @@ -1211,7 +1289,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>  	init_waitqueue_head(&sched->job_scheduled);
>>  	INIT_LIST_HEAD(&sched->pending_list);
>>  	spin_lock_init(&sched->job_list_lock);
>> -	atomic_set(&sched->hw_rq_count, 0);
>> +	atomic_set(&sched->credit_count, 0);
>>  	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>  	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>  	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>> diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
>> index 2e94ce788c71..8479e5302f7b 100644
>> --- a/drivers/gpu/drm/v3d/v3d_gem.c
>> +++ b/drivers/gpu/drm/v3d/v3d_gem.c
>> @@ -417,7 +417,7 @@ v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
>>  	job->free = free;
>>  
>>  	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
>> -				 v3d_priv);
>> +				 1, v3d_priv);
>>  	if (ret)
>>  		goto fail;
>>  
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index e5a6166eb152..a911b3f5917e 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -321,6 +321,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>>   * @sched: the scheduler instance on which this job is scheduled.
>>   * @s_fence: contains the fences for the scheduling of job.
>>   * @finish_cb: the callback for the finished fence.
>> + * @credits: the number of credits this job contributes to the scheduler
>>   * @work: Helper to reschdeule job kill to different context.
>>   * @id: a unique id assigned to each job scheduled on the scheduler.
>>   * @karma: increment on every hang caused by this job. If this exceeds the hang
>> @@ -340,6 +341,8 @@ struct drm_sched_job {
>>  	struct drm_gpu_scheduler	*sched;
>>  	struct drm_sched_fence		*s_fence;
>>  
>> +	u32				credits;
>> +
>>  	/*
>>  	 * work is used only after finish_cb has been used and will not be
>>  	 * accessed anymore.
>> @@ -463,13 +466,29 @@ struct drm_sched_backend_ops {
>>           * and it's time to clean it up.
>>  	 */
>>  	void (*free_job)(struct drm_sched_job *sched_job);
>> +
>> +	/**
>> +	 * @update_job_credits: Called once the scheduler is considering this
>> +	 * job for execution.
>> +	 *
> 
> This whole paragraph isn't very clear of what update_job_credits does.
> Perhaps a simple and straightforward name would be more clear:
> 	get_job_credits: Returns the number of credits this job would
> 		take.
> 
>> +	 * Drivers may use this to update the job's submission credits, which is
>> +	 * useful to e.g. deduct the number of native fences which have been
>> +	 * signaled meanwhile.
>> +	 *
>> +	 * The callback must either return the new number of submission credits
>> +	 * for the given job, or zero if no update is required.
> 
> The word "update" is confusing here and it implies that DRM should keep track
> of the previous value this function had returned.
> 
> Perhaps we can just say:
> 	* This callback returns the number of credits this job would take if pushed
> 	* to the driver/hardware. It returns 0, if this job would take no credits.
> 
> Else, driver writers would have to deduce what to return here by looking at
> drm_sched_can_queue() which effectively does:
> 	s_job->credits = sched->ops->update_job_credits(s_job).
> 
>> +	 *
>> +	 * This callback is optional.
>> +	 */
>> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
>>  };
>>  
> 
> The rest is good--thanks for the changes!
> 
>>  /**
>>   * struct drm_gpu_scheduler - scheduler instance-specific data
>>   *
>>   * @ops: backend operations provided by the driver.
>> - * @hw_submission_limit: the max size of the hardware queue.
>> + * @credit_limit: the credit limit of this scheduler
>> + * @credit_count: the current credit count of this scheduler
>>   * @timeout: the time after which a job is removed from the scheduler.
>>   * @name: name of the ring for which this scheduler is being used.
>>   * @num_rqs: Number of run-queues. This is at most DRM_SCHED_PRIORITY_COUNT,
>> @@ -478,7 +497,6 @@ struct drm_sched_backend_ops {
>>   * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>>   *                 waits on this wait queue until all the scheduled jobs are
>>   *                 finished.
>> - * @hw_rq_count: the number of jobs currently in the hardware queue.
>>   * @job_id_count: used to assign unique id to the each job.
>>   * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>>   * @timeout_wq: workqueue used to queue @work_tdr
>> @@ -502,13 +520,13 @@ struct drm_sched_backend_ops {
>>   */
>>  struct drm_gpu_scheduler {
>>  	const struct drm_sched_backend_ops	*ops;
>> -	uint32_t			hw_submission_limit;
>> +	u32				credit_limit;
>> +	atomic_t			credit_count;
>>  	long				timeout;
>>  	const char			*name;
>>  	u32                             num_rqs;
>>  	struct drm_sched_rq             **sched_rq;
>>  	wait_queue_head_t		job_scheduled;
>> -	atomic_t			hw_rq_count;
>>  	atomic64_t			job_id_count;
>>  	struct workqueue_struct		*submit_wq;
>>  	struct workqueue_struct		*timeout_wq;
>> @@ -530,14 +548,14 @@ struct drm_gpu_scheduler {
>>  int drm_sched_init(struct drm_gpu_scheduler *sched,
>>  		   const struct drm_sched_backend_ops *ops,
>>  		   struct workqueue_struct *submit_wq,
>> -		   u32 num_rqs, uint32_t hw_submission, unsigned hang_limit,
>> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>>  		   long timeout, struct workqueue_struct *timeout_wq,
>>  		   atomic_t *score, const char *name, struct device *dev);
>>  
>>  void drm_sched_fini(struct drm_gpu_scheduler *sched);
>>  int drm_sched_job_init(struct drm_sched_job *job,
>>  		       struct drm_sched_entity *entity,
>> -		       void *owner);
>> +		       u32 credits, void *owner);
>>  void drm_sched_job_arm(struct drm_sched_job *job);
>>  int drm_sched_job_add_dependency(struct drm_sched_job *job,
>>  				 struct dma_fence *fence);
>> @@ -559,7 +577,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>  
>>  void drm_sched_tdr_queue_imm(struct drm_gpu_scheduler *sched);
>>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
>> +				   struct drm_sched_entity *entity);
>>  bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
>>  void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>>  void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>
  
Boris Brezillon Oct. 27, 2023, 7:17 a.m. UTC | #3
Hi Danilo,

On Thu, 26 Oct 2023 18:13:00 +0200
Danilo Krummrich <dakr@redhat.com> wrote:

> +
> +	/**
> +	 * @update_job_credits: Called once the scheduler is considering this
> +	 * job for execution.
> +	 *
> +	 * Drivers may use this to update the job's submission credits, which is
> +	 * useful to e.g. deduct the number of native fences which have been
> +	 * signaled meanwhile.
> +	 *
> +	 * The callback must either return the new number of submission credits
> +	 * for the given job, or zero if no update is required.
> +	 *
> +	 * This callback is optional.
> +	 */
> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);

I'm copying my late reply to v2 here so it doesn't get lost:

I keep thinking it'd be simpler to make this a void function that
updates s_job->submission_credits directly. I also don't see the
problem with doing a sanity check on job->submission_credits. I mean,
if the driver is doing something silly, you can't do much to prevent it
anyway, except warn the user that something wrong has happened. If you
want to

	WARN_ON(job->submission_credits == 0 ||
		job->submission_credits > job_old_submission_credits);

that's fine. But none of this sanity checking has to do with the
function prototype/semantics, and I'm still not comfortable with this 0
=> no-change. If there's no change, we should just leave  
job->submission_credits unchanged (or return job->submission_credits)
instead of inventing a new special case.

Regards,

Boris
  
Christian König Oct. 27, 2023, 7:22 a.m. UTC | #4
Am 26.10.23 um 18:13 schrieb Danilo Krummrich:
> Currently, job flow control is implemented simply by limiting the number
> of jobs in flight. Therefore, a scheduler is initialized with a credit
> limit that corresponds to the number of jobs which can be sent to the
> hardware.
>
> This implies that for each job, drivers need to account for the maximum
> job size possible in order to not overflow the ring buffer.
>
> However, there are drivers, such as Nouveau, where the job size has a
> rather large range. For such drivers it can easily happen that job
> submissions not even filling the ring by 1% can block subsequent
> submissions, which, in the worst case, can lead to the ring run dry.
>
> In order to overcome this issue, allow for tracking the actual job size
> instead of the number of jobs. Therefore, add a field to track a job's
> credit count, which represents the number of credits a job contributes
> to the scheduler's credit limit.
>
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> ---
> Changes in V2:
> ==============
>    - fixed up influence on scheduling fairness due to consideration of a job's
>      size
>      - If we reach a ready entity in drm_sched_select_entity() but can't actually
>        queue a job from it due to size limitations, just give up and go to sleep
>        until woken up due to a pending job finishing, rather than continue to try
>        other entities.
>    - added a callback to dynamically update a job's credits (Boris)
>    - renamed 'units' to 'credits'
>    - fixed commit message and comments as requested by Luben
>
> Changes in V3:
> ==============
>    - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>    - move up drm_sched_can_queue() instead of adding a forward declaration
>      (Boris)
>    - add a drm_sched_available_credits() helper (Boris)
>    - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>    - re-phrase a few comments and fix a typo (Luben)
>    - change naming of all structures credit fields and function parameters to the
>      following scheme
>      - drm_sched_job::credits
>      - drm_gpu_scheduler::credit_limit
>      - drm_gpu_scheduler::credit_count
>      - drm_sched_init(..., u32 credit_limit, ...)
>      - drm_sched_job_init(..., u32 credits, ...)
>    - add proper documentation for the scheduler's job-flow control mechanism
>
> This patch is based on V7 of the "DRM scheduler changes for Xe" series. [1]
> provides a branch based on drm-misc-next, with the named series and this patch
> on top of it.
>
> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
> ---
>   Documentation/gpu/drm-mm.rst                  |   6 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
>   drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>   drivers/gpu/drm/lima/lima_sched.c             |   2 +-
>   drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
>   drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
>   drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
>   .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>   drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
>   drivers/gpu/drm/scheduler/sched_main.c        | 142 ++++++++++++++----
>   drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
>   include/drm/gpu_scheduler.h                   |  33 +++-
>   12 files changed, 152 insertions(+), 49 deletions(-)
>
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index 602010cb6894..acc5901ac840 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>      :doc: Overview
>   
> +Flow Control
> +------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Flow Control
> +
>   Scheduler Function References
>   -----------------------------
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 1f357198533f..62bb7fc7448a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   	if (!entity)
>   		return 0;
>   
> -	return drm_sched_job_init(&(*job)->base, entity, owner);
> +	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>   }
>   
>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> index 2416c526f9b0..3d0f8d182506 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
> @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
>   
>   	ret = drm_sched_job_init(&submit->sched_job,
>   				 &ctx->sched_entity[args->pipe],
> -				 submit->ctx);
> +				 1, submit->ctx);
>   	if (ret)
>   		goto err_submit_put;
>   
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 23a6276f1332..cdcb37ff62c3 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>   	for (i = 0; i < num_bos; i++)
>   		drm_gem_object_get(&bos[i]->base.base);
>   
> -	err = drm_sched_job_init(&task->base, &context->base, vm);
> +	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
>   	if (err) {
>   		kfree(task->bos);
>   		return err;
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 99744de6c05a..c002cabe7b9c 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>   		return ERR_PTR(ret);
>   	}
>   
> -	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
> +	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
>   	if (ret) {
>   		kfree(submit->hw_fence);
>   		kfree(submit);
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index 7e64b5ef90fb..1b2cc3f2e1c7 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -89,7 +89,7 @@ nouveau_job_init(struct nouveau_job *job,
>   
>   	}
>   
> -	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
> +	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
>   	if (ret)
>   		goto err_free_chains;
>   
> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
> index b834777b409b..54d1c19bea84 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> @@ -274,7 +274,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>   
>   	ret = drm_sched_job_init(&job->base,
>   				 &file_priv->sched_entity[slot],
> -				 NULL);
> +				 1, NULL);
>   	if (ret)
>   		goto out_put_job;
>   
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> index 3143ecaaff86..f8ed093b7356 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
> @@ -51,7 +51,7 @@ DECLARE_EVENT_CLASS(drm_sched_job,
>   			   __assign_str(name, sched_job->sched->name);
>   			   __entry->job_count = spsc_queue_count(&entity->job_queue);
>   			   __entry->hw_job_count = atomic_read(
> -				   &sched_job->sched->hw_rq_count);
> +				   &sched_job->sched->credit_count);
>   			   ),
>   	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
>   		      __entry->entity, __entry->id,
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 409e4256f6e7..b79e0672b94f 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>   		container_of(cb, struct drm_sched_entity, cb);
>   
>   	drm_sched_entity_clear_dep(f, cb);
> -	drm_sched_wakeup_if_can_queue(entity->rq->sched);
> +	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>   }
>   
>   /**
> @@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>   		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>   			drm_sched_rq_update_fifo(entity, submit_ts);
>   
> -		drm_sched_wakeup_if_can_queue(entity->rq->sched);
> +		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>   	}
>   }
>   EXPORT_SYMBOL(drm_sched_entity_push_job);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 246213963928..3cc3dc0091fc 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -48,6 +48,30 @@
>    * through the jobs entity pointer.
>    */
>   
> +/**
> + * DOC: Flow Control
> + *
> + * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
> + * in which the jobs fetched from scheduler entities are executed.
> + *
> + * In this context the &drm_gpu_scheduler keeps track of a driver specified
> + * credit limit representing the capacity of this scheduler and a credit count;
> + * every &drm_sched_job carries a driver specified number of credits.
> + *
> + * Once a job is executed (but not yet finished), the job's credits contribute
> + * to the scheduler's credit count until the job is finished. If by executing
> + * one more job the scheduler's credit count would exceed the scheduler's
> + * credit limit, the job won't be executed. Instead, the scheduler will wait
> + * until the credit count has decreased enough to not overflow its credit limit.
> + * This implies waiting for previously executed jobs.
> + *
> + * Optionally, drivers can register a callback (update_job_credits) provided by
> + * struct drm_sched_backend_ops to update the job's credits dynamically. The
> + * scheduler will execute this callback every time the scheduler considers a job
> + * for execution and subsequently checks whether the job fits the scheduler's
> + * credit limit.
> + */
> +
>   #include <linux/wait.h>
>   #include <linux/sched.h>
>   #include <linux/completion.h>
> @@ -75,6 +99,46 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>   MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
>   module_param_named(sched_policy, drm_sched_policy, int, 0444);
>   
> +static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
> +{
> +	u32 credits;
> +
> +	WARN_ON(check_sub_overflow(sched->credit_limit,
> +				   atomic_read(&sched->credit_count),
> +				   &credits));
> +
> +	return credits;
> +}
> +
> +/**
> + * drm_sched_can_queue -- Can we queue more to the hardware?
> + * @sched: scheduler instance
> + * @entity: the scheduler entity
> + *
> + * Return true if we can push at least one more job from @entity, false
> + * otherwise.
> + */
> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
> +				struct drm_sched_entity *entity)
> +{
> +	struct drm_sched_job *s_job;
> +
> +	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> +	if (!s_job)
> +		return false;
> +
> +	if (sched->ops->update_job_credits) {
> +		u32 credits = sched->ops->update_job_credits(s_job);
> +
> +		if (credits)
> +			s_job->credits = credits;
> +	}
> +
> +	WARN_ON(s_job->credits > sched->credit_limit);
> +
> +	return drm_sched_available_credits(sched) >= s_job->credits;
> +}
> +
>   static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
>   							    const struct rb_node *b)
>   {
> @@ -186,12 +250,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>   /**
>    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>    *
> + * @sched: the gpu scheduler
>    * @rq: scheduler run queue to check.
>    *
>    * Try to find a ready entity, returns NULL if none found.
>    */
>   static struct drm_sched_entity *
> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
> +			      struct drm_sched_rq *rq)
>   {
>   	struct drm_sched_entity *entity;
>   
> @@ -201,6 +267,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>   	if (entity) {
>   		list_for_each_entry_continue(entity, &rq->entities, list) {
>   			if (drm_sched_entity_is_ready(entity)) {
> +				/* If we can't queue yet, preserve the current
> +				 * entity in terms of fairness.
> +				 */
> +				if (!drm_sched_can_queue(sched, entity))
> +					goto out;
> +
>   				rq->current_entity = entity;
>   				reinit_completion(&entity->entity_idle);
>   				spin_unlock(&rq->lock);
> @@ -210,8 +282,13 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>   	}
>   
>   	list_for_each_entry(entity, &rq->entities, list) {
> -
>   		if (drm_sched_entity_is_ready(entity)) {
> +			/* If we can't queue yet, preserve the current entity in
> +			 * terms of fairness.
> +			 */
> +			if (!drm_sched_can_queue(sched, entity))
> +				goto out;
> +
>   			rq->current_entity = entity;
>   			reinit_completion(&entity->entity_idle);
>   			spin_unlock(&rq->lock);
> @@ -222,20 +299,22 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>   			break;
>   	}
>   
> +out:
>   	spin_unlock(&rq->lock);
> -
>   	return NULL;
>   }
>   
>   /**
>    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>    *
> + * @sched: the gpu scheduler
>    * @rq: scheduler run queue to check.
>    *
>    * Find oldest waiting ready entity, returns NULL if none found.
>    */
>   static struct drm_sched_entity *
> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
> +				struct drm_sched_rq *rq)
>   {
>   	struct rb_node *rb;
>   
> @@ -245,6 +324,15 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>   
>   		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>   		if (drm_sched_entity_is_ready(entity)) {
> +			/* If we can't queue yet, don't pick another entity
> +			 * which's job might fit and wait until we got space for
> +			 * this one in terms of fairness.
> +			 */
> +			if (!drm_sched_can_queue(sched, entity)) {
> +				rb = NULL;
> +				break;
> +			}
> +
>   			rq->current_entity = entity;
>   			reinit_completion(&entity->entity_idle);
>   			break;
> @@ -266,18 +354,6 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>   		queue_work(sched->submit_wq, &sched->work_run_job);
>   }
>   
> -/**
> - * drm_sched_can_queue -- Can we queue more to the hardware?
> - * @sched: scheduler instance
> - *
> - * Return true if we can push more jobs to the hw, otherwise false.
> - */
> -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> -{
> -	return atomic_read(&sched->hw_rq_count) <
> -		sched->hw_submission_limit;
> -}
> -
>   /**
>    * drm_sched_select_entity - Select next entity to process
>    *
> @@ -291,14 +367,11 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>   	struct drm_sched_entity *entity;
>   	int i;
>   
> -	if (!drm_sched_can_queue(sched))
> -		return NULL;
> -
>   	/* Kernel run queue has higher priority than normal run queue*/
>   	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>   		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
> -			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
> -			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
> +			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
> +			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
>   		if (entity)
>   			break;
>   	}
> @@ -353,7 +426,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>   	struct drm_sched_fence *s_fence = s_job->s_fence;
>   	struct drm_gpu_scheduler *sched = s_fence->sched;
>   
> -	atomic_dec(&sched->hw_rq_count);
> +	atomic_sub(s_job->credits, &sched->credit_count);
>   	atomic_dec(sched->score);
>   
>   	trace_drm_sched_process_job(s_fence);
> @@ -576,7 +649,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>   					      &s_job->cb)) {
>   			dma_fence_put(s_job->s_fence->parent);
>   			s_job->s_fence->parent = NULL;
> -			atomic_dec(&sched->hw_rq_count);
> +			atomic_sub(s_job->credits, &sched->credit_count);
>   		} else {
>   			/*
>   			 * remove job from pending_list.
> @@ -637,7 +710,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>   	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>   		struct dma_fence *fence = s_job->s_fence->parent;
>   
> -		atomic_inc(&sched->hw_rq_count);
> +		atomic_add(s_job->credits, &sched->credit_count);
>   
>   		if (!full_recovery)
>   			continue;
> @@ -718,6 +791,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>    * drm_sched_job_init - init a scheduler job
>    * @job: scheduler job to init
>    * @entity: scheduler entity to use
> + * @credits: the number of credits this job contributes to the schedulers
> + * credit limit
>    * @owner: job owner for debugging
>    *
>    * Refer to drm_sched_entity_push_job() documentation
> @@ -735,7 +810,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>    */
>   int drm_sched_job_init(struct drm_sched_job *job,
>   		       struct drm_sched_entity *entity,
> -		       void *owner)
> +		       u32 credits, void *owner)
>   {
>   	if (!entity->rq) {
>   		/* This will most likely be followed by missing frames
> @@ -752,6 +827,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>   		return -ENOMEM;
>   
>   	INIT_LIST_HEAD(&job->list);
> +	job->credits = credits ? credits : 1;
>   
>   	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
>   
> @@ -961,12 +1037,14 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>   /**
>    * drm_sched_wakeup_if_can_queue - Wake up the scheduler
>    * @sched: scheduler instance
> + * @entity: the scheduler entity
>    *
>    * Wake up the scheduler if we can queue jobs.
>    */
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
> +				   struct drm_sched_entity *entity)
>   {
> -	if (drm_sched_can_queue(sched))
> +	if (drm_sched_can_queue(sched, entity))
>   		drm_sched_run_job_queue(sched);
>   }
>   
> @@ -1104,7 +1182,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>   
>   	s_fence = sched_job->s_fence;
>   
> -	atomic_inc(&sched->hw_rq_count);
> +	atomic_add(sched_job->credits, &sched->credit_count);
>   	drm_sched_job_begin(sched_job);
>   
>   	trace_drm_run_job(sched_job, entity);
> @@ -1139,7 +1217,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>    * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
>    *	       allocated and used
>    * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
> - * @hw_submission: number of hw submissions that can be in flight
> + * @credit_limit: the number of credits this scheduler can hold from all jobs
>    * @hang_limit: number of times to allow a job to hang before dropping it
>    * @timeout: timeout value in jiffies for the scheduler
>    * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
> @@ -1153,14 +1231,14 @@ static void drm_sched_run_job_work(struct work_struct *w)
>   int drm_sched_init(struct drm_gpu_scheduler *sched,
>   		   const struct drm_sched_backend_ops *ops,
>   		   struct workqueue_struct *submit_wq,
> -		   u32 num_rqs, unsigned hw_submission, unsigned hang_limit,
> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>   		   long timeout, struct workqueue_struct *timeout_wq,
>   		   atomic_t *score, const char *name, struct device *dev)
>   {
>   	int i, ret;
>   
>   	sched->ops = ops;
> -	sched->hw_submission_limit = hw_submission;
> +	sched->credit_limit = credit_limit;
>   	sched->name = name;
>   	if (submit_wq) {
>   		sched->submit_wq = submit_wq;
> @@ -1211,7 +1289,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>   	init_waitqueue_head(&sched->job_scheduled);
>   	INIT_LIST_HEAD(&sched->pending_list);
>   	spin_lock_init(&sched->job_list_lock);
> -	atomic_set(&sched->hw_rq_count, 0);
> +	atomic_set(&sched->credit_count, 0);
>   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>   	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>   	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
> index 2e94ce788c71..8479e5302f7b 100644
> --- a/drivers/gpu/drm/v3d/v3d_gem.c
> +++ b/drivers/gpu/drm/v3d/v3d_gem.c
> @@ -417,7 +417,7 @@ v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
>   	job->free = free;
>   
>   	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
> -				 v3d_priv);
> +				 1, v3d_priv);
>   	if (ret)
>   		goto fail;
>   
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index e5a6166eb152..a911b3f5917e 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -321,6 +321,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>    * @sched: the scheduler instance on which this job is scheduled.
>    * @s_fence: contains the fences for the scheduling of job.
>    * @finish_cb: the callback for the finished fence.
> + * @credits: the number of credits this job contributes to the scheduler
>    * @work: Helper to reschdeule job kill to different context.
>    * @id: a unique id assigned to each job scheduled on the scheduler.
>    * @karma: increment on every hang caused by this job. If this exceeds the hang
> @@ -340,6 +341,8 @@ struct drm_sched_job {
>   	struct drm_gpu_scheduler	*sched;
>   	struct drm_sched_fence		*s_fence;
>   
> +	u32				credits;
> +
>   	/*
>   	 * work is used only after finish_cb has been used and will not be
>   	 * accessed anymore.
> @@ -463,13 +466,29 @@ struct drm_sched_backend_ops {
>            * and it's time to clean it up.
>   	 */
>   	void (*free_job)(struct drm_sched_job *sched_job);
> +
> +	/**
> +	 * @update_job_credits: Called once the scheduler is considering this
> +	 * job for execution.
> +	 *
> +	 * Drivers may use this to update the job's submission credits, which is
> +	 * useful to e.g. deduct the number of native fences which have been
> +	 * signaled meanwhile.
> +	 *
> +	 * The callback must either return the new number of submission credits
> +	 * for the given job, or zero if no update is required.
> +	 *
> +	 * This callback is optional.
> +	 */
> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);

Why do we need an extra callback for this?

Just document that prepare_job() is allowed to reduce the number of 
credits the job might need.

Regards,
Christian.

>   };
>   
>   /**
>    * struct drm_gpu_scheduler - scheduler instance-specific data
>    *
>    * @ops: backend operations provided by the driver.
> - * @hw_submission_limit: the max size of the hardware queue.
> + * @credit_limit: the credit limit of this scheduler
> + * @credit_count: the current credit count of this scheduler
>    * @timeout: the time after which a job is removed from the scheduler.
>    * @name: name of the ring for which this scheduler is being used.
>    * @num_rqs: Number of run-queues. This is at most DRM_SCHED_PRIORITY_COUNT,
> @@ -478,7 +497,6 @@ struct drm_sched_backend_ops {
>    * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>    *                 waits on this wait queue until all the scheduled jobs are
>    *                 finished.
> - * @hw_rq_count: the number of jobs currently in the hardware queue.
>    * @job_id_count: used to assign unique id to the each job.
>    * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>    * @timeout_wq: workqueue used to queue @work_tdr
> @@ -502,13 +520,13 @@ struct drm_sched_backend_ops {
>    */
>   struct drm_gpu_scheduler {
>   	const struct drm_sched_backend_ops	*ops;
> -	uint32_t			hw_submission_limit;
> +	u32				credit_limit;
> +	atomic_t			credit_count;
>   	long				timeout;
>   	const char			*name;
>   	u32                             num_rqs;
>   	struct drm_sched_rq             **sched_rq;
>   	wait_queue_head_t		job_scheduled;
> -	atomic_t			hw_rq_count;
>   	atomic64_t			job_id_count;
>   	struct workqueue_struct		*submit_wq;
>   	struct workqueue_struct		*timeout_wq;
> @@ -530,14 +548,14 @@ struct drm_gpu_scheduler {
>   int drm_sched_init(struct drm_gpu_scheduler *sched,
>   		   const struct drm_sched_backend_ops *ops,
>   		   struct workqueue_struct *submit_wq,
> -		   u32 num_rqs, uint32_t hw_submission, unsigned hang_limit,
> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>   		   long timeout, struct workqueue_struct *timeout_wq,
>   		   atomic_t *score, const char *name, struct device *dev);
>   
>   void drm_sched_fini(struct drm_gpu_scheduler *sched);
>   int drm_sched_job_init(struct drm_sched_job *job,
>   		       struct drm_sched_entity *entity,
> -		       void *owner);
> +		       u32 credits, void *owner);
>   void drm_sched_job_arm(struct drm_sched_job *job);
>   int drm_sched_job_add_dependency(struct drm_sched_job *job,
>   				 struct dma_fence *fence);
> @@ -559,7 +577,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>   
>   void drm_sched_tdr_queue_imm(struct drm_gpu_scheduler *sched);
>   void drm_sched_job_cleanup(struct drm_sched_job *job);
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
> +				   struct drm_sched_entity *entity);
>   bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
>   void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
  
Boris Brezillon Oct. 27, 2023, 7:32 a.m. UTC | #5
On Fri, 27 Oct 2023 09:22:12 +0200
Christian König <christian.koenig@amd.com> wrote:

> > +
> > +	/**
> > +	 * @update_job_credits: Called once the scheduler is considering this
> > +	 * job for execution.
> > +	 *
> > +	 * Drivers may use this to update the job's submission credits, which is
> > +	 * useful to e.g. deduct the number of native fences which have been
> > +	 * signaled meanwhile.
> > +	 *
> > +	 * The callback must either return the new number of submission credits
> > +	 * for the given job, or zero if no update is required.
> > +	 *
> > +	 * This callback is optional.
> > +	 */
> > +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);  
> 
> Why do we need an extra callback for this?
> 
> Just document that prepare_job() is allowed to reduce the number of 
> credits the job might need.

->prepare_job() is called only once if the returned fence is NULL, but
we need this credit-update to happen every time a job is considered for
execution by the scheduler. If you're saying this control-flow should
be implemented with a dma_fence that's signaled when enough space is
available, I fear Danilo's work won't be that useful to the PowerVR
driver, unfortunately.
  
Christian König Oct. 27, 2023, 7:35 a.m. UTC | #6
Am 27.10.23 um 09:32 schrieb Boris Brezillon:
> On Fri, 27 Oct 2023 09:22:12 +0200
> Christian König <christian.koenig@amd.com> wrote:
>
>>> +
>>> +	/**
>>> +	 * @update_job_credits: Called once the scheduler is considering this
>>> +	 * job for execution.
>>> +	 *
>>> +	 * Drivers may use this to update the job's submission credits, which is
>>> +	 * useful to e.g. deduct the number of native fences which have been
>>> +	 * signaled meanwhile.
>>> +	 *
>>> +	 * The callback must either return the new number of submission credits
>>> +	 * for the given job, or zero if no update is required.
>>> +	 *
>>> +	 * This callback is optional.
>>> +	 */
>>> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
>> Why do we need an extra callback for this?
>>
>> Just document that prepare_job() is allowed to reduce the number of
>> credits the job might need.
> ->prepare_job() is called only once if the returned fence is NULL, but
> we need this credit-update to happen every time a job is considered for
> execution by the scheduler.

But the job is only considered for execution once. How do you see that 
this is called multiple times?

Regards,
Christian.

> If you're saying this control-flow should
> be implemented with a dma_fence that's signaled when enough space is
> available, I fear Danilo's work won't be that useful to the PowerVR
> driver, unfortunately.
  
Boris Brezillon Oct. 27, 2023, 7:39 a.m. UTC | #7
On Fri, 27 Oct 2023 09:35:01 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 27.10.23 um 09:32 schrieb Boris Brezillon:
> > On Fri, 27 Oct 2023 09:22:12 +0200
> > Christian König <christian.koenig@amd.com> wrote:
> >  
> >>> +
> >>> +	/**
> >>> +	 * @update_job_credits: Called once the scheduler is considering this
> >>> +	 * job for execution.
> >>> +	 *
> >>> +	 * Drivers may use this to update the job's submission credits, which is
> >>> +	 * useful to e.g. deduct the number of native fences which have been
> >>> +	 * signaled meanwhile.
> >>> +	 *
> >>> +	 * The callback must either return the new number of submission credits
> >>> +	 * for the given job, or zero if no update is required.
> >>> +	 *
> >>> +	 * This callback is optional.
> >>> +	 */
> >>> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);  
> >> Why do we need an extra callback for this?
> >>
> >> Just document that prepare_job() is allowed to reduce the number of
> >> credits the job might need.
> > ->prepare_job() is called only once if the returned fence is NULL, but  
> > we need this credit-update to happen every time a job is considered for
> > execution by the scheduler.  
> 
> But the job is only considered for execution once. How do you see that 
> this is called multiple times?

Nope, it's not. If drm_sched_can_queue() returns false, the scheduler
will go look for another entity that has a job ready for execution, and
get back to this entity later, and test drm_sched_can_queue() again.
Basically, any time drm_sched_can_queue() is called, the job credits
update should happen, so we have an accurate view of how many credits
this job needs.
  
Boris Brezillon Oct. 27, 2023, 8:22 a.m. UTC | #8
On Fri, 27 Oct 2023 09:44:13 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 27.10.23 um 09:39 schrieb Boris Brezillon:
> > On Fri, 27 Oct 2023 09:35:01 +0200
> > Christian König<christian.koenig@amd.com>  wrote:
> >  
> >> Am 27.10.23 um 09:32 schrieb Boris Brezillon:  
> >>> On Fri, 27 Oct 2023 09:22:12 +0200
> >>> Christian König<christian.koenig@amd.com>  wrote:
> >>>     
> >>>>> +
> >>>>> +	/**
> >>>>> +	 * @update_job_credits: Called once the scheduler is considering this
> >>>>> +	 * job for execution.
> >>>>> +	 *
> >>>>> +	 * Drivers may use this to update the job's submission credits, which is
> >>>>> +	 * useful to e.g. deduct the number of native fences which have been
> >>>>> +	 * signaled meanwhile.
> >>>>> +	 *
> >>>>> +	 * The callback must either return the new number of submission credits
> >>>>> +	 * for the given job, or zero if no update is required.
> >>>>> +	 *
> >>>>> +	 * This callback is optional.
> >>>>> +	 */
> >>>>> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);  
> >>>> Why do we need an extra callback for this?
> >>>>
> >>>> Just document that prepare_job() is allowed to reduce the number of
> >>>> credits the job might need.
> >>> ->prepare_job() is called only once if the returned fence is NULL, but  
> >>> we need this credit-update to happen every time a job is considered for
> >>> execution by the scheduler.  
> >> But the job is only considered for execution once. How do you see that
> >> this is called multiple times?  
> > Nope, it's not. If drm_sched_can_queue() returns false, the scheduler
> > will go look for another entity that has a job ready for execution, and
> > get back to this entity later, and test drm_sched_can_queue() again.
> > Basically, any time drm_sched_can_queue() is called, the job credits
> > update should happen, so we have an accurate view of how many credits
> > this job needs.  
> 
> Well, that is the handling which I already rejected because it creates 
> unfairness between processes. When you consider the credits needed 
> *before* scheduling jobs with a lower credit count are always preferred 
> over jobs with a higher credit count.

My bad, it doesn't pick another entity when an entity with a
ready job that doesn't fit the queue is found, it just bails out from
drm_sched_rq_select_entity_rr() and returns NULL (AKA: no ready entity
found). But we still want to update the job credits before checking if
the job fits or not (next time this entity is tested).

> What you can do is to look at the credits of a job *after* it was picked 
> up for scheduling so that you can scheduler more jobs.

Sure, but then you might further delay your job if something made it
smaller (ie. native fences got signaled) between ->prepare_job() and
drm_sched_can_queue(). And any new drm_sched_can_queue() test would
just see the old credits value.

Out of curiosity, what are you worried about with this optional
->update_job_credits() call in the drm_sched_can_queue() path? Is the
if (sched->update_job_credits) overhead considered too high for drivers
that don't need it?
  
Boris Brezillon Oct. 27, 2023, 8:25 a.m. UTC | #9
Hi Danilo,

On Thu, 26 Oct 2023 18:13:00 +0200
Danilo Krummrich <dakr@redhat.com> wrote:

> Currently, job flow control is implemented simply by limiting the number
> of jobs in flight. Therefore, a scheduler is initialized with a credit
> limit that corresponds to the number of jobs which can be sent to the
> hardware.
> 
> This implies that for each job, drivers need to account for the maximum
> job size possible in order to not overflow the ring buffer.
> 
> However, there are drivers, such as Nouveau, where the job size has a
> rather large range. For such drivers it can easily happen that job
> submissions not even filling the ring by 1% can block subsequent
> submissions, which, in the worst case, can lead to the ring run dry.
> 
> In order to overcome this issue, allow for tracking the actual job size
> instead of the number of jobs. Therefore, add a field to track a job's
> credit count, which represents the number of credits a job contributes
> to the scheduler's credit limit.
> 
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> ---
> Changes in V2:
> ==============
>   - fixed up influence on scheduling fairness due to consideration of a job's
>     size
>     - If we reach a ready entity in drm_sched_select_entity() but can't actually
>       queue a job from it due to size limitations, just give up and go to sleep
>       until woken up due to a pending job finishing, rather than continue to try
>       other entities.
>   - added a callback to dynamically update a job's credits (Boris)

This callback seems controversial. I'd suggest dropping it, so the
patch can be merged.

Regards,

Boris

>   - renamed 'units' to 'credits'
>   - fixed commit message and comments as requested by Luben
> 
> Changes in V3:
> ==============
>   - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>   - move up drm_sched_can_queue() instead of adding a forward declaration
>     (Boris)
>   - add a drm_sched_available_credits() helper (Boris)
>   - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>   - re-phrase a few comments and fix a typo (Luben)
>   - change naming of all structures credit fields and function parameters to the
>     following scheme
>     - drm_sched_job::credits
>     - drm_gpu_scheduler::credit_limit
>     - drm_gpu_scheduler::credit_count
>     - drm_sched_init(..., u32 credit_limit, ...)
>     - drm_sched_job_init(..., u32 credits, ...)
>   - add proper documentation for the scheduler's job-flow control mechanism
  
Boris Brezillon Oct. 27, 2023, 10:17 a.m. UTC | #10
Hi Christian,

On Fri, 27 Oct 2023 11:06:44 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 27.10.23 um 10:22 schrieb Boris Brezillon:
> > On Fri, 27 Oct 2023 09:44:13 +0200
> > Christian König<christian.koenig@amd.com>  wrote:
> >  
> >> Am 27.10.23 um 09:39 schrieb Boris Brezillon:  
> >>> On Fri, 27 Oct 2023 09:35:01 +0200
> >>> Christian König<christian.koenig@amd.com>   wrote:
> >>>     
> >>>> Am 27.10.23 um 09:32 schrieb Boris Brezillon:  
> >>>>> On Fri, 27 Oct 2023 09:22:12 +0200
> >>>>> Christian König<christian.koenig@amd.com>   wrote:
> >>>>>        
> >>>>>>> +
> >>>>>>> +	/**
> >>>>>>> +	 * @update_job_credits: Called once the scheduler is considering this
> >>>>>>> +	 * job for execution.
> >>>>>>> +	 *
> >>>>>>> +	 * Drivers may use this to update the job's submission credits, which is
> >>>>>>> +	 * useful to e.g. deduct the number of native fences which have been
> >>>>>>> +	 * signaled meanwhile.
> >>>>>>> +	 *
> >>>>>>> +	 * The callback must either return the new number of submission credits
> >>>>>>> +	 * for the given job, or zero if no update is required.
> >>>>>>> +	 *
> >>>>>>> +	 * This callback is optional.
> >>>>>>> +	 */
> >>>>>>> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);  
> >>>>>> Why do we need an extra callback for this?
> >>>>>>
> >>>>>> Just document that prepare_job() is allowed to reduce the number of
> >>>>>> credits the job might need.
> >>>>> ->prepare_job() is called only once if the returned fence is NULL, but  
> >>>>> we need this credit-update to happen every time a job is considered for
> >>>>> execution by the scheduler.  
> >>>> But the job is only considered for execution once. How do you see that
> >>>> this is called multiple times?  
> >>> Nope, it's not. If drm_sched_can_queue() returns false, the scheduler
> >>> will go look for another entity that has a job ready for execution, and
> >>> get back to this entity later, and test drm_sched_can_queue() again.
> >>> Basically, any time drm_sched_can_queue() is called, the job credits
> >>> update should happen, so we have an accurate view of how many credits
> >>> this job needs.  
> >> Well, that is the handling which I already rejected because it creates
> >> unfairness between processes. When you consider the credits needed
> >> *before* scheduling jobs with a lower credit count are always preferred
> >> over jobs with a higher credit count.  
> > My bad, it doesn't pick another entity when an entity with a
> > ready job that doesn't fit the queue is found, it just bails out from
> > drm_sched_rq_select_entity_rr() and returns NULL (AKA: no ready entity
> > found). But we still want to update the job credits before checking if
> > the job fits or not (next time this entity is tested).  
> 
> Why? I only see a few possibility here:
> 
> 1. You need to wait for submissions to the same scheduler to finish 
> before the one you want to push to the ring.
>      This case can be avoided by trying to cast the dependency fences to 
> drm_sched_fences and looking if they are already scheduled.
> 
> 2. You need to wait for submissions to a different scheduler instance 
> and in this case you should probably return that as dependency instead.

It's already described as a dependency, but native dependency waits are
deferred to the FW (we wait on scheduled to run the job, not finished).
The thing is, after ->prepare_job() returned NULL (no non-native deps
remaining), and before ->run_job() gets called, there might be several
of these native deps that get signaled, and we're still considering
job->submission_credits as unchanged, when each of these signalled
fence could have reduced the job credits, potentially allowing it to be
submitted sooner.

> 
> So to me it looks like when prepare_job() is called because it is 
> selected as next job for submission you should already know how many 
> credits it needs.

You know how many credits it needs when ->prepare_job() is called, but
if this number is too high, the entity will not be picked, and next
time it's checked, you'll still consider the job credits at the time
->prepare_job() was called, which might differ from the current job
credits (signalled native fences might have been signalled in the
meantime, and they could be evicted).

> 
> >> What you can do is to look at the credits of a job *after* it was picked
> >> up for scheduling so that you can scheduler more jobs.  
> > Sure, but then you might further delay your job if something made it
> > smaller (ie. native fences got signaled) between ->prepare_job() and
> > drm_sched_can_queue(). And any new drm_sched_can_queue() test would
> > just see the old credits value.
> >
> > Out of curiosity, what are you worried about with this optional  
> > ->update_job_credits() call in the drm_sched_can_queue() path? Is the  
> > if (sched->update_job_credits) overhead considered too high for drivers
> > that don't need it?  
> 
> Since the dma_fences are also used for resource management the scheduler 
> is vital for correct system operation.
> 
> We had massively problems because people tried to over-optimize the 
> dma_fence handling which lead to very hard to narrow down memory 
> corruptions.
> 
> So for every change you come up here you need to have a very very good 
> justification. And just saving a bit size of your ring buffer is 
> certainly not one of them.

I fail to see how calling ->update_job_credits() changes the scheduler
behavior or how it relates to the sort memory corruption you're
referring to. And yes, we can live with the overhead of making jobs
slightly bigger than they actually are, thus potentially delaying their
execution. It's just that I don't quite understand the rational behind
this conservatism, as I don't really see what negative impact this extra
->update_job_credits() call in the credit checking path has, other than
the slight overhead of an if-check for drivers that don't need it.

Regards,

Boris
  
Danilo Krummrich Oct. 27, 2023, 2:23 p.m. UTC | #11
On 10/27/23 10:25, Boris Brezillon wrote:
> Hi Danilo,
> 
> On Thu, 26 Oct 2023 18:13:00 +0200
> Danilo Krummrich <dakr@redhat.com> wrote:
> 
>> Currently, job flow control is implemented simply by limiting the number
>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>> limit that corresponds to the number of jobs which can be sent to the
>> hardware.
>>
>> This implies that for each job, drivers need to account for the maximum
>> job size possible in order to not overflow the ring buffer.
>>
>> However, there are drivers, such as Nouveau, where the job size has a
>> rather large range. For such drivers it can easily happen that job
>> submissions not even filling the ring by 1% can block subsequent
>> submissions, which, in the worst case, can lead to the ring run dry.
>>
>> In order to overcome this issue, allow for tracking the actual job size
>> instead of the number of jobs. Therefore, add a field to track a job's
>> credit count, which represents the number of credits a job contributes
>> to the scheduler's credit limit.
>>
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>> ---
>> Changes in V2:
>> ==============
>>    - fixed up influence on scheduling fairness due to consideration of a job's
>>      size
>>      - If we reach a ready entity in drm_sched_select_entity() but can't actually
>>        queue a job from it due to size limitations, just give up and go to sleep
>>        until woken up due to a pending job finishing, rather than continue to try
>>        other entities.
>>    - added a callback to dynamically update a job's credits (Boris)
> 
> This callback seems controversial. I'd suggest dropping it, so the
> patch can be merged.

I don't think we should drop it just for the sake of moving forward. If there are objections
we'll discuss them. I've seen good reasons why the drivers you are working on require this,
while, following the discussion, I have *not* seen any reasons to drop it. It helps simplifying
driver code and doesn't introduce any complexity or overhead to existing drivers.

> 
> Regards,
> 
> Boris
> 
>>    - renamed 'units' to 'credits'
>>    - fixed commit message and comments as requested by Luben
>>
>> Changes in V3:
>> ==============
>>    - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>>    - move up drm_sched_can_queue() instead of adding a forward declaration
>>      (Boris)
>>    - add a drm_sched_available_credits() helper (Boris)
>>    - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>>    - re-phrase a few comments and fix a typo (Luben)
>>    - change naming of all structures credit fields and function parameters to the
>>      following scheme
>>      - drm_sched_job::credits
>>      - drm_gpu_scheduler::credit_limit
>>      - drm_gpu_scheduler::credit_count
>>      - drm_sched_init(..., u32 credit_limit, ...)
>>      - drm_sched_job_init(..., u32 credits, ...)
>>    - add proper documentation for the scheduler's job-flow control mechanism
>
  
Luben Tuikov Oct. 27, 2023, 2:32 p.m. UTC | #12
On 2023-10-27 04:25, Boris Brezillon wrote:
> Hi Danilo,
> 
> On Thu, 26 Oct 2023 18:13:00 +0200
> Danilo Krummrich <dakr@redhat.com> wrote:
> 
>> Currently, job flow control is implemented simply by limiting the number
>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>> limit that corresponds to the number of jobs which can be sent to the
>> hardware.
>>
>> This implies that for each job, drivers need to account for the maximum
>> job size possible in order to not overflow the ring buffer.
>>
>> However, there are drivers, such as Nouveau, where the job size has a
>> rather large range. For such drivers it can easily happen that job
>> submissions not even filling the ring by 1% can block subsequent
>> submissions, which, in the worst case, can lead to the ring run dry.
>>
>> In order to overcome this issue, allow for tracking the actual job size
>> instead of the number of jobs. Therefore, add a field to track a job's
>> credit count, which represents the number of credits a job contributes
>> to the scheduler's credit limit.
>>
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>> ---
>> Changes in V2:
>> ==============
>>   - fixed up influence on scheduling fairness due to consideration of a job's
>>     size
>>     - If we reach a ready entity in drm_sched_select_entity() but can't actually
>>       queue a job from it due to size limitations, just give up and go to sleep
>>       until woken up due to a pending job finishing, rather than continue to try
>>       other entities.
>>   - added a callback to dynamically update a job's credits (Boris)
> 
> This callback seems controversial. I'd suggest dropping it, so the
> patch can be merged.

Sorry, why is it controversial? (I did read back-and-forth above, but it wasn't clear
why it is /controversial/.)

I believe only drivers are privy to changes in the credit availability as their
firmware and hardware executes new jobs and finishes others, and so this "update"
here is essential--leaving it only to prepare_job() wouldn't quite fulfill the vision
of why the credit mechanism introduced by this patch in the first place.
  
Danilo Krummrich Oct. 27, 2023, 2:34 p.m. UTC | #13
On 10/27/23 09:17, Boris Brezillon wrote:
> Hi Danilo,
> 
> On Thu, 26 Oct 2023 18:13:00 +0200
> Danilo Krummrich <dakr@redhat.com> wrote:
> 
>> +
>> +	/**
>> +	 * @update_job_credits: Called once the scheduler is considering this
>> +	 * job for execution.
>> +	 *
>> +	 * Drivers may use this to update the job's submission credits, which is
>> +	 * useful to e.g. deduct the number of native fences which have been
>> +	 * signaled meanwhile.
>> +	 *
>> +	 * The callback must either return the new number of submission credits
>> +	 * for the given job, or zero if no update is required.
>> +	 *
>> +	 * This callback is optional.
>> +	 */
>> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
> 
> I'm copying my late reply to v2 here so it doesn't get lost:
> 
> I keep thinking it'd be simpler to make this a void function that
> updates s_job->submission_credits directly. I also don't see the
> problem with doing a sanity check on job->submission_credits. I mean,
> if the driver is doing something silly, you can't do much to prevent it
> anyway, except warn the user that something wrong has happened. If you
> want to
> 
> 	WARN_ON(job->submission_credits == 0 ||
> 		job->submission_credits > job_old_submission_credits);
> 
> that's fine. But none of this sanity checking has to do with the
> function prototype/semantics, and I'm still not comfortable with this 0
> => no-change. If there's no change, we should just leave
> job->submission_credits unchanged (or return job->submission_credits)
> instead of inventing a new special case.

If we can avoid letting drivers change fields of generic structures directly
without any drawbacks I think we should avoid it. Currently, drivers shouldn't
have the need to mess with job->credits directly. The initial value is set
through drm_sched_job_init() and is updated through the return value of
update_job_credits().

I'm fine getting rid of the 0 => no-change semantics though. Instead we can just
WARN() on 0. However, if we do that I'd also want to change it for
drm_sched_job_init() (where 0 currently defaults to 1) such that we accept 0, but
WARN() accordingly.

I think it's consequent to either consistently give 0 a different meaning or just
accept it but WARN() on it.

> 
> Regards,
> 
> Boris
>
  
Danilo Krummrich Oct. 27, 2023, 2:45 p.m. UTC | #14
Hi Luben,

On 10/26/23 23:13, Luben Tuikov wrote:
> On 2023-10-26 12:13, Danilo Krummrich wrote:
>> Currently, job flow control is implemented simply by limiting the number
>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>> limit that corresponds to the number of jobs which can be sent to the
>> hardware.
>>
>> This implies that for each job, drivers need to account for the maximum
>> job size possible in order to not overflow the ring buffer.
>>
>> However, there are drivers, such as Nouveau, where the job size has a
>> rather large range. For such drivers it can easily happen that job
>> submissions not even filling the ring by 1% can block subsequent
>> submissions, which, in the worst case, can lead to the ring run dry.
>>
>> In order to overcome this issue, allow for tracking the actual job size
>> instead of the number of jobs. Therefore, add a field to track a job's
>> credit count, which represents the number of credits a job contributes
>> to the scheduler's credit limit.
>>
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>> ---
>> Changes in V2:
>> ==============
>>    - fixed up influence on scheduling fairness due to consideration of a job's
>>      size
>>      - If we reach a ready entity in drm_sched_select_entity() but can't actually
>>        queue a job from it due to size limitations, just give up and go to sleep
>>        until woken up due to a pending job finishing, rather than continue to try
>>        other entities.
>>    - added a callback to dynamically update a job's credits (Boris)
>>    - renamed 'units' to 'credits'
>>    - fixed commit message and comments as requested by Luben
>>
>> Changes in V3:
>> ==============
>>    - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>>    - move up drm_sched_can_queue() instead of adding a forward declaration
>>      (Boris)
>>    - add a drm_sched_available_credits() helper (Boris)
>>    - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>>    - re-phrase a few comments and fix a typo (Luben)
>>    - change naming of all structures credit fields and function parameters to the
>>      following scheme
>>      - drm_sched_job::credits
>>      - drm_gpu_scheduler::credit_limit
>>      - drm_gpu_scheduler::credit_count
>>      - drm_sched_init(..., u32 credit_limit, ...)
>>      - drm_sched_job_init(..., u32 credits, ...)
>>    - add proper documentation for the scheduler's job-flow control mechanism
>>
>> This patch is based on V7 of the "DRM scheduler changes for Xe" series. [1]
>> provides a branch based on drm-misc-next, with the named series and this patch
>> on top of it.
>>
>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
>> ---
>>   Documentation/gpu/drm-mm.rst                  |   6 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
>>   drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>>   drivers/gpu/drm/lima/lima_sched.c             |   2 +-
>>   drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
>>   drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
>>   drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
>>   .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>>   drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
>>   drivers/gpu/drm/scheduler/sched_main.c        | 142 ++++++++++++++----
>>   drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
>>   include/drm/gpu_scheduler.h                   |  33 +++-
>>   12 files changed, 152 insertions(+), 49 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
>> index 602010cb6894..acc5901ac840 100644
>> --- a/Documentation/gpu/drm-mm.rst
>> +++ b/Documentation/gpu/drm-mm.rst
>> @@ -552,6 +552,12 @@ Overview
>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>      :doc: Overview
>>   
>> +Flow Control
>> +------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Flow Control
>> +
>>   Scheduler Function References
>>   -----------------------------
>>   
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 1f357198533f..62bb7fc7448a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>   	if (!entity)
>>   		return 0;
>>   
>> -	return drm_sched_job_init(&(*job)->base, entity, owner);
>> +	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>>   }
>>   
>>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>> index 2416c526f9b0..3d0f8d182506 100644
>> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>> @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
>>   
>>   	ret = drm_sched_job_init(&submit->sched_job,
>>   				 &ctx->sched_entity[args->pipe],
>> -				 submit->ctx);
>> +				 1, submit->ctx);
>>   	if (ret)
>>   		goto err_submit_put;
>>   
>> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
>> index 23a6276f1332..cdcb37ff62c3 100644
>> --- a/drivers/gpu/drm/lima/lima_sched.c
>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>>   	for (i = 0; i < num_bos; i++)
>>   		drm_gem_object_get(&bos[i]->base.base);
>>   
>> -	err = drm_sched_job_init(&task->base, &context->base, vm);
>> +	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
>>   	if (err) {
>>   		kfree(task->bos);
>>   		return err;
>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>> index 99744de6c05a..c002cabe7b9c 100644
>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>> @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>>   		return ERR_PTR(ret);
>>   	}
>>   
>> -	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
>> +	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
>>   	if (ret) {
>>   		kfree(submit->hw_fence);
>>   		kfree(submit);
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> index 7e64b5ef90fb..1b2cc3f2e1c7 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> @@ -89,7 +89,7 @@ nouveau_job_init(struct nouveau_job *job,
>>   
>>   	}
>>   
>> -	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
>> +	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
>>   	if (ret)
>>   		goto err_free_chains;
>>   
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
>> index b834777b409b..54d1c19bea84 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
>> @@ -274,7 +274,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>>   
>>   	ret = drm_sched_job_init(&job->base,
>>   				 &file_priv->sched_entity[slot],
>> -				 NULL);
>> +				 1, NULL);
>>   	if (ret)
>>   		goto out_put_job;
>>   
>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>> index 3143ecaaff86..f8ed093b7356 100644
>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>> @@ -51,7 +51,7 @@ DECLARE_EVENT_CLASS(drm_sched_job,
>>   			   __assign_str(name, sched_job->sched->name);
>>   			   __entry->job_count = spsc_queue_count(&entity->job_queue);
>>   			   __entry->hw_job_count = atomic_read(
>> -				   &sched_job->sched->hw_rq_count);
>> +				   &sched_job->sched->credit_count);
>>   			   ),
>>   	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
>>   		      __entry->entity, __entry->id,
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 409e4256f6e7..b79e0672b94f 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>   		container_of(cb, struct drm_sched_entity, cb);
>>   
>>   	drm_sched_entity_clear_dep(f, cb);
>> -	drm_sched_wakeup_if_can_queue(entity->rq->sched);
>> +	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>   }
>>   
>>   /**
>> @@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>   		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>   			drm_sched_rq_update_fifo(entity, submit_ts);
>>   
>> -		drm_sched_wakeup_if_can_queue(entity->rq->sched);
>> +		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>   	}
>>   }
>>   EXPORT_SYMBOL(drm_sched_entity_push_job);
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 246213963928..3cc3dc0091fc 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -48,6 +48,30 @@
>>    * through the jobs entity pointer.
>>    */
>>   
>> +/**
>> + * DOC: Flow Control
>> + *
>> + * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
>> + * in which the jobs fetched from scheduler entities are executed.
>> + *
>> + * In this context the &drm_gpu_scheduler keeps track of a driver specified
>> + * credit limit representing the capacity of this scheduler and a credit count;
>> + * every &drm_sched_job carries a driver specified number of credits.
>> + *
>> + * Once a job is executed (but not yet finished), the job's credits contribute
>> + * to the scheduler's credit count until the job is finished. If by executing
>> + * one more job the scheduler's credit count would exceed the scheduler's
>> + * credit limit, the job won't be executed. Instead, the scheduler will wait
>> + * until the credit count has decreased enough to not overflow its credit limit.
>> + * This implies waiting for previously executed jobs.
>> + *
>> + * Optionally, drivers can register a callback (update_job_credits) provided by
> 
> "can" --> "may".
> 
>> + * struct drm_sched_backend_ops to update the job's credits dynamically. The
>> + * scheduler will execute this callback every time the scheduler considers a job
> 
> No future tense, "will execute" --> "executes", because when this patch lands,
> it all becomes present reality. See below for a reworded paragraph.
> 
>> + * for execution and subsequently checks whether the job fits the scheduler's
>> + * credit limit.
> 
> This here is a good explanation of what update_job_credits() does, but the one
> where this callback is defined in the scheduler ops is not very clear (see further down
> into the patch).
> 
> I think we shouldn't use the word "update" as we don't really have any "update" code,
> a la,
> 	if old value != new value, then
> 		old value = new value # update it
> in the code using this new sched op.
> 
> Perhaps it should be a "get_job_credits()" as this is how this function is used
> in drm_sched_can_queue().
> 
> Perhaps this adds some clarity:
> 
> * Optionally, drivers may register a sched ops callback, get_job_credits(), which
> * returns the number of credits the job passed as an argument would take when
> * submitted to driver/hardware. The scheduler executes this callback every time it
> * considers a job for execution in order to check if the job's credits fit
> * into the scheduler's credit limit.
> 
> (FWIW, it may be doing some "update" into the driver/firmware/hardware, but DRM
> has no visibility or care for that, since DRM is using it simply as "tell me
> the number or credits this job would take.")
> 
>> + */
>> +
>>   #include <linux/wait.h>
>>   #include <linux/sched.h>
>>   #include <linux/completion.h>
>> @@ -75,6 +99,46 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>>   MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
>>   module_param_named(sched_policy, drm_sched_policy, int, 0444);
>>   
>> +static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
>> +{
>> +	u32 credits;
>> +
>> +	WARN_ON(check_sub_overflow(sched->credit_limit,
>> +				   atomic_read(&sched->credit_count),
>> +				   &credits));
>> +
>> +	return credits;
>> +}
>> +
>> +/**
>> + * drm_sched_can_queue -- Can we queue more to the hardware?
>> + * @sched: scheduler instance
>> + * @entity: the scheduler entity
>> + *
>> + * Return true if we can push at least one more job from @entity, false
>> + * otherwise.
>> + */
>> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>> +				struct drm_sched_entity *entity)
>> +{
>> +	struct drm_sched_job *s_job;
>> +
>> +	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>> +	if (!s_job)
>> +		return false;
> 
> We don't have this gating logic at the moment, and I don't think
> we need it.
> 
> At the current state of the code, we don't care if there's jobs
> in the incoming queue or not.
> 
> The only thing we should check here for is the credit availability
> for _this job_, as you do in the code below.
> 
>> +
>> +	if (sched->ops->update_job_credits) {
>> +		u32 credits = sched->ops->update_job_credits(s_job);
>> +
>> +		if (credits)
>> +			s_job->credits = credits;
>> +	}
>> +
>> +	WARN_ON(s_job->credits > sched->credit_limit);
>> +
>> +	return drm_sched_available_credits(sched) >= s_job->credits;
>> +}
>> +
>>   static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
>>   							    const struct rb_node *b)
>>   {
>> @@ -186,12 +250,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>   /**
>>    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>    *
>> + * @sched: the gpu scheduler
>>    * @rq: scheduler run queue to check.
>>    *
>>    * Try to find a ready entity, returns NULL if none found.
>>    */
>>   static struct drm_sched_entity *
>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
>> +			      struct drm_sched_rq *rq)
>>   {
>>   	struct drm_sched_entity *entity;
>>   
>> @@ -201,6 +267,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>   	if (entity) {
>>   		list_for_each_entry_continue(entity, &rq->entities, list) {
>>   			if (drm_sched_entity_is_ready(entity)) {
>> +				/* If we can't queue yet, preserve the current
>> +				 * entity in terms of fairness.
>> +				 */
>> +				if (!drm_sched_can_queue(sched, entity))
>> +					goto out;
>> +
>>   				rq->current_entity = entity;
>>   				reinit_completion(&entity->entity_idle);
>>   				spin_unlock(&rq->lock);
>> @@ -210,8 +282,13 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>   	}
>>   
>>   	list_for_each_entry(entity, &rq->entities, list) {
>> -
>>   		if (drm_sched_entity_is_ready(entity)) {
>> +			/* If we can't queue yet, preserve the current entity in
>> +			 * terms of fairness.
>> +			 */
>> +			if (!drm_sched_can_queue(sched, entity))
>> +				goto out;
>> +
>>   			rq->current_entity = entity;
>>   			reinit_completion(&entity->entity_idle);
>>   			spin_unlock(&rq->lock);
>> @@ -222,20 +299,22 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>   			break;
>>   	}
>>   
>> +out:
>>   	spin_unlock(&rq->lock);
>> -
>>   	return NULL;
>>   }
>>   
>>   /**
>>    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>    *
>> + * @sched: the gpu scheduler
>>    * @rq: scheduler run queue to check.
>>    *
>>    * Find oldest waiting ready entity, returns NULL if none found.
>>    */
>>   static struct drm_sched_entity *
>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
>> +				struct drm_sched_rq *rq)
>>   {
>>   	struct rb_node *rb;
>>   
>> @@ -245,6 +324,15 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>   
>>   		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>   		if (drm_sched_entity_is_ready(entity)) {
>> +			/* If we can't queue yet, don't pick another entity
>> +			 * which's job might fit and wait until we got space for
> 
> 			 * whose job might not fit and wait until we get space for
> Three fixes:                 ^              ^                         ^
> 
>> +			 * this one in terms of fairness.
> 
>                           * this one in terms of credit availability.
> 
> It's not the "fairness", it's the number of credits we check for in drm_sched_can_queue() below.
> 
>> +			 */
>> +			if (!drm_sched_can_queue(sched, entity)) {
>> +				rb = NULL;
>> +				break;
>> +			}
>> +
> 
> That's good.
> 
>>   			rq->current_entity = entity;
>>   			reinit_completion(&entity->entity_idle);
>>   			break;
>> @@ -266,18 +354,6 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>   		queue_work(sched->submit_wq, &sched->work_run_job);
>>   }
>>   
>> -/**
>> - * drm_sched_can_queue -- Can we queue more to the hardware?
>> - * @sched: scheduler instance
>> - *
>> - * Return true if we can push more jobs to the hw, otherwise false.
>> - */
>> -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>> -{
>> -	return atomic_read(&sched->hw_rq_count) <
>> -		sched->hw_submission_limit;
>> -}
>> -
>>   /**
>>    * drm_sched_select_entity - Select next entity to process
>>    *
>> @@ -291,14 +367,11 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>   	struct drm_sched_entity *entity;
>>   	int i;
>>   
>> -	if (!drm_sched_can_queue(sched))
>> -		return NULL;
>> -
>>   	/* Kernel run queue has higher priority than normal run queue*/
>>   	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>   		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
>> -			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
>> -			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
>> +			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
>> +			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
>>   		if (entity)
>>   			break;
>>   	}
>> @@ -353,7 +426,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>   	struct drm_sched_fence *s_fence = s_job->s_fence;
>>   	struct drm_gpu_scheduler *sched = s_fence->sched;
>>   
>> -	atomic_dec(&sched->hw_rq_count);
>> +	atomic_sub(s_job->credits, &sched->credit_count);
>>   	atomic_dec(sched->score);
>>   
>>   	trace_drm_sched_process_job(s_fence);
>> @@ -576,7 +649,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>   					      &s_job->cb)) {
>>   			dma_fence_put(s_job->s_fence->parent);
>>   			s_job->s_fence->parent = NULL;
>> -			atomic_dec(&sched->hw_rq_count);
>> +			atomic_sub(s_job->credits, &sched->credit_count);
>>   		} else {
>>   			/*
>>   			 * remove job from pending_list.
>> @@ -637,7 +710,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>>   	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>   		struct dma_fence *fence = s_job->s_fence->parent;
>>   
>> -		atomic_inc(&sched->hw_rq_count);
>> +		atomic_add(s_job->credits, &sched->credit_count);
>>   
>>   		if (!full_recovery)
>>   			continue;
>> @@ -718,6 +791,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>    * drm_sched_job_init - init a scheduler job
>>    * @job: scheduler job to init
>>    * @entity: scheduler entity to use
>> + * @credits: the number of credits this job contributes to the schedulers
>> + * credit limit
>>    * @owner: job owner for debugging
>>    *
>>    * Refer to drm_sched_entity_push_job() documentation
>> @@ -735,7 +810,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>    */
>>   int drm_sched_job_init(struct drm_sched_job *job,
>>   		       struct drm_sched_entity *entity,
>> -		       void *owner)
>> +		       u32 credits, void *owner)
>>   {
>>   	if (!entity->rq) {
>>   		/* This will most likely be followed by missing frames
>> @@ -752,6 +827,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>   		return -ENOMEM;
>>   
>>   	INIT_LIST_HEAD(&job->list);
>> +	job->credits = credits ? credits : 1;
>>   
>>   	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
>>   
>> @@ -961,12 +1037,14 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>>   /**
>>    * drm_sched_wakeup_if_can_queue - Wake up the scheduler
>>    * @sched: scheduler instance
>> + * @entity: the scheduler entity
>>    *
>>    * Wake up the scheduler if we can queue jobs.
>>    */
>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
>> +				   struct drm_sched_entity *entity)
>>   {
>> -	if (drm_sched_can_queue(sched))
>> +	if (drm_sched_can_queue(sched, entity))
>>   		drm_sched_run_job_queue(sched);
>>   }
>>   
>> @@ -1104,7 +1182,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>   
>>   	s_fence = sched_job->s_fence;
>>   
>> -	atomic_inc(&sched->hw_rq_count);
>> +	atomic_add(sched_job->credits, &sched->credit_count);
>>   	drm_sched_job_begin(sched_job);
>>   
>>   	trace_drm_run_job(sched_job, entity);
>> @@ -1139,7 +1217,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>    * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
>>    *	       allocated and used
>>    * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
>> - * @hw_submission: number of hw submissions that can be in flight
>> + * @credit_limit: the number of credits this scheduler can hold from all jobs
>>    * @hang_limit: number of times to allow a job to hang before dropping it
>>    * @timeout: timeout value in jiffies for the scheduler
>>    * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
>> @@ -1153,14 +1231,14 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>   int drm_sched_init(struct drm_gpu_scheduler *sched,
>>   		   const struct drm_sched_backend_ops *ops,
>>   		   struct workqueue_struct *submit_wq,
>> -		   u32 num_rqs, unsigned hw_submission, unsigned hang_limit,
>> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>>   		   long timeout, struct workqueue_struct *timeout_wq,
>>   		   atomic_t *score, const char *name, struct device *dev)
>>   {
>>   	int i, ret;
>>   
>>   	sched->ops = ops;
>> -	sched->hw_submission_limit = hw_submission;
>> +	sched->credit_limit = credit_limit;
>>   	sched->name = name;
>>   	if (submit_wq) {
>>   		sched->submit_wq = submit_wq;
>> @@ -1211,7 +1289,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>   	init_waitqueue_head(&sched->job_scheduled);
>>   	INIT_LIST_HEAD(&sched->pending_list);
>>   	spin_lock_init(&sched->job_list_lock);
>> -	atomic_set(&sched->hw_rq_count, 0);
>> +	atomic_set(&sched->credit_count, 0);
>>   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>   	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>   	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>> diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
>> index 2e94ce788c71..8479e5302f7b 100644
>> --- a/drivers/gpu/drm/v3d/v3d_gem.c
>> +++ b/drivers/gpu/drm/v3d/v3d_gem.c
>> @@ -417,7 +417,7 @@ v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
>>   	job->free = free;
>>   
>>   	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
>> -				 v3d_priv);
>> +				 1, v3d_priv);
>>   	if (ret)
>>   		goto fail;
>>   
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index e5a6166eb152..a911b3f5917e 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -321,6 +321,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>>    * @sched: the scheduler instance on which this job is scheduled.
>>    * @s_fence: contains the fences for the scheduling of job.
>>    * @finish_cb: the callback for the finished fence.
>> + * @credits: the number of credits this job contributes to the scheduler
>>    * @work: Helper to reschdeule job kill to different context.
>>    * @id: a unique id assigned to each job scheduled on the scheduler.
>>    * @karma: increment on every hang caused by this job. If this exceeds the hang
>> @@ -340,6 +341,8 @@ struct drm_sched_job {
>>   	struct drm_gpu_scheduler	*sched;
>>   	struct drm_sched_fence		*s_fence;
>>   
>> +	u32				credits;
>> +
>>   	/*
>>   	 * work is used only after finish_cb has been used and will not be
>>   	 * accessed anymore.
>> @@ -463,13 +466,29 @@ struct drm_sched_backend_ops {
>>            * and it's time to clean it up.
>>   	 */
>>   	void (*free_job)(struct drm_sched_job *sched_job);
>> +
>> +	/**
>> +	 * @update_job_credits: Called once the scheduler is considering this
>> +	 * job for execution.
>> +	 *
> 
> This whole paragraph isn't very clear of what update_job_credits does.
> Perhaps a simple and straightforward name would be more clear:
> 	get_job_credits: Returns the number of credits this job would
> 		take.

I still very much prefer 'update' over 'get' for the following reasons.

1. 'Update' describes very well what the driver should do when receiving this
    callback. It should update the number of credits the job requires at the time
    it receives the callback.

2. 'get', as well as 'put', is typically used in a different context in the kernel
    and hence this might create confusion.

> 
>> +	 * Drivers may use this to update the job's submission credits, which is
>> +	 * useful to e.g. deduct the number of native fences which have been
>> +	 * signaled meanwhile.
>> +	 *
>> +	 * The callback must either return the new number of submission credits
>> +	 * for the given job, or zero if no update is required.
> 
> The word "update" is confusing here and it implies that DRM should keep track
> of the previous value this function had returned.
> 
> Perhaps we can just say:
> 	* This callback returns the number of credits this job would take if pushed
> 	* to the driver/hardware. It returns 0, if this job would take no credits.

For this version of the patch this would be wrong, 0 really means that we keep
job->credits as it is. However, I'd probably change this in V4 as requested by Boris,
such that we accept 0 credits, but WARN() on it. Unless, we consider it to be valid
use-case to have jobs that are not flow-controlled at all.

All other comments look good, gonna address them in a V4.

- Danilo

> 
> Else, driver writers would have to deduce what to return here by looking at
> drm_sched_can_queue() which effectively does:
> 	s_job->credits = sched->ops->update_job_credits(s_job).
> 
>> +	 *
>> +	 * This callback is optional.
>> +	 */
>> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
>>   };
>>   
> 
> The rest is good--thanks for the changes!
> 
>>   /**
>>    * struct drm_gpu_scheduler - scheduler instance-specific data
>>    *
>>    * @ops: backend operations provided by the driver.
>> - * @hw_submission_limit: the max size of the hardware queue.
>> + * @credit_limit: the credit limit of this scheduler
>> + * @credit_count: the current credit count of this scheduler
>>    * @timeout: the time after which a job is removed from the scheduler.
>>    * @name: name of the ring for which this scheduler is being used.
>>    * @num_rqs: Number of run-queues. This is at most DRM_SCHED_PRIORITY_COUNT,
>> @@ -478,7 +497,6 @@ struct drm_sched_backend_ops {
>>    * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>>    *                 waits on this wait queue until all the scheduled jobs are
>>    *                 finished.
>> - * @hw_rq_count: the number of jobs currently in the hardware queue.
>>    * @job_id_count: used to assign unique id to the each job.
>>    * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>>    * @timeout_wq: workqueue used to queue @work_tdr
>> @@ -502,13 +520,13 @@ struct drm_sched_backend_ops {
>>    */
>>   struct drm_gpu_scheduler {
>>   	const struct drm_sched_backend_ops	*ops;
>> -	uint32_t			hw_submission_limit;
>> +	u32				credit_limit;
>> +	atomic_t			credit_count;
>>   	long				timeout;
>>   	const char			*name;
>>   	u32                             num_rqs;
>>   	struct drm_sched_rq             **sched_rq;
>>   	wait_queue_head_t		job_scheduled;
>> -	atomic_t			hw_rq_count;
>>   	atomic64_t			job_id_count;
>>   	struct workqueue_struct		*submit_wq;
>>   	struct workqueue_struct		*timeout_wq;
>> @@ -530,14 +548,14 @@ struct drm_gpu_scheduler {
>>   int drm_sched_init(struct drm_gpu_scheduler *sched,
>>   		   const struct drm_sched_backend_ops *ops,
>>   		   struct workqueue_struct *submit_wq,
>> -		   u32 num_rqs, uint32_t hw_submission, unsigned hang_limit,
>> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>>   		   long timeout, struct workqueue_struct *timeout_wq,
>>   		   atomic_t *score, const char *name, struct device *dev);
>>   
>>   void drm_sched_fini(struct drm_gpu_scheduler *sched);
>>   int drm_sched_job_init(struct drm_sched_job *job,
>>   		       struct drm_sched_entity *entity,
>> -		       void *owner);
>> +		       u32 credits, void *owner);
>>   void drm_sched_job_arm(struct drm_sched_job *job);
>>   int drm_sched_job_add_dependency(struct drm_sched_job *job,
>>   				 struct dma_fence *fence);
>> @@ -559,7 +577,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>   
>>   void drm_sched_tdr_queue_imm(struct drm_gpu_scheduler *sched);
>>   void drm_sched_job_cleanup(struct drm_sched_job *job);
>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
>> +				   struct drm_sched_entity *entity);
>>   bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
>>   void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>>   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>
  
Danilo Krummrich Oct. 27, 2023, 2:47 p.m. UTC | #15
On 10/27/23 03:03, Luben Tuikov wrote:
> On 2023-10-26 17:13, Luben Tuikov wrote:
>> On 2023-10-26 12:13, Danilo Krummrich wrote:
>>> Currently, job flow control is implemented simply by limiting the number
>>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>>> limit that corresponds to the number of jobs which can be sent to the
>>> hardware.
>>>
>>> This implies that for each job, drivers need to account for the maximum
>>> job size possible in order to not overflow the ring buffer.
>>>
>>> However, there are drivers, such as Nouveau, where the job size has a
>>> rather large range. For such drivers it can easily happen that job
>>> submissions not even filling the ring by 1% can block subsequent
>>> submissions, which, in the worst case, can lead to the ring run dry.
>>>
>>> In order to overcome this issue, allow for tracking the actual job size
>>> instead of the number of jobs. Therefore, add a field to track a job's
>>> credit count, which represents the number of credits a job contributes
>>> to the scheduler's credit limit.
>>>
>>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>>> ---
>>> Changes in V2:
>>> ==============
>>>    - fixed up influence on scheduling fairness due to consideration of a job's
>>>      size
>>>      - If we reach a ready entity in drm_sched_select_entity() but can't actually
>>>        queue a job from it due to size limitations, just give up and go to sleep
>>>        until woken up due to a pending job finishing, rather than continue to try
>>>        other entities.
>>>    - added a callback to dynamically update a job's credits (Boris)
>>>    - renamed 'units' to 'credits'
>>>    - fixed commit message and comments as requested by Luben
>>>
>>> Changes in V3:
>>> ==============
>>>    - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>>>    - move up drm_sched_can_queue() instead of adding a forward declaration
>>>      (Boris)
>>>    - add a drm_sched_available_credits() helper (Boris)
>>>    - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>>>    - re-phrase a few comments and fix a typo (Luben)
>>>    - change naming of all structures credit fields and function parameters to the
>>>      following scheme
>>>      - drm_sched_job::credits
>>>      - drm_gpu_scheduler::credit_limit
>>>      - drm_gpu_scheduler::credit_count
>>>      - drm_sched_init(..., u32 credit_limit, ...)
>>>      - drm_sched_job_init(..., u32 credits, ...)
>>>    - add proper documentation for the scheduler's job-flow control mechanism
>>>
>>> This patch is based on V7 of the "DRM scheduler changes for Xe" series. [1]
>>> provides a branch based on drm-misc-next, with the named series and this patch
>>> on top of it.
>>>
>>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
>>> ---
>>>   Documentation/gpu/drm-mm.rst                  |   6 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
>>>   drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>>>   drivers/gpu/drm/lima/lima_sched.c             |   2 +-
>>>   drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
>>>   drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
>>>   drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
>>>   .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>>>   drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
>>>   drivers/gpu/drm/scheduler/sched_main.c        | 142 ++++++++++++++----
>>>   drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
>>>   include/drm/gpu_scheduler.h                   |  33 +++-
>>>   12 files changed, 152 insertions(+), 49 deletions(-)
>>>
>>> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
>>> index 602010cb6894..acc5901ac840 100644
>>> --- a/Documentation/gpu/drm-mm.rst
>>> +++ b/Documentation/gpu/drm-mm.rst
>>> @@ -552,6 +552,12 @@ Overview
>>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>>      :doc: Overview
>>>   
>>> +Flow Control
>>> +------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Flow Control
>>> +
>>>   Scheduler Function References
>>>   -----------------------------
>>>   
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 1f357198533f..62bb7fc7448a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>>   	if (!entity)
>>>   		return 0;
>>>   
>>> -	return drm_sched_job_init(&(*job)->base, entity, owner);
>>> +	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>>>   }
>>>   
>>>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> index 2416c526f9b0..3d0f8d182506 100644
>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
>>>   
>>>   	ret = drm_sched_job_init(&submit->sched_job,
>>>   				 &ctx->sched_entity[args->pipe],
>>> -				 submit->ctx);
>>> +				 1, submit->ctx);
>>>   	if (ret)
>>>   		goto err_submit_put;
>>>   
>>> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
>>> index 23a6276f1332..cdcb37ff62c3 100644
>>> --- a/drivers/gpu/drm/lima/lima_sched.c
>>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>>> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>>>   	for (i = 0; i < num_bos; i++)
>>>   		drm_gem_object_get(&bos[i]->base.base);
>>>   
>>> -	err = drm_sched_job_init(&task->base, &context->base, vm);
>>> +	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
>>>   	if (err) {
>>>   		kfree(task->bos);
>>>   		return err;
>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>>> index 99744de6c05a..c002cabe7b9c 100644
>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>>> @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>>>   		return ERR_PTR(ret);
>>>   	}
>>>   
>>> -	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
>>> +	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
>>>   	if (ret) {
>>>   		kfree(submit->hw_fence);
>>>   		kfree(submit);
>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>>> index 7e64b5ef90fb..1b2cc3f2e1c7 100644
>>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>>> @@ -89,7 +89,7 @@ nouveau_job_init(struct nouveau_job *job,
>>>   
>>>   	}
>>>   
>>> -	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
>>> +	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
>>>   	if (ret)
>>>   		goto err_free_chains;
>>>   
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>> index b834777b409b..54d1c19bea84 100644
>>> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>> @@ -274,7 +274,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>>>   
>>>   	ret = drm_sched_job_init(&job->base,
>>>   				 &file_priv->sched_entity[slot],
>>> -				 NULL);
>>> +				 1, NULL);
>>>   	if (ret)
>>>   		goto out_put_job;
>>>   
>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>>> index 3143ecaaff86..f8ed093b7356 100644
>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>>> @@ -51,7 +51,7 @@ DECLARE_EVENT_CLASS(drm_sched_job,
>>>   			   __assign_str(name, sched_job->sched->name);
>>>   			   __entry->job_count = spsc_queue_count(&entity->job_queue);
>>>   			   __entry->hw_job_count = atomic_read(
>>> -				   &sched_job->sched->hw_rq_count);
>>> +				   &sched_job->sched->credit_count);
>>>   			   ),
>>>   	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
>>>   		      __entry->entity, __entry->id,
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index 409e4256f6e7..b79e0672b94f 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>>   		container_of(cb, struct drm_sched_entity, cb);
>>>   
>>>   	drm_sched_entity_clear_dep(f, cb);
>>> -	drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> +	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>   }
>>>   
>>>   /**
>>> @@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>>   		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>   			drm_sched_rq_update_fifo(entity, submit_ts);
>>>   
>>> -		drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> +		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>   	}
>>>   }
>>>   EXPORT_SYMBOL(drm_sched_entity_push_job);
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 246213963928..3cc3dc0091fc 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -48,6 +48,30 @@
>>>    * through the jobs entity pointer.
>>>    */
>>>   
>>> +/**
>>> + * DOC: Flow Control
>>> + *
>>> + * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
>>> + * in which the jobs fetched from scheduler entities are executed.
>>> + *
>>> + * In this context the &drm_gpu_scheduler keeps track of a driver specified
>>> + * credit limit representing the capacity of this scheduler and a credit count;
>>> + * every &drm_sched_job carries a driver specified number of credits.
>>> + *
>>> + * Once a job is executed (but not yet finished), the job's credits contribute
>>> + * to the scheduler's credit count until the job is finished. If by executing
>>> + * one more job the scheduler's credit count would exceed the scheduler's
>>> + * credit limit, the job won't be executed. Instead, the scheduler will wait
>>> + * until the credit count has decreased enough to not overflow its credit limit.
>>> + * This implies waiting for previously executed jobs.
>>> + *
>>> + * Optionally, drivers can register a callback (update_job_credits) provided by
>>
>> "can" --> "may".
>>
>>> + * struct drm_sched_backend_ops to update the job's credits dynamically. The
>>> + * scheduler will execute this callback every time the scheduler considers a job
>>
>> No future tense, "will execute" --> "executes", because when this patch lands,
>> it all becomes present reality. See below for a reworded paragraph.
>>
>>> + * for execution and subsequently checks whether the job fits the scheduler's
>>> + * credit limit.
>>
>> This here is a good explanation of what update_job_credits() does, but the one
>> where this callback is defined in the scheduler ops is not very clear (see further down
>> into the patch).
>>
>> I think we shouldn't use the word "update" as we don't really have any "update" code,
>> a la,
>> 	if old value != new value, then
>> 		old value = new value # update it
>> in the code using this new sched op.
>>
>> Perhaps it should be a "get_job_credits()" as this is how this function is used
>> in drm_sched_can_queue().
>>
>> Perhaps this adds some clarity:
>>
>> * Optionally, drivers may register a sched ops callback, get_job_credits(), which
>> * returns the number of credits the job passed as an argument would take when
>> * submitted to driver/hardware. The scheduler executes this callback every time it
>> * considers a job for execution in order to check if the job's credits fit
>> * into the scheduler's credit limit.
>>
>> (FWIW, it may be doing some "update" into the driver/firmware/hardware, but DRM
>> has no visibility or care for that, since DRM is using it simply as "tell me
>> the number or credits this job would take.")
>>
>>> + */
>>> +
>>>   #include <linux/wait.h>
>>>   #include <linux/sched.h>
>>>   #include <linux/completion.h>
>>> @@ -75,6 +99,46 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>>>   MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
>>>   module_param_named(sched_policy, drm_sched_policy, int, 0444);
>>>   
>>> +static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
>>> +{
>>> +	u32 credits;
>>> +
>>> +	WARN_ON(check_sub_overflow(sched->credit_limit,
>>> +				   atomic_read(&sched->credit_count),
>>> +				   &credits));
>>> +
>>> +	return credits;
>>> +}
>>> +
>>> +/**
>>> + * drm_sched_can_queue -- Can we queue more to the hardware?
>>> + * @sched: scheduler instance
>>> + * @entity: the scheduler entity
>>> + *
>>> + * Return true if we can push at least one more job from @entity, false
>>> + * otherwise.
>>> + */
>>> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>>> +				struct drm_sched_entity *entity)
>>> +{
>>> +	struct drm_sched_job *s_job;
>>> +
>>> +	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>>> +	if (!s_job)
>>> +		return false;
>>
>> We don't have this gating logic at the moment, and I don't think
>> we need it.
>>
>> At the current state of the code, we don't care if there's jobs
>> in the incoming queue or not.
>>
>> The only thing we should check here for is the credit availability
>> for _this job_, as you do in the code below.
> 
> I see it now--never mind.
> 
> (I would've probably added a comment--something like
> /* Incoming queue empty--nothing to schedule, return false. */
> as it circles all the way back to where we pick the next job to schedule.)
> 
> So, I ported this patch to drm-misc-next and then to 6.6-rc7 and ran
> with it. All is good.

That's great!

> 
> I think this patch is ready to go in after addressing the cosmetic fixes
> to comments and clarity--if you also feel the same way.

Yes, as mentioned in my previous reply, gonna address them soon in a V4.

- Danilo

> 
> Thanks! Great job!
> 
> Regards,
> Luben
> 
> 
>>
>>> +
>>> +	if (sched->ops->update_job_credits) {
>>> +		u32 credits = sched->ops->update_job_credits(s_job);
>>> +
>>> +		if (credits)
>>> +			s_job->credits = credits;
>>> +	}
>>> +
>>> +	WARN_ON(s_job->credits > sched->credit_limit);
>>> +
>>> +	return drm_sched_available_credits(sched) >= s_job->credits;
>>> +}
>>> +
>>>   static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
>>>   							    const struct rb_node *b)
>>>   {
>>> @@ -186,12 +250,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>   /**
>>>    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>    *
>>> + * @sched: the gpu scheduler
>>>    * @rq: scheduler run queue to check.
>>>    *
>>>    * Try to find a ready entity, returns NULL if none found.
>>>    */
>>>   static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
>>> +			      struct drm_sched_rq *rq)
>>>   {
>>>   	struct drm_sched_entity *entity;
>>>   
>>> @@ -201,6 +267,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>   	if (entity) {
>>>   		list_for_each_entry_continue(entity, &rq->entities, list) {
>>>   			if (drm_sched_entity_is_ready(entity)) {
>>> +				/* If we can't queue yet, preserve the current
>>> +				 * entity in terms of fairness.
>>> +				 */
>>> +				if (!drm_sched_can_queue(sched, entity))
>>> +					goto out;
>>> +
>>>   				rq->current_entity = entity;
>>>   				reinit_completion(&entity->entity_idle);
>>>   				spin_unlock(&rq->lock);
>>> @@ -210,8 +282,13 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>   	}
>>>   
>>>   	list_for_each_entry(entity, &rq->entities, list) {
>>> -
>>>   		if (drm_sched_entity_is_ready(entity)) {
>>> +			/* If we can't queue yet, preserve the current entity in
>>> +			 * terms of fairness.
>>> +			 */
>>> +			if (!drm_sched_can_queue(sched, entity))
>>> +				goto out;
>>> +
>>>   			rq->current_entity = entity;
>>>   			reinit_completion(&entity->entity_idle);
>>>   			spin_unlock(&rq->lock);
>>> @@ -222,20 +299,22 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>   			break;
>>>   	}
>>>   
>>> +out:
>>>   	spin_unlock(&rq->lock);
>>> -
>>>   	return NULL;
>>>   }
>>>   
>>>   /**
>>>    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>>    *
>>> + * @sched: the gpu scheduler
>>>    * @rq: scheduler run queue to check.
>>>    *
>>>    * Find oldest waiting ready entity, returns NULL if none found.
>>>    */
>>>   static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
>>> +				struct drm_sched_rq *rq)
>>>   {
>>>   	struct rb_node *rb;
>>>   
>>> @@ -245,6 +324,15 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>   
>>>   		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>>   		if (drm_sched_entity_is_ready(entity)) {
>>> +			/* If we can't queue yet, don't pick another entity
>>> +			 * which's job might fit and wait until we got space for
>>
>> 			 * whose job might not fit and wait until we get space for
>> Three fixes:                 ^              ^                         ^
>>
>>> +			 * this one in terms of fairness.
>>
>>                           * this one in terms of credit availability.
>>
>> It's not the "fairness", it's the number of credits we check for in drm_sched_can_queue() below.
>>
>>> +			 */
>>> +			if (!drm_sched_can_queue(sched, entity)) {
>>> +				rb = NULL;
>>> +				break;
>>> +			}
>>> +
>>
>> That's good.
>>
>>>   			rq->current_entity = entity;
>>>   			reinit_completion(&entity->entity_idle);
>>>   			break;
>>> @@ -266,18 +354,6 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>   		queue_work(sched->submit_wq, &sched->work_run_job);
>>>   }
>>>   
>>> -/**
>>> - * drm_sched_can_queue -- Can we queue more to the hardware?
>>> - * @sched: scheduler instance
>>> - *
>>> - * Return true if we can push more jobs to the hw, otherwise false.
>>> - */
>>> -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>>> -{
>>> -	return atomic_read(&sched->hw_rq_count) <
>>> -		sched->hw_submission_limit;
>>> -}
>>> -
>>>   /**
>>>    * drm_sched_select_entity - Select next entity to process
>>>    *
>>> @@ -291,14 +367,11 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>   	struct drm_sched_entity *entity;
>>>   	int i;
>>>   
>>> -	if (!drm_sched_can_queue(sched))
>>> -		return NULL;
>>> -
>>>   	/* Kernel run queue has higher priority than normal run queue*/
>>>   	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>   		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
>>> -			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
>>> -			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
>>> +			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
>>> +			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
>>>   		if (entity)
>>>   			break;
>>>   	}
>>> @@ -353,7 +426,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>>   	struct drm_sched_fence *s_fence = s_job->s_fence;
>>>   	struct drm_gpu_scheduler *sched = s_fence->sched;
>>>   
>>> -	atomic_dec(&sched->hw_rq_count);
>>> +	atomic_sub(s_job->credits, &sched->credit_count);
>>>   	atomic_dec(sched->score);
>>>   
>>>   	trace_drm_sched_process_job(s_fence);
>>> @@ -576,7 +649,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>   					      &s_job->cb)) {
>>>   			dma_fence_put(s_job->s_fence->parent);
>>>   			s_job->s_fence->parent = NULL;
>>> -			atomic_dec(&sched->hw_rq_count);
>>> +			atomic_sub(s_job->credits, &sched->credit_count);
>>>   		} else {
>>>   			/*
>>>   			 * remove job from pending_list.
>>> @@ -637,7 +710,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>>>   	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>   		struct dma_fence *fence = s_job->s_fence->parent;
>>>   
>>> -		atomic_inc(&sched->hw_rq_count);
>>> +		atomic_add(s_job->credits, &sched->credit_count);
>>>   
>>>   		if (!full_recovery)
>>>   			continue;
>>> @@ -718,6 +791,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>>    * drm_sched_job_init - init a scheduler job
>>>    * @job: scheduler job to init
>>>    * @entity: scheduler entity to use
>>> + * @credits: the number of credits this job contributes to the schedulers
>>> + * credit limit
>>>    * @owner: job owner for debugging
>>>    *
>>>    * Refer to drm_sched_entity_push_job() documentation
>>> @@ -735,7 +810,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>>    */
>>>   int drm_sched_job_init(struct drm_sched_job *job,
>>>   		       struct drm_sched_entity *entity,
>>> -		       void *owner)
>>> +		       u32 credits, void *owner)
>>>   {
>>>   	if (!entity->rq) {
>>>   		/* This will most likely be followed by missing frames
>>> @@ -752,6 +827,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>>   		return -ENOMEM;
>>>   
>>>   	INIT_LIST_HEAD(&job->list);
>>> +	job->credits = credits ? credits : 1;
>>>   
>>>   	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
>>>   
>>> @@ -961,12 +1037,14 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>>>   /**
>>>    * drm_sched_wakeup_if_can_queue - Wake up the scheduler
>>>    * @sched: scheduler instance
>>> + * @entity: the scheduler entity
>>>    *
>>>    * Wake up the scheduler if we can queue jobs.
>>>    */
>>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
>>> +				   struct drm_sched_entity *entity)
>>>   {
>>> -	if (drm_sched_can_queue(sched))
>>> +	if (drm_sched_can_queue(sched, entity))
>>>   		drm_sched_run_job_queue(sched);
>>>   }
>>>   
>>> @@ -1104,7 +1182,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>   
>>>   	s_fence = sched_job->s_fence;
>>>   
>>> -	atomic_inc(&sched->hw_rq_count);
>>> +	atomic_add(sched_job->credits, &sched->credit_count);
>>>   	drm_sched_job_begin(sched_job);
>>>   
>>>   	trace_drm_run_job(sched_job, entity);
>>> @@ -1139,7 +1217,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>    * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
>>>    *	       allocated and used
>>>    * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
>>> - * @hw_submission: number of hw submissions that can be in flight
>>> + * @credit_limit: the number of credits this scheduler can hold from all jobs
>>>    * @hang_limit: number of times to allow a job to hang before dropping it
>>>    * @timeout: timeout value in jiffies for the scheduler
>>>    * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
>>> @@ -1153,14 +1231,14 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>   int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>   		   const struct drm_sched_backend_ops *ops,
>>>   		   struct workqueue_struct *submit_wq,
>>> -		   u32 num_rqs, unsigned hw_submission, unsigned hang_limit,
>>> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>>>   		   long timeout, struct workqueue_struct *timeout_wq,
>>>   		   atomic_t *score, const char *name, struct device *dev)
>>>   {
>>>   	int i, ret;
>>>   
>>>   	sched->ops = ops;
>>> -	sched->hw_submission_limit = hw_submission;
>>> +	sched->credit_limit = credit_limit;
>>>   	sched->name = name;
>>>   	if (submit_wq) {
>>>   		sched->submit_wq = submit_wq;
>>> @@ -1211,7 +1289,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>   	init_waitqueue_head(&sched->job_scheduled);
>>>   	INIT_LIST_HEAD(&sched->pending_list);
>>>   	spin_lock_init(&sched->job_list_lock);
>>> -	atomic_set(&sched->hw_rq_count, 0);
>>> +	atomic_set(&sched->credit_count, 0);
>>>   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>>   	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>>   	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>>> diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
>>> index 2e94ce788c71..8479e5302f7b 100644
>>> --- a/drivers/gpu/drm/v3d/v3d_gem.c
>>> +++ b/drivers/gpu/drm/v3d/v3d_gem.c
>>> @@ -417,7 +417,7 @@ v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
>>>   	job->free = free;
>>>   
>>>   	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
>>> -				 v3d_priv);
>>> +				 1, v3d_priv);
>>>   	if (ret)
>>>   		goto fail;
>>>   
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index e5a6166eb152..a911b3f5917e 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -321,6 +321,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>>>    * @sched: the scheduler instance on which this job is scheduled.
>>>    * @s_fence: contains the fences for the scheduling of job.
>>>    * @finish_cb: the callback for the finished fence.
>>> + * @credits: the number of credits this job contributes to the scheduler
>>>    * @work: Helper to reschdeule job kill to different context.
>>>    * @id: a unique id assigned to each job scheduled on the scheduler.
>>>    * @karma: increment on every hang caused by this job. If this exceeds the hang
>>> @@ -340,6 +341,8 @@ struct drm_sched_job {
>>>   	struct drm_gpu_scheduler	*sched;
>>>   	struct drm_sched_fence		*s_fence;
>>>   
>>> +	u32				credits;
>>> +
>>>   	/*
>>>   	 * work is used only after finish_cb has been used and will not be
>>>   	 * accessed anymore.
>>> @@ -463,13 +466,29 @@ struct drm_sched_backend_ops {
>>>            * and it's time to clean it up.
>>>   	 */
>>>   	void (*free_job)(struct drm_sched_job *sched_job);
>>> +
>>> +	/**
>>> +	 * @update_job_credits: Called once the scheduler is considering this
>>> +	 * job for execution.
>>> +	 *
>>
>> This whole paragraph isn't very clear of what update_job_credits does.
>> Perhaps a simple and straightforward name would be more clear:
>> 	get_job_credits: Returns the number of credits this job would
>> 		take.
>>
>>> +	 * Drivers may use this to update the job's submission credits, which is
>>> +	 * useful to e.g. deduct the number of native fences which have been
>>> +	 * signaled meanwhile.
>>> +	 *
>>> +	 * The callback must either return the new number of submission credits
>>> +	 * for the given job, or zero if no update is required.
>>
>> The word "update" is confusing here and it implies that DRM should keep track
>> of the previous value this function had returned.
>>
>> Perhaps we can just say:
>> 	* This callback returns the number of credits this job would take if pushed
>> 	* to the driver/hardware. It returns 0, if this job would take no credits.
>>
>> Else, driver writers would have to deduce what to return here by looking at
>> drm_sched_can_queue() which effectively does:
>> 	s_job->credits = sched->ops->update_job_credits(s_job).
>>
>>> +	 *
>>> +	 * This callback is optional.
>>> +	 */
>>> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
>>>   };
>>>   
>>
>> The rest is good--thanks for the changes!
>>
>>>   /**
>>>    * struct drm_gpu_scheduler - scheduler instance-specific data
>>>    *
>>>    * @ops: backend operations provided by the driver.
>>> - * @hw_submission_limit: the max size of the hardware queue.
>>> + * @credit_limit: the credit limit of this scheduler
>>> + * @credit_count: the current credit count of this scheduler
>>>    * @timeout: the time after which a job is removed from the scheduler.
>>>    * @name: name of the ring for which this scheduler is being used.
>>>    * @num_rqs: Number of run-queues. This is at most DRM_SCHED_PRIORITY_COUNT,
>>> @@ -478,7 +497,6 @@ struct drm_sched_backend_ops {
>>>    * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>>>    *                 waits on this wait queue until all the scheduled jobs are
>>>    *                 finished.
>>> - * @hw_rq_count: the number of jobs currently in the hardware queue.
>>>    * @job_id_count: used to assign unique id to the each job.
>>>    * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>>>    * @timeout_wq: workqueue used to queue @work_tdr
>>> @@ -502,13 +520,13 @@ struct drm_sched_backend_ops {
>>>    */
>>>   struct drm_gpu_scheduler {
>>>   	const struct drm_sched_backend_ops	*ops;
>>> -	uint32_t			hw_submission_limit;
>>> +	u32				credit_limit;
>>> +	atomic_t			credit_count;
>>>   	long				timeout;
>>>   	const char			*name;
>>>   	u32                             num_rqs;
>>>   	struct drm_sched_rq             **sched_rq;
>>>   	wait_queue_head_t		job_scheduled;
>>> -	atomic_t			hw_rq_count;
>>>   	atomic64_t			job_id_count;
>>>   	struct workqueue_struct		*submit_wq;
>>>   	struct workqueue_struct		*timeout_wq;
>>> @@ -530,14 +548,14 @@ struct drm_gpu_scheduler {
>>>   int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>   		   const struct drm_sched_backend_ops *ops,
>>>   		   struct workqueue_struct *submit_wq,
>>> -		   u32 num_rqs, uint32_t hw_submission, unsigned hang_limit,
>>> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>>>   		   long timeout, struct workqueue_struct *timeout_wq,
>>>   		   atomic_t *score, const char *name, struct device *dev);
>>>   
>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched);
>>>   int drm_sched_job_init(struct drm_sched_job *job,
>>>   		       struct drm_sched_entity *entity,
>>> -		       void *owner);
>>> +		       u32 credits, void *owner);
>>>   void drm_sched_job_arm(struct drm_sched_job *job);
>>>   int drm_sched_job_add_dependency(struct drm_sched_job *job,
>>>   				 struct dma_fence *fence);
>>> @@ -559,7 +577,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>>   
>>>   void drm_sched_tdr_queue_imm(struct drm_gpu_scheduler *sched);
>>>   void drm_sched_job_cleanup(struct drm_sched_job *job);
>>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
>>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
>>> +				   struct drm_sched_entity *entity);
>>>   bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
>>>   void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>>>   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>>
  
Luben Tuikov Oct. 27, 2023, 2:59 p.m. UTC | #16
Hi Danilo,

On 2023-10-27 10:45, Danilo Krummrich wrote:
> Hi Luben,
> 
> On 10/26/23 23:13, Luben Tuikov wrote:
>> On 2023-10-26 12:13, Danilo Krummrich wrote:
>>> Currently, job flow control is implemented simply by limiting the number
>>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>>> limit that corresponds to the number of jobs which can be sent to the
>>> hardware.
>>>
>>> This implies that for each job, drivers need to account for the maximum
>>> job size possible in order to not overflow the ring buffer.
>>>
>>> However, there are drivers, such as Nouveau, where the job size has a
>>> rather large range. For such drivers it can easily happen that job
>>> submissions not even filling the ring by 1% can block subsequent
>>> submissions, which, in the worst case, can lead to the ring run dry.
>>>
>>> In order to overcome this issue, allow for tracking the actual job size
>>> instead of the number of jobs. Therefore, add a field to track a job's
>>> credit count, which represents the number of credits a job contributes
>>> to the scheduler's credit limit.
>>>
>>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>>> ---
>>> Changes in V2:
>>> ==============
>>>    - fixed up influence on scheduling fairness due to consideration of a job's
>>>      size
>>>      - If we reach a ready entity in drm_sched_select_entity() but can't actually
>>>        queue a job from it due to size limitations, just give up and go to sleep
>>>        until woken up due to a pending job finishing, rather than continue to try
>>>        other entities.
>>>    - added a callback to dynamically update a job's credits (Boris)
>>>    - renamed 'units' to 'credits'
>>>    - fixed commit message and comments as requested by Luben
>>>
>>> Changes in V3:
>>> ==============
>>>    - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>>>    - move up drm_sched_can_queue() instead of adding a forward declaration
>>>      (Boris)
>>>    - add a drm_sched_available_credits() helper (Boris)
>>>    - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>>>    - re-phrase a few comments and fix a typo (Luben)
>>>    - change naming of all structures credit fields and function parameters to the
>>>      following scheme
>>>      - drm_sched_job::credits
>>>      - drm_gpu_scheduler::credit_limit
>>>      - drm_gpu_scheduler::credit_count
>>>      - drm_sched_init(..., u32 credit_limit, ...)
>>>      - drm_sched_job_init(..., u32 credits, ...)
>>>    - add proper documentation for the scheduler's job-flow control mechanism
>>>
>>> This patch is based on V7 of the "DRM scheduler changes for Xe" series. [1]
>>> provides a branch based on drm-misc-next, with the named series and this patch
>>> on top of it.
>>>
>>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
>>> ---
>>>   Documentation/gpu/drm-mm.rst                  |   6 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
>>>   drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>>>   drivers/gpu/drm/lima/lima_sched.c             |   2 +-
>>>   drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
>>>   drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
>>>   drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
>>>   .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>>>   drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
>>>   drivers/gpu/drm/scheduler/sched_main.c        | 142 ++++++++++++++----
>>>   drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
>>>   include/drm/gpu_scheduler.h                   |  33 +++-
>>>   12 files changed, 152 insertions(+), 49 deletions(-)
>>>
>>> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
>>> index 602010cb6894..acc5901ac840 100644
>>> --- a/Documentation/gpu/drm-mm.rst
>>> +++ b/Documentation/gpu/drm-mm.rst
>>> @@ -552,6 +552,12 @@ Overview
>>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>>      :doc: Overview
>>>   
>>> +Flow Control
>>> +------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Flow Control
>>> +
>>>   Scheduler Function References
>>>   -----------------------------
>>>   
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 1f357198533f..62bb7fc7448a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>>   	if (!entity)
>>>   		return 0;
>>>   
>>> -	return drm_sched_job_init(&(*job)->base, entity, owner);
>>> +	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>>>   }
>>>   
>>>   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> index 2416c526f9b0..3d0f8d182506 100644
>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>> @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
>>>   
>>>   	ret = drm_sched_job_init(&submit->sched_job,
>>>   				 &ctx->sched_entity[args->pipe],
>>> -				 submit->ctx);
>>> +				 1, submit->ctx);
>>>   	if (ret)
>>>   		goto err_submit_put;
>>>   
>>> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
>>> index 23a6276f1332..cdcb37ff62c3 100644
>>> --- a/drivers/gpu/drm/lima/lima_sched.c
>>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>>> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>>>   	for (i = 0; i < num_bos; i++)
>>>   		drm_gem_object_get(&bos[i]->base.base);
>>>   
>>> -	err = drm_sched_job_init(&task->base, &context->base, vm);
>>> +	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
>>>   	if (err) {
>>>   		kfree(task->bos);
>>>   		return err;
>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>>> index 99744de6c05a..c002cabe7b9c 100644
>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>>> @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>>>   		return ERR_PTR(ret);
>>>   	}
>>>   
>>> -	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
>>> +	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
>>>   	if (ret) {
>>>   		kfree(submit->hw_fence);
>>>   		kfree(submit);
>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>>> index 7e64b5ef90fb..1b2cc3f2e1c7 100644
>>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>>> @@ -89,7 +89,7 @@ nouveau_job_init(struct nouveau_job *job,
>>>   
>>>   	}
>>>   
>>> -	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
>>> +	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
>>>   	if (ret)
>>>   		goto err_free_chains;
>>>   
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>> index b834777b409b..54d1c19bea84 100644
>>> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>> @@ -274,7 +274,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>>>   
>>>   	ret = drm_sched_job_init(&job->base,
>>>   				 &file_priv->sched_entity[slot],
>>> -				 NULL);
>>> +				 1, NULL);
>>>   	if (ret)
>>>   		goto out_put_job;
>>>   
>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>>> index 3143ecaaff86..f8ed093b7356 100644
>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>>> @@ -51,7 +51,7 @@ DECLARE_EVENT_CLASS(drm_sched_job,
>>>   			   __assign_str(name, sched_job->sched->name);
>>>   			   __entry->job_count = spsc_queue_count(&entity->job_queue);
>>>   			   __entry->hw_job_count = atomic_read(
>>> -				   &sched_job->sched->hw_rq_count);
>>> +				   &sched_job->sched->credit_count);
>>>   			   ),
>>>   	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
>>>   		      __entry->entity, __entry->id,
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index 409e4256f6e7..b79e0672b94f 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>>   		container_of(cb, struct drm_sched_entity, cb);
>>>   
>>>   	drm_sched_entity_clear_dep(f, cb);
>>> -	drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> +	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>   }
>>>   
>>>   /**
>>> @@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>>   		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>   			drm_sched_rq_update_fifo(entity, submit_ts);
>>>   
>>> -		drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>> +		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>   	}
>>>   }
>>>   EXPORT_SYMBOL(drm_sched_entity_push_job);
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 246213963928..3cc3dc0091fc 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -48,6 +48,30 @@
>>>    * through the jobs entity pointer.
>>>    */
>>>   
>>> +/**
>>> + * DOC: Flow Control
>>> + *
>>> + * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
>>> + * in which the jobs fetched from scheduler entities are executed.
>>> + *
>>> + * In this context the &drm_gpu_scheduler keeps track of a driver specified
>>> + * credit limit representing the capacity of this scheduler and a credit count;
>>> + * every &drm_sched_job carries a driver specified number of credits.
>>> + *
>>> + * Once a job is executed (but not yet finished), the job's credits contribute
>>> + * to the scheduler's credit count until the job is finished. If by executing
>>> + * one more job the scheduler's credit count would exceed the scheduler's
>>> + * credit limit, the job won't be executed. Instead, the scheduler will wait
>>> + * until the credit count has decreased enough to not overflow its credit limit.
>>> + * This implies waiting for previously executed jobs.
>>> + *
>>> + * Optionally, drivers can register a callback (update_job_credits) provided by
>>
>> "can" --> "may".
>>
>>> + * struct drm_sched_backend_ops to update the job's credits dynamically. The
>>> + * scheduler will execute this callback every time the scheduler considers a job
>>
>> No future tense, "will execute" --> "executes", because when this patch lands,
>> it all becomes present reality. See below for a reworded paragraph.
>>
>>> + * for execution and subsequently checks whether the job fits the scheduler's
>>> + * credit limit.
>>
>> This here is a good explanation of what update_job_credits() does, but the one
>> where this callback is defined in the scheduler ops is not very clear (see further down
>> into the patch).
>>
>> I think we shouldn't use the word "update" as we don't really have any "update" code,
>> a la,
>> 	if old value != new value, then
>> 		old value = new value # update it
>> in the code using this new sched op.
>>
>> Perhaps it should be a "get_job_credits()" as this is how this function is used
>> in drm_sched_can_queue().
>>
>> Perhaps this adds some clarity:
>>
>> * Optionally, drivers may register a sched ops callback, get_job_credits(), which
>> * returns the number of credits the job passed as an argument would take when
>> * submitted to driver/hardware. The scheduler executes this callback every time it
>> * considers a job for execution in order to check if the job's credits fit
>> * into the scheduler's credit limit.
>>
>> (FWIW, it may be doing some "update" into the driver/firmware/hardware, but DRM
>> has no visibility or care for that, since DRM is using it simply as "tell me
>> the number or credits this job would take.")
>>
>>> + */
>>> +
>>>   #include <linux/wait.h>
>>>   #include <linux/sched.h>
>>>   #include <linux/completion.h>
>>> @@ -75,6 +99,46 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>>>   MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
>>>   module_param_named(sched_policy, drm_sched_policy, int, 0444);
>>>   
>>> +static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
>>> +{
>>> +	u32 credits;
>>> +
>>> +	WARN_ON(check_sub_overflow(sched->credit_limit,
>>> +				   atomic_read(&sched->credit_count),
>>> +				   &credits));
>>> +
>>> +	return credits;
>>> +}
>>> +
>>> +/**
>>> + * drm_sched_can_queue -- Can we queue more to the hardware?
>>> + * @sched: scheduler instance
>>> + * @entity: the scheduler entity
>>> + *
>>> + * Return true if we can push at least one more job from @entity, false
>>> + * otherwise.
>>> + */
>>> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>>> +				struct drm_sched_entity *entity)
>>> +{
>>> +	struct drm_sched_job *s_job;
>>> +
>>> +	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>>> +	if (!s_job)
>>> +		return false;
>>
>> We don't have this gating logic at the moment, and I don't think
>> we need it.
>>
>> At the current state of the code, we don't care if there's jobs
>> in the incoming queue or not.
>>
>> The only thing we should check here for is the credit availability
>> for _this job_, as you do in the code below.
>>
>>> +
>>> +	if (sched->ops->update_job_credits) {
>>> +		u32 credits = sched->ops->update_job_credits(s_job);
>>> +
>>> +		if (credits)
>>> +			s_job->credits = credits;
>>> +	}
>>> +
>>> +	WARN_ON(s_job->credits > sched->credit_limit);
>>> +
>>> +	return drm_sched_available_credits(sched) >= s_job->credits;
>>> +}
>>> +
>>>   static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
>>>   							    const struct rb_node *b)
>>>   {
>>> @@ -186,12 +250,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>   /**
>>>    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>    *
>>> + * @sched: the gpu scheduler
>>>    * @rq: scheduler run queue to check.
>>>    *
>>>    * Try to find a ready entity, returns NULL if none found.
>>>    */
>>>   static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
>>> +			      struct drm_sched_rq *rq)
>>>   {
>>>   	struct drm_sched_entity *entity;
>>>   
>>> @@ -201,6 +267,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>   	if (entity) {
>>>   		list_for_each_entry_continue(entity, &rq->entities, list) {
>>>   			if (drm_sched_entity_is_ready(entity)) {
>>> +				/* If we can't queue yet, preserve the current
>>> +				 * entity in terms of fairness.
>>> +				 */
>>> +				if (!drm_sched_can_queue(sched, entity))
>>> +					goto out;
>>> +
>>>   				rq->current_entity = entity;
>>>   				reinit_completion(&entity->entity_idle);
>>>   				spin_unlock(&rq->lock);
>>> @@ -210,8 +282,13 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>   	}
>>>   
>>>   	list_for_each_entry(entity, &rq->entities, list) {
>>> -
>>>   		if (drm_sched_entity_is_ready(entity)) {
>>> +			/* If we can't queue yet, preserve the current entity in
>>> +			 * terms of fairness.
>>> +			 */
>>> +			if (!drm_sched_can_queue(sched, entity))
>>> +				goto out;
>>> +
>>>   			rq->current_entity = entity;
>>>   			reinit_completion(&entity->entity_idle);
>>>   			spin_unlock(&rq->lock);
>>> @@ -222,20 +299,22 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>   			break;
>>>   	}
>>>   
>>> +out:
>>>   	spin_unlock(&rq->lock);
>>> -
>>>   	return NULL;
>>>   }
>>>   
>>>   /**
>>>    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>>    *
>>> + * @sched: the gpu scheduler
>>>    * @rq: scheduler run queue to check.
>>>    *
>>>    * Find oldest waiting ready entity, returns NULL if none found.
>>>    */
>>>   static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
>>> +				struct drm_sched_rq *rq)
>>>   {
>>>   	struct rb_node *rb;
>>>   
>>> @@ -245,6 +324,15 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>   
>>>   		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>>   		if (drm_sched_entity_is_ready(entity)) {
>>> +			/* If we can't queue yet, don't pick another entity
>>> +			 * which's job might fit and wait until we got space for
>>
>> 			 * whose job might not fit and wait until we get space for
>> Three fixes:                 ^              ^                         ^
>>
>>> +			 * this one in terms of fairness.
>>
>>                           * this one in terms of credit availability.
>>
>> It's not the "fairness", it's the number of credits we check for in drm_sched_can_queue() below.
>>
>>> +			 */
>>> +			if (!drm_sched_can_queue(sched, entity)) {
>>> +				rb = NULL;
>>> +				break;
>>> +			}
>>> +
>>
>> That's good.
>>
>>>   			rq->current_entity = entity;
>>>   			reinit_completion(&entity->entity_idle);
>>>   			break;
>>> @@ -266,18 +354,6 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>   		queue_work(sched->submit_wq, &sched->work_run_job);
>>>   }
>>>   
>>> -/**
>>> - * drm_sched_can_queue -- Can we queue more to the hardware?
>>> - * @sched: scheduler instance
>>> - *
>>> - * Return true if we can push more jobs to the hw, otherwise false.
>>> - */
>>> -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>>> -{
>>> -	return atomic_read(&sched->hw_rq_count) <
>>> -		sched->hw_submission_limit;
>>> -}
>>> -
>>>   /**
>>>    * drm_sched_select_entity - Select next entity to process
>>>    *
>>> @@ -291,14 +367,11 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>   	struct drm_sched_entity *entity;
>>>   	int i;
>>>   
>>> -	if (!drm_sched_can_queue(sched))
>>> -		return NULL;
>>> -
>>>   	/* Kernel run queue has higher priority than normal run queue*/
>>>   	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>   		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
>>> -			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
>>> -			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
>>> +			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
>>> +			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
>>>   		if (entity)
>>>   			break;
>>>   	}
>>> @@ -353,7 +426,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>>   	struct drm_sched_fence *s_fence = s_job->s_fence;
>>>   	struct drm_gpu_scheduler *sched = s_fence->sched;
>>>   
>>> -	atomic_dec(&sched->hw_rq_count);
>>> +	atomic_sub(s_job->credits, &sched->credit_count);
>>>   	atomic_dec(sched->score);
>>>   
>>>   	trace_drm_sched_process_job(s_fence);
>>> @@ -576,7 +649,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>   					      &s_job->cb)) {
>>>   			dma_fence_put(s_job->s_fence->parent);
>>>   			s_job->s_fence->parent = NULL;
>>> -			atomic_dec(&sched->hw_rq_count);
>>> +			atomic_sub(s_job->credits, &sched->credit_count);
>>>   		} else {
>>>   			/*
>>>   			 * remove job from pending_list.
>>> @@ -637,7 +710,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>>>   	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>   		struct dma_fence *fence = s_job->s_fence->parent;
>>>   
>>> -		atomic_inc(&sched->hw_rq_count);
>>> +		atomic_add(s_job->credits, &sched->credit_count);
>>>   
>>>   		if (!full_recovery)
>>>   			continue;
>>> @@ -718,6 +791,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>>    * drm_sched_job_init - init a scheduler job
>>>    * @job: scheduler job to init
>>>    * @entity: scheduler entity to use
>>> + * @credits: the number of credits this job contributes to the schedulers
>>> + * credit limit
>>>    * @owner: job owner for debugging
>>>    *
>>>    * Refer to drm_sched_entity_push_job() documentation
>>> @@ -735,7 +810,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>>    */
>>>   int drm_sched_job_init(struct drm_sched_job *job,
>>>   		       struct drm_sched_entity *entity,
>>> -		       void *owner)
>>> +		       u32 credits, void *owner)
>>>   {
>>>   	if (!entity->rq) {
>>>   		/* This will most likely be followed by missing frames
>>> @@ -752,6 +827,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>>   		return -ENOMEM;
>>>   
>>>   	INIT_LIST_HEAD(&job->list);
>>> +	job->credits = credits ? credits : 1;
>>>   
>>>   	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
>>>   
>>> @@ -961,12 +1037,14 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>>>   /**
>>>    * drm_sched_wakeup_if_can_queue - Wake up the scheduler
>>>    * @sched: scheduler instance
>>> + * @entity: the scheduler entity
>>>    *
>>>    * Wake up the scheduler if we can queue jobs.
>>>    */
>>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
>>> +				   struct drm_sched_entity *entity)
>>>   {
>>> -	if (drm_sched_can_queue(sched))
>>> +	if (drm_sched_can_queue(sched, entity))
>>>   		drm_sched_run_job_queue(sched);
>>>   }
>>>   
>>> @@ -1104,7 +1182,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>   
>>>   	s_fence = sched_job->s_fence;
>>>   
>>> -	atomic_inc(&sched->hw_rq_count);
>>> +	atomic_add(sched_job->credits, &sched->credit_count);
>>>   	drm_sched_job_begin(sched_job);
>>>   
>>>   	trace_drm_run_job(sched_job, entity);
>>> @@ -1139,7 +1217,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>    * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
>>>    *	       allocated and used
>>>    * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
>>> - * @hw_submission: number of hw submissions that can be in flight
>>> + * @credit_limit: the number of credits this scheduler can hold from all jobs
>>>    * @hang_limit: number of times to allow a job to hang before dropping it
>>>    * @timeout: timeout value in jiffies for the scheduler
>>>    * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
>>> @@ -1153,14 +1231,14 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>   int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>   		   const struct drm_sched_backend_ops *ops,
>>>   		   struct workqueue_struct *submit_wq,
>>> -		   u32 num_rqs, unsigned hw_submission, unsigned hang_limit,
>>> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>>>   		   long timeout, struct workqueue_struct *timeout_wq,
>>>   		   atomic_t *score, const char *name, struct device *dev)
>>>   {
>>>   	int i, ret;
>>>   
>>>   	sched->ops = ops;
>>> -	sched->hw_submission_limit = hw_submission;
>>> +	sched->credit_limit = credit_limit;
>>>   	sched->name = name;
>>>   	if (submit_wq) {
>>>   		sched->submit_wq = submit_wq;
>>> @@ -1211,7 +1289,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>   	init_waitqueue_head(&sched->job_scheduled);
>>>   	INIT_LIST_HEAD(&sched->pending_list);
>>>   	spin_lock_init(&sched->job_list_lock);
>>> -	atomic_set(&sched->hw_rq_count, 0);
>>> +	atomic_set(&sched->credit_count, 0);
>>>   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>>   	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>>   	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>>> diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
>>> index 2e94ce788c71..8479e5302f7b 100644
>>> --- a/drivers/gpu/drm/v3d/v3d_gem.c
>>> +++ b/drivers/gpu/drm/v3d/v3d_gem.c
>>> @@ -417,7 +417,7 @@ v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
>>>   	job->free = free;
>>>   
>>>   	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
>>> -				 v3d_priv);
>>> +				 1, v3d_priv);
>>>   	if (ret)
>>>   		goto fail;
>>>   
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index e5a6166eb152..a911b3f5917e 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -321,6 +321,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>>>    * @sched: the scheduler instance on which this job is scheduled.
>>>    * @s_fence: contains the fences for the scheduling of job.
>>>    * @finish_cb: the callback for the finished fence.
>>> + * @credits: the number of credits this job contributes to the scheduler
>>>    * @work: Helper to reschdeule job kill to different context.
>>>    * @id: a unique id assigned to each job scheduled on the scheduler.
>>>    * @karma: increment on every hang caused by this job. If this exceeds the hang
>>> @@ -340,6 +341,8 @@ struct drm_sched_job {
>>>   	struct drm_gpu_scheduler	*sched;
>>>   	struct drm_sched_fence		*s_fence;
>>>   
>>> +	u32				credits;
>>> +
>>>   	/*
>>>   	 * work is used only after finish_cb has been used and will not be
>>>   	 * accessed anymore.
>>> @@ -463,13 +466,29 @@ struct drm_sched_backend_ops {
>>>            * and it's time to clean it up.
>>>   	 */
>>>   	void (*free_job)(struct drm_sched_job *sched_job);
>>> +
>>> +	/**
>>> +	 * @update_job_credits: Called once the scheduler is considering this
>>> +	 * job for execution.
>>> +	 *
>>
>> This whole paragraph isn't very clear of what update_job_credits does.
>> Perhaps a simple and straightforward name would be more clear:
>> 	get_job_credits: Returns the number of credits this job would
>> 		take.
> 
> I still very much prefer 'update' over 'get' for the following reasons.
> 
> 1. 'Update' describes very well what the driver should do when receiving this
>     callback. It should update the number of credits the job requires at the time
>     it receives the callback.
> 
> 2. 'get', as well as 'put', is typically used in a different context in the kernel
>     and hence this might create confusion.

That's fine--agreed.

Perhaps add a comment to say, "Update from the value set by drm_sched_job_init()",
to make the "update" instantly visualized by the reader of the code.

> 
>>
>>> +	 * Drivers may use this to update the job's submission credits, which is
>>> +	 * useful to e.g. deduct the number of native fences which have been
>>> +	 * signaled meanwhile.
>>> +	 *
>>> +	 * The callback must either return the new number of submission credits
>>> +	 * for the given job, or zero if no update is required.
>>
>> The word "update" is confusing here and it implies that DRM should keep track
>> of the previous value this function had returned.
>>
>> Perhaps we can just say:
>> 	* This callback returns the number of credits this job would take if pushed
>> 	* to the driver/hardware. It returns 0, if this job would take no credits.
> 
> For this version of the patch this would be wrong, 0 really means that we keep
> job->credits as it is. However, I'd probably change this in V4 as requested by Boris,
> such that we accept 0 credits, but WARN() on it. Unless, we consider it to be valid
> use-case to have jobs that are not flow-controlled at all.

I agree with your reasoning. Yeah, not sure what the _meaning_ of update_job_credits()
returning 0 is (from a previous non-zero value).

Perhaps, just leave it up to the driver--they should know what they're doing and
just warn, but take the 0, and run with it. If drivers become clever with the 0,
we can remove the WARN down the road.

So, yeah, looks good! Thanks.

> 
> All other comments look good, gonna address them in a V4.

Great!
  
Danilo Krummrich Oct. 27, 2023, 3:01 p.m. UTC | #17
On 10/27/23 16:59, Luben Tuikov wrote:
> Hi Danilo,
> 
> On 2023-10-27 10:45, Danilo Krummrich wrote:
>> Hi Luben,
>>
>> On 10/26/23 23:13, Luben Tuikov wrote:
>>> On 2023-10-26 12:13, Danilo Krummrich wrote:
>>>> Currently, job flow control is implemented simply by limiting the number
>>>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>>>> limit that corresponds to the number of jobs which can be sent to the
>>>> hardware.
>>>>
>>>> This implies that for each job, drivers need to account for the maximum
>>>> job size possible in order to not overflow the ring buffer.
>>>>
>>>> However, there are drivers, such as Nouveau, where the job size has a
>>>> rather large range. For such drivers it can easily happen that job
>>>> submissions not even filling the ring by 1% can block subsequent
>>>> submissions, which, in the worst case, can lead to the ring run dry.
>>>>
>>>> In order to overcome this issue, allow for tracking the actual job size
>>>> instead of the number of jobs. Therefore, add a field to track a job's
>>>> credit count, which represents the number of credits a job contributes
>>>> to the scheduler's credit limit.
>>>>
>>>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>>>> ---
>>>> Changes in V2:
>>>> ==============
>>>>     - fixed up influence on scheduling fairness due to consideration of a job's
>>>>       size
>>>>       - If we reach a ready entity in drm_sched_select_entity() but can't actually
>>>>         queue a job from it due to size limitations, just give up and go to sleep
>>>>         until woken up due to a pending job finishing, rather than continue to try
>>>>         other entities.
>>>>     - added a callback to dynamically update a job's credits (Boris)
>>>>     - renamed 'units' to 'credits'
>>>>     - fixed commit message and comments as requested by Luben
>>>>
>>>> Changes in V3:
>>>> ==============
>>>>     - rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
>>>>     - move up drm_sched_can_queue() instead of adding a forward declaration
>>>>       (Boris)
>>>>     - add a drm_sched_available_credits() helper (Boris)
>>>>     - adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
>>>>     - re-phrase a few comments and fix a typo (Luben)
>>>>     - change naming of all structures credit fields and function parameters to the
>>>>       following scheme
>>>>       - drm_sched_job::credits
>>>>       - drm_gpu_scheduler::credit_limit
>>>>       - drm_gpu_scheduler::credit_count
>>>>       - drm_sched_init(..., u32 credit_limit, ...)
>>>>       - drm_sched_job_init(..., u32 credits, ...)
>>>>     - add proper documentation for the scheduler's job-flow control mechanism
>>>>
>>>> This patch is based on V7 of the "DRM scheduler changes for Xe" series. [1]
>>>> provides a branch based on drm-misc-next, with the named series and this patch
>>>> on top of it.
>>>>
>>>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-credits/
>>>> ---
>>>>    Documentation/gpu/drm-mm.rst                  |   6 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
>>>>    drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |   2 +-
>>>>    drivers/gpu/drm/lima/lima_sched.c             |   2 +-
>>>>    drivers/gpu/drm/msm/msm_gem_submit.c          |   2 +-
>>>>    drivers/gpu/drm/nouveau/nouveau_sched.c       |   2 +-
>>>>    drivers/gpu/drm/panfrost/panfrost_drv.c       |   2 +-
>>>>    .../gpu/drm/scheduler/gpu_scheduler_trace.h   |   2 +-
>>>>    drivers/gpu/drm/scheduler/sched_entity.c      |   4 +-
>>>>    drivers/gpu/drm/scheduler/sched_main.c        | 142 ++++++++++++++----
>>>>    drivers/gpu/drm/v3d/v3d_gem.c                 |   2 +-
>>>>    include/drm/gpu_scheduler.h                   |  33 +++-
>>>>    12 files changed, 152 insertions(+), 49 deletions(-)
>>>>
>>>> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
>>>> index 602010cb6894..acc5901ac840 100644
>>>> --- a/Documentation/gpu/drm-mm.rst
>>>> +++ b/Documentation/gpu/drm-mm.rst
>>>> @@ -552,6 +552,12 @@ Overview
>>>>    .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>>>       :doc: Overview
>>>>    
>>>> +Flow Control
>>>> +------------
>>>> +
>>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>>> +   :doc: Flow Control
>>>> +
>>>>    Scheduler Function References
>>>>    -----------------------------
>>>>    
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 1f357198533f..62bb7fc7448a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -115,7 +115,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>>>    	if (!entity)
>>>>    		return 0;
>>>>    
>>>> -	return drm_sched_job_init(&(*job)->base, entity, owner);
>>>> +	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
>>>>    }
>>>>    
>>>>    int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
>>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>>> index 2416c526f9b0..3d0f8d182506 100644
>>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
>>>> @@ -535,7 +535,7 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
>>>>    
>>>>    	ret = drm_sched_job_init(&submit->sched_job,
>>>>    				 &ctx->sched_entity[args->pipe],
>>>> -				 submit->ctx);
>>>> +				 1, submit->ctx);
>>>>    	if (ret)
>>>>    		goto err_submit_put;
>>>>    
>>>> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
>>>> index 23a6276f1332..cdcb37ff62c3 100644
>>>> --- a/drivers/gpu/drm/lima/lima_sched.c
>>>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>>>> @@ -123,7 +123,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>>>>    	for (i = 0; i < num_bos; i++)
>>>>    		drm_gem_object_get(&bos[i]->base.base);
>>>>    
>>>> -	err = drm_sched_job_init(&task->base, &context->base, vm);
>>>> +	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
>>>>    	if (err) {
>>>>    		kfree(task->bos);
>>>>    		return err;
>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>>>> index 99744de6c05a..c002cabe7b9c 100644
>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>>>> @@ -48,7 +48,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>>>>    		return ERR_PTR(ret);
>>>>    	}
>>>>    
>>>> -	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
>>>> +	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
>>>>    	if (ret) {
>>>>    		kfree(submit->hw_fence);
>>>>    		kfree(submit);
>>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>>>> index 7e64b5ef90fb..1b2cc3f2e1c7 100644
>>>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>>>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>>>> @@ -89,7 +89,7 @@ nouveau_job_init(struct nouveau_job *job,
>>>>    
>>>>    	}
>>>>    
>>>> -	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
>>>> +	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
>>>>    	if (ret)
>>>>    		goto err_free_chains;
>>>>    
>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>>> index b834777b409b..54d1c19bea84 100644
>>>> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
>>>> @@ -274,7 +274,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
>>>>    
>>>>    	ret = drm_sched_job_init(&job->base,
>>>>    				 &file_priv->sched_entity[slot],
>>>> -				 NULL);
>>>> +				 1, NULL);
>>>>    	if (ret)
>>>>    		goto out_put_job;
>>>>    
>>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>>>> index 3143ecaaff86..f8ed093b7356 100644
>>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
>>>> @@ -51,7 +51,7 @@ DECLARE_EVENT_CLASS(drm_sched_job,
>>>>    			   __assign_str(name, sched_job->sched->name);
>>>>    			   __entry->job_count = spsc_queue_count(&entity->job_queue);
>>>>    			   __entry->hw_job_count = atomic_read(
>>>> -				   &sched_job->sched->hw_rq_count);
>>>> +				   &sched_job->sched->credit_count);
>>>>    			   ),
>>>>    	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
>>>>    		      __entry->entity, __entry->id,
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index 409e4256f6e7..b79e0672b94f 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -370,7 +370,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>>>    		container_of(cb, struct drm_sched_entity, cb);
>>>>    
>>>>    	drm_sched_entity_clear_dep(f, cb);
>>>> -	drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>>> +	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>>    }
>>>>    
>>>>    /**
>>>> @@ -602,7 +602,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>>>    		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>>    			drm_sched_rq_update_fifo(entity, submit_ts);
>>>>    
>>>> -		drm_sched_wakeup_if_can_queue(entity->rq->sched);
>>>> +		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>>>>    	}
>>>>    }
>>>>    EXPORT_SYMBOL(drm_sched_entity_push_job);
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 246213963928..3cc3dc0091fc 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -48,6 +48,30 @@
>>>>     * through the jobs entity pointer.
>>>>     */
>>>>    
>>>> +/**
>>>> + * DOC: Flow Control
>>>> + *
>>>> + * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
>>>> + * in which the jobs fetched from scheduler entities are executed.
>>>> + *
>>>> + * In this context the &drm_gpu_scheduler keeps track of a driver specified
>>>> + * credit limit representing the capacity of this scheduler and a credit count;
>>>> + * every &drm_sched_job carries a driver specified number of credits.
>>>> + *
>>>> + * Once a job is executed (but not yet finished), the job's credits contribute
>>>> + * to the scheduler's credit count until the job is finished. If by executing
>>>> + * one more job the scheduler's credit count would exceed the scheduler's
>>>> + * credit limit, the job won't be executed. Instead, the scheduler will wait
>>>> + * until the credit count has decreased enough to not overflow its credit limit.
>>>> + * This implies waiting for previously executed jobs.
>>>> + *
>>>> + * Optionally, drivers can register a callback (update_job_credits) provided by
>>>
>>> "can" --> "may".
>>>
>>>> + * struct drm_sched_backend_ops to update the job's credits dynamically. The
>>>> + * scheduler will execute this callback every time the scheduler considers a job
>>>
>>> No future tense, "will execute" --> "executes", because when this patch lands,
>>> it all becomes present reality. See below for a reworded paragraph.
>>>
>>>> + * for execution and subsequently checks whether the job fits the scheduler's
>>>> + * credit limit.
>>>
>>> This here is a good explanation of what update_job_credits() does, but the one
>>> where this callback is defined in the scheduler ops is not very clear (see further down
>>> into the patch).
>>>
>>> I think we shouldn't use the word "update" as we don't really have any "update" code,
>>> a la,
>>> 	if old value != new value, then
>>> 		old value = new value # update it
>>> in the code using this new sched op.
>>>
>>> Perhaps it should be a "get_job_credits()" as this is how this function is used
>>> in drm_sched_can_queue().
>>>
>>> Perhaps this adds some clarity:
>>>
>>> * Optionally, drivers may register a sched ops callback, get_job_credits(), which
>>> * returns the number of credits the job passed as an argument would take when
>>> * submitted to driver/hardware. The scheduler executes this callback every time it
>>> * considers a job for execution in order to check if the job's credits fit
>>> * into the scheduler's credit limit.
>>>
>>> (FWIW, it may be doing some "update" into the driver/firmware/hardware, but DRM
>>> has no visibility or care for that, since DRM is using it simply as "tell me
>>> the number or credits this job would take.")
>>>
>>>> + */
>>>> +
>>>>    #include <linux/wait.h>
>>>>    #include <linux/sched.h>
>>>>    #include <linux/completion.h>
>>>> @@ -75,6 +99,46 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>>>>    MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
>>>>    module_param_named(sched_policy, drm_sched_policy, int, 0444);
>>>>    
>>>> +static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
>>>> +{
>>>> +	u32 credits;
>>>> +
>>>> +	WARN_ON(check_sub_overflow(sched->credit_limit,
>>>> +				   atomic_read(&sched->credit_count),
>>>> +				   &credits));
>>>> +
>>>> +	return credits;
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_sched_can_queue -- Can we queue more to the hardware?
>>>> + * @sched: scheduler instance
>>>> + * @entity: the scheduler entity
>>>> + *
>>>> + * Return true if we can push at least one more job from @entity, false
>>>> + * otherwise.
>>>> + */
>>>> +static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>>>> +				struct drm_sched_entity *entity)
>>>> +{
>>>> +	struct drm_sched_job *s_job;
>>>> +
>>>> +	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>>>> +	if (!s_job)
>>>> +		return false;
>>>
>>> We don't have this gating logic at the moment, and I don't think
>>> we need it.
>>>
>>> At the current state of the code, we don't care if there's jobs
>>> in the incoming queue or not.
>>>
>>> The only thing we should check here for is the credit availability
>>> for _this job_, as you do in the code below.
>>>
>>>> +
>>>> +	if (sched->ops->update_job_credits) {
>>>> +		u32 credits = sched->ops->update_job_credits(s_job);
>>>> +
>>>> +		if (credits)
>>>> +			s_job->credits = credits;
>>>> +	}
>>>> +
>>>> +	WARN_ON(s_job->credits > sched->credit_limit);
>>>> +
>>>> +	return drm_sched_available_credits(sched) >= s_job->credits;
>>>> +}
>>>> +
>>>>    static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
>>>>    							    const struct rb_node *b)
>>>>    {
>>>> @@ -186,12 +250,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>>    /**
>>>>     * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>>     *
>>>> + * @sched: the gpu scheduler
>>>>     * @rq: scheduler run queue to check.
>>>>     *
>>>>     * Try to find a ready entity, returns NULL if none found.
>>>>     */
>>>>    static struct drm_sched_entity *
>>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
>>>> +			      struct drm_sched_rq *rq)
>>>>    {
>>>>    	struct drm_sched_entity *entity;
>>>>    
>>>> @@ -201,6 +267,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>    	if (entity) {
>>>>    		list_for_each_entry_continue(entity, &rq->entities, list) {
>>>>    			if (drm_sched_entity_is_ready(entity)) {
>>>> +				/* If we can't queue yet, preserve the current
>>>> +				 * entity in terms of fairness.
>>>> +				 */
>>>> +				if (!drm_sched_can_queue(sched, entity))
>>>> +					goto out;
>>>> +
>>>>    				rq->current_entity = entity;
>>>>    				reinit_completion(&entity->entity_idle);
>>>>    				spin_unlock(&rq->lock);
>>>> @@ -210,8 +282,13 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>    	}
>>>>    
>>>>    	list_for_each_entry(entity, &rq->entities, list) {
>>>> -
>>>>    		if (drm_sched_entity_is_ready(entity)) {
>>>> +			/* If we can't queue yet, preserve the current entity in
>>>> +			 * terms of fairness.
>>>> +			 */
>>>> +			if (!drm_sched_can_queue(sched, entity))
>>>> +				goto out;
>>>> +
>>>>    			rq->current_entity = entity;
>>>>    			reinit_completion(&entity->entity_idle);
>>>>    			spin_unlock(&rq->lock);
>>>> @@ -222,20 +299,22 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>    			break;
>>>>    	}
>>>>    
>>>> +out:
>>>>    	spin_unlock(&rq->lock);
>>>> -
>>>>    	return NULL;
>>>>    }
>>>>    
>>>>    /**
>>>>     * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>>>     *
>>>> + * @sched: the gpu scheduler
>>>>     * @rq: scheduler run queue to check.
>>>>     *
>>>>     * Find oldest waiting ready entity, returns NULL if none found.
>>>>     */
>>>>    static struct drm_sched_entity *
>>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
>>>> +				struct drm_sched_rq *rq)
>>>>    {
>>>>    	struct rb_node *rb;
>>>>    
>>>> @@ -245,6 +324,15 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>    
>>>>    		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>>>    		if (drm_sched_entity_is_ready(entity)) {
>>>> +			/* If we can't queue yet, don't pick another entity
>>>> +			 * which's job might fit and wait until we got space for
>>>
>>> 			 * whose job might not fit and wait until we get space for
>>> Three fixes:                 ^              ^                         ^
>>>
>>>> +			 * this one in terms of fairness.
>>>
>>>                            * this one in terms of credit availability.
>>>
>>> It's not the "fairness", it's the number of credits we check for in drm_sched_can_queue() below.
>>>
>>>> +			 */
>>>> +			if (!drm_sched_can_queue(sched, entity)) {
>>>> +				rb = NULL;
>>>> +				break;
>>>> +			}
>>>> +
>>>
>>> That's good.
>>>
>>>>    			rq->current_entity = entity;
>>>>    			reinit_completion(&entity->entity_idle);
>>>>    			break;
>>>> @@ -266,18 +354,6 @@ static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>>    		queue_work(sched->submit_wq, &sched->work_run_job);
>>>>    }
>>>>    
>>>> -/**
>>>> - * drm_sched_can_queue -- Can we queue more to the hardware?
>>>> - * @sched: scheduler instance
>>>> - *
>>>> - * Return true if we can push more jobs to the hw, otherwise false.
>>>> - */
>>>> -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>>>> -{
>>>> -	return atomic_read(&sched->hw_rq_count) <
>>>> -		sched->hw_submission_limit;
>>>> -}
>>>> -
>>>>    /**
>>>>     * drm_sched_select_entity - Select next entity to process
>>>>     *
>>>> @@ -291,14 +367,11 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>>    	struct drm_sched_entity *entity;
>>>>    	int i;
>>>>    
>>>> -	if (!drm_sched_can_queue(sched))
>>>> -		return NULL;
>>>> -
>>>>    	/* Kernel run queue has higher priority than normal run queue*/
>>>>    	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>>    		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
>>>> -			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
>>>> -			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
>>>> +			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
>>>> +			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
>>>>    		if (entity)
>>>>    			break;
>>>>    	}
>>>> @@ -353,7 +426,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>>>    	struct drm_sched_fence *s_fence = s_job->s_fence;
>>>>    	struct drm_gpu_scheduler *sched = s_fence->sched;
>>>>    
>>>> -	atomic_dec(&sched->hw_rq_count);
>>>> +	atomic_sub(s_job->credits, &sched->credit_count);
>>>>    	atomic_dec(sched->score);
>>>>    
>>>>    	trace_drm_sched_process_job(s_fence);
>>>> @@ -576,7 +649,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>    					      &s_job->cb)) {
>>>>    			dma_fence_put(s_job->s_fence->parent);
>>>>    			s_job->s_fence->parent = NULL;
>>>> -			atomic_dec(&sched->hw_rq_count);
>>>> +			atomic_sub(s_job->credits, &sched->credit_count);
>>>>    		} else {
>>>>    			/*
>>>>    			 * remove job from pending_list.
>>>> @@ -637,7 +710,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>>>>    	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>>    		struct dma_fence *fence = s_job->s_fence->parent;
>>>>    
>>>> -		atomic_inc(&sched->hw_rq_count);
>>>> +		atomic_add(s_job->credits, &sched->credit_count);
>>>>    
>>>>    		if (!full_recovery)
>>>>    			continue;
>>>> @@ -718,6 +791,8 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>>>     * drm_sched_job_init - init a scheduler job
>>>>     * @job: scheduler job to init
>>>>     * @entity: scheduler entity to use
>>>> + * @credits: the number of credits this job contributes to the schedulers
>>>> + * credit limit
>>>>     * @owner: job owner for debugging
>>>>     *
>>>>     * Refer to drm_sched_entity_push_job() documentation
>>>> @@ -735,7 +810,7 @@ EXPORT_SYMBOL(drm_sched_resubmit_jobs);
>>>>     */
>>>>    int drm_sched_job_init(struct drm_sched_job *job,
>>>>    		       struct drm_sched_entity *entity,
>>>> -		       void *owner)
>>>> +		       u32 credits, void *owner)
>>>>    {
>>>>    	if (!entity->rq) {
>>>>    		/* This will most likely be followed by missing frames
>>>> @@ -752,6 +827,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>>>    		return -ENOMEM;
>>>>    
>>>>    	INIT_LIST_HEAD(&job->list);
>>>> +	job->credits = credits ? credits : 1;
>>>>    
>>>>    	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
>>>>    
>>>> @@ -961,12 +1037,14 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>>>>    /**
>>>>     * drm_sched_wakeup_if_can_queue - Wake up the scheduler
>>>>     * @sched: scheduler instance
>>>> + * @entity: the scheduler entity
>>>>     *
>>>>     * Wake up the scheduler if we can queue jobs.
>>>>     */
>>>> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>>> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
>>>> +				   struct drm_sched_entity *entity)
>>>>    {
>>>> -	if (drm_sched_can_queue(sched))
>>>> +	if (drm_sched_can_queue(sched, entity))
>>>>    		drm_sched_run_job_queue(sched);
>>>>    }
>>>>    
>>>> @@ -1104,7 +1182,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>>    
>>>>    	s_fence = sched_job->s_fence;
>>>>    
>>>> -	atomic_inc(&sched->hw_rq_count);
>>>> +	atomic_add(sched_job->credits, &sched->credit_count);
>>>>    	drm_sched_job_begin(sched_job);
>>>>    
>>>>    	trace_drm_run_job(sched_job, entity);
>>>> @@ -1139,7 +1217,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>>     * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
>>>>     *	       allocated and used
>>>>     * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
>>>> - * @hw_submission: number of hw submissions that can be in flight
>>>> + * @credit_limit: the number of credits this scheduler can hold from all jobs
>>>>     * @hang_limit: number of times to allow a job to hang before dropping it
>>>>     * @timeout: timeout value in jiffies for the scheduler
>>>>     * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
>>>> @@ -1153,14 +1231,14 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>>    int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>    		   const struct drm_sched_backend_ops *ops,
>>>>    		   struct workqueue_struct *submit_wq,
>>>> -		   u32 num_rqs, unsigned hw_submission, unsigned hang_limit,
>>>> +		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
>>>>    		   long timeout, struct workqueue_struct *timeout_wq,
>>>>    		   atomic_t *score, const char *name, struct device *dev)
>>>>    {
>>>>    	int i, ret;
>>>>    
>>>>    	sched->ops = ops;
>>>> -	sched->hw_submission_limit = hw_submission;
>>>> +	sched->credit_limit = credit_limit;
>>>>    	sched->name = name;
>>>>    	if (submit_wq) {
>>>>    		sched->submit_wq = submit_wq;
>>>> @@ -1211,7 +1289,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>    	init_waitqueue_head(&sched->job_scheduled);
>>>>    	INIT_LIST_HEAD(&sched->pending_list);
>>>>    	spin_lock_init(&sched->job_list_lock);
>>>> -	atomic_set(&sched->hw_rq_count, 0);
>>>> +	atomic_set(&sched->credit_count, 0);
>>>>    	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>>>    	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>>>    	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>>>> diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
>>>> index 2e94ce788c71..8479e5302f7b 100644
>>>> --- a/drivers/gpu/drm/v3d/v3d_gem.c
>>>> +++ b/drivers/gpu/drm/v3d/v3d_gem.c
>>>> @@ -417,7 +417,7 @@ v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
>>>>    	job->free = free;
>>>>    
>>>>    	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
>>>> -				 v3d_priv);
>>>> +				 1, v3d_priv);
>>>>    	if (ret)
>>>>    		goto fail;
>>>>    
>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>> index e5a6166eb152..a911b3f5917e 100644
>>>> --- a/include/drm/gpu_scheduler.h
>>>> +++ b/include/drm/gpu_scheduler.h
>>>> @@ -321,6 +321,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>>>>     * @sched: the scheduler instance on which this job is scheduled.
>>>>     * @s_fence: contains the fences for the scheduling of job.
>>>>     * @finish_cb: the callback for the finished fence.
>>>> + * @credits: the number of credits this job contributes to the scheduler
>>>>     * @work: Helper to reschdeule job kill to different context.
>>>>     * @id: a unique id assigned to each job scheduled on the scheduler.
>>>>     * @karma: increment on every hang caused by this job. If this exceeds the hang
>>>> @@ -340,6 +341,8 @@ struct drm_sched_job {
>>>>    	struct drm_gpu_scheduler	*sched;
>>>>    	struct drm_sched_fence		*s_fence;
>>>>    
>>>> +	u32				credits;
>>>> +
>>>>    	/*
>>>>    	 * work is used only after finish_cb has been used and will not be
>>>>    	 * accessed anymore.
>>>> @@ -463,13 +466,29 @@ struct drm_sched_backend_ops {
>>>>             * and it's time to clean it up.
>>>>    	 */
>>>>    	void (*free_job)(struct drm_sched_job *sched_job);
>>>> +
>>>> +	/**
>>>> +	 * @update_job_credits: Called once the scheduler is considering this
>>>> +	 * job for execution.
>>>> +	 *
>>>
>>> This whole paragraph isn't very clear of what update_job_credits does.
>>> Perhaps a simple and straightforward name would be more clear:
>>> 	get_job_credits: Returns the number of credits this job would
>>> 		take.
>>
>> I still very much prefer 'update' over 'get' for the following reasons.
>>
>> 1. 'Update' describes very well what the driver should do when receiving this
>>      callback. It should update the number of credits the job requires at the time
>>      it receives the callback.
>>
>> 2. 'get', as well as 'put', is typically used in a different context in the kernel
>>      and hence this might create confusion.
> 
> That's fine--agreed.
> 
> Perhaps add a comment to say, "Update from the value set by drm_sched_job_init()",
> to make the "update" instantly visualized by the reader of the code.

Sounds good.

> 
>>
>>>
>>>> +	 * Drivers may use this to update the job's submission credits, which is
>>>> +	 * useful to e.g. deduct the number of native fences which have been
>>>> +	 * signaled meanwhile.
>>>> +	 *
>>>> +	 * The callback must either return the new number of submission credits
>>>> +	 * for the given job, or zero if no update is required.
>>>
>>> The word "update" is confusing here and it implies that DRM should keep track
>>> of the previous value this function had returned.
>>>
>>> Perhaps we can just say:
>>> 	* This callback returns the number of credits this job would take if pushed
>>> 	* to the driver/hardware. It returns 0, if this job would take no credits.
>>
>> For this version of the patch this would be wrong, 0 really means that we keep
>> job->credits as it is. However, I'd probably change this in V4 as requested by Boris,
>> such that we accept 0 credits, but WARN() on it. Unless, we consider it to be valid
>> use-case to have jobs that are not flow-controlled at all.
> 
> I agree with your reasoning. Yeah, not sure what the _meaning_ of update_job_credits()
> returning 0 is (from a previous non-zero value).
> 
> Perhaps, just leave it up to the driver--they should know what they're doing and
> just warn, but take the 0, and run with it. If drivers become clever with the 0,
> we can remove the WARN down the road.

Yeah, sounds reasonable. Let's do that.

- Danilo

> 
> So, yeah, looks good! Thanks.
> 
>>
>> All other comments look good, gonna address them in a V4.
> 
> Great!
>
  
Boris Brezillon Oct. 27, 2023, 4:26 p.m. UTC | #18
On Fri, 27 Oct 2023 16:34:26 +0200
Danilo Krummrich <dakr@redhat.com> wrote:

> On 10/27/23 09:17, Boris Brezillon wrote:
> > Hi Danilo,
> > 
> > On Thu, 26 Oct 2023 18:13:00 +0200
> > Danilo Krummrich <dakr@redhat.com> wrote:
> >   
> >> +
> >> +	/**
> >> +	 * @update_job_credits: Called once the scheduler is considering this
> >> +	 * job for execution.
> >> +	 *
> >> +	 * Drivers may use this to update the job's submission credits, which is
> >> +	 * useful to e.g. deduct the number of native fences which have been
> >> +	 * signaled meanwhile.
> >> +	 *
> >> +	 * The callback must either return the new number of submission credits
> >> +	 * for the given job, or zero if no update is required.
> >> +	 *
> >> +	 * This callback is optional.
> >> +	 */
> >> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);  
> > 
> > I'm copying my late reply to v2 here so it doesn't get lost:
> > 
> > I keep thinking it'd be simpler to make this a void function that
> > updates s_job->submission_credits directly. I also don't see the
> > problem with doing a sanity check on job->submission_credits. I mean,
> > if the driver is doing something silly, you can't do much to prevent it
> > anyway, except warn the user that something wrong has happened. If you
> > want to
> > 
> > 	WARN_ON(job->submission_credits == 0 ||
> > 		job->submission_credits > job_old_submission_credits);
> > 
> > that's fine. But none of this sanity checking has to do with the
> > function prototype/semantics, and I'm still not comfortable with this 0  
> > => no-change. If there's no change, we should just leave  
> > job->submission_credits unchanged (or return job->submission_credits)
> > instead of inventing a new special case.  
> 
> If we can avoid letting drivers change fields of generic structures directly
> without any drawbacks I think we should avoid it. Currently, drivers shouldn't
> have the need to mess with job->credits directly. The initial value is set
> through drm_sched_job_init() and is updated through the return value of
> update_job_credits().

Fair enough. I do agree that keeping internal fields out of driver
hands is a good thing in general, it's just that it's already
free-for-all in so many places in drm_sched (like the fact drivers
iterate the pending list in their stop-queue handling) that I didn't
really see it as an issue. Note that's there's always the option of
providing drm_sched_job_{update,get}_credits() helpers, with the update
helper making sure the new credits value is consistent (smaller or
equal to the old one, and not zero).

> 
> I'm fine getting rid of the 0 => no-change semantics though. Instead we can just
> WARN() on 0.

Yeah, I think that's preferable. It's pretty easy to return the old
value if the driver has a way to detect when nothing changed (with a
get helper if you don't want drivers to touch the credits field).

> However, if we do that I'd also want to change it for
> drm_sched_job_init() (where 0 currently defaults to 1) such that we accept 0, but
> WARN() accordingly.

Sure. You update all drivers anyway, so passing 1 instead of 0 is not a
big deal, I would say.

> 
> I think it's consequent to either consistently give 0 a different meaning or just
> accept it but WARN() on it.

Using default as a default value makes sense when you're passing
zero-initialized objects that are later extended with new fields, but
here you update the function prototype and all the call sites, so we're
better off considering 0 as an invalid value, IMHO.
  
Boris Brezillon Oct. 27, 2023, 4:31 p.m. UTC | #19
On Fri, 27 Oct 2023 16:23:24 +0200
Danilo Krummrich <dakr@redhat.com> wrote:

> On 10/27/23 10:25, Boris Brezillon wrote:
> > Hi Danilo,
> > 
> > On Thu, 26 Oct 2023 18:13:00 +0200
> > Danilo Krummrich <dakr@redhat.com> wrote:
> >   
> >> Currently, job flow control is implemented simply by limiting the number
> >> of jobs in flight. Therefore, a scheduler is initialized with a credit
> >> limit that corresponds to the number of jobs which can be sent to the
> >> hardware.
> >>
> >> This implies that for each job, drivers need to account for the maximum
> >> job size possible in order to not overflow the ring buffer.
> >>
> >> However, there are drivers, such as Nouveau, where the job size has a
> >> rather large range. For such drivers it can easily happen that job
> >> submissions not even filling the ring by 1% can block subsequent
> >> submissions, which, in the worst case, can lead to the ring run dry.
> >>
> >> In order to overcome this issue, allow for tracking the actual job size
> >> instead of the number of jobs. Therefore, add a field to track a job's
> >> credit count, which represents the number of credits a job contributes
> >> to the scheduler's credit limit.
> >>
> >> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> >> ---
> >> Changes in V2:
> >> ==============
> >>    - fixed up influence on scheduling fairness due to consideration of a job's
> >>      size
> >>      - If we reach a ready entity in drm_sched_select_entity() but can't actually
> >>        queue a job from it due to size limitations, just give up and go to sleep
> >>        until woken up due to a pending job finishing, rather than continue to try
> >>        other entities.
> >>    - added a callback to dynamically update a job's credits (Boris)  
> > 
> > This callback seems controversial. I'd suggest dropping it, so the
> > patch can be merged.  
> 
> I don't think we should drop it just for the sake of moving forward. If there are objections
> we'll discuss them. I've seen good reasons why the drivers you are working on require this,
> while, following the discussion, I have *not* seen any reasons to drop it. It helps simplifying
> driver code and doesn't introduce any complexity or overhead to existing drivers.

Up to you. I'm just saying, moving one step in the right direction is
better than being stuck, and it's not like adding this callback in a
follow-up patch is super complicated either. If you're confident that
we can get all parties to agree on keeping this hook, fine by me.
  
Boris Brezillon Oct. 27, 2023, 4:41 p.m. UTC | #20
On Fri, 27 Oct 2023 10:32:52 -0400
Luben Tuikov <ltuikov89@gmail.com> wrote:

> On 2023-10-27 04:25, Boris Brezillon wrote:
> > Hi Danilo,
> > 
> > On Thu, 26 Oct 2023 18:13:00 +0200
> > Danilo Krummrich <dakr@redhat.com> wrote:
> >   
> >> Currently, job flow control is implemented simply by limiting the number
> >> of jobs in flight. Therefore, a scheduler is initialized with a credit
> >> limit that corresponds to the number of jobs which can be sent to the
> >> hardware.
> >>
> >> This implies that for each job, drivers need to account for the maximum
> >> job size possible in order to not overflow the ring buffer.
> >>
> >> However, there are drivers, such as Nouveau, where the job size has a
> >> rather large range. For such drivers it can easily happen that job
> >> submissions not even filling the ring by 1% can block subsequent
> >> submissions, which, in the worst case, can lead to the ring run dry.
> >>
> >> In order to overcome this issue, allow for tracking the actual job size
> >> instead of the number of jobs. Therefore, add a field to track a job's
> >> credit count, which represents the number of credits a job contributes
> >> to the scheduler's credit limit.
> >>
> >> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> >> ---
> >> Changes in V2:
> >> ==============
> >>   - fixed up influence on scheduling fairness due to consideration of a job's
> >>     size
> >>     - If we reach a ready entity in drm_sched_select_entity() but can't actually
> >>       queue a job from it due to size limitations, just give up and go to sleep
> >>       until woken up due to a pending job finishing, rather than continue to try
> >>       other entities.
> >>   - added a callback to dynamically update a job's credits (Boris)  
> > 
> > This callback seems controversial. I'd suggest dropping it, so the
> > patch can be merged.  
> 
> Sorry, why is it controversial? (I did read back-and-forth above, but it wasn't clear
> why it is /controversial/.)

That's a question for Christian, I guess. I didn't quite get what he
was worried about, other than this hook introducing a new way for
drivers to screw things up by returning funky/invalid credits (which we
can report with WARN_ON()s). But let's be honest, there's probably a
hundred different ways (if not more) drivers can shoot themselves in the
foot with drm_sched already...

> 
> I believe only drivers are privy to changes in the credit availability as their
> firmware and hardware executes new jobs and finishes others, and so this "update"
> here is essential--leaving it only to prepare_job() wouldn't quite fulfill the vision
> of why the credit mechanism introduced by this patch in the first place.

I kinda agree with you, even if I wouldn't so pessimistic as to how
useful this patch would be without the ->update_job_credits() hook
(it already makes the situation a lot better for Nouveau and probably
other drivers too).
  
Luben Tuikov Oct. 28, 2023, 3:34 a.m. UTC | #21
Hi,

On 2023-10-27 12:26, Boris Brezillon wrote:
> On Fri, 27 Oct 2023 16:34:26 +0200
> Danilo Krummrich <dakr@redhat.com> wrote:
> 
>> On 10/27/23 09:17, Boris Brezillon wrote:
>>> Hi Danilo,
>>>
>>> On Thu, 26 Oct 2023 18:13:00 +0200
>>> Danilo Krummrich <dakr@redhat.com> wrote:
>>>   
>>>> +
>>>> +	/**
>>>> +	 * @update_job_credits: Called once the scheduler is considering this
>>>> +	 * job for execution.
>>>> +	 *
>>>> +	 * Drivers may use this to update the job's submission credits, which is
>>>> +	 * useful to e.g. deduct the number of native fences which have been
>>>> +	 * signaled meanwhile.
>>>> +	 *
>>>> +	 * The callback must either return the new number of submission credits
>>>> +	 * for the given job, or zero if no update is required.
>>>> +	 *
>>>> +	 * This callback is optional.
>>>> +	 */
>>>> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);  
>>>
>>> I'm copying my late reply to v2 here so it doesn't get lost:
>>>
>>> I keep thinking it'd be simpler to make this a void function that
>>> updates s_job->submission_credits directly. I also don't see the
>>> problem with doing a sanity check on job->submission_credits. I mean,
>>> if the driver is doing something silly, you can't do much to prevent it
>>> anyway, except warn the user that something wrong has happened. If you
>>> want to
>>>
>>> 	WARN_ON(job->submission_credits == 0 ||
>>> 		job->submission_credits > job_old_submission_credits);
>>>
>>> that's fine. But none of this sanity checking has to do with the
>>> function prototype/semantics, and I'm still not comfortable with this 0  
>>> => no-change. If there's no change, we should just leave  
>>> job->submission_credits unchanged (or return job->submission_credits)
>>> instead of inventing a new special case.  
>>
>> If we can avoid letting drivers change fields of generic structures directly
>> without any drawbacks I think we should avoid it. Currently, drivers shouldn't
>> have the need to mess with job->credits directly. The initial value is set
>> through drm_sched_job_init() and is updated through the return value of
>> update_job_credits().
> 
> Fair enough. I do agree that keeping internal fields out of driver
> hands is a good thing in general, it's just that it's already
> free-for-all in so many places in drm_sched (like the fact drivers

"Free-for-all" doesn't mean we need to follow suit. We should keep
good programming practices, as this patch strives to.

> iterate the pending list in their stop-queue handling) that I didn't
> really see it as an issue. Note that's there's always the option of
> providing drm_sched_job_{update,get}_credits() helpers, with the update
> helper making sure the new credits value is consistent (smaller or
> equal to the old one, and not zero).
> 
>>
>> I'm fine getting rid of the 0 => no-change semantics though. Instead we can just
>> WARN() on 0.
> 
> Yeah, I think that's preferable. It's pretty easy to return the old
> value if the driver has a way to detect when nothing changed (with a
> get helper if you don't want drivers to touch the credits field).
> 
>> However, if we do that I'd also want to change it for
>> drm_sched_job_init() (where 0 currently defaults to 1) such that we accept 0, but
>> WARN() accordingly.
> 
> Sure. You update all drivers anyway, so passing 1 instead of 0 is not a
> big deal, I would say.

At this point in time, we should consider 1 as normal, 0 out of spec and
WARN on it but carry on and (perhaps) reset it to 1. Drivers in the future, may
see a need (i.e. do tricks) to return 0, at which point they'll submit a patch which
does two things, 1) removes the WARN, 2) removes the reset from 0 to 1, and
explain why they need to return 0 to allow (one more) job, but we're nowhere near then yet,
so status quo for now.

I don't see how it makes sense to call drm_sched_job_init(credits:0), and I believe
the code is correct to default to 1 in that case--which defaults to the current
flow control we have, which we want.

> 
>>
>> I think it's consequent to either consistently give 0 a different meaning or just
>> accept it but WARN() on it.
> 
> Using default as a default value makes sense when you're passing

I suppose you meant "using zero as a default value".

> zero-initialized objects that are later extended with new fields, but
> here you update the function prototype and all the call sites, so we're
> better off considering 0 as an invalid value, IMHO.

Yes, absolutely.

You never want to give 0 a meaning, since as you pointed out, it is zero-ed
memory, and as such, can have any meaning you'd like. So yes: WARN on 0;
1 is good and normal.

Regards,
Luben
  
Luben Tuikov Oct. 28, 2023, 3:37 a.m. UTC | #22
On 2023-10-27 12:31, Boris Brezillon wrote:
> On Fri, 27 Oct 2023 16:23:24 +0200
> Danilo Krummrich <dakr@redhat.com> wrote:
> 
>> On 10/27/23 10:25, Boris Brezillon wrote:
>>> Hi Danilo,
>>>
>>> On Thu, 26 Oct 2023 18:13:00 +0200
>>> Danilo Krummrich <dakr@redhat.com> wrote:
>>>   
>>>> Currently, job flow control is implemented simply by limiting the number
>>>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>>>> limit that corresponds to the number of jobs which can be sent to the
>>>> hardware.
>>>>
>>>> This implies that for each job, drivers need to account for the maximum
>>>> job size possible in order to not overflow the ring buffer.
>>>>
>>>> However, there are drivers, such as Nouveau, where the job size has a
>>>> rather large range. For such drivers it can easily happen that job
>>>> submissions not even filling the ring by 1% can block subsequent
>>>> submissions, which, in the worst case, can lead to the ring run dry.
>>>>
>>>> In order to overcome this issue, allow for tracking the actual job size
>>>> instead of the number of jobs. Therefore, add a field to track a job's
>>>> credit count, which represents the number of credits a job contributes
>>>> to the scheduler's credit limit.
>>>>
>>>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>>>> ---
>>>> Changes in V2:
>>>> ==============
>>>>    - fixed up influence on scheduling fairness due to consideration of a job's
>>>>      size
>>>>      - If we reach a ready entity in drm_sched_select_entity() but can't actually
>>>>        queue a job from it due to size limitations, just give up and go to sleep
>>>>        until woken up due to a pending job finishing, rather than continue to try
>>>>        other entities.
>>>>    - added a callback to dynamically update a job's credits (Boris)  
>>>
>>> This callback seems controversial. I'd suggest dropping it, so the
>>> patch can be merged.  
>>
>> I don't think we should drop it just for the sake of moving forward. If there are objections
>> we'll discuss them. I've seen good reasons why the drivers you are working on require this,
>> while, following the discussion, I have *not* seen any reasons to drop it. It helps simplifying
>> driver code and doesn't introduce any complexity or overhead to existing drivers.
> 
> Up to you. I'm just saying, moving one step in the right direction is
> better than being stuck, and it's not like adding this callback in a
> follow-up patch is super complicated either. If you're confident that
> we can get all parties to agree on keeping this hook, fine by me.

I'd rather have it in now, as it is really *the vision* of this patch. There's no point
in pushing in something half-baked.
  
Luben Tuikov Oct. 28, 2023, 3:51 a.m. UTC | #23
On 2023-10-27 12:41, Boris Brezillon wrote:
> On Fri, 27 Oct 2023 10:32:52 -0400
> Luben Tuikov <ltuikov89@gmail.com> wrote:
> 
>> On 2023-10-27 04:25, Boris Brezillon wrote:
>>> Hi Danilo,
>>>
>>> On Thu, 26 Oct 2023 18:13:00 +0200
>>> Danilo Krummrich <dakr@redhat.com> wrote:
>>>   
>>>> Currently, job flow control is implemented simply by limiting the number
>>>> of jobs in flight. Therefore, a scheduler is initialized with a credit
>>>> limit that corresponds to the number of jobs which can be sent to the
>>>> hardware.
>>>>
>>>> This implies that for each job, drivers need to account for the maximum
>>>> job size possible in order to not overflow the ring buffer.
>>>>
>>>> However, there are drivers, such as Nouveau, where the job size has a
>>>> rather large range. For such drivers it can easily happen that job
>>>> submissions not even filling the ring by 1% can block subsequent
>>>> submissions, which, in the worst case, can lead to the ring run dry.
>>>>
>>>> In order to overcome this issue, allow for tracking the actual job size
>>>> instead of the number of jobs. Therefore, add a field to track a job's
>>>> credit count, which represents the number of credits a job contributes
>>>> to the scheduler's credit limit.
>>>>
>>>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>>>> ---
>>>> Changes in V2:
>>>> ==============
>>>>   - fixed up influence on scheduling fairness due to consideration of a job's
>>>>     size
>>>>     - If we reach a ready entity in drm_sched_select_entity() but can't actually
>>>>       queue a job from it due to size limitations, just give up and go to sleep
>>>>       until woken up due to a pending job finishing, rather than continue to try
>>>>       other entities.
>>>>   - added a callback to dynamically update a job's credits (Boris)  
>>>
>>> This callback seems controversial. I'd suggest dropping it, so the
>>> patch can be merged.  
>>
>> Sorry, why is it controversial? (I did read back-and-forth above, but it wasn't clear
>> why it is /controversial/.)
> 
> That's a question for Christian, I guess. I didn't quite get what he
> was worried about, other than this hook introducing a new way for
> drivers to screw things up by returning funky/invalid credits (which we

It's up to the driver--they can test, test, test, and fix their code and so on.
We can only do so much and shouldn't be baby-sitting drivers ad nauseam. Drivers
can also not define this callback. :-)

> can report with WARN_ON()s). But let's be honest, there's probably a
> hundred different ways (if not more) drivers can shoot themselves in the
> foot with drm_sched already...

Yes, that's true. So there's no worries with this hook.

> 
>>
>> I believe only drivers are privy to changes in the credit availability as their
>> firmware and hardware executes new jobs and finishes others, and so this "update"
>> here is essential--leaving it only to prepare_job() wouldn't quite fulfill the vision
>> of why the credit mechanism introduced by this patch in the first place.
> 
> I kinda agree with you, even if I wouldn't so pessimistic as to how
> useful this patch would be without the ->update_job_credits() hook
> (it already makes the situation a lot better for Nouveau and probably
> other drivers too).

Sure, and that's a good thing.

The heart of the dynamic credit scheme this patch is introducing *is* update_job_credits()
callback. Without it, it's just about like the current job flow-control scheme we have with
varied job weights (credits). Remember, it is an optional callback and driver can choose NOT
to define it--simple. :-)

So, I'm very excited about this, and see a wide range of applications and tricks drivers
can do with the credit scheme (albeit had it been an "int" bwha-ha-ha-ha ]:-> ).

Have a good weekend everyone!
  
Christian König Oct. 30, 2023, 7:38 a.m. UTC | #24
Hi Boris,

Am 27.10.23 um 12:17 schrieb Boris Brezillon:
> Hi Christian,
>
> On Fri, 27 Oct 2023 11:06:44 +0200
> Christian König <christian.koenig@amd.com> wrote:
>
>> Am 27.10.23 um 10:22 schrieb Boris Brezillon:
>>> On Fri, 27 Oct 2023 09:44:13 +0200
>>> Christian König<christian.koenig@amd.com>  wrote:
>>>   
>>>> Am 27.10.23 um 09:39 schrieb Boris Brezillon:
>>>>> On Fri, 27 Oct 2023 09:35:01 +0200
>>>>> Christian König<christian.koenig@amd.com>   wrote:
>>>>>      
>>>>>> Am 27.10.23 um 09:32 schrieb Boris Brezillon:
>>>>>>> On Fri, 27 Oct 2023 09:22:12 +0200
>>>>>>> Christian König<christian.koenig@amd.com>   wrote:
>>>>>>>         
>>>>>>>>> +
>>>>>>>>> +	/**
>>>>>>>>> +	 * @update_job_credits: Called once the scheduler is considering this
>>>>>>>>> +	 * job for execution.
>>>>>>>>> +	 *
>>>>>>>>> +	 * Drivers may use this to update the job's submission credits, which is
>>>>>>>>> +	 * useful to e.g. deduct the number of native fences which have been
>>>>>>>>> +	 * signaled meanwhile.
>>>>>>>>> +	 *
>>>>>>>>> +	 * The callback must either return the new number of submission credits
>>>>>>>>> +	 * for the given job, or zero if no update is required.
>>>>>>>>> +	 *
>>>>>>>>> +	 * This callback is optional.
>>>>>>>>> +	 */
>>>>>>>>> +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
>>>>>>>> Why do we need an extra callback for this?
>>>>>>>>
>>>>>>>> Just document that prepare_job() is allowed to reduce the number of
>>>>>>>> credits the job might need.
>>>>>>> ->prepare_job() is called only once if the returned fence is NULL, but
>>>>>>> we need this credit-update to happen every time a job is considered for
>>>>>>> execution by the scheduler.
>>>>>> But the job is only considered for execution once. How do you see that
>>>>>> this is called multiple times?
>>>>> Nope, it's not. If drm_sched_can_queue() returns false, the scheduler
>>>>> will go look for another entity that has a job ready for execution, and
>>>>> get back to this entity later, and test drm_sched_can_queue() again.
>>>>> Basically, any time drm_sched_can_queue() is called, the job credits
>>>>> update should happen, so we have an accurate view of how many credits
>>>>> this job needs.
>>>> Well, that is the handling which I already rejected because it creates
>>>> unfairness between processes. When you consider the credits needed
>>>> *before* scheduling jobs with a lower credit count are always preferred
>>>> over jobs with a higher credit count.
>>> My bad, it doesn't pick another entity when an entity with a
>>> ready job that doesn't fit the queue is found, it just bails out from
>>> drm_sched_rq_select_entity_rr() and returns NULL (AKA: no ready entity
>>> found). But we still want to update the job credits before checking if
>>> the job fits or not (next time this entity is tested).
>> Why? I only see a few possibility here:
>>
>> 1. You need to wait for submissions to the same scheduler to finish
>> before the one you want to push to the ring.
>>       This case can be avoided by trying to cast the dependency fences to
>> drm_sched_fences and looking if they are already scheduled.
>>
>> 2. You need to wait for submissions to a different scheduler instance
>> and in this case you should probably return that as dependency instead.
> It's already described as a dependency, but native dependency waits are
> deferred to the FW (we wait on scheduled to run the job, not finished).
> The thing is, after ->prepare_job() returned NULL (no non-native deps
> remaining), and before ->run_job() gets called, there might be several
> of these native deps that get signaled, and we're still considering
> job->submission_credits as unchanged, when each of these signalled
> fence could have reduced the job credits, potentially allowing it to be
> submitted sooner.

Ah, ok that at least clears up your intentions here.

Question is if that is really that important for you? I mean you just 
seem to fill up more of the ring buffer.

>
>> So to me it looks like when prepare_job() is called because it is
>> selected as next job for submission you should already know how many
>> credits it needs.
> You know how many credits it needs when ->prepare_job() is called, but
> if this number is too high, the entity will not be picked, and next
> time it's checked, you'll still consider the job credits at the time
> ->prepare_job() was called, which might differ from the current job
> credits (signalled native fences might have been signalled in the
> meantime, and they could be evicted).
>
>>>> What you can do is to look at the credits of a job *after* it was picked
>>>> up for scheduling so that you can scheduler more jobs.
>>> Sure, but then you might further delay your job if something made it
>>> smaller (ie. native fences got signaled) between ->prepare_job() and
>>> drm_sched_can_queue(). And any new drm_sched_can_queue() test would
>>> just see the old credits value.
>>>
>>> Out of curiosity, what are you worried about with this optional
>>> ->update_job_credits() call in the drm_sched_can_queue() path? Is the
>>> if (sched->update_job_credits) overhead considered too high for drivers
>>> that don't need it?
>> Since the dma_fences are also used for resource management the scheduler
>> is vital for correct system operation.
>>
>> We had massively problems because people tried to over-optimize the
>> dma_fence handling which lead to very hard to narrow down memory
>> corruptions.
>>
>> So for every change you come up here you need to have a very very good
>> justification. And just saving a bit size of your ring buffer is
>> certainly not one of them.
> I fail to see how calling ->update_job_credits() changes the scheduler
> behavior or how it relates to the sort memory corruption you're
> referring to.

Yeah, you are right that's not even remotely related.

> And yes, we can live with the overhead of making jobs
> slightly bigger than they actually are, thus potentially delaying their
> execution. It's just that I don't quite understand the rational behind
> this conservatism, as I don't really see what negative impact this extra
> ->update_job_credits() call in the credit checking path has, other than
> the slight overhead of an if-check for drivers that don't need it.

 From experience it showed that we should not make the scheduler more 
complicated than necessary. And I still think that the ring buffers only 
need to be filled enough to keep the hardware busy.

If this here has some measurable positive effect then yeah we should 
probably do it, but as long as it's only nice to have I have some 
objections to that.

Regards,
Christian.

>
> Regards,
>
> Boris
  
Danilo Krummrich Oct. 30, 2023, 5:56 p.m. UTC | #25
Hi Christian,

On Mon, Oct 30, 2023 at 08:38:45AM +0100, Christian König wrote:
> Hi Boris,
> 
> Am 27.10.23 um 12:17 schrieb Boris Brezillon:
> > Hi Christian,
> > 
> > On Fri, 27 Oct 2023 11:06:44 +0200
> > Christian König <christian.koenig@amd.com> wrote:
> > 
> > > Am 27.10.23 um 10:22 schrieb Boris Brezillon:
> > > > On Fri, 27 Oct 2023 09:44:13 +0200
> > > > Christian König<christian.koenig@amd.com>  wrote:
> > > > > Am 27.10.23 um 09:39 schrieb Boris Brezillon:
> > > > > > On Fri, 27 Oct 2023 09:35:01 +0200
> > > > > > Christian König<christian.koenig@amd.com>   wrote:
> > > > > > > Am 27.10.23 um 09:32 schrieb Boris Brezillon:
> > > > > > > > On Fri, 27 Oct 2023 09:22:12 +0200
> > > > > > > > Christian König<christian.koenig@amd.com>   wrote:
> > > > > > > > > > +
> > > > > > > > > > +	/**
> > > > > > > > > > +	 * @update_job_credits: Called once the scheduler is considering this
> > > > > > > > > > +	 * job for execution.
> > > > > > > > > > +	 *
> > > > > > > > > > +	 * Drivers may use this to update the job's submission credits, which is
> > > > > > > > > > +	 * useful to e.g. deduct the number of native fences which have been
> > > > > > > > > > +	 * signaled meanwhile.
> > > > > > > > > > +	 *
> > > > > > > > > > +	 * The callback must either return the new number of submission credits
> > > > > > > > > > +	 * for the given job, or zero if no update is required.
> > > > > > > > > > +	 *
> > > > > > > > > > +	 * This callback is optional.
> > > > > > > > > > +	 */
> > > > > > > > > > +	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
> > > > > > > > > Why do we need an extra callback for this?
> > > > > > > > > 
> > > > > > > > > Just document that prepare_job() is allowed to reduce the number of
> > > > > > > > > credits the job might need.
> > > > > > > > ->prepare_job() is called only once if the returned fence is NULL, but
> > > > > > > > we need this credit-update to happen every time a job is considered for
> > > > > > > > execution by the scheduler.
> > > > > > > But the job is only considered for execution once. How do you see that
> > > > > > > this is called multiple times?
> > > > > > Nope, it's not. If drm_sched_can_queue() returns false, the scheduler
> > > > > > will go look for another entity that has a job ready for execution, and
> > > > > > get back to this entity later, and test drm_sched_can_queue() again.
> > > > > > Basically, any time drm_sched_can_queue() is called, the job credits
> > > > > > update should happen, so we have an accurate view of how many credits
> > > > > > this job needs.
> > > > > Well, that is the handling which I already rejected because it creates
> > > > > unfairness between processes. When you consider the credits needed
> > > > > *before* scheduling jobs with a lower credit count are always preferred
> > > > > over jobs with a higher credit count.
> > > > My bad, it doesn't pick another entity when an entity with a
> > > > ready job that doesn't fit the queue is found, it just bails out from
> > > > drm_sched_rq_select_entity_rr() and returns NULL (AKA: no ready entity
> > > > found). But we still want to update the job credits before checking if
> > > > the job fits or not (next time this entity is tested).
> > > Why? I only see a few possibility here:
> > > 
> > > 1. You need to wait for submissions to the same scheduler to finish
> > > before the one you want to push to the ring.
> > >       This case can be avoided by trying to cast the dependency fences to
> > > drm_sched_fences and looking if they are already scheduled.
> > > 
> > > 2. You need to wait for submissions to a different scheduler instance
> > > and in this case you should probably return that as dependency instead.
> > It's already described as a dependency, but native dependency waits are
> > deferred to the FW (we wait on scheduled to run the job, not finished).
> > The thing is, after ->prepare_job() returned NULL (no non-native deps
> > remaining), and before ->run_job() gets called, there might be several
> > of these native deps that get signaled, and we're still considering
> > job->submission_credits as unchanged, when each of these signalled
> > fence could have reduced the job credits, potentially allowing it to be
> > submitted sooner.
> 
> Ah, ok that at least clears up your intentions here.
> 
> Question is if that is really that important for you? I mean you just seem
> to fill up more of the ring buffer.
> 
> > 
> > > So to me it looks like when prepare_job() is called because it is
> > > selected as next job for submission you should already know how many
> > > credits it needs.
> > You know how many credits it needs when ->prepare_job() is called, but
> > if this number is too high, the entity will not be picked, and next
> > time it's checked, you'll still consider the job credits at the time
> > ->prepare_job() was called, which might differ from the current job
> > credits (signalled native fences might have been signalled in the
> > meantime, and they could be evicted).
> > 
> > > > > What you can do is to look at the credits of a job *after* it was picked
> > > > > up for scheduling so that you can scheduler more jobs.
> > > > Sure, but then you might further delay your job if something made it
> > > > smaller (ie. native fences got signaled) between ->prepare_job() and
> > > > drm_sched_can_queue(). And any new drm_sched_can_queue() test would
> > > > just see the old credits value.
> > > > 
> > > > Out of curiosity, what are you worried about with this optional
> > > > ->update_job_credits() call in the drm_sched_can_queue() path? Is the
> > > > if (sched->update_job_credits) overhead considered too high for drivers
> > > > that don't need it?
> > > Since the dma_fences are also used for resource management the scheduler
> > > is vital for correct system operation.
> > > 
> > > We had massively problems because people tried to over-optimize the
> > > dma_fence handling which lead to very hard to narrow down memory
> > > corruptions.
> > > 
> > > So for every change you come up here you need to have a very very good
> > > justification. And just saving a bit size of your ring buffer is
> > > certainly not one of them.
> > I fail to see how calling ->update_job_credits() changes the scheduler
> > behavior or how it relates to the sort memory corruption you're
> > referring to.
> 
> Yeah, you are right that's not even remotely related.
> 
> > And yes, we can live with the overhead of making jobs
> > slightly bigger than they actually are, thus potentially delaying their
> > execution. It's just that I don't quite understand the rational behind
> > this conservatism, as I don't really see what negative impact this extra
> > ->update_job_credits() call in the credit checking path has, other than
> > the slight overhead of an if-check for drivers that don't need it.
> 
> From experience it showed that we should not make the scheduler more
> complicated than necessary. And I still think that the ring buffers only
> need to be filled enough to keep the hardware busy.

Right, and this callback contributes to exactly that.

I don't really think there is much to worry about in terms of introducing more
complexity. The implementation behind this callback is fairly trivial - it's
simply called right before we check whether a job fits on the ring, to fetch
the job's actual size.

I would agree if the implementation of that would be bulky, tricky and asking
for a compromise. But, since it actually is simple and straight forward I really
think that if we implement some kind of dynamic job-flow control it should be
implemented as acurate as possible rather than doing it half-baked.

> 
> If this here has some measurable positive effect then yeah we should
> probably do it, but as long as it's only nice to have I have some objections
> to that.

Can't answer this, since Nouveau doesn't support native fence waits. However, I
guess it depends on how many native fences a job carries. So, if we'd have two
jobs with each of them carrying a lot of native fences, but not a lot of actual
work, I can very well imagine that over-accounting can have a measureable
impact.

I just wonder if we really want to ask for real measurements given that the
optimization is rather trivial and cheap and we already have enough evidence
that it makes sense conceptually.

- Danilo

> 
> Regards,
> Christian.
> 
> > 
> > Regards,
> > 
> > Boris
>
  
Danilo Krummrich Oct. 31, 2023, 3:01 p.m. UTC | #26
On Tue, Oct 31, 2023 at 02:20:50PM +0100, Christian König wrote:
> Hi Danilo,
> 
> sorry for splitting up the mail thread. I had to fetch this mail from my
> spam folder for some reason.
> 
> Am 30.10.23 um 18:56 schrieb Danilo Krummrich:
> > Hi Christian,
> > 
> > [SNIP]
> > > > And yes, we can live with the overhead of making jobs
> > > > slightly bigger than they actually are, thus potentially delaying their
> > > > execution. It's just that I don't quite understand the rational behind
> > > > this conservatism, as I don't really see what negative impact this extra
> > > > ->update_job_credits() call in the credit checking path has, other than
> > > > the slight overhead of an if-check for drivers that don't need it.
> > >  From experience it showed that we should not make the scheduler more
> > > complicated than necessary. And I still think that the ring buffers only
> > > need to be filled enough to keep the hardware busy.
> > Right, and this callback contributes to exactly that.
> > 
> > I don't really think there is much to worry about in terms of introducing more
> > complexity. The implementation behind this callback is fairly trivial - it's
> > simply called right before we check whether a job fits on the ring, to fetch
> > the job's actual size.
> > 
> > I would agree if the implementation of that would be bulky, tricky and asking
> > for a compromise. But, since it actually is simple and straight forward I really
> > think that if we implement some kind of dynamic job-flow control it should be
> > implemented as acurate as possible rather than doing it half-baked.
> 
> Yeah, I see the intention here and can perfectly relate to it it's just that
> I have prioritize other things.

I don't see any work being required from your side for this.

> 
> Adding this callback allows for the driver to influence the job submission
> and while this might seems useful now I'm just to much of a burned child to
> do stuff like this without having a really good reason for it.

It does influence the job submission in the exact same way as the initial credit
count set through drm_sched_job_init() does. There is absolutely nothing with
this callback a driver couldn't mess up in the exact same way with the initial
credit count set through drm_sched_job_init(). Following this logic we'd need to
abandon the whole patch.

Hence, I don't really understand why you're so focused on this callback.
Especially, since it's an optional one.

> 
> > > If this here has some measurable positive effect then yeah we should
> > > probably do it, but as long as it's only nice to have I have some objections
> > > to that.
> > Can't answer this, since Nouveau doesn't support native fence waits. However, I
> > guess it depends on how many native fences a job carries. So, if we'd have two
> > jobs with each of them carrying a lot of native fences, but not a lot of actual
> > work, I can very well imagine that over-accounting can have a measureable
> > impact.
> 
> What I can imagine as well is things like the hardware or firmware is
> looking at the fullness of the ring buffer to predict how much pending work
> there is.
> 
> > I just wonder if we really want to ask for real measurements given that the
> > optimization is rather trivial and cheap and we already have enough evidence
> > that it makes sense conceptually.
> 
> Well that's the point I disagree on, this callback isn't trivial. If (for
> example) the driver returns a value larger than initially estimated for the
> job we can run into an endless loop.

I agree it doesn't make sense to increase, but it wouldn't break anything,
unless the job size starts exceeding the capacity of the ring. And this case is
handled anyway.

> 
> It's just one more thing which can go boom. At bare minimum we should check
> that the value is always decreasing.

Considering the above I still agree, such a check makes sense - gonna add one.

- Danilo

> 
> Christian.
> 
> > 
> > - Danilo
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Regards,
> > > > 
> > > > Boris
  
Danilo Krummrich Nov. 6, 2023, 4:46 p.m. UTC | #27
On Fri, Nov 03, 2023 at 09:15:25AM +0100, Christian König wrote:
> Am 02.11.23 um 19:03 schrieb Danilo Krummrich:
> > On Thu, Nov 02, 2023 at 11:07:32AM +0100, Christian König wrote:
> > > Hi Danilo,
> > > 
> > > Am 31.10.23 um 16:01 schrieb Danilo Krummrich:
> > > > On Tue, Oct 31, 2023 at 02:20:50PM +0100, Christian König wrote:
> > > > > [SNIP]
> > > > > Yeah, I see the intention here and can perfectly relate to it it's just that
> > > > > I have prioritize other things.
> > > > I don't see any work being required from your side for this.
> > > What I wanted to say is that I understand your intentions and can relate to
> > > that, but other aspects have higher priority in this discussion.
> > What aspects would that be?
> 
> Be defensive. As far as I can see this callback is only nice to have and not
> functionally mandatory.

I gave you quite some reasons why this is desired.

And generally speaking, even if something is not functionally mandatory,
rejecting it *only* because of that isn't a good idea.

It is better to deal with it in terms of content and consider its pros and cons.

> 
> And in such cases I have to weight between complexity by adding something
> which might go boom and being conservative and using well known and working
> code paths.
> 
> > Not breaking other drivers? - The callback is optional, and besides that, as
> > already mentioned, the callback doesn't do anything that can't already go wrong
> > with the inital credit value from drm_sched_job_init().
> 
> During drm_sched_job_init() time the fence of this job isn't published yet.
> So when the driver specified something invalid we can easily return an error
> code and abort.
> 
> Inside the scheduler we can't do anything like this. E.g. what happens if
> the driver suddenly returns a value which is to large? We can't reject that.

This is all correct and I recognize that. But again, the callback is optional.
I don't see how drivers would be affected not opting in for this feature.

And for drivers which do, they'd have the same problem (if you actually want to
call it one) doing some driver specific workaround as well. And because they'd
not have the correct hook likely many more.

> 
> > Keeping things simple? - The workaround PowerVR is considering instead
> > (returning a dma-fence in ->prepare_job()) doesn't seem to contribute to this
> > goal.
> 
> I don't see this as a workaround, this is essentially the desired solution.
> All dependencies of a job should be represented as a dma-fence if possible.

So, you're saying that given we have an inaccurate way of tracking credits
(because you're rejecting the callback making it accurate), the desired solution
would be to use an existing callback with a whole different, more complicated
and hence more error prone approach to overcome that?

> 
> The background is that dma-fences have a very well defined semantics which
> through a rather elaborated hack is validated to the extend that lockdep can
> check if drivers behave correctly or not: https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-fence.c#L194

I fail to see what dma-fences have to do with job-flow control of the ring
buffer. You mention it to be a plus that there is a hack to be able to validate
dma-fences with lockdep, but you miss the fact that the only reason we'd need
this in the first place is that you think it is desired to do job-flow control
with dma-fences.

> 
> With this here you are actually breaking this because now drivers have
> influence again on when stuff is scheduled.

I'm not breaking anything. This flow control mechanism is already in the
scheduler, it's just that it's inaccurate because drivers need to calculate
every job with the worst case credit amount. I'm just making it accurate, such
that it becomes useful for some drivers.

I also do not agree that this breaks anything dma-fence related in general. It
doesn't give drivers control of *when* stuff is scheduled. It gives the
scheduler an idea of *if* it can schedule a job without overflowing the ring
buffer. The influence of when comes through the hardware finishing a previous
job and the corresponding dma-fence callback creating the corresponding space
in terms of credit capacity of the scheduler.

> 
> > > > > Adding this callback allows for the driver to influence the job submission
> > > > > and while this might seems useful now I'm just to much of a burned child to
> > > > > do stuff like this without having a really good reason for it.
> > > > It does influence the job submission in the exact same way as the initial credit
> > > > count set through drm_sched_job_init() does. There is absolutely nothing with
> > > > this callback a driver couldn't mess up in the exact same way with the initial
> > > > credit count set through drm_sched_job_init(). Following this logic we'd need to
> > > > abandon the whole patch.
> > > Well what I thought you would be doing is to replace the fixed one credit
> > > per job with N credits per job.
> > That's exactly what the patch does. Plus, with the help of the
> > update_job_credits() callback, keep this job credit count accurate and up to
> > date.
> > 
> > My point is that you're concerned about the logic of a callback that just
> > repeats the exact same logic that's applied on initialization already.
> 
> As I wrote above the difference is that during initialization time we can
> just return an error code.
> 
> As soon as we have entered the scheduler we are doomed to execute it. When
> the driver now does nonsense we can't handle that situation gracefully any
> more.
> 
> Daniel was already rather concerned when I came up with the prepare callback
> and that concern was proven to be correct. This here adds another callback
> drivers can stumble over.
> 
> > > > Hence, I don't really understand why you're so focused on this callback.
> > > > Especially, since it's an optional one.
> > > It's seems unnecessary to me. As long as it's just to fill up the ring
> > > buffer more than necessary it is pretty much just a perfect example of
> > > over-engineering
> > Boris explained multiple times why it is needed for PowerVR. There might be
> > cases where it doesn't matter to over-account credits for a job. And clearly
> > it's even most of them, that's what I agree on.
> 
> I probably need to re-read what Boris wrote, but my impression is that this
> is only optional for him.
> 
> > However, I don't see why we want to be less accurate as we could be and hence
> > risk issues in drivers like PowerVR that are hard to debug.
> > 
> > AFAICT, whether we could potentially see a ring run dry because of this depends
> > on the workload pattern submitted by applications. Breaking down such observed
> > performance issues to the scheduler not being accurate at this point in a few
> > month or years would be an absurd effort. So why not just rule this out in
> > advance?
> 
> Because correct dma-fence signaling is more important than driver
> performance.

You mean because drivers could return more credits than the scheduler has
capacity and then we WARN() but are still stuck?

The easy fix would be to still WARN() but set the job's credits to the maximum
credit size the scheduler can handle.

> 
> Those signaling bugs are even *much* more problematic than anything the
> driver can come up with.
> 
> Please see here for an example why this is so problematic: https://patches.linaro.org/project/linux-media/patch/20200612070623.1778466-1-daniel.vetter@ffwll.ch/
> 
> We basically said that all dependencies the job submission is based on is
> expressed as a dma-fence, because dma-fences can be proven to be correct.

I'm pretty sure that it's also possible to prove three lines of code updating
the job's credit count to be correct...

Maybe we're better off using our energy for that instead?

> 
> When we add something like this callback we circumvent this whole checking.
> 
> Maybe we can approach this from a completely different side. Basically we
> have static and dynamic dependencies for a job.
> 
> The static ones are expressed in the dependencies xarray while the dynamic
> ones are returned by the prepare callback.
> 
> What if we completely remove the credits or ring buffer handling from the
> scheduler and put it into a separate component which drivers can use a
> prepare callback?

While I honestly appreciate you making a proposal to move forward, I really
think that this makes everything you're concerned about just worse and even
starts me getting concerned about it.

1) We'd trade the complxity of three lines of code to update the credit count
with a whole new component.

2) You mentioned that the concerns about the prepare callback were proven to be
correct, now you propose to extend it even.

The update_job_credits() has a very limited scope and hence is way easier to
validate for correctness. The only thing we really need to do is to guarantee
forward progress, meaning we just don't allow the job's credits to exceed the
schedulers credit capacity. And that should be it.

3) You seem to imply that you want to use dma-fences to do job-flow control,
which as I mentioned above has way more complexity (separately allocated object,
locking restrictions, etc.) and hence is more error prone.

4) Why would we remove something from the scheduler which can be considered to
be one of its core responsibilities?

> 
> Regards,
> Christian.
> 
> > 
> > Again, if this callback would introduce any complexity or tradeoffs for existing
> > drivers, I'd vote to wait until we actually see issues as well. But this simply
> > isn't the case. And if you think otherwise, please show me the complexity it
> > introduces that is concerning and the tradeoffs for existing drivers.
> > 
> > > > > > > If this here has some measurable positive effect then yeah we should
> > > > > > > probably do it, but as long as it's only nice to have I have some objections
> > > > > > > to that.
> > > > > > Can't answer this, since Nouveau doesn't support native fence waits. However, I
> > > > > > guess it depends on how many native fences a job carries. So, if we'd have two
> > > > > > jobs with each of them carrying a lot of native fences, but not a lot of actual
> > > > > > work, I can very well imagine that over-accounting can have a measureable
> > > > > > impact.
> > > > > What I can imagine as well is things like the hardware or firmware is
> > > > > looking at the fullness of the ring buffer to predict how much pending work
> > > > > there is.
> > > > > 
> > > > > > I just wonder if we really want to ask for real measurements given that the
> > > > > > optimization is rather trivial and cheap and we already have enough evidence
> > > > > > that it makes sense conceptually.
> > > > > Well that's the point I disagree on, this callback isn't trivial. If (for
> > > > > example) the driver returns a value larger than initially estimated for the
> > > > > job we can run into an endless loop.
> > > > I agree it doesn't make sense to increase, but it wouldn't break anything,
> > > > unless the job size starts exceeding the capacity of the ring. And this case is
> > > > handled anyway.
> > > > 
> > > > > It's just one more thing which can go boom. At bare minimum we should check
> > > > > that the value is always decreasing.
> > > > Considering the above I still agree, such a check makes sense - gonna add one.
> > > Please don't send anything out until we have solved this.
> > I did, but don't worry about this, it helped me to finalize the rest of the
> > patch. We can keep discussing this in this thread either way.
> > 
> > > So far I haven't seen any argument which would not let me reject this and I
> > > don't want to waste your time.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > - Danilo
> > > > 
> > > > > Christian.
> > > > > 
> > > > > > - Danilo
> > > > > > 
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > Regards,
> > > > > > > > 
> > > > > > > > Boris
  
Christian König Nov. 7, 2023, 9:13 a.m. UTC | #28
Am 06.11.23 um 17:46 schrieb Danilo Krummrich:
> On Fri, Nov 03, 2023 at 09:15:25AM +0100, Christian König wrote:
>> Am 02.11.23 um 19:03 schrieb Danilo Krummrich:
>> > On Thu, Nov 02, 2023 at 11:07:32AM +0100, Christian König wrote:
>> > > Hi Danilo,
>> > > > > Am 31.10.23 um 16:01 schrieb Danilo Krummrich:
>> > > > On Tue, Oct 31, 2023 at 02:20:50PM +0100, Christian König wrote:
>> > > > > [SNIP]
>> > > > > Yeah, I see the intention here and can perfectly relate to it 
>> it's just that
>> > > > > I have prioritize other things.
>> > > > I don't see any work being required from your side for this.
>> > > What I wanted to say is that I understand your intentions and can 
>> relate to
>> > > that, but other aspects have higher priority in this discussion.
>> > What aspects would that be?
>>
>> Be defensive. As far as I can see this callback is only nice to have 
>> and not
>> functionally mandatory.
>
> I gave you quite some reasons why this is desired.

Yeah, but I need something to make it necessary. Desired is just not 
enough in this case.

> And generally speaking, even if something is not functionally mandatory,
> rejecting it *only* because of that isn't a good idea.

Completely agree. But what makes this at least questionable is the 
combination of not mandatory and giving drivers the opportunity to mess 
with submissions again.

Especially in the scheduler and dma_fence handling we had tons of 
patches which added a nice to have and seemingly harmless features which 
later turned into a complete nightmare to maintain.

The takeaway is that we need more big red warning signs in the form of 
documentation and try to not change things just because it makes them 
look better.

> It is better to deal with it in terms of content and consider its pros 
> and cons.
>
>>
>> And in such cases I have to weight between complexity by adding 
>> something
>> which might go boom and being conservative and using well known and 
>> working
>> code paths.
>>
>> > Not breaking other drivers? - The callback is optional, and besides 
>> that, as
>> > already mentioned, the callback doesn't do anything that can't 
>> already go wrong
>> > with the inital credit value from drm_sched_job_init().
>>
>> During drm_sched_job_init() time the fence of this job isn't 
>> published yet.
>> So when the driver specified something invalid we can easily return 
>> an error
>> code and abort.
>>
>> Inside the scheduler we can't do anything like this. E.g. what 
>> happens if
>> the driver suddenly returns a value which is to large? We can't 
>> reject that.
>
> This is all correct and I recognize that. But again, the callback is 
> optional.
> I don't see how drivers would be affected not opting in for this feature.

I see this actually as cons argument for the change. If many drivers 
would use this then the code path would be much better tested.

>
> And for drivers which do, they'd have the same problem (if you 
> actually want to
> call it one) doing some driver specific workaround as well. And 
> because they'd
> not have the correct hook likely many more.
>
>>
>> > Keeping things simple? - The workaround PowerVR is considering instead
>> > (returning a dma-fence in ->prepare_job()) doesn't seem to 
>> contribute to this
>> > goal.
>>
>> I don't see this as a workaround, this is essentially the desired 
>> solution.
>> All dependencies of a job should be represented as a dma-fence if 
>> possible.
>
> So, you're saying that given we have an inaccurate way of tracking 
> credits
> (because you're rejecting the callback making it accurate), the 
> desired solution
> would be to use an existing callback with a whole different, more 
> complicated
> and hence more error prone approach to overcome that?
>
>>
>> The background is that dma-fences have a very well defined semantics 
>> which
>> through a rather elaborated hack is validated to the extend that 
>> lockdep can
>> check if drivers behave correctly or not: 
>> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-fence.c#L194
>
> I fail to see what dma-fences have to do with job-flow control of the 
> ring
> buffer. You mention it to be a plus that there is a hack to be able to 
> validate
> dma-fences with lockdep, but you miss the fact that the only reason 
> we'd need
> this in the first place is that you think it is desired to do job-flow 
> control
> with dma-fences.

Well I consider the hack better than this callback.

The hack Daniel came up with allows us to validate driver behavior. In 
other words even if things seems to work we get a warning if drivers do 
something they are not supposed to do.

>> With this here you are actually breaking this because now drivers have
>> influence again on when stuff is scheduled.
>
> I'm not breaking anything. This flow control mechanism is already in the
> scheduler, it's just that it's inaccurate because drivers need to 
> calculate
> every job with the worst case credit amount. I'm just making it 
> accurate, such
> that it becomes useful for some drivers.

Well, what's wrong with assuming the worst?

> I also do not agree that this breaks anything dma-fence related in 
> general. It
> doesn't give drivers control of *when* stuff is scheduled. It gives the
> scheduler an idea of *if* it can schedule a job without overflowing 
> the ring
> buffer. The influence of when comes through the hardware finishing a 
> previous
> job and the corresponding dma-fence callback creating the 
> corresponding space
> in terms of credit capacity of the scheduler.

That's a really good point, but what if the driver keeps telling the 
scheduler that it can't execute the job?

>> [SNIP]
>>
>> > However, I don't see why we want to be less accurate as we could be 
>> and hence
>> > risk issues in drivers like PowerVR that are hard to debug.
>> > > AFAICT, whether we could potentially see a ring run dry because 
>> of this depends
>> > on the workload pattern submitted by applications. Breaking down 
>> such observed
>> > performance issues to the scheduler not being accurate at this 
>> point in a few
>> > month or years would be an absurd effort. So why not just rule this 
>> out in
>> > advance?
>>
>> Because correct dma-fence signaling is more important than driver
>> performance.
>
> You mean because drivers could return more credits than the scheduler has
> capacity and then we WARN() but are still stuck?

Yes, exactly that's my concern here.

> The easy fix would be to still WARN() but set the job's credits to the 
> maximum
> credit size the scheduler can handle.

But that has then the potential for overflowing the ring buffer.

As far as I can see when the driver does something wrong there is no way 
for the scheduler to handle that gracefully. Either we don't fulfill the 
driver's wish to have at least X credits available before executing the 
job or we ignore the fact that we need to guarantee forward progress.

>>
>> Those signaling bugs are even *much* more problematic than anything the
>> driver can come up with.
>>
>> Please see here for an example why this is so problematic: 
>> https://patches.linaro.org/project/linux-media/patch/20200612070623.1778466-1-daniel.vetter@ffwll.ch/
>>
>> We basically said that all dependencies the job submission is based 
>> on is
>> expressed as a dma-fence, because dma-fences can be proven to be 
>> correct.
>
> I'm pretty sure that it's also possible to prove three lines of code 
> updating
> the job's credit count to be correct...
>
> Maybe we're better off using our energy for that instead?

Yeah, those three lines inside the scheduler are indeed correct but I 
can't validate what the driver is doing.

As soon as we have this callback the calculation to come up with the 
credits is outside of the scheduler.

>> When we add something like this callback we circumvent this whole 
>> checking.
>>
>> Maybe we can approach this from a completely different side. 
>> Basically we
>> have static and dynamic dependencies for a job.
>>
>> The static ones are expressed in the dependencies xarray while the 
>> dynamic
>> ones are returned by the prepare callback.
>>
>> What if we completely remove the credits or ring buffer handling from 
>> the
>> scheduler and put it into a separate component which drivers can use a
>> prepare callback?
>
> While I honestly appreciate you making a proposal to move forward, I 
> really
> think that this makes everything you're concerned about just worse and 
> even
> starts me getting concerned about it.
>
> 1) We'd trade the complxity of three lines of code to update the 
> credit count
> with a whole new component.
>
> 2) You mentioned that the concerns about the prepare callback were 
> proven to be
> correct, now you propose to extend it even.

The difference is that Daniel has found a way to validate the prepare 
callback functionality.

When the driver gives us a fence which never signals we get a warning, 
when the driver allocates a new fence we get a warning etc...

We have found tons of bugs around this inside the drivers which ranges 
from allocating memory to taking locks in incorrect order. And we 
haven't even fixed all of them yet, I can still point you to code where 
amdgpu for example is still broken. It's just that this source path 
isn't exercised this often so we can live with it for now.

I really don't want to repeat that stunt with this callback again. At 
least not for something which is only nice to have.

> The update_job_credits() has a very limited scope and hence is way 
> easier to
> validate for correctness. The only thing we really need to do is to 
> guarantee
> forward progress, meaning we just don't allow the job's credits to 
> exceed the
> schedulers credit capacity. And that should be it.
>
> 3) You seem to imply that you want to use dma-fences to do job-flow 
> control,
> which as I mentioned above has way more complexity (separately 
> allocated object,
> locking restrictions, etc.) and hence is more error prone.
>
> 4) Why would we remove something from the scheduler which can be 
> considered to
> be one of its core responsibilities?

Yeah, good points. I wanted to at least throw out some ideas how to 
solve this instead of just bluntly rejecting it.

Regards,
Christian.

>
>
>>
>> Regards,
>> Christian.
>>
>> > > Again, if this callback would introduce any complexity or 
>> tradeoffs for existing
>> > drivers, I'd vote to wait until we actually see issues as well. But 
>> this simply
>> > isn't the case. And if you think otherwise, please show me the 
>> complexity it
>> > introduces that is concerning and the tradeoffs for existing drivers.
>> > > > > > > > If this here has some measurable positive effect then 
>> yeah we should
>> > > > > > > probably do it, but as long as it's only nice to have I 
>> have some objections
>> > > > > > > to that.
>> > > > > > Can't answer this, since Nouveau doesn't support native 
>> fence waits. However, I
>> > > > > > guess it depends on how many native fences a job carries. 
>> So, if we'd have two
>> > > > > > jobs with each of them carrying a lot of native fences, but 
>> not a lot of actual
>> > > > > > work, I can very well imagine that over-accounting can have 
>> a measureable
>> > > > > > impact.
>> > > > > What I can imagine as well is things like the hardware or 
>> firmware is
>> > > > > looking at the fullness of the ring buffer to predict how 
>> much pending work
>> > > > > there is.
>> > > > > > > > > > I just wonder if we really want to ask for real 
>> measurements given that the
>> > > > > > optimization is rather trivial and cheap and we already 
>> have enough evidence
>> > > > > > that it makes sense conceptually.
>> > > > > Well that's the point I disagree on, this callback isn't 
>> trivial. If (for
>> > > > > example) the driver returns a value larger than initially 
>> estimated for the
>> > > > > job we can run into an endless loop.
>> > > > I agree it doesn't make sense to increase, but it wouldn't 
>> break anything,
>> > > > unless the job size starts exceeding the capacity of the ring. 
>> And this case is
>> > > > handled anyway.
>> > > > > > > > It's just one more thing which can go boom. At bare 
>> minimum we should check
>> > > > > that the value is always decreasing.
>> > > > Considering the above I still agree, such a check makes sense - 
>> gonna add one.
>> > > Please don't send anything out until we have solved this.
>> > I did, but don't worry about this, it helped me to finalize the 
>> rest of the
>> > patch. We can keep discussing this in this thread either way.
>> > > > So far I haven't seen any argument which would not let me 
>> reject this and I
>> > > don't want to waste your time.
>> > > > > Regards,
>> > > Christian.
>> > > > > > - Danilo
>> > > > > > > > Christian.
>> > > > > > > > > > - Danilo
>> > > > > > > > > > > > Regards,
>> > > > > > > Christian.
>> > > > > > > > > > > > > > Regards,
>> > > > > > > > > > > > > > > Boris
>
  
Danilo Krummrich Nov. 7, 2023, 1:03 p.m. UTC | #29
On Tue, Nov 07, 2023 at 10:13:37AM +0100, Christian König wrote:
> Am 06.11.23 um 17:46 schrieb Danilo Krummrich:
> > On Fri, Nov 03, 2023 at 09:15:25AM +0100, Christian König wrote:
> > > Am 02.11.23 um 19:03 schrieb Danilo Krummrich:
> > > > On Thu, Nov 02, 2023 at 11:07:32AM +0100, Christian König wrote:
> > > > > Hi Danilo,
> > > > > > > Am 31.10.23 um 16:01 schrieb Danilo Krummrich:
> > > > > > On Tue, Oct 31, 2023 at 02:20:50PM +0100, Christian König wrote:
> > > > > > > [SNIP]
> > > > > > > Yeah, I see the intention here and can perfectly relate to
> > > it it's just that
> > > > > > > I have prioritize other things.
> > > > > > I don't see any work being required from your side for this.
> > > > > What I wanted to say is that I understand your intentions and
> > > can relate to
> > > > > that, but other aspects have higher priority in this discussion.
> > > > What aspects would that be?
> > > 
> > > Be defensive. As far as I can see this callback is only nice to have
> > > and not
> > > functionally mandatory.
> > 
> > I gave you quite some reasons why this is desired.
> 
> Yeah, but I need something to make it necessary. Desired is just not enough
> in this case.

You can't really say that *and*...

> 
> > And generally speaking, even if something is not functionally mandatory,
> > rejecting it *only* because of that isn't a good idea.
> 
> Completely agree.

...agree here. That is entirely contradicting.

> But what makes this at least questionable is the
> combination of not mandatory and giving drivers the opportunity to mess with
> submissions again.

We already had this below. I explained why it doesn't and you agreed.

> 
> Especially in the scheduler and dma_fence handling we had tons of patches
> which added a nice to have and seemingly harmless features which later
> turned into a complete nightmare to maintain.

That's pseudo argument, you can simply use this to randomly reject everything.

The fact is, that we need to get a single integer check right, I fail to see how
this will turn into a nightmare to maintain.

> 
> The takeaway is that we need more big red warning signs in the form of
> documentation and try to not change things just because it makes them look
> better.

Documentation would be great indeed. You might be happy to hear that this patch
actually comes with documentation on how the job-flow control of the scheduler
works and what users can expect. Once merged, this will be the best documented
part of the scheduler...

> 
> > It is better to deal with it in terms of content and consider its pros
> > and cons.
> > 
> > > 
> > > And in such cases I have to weight between complexity by adding
> > > something
> > > which might go boom and being conservative and using well known and
> > > working
> > > code paths.
> > > 
> > > > Not breaking other drivers? - The callback is optional, and
> > > besides that, as
> > > > already mentioned, the callback doesn't do anything that can't
> > > already go wrong
> > > > with the inital credit value from drm_sched_job_init().
> > > 
> > > During drm_sched_job_init() time the fence of this job isn't
> > > published yet.
> > > So when the driver specified something invalid we can easily return
> > > an error
> > > code and abort.
> > > 
> > > Inside the scheduler we can't do anything like this. E.g. what
> > > happens if
> > > the driver suddenly returns a value which is to large? We can't
> > > reject that.
> > 
> > This is all correct and I recognize that. But again, the callback is
> > optional.
> > I don't see how drivers would be affected not opting in for this feature.
> 
> I see this actually as cons argument for the change. If many drivers would
> use this then the code path would be much better tested.

It's a single integer check. Again, I'm pretty sure we can get this right. We can
even add a few test cases to validate this code path and the credit handling in
general - I'd offer that.

> 
> > 
> > And for drivers which do, they'd have the same problem (if you actually
> > want to
> > call it one) doing some driver specific workaround as well. And because
> > they'd
> > not have the correct hook likely many more.
> > 
> > > 
> > > > Keeping things simple? - The workaround PowerVR is considering instead
> > > > (returning a dma-fence in ->prepare_job()) doesn't seem to
> > > contribute to this
> > > > goal.
> > > 
> > > I don't see this as a workaround, this is essentially the desired
> > > solution.
> > > All dependencies of a job should be represented as a dma-fence if
> > > possible.
> > 
> > So, you're saying that given we have an inaccurate way of tracking
> > credits
> > (because you're rejecting the callback making it accurate), the desired
> > solution
> > would be to use an existing callback with a whole different, more
> > complicated
> > and hence more error prone approach to overcome that?
> > 
> > > 
> > > The background is that dma-fences have a very well defined semantics
> > > which
> > > through a rather elaborated hack is validated to the extend that
> > > lockdep can
> > > check if drivers behave correctly or not: https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-fence.c#L194
> > 
> > I fail to see what dma-fences have to do with job-flow control of the
> > ring
> > buffer. You mention it to be a plus that there is a hack to be able to
> > validate
> > dma-fences with lockdep, but you miss the fact that the only reason we'd
> > need
> > this in the first place is that you think it is desired to do job-flow
> > control
> > with dma-fences.
> 
> Well I consider the hack better than this callback.
> 
> The hack Daniel came up with allows us to validate driver behavior. In other
> words even if things seems to work we get a warning if drivers do something
> they are not supposed to do.

You insist on this validation argument, but you ignore that we can easily
validate the credit count as well. There is absolutely no difference.

> 
> > > With this here you are actually breaking this because now drivers have
> > > influence again on when stuff is scheduled.
> > 
> > I'm not breaking anything. This flow control mechanism is already in the
> > scheduler, it's just that it's inaccurate because drivers need to
> > calculate
> > every job with the worst case credit amount. I'm just making it
> > accurate, such
> > that it becomes useful for some drivers.
> 
> Well, what's wrong with assuming the worst?

Please don't ask this question. I allege you know the answer. We had a long
discussion about it before I sent this patch.

Otherwise, it's in the commit message.

> 
> > I also do not agree that this breaks anything dma-fence related in
> > general. It
> > doesn't give drivers control of *when* stuff is scheduled. It gives the
> > scheduler an idea of *if* it can schedule a job without overflowing the
> > ring
> > buffer. The influence of when comes through the hardware finishing a
> > previous
> > job and the corresponding dma-fence callback creating the corresponding
> > space
> > in terms of credit capacity of the scheduler.
> 
> That's a really good point, but what if the driver keeps telling the
> scheduler that it can't execute the job?

Well, the driver can't tell the scheduler to simply not execute a job.

> 
> > > [SNIP]
> > > 
> > > > However, I don't see why we want to be less accurate as we could
> > > be and hence
> > > > risk issues in drivers like PowerVR that are hard to debug.
> > > > > AFAICT, whether we could potentially see a ring run dry because
> > > of this depends
> > > > on the workload pattern submitted by applications. Breaking down
> > > such observed
> > > > performance issues to the scheduler not being accurate at this
> > > point in a few
> > > > month or years would be an absurd effort. So why not just rule
> > > this out in
> > > > advance?
> > > 
> > > Because correct dma-fence signaling is more important than driver
> > > performance.
> > 
> > You mean because drivers could return more credits than the scheduler has
> > capacity and then we WARN() but are still stuck?
> 
> Yes, exactly that's my concern here.

Cool, we can easily rule this one out then. See below.

> 
> > The easy fix would be to still WARN() but set the job's credits to the
> > maximum
> > credit size the scheduler can handle.
> 
> But that has then the potential for overflowing the ring buffer.
> 
> As far as I can see when the driver does something wrong there is no way for
> the scheduler to handle that gracefully. Either we don't fulfill the
> driver's wish to have at least X credits available before executing the job
> or we ignore the fact that we need to guarantee forward progress.

Sure, but this will *always* be the case, no matter what primitive (dma-fences
or whatnot) you gonna use for handling / tracking job credits.

If the driver doesn't stick to its own bounds in this case this is and always
will be a driver bug we simply can't prevent.

We can only decide how we handle it. And the above strategy seems reasonable.
WARN about it and guarantee forward progress. If it leads to the driver
overflowing its ring buffer, that's not critical for the kernel itself and the
driver is already buggy and due to the warning we exactly know that and we know
why.

> 
> > > 
> > > Those signaling bugs are even *much* more problematic than anything the
> > > driver can come up with.
> > > 
> > > Please see here for an example why this is so problematic: https://patches.linaro.org/project/linux-media/patch/20200612070623.1778466-1-daniel.vetter@ffwll.ch/
> > > 
> > > We basically said that all dependencies the job submission is based
> > > on is
> > > expressed as a dma-fence, because dma-fences can be proven to be
> > > correct.
> > 
> > I'm pretty sure that it's also possible to prove three lines of code
> > updating
> > the job's credit count to be correct...
> > 
> > Maybe we're better off using our energy for that instead?
> 
> Yeah, those three lines inside the scheduler are indeed correct but I can't
> validate what the driver is doing.

Surely, we can validate it. We can validate whether the argument is in its
correct bounds and in case WARN about it and handle it.

We just can't control what the driver is doing. But that's always true, also
with dma-fences.

> 
> As soon as we have this callback the calculation to come up with the credits
> is outside of the scheduler.

Sure, but that's in the nature of this value? Again, you can use whatever
primitive you want, nothing will change this fact.

> 
> > > When we add something like this callback we circumvent this whole
> > > checking.
> > > 
> > > Maybe we can approach this from a completely different side.
> > > Basically we
> > > have static and dynamic dependencies for a job.
> > > 
> > > The static ones are expressed in the dependencies xarray while the
> > > dynamic
> > > ones are returned by the prepare callback.
> > > 
> > > What if we completely remove the credits or ring buffer handling
> > > from the
> > > scheduler and put it into a separate component which drivers can use a
> > > prepare callback?
> > 
> > While I honestly appreciate you making a proposal to move forward, I
> > really
> > think that this makes everything you're concerned about just worse and
> > even
> > starts me getting concerned about it.
> > 
> > 1) We'd trade the complxity of three lines of code to update the credit
> > count
> > with a whole new component.
> > 
> > 2) You mentioned that the concerns about the prepare callback were
> > proven to be
> > correct, now you propose to extend it even.
> 
> The difference is that Daniel has found a way to validate the prepare
> callback functionality.

That's not a difference, we can validate an integer as well. See above.

> 
> When the driver gives us a fence which never signals we get a warning, when
> the driver allocates a new fence we get a warning etc...

And when the integer is outside of its bounds we get a warning too. Again, where
is the difference?

> 
> We have found tons of bugs around this inside the drivers which ranges from
> allocating memory to taking locks in incorrect order. And we haven't even
> fixed all of them yet, I can still point you to code where amdgpu for
> example is still broken. It's just that this source path isn't exercised
> this often so we can live with it for now.

None of those bugs are relevant for what I'm doing, nor are they even possible
with what I'm doing.

> 
> I really don't want to repeat that stunt with this callback again. At least
> not for something which is only nice to have.

I recognize that. But saying that the type of callback was the problem here is
nonreflective and simplified.

The truth is that the problem with prepare_job() was the exploding complexity
that was mainly contributed by combining it with dma-fences, which by their
nature carry a lot of complexity themselfes.

If update_job_credits() would have been there before prepare_job() you'd never
had any of those issues. And even if there were an issue it would have been
trivial to fix. As a result you wouldn't blame the type of callback.

> 
> > The update_job_credits() has a very limited scope and hence is way
> > easier to
> > validate for correctness. The only thing we really need to do is to
> > guarantee
> > forward progress, meaning we just don't allow the job's credits to
> > exceed the
> > schedulers credit capacity. And that should be it.
> > 
> > 3) You seem to imply that you want to use dma-fences to do job-flow
> > control,
> > which as I mentioned above has way more complexity (separately allocated
> > object,
> > locking restrictions, etc.) and hence is more error prone.
> > 
> > 4) Why would we remove something from the scheduler which can be
> > considered to
> > be one of its core responsibilities?
> 
> Yeah, good points. I wanted to at least throw out some ideas how to solve
> this instead of just bluntly rejecting it.

And I honestly appreciate that.

Now, since none of the arguments against this seem to hold, can we go ahead?

To summarize.

- it serves a purpose
- it's very simple (not a lot of code and not much complexity)
- we can validate it (detecting and warning about driver bugs)
- we can guarantee forward progress (don't void fencing rules)
- we can even test it (I offer to write the corresponding tests)
- it comes with proper documentation

All of those points got a proper explanation behind them throughout this mail
thread, hence should you disagree, please answer at the corresponding location
in the mail thread. Please abstain from re-starting the discussion here.

> 
> Regards,
> Christian.
> 
> > 
> > 
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > > Again, if this callback would introduce any complexity or
> > > tradeoffs for existing
> > > > drivers, I'd vote to wait until we actually see issues as well.
> > > But this simply
> > > > isn't the case. And if you think otherwise, please show me the
> > > complexity it
> > > > introduces that is concerning and the tradeoffs for existing drivers.
> > > > > > > > > > If this here has some measurable positive effect then
> > > yeah we should
> > > > > > > > > probably do it, but as long as it's only nice to have I
> > > have some objections
> > > > > > > > > to that.
> > > > > > > > Can't answer this, since Nouveau doesn't support native
> > > fence waits. However, I
> > > > > > > > guess it depends on how many native fences a job carries.
> > > So, if we'd have two
> > > > > > > > jobs with each of them carrying a lot of native fences,
> > > but not a lot of actual
> > > > > > > > work, I can very well imagine that over-accounting can
> > > have a measureable
> > > > > > > > impact.
> > > > > > > What I can imagine as well is things like the hardware or
> > > firmware is
> > > > > > > looking at the fullness of the ring buffer to predict how
> > > much pending work
> > > > > > > there is.
> > > > > > > > > > > > I just wonder if we really want to ask for real
> > > measurements given that the
> > > > > > > > optimization is rather trivial and cheap and we already
> > > have enough evidence
> > > > > > > > that it makes sense conceptually.
> > > > > > > Well that's the point I disagree on, this callback isn't
> > > trivial. If (for
> > > > > > > example) the driver returns a value larger than initially
> > > estimated for the
> > > > > > > job we can run into an endless loop.
> > > > > > I agree it doesn't make sense to increase, but it wouldn't
> > > break anything,
> > > > > > unless the job size starts exceeding the capacity of the ring.
> > > And this case is
> > > > > > handled anyway.
> > > > > > > > > > It's just one more thing which can go boom. At bare
> > > minimum we should check
> > > > > > > that the value is always decreasing.
> > > > > > Considering the above I still agree, such a check makes sense
> > > - gonna add one.
> > > > > Please don't send anything out until we have solved this.
> > > > I did, but don't worry about this, it helped me to finalize the
> > > rest of the
> > > > patch. We can keep discussing this in this thread either way.
> > > > > > So far I haven't seen any argument which would not let me
> > > reject this and I
> > > > > don't want to waste your time.
> > > > > > > Regards,
> > > > > Christian.
> > > > > > > > - Danilo
> > > > > > > > > > Christian.
> > > > > > > > > > > > - Danilo
> > > > > > > > > > > > > > Regards,
> > > > > > > > > Christian.
> > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > > Boris
> > 
>
  

Patch

diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index 602010cb6894..acc5901ac840 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -552,6 +552,12 @@  Overview
 .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
    :doc: Overview
 
+Flow Control
+------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Flow Control
+
 Scheduler Function References
 -----------------------------
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 1f357198533f..62bb7fc7448a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -115,7 +115,7 @@  int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	if (!entity)
 		return 0;
 
-	return drm_sched_job_init(&(*job)->base, entity, owner);
+	return drm_sched_job_init(&(*job)->base, entity, 1, owner);
 }
 
 int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
index 2416c526f9b0..3d0f8d182506 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c
@@ -535,7 +535,7 @@  int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data,
 
 	ret = drm_sched_job_init(&submit->sched_job,
 				 &ctx->sched_entity[args->pipe],
-				 submit->ctx);
+				 1, submit->ctx);
 	if (ret)
 		goto err_submit_put;
 
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index 23a6276f1332..cdcb37ff62c3 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -123,7 +123,7 @@  int lima_sched_task_init(struct lima_sched_task *task,
 	for (i = 0; i < num_bos; i++)
 		drm_gem_object_get(&bos[i]->base.base);
 
-	err = drm_sched_job_init(&task->base, &context->base, vm);
+	err = drm_sched_job_init(&task->base, &context->base, 1, vm);
 	if (err) {
 		kfree(task->bos);
 		return err;
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 99744de6c05a..c002cabe7b9c 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -48,7 +48,7 @@  static struct msm_gem_submit *submit_create(struct drm_device *dev,
 		return ERR_PTR(ret);
 	}
 
-	ret = drm_sched_job_init(&submit->base, queue->entity, queue);
+	ret = drm_sched_job_init(&submit->base, queue->entity, 1, queue);
 	if (ret) {
 		kfree(submit->hw_fence);
 		kfree(submit);
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 7e64b5ef90fb..1b2cc3f2e1c7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -89,7 +89,7 @@  nouveau_job_init(struct nouveau_job *job,
 
 	}
 
-	ret = drm_sched_job_init(&job->base, &entity->base, NULL);
+	ret = drm_sched_job_init(&job->base, &entity->base, 1, NULL);
 	if (ret)
 		goto err_free_chains;
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
index b834777b409b..54d1c19bea84 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -274,7 +274,7 @@  static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
 
 	ret = drm_sched_job_init(&job->base,
 				 &file_priv->sched_entity[slot],
-				 NULL);
+				 1, NULL);
 	if (ret)
 		goto out_put_job;
 
diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
index 3143ecaaff86..f8ed093b7356 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler_trace.h
@@ -51,7 +51,7 @@  DECLARE_EVENT_CLASS(drm_sched_job,
 			   __assign_str(name, sched_job->sched->name);
 			   __entry->job_count = spsc_queue_count(&entity->job_queue);
 			   __entry->hw_job_count = atomic_read(
-				   &sched_job->sched->hw_rq_count);
+				   &sched_job->sched->credit_count);
 			   ),
 	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
 		      __entry->entity, __entry->id,
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 409e4256f6e7..b79e0672b94f 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -370,7 +370,7 @@  static void drm_sched_entity_wakeup(struct dma_fence *f,
 		container_of(cb, struct drm_sched_entity, cb);
 
 	drm_sched_entity_clear_dep(f, cb);
-	drm_sched_wakeup_if_can_queue(entity->rq->sched);
+	drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
 }
 
 /**
@@ -602,7 +602,7 @@  void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
 			drm_sched_rq_update_fifo(entity, submit_ts);
 
-		drm_sched_wakeup_if_can_queue(entity->rq->sched);
+		drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
 	}
 }
 EXPORT_SYMBOL(drm_sched_entity_push_job);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 246213963928..3cc3dc0091fc 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -48,6 +48,30 @@ 
  * through the jobs entity pointer.
  */
 
+/**
+ * DOC: Flow Control
+ *
+ * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
+ * in which the jobs fetched from scheduler entities are executed.
+ *
+ * In this context the &drm_gpu_scheduler keeps track of a driver specified
+ * credit limit representing the capacity of this scheduler and a credit count;
+ * every &drm_sched_job carries a driver specified number of credits.
+ *
+ * Once a job is executed (but not yet finished), the job's credits contribute
+ * to the scheduler's credit count until the job is finished. If by executing
+ * one more job the scheduler's credit count would exceed the scheduler's
+ * credit limit, the job won't be executed. Instead, the scheduler will wait
+ * until the credit count has decreased enough to not overflow its credit limit.
+ * This implies waiting for previously executed jobs.
+ *
+ * Optionally, drivers can register a callback (update_job_credits) provided by
+ * struct drm_sched_backend_ops to update the job's credits dynamically. The
+ * scheduler will execute this callback every time the scheduler considers a job
+ * for execution and subsequently checks whether the job fits the scheduler's
+ * credit limit.
+ */
+
 #include <linux/wait.h>
 #include <linux/sched.h>
 #include <linux/completion.h>
@@ -75,6 +99,46 @@  int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
 MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
 module_param_named(sched_policy, drm_sched_policy, int, 0444);
 
+static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
+{
+	u32 credits;
+
+	WARN_ON(check_sub_overflow(sched->credit_limit,
+				   atomic_read(&sched->credit_count),
+				   &credits));
+
+	return credits;
+}
+
+/**
+ * drm_sched_can_queue -- Can we queue more to the hardware?
+ * @sched: scheduler instance
+ * @entity: the scheduler entity
+ *
+ * Return true if we can push at least one more job from @entity, false
+ * otherwise.
+ */
+static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
+				struct drm_sched_entity *entity)
+{
+	struct drm_sched_job *s_job;
+
+	s_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
+	if (!s_job)
+		return false;
+
+	if (sched->ops->update_job_credits) {
+		u32 credits = sched->ops->update_job_credits(s_job);
+
+		if (credits)
+			s_job->credits = credits;
+	}
+
+	WARN_ON(s_job->credits > sched->credit_limit);
+
+	return drm_sched_available_credits(sched) >= s_job->credits;
+}
+
 static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
 							    const struct rb_node *b)
 {
@@ -186,12 +250,14 @@  void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 /**
  * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
  *
+ * @sched: the gpu scheduler
  * @rq: scheduler run queue to check.
  *
  * Try to find a ready entity, returns NULL if none found.
  */
 static struct drm_sched_entity *
-drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
+drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
+			      struct drm_sched_rq *rq)
 {
 	struct drm_sched_entity *entity;
 
@@ -201,6 +267,12 @@  drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
 	if (entity) {
 		list_for_each_entry_continue(entity, &rq->entities, list) {
 			if (drm_sched_entity_is_ready(entity)) {
+				/* If we can't queue yet, preserve the current
+				 * entity in terms of fairness.
+				 */
+				if (!drm_sched_can_queue(sched, entity))
+					goto out;
+
 				rq->current_entity = entity;
 				reinit_completion(&entity->entity_idle);
 				spin_unlock(&rq->lock);
@@ -210,8 +282,13 @@  drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
 	}
 
 	list_for_each_entry(entity, &rq->entities, list) {
-
 		if (drm_sched_entity_is_ready(entity)) {
+			/* If we can't queue yet, preserve the current entity in
+			 * terms of fairness.
+			 */
+			if (!drm_sched_can_queue(sched, entity))
+				goto out;
+
 			rq->current_entity = entity;
 			reinit_completion(&entity->entity_idle);
 			spin_unlock(&rq->lock);
@@ -222,20 +299,22 @@  drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
 			break;
 	}
 
+out:
 	spin_unlock(&rq->lock);
-
 	return NULL;
 }
 
 /**
  * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
  *
+ * @sched: the gpu scheduler
  * @rq: scheduler run queue to check.
  *
  * Find oldest waiting ready entity, returns NULL if none found.
  */
 static struct drm_sched_entity *
-drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
+drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
+				struct drm_sched_rq *rq)
 {
 	struct rb_node *rb;
 
@@ -245,6 +324,15 @@  drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
 
 		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
 		if (drm_sched_entity_is_ready(entity)) {
+			/* If we can't queue yet, don't pick another entity
+			 * which's job might fit and wait until we got space for
+			 * this one in terms of fairness.
+			 */
+			if (!drm_sched_can_queue(sched, entity)) {
+				rb = NULL;
+				break;
+			}
+
 			rq->current_entity = entity;
 			reinit_completion(&entity->entity_idle);
 			break;
@@ -266,18 +354,6 @@  static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
 		queue_work(sched->submit_wq, &sched->work_run_job);
 }
 
-/**
- * drm_sched_can_queue -- Can we queue more to the hardware?
- * @sched: scheduler instance
- *
- * Return true if we can push more jobs to the hw, otherwise false.
- */
-static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
-{
-	return atomic_read(&sched->hw_rq_count) <
-		sched->hw_submission_limit;
-}
-
 /**
  * drm_sched_select_entity - Select next entity to process
  *
@@ -291,14 +367,11 @@  drm_sched_select_entity(struct drm_gpu_scheduler *sched)
 	struct drm_sched_entity *entity;
 	int i;
 
-	if (!drm_sched_can_queue(sched))
-		return NULL;
-
 	/* Kernel run queue has higher priority than normal run queue*/
 	for (i = sched->num_rqs - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
-			drm_sched_rq_select_entity_fifo(sched->sched_rq[i]) :
-			drm_sched_rq_select_entity_rr(sched->sched_rq[i]);
+			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
+			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
 		if (entity)
 			break;
 	}
@@ -353,7 +426,7 @@  static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
 	struct drm_sched_fence *s_fence = s_job->s_fence;
 	struct drm_gpu_scheduler *sched = s_fence->sched;
 
-	atomic_dec(&sched->hw_rq_count);
+	atomic_sub(s_job->credits, &sched->credit_count);
 	atomic_dec(sched->score);
 
 	trace_drm_sched_process_job(s_fence);
@@ -576,7 +649,7 @@  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 					      &s_job->cb)) {
 			dma_fence_put(s_job->s_fence->parent);
 			s_job->s_fence->parent = NULL;
-			atomic_dec(&sched->hw_rq_count);
+			atomic_sub(s_job->credits, &sched->credit_count);
 		} else {
 			/*
 			 * remove job from pending_list.
@@ -637,7 +710,7 @@  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
 		struct dma_fence *fence = s_job->s_fence->parent;
 
-		atomic_inc(&sched->hw_rq_count);
+		atomic_add(s_job->credits, &sched->credit_count);
 
 		if (!full_recovery)
 			continue;
@@ -718,6 +791,8 @@  EXPORT_SYMBOL(drm_sched_resubmit_jobs);
  * drm_sched_job_init - init a scheduler job
  * @job: scheduler job to init
  * @entity: scheduler entity to use
+ * @credits: the number of credits this job contributes to the schedulers
+ * credit limit
  * @owner: job owner for debugging
  *
  * Refer to drm_sched_entity_push_job() documentation
@@ -735,7 +810,7 @@  EXPORT_SYMBOL(drm_sched_resubmit_jobs);
  */
 int drm_sched_job_init(struct drm_sched_job *job,
 		       struct drm_sched_entity *entity,
-		       void *owner)
+		       u32 credits, void *owner)
 {
 	if (!entity->rq) {
 		/* This will most likely be followed by missing frames
@@ -752,6 +827,7 @@  int drm_sched_job_init(struct drm_sched_job *job,
 		return -ENOMEM;
 
 	INIT_LIST_HEAD(&job->list);
+	job->credits = credits ? credits : 1;
 
 	xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
 
@@ -961,12 +1037,14 @@  EXPORT_SYMBOL(drm_sched_job_cleanup);
 /**
  * drm_sched_wakeup_if_can_queue - Wake up the scheduler
  * @sched: scheduler instance
+ * @entity: the scheduler entity
  *
  * Wake up the scheduler if we can queue jobs.
  */
-void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
+void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
+				   struct drm_sched_entity *entity)
 {
-	if (drm_sched_can_queue(sched))
+	if (drm_sched_can_queue(sched, entity))
 		drm_sched_run_job_queue(sched);
 }
 
@@ -1104,7 +1182,7 @@  static void drm_sched_run_job_work(struct work_struct *w)
 
 	s_fence = sched_job->s_fence;
 
-	atomic_inc(&sched->hw_rq_count);
+	atomic_add(sched_job->credits, &sched->credit_count);
 	drm_sched_job_begin(sched_job);
 
 	trace_drm_run_job(sched_job, entity);
@@ -1139,7 +1217,7 @@  static void drm_sched_run_job_work(struct work_struct *w)
  * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
  *	       allocated and used
  * @num_rqs: number of runqueues, one for each priority, up to DRM_SCHED_PRIORITY_COUNT
- * @hw_submission: number of hw submissions that can be in flight
+ * @credit_limit: the number of credits this scheduler can hold from all jobs
  * @hang_limit: number of times to allow a job to hang before dropping it
  * @timeout: timeout value in jiffies for the scheduler
  * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
@@ -1153,14 +1231,14 @@  static void drm_sched_run_job_work(struct work_struct *w)
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
 		   struct workqueue_struct *submit_wq,
-		   u32 num_rqs, unsigned hw_submission, unsigned hang_limit,
+		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name, struct device *dev)
 {
 	int i, ret;
 
 	sched->ops = ops;
-	sched->hw_submission_limit = hw_submission;
+	sched->credit_limit = credit_limit;
 	sched->name = name;
 	if (submit_wq) {
 		sched->submit_wq = submit_wq;
@@ -1211,7 +1289,7 @@  int drm_sched_init(struct drm_gpu_scheduler *sched,
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
 	spin_lock_init(&sched->job_list_lock);
-	atomic_set(&sched->hw_rq_count, 0);
+	atomic_set(&sched->credit_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
 	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
 	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
index 2e94ce788c71..8479e5302f7b 100644
--- a/drivers/gpu/drm/v3d/v3d_gem.c
+++ b/drivers/gpu/drm/v3d/v3d_gem.c
@@ -417,7 +417,7 @@  v3d_job_init(struct v3d_dev *v3d, struct drm_file *file_priv,
 	job->free = free;
 
 	ret = drm_sched_job_init(&job->base, &v3d_priv->sched_entity[queue],
-				 v3d_priv);
+				 1, v3d_priv);
 	if (ret)
 		goto fail;
 
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e5a6166eb152..a911b3f5917e 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -321,6 +321,7 @@  struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
  * @sched: the scheduler instance on which this job is scheduled.
  * @s_fence: contains the fences for the scheduling of job.
  * @finish_cb: the callback for the finished fence.
+ * @credits: the number of credits this job contributes to the scheduler
  * @work: Helper to reschdeule job kill to different context.
  * @id: a unique id assigned to each job scheduled on the scheduler.
  * @karma: increment on every hang caused by this job. If this exceeds the hang
@@ -340,6 +341,8 @@  struct drm_sched_job {
 	struct drm_gpu_scheduler	*sched;
 	struct drm_sched_fence		*s_fence;
 
+	u32				credits;
+
 	/*
 	 * work is used only after finish_cb has been used and will not be
 	 * accessed anymore.
@@ -463,13 +466,29 @@  struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @update_job_credits: Called once the scheduler is considering this
+	 * job for execution.
+	 *
+	 * Drivers may use this to update the job's submission credits, which is
+	 * useful to e.g. deduct the number of native fences which have been
+	 * signaled meanwhile.
+	 *
+	 * The callback must either return the new number of submission credits
+	 * for the given job, or zero if no update is required.
+	 *
+	 * This callback is optional.
+	 */
+	u32 (*update_job_credits)(struct drm_sched_job *sched_job);
 };
 
 /**
  * struct drm_gpu_scheduler - scheduler instance-specific data
  *
  * @ops: backend operations provided by the driver.
- * @hw_submission_limit: the max size of the hardware queue.
+ * @credit_limit: the credit limit of this scheduler
+ * @credit_count: the current credit count of this scheduler
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
  * @num_rqs: Number of run-queues. This is at most DRM_SCHED_PRIORITY_COUNT,
@@ -478,7 +497,6 @@  struct drm_sched_backend_ops {
  * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
  *                 waits on this wait queue until all the scheduled jobs are
  *                 finished.
- * @hw_rq_count: the number of jobs currently in the hardware queue.
  * @job_id_count: used to assign unique id to the each job.
  * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
  * @timeout_wq: workqueue used to queue @work_tdr
@@ -502,13 +520,13 @@  struct drm_sched_backend_ops {
  */
 struct drm_gpu_scheduler {
 	const struct drm_sched_backend_ops	*ops;
-	uint32_t			hw_submission_limit;
+	u32				credit_limit;
+	atomic_t			credit_count;
 	long				timeout;
 	const char			*name;
 	u32                             num_rqs;
 	struct drm_sched_rq             **sched_rq;
 	wait_queue_head_t		job_scheduled;
-	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
 	struct workqueue_struct		*submit_wq;
 	struct workqueue_struct		*timeout_wq;
@@ -530,14 +548,14 @@  struct drm_gpu_scheduler {
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
 		   struct workqueue_struct *submit_wq,
-		   u32 num_rqs, uint32_t hw_submission, unsigned hang_limit,
+		   u32 num_rqs, u32 credit_limit, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name, struct device *dev);
 
 void drm_sched_fini(struct drm_gpu_scheduler *sched);
 int drm_sched_job_init(struct drm_sched_job *job,
 		       struct drm_sched_entity *entity,
-		       void *owner);
+		       u32 credits, void *owner);
 void drm_sched_job_arm(struct drm_sched_job *job);
 int drm_sched_job_add_dependency(struct drm_sched_job *job,
 				 struct dma_fence *fence);
@@ -559,7 +577,8 @@  void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 
 void drm_sched_tdr_queue_imm(struct drm_gpu_scheduler *sched);
 void drm_sched_job_cleanup(struct drm_sched_job *job);
-void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
+void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched,
+				   struct drm_sched_entity *entity);
 bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
 void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
 void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);