[net,v1] net/sched: taprio: fix cycle time extension logic

Message ID 20230530082541.495-1-muhammad.husaini.zulkifli@intel.com
State New
Headers
Series [net,v1] net/sched: taprio: fix cycle time extension logic |

Commit Message

Zulkifli, Muhammad Husaini May 30, 2023, 8:25 a.m. UTC
  From: Tan Tee Min <tee.min.tan@linux.intel.com>

According to IEEE Std. 802.1Q-2018 section Q.5 CycleTimeExtension,
the Cycle Time Extension variable allows this extension of the last old
cycle to be done in a defined way. If the last complete old cycle would
normally end less than OperCycleTimeExtension nanoseconds before the new
base time, then the last complete cycle before AdminBaseTime is reached
is extended so that it ends at AdminBaseTime.

This patch extends the last entry of last complete cycle to AdminBaseTime.

Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule")
Signed-off-by: Tan Tee Min <tee.min.tan@linux.intel.com>
Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
---
 net/sched/sch_taprio.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)
  

Comments

Vladimir Oltean May 30, 2023, 7:47 p.m. UTC | #1
On Tue, May 30, 2023 at 04:25:41PM +0800, Muhammad Husaini Zulkifli wrote:
> From: Tan Tee Min <tee.min.tan@linux.intel.com>
> 
> According to IEEE Std. 802.1Q-2018 section Q.5 CycleTimeExtension,
> the Cycle Time Extension variable allows this extension of the last old
> cycle to be done in a defined way. If the last complete old cycle would
> normally end less than OperCycleTimeExtension nanoseconds before the new
> base time, then the last complete cycle before AdminBaseTime is reached
> is extended so that it ends at AdminBaseTime.
> 
> This patch extends the last entry of last complete cycle to AdminBaseTime.
> 
> Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule")
> Signed-off-by: Tan Tee Min <tee.min.tan@linux.intel.com>
> Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
> ---

Thanks for the patch.

I think that the commit message insufficiently describes the problems
with the existing code. Also, the incorrect Fixes: tag suggests it may
even have been incompletely characterized.

Here are some additional thoughts of mine (also insufficient, since they
are based on static code analysis, done now) which may nuance things a
bit more, to change the Fixes tag and the shape of the proposed solution.

Of the problems is that after my commit a1e6ad30fa19 ("net/sched: taprio:
calculate guard band against actual TC gate close time"), the last entry
of the old admin schedule stops being valid from taprio_dequeue_from_txq()'s
perspective.

Before that patch, taprio_dequeue_from_txq() would look at entry->close_time
to decide whether packets are eligible to be sent or not. The old logic
would take a cycle length correction into account like this:

	if (should_change_schedules(admin, oper, close_time)) {
		/* Set things so the next time this runs, the new
		 * schedule runs.
		 */
		close_time = sched_base_time(admin);
		switch_schedules(q, &admin, &oper);
	}

	next->close_time = close_time; // contains the cycle length correction

After that patch, taprio_dequeue_from_txq() -> taprio_entry_allows_tx()
simply does not have logic to take a cycle time correction into consideration;
it just looks at entry->gate_close_time[tc] which is determined from the
previous entry->end_time plus the next->gate_duration[tc]. All
entry->gate_duration[tc] values are calculated only once, during
taprio_calculate_gate_durations(). Nowhere is a dynamic schedule change
taken into account.

Now, taprio_dequeue_from_txq() uses the following "entry" fields:
gate_close_time[tc] and budget[tc]. They are both recalculated
incorrectly by advance_sched(), which your change addresses. That is one
issue which should be described properly, and a patch limited to that.

Sure, you're modifying entry->gate_duration[] to account for the correction,
but now it no longer matches this kind of checks:

	/* Traffic classes which never close have infinite budget */
	if (entry->gate_duration[tc] == sched->cycle_time)
		budget = INT_MAX;

so this is why your choice is to also update the cycle_time. An unfortunate
consequence of that choice is that this might also become transiently visible
to taprio_dump(), which I'm not totally convinced is desirable - because
effectively, the cycle time shouldn't have changed. Although, true, the
old oper schedule is going away soon, we don't really know how soon. The
cycle times can be arbitrarily large. It would be great if we could save
the correction into a dedicated "s64 correction" variable (ranging from
-cycle_time+1 to +cycle_time_extension), and leave the cycle_time alone.

So my patch is partly to blame for today's mishaps, which is something
that this patch fails to identify properly.

But taprio_enqueue() is also a problem, and that isn't addressed. It can be,
that during a corrected cycle, frames which were oversized due to the
queueMaxSDU restriction are transiently not oversized anymore, and
should be allowed to pass - or the other way around (and this is worse):
a frame which would have passed through a full-size window will not pass
through a truncated last cycle (negative correction), and taprio_enqueue()
will not detect this and will not drop the skb.

taprio_update_queue_max_sdu() would need to be called, and that depends
on an up-to-date sched->max_open_gate_duration[tc] which isn't currently
recalculated.

So, one way or another, taprio_calculate_gate_durations() also needs to
be called again after a change to the schedule's correction.

>  net/sched/sch_taprio.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index 76db9a10ef504..ef487fef83fce 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -99,6 +99,7 @@ struct taprio_sched {
>  	u32 max_sdu[TC_MAX_QUEUE]; /* save info from the user */
>  	u32 fp[TC_QOPT_MAX_QUEUE]; /* only for dump and offloading */
>  	u32 txtime_delay;
> +	bool sched_changed;

nitpick: sched_change_pending?

>  };
>  
>  struct __tc_taprio_qopt_offload {
> @@ -934,8 +935,10 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
>  	admin = rcu_dereference_protected(q->admin_sched,
>  					  lockdep_is_held(&q->current_entry_lock));
>  
> -	if (!oper)
> +	if (!oper || q->sched_changed) {
> +		q->sched_changed = false;
>  		switch_schedules(q, &admin, &oper);
> +	}
>  
>  	/* This can happen in two cases: 1. this is the very first run
>  	 * of this function (i.e. we weren't running any schedule
> @@ -962,20 +965,27 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
>  	end_time = ktime_add_ns(entry->end_time, next->interval);
>  	end_time = min_t(ktime_t, end_time, oper->cycle_end_time);
>  
> +	if (should_change_schedules(admin, oper, oper->cycle_end_time) &&
> +	    list_is_last(&next->list, &oper->entries)) {
> +		u32 ori_interval = next->interval;

The choice of operations seems unintuitive, when you could have simply
done:

	ktime_t correction = ktime_sub(sched_base_time(admin), end_time);

	next->interval += correction;
	oper->cycle_time += correction;

	(or possibly save the correction instead, see the feedback above)

> +
> +		next->interval += ktime_sub(sched_base_time(admin), end_time);
> +		oper->cycle_time += next->interval - ori_interval;
> +		end_time = sched_base_time(admin);
> +		oper->cycle_end_time = end_time;
> +		q->sched_changed = true;

I see an issue here: you need to set q->sched_changed = true whenever
should_change_schedules() returned true and not just for the last entry,
correct? Because there could be a schedule change which needs to happens
during entry 2 out of 4 of the current one (truncation case - negative
correction). In that case, I believe that should_change_schedules()
keeps shouting at you "change! change! change!" but you only call
switch_schedules() when you've reached entry 4 with the "next" pointer,
which is not what the standard suggests should be done.

IIUC, the standard says that when an admin and an oper sched meet, the
decision of what to do near the AdminBaseTime - whether to elongate the
next-to-last cycle of the oper sched or to let the last cycle run but to
cut it short - depends on the OperCycleTimeExtension. In a nutshell,
that variable says what is the maximum positive correction applicable to
the last sched entry and to the cycle time. If a positive correction
larger than that would be necessary (relative to the next-to-last cycle),
the decision is to just let the last cycle run for how long it can.

> +	}
> +
>  	for (tc = 0; tc < num_tc; tc++) {
> -		if (next->gate_duration[tc] == oper->cycle_time)
> +		if (next->gate_duration[tc] == oper->cycle_time) {
>  			next->gate_close_time[tc] = KTIME_MAX;
> -		else
> +		} else if (q->sched_changed && next->gate_duration[tc]) {
> +			next->gate_close_time[tc] = end_time;
> +			next->gate_duration[tc] = next->interval;

This deserves a comment because, although I understand what it intends
to do, I fail to understand if it will work or not :)

> +		} else {
>  			next->gate_close_time[tc] = ktime_add_ns(entry->end_time,
>  								 next->gate_duration[tc]);
> -	}
> -
> -	if (should_change_schedules(admin, oper, end_time)) {
> -		/* Set things so the next time this runs, the new
> -		 * schedule runs.
> -		 */
> -		end_time = sched_base_time(admin);
> -		switch_schedules(q, &admin, &oper);

I guess this is also a separate issue with the code. switch_schedules()
changes q->oper_sched too early, and taprio_skb_exceeds_queue_max_sdu()
looks at q->oper_sched.  So during an extension period, the queueMaxSDU
of the new schedule will be applied instead of the (extended) old one.

> +		}
>  	}
>  
>  	next->end_time = end_time;
> -- 
> 2.17.1
>

I guess at some point we should also fix up this comment?

	/* FIXME: the IEEE 802.1Q-2018 Specification isn't clear about
	 * how precisely the extension should be made. So after
	 * conformance testing, this logic may change.
	 */
	if (ktime_compare(next_base_time, extension_time) <= 0)
		return true;
  
Tan Tee Min June 2, 2023, 7:14 a.m. UTC | #2
On Tue, May 30, 2023 at 10:47:08PM +0300, Vladimir Oltean wrote:
> On Tue, May 30, 2023 at 04:25:41PM +0800, Muhammad Husaini Zulkifli wrote:
> > From: Tan Tee Min <tee.min.tan@linux.intel.com>
> > 
> I think that the commit message insufficiently describes the problems
> with the existing code. Also, the incorrect Fixes: tag suggests it may
> even have been incompletely characterized.
> 
> Here are some additional thoughts of mine (also insufficient, since they
> are based on static code analysis, done now) which may nuance things a
> bit more, to change the Fixes tag and the shape of the proposed solution.

Thanks a lot for the reviews!

> 
> Now, taprio_dequeue_from_txq() uses the following "entry" fields:
> gate_close_time[tc] and budget[tc]. They are both recalculated
> incorrectly by advance_sched(), which your change addresses. That is one
> issue which should be described properly, and a patch limited to that.

Do you mean a separate patch to fix the recalculation issue for
gate_close_time[tc] and budget[tc] in advance_sched()?

> 
> so this is why your choice is to also update the cycle_time. An unfortunate
> consequence of that choice is that this might also become transiently visible
> to taprio_dump(), which I'm not totally convinced is desirable - because
> effectively, the cycle time shouldn't have changed. Although, true, the
> old oper schedule is going away soon, we don't really know how soon. The
> cycle times can be arbitrarily large. It would be great if we could save
> the correction into a dedicated "s64 correction" variable (ranging from
> -cycle_time+1 to +cycle_time_extension), and leave the cycle_time alone.

That makes sense. I will save the correction into a dedicated "correction"
variable, and make it use with the cycle time at other places, and not update
the oper cycle_time.

So that the taprio_dump() still shows the original oper cycle_time during the
schedule change pending period.

> 
> But taprio_enqueue() is also a problem, and that isn't addressed. It can be,
> that during a corrected cycle, frames which were oversized due to the
> queueMaxSDU restriction are transiently not oversized anymore, and
> should be allowed to pass - or the other way around (and this is worse):
> a frame which would have passed through a full-size window will not pass
> through a truncated last cycle (negative correction), and taprio_enqueue()
> will not detect this and will not drop the skb.
> 
> taprio_update_queue_max_sdu() would need to be called, and that depends
> on an up-to-date sched->max_open_gate_duration[tc] which isn't currently
> recalculated.
> 
> So, one way or another, taprio_calculate_gate_durations() also needs to
> be called again after a change to the schedule's correction.

Yape, the queueMaxSDU and max_open_gate_durations() both should need to be
updated too. I will correct it in v2 patch by adding in the
taprio_update_queue_max_sdu() and taprio_calculate_gate_durations()
after the schedule change.

> 
> >  net/sched/sch_taprio.c | 32 +++++++++++++++++++++-----------
> >  1 file changed, 21 insertions(+), 11 deletions(-)
> > 
> > diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> > index 76db9a10ef504..ef487fef83fce 100644
> > --- a/net/sched/sch_taprio.c
> > +++ b/net/sched/sch_taprio.c
> > @@ -99,6 +99,7 @@ struct taprio_sched {
> >  	u32 max_sdu[TC_MAX_QUEUE]; /* save info from the user */
> >  	u32 fp[TC_QOPT_MAX_QUEUE]; /* only for dump and offloading */
> >  	u32 txtime_delay;
> > +	bool sched_changed;
> 
> nitpick: sched_change_pending?

Noted. Will change it to sched_change_pending.

>
> The choice of operations seems unintuitive, when you could have simply
> done:
> 
> 	ktime_t correction = ktime_sub(sched_base_time(admin), end_time);
> 
> 	next->interval += correction;
> 	oper->cycle_time += correction;
> 
> 	(or possibly save the correction instead, see the feedback above)

Thanks. Feedback taken.

> I see an issue here: you need to set q->sched_changed = true whenever
> should_change_schedules() returned true and not just for the last entry,
> correct? Because there could be a schedule change which needs to happens
> during entry 2 out of 4 of the current one (truncation case - negative
> correction). In that case, I believe that should_change_schedules()
> keeps shouting at you "change! change! change!" but you only call
> switch_schedules() when you've reached entry 4 with the "next" pointer,
> which is not what the standard suggests should be done.
> 
> IIUC, the standard says that when an admin and an oper sched meet, the
> decision of what to do near the AdminBaseTime - whether to elongate the
> next-to-last cycle of the oper sched or to let the last cycle run but to
> cut it short - depends on the OperCycleTimeExtension. In a nutshell,
> that variable says what is the maximum positive correction applicable to
> the last sched entry and to the cycle time. If a positive correction
> larger than that would be necessary (relative to the next-to-last cycle),
> the decision is to just let the last cycle run for how long it can.

Based on my understanding, here are the two cases when the Oper and Admin
cycle boundaries don’t line up perfectly:-
1/ The final Oper cycle before first Admin cycle is smaller than
   OperCycleTimeExtension:
   - Then extend the final oper cycle rather than restart a very short final
     oper cycle.
2/ The final Oper cycle before first Admin cycle is greater than
   OperCycleTimeExtension:
   - Then it won't extend the final Oper cycle and restart the final Oper
     cycle instead, then it will be truncated at Admin base time.

I think you are saying the scenario 2/ above, right?
Let me rework the solution and come back with the proper fixes.

> I guess at some point we should also fix up this comment?
> 
> 	/* FIXME: the IEEE 802.1Q-2018 Specification isn't clear about
> 	 * how precisely the extension should be made. So after
> 	 * conformance testing, this logic may change.
> 	 */
> 	if (ktime_compare(next_base_time, extension_time) <= 0)
> 		return true;

Agree. Let me fix up this comment in next patch.

Thanks,
Tee Min
  
Vladimir Oltean June 2, 2023, 4:14 p.m. UTC | #3
On Fri, Jun 02, 2023 at 03:14:06PM +0800, Tan Tee Min wrote:
> Do you mean a separate patch to fix the recalculation issue for
> gate_close_time[tc] and budget[tc] in advance_sched()?

Yes. You might be asking: "separate from what"?

I notice one other unrelated change in your patch is to delay the
advance_sched() -> switch_schedules() call via this new sched_change_pending
variable, and that is probably a good change. But it needs its own patch
with its own context, problem description and solution description.

> > I see an issue here: you need to set q->sched_changed = true whenever
> > should_change_schedules() returned true and not just for the last entry,
> > correct? Because there could be a schedule change which needs to happens
> > during entry 2 out of 4 of the current one (truncation case - negative
> > correction). In that case, I believe that should_change_schedules()
> > keeps shouting at you "change! change! change!" but you only call
> > switch_schedules() when you've reached entry 4 with the "next" pointer,
> > which is not what the standard suggests should be done.
> > 
> > IIUC, the standard says that when an admin and an oper sched meet, the
> > decision of what to do near the AdminBaseTime - whether to elongate the
> > next-to-last cycle of the oper sched or to let the last cycle run but to
> > cut it short - depends on the OperCycleTimeExtension. In a nutshell,
> > that variable says what is the maximum positive correction applicable to
> > the last sched entry and to the cycle time. If a positive correction
> > larger than that would be necessary (relative to the next-to-last cycle),
> > the decision is to just let the last cycle run for how long it can.
> 
> Based on my understanding, here are the two cases when the Oper and Admin
> cycle boundaries don’t line up perfectly:-
> 1/ The final Oper cycle before first Admin cycle is smaller than
>    OperCycleTimeExtension:
>    - Then extend the final oper cycle rather than restart a very short final
>      oper cycle.

Yes, this should result in a positive correction - a new cycle is not
begun, but the last entry of the next-to-last cycle just lasts longer.

> 2/ The final Oper cycle before first Admin cycle is greater than
>    OperCycleTimeExtension:
>    - Then it won't extend the final Oper cycle and restart the final Oper
>      cycle instead, then it will be truncated at Admin base time.

Yes, this should result in a negative correction - a new cycle is begun,
but whose effective length will be shorter because it will be truncated
as you say.

> I think you are saying the scenario 2/ above, right?
> Let me rework the solution and come back with the proper fixes.

Yes, I'm saying that in situation 2, you're not allowing the schedule to
be truncated where it should be, I believe.
  

Patch

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 76db9a10ef504..ef487fef83fce 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -99,6 +99,7 @@  struct taprio_sched {
 	u32 max_sdu[TC_MAX_QUEUE]; /* save info from the user */
 	u32 fp[TC_QOPT_MAX_QUEUE]; /* only for dump and offloading */
 	u32 txtime_delay;
+	bool sched_changed;
 };
 
 struct __tc_taprio_qopt_offload {
@@ -934,8 +935,10 @@  static enum hrtimer_restart advance_sched(struct hrtimer *timer)
 	admin = rcu_dereference_protected(q->admin_sched,
 					  lockdep_is_held(&q->current_entry_lock));
 
-	if (!oper)
+	if (!oper || q->sched_changed) {
+		q->sched_changed = false;
 		switch_schedules(q, &admin, &oper);
+	}
 
 	/* This can happen in two cases: 1. this is the very first run
 	 * of this function (i.e. we weren't running any schedule
@@ -962,20 +965,27 @@  static enum hrtimer_restart advance_sched(struct hrtimer *timer)
 	end_time = ktime_add_ns(entry->end_time, next->interval);
 	end_time = min_t(ktime_t, end_time, oper->cycle_end_time);
 
+	if (should_change_schedules(admin, oper, oper->cycle_end_time) &&
+	    list_is_last(&next->list, &oper->entries)) {
+		u32 ori_interval = next->interval;
+
+		next->interval += ktime_sub(sched_base_time(admin), end_time);
+		oper->cycle_time += next->interval - ori_interval;
+		end_time = sched_base_time(admin);
+		oper->cycle_end_time = end_time;
+		q->sched_changed = true;
+	}
+
 	for (tc = 0; tc < num_tc; tc++) {
-		if (next->gate_duration[tc] == oper->cycle_time)
+		if (next->gate_duration[tc] == oper->cycle_time) {
 			next->gate_close_time[tc] = KTIME_MAX;
-		else
+		} else if (q->sched_changed && next->gate_duration[tc]) {
+			next->gate_close_time[tc] = end_time;
+			next->gate_duration[tc] = next->interval;
+		} else {
 			next->gate_close_time[tc] = ktime_add_ns(entry->end_time,
 								 next->gate_duration[tc]);
-	}
-
-	if (should_change_schedules(admin, oper, end_time)) {
-		/* Set things so the next time this runs, the new
-		 * schedule runs.
-		 */
-		end_time = sched_base_time(admin);
-		switch_schedules(q, &admin, &oper);
+		}
 	}
 
 	next->end_time = end_time;