[v5,0/7] SCHED_DEADLINE server infrastructure

Message ID	cover.1699095159.git.bristot@kernel.org
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; From: Daniel Bristot de Oliveira <bristot@kernel.org> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, linux-kernel@vger.kernel.org, Luca Abeni <luca.abeni@santannapisa.it>, Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>, Thomas Gleixner <tglx@linutronix.de>, Joel Fernandes <joel@joelfernandes.org>, Vineeth Pillai <vineeth@bitbyteword.org>, Shuah Khan <skhan@linuxfoundation.org>, bristot@kernel.org, Phil Auld <pauld@redhat.com> Subject: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Date: Sat, 4 Nov 2023 11:59:17 +0100 Message-Id: <cover.1699095159.git.bristot@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	SCHED_DEADLINE server infrastructure \| [v5,0/7] SCHED_DEADLINE server infrastructure [v5,1/7] sched: Unify runtime accounting across classes [v5,2/7] sched/deadline: Collect sched_dl_entity initialization [v5,3/7] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity [v5,4/7] sched/deadline: Introduce deadline servers [v5,5/7] sched/fair: Add trivial fair server [v5,6/7] sched/deadline: Deferrable dl server [v5,7/7] sched/fair: Fair server interface

Message ID

cover.1699095159.git.bristot@kernel.org

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37;
From: Daniel Bristot de Oliveira <bristot@kernel.org>
To: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        linux-kernel@vger.kernel.org,
        Luca Abeni <luca.abeni@santannapisa.it>,
        Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>,
        Thomas Gleixner <tglx@linutronix.de>,
        Joel Fernandes <joel@joelfernandes.org>,
        Vineeth Pillai <vineeth@bitbyteword.org>,
        Shuah Khan <skhan@linuxfoundation.org>, bristot@kernel.org,
        Phil Auld <pauld@redhat.com>
Subject: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure
Date: Sat,  4 Nov 2023 11:59:17 +0100
Message-Id: <cover.1699095159.git.bristot@kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

SCHED_DEADLINE server infrastructure |

Message

Daniel Bristot de Oliveira Nov. 4, 2023, 10:59 a.m. UTC

  This is v5 of Peter's SCHED_DEADLINE server infrastructure
implementation [1].

SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that.

In the v1 there was discussion raised about the consequence of using
deadline based servers on the fixed-priority workloads. For a demonstration
here is the baseline of timerlat scheduling latency as-is, with kernel
build background workload:

 # rtla timerlat top -u -d 10m

  --------------------- %< ------------------------
                                     Timer Latency
  0 01:42:24   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #6143559   |        0         0         0        92 |        2         1         3        98 |        4         1         5       100
  1 #6143559   |        1         0         0        97 |        7         1         5       101 |        9         1         7       103
  2 #6143559   |        0         0         0        88 |        3         1         5        95 |        5         1         7        99
  3 #6143559   |        0         0         0        90 |        6         1         5       103 |       10         1         7       126
  4 #6143558   |        1         0         0        81 |        7         1         4        86 |        9         1         7        90
  5 #6143558   |        0         0         0        74 |        3         1         5        79 |        4         1         7        83
  6 #6143558   |        0         0         0        83 |        2         1         5        89 |        3         0         7       108
  7 #6143558   |        0         0         0        85 |        3         1         4       126 |        5         1         6       137
  --------------------- >% ------------------------

And this is the same tests with DL server activating without any delay:
  --------------------- %< ------------------------
  0 00:10:01   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #579147    |        0         0         0        54 |        2         1        52     61095 |        2         2        56     61102
  1 #578766    |        0         0         0        83 |        2         1        49     55824 |        3         2        53     55831
  2 #578559    |        0         0         1        59 |        2         1        50     55760 |        3         2        54     55770
  3 #578318    |        0         0         0        76 |        2         1        49     55751 |        3         2        54     55760
  4 #578611    |        0         0         0        64 |        2         1        49     55811 |        3         2        53     55820
  5 #578347    |        0         0         1        40 |        2         1        50     56121 |        3         2        55     56133
  6 #578938    |        0         0         1        75 |        2         1        49     55755 |        3         2        53     55764
  7 #578631    |        0         0         1        36 |        3         1        51     55528 |        4         2        55     55541
  --------------------- >% ------------------------

The problem with DL server only implementation is that FIFO tasks might
suffer preemption from NORMAL even when spare CPU cycles are available.
In fact, fair deadline server is enqueued right away when NORMAL tasks
wake up and they are first scheduled by the server, thus potentially
preempting a well behaving FIFO task. This is of course not ideal.

We had discussions about it, and one of the possibilities would be
using a different scheduling algorithm for this. But IMHO that is
an overkill.

Juri and I discussed this and though about delaying the server
activation for the 0-lag time, thus enabling the server only if the
fair scheduler is about to starve.

The patch 6/7 adds the possibility to defer the server start to the
(absolute deadline - runtime) point in time. This is achieved by
enqueuing the dl server throttled, with a next replenishing time
set to activate the server at (absolute deadline - runtime).

Differently from v4, now the server is enqueued with the runtime
replenished. As the fair scheduler runs without boost, its runtime
is consumed. If the fair server has its runtime before the 0-laxity
time, the a new period is set, and the timer armed for the new
(deadline - runtime).

The patch 7/7 add a per_rq interface for the knobs:
	fair_server_runtime (950 ms)
	fair_server_period  (1s)
	fair_server_defer   (enabled)

With defer enabled on CPUs [0:3], the results get better, having a
behavior similar to the one we have with the rt throttling.

  --------------------- %< ------------------------
                                     Timer Latency                                                                                       
  0 00:10:01   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #599979    |        0         0         0        64 |        4         1         4        67 |        6         1         5        69
  1 #599979    |        0         0         1        17 |        6         1         5        50 |       10         2         7        71
  2 #599984    |        1         0         1        22 |        4         1         5        78 |        5         2         7       107
  3 #599986    |        0         0         1        72 |        7         1         5        79 |       10         2         7        82
  4 #581580    |        1         0         1        37 |        6         1        38     52797 |       10         2        41     52805
  5 #583270    |        1         0         1        41 |        9         1        36     52617 |       12         2        38     52623
  6 #581240    |        0         0         1        25 |        7         1        39     52870 |       11         2        41     52876
  7 #581208    |        0         0         1        69 |        6         1        39     52917 |        9         2        41     52923
  --------------------- >% ------------------------

Here are some osnoise measurement, with osnoise threads running as FIFO:1 with
different setups (defer enabled):
 - CPU 2 isolated
 - CPU 3 isolated shared with a CFS busy loop task
 - CPU 8 non-isolated
 - CPU 9 non-isolated shared with a CFS busy loop task

  --------------------- %< ------------------------
 ~# pgrep ktimer | while read pid; do chrt -p -f 2 $pid; done # for RT kernel
 ~# sysctl kernel.sched_rt_runtime_us=-1
 ~# tuna  isolate -c 2
 ~# tuna  isolate -c 3
 ~# taskset -c 3 ./f &
 ~# taskset -c 9 ./f &
 ~# osnoise -P f:1 -c 2,3,8,9 -T 1 -d 10m -H 1
                                          Operating System Noise
duration:   0 00:10:00 | time is in us
CPU Period       Runtime        Noise  % CPU Aval   Max Noise   Max Single          HW          NMI          IRQ      Softirq       Thread
  2 #599       599000000          178    99.99997          18            2           0            0          270            0            0
  3 #598       598054434     31351553    94.75774      104442       104442           0            0      2837523            0         1794
  8 #599       599000001       567456    99.90526        3260         2375           2           89       620490            0        13539
  9 #598       598021196     31742537    94.69207       71707        53357           0           90      3411023            0         1762
   --------------------- >% ------------------------

the system runs fine!
	- no crashes (famous last words)
	- FIFO property is kept
	- per cpu interface because it is more flexible - and to detach this from
	  the throttling concept.

Global is broken, but it will > /dev/null.

TODO:
  - Move rt throttling code to RT_GROUP_SCHED for now (then send it to the same
    place as global then).

Changes from V4:
  - Enable the server when nr fair tasks is > 0 (peter)
  - Consume runtime if the zerolax server is not boosted (peterz)
  - Adjust interface to deal with admission control (peterz)
  - Rebased to 6.6
Changes from V3:
  - Add the defer server (Daniel)
  - Add an per rq interface (Daniel with peter's feedback)
  - Add an option not defer the server (for Joel)
  - Typos and 1-liner fixes (Valentin, Luca, Peter)
  - Fair scheduler running on dl server do not account as RT task (Daniel)
  - Changed the condition to enable the server (RT & fair tasks) (Daniel)
Changes from v2:
  - Refactor/rephrase/typos changes
  - Defferable server using throttling
  - The server starts when RT && Fair tasks are enqueued
  - Interface with runtime/period/defer option
Changes from v1:
  - rebased on 6.4-rc1 tip/sched/core

Daniel Bristot de Oliveira (2):
  sched/deadline: Deferrable dl server
  sched/fair: Fair server interface

Peter Zijlstra (5):
  sched: Unify runtime accounting across classes
  sched/deadline: Collect sched_dl_entity initialization
  sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
  sched/deadline: Introduce deadline servers
  sched/fair: Add trivial fair server

 include/linux/sched.h    |  26 +-
 kernel/sched/core.c      |  23 +-
 kernel/sched/deadline.c  | 671 ++++++++++++++++++++++++++++-----------
 kernel/sched/debug.c     | 202 ++++++++++++
 kernel/sched/fair.c      |  87 ++++-
 kernel/sched/rt.c        |  15 +-
 kernel/sched/sched.h     |  56 +++-
 kernel/sched/stop_task.c |  13 +-
 8 files changed, 847 insertions(+), 246 deletions(-)

Comments

Joel Fernandes Dec. 8, 2023, 9:47 p.m. UTC | #1

On Sat, Nov 04, 2023 at 11:59:17AM +0100, Daniel Bristot de Oliveira wrote:
> This is v5 of Peter's SCHED_DEADLINE server infrastructure
> implementation [1].
> 
> SCHED_DEADLINE servers can help fixing starvation issues of low priority
> tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
> cycles. Today we have RT Throttling; DEADLINE servers should be able to
> replace and improve that.

Hello!
Just wanted to provide some ChromeOS data on these patches. There is
great improvement when using DL-sever along with RT for foreground Chrome's
display, audio and main threads. Once they are in the background, we set them
back to CFS (except audio). I think these patches are ready to move forward
as the data looks good to me. I see Peter picked up some of them already
which is nice.

One of the key metrics for us is event latency. We have a test that measures
various latency metrics with typing happening on a Google docs in one window
and a 16-person Google meet call happening on the other. This is a very
complex test but gets us close to what the user experiences (as is typical -
meeting attendees in a Google meet call take notes in a Google doc). As a
result, getting stable numbers requires a lot of care which is why I used
P-value to measure the statistical significance of the results. The P-value
for some metrics show lower significance, so we can ignore those but I still
provided it in the table.

The test is run on a Chromebook with 4 cores (Intel(R) Celeron(R) N4100 CPU @
1.10GHz) and 16GB of RAM. No Hyperthreading.

All units are microseconds. The average is calculated as the average of 20
runs with and without "Chrome using RT + DL-server". The 5% every 1 second
default does not work for us, so I changed the DL server parameters to 5ms
every 30ms. This allows CFS to run more often.

This test runs for 6 hours. Total test time for both before and after is 12 hours:

---------------------------------------------------------------------------------------------------------
| MetricName                                      | Average Before | Average After | Change % | P-value |
---------------------------------------------------------------------------------------------------------
| Ash.EventLatency.Core.TotalLatency              | 90.19          | 78.22         | 13.27%   | 0.03    |
---------------------------------------------------------------------------------------------------------
| Ash.EventLatency.KeyReleased.TotalLatency       | 90733.76       | 78602.72      | 13.37%   | 0.03    |
---------------------------------------------------------------------------------------------------------
| Ash.EventLatency.TotalLatency                   | 90.19          | 78.22         | 13.27%   | 0.03    |
---------------------------------------------------------------------------------------------------------
| Docs.EventLatency.KeyPressed.TotalLatency       | 68269.21       | 63310.99      | 7.26%    | 0.00    |
---------------------------------------------------------------------------------------------------------
| Docs.EventLatency.MousePressed.TotalLatency     | 192080.44      | 179264.31     | 6.67%    | 0.26    |
---------------------------------------------------------------------------------------------------------
| Docs.EventLatency.TotalLatency                  | 68795.99       | 63860.04      | 7.17%    | 0.00    |
---------------------------------------------------------------------------------------------------------
| EventLatency.GestureScrollUpdt.Wheel.TotalLat   | 63420.88       | 59394.18      | 6.35%    | 0.02    |
---------------------------------------------------------------------------------------------------------
| EventLatency.KeyPressed.TotalLatency            | 68269.21       | 63310.99      | 7.26%    | 0.00    |
---------------------------------------------------------------------------------------------------------
| EventLatency.MouseDragged.TotalLatency          | 106393.09      | 104152.50     | 2.11%    | 0.57    |
---------------------------------------------------------------------------------------------------------
| EventLatency.MouseMoved.TotalLatency            | 129225.65      | 113268.48     | 12.35%   | 0.01    |
---------------------------------------------------------------------------------------------------------
| EventLatency.MousePressed.TotalLatency          | 192080.44      | 179264.31     | 6.67%    | 0.26    |
---------------------------------------------------------------------------------------------------------
| EventLatency.MouseReleased.TotalLatency         | 152366.33      | 140309.50     | 7.91%    | 0.44    |
---------------------------------------------------------------------------------------------------------
| EventLatency.TotalLatency                       | 68795.99       | 63862.45      | 7.17%    | 0.00    |
---------------------------------------------------------------------------------------------------------
| EventLatency.TotalLatency_ash-Chrome            | 68795.99       | 63862.45      | 7.17%    | 0.00    |
---------------------------------------------------------------------------------------------------------

I also did another test where I measure the CFS maximum latency (using perf
sched) while a YouTube video is playing, and the CFS max latency looks great
too. In fact, with the vanilla RT throttling, our CFS tasks are doing really
badly (perhaps because of depending on RT tasks due to locks or such). So we
definitely need the DL-server to use RT properly!

We are testing dlserver with 5ms/50ms and 5ms/100ms as well to see the
impact. But at the moment, 5ms/30ms is looking good.

Thanks for all of your work, here's to better Linux and better Chromebooks ;)

 - Joel

Huang, Ying Feb. 19, 2024, 7:33 a.m. UTC | #2

Hi, Daniel,

Thanks a lot for your great patchset!

We have a similar starvation issue in mm subsystem too.  Details are in
the patch description of the below commit.  In short, task A is busy
looping on some event, while task B will signal the event after some
work.  If the priority of task A is higher than that of task B, task B
may be starved.

IIUC, if task A is RT task while task B is fair task, then your patchset
will solve the issue.  If both task A and task B is RT tasks, is there
some way to solve the issue?

Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
resolve the issue, is there something better?

Best Regards,
Huang, Ying

--------------------------8<---------------------------------------
commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
Author: Guo Ziliang <guo.ziliang@zte.com.cn>
Date:   Wed Mar 16 16:15:03 2022 -0700

    mm: swap: get rid of livelock in swapin readahead
    
    In our testing, a livelock task was found.  Through sysrq printing, same
    stack was found every time, as follows:
    
      __swap_duplicate+0x58/0x1a0
      swapcache_prepare+0x24/0x30
      __read_swap_cache_async+0xac/0x220
      read_swap_cache_async+0x58/0xa0
      swapin_readahead+0x24c/0x628
      do_swap_page+0x374/0x8a0
      __handle_mm_fault+0x598/0xd60
      handle_mm_fault+0x114/0x200
      do_page_fault+0x148/0x4d0
      do_translation_fault+0xb0/0xd4
      do_mem_abort+0x50/0xb0
    
    The reason for the livelock is that swapcache_prepare() always returns
    EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
    cannot jump out of the loop.  We suspect that the task that clears the
    SWAP_HAS_CACHE flag never gets a chance to run.  We try to lower the
    priority of the task stuck in a livelock so that the task that clears
    the SWAP_HAS_CACHE flag will run.  The results show that the system
    returns to normal after the priority is lowered.
    
    In our testing, multiple real-time tasks are bound to the same core, and
    the task in the livelock is the highest priority task of the core, so
    the livelocked task cannot be preempted.
    
    Although cond_resched() is used by __read_swap_cache_async, it is an
    empty function in the preemptive system and cannot achieve the purpose
    of releasing the CPU.  A high-priority task cannot release the CPU
    unless preempted by a higher-priority task.  But when this task is
    already the highest priority task on this core, other tasks will not be
    able to be scheduled.  So we think we should replace cond_resched() with
    schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
    call set_current_state first to set the task state, so the task will be
    removed from the running queue, so as to achieve the purpose of giving
    up the CPU and prevent it from running in kernel mode for too long.
    
    (akpm: ugly hack becomes uglier.  But it fixes the issue in a
    backportable-to-stable fashion while we hopefully work on something
    better)
    
    Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
    Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn>
    Reported-by: Zeal Robot <zealci@zte.com.cn>
    Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
    Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Roger Quadros <rogerq@kernel.org>
    Cc: Ziliang Guo <guo.ziliang@zte.com.cn>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8d4104242100..ee67164531c0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
 		 * in swap_map, but not yet added its page to swap cache.
 		 */
-		cond_resched();
+		schedule_timeout_uninterruptible(1);
 	}
 
 	/*

Daniel Bristot de Oliveira Feb. 19, 2024, 10:23 a.m. UTC | #3

Hi

On 2/19/24 08:33, Huang, Ying wrote:
> Hi, Daniel,
> 
> Thanks a lot for your great patchset!
> 
> We have a similar starvation issue in mm subsystem too.  Details are in
> the patch description of the below commit.  In short, task A is busy
> looping on some event, while task B will signal the event after some
> work.  If the priority of task A is higher than that of task B, task B
> may be starved.

ok...

> 
> IIUC, if task A is RT task while task B is fair task, then your patchset
> will solve the issue.

This patch set will not solve the issue. It will mitigate the effect of the
problem. Still the system will perform very poorly...

If both task A and task B is RT tasks, is there
> some way to solve the issue?

I would say reworking the swap algorithm, as it is not meant to be used when
real-time tasks are in place.

As an exercise, let's say that we add a server per priority on FIFO, with a default
50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
busy loop in vain.

Then one would say, let's lower the parameters, so the granularity of
the server would provide lower latencies. The same problem would still
exist, as it exists with sched fair....

So, the solution is not on schedule. Busy loop waiting is not right when you
have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove
spin_locks that do busy waiting. spin_locks do not have this problem you
show because they disable preemption... but disabling preemption is not
the best solution either.

So, a first try of duct tape would using (local_?) locks like in
preempt rt to make things sleepable...

AFAICS, this was already discussed in the previous link, right?

> 
> Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
> resolve the issue, is there something better?

I am not a swap/mm expert.. my guesses would be all on sleepable locking.
But I know there are many smart people on the mm side with better guesses...

It is just that the DL server or any type of starvation avoidance does not
seem to be a solution for your problem.

-- Daniel


> Best Regards,
> Huang, Ying
> 
> --------------------------8<---------------------------------------
> commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
> Author: Guo Ziliang <guo.ziliang@zte.com.cn>
> Date:   Wed Mar 16 16:15:03 2022 -0700
> 
>     mm: swap: get rid of livelock in swapin readahead
>     
>     In our testing, a livelock task was found.  Through sysrq printing, same
>     stack was found every time, as follows:
>     
>       __swap_duplicate+0x58/0x1a0
>       swapcache_prepare+0x24/0x30
>       __read_swap_cache_async+0xac/0x220
>       read_swap_cache_async+0x58/0xa0
>       swapin_readahead+0x24c/0x628
>       do_swap_page+0x374/0x8a0
>       __handle_mm_fault+0x598/0xd60
>       handle_mm_fault+0x114/0x200
>       do_page_fault+0x148/0x4d0
>       do_translation_fault+0xb0/0xd4
>       do_mem_abort+0x50/0xb0
>     
>     The reason for the livelock is that swapcache_prepare() always returns
>     EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
>     cannot jump out of the loop.  We suspect that the task that clears the
>     SWAP_HAS_CACHE flag never gets a chance to run.  We try to lower the
>     priority of the task stuck in a livelock so that the task that clears
>     the SWAP_HAS_CACHE flag will run.  The results show that the system
>     returns to normal after the priority is lowered.
>     
>     In our testing, multiple real-time tasks are bound to the same core, and
>     the task in the livelock is the highest priority task of the core, so
>     the livelocked task cannot be preempted.
>     
>     Although cond_resched() is used by __read_swap_cache_async, it is an
>     empty function in the preemptive system and cannot achieve the purpose
>     of releasing the CPU.  A high-priority task cannot release the CPU
>     unless preempted by a higher-priority task.  But when this task is
>     already the highest priority task on this core, other tasks will not be
>     able to be scheduled.  So we think we should replace cond_resched() with
>     schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
>     call set_current_state first to set the task state, so the task will be
>     removed from the running queue, so as to achieve the purpose of giving
>     up the CPU and prevent it from running in kernel mode for too long.
>     
>     (akpm: ugly hack becomes uglier.  But it fixes the issue in a
>     backportable-to-stable fashion while we hopefully work on something
>     better)
>     
>     Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
>     Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn>
>     Reported-by: Zeal Robot <zealci@zte.com.cn>
>     Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
>     Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
>     Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
>     Acked-by: Hugh Dickins <hughd@google.com>
>     Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
>     Cc: Michal Hocko <mhocko@kernel.org>
>     Cc: Minchan Kim <minchan@kernel.org>
>     Cc: Johannes Weiner <hannes@cmpxchg.org>
>     Cc: Roger Quadros <rogerq@kernel.org>
>     Cc: Ziliang Guo <guo.ziliang@zte.com.cn>
>     Cc: <stable@vger.kernel.org>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 8d4104242100..ee67164531c0 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
>  		 * in swap_map, but not yet added its page to swap cache.
>  		 */
> -		cond_resched();
> +		schedule_timeout_uninterruptible(1);
>  	}
>  
>  	/*

Huang, Ying Feb. 20, 2024, 3:28 a.m. UTC | #4

Daniel Bristot de Oliveira <bristot@kernel.org> writes:

> Hi
>
> On 2/19/24 08:33, Huang, Ying wrote:
>> Hi, Daniel,
>> 
>> Thanks a lot for your great patchset!
>> 
>> We have a similar starvation issue in mm subsystem too.  Details are in
>> the patch description of the below commit.  In short, task A is busy
>> looping on some event, while task B will signal the event after some
>> work.  If the priority of task A is higher than that of task B, task B
>> may be starved.
>
> ok...
>
>> 
>> IIUC, if task A is RT task while task B is fair task, then your patchset
>> will solve the issue.
>
> This patch set will not solve the issue. It will mitigate the effect of the
> problem. Still the system will perform very poorly...

I don't think that it's common (or even reasonable) for real-time tasks
to use swap.  So, IMHO, performance isn't very important here.  But, we
need to avoid live-lock anyway.  I think that your patchset solves the
live-lock issue.

>> If both task A and task B is RT tasks, is there
>> some way to solve the issue?
>
> I would say reworking the swap algorithm, as it is not meant to be used when
> real-time tasks are in place.
>
> As an exercise, let's say that we add a server per priority on FIFO, with a default
> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
> busy loop in vain.

If the target is only the live-lock avoidance, is it possible to run
lower priority runnable tasks for a short while if we run long enough in
the busy loop?

> Then one would say, let's lower the parameters, so the granularity of
> the server would provide lower latencies. The same problem would still
> exist, as it exists with sched fair....
>
> So, the solution is not on schedule. Busy loop waiting is not right when you
> have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove
> spin_locks that do busy waiting. spin_locks do not have this problem you
> show because they disable preemption... but disabling preemption is not
> the best solution either.
>
> So, a first try of duct tape would using (local_?) locks like in
> preempt rt to make things sleepable...
>
> AFAICS, this was already discussed in the previous link, right?
>
>> 
>> Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
>> resolve the issue, is there something better?
>
> I am not a swap/mm expert.. my guesses would be all on sleepable locking.
> But I know there are many smart people on the mm side with better guesses...
>
> It is just that the DL server or any type of starvation avoidance does not
> seem to be a solution for your problem.

Yes.  To improve the performance, we need something else.

--
Best Regards,
Huang, Ying

> -- Daniel
>
>
>> Best Regards,
>> Huang, Ying
>> 
>> --------------------------8<---------------------------------------
>> commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
>> Author: Guo Ziliang <guo.ziliang@zte.com.cn>
>> Date:   Wed Mar 16 16:15:03 2022 -0700
>> 
>>     mm: swap: get rid of livelock in swapin readahead
>>     
>>     In our testing, a livelock task was found.  Through sysrq printing, same
>>     stack was found every time, as follows:
>>     
>>       __swap_duplicate+0x58/0x1a0
>>       swapcache_prepare+0x24/0x30
>>       __read_swap_cache_async+0xac/0x220
>>       read_swap_cache_async+0x58/0xa0
>>       swapin_readahead+0x24c/0x628
>>       do_swap_page+0x374/0x8a0
>>       __handle_mm_fault+0x598/0xd60
>>       handle_mm_fault+0x114/0x200
>>       do_page_fault+0x148/0x4d0
>>       do_translation_fault+0xb0/0xd4
>>       do_mem_abort+0x50/0xb0
>>     
>>     The reason for the livelock is that swapcache_prepare() always returns
>>     EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
>>     cannot jump out of the loop.  We suspect that the task that clears the
>>     SWAP_HAS_CACHE flag never gets a chance to run.  We try to lower the
>>     priority of the task stuck in a livelock so that the task that clears
>>     the SWAP_HAS_CACHE flag will run.  The results show that the system
>>     returns to normal after the priority is lowered.
>>     
>>     In our testing, multiple real-time tasks are bound to the same core, and
>>     the task in the livelock is the highest priority task of the core, so
>>     the livelocked task cannot be preempted.
>>     
>>     Although cond_resched() is used by __read_swap_cache_async, it is an
>>     empty function in the preemptive system and cannot achieve the purpose
>>     of releasing the CPU.  A high-priority task cannot release the CPU
>>     unless preempted by a higher-priority task.  But when this task is
>>     already the highest priority task on this core, other tasks will not be
>>     able to be scheduled.  So we think we should replace cond_resched() with
>>     schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
>>     call set_current_state first to set the task state, so the task will be
>>     removed from the running queue, so as to achieve the purpose of giving
>>     up the CPU and prevent it from running in kernel mode for too long.
>>     
>>     (akpm: ugly hack becomes uglier.  But it fixes the issue in a
>>     backportable-to-stable fashion while we hopefully work on something
>>     better)
>>     
>>     Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
>>     Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn>
>>     Reported-by: Zeal Robot <zealci@zte.com.cn>
>>     Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
>>     Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
>>     Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
>>     Acked-by: Hugh Dickins <hughd@google.com>
>>     Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>     Cc: Michal Hocko <mhocko@kernel.org>
>>     Cc: Minchan Kim <minchan@kernel.org>
>>     Cc: Johannes Weiner <hannes@cmpxchg.org>
>>     Cc: Roger Quadros <rogerq@kernel.org>
>>     Cc: Ziliang Guo <guo.ziliang@zte.com.cn>
>>     Cc: <stable@vger.kernel.org>
>>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>> 
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 8d4104242100..ee67164531c0 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>>  		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
>>  		 * in swap_map, but not yet added its page to swap cache.
>>  		 */
>> -		cond_resched();
>> +		schedule_timeout_uninterruptible(1);
>>  	}
>>  
>>  	/*

Daniel Bristot de Oliveira Feb. 20, 2024, 8:31 a.m. UTC | #5

On 2/20/24 04:28, Huang, Ying wrote:
> Daniel Bristot de Oliveira <bristot@kernel.org> writes:
> 
>> Hi
>>
>> On 2/19/24 08:33, Huang, Ying wrote:
>>> Hi, Daniel,
>>>
>>> Thanks a lot for your great patchset!
>>>
>>> We have a similar starvation issue in mm subsystem too.  Details are in
>>> the patch description of the below commit.  In short, task A is busy
>>> looping on some event, while task B will signal the event after some
>>> work.  If the priority of task A is higher than that of task B, task B
>>> may be starved.
>>
>> ok...
>>
>>>
>>> IIUC, if task A is RT task while task B is fair task, then your patchset
>>> will solve the issue.
>>
>> This patch set will not solve the issue. It will mitigate the effect of the
>> problem. Still the system will perform very poorly...
> 
> I don't think that it's common (or even reasonable) for real-time tasks
> to use swap.  So, IMHO, performance isn't very important here.  But, we
> need to avoid live-lock anyway.  I think that your patchset solves the
> live-lock issue.

I mean, if for you this is solving your user problem, be happy :-) Play with parameters...
find a way to tune your system as a user... use it :)

But your problem is also "solved" with RT throttling without RT_RUNTIME_SHARE (the
default since... two years ago, I think). So there is not much news here.

IMHO, it is not a solution. As a developer, there is a synchronization problem
in swap code, and pushing a workaround to the scheduling side is not the way to go...

> 
>>> If both task A and task B is RT tasks, is there
>>> some way to solve the issue?
>>
>> I would say reworking the swap algorithm, as it is not meant to be used when
>> real-time tasks are in place.
>>
>> As an exercise, let's say that we add a server per priority on FIFO, with a default
>> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
>> busy loop in vain.
> 
> If the target is only the live-lock avoidance, is it possible to run
> lower priority runnable tasks for a short while if we run long enough in
> the busy loop?

If you do it in the algorithm side (instead of relying on scheduling), it could be a
thing.

I think NAPI still uses something like this: Busy-loop for two jiffies in the softirq
context (a priority higher than all threads on the !rt kernel), then move to thread
the thread context to avoid starvation. In the swap case, it could run for two jiffies
and then go to sleep for a while. How well will swap people receive this as a solution...
I do not know :) I would first try something better than this using synchronization
primitives.

This patch set is for things outside of kernel control. For example, people running
poll mode DPDK in user-space with FIFO priority; FIFO tasks in user-space for too long...
with a better design than rt throttling.

Will this patch help in misbehaving kernel activities: yes. Is it a reason not to
fix kernel problems? I do not think so, and I bet many other people do not believe as
well.

-- Daniel

Huang, Ying Feb. 20, 2024, 8:41 a.m. UTC | #6

Daniel Bristot de Oliveira <bristot@kernel.org> writes:

> On 2/20/24 04:28, Huang, Ying wrote:
>> Daniel Bristot de Oliveira <bristot@kernel.org> writes:
>> 
>>> Hi
>>>
>>> On 2/19/24 08:33, Huang, Ying wrote:
>>>> Hi, Daniel,
>>>>
>>>> Thanks a lot for your great patchset!
>>>>
>>>> We have a similar starvation issue in mm subsystem too.  Details are in
>>>> the patch description of the below commit.  In short, task A is busy
>>>> looping on some event, while task B will signal the event after some
>>>> work.  If the priority of task A is higher than that of task B, task B
>>>> may be starved.
>>>
>>> ok...
>>>
>>>>
>>>> IIUC, if task A is RT task while task B is fair task, then your patchset
>>>> will solve the issue.
>>>
>>> This patch set will not solve the issue. It will mitigate the effect of the
>>> problem. Still the system will perform very poorly...
>> 
>> I don't think that it's common (or even reasonable) for real-time tasks
>> to use swap.  So, IMHO, performance isn't very important here.  But, we
>> need to avoid live-lock anyway.  I think that your patchset solves the
>> live-lock issue.
>
> I mean, if for you this is solving your user problem, be happy :-) Play with parameters...
> find a way to tune your system as a user... use it :)
>
> But your problem is also "solved" with RT throttling without RT_RUNTIME_SHARE (the
> default since... two years ago, I think). So there is not much news here.
>
> IMHO, it is not a solution. As a developer, there is a synchronization problem
> in swap code, and pushing a workaround to the scheduling side is not the way to go...
>
>> 
>>>> If both task A and task B is RT tasks, is there
>>>> some way to solve the issue?
>>>
>>> I would say reworking the swap algorithm, as it is not meant to be used when
>>> real-time tasks are in place.
>>>
>>> As an exercise, let's say that we add a server per priority on FIFO, with a default
>>> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
>>> busy loop in vain.
>> 
>> If the target is only the live-lock avoidance, is it possible to run
>> lower priority runnable tasks for a short while if we run long enough in
>> the busy loop?
>
> If you do it in the algorithm side (instead of relying on scheduling), it could be a
> thing.
>
> I think NAPI still uses something like this: Busy-loop for two jiffies in the softirq
> context (a priority higher than all threads on the !rt kernel), then move to thread
> the thread context to avoid starvation. In the swap case, it could run for two jiffies
> and then go to sleep for a while. How well will swap people receive this as a solution...
> I do not know :) I would first try something better than this using synchronization
> primitives.
>
> This patch set is for things outside of kernel control. For example, people running
> poll mode DPDK in user-space with FIFO priority; FIFO tasks in user-space for too long...
> with a better design than rt throttling.
>
> Will this patch help in misbehaving kernel activities: yes. Is it a reason not to
> fix kernel problems? I do not think so, and I bet many other people do not believe as
> well.

I totally agree with you that we need to fix the kernel problems.  And,
Thanks for your information!

--
Best Regards,
Huang, Ying