Message ID | cover.1699095159.git.bristot@kernel.org |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:8f47:0:b0:403:3b70:6f57 with SMTP id j7csp1580109vqu; Sat, 4 Nov 2023 04:00:03 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHBiavjPr/nbmpni3//xmomSRnctFvS79LAJCNb1tj0D3sRlRft8JWBM7CFdOcDmBWTYxKW X-Received: by 2002:a05:6870:2b11:b0:1bb:83e9:6277 with SMTP id ld17-20020a0568702b1100b001bb83e96277mr28755756oab.33.1699095602864; Sat, 04 Nov 2023 04:00:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1699095602; cv=none; d=google.com; s=arc-20160816; b=n+97ZU9ojXTKug6SPyNg+fqhM601SpNcFxToJiLZ/ygtlp6rRWS/umGzyHBzlVTKEO 2PJMSnSYv+LjeXk/WM36TUuk101973HUs2k6uGGUubLVJYVQDP3Y8oEQ85pKra1RaWmg W5exIFitRgzOpyBaS+TMLpBJ5x/VrIo9twYsrCrL8NXZ0JjmpHgm6q+gyrk6grTmwkMh 5L/A2eABlO0s7D5rIhCnLAbHsqzUrv3uWtb2LGX8L4EdPeg5pLngCsFtvoEZvRDUM+ka FPWBWJRZElMIKWcEojwgjnWBh3q5sV8sLyVjhq+lGtJKiV30gKXgPOGZ1JF6Yyq/lafB lvPA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=A69ojJGkNmVns1zpdF/4Kib7TEJ7Ib074nCHMxgViA8=; fh=hclUZirF6ioy5RJ9ji4zprhkXEWhJA3xztaBC4vzfLk=; b=F9LTL6q1R0tpuml8u+oV7qo8Flz7V1UPqAfV2Qt6lrjvGjsbcMZ3+R+aLwYIPf9RGG Ygzi413broTSuP4C/kOJx5WqKu35vA+9IBPhEMAsM1+0xFlF07HXQLbgwf6SdyAZeR7F 9yqhIgaPbcJERCh9B8nwz55HcaBtjg0t5kCw+6c4XVqZlZOfqqrCgSEiNQj3QwpC2hQH D1PuXrZh2Woursi1CZ6X0AAbH7xzbtaT8nG5q18zjshZkTFJjI8OGESkjgzMuJzQXPz6 1L+mDToUO/ZA1AWeHQYKsbYHK7dqyRKuZs0BAVV1dyPJ/Adm8sgAX+C0IJ5r9xvkyJqo PhLg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=N9sQaMOD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id bd15-20020a656e0f000000b005b93908de64si3898492pgb.704.2023.11.04.04.00.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 04 Nov 2023 04:00:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=N9sQaMOD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 72B9A80417F3; Sat, 4 Nov 2023 04:00:01 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232211AbjKDK77 (ORCPT <rfc822;lhua1029@gmail.com> + 35 others); Sat, 4 Nov 2023 06:59:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232202AbjKDK7s (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 4 Nov 2023 06:59:48 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B1D67D48 for <linux-kernel@vger.kernel.org>; Sat, 4 Nov 2023 03:59:33 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 57929C433C7; Sat, 4 Nov 2023 10:59:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1699095573; bh=Z44CJflOtr1aWOSpGdoxVEyhwROBxLY72GaO/vu6uHU=; h=From:To:Cc:Subject:Date:From; b=N9sQaMODf/jDs6nHnKQ8YpazdBBlgL10frzc7yy8xtAURNdOu/NsXcDR5+CfJLdcj ZFQtFfwbmpmjcSzsPVIiJaHfGri3kL6ffzKBV4VJJU0IPDmq+rHd8kp9lA13JG9MzV C5nlqNDfI4U5Of5u02haWFO1T2yT+ECcEDzY6DP5QHVbqm5r4uYDPwSau1Jw2mHxFz W17aZWgsPSKcaywLy0b4YY++gtzEAAx9p4zCrJthjFG4+hWEMfxmC8jt0tR10QJ/mm wlrOgXsm1QIq+fLyXqHQL75H/ZMJNNehSSAcpqBF9NwwYNqUd7OAFlcyCAt7T91i7c YEVEOeijtJoFA== From: Daniel Bristot de Oliveira <bristot@kernel.org> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, linux-kernel@vger.kernel.org, Luca Abeni <luca.abeni@santannapisa.it>, Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>, Thomas Gleixner <tglx@linutronix.de>, Joel Fernandes <joel@joelfernandes.org>, Vineeth Pillai <vineeth@bitbyteword.org>, Shuah Khan <skhan@linuxfoundation.org>, bristot@kernel.org, Phil Auld <pauld@redhat.com> Subject: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Date: Sat, 4 Nov 2023 11:59:17 +0100 Message-Id: <cover.1699095159.git.bristot@kernel.org> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Sat, 04 Nov 2023 04:00:01 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781630870946676308 X-GMAIL-MSGID: 1781630870946676308 |
Series |
SCHED_DEADLINE server infrastructure
|
|
Message
Daniel Bristot de Oliveira
Nov. 4, 2023, 10:59 a.m. UTC
This is v5 of Peter's SCHED_DEADLINE server infrastructure implementation [1]. SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT Throttling; DEADLINE servers should be able to replace and improve that. In the v1 there was discussion raised about the consequence of using deadline based servers on the fixed-priority workloads. For a demonstration here is the baseline of timerlat scheduling latency as-is, with kernel build background workload: # rtla timerlat top -u -d 10m --------------------- %< ------------------------ Timer Latency 0 01:42:24 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us) CPU COUNT | cur min avg max | cur min avg max | cur min avg max 0 #6143559 | 0 0 0 92 | 2 1 3 98 | 4 1 5 100 1 #6143559 | 1 0 0 97 | 7 1 5 101 | 9 1 7 103 2 #6143559 | 0 0 0 88 | 3 1 5 95 | 5 1 7 99 3 #6143559 | 0 0 0 90 | 6 1 5 103 | 10 1 7 126 4 #6143558 | 1 0 0 81 | 7 1 4 86 | 9 1 7 90 5 #6143558 | 0 0 0 74 | 3 1 5 79 | 4 1 7 83 6 #6143558 | 0 0 0 83 | 2 1 5 89 | 3 0 7 108 7 #6143558 | 0 0 0 85 | 3 1 4 126 | 5 1 6 137 --------------------- >% ------------------------ And this is the same tests with DL server activating without any delay: --------------------- %< ------------------------ 0 00:10:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us) CPU COUNT | cur min avg max | cur min avg max | cur min avg max 0 #579147 | 0 0 0 54 | 2 1 52 61095 | 2 2 56 61102 1 #578766 | 0 0 0 83 | 2 1 49 55824 | 3 2 53 55831 2 #578559 | 0 0 1 59 | 2 1 50 55760 | 3 2 54 55770 3 #578318 | 0 0 0 76 | 2 1 49 55751 | 3 2 54 55760 4 #578611 | 0 0 0 64 | 2 1 49 55811 | 3 2 53 55820 5 #578347 | 0 0 1 40 | 2 1 50 56121 | 3 2 55 56133 6 #578938 | 0 0 1 75 | 2 1 49 55755 | 3 2 53 55764 7 #578631 | 0 0 1 36 | 3 1 51 55528 | 4 2 55 55541 --------------------- >% ------------------------ The problem with DL server only implementation is that FIFO tasks might suffer preemption from NORMAL even when spare CPU cycles are available. In fact, fair deadline server is enqueued right away when NORMAL tasks wake up and they are first scheduled by the server, thus potentially preempting a well behaving FIFO task. This is of course not ideal. We had discussions about it, and one of the possibilities would be using a different scheduling algorithm for this. But IMHO that is an overkill. Juri and I discussed this and though about delaying the server activation for the 0-lag time, thus enabling the server only if the fair scheduler is about to starve. The patch 6/7 adds the possibility to defer the server start to the (absolute deadline - runtime) point in time. This is achieved by enqueuing the dl server throttled, with a next replenishing time set to activate the server at (absolute deadline - runtime). Differently from v4, now the server is enqueued with the runtime replenished. As the fair scheduler runs without boost, its runtime is consumed. If the fair server has its runtime before the 0-laxity time, the a new period is set, and the timer armed for the new (deadline - runtime). The patch 7/7 add a per_rq interface for the knobs: fair_server_runtime (950 ms) fair_server_period (1s) fair_server_defer (enabled) With defer enabled on CPUs [0:3], the results get better, having a behavior similar to the one we have with the rt throttling. --------------------- %< ------------------------ Timer Latency 0 00:10:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us) CPU COUNT | cur min avg max | cur min avg max | cur min avg max 0 #599979 | 0 0 0 64 | 4 1 4 67 | 6 1 5 69 1 #599979 | 0 0 1 17 | 6 1 5 50 | 10 2 7 71 2 #599984 | 1 0 1 22 | 4 1 5 78 | 5 2 7 107 3 #599986 | 0 0 1 72 | 7 1 5 79 | 10 2 7 82 4 #581580 | 1 0 1 37 | 6 1 38 52797 | 10 2 41 52805 5 #583270 | 1 0 1 41 | 9 1 36 52617 | 12 2 38 52623 6 #581240 | 0 0 1 25 | 7 1 39 52870 | 11 2 41 52876 7 #581208 | 0 0 1 69 | 6 1 39 52917 | 9 2 41 52923 --------------------- >% ------------------------ Here are some osnoise measurement, with osnoise threads running as FIFO:1 with different setups (defer enabled): - CPU 2 isolated - CPU 3 isolated shared with a CFS busy loop task - CPU 8 non-isolated - CPU 9 non-isolated shared with a CFS busy loop task --------------------- %< ------------------------ ~# pgrep ktimer | while read pid; do chrt -p -f 2 $pid; done # for RT kernel ~# sysctl kernel.sched_rt_runtime_us=-1 ~# tuna isolate -c 2 ~# tuna isolate -c 3 ~# taskset -c 3 ./f & ~# taskset -c 9 ./f & ~# osnoise -P f:1 -c 2,3,8,9 -T 1 -d 10m -H 1 Operating System Noise duration: 0 00:10:00 | time is in us CPU Period Runtime Noise % CPU Aval Max Noise Max Single HW NMI IRQ Softirq Thread 2 #599 599000000 178 99.99997 18 2 0 0 270 0 0 3 #598 598054434 31351553 94.75774 104442 104442 0 0 2837523 0 1794 8 #599 599000001 567456 99.90526 3260 2375 2 89 620490 0 13539 9 #598 598021196 31742537 94.69207 71707 53357 0 90 3411023 0 1762 --------------------- >% ------------------------ the system runs fine! - no crashes (famous last words) - FIFO property is kept - per cpu interface because it is more flexible - and to detach this from the throttling concept. Global is broken, but it will > /dev/null. TODO: - Move rt throttling code to RT_GROUP_SCHED for now (then send it to the same place as global then). Changes from V4: - Enable the server when nr fair tasks is > 0 (peter) - Consume runtime if the zerolax server is not boosted (peterz) - Adjust interface to deal with admission control (peterz) - Rebased to 6.6 Changes from V3: - Add the defer server (Daniel) - Add an per rq interface (Daniel with peter's feedback) - Add an option not defer the server (for Joel) - Typos and 1-liner fixes (Valentin, Luca, Peter) - Fair scheduler running on dl server do not account as RT task (Daniel) - Changed the condition to enable the server (RT & fair tasks) (Daniel) Changes from v2: - Refactor/rephrase/typos changes - Defferable server using throttling - The server starts when RT && Fair tasks are enqueued - Interface with runtime/period/defer option Changes from v1: - rebased on 6.4-rc1 tip/sched/core Daniel Bristot de Oliveira (2): sched/deadline: Deferrable dl server sched/fair: Fair server interface Peter Zijlstra (5): sched: Unify runtime accounting across classes sched/deadline: Collect sched_dl_entity initialization sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity sched/deadline: Introduce deadline servers sched/fair: Add trivial fair server include/linux/sched.h | 26 +- kernel/sched/core.c | 23 +- kernel/sched/deadline.c | 671 ++++++++++++++++++++++++++++----------- kernel/sched/debug.c | 202 ++++++++++++ kernel/sched/fair.c | 87 ++++- kernel/sched/rt.c | 15 +- kernel/sched/sched.h | 56 +++- kernel/sched/stop_task.c | 13 +- 8 files changed, 847 insertions(+), 246 deletions(-)
Comments
On Sat, Nov 04, 2023 at 11:59:17AM +0100, Daniel Bristot de Oliveira wrote: > This is v5 of Peter's SCHED_DEADLINE server infrastructure > implementation [1]. > > SCHED_DEADLINE servers can help fixing starvation issues of low priority > tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU > cycles. Today we have RT Throttling; DEADLINE servers should be able to > replace and improve that. Hello! Just wanted to provide some ChromeOS data on these patches. There is great improvement when using DL-sever along with RT for foreground Chrome's display, audio and main threads. Once they are in the background, we set them back to CFS (except audio). I think these patches are ready to move forward as the data looks good to me. I see Peter picked up some of them already which is nice. One of the key metrics for us is event latency. We have a test that measures various latency metrics with typing happening on a Google docs in one window and a 16-person Google meet call happening on the other. This is a very complex test but gets us close to what the user experiences (as is typical - meeting attendees in a Google meet call take notes in a Google doc). As a result, getting stable numbers requires a lot of care which is why I used P-value to measure the statistical significance of the results. The P-value for some metrics show lower significance, so we can ignore those but I still provided it in the table. The test is run on a Chromebook with 4 cores (Intel(R) Celeron(R) N4100 CPU @ 1.10GHz) and 16GB of RAM. No Hyperthreading. All units are microseconds. The average is calculated as the average of 20 runs with and without "Chrome using RT + DL-server". The 5% every 1 second default does not work for us, so I changed the DL server parameters to 5ms every 30ms. This allows CFS to run more often. This test runs for 6 hours. Total test time for both before and after is 12 hours: --------------------------------------------------------------------------------------------------------- | MetricName | Average Before | Average After | Change % | P-value | --------------------------------------------------------------------------------------------------------- | Ash.EventLatency.Core.TotalLatency | 90.19 | 78.22 | 13.27% | 0.03 | --------------------------------------------------------------------------------------------------------- | Ash.EventLatency.KeyReleased.TotalLatency | 90733.76 | 78602.72 | 13.37% | 0.03 | --------------------------------------------------------------------------------------------------------- | Ash.EventLatency.TotalLatency | 90.19 | 78.22 | 13.27% | 0.03 | --------------------------------------------------------------------------------------------------------- | Docs.EventLatency.KeyPressed.TotalLatency | 68269.21 | 63310.99 | 7.26% | 0.00 | --------------------------------------------------------------------------------------------------------- | Docs.EventLatency.MousePressed.TotalLatency | 192080.44 | 179264.31 | 6.67% | 0.26 | --------------------------------------------------------------------------------------------------------- | Docs.EventLatency.TotalLatency | 68795.99 | 63860.04 | 7.17% | 0.00 | --------------------------------------------------------------------------------------------------------- | EventLatency.GestureScrollUpdt.Wheel.TotalLat | 63420.88 | 59394.18 | 6.35% | 0.02 | --------------------------------------------------------------------------------------------------------- | EventLatency.KeyPressed.TotalLatency | 68269.21 | 63310.99 | 7.26% | 0.00 | --------------------------------------------------------------------------------------------------------- | EventLatency.MouseDragged.TotalLatency | 106393.09 | 104152.50 | 2.11% | 0.57 | --------------------------------------------------------------------------------------------------------- | EventLatency.MouseMoved.TotalLatency | 129225.65 | 113268.48 | 12.35% | 0.01 | --------------------------------------------------------------------------------------------------------- | EventLatency.MousePressed.TotalLatency | 192080.44 | 179264.31 | 6.67% | 0.26 | --------------------------------------------------------------------------------------------------------- | EventLatency.MouseReleased.TotalLatency | 152366.33 | 140309.50 | 7.91% | 0.44 | --------------------------------------------------------------------------------------------------------- | EventLatency.TotalLatency | 68795.99 | 63862.45 | 7.17% | 0.00 | --------------------------------------------------------------------------------------------------------- | EventLatency.TotalLatency_ash-Chrome | 68795.99 | 63862.45 | 7.17% | 0.00 | --------------------------------------------------------------------------------------------------------- I also did another test where I measure the CFS maximum latency (using perf sched) while a YouTube video is playing, and the CFS max latency looks great too. In fact, with the vanilla RT throttling, our CFS tasks are doing really badly (perhaps because of depending on RT tasks due to locks or such). So we definitely need the DL-server to use RT properly! We are testing dlserver with 5ms/50ms and 5ms/100ms as well to see the impact. But at the moment, 5ms/30ms is looking good. Thanks for all of your work, here's to better Linux and better Chromebooks ;) - Joel
Hi, Daniel, Thanks a lot for your great patchset! We have a similar starvation issue in mm subsystem too. Details are in the patch description of the below commit. In short, task A is busy looping on some event, while task B will signal the event after some work. If the priority of task A is higher than that of task B, task B may be starved. IIUC, if task A is RT task while task B is fair task, then your patchset will solve the issue. If both task A and task B is RT tasks, is there some way to solve the issue? Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to resolve the issue, is there something better? Best Regards, Huang, Ying --------------------------8<--------------------------------------- commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e Author: Guo Ziliang <guo.ziliang@zte.com.cn> Date: Wed Mar 16 16:15:03 2022 -0700 mm: swap: get rid of livelock in swapin readahead In our testing, a livelock task was found. Through sysrq printing, same stack was found every time, as follows: __swap_duplicate+0x58/0x1a0 swapcache_prepare+0x24/0x30 __read_swap_cache_async+0xac/0x220 read_swap_cache_async+0x58/0xa0 swapin_readahead+0x24c/0x628 do_swap_page+0x374/0x8a0 __handle_mm_fault+0x598/0xd60 handle_mm_fault+0x114/0x200 do_page_fault+0x148/0x4d0 do_translation_fault+0xb0/0xd4 do_mem_abort+0x50/0xb0 The reason for the livelock is that swapcache_prepare() always returns EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it cannot jump out of the loop. We suspect that the task that clears the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the priority of the task stuck in a livelock so that the task that clears the SWAP_HAS_CACHE flag will run. The results show that the system returns to normal after the priority is lowered. In our testing, multiple real-time tasks are bound to the same core, and the task in the livelock is the highest priority task of the core, so the livelocked task cannot be preempted. Although cond_resched() is used by __read_swap_cache_async, it is an empty function in the preemptive system and cannot achieve the purpose of releasing the CPU. A high-priority task cannot release the CPU unless preempted by a higher-priority task. But when this task is already the highest priority task on this core, other tasks will not be able to be scheduled. So we think we should replace cond_resched() with schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will call set_current_state first to set the task state, so the task will be removed from the running queue, so as to achieve the purpose of giving up the CPU and prevent it from running in kernel mode for too long. (akpm: ugly hack becomes uglier. But it fixes the issue in a backportable-to-stable fashion while we hopefully work on something better) Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn> Reported-by: Zeal Robot <zealci@zte.com.cn> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Acked-by: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roger Quadros <rogerq@kernel.org> Cc: Ziliang Guo <guo.ziliang@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff --git a/mm/swap_state.c b/mm/swap_state.c index 8d4104242100..ee67164531c0 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * __read_swap_cache_async(), which has set SWAP_HAS_CACHE * in swap_map, but not yet added its page to swap cache. */ - cond_resched(); + schedule_timeout_uninterruptible(1); } /*
Hi On 2/19/24 08:33, Huang, Ying wrote: > Hi, Daniel, > > Thanks a lot for your great patchset! > > We have a similar starvation issue in mm subsystem too. Details are in > the patch description of the below commit. In short, task A is busy > looping on some event, while task B will signal the event after some > work. If the priority of task A is higher than that of task B, task B > may be starved. ok... > > IIUC, if task A is RT task while task B is fair task, then your patchset > will solve the issue. This patch set will not solve the issue. It will mitigate the effect of the problem. Still the system will perform very poorly... If both task A and task B is RT tasks, is there > some way to solve the issue? I would say reworking the swap algorithm, as it is not meant to be used when real-time tasks are in place. As an exercise, let's say that we add a server per priority on FIFO, with a default 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency, busy loop in vain. Then one would say, let's lower the parameters, so the granularity of the server would provide lower latencies. The same problem would still exist, as it exists with sched fair.... So, the solution is not on schedule. Busy loop waiting is not right when you have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove spin_locks that do busy waiting. spin_locks do not have this problem you show because they disable preemption... but disabling preemption is not the best solution either. So, a first try of duct tape would using (local_?) locks like in preempt rt to make things sleepable... AFAICS, this was already discussed in the previous link, right? > > Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to > resolve the issue, is there something better? I am not a swap/mm expert.. my guesses would be all on sleepable locking. But I know there are many smart people on the mm side with better guesses... It is just that the DL server or any type of starvation avoidance does not seem to be a solution for your problem. -- Daniel > Best Regards, > Huang, Ying > > --------------------------8<--------------------------------------- > commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e > Author: Guo Ziliang <guo.ziliang@zte.com.cn> > Date: Wed Mar 16 16:15:03 2022 -0700 > > mm: swap: get rid of livelock in swapin readahead > > In our testing, a livelock task was found. Through sysrq printing, same > stack was found every time, as follows: > > __swap_duplicate+0x58/0x1a0 > swapcache_prepare+0x24/0x30 > __read_swap_cache_async+0xac/0x220 > read_swap_cache_async+0x58/0xa0 > swapin_readahead+0x24c/0x628 > do_swap_page+0x374/0x8a0 > __handle_mm_fault+0x598/0xd60 > handle_mm_fault+0x114/0x200 > do_page_fault+0x148/0x4d0 > do_translation_fault+0xb0/0xd4 > do_mem_abort+0x50/0xb0 > > The reason for the livelock is that swapcache_prepare() always returns > EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it > cannot jump out of the loop. We suspect that the task that clears the > SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the > priority of the task stuck in a livelock so that the task that clears > the SWAP_HAS_CACHE flag will run. The results show that the system > returns to normal after the priority is lowered. > > In our testing, multiple real-time tasks are bound to the same core, and > the task in the livelock is the highest priority task of the core, so > the livelocked task cannot be preempted. > > Although cond_resched() is used by __read_swap_cache_async, it is an > empty function in the preemptive system and cannot achieve the purpose > of releasing the CPU. A high-priority task cannot release the CPU > unless preempted by a higher-priority task. But when this task is > already the highest priority task on this core, other tasks will not be > able to be scheduled. So we think we should replace cond_resched() with > schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will > call set_current_state first to set the task state, so the task will be > removed from the running queue, so as to achieve the purpose of giving > up the CPU and prevent it from running in kernel mode for too long. > > (akpm: ugly hack becomes uglier. But it fixes the issue in a > backportable-to-stable fashion while we hopefully work on something > better) > > Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com > Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn> > Reported-by: Zeal Robot <zealci@zte.com.cn> > Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> > Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn> > Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> > Acked-by: Hugh Dickins <hughd@google.com> > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Roger Quadros <rogerq@kernel.org> > Cc: Ziliang Guo <guo.ziliang@zte.com.cn> > Cc: <stable@vger.kernel.org> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 8d4104242100..ee67164531c0 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > * __read_swap_cache_async(), which has set SWAP_HAS_CACHE > * in swap_map, but not yet added its page to swap cache. > */ > - cond_resched(); > + schedule_timeout_uninterruptible(1); > } > > /*
Daniel Bristot de Oliveira <bristot@kernel.org> writes: > Hi > > On 2/19/24 08:33, Huang, Ying wrote: >> Hi, Daniel, >> >> Thanks a lot for your great patchset! >> >> We have a similar starvation issue in mm subsystem too. Details are in >> the patch description of the below commit. In short, task A is busy >> looping on some event, while task B will signal the event after some >> work. If the priority of task A is higher than that of task B, task B >> may be starved. > > ok... > >> >> IIUC, if task A is RT task while task B is fair task, then your patchset >> will solve the issue. > > This patch set will not solve the issue. It will mitigate the effect of the > problem. Still the system will perform very poorly... I don't think that it's common (or even reasonable) for real-time tasks to use swap. So, IMHO, performance isn't very important here. But, we need to avoid live-lock anyway. I think that your patchset solves the live-lock issue. >> If both task A and task B is RT tasks, is there >> some way to solve the issue? > > I would say reworking the swap algorithm, as it is not meant to be used when > real-time tasks are in place. > > As an exercise, let's say that we add a server per priority on FIFO, with a default > 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency, > busy loop in vain. If the target is only the live-lock avoidance, is it possible to run lower priority runnable tasks for a short while if we run long enough in the busy loop? > Then one would say, let's lower the parameters, so the granularity of > the server would provide lower latencies. The same problem would still > exist, as it exists with sched fair.... > > So, the solution is not on schedule. Busy loop waiting is not right when you > have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove > spin_locks that do busy waiting. spin_locks do not have this problem you > show because they disable preemption... but disabling preemption is not > the best solution either. > > So, a first try of duct tape would using (local_?) locks like in > preempt rt to make things sleepable... > > AFAICS, this was already discussed in the previous link, right? > >> >> Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to >> resolve the issue, is there something better? > > I am not a swap/mm expert.. my guesses would be all on sleepable locking. > But I know there are many smart people on the mm side with better guesses... > > It is just that the DL server or any type of starvation avoidance does not > seem to be a solution for your problem. Yes. To improve the performance, we need something else. -- Best Regards, Huang, Ying > -- Daniel > > >> Best Regards, >> Huang, Ying >> >> --------------------------8<--------------------------------------- >> commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e >> Author: Guo Ziliang <guo.ziliang@zte.com.cn> >> Date: Wed Mar 16 16:15:03 2022 -0700 >> >> mm: swap: get rid of livelock in swapin readahead >> >> In our testing, a livelock task was found. Through sysrq printing, same >> stack was found every time, as follows: >> >> __swap_duplicate+0x58/0x1a0 >> swapcache_prepare+0x24/0x30 >> __read_swap_cache_async+0xac/0x220 >> read_swap_cache_async+0x58/0xa0 >> swapin_readahead+0x24c/0x628 >> do_swap_page+0x374/0x8a0 >> __handle_mm_fault+0x598/0xd60 >> handle_mm_fault+0x114/0x200 >> do_page_fault+0x148/0x4d0 >> do_translation_fault+0xb0/0xd4 >> do_mem_abort+0x50/0xb0 >> >> The reason for the livelock is that swapcache_prepare() always returns >> EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it >> cannot jump out of the loop. We suspect that the task that clears the >> SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the >> priority of the task stuck in a livelock so that the task that clears >> the SWAP_HAS_CACHE flag will run. The results show that the system >> returns to normal after the priority is lowered. >> >> In our testing, multiple real-time tasks are bound to the same core, and >> the task in the livelock is the highest priority task of the core, so >> the livelocked task cannot be preempted. >> >> Although cond_resched() is used by __read_swap_cache_async, it is an >> empty function in the preemptive system and cannot achieve the purpose >> of releasing the CPU. A high-priority task cannot release the CPU >> unless preempted by a higher-priority task. But when this task is >> already the highest priority task on this core, other tasks will not be >> able to be scheduled. So we think we should replace cond_resched() with >> schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will >> call set_current_state first to set the task state, so the task will be >> removed from the running queue, so as to achieve the purpose of giving >> up the CPU and prevent it from running in kernel mode for too long. >> >> (akpm: ugly hack becomes uglier. But it fixes the issue in a >> backportable-to-stable fashion while we hopefully work on something >> better) >> >> Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com >> Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn> >> Reported-by: Zeal Robot <zealci@zte.com.cn> >> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> >> Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn> >> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> >> Acked-by: Hugh Dickins <hughd@google.com> >> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> >> Cc: Michal Hocko <mhocko@kernel.org> >> Cc: Minchan Kim <minchan@kernel.org> >> Cc: Johannes Weiner <hannes@cmpxchg.org> >> Cc: Roger Quadros <rogerq@kernel.org> >> Cc: Ziliang Guo <guo.ziliang@zte.com.cn> >> Cc: <stable@vger.kernel.org> >> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> >> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> >> >> diff --git a/mm/swap_state.c b/mm/swap_state.c >> index 8d4104242100..ee67164531c0 100644 >> --- a/mm/swap_state.c >> +++ b/mm/swap_state.c >> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, >> * __read_swap_cache_async(), which has set SWAP_HAS_CACHE >> * in swap_map, but not yet added its page to swap cache. >> */ >> - cond_resched(); >> + schedule_timeout_uninterruptible(1); >> } >> >> /*
On 2/20/24 04:28, Huang, Ying wrote: > Daniel Bristot de Oliveira <bristot@kernel.org> writes: > >> Hi >> >> On 2/19/24 08:33, Huang, Ying wrote: >>> Hi, Daniel, >>> >>> Thanks a lot for your great patchset! >>> >>> We have a similar starvation issue in mm subsystem too. Details are in >>> the patch description of the below commit. In short, task A is busy >>> looping on some event, while task B will signal the event after some >>> work. If the priority of task A is higher than that of task B, task B >>> may be starved. >> >> ok... >> >>> >>> IIUC, if task A is RT task while task B is fair task, then your patchset >>> will solve the issue. >> >> This patch set will not solve the issue. It will mitigate the effect of the >> problem. Still the system will perform very poorly... > > I don't think that it's common (or even reasonable) for real-time tasks > to use swap. So, IMHO, performance isn't very important here. But, we > need to avoid live-lock anyway. I think that your patchset solves the > live-lock issue. I mean, if for you this is solving your user problem, be happy :-) Play with parameters... find a way to tune your system as a user... use it :) But your problem is also "solved" with RT throttling without RT_RUNTIME_SHARE (the default since... two years ago, I think). So there is not much news here. IMHO, it is not a solution. As a developer, there is a synchronization problem in swap code, and pushing a workaround to the scheduling side is not the way to go... > >>> If both task A and task B is RT tasks, is there >>> some way to solve the issue? >> >> I would say reworking the swap algorithm, as it is not meant to be used when >> real-time tasks are in place. >> >> As an exercise, let's say that we add a server per priority on FIFO, with a default >> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency, >> busy loop in vain. > > If the target is only the live-lock avoidance, is it possible to run > lower priority runnable tasks for a short while if we run long enough in > the busy loop? If you do it in the algorithm side (instead of relying on scheduling), it could be a thing. I think NAPI still uses something like this: Busy-loop for two jiffies in the softirq context (a priority higher than all threads on the !rt kernel), then move to thread the thread context to avoid starvation. In the swap case, it could run for two jiffies and then go to sleep for a while. How well will swap people receive this as a solution... I do not know :) I would first try something better than this using synchronization primitives. This patch set is for things outside of kernel control. For example, people running poll mode DPDK in user-space with FIFO priority; FIFO tasks in user-space for too long... with a better design than rt throttling. Will this patch help in misbehaving kernel activities: yes. Is it a reason not to fix kernel problems? I do not think so, and I bet many other people do not believe as well. -- Daniel
Daniel Bristot de Oliveira <bristot@kernel.org> writes: > On 2/20/24 04:28, Huang, Ying wrote: >> Daniel Bristot de Oliveira <bristot@kernel.org> writes: >> >>> Hi >>> >>> On 2/19/24 08:33, Huang, Ying wrote: >>>> Hi, Daniel, >>>> >>>> Thanks a lot for your great patchset! >>>> >>>> We have a similar starvation issue in mm subsystem too. Details are in >>>> the patch description of the below commit. In short, task A is busy >>>> looping on some event, while task B will signal the event after some >>>> work. If the priority of task A is higher than that of task B, task B >>>> may be starved. >>> >>> ok... >>> >>>> >>>> IIUC, if task A is RT task while task B is fair task, then your patchset >>>> will solve the issue. >>> >>> This patch set will not solve the issue. It will mitigate the effect of the >>> problem. Still the system will perform very poorly... >> >> I don't think that it's common (or even reasonable) for real-time tasks >> to use swap. So, IMHO, performance isn't very important here. But, we >> need to avoid live-lock anyway. I think that your patchset solves the >> live-lock issue. > > I mean, if for you this is solving your user problem, be happy :-) Play with parameters... > find a way to tune your system as a user... use it :) > > But your problem is also "solved" with RT throttling without RT_RUNTIME_SHARE (the > default since... two years ago, I think). So there is not much news here. > > IMHO, it is not a solution. As a developer, there is a synchronization problem > in swap code, and pushing a workaround to the scheduling side is not the way to go... > >> >>>> If both task A and task B is RT tasks, is there >>>> some way to solve the issue? >>> >>> I would say reworking the swap algorithm, as it is not meant to be used when >>> real-time tasks are in place. >>> >>> As an exercise, let's say that we add a server per priority on FIFO, with a default >>> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency, >>> busy loop in vain. >> >> If the target is only the live-lock avoidance, is it possible to run >> lower priority runnable tasks for a short while if we run long enough in >> the busy loop? > > If you do it in the algorithm side (instead of relying on scheduling), it could be a > thing. > > I think NAPI still uses something like this: Busy-loop for two jiffies in the softirq > context (a priority higher than all threads on the !rt kernel), then move to thread > the thread context to avoid starvation. In the swap case, it could run for two jiffies > and then go to sleep for a while. How well will swap people receive this as a solution... > I do not know :) I would first try something better than this using synchronization > primitives. > > This patch set is for things outside of kernel control. For example, people running > poll mode DPDK in user-space with FIFO priority; FIFO tasks in user-space for too long... > with a better design than rt throttling. > > Will this patch help in misbehaving kernel activities: yes. Is it a reason not to > fix kernel problems? I do not think so, and I bet many other people do not believe as > well. I totally agree with you that we need to fix the kernel problems. And, Thanks for your information! -- Best Regards, Huang, Ying