Message ID | 20230320180332.102837832@redhat.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1372340wrt; Mon, 20 Mar 2023 11:37:07 -0700 (PDT) X-Google-Smtp-Source: AK7set8697cNAxe9/U8tzyQubx1RIdhhpjjtXs+NxZaNgZMb2lkt+eVs98GgSahrDqM2AQyb/xdR X-Received: by 2002:a17:902:f2cd:b0:19c:f096:bbef with SMTP id h13-20020a170902f2cd00b0019cf096bbefmr14729650plc.49.1679337427172; Mon, 20 Mar 2023 11:37:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1679337427; cv=none; d=google.com; s=arc-20160816; b=Sll2HfO2AS5NLj3NDkAs5WgsqmZXONyYQ1u7HC4brdX55h8Rwxn5ogcfPXJoD2QrKI f3cIo/AGxPkl3PzST8bczqqxWhYiXGSMLZNRXx94q6Zk2NwM4XO/Lw4DO4x9nDx38cSY D3OJZn8bcaDiQmsaLydznXKGd9PSUNpYWz3PjW6KCz1AROUmMNVEpeVi1KBsZJ48zGSg 9npZtiUuhd9PIs9GuBy15TqCoUb3xA8oebvtMtcjSrXbXCe+M+m3L02kfQYjG0t7tZFM TwbKFCXsmSM9MU5eXFGIPA8q8GM1EVZnSEqvw/yeXvtazNfgpISsRpO3SRj6FkM8/J5J sWzg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:subject:cc:to:from:date:user-agent:message-id :dkim-signature; bh=2/4zOYZZtH2WSGJk64M37M/sTAQLT68CCaLnUdPqKgU=; b=j3qwa1cr9DSHTSd0RbrlebCsxw5QtgM1LzFaMsdTMkbjS7cwKuTvKDQXgb+Mci9wIC HLdCjyiom2Gtqk2u2PPMS9Yev5FwCzFWYJ0fPtkdVURJL2HS0XeKbU76YbJBwHH/G6ps N4FvJ4MJSSwHmUD2G0i+3F++Iw8pSFwlkExGQtHpp4oQi6w0MOpjTzAn+tMN4+udYEJD 2KGwE8mE8LkYC+OcaUzVOZf7kPilqS+pugK7MEh+ZuHVczOTkOJDioW8ZF1GOQ2HRd30 CICJONs2+8nDOlvytEPe+02QsN6rfGDPrBBu6rwZpxNXopF8SYK5OhMF1xgv6YpjeMnT JkbQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Tn64OC0u; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w8-20020a170902e88800b001a1c690e190si6118197plg.44.2023.03.20.11.36.52; Mon, 20 Mar 2023 11:37:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Tn64OC0u; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230225AbjCTSVb (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others); Mon, 20 Mar 2023 14:21:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59260 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230012AbjCTSUY (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 20 Mar 2023 14:20:24 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 296EA3B223 for <linux-kernel@vger.kernel.org>; Mon, 20 Mar 2023 11:13:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1679335929; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc; bh=2/4zOYZZtH2WSGJk64M37M/sTAQLT68CCaLnUdPqKgU=; b=Tn64OC0ukcDCnDHYZsdi2e2CZM+rAxYIpRnPWNZ6ZnHiF4uXZmQBPrIJkpdiNW9EZGHqL9 dpKHBWd40D+E4uO0S5s88w/Oyw64XR5PJFStnbWxFEuphZ9iXjwV7Uvm6f16NAm2cAo/VQ rDVj8txia6Q8h69/GW6cmG3glha3z28= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-384-zhto1ZHtMbmUSZaKIzWElw-1; Mon, 20 Mar 2023 14:12:04 -0400 X-MC-Unique: zhto1ZHtMbmUSZaKIzWElw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 9500A38173D9; Mon, 20 Mar 2023 18:12:03 +0000 (UTC) Received: from tpad.localdomain (ovpn-112-2.gru2.redhat.com [10.97.112.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 20A08C15BA0; Mon, 20 Mar 2023 18:12:03 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id CF185403A3E4B; Mon, 20 Mar 2023 15:08:02 -0300 (-03) Message-ID: <20230320180332.102837832@redhat.com> User-Agent: quilt/0.67 Date: Mon, 20 Mar 2023 15:03:32 -0300 From: Marcelo Tosatti <mtosatti@redhat.com> To: Christoph Lameter <cl@linux.com> Cc: Aaron Tomlin <atomlin@atomlin.com>, Frederic Weisbecker <frederic@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Russell King <linux@armlinux.org.uk>, Huacai Chen <chenhuacai@kernel.org>, Heiko Carstens <hca@linux.ibm.com>, x86@kernel.org, Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com> Subject: [PATCH v7 00/13] fold per-CPU vmstats remotely X-Scanned-By: MIMEDefang 3.1 on 10.11.54.8 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760912922121998027?= X-GMAIL-MSGID: =?utf-8?q?1760912922121998027?= |
Series |
fold per-CPU vmstats remotely
|
|
Message
Marcelo Tosatti
March 20, 2023, 6:03 p.m. UTC
This patch series addresses the following two problems: 1. A customer provided evidence indicating that a process was stalled in direct reclaim: - The process was trapped in throttle_direct_reclaim(). The function wait_event_killable() was called to wait condition allow_direct_reclaim(pgdat) for current node to be true. The allow_direct_reclaim(pgdat) examined the number of free pages on the node by zone_page_state() which just returns value in zone->vm_stat[NR_FREE_PAGES]. - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0. However, the freelist on this node was not empty. - This inconsistent of vmstat value was caused by percpu vmstat on nohz_full cpus. Every increment/decrement of vmstat is performed on percpu vmstat counter at first, then pooled diffs are cumulated to the zone's vmstat counter in timely manner. However, on nohz_full cpus (in case of this customer's system, 48 of 52 cpus) these pooled diffs were not cumulated once the cpu had no event on it so that the cpu started sleeping infinitely. I checked percpu vmstat and found there were total 69 counts not cumulated to the zone's vmstat counter yet. - In this situation, kswapd did not help the trapped process. In pgdat_balanced(), zone_wakermark_ok_safe() examined the number of free pages on the node by zone_page_state_snapshot() which checks pending counts on percpu vmstat. Therefore kswapd could know there were 69 free pages correctly. Since zone->_watermark = {8, 20, 32}, kswapd did not work because 69 was greater than 32 as high watermark. 2. With a task that busy loops on a given CPU, the kworker interruption to execute vmstat_update is undesired and may exceed latency thresholds for certain applications. By having vmstat_shepherd flush the per-CPU counters to the global counters from remote CPUs. This is done using cmpxchg to manipulate the counters, both CPU locally (via the account functions), and remotely (via cpu_vm_stats_fold). Thanks to Aaron Tomlin for diagnosing issue 1 and writing the initial patch series. Performance details for the kworker interruption: oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... The example above shows an additional 7us for the oslat -> kworker -> oslat switches. In the case of a virtualized CPU, and the vmstat_update interruption in the host (of a qemu-kvm vcpu), the latency penalty observed in the guest is higher than 50us, violating the acceptable latency threshold for certain applications. v7: - Fix allow_direct_reclaim issue by using zone_page_state_snapshot (Michal Hocko) v6: - Add more information on throttle_direct_reclaim problem to commit logs (Michal Hocko) v5: - Drop "mm/vmstat: remove remote node draining" (Vlastimil Babka) - Implement remote node draining for cpu_vm_stats_fold (Vlastimil Babka) v4: - Switch per-CPU vmstat counters to s32, required by RISC-V, ARC architectures v3: - Removed unused drain_zone_pages and changes variable (David Hildenbrand) - Use xchg instead of cmpxchg in refresh_cpu_vm_stats (Peter Xu) - Add drain_all_pages to vmstat_refresh to make stats more accurate (Peter Xu) - Improve changelog of "mm/vmstat: switch counter modification to cmpxchg" (Peter Xu / David) - Improve changelog of "mm/vmstat: remove remote node draining" (David Hildenbrand) v2: - actually use LOCK CMPXCHG on counter mod/inc/dec functions (Christoph Lameter) - use try_cmpxchg for cmpxchg loops (Uros Bizjak / Matthew Wilcox) arch/arm64/include/asm/percpu.h | 16 ++ arch/loongarch/include/asm/percpu.h | 23 +++- arch/s390/include/asm/percpu.h | 5 arch/x86/include/asm/percpu.h | 39 +++--- include/asm-generic/percpu.h | 17 ++ include/linux/mmzone.h | 8 - include/linux/percpu-defs.h | 2 include/linux/vmstat.h | 2 kernel/fork.c | 2 kernel/scs.c | 2 mm/page_alloc.c | 5 mm/vmscan.c | 2 mm/vmstat.c | 440 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------ 13 files changed, 361 insertions(+), 202 deletions(-)
Comments
On Mon 20-03-23 15:03:33, Marcelo Tosatti wrote: > A customer provided evidence indicating that a process > was stalled in direct reclaim: > > - The process was trapped in throttle_direct_reclaim(). > The function wait_event_killable() was called to wait condition > allow_direct_reclaim(pgdat) for current node to be true. > The allow_direct_reclaim(pgdat) examined the number of free pages > on the node by zone_page_state() which just returns value in > zone->vm_stat[NR_FREE_PAGES]. > > - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0. > However, the freelist on this node was not empty. > > - This inconsistent of vmstat value was caused by percpu vmstat on > nohz_full cpus. Every increment/decrement of vmstat is performed > on percpu vmstat counter at first, then pooled diffs are cumulated > to the zone's vmstat counter in timely manner. However, on nohz_full > cpus (in case of this customer's system, 48 of 52 cpus) these pooled > diffs were not cumulated once the cpu had no event on it so that > the cpu started sleeping infinitely. > I checked percpu vmstat and found there were total 69 counts not > cumulated to the zone's vmstat counter yet. > > - In this situation, kswapd did not help the trapped process. > In pgdat_balanced(), zone_wakermark_ok_safe() examined the number > of free pages on the node by zone_page_state_snapshot() which > checks pending counts on percpu vmstat. > Therefore kswapd could know there were 69 free pages correctly. > Since zone->_watermark = {8, 20, 32}, kswapd did not work because > 69 was greater than 32 as high watermark. > > Change allow_direct_reclaim to use zone_page_state_snapshot, which > allows a more precise version of the vmstat counters to be used. > > allow_direct_reclaim will only be called from try_to_free_pages, > which is not a hot path. Have you managed to test this patch to confirm it addresses the above issue? It should but better double check that. > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> The patch makes sense regardless but a note about testing should be added. Acked-by: Michal Hocko <mhocko@suse.com> > > --- > > Index: linux-vmstat-remote/mm/vmscan.c > =================================================================== > --- linux-vmstat-remote.orig/mm/vmscan.c > +++ linux-vmstat-remote/mm/vmscan.c > @@ -6861,7 +6861,7 @@ static bool allow_direct_reclaim(pg_data > continue; > > pfmemalloc_reserve += min_wmark_pages(zone); > - free_pages += zone_page_state(zone, NR_FREE_PAGES); > + free_pages += zone_page_state_snapshot(zone, NR_FREE_PAGES); > } > > /* If there are no reserves (unexpected config) then do not throttle */ >
On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote: > This patch series addresses the following two problems: > > 1. A customer provided evidence indicating that a process > was stalled in direct reclaim: > This is addressed by the trivial patch 1. [...] > 2. With a task that busy loops on a given CPU, > the kworker interruption to execute vmstat_update > is undesired and may exceed latency thresholds > for certain applications. Yes it can but why does that matter? > By having vmstat_shepherd flush the per-CPU counters to the > global counters from remote CPUs. > > This is done using cmpxchg to manipulate the counters, > both CPU locally (via the account functions), > and remotely (via cpu_vm_stats_fold). > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing > the initial patch series. > > > Performance details for the kworker interruption: > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > The example above shows an additional 7us for the > > oslat -> kworker -> oslat > > switches. In the case of a virtualized CPU, and the vmstat_update > interruption in the host (of a qemu-kvm vcpu), the latency penalty > observed in the guest is higher than 50us, violating the acceptable > latency threshold for certain applications. I do not think we have ever promissed any specific latency guarantees for vmstat. These are statistics have been mostly used for debugging purposes AFAIK. I am not aware of any specific user space use case that would be latency sensitive. Your changelog doesn't go into details there either. [...] > mm/vmstat.c | 440 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------ This requires much more detailed story why we really need that.
On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote: > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote: > > This patch series addresses the following two problems: > > > > 1. A customer provided evidence indicating that a process > > was stalled in direct reclaim: > > > This is addressed by the trivial patch 1. > > [...] > > 2. With a task that busy loops on a given CPU, > > the kworker interruption to execute vmstat_update > > is undesired and may exceed latency thresholds > > for certain applications. > > Yes it can but why does that matter? It matters for the application that is executing and expects not to be interrupted. > > By having vmstat_shepherd flush the per-CPU counters to the > > global counters from remote CPUs. > > > > This is done using cmpxchg to manipulate the counters, > > both CPU locally (via the account functions), > > and remotely (via cpu_vm_stats_fold). > > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing > > the initial patch series. > > > > > > Performance details for the kworker interruption: > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > The example above shows an additional 7us for the > > > > oslat -> kworker -> oslat > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > observed in the guest is higher than 50us, violating the acceptable > > latency threshold for certain applications. > > I do not think we have ever promissed any specific latency guarantees > for vmstat. These are statistics have been mostly used for debugging > purposes AFAIK. I am not aware of any specific user space use case that > would be latency sensitive. Your changelog doesn't go into details there > either. There is a class of workloads for which response time can be of interest. MAC scheduler is an example: https://par.nsf.gov/servlets/purl/10090368 Thanks!
On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote: > On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote: > > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote: > > > This patch series addresses the following two problems: > > > > > > 1. A customer provided evidence indicating that a process > > > was stalled in direct reclaim: > > > > > This is addressed by the trivial patch 1. > > > > [...] > > > 2. With a task that busy loops on a given CPU, > > > the kworker interruption to execute vmstat_update > > > is undesired and may exceed latency thresholds > > > for certain applications. > > > > Yes it can but why does that matter? > > It matters for the application that is executing and expects > not to be interrupted. Those workloads shouldn't enter the kernel in the first place, no? Otherwise the in kernel execution with all the direct or indirect dependencies (e.g. via locks) can throw any latency expectations off the window. > > > By having vmstat_shepherd flush the per-CPU counters to the > > > global counters from remote CPUs. > > > > > > This is done using cmpxchg to manipulate the counters, > > > both CPU locally (via the account functions), > > > and remotely (via cpu_vm_stats_fold). > > > > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing > > > the initial patch series. > > > > > > > > > Performance details for the kworker interruption: > > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > > > The example above shows an additional 7us for the > > > > > > oslat -> kworker -> oslat > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > > observed in the guest is higher than 50us, violating the acceptable > > > latency threshold for certain applications. > > > > I do not think we have ever promissed any specific latency guarantees > > for vmstat. These are statistics have been mostly used for debugging > > purposes AFAIK. I am not aware of any specific user space use case that > > would be latency sensitive. Your changelog doesn't go into details there > > either. > > There is a class of workloads for which response time can be > of interest. MAC scheduler is an example: > > https://par.nsf.gov/servlets/purl/10090368 Yes, I am not disputing low latency workloads in general. I am just saying that you haven't really established a very sound justification here. Of course there are workloads which do not want to conflict with any in kernel house keeping. Those have to be configured and implemented very carefully though. Vmstat as such should not collide with those workloads as long as they do not interact with the kernel in a way counters are updated. Is this hard or impossible to avoid? I can imagine that those workloads have an start up sequence where the kernel is involved and counters updated so that deferred flushing could interfere with the later and latency sensitive phase. Is that a real problem in practice? Please tell us much more why we need to make the vmstat code more complex. Thanks!
On Wed, Mar 22, 2023 at 11:13:02AM +0100, Michal Hocko wrote: > On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote: > > On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote: > > > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote: > > > > This patch series addresses the following two problems: > > > > > > > > 1. A customer provided evidence indicating that a process > > > > was stalled in direct reclaim: > > > > > > > This is addressed by the trivial patch 1. > > > > > > [...] > > > > 2. With a task that busy loops on a given CPU, > > > > the kworker interruption to execute vmstat_update > > > > is undesired and may exceed latency thresholds > > > > for certain applications. > > > > > > Yes it can but why does that matter? > > > > It matters for the application that is executing and expects > > not to be interrupted. > > Those workloads shouldn't enter the kernel in the first place, no? It depends on the latency requirements and individual system calls. > Otherwise the in kernel execution with all the direct or indirect > dependencies (e.g. via locks) can throw any latency expectations off the > window. > > > > > By having vmstat_shepherd flush the per-CPU counters to the > > > > global counters from remote CPUs. > > > > > > > > This is done using cmpxchg to manipulate the counters, > > > > both CPU locally (via the account functions), > > > > and remotely (via cpu_vm_stats_fold). > > > > > > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing > > > > the initial patch series. > > > > > > > > > > > > Performance details for the kworker interruption: > > > > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > > > > > The example above shows an additional 7us for the > > > > > > > > oslat -> kworker -> oslat > > > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > > > observed in the guest is higher than 50us, violating the acceptable > > > > latency threshold for certain applications. > > > > > > I do not think we have ever promissed any specific latency guarantees > > > for vmstat. These are statistics have been mostly used for debugging > > > purposes AFAIK. I am not aware of any specific user space use case that > > > would be latency sensitive. Your changelog doesn't go into details there > > > either. > > > > There is a class of workloads for which response time can be > > of interest. MAC scheduler is an example: > > > > https://par.nsf.gov/servlets/purl/10090368 > > Yes, I am not disputing low latency workloads in general. I am just > saying that you haven't really established a very sound justification > here. The -v7 cover letter was updated with additional details, as you requested (perhaps you missed it): "Performance details for the kworker interruption: oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... The example above shows an additional 7us for the oslat -> kworker -> oslat switches. In the case of a virtualized CPU, and the vmstat_update interruption in the host (of a qemu-kvm vcpu), the latency penalty observed in the guest is higher than 50us, violating the acceptable latency threshold for certain applications." > Of course there are workloads which do not want to conflict with > any in kernel house keeping. Those have to be configured and implemented > very carefully though. Vmstat as such should not collide with those > workloads as long as they do not interact with the kernel in a way > counters are updated. Is this hard or impossible to avoid? The practical problem we have been seeing is -RT app initialization. For example: 1) mlock(); 2) enter loop without system calls > I can imagine that those workloads have an start up sequence where the > kernel is involved and counters updated so that deferred flushing could > interfere with the later and latency sensitive phase. Is that a real > problem in practice? Please tell us much more why we need to make the > vmstat code more complex. Yes, it is. I have attached traces and performance numbers above.
On Wed 22-03-23 08:23:21, Marcelo Tosatti wrote: > On Wed, Mar 22, 2023 at 11:13:02AM +0100, Michal Hocko wrote: > > On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote: > > > On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote: > > > > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote: > > > > > This patch series addresses the following two problems: > > > > > > > > > > 1. A customer provided evidence indicating that a process > > > > > was stalled in direct reclaim: > > > > > > > > > This is addressed by the trivial patch 1. > > > > > > > > [...] > > > > > 2. With a task that busy loops on a given CPU, > > > > > the kworker interruption to execute vmstat_update > > > > > is undesired and may exceed latency thresholds > > > > > for certain applications. > > > > > > > > Yes it can but why does that matter? > > > > > > It matters for the application that is executing and expects > > > not to be interrupted. > > > > Those workloads shouldn't enter the kernel in the first place, no? > > It depends on the latency requirements and individual system calls. > > > Otherwise the in kernel execution with all the direct or indirect > > dependencies (e.g. via locks) can throw any latency expectations off the > > window. > > > > > > > By having vmstat_shepherd flush the per-CPU counters to the > > > > > global counters from remote CPUs. > > > > > > > > > > This is done using cmpxchg to manipulate the counters, > > > > > both CPU locally (via the account functions), > > > > > and remotely (via cpu_vm_stats_fold). > > > > > > > > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing > > > > > the initial patch series. > > > > > > > > > > > > > > > Performance details for the kworker interruption: > > > > > > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > > > > > > > The example above shows an additional 7us for the > > > > > > > > > > oslat -> kworker -> oslat > > > > > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > > > > observed in the guest is higher than 50us, violating the acceptable > > > > > latency threshold for certain applications. > > > > > > > > I do not think we have ever promissed any specific latency guarantees > > > > for vmstat. These are statistics have been mostly used for debugging > > > > purposes AFAIK. I am not aware of any specific user space use case that > > > > would be latency sensitive. Your changelog doesn't go into details there > > > > either. > > > > > > There is a class of workloads for which response time can be > > > of interest. MAC scheduler is an example: > > > > > > https://par.nsf.gov/servlets/purl/10090368 > > > > Yes, I am not disputing low latency workloads in general. I am just > > saying that you haven't really established a very sound justification > > here. > > The -v7 cover letter was updated with additional details, > as you requested (perhaps you missed it): > > "Performance details for the kworker interruption: > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > The example above shows an additional 7us for the > > oslat -> kworker -> oslat > > switches. In the case of a virtualized CPU, and the vmstat_update > interruption in the host (of a qemu-kvm vcpu), the latency penalty > observed in the guest is higher than 50us, violating the acceptable > latency threshold for certain applications." Yes, I have seen that but it doesn't really give a wider context to understand why those numbers matter. > > Of course there are workloads which do not want to conflict with > > any in kernel house keeping. Those have to be configured and implemented > > very carefully though. Vmstat as such should not collide with those > > workloads as long as they do not interact with the kernel in a way > > counters are updated. Is this hard or impossible to avoid? > > The practical problem we have been seeing is -RT app initialization. > For example: > > 1) mlock(); > 2) enter loop without system calls OK, that is what I have kinda expected. Would have been better to mention it explicitly. I expect this to be a very common pattern and vmstat might not be the only subsystem that could interfere later on. Would it make more sense to address this by a more generic solution? E.g. a syscall to flush all per-cpu caches so they won't interfere later unless userspace hits the kernel path in some way (e.g. flush_cpu_caches(cpu_set_t cpumask, int flags)? The above pattern could then be implemented as do_initial_setup() sched_setaffinity(getpid(), cpumask); flush_cpu_caches(cpumask, 0); do_userspace_loop()
On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote: > On Wed 22-03-23 08:23:21, Marcelo Tosatti wrote: > > On Wed, Mar 22, 2023 at 11:13:02AM +0100, Michal Hocko wrote: > > > On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote: > > > > On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote: > > > > > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote: > > > > > > This patch series addresses the following two problems: > > > > > > > > > > > > 1. A customer provided evidence indicating that a process > > > > > > was stalled in direct reclaim: > > > > > > > > > > > This is addressed by the trivial patch 1. > > > > > > > > > > [...] > > > > > > 2. With a task that busy loops on a given CPU, > > > > > > the kworker interruption to execute vmstat_update > > > > > > is undesired and may exceed latency thresholds > > > > > > for certain applications. > > > > > > > > > > Yes it can but why does that matter? > > > > > > > > It matters for the application that is executing and expects > > > > not to be interrupted. > > > > > > Those workloads shouldn't enter the kernel in the first place, no? > > > > It depends on the latency requirements and individual system calls. > > > > > Otherwise the in kernel execution with all the direct or indirect > > > dependencies (e.g. via locks) can throw any latency expectations off the > > > window. > > > > > > > > > By having vmstat_shepherd flush the per-CPU counters to the > > > > > > global counters from remote CPUs. > > > > > > > > > > > > This is done using cmpxchg to manipulate the counters, > > > > > > both CPU locally (via the account functions), > > > > > > and remotely (via cpu_vm_stats_fold). > > > > > > > > > > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing > > > > > > the initial patch series. > > > > > > > > > > > > > > > > > > Performance details for the kworker interruption: > > > > > > > > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > > > > > > > > > The example above shows an additional 7us for the > > > > > > > > > > > > oslat -> kworker -> oslat > > > > > > > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > > > > > observed in the guest is higher than 50us, violating the acceptable > > > > > > latency threshold for certain applications. > > > > > > > > > > I do not think we have ever promissed any specific latency guarantees > > > > > for vmstat. These are statistics have been mostly used for debugging > > > > > purposes AFAIK. I am not aware of any specific user space use case that > > > > > would be latency sensitive. Your changelog doesn't go into details there > > > > > either. > > > > > > > > There is a class of workloads for which response time can be > > > > of interest. MAC scheduler is an example: > > > > > > > > https://par.nsf.gov/servlets/purl/10090368 > > > > > > Yes, I am not disputing low latency workloads in general. I am just > > > saying that you haven't really established a very sound justification > > > here. > > > > The -v7 cover letter was updated with additional details, > > as you requested (perhaps you missed it): > > > > "Performance details for the kworker interruption: > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > The example above shows an additional 7us for the > > > > oslat -> kworker -> oslat > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > observed in the guest is higher than 50us, violating the acceptable > > latency threshold for certain applications." > > Yes, I have seen that but it doesn't really give a wider context to > understand why those numbers matter. OK. "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us interruption observed in a guest (which is above the safety threshold for this application)." Is that OK? > > > Of course there are workloads which do not want to conflict with > > > any in kernel house keeping. Those have to be configured and implemented > > > very carefully though. Vmstat as such should not collide with those > > > workloads as long as they do not interact with the kernel in a way > > > counters are updated. Is this hard or impossible to avoid? > > > > The practical problem we have been seeing is -RT app initialization. > > For example: > > > > 1) mlock(); > > 2) enter loop without system calls > > OK, that is what I have kinda expected. Would have been better to > mention it explicitly. > > I expect this to be a very common pattern and vmstat might not be the > only subsystem that could interfere later on. Would it make more sense > to address this by a more generic solution? E.g. a syscall to flush all > per-cpu caches so they won't interfere later unless userspace hits the > kernel path in some way (e.g. flush_cpu_caches(cpu_set_t cpumask, int flags)? > The above pattern could then be implemented as > > do_initial_setup() > sched_setaffinity(getpid(), cpumask); > flush_cpu_caches(cpumask, 0); > do_userspace_loop() I would argue that fixing this without introducing a userspace tunable is more generic as all programs (modified to use a syscall or not) benefit from the improvement. HPC workloads, for example. But it might be necessary to do what you suggest for other reasons (where you'd want a behaviour to be enabled which is undesired for other application types).
On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote: > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote: [...] > > > "Performance details for the kworker interruption: > > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > > > The example above shows an additional 7us for the > > > > > > oslat -> kworker -> oslat > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > > observed in the guest is higher than 50us, violating the acceptable > > > latency threshold for certain applications." > > > > Yes, I have seen that but it doesn't really give a wider context to > > understand why those numbers matter. > > OK. > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us > interruption observed in a guest (which is above the safety > threshold for this application)." > > Is that OK? This might be a sufficient information for somebody familiar with the matter (not me). So no, not enough. We need to hear a more complete story.
On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote: > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote: > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote: > [...] > > > > "Performance details for the kworker interruption: > > > > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > > > > > The example above shows an additional 7us for the > > > > > > > > oslat -> kworker -> oslat > > > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > > > observed in the guest is higher than 50us, violating the acceptable > > > > latency threshold for certain applications." > > > > > > Yes, I have seen that but it doesn't really give a wider context to > > > understand why those numbers matter. > > > > OK. > > > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us > > interruption observed in a guest (which is above the safety > > threshold for this application)." > > > > Is that OK? > > This might be a sufficient information for somebody familiar with the > matter (not me). So no, not enough. We need to hear a more complete > story. Michal, Please refer to https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf 2.3 Channel Dependent Scheduling The purpose of scheduling is to decide which terminal will transmit data on which set of resource blocks with what transport format to use. The objective is to assign resources to the terminal such that the quality of service (QoS) requirement is fulfilled. Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the same length of Transmission Time Interval (TTI) in LTE system. In general: https://en.wikipedia.org/wiki/Real-time_computing Real-time computing (RTC) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response.[1] Real-time programs must guarantee response within specified time constraints, often referred to as "deadlines".[2] Real-time responses are often understood to be in the order of milliseconds, and sometimes microseconds. A system not specified as operating in real time cannot usually guarantee a response within any timeframe, although typical or expected response times may be given. Real-time processing fails if not completed within a specified deadline relative to an event; deadlines must always be met, regardless of system load. For example, for the MAC scheduler processing must occur every 1ms, and a certain amount of computation takes place (and must finish before the next 1ms timeframe). A > 50us latency spike as observed by cyclictest is considered a "failure".
On Thu, Mar 23, 2023 at 07:52:22AM -0300, Marcelo Tosatti wrote: > On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote: > > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote: > > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote: > > [...] > > > > > "Performance details for the kworker interruption: > > > > > > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > > > > > > > The example above shows an additional 7us for the > > > > > > > > > > oslat -> kworker -> oslat > > > > > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > > > > observed in the guest is higher than 50us, violating the acceptable > > > > > latency threshold for certain applications." > > > > > > > > Yes, I have seen that but it doesn't really give a wider context to > > > > understand why those numbers matter. > > > > > > OK. > > > > > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us > > > interruption observed in a guest (which is above the safety > > > threshold for this application)." > > > > > > Is that OK? > > > > This might be a sufficient information for somebody familiar with the > > matter (not me). So no, not enough. We need to hear a more complete > > story. > > Michal, > > Please refer to > https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf > > 2.3 Channel Dependent Scheduling > The purpose of scheduling is to decide which terminal will transmit data on which set > of resource blocks with what transport format to use. The objective is to assign > resources to the terminal such that the quality of service (QoS) requirement is fulfilled. > Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the > same length of Transmission Time Interval (TTI) in LTE system. > > In general: > > https://en.wikipedia.org/wiki/Real-time_computing > > Real-time computing (RTC) is the computer science term for hardware and > software systems subject to a "real-time constraint", for example from > event to system response.[1] Real-time programs must guarantee response > within specified time constraints, often referred to as "deadlines".[2] > > Real-time responses are often understood to be in the order of > milliseconds, and sometimes microseconds. A system not specified as > operating in real time cannot usually guarantee a response within any > timeframe, although typical or expected response times may be given. > Real-time processing fails if not completed within a specified deadline > relative to an event; deadlines must always be met, regardless of system > load. > > For example, for the MAC scheduler processing must occur every 1ms, > and a certain amount of computation takes place (and must finish before > the next 1ms timeframe). A > 50us latency spike as observed by cyclictest > is considered a "failure". If you need more detail, will have to ask someone else, because that is all I know.
On Thu 23-03-23 07:52:22, Marcelo Tosatti wrote: > On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote: > > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote: > > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote: > > [...] > > > > > "Performance details for the kworker interruption: > > > > > > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > > > > > > > The example above shows an additional 7us for the > > > > > > > > > > oslat -> kworker -> oslat > > > > > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > > > > observed in the guest is higher than 50us, violating the acceptable > > > > > latency threshold for certain applications." > > > > > > > > Yes, I have seen that but it doesn't really give a wider context to > > > > understand why those numbers matter. > > > > > > OK. > > > > > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us > > > interruption observed in a guest (which is above the safety > > > threshold for this application)." > > > > > > Is that OK? > > > > This might be a sufficient information for somebody familiar with the > > matter (not me). So no, not enough. We need to hear a more complete > > story. > > Michal, > > Please refer to > https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf > > 2.3 Channel Dependent Scheduling > The purpose of scheduling is to decide which terminal will transmit data on which set > of resource blocks with what transport format to use. The objective is to assign > resources to the terminal such that the quality of service (QoS) requirement is fulfilled. > Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the > same length of Transmission Time Interval (TTI) in LTE system. > > In general: > > https://en.wikipedia.org/wiki/Real-time_computing Thank you, but not something I was really asking for (repeatedly). I am pretty aware of what RT computing is about. I am not really interested in a generic fluff. I am asking about specific usecases you have in mind when pushing these changes. > For example, for the MAC scheduler processing must occur every 1ms, > and a certain amount of computation takes place (and must finish before > the next 1ms timeframe). A > 50us latency spike as observed by cyclictest > is considered a "failure". OK, you are claiming that much but you are not really filling up other holes in your story. Let me just outline few questions I have. Your measurements talk about 7us overhead the vmstat processing might add. This is really far from > 50us above. You suggest that this is an effect of the workload running in a guest without more details. I am quite surprised to hear about RT expectations inside a guest system TBH. All that being said, it would be really helpful if you were more specific about the workload and why there is no other way but making vmstat infrastructure more complex (it is quite complex on its own). Thanks!
On Thu, Mar 23, 2023 at 01:17:32PM +0100, Michal Hocko wrote: > On Thu 23-03-23 07:52:22, Marcelo Tosatti wrote: > > On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote: > > > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote: > > > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote: > > > [...] > > > > > > "Performance details for the kworker interruption: > > > > > > > > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > > > > > > > > > The example above shows an additional 7us for the > > > > > > > > > > > > oslat -> kworker -> oslat > > > > > > > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > > > > > observed in the guest is higher than 50us, violating the acceptable > > > > > > latency threshold for certain applications." > > > > > > > > > > Yes, I have seen that but it doesn't really give a wider context to > > > > > understand why those numbers matter. > > > > > > > > OK. > > > > > > > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us > > > > interruption observed in a guest (which is above the safety > > > > threshold for this application)." > > > > > > > > Is that OK? > > > > > > This might be a sufficient information for somebody familiar with the > > > matter (not me). So no, not enough. We need to hear a more complete > > > story. > > > > Michal, > > > > Please refer to > > https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf > > > > 2.3 Channel Dependent Scheduling > > The purpose of scheduling is to decide which terminal will transmit data on which set > > of resource blocks with what transport format to use. The objective is to assign > > resources to the terminal such that the quality of service (QoS) requirement is fulfilled. > > Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the > > same length of Transmission Time Interval (TTI) in LTE system. > > > > In general: > > > > https://en.wikipedia.org/wiki/Real-time_computing > > Thank you, but not something I was really asking for (repeatedly). I am > pretty aware of what RT computing is about. I am not really interested > in a generic fluff. I am asking about specific usecases you have in mind > when pushing these changes. > > > For example, for the MAC scheduler processing must occur every 1ms, > > and a certain amount of computation takes place (and must finish before > > the next 1ms timeframe). A > 50us latency spike as observed by cyclictest > > is considered a "failure". > > OK, you are claiming that much but you are not really filling up other > holes in your story. Let me just outline few questions I have. Your > measurements talk about 7us overhead the vmstat processing might add. > This is really far from > 50us above. 7us in the host, for the following sched_switch events: oslat -> kworker kworker -> oslat However, if the impact is for a virtualized application: oslat, executing via qemu-vcpu process in the host. oslat executing qemu-vcpu VM-EXIT qemu-vcpu -> kworker kworker -> qemu-vcpu qemu-vcpu VM-ENTRY is much higher than the 7us (can be above 100us). > You suggest that this is an effect > of the workload running in a guest without more details. I am quite > surprised to hear about RT expectations inside a guest system TBH. https://www.youtube.com/watch?v=zIDwc6uDszY > All that being said, it would be really helpful if you were more > specific about the workload and why there is no other way but making > vmstat infrastructure more complex (it is quite complex on its own). The patchset is just changing vmstat_shepherd from happening locally to happening remotely. There are a number of algorithms in the kernel that deal with concurrent access already. What you think this particular patchset makes things complicated and what can be done to make it simpler?
On Thu, Mar 23, 2023 at 10:30:13AM -0300, Marcelo Tosatti wrote: > On Thu, Mar 23, 2023 at 01:17:32PM +0100, Michal Hocko wrote: > > On Thu 23-03-23 07:52:22, Marcelo Tosatti wrote: > > > On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote: > > > > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote: > > > > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote: > > > > [...] > > > > > > > "Performance details for the kworker interruption: > > > > > > > > > > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > > > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > > > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > > > > > > > > > > > The example above shows an additional 7us for the > > > > > > > > > > > > > > oslat -> kworker -> oslat > > > > > > > > > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > > > > > > observed in the guest is higher than 50us, violating the acceptable > > > > > > > latency threshold for certain applications." > > > > > > > > > > > > Yes, I have seen that but it doesn't really give a wider context to > > > > > > understand why those numbers matter. > > > > > > > > > > OK. > > > > > > > > > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us > > > > > interruption observed in a guest (which is above the safety > > > > > threshold for this application)." > > > > > > > > > > Is that OK? > > > > > > > > This might be a sufficient information for somebody familiar with the > > > > matter (not me). So no, not enough. We need to hear a more complete > > > > story. > > > > > > Michal, > > > > > > Please refer to > > > https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf > > > > > > 2.3 Channel Dependent Scheduling > > > The purpose of scheduling is to decide which terminal will transmit data on which set > > > of resource blocks with what transport format to use. The objective is to assign > > > resources to the terminal such that the quality of service (QoS) requirement is fulfilled. > > > Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the > > > same length of Transmission Time Interval (TTI) in LTE system. > > > > > > In general: > > > > > > https://en.wikipedia.org/wiki/Real-time_computing > > > > Thank you, but not something I was really asking for (repeatedly). I am > > pretty aware of what RT computing is about. I am not really interested > > in a generic fluff. I am asking about specific usecases you have in mind > > when pushing these changes. > > > > > For example, for the MAC scheduler processing must occur every 1ms, > > > and a certain amount of computation takes place (and must finish before > > > the next 1ms timeframe). A > 50us latency spike as observed by cyclictest > > > is considered a "failure". > > > > OK, you are claiming that much but you are not really filling up other > > holes in your story. Let me just outline few questions I have. Your > > measurements talk about 7us overhead the vmstat processing might add. > > This is really far from > 50us above. > > 7us in the host, for the following sched_switch events: > > oslat -> kworker > kworker -> oslat > > However, if the impact is for a virtualized application: > > oslat, executing via qemu-vcpu process in the host. > > oslat executing > qemu-vcpu VM-EXIT > qemu-vcpu -> kworker > kworker -> qemu-vcpu > qemu-vcpu VM-ENTRY > > is much higher than the 7us (can be above 100us). And nothing prevents this from happening: oslat executing qemu-vcpu VM-EXIT qemu-vcpu -> kworker (in the host, to handle vmstat_update) kworker -> qemu-vcpu qemu-vcpu VM-ENTRY oslat -> kworker (in the guest, to handle vmstat_update) kworker -> oslat
On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote: > This patch series addresses the following two problems: > > 1. A customer provided evidence indicating that a process > was stalled in direct reclaim: > > ... > > 2. With a task that busy loops on a given CPU, > the kworker interruption to execute vmstat_update > is undesired and may exceed latency thresholds > for certain applications. > I don't think I'll be sending this upstream in the next merge window. Because it isn't clear that the added complexity in vmstat handling is justified. - Michal's request for more clarity on the end-user requirements seems reasonable. - You have indicated that additional changelog material is forthcoming. - The alternative idea of adding a syscall which tells the kernel "I'm about to go realtime, so please clear away all the pending crap which might later interrupt me" sounds pretty good. Partly because there are surely other places where we can use this. Partly because it moves all the crap-clearing into special crap-clearing code paths while adding less burden to the commonly-executed code. And I don't think this alternative has been fully investigated and discussed.
On Tue, Apr 18, 2023 at 03:02:00PM -0700, Andrew Morton wrote: > On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > This patch series addresses the following two problems: > > > > 1. A customer provided evidence indicating that a process > > was stalled in direct reclaim: > > > > ... > > > > 2. With a task that busy loops on a given CPU, > > the kworker interruption to execute vmstat_update > > is undesired and may exceed latency thresholds > > for certain applications. > > > > I don't think I'll be sending this upstream in the next merge window. > Because it isn't clear that the added complexity in vmstat handling is > justified. From my POV this is an incorrect statement (that the complexity in vmstat handling is not justified). Andrew, this is the 3rd attempt to fix this problem: First try: https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/ Second try: https://patchew.org/linux/20230105125218.031928326@redhat.com/ Third try: syncing vmstats remotely from vmstat_shepherd (this patchset). And also, can you please explain: what is so complicated about the vmstat handling? cmpxchg has been around and is used all over the kernel, and nobody considers "excessively complicated". > - Michal's request for more clarity on the end-user requirements > seems reasonable. And i explained to Michal in great detail where the end-user requirements come from. For virtualized workloads, there are two types of use-cases: 1) For example, for the MAC scheduler processing must occur every 1ms, and a certain amount of computation takes place (and must finish before the next 1ms timeframe). A > 50us latency spike as observed by cyclictest is considered a "failure". I showed him a 7us trace caused by, and explained that will extend to > 50us in the case of virtualized vCPU. 2) PLCs. These workloads will also suffer > 50us latency spikes which is undesirable. Can you please explain what additional clarity is required? RH's performance team, for example, has been performing packet latency tests and waiting for this issue to be fixed for about 2 years now. Andrew Theurer, can you please explain what problem is the vmstat_work interruption causing in your testing? > - You have indicated that additional changelog material is forthcoming. Not really. Do you think additional information on the changelog is necessary? > - The alternative idea of adding a syscall which tells the kernel > "I'm about to go realtime, so please clear away all the pending crap > which might later interrupt me" sounds pretty good. > > Partly because there are surely other places where we can use this. > > Partly because it moves all the crap-clearing into special > crap-clearing code paths while adding less burden to the > commonly-executed code. > > And I don't think this alternative has been fully investigated and > discussed. This was tried before: https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/ My conclusion from that discussion (and work) is that a special system call: 1) Does not allow the benefits to be widely applied (only modified applications will benefit). Is not portable across different operating systems. Removing the vmstat_work interruption is a benefit for HPC workloads, for example (in fact, it is a benefit for any kind of application, since the interruption causes cache misses). 2) Increases the system call cost for applications which would use the interface. So avoiding the vmstat_update update interruption, without userspace knowledge and modifications, is a better than solution than a modified userspace.
On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote: > On Tue, Apr 18, 2023 at 03:02:00PM -0700, Andrew Morton wrote: > > On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote: > > > > > This patch series addresses the following two problems: > > > > > > 1. A customer provided evidence indicating that a process > > > was stalled in direct reclaim: > > > > > > ... > > > > > > 2. With a task that busy loops on a given CPU, > > > the kworker interruption to execute vmstat_update > > > is undesired and may exceed latency thresholds > > > for certain applications. > > > > > > > I don't think I'll be sending this upstream in the next merge window. > > Because it isn't clear that the added complexity in vmstat handling is > > justified. > > From my POV this is an incorrect statement (that the complexity in > vmstat handling is not justified). > > Andrew, this is the 3rd attempt to fix this problem: > > First try: https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/ > > Second try: https://patchew.org/linux/20230105125218.031928326@redhat.com/ > > Third try: syncing vmstats remotely from vmstat_shepherd (this > patchset). > > And also, can you please explain: what is so complicated about the > vmstat handling? cmpxchg has been around and is used all over the > kernel, and nobody considers "excessively complicated". > > > - Michal's request for more clarity on the end-user requirements > > seems reasonable. > > And i explained to Michal in great detail where the end-user > requirements come from. For virtualized workloads, there are two > types of use-cases: > > 1) For example, for the MAC scheduler processing must occur every 1ms, > and a certain amount of computation takes place (and must finish before > the next 1ms timeframe). A > 50us latency spike as observed by cyclictest > is considered a "failure". > > I showed him a 7us trace caused by, and explained that will extend to > > 50us in the case of virtualized vCPU. > > 2) PLCs. These workloads will also suffer > 50us latency spikes > which is undesirable. > > Can you please explain what additional clarity is required? > > RH's performance team, for example, has been performing packet > latency tests and waiting for this issue to be fixed for about 2 > years now. > > Andrew Theurer, can you please explain what problem is the vmstat_work > interruption causing in your testing? +CC Andrew.
On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote: > This was tried before: > https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/ > > My conclusion from that discussion (and work) is that a special system > call: > > 1) Does not allow the benefits to be widely applied (only modified > applications will benefit). Is not portable across different operating systems. > > Removing the vmstat_work interruption is a benefit for HPC workloads, > for example (in fact, it is a benefit for any kind of application, > since the interruption causes cache misses). > > 2) Increases the system call cost for applications which would use > the interface. > > So avoiding the vmstat_update update interruption, without userspace > knowledge and modifications, is a better than solution than a modified > userspace. Another important point is this: if an application dirties its own per-CPU vmstat cache, while performing a system call, and a vmstat sync event is triggered on a different CPU, you'd have to: 1) Wait for that CPU to return to userspace and sync its stats (unfeasible). 2) Queue work to execute on that CPU (undesirable, as that causes an interruption). 3) Remotely sync the vmstat for that CPU.
On Wed, Apr 19, 2023 at 08:29:47AM -0300, Marcelo Tosatti wrote: > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote: > > This was tried before: > > https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/ > > > > My conclusion from that discussion (and work) is that a special system > > call: > > > > 1) Does not allow the benefits to be widely applied (only modified > > applications will benefit). Is not portable across different operating systems. > > > > Removing the vmstat_work interruption is a benefit for HPC workloads, > > for example (in fact, it is a benefit for any kind of application, > > since the interruption causes cache misses). > > > > 2) Increases the system call cost for applications which would use > > the interface. > > > > So avoiding the vmstat_update update interruption, without userspace > > knowledge and modifications, is a better than solution than a modified > > userspace. > > Another important point is this: if an application dirties > its own per-CPU vmstat cache, while performing a system call, Or while handling a VM-exit from a vCPU. This are, in my mind, sufficient reasons to discard the "flush per-cpu caches" idea. This is also why i chose to abandon the prctrl interface patchset. > and a vmstat sync event is triggered on a different CPU, you'd have to: > > 1) Wait for that CPU to return to userspace and sync its stats > (unfeasible). > > 2) Queue work to execute on that CPU (undesirable, as that causes > an interruption). > > 3) Remotely sync the vmstat for that CPU.
Le Wed, Apr 19, 2023 at 08:59:28AM -0300, Marcelo Tosatti a écrit : > On Wed, Apr 19, 2023 at 08:29:47AM -0300, Marcelo Tosatti wrote: > > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote: > > > This was tried before: > > > https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/ > > > > > > My conclusion from that discussion (and work) is that a special system > > > call: > > > > > > 1) Does not allow the benefits to be widely applied (only modified > > > applications will benefit). Is not portable across different operating systems. > > > > > > Removing the vmstat_work interruption is a benefit for HPC workloads, > > > for example (in fact, it is a benefit for any kind of application, > > > since the interruption causes cache misses). > > > > > > 2) Increases the system call cost for applications which would use > > > the interface. > > > > > > So avoiding the vmstat_update update interruption, without userspace > > > knowledge and modifications, is a better than solution than a modified > > > userspace. > > > > Another important point is this: if an application dirties > > its own per-CPU vmstat cache, while performing a system call, > > Or while handling a VM-exit from a vCPU. > > This are, in my mind, sufficient reasons to discard the "flush per-cpu > caches" idea. This is also why i chose to abandon the prctrl interface > patchset. If you're running your isolated workloads on guests, which sounds quite challenging but I guess you guys managed, I'd expect that VMEXITs are absolutely out of question while the task runs critical code, so I'm not sure why you would care. I guess not only your guests but also your hosts run nohz_full, right? I can't tell if the prctl solution which quiesces everything is the solution for you, I don't know well enough your workloads, but I would expect that the pattern is as follows: 1) Arrange for full isolation (no more interrupts/exceptions/VMEXITs) 2) Run critical code 3) Optionally do something once you're done If vmstat is going to be the only thing to wait for on 1), then the remote solution looks good enough (although I leave that to -mm guys as I'm too clueless about those matters), if there is more to be expected, I guess the quiescing prctl (or whatever syscall) is something to consider. Thanks.
> On Apr 19, 2023, at 6:15 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote: >> On Tue, Apr 18, 2023 at 03:02:00PM -0700, Andrew Morton wrote: >>> On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote: >>> >>>> This patch series addresses the following two problems: >>>> >>>> 1. A customer provided evidence indicating that a process >>>> was stalled in direct reclaim: >>>> >>>> ... >>>> >>>> 2. With a task that busy loops on a given CPU, >>>> the kworker interruption to execute vmstat_update >>>> is undesired and may exceed latency thresholds >>>> for certain applications. >>>> >>> >>> I don't think I'll be sending this upstream in the next merge window. >>> Because it isn't clear that the added complexity in vmstat handling is >>> justified. >> >> From my POV this is an incorrect statement (that the complexity in >> vmstat handling is not justified). >> >> Andrew, this is the 3rd attempt to fix this problem: >> >> First try: https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/ >> >> Second try: https://patchew.org/linux/20230105125218.031928326@redhat.com/ >> >> Third try: syncing vmstats remotely from vmstat_shepherd (this >> patchset). >> >> And also, can you please explain: what is so complicated about the >> vmstat handling? cmpxchg has been around and is used all over the >> kernel, and nobody considers "excessively complicated". >> >>> - Michal's request for more clarity on the end-user requirements >>> seems reasonable. >> >> And i explained to Michal in great detail where the end-user >> requirements come from. For virtualized workloads, there are two >> types of use-cases: >> >> 1) For example, for the MAC scheduler processing must occur every 1ms, >> and a certain amount of computation takes place (and must finish before >> the next 1ms timeframe). A > 50us latency spike as observed by cyclictest >> is considered a "failure". >> >> I showed him a 7us trace caused by, and explained that will extend to > >> 50us in the case of virtualized vCPU. >> >> 2) PLCs. These workloads will also suffer > 50us latency spikes >> which is undesirable. >> >> Can you please explain what additional clarity is required? >> >> RH's performance team, for example, has been performing packet >> latency tests and waiting for this issue to be fixed for about 2 >> years now. >> >> Andrew Theurer, can you please explain what problem is the vmstat_work >> interruption causing in your testing? > > +CC Andrew. Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency as measured by cyclictest & oslat. We cannot achieve under 20 usec with the vmstats interruption. -Andrew
On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote: > Le Wed, Apr 19, 2023 at 08:59:28AM -0300, Marcelo Tosatti a écrit : > > On Wed, Apr 19, 2023 at 08:29:47AM -0300, Marcelo Tosatti wrote: > > > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote: > > > > This was tried before: > > > > https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/ > > > > > > > > My conclusion from that discussion (and work) is that a special system > > > > call: > > > > > > > > 1) Does not allow the benefits to be widely applied (only modified > > > > applications will benefit). Is not portable across different operating systems. > > > > > > > > Removing the vmstat_work interruption is a benefit for HPC workloads, > > > > for example (in fact, it is a benefit for any kind of application, > > > > since the interruption causes cache misses). > > > > > > > > 2) Increases the system call cost for applications which would use > > > > the interface. > > > > > > > > So avoiding the vmstat_update update interruption, without userspace > > > > knowledge and modifications, is a better than solution than a modified > > > > userspace. > > > > > > Another important point is this: if an application dirties > > > its own per-CPU vmstat cache, while performing a system call, > > > > Or while handling a VM-exit from a vCPU. > > > > This are, in my mind, sufficient reasons to discard the "flush per-cpu > > caches" idea. This is also why i chose to abandon the prctrl interface > > patchset. > > If you're running your isolated workloads on guests, which sounds quite > challenging but I guess you guys managed, I'd expect that VMEXITs are > absolutely out of question while the task runs critical code, so I'm not > sure why you would care. I guess not only your guests but also your hosts > run nohz_full, right? The answer is: there are VM-exits. For example to write MSRs to program LAPIC timer. Yes both host and guest are nohz_full (but for example, cyclictest or a PLC program can call nanosleep in the guest which translate to MSR writes to program LAPIC timer which is a VM-exit). > I can't tell if the prctl solution which quiesces everything is the solution > for you, I don't know well enough your workloads, but I would expect that > the pattern is as follows: > > 1) Arrange for full isolation (no more interrupts/exceptions/VMEXITs) Yes, this in the general scheme. Full isolation is automated by tuned (realtime-virtual-host/realtime-virtual-guest profiles). There are VM-exits in our use-case. There might be use-cases where interrupts are desired. For more details: https://www.youtube.com/watch?v=SyhfctYqjc8 > 2) Run critical code > 3) Optionally do something once you're done > > If vmstat is going to be the only thing to wait for on 1), then the remote > solution looks good enough (although I leave that to -mm guys as I'm too > clueless about those matters), I am mostly clueless too, but i don't see a problem with the proposed patch (and no one has pointed any problem either). > if there is more to be expected, I guess the > quiescing prctl (or whatever syscall) is something to consider. > > Thanks. I don't know of anything else to consider ATM, and for all cases we have analyzed so far there has always been the possibility to do the work remotely, via RCU or some other locking scheme, rather than requiring the application to be modified (which decreases the number of userspace applications that can benefit).
On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote: > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote: [...] > > 2) Run critical code > > 3) Optionally do something once you're done > > > > If vmstat is going to be the only thing to wait for on 1), then the remote > > solution looks good enough (although I leave that to -mm guys as I'm too > > clueless about those matters), > > I am mostly clueless too, but i don't see a problem with the proposed > patch (and no one has pointed any problem either). I really hate to repeat myself again. The biggest pushback has been on a) justification and b) single purpose solution which is very likely incomplete. For a) we are getting the story piece by piece which doesn't speed up the process. You are proposing a non-trivial change to an already convoluted code so having a solid justification is something that shouldn't be all that surprising. b) is what concerns me more though. There are other per-cpu specific things going on that require some regular flushing. Just to mention another one that your group has been brought up was the memcg pcp caches. Again with a non-trivial proposal to deal with that problem [1]. It has turned out that we can do a simpler thing [2]. I do not think it is a stretch to expect that similar things will pop out every now and then and rather than dealing with each one in its own way it kinda makes sense to come up with a more general concept so that all those cases can be handled at a single place at least. All I hear about that is that the code of those special applications would need to be changed to use that. Well, true but is that bar so impractical that we are going to grow kernel complexity and therefore a maintenance burden? Everything for a very specialized workloads? [1] http://lkml.kernel.org/r/20221102020243.522358-1-leobras@redhat.com [2] http://lkml.kernel.org/r/20230317134448.11082-1-mhocko@kernel.org
Hi Michal, On Wed, Apr 19, 2023 at 04:35:51PM +0200, Michal Hocko wrote: > On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote: > > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote: > [...] > > > 2) Run critical code > > > 3) Optionally do something once you're done > > > > > > If vmstat is going to be the only thing to wait for on 1), then the remote > > > solution looks good enough (although I leave that to -mm guys as I'm too > > > clueless about those matters), > > > > I am mostly clueless too, but i don't see a problem with the proposed > > patch (and no one has pointed any problem either). > > I really hate to repeat myself again. The biggest pushback has been on > a) justification and b) single purpose solution which is very likely > incomplete. For a) we are getting the story piece by piece which doesn't > speed up the process. You are proposing a non-trivial change to an > already convoluted code so having a solid justification is something > that shouldn't be all that surprising. The justification is simple and concise: 2. With a task that busy loops on a given CPU, the kworker interruption to execute vmstat_update is undesired and may exceed latency thresholds for certain applications. Performance details for the kworker interruption: oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... The example above shows an additional 7us for the oslat -> kworker -> oslat switches. In the case of a virtualized CPU, and the vmstat_update interruption in the host (of a qemu-kvm vcpu), the latency penalty observed in the guest is higher than 50us, violating the acceptable latency threshold for certain applications. --- An additional use-case is what has been noted by Andrew Theurer: Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency as measured by cyclictest & oslat. We cannot achieve under 20 usec with the vmstats interruption. --- It seems to me this is solid justification (it seems you want particular microsecond values, but those depend on application and the CPU). The point is there are several applications which do not want to be interrupted (we can ignore the specifics and focus on that fact). Moreover, unrelated interruptions might occur close in time (for example, random function call IPIs generated by other kernel subsystems), which renders the "lets just consider this one application, running on this CPU, to decide what to do" short sighted. > b) is what concerns me more though. There are other per-cpu specific > things going on that require some regular flushing. Just to mention > another one that your group has been brought up was the memcg pcp > caches. Again with a non-trivial proposal to deal with that problem > [1]. Yes. > It has turned out that we can do a simpler thing [2]. For the particular memcg case, there was a simpler fix. For the vmstat_update case, i don't see a simpler fix. > I do not think it is a stretch to expect that similar things will pop > out every now and then Agree. > and rather than dealing with each one in its own way it > kinda makes sense to come up with a more general concept so that all > those cases can be handled at a single place at least. I can understand where you are coming from. Unfortunately, for some cases it is appropriate to perform the work from a remote CPU (and i think this is one such case). > All I hear about > that is that the code of those special applications would need to be > changed to use that. This is a burden for application writers and for system configuration. Or it could be done automatically (from outside of the application). Which is what is described and implemented here: https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/ "Task isolation is divided in two main steps: configuration and activation. Each step can be performed by an external tool or the latency sensitive application itself. util-linux contains the "chisol" tool for this purpose." But not only that, the second thing is: "> Another important point is this: if an application dirties > its own per-CPU vmstat cache, while performing a system call, Or while handling a VM-exit from a vCPU. This are, in my mind, sufficient reasons to discard the "flush per-cpu caches" idea. This is also why i chose to abandon the prctrl interface patchset. > and a vmstat sync event is triggered on a different CPU, you'd have to: > > 1) Wait for that CPU to return to userspace and sync its stats > (unfeasible). > > 2) Queue work to execute on that CPU (undesirable, as that causes > an interruption). > > 3) Remotely sync the vmstat for that CPU." So the only option is to remotely sync vmstat for the CPU (unless you have a better suggestion). > Well, true but is that bar so impractical that we > are going to grow kernel complexity and therefore a maintenance burden? Honestly, this patchset is just using cmpxchg to transfer data from per-CPU counters to global counters. I don't see why its that complicated. If you have a suggestion on how to reduce the apparent complexity, that would be great. > Everything for a very specialized workloads? Well the kernel has been increasing in complexity, and the maintenance burden has increased as a side-effect, to accomodate more workloads than it was initially designed for. > [1] http://lkml.kernel.org/r/20221102020243.522358-1-leobras@redhat.com > [2] http://lkml.kernel.org/r/20230317134448.11082-1-mhocko@kernel.org > -- > Michal Hocko > SUSE Labs > >
On 4/19/23 13:29, Marcelo Tosatti wrote: > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote: >> This was tried before: >> https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/ >> >> My conclusion from that discussion (and work) is that a special system >> call: >> >> 1) Does not allow the benefits to be widely applied (only modified >> applications will benefit). Is not portable across different operating systems. >> >> Removing the vmstat_work interruption is a benefit for HPC workloads, >> for example (in fact, it is a benefit for any kind of application, >> since the interruption causes cache misses). >> >> 2) Increases the system call cost for applications which would use >> the interface. >> >> So avoiding the vmstat_update update interruption, without userspace >> knowledge and modifications, is a better than solution than a modified >> userspace. > > Another important point is this: if an application dirties > its own per-CPU vmstat cache, while performing a system call, > and a vmstat sync event is triggered on a different CPU, you'd have to: > > 1) Wait for that CPU to return to userspace and sync its stats > (unfeasible). > > 2) Queue work to execute on that CPU (undesirable, as that causes > an interruption). So you're saying the application might do a syscall from the isolcpu, so IIUC it cannot expect any latency guarantees at that very moment, but then it immediately starts expecting them again after returning to userspace, and a single interruption for a one-time flush after the syscall would be too intrusive? (elsewhere in the thread you described an RT app initialization that may generate vmstats to flush and then entry userspace loop, again, would a single interruption soon after entering the loop be so critical?) > 3) Remotely sync the vmstat for that CPU. > > >
On Wed, Apr 19, 2023 at 06:47:30PM +0200, Vlastimil Babka wrote: > On 4/19/23 13:29, Marcelo Tosatti wrote: > > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote: > >> This was tried before: > >> https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/ > >> > >> My conclusion from that discussion (and work) is that a special system > >> call: > >> > >> 1) Does not allow the benefits to be widely applied (only modified > >> applications will benefit). Is not portable across different operating systems. > >> > >> Removing the vmstat_work interruption is a benefit for HPC workloads, > >> for example (in fact, it is a benefit for any kind of application, > >> since the interruption causes cache misses). > >> > >> 2) Increases the system call cost for applications which would use > >> the interface. > >> > >> So avoiding the vmstat_update update interruption, without userspace > >> knowledge and modifications, is a better than solution than a modified > >> userspace. > > > > Another important point is this: if an application dirties > > its own per-CPU vmstat cache, while performing a system call, > > and a vmstat sync event is triggered on a different CPU, you'd have to: > > > > 1) Wait for that CPU to return to userspace and sync its stats > > (unfeasible). > > > > 2) Queue work to execute on that CPU (undesirable, as that causes > > an interruption). > > So you're saying the application might do a syscall from the isolcpu, so > IIUC it cannot expect any latency guarantees at that very moment, Why not? cyclictest uses nanosleep and its the main tool for measuring latency. > but then > it immediately starts expecting them again after returning to userspace, No, the expectation more generally is this: For certain types of applications (for example PLC software or RAN processing), upon occurrence of an event, it is necessary to complete a certain task in a maximum amount of time (deadline). One way to express this requirement is with a pair of numbers, deadline time and execution time, where: * deadline time: length of time between event and deadline. * execution time: length of time it takes for processing of event to occur on a particular hardware platform (uninterrupted). The particular values depend on use-case. For the case where the realtime application executes in a virtualized guest, an interruption which must be serviced in the host will cause the following sequence of events: 1) VM-exit 2) execution of IPI (and function call) (or switch to kwork thread to execute some work item). 3) VM-entry Which causes an excess of 50us latency as observed by cyclictest (this violates the latency requirement of vRAN application with 1ms TTI, for example). > and > a single interruption for a one-time flush after the syscall would be too > intrusive? Generally, if you can't complete the task (which involves executing a number of instructions) before the deadline, then its a problem. One-time flush? You mean to switch between: rt-app -> kworker (to execute vmstat_update flush) -> rt-app My measurement, which probably had vmstat_update code/data in cache, took 7us. It might be the case that the code to execute must be brought in from memory, which takes even longer. > (elsewhere in the thread you described an RT app initialization that may > generate vmstats to flush and then entry userspace loop, again, would a > single interruption soon after entering the loop be so critical?) 1) It depends on the application. For the use-case above, where < 50us interruption is desired, yes it is critical. 2) The interruptions can come from different sources. Time 0 rt-app executing instruction 1 1 rt-app executing instruction 2 2 scheduler switches between rt-app and kworker 3 kworker runs vmstat_work 4 scheduler switches between kworker and rt-app 5 rt-app executing instruction 3 6 ipi to handle a KVM request IPI 7 fill in your preferred IPI handler So the argument "a single interruption might not cause your deadline to be exceeded" fails (because the time to handle the different interruptions might sum). Does that make sense? > > 3) Remotely sync the vmstat for that CPU. > > > > > > > >
On Wed 19-04-23 08:44:23, Andrew Theurer wrote: > > On Apr 19, 2023, at 6:15 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > >> Andrew Theurer, can you please explain what problem is the vmstat_work > >> interruption causing in your testing? > > > > +CC Andrew. > > Nearly every telco we work with for 5G RAN is demanding <20 usec CPU > latency as measured by cyclictest & oslat. We cannot achieve under 20 > usec with the vmstats interruption. Are you able to get those latency requirements with PREEMPT_RT?
On Wed 19-04-23 13:35:12, Marcelo Tosatti wrote: [...] > This is a burden for application writers and for system configuration. Yes. And I find it reasonable to expect that burden put there as there are non-trivial requirements for those workloads anyway. It is not out-of-the-box thing, right? > Or it could be done automatically (from outside of the application). > Which is what is described and implemented here: > > https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/ > > "Task isolation is divided in two main steps: configuration and > activation. > > Each step can be performed by an external tool or the latency > sensitive application itself. util-linux contains the "chisol" tool > for this purpose." I cannot say I would be a fan of prctl interfaces in general but I do agree with the overal idea to forcing a quiescent state on a set of CPUs. > But not only that, the second thing is: > > "> Another important point is this: if an application dirties > > its own per-CPU vmstat cache, while performing a system call, > > Or while handling a VM-exit from a vCPU. Do you have any specific examples on this? > This are, in my mind, sufficient reasons to discard the "flush per-cpu > caches" idea. This is also why i chose to abandon the prctrl interface > patchset. > > > and a vmstat sync event is triggered on a different CPU, you'd have to: > > > > 1) Wait for that CPU to return to userspace and sync its stats > > (unfeasible). > > > > 2) Queue work to execute on that CPU (undesirable, as that causes > > an interruption). > > > > 3) Remotely sync the vmstat for that CPU." > > So the only option is to remotely sync vmstat for the CPU > (unless you have a better suggestion). `echo 1 > /proc/sys/vm/stat_refresh' achieves essentially the same without any kernel changes. But let me repeat, this is not just about vmstats. Just have a look at other queue_work_on users. You do not want to handy pick each and every one and do so in the future as well.
On Wed, Apr 19, 2023 at 01:35:12PM -0300, Marcelo Tosatti wrote: > Hi Michal, > > On Wed, Apr 19, 2023 at 04:35:51PM +0200, Michal Hocko wrote: > > On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote: > > > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote: > > [...] > > > > 2) Run critical code > > > > 3) Optionally do something once you're done > > > > > > > > If vmstat is going to be the only thing to wait for on 1), then the remote > > > > solution looks good enough (although I leave that to -mm guys as I'm too > > > > clueless about those matters), > > > > > > I am mostly clueless too, but i don't see a problem with the proposed > > > patch (and no one has pointed any problem either). > > > > I really hate to repeat myself again. The biggest pushback has been on > > a) justification and b) single purpose solution which is very likely > > incomplete. For a) we are getting the story piece by piece which doesn't > > speed up the process. You are proposing a non-trivial change to an > > already convoluted code so having a solid justification is something > > that shouldn't be all that surprising. > > The justification is simple and concise: > > 2. With a task that busy loops on a given CPU, > the kworker interruption to execute vmstat_update > is undesired and may exceed latency thresholds > for certain applications. > > Performance details for the kworker interruption: > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > The example above shows an additional 7us for the > > oslat -> kworker -> oslat > > switches. In the case of a virtualized CPU, and the vmstat_update > interruption in the host (of a qemu-kvm vcpu), the latency penalty > observed in the guest is higher than 50us, violating the acceptable > latency threshold for certain applications. > > --- > > An additional use-case is what has been noted by Andrew Theurer: > > Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency > as measured by cyclictest & oslat. We cannot achieve under 20 usec with > the vmstats interruption. > > --- > > It seems to me this is solid justification (it seems you want > particular microsecond values, but those depend on application > and the CPU). The point is there are several applications which do not > want to be interrupted (we can ignore the specifics and focus on > that fact). > > Moreover, unrelated interruptions might occur close in time > (for example, random function call IPIs generated by other > kernel subsystems), which renders the "lets just consider this > one application, running on this CPU, to decide what to do" > short sighted. > > > b) is what concerns me more though. There are other per-cpu specific > > things going on that require some regular flushing. Just to mention > > another one that your group has been brought up was the memcg pcp > > caches. Again with a non-trivial proposal to deal with that problem > > [1]. > > Yes. > > > It has turned out that we can do a simpler thing [2]. > > For the particular memcg case, there was a simpler fix. > > For the vmstat_update case, i don't see a simpler fix. > > > I do not think it is a stretch to expect that similar things will pop > > out every now and then > > Agree. > > > and rather than dealing with each one in its own way it > > kinda makes sense to come up with a more general concept so that all > > those cases can be handled at a single place at least. > > I can understand where you are coming from. Unfortunately, > for some cases it is appropriate to perform the work from a > remote CPU (and i think this is one such case). > > > All I hear about > > that is that the code of those special applications would need to be > > changed to use that. > > This is a burden for application writers and for system configuration. > > Or it could be done automatically (from outside of the application). > Which is what is described and implemented here: > > https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/ > > "Task isolation is divided in two main steps: configuration and > activation. > > Each step can be performed by an external tool or the latency > sensitive application itself. util-linux contains the "chisol" tool > for this purpose." > > But not only that, the second thing is: > > "> Another important point is this: if an application dirties > > its own per-CPU vmstat cache, while performing a system call, > > Or while handling a VM-exit from a vCPU. > > This are, in my mind, sufficient reasons to discard the "flush per-cpu > caches" idea. This is also why i chose to abandon the prctrl interface > patchset. There are additional details that were not mentioned. When we think of flushing caches, or disabling per-CPU caches, this means that the isolated application loses the benefit of those caches (which means you are turning a "general purpose" programming environment into potentially slower environment for applications to execute). (yes, of course, one has to be mindful of which system calls can be used, for example the execution time of system calls which take locks will depend on whether, and how many, users of those locks there are at a given moment). For example, if we flush the vm-stats at every system call return, and the application happens to switch between different phases of isolated, userspace only behaviour (computation) system call intensive behaviour (which is an HPC example Thomas Gleixner mentioned in a different thread), then you see the disadvantage of the "special" environment where flushing is performed on return to system calls. So it seems to me (unless there are points that show otherwise, which would indicate that explicit userspace interfaces are preferred) _not_ requiring userspace changes is a superior solution. Perhaps the complexity should be judged for individual cases of interruptions, and if a given interruption-free conversion is seen as too complex, then a "disable feature which makes use of per-CPU caches" style solution can be made (and then userspace has to explicitly request for that per-CPU feature to be disabled). But i don't see that this patchset introduces unmanageable complexity, neither: 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock
On Thu, Apr 20, 2023 at 10:40:25AM +0200, Michal Hocko wrote: > On Wed 19-04-23 13:35:12, Marcelo Tosatti wrote: > [...] > > This is a burden for application writers and for system configuration. > > Yes. And I find it reasonable to expect that burden put there as there > are non-trivial requirements for those workloads anyway. It is not > out-of-the-box thing, right? See below. > > Or it could be done automatically (from outside of the application). > > Which is what is described and implemented here: > > > > https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/ > > > > "Task isolation is divided in two main steps: configuration and > > activation. > > > > Each step can be performed by an external tool or the latency > > sensitive application itself. util-linux contains the "chisol" tool > > for this purpose." > > I cannot say I would be a fan of prctl interfaces in general but I do > agree with the overal idea to forcing a quiescent state on a set of > CPUs. This has been avoided with success so far. > > But not only that, the second thing is: > > > > "> Another important point is this: if an application dirties > > > its own per-CPU vmstat cache, while performing a system call, > > > > Or while handling a VM-exit from a vCPU. > > Do you have any specific examples on this? Interrupt handling freeing a page. handle_access_fault (ARM64) -> handle_changed_spte_acc_track (x86) -> kvm_set_pfn_accessed -> kvm_set_page_accessed -> mark_page_accessed -> folio_mark_accessed -> folio_activate -> folio_activate_fn -> lruvec_add_folio -> update_lru_size -> __update_lru_size -> __mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LRU_BASE + lru, nr_pages); The other option would be to _FORBID_ use of __mod_zone_page_state in certain code sections. > > This are, in my mind, sufficient reasons to discard the "flush per-cpu > > caches" idea. This is also why i chose to abandon the prctrl interface > > patchset. > > > > > and a vmstat sync event is triggered on a different CPU, you'd have to: > > > > > > 1) Wait for that CPU to return to userspace and sync its stats > > > (unfeasible). > > > > > > 2) Queue work to execute on that CPU (undesirable, as that causes > > > an interruption). > > > > > > 3) Remotely sync the vmstat for that CPU." > > > > So the only option is to remotely sync vmstat for the CPU > > (unless you have a better suggestion). > > `echo 1 > /proc/sys/vm/stat_refresh' achieves essentially the same > without any kernel changes. It is unsuitable. You'd have to guarantee that, by the time you return from the write() system call to that file, there has been no other mod_zone_page_state call. For example, no interrupt or exception that frees or allocates a page through rmqueue (NR_FREE_PAGES counter), or that bounce_end_io cannot be called (since it calls dec_zone_page_state). It has been used internally as a workaround, but it is not reliable. > But let me repeat, this is not just about vmstats. Just have a look at > other queue_work_on users. You do not want to handy pick each and every > one and do so in the future as well. The ones that are problematic are being fixed for sometime now. For example: commit 2de79ee27fdb52626ac4ac48ec6d8d52ba6f9047 Author: Paolo Abeni <pabeni@redhat.com> Date: Thu Sep 10 23:33:18 2020 +0200 net: try to avoid unneeded backlog flush flush_all_backlogs() may cause deadlock on systems running processes with FIFO scheduling policy. The above is critical in -RT scenarios, where user-space specifically ensure no network activity is scheduled on the CPU running the mentioned FIFO process, but still get stuck. This commit tries to address the problem checking the backlog status on the remote CPUs before scheduling the flush operation. If the backlog is empty, we can skip it. v1 -> v2: - explicitly clear flushed cpu mask - Eric Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> And it has been a normal process so far. I think what needs to be done is to avoid new queue_work_on() users from being introduced in the tree (the number of existing ones is finite and can therefore be fixed).
On Thu, Apr 20, 2023 at 09:55:40AM +0200, Michal Hocko wrote: > On Wed 19-04-23 08:44:23, Andrew Theurer wrote: > > > On Apr 19, 2023, at 6:15 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > > >> Andrew Theurer, can you please explain what problem is the vmstat_work > > >> interruption causing in your testing? > > > > > > +CC Andrew. > > > > Nearly every telco we work with for 5G RAN is demanding <20 usec CPU > > latency as measured by cyclictest & oslat. We cannot achieve under 20 > > usec with the vmstats interruption. > > Are you able to get those latency requirements with PREEMPT_RT? What do you mean, exactly? PREEMPT_RT allows for the preemption of tasks in kernel context (so that higher priority tasks can interrupt lower priority tasks). It also enables IRQ handling to happen in thread context (so that a given thread might be given higher priority than executing a particular IRQ handler). If the question is: "Are you able to achieve <20 usec latency while allowing switching between different tasks ?" The answer with current processor and memory speeds is probably: no. But with more performant processors, you might. However, with a fully isolated processor which does not require switching between tasks, yes you can achieve < 20 usec latency.
On Thu, Apr 20, 2023 at 10:45:20AM -0300, Marcelo Tosatti wrote: > On Wed, Apr 19, 2023 at 01:35:12PM -0300, Marcelo Tosatti wrote: > > Hi Michal, > > > > On Wed, Apr 19, 2023 at 04:35:51PM +0200, Michal Hocko wrote: > > > On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote: > > > > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote: > > > [...] > > > > > 2) Run critical code > > > > > 3) Optionally do something once you're done > > > > > > > > > > If vmstat is going to be the only thing to wait for on 1), then the remote > > > > > solution looks good enough (although I leave that to -mm guys as I'm too > > > > > clueless about those matters), > > > > > > > > I am mostly clueless too, but i don't see a problem with the proposed > > > > patch (and no one has pointed any problem either). > > > > > > I really hate to repeat myself again. The biggest pushback has been on > > > a) justification and b) single purpose solution which is very likely > > > incomplete. For a) we are getting the story piece by piece which doesn't > > > speed up the process. You are proposing a non-trivial change to an > > > already convoluted code so having a solid justification is something > > > that shouldn't be all that surprising. > > > > The justification is simple and concise: > > > > 2. With a task that busy loops on a given CPU, > > the kworker interruption to execute vmstat_update > > is undesired and may exceed latency thresholds > > for certain applications. > > > > Performance details for the kworker interruption: > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... > > > > The example above shows an additional 7us for the > > > > oslat -> kworker -> oslat > > > > switches. In the case of a virtualized CPU, and the vmstat_update > > interruption in the host (of a qemu-kvm vcpu), the latency penalty > > observed in the guest is higher than 50us, violating the acceptable > > latency threshold for certain applications. > > > > --- > > > > An additional use-case is what has been noted by Andrew Theurer: > > > > Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency > > as measured by cyclictest & oslat. We cannot achieve under 20 usec with > > the vmstats interruption. > > > > --- > > > > It seems to me this is solid justification (it seems you want > > particular microsecond values, but those depend on application > > and the CPU). The point is there are several applications which do not > > want to be interrupted (we can ignore the specifics and focus on > > that fact). > > > > Moreover, unrelated interruptions might occur close in time > > (for example, random function call IPIs generated by other > > kernel subsystems), which renders the "lets just consider this > > one application, running on this CPU, to decide what to do" > > short sighted. > > > > > b) is what concerns me more though. There are other per-cpu specific > > > things going on that require some regular flushing. Just to mention > > > another one that your group has been brought up was the memcg pcp > > > caches. Again with a non-trivial proposal to deal with that problem > > > [1]. > > > > Yes. > > > > > It has turned out that we can do a simpler thing [2]. > > > > For the particular memcg case, there was a simpler fix. > > > > For the vmstat_update case, i don't see a simpler fix. > > > > > I do not think it is a stretch to expect that similar things will pop > > > out every now and then > > > > Agree. > > > > > and rather than dealing with each one in its own way it > > > kinda makes sense to come up with a more general concept so that all > > > those cases can be handled at a single place at least. > > > > I can understand where you are coming from. Unfortunately, > > for some cases it is appropriate to perform the work from a > > remote CPU (and i think this is one such case). > > > > > All I hear about > > > that is that the code of those special applications would need to be > > > changed to use that. > > > > This is a burden for application writers and for system configuration. > > > > Or it could be done automatically (from outside of the application). > > Which is what is described and implemented here: > > > > https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/ > > > > "Task isolation is divided in two main steps: configuration and > > activation. > > > > Each step can be performed by an external tool or the latency > > sensitive application itself. util-linux contains the "chisol" tool > > for this purpose." > > > > But not only that, the second thing is: > > > > "> Another important point is this: if an application dirties > > > its own per-CPU vmstat cache, while performing a system call, > > > > Or while handling a VM-exit from a vCPU. > > > > This are, in my mind, sufficient reasons to discard the "flush per-cpu > > caches" idea. This is also why i chose to abandon the prctrl interface > > patchset. > > There are additional details that were not mentioned. When we think > of flushing caches, or disabling per-CPU caches, this means that the > isolated application loses the benefit of those caches (which means you > are turning a "general purpose" programming environment into > potentially slower environment for applications to execute). https://www.uwsg.indiana.edu/hypermail/linux/kernel/2012.0/06823.html > (yes, of course, one has to be mindful of which system calls can be > used, for example the execution time of system calls which take locks will > depend on whether, and how many, users of those locks there are at a > given moment). > > For example, if we flush the vm-stats at every system call return, > and the application happens to switch between different phases of > > isolated, userspace only behaviour (computation) > system call intensive behaviour > > (which is an HPC example Thomas Gleixner mentioned in a different > thread), then you see the disadvantage of the "special" environment > where flushing is performed on return to system calls. https://lore.kernel.org/linux-mm/87im9d4ezq.fsf@nanos.tec.linutronix.de/ "In a real-world usecase we had the situation of compute bursts and an unfortunate hw enforced requirement to go into the kernel between them for synchronization between the compute threads and hardware (A quick hardware assisted save/load). Unmodified NOHZ full accumulated to more than 6% loss compared to a fully undisturbed run. Most of it was caused by cache effects and not by the actually used CPU time. A full enforced quiescing upfront gained ~2-3%, but a lazy approach of accepting that some stuff might happen once and does not happen again gained almost 5%. In that particular scenario 5% _is_ a huge win." > > So it seems to me (unless there are points that show otherwise, which > would indicate that explicit userspace interfaces are preferred) _not_ > requiring userspace changes is a superior solution. > > Perhaps the complexity should be judged for individual cases > of interruptions, and if a given interruption-free conversion > is seen as too complex, then a "disable feature which makes use of per-CPU > caches" style solution can be made (and then userspace has to > explicitly request for that per-CPU feature to be disabled). > > But i don't see that this patchset introduces unmanageable complexity, > neither: > > 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock > 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock > >
On 4/20/23 15:45, Marcelo Tosatti wrote: > Perhaps the complexity should be judged for individual cases > of interruptions, and if a given interruption-free conversion > is seen as too complex, then a "disable feature which makes use of per-CPU > caches" style solution can be made (and then userspace has to > explicitly request for that per-CPU feature to be disabled). > > But i don't see that this patchset introduces unmanageable complexity, > neither: > > 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock > 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock Well that one is a bit different, as there was one kind of lock replaced with other kind of lock, the lock is uncontended unless there's remote flushes happening so it's not causing extra overhead for the fast paths, and later even the irq disabling was removed, which should even improve things. But this patchset is turning all vmstat counter increments a cmpxchg.
On Wed, Apr 26, 2023 at 05:04:49PM +0200, Vlastimil Babka wrote: > On 4/20/23 15:45, Marcelo Tosatti wrote: > > Perhaps the complexity should be judged for individual cases > > of interruptions, and if a given interruption-free conversion > > is seen as too complex, then a "disable feature which makes use of per-CPU > > caches" style solution can be made (and then userspace has to > > explicitly request for that per-CPU feature to be disabled). > > > > But i don't see that this patchset introduces unmanageable complexity, > > neither: > > > > 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock > > 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock > > Well that one is a bit different, as there was one kind of lock replaced > with other kind of lock, local_lock is defined to NULL if CONFIG_PREEMPT_RT is not set. So for the !CONFIG_PREEMPT_RT case, it introduced a lock. > the lock is uncontended unless there's remote > flushes happening so it's not causing extra overhead for the fast paths, > and later even the irq disabling was removed, which should even improve > things. But this patchset is turning all vmstat counter increments a > cmpxchg. Yes, and we have a similar situation in this case: 1) CMPXCHG is already used to protect many vmstat counter increments. 2) The patchset adds "LOCK CMPXCHG" to existing CMPXCHG user. 3) The performance decrease is negligible, because cache locking is effective. "To test the performance difference, a page allocator microbenchmark: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz. For the single_page_alloc_free test, which does /** Loop to measure **/ for (i = 0; i < rec->loops; i++) { my_page = alloc_page(gfp_mask); if (unlikely(my_page == NULL)) return 0; __free_page(my_page); } Unit is cycles. Vanilla Patched Diff 115.25 117 1.4%" To be honest, that 1.4% difference was not stable but fluctuated between positive and negative percentages (so the performance difference was in the noise). So performance is not a decisive factor in this case.
On Wed 26-04-23 11:34:00, Marcelo Tosatti wrote: > On Thu, Apr 20, 2023 at 10:45:20AM -0300, Marcelo Tosatti wrote: [...] > > There are additional details that were not mentioned. When we think > > of flushing caches, or disabling per-CPU caches, this means that the > > isolated application loses the benefit of those caches (which means you > > are turning a "general purpose" programming environment into > > potentially slower environment for applications to execute). I do not really buy this argument! Nothing is really free and somebody has to pay for the overhead. You want highly specialized workload to enjoy all the performance while having high demand on latency yet the overhead has to pay everybody else. > https://www.uwsg.indiana.edu/hypermail/linux/kernel/2012.0/06823.html This is just talking about who benefits from isolation and I do not think there is any dispute in that regard. I haven't questioned that. My main argument was that those really need to be special and careful to achieve their goal and Thomas says a very similar thing. I do not see any objection to an explicit programming model to achieve that goal. > > (yes, of course, one has to be mindful of which system calls can be > > used, for example the execution time of system calls which take locks will > > depend on whether, and how many, users of those locks there are at a > > given moment). This is simply not maintainble state. Once you enter the kernel you cannot really expect your _ultra low_ latency expectations. [...] > > So it seems to me (unless there are points that show otherwise, which > > would indicate that explicit userspace interfaces are preferred) _not_ > > requiring userspace changes is a superior solution. > > > > Perhaps the complexity should be judged for individual cases > > of interruptions, and if a given interruption-free conversion > > is seen as too complex, then a "disable feature which makes use of per-CPU > > caches" style solution can be made (and then userspace has to > > explicitly request for that per-CPU feature to be disabled). > > > > But i don't see that this patchset introduces unmanageable complexity, > > neither: As I've tried to explain, I disagree about the approach you are taking. You are fixing your problem at a wrong layer. You really need to address the fundamental issue and that is that you do not want housekeeping done on isolated cpu(s) while your workload is running there. vmstat updates are just one of schedule_on_cpu users who could disturb your workload. We do not want to chase every single one and keep doing that for ever as new callers of that API are added. See the point? "Fixing" vmstat will not make your isolated workload more reliable. You really need a more generic solution rather than a quick hack. Also vmstat already has a concept of silencing - i.e. quiet_vmstat. IIRC this is used by NOHZ. I do not remember any details but if anything this is something I would have a look into. There is close to 0 benefit to teaching remote stat flushing. As I've said stats are only for debugging purposes and imprecise values shouldn't matter. So this just adds a complexity without any actual real benefit.
On Wed 26-04-23 13:10:54, Marcelo Tosatti wrote: [...] > "To test the performance difference, a page allocator microbenchmark: > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c > with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz. > > For the single_page_alloc_free test, which does > > /** Loop to measure **/ > for (i = 0; i < rec->loops; i++) { > my_page = alloc_page(gfp_mask); > if (unlikely(my_page == NULL)) > return 0; > __free_page(my_page); > } > > Unit is cycles. > > Vanilla Patched Diff > 115.25 117 1.4%" > > To be honest, that 1.4% difference was not stable but fluctuated between > positive and negative percentages (so the performance difference was in > the noise). > > So performance is not a decisive factor in this case. It is not neglible considering that majority worklods will not benefit from this change. You are clearly ignoring that vmstat code has been highly optimized for local per-cpu access exactly to avoid locked operations and cache line bouncing.
On Thu, Apr 27, 2023 at 10:31:21AM +0200, Michal Hocko wrote: > On Wed 26-04-23 11:34:00, Marcelo Tosatti wrote: > > On Thu, Apr 20, 2023 at 10:45:20AM -0300, Marcelo Tosatti wrote: > [...] > > > There are additional details that were not mentioned. When we think > > > of flushing caches, or disabling per-CPU caches, this means that the > > > isolated application loses the benefit of those caches (which means you > > > are turning a "general purpose" programming environment into > > > potentially slower environment for applications to execute). > > I do not really buy this argument! Nothing is really free and somebody > has to pay for the overhead. About the overhead, modern processors perform "cache locking": https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-261.html Which means that as long as memory is completly contained in a cacheline (for write-back memory), then it is not necessary to perform the LOCK operation on the bus. Multiple experiments have confirmed this is the case. This is the case with per-CPU vmstat memory as well. > You want highly specialized workload to > enjoy all the performance while having high demand on latency yet the > overhead has to pay everybody else. Yes, the overhead is that code should avoid interrupting CPUs. Your argument is that: "Avoiding the interruptions adds unsurmountable complexity therefore those workloads should not be supported" ? It seems to me this argument can be used against any new piece of code of functionality that is added to the kernel, isnt it ? The same argument could be used to reject (at the time) new additions such as RCU (because systems with large number of processors are a a highly specialized workload), memory hotplug (same thing), PCI hotplug (same thing). > > https://www.uwsg.indiana.edu/hypermail/linux/kernel/2012.0/06823.html > > This is just talking about who benefits from isolation and I do not > think there is any dispute in that regard. I haven't questioned that. My > main argument was that those really need to be special and careful to > achieve their goal I see. > and Thomas says a very similar thing. I do not see > any objection to an explicit programming model to achieve that goal. Yes, but it seems to me that the best possible (and most widely applicable) solution is to avoid any explicit programming if possible. > > > (yes, of course, one has to be mindful of which system calls can be > > > used, for example the execution time of system calls which take locks will > > > depend on whether, and how many, users of those locks there are at a > > > given moment). > > This is simply not maintainble state. Once you enter the kernel you > cannot really expect your _ultra low_ latency expectations. Whether or not its OK to perform system calls is up to the application: What matters is the latency expectation from the outside world [1] VS how long it takes to execute a set of instructions. I can give two concrete example: 1) Cyclictest use sys_nanosleep(). It makes sense to abstract the details of HLT'ing to the operating system. A whole class of programs (which must handle periodic tasks) will sleep via the kernel. 2) The HPC example from Thomas, where: " 1 read_data_set() <- involving syscalls/OS obviously 2 compute_set() <- let me alone 3 save_data_set() <- involving syscalls/OS obviously repeat the above... then it's at his discretion to decide to inflict a particular isolation set on the task which is obviously ineffective while doing #1 and #3 but might provide the so desired 0.9% boost for compute_set() which dominates the judgement." It seems the operating system is capable of providing an isolation free environment for the application without explicit knowledge from it (other than taskset). So all the above are good reasons to try and avoid an explicit programming interface (again, i did write an explicit programming interface, and seen in practice its downsides). I will assume you now understand and agree that the additional complexity added to the kernel is worthwhile (since its not that huge complexity to perform particular work remotely, with appropriate locking and/or usage of lockless algorithms, these things have been around and used in the kernel for a while now). As for the next topic... > [...] > > > So it seems to me (unless there are points that show otherwise, which > > > would indicate that explicit userspace interfaces are preferred) _not_ > > > requiring userspace changes is a superior solution. > > > > > > Perhaps the complexity should be judged for individual cases > > > of interruptions, and if a given interruption-free conversion > > > is seen as too complex, then a "disable feature which makes use of per-CPU > > > caches" style solution can be made (and then userspace has to > > > explicitly request for that per-CPU feature to be disabled). > > > > > > But i don't see that this patchset introduces unmanageable complexity, > > > neither: > > As I've tried to explain, I disagree about the approach you are taking. > You are fixing your problem at a wrong layer. You really need to address > the fundamental issue and that is that you do not want housekeeping done > on isolated cpu(s) while your workload is running there. OK, that is a problem. But the fact is that there are interfaces to request work to be performed on remote CPUs (usually on per-CPU data), and those must be addressed one-by-one. We (as in the community) are looking into ways to address multiple classes of interruptions at once, for example: https://lpc.events/event/16/contributions/1218/ https://www.spinics.net/lists/linux-s390/msg57118.html "The current CPU isolation is a best effort approach and I agree that for more strict isolation modes we need to be able to enforce that and hunt down offenders and think about them one by one." But we can't block IPIs or requests for work to be executed remotely. > vmstat updates are just one of schedule_on_cpu users who could disturb > your workload. Yes. But its one case which we see right now. So our approach so far has been to monitor the workload and remove individual interruptions that are observed. But as long as you have code that is doing: queue_work_on(CPU, &this_work); flush_work(&this_work); There is nothing one can do about it other than: 1) Return an error to avoid the interruption (which is what we are trying here: https://lpc.events/event/16/contributions/1218/). However this might not be suitable for all cases (because you rely on the functionality). 2) Convert it to avoid the remote work somehow. 3) Don't queue the work (which might result in incorrect system operation). Again, we are trying to widen the number of callsites that can be handled with certain approaches (see the above URLs). > We do not want to chase every single one and keep > doing that for ever as new callers of that API are added. See the > point? Yes, agree with this point. Possible solutions: 1) Change the APIs so that any new users that attempt to use the APIs are encouraged to avoid executing code on isolated CPUs (or have to handle the errors). /** * queue_work_on - queue work on specific cpu * @cpu: CPU number to execute work on * @wq: workqueue to use * @work: work to queue * * We queue the work to a specific CPU, the caller must ensure it * can't go away. Callers that fail to ensure that the specified * CPU cannot go away will execute on a randomly chosen CPU. * * Return: %false if @work was already on a queue, %true otherwise. */ bool queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work) { Perhaps _fail variants of APIs to queue remote work (https://lore.kernel.org/lkml/20220908192859.546633738@redhat.com/T/#mc25ddea62ff095dba61d244fbfdca1f61221c915) plus a checkpatch.pl error in case of addition of queue_work_on would be helpful? 2) Change the patch acceptance criteria to avoid introducing queue_work_on's to isolated CPUs. If you have additional suggestions or ideas, they are welcome. > "Fixing" vmstat will not make your isolated workload more > reliable. You really need a more generic solution rather than a quick > hack. It does as long as no one else executes queue_work_on() :-) Yes, i understand and agree this is weak and fragile. > Also vmstat already has a concept of silencing - i.e. quiet_vmstat. IIRC > this is used by NOHZ. I do not remember any details but if anything this > is something I would have a look into. Its not sufficient since what it does is only flushing the per-CPU vmstats when entering idle. There has been work on a patchset (https://lkml.org/lkml/2022/9/24/306) to improve the infrastructure, to call quiet_vmstat on return to userspace when nohz_full is used, but honestly the remote flushing is superior: it potentially avoids many unecessary flushes (which can happen if the application performs a lot of system calls). > There is close to 0 benefit to teaching remote stat flushing. As I've > said stats are only for debugging purposes and imprecise values > shouldn't matter. So this just adds a complexity without any actual real > benefit. Well there is the need to request a synchronization of the events from all CPUs, right? For example: int vmstat_refresh(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { long val; int err; int i; /* * The regular update, every sysctl_stat_interval, may come later * than expected: leaving a significant amount in per_cpu buckets. * This is particularly misleading when checking a quantity of HUGE * pages, immediately after running a test. /proc/sys/vm/stat_refresh, * which can equally be echo'ed to or cat'ted from (by root), * can be used to update the stats just before reading them. * * Oh, and since global_zone_page_state() etc. are so careful to hide * transiently negative values, report an error here if any of * the stats is negative, so we know to go looking for imbalance. */ err = schedule_on_each_cpu(refresh_vm_stats); So if you let a remote CPU dirty its own cache, then at somepoint you need to flush back that cache. Are you suggesting to disable synchronization of the per-CPU vmstats for a given CPU (if isolated, for example), and then any callsites which require proper values must loop over all CPUs when requesting uptodate statistics? Works for me and the addresses the requests from users (since it removes the interruption to isolated CPUs).
On Thu, Apr 27, 2023 at 10:39:29AM +0200, Michal Hocko wrote: > On Wed 26-04-23 13:10:54, Marcelo Tosatti wrote: > [...] > > "To test the performance difference, a page allocator microbenchmark: > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c > > with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz. > > > > For the single_page_alloc_free test, which does > > > > /** Loop to measure **/ > > for (i = 0; i < rec->loops; i++) { > > my_page = alloc_page(gfp_mask); > > if (unlikely(my_page == NULL)) > > return 0; > > __free_page(my_page); > > } > > > > Unit is cycles. > > > > Vanilla Patched Diff > > 115.25 117 1.4%" > > > > To be honest, that 1.4% difference was not stable but fluctuated between > > positive and negative percentages (so the performance difference was in > > the noise). > > > > So performance is not a decisive factor in this case. > > It is not neglible considering that majority worklods will not benefit > from this change. You are clearly ignoring that vmstat code has been > highly optimized for local per-cpu access exactly to avoid locked > operations and cache line bouncing. > -- > Michal Hocko > SUSE Labs Again, the values fluctuate between positive and negative performance difference (i happen to have copied a positive value). So the performance difference is in the noise (its not stable at 1.4%), but rather close to 0%. So the data is showing that there is no negative performance impact.
On Wed, Apr 19, 2023 at 04:15:50PM -0300, Marcelo Tosatti wrote: > On Wed, Apr 19, 2023 at 06:47:30PM +0200, Vlastimil Babka wrote: > > On 4/19/23 13:29, Marcelo Tosatti wrote: > > > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote: > > >> This was tried before: > > >> https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/ > > >> > > >> My conclusion from that discussion (and work) is that a special system > > >> call: > > >> > > >> 1) Does not allow the benefits to be widely applied (only modified > > >> applications will benefit). Is not portable across different operating systems. > > >> > > >> Removing the vmstat_work interruption is a benefit for HPC workloads, > > >> for example (in fact, it is a benefit for any kind of application, > > >> since the interruption causes cache misses). > > >> > > >> 2) Increases the system call cost for applications which would use > > >> the interface. > > >> > > >> So avoiding the vmstat_update update interruption, without userspace > > >> knowledge and modifications, is a better than solution than a modified > > >> userspace. > > > > > > Another important point is this: if an application dirties > > > its own per-CPU vmstat cache, while performing a system call, > > > and a vmstat sync event is triggered on a different CPU, you'd have to: > > > > > > 1) Wait for that CPU to return to userspace and sync its stats > > > (unfeasible). > > > > > > 2) Queue work to execute on that CPU (undesirable, as that causes > > > an interruption). > > > > So you're saying the application might do a syscall from the isolcpu, so > > IIUC it cannot expect any latency guarantees at that very moment, > > Why not? cyclictest uses nanosleep and its the main tool for measuring > latency. > > > but then > > it immediately starts expecting them again after returning to userspace, > > No, the expectation more generally is this: > > For certain types of applications (for example PLC software or > RAN processing), upon occurrence of an event, it is necessary to > complete a certain task in a maximum amount of time (deadline). > > One way to express this requirement is with a pair of numbers, > deadline time and execution time, where: > > * deadline time: length of time between event and deadline. > * execution time: length of time it takes for processing of event > to occur on a particular hardware platform > (uninterrupted). > > The particular values depend on use-case. For the case > where the realtime application executes in a virtualized > guest, an interruption which must be serviced in the host will cause > the following sequence of events: > > 1) VM-exit > 2) execution of IPI (and function call) (or switch to kwork > thread to execute some work item). > 3) VM-entry > > Which causes an excess of 50us latency as observed by cyclictest > (this violates the latency requirement of vRAN application with 1ms TTI, > for example). > > > and > > a single interruption for a one-time flush after the syscall would be too > > intrusive? > > Generally, if you can't complete the task (which involves executing a > number of instructions) before the deadline, then its a problem. > > One-time flush? You mean to switch between: > > rt-app -> kworker (to execute vmstat_update flush) -> rt-app > > My measurement, which probably had vmstat_update code/data in cache, took 7us. > It might be the case that the code to execute must be brought in from > memory, which takes even longer. > > > (elsewhere in the thread you described an RT app initialization that may > > generate vmstats to flush and then entry userspace loop, again, would a > > single interruption soon after entering the loop be so critical?) > > 1) It depends on the application. For the use-case above, where < 50us > interruption is desired, yes it is critical. > > 2) The interruptions can come from different sources. > > Time > 0 rt-app executing instruction 1 > 1 rt-app executing instruction 2 > 2 scheduler switches between rt-app and kworker > 3 kworker runs vmstat_work > 4 scheduler switches between kworker and rt-app > 5 rt-app executing instruction 3 > 6 ipi to handle a KVM request IPI > 7 fill in your preferred IPI handler > > So the argument "a single interruption might not cause your deadline > to be exceeded" fails (because the time to handle the > different interruptions might sum). > > Does that make sense? Ping ? (just want to double check the reasoning above makes sense).