Message ID | 20230125073502.743446-1-leobras@redhat.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp146654wrn; Tue, 24 Jan 2023 23:38:42 -0800 (PST) X-Google-Smtp-Source: AMrXdXsKLK3Bxrpd1SRjwrKdsxx9kWrhXFNGp1SpPrimcvIm/hY20Wni4cLXfImWVCPFuW5Tc8e9 X-Received: by 2002:a05:6a20:9e0e:b0:b2:46c0:81a9 with SMTP id ms14-20020a056a209e0e00b000b246c081a9mr31717531pzb.30.1674632321807; Tue, 24 Jan 2023 23:38:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674632321; cv=none; d=google.com; s=arc-20160816; b=dtCIZzGe/dEzY376dIzfO++wtmkD8zQ+CPVASLp2PTJSEeU+I9M3zGkW5BzVln/4FV subvkERjOnXZfUylH4SSD+KHmZ9El2632/g9LDQ0IuhPfB983AmMUB85PbGPVk1EeNyI b66/mi3x+EEiEm9kKtXmCM4Gs6HKsCEiUXgouMqfFs7jXN3RuISvCPRsa6NaT8YRqJVA NpwTO9PJWjhaJSooJk4vseOyoF/aW2mDb1UTb8Ewi/aMrQyehvJFRAa9IRCreOvp1ykN it4IRbOj1LJUHHyzKG/D+0nTw7gCFLI1WTUaK7ltxI+hiREpE72OiBXH6m8PkSptwuYb 8zRg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=fnKT1MJD+ucJIxHKvxLLC0LjfVH19PUsoy+1ammqfUQ=; b=FO+pBQeOV0YSZm19ic6D3WgN5Sh2RN8oZCVxbMz7No/9OrMwjSOaORmd40wwXhd0nP X7pZmwMPBo/muqmQHMvKI7D0QZd2V8cisqZ7DEsqwv0sYClKgHhRKFsadW4GQmwJ7Im/ AUSZc/eJ7VTwNq6khjDEABAy9ttRUiJA3/qfpaJ+y4Ye9AjKcc0VDN4tWJtf31sW/a1i QxeYqXWWRXhCLiUmL67d3LZw/L3y4hsm4/8lv7hptnA8gteXzmGB47Fu2P/p5Do8GP+j qSSJDT2M1XPHnG6NjEGY1PBerZWRupB7kLsmoO5MvFgZaawtNSSSZPD72WhLbnGBSefW Rcdw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="AKt30/CL"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m137-20020a633f8f000000b004c697f2f114si4980651pga.190.2023.01.24.23.38.29; Tue, 24 Jan 2023 23:38:41 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="AKt30/CL"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235073AbjAYHgm (ORCPT <rfc822;rust.linux@gmail.com> + 99 others); Wed, 25 Jan 2023 02:36:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35820 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235058AbjAYHgj (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 25 Jan 2023 02:36:39 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E5C9542DD0 for <linux-kernel@vger.kernel.org>; Tue, 24 Jan 2023 23:35:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674632149; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=fnKT1MJD+ucJIxHKvxLLC0LjfVH19PUsoy+1ammqfUQ=; b=AKt30/CL9jCoiyf9+J/c4S3Z91xBLjyYumCHYfezi88sTeQFm1Sp5D6pVGSxyoaTkGqHKk mRqTsxAPilaKWy2/38pQXx2cDb+z+kkagJ5j2ym5NKaRIZSt+dftGYt8BmQpzNuc3bAYAr Lbu36R3ECQga/BkOVu0Qhihg4GO8Xlg= Received: from mail-ot1-f70.google.com (mail-ot1-f70.google.com [209.85.210.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-308-tnC8kYiyPNWrKDfVC7QqpA-1; Wed, 25 Jan 2023 02:35:45 -0500 X-MC-Unique: tnC8kYiyPNWrKDfVC7QqpA-1 Received: by mail-ot1-f70.google.com with SMTP id by2-20020a056830608200b00683e3e829d5so8870259otb.17 for <linux-kernel@vger.kernel.org>; Tue, 24 Jan 2023 23:35:45 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=fnKT1MJD+ucJIxHKvxLLC0LjfVH19PUsoy+1ammqfUQ=; b=NhtqyLfnoeBLcA4NniYdkSWzcdQvFZMbugEWuklfqNJZCZ70o7TBtBWwYKhJfXwg+K Khasb3H/x/G8uxl3nBf/U4AGRFwBfHZRfV8NUW6b+huBlvEQBIqWFz3xqHHqNCoT7EBc 4qyoBmAgD1GRnVXLbdQ1/I1Fz84AJVIulT80Wi2tDdoPPj/D2yjBDMnuau+/9U7fgu+q FoooTgC5mOwEYnEH+UQsxPNFOU/ZUoJpTsB1pdxNGWThyGQDmQ6U9mavv1qK3hnrkA25 oVXFuy2n+MX0yS8AN9EHGzL2kBD+04syQaxQy+QTc+lfcAm4bWsPrX3F80POYkZSoPf3 4rgw== X-Gm-Message-State: AFqh2korOCAT7P0TErN1WRIIzqQPNiv7X/WciCTn0ITv2HT8F49mKpHx 5dVtbvBiQHReVncwhTLMVNK6FKP17xKmk8XIsiyp/vWJYOntsOm/Z42Rr04vqtKQOZd4hSulcXq /Q2HAbS8FOVSFYMWoxsLmiCwr X-Received: by 2002:a4a:b542:0:b0:502:a732:f8f5 with SMTP id s2-20020a4ab542000000b00502a732f8f5mr7489698ooo.5.1674632144799; Tue, 24 Jan 2023 23:35:44 -0800 (PST) X-Received: by 2002:a4a:b542:0:b0:502:a732:f8f5 with SMTP id s2-20020a4ab542000000b00502a732f8f5mr7489692ooo.5.1674632144483; Tue, 24 Jan 2023 23:35:44 -0800 (PST) Received: from LeoBras.redhat.com ([2804:1b3:a800:14fa:9361:c141:6c70:c877]) by smtp.gmail.com with ESMTPSA id x189-20020a4a41c6000000b0050dc79bb80esm1538802ooa.27.2023.01.24.23.35.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Jan 2023 23:35:43 -0800 (PST) From: Leonardo Bras <leobras@redhat.com> To: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com>, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining Date: Wed, 25 Jan 2023 04:34:57 -0300 Message-Id: <20230125073502.743446-1-leobras@redhat.com> X-Mailer: git-send-email 2.39.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755979261189285683?= X-GMAIL-MSGID: =?utf-8?q?1755979261189285683?= |
Series |
Introduce memcg_stock_pcp remote draining
|
|
Message
Leonardo Bras Soares Passos
Jan. 25, 2023, 7:34 a.m. UTC
Disclaimer: a - The cover letter got bigger than expected, so I had to split it in sections to better organize myself. I am not very confortable with it. b - Performance numbers below did not include patch 5/5 (Remove flags from memcg_stock_pcp), which could further improve performance for drain_all_stock(), but I could only notice the optimization at the last minute. 0 - Motivation: On current codebase, when drain_all_stock() is ran, it will schedule a drain_local_stock() for each cpu that has a percpu stock associated with a descendant of a given root_memcg. This happens even on 'isolated cpus', a feature commonly used on workloads that are sensitive to interruption and context switching such as vRAN and Industrial Control Systems. Since this scheduling behavior is a problem to those workloads, the proposal is to replace the current local_lock + schedule_work_on() solution with a per-cpu spinlock. 1 - Possible performance losses: With a low amount of shedule_work_on(), local_locks are supposed to perform better than spinlocks, given cacheline is always accessed by a single CPU and there is no contention. The impact on those areas is analyzed bellow: 1.1 - Cacheline usage In current implementation drain_all_stock() will be remote reading the percpu memcg_stock cacheline of every online CPU, and remote writing to all cpus that succeed the mem_cgroup_is_descendant() test. (stock->flags, stock->work) With spinlocks, drain_all_stock() will be remote reading the percpu memcg_stock cacheline of every online CPU, and remote writing to all cpus that succeed the mem_cgroup_is_descendant() test. (stock->stock_lock, on top of the above) While the spinlock may require extra acquire/release writes on some archs, they will all happen on an exclusive cacheline, so not much overhead. In both cases, the next local cpu read will require a fetch from memory (as it was invalidated on the remote write) and cacheline exclusivity get before the local write. So about cacheline usage, there should not be much impact. 1.2 - Contention We can safely assume that drain_all_stock() will not run oftenly. If it was not the case, there would be a lot of scheduled tasks and kill cpu performance. Since it does not run oftenly, and it is the only function that acesses remote percpu memcg_stock, contention should not be too expressive, and should cause less impact than scheduling (remote) and running the scheduled work (local). 2 - Performance test results: 2.1 - Test System: - Non-virtualized AMD EPYC 7601 32-Core Processor, 128 CPUs distributed in 8 NUMA nodes: 0-7,64-71 on node0, 8-15,72-79 on node1, and so on. - 256GB RAM, 2x32GB per NUMA node - For cpu isolation: use kernel cmdline: isolcpus=8-15,72-79 nohz_full=8-15,72-79 - Memcg group created with: [Slice] MemoryAccounting=true MemoryLimit=1G MemoryMax=1G MemorySwapMax=1M - For pinning runs on given CPU, I used 'taskset -c $cpunum command' - For restarting the memcg, it was used: restart_memcg(){ systemctl stop test.slice systemctl daemon-reload systemctl start test.slice } 2.2 - Impact on functions that use memcg_stock: 2.2.1 - Approach: Using perf or tracepoints get a very course result, so it was preferred to count the total cpu clocks between entering and exiting the functions that use the memcg_stock, on an isolated cpu. Something like this was used on x86: fun_name(){ u64 clk = rdtsc_ordered(); ... <function does it's job> clk = rdtsc_ordered() - clk; <percpu struct dbg> dbg->fun_name.cycles += clk; dbg->fun_name.count++; } The percpu statistics were then acquired via a char device after the test finished, and an average function clock usage was calculated. For the stress test, run "cat /proc/interrupts > /dev/null" in a loop of 1000000 iterations inside the memcg for each cpu tested: for each cpu in $cpuset; do systemd-run --wait --slice=test.slice taskset -c $cpu bash -c " for k in {1..100000} ; do cat /proc/interrupts > /dev/null; done" For the drain_all_stock() test, it was necessary to restart the memcg (or cause an OOM) to call the function. 2.2.2 - Results For 1 isolated CPU, pinned on cpu 8, with no drain_all_stock() calls, being STDDEV the standard deviation between the average on 6 runs, and Call Count the sum of calls on the 6 runs: Patched Average clk STDDEV Call Count consume_stock: 63.75983475 0.1610502136 72167768 refill_stock: 67.45708322 0.09732816852 23401756 mod_objcg_state: 98.03841384 1.491628532 181292961 consume_obj_stock: 63.2543456 0.04624513799 94846454 refill_obj_stock: 78.56390025 0.3160306174 91732973 Upstream Average clk STDDEV Call Count consume_stock: 53.51201046 0.05200824438 71.866536 refill_stock: 65.46991584 0.1178078417 23401752 mod_objcg_state: 84.95365055 1.371464414 181.292707 consume_obj_stock: 60.03619438 0.05944582207 94.846327 refill_obj_stock: 73.23757912 1.161933856 91.732787 Patched - Upstream Diff (cycles) Diff % consume_stock: 10.24782429 19.15051258 refill_stock: 1.987167378 3.035237411 mod_objcg_state: 13.08476328 15.40223781 consume_obj_stock: 3.218151217 5.360351785 refill_obj_stock: 5.326321123 7.272661368 So in average the above patched functions are 2~13 clocks cycles slower than upstream. On the other hand, drain_all_stock is faster on the patched version, even considering it does all the draining instead of scheduling the work to other CPUs: drain_all_stock cpus Upstream Patched Diff (cycles) Diff(%) 1 44331.10831 38978.03581 -5353.072507 -12.07520567 8 43992.96512 39026.76654 -4966.198572 -11.2886198 128 156274.6634 58053.87421 -98220.78915 -62.85138425 (8 cpus being in the same NUMA node) 2.3 - Contention numbers 2.3.1 - Approach On top of the patched version, I replaced the spin_lock_irqsave() on functions that use the memcg_stock with spin_lock_irqsave_cc(), which is defined as: #define spin_lock_irqsave_cc(l, flags) \ if (!spin_trylock_irqsave(l, flags)) { \ u64 clk = rdtsc_ordered(); \ spin_lock_irqsave(l, flags); \ clk = rdtsc_ordered() - clk; \ pr_err("mydebug: cpu %d hit contention :" \ " fun = %s, clk = %lld/n", \ smp_processor_id(), __func__, clk); \ } So in case of contention (try_lock failing) it would record an approximate clk usage before getting the lock, and print this to dmesg. For the stress test, run "cat /proc/interrupts > /dev/null" in a loop of 1000000 iterations for each of the 128 cpus inside the memcg (limit set to 20G): for each cpu in {1..128}; do restart_memcg() systemd-run --wait --slice=test.slice taskset -c $cpu bash -c " for k in {1..100000} ; do cat /proc/interrupts > /dev/null; done" This loop was repeated for over 8 hours 2.3.2- Results Function # calls consume_stock: 15078323802 refill_stock: 2495995683 mod_objcg_state: 39765559905 consume_obj_stock: 19882888224 refill_obj_stock: 21025241793 drain_all_stock: 592 Contentions hit: 0 (Other more aggressive synthetic tests were run, and even in this case contention was hit just a couple times over some hours.) 2.4 - Syscall time measure 2.4.1- Approach To measure the patchset effect on syscall time, the following code was used: (copied/adapted from https://lore.kernel.org/linux-mm/20220924152227.819815-1-atomlin@redhat.com/ ) ################ #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <unistd.h> #include <string.h> typedef unsigned long long cycles_t; typedef unsigned long long usecs_t; typedef unsigned long long u64; #ifdef __x86_64__ #define DECLARE_ARGS(val, low, high) unsigned long low, high #define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32)) #define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high) #define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high) #else #define DECLARE_ARGS(val, low, high) unsigned long long val #define EAX_EDX_VAL(val, low, high) (val) #define EAX_EDX_ARGS(val, low, high) "A" (val) #define EAX_EDX_RET(val, low, high) "=A" (val) #endif static inline unsigned long long __rdtscll(void) { DECLARE_ARGS(val, low, high); asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high)); return EAX_EDX_VAL(val, low, high); } #define rdtscll(val) do { (val) = __rdtscll(); } while (0) #define NRSYSCALLS 30000000 #define NRSLEEPS 100000 #define page_mmap() mmap(NULL, 4096, PROT_READ|PROT_WRITE, \ MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) #define page_munmap(x) munmap(x, 4096) void main(int argc, char *argv[]) { unsigned long a, b, cycles; int i, syscall = 0; int *page; page = page_mmap(); if (page == MAP_FAILED) perror("mmap"); if (page_munmap(page)) perror("munmap"); if (argc != 2) { printf("usage: %s {idle,syscall}\n", argv[0]); exit(1); } rdtscll(a); if (strncmp("idle", argv[1], 4) == 0) syscall = 0; else if (strncmp("syscall", argv[1], 7) == 0) syscall = 1; else { printf("usage: %s {idle,syscall}\n", argv[0]); exit(1); } if (syscall == 1) { for (i = 0; i < NRSYSCALLS; i++) { page = page_mmap(); if (page == MAP_FAILED) perror("mmap"); #ifdef MY_WRITE page[3] = i; #endif if (page_munmap(page)) perror("munmap"); } } else { for (i = 0; i < NRSLEEPS; i++) usleep(10); } rdtscll(b); cycles = b - a; if (syscall == 1) printf("cycles / syscall: %d\n", (b-a)/(NRSYSCALLS*2)); else printf("cycles / idle loop: %d\n", (b-a)/NRSLEEPS); } ################ Running with ./my_test syscall will cause it to print the average clock cycles usage of the syscall pair (page_mmap() and page_munmap()); It was compiled with two versions: With -DMY_WRITE and without it. The difference is writing to the allocated page, causing it to fault. Each version was run 200 times, pinned to an isolated cpu. Then an average and standard deviation was calculated on those results. 2.4.2- Results Patched: no_write write AVG 2991.195 5746.495 STDEV 27.77488427 40.55878512 STDEV % 0.9285547838 0.7058004073 Upstream: no_write write AVG 3012.22 5749.605 STDEV 25.1186451 37.26206223 STDEV % 0.8338914522 0.6480803851 Pat - Up: no_write write Diff -21.025 -3.11 Diff % -0.6979901866 -0.05409067232 Meaning the pair page_mmap() + page_munmap() + pagefault runs a tiny bit faster on the patched version, compared to upstream. 3 - Discussion on results On 2.2 we see every function that uses memcg_stock on local cpu gets a slower by some cycles on the patched version, while the function that accesses it remotely (drain_all_stock()) gets faster. The difference is more accentuated as we raise the cpu count, and consequently start dealing with sharing memory across NUMA. On 2.3 we see contention is not a big issue, as expected in 1.2. This probably happens due to the fact that drain_all_stock() runs quite rarely on normal operation, and the other functions are quite fast. On 2.4 we can see that page_mmap() + page_munmap() + pagefault ran a tiny bit faster. This is probably due to the fact that drain_all_stock() does not schedule work on the running cpu, causing it not to get interrupted, and possibly making up for the increased time in local functions. 4- Conclusion Scheduling work on isolated cpus can be an issue for some workloads. Reducing the issue by replacing the local_lock in memcg_stock with a spinlock should not cause much impact on performance. Leonardo Bras (5): mm/memcontrol: Align percpu memcg_stock to cache mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t mm/memcontrol: Reorder memcg_stock_pcp members to avoid holes mm/memcontrol: Perform all stock drain in current CPU mm/memcontrol: Remove flags from memcg_stock_pcp mm/memcontrol.c | 75 ++++++++++++++++++++----------------------------- 1 file changed, 31 insertions(+), 44 deletions(-)
Comments
On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > Disclaimer: > a - The cover letter got bigger than expected, so I had to split it in > sections to better organize myself. I am not very confortable with it. > b - Performance numbers below did not include patch 5/5 (Remove flags > from memcg_stock_pcp), which could further improve performance for > drain_all_stock(), but I could only notice the optimization at the > last minute. > > > 0 - Motivation: > On current codebase, when drain_all_stock() is ran, it will schedule a > drain_local_stock() for each cpu that has a percpu stock associated with a > descendant of a given root_memcg. > > This happens even on 'isolated cpus', a feature commonly used on workloads that > are sensitive to interruption and context switching such as vRAN and Industrial > Control Systems. > > Since this scheduling behavior is a problem to those workloads, the proposal is > to replace the current local_lock + schedule_work_on() solution with a per-cpu > spinlock. If IIRC we have also discussed that isolated CPUs can simply opt out from the pcp caching and therefore the problem would be avoided altogether without changes to the locking scheme. I do not see anything regarding that in this submission. Could you elaborate why you have abandoned this option?
On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > Disclaimer: > > a - The cover letter got bigger than expected, so I had to split it in > > sections to better organize myself. I am not very confortable with it. > > b - Performance numbers below did not include patch 5/5 (Remove flags > > from memcg_stock_pcp), which could further improve performance for > > drain_all_stock(), but I could only notice the optimization at the > > last minute. > > > > > > 0 - Motivation: > > On current codebase, when drain_all_stock() is ran, it will schedule a > > drain_local_stock() for each cpu that has a percpu stock associated with a > > descendant of a given root_memcg. > > > > This happens even on 'isolated cpus', a feature commonly used on workloads that > > are sensitive to interruption and context switching such as vRAN and Industrial > > Control Systems. > > > > Since this scheduling behavior is a problem to those workloads, the proposal is > > to replace the current local_lock + schedule_work_on() solution with a per-cpu > > spinlock. > > If IIRC we have also discussed that isolated CPUs can simply opt out > from the pcp caching and therefore the problem would be avoided > altogether without changes to the locking scheme. I do not see anything > regarding that in this submission. Could you elaborate why you have > abandoned this option? Hello Michal, I understand pcp caching is a nice to have. So while I kept the idea of disabling pcp caching in mind as an option, I first tried to understand what kind of impacts we would be seeing when trying to change the locking scheme. After I raised the data in the cover letter, I found that the performance impact appears not be that big. So in order to try keeping the pcp cache on isolated cpus active, I decided to focus effort on the locking scheme change. I mean, my rationale is: if is there a non-expensive way of keeping the feature, why should we abandon it? Best regards, Leo
On Wed 25-01-23 08:06:46, Leonardo Brás wrote: > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > Disclaimer: > > > a - The cover letter got bigger than expected, so I had to split it in > > > sections to better organize myself. I am not very confortable with it. > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > from memcg_stock_pcp), which could further improve performance for > > > drain_all_stock(), but I could only notice the optimization at the > > > last minute. > > > > > > > > > 0 - Motivation: > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > descendant of a given root_memcg. > > > > > > This happens even on 'isolated cpus', a feature commonly used on workloads that > > > are sensitive to interruption and context switching such as vRAN and Industrial > > > Control Systems. > > > > > > Since this scheduling behavior is a problem to those workloads, the proposal is > > > to replace the current local_lock + schedule_work_on() solution with a per-cpu > > > spinlock. > > > > If IIRC we have also discussed that isolated CPUs can simply opt out > > from the pcp caching and therefore the problem would be avoided > > altogether without changes to the locking scheme. I do not see anything > > regarding that in this submission. Could you elaborate why you have > > abandoned this option? > > Hello Michal, > > I understand pcp caching is a nice to have. > So while I kept the idea of disabling pcp caching in mind as an option, I first > tried to understand what kind of impacts we would be seeing when trying to > change the locking scheme. > > After I raised the data in the cover letter, I found that the performance impact > appears not be that big. So in order to try keeping the pcp cache on isolated > cpus active, I decided to focus effort on the locking scheme change. > > I mean, my rationale is: if is there a non-expensive way of keeping the feature, > why should we abandon it? Because any locking to pcp adds a potential contention. You haven't been able to trigger that contention by your testing but that doesn't really mean it is non-existent. Besided that opt-out for isolated cpus should be a much smaller change with a much more predictable behavior. The overall performance for charging on isolated cpus will be slightly worse but it hasn't been seen this is anything really noticeable. Most workloads which do care about isolcpus tend to not enter kernel much AFAIK so this should have relatively small impact. All that being said I would prefer a simpler solution which seems to be to simply opt out from pcp caching and if the performance for real world workloads shows regressions then we can start thinking about a more complex solution.
On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > Disclaimer: > > > a - The cover letter got bigger than expected, so I had to split it in > > > sections to better organize myself. I am not very confortable with it. > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > from memcg_stock_pcp), which could further improve performance for > > > drain_all_stock(), but I could only notice the optimization at the > > > last minute. > > > > > > > > > 0 - Motivation: > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > descendant of a given root_memcg. > > > > > > This happens even on 'isolated cpus', a feature commonly used on workloads that > > > are sensitive to interruption and context switching such as vRAN and Industrial > > > Control Systems. > > > > > > Since this scheduling behavior is a problem to those workloads, the proposal is > > > to replace the current local_lock + schedule_work_on() solution with a per-cpu > > > spinlock. > > > > If IIRC we have also discussed that isolated CPUs can simply opt out > > from the pcp caching and therefore the problem would be avoided > > altogether without changes to the locking scheme. I do not see anything > > regarding that in this submission. Could you elaborate why you have > > abandoned this option? > > Hello Michal, > > I understand pcp caching is a nice to have. > So while I kept the idea of disabling pcp caching in mind as an option, I first > tried to understand what kind of impacts we would be seeing when trying to > change the locking scheme. Remote draining reduces interruptions whether CPU is marked as isolated or not: - Allows isolated CPUs from benefiting of pcp caching. - Removes the interruption to non isolated CPUs. See for example https://lkml.org/lkml/2022/6/13/2769 "Minchan Kim tested this independently and reported; My workload is not NOHZ CPUs but run apps under heavy memory pressure so they goes to direct reclaim and be stuck on drain_all_pages until work on workqueue run. unit: nanosecond max(dur) avg(dur) count(dur) 166713013 487511.77786438033 1283 From traces, system encountered the drain_all_pages 1283 times and worst case was 166ms and avg was 487us. The other problem was alloc_contig_range in CMA. The PCP draining takes several hundred millisecond sometimes though there is no memory pressure or a few of pages to be migrated out but CPU were fully booked. Your patch perfectly removed those wasted time." > After I raised the data in the cover letter, I found that the performance impact > appears not be that big. So in order to try keeping the pcp cache on isolated > cpus active, I decided to focus effort on the locking scheme change. > > I mean, my rationale is: if is there a non-expensive way of keeping the feature, > why should we abandon it? > > Best regards, > Leo > > > > > > > >
On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote: > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > > Disclaimer: > > > > a - The cover letter got bigger than expected, so I had to split it in > > > > sections to better organize myself. I am not very confortable with it. > > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > > from memcg_stock_pcp), which could further improve performance for > > > > drain_all_stock(), but I could only notice the optimization at the > > > > last minute. > > > > > > > > > > > > 0 - Motivation: > > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > > descendant of a given root_memcg. Do you know what caused those drain_all_stock() calls? I wonder if we should look into why we have many of them and whether we really need them? It's either some user's actions (e.g. reducing memory.max), either some memcg is entering pre-oom conditions. In the latter case a lot of drain calls can be scheduled without a good reason (assuming the cgroup contain multiple tasks running on multiple cpus). Essentially each cpu will try to grab the remains of the memory quota and move it locally. I wonder in such circumstances if we need to disable the pcp-caching on per-cgroup basis. Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some charges/memcg references (it might be not completely idle, but running some very special workload which is not doing any kernel allocations or a process belonging to the root memcg). In all other cases pcpu stock will be either drained naturally by an allocation from another memcg or an allocation from the same memcg will "restore" it, making draining useless. We also can into drain_all_pages() opportunistically, without waiting for the result. On a busy system it's most likely useless, we might oom before scheduled works will be executed. I admit I planned to do some work around and even started, but then never had enough time to finish it. Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower for an improvement of stock draining. It's not a strong objection, but IMO we should avoid doing this without a really strong reason. Thanks!
On Wed 25-01-23 15:14:48, Roman Gushchin wrote: > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote: > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > > > Disclaimer: > > > > > a - The cover letter got bigger than expected, so I had to split it in > > > > > sections to better organize myself. I am not very confortable with it. > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > > > from memcg_stock_pcp), which could further improve performance for > > > > > drain_all_stock(), but I could only notice the optimization at the > > > > > last minute. > > > > > > > > > > > > > > > 0 - Motivation: > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > > > descendant of a given root_memcg. > > Do you know what caused those drain_all_stock() calls? I wonder if we should look > into why we have many of them and whether we really need them? > > It's either some user's actions (e.g. reducing memory.max), either some memcg > is entering pre-oom conditions. In the latter case a lot of drain calls can be > scheduled without a good reason (assuming the cgroup contain multiple tasks running > on multiple cpus). I believe I've never got a specific answer to that. We have discussed that in the previous version submission (20221102020243.522358-1-leobras@redhat.com and specifically Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and isolcpus. I was wondering about using memcgs in RT workloads because that just sounds weird but let's say this is the case indeed. Then an RT task or whatever task that is running on an isolated cpu can have pcp charges. > Essentially each cpu will try to grab the remains of the memory quota > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > on per-cgroup basis. I think it would be more than sufficient to disable pcp charging on an isolated cpu. This is not a per memcg property. I can imagine that different tasks running in the same memcg can run on a mix of CPUs (e.g. only part of it on isolated CPUs). It is a recipe for all sorts of priority inversions but well, memcg and RT is there already. > Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some > charges/memcg references (it might be not completely idle, but running some very special > workload which is not doing any kernel allocations or a process belonging to the root memcg). > In all other cases pcpu stock will be either drained naturally by an allocation from another > memcg or an allocation from the same memcg will "restore" it, making draining useless. > > We also can into drain_all_pages() opportunistically, without waiting for the result. > On a busy system it's most likely useless, we might oom before scheduled works will be executed. I think the primary objective is that no userspace unintended execution happens on isolated cpus. > I admit I planned to do some work around and even started, but then never had enough time to > finish it. > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid > doing this without a really strong reason. Are you OK with a simple opt out on isolated CPUs? That would make charges slightly slower (atomic on the hierarchy counters vs. a single pcp adjustment) but it would guarantee that the isolated workload is predictable which is the primary objective AFAICS.
On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote: [...] > Remote draining reduces interruptions whether CPU > is marked as isolated or not: > > - Allows isolated CPUs from benefiting of pcp caching. > - Removes the interruption to non isolated CPUs. See for example > > https://lkml.org/lkml/2022/6/13/2769 This is talking about page allocato per cpu caches, right? In this patch we are talking about memcg pcp caches. Are you sure the same applies here?
On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > On Wed 25-01-23 15:14:48, Roman Gushchin wrote: > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote: > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > > > > Disclaimer: > > > > > > a - The cover letter got bigger than expected, so I had to split it in > > > > > > sections to better organize myself. I am not very confortable with it. > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > > > > from memcg_stock_pcp), which could further improve performance for > > > > > > drain_all_stock(), but I could only notice the optimization at the > > > > > > last minute. > > > > > > > > > > > > > > > > > > 0 - Motivation: > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > > > > descendant of a given root_memcg. > > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look > > into why we have many of them and whether we really need them? > > > > It's either some user's actions (e.g. reducing memory.max), either some memcg > > is entering pre-oom conditions. In the latter case a lot of drain calls can be > > scheduled without a good reason (assuming the cgroup contain multiple tasks running > > on multiple cpus). > > I believe I've never got a specific answer to that. We > have discussed that in the previous version submission > (20221102020243.522358-1-leobras@redhat.com and specifically > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and > isolcpus. I was wondering about using memcgs in RT workloads because > that just sounds weird but let's say this is the case indeed. This could be the case. You can consider an "edge device" where it is necessary to run a RT workload. It might also be useful to run non realtime applications on the same system. > Then an RT task or whatever task that is running on an isolated > cpu can have pcp charges. Usually the RT task (or more specifically the realtime sensitive loop of the application) runs entirely on userspace. But i suppose there could be charges on application startup. > > Essentially each cpu will try to grab the remains of the memory quota > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > > on per-cgroup basis. > > I think it would be more than sufficient to disable pcp charging on an > isolated cpu. This is not a per memcg property. I can imagine that > different tasks running in the same memcg can run on a mix of CPUs (e.g. > only part of it on isolated CPUs). It is a recipe for all sorts of > priority inversions but well, memcg and RT is there already. I suppose the more general the solution, the better. > > Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some > > charges/memcg references (it might be not completely idle, but running some very special > > workload which is not doing any kernel allocations or a process belonging to the root memcg). > > In all other cases pcpu stock will be either drained naturally by an allocation from another > > memcg or an allocation from the same memcg will "restore" it, making draining useless. > > > > We also can into drain_all_pages() opportunistically, without waiting for the result. > > On a busy system it's most likely useless, we might oom before scheduled works will be executed. > > I think the primary objective is that no userspace unintended execution > happens on isolated cpus. No interruptions to the userspace code (time sensitive code) running on isolated CPUs: no IPIs, no task switches. > > I admit I planned to do some work around and even started, but then never had enough time to > > finish it. > > > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower > > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid > > doing this without a really strong reason. > > Are you OK with a simple opt out on isolated CPUs? That would make > charges slightly slower (atomic on the hierarchy counters vs. a single > pcp adjustment) but it would guarantee that the isolated workload is > predictable which is the primary objective AFAICS. This would make isolated CPUs "second class citizens": it would be nice to be able to execute non realtime apps on isolated CPUs as well (think of different periods of time during a day, one where more realtime apps are required, another where less realtime apps are required). Concrete example: think of a computer handling vRAN traffic near a cell tower. The traffic (therefore amount of processing required by realtime applications) might vary during the day. User might want to run containers that depend on good memcg charging performance on isolated CPUs.
On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote: > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote: > [...] > > Remote draining reduces interruptions whether CPU > > is marked as isolated or not: > > > > - Allows isolated CPUs from benefiting of pcp caching. > > - Removes the interruption to non isolated CPUs. See for example > > > > https://lkml.org/lkml/2022/6/13/2769 > > This is talking about page allocato per cpu caches, right? In this patch > we are talking about memcg pcp caches. Are you sure the same applies > here? Both can stall the users of the drain operation. "Minchan Kim tested this independently and reported; My workload is not NOHZ CPUs but run apps under heavy memory pressure so they goes to direct reclaim and be stuck on drain_all_pages until work on workqueue run." Therefore using a workqueue to drain memcg pcps also depends on the remote CPU executing that work item in time (which can stall the following). No? === 7 3141 mm/memory.c <<wp_page_copy>> if (mem_cgroup_charge(page_folio(new_page), mm, GFP_KERNEL)) 8 4118 mm/memory.c <<do_anonymous_page>> if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) 9 4577 mm/memory.c <<do_cow_fault>> if (mem_cgroup_charge(page_folio(vmf->cow_page), vma->vm_mm, 10 621 mm/migrate_device.c <<migrate_vma_insert_page>> if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) 11 710 mm/shmem.c <<shmem_add_to_page_cache>> error = mem_cgroup_charge(folio, charge_mm, gfp);
On Wed, Jan 25, 2023 at 03:14:48PM -0800, Roman Gushchin wrote: > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote: > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > > > Disclaimer: > > > > > a - The cover letter got bigger than expected, so I had to split it in > > > > > sections to better organize myself. I am not very confortable with it. > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > > > from memcg_stock_pcp), which could further improve performance for > > > > > drain_all_stock(), but I could only notice the optimization at the > > > > > last minute. > > > > > > > > > > > > > > > 0 - Motivation: > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > > > descendant of a given root_memcg. > > Do you know what caused those drain_all_stock() calls? I wonder if we should look > into why we have many of them and whether we really need them? > > It's either some user's actions (e.g. reducing memory.max), either some memcg > is entering pre-oom conditions. In the latter case a lot of drain calls can be > scheduled without a good reason (assuming the cgroup contain multiple tasks running > on multiple cpus). Essentially each cpu will try to grab the remains of the memory quota > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > on per-cgroup basis. > > Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some > charges/memcg references (it might be not completely idle, but running some very special > workload which is not doing any kernel allocations or a process belonging to the root memcg). > In all other cases pcpu stock will be either drained naturally by an allocation from another > memcg or an allocation from the same memcg will "restore" it, making draining useless. > > We also can into drain_all_pages() opportunistically, without waiting for the result. > On a busy system it's most likely useless, we might oom before scheduled works will be executed. > > I admit I planned to do some work around and even started, but then never had enough time to > finish it. > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid > doing this without a really strong reason. The expectation would be that cache locking should not cause slowdown of the allocation and free paths: https://manualsbrain.com/en/manuals/1246877/?page=313 For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK oper- ation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to insure that the operation is carried out atomically. This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that ...
On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote: > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote: > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote: > > [...] > > > Remote draining reduces interruptions whether CPU > > > is marked as isolated or not: > > > > > > - Allows isolated CPUs from benefiting of pcp caching. > > > - Removes the interruption to non isolated CPUs. See for example > > > > > > https://lkml.org/lkml/2022/6/13/2769 > > > > This is talking about page allocato per cpu caches, right? In this patch > > we are talking about memcg pcp caches. Are you sure the same applies > > here? > > Both can stall the users of the drain operation. Yes. But it is important to consider who those users are. We are draining when - we are charging and the limit is hit so that memory reclaim has to be triggered. - hard, high limits are set and require memory reclaim. - force_empty - full memory reclaim for a memcg - memcg offlining - cgroup removel - quite a heavy operation as well. all those could be really costly kernel operations and they affect isolated cpu only if the same memcg is used by both isolated and non-isolated cpus. In other words those costly operations would have to be triggered from non-isolated cpus and those are to be expected to be stalled. It is the side effect of the local cpu draining that is scheduled that affects the isolated cpu as well. Is that more clear?
On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote: > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > > On Wed 25-01-23 15:14:48, Roman Gushchin wrote: > > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote: > > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > > > > > Disclaimer: > > > > > > > a - The cover letter got bigger than expected, so I had to split it in > > > > > > > sections to better organize myself. I am not very confortable with it. > > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > > > > > from memcg_stock_pcp), which could further improve performance for > > > > > > > drain_all_stock(), but I could only notice the optimization at the > > > > > > > last minute. > > > > > > > > > > > > > > > > > > > > > 0 - Motivation: > > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > > > > > descendant of a given root_memcg. > > > > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look > > > into why we have many of them and whether we really need them? > > > > > > It's either some user's actions (e.g. reducing memory.max), either some memcg > > > is entering pre-oom conditions. In the latter case a lot of drain calls can be > > > scheduled without a good reason (assuming the cgroup contain multiple tasks running > > > on multiple cpus). > > > > I believe I've never got a specific answer to that. We > > have discussed that in the previous version submission > > (20221102020243.522358-1-leobras@redhat.com and specifically > > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and > > isolcpus. I was wondering about using memcgs in RT workloads because > > that just sounds weird but let's say this is the case indeed. > > This could be the case. You can consider an "edge device" where it is > necessary to run a RT workload. It might also be useful to run > non realtime applications on the same system. > > > Then an RT task or whatever task that is running on an isolated > > cpu can have pcp charges. > > Usually the RT task (or more specifically the realtime sensitive loop > of the application) runs entirely on userspace. But i suppose there > could be charges on application startup. What is the role of memcg then? If the memory limit is in place and the workload doesn't fit in then it will get reclaimed during start up and memory would need to be refaulted if not mlocked. If it is mlocked then the limit cannot be enforced and the start up would likely fail as a result of the memcg oom killer. [...] > > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower > > > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid > > > doing this without a really strong reason. > > > > Are you OK with a simple opt out on isolated CPUs? That would make > > charges slightly slower (atomic on the hierarchy counters vs. a single > > pcp adjustment) but it would guarantee that the isolated workload is > > predictable which is the primary objective AFAICS. > > This would make isolated CPUs "second class citizens": it would be nice > to be able to execute non realtime apps on isolated CPUs as well > (think of different periods of time during a day, one where > more realtime apps are required, another where less > realtime apps are required). An alternative requires to make the current implementation that is lockless to use locks and introduce potential lock contention. This could be harmful to regular workloads. Not using pcp caching would make a fast path using few atomics rather than local pcp update. That is not a terrible cost to pay for special cased workloads which use isolcpus. Really we are not talking about a massive cost to be payed. At least nobody has shown that in any numbers. > Concrete example: think of a computer handling vRAN traffic near a > cell tower. The traffic (therefore amount of processing required > by realtime applications) might vary during the day. > > User might want to run containers that depend on good memcg charging > performance on isolated CPUs. What kind of role would memcg play here? Do you want to memory constrain that workload?
On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > On Wed 25-01-23 15:14:48, Roman Gushchin wrote: > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote: > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > > > > Disclaimer: > > > > > > a - The cover letter got bigger than expected, so I had to split it in > > > > > > sections to better organize myself. I am not very confortable with it. > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > > > > from memcg_stock_pcp), which could further improve performance for > > > > > > drain_all_stock(), but I could only notice the optimization at the > > > > > > last minute. > > > > > > > > > > > > > > > > > > 0 - Motivation: > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > > > > descendant of a given root_memcg. > > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look > > into why we have many of them and whether we really need them? > > > > It's either some user's actions (e.g. reducing memory.max), either some memcg > > is entering pre-oom conditions. In the latter case a lot of drain calls can be > > scheduled without a good reason (assuming the cgroup contain multiple tasks running > > on multiple cpus). > > I believe I've never got a specific answer to that. We > have discussed that in the previous version submission > (20221102020243.522358-1-leobras@redhat.com and specifically > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and > isolcpus. I was wondering about using memcgs in RT workloads because > that just sounds weird but let's say this is the case indeed. Then an RT > task or whatever task that is running on an isolated cpu can have pcp > charges. > > > Essentially each cpu will try to grab the remains of the memory quota > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > > on per-cgroup basis. > > I think it would be more than sufficient to disable pcp charging on an > isolated cpu. It might have significant performance consequences. I'd rather opt out of stock draining for isolated cpus: it might slightly reduce the accuracy of memory limits and slightly increase the memory footprint (all those dying memcgs...), but the impact will be limited. Actually it is limited by the number of cpus. > This is not a per memcg property. Sure, my point was that in pre-oom condition several cpus might try to consolidate the remains of the memory quota, actually working against each other. Separate issue, which might be a reason why there are many flush attempts in the case we discuss. Thanks!
On Thu, Jan 26, 2023 at 08:20:46PM +0100, Michal Hocko wrote: > On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote: > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > > > On Wed 25-01-23 15:14:48, Roman Gushchin wrote: > > > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote: > > > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > > > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > > > > > > Disclaimer: > > > > > > > > a - The cover letter got bigger than expected, so I had to split it in > > > > > > > > sections to better organize myself. I am not very confortable with it. > > > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > > > > > > from memcg_stock_pcp), which could further improve performance for > > > > > > > > drain_all_stock(), but I could only notice the optimization at the > > > > > > > > last minute. > > > > > > > > > > > > > > > > > > > > > > > > 0 - Motivation: > > > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > > > > > > descendant of a given root_memcg. > > > > > > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look > > > > into why we have many of them and whether we really need them? > > > > > > > > It's either some user's actions (e.g. reducing memory.max), either some memcg > > > > is entering pre-oom conditions. In the latter case a lot of drain calls can be > > > > scheduled without a good reason (assuming the cgroup contain multiple tasks running > > > > on multiple cpus). > > > > > > I believe I've never got a specific answer to that. We > > > have discussed that in the previous version submission > > > (20221102020243.522358-1-leobras@redhat.com and specifically > > > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and > > > isolcpus. I was wondering about using memcgs in RT workloads because > > > that just sounds weird but let's say this is the case indeed. > > > > This could be the case. You can consider an "edge device" where it is > > necessary to run a RT workload. It might also be useful to run > > non realtime applications on the same system. > > > > > Then an RT task or whatever task that is running on an isolated > > > cpu can have pcp charges. > > > > Usually the RT task (or more specifically the realtime sensitive loop > > of the application) runs entirely on userspace. But i suppose there > > could be charges on application startup. > > What is the role of memcg then? If the memory limit is in place and the > workload doesn't fit in then it will get reclaimed during start up and > memory would need to be refaulted if not mlocked. If it is mlocked then > the limit cannot be enforced and the start up would likely fail as a > result of the memcg oom killer. 1) Application which is not time sensitive executes on isolated CPU, with memcg control enabled. Per-CPU stock is created. 2) App with memcg control enabled exits, per-CPU stock is not drained. 3) Latency sensitive application starts, isolated per-CPU has stock to be drained, and: /* * Drains all per-CPU charge caches for given root_memcg resp. subtree * of the hierarchy under it. */ static void drain_all_stock(struct mem_cgroup *root_memcg) { int cpu, curcpu; /* If someone's already draining, avoid adding running more workers. */ if (!mutex_trylock(&percpu_charge_mutex)) return; /* * Notify other cpus that system-wide "drain" is running * We do not care about races with the cpu hotplug because cpu down * as well as workers from this path always operate on the local * per-cpu data. CPU up doesn't touch memcg_stock at all. */ migrate_disable(); curcpu = smp_processor_id(); for_each_online_cpu(cpu) { struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); struct mem_cgroup *memcg; bool flush = false; rcu_read_lock(); memcg = stock->cached; if (memcg && stock->nr_pages && mem_cgroup_is_descendant(memcg, root_memcg)) flush = true; else if (obj_stock_flush_required(stock, root_memcg)) flush = true; rcu_read_unlock(); if (flush && !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { if (cpu == curcpu) drain_local_stock(&stock->work); else schedule_work_on(cpu, &stock->work); } } migrate_enable(); mutex_unlock(&percpu_charge_mutex); } > [...] > > > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower > > > > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid > > > > doing this without a really strong reason. > > > > > > Are you OK with a simple opt out on isolated CPUs? That would make > > > charges slightly slower (atomic on the hierarchy counters vs. a single > > > pcp adjustment) but it would guarantee that the isolated workload is > > > predictable which is the primary objective AFAICS. > > > > This would make isolated CPUs "second class citizens": it would be nice > > to be able to execute non realtime apps on isolated CPUs as well > > (think of different periods of time during a day, one where > > more realtime apps are required, another where less > > realtime apps are required). > > An alternative requires to make the current implementation that is > lockless to use locks and introduce potential lock contention. This > could be harmful to regular workloads. Not using pcp caching would make > a fast path using few atomics rather than local pcp update. That is not > a terrible cost to pay for special cased workloads which use isolcpus. > Really we are not talking about a massive cost to be payed. At least > nobody has shown that in any numbers. > > > Concrete example: think of a computer handling vRAN traffic near a > > cell tower. The traffic (therefore amount of processing required > > by realtime applications) might vary during the day. > > > > User might want to run containers that depend on good memcg charging > > performance on isolated CPUs. > > What kind of role would memcg play here? Do you want to memory constrain > that workload? See example above. > -- > Michal Hocko > SUSE Labs > >
On Thu, 2023-01-26 at 15:19 -0300, Marcelo Tosatti wrote: > On Wed, Jan 25, 2023 at 03:14:48PM -0800, Roman Gushchin wrote: > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote: > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > > > > Disclaimer: > > > > > > a - The cover letter got bigger than expected, so I had to split it in > > > > > > sections to better organize myself. I am not very confortable with it. > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > > > > from memcg_stock_pcp), which could further improve performance for > > > > > > drain_all_stock(), but I could only notice the optimization at the > > > > > > last minute. > > > > > > > > > > > > > > > > > > 0 - Motivation: > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > > > > descendant of a given root_memcg. > > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look > > into why we have many of them and whether we really need them? > > > > It's either some user's actions (e.g. reducing memory.max), either some memcg > > is entering pre-oom conditions. In the latter case a lot of drain calls can be > > scheduled without a good reason (assuming the cgroup contain multiple tasks running > > on multiple cpus). Essentially each cpu will try to grab the remains of the memory quota > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > > on per-cgroup basis. > > > > Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some > > charges/memcg references (it might be not completely idle, but running some very special > > workload which is not doing any kernel allocations or a process belonging to the root memcg). > > In all other cases pcpu stock will be either drained naturally by an allocation from another > > memcg or an allocation from the same memcg will "restore" it, making draining useless. > > > > We also can into drain_all_pages() opportunistically, without waiting for the result. > > On a busy system it's most likely useless, we might oom before scheduled works will be executed. > > > > I admit I planned to do some work around and even started, but then never had enough time to > > finish it. > > > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower > > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid > > doing this without a really strong reason. > > The expectation would be that cache locking should not cause slowdown of > the allocation and free paths: > > https://manualsbrain.com/en/manuals/1246877/?page=313 > > For the P6 and more recent processor families, if the area of memory being locked > during a LOCK operation is cached in the processor that is performing the LOCK oper- > ation as write-back memory and is completely contained in a cache line, the > processor may not assert the LOCK# signal on the bus. Instead, it will modify the > memory location internally and allow it’s cache coherency mechanism to insure that > the operation is carried out atomically. This operation is called “cache locking.” The > cache coherency mechanism automatically prevents two or more processors that ... > > Just to keep the info easily available: the protected structure (struct memcg_stock_pcp) fits in 48 Bytes, which is less than the usual 64B cacheline. struct memcg_stock_pcp { spinlock_t stock_lock; /* 0 4 */ unsigned int nr_pages; /* 4 4 */ struct mem_cgroup * cached; /* 8 8 */ struct obj_cgroup * cached_objcg; /* 16 8 */ struct pglist_data * cached_pgdat; /* 24 8 */ unsigned int nr_bytes; /* 32 4 */ int nr_slab_reclaimable_b; /* 36 4 */ int nr_slab_unreclaimable_b; /* 40 4 */ /* size: 48, cachelines: 1, members: 8 */ /* padding: 4 */ /* last cacheline: 48 bytes */ }; (It got smaller after patches 3/5, 4/5 and 5/5, which remove holes, work_struct and flags respectively.) On top of that, patch 1/5 makes sure the percpu allocation is aligned to cacheline size.
On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote: > On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote: > > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote: > > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote: > > > [...] > > > > Remote draining reduces interruptions whether CPU > > > > is marked as isolated or not: > > > > > > > > - Allows isolated CPUs from benefiting of pcp caching. > > > > - Removes the interruption to non isolated CPUs. See for example > > > > > > > > https://lkml.org/lkml/2022/6/13/2769 > > > > > > This is talking about page allocato per cpu caches, right? In this patch > > > we are talking about memcg pcp caches. Are you sure the same applies > > > here? > > > > Both can stall the users of the drain operation. > > Yes. But it is important to consider who those users are. We are > draining when > - we are charging and the limit is hit so that memory reclaim > has to be triggered. > - hard, high limits are set and require memory reclaim. > - force_empty - full memory reclaim for a memcg > - memcg offlining - cgroup removel - quite a heavy operation as > well. > all those could be really costly kernel operations and they affect > isolated cpu only if the same memcg is used by both isolated and non-isolated > cpus. In other words those costly operations would have to be triggered > from non-isolated cpus and those are to be expected to be stalled. It is > the side effect of the local cpu draining that is scheduled that affects > the isolated cpu as well. > > Is that more clear? I think so, please help me check: IIUC, we can approach this by dividing the problem in two working modes: 1 - Normal, meaning no drain_all_stock() running. 2 - Draining, grouping together pre-OOM and userspace 'config' : changing, destroying, reconfiguring a memcg. For (1), we will have (ideally) only local cpu working on the percpu struct. This mode will not have any kind of contention, because each CPU will hold it's own spinlock only. For (2), we will have a lot of drain_all_stock() running. This will mean a lot of schedule_work_on() running (on upstream) or possibly causing contention, i.e. local cpus having to wait for a lock to get their cache, on the patch proposal. Ok, given the above is correct: # Some arguments point that (1) becomes slower with this patch. This is partially true: while test 2.2 pointed that local cpu functions running time had became slower by a few cycles, test 2.4 points that the userspace perception of it was that the syscalls and pagefaulting actually became faster: During some debugging tests before getting the performance on test 2.4, I noticed that the 'syscall + write' test would call all those functions that became slower on test 2.2. Those functions were called multiple millions of times during a single test, and still the patched version performance test returned faster for test 2.4 than upstream version. Maybe the functions became slower, but overall the usage of them in the usual context became faster. Is not that a small improvement? # Regarding (2), I notice that we fear contention While this seems to be the harder part of the discussion, I think we have enough data to deal with it. In which case contention would be a big problem here? IIUC it would be when a lot of drain_all_stock() get running because the memory limit is getting near. I mean, having the user to create / modify a memcg multiple times a second for a while is not something that is expected, IMHO. Now, if I assumed correctly and the case where contention could be a problem is on a memcg with high memory pressure, then we have the argument that Marcelo Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page allocation brought better results than local_locks + schedule_work_on(). I mean, while contention would cause the cpu to wait for a while before getting the lock for allocating a page from cache, something similar would happen with schedule_work_on(), which would force the current task to wait while the draining happens locally. What I am able to see is that, for each drain_all_stock(), for each cpu getting drained we have the option to (a) (sometimes) wait for a lock to be freed, or (b) wait for a whole context switch to happen. And IIUC, (b) is much slower than (a) on average, and this is what causes the improved performance seen in [P1]. (I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for it to run on the remote CPU may not be that different, since the cacheline is already writen to by the remote cpu on Upstream) Also according to test 2.2, for the patched version, drain_local_stock() have gotten faster (much faster for 128 cpus), even though it does all the draining instead of just scheduling it on the other cpus. I mean, summing that to the brief nature of local cpu functions, we may not hit contention as much as we are expected. ## Sorry for the long text. I may be missing some point, please let me know if that's the case. Thanks a lot for reviewing! Leo [P1]: https://lkml.org/lkml/2022/6/13/2769
On Thu 26-01-23 21:32:18, Marcelo Tosatti wrote: > On Thu, Jan 26, 2023 at 08:20:46PM +0100, Michal Hocko wrote: > > On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote: > > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > > > > On Wed 25-01-23 15:14:48, Roman Gushchin wrote: > > > > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote: > > > > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > > > > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > > > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > > > > > > > Disclaimer: > > > > > > > > > a - The cover letter got bigger than expected, so I had to split it in > > > > > > > > > sections to better organize myself. I am not very confortable with it. > > > > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > > > > > > > from memcg_stock_pcp), which could further improve performance for > > > > > > > > > drain_all_stock(), but I could only notice the optimization at the > > > > > > > > > last minute. > > > > > > > > > > > > > > > > > > > > > > > > > > > 0 - Motivation: > > > > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > > > > > > > descendant of a given root_memcg. > > > > > > > > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look > > > > > into why we have many of them and whether we really need them? > > > > > > > > > > It's either some user's actions (e.g. reducing memory.max), either some memcg > > > > > is entering pre-oom conditions. In the latter case a lot of drain calls can be > > > > > scheduled without a good reason (assuming the cgroup contain multiple tasks running > > > > > on multiple cpus). > > > > > > > > I believe I've never got a specific answer to that. We > > > > have discussed that in the previous version submission > > > > (20221102020243.522358-1-leobras@redhat.com and specifically > > > > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and > > > > isolcpus. I was wondering about using memcgs in RT workloads because > > > > that just sounds weird but let's say this is the case indeed. > > > > > > This could be the case. You can consider an "edge device" where it is > > > necessary to run a RT workload. It might also be useful to run > > > non realtime applications on the same system. > > > > > > > Then an RT task or whatever task that is running on an isolated > > > > cpu can have pcp charges. > > > > > > Usually the RT task (or more specifically the realtime sensitive loop > > > of the application) runs entirely on userspace. But i suppose there > > > could be charges on application startup. > > > > What is the role of memcg then? If the memory limit is in place and the > > workload doesn't fit in then it will get reclaimed during start up and > > memory would need to be refaulted if not mlocked. If it is mlocked then > > the limit cannot be enforced and the start up would likely fail as a > > result of the memcg oom killer. > > 1) Application which is not time sensitive executes on isolated CPU, > with memcg control enabled. Per-CPU stock is created. > > 2) App with memcg control enabled exits, per-CPU stock is not drained. > > 3) Latency sensitive application starts, isolated per-CPU has stock to > be drained, and: > > /* > * Drains all per-CPU charge caches for given root_memcg resp. subtree > * of the hierarchy under it. > */ > static void drain_all_stock(struct mem_cgroup *root_memcg) No, this is not really answering my question. See Y9LQ615H13RmG7wL@dhcp22.suse.cz which already explains how the draining would be triggered. This is not really happening on any operation. I am really asking for specific workloads which are running multiple processes on a mix of isolated and non-isolated cpus yet they share memcg so that they can interfere. The consequences of the common memcg are described above.
[Cc Frederic] On Thu 26-01-23 15:12:35, Roman Gushchin wrote: > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: [...] > > > Essentially each cpu will try to grab the remains of the memory quota > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > > > on per-cgroup basis. > > > > I think it would be more than sufficient to disable pcp charging on an > > isolated cpu. > > It might have significant performance consequences. Is it really significant? > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > the accuracy of memory limits and slightly increase the memory footprint (all > those dying memcgs...), but the impact will be limited. Actually it is limited > by the number of cpus. Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges potentially left behind should be small and that shouldn't really be a concern for memcg oom situations (unless the limit is very small and workloads on isolated cpus using small hard limits is way beyond my imagination). My first thought was that those charges could be left behind without any upper bound but in reality sooner or later something should be running on those cpus and if the memcg is gone the pcp cache would get refilled and old charges gone. So yes, this is actually a better and even simpler solution. All we need is something like this diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ab457f0394ab..13b84bbd70ba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) struct mem_cgroup *memcg; bool flush = false; + if (cpu_is_isolated(cpu)) + continue; + rcu_read_lock(); memcg = stock->cached; if (memcg && stock->nr_pages && There is no such cpu_is_isolated() AFAICS so we would need a help from NOHZ and cpuisol people to create one for us. Frederic, would such an abstraction make any sense from your POV?
On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote: > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > > On Wed 25-01-23 15:14:48, Roman Gushchin wrote: > > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote: > > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > > > > > Disclaimer: > > > > > > > a - The cover letter got bigger than expected, so I had to split it in > > > > > > > sections to better organize myself. I am not very confortable with it. > > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > > > > > from memcg_stock_pcp), which could further improve performance for > > > > > > > drain_all_stock(), but I could only notice the optimization at the > > > > > > > last minute. > > > > > > > > > > > > > > > > > > > > > 0 - Motivation: > > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > > > > > descendant of a given root_memcg. > > > > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look > > > into why we have many of them and whether we really need them? > > > > > > It's either some user's actions (e.g. reducing memory.max), either some memcg > > > is entering pre-oom conditions. In the latter case a lot of drain calls can be > > > scheduled without a good reason (assuming the cgroup contain multiple tasks running > > > on multiple cpus). > > > > I believe I've never got a specific answer to that. We > > have discussed that in the previous version submission > > (20221102020243.522358-1-leobras@redhat.com and specifically > > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and > > isolcpus. I was wondering about using memcgs in RT workloads because > > that just sounds weird but let's say this is the case indeed. Then an RT > > task or whatever task that is running on an isolated cpu can have pcp > > charges. > > > > > Essentially each cpu will try to grab the remains of the memory quota > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > > > on per-cgroup basis. > > > > I think it would be more than sufficient to disable pcp charging on an > > isolated cpu. > > It might have significant performance consequences. > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > the accuracy of memory limits and slightly increase the memory footprint (all > those dying memcgs...), but the impact will be limited. Actually it is limited > by the number of cpus. I was discussing this same idea with Marcelo yesterday morning. The questions had in the topic were: a - About how many pages the pcp cache will hold before draining them itself? b - Would it cache any kind of bigger page, or huge page in this same aspect? Please let me know if I got anything wrong, but IIUC from a previous debug (a)'s answer is 4 pages. Meaning even on bigger-page archs such as powerpc, with 64k pages, the max pcp cache 'wasted' on each processor would be 256k (very small on today's standard). Please let me know if you have any info on (b), or any correcting on (a). The thing is: having this drain_local_stock() waiver only for isolated cpus would not bring the same benefits for non-isolated cpus in high memory pressure as I understand this patchset is bringing. OTOH not running drain_local_stock() at all for every cpu may introduce performance gains (no remote CPU access) but can be a problem if I got the 'wasted pages' could on (a) wrong. I mean, drain_local_stock() was introduced for a reason. > > > This is not a per memcg property. > > Sure, my point was that in pre-oom condition several cpus might try to consolidate > the remains of the memory quota, actually working against each other. Separate issue, > which might be a reason why there are many flush attempts in the case we discuss. > > Thanks! > Thank you for reviewing! Leo
On Fri 27-01-23 04:14:19, Leonardo Brás wrote: > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote: [...] > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > > the accuracy of memory limits and slightly increase the memory footprint (all > > those dying memcgs...), but the impact will be limited. Actually it is limited > > by the number of cpus. > > I was discussing this same idea with Marcelo yesterday morning. > > The questions had in the topic were: > a - About how many pages the pcp cache will hold before draining them itself? MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache doesn't really hold any pages. It is a mere counter of how many charges have been accounted for the memcg page counter. So it is not really consuming proportional amount of resources. It just pins the corresponding memcg. Have a look at consume_stock and refill_stock > b - Would it cache any kind of bigger page, or huge page in this same aspect? The above should answer this as well as those following up I hope. If not let me know.
On Fri, 2023-01-27 at 08:11 +0100, Michal Hocko wrote: > [Cc Frederic] > > On Thu 26-01-23 15:12:35, Roman Gushchin wrote: > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > [...] > > > > Essentially each cpu will try to grab the remains of the memory quota > > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > > > > on per-cgroup basis. > > > > > > I think it would be more than sufficient to disable pcp charging on an > > > isolated cpu. > > > > It might have significant performance consequences. > > Is it really significant? > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > > the accuracy of memory limits and slightly increase the memory footprint (all > > those dying memcgs...), but the impact will be limited. Actually it is limited > > by the number of cpus. > > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges > potentially left behind should be small and that shouldn't really be a > concern for memcg oom situations (unless the limit is very small and > workloads on isolated cpus using small hard limits is way beyond my > imagination). > > My first thought was that those charges could be left behind without any > upper bound but in reality sooner or later something should be running > on those cpus and if the memcg is gone the pcp cache would get refilled > and old charges gone. > > So yes, this is actually a better and even simpler solution. All we need > is something like this > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ab457f0394ab..13b84bbd70ba 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > struct mem_cgroup *memcg; > bool flush = false; > > + if (cpu_is_isolated(cpu)) > + continue; > + > rcu_read_lock(); > memcg = stock->cached; > if (memcg && stock->nr_pages && > > There is no such cpu_is_isolated() AFAICS so we would need a help from > NOHZ and cpuisol people to create one for us. Frederic, would such an > abstraction make any sense from your POV? IIUC, 'if (cpu_is_isolated())' would be instead: if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) || !housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ)
On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote: > On Fri 27-01-23 04:14:19, Leonardo Brás wrote: > > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote: > [...] > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > > > the accuracy of memory limits and slightly increase the memory footprint (all > > > those dying memcgs...), but the impact will be limited. Actually it is limited > > > by the number of cpus. > > > > I was discussing this same idea with Marcelo yesterday morning. > > > > The questions had in the topic were: > > a - About how many pages the pcp cache will hold before draining them itself? > > MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache > doesn't really hold any pages. It is a mere counter of how many charges > have been accounted for the memcg page counter. So it is not really > consuming proportional amount of resources. It just pins the > corresponding memcg. Have a look at consume_stock and refill_stock I see. Thanks for pointing that out! So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1) that are not getting used, and may cause an 'earlier' OOM if this amount is needed but can't be freed. In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources). It's starting to get too big, but still ok for a machine this size. The thing is that it can present an odd behavior: You have a cgroup created before, now empty, and try to run given application, and hits OOM. You then restart the cgroup, run the same application without an issue. Even though it looks a good possibility, this can be perceived by user as instability. > > > b - Would it cache any kind of bigger page, or huge page in this same aspect? > > The above should answer this as well as those following up I hope. If > not let me know. IIUC we are talking normal pages, is that it? Best regards, Leo
On Fri, 2023-01-27 at 04:22 -0300, Leonardo Brás wrote: > On Fri, 2023-01-27 at 08:11 +0100, Michal Hocko wrote: > > [Cc Frederic] > > > > On Thu 26-01-23 15:12:35, Roman Gushchin wrote: > > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > > [...] > > > > > Essentially each cpu will try to grab the remains of the memory quota > > > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > > > > > on per-cgroup basis. > > > > > > > > I think it would be more than sufficient to disable pcp charging on an > > > > isolated cpu. > > > > > > It might have significant performance consequences. > > > > Is it really significant? > > > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > > > the accuracy of memory limits and slightly increase the memory footprint (all > > > those dying memcgs...), but the impact will be limited. Actually it is limited > > > by the number of cpus. > > > > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges > > potentially left behind should be small and that shouldn't really be a > > concern for memcg oom situations (unless the limit is very small and > > workloads on isolated cpus using small hard limits is way beyond my > > imagination). > > > > My first thought was that those charges could be left behind without any > > upper bound but in reality sooner or later something should be running > > on those cpus and if the memcg is gone the pcp cache would get refilled > > and old charges gone. > > > > So yes, this is actually a better and even simpler solution. All we need > > is something like this > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index ab457f0394ab..13b84bbd70ba 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > > struct mem_cgroup *memcg; > > bool flush = false; > > > > + if (cpu_is_isolated(cpu)) > > + continue; > > + > > rcu_read_lock(); > > memcg = stock->cached; > > if (memcg && stock->nr_pages && > > > > There is no such cpu_is_isolated() AFAICS so we would need a help from > > NOHZ and cpuisol people to create one for us. Frederic, would such an > > abstraction make any sense from your POV? > > > IIUC, 'if (cpu_is_isolated())' would be instead: > > if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) || > !housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ) oh, sorry 's/smp_processor_id()/cpu/' here: if(!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) || !housekeeping_cpu(cpu, HK_TYPE_WQ)) Not sure why I added smp_processor_id() instead of cpu. I think I need some sleep. :)
On Fri 27-01-23 05:12:13, Leonardo Brás wrote: > On Fri, 2023-01-27 at 04:22 -0300, Leonardo Brás wrote: > > On Fri, 2023-01-27 at 08:11 +0100, Michal Hocko wrote: > > > [Cc Frederic] > > > > > > On Thu 26-01-23 15:12:35, Roman Gushchin wrote: > > > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > > > [...] > > > > > > Essentially each cpu will try to grab the remains of the memory quota > > > > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > > > > > > on per-cgroup basis. > > > > > > > > > > I think it would be more than sufficient to disable pcp charging on an > > > > > isolated cpu. > > > > > > > > It might have significant performance consequences. > > > > > > Is it really significant? > > > > > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > > > > the accuracy of memory limits and slightly increase the memory footprint (all > > > > those dying memcgs...), but the impact will be limited. Actually it is limited > > > > by the number of cpus. > > > > > > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges > > > potentially left behind should be small and that shouldn't really be a > > > concern for memcg oom situations (unless the limit is very small and > > > workloads on isolated cpus using small hard limits is way beyond my > > > imagination). > > > > > > My first thought was that those charges could be left behind without any > > > upper bound but in reality sooner or later something should be running > > > on those cpus and if the memcg is gone the pcp cache would get refilled > > > and old charges gone. > > > > > > So yes, this is actually a better and even simpler solution. All we need > > > is something like this > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index ab457f0394ab..13b84bbd70ba 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > > > struct mem_cgroup *memcg; > > > bool flush = false; > > > > > > + if (cpu_is_isolated(cpu)) > > > + continue; > > > + > > > rcu_read_lock(); > > > memcg = stock->cached; > > > if (memcg && stock->nr_pages && > > > > > > There is no such cpu_is_isolated() AFAICS so we would need a help from > > > NOHZ and cpuisol people to create one for us. Frederic, would such an > > > abstraction make any sense from your POV? > > > > > > IIUC, 'if (cpu_is_isolated())' would be instead: > > > > if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) || > > !housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ) > > oh, sorry 's/smp_processor_id()/cpu/' here: > > if(!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) || !housekeeping_cpu(cpu, HK_TYPE_WQ)) Or maybe we can get a nice abstract API so that we do not have to really care about those low level details. I do not really know what those really mean and hopefully I shouldn't really need to know.
On Fri 27-01-23 04:35:22, Leonardo Brás wrote: > On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote: > > On Fri 27-01-23 04:14:19, Leonardo Brás wrote: > > > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote: > > [...] > > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > > > > the accuracy of memory limits and slightly increase the memory footprint (all > > > > those dying memcgs...), but the impact will be limited. Actually it is limited > > > > by the number of cpus. > > > > > > I was discussing this same idea with Marcelo yesterday morning. > > > > > > The questions had in the topic were: > > > a - About how many pages the pcp cache will hold before draining them itself? > > > > MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache > > doesn't really hold any pages. It is a mere counter of how many charges > > have been accounted for the memcg page counter. So it is not really > > consuming proportional amount of resources. It just pins the > > corresponding memcg. Have a look at consume_stock and refill_stock > > I see. Thanks for pointing that out! > > So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1) s@numcpus@num_isolated_cpus@ > that are not getting used, and may cause an 'earlier' OOM if this amount is > needed but can't be freed. s@OOM@memcg OOM@ > In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each > holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources). > It's starting to get too big, but still ok for a machine this size. It is more about the memcg limit rather than the size of the machine. Again, let's focus on actual usacase. What is the usual memcg setup with those isolcpus > The thing is that it can present an odd behavior: > You have a cgroup created before, now empty, and try to run given application, > and hits OOM. The application would either consume those cached charges or flush them if it is running in a different memcg. Or what do you have in mind? > You then restart the cgroup, run the same application without an issue. > > Even though it looks a good possibility, this can be perceived by user as > instability. > > > > > > b - Would it cache any kind of bigger page, or huge page in this same aspect? > > > > The above should answer this as well as those following up I hope. If > > not let me know. > > IIUC we are talking normal pages, is that it? We are talking about memcg charges and those have page granularity.
On Fri, Jan 27, 2023 at 05:12:13AM -0300, Leonardo Brás wrote: > On Fri, 2023-01-27 at 04:22 -0300, Leonardo Brás wrote: > > > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges > > > potentially left behind should be small and that shouldn't really be a > > > concern for memcg oom situations (unless the limit is very small and > > > workloads on isolated cpus using small hard limits is way beyond my > > > imagination). > > > > > > My first thought was that those charges could be left behind without any > > > upper bound but in reality sooner or later something should be running > > > on those cpus and if the memcg is gone the pcp cache would get refilled > > > and old charges gone. > > > > > > So yes, this is actually a better and even simpler solution. All we need > > > is something like this > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index ab457f0394ab..13b84bbd70ba 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > > > struct mem_cgroup *memcg; > > > bool flush = false; > > > > > > + if (cpu_is_isolated(cpu)) > > > + continue; > > > + > > > rcu_read_lock(); > > > memcg = stock->cached; > > > if (memcg && stock->nr_pages && > > > > > > There is no such cpu_is_isolated() AFAICS so we would need a help from > > > NOHZ and cpuisol people to create one for us. Frederic, would such an > > > abstraction make any sense from your POV? > > > > > > IIUC, 'if (cpu_is_isolated())' would be instead: > > > > if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) || > > !housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ) > > oh, sorry 's/smp_processor_id()/cpu/' here: > > if(!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) || !housekeeping_cpu(cpu, HK_TYPE_WQ)) Do you also need to handle cpuset.sched_load_balance=0 (aka. cpuset v2 "isolated" partition type). It has the same effect as isolcpus=, but it can be changed at runtime. And then on_null_domain() look like what you need. You'd have to make that API more generally available though, and rename it to something like "bool cpu_has_null_domain(int cpu)". But then you also need to handle concurrent cpuset changes. If you can tolerate it to be racy, then RCU alone is fine. Thanks.
On Fri 27-01-23 08:11:04, Michal Hocko wrote: > [Cc Frederic] > > On Thu 26-01-23 15:12:35, Roman Gushchin wrote: > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > [...] > > > > Essentially each cpu will try to grab the remains of the memory quota > > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > > > > on per-cgroup basis. > > > > > > I think it would be more than sufficient to disable pcp charging on an > > > isolated cpu. > > > > It might have significant performance consequences. > > Is it really significant? > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > > the accuracy of memory limits and slightly increase the memory footprint (all > > those dying memcgs...), but the impact will be limited. Actually it is limited > > by the number of cpus. > > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges > potentially left behind should be small and that shouldn't really be a > concern for memcg oom situations (unless the limit is very small and > workloads on isolated cpus using small hard limits is way beyond my > imagination). > > My first thought was that those charges could be left behind without any > upper bound but in reality sooner or later something should be running > on those cpus and if the memcg is gone the pcp cache would get refilled > and old charges gone. > > So yes, this is actually a better and even simpler solution. All we need > is something like this > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ab457f0394ab..13b84bbd70ba 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > struct mem_cgroup *memcg; > bool flush = false; > > + if (cpu_is_isolated(cpu)) > + continue; > + > rcu_read_lock(); > memcg = stock->cached; > if (memcg && stock->nr_pages && Btw. this would be over pessimistic. The following should make more sense: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ab457f0394ab..55e440e54504 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2357,7 +2357,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { if (cpu == curcpu) drain_local_stock(&stock->work); - else + else if (!cpu_is_isolated(cpu)) schedule_work_on(cpu, &stock->work); } }
On Fri, Jan 27, 2023 at 02:58:19PM +0100, Michal Hocko wrote: > On Fri 27-01-23 08:11:04, Michal Hocko wrote: > > [Cc Frederic] > > > > On Thu 26-01-23 15:12:35, Roman Gushchin wrote: > > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > > [...] > > > > > Essentially each cpu will try to grab the remains of the memory quota > > > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching > > > > > on per-cgroup basis. > > > > > > > > I think it would be more than sufficient to disable pcp charging on an > > > > isolated cpu. > > > > > > It might have significant performance consequences. > > > > Is it really significant? > > > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > > > the accuracy of memory limits and slightly increase the memory footprint (all > > > those dying memcgs...), but the impact will be limited. Actually it is limited > > > by the number of cpus. > > > > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges > > potentially left behind should be small and that shouldn't really be a > > concern for memcg oom situations (unless the limit is very small and > > workloads on isolated cpus using small hard limits is way beyond my > > imagination). > > > > My first thought was that those charges could be left behind without any > > upper bound but in reality sooner or later something should be running > > on those cpus and if the memcg is gone the pcp cache would get refilled > > and old charges gone. > > > > So yes, this is actually a better and even simpler solution. All we need > > is something like this > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index ab457f0394ab..13b84bbd70ba 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > > struct mem_cgroup *memcg; > > bool flush = false; > > > > + if (cpu_is_isolated(cpu)) > > + continue; > > + > > rcu_read_lock(); > > memcg = stock->cached; > > if (memcg && stock->nr_pages && > > Btw. this would be over pessimistic. The following should make more > sense: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ab457f0394ab..55e440e54504 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2357,7 +2357,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { > if (cpu == curcpu) > drain_local_stock(&stock->work); > - else > + else if (!cpu_is_isolated(cpu)) > schedule_work_on(cpu, &stock->work); > } > } Yes, this is exactly what I was thinking of. It should solve the problem for isolated cpus well enough without introducing an overhead for everybody else. If you'll make a proper patch, please add my Acked-by: Roman Gushchin <roman.gushchin@linux.dev> I understand the concerns regarding spurious OOMs on 256-cores machine, but I guess they are somewhat theoretical and also possible with the current code (e.g. one ooming cgroup can effectively block draining for everybody else). Thanks!
On Fri, 2023-01-27 at 10:29 +0100, Michal Hocko wrote: > On Fri 27-01-23 04:35:22, Leonardo Brás wrote: > > On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote: > > > On Fri 27-01-23 04:14:19, Leonardo Brás wrote: > > > > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote: > > > [...] > > > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > > > > > the accuracy of memory limits and slightly increase the memory footprint (all > > > > > those dying memcgs...), but the impact will be limited. Actually it is limited > > > > > by the number of cpus. > > > > > > > > I was discussing this same idea with Marcelo yesterday morning. > > > > > > > > The questions had in the topic were: > > > > a - About how many pages the pcp cache will hold before draining them itself? > > > > > > MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache > > > doesn't really hold any pages. It is a mere counter of how many charges > > > have been accounted for the memcg page counter. So it is not really > > > consuming proportional amount of resources. It just pins the > > > corresponding memcg. Have a look at consume_stock and refill_stock > > > > I see. Thanks for pointing that out! > > > > So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1) > > s@numcpus@num_isolated_cpus@ I was thinking worst case scenario being (ncpus - 1) being isolated. > > > that are not getting used, and may cause an 'earlier' OOM if this amount is > > needed but can't be freed. > > s@OOM@memcg OOM@ > > In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each > > holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources). > > It's starting to get too big, but still ok for a machine this size. > > It is more about the memcg limit rather than the size of the machine. > Again, let's focus on actual usacase. What is the usual memcg setup with > those isolcpus I understand it's about the limit, not actually allocated memory. When I point the machine size, I mean what is expected to be acceptable from a user in that machine. > > > The thing is that it can present an odd behavior: > > You have a cgroup created before, now empty, and try to run given application, > > and hits OOM. > > The application would either consume those cached charges or flush them > if it is running in a different memcg. Or what do you have in mind? 1 - Create a memcg with a VM inside, multiple vcpus pinned to isolated cpus. 2 - Run multi-cpu task inside the VM, it allocates memory for every CPU and keep the pcp cache 3 - Try to run a single-cpu task (pinned?) inside the VM, which uses almost all the available memory. 4 - memcg OOM. Does it make sense? > > > You then restart the cgroup, run the same application without an issue. > > > > Even though it looks a good possibility, this can be perceived by user as > > instability. > > > > > > > > > b - Would it cache any kind of bigger page, or huge page in this same aspect? > > > > > > The above should answer this as well as those following up I hope. If > > > not let me know. > > > > IIUC we are talking normal pages, is that it? > > We are talking about memcg charges and those have page granularity. > Thanks for the info! Also, thanks for the feedback! Leo
On Fri, Jan 27, 2023 at 04:29:37PM -0300, Leonardo Brás wrote: > On Fri, 2023-01-27 at 10:29 +0100, Michal Hocko wrote: > > On Fri 27-01-23 04:35:22, Leonardo Brás wrote: > > > On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote: > > > > On Fri 27-01-23 04:14:19, Leonardo Brás wrote: > > > > > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote: > > > > [...] > > > > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce > > > > > > the accuracy of memory limits and slightly increase the memory footprint (all > > > > > > those dying memcgs...), but the impact will be limited. Actually it is limited > > > > > > by the number of cpus. > > > > > > > > > > I was discussing this same idea with Marcelo yesterday morning. > > > > > > > > > > The questions had in the topic were: > > > > > a - About how many pages the pcp cache will hold before draining them itself? > > > > > > > > MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache > > > > doesn't really hold any pages. It is a mere counter of how many charges > > > > have been accounted for the memcg page counter. So it is not really > > > > consuming proportional amount of resources. It just pins the > > > > corresponding memcg. Have a look at consume_stock and refill_stock > > > > > > I see. Thanks for pointing that out! > > > > > > So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1) > > > > s@numcpus@num_isolated_cpus@ > > I was thinking worst case scenario being (ncpus - 1) being isolated. > > > > > > that are not getting used, and may cause an 'earlier' OOM if this amount is > > > needed but can't be freed. > > > > s@OOM@memcg OOM@ > > > > In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each > > > holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources). > > > It's starting to get too big, but still ok for a machine this size. > > > > It is more about the memcg limit rather than the size of the machine. > > Again, let's focus on actual usacase. What is the usual memcg setup with > > those isolcpus > > I understand it's about the limit, not actually allocated memory. When I point > the machine size, I mean what is expected to be acceptable from a user in that > machine. > > > > > > The thing is that it can present an odd behavior: > > > You have a cgroup created before, now empty, and try to run given application, > > > and hits OOM. > > > > The application would either consume those cached charges or flush them > > if it is running in a different memcg. Or what do you have in mind? > > 1 - Create a memcg with a VM inside, multiple vcpus pinned to isolated cpus. > 2 - Run multi-cpu task inside the VM, it allocates memory for every CPU and keep > the pcp cache > 3 - Try to run a single-cpu task (pinned?) inside the VM, which uses almost all > the available memory. > 4 - memcg OOM. > > Does it make sense? It can happen now as well, you just need a competing drain request. Honestly, I feel the probability of this scenario to be a real problem is fairly low. I don't recall any complains on spurious OOMs because of races in the draining code. Usually machines which are tight on memory are rarely have so many idle cpus. Thanks!
On Fri, Jan 27, 2023 at 03:55:39AM -0300, Leonardo Brás wrote: > On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote: > > On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote: > > > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote: > > > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote: > > > > [...] > > > > > Remote draining reduces interruptions whether CPU > > > > > is marked as isolated or not: > > > > > > > > > > - Allows isolated CPUs from benefiting of pcp caching. > > > > > - Removes the interruption to non isolated CPUs. See for example > > > > > > > > > > https://lkml.org/lkml/2022/6/13/2769 > > > > > > > > This is talking about page allocato per cpu caches, right? In this patch > > > > we are talking about memcg pcp caches. Are you sure the same applies > > > > here? > > > > > > Both can stall the users of the drain operation. > > > > Yes. But it is important to consider who those users are. We are > > draining when > > - we are charging and the limit is hit so that memory reclaim > > has to be triggered. > > - hard, high limits are set and require memory reclaim. > > - force_empty - full memory reclaim for a memcg > > - memcg offlining - cgroup removel - quite a heavy operation as > > well. > > all those could be really costly kernel operations and they affect > > isolated cpu only if the same memcg is used by both isolated and non-isolated > > cpus. In other words those costly operations would have to be triggered > > from non-isolated cpus and those are to be expected to be stalled. It is > > the side effect of the local cpu draining that is scheduled that affects > > the isolated cpu as well. > > > > Is that more clear? > > I think so, please help me check: > > IIUC, we can approach this by dividing the problem in two working modes: > 1 - Normal, meaning no drain_all_stock() running. > 2 - Draining, grouping together pre-OOM and userspace 'config' : changing, > destroying, reconfiguring a memcg. > > For (1), we will have (ideally) only local cpu working on the percpu struct. > This mode will not have any kind of contention, because each CPU will hold it's > own spinlock only. > > For (2), we will have a lot of drain_all_stock() running. This will mean a lot > of schedule_work_on() running (on upstream) or possibly causing contention, i.e. > local cpus having to wait for a lock to get their cache, on the patch proposal. > > Ok, given the above is correct: > > # Some arguments point that (1) becomes slower with this patch. > > This is partially true: while test 2.2 pointed that local cpu functions running > time had became slower by a few cycles, test 2.4 points that the userspace > perception of it was that the syscalls and pagefaulting actually became faster: > > During some debugging tests before getting the performance on test 2.4, I > noticed that the 'syscall + write' test would call all those functions that > became slower on test 2.2. Those functions were called multiple millions of > times during a single test, and still the patched version performance test > returned faster for test 2.4 than upstream version. Maybe the functions became > slower, but overall the usage of them in the usual context became faster. > > Is not that a small improvement? > > # Regarding (2), I notice that we fear contention > > While this seems to be the harder part of the discussion, I think we have enough > data to deal with it. > > In which case contention would be a big problem here? > IIUC it would be when a lot of drain_all_stock() get running because the memory > limit is getting near. I mean, having the user to create / modify a memcg > multiple times a second for a while is not something that is expected, IMHO. Considering that the use of spinlocks with remote draining is the more general solution, what would be a test-case to demonstrate a contention problem? > Now, if I assumed correctly and the case where contention could be a problem is > on a memcg with high memory pressure, then we have the argument that Marcelo > Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page > allocation brought better results than local_locks + schedule_work_on(). > > I mean, while contention would cause the cpu to wait for a while before getting > the lock for allocating a page from cache, something similar would happen with > schedule_work_on(), which would force the current task to wait while the > draining happens locally. > > What I am able to see is that, for each drain_all_stock(), for each cpu getting > drained we have the option to (a) (sometimes) wait for a lock to be freed, or > (b) wait for a whole context switch to happen. > And IIUC, (b) is much slower than (a) on average, and this is what causes the > improved performance seen in [P1]. Moreover, there is a delay for the remote CPU to execute a work item (there could be high priority tasks, IRQ handling on the remote CPU, which delays execution of the work item even further). Also, the other point is that the spinlock already exists for PREEMPT_RT (which means that any potential contention issue with the spinlock today is limited to PREEMPT_RT users). So it would be good to point out a specific problematic testcase/scenario with using the spinlock in this particular case? > (I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for > it to run on the remote CPU may not be that different, since the cacheline is > already writen to by the remote cpu on Upstream) > > Also according to test 2.2, for the patched version, drain_local_stock() have > gotten faster (much faster for 128 cpus), even though it does all the draining > instead of just scheduling it on the other cpus. > I mean, summing that to the brief nature of local cpu functions, we may not hit > contention as much as we are expected. > > ## > > Sorry for the long text. > I may be missing some point, please let me know if that's the case. > > Thanks a lot for reviewing! > Leo > > [P1]: https://lkml.org/lkml/2022/6/13/2769 > >
On Tue, 2023-01-31 at 08:35 -0300, Marcelo Tosatti wrote: > On Fri, Jan 27, 2023 at 03:55:39AM -0300, Leonardo Brás wrote: > > On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote: > > > On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote: > > > > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote: > > > > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote: > > > > > [...] > > > > > > Remote draining reduces interruptions whether CPU > > > > > > is marked as isolated or not: > > > > > > > > > > > > - Allows isolated CPUs from benefiting of pcp caching. > > > > > > - Removes the interruption to non isolated CPUs. See for example > > > > > > > > > > > > https://lkml.org/lkml/2022/6/13/2769 > > > > > > > > > > This is talking about page allocato per cpu caches, right? In this patch > > > > > we are talking about memcg pcp caches. Are you sure the same applies > > > > > here? > > > > > > > > Both can stall the users of the drain operation. > > > > > > Yes. But it is important to consider who those users are. We are > > > draining when > > > - we are charging and the limit is hit so that memory reclaim > > > has to be triggered. > > > - hard, high limits are set and require memory reclaim. > > > - force_empty - full memory reclaim for a memcg > > > - memcg offlining - cgroup removel - quite a heavy operation as > > > well. > > > all those could be really costly kernel operations and they affect > > > isolated cpu only if the same memcg is used by both isolated and non-isolated > > > cpus. In other words those costly operations would have to be triggered > > > from non-isolated cpus and those are to be expected to be stalled. It is > > > the side effect of the local cpu draining that is scheduled that affects > > > the isolated cpu as well. > > > > > > Is that more clear? > > > > I think so, please help me check: Michal, Roman: Could you please review my argumentation below, so I can understand what exactly is wrong ? > > > > IIUC, we can approach this by dividing the problem in two working modes: > > 1 - Normal, meaning no drain_all_stock() running. > > 2 - Draining, grouping together pre-OOM and userspace 'config' : changing, > > destroying, reconfiguring a memcg. > > > > For (1), we will have (ideally) only local cpu working on the percpu struct. > > This mode will not have any kind of contention, because each CPU will hold it's > > own spinlock only. > > > > For (2), we will have a lot of drain_all_stock() running. This will mean a lot > > of schedule_work_on() running (on upstream) or possibly causing contention, i.e. > > local cpus having to wait for a lock to get their cache, on the patch proposal. > > > > Ok, given the above is correct: > > > > # Some arguments point that (1) becomes slower with this patch. > > > > This is partially true: while test 2.2 pointed that local cpu functions running > > time had became slower by a few cycles, test 2.4 points that the userspace > > perception of it was that the syscalls and pagefaulting actually became faster: > > > > During some debugging tests before getting the performance on test 2.4, I > > noticed that the 'syscall + write' test would call all those functions that > > became slower on test 2.2. Those functions were called multiple millions of > > times during a single test, and still the patched version performance test > > returned faster for test 2.4 than upstream version. Maybe the functions became > > slower, but overall the usage of them in the usual context became faster. > > > > Is not that a small improvement? > > > > # Regarding (2), I notice that we fear contention > > > > While this seems to be the harder part of the discussion, I think we have enough > > data to deal with it. > > > > In which case contention would be a big problem here? > > IIUC it would be when a lot of drain_all_stock() get running because the memory > > limit is getting near. I mean, having the user to create / modify a memcg > > multiple times a second for a while is not something that is expected, IMHO. > > Considering that the use of spinlocks with remote draining is the more general solution, > what would be a test-case to demonstrate a contention problem? IIUC we could try to reproduce a memory tight workload that keeps allocating / freeing from different cpus (without hitting OOM). Michal, Roman: Is that correct? You have any workload like that so we can test? > > > Now, if I assumed correctly and the case where contention could be a problem is > > on a memcg with high memory pressure, then we have the argument that Marcelo > > Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page > > allocation brought better results than local_locks + schedule_work_on(). > > > > I mean, while contention would cause the cpu to wait for a while before getting > > the lock for allocating a page from cache, something similar would happen with > > schedule_work_on(), which would force the current task to wait while the > > draining happens locally. > > > > What I am able to see is that, for each drain_all_stock(), for each cpu getting > > drained we have the option to (a) (sometimes) wait for a lock to be freed, or > > (b) wait for a whole context switch to happen. > > And IIUC, (b) is much slower than (a) on average, and this is what causes the > > improved performance seen in [P1]. > > Moreover, there is a delay for the remote CPU to execute a work item > (there could be high priority tasks, IRQ handling on the remote CPU, > which delays execution of the work item even further). > > Also, the other point is that the spinlock already exists for > PREEMPT_RT (which means that any potential contention issue > with the spinlock today is limited to PREEMPT_RT users). > > So it would be good to point out a specific problematic > testcase/scenario with using the spinlock in this particular case? Oh, that's a good point, but IIUC spin_lock() replaces local_lock() in memcg, meaning it will always run in the same CPU, and there should only be any contention if the memcg local cpu functions get preempted by something that calls a memcg local cpu function. > > > (I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for > > it to run on the remote CPU may not be that different, since the cacheline is > > already writen to by the remote cpu on Upstream) > > > > Also according to test 2.2, for the patched version, drain_local_stock() have > > gotten faster (much faster for 128 cpus), even though it does all the draining > > instead of just scheduling it on the other cpus. > > I mean, summing that to the brief nature of local cpu functions, we may not hit > > contention as much as we are expected. > > > > ## > > > > Sorry for the long text. > > I may be missing some point, please let me know if that's the case. > > > > Thanks a lot for reviewing! > > Leo > > > > [P1]: https://lkml.org/lkml/2022/6/13/2769 > > > > > Thanks! Leo
On Tue 31-01-23 08:35:34, Marcelo Tosatti wrote: [...] > So it would be good to point out a specific problematic > testcase/scenario with using the spinlock in this particular case? Please think about it some more. The sole purpose of the pcp charge caching is to avoid atomics because the normal fast path (i.e. no limit hit) is a page counter which is an atomic counter. If you go with a spin lock for the pcp cache you are just losing some of the advantage of the caching. Sure that would be a smaller atomics use than a deeper memcg hierarchy but still. Not to mention a potential contention which is hard to predict and it will depend on the specific runtime very much. So not something that would be easy to test for other than artificial testcases. Full cpu pcp draining is not a very common operation and one could argue that the problem will be limited but so far I haven't really heard any strong arguments against the proposal to avoid scheduling the work on isolated cpus which is a much more simpler solution and achieves the same effect.
On Wed 01-02-23 01:36:07, Leonardo Brás wrote: [...] > Michal, Roman: Could you please review my argumentation below, so I can > understand what exactly is wrong ? > > > > > > > IIUC, we can approach this by dividing the problem in two working modes: > > > 1 - Normal, meaning no drain_all_stock() running. > > > 2 - Draining, grouping together pre-OOM and userspace 'config' : changing, > > > destroying, reconfiguring a memcg. > > > > > > For (1), we will have (ideally) only local cpu working on the percpu struct. > > > This mode will not have any kind of contention, because each CPU will hold it's > > > own spinlock only. You are still hitting locked atomic operations which is not the case for a simple pcp access. As already mentioned page counter (which would be taken in case of no pcp cache) is also about atomics (elbeit more of them with a deeper cgroup hierarchy). > > > For (2), we will have a lot of drain_all_stock() running. This will mean a lot > > > of schedule_work_on() running (on upstream) or possibly causing contention, i.e. > > > local cpus having to wait for a lock to get their cache, on the patch proposal. Yes. > > > Ok, given the above is correct: > > > > > > # Some arguments point that (1) becomes slower with this patch. > > > > > > This is partially true: while test 2.2 pointed that local cpu functions running > > > time had became slower by a few cycles, test 2.4 points that the userspace > > > perception of it was that the syscalls and pagefaulting actually became faster: > > > > > > During some debugging tests before getting the performance on test 2.4, I > > > noticed that the 'syscall + write' test would call all those functions that > > > became slower on test 2.2. Those functions were called multiple millions of > > > times during a single test, and still the patched version performance test > > > returned faster for test 2.4 than upstream version. Maybe the functions became > > > slower, but overall the usage of them in the usual context became faster. It is hard to tell anything without proper analysis of those numbers. > > > Is not that a small improvement? > > > > > > # Regarding (2), I notice that we fear contention > > > > > > While this seems to be the harder part of the discussion, I think we have enough > > > data to deal with it. > > > > > > In which case contention would be a big problem here? > > > IIUC it would be when a lot of drain_all_stock() get running because the memory > > > limit is getting near. I mean, having the user to create / modify a memcg > > > multiple times a second for a while is not something that is expected, IMHO. We have already explained when that happens. You are right this is not happening all that much in most workloads. But this is a user triggered operation so you cannot really many assumptions. It will really depend on the workload. > > Considering that the use of spinlocks with remote draining is the more general solution, > > what would be a test-case to demonstrate a contention problem? > > IIUC we could try to reproduce a memory tight workload that keeps allocating / > freeing from different cpus (without hitting OOM). > > Michal, Roman: Is that correct? You have any workload like that so we can test? It would be easy to create artificial workload that would hit this. All you need is to trigger any of the code paths that have already been mentioned elsewhere from the userspace. Let me be clear here. Unless there are some real concerns about not flushing remotely on isolated cpus then the spin lock approach is just much less preferable. So I would much rather focus on the trivial patch and investigates whether there are no new corner cases to be created by that.
On Thu, Jan 26, 2023 at 08:20:46PM +0100, Michal Hocko wrote: > On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote: > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote: > > > On Wed 25-01-23 15:14:48, Roman Gushchin wrote: > > > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote: > > > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote: > > > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote: > > > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote: > > > > > > > > Disclaimer: > > > > > > > > a - The cover letter got bigger than expected, so I had to split it in > > > > > > > > sections to better organize myself. I am not very confortable with it. > > > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags > > > > > > > > from memcg_stock_pcp), which could further improve performance for > > > > > > > > drain_all_stock(), but I could only notice the optimization at the > > > > > > > > last minute. > > > > > > > > > > > > > > > > > > > > > > > > 0 - Motivation: > > > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a > > > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a > > > > > > > > descendant of a given root_memcg. > > > > > > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look > > > > into why we have many of them and whether we really need them? > > > > > > > > It's either some user's actions (e.g. reducing memory.max), either some memcg > > > > is entering pre-oom conditions. In the latter case a lot of drain calls can be > > > > scheduled without a good reason (assuming the cgroup contain multiple tasks running > > > > on multiple cpus). > > > > > > I believe I've never got a specific answer to that. We > > > have discussed that in the previous version submission > > > (20221102020243.522358-1-leobras@redhat.com and specifically > > > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and > > > isolcpus. I was wondering about using memcgs in RT workloads because > > > that just sounds weird but let's say this is the case indeed. > > > > This could be the case. You can consider an "edge device" where it is > > necessary to run a RT workload. It might also be useful to run > > non realtime applications on the same system. > > > > > Then an RT task or whatever task that is running on an isolated > > > cpu can have pcp charges. > > > > Usually the RT task (or more specifically the realtime sensitive loop > > of the application) runs entirely on userspace. But i suppose there > > could be charges on application startup. > > What is the role of memcg then? If the memory limit is in place and the > workload doesn't fit in then it will get reclaimed during start up and > memory would need to be refaulted if not mlocked. If it is mlocked then > the limit cannot be enforced and the start up would likely fail as a > result of the memcg oom killer. > > [...] > > > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower > > > > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid > > > > doing this without a really strong reason. > > > > > > Are you OK with a simple opt out on isolated CPUs? That would make > > > charges slightly slower (atomic on the hierarchy counters vs. a single > > > pcp adjustment) but it would guarantee that the isolated workload is > > > predictable which is the primary objective AFAICS. > > > > This would make isolated CPUs "second class citizens": it would be nice > > to be able to execute non realtime apps on isolated CPUs as well > > (think of different periods of time during a day, one where > > more realtime apps are required, another where less > > realtime apps are required). > > An alternative requires to make the current implementation that is > lockless to use locks and introduce potential lock contention. This > could be harmful to regular workloads. Not using pcp caching would make > a fast path using few atomics rather than local pcp update. That is not > a terrible cost to pay for special cased workloads which use isolcpus. > Really we are not talking about a massive cost to be payed. At least > nobody has shown that in any numbers. Can't agree more. I also agree that the whole pcpu stock draining code can be enhanced, but I believe we should go into the direction almost directly opposite to what's being proposed here. Can we please return to the original problem which the patchset aims to solve? Is it the latency introduced by execution of draining works on isolated cpus? Maybe schedule these works with a delay and cancel them if the draining occurred naturally during the delay? Thanks!
OK, so this is a speculative patch. cpu_is_isolated doesn't exist yet. I have talked to Frederic off-list and he is working on an implementation. I will be offline next whole week (will be back Feb 13th) so I am sending this early but this patch cannot be merged without his one of course. I have tried to summarize the reasoning behind both approach should we ever need to revisit this approach. For now I strongly believe a simpler solution should be preferred. Roman, I have added your ack as you indicated but if you disagree with the reasoning please let me know. --- From 6d99c4d7a238809ff2eb7c362b45d002ca212c70 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Fri, 3 Feb 2023 15:54:38 +0100 Subject: [PATCH] memcg: do not drain charge pcp caches on remote isolated cpus Leonardo Bras has noticed that pcp charge cache draining might be disruptive on workloads relying on 'isolated cpus', a feature commonly used on workloads that are sensitive to interruption and context switching such as vRAN and Industrial Control Systems. There are essentially two ways how to approach the issue. We can either allow the pcp cache to be drained on a different rather than a local cpu or avoid remote flushing on isolated cpus. The current pcp charge cache is really optimized for high performance and it always relies to stick with its cpu. That means it only requires local_lock (preempt_disable on !RT) and draining is handed over to pcp WQ to drain locally again. The former solution (remote draining) would require to add an additional locking to prevent local charges from racing with the draining. This adds an atomic operation to otherwise simple arithmetic fast path in the try_charge path. Another concern is that the remote draining can cause a lock contention for the isolated workloads and therefore interfere with it indirectly via user space interfaces. Another option is to avoid draining scheduling on isolated cpus altogether. That means that those remote cpus would keep their charges even after drain_all_stock returns. This is certainly not optimal either but it shouldn't really cause any major problems. In the worst case (many isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH i.e 64 page) the memory consumption of a memcg would be artificially higher than can be immediately used from other cpus. Theoretically a memcg OOM killer could be triggered pre-maturely. Currently it is not really clear whether this is a practical problem though. Tight memcg limit would be really counter productive to cpu isolated workloads pretty much by definition because any memory reclaimed induced by memcg limit could break user space timing expectations as those usually expect execution in the userspace most of the time. Also charges could be left behind on memcg removal. Any future charge on those isolated cpus will drain that pcp cache so this won't be a permanent leak. Considering cons and pros of both approaches this patch is implementing the second option and simply do not schedule remote draining if the target cpu is isolated. This solution is much more simpler. It doesn't add any new locking and it is more more predictable from the user space POV. Should the pre-mature memcg OOM become a real life problem, we can revisit this decision. Reported-by: Leonardo Bras <leobras@redhat.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ab457f0394ab..55e440e54504 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2357,7 +2357,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { if (cpu == curcpu) drain_local_stock(&stock->work); - else + else if (!cpu_is_isolated(cpu)) schedule_work_on(cpu, &stock->work); } }
On Fri, Feb 03, 2023 at 04:21:09PM +0100, Michal Hocko wrote: > OK, so this is a speculative patch. cpu_is_isolated doesn't exist yet. I > have talked to Frederic off-list and he is working on an implementation. > I will be offline next whole week (will be back Feb 13th) so I am > sending this early but this patch cannot be merged without his one of > course. > > I have tried to summarize the reasoning behind both approach should we > ever need to revisit this approach. For now I strongly believe a simpler > solution should be preferred. > > Roman, I have added your ack as you indicated but if you disagree with > the reasoning please let me know. Great, thank you for preparing it up! Really nice summary. My ack definitely applies. If you want, feel free to add a "Suggested-by: Roman Gushchin <roman.gushchin@linux.dev>" tag to blame me later :) Thank you!
Michal, Roman, I understand you have far more experience in this codebase than myself, so please help me understand what am I missing in my argument for the spinlock approach. I honestly want to improve, and your help is really appreciated. On Wed, 2023-02-01 at 13:41 +0100, Michal Hocko wrote: > On Tue 31-01-23 08:35:34, Marcelo Tosatti wrote: > [...] > > So it would be good to point out a specific problematic > > testcase/scenario with using the spinlock in this particular case? > > Please think about it some more. The sole purpose of the pcp charge > caching is to avoid atomics because the normal fast path (i.e. no limit > hit) is a page counter which is an atomic counter. If you go with a spin > lock for the pcp cache you are just losing some of the advantage of the > caching. Sure that would be a smaller atomics use than a deeper memcg > hierarchy but still. I could read the code that calls consume_stock(), and noticed that you are probably referencing the atomic operations on memcg->memory->usage (and others) used in page_counter_try_charge(). It is a single atomic variable per memcg that is potentially accessed by every cpu. Is that correct? I understand the cost of an atomic operation such as this is high because of the inherent high cost of bouncing the variable's cacheline between those cpus. The difference between 'usage' atomic variable and the spinlock we are proposing is the scope of the variable: spinlock is defined on a per-cpu structure, and most of the accesses will come from the local cpu. According to Intel® 64 and IA-32 Architectures Software Developer’s Manual, at Vol. 3A page 9-5: ### 9.1.4 Effects of a LOCK Operation on Internal Processor Caches [...] For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area. ### So locking on a percpu spinlock which is cached in the current cpu (happens most of time) is as cheap as modifying the local cacheline. Since the cachelines are already modified in the local cpu functions, so the write to memory is batched. For reference, this is the pahole output for memcg_stock_pcp after my patch. The spinlock fits in the same cacheline as the changed variables. struct memcg_stock_pcp { spinlock_t stock_lock; /* 0 4 */ unsigned int nr_pages; /* 4 4 */ struct mem_cgroup * cached; /* 8 8 */ struct obj_cgroup * cached_objcg; /* 16 8 */ struct pglist_data * cached_pgdat; /* 24 8 */ unsigned int nr_bytes; /* 32 4 */ int nr_slab_reclaimable_b; /* 36 4 */ int nr_slab_unreclaimable_b; /* 40 4 */ /* size: 48, cachelines: 1, members: 8 */ /* padding: 4 */ /* last cacheline: 48 bytes */ }; The only accesses that will come from a remote cpu, and thus cause the cacheline to bounce and the lock to be more expensive, are the ones from drain_all_stock(). OTOH, on upstream, those accesses already modify the remote cacheline with an atomic variable (memcg_stock_pcp.flags), so we are only dealing with potential contention costs. > > Not to mention a potential contention which is hard to predict and it > will depend on the specific runtime very much. So not something that > would be easy to test for other than artificial testcases. Full cpu > pcp draining is not a very common operation and one could argue that > the problem will be limited > Yes, but we are exchanging an "always schedule_work_on()", which is a kind of contention, for a "sometimes we hit spinlock contention". For the spinlock proposal, on the local cpu side, the *worst case* contention is: 1 - wait the spin_unlock() for a complete <percpu cache drain process>, 2 - wait a cache hit for local per-cpu cacheline What is current implemented (schedule_work_on() approach), for the local cpu side there is *always* this contention: 1 - wait for a context switch, 2 - wait a cache hit from it's local per-cpu cacheline, 3 - wait a complete <percpu cache drain process>, 4 - then for a new context switch to the current thread. So moving from schedule_work_on() to spinlocks will save 2 context switches per cpu every time drain_all_stock() is called. On the remote cpu side, my tests point that doing the remote draining is faster than scheduling a local draining, so it's also a gain. Also, IIUC the possible contention in the spinlock approach happens only on page-faulting and syscalls, versus the schedule_work_on() approach that can interrupt user workload at any time. In fact, not interrupting the user workload in isolated cpus is just a bonus of using spinlocks. > but so far I haven't really heard any strong > arguments against the proposal to avoid scheduling the work on isolated > cpus which is a much more simpler solution and achieves the same > effect. I understand the 'not scheduling work on isolated cpus' appeal: it's a much simpler change, and will only affect cases with isolated cpus, so it is safer than changing the whole locking structure. I am not against it. I already have a patch implementing it for testing, and I could gladly send it if you see fit. But if nothing else, it introduces another specific case, and now it have to deal differently with local cpu, remote cpus, and isolated cpus. With spinlocks, there is a much simpler solution: - All cpus are dealt with the same way, which is faster in the local cpu (as in upstream implementation). - We don't have to avoid draining on isolated cpus. - We get rid of the "work_struct", and scheduling costs - We get rid of "flag", as we don't need to take care of multi-scheduling local draining. - When drain_all_stock() returns, all stock have already been drained, so retrying consume_stock() may have immediate results on pre-OOM cases. On Wed, 2023-02-01 at 13:52 +0100, Michal Hocko wrote: [...] > Let me be clear here. Unless there are some real concerns about not > flushing remotely on isolated cpus then the spin lock approach is just > much less preferable. So I would much rather focus on the trivial patch > and investigates whether there are no new corner cases to be created by > that. I understand 'not scheduling work on isolated cpus' approach should work with low impact on current behavior, and I also understand that as Maintainers you have to consider many more factors than a regular developer like me, and also that the spinlock approach is a big change on how pcp memcg caches work. On that topic, if there is any comparison test you find important running, or any topic you think could be better discussed, please let me know. Thank you for reviewing, Leo
Hi Leonardo! > Yes, but we are exchanging an "always schedule_work_on()", which is a kind of > contention, for a "sometimes we hit spinlock contention". > > For the spinlock proposal, on the local cpu side, the *worst case* contention > is: > 1 - wait the spin_unlock() for a complete <percpu cache drain process>, > 2 - wait a cache hit for local per-cpu cacheline > > What is current implemented (schedule_work_on() approach), for the local > cpu side there is *always* this contention: > 1 - wait for a context switch, > 2 - wait a cache hit from it's local per-cpu cacheline, > 3 - wait a complete <percpu cache drain process>, > 4 - then for a new context switch to the current thread. I think both Michal and me are thinking of a more generic case in which the cpu is not exclusively consumed by 1 special process, so that the draining work can be executed during an idle time. In this case the work is basically free. And the introduction of a spin_lock() on the hot path is what we're are concerned about. I agree, that on some hardware platforms it won't be that expensive, but in general not having any spinlocks is so much better. > > So moving from schedule_work_on() to spinlocks will save 2 context switches per > cpu every time drain_all_stock() is called. > > On the remote cpu side, my tests point that doing the remote draining is faster > than scheduling a local draining, so it's also a gain. > > Also, IIUC the possible contention in the spinlock approach happens only on > page-faulting and syscalls, versus the schedule_work_on() approach that can > interrupt user workload at any time. > > In fact, not interrupting the user workload in isolated cpus is just a bonus of > using spinlocks. I believe it significantly depends on the preemption model: you're right regarding fully preemptive kernels, but with voluntary/none preemption it's exactly opposite: the draining work will be executed at some point later (probably with 0 cost), while the remote access from another cpu will potentially cause delays on the spin lock as well as a need to refill the stock. Overall I'd expect a noticeable performance regression from an introduction of spin locks and remote draining. Maybe not on all platforms, but at least on some. That's my main concern. And I don't think the problem we're aiming to solve here justifies this potential regression. Thanks!
On Sun, 2023-02-05 at 11:49 -0800, Roman Gushchin wrote: > Hi Leonardo! Hello Roman, Thanks a lot for replying! > > > Yes, but we are exchanging an "always schedule_work_on()", which is a kind of > > contention, for a "sometimes we hit spinlock contention". > > > > For the spinlock proposal, on the local cpu side, the *worst case* contention > > is: > > 1 - wait the spin_unlock() for a complete <percpu cache drain process>, > > 2 - wait a cache hit for local per-cpu cacheline > > > > What is current implemented (schedule_work_on() approach), for the local > > cpu side there is *always* this contention: > > 1 - wait for a context switch, > > 2 - wait a cache hit from it's local per-cpu cacheline, > > 3 - wait a complete <percpu cache drain process>, > > 4 - then for a new context switch to the current thread. > > I think both Michal and me are thinking of a more generic case in which the cpu > is not exclusively consumed by 1 special process, so that the draining work can > be executed during an idle time. In this case the work is basically free. Oh, it makes sense. But in such scenarios, wouldn't the same happens to spinlocks? I mean, most of the contention with spinlocks only happens if the remote cpu is trying to drain the cache while the local cpu happens to be draining/charging, which is quite rare due to how fast the local cpu operations are. Also, if the cpu has some idle time, using a little more on a possible spinlock contention should not be a problem. Right? > > And the introduction of a spin_lock() on the hot path is what we're are concerned > about. I agree, that on some hardware platforms it won't be that expensive, > IIRC most hardware platforms with multicore supported by the kernel should have the same behavior, since it's better to rely on cache coherence than locking the memory bus. For instance, the other popular architectures supported by Linux use the LR/SC strategy for atomic operations (tested on ARM, POWER, RISCV) and IIRC the LoadReserve slow part waits for the cacheline exclusivity, which is already already exclusive in this perCPU structure. > but in general not having any spinlocks is so much better. I agree that spinlocks may bring contention, which is not ideal in many cases. In this case, though, it may not be a big issue, due to very rare remote access in the structure, for the usual case (non-pre-OOMCG) > > > > > So moving from schedule_work_on() to spinlocks will save 2 context switches per > > cpu every time drain_all_stock() is called. > > > > On the remote cpu side, my tests point that doing the remote draining is faster > > than scheduling a local draining, so it's also a gain. > > > > Also, IIUC the possible contention in the spinlock approach happens only on > > page-faulting and syscalls, versus the schedule_work_on() approach that can > > interrupt user workload at any time. > > > > In fact, not interrupting the user workload in isolated cpus is just a bonus of > > using spinlocks. > > I believe it significantly depends on the preemption model: you're right regarding > fully preemptive kernels, but with voluntary/none preemption it's exactly opposite: > the draining work will be executed at some point later (probably with 0 cost), So, in case of voluntary/none preemption with some free cpu time. > while the remote access from another cpu will potentially cause delays on the > spin lock as well as a need to refill the stock. But if there is some free CPU time, what is the issue of some (potential) delays due to spinlock contention? I am probably missing the whole picture, but when I think of performance improvement, I think on doing more with the same cputime. If we can use free cputime to do stuff later, it's only fair to also use it in case of contention, right? I know there are some cases that may need to be more previsible (mostly RT), but when I think of memory allocation, I don't expect it to always take the same time (as there are caches, pre-OOM, and so) Also, as previously discussed, in case of a busy cpu, the spinlock approach will probably allow more work to be done. > > Overall I'd expect a noticeable performance regression from an introduction of > spin locks and remote draining. Maybe not on all platforms, but at least on some. > That's my main concern. > I see. For the platform I have tested (x86) I noticed better overall performance on spinlocks than upstream solution. For other popular platforms, I have briefly read some documentation on locking/atomicity and I think we may keep the performance gains. But to be sure, I could retake the tests on other platforms, such as ARM, POWER, RISCV, and so. Or even perform extra suggested tests. With that info, would you feel less concerned about a possible change in memcg pcp cache locking scheme? > And I don't think the problem we're aiming to solve here > justifies this potential regression. > To be strict, the isolated cpu scheduling problem is already fixed by the housekeeping patch (with some limitations). At this point, I am trying to bring focus to a (possible) performance improvement on the memcg pcp cache locking system. > Thanks! > Thank you for helping me better understand your arguments and concerns. I really appreciate it! Best regards, Leo
On Tue, Feb 07, 2023 at 12:18:01AM -0300, Leonardo Brás wrote: > On Sun, 2023-02-05 at 11:49 -0800, Roman Gushchin wrote: > > Hi Leonardo! > > Hello Roman, > Thanks a lot for replying! > > > > > > Yes, but we are exchanging an "always schedule_work_on()", which is a kind of > > > contention, for a "sometimes we hit spinlock contention". > > > > > > For the spinlock proposal, on the local cpu side, the *worst case* contention > > > is: > > > 1 - wait the spin_unlock() for a complete <percpu cache drain process>, > > > 2 - wait a cache hit for local per-cpu cacheline > > > > > > What is current implemented (schedule_work_on() approach), for the local > > > cpu side there is *always* this contention: > > > 1 - wait for a context switch, > > > 2 - wait a cache hit from it's local per-cpu cacheline, > > > 3 - wait a complete <percpu cache drain process>, > > > 4 - then for a new context switch to the current thread. > > > > I think both Michal and me are thinking of a more generic case in which the cpu > > is not exclusively consumed by 1 special process, so that the draining work can > > be executed during an idle time. In this case the work is basically free. > > Oh, it makes sense. > But in such scenarios, wouldn't the same happens to spinlocks? > > I mean, most of the contention with spinlocks only happens if the remote cpu is > trying to drain the cache while the local cpu happens to be draining/charging, > which is quite rare due to how fast the local cpu operations are. > > Also, if the cpu has some idle time, using a little more on a possible spinlock > contention should not be a problem. Right? > > > > > And the introduction of a spin_lock() on the hot path is what we're are concerned > > about. I agree, that on some hardware platforms it won't be that expensive, > > > > IIRC most hardware platforms with multicore supported by the kernel should have > the same behavior, since it's better to rely on cache coherence than locking the > memory bus. > > For instance, the other popular architectures supported by Linux use the LR/SC > strategy for atomic operations (tested on ARM, POWER, RISCV) and IIRC the > LoadReserve slow part waits for the cacheline exclusivity, which is already > already exclusive in this perCPU structure. > > > > but in general not having any spinlocks is so much better. > > I agree that spinlocks may bring contention, which is not ideal in many cases. > In this case, though, it may not be a big issue, due to very rare remote access > in the structure, for the usual case (non-pre-OOMCG) Hi Leonardo! I made a very simple test: replaced pcp local_lock with a spinlock and ran a very simple kernel memory accounting benchmark (attached below) on my desktop pc (Intel Core i9-7940X). Original (3 attempts): 81341 us 80973 us 81258 us Patched (3 attempts): 99918 us 100220 us 100291 us This is without any contention and without CONFIG_LOCKDEP. Thanks! -- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 49f67176a1a2..bafd3cde4507 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6563,6 +6563,37 @@ static int memory_stat_show(struct seq_file *m, void *v) return 0; } +static int memory_alloc_test(struct seq_file *m, void *v) +{ + unsigned long i, j; + void **ptrs; + ktime_t start, end; + s64 delta, min_delta = LLONG_MAX; + + ptrs = kvmalloc(sizeof(void *) * 1024 * 1024, GFP_KERNEL); + if (!ptrs) + return -ENOMEM; + + for (j = 0; j < 100; j++) { + start = ktime_get(); + for (i = 0; i < 1024 * 1024; i++) + ptrs[i] = kmalloc(8, GFP_KERNEL_ACCOUNT); + end = ktime_get(); + + delta = ktime_us_delta(end, start); + if (delta < min_delta) + min_delta = delta; + + for (i = 0; i < 1024 * 1024; i++) + kfree(ptrs[i]); + } + + kvfree(ptrs); + seq_printf(m, "%lld us\n", min_delta); + + return 0; +} + #ifdef CONFIG_NUMA static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec, int item) @@ -6758,6 +6789,10 @@ static struct cftype memory_files[] = { .name = "stat", .seq_show = memory_stat_show, }, + { + .name = "test", + .seq_show = memory_alloc_test, + }, #ifdef CONFIG_NUMA { .name = "numa_stat",
On Fri 03-02-23 11:25:16, Roman Gushchin wrote: > On Fri, Feb 03, 2023 at 04:21:09PM +0100, Michal Hocko wrote: > > OK, so this is a speculative patch. cpu_is_isolated doesn't exist yet. I > > have talked to Frederic off-list and he is working on an implementation. > > I will be offline next whole week (will be back Feb 13th) so I am > > sending this early but this patch cannot be merged without his one of > > course. > > > > I have tried to summarize the reasoning behind both approach should we > > ever need to revisit this approach. For now I strongly believe a simpler > > solution should be preferred. > > > > Roman, I have added your ack as you indicated but if you disagree with > > the reasoning please let me know. > > Great, thank you for preparing it up! Really nice summary. > My ack definitely applies. > > If you want, feel free to add a > "Suggested-by: Roman Gushchin <roman.gushchin@linux.dev>" > tag to blame me later :) Sure, I have updated the changelog. I will repost after the patchset by Frederic[1] is merged. [1] http://lkml.kernel.org/r/20230203232409.163847-1-frederic@kernel.org