From patchwork Wed Jan 25 07:34:57 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Leonardo Bras Soares Passos X-Patchwork-Id: 4495 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp146654wrn; Tue, 24 Jan 2023 23:38:42 -0800 (PST) X-Google-Smtp-Source: AMrXdXsKLK3Bxrpd1SRjwrKdsxx9kWrhXFNGp1SpPrimcvIm/hY20Wni4cLXfImWVCPFuW5Tc8e9 X-Received: by 2002:a05:6a20:9e0e:b0:b2:46c0:81a9 with SMTP id ms14-20020a056a209e0e00b000b246c081a9mr31717531pzb.30.1674632321807; Tue, 24 Jan 2023 23:38:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674632321; cv=none; d=google.com; s=arc-20160816; b=dtCIZzGe/dEzY376dIzfO++wtmkD8zQ+CPVASLp2PTJSEeU+I9M3zGkW5BzVln/4FV subvkERjOnXZfUylH4SSD+KHmZ9El2632/g9LDQ0IuhPfB983AmMUB85PbGPVk1EeNyI b66/mi3x+EEiEm9kKtXmCM4Gs6HKsCEiUXgouMqfFs7jXN3RuISvCPRsa6NaT8YRqJVA NpwTO9PJWjhaJSooJk4vseOyoF/aW2mDb1UTb8Ewi/aMrQyehvJFRAa9IRCreOvp1ykN it4IRbOj1LJUHHyzKG/D+0nTw7gCFLI1WTUaK7ltxI+hiREpE72OiBXH6m8PkSptwuYb 8zRg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=fnKT1MJD+ucJIxHKvxLLC0LjfVH19PUsoy+1ammqfUQ=; b=FO+pBQeOV0YSZm19ic6D3WgN5Sh2RN8oZCVxbMz7No/9OrMwjSOaORmd40wwXhd0nP X7pZmwMPBo/muqmQHMvKI7D0QZd2V8cisqZ7DEsqwv0sYClKgHhRKFsadW4GQmwJ7Im/ AUSZc/eJ7VTwNq6khjDEABAy9ttRUiJA3/qfpaJ+y4Ye9AjKcc0VDN4tWJtf31sW/a1i QxeYqXWWRXhCLiUmL67d3LZw/L3y4hsm4/8lv7hptnA8gteXzmGB47Fu2P/p5Do8GP+j qSSJDT2M1XPHnG6NjEGY1PBerZWRupB7kLsmoO5MvFgZaawtNSSSZPD72WhLbnGBSefW Rcdw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="AKt30/CL"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m137-20020a633f8f000000b004c697f2f114si4980651pga.190.2023.01.24.23.38.29; Tue, 24 Jan 2023 23:38:41 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="AKt30/CL"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235073AbjAYHgm (ORCPT + 99 others); Wed, 25 Jan 2023 02:36:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35820 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235058AbjAYHgj (ORCPT ); Wed, 25 Jan 2023 02:36:39 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E5C9542DD0 for ; Tue, 24 Jan 2023 23:35:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674632149; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=fnKT1MJD+ucJIxHKvxLLC0LjfVH19PUsoy+1ammqfUQ=; b=AKt30/CL9jCoiyf9+J/c4S3Z91xBLjyYumCHYfezi88sTeQFm1Sp5D6pVGSxyoaTkGqHKk mRqTsxAPilaKWy2/38pQXx2cDb+z+kkagJ5j2ym5NKaRIZSt+dftGYt8BmQpzNuc3bAYAr Lbu36R3ECQga/BkOVu0Qhihg4GO8Xlg= Received: from mail-ot1-f70.google.com (mail-ot1-f70.google.com [209.85.210.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-308-tnC8kYiyPNWrKDfVC7QqpA-1; Wed, 25 Jan 2023 02:35:45 -0500 X-MC-Unique: tnC8kYiyPNWrKDfVC7QqpA-1 Received: by mail-ot1-f70.google.com with SMTP id by2-20020a056830608200b00683e3e829d5so8870259otb.17 for ; Tue, 24 Jan 2023 23:35:45 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=fnKT1MJD+ucJIxHKvxLLC0LjfVH19PUsoy+1ammqfUQ=; b=NhtqyLfnoeBLcA4NniYdkSWzcdQvFZMbugEWuklfqNJZCZ70o7TBtBWwYKhJfXwg+K Khasb3H/x/G8uxl3nBf/U4AGRFwBfHZRfV8NUW6b+huBlvEQBIqWFz3xqHHqNCoT7EBc 4qyoBmAgD1GRnVXLbdQ1/I1Fz84AJVIulT80Wi2tDdoPPj/D2yjBDMnuau+/9U7fgu+q FoooTgC5mOwEYnEH+UQsxPNFOU/ZUoJpTsB1pdxNGWThyGQDmQ6U9mavv1qK3hnrkA25 oVXFuy2n+MX0yS8AN9EHGzL2kBD+04syQaxQy+QTc+lfcAm4bWsPrX3F80POYkZSoPf3 4rgw== X-Gm-Message-State: AFqh2korOCAT7P0TErN1WRIIzqQPNiv7X/WciCTn0ITv2HT8F49mKpHx 5dVtbvBiQHReVncwhTLMVNK6FKP17xKmk8XIsiyp/vWJYOntsOm/Z42Rr04vqtKQOZd4hSulcXq /Q2HAbS8FOVSFYMWoxsLmiCwr X-Received: by 2002:a4a:b542:0:b0:502:a732:f8f5 with SMTP id s2-20020a4ab542000000b00502a732f8f5mr7489698ooo.5.1674632144799; Tue, 24 Jan 2023 23:35:44 -0800 (PST) X-Received: by 2002:a4a:b542:0:b0:502:a732:f8f5 with SMTP id s2-20020a4ab542000000b00502a732f8f5mr7489692ooo.5.1674632144483; Tue, 24 Jan 2023 23:35:44 -0800 (PST) Received: from LeoBras.redhat.com ([2804:1b3:a800:14fa:9361:c141:6c70:c877]) by smtp.gmail.com with ESMTPSA id x189-20020a4a41c6000000b0050dc79bb80esm1538802ooa.27.2023.01.24.23.35.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Jan 2023 23:35:43 -0800 (PST) From: Leonardo Bras To: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Marcelo Tosatti Cc: Leonardo Bras , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining Date: Wed, 25 Jan 2023 04:34:57 -0300 Message-Id: <20230125073502.743446-1-leobras@redhat.com> X-Mailer: git-send-email 2.39.1 MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755979261189285683?= X-GMAIL-MSGID: =?utf-8?q?1755979261189285683?= Disclaimer: a - The cover letter got bigger than expected, so I had to split it in sections to better organize myself. I am not very confortable with it. b - Performance numbers below did not include patch 5/5 (Remove flags from memcg_stock_pcp), which could further improve performance for drain_all_stock(), but I could only notice the optimization at the last minute. 0 - Motivation: On current codebase, when drain_all_stock() is ran, it will schedule a drain_local_stock() for each cpu that has a percpu stock associated with a descendant of a given root_memcg. This happens even on 'isolated cpus', a feature commonly used on workloads that are sensitive to interruption and context switching such as vRAN and Industrial Control Systems. Since this scheduling behavior is a problem to those workloads, the proposal is to replace the current local_lock + schedule_work_on() solution with a per-cpu spinlock. 1 - Possible performance losses: With a low amount of shedule_work_on(), local_locks are supposed to perform better than spinlocks, given cacheline is always accessed by a single CPU and there is no contention. The impact on those areas is analyzed bellow: 1.1 - Cacheline usage In current implementation drain_all_stock() will be remote reading the percpu memcg_stock cacheline of every online CPU, and remote writing to all cpus that succeed the mem_cgroup_is_descendant() test. (stock->flags, stock->work) With spinlocks, drain_all_stock() will be remote reading the percpu memcg_stock cacheline of every online CPU, and remote writing to all cpus that succeed the mem_cgroup_is_descendant() test. (stock->stock_lock, on top of the above) While the spinlock may require extra acquire/release writes on some archs, they will all happen on an exclusive cacheline, so not much overhead. In both cases, the next local cpu read will require a fetch from memory (as it was invalidated on the remote write) and cacheline exclusivity get before the local write. So about cacheline usage, there should not be much impact. 1.2 - Contention We can safely assume that drain_all_stock() will not run oftenly. If it was not the case, there would be a lot of scheduled tasks and kill cpu performance. Since it does not run oftenly, and it is the only function that acesses remote percpu memcg_stock, contention should not be too expressive, and should cause less impact than scheduling (remote) and running the scheduled work (local). 2 - Performance test results: 2.1 - Test System: - Non-virtualized AMD EPYC 7601 32-Core Processor, 128 CPUs distributed in 8 NUMA nodes: 0-7,64-71 on node0, 8-15,72-79 on node1, and so on. - 256GB RAM, 2x32GB per NUMA node - For cpu isolation: use kernel cmdline: isolcpus=8-15,72-79 nohz_full=8-15,72-79 - Memcg group created with: [Slice] MemoryAccounting=true MemoryLimit=1G MemoryMax=1G MemorySwapMax=1M - For pinning runs on given CPU, I used 'taskset -c $cpunum command' - For restarting the memcg, it was used: restart_memcg(){ systemctl stop test.slice systemctl daemon-reload systemctl start test.slice } 2.2 - Impact on functions that use memcg_stock: 2.2.1 - Approach: Using perf or tracepoints get a very course result, so it was preferred to count the total cpu clocks between entering and exiting the functions that use the memcg_stock, on an isolated cpu. Something like this was used on x86: fun_name(){ u64 clk = rdtsc_ordered(); ... clk = rdtsc_ordered() - clk; dbg->fun_name.cycles += clk; dbg->fun_name.count++; } The percpu statistics were then acquired via a char device after the test finished, and an average function clock usage was calculated. For the stress test, run "cat /proc/interrupts > /dev/null" in a loop of 1000000 iterations inside the memcg for each cpu tested: for each cpu in $cpuset; do systemd-run --wait --slice=test.slice taskset -c $cpu bash -c " for k in {1..100000} ; do cat /proc/interrupts > /dev/null; done" For the drain_all_stock() test, it was necessary to restart the memcg (or cause an OOM) to call the function. 2.2.2 - Results For 1 isolated CPU, pinned on cpu 8, with no drain_all_stock() calls, being STDDEV the standard deviation between the average on 6 runs, and Call Count the sum of calls on the 6 runs: Patched Average clk STDDEV Call Count consume_stock: 63.75983475 0.1610502136 72167768 refill_stock: 67.45708322 0.09732816852 23401756 mod_objcg_state: 98.03841384 1.491628532 181292961 consume_obj_stock: 63.2543456 0.04624513799 94846454 refill_obj_stock: 78.56390025 0.3160306174 91732973 Upstream Average clk STDDEV Call Count consume_stock: 53.51201046 0.05200824438 71.866536 refill_stock: 65.46991584 0.1178078417 23401752 mod_objcg_state: 84.95365055 1.371464414 181.292707 consume_obj_stock: 60.03619438 0.05944582207 94.846327 refill_obj_stock: 73.23757912 1.161933856 91.732787 Patched - Upstream Diff (cycles) Diff % consume_stock: 10.24782429 19.15051258 refill_stock: 1.987167378 3.035237411 mod_objcg_state: 13.08476328 15.40223781 consume_obj_stock: 3.218151217 5.360351785 refill_obj_stock: 5.326321123 7.272661368 So in average the above patched functions are 2~13 clocks cycles slower than upstream. On the other hand, drain_all_stock is faster on the patched version, even considering it does all the draining instead of scheduling the work to other CPUs: drain_all_stock cpus Upstream Patched Diff (cycles) Diff(%) 1 44331.10831 38978.03581 -5353.072507 -12.07520567 8 43992.96512 39026.76654 -4966.198572 -11.2886198 128 156274.6634 58053.87421 -98220.78915 -62.85138425 (8 cpus being in the same NUMA node) 2.3 - Contention numbers 2.3.1 - Approach On top of the patched version, I replaced the spin_lock_irqsave() on functions that use the memcg_stock with spin_lock_irqsave_cc(), which is defined as: #define spin_lock_irqsave_cc(l, flags) \ if (!spin_trylock_irqsave(l, flags)) { \ u64 clk = rdtsc_ordered(); \ spin_lock_irqsave(l, flags); \ clk = rdtsc_ordered() - clk; \ pr_err("mydebug: cpu %d hit contention :" \ " fun = %s, clk = %lld/n", \ smp_processor_id(), __func__, clk); \ } So in case of contention (try_lock failing) it would record an approximate clk usage before getting the lock, and print this to dmesg. For the stress test, run "cat /proc/interrupts > /dev/null" in a loop of 1000000 iterations for each of the 128 cpus inside the memcg (limit set to 20G): for each cpu in {1..128}; do restart_memcg() systemd-run --wait --slice=test.slice taskset -c $cpu bash -c " for k in {1..100000} ; do cat /proc/interrupts > /dev/null; done" This loop was repeated for over 8 hours 2.3.2- Results Function # calls consume_stock: 15078323802 refill_stock: 2495995683 mod_objcg_state: 39765559905 consume_obj_stock: 19882888224 refill_obj_stock: 21025241793 drain_all_stock: 592 Contentions hit: 0 (Other more aggressive synthetic tests were run, and even in this case contention was hit just a couple times over some hours.) 2.4 - Syscall time measure 2.4.1- Approach To measure the patchset effect on syscall time, the following code was used: (copied/adapted from https://lore.kernel.org/linux-mm/20220924152227.819815-1-atomlin@redhat.com/ ) ################ #include #include #include #include #include typedef unsigned long long cycles_t; typedef unsigned long long usecs_t; typedef unsigned long long u64; #ifdef __x86_64__ #define DECLARE_ARGS(val, low, high) unsigned long low, high #define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32)) #define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high) #define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high) #else #define DECLARE_ARGS(val, low, high) unsigned long long val #define EAX_EDX_VAL(val, low, high) (val) #define EAX_EDX_ARGS(val, low, high) "A" (val) #define EAX_EDX_RET(val, low, high) "=A" (val) #endif static inline unsigned long long __rdtscll(void) { DECLARE_ARGS(val, low, high); asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high)); return EAX_EDX_VAL(val, low, high); } #define rdtscll(val) do { (val) = __rdtscll(); } while (0) #define NRSYSCALLS 30000000 #define NRSLEEPS 100000 #define page_mmap() mmap(NULL, 4096, PROT_READ|PROT_WRITE, \ MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) #define page_munmap(x) munmap(x, 4096) void main(int argc, char *argv[]) { unsigned long a, b, cycles; int i, syscall = 0; int *page; page = page_mmap(); if (page == MAP_FAILED) perror("mmap"); if (page_munmap(page)) perror("munmap"); if (argc != 2) { printf("usage: %s {idle,syscall}\n", argv[0]); exit(1); } rdtscll(a); if (strncmp("idle", argv[1], 4) == 0) syscall = 0; else if (strncmp("syscall", argv[1], 7) == 0) syscall = 1; else { printf("usage: %s {idle,syscall}\n", argv[0]); exit(1); } if (syscall == 1) { for (i = 0; i < NRSYSCALLS; i++) { page = page_mmap(); if (page == MAP_FAILED) perror("mmap"); #ifdef MY_WRITE page[3] = i; #endif if (page_munmap(page)) perror("munmap"); } } else { for (i = 0; i < NRSLEEPS; i++) usleep(10); } rdtscll(b); cycles = b - a; if (syscall == 1) printf("cycles / syscall: %d\n", (b-a)/(NRSYSCALLS*2)); else printf("cycles / idle loop: %d\n", (b-a)/NRSLEEPS); } ################ Running with ./my_test syscall will cause it to print the average clock cycles usage of the syscall pair (page_mmap() and page_munmap()); It was compiled with two versions: With -DMY_WRITE and without it. The difference is writing to the allocated page, causing it to fault. Each version was run 200 times, pinned to an isolated cpu. Then an average and standard deviation was calculated on those results. 2.4.2- Results Patched: no_write write AVG 2991.195 5746.495 STDEV 27.77488427 40.55878512 STDEV % 0.9285547838 0.7058004073 Upstream: no_write write AVG 3012.22 5749.605 STDEV 25.1186451 37.26206223 STDEV % 0.8338914522 0.6480803851 Pat - Up: no_write write Diff -21.025 -3.11 Diff % -0.6979901866 -0.05409067232 Meaning the pair page_mmap() + page_munmap() + pagefault runs a tiny bit faster on the patched version, compared to upstream. 3 - Discussion on results On 2.2 we see every function that uses memcg_stock on local cpu gets a slower by some cycles on the patched version, while the function that accesses it remotely (drain_all_stock()) gets faster. The difference is more accentuated as we raise the cpu count, and consequently start dealing with sharing memory across NUMA. On 2.3 we see contention is not a big issue, as expected in 1.2. This probably happens due to the fact that drain_all_stock() runs quite rarely on normal operation, and the other functions are quite fast. On 2.4 we can see that page_mmap() + page_munmap() + pagefault ran a tiny bit faster. This is probably due to the fact that drain_all_stock() does not schedule work on the running cpu, causing it not to get interrupted, and possibly making up for the increased time in local functions. 4- Conclusion Scheduling work on isolated cpus can be an issue for some workloads. Reducing the issue by replacing the local_lock in memcg_stock with a spinlock should not cause much impact on performance. Leonardo Bras (5): mm/memcontrol: Align percpu memcg_stock to cache mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t mm/memcontrol: Reorder memcg_stock_pcp members to avoid holes mm/memcontrol: Perform all stock drain in current CPU mm/memcontrol: Remove flags from memcg_stock_pcp mm/memcontrol.c | 75 ++++++++++++++++++++----------------------------- 1 file changed, 31 insertions(+), 44 deletions(-) Acked-by: Roman Gushchin Reported-by: Leonardo Bras Acked-by: Roman Gushchin Signed-off-by: Michal Hocko