Message ID | 20231030072540.38631-1-byungchul@sk.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:d641:0:b0:403:3b70:6f57 with SMTP id cy1csp2040289vqb; Mon, 30 Oct 2023 00:41:27 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHDc4qBWerFjxKxLB/aLPRs4aVV3aw4DAjuba7lRxzQCkGy21kOnBvZX4IncHiPwvRnYjNp X-Received: by 2002:a17:902:d041:b0:1c9:c735:2d60 with SMTP id l1-20020a170902d04100b001c9c7352d60mr5222689pll.17.1698651687122; Mon, 30 Oct 2023 00:41:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698651687; cv=none; d=google.com; s=arc-20160816; b=yeOBW2b9t20kbXp28ZkDrs1yFhDgdGgv7IlNqeNfIUdMQJL9e6tOSgSI30YbR3VUIs viIWjZma/zqI70CBT90RWhUmcM5gZFilrH5FH6bW0DkxUXckLUjc8Idar/+39BZLHpZz QqaMrcHbfNXu14PtpL8FRoPzhjIA1TrmiEtPLLi/fjY5P7ilTGOVqDCme5avCvky+2MF VOTMEKkqANQeQsXAeveCBWYQysyVCvaDwsh0U4KHCxt0YjvyXCJq81ExVfAqvKDf37Ns F3dDB7ou38vUDqvejXbmd3iOC0jfBy52jz9jV7/nK+3d2Ef+i/cfOlk95/UAaMYpdtze gZTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from; bh=vRe5yV0+vuoyoq8RtAyMzwivWlczk9iuDZIwqDJyIRA=; fh=qkutryktk1+eA6li8Kr0a/P8xPDRQ0wHxGCKWFfNW0w=; b=o5Y/+jkhI/j4avedLHeTLhK7gtLWQ0Skjipa9xPZuc7YADjEoScjES8cN8pvlEeqSQ bE884gJKxhFZK4iByK3s2gYm3U1XTb3Mq3NXG/FtxKfHam7Acml1BQh/tVyGy2inZtfL sWrOYAbHdEcUCB0T/b1B82ZjMccaw5PZNlwi/oMC0gCBM9VllRgBy7Y24pJMcUNI1sNc VvRsK+B9dMYSb+XqQxFT8g0Gj7XW7ipY4E6HEcyB7i4JP6YJX2FwPoWJR4aIosL4XPFf wYi/NKqzqQVaqKm1ci2N/hMuOemyHIf4qCJs2WbZAhBOHlnNm2cXrB9DPlWooJr6gEgo 3lsQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id u5-20020a17090341c500b001c9c967e77esi3866707ple.207.2023.10.30.00.41.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 00:41:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 3950E806117F; Mon, 30 Oct 2023 00:41:20 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231694AbjJ3HlB (ORCPT <rfc822;chrisjones.unixmen@gmail.com> + 32 others); Mon, 30 Oct 2023 03:41:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49946 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229688AbjJ3HlA (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 30 Oct 2023 03:41:00 -0400 X-Greylist: delayed 901 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Mon, 30 Oct 2023 00:40:55 PDT Received: from invmail4.hynix.com (exvmail4.skhynix.com [166.125.252.92]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 7FDADC0 for <linux-kernel@vger.kernel.org>; Mon, 30 Oct 2023 00:40:55 -0700 (PDT) X-AuditID: a67dfc5b-d6dff70000001748-9b-653f5a7f3bb6 From: Byungchul Park <byungchul@sk.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: kernel_team@skhynix.com, akpm@linux-foundation.org, ying.huang@intel.com, namit@vmware.com, xhao@linux.alibaba.com, mgorman@techsingularity.net, hughd@google.com, willy@infradead.org, david@redhat.com, peterz@infradead.org, luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com Subject: [v3 0/3] Reduce TLB flushes under some specific conditions Date: Mon, 30 Oct 2023 16:25:37 +0900 Message-Id: <20231030072540.38631-1-byungchul@sk.com> X-Mailer: git-send-email 2.17.1 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrLLMWRmVeSWpSXmKPExsXC9ZZnoW59lH2qQdNzUYs569ewWXze8I/N 4sWGdkaLr+t/MVs8/dTHYnF51xw2i3tr/rNanN+1ltVix9J9TBaXDixgsri+6yGjxfHeA0wW mzdNZbb4/QOobs4UK4uTsyazOAh4fG/tY/FYsKnUY/MKLY/Fe14yeWxa1cnmsenTJHaPd+fO sXucmPGbxWPnQ0uPeScDPd7vu8rmsfWXncfnTXIe7+a/ZQvgi+KySUnNySxLLdK3S+DKOHP0 AkvBL5WKnxOmsDQwfpTuYuTkkBAwkTi98jU7jL1u2xkWEJtNQF3ixo2fzCC2iICZxMHWP0A1 XBzMAg+YJOa+XcEIkhAWcJY4s3o2WDOLgKrEh4u/2EBsXgFTiZN7fjFCDJWXWL3hADNIs4TA bTaJBYeeMUEkJCUOrrjBMoGRewEjwypGocy8stzEzBwTvYzKvMwKveT83E2MwEBeVvsnegfj pwvBhxgFOBiVeHgDwu1ShVgTy4orcw8xSnAwK4nwMjvapArxpiRWVqUW5ccXleakFh9ilOZg URLnNfpWniIkkJ5YkpqdmlqQWgSTZeLglGpgdM7d/G77xZ6PlXfmzOT7ufPV1MNlj+/zriiR yNnFHmTXMOXo2hBGKYnCm4di9zwI6Lq063Nzj3mgct+xG5WTGOJ2hfSu0/pY+c8g7oiA+vWk m5ftH6XM2V+VIPfWm8tyvcaMmx940o9fULgfeK77IqPa4aL4h1YJ8y/OeS88MaKgQveVTcqX SUosxRmJhlrMRcWJALrMaC9gAgAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrCLMWRmVeSWpSXmKPExsXC5WfdrFsfZZ9qMOcBs8Wc9WvYLD5v+Mdm 8WJDO6PF1/W/mC2efupjsTg89ySrxeVdc9gs7q35z2pxftdaVosdS/cxWVw6sIDJ4vquh4wW x3sPMFls3jSV2eL3D6C6OVOsLE7OmsziIOjxvbWPxWPBplKPzSu0PBbvecnksWlVJ5vHpk+T 2D3enTvH7nFixm8Wj50PLT3mnQz0eL/vKpvH4hcfmDy2/rLz+LxJzuPd/LdsAfxRXDYpqTmZ ZalF+nYJXBlnjl5gKfilUvFzwhSWBsaP0l2MnBwSAiYS67adYQGx2QTUJW7c+MkMYosImEkc bP3D3sXIxcEs8IBJYu7bFYwgCWEBZ4kzq2ezg9gsAqoSHy7+YgOxeQVMJU7u+cUIMVReYvWG A8wTGDkWMDKsYhTJzCvLTczMMdUrzs6ozMus0EvOz93ECAzLZbV/Ju5g/HLZ/RCjAAejEg9v QLhdqhBrYllxZe4hRgkOZiURXmZHm1Qh3pTEyqrUovz4otKc1OJDjNIcLErivF7hqQlCAumJ JanZqakFqUUwWSYOTqkGRjVDL6bfLow8WmUSXFfull9LkkyMm7juTaDWXGmTde1xMte/JDzl 2OJ12dnusrKe/s1ZC2ZbOpTJyX6+v+N+xszVt00rEryb7Bob4ia6F3hsWKnUu0co8PDayKqf 6i33Zk7/krBS2PxklPvUHdnsi+e/TpVqnGLTlpzPErl2Q4vzdXmD+UdXKbEUZyQaajEXFScC APbxBLdHAgAA X-CFilter-Loop: Reflected X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Mon, 30 Oct 2023 00:41:20 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781165391630103923 X-GMAIL-MSGID: 1781165391630103923 |
Series |
Reduce TLB flushes under some specific conditions
|
|
Message
Byungchul Park
Oct. 30, 2023, 7:25 a.m. UTC
To Huang Ying, I tried to apply migrc to swap. Fortunately I couldn't see any regression but no performance improvement either. I thought it's meaningless to make code bigger without observing any benefit. So I won't include the part. Thoughts? To Nadav Amit, I tried to split this patch set to as many as possible for better review. However, it was very hard to make each patch meaningfully and stably work because it works very tightly coupled to one another. So I ended in combining those patches to one again. Instead, I tried my best to add sufficient comments in code. Any opinion would be appreciated. Hi everyone, While I'm working with CXL memory, I have been facing migraion overhead esp. TLB shootdown on promotion or demotion between different tiers. Yeah.. most TLB shootdowns on migration through hinting fault can be avoided thanks to Huang Ying's work, commit 4d4b6d66db ("mm,unmap: avoid flushing TLB in batch if PTE is inaccessible"). However, it's only for ones using hinting fault. I thought it'd be much better if we have a general mechanism to reduce # of TLB flushes and TLB misses, that we can apply to any type of migration. I tried it only for tiering migration for now tho. I'm suggesting a mechanism to reduce TLB flushes by keeping source and destination of folios participated in the migrations until all TLB flushes required are done, only if those folios are not mapped with write permission PTE entries at all. I worked Based on v6.6-rc5. Can you believe it? I saw the number of TLB full flush reduced about 80%, iTLB miss reduced about 50% and the performance improved a little bit with the workload I tested with, XSBench. However, I believe that it would help more with other ones or any real ones. It'd be appreciated to let me know if I'm missing something. Byungchul --- Changes from RFC v2: 1. Remove additional occupation in struct page. To do that, unioned with lru field for migrc's list and added a page flag. I know page flag is a thing that we don't like to add but no choice because migrc should distinguish folios under migrc's control from others. Instead, I force migrc to be used only on 64 bit system to mitigate you guys from getting angry. 2. Remove meaningless internal object allocator that I introduced to minimize impact onto the system. However, a ton of tests showed there was no difference. 3. Stop migrc from working when the system is in high memory pressure like about to perform direct reclaim. At the condition where the swap mechanism is heavily used, I found the system suffered from regression without this control. 4. Exclude folios that pte_dirty() == true from migrc's interest so that migrc can work simpler. 5. Combine several patches that work tightly coupled to one. 6. Add sufficient comments for better review. 7. Manage migrc's request in per-node manner (from globally). 8. Add TLB miss improvement in commit message. 9. Test with more CPUs(4 -> 16) to see bigger improvement. Changes from RFC: 1. Fix a bug triggered when a destination folio at the previous migration becomes a source folio at the next migration, before the folio gets handled properly so that the folio can play with another migration. There was inconsistency in the folio's state. Fixed it. 2. Split the patch set into more pieces so that the folks can review better. (Feedbacked by Nadav Amit) 3. Fix a wrong usage of barrier e.g. smp_mb__after_atomic(). (Feedbacked by Nadav Amit) 4. Tried to add sufficient comments to explain the patch set better. (Feedbacked by Nadav Amit) Byungchul Park (3): mm/rmap: Recognize non-writable TLB entries during TLB batch flush mm: Defer TLB flush by keeping both src and dst folios at migration mm, migrc: Add a sysctl knob to enable/disable MIGRC mechanism arch/x86/include/asm/tlbflush.h | 9 + arch/x86/mm/tlb.c | 98 ++++++++++ include/linux/mm.h | 25 +++ include/linux/mm_types.h | 49 +++++ include/linux/mm_types_task.h | 4 +- include/linux/mmzone.h | 7 + include/linux/page-flags.h | 7 + include/linux/sched.h | 9 + include/trace/events/mmflags.h | 9 +- init/Kconfig | 14 ++ mm/internal.h | 14 ++ mm/memory.c | 13 ++ mm/migrate.c | 310 ++++++++++++++++++++++++++++++++ mm/page_alloc.c | 29 ++- mm/rmap.c | 115 +++++++++++- 15 files changed, 703 insertions(+), 9 deletions(-)
Comments
On 10/30/23 00:25, Byungchul Park wrote: > I'm suggesting a mechanism to reduce TLB flushes by keeping source and > destination of folios participated in the migrations until all TLB > flushes required are done, only if those folios are not mapped with > write permission PTE entries at all. I worked Based on v6.6-rc5. There's a lot of common overhead here, on top of the complexity in general: * A new page flag * A new cpumask_t in task_struct * A new zone list * Extra (temporary) memory consumption and the benefits are ... "performance improved a little bit" on one workload. That doesn't seem like a good overall tradeoff to me. There will certainly be workloads that, before this patch, would have little or no memory pressure and after this patch would need to do reclaim. Also, looking with my arch/x86 hat on, there's really nothing arch-specific here. Please try to keep stuff out of arch/x86 unless it's very much arch-specific. The connection between the arch-generic TLB flushing and __flush_tlb_local() seems quite tenuous. __flush_tlb_local() is, to me, quite deep in the implementation and there are quite a few ways that a random TLB flush might not end up there. In other words, I'm not saying that this is broken, but it's not clear at all to me how it functions reliably.
> On Oct 30, 2023, at 7:55 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > !! External Email > > On 10/30/23 00:25, Byungchul Park wrote: >> I'm suggesting a mechanism to reduce TLB flushes by keeping source and >> destination of folios participated in the migrations until all TLB >> flushes required are done, only if those folios are not mapped with >> write permission PTE entries at all. I worked Based on v6.6-rc5. > > There's a lot of common overhead here, on top of the complexity in general: > > * A new page flag > * A new cpumask_t in task_struct > * A new zone list > * Extra (temporary) memory consumption > > and the benefits are ... "performance improved a little bit" on one > workload. That doesn't seem like a good overall tradeoff to me. I almost forgot that I did (and embarrassingly did not follow) a TLB flush deferring mechanism mechanism before [*], which was relatively generic. I did not look at the migration case, but it could have been relatively easily added - I think. Feel free to plagiarize if you find it suitable. Note that some of the patch-set is not relevant (e.g., 20/20 has already been fixed, 3/20 was merged.) [*] https://lore.kernel.org/linux-mm/20210131001132.3368247-1-namit@vmware.com/
On Mon, Oct 30, 2023 at 10:55:07AM -0700, Dave Hansen wrote: > On 10/30/23 00:25, Byungchul Park wrote: > > I'm suggesting a mechanism to reduce TLB flushes by keeping source and > > destination of folios participated in the migrations until all TLB > > flushes required are done, only if those folios are not mapped with > > write permission PTE entries at all. I worked Based on v6.6-rc5. > > There's a lot of common overhead here, on top of the complexity in general: > > * A new page flag > * A new cpumask_t in task_struct > * A new zone list > * Extra (temporary) memory consumption > > and the benefits are ... "performance improved a little bit" on one > workload. That doesn't seem like a good overall tradeoff to me. > > There will certainly be workloads that, before this patch, would have > little or no memory pressure and after this patch would need to do reclaim. 'if (gain - cost) > 0 ?'" is a difficult problem. I think the followings are already big benefit in general: 1. big reduction of IPIs # 2. big reduction of TLB flushes # 3. big reduction of TLB misses # Of course, I or we need to keep trying to see a better number in end-to-end performance. > Also, looking with my arch/x86 hat on, there's really nothing > arch-specific here. Please try to keep stuff out of arch/x86 unless > it's very much arch-specific. Okay. I will try to keep it out of arch code. I should give up an optimization that can be achieved by working on arch code tho. Byungchul
On Mon, Oct 30, 2023 at 10:55:07AM -0700, Dave Hansen wrote: > On 10/30/23 00:25, Byungchul Park wrote: > > I'm suggesting a mechanism to reduce TLB flushes by keeping source and > > destination of folios participated in the migrations until all TLB > > flushes required are done, only if those folios are not mapped with > > write permission PTE entries at all. I worked Based on v6.6-rc5. > > There's a lot of common overhead here, on top of the complexity in general: > > * A new page flag > * A new cpumask_t in task_struct > * A new zone list > * Extra (temporary) memory consumption > > and the benefits are ... "performance improved a little bit" on one > workload. That doesn't seem like a good overall tradeoff to me. I tested it under limited conditions to get stable results e.g. not to use hyper-thread, dedicate cpu times to the test and so on. However, I'm conviced that this patch set is more worth developing than you think it is. Lemme share the results I've just got after changing # of CPUs participated in the test, 16 -> 80, in the system with 80 CPUs. This is just for your information - not that stable tho. Byungchul --- Architecture - x86_64 QEMU - kvm enabled, host cpu Numa - 2 nodes (80 CPUs 1GB, no CPUs 8GB) Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled Benchmark - XSBench -p 50000000 (-p option makes the runtime longer) mainline kernel =============== The 1st try) ===================================== Threads: 64 Runtime: 233.118 seconds ===================================== numa_pages_migrated 758334 pgmigrate_success 1724964 nr_tlb_remote_flush 305706 nr_tlb_remote_flush_received 18598543 nr_tlb_local_flush_all 19092 nr_tlb_local_flush_one 4518717 The 2nd try) ===================================== Threads: 64 Runtime: 221.725 seconds ===================================== numa_pages_migrated 633209 pgmigrate_success 2156509 nr_tlb_remote_flush 261977 nr_tlb_remote_flush_received 14289256 nr_tlb_local_flush_all 11738 nr_tlb_local_flush_one 4520317 mainline kernel + migrc ======================= The 1st try) ===================================== Threads: 64 Runtime: 212.522 seconds ==================================== numa_pages_migrated 901264 pgmigrate_success 1990814 nr_tlb_remote_flush 151280 nr_tlb_remote_flush_received 9031376 nr_tlb_local_flush_all 21208 nr_tlb_local_flush_one 4519595 The 2nd try) ===================================== Threads: 64 Runtime: 204.410 seconds ==================================== numa_pages_migrated 929260 pgmigrate_success 2729868 nr_tlb_remote_flush 166722 nr_tlb_remote_flush_received 8238273 nr_tlb_local_flush_all 13717 nr_tlb_local_flush_one 4519582
On 30.10.23 23:55, Byungchul Park wrote: > On Mon, Oct 30, 2023 at 10:55:07AM -0700, Dave Hansen wrote: >> On 10/30/23 00:25, Byungchul Park wrote: >>> I'm suggesting a mechanism to reduce TLB flushes by keeping source and >>> destination of folios participated in the migrations until all TLB >>> flushes required are done, only if those folios are not mapped with >>> write permission PTE entries at all. I worked Based on v6.6-rc5. >> >> There's a lot of common overhead here, on top of the complexity in general: >> >> * A new page flag >> * A new cpumask_t in task_struct >> * A new zone list >> * Extra (temporary) memory consumption >> >> and the benefits are ... "performance improved a little bit" on one >> workload. That doesn't seem like a good overall tradeoff to me. >> >> There will certainly be workloads that, before this patch, would have >> little or no memory pressure and after this patch would need to do reclaim. > > 'if (gain - cost) > 0 ?'" is a difficult problem. I think the followings > are already big benefit in general: > > 1. big reduction of IPIs # > 2. big reduction of TLB flushes # > 3. big reduction of TLB misses # > > Of course, I or we need to keep trying to see a better number in > end-to-end performance. You'll have to show convincing, real numbers, for use cases people care about, to even motivate why people should consider looking at this in more detail. If you can't measure it and only speculate, nobody cares. The numbers you provided were so far not convincing, and it's questionable if the single benchmark you are presenting represents a reasonable real workload that ends up improving *real* workloads. A better description of the whole benchmark and why it represents a real workload behavior might help.