From patchwork Mon Mar 13 16:25:14 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Marcelo Tosatti X-Patchwork-Id: 68928 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp1291574wrd; Mon, 13 Mar 2023 09:49:34 -0700 (PDT) X-Google-Smtp-Source: AK7set/SEwF0IQriYEKGf+PameOUxUSec4IHv1jihkEaLOvpi64o5JKOgRkk+hNksGs+P/330iY4 X-Received: by 2002:a05:6a20:9390:b0:cc:a62f:1a9d with SMTP id x16-20020a056a20939000b000cca62f1a9dmr41908243pzh.23.1678726174118; Mon, 13 Mar 2023 09:49:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1678726174; cv=none; d=google.com; s=arc-20160816; b=S9D0l57sk1JQdtiahxJrgeJFPBHh60+yU4sFp/26wsjj34tLHrYiBa3Ormu3Q/7aS3 UR9Gj6F/8r0qROU25KLr9sYRkBKZpDLiFUGf+4vFpKRUmmE+JIl3lSdXIZ1tzl7cmD9l XN/A/iV/OgdcL8U0VziR2LVLC+uDo6zFD+y0qFKihUK8L77oGayYWQAgVnJ8h/dxRZPj ZqvlqMGkhCY/u5MNw67aulY65qMiSUvOcUG3ReRCaVT3eHVpWcHFYekDb2rHAvQRPqUG ++LaGImpRdCJRjKUv+4D/TyWlxLvrutp9eA1mYOols430trU0lE6xaSwIZdlMUFBr2M5 qHCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:subject:cc:to:from:date :user-agent:message-id:dkim-signature; bh=HRaKPhoc2o/dyfmAGwpM6cYdTDahEMUEDBWaTmQOBWk=; b=jf4x1r9/Q9oaPQgsCyzC52AfyK1uGLcx/IWLzCuCUMZ8x8G4SxYuGgwlsG7H+mts+B ZYrnoGCAsTroUygSCgFtPhD7Ec1WL3SUlFIkJt8uvo/W5LmQt/eeMafLQ4G+ddDwjVjV yvSRPRw/mV9FFHKW56SJ2dOjp0uc8ryPYjO0ruMPCGDMqrIreJfrdTYwpwcjMS/i6fqB +07eCrG4RxkR0kghJA0t0P9EqjBriu2oY/T9QZJLuqq+jgBZRfUZEjd9B4jjvcE4viMb LXyfOz5/SmW1DICIuCZB4f79kUTHxgPXwQbfFAmYrjolpVKBfY2PSFZWyBq/fJ7YGOtL x4IQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=TrQ7L57h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l69-20020a638848000000b00508bbf1942esi6107502pgd.678.2023.03.13.09.49.21; Mon, 13 Mar 2023 09:49:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=TrQ7L57h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231432AbjCMQ33 (ORCPT + 99 others); Mon, 13 Mar 2023 12:29:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50042 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231256AbjCMQ3C (ORCPT ); Mon, 13 Mar 2023 12:29:02 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6562279B27 for ; Mon, 13 Mar 2023 09:27:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1678724864; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=HRaKPhoc2o/dyfmAGwpM6cYdTDahEMUEDBWaTmQOBWk=; b=TrQ7L57h5rd5rYyegsk9AtlahD31LADJOLq6jCwa5zMkx4KNKOwGNjLFxUQ/JHX92s0d8o S3ftno+11PZzDtLKP72zmk7pSSixGXGj/Ceq1IPsqUoCL5noJ5bV5Kl0+ccPCWcS6LxDTy wNAWGoUhN8N51ZJgoDN6HsuB3++cdPY= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-447-m-vIZTaiPO-3HnJowFEtqw-1; Mon, 13 Mar 2023 12:27:43 -0400 X-MC-Unique: m-vIZTaiPO-3HnJowFEtqw-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 52AC985CBEB; Mon, 13 Mar 2023 16:27:42 +0000 (UTC) Received: from tpad.localdomain (ovpn-112-2.gru2.redhat.com [10.97.112.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id F0662492C13; Mon, 13 Mar 2023 16:27:41 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id 29EC94038A1F7; Mon, 13 Mar 2023 13:26:44 -0300 (-03) Message-ID: <20230313162634.433875205@redhat.com> User-Agent: quilt/0.67 Date: Mon, 13 Mar 2023 13:25:14 -0300 From: Marcelo Tosatti To: Christoph Lameter Cc: Aaron Tomlin , Frederic Weisbecker , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Russell King , Huacai Chen , Heiko Carstens , x86@kernel.org, Vlastimil Babka , Marcelo Tosatti Subject: [PATCH v5 07/12] mm/vmstat: switch counter modification to cmpxchg References: <20230313162507.032200398@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760271976623542525?= X-GMAIL-MSGID: =?utf-8?q?1760271976623542525?= In preparation for switching vmstat shepherd to flush per-CPU counters remotely, switch the __{mod,inc,dec} functions that modify the counters to use cmpxchg. To facilitate reviewing, functions are ordered in the text file, as: __{mod,inc,dec}_{zone,node}_page_state #ifdef CONFIG_HAVE_CMPXCHG_LOCAL {mod,inc,dec}_{zone,node}_page_state #else {mod,inc,dec}_{zone,node}_page_state #endif This patch defines the __ versions for the CONFIG_HAVE_CMPXCHG_LOCAL case to be their non-"__" counterparts: #ifdef CONFIG_HAVE_CMPXCHG_LOCAL {mod,inc,dec}_{zone,node}_page_state __{mod,inc,dec}_{zone,node}_page_state = {mod,inc,dec}_{zone,node}_page_state #else {mod,inc,dec}_{zone,node}_page_state __{mod,inc,dec}_{zone,node}_page_state #endif To test the performance difference, a page allocator microbenchmark: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz. For the single_page_alloc_free test, which does /** Loop to measure **/ for (i = 0; i < rec->loops; i++) { my_page = alloc_page(gfp_mask); if (unlikely(my_page == NULL)) return 0; __free_page(my_page); } Unit is cycles. Vanilla Patched Diff 115.25 117 1.4% Signed-off-by: Marcelo Tosatti Index: linux-vmstat-remote/mm/vmstat.c =================================================================== --- linux-vmstat-remote.orig/mm/vmstat.c +++ linux-vmstat-remote/mm/vmstat.c @@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_ } } +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL +/* + * If we have cmpxchg_local support then we do not need to incur the overhead + * that comes with local_irq_save/restore if we use this_cpu_cmpxchg. + * + * mod_state() modifies the zone counter state through atomic per cpu + * operations. + * + * Overstep mode specifies how overstep should handled: + * 0 No overstepping + * 1 Overstepping half of threshold + * -1 Overstepping minus half of threshold + */ +static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item, + long delta, int overstep_mode) +{ + struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats; + s8 __percpu *p = pcp->vm_stat_diff + item; + long o, n, t, z; + + do { + z = 0; /* overflow to zone counters */ + + /* + * The fetching of the stat_threshold is racy. We may apply + * a counter threshold to the wrong the cpu if we get + * rescheduled while executing here. However, the next + * counter update will apply the threshold again and + * therefore bring the counter under the threshold again. + * + * Most of the time the thresholds are the same anyways + * for all cpus in a zone. + */ + t = this_cpu_read(pcp->stat_threshold); + + o = this_cpu_read(*p); + n = delta + o; + + if (abs(n) > t) { + int os = overstep_mode * (t >> 1); + + /* Overflow must be added to zone counters */ + z = n + os; + n = -os; + } + } while (this_cpu_cmpxchg(*p, o, n) != o); + + if (z) + zone_page_state_add(z, zone, item); +} + +void mod_zone_page_state(struct zone *zone, enum zone_stat_item item, + long delta) +{ + mod_zone_state(zone, item, delta, 0); +} +EXPORT_SYMBOL(mod_zone_page_state); + +void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, + long delta) +{ + mod_zone_state(zone, item, delta, 0); +} +EXPORT_SYMBOL(__mod_zone_page_state); + +void inc_zone_page_state(struct page *page, enum zone_stat_item item) +{ + mod_zone_state(page_zone(page), item, 1, 1); +} +EXPORT_SYMBOL(inc_zone_page_state); + +void __inc_zone_page_state(struct page *page, enum zone_stat_item item) +{ + mod_zone_state(page_zone(page), item, 1, 1); +} +EXPORT_SYMBOL(__inc_zone_page_state); + +void dec_zone_page_state(struct page *page, enum zone_stat_item item) +{ + mod_zone_state(page_zone(page), item, -1, -1); +} +EXPORT_SYMBOL(dec_zone_page_state); + +void __dec_zone_page_state(struct page *page, enum zone_stat_item item) +{ + mod_zone_state(page_zone(page), item, -1, -1); +} +EXPORT_SYMBOL(__dec_zone_page_state); + +static inline void mod_node_state(struct pglist_data *pgdat, + enum node_stat_item item, + int delta, int overstep_mode) +{ + struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; + s8 __percpu *p = pcp->vm_node_stat_diff + item; + long o, n, t, z; + + if (vmstat_item_in_bytes(item)) { + /* + * Only cgroups use subpage accounting right now; at + * the global level, these items still change in + * multiples of whole pages. Store them as pages + * internally to keep the per-cpu counters compact. + */ + VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1)); + delta >>= PAGE_SHIFT; + } + + do { + z = 0; /* overflow to node counters */ + + /* + * The fetching of the stat_threshold is racy. We may apply + * a counter threshold to the wrong the cpu if we get + * rescheduled while executing here. However, the next + * counter update will apply the threshold again and + * therefore bring the counter under the threshold again. + * + * Most of the time the thresholds are the same anyways + * for all cpus in a node. + */ + t = this_cpu_read(pcp->stat_threshold); + + o = this_cpu_read(*p); + n = delta + o; + + if (abs(n) > t) { + int os = overstep_mode * (t >> 1); + + /* Overflow must be added to node counters */ + z = n + os; + n = -os; + } + } while (this_cpu_cmpxchg(*p, o, n) != o); + + if (z) + node_page_state_add(z, pgdat, item); +} + +void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, + long delta) +{ + mod_node_state(pgdat, item, delta, 0); +} +EXPORT_SYMBOL(mod_node_page_state); + +void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, + long delta) +{ + mod_node_state(pgdat, item, delta, 0); +} +EXPORT_SYMBOL(__mod_node_page_state); + +void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) +{ + mod_node_state(pgdat, item, 1, 1); +} + +void inc_node_page_state(struct page *page, enum node_stat_item item) +{ + mod_node_state(page_pgdat(page), item, 1, 1); +} +EXPORT_SYMBOL(inc_node_page_state); + +void __inc_node_page_state(struct page *page, enum node_stat_item item) +{ + mod_node_state(page_pgdat(page), item, 1, 1); +} +EXPORT_SYMBOL(__inc_node_page_state); + +void dec_node_page_state(struct page *page, enum node_stat_item item) +{ + mod_node_state(page_pgdat(page), item, -1, -1); +} +EXPORT_SYMBOL(dec_node_page_state); + +void __dec_node_page_state(struct page *page, enum node_stat_item item) +{ + mod_node_state(page_pgdat(page), item, -1, -1); +} +EXPORT_SYMBOL(__dec_node_page_state); +#else /* * For use when we know that interrupts are disabled, * or when we know that preemption is disabled and that @@ -541,149 +723,6 @@ void __dec_node_page_state(struct page * } EXPORT_SYMBOL(__dec_node_page_state); -#ifdef CONFIG_HAVE_CMPXCHG_LOCAL -/* - * If we have cmpxchg_local support then we do not need to incur the overhead - * that comes with local_irq_save/restore if we use this_cpu_cmpxchg. - * - * mod_state() modifies the zone counter state through atomic per cpu - * operations. - * - * Overstep mode specifies how overstep should handled: - * 0 No overstepping - * 1 Overstepping half of threshold - * -1 Overstepping minus half of threshold -*/ -static inline void mod_zone_state(struct zone *zone, - enum zone_stat_item item, long delta, int overstep_mode) -{ - struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats; - s8 __percpu *p = pcp->vm_stat_diff + item; - long o, n, t, z; - - do { - z = 0; /* overflow to zone counters */ - - /* - * The fetching of the stat_threshold is racy. We may apply - * a counter threshold to the wrong the cpu if we get - * rescheduled while executing here. However, the next - * counter update will apply the threshold again and - * therefore bring the counter under the threshold again. - * - * Most of the time the thresholds are the same anyways - * for all cpus in a zone. - */ - t = this_cpu_read(pcp->stat_threshold); - - o = this_cpu_read(*p); - n = delta + o; - - if (abs(n) > t) { - int os = overstep_mode * (t >> 1) ; - - /* Overflow must be added to zone counters */ - z = n + os; - n = -os; - } - } while (this_cpu_cmpxchg(*p, o, n) != o); - - if (z) - zone_page_state_add(z, zone, item); -} - -void mod_zone_page_state(struct zone *zone, enum zone_stat_item item, - long delta) -{ - mod_zone_state(zone, item, delta, 0); -} -EXPORT_SYMBOL(mod_zone_page_state); - -void inc_zone_page_state(struct page *page, enum zone_stat_item item) -{ - mod_zone_state(page_zone(page), item, 1, 1); -} -EXPORT_SYMBOL(inc_zone_page_state); - -void dec_zone_page_state(struct page *page, enum zone_stat_item item) -{ - mod_zone_state(page_zone(page), item, -1, -1); -} -EXPORT_SYMBOL(dec_zone_page_state); - -static inline void mod_node_state(struct pglist_data *pgdat, - enum node_stat_item item, int delta, int overstep_mode) -{ - struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; - long o, n, t, z; - - if (vmstat_item_in_bytes(item)) { - /* - * Only cgroups use subpage accounting right now; at - * the global level, these items still change in - * multiples of whole pages. Store them as pages - * internally to keep the per-cpu counters compact. - */ - VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1)); - delta >>= PAGE_SHIFT; - } - - do { - z = 0; /* overflow to node counters */ - - /* - * The fetching of the stat_threshold is racy. We may apply - * a counter threshold to the wrong the cpu if we get - * rescheduled while executing here. However, the next - * counter update will apply the threshold again and - * therefore bring the counter under the threshold again. - * - * Most of the time the thresholds are the same anyways - * for all cpus in a node. - */ - t = this_cpu_read(pcp->stat_threshold); - - o = this_cpu_read(*p); - n = delta + o; - - if (abs(n) > t) { - int os = overstep_mode * (t >> 1) ; - - /* Overflow must be added to node counters */ - z = n + os; - n = -os; - } - } while (this_cpu_cmpxchg(*p, o, n) != o); - - if (z) - node_page_state_add(z, pgdat, item); -} - -void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, - long delta) -{ - mod_node_state(pgdat, item, delta, 0); -} -EXPORT_SYMBOL(mod_node_page_state); - -void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) -{ - mod_node_state(pgdat, item, 1, 1); -} - -void inc_node_page_state(struct page *page, enum node_stat_item item) -{ - mod_node_state(page_pgdat(page), item, 1, 1); -} -EXPORT_SYMBOL(inc_node_page_state); - -void dec_node_page_state(struct page *page, enum node_stat_item item) -{ - mod_node_state(page_pgdat(page), item, -1, -1); -} -EXPORT_SYMBOL(dec_node_page_state); -#else /* * Use interrupt disable to serialize counter updates */ Index: linux-vmstat-remote/mm/page_alloc.c =================================================================== --- linux-vmstat-remote.orig/mm/page_alloc.c +++ linux-vmstat-remote/mm/page_alloc.c @@ -8628,9 +8628,6 @@ static int page_alloc_cpu_dead(unsigned /* * Zero the differential counters of the dead processor * so that the vm statistics are consistent. - * - * This is only okay since the processor is dead and cannot - * race with what we are doing. */ cpu_vm_stats_fold(cpu);