Message ID | 20230126184157.27626-8-tony.luck@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp432022wrn; Thu, 26 Jan 2023 10:53:08 -0800 (PST) X-Google-Smtp-Source: AK7set98NG+NhcnNzGAw1bE4v/5iBYufzZkozUFf6M22et1f6cj6TSADnjl9OLfrb5Srk+d6vdAc X-Received: by 2002:a17:903:191:b0:196:2364:e908 with SMTP id z17-20020a170903019100b001962364e908mr9375867plg.37.1674759187692; Thu, 26 Jan 2023 10:53:07 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674759187; cv=none; d=google.com; s=arc-20160816; b=CGggIf45lk3IyJdxWSddwYme12kH/JjenrQhdHAutWZPGnhNUiSgt76vJHglPmxInb IfSGd7aQp8jNSm9+/zhFh2MimpYocItJ/8ITvMz7NSPaEL8JmiLei7f48zarvDR+ETwP U4Xyp9kKaGxKJ4iwjRqy24ZIHlFkWTAWBnpMhwQOeyvEWeeZG/aDDRMZukl1rcfCE2Jv 54Q/JfgfZSZ6MV4tZueQn+EdqxMXT2LgHEGvZTT0dcpHBCDT8KD2oLQ7HCP8jB7Hc0Yk MH57E09VwzCzC81JOqY8sgq2JVloV9LcFqzz50v729fGzYGb6hgE+xYzieAZw6mysJNm urSQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=+nxFqMHSrV0tCsNcNRIwpZmkJnaycxEMuACJPkLVquI=; b=CDCSSLfD3RpLZDWWF6GiPJU+vdsUXM+CFG/xMd2kjVjZ9wbaNWqmAw1WS2zmQFOJfG bCIoPiKitmY1YLFRzT9vOh7a0bzV7aO4o4bprviHdcp9fHZ76kViHzKgDpfRyIAU4YEy 2ckqqivsCGH6K3lk3cUY11bVxTsNxMyK2eB/d3Q2K1ugHFQIxfDLbnyUkHmC6krz1iw4 qfyE19BosugvrfxhRS38NYAWNPFUp7JqPegTwAxaUVzR5Eshs4cqE/W1oXoJt8gkn3HS dzGjpr1q3ofYD9ZhrmApZhxSP+meWqLdoiHhbMNRZlt9J+6fsn64E7dimSEdKlp500Qp ROzw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=eCCLJiCg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i6-20020a170902c94600b0019442e6b916si2432196pla.182.2023.01.26.10.52.54; Thu, 26 Jan 2023 10:53:07 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=eCCLJiCg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232259AbjAZSmz (ORCPT <rfc822;lekhanya01809@gmail.com> + 99 others); Thu, 26 Jan 2023 13:42:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60930 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231607AbjAZSme (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 26 Jan 2023 13:42:34 -0500 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9146D66FA7; Thu, 26 Jan 2023 10:42:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674758553; x=1706294553; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=OeIpfTz2aSbcARPCLzsdspXDjxwH3ccCC0KuAuykCuc=; b=eCCLJiCgqm786FogTNHEUkFFYkB++e2SDf3305B2ADLJ9sjVQ3LP4KOT WZRDPcIsSTfHo56h5CDghJMsgNrlSHzBlT0xNwaECeXpDdlD9lUNygmQn /G/zBpP+xHmhGVBGGXTtlf0hfVdFt9MAAInJmwETmGQMcCHRMgduF9tFr XefsX+QEhrCtqYXK6IJtYM9UpbWRid3ZBBPEPBEV8lpLQv6h3t9dVmsdZ 7ia/0r9MvuxZDeVBCpRtipZvcGzPsx9p7vLLeoU8Ap+MBwV9bk1CDgasr 3RitSoVFXPbzdVlv91uSzTYAosly/8TrCsvvPAiYicUxVIgEaWwHi1c9D g==; X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="354203381" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="354203381" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 10:42:06 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="991745458" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="991745458" Received: from agluck-desk3.sc.intel.com ([172.25.222.78]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 10:42:06 -0800 From: Tony Luck <tony.luck@intel.com> To: Fenghua Yu <fenghua.yu@intel.com>, Reinette Chatre <reinette.chatre@intel.com>, Peter Newman <peternewman@google.com>, Jonathan Corbet <corbet@lwn.net>, x86@kernel.org Cc: Shaopeng Tan <tan.shaopeng@fujitsu.com>, James Morse <james.morse@arm.com>, Jamie Iles <quic_jiles@quicinc.com>, Babu Moger <babu.moger@amd.com>, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, patches@lists.linux.dev, Tony Luck <tony.luck@intel.com> Subject: [PATCH 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize. Date: Thu, 26 Jan 2023 10:41:57 -0800 Message-Id: <20230126184157.27626-8-tony.luck@intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230126184157.27626-1-tony.luck@intel.com> References: <20230126184157.27626-1-tony.luck@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756112289747171439?= X-GMAIL-MSGID: =?utf-8?q?1756112289747171439?= |
Series |
x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
|
|
Commit Message
Luck, Tony
Jan. 26, 2023, 6:41 p.m. UTC
There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Cluster enabled.
Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.
When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
are distributed equally between the SNC nodes within each socket.
E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.
Handle this by initializing a per-cpu RMID offset value. Use this
to calculate the value to write to the RMID field of the IA32_PQR_ASSOC
MSR during context switch, and also to the IA32_QM_EVTSEL MSR when
reading RMID event values.
N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/include/asm/resctrl.h | 4 ++-
arch/x86/kernel/cpu/resctrl/core.c | 43 +++++++++++++++++++++++++--
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
3 files changed, 44 insertions(+), 5 deletions(-)
Comments
Hi Tony, On Thu, Jan 26, 2023 at 7:42 PM Tony Luck <tony.luck@intel.com> wrote: > +static __init int find_snc_ways(void) > +{ > + unsigned long *node_caches; > + int cpu, node, ret; > + > + node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL); > + if (!node_caches) > + return 1; > + > + cpus_read_lock(); > + for_each_node(node) { Someone tried this patch on a machine with a CPU-less node... We need to check for this: + if (cpumask_empty(cpumask_of_node(node))) + continue; > + cpu = cpumask_first(cpumask_of_node(node)); > + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches); > + } > + cpus_read_unlock(); Thanks! -Peter
Hi Tony, On 1/26/23 12:41, Tony Luck wrote: > There isn't a simple hardware enumeration to indicate to software that > a system is running with Sub-NUMA Cluster enabled. > > Compare the number of NUMA nodes with the number of L3 caches to calculate > the number of Sub-NUMA nodes per L3 cache. > > When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters > are distributed equally between the SNC nodes within each socket. > > E.g. if there are 400 RMID counters, and the system is configured with > two SNC nodes per socket, then RMID counter 0..199 are used on SNC node > 0 on the socket, and RMID counter 200..399 on SNC node 1. > > Handle this by initializing a per-cpu RMID offset value. Use this > to calculate the value to write to the RMID field of the IA32_PQR_ASSOC > MSR during context switch, and also to the IA32_QM_EVTSEL MSR when > reading RMID event values. > > N.B. this works well for well-behaved NUMA applications that access > memory predominantly from the local memory node. For applications that > access memory across multiple nodes it may be necessary for the user > to read counters for all SNC nodes on a socket and add the values to > get the actual LLC occupancy or memory bandwidth. Perhaps this isn't > all that different from applications that span across multiple sockets > in a legacy system. > > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > arch/x86/include/asm/resctrl.h | 4 ++- > arch/x86/kernel/cpu/resctrl/core.c | 43 +++++++++++++++++++++++++-- > arch/x86/kernel/cpu/resctrl/monitor.c | 2 +- > 3 files changed, 44 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h > index 52788f79786f..59b8afd8c53c 100644 > --- a/arch/x86/include/asm/resctrl.h > +++ b/arch/x86/include/asm/resctrl.h > @@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key); > DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); > DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key); > > +DECLARE_PER_CPU(int, rmid_offset); > + > /* > * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR > * > @@ -69,7 +71,7 @@ static void __resctrl_sched_in(void) > if (static_branch_likely(&rdt_mon_enable_key)) { > tmp = READ_ONCE(current->rmid); > if (tmp) > - rmid = tmp; > + rmid = tmp + this_cpu_read(rmid_offset); > } > > if (closid != state->cur_closid || rmid != state->cur_rmid) { > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c > index 53b2ab37af2f..0ff739375e3b 100644 > --- a/arch/x86/kernel/cpu/resctrl/core.c > +++ b/arch/x86/kernel/cpu/resctrl/core.c > @@ -16,6 +16,7 @@ > > #define pr_fmt(fmt) "resctrl: " fmt > > +#include <linux/cpu.h> > #include <linux/slab.h> > #include <linux/err.h> > #include <linux/cacheinfo.h> > @@ -484,6 +485,13 @@ static int get_domain_id(int cpu, enum resctrl_scope scope) > return get_cpu_cacheinfo_id(cpu, scope); > } > > +DEFINE_PER_CPU(int, rmid_offset); > + > +static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r) > +{ > + this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) * r->num_rmid); > +} > + > /* > * domain_add_cpu - Add a cpu to a resource's domain list. > * > @@ -515,6 +523,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) > cpumask_set_cpu(cpu, &d->cpu_mask); > if (r->cache.arch_has_per_cpu_cfg) > rdt_domain_reconfigure_cdp(r); > + if (r->mon_capable) > + set_per_cpu_rmid_offset(cpu, r); > return; > } > > @@ -533,9 +543,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) > return; > } > > - if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) { > - domain_free(hw_dom); > - return; > + if (r->mon_capable) { > + if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) { > + domain_free(hw_dom); > + return; > + } > + set_per_cpu_rmid_offset(cpu, r); > } > > list_add_tail(&d->list, add_pos); > @@ -845,11 +858,35 @@ static __init bool get_rdt_resources(void) > return (rdt_mon_capable || rdt_alloc_capable); > } > > +static __init int find_snc_ways(void) > +{ > + unsigned long *node_caches; > + int cpu, node, ret; > + > + node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL); > + if (!node_caches) > + return 1; > + > + cpus_read_lock(); > + for_each_node(node) { > + cpu = cpumask_first(cpumask_of_node(node)); > + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches); > + } > + cpus_read_unlock(); > + > + ret = nr_node_ids / bitmap_weight(node_caches, nr_node_ids); > + kfree(node_caches); > + > + return ret; > +} > + > static __init void rdt_init_res_defs_intel(void) > { > struct rdt_hw_resource *hw_res; > struct rdt_resource *r; > > + snc_ways = find_snc_ways(); > + > for_each_rdt_resource(r) { > hw_res = resctrl_to_arch_res(r); > > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c > index 3fc63aa68130..bd5ec348d925 100644 > --- a/arch/x86/kernel/cpu/resctrl/monitor.c > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c > @@ -160,7 +160,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) > * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62) > * are error bits. > */ > - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid); > + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset)); I am thinking loud here. When a new monitor group is created, new RMID is assigned. This is done by alloc_rmid. It does not know about the rmid_offset details. This will allocate the one of the free RMIDs. When CPUs are assigned to the group, then per cpu pqr_state is updated. At that point, this RMID becomes default_rmid for that cpu. But CPUs can be assigned from two different Sub-NUMA nodes. Considering same example you mentioned. E.g. in 2-way Sub-NUMA cluster with 200 RMID counters there are only 100 available counters to the resctrl code. When running on the first SNC node RMID values 0..99 are used as before. But when running on the second node, a task that is assigned resctrl rmid=10 must load 10+100 into IA32_PQR_ASSOC to use RMID counter 110. #mount -t resctrl resctrl /sys/fs/resctrl/ #cd /sys/fs/resctrl/ #mkdir test (Lets say RMID 1 is allocated) #cd test #echo 1 > cpus_list #echo 101 > cpus_list In this case, the following code may run on two different RMIDs even though it was intended to run on same RMID. wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset)); Have you thought of this problem? Thanks Babu
Babu wrote: > I am thinking loud here. Have you thought of addressing this problem? > When a new monitor group is created, new RMID is assigned. This is done by alloc_rmid. It does not know about the rmid_offset details. This will allocate the one of the free RMIDs. > When CPUs are assigned to the group, then per cpu pqr_state is updated. At that point, this RMID becomes default_rmid for that cpu. Good point. This is a gap. I haven't handled assigning CPUs to resctrl groups when SNC is enabled. I'm not sure this has a solution :-( -Tony
On 2/28/2023 2:39 PM, Luck, Tony wrote: > Babu wrote: >> I am thinking loud here. Have you thought of addressing this problem? >> When a new monitor group is created, new RMID is assigned. This is done by alloc_rmid. It does not know about the rmid_offset details. This will allocate the one of the free RMIDs. >> When CPUs are assigned to the group, then per cpu pqr_state is updated. At that point, this RMID becomes default_rmid for that cpu. > Good point. This is a gap. I haven't handled assigning CPUs to resctrl groups when SNC is enabled. You may need to document it. Thanks Babu
On Mon, Feb 27, 2023 at 02:30:38PM +0100, Peter Newman wrote: > Hi Tony, > > On Thu, Jan 26, 2023 at 7:42 PM Tony Luck <tony.luck@intel.com> wrote: > > +static __init int find_snc_ways(void) > > +{ > > + unsigned long *node_caches; > > + int cpu, node, ret; > > + > > + node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL); > > + if (!node_caches) > > + return 1; > > + > > + cpus_read_lock(); > > + for_each_node(node) { > > Someone tried this patch on a machine with a CPU-less node... > > We need to check for this: > > + if (cpumask_empty(cpumask_of_node(node))) > + continue; > > > + cpu = cpumask_first(cpumask_of_node(node)); > > + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches); > > + } > > + cpus_read_unlock(); Peter, Tell me more about your CPU-less nodes. Your fix avoids a bad pointer reference (because cpumask_first() returns cpu >= nr_cpu_ids for an empty bitmask). But now I'm worried about whether I have the right values in the formula: nr_node_ids / bitmap_weight(node_caches, nr_node_ids); This fix avoids counting the L3 from a non-existent CPU, but still counts the node in the numerator. Is your CPU-less node a full (non-SNC) node? Like this: Socket 0 Socket 1 +--------------------+ +--------------------+ | . | | . | | SNC 0.0 . SNC 0.1 | | zero . zero | | . | | CPUs . CPUs | | . | | . | | . | | . | +--------------------+ +--------------------+ | L3 Cache | | L3 Cache | +--------------------+ +--------------------+ I could fix this case by counting how many CPU-less nodes I find, and reducing the numerator (the denominator didn't count the L3 cache from socket 1 because there are no CPUs there) (nr_node_ids - n_empty_nodes) / bitmap_weight(node_caches, nr_node_ids); => 2 / 1 But that won't work if your CPU-less node is an SNC node and the other SNC node in the same socket does have some CPUs: Socket 0 Socket 1 +--------------------+ +--------------------+ | . | | . | | SNC 0.0 . SNC 0.1 | | zero . SNC 1.1 | | . | | CPUs . | | . | | . | | . | | . | +--------------------+ +--------------------+ | L3 Cache | | L3 Cache | +--------------------+ +--------------------+ This would get 3 / 2 ... i.e. I should still count the empty node because its cache was counted by its SNC buddy. -Tony
Hi Tony, On Fri, Mar 10, 2023 at 6:30 PM Tony Luck <tony.luck@intel.com> wrote: > Tell me more about your CPU-less nodes. Your fix avoids a bad > pointer reference (because cpumask_first() returns cpu >= nr_cpu_ids > for an empty bitmask). > > But now I'm worried about whether I have the right values in the > formula: > > nr_node_ids / bitmap_weight(node_caches, nr_node_ids); > > This fix avoids counting the L3 from a non-existent CPU, but still > counts the node in the numerator. > > Is your CPU-less node a full (non-SNC) node? Like this: > > Socket 0 Socket 1 > +--------------------+ +--------------------+ > | . | | . | > | SNC 0.0 . SNC 0.1 | | zero . zero | > | . | | CPUs . CPUs | > | . | | . | > | . | | . | > +--------------------+ +--------------------+ > | L3 Cache | | L3 Cache | > +--------------------+ +--------------------+ In the case I saw, the nodes were AEP DIMMs, so all-memory nodes. Browsing sysfs, they are listed in has_memory, but not has_normal_memory or has_cpu. I imagine CXL.mem would have similar issues. -Peter
> In the case I saw, the nodes were AEP DIMMs, so all-memory nodes.
Peter,
Thanks. This helps a lot.
Ok. I will add code to count the number of memory only nodes and subtract
that from the numerator of "nodes / L3-caches".
I'll ignore the weird case of a memory-only SNC node when other SNC
nodes on the same socket do have CPUs until such time as someone
convinces me that there is a real-world reason to enable SNC and then
disable the CPUs in one node. It would seem much better to keep SNC
turned off so that the remaining CPUs on the socket get access to all
of the L3.
-Tony
On Tue, Feb 28, 2023 at 01:51:32PM -0600, Moger, Babu wrote: > I am thinking loud here. > When a new monitor group is created, new RMID is assigned. This is done by > alloc_rmid. It does not know about the rmid_offset details. This will > allocate the one of the free RMIDs. > > When CPUs are assigned to the group, then per cpu pqr_state is updated. > At that point, this RMID becomes default_rmid for that cpu. > > But CPUs can be assigned from two different Sub-NUMA nodes. > > Considering same example you mentioned. > > E.g. in 2-way Sub-NUMA cluster with 200 RMID counters there are only > 100 available counters to the resctrl code. When running on the first > SNC node RMID values 0..99 are used as before. But when running on the > second node, a task that is assigned resctrl rmid=10 must load 10+100 > into IA32_PQR_ASSOC to use RMID counter 110. > > #mount -t resctrl resctrl /sys/fs/resctrl/ > #cd /sys/fs/resctrl/ > #mkdir test (Lets say RMID 1 is allocated) > #cd test > #echo 1 > cpus_list > #echo 101 > cpus_list > > In this case, the following code may run on two different RMIDs even > though it was intended to run on same RMID. > > wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset)); > > Have you thought of this problem? Now I've thought about this. I don't think it is a problem. With SNC enabled for two nodes per socket the available RMIDs are divided between the SNC nodes, but are for some purposes numbered [0 .. N/2) but in some cases must be viewed as two separate sets [0 .. N/2) on the first node and [N/2 .. N) on the second. In your example RMID 1 is assigned to the group and you have one CPU from each node in the group. Processes on CPU1 will load IA32_PQR_ASSOC.RMID = 1, while processes on CPU101 will set IA32_PQR_ASSOC.RMID = 101. So counts of memory bandwidth and cache occupancy will be in two different physical RMID counters. To read these back the user needs to lookup which $node each CPU belongs to and then read from the appropriate mon_data/mon_L3_$node/{llc_occupancy,mbm_local_bytes,mbm_total_bytes} file. $ cat mon_data/mon_L3_00/llc_occupancy # reads RMID=1 $ cat mon_data/mon_L3_01/llc_occupancy # reads RMID=101 -Tony
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 52788f79786f..59b8afd8c53c 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -35,6 +35,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key); +DECLARE_PER_CPU(int, rmid_offset); + /* * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR * @@ -69,7 +71,7 @@ static void __resctrl_sched_in(void) if (static_branch_likely(&rdt_mon_enable_key)) { tmp = READ_ONCE(current->rmid); if (tmp) - rmid = tmp; + rmid = tmp + this_cpu_read(rmid_offset); } if (closid != state->cur_closid || rmid != state->cur_rmid) { diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 53b2ab37af2f..0ff739375e3b 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -16,6 +16,7 @@ #define pr_fmt(fmt) "resctrl: " fmt +#include <linux/cpu.h> #include <linux/slab.h> #include <linux/err.h> #include <linux/cacheinfo.h> @@ -484,6 +485,13 @@ static int get_domain_id(int cpu, enum resctrl_scope scope) return get_cpu_cacheinfo_id(cpu, scope); } +DEFINE_PER_CPU(int, rmid_offset); + +static void set_per_cpu_rmid_offset(int cpu, struct rdt_resource *r) +{ + this_cpu_write(rmid_offset, (cpu_to_node(cpu) % snc_ways) * r->num_rmid); +} + /* * domain_add_cpu - Add a cpu to a resource's domain list. * @@ -515,6 +523,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) cpumask_set_cpu(cpu, &d->cpu_mask); if (r->cache.arch_has_per_cpu_cfg) rdt_domain_reconfigure_cdp(r); + if (r->mon_capable) + set_per_cpu_rmid_offset(cpu, r); return; } @@ -533,9 +543,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) return; } - if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) { - domain_free(hw_dom); - return; + if (r->mon_capable) { + if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) { + domain_free(hw_dom); + return; + } + set_per_cpu_rmid_offset(cpu, r); } list_add_tail(&d->list, add_pos); @@ -845,11 +858,35 @@ static __init bool get_rdt_resources(void) return (rdt_mon_capable || rdt_alloc_capable); } +static __init int find_snc_ways(void) +{ + unsigned long *node_caches; + int cpu, node, ret; + + node_caches = kcalloc(BITS_TO_LONGS(nr_node_ids), sizeof(*node_caches), GFP_KERNEL); + if (!node_caches) + return 1; + + cpus_read_lock(); + for_each_node(node) { + cpu = cpumask_first(cpumask_of_node(node)); + set_bit(get_cpu_cacheinfo_id(cpu, 3), node_caches); + } + cpus_read_unlock(); + + ret = nr_node_ids / bitmap_weight(node_caches, nr_node_ids); + kfree(node_caches); + + return ret; +} + static __init void rdt_init_res_defs_intel(void) { struct rdt_hw_resource *hw_res; struct rdt_resource *r; + snc_ways = find_snc_ways(); + for_each_rdt_resource(r) { hw_res = resctrl_to_arch_res(r); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 3fc63aa68130..bd5ec348d925 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -160,7 +160,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62) * are error bits. */ - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid); + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + this_cpu_read(rmid_offset)); rdmsrl(MSR_IA32_QM_CTR, msr_val); if (msr_val & RMID_VAL_ERROR)