From patchwork Tue Oct  3 21:30:41 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Luck, Tony" <tony.luck@intel.com>
X-Patchwork-Id: 148064
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:2a8e:b0:403:3b70:6f57 with SMTP id
 in14csp2363367vqb;
        Tue, 3 Oct 2023 14:31:30 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IHqFt52T2bmDxoEyyzrom+3+6ZilxLWtXzJjybhEzcQrWnTix+KxcdmMkuBhTBJTit9Dq+G
X-Received: by 2002:a05:6e02:1e02:b0:350:f0bb:6a42 with SMTP id
 g2-20020a056e021e0200b00350f0bb6a42mr722332ila.29.1696368690009;
        Tue, 03 Oct 2023 14:31:30 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1696368689; cv=none;
        d=google.com; s=arc-20160816;
        b=LPi2qjwCj9f4y3+3mP9CROtPEPXnkuv9Ha2vCcCp770VTF0UwsSE1e6kbFUHiekfka
         aMQIAHiZ/EyZztLD5RxpjF1K5mo7ApLUXdl1fjCgLW6Nvvo1GnoQyUo3nIR/Qg+XE2KT
         tI6yDFBdZUzVhQAuVBIsEjS8FdYNiqVi+K5dK3bnhmuqzgaPeT9rmFkvQUGvgWIqOr1s
         KB+fTUfL/yExBiD2N57lKu5bEWqwAUVWNcEPPKJ+5r32WeTec1nHVzZW6Dvh1jPvvJSC
         JJEoDxeontqyO9HiYVyQgsh0wRq7PWUBdb26PDk/YZjW97FbQVnLExpMDpLJviVwAv1g
         km2w==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=nNUpFUWNpLH0gPTiKB7/LBlSjMZQ0bHFrsTyHzfawmo=;
        fh=EIH9XAmicvPIUSP7TBeBhZ/WaoqG49JQ3xV1i3Gl7Co=;
        b=CSKJTzUcUxMB9auuzSLIvgV+YRClslghMzzy9V1HogoIFtX8aWkQzfMoutBdmWUCxE
         2JEUmhcTIDCIBILFC+Zkf2FgnCMObSuQLATHf+1QBLIOMPsDK8qzb9lUukEwIha1PYg9
         4NfcmkZqhe1ZZeUPRqFzsJCQwFrfmSzGq3znwr+hZhSoRJEQIvoDpl4d2rtl/duL8kfL
         /VgI3GFfqX8Dh9CarOXu4BdxhUe0BA7U2RKREXLOqR4z9u5cEjazJfgGM6D5WkjC1/HF
         jbhXj2c0EgFPCVomqyI6GfmN9gIy3XA34mYY5JQfMDmAxBP7Q6L9cysXfOz/NRFWwJd+
         3lCw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=QECDAq1b;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from snail.vger.email (snail.vger.email. [23.128.96.37])
        by mx.google.com with ESMTPS id
 f20-20020a63f114000000b00578da80ac3dsi1844371pgi.80.2023.10.03.14.31.29
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 03 Oct 2023 14:31:29 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=QECDAq1b;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id 2EC1D81B8962;
	Tue,  3 Oct 2023 14:31:29 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S241283AbjJCVbN (ORCPT <rfc822;chrisfriedt@gmail.com>
        + 17 others); Tue, 3 Oct 2023 17:31:13 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50854 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S241245AbjJCVbA (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 3 Oct 2023 17:31:00 -0400
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.9])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B7E299B;
        Tue,  3 Oct 2023 14:30:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1696368656; x=1727904656;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=0IrW4YPJzpJ32D8aDB1vndlXCaF4RwOeigEMvowf5nw=;
  b=QECDAq1bE/WVp7dIhARD3svuNIY0votkQMLQRLRlY3ohlLiWi3SeKMPg
   WFDhMKW2GA+ZULkOBTEyyTTb+p8bDOx5NYEQcCfwsNbwHKzyrFYrmZR3l
   2erolSRX+jw+USDmBmCAvq3ZxR6WuCqnV+bQhlA6gM+jYw4BUHZBDn5Bt
   GQA2ZbNYq5zbP90VFrdgBbs7GQq8ODPWHeeFWdYZlrKnmXQem6KjzU33x
   sgTcCSCwcgahMi4WUvmPQIm7Qgql57CaPsPayjbt9ocDUeJnFoOt9ZLi+
   jMpQIm1hRp2WGXPRmoDWdZ5JiW7O03zBxlFrqMyI2JaK4HSfd56g+r+FS
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10852"; a="1580433"
X-IronPort-AV: E=Sophos;i="6.03,198,1694761200";
   d="scan'208";a="1580433"
Received: from orsmga005.jf.intel.com ([10.7.209.41])
  by orvoesa101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Oct 2023 14:30:53 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10852"; a="924801997"
X-IronPort-AV: E=Sophos;i="6.03,198,1694761200";
   d="scan'208";a="924801997"
Received: from agluck-desk3.sc.intel.com ([172.25.222.74])
  by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Oct 2023 14:30:52 -0700
From: Tony Luck <tony.luck@intel.com>
To: Fenghua Yu <fenghua.yu@intel.com>,
        Reinette Chatre <reinette.chatre@intel.com>,
        Peter Newman <peternewman@google.com>,
        Jonathan Corbet <corbet@lwn.net>,
        Shuah Khan <skhan@linuxfoundation.org>, x86@kernel.org
Cc: Shaopeng Tan <tan.shaopeng@fujitsu.com>,
        James Morse <james.morse@arm.com>,
        Jamie Iles <quic_jiles@quicinc.com>,
        Babu Moger <babu.moger@amd.com>,
        Randy Dunlap <rdunlap@infradead.org>,
        linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
        patches@lists.linux.dev, Tony Luck <tony.luck@intel.com>
Subject: [PATCH v8 6/8] x86/resctrl: Introduce snc_nodes_per_l3_cache
Date: Tue,  3 Oct 2023 14:30:41 -0700
Message-ID: <20231003213043.13565-7-tony.luck@intel.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20231003213043.13565-1-tony.luck@intel.com>
References: <20230928191350.205703-1-tony.luck@intel.com>
 <20231003213043.13565-1-tony.luck@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,
        RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Tue, 03 Oct 2023 14:31:29 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1778071488064176783
X-GMAIL-MSGID: 1778771495565584135

Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.

This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.

Resctrl monitoring on Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.

The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.

The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.

Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.

Add a global integer "snc_nodes_per_l3_cache" that will show how many
SNC nodes share each L3 cache. When this is "1", SNC mode is either
not implemented, or not enabled.

A later patch will detect SNC mode and set snc_nodes_per_l3_cache to
the appropriate value. For now it remains at the default "1" to
indicate SNC mode is not active.

Code that needs to take action when SNC is enabled is:
1) The number of logical RMIDs per L3 cache available for use is the
   number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be adjusted for the number
   of SNC nodes.
3) The RMID renumbering operates when using the value from the
   IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
   counter, code must adjust from the logical RMID used to the physical
   RMID value for the SNC node that it wishes to read and load the
   adjusted value into the IA32_QM_EVTSEL MSR.
4) The L3 cache is divided between the SNC nodes. So the value
   reported in the resctrl "size" file is adjusted.
5) The "-o mba_MBps" mount option must be disabled in SNC mode
   because the monitoring is being done per SNC node, while the
   bandwidth allocation is still done at the L3 cache scope.
   Trying to use this feedback loop might result in contradictory
   changes to the throttling level coming from each of the SNC
   node bandwidth measurements.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
---

Changes since last version:

In commit comment s/redumbering/renumbering/

Move check that SNC is not enabled into supports_mba_mbps().

---

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 3aed8e7b8487..3fddda401b83 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -446,6 +446,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
 
 extern struct dentry *debugfs_resctrl;
 
+extern int snc_nodes_per_l3_cache;
+
 enum resctrl_res_level {
 	RDT_RESOURCE_L3,
 	RDT_RESOURCE_L2,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6b937da36e4c..cd189b7ca6ea 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
  */
 bool rdt_alloc_capable;
 
+/*
+ * Number of SNC nodes that share each L3 cache.  Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+int snc_nodes_per_l3_cache = 1;
+
 static void
 mba_wrmsr_intel(struct rdt_ctrl_domain *d, struct msr_param *m,
 		struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 97d2ed829f5d..e6e566921a60 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
 
 static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 {
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+	int cpu = smp_processor_id();
+	int rmid_offset = 0;
 	u64 msr_val;
 
+	/*
+	 * When SNC mode is on, need to compute the offset to read the
+	 * physical RMID counter for the node to which this CPU belongs.
+	 */
+	if (snc_nodes_per_l3_cache > 1)
+		rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
 	/*
 	 * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
 	 * with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
 	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
 	 * are error bits.
 	 */
-	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
 	rdmsrl(MSR_IA32_QM_CTR, msr_val);
 
 	if (msr_val & RMID_VAL_ERROR)
@@ -783,8 +793,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 	int ret;
 
 	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
-	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
-	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
 	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
 
 	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index afa7a8dca48d..def203c40d70 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1357,7 +1357,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
 		}
 	}
 
-	return size;
+	return size / snc_nodes_per_l3_cache;
 }
 
 /**
@@ -2207,7 +2207,8 @@ static bool supports_mba_mbps(void)
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
 
 	return (is_mbm_local_enabled() &&
-		r->alloc_capable && is_mba_linear());
+		r->alloc_capable && is_mba_linear() &&
+		snc_nodes_per_l3_cache == 1);
 }
 
 /*