Message ID | 20230126184157.27626-1-tony.luck@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp428697wrn; Thu, 26 Jan 2023 10:44:07 -0800 (PST) X-Google-Smtp-Source: AMrXdXvBekt9/2GNaJy2iIJKkssWw9XGOY/3uTnSyX5Ql2bPlXxz6ZmZlpGSMGKehENscewHWdRh X-Received: by 2002:a05:6a21:99a7:b0:b2:5cf9:817b with SMTP id ve39-20020a056a2199a700b000b25cf9817bmr51098485pzb.5.1674758647253; Thu, 26 Jan 2023 10:44:07 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674758647; cv=none; d=google.com; s=arc-20160816; b=CqlaBfjxq1MRHpc7nVP6fiG3B6hNav1Vq0rjVPTmgWdlzLmnufDZgpOaZphHYiApWm GoFAHdHDJ2Dz5bVfX9cqZgcTt/jwh0hF2OwAoD9LZ+ZTTWknMNEc8x5Y05vhnMLpFok2 LavShYBPgvT5salqtPP9EVrE7yPKWgsvdFx0eXcnqjBi9W/75NFzFADpElVTX7ao1sUI 7rBSARVq/ORx8/yJAu96Zx4HuC9oL6J3Phpj/jg9VvY5RH39oRBTDyi+C/w63jargy+2 KUuAbXxtAgHMNG13fgGuJ0VEh/hJfWCZaoJbMd2thgVYmtivEQjtCmaUjL61yhNahXkY CfKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=KvA5s1UD8TCR/uDiEDZQdeO3P/dvJMnSMDK8rllS7o4=; b=bmDxML4gKL4Ide6PNu7XZXYymcrQMm9Cinmuwl9m38GQuCK/KOhQPrHQHLRBuDKOsT Yc37QzjBk6/++cXw46imF6zpeQ515HtHNbCHFukoP/azlUI2lP0Em1BafqPLw0gZb5DR Qlr4LTcRUIqENwztiNO7Yi4sIlodosWmaYMKDvfOxoz9WQENnto55HFjBhLXI0C0zwdM c1V/4sY58QHAS1uwntFImk4Ih3IOrFM4Ku/lmchAf9nDB7HroFW2i8KdlWuWaTq5mhh5 FOHMhl0BN91gLHqhrAby9xsevZMZUclblq3fwWs4I+osfhHF2A4GzA3lZx326aRuklZ4 5xcA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=aH5mUwCy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id p13-20020a637f4d000000b004b53b60596esi1673116pgn.575.2023.01.26.10.43.54; Thu, 26 Jan 2023 10:44:07 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=aH5mUwCy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232192AbjAZSmj (ORCPT <rfc822;lekhanya01809@gmail.com> + 99 others); Thu, 26 Jan 2023 13:42:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60870 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229781AbjAZSmd (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 26 Jan 2023 13:42:33 -0500 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B312C3EFC6; Thu, 26 Jan 2023 10:42:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674758551; x=1706294551; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=S3DfwWWHVjW08Y4Sj57WtWmKbuBdHRvdkYJVizXcOyE=; b=aH5mUwCy+ipIhItezfxNJ9gWnOXNs/oI/sAqRuUUicTwGa33Uq+ocIF/ yqXnlRUp0tmGEzvH/WjWr3QzFoHLk1REd+lnko0ADcJM9aknRv9sdIrM/ 8yQuCtkSyf4hpc5HzhgH6ltCdW4FlUnjQ/8mQmZGyeKGsUdKewi4Hua/c DPVJaSgDXxI5k7PlmSPiHBrO+v0GF3dhHyCGWfkZSJ2G/bz8+dcGlDRPj YHfzgflN0xxHxjmxVSUyEKvLvaAiM0BmaIxpwcRiVez14tYNli6eNIU2M e74EUMj9i+16SRuGbHpawiswFNBGnYK6rOY4/Z354NJUJU71hydAyFMdb A==; X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="354203308" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="354203308" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 10:42:05 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="991745435" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="991745435" Received: from agluck-desk3.sc.intel.com ([172.25.222.78]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 10:42:05 -0800 From: Tony Luck <tony.luck@intel.com> To: Fenghua Yu <fenghua.yu@intel.com>, Reinette Chatre <reinette.chatre@intel.com>, Peter Newman <peternewman@google.com>, Jonathan Corbet <corbet@lwn.net>, x86@kernel.org Cc: Shaopeng Tan <tan.shaopeng@fujitsu.com>, James Morse <james.morse@arm.com>, Jamie Iles <quic_jiles@quicinc.com>, Babu Moger <babu.moger@amd.com>, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, patches@lists.linux.dev, Tony Luck <tony.luck@intel.com> Subject: [PATCH 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems Date: Thu, 26 Jan 2023 10:41:50 -0800 Message-Id: <20230126184157.27626-1-tony.luck@intel.com> X-Mailer: git-send-email 2.39.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756111723266445481?= X-GMAIL-MSGID: =?utf-8?q?1756111723266445481?= |
Series |
x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
|
|
Message
Luck, Tony
Jan. 26, 2023, 6:41 p.m. UTC
Intel server systems starting with Skylake support a mode that logically partitions each socket. E.g. when partitioned two ways, half the cores, L3 cache, and memory controllers are allocated to each of the partitions. This may reduce average latency to access L3 cache and memory, with the tradeoff that only half the L3 cache is available for subnode-local memory access. The existing Linux resctrl system mishandles RDT monitoring on systems with SNC mode enabled. But, with some simple changes, this can be fixed. When SNC mode is enabled, the RDT RMID counters are also partitioned with the low numbered counters going to the first partition, and the high numbered counters to the second partition[1]. The key is to adjust the RMID value written to the IA32_PQR_ASSOC MSR on context switch, and the value written to the IA32_QM_EVTSEL when reading out counters, and to change the scaling factor that was read from CPUID(0xf,1).EBX E.g. in 2-way Sub-NUMA cluster with 200 RMID counters there are only 100 available counters to the resctrl code. When running on the first SNC node RMID values 0..99 are used as before. But when running on the second node, a task that is assigned resctrl rmid=10 must load 10+100 into IA32_PQR_ASSOC to use RMID counter 110. There should be no changes to functionality on other architectures, or on Intel systems with SNC disabled, where snc_ways == 1. -Tony [1] Some systems also support a 4-way split. All the above still applies, just need to account for cores, cache, memory controllers and RMID counters being divided four ways instead of two. Tony Luck (7): x86/resctrl: Refactor in preparation for node-scoped resources x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c x86/resctrl: Add a new node-scoped resource to rdt_resources_all[] x86/resctrl: Add code to setup monitoring at L3 or NODE scope. x86/resctrl: Add a new "snc_ways" file to the monitoring info directory. x86/resctrl: Update documentation with Sub-NUMA cluster changes x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize. Documentation/x86/resctrl.rst | 15 +++- include/linux/resctrl.h | 4 +- arch/x86/include/asm/resctrl.h | 4 +- arch/x86/kernel/cpu/resctrl/internal.h | 9 +++ arch/x86/kernel/cpu/resctrl/core.c | 83 ++++++++++++++++++++--- arch/x86/kernel/cpu/resctrl/monitor.c | 24 ++++--- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 22 +++++- 8 files changed, 136 insertions(+), 27 deletions(-)
Comments
Hi Tony, On 26/01/2023 18:41, Tony Luck wrote: > Intel server systems starting with Skylake support a mode that logically > partitions each socket. E.g. when partitioned two ways, half the cores, > L3 cache, and memory controllers are allocated to each of the partitions. > This may reduce average latency to access L3 cache and memory, with the > tradeoff that only half the L3 cache is available for subnode-local memory > access. I couldn't find a description of what happens to the CAT bitmaps or counters. Presumably the CAT bitmaps are duplicated, so each cluster has its own set, and the counters aren't - so software has to co-ordinate the use of RMID across the CPUs? How come cacheinfo isn't modified to report the L3 partitions as separate caches? Otherwise user-space would assume the full size of the cache is available on any of those CPUs. This would avoid an ABI change in resctrl (domain is now the numa node), leaving only the RMID range code. Thanks, James
> > Intel server systems starting with Skylake support a mode that logically > > partitions each socket. E.g. when partitioned two ways, half the cores, > > L3 cache, and memory controllers are allocated to each of the partitions. > > This may reduce average latency to access L3 cache and memory, with the > > tradeoff that only half the L3 cache is available for subnode-local memory > > access. > > I couldn't find a description of what happens to the CAT bitmaps or counters. No changes to CAT. The cache is partitioned between sub-numa nodes based on the index, not by dividing the ways. E.g. an 8-way associative 32MB cache is still 8-way associative in each sub-node, but with 16MB available to each node. This means users who want a specific amount of cache will need to allocate more bits in the cache way mask (because each way is half as big). > Presumably the CAT bitmaps are duplicated, so each cluster has its own set, and > the counters aren't - so software has to co-ordinate the use of RMID across the CPUs? Nope. Still one set of CAT bit maps per socket. With "N" RMIDs available on a system with SNC disabled, there will be N/2 available when there are 2 SNC nodes per socket. Processes use values [0 .. N/2). > How come cacheinfo isn't modified to report the L3 partitions as separate caches? > Otherwise user-space would assume the full size of the cache is available on any of those > CPUs. The size of the cache is perhaps poorly defined in the SNC enabled case. A well behaved NUMA application that is only accessing memory from its local node will see an effective cache half the size. But if a process accesses memory from the other SNC node on the same socket, then it will get allocations in that SNC nodes half share of the cache. Accessing memory across inter-socket links will end up allocating across the whole cache. Moral: SNC mode is intended for applications that have very well-behaved NUMA characteristics. -Tony