Message ID | cover.1705688538.git.babu.moger@amd.com |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-31470-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:2bc4:b0:101:a8e8:374 with SMTP id hx4csp1192869dyb; Fri, 19 Jan 2024 10:22:48 -0800 (PST) X-Google-Smtp-Source: AGHT+IGJ/HCdezBeLfRTRBpT0yCjq+zhcNur4lUuVjQ278PaBFnR6PKE+lHM9d7rZ+Ysk+4dWNwT X-Received: by 2002:a05:6a00:84f:b0:6db:7656:1571 with SMTP id q15-20020a056a00084f00b006db76561571mr290701pfk.22.1705688567812; Fri, 19 Jan 2024 10:22:47 -0800 (PST) Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id cd17-20020a056a00421100b006dbbae74070si1946502pfb.184.2024.01.19.10.22.47 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Jan 2024 10:22:47 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-31470-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@amd.com header.s=selector1 header.b=DgrXMIuE; arc=fail (signature failed); spf=pass (google.com: domain of linux-kernel+bounces-31470-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-31470-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amd.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 3395A2877E1 for <ouuuleilei@gmail.com>; Fri, 19 Jan 2024 18:22:47 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 3C0AD56479; Fri, 19 Jan 2024 18:22:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="DgrXMIuE" Received: from NAM10-MW2-obe.outbound.protection.outlook.com (mail-mw2nam10on2048.outbound.protection.outlook.com [40.107.94.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CD1421F170; Fri, 19 Jan 2024 18:22:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.94.48 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705688551; cv=fail; b=iPUZoKLg+c1mzg9wryyAL16m0N4rkGWQ1yh2rsmrFJnXJPfB5aRccArNvBSiYKQM2vhMCLobjQys41wTkZyYm64DrIFLTnPmiXEdbA54EmF2Xv90kBe7v65N43BAw7FZtKr1nl5cuujhkXTyGRzEjgCunb+TgN1s3i67F3UOInE= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705688551; c=relaxed/simple; bh=hYFvbkiCkn6JY+LHEm/O2JlTPF+xBB1f4BjAQiUnsE0=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=BAQEhoDrXz6WymKCgAipCMpUIj+R+aP/WCifSnB0X14KzI1F2RTcdtVNkWmNiHA9bzoojab19Yv0fZKChiSe2WyoYj0TdqgxTLP+8nnYhnQyOGCjzidZoZgFN4Jp0d299hfMvAxWbXXpzhe8RHjUaD1+fiXK3PzV4tb94A2qbD0= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=DgrXMIuE; arc=fail smtp.client-ip=40.107.94.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=S2rhsNANnp9AzVGxkL3rdstksHE41l7utXGG1oSYUa8CUMp8I8Wep0Pt+ToTrSUiZiT7K2idcVAZMk2Q5x2dySei4jRf4bIZ4ZN1yvp8lCTI94Hiq5h63zTYO3L+ItBC1LVO5RMjUpnI7tYGLVhM/pLXikMLWJ9z9s27wmaIKjJby97q85NdSjAU3Sq9FpWY1/qc7YVXdh+RPIqnJCuOyaPKWWUaSz7deBd8a5clap5m+WX+rt/L3hcCnhv5ys2JMSsvWDhyFiiZ58Sl7cb9t4sTI778ahSaW4YE8VN7opgnq+fA3b9tAGf238KRugKAU+5zcn0U8aPFiA8KtmpSYA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=eBTWO4HKg5kXdHE4fN/DwYJmPTPBWOrfFZIqDaXQcMY=; b=bFBxiU2gCKNu3EOpyMV3+nqCwbIzOHQLB6h7T6OFVeg0qcqNph4gMbxYb9ppy3Mq1u9UWdFGipy59wlydHeG50Qjrip65mv/KL9A131x7nuSiq1BsVWQUZhJSZcrvCvUBgRPVqLfhgH0bge23YLSARzw4Tb2XxIKXSQT/iC1g2IdfG9HsHyy6u5i9KUcINwaxmzqqrixkbvYu+vvEyLU99phWRlTSEURg62OiyznegXQyJci4pwiaa1keyCV2/2HElpYT8a6JHssaiApxBKOMDZZ5pSYfgindIWkbIdrWl35qxMw3zrUTAFM0CZbyNXRWrFvXnsE6eL8y33Zte1Mqw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=lwn.net smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=eBTWO4HKg5kXdHE4fN/DwYJmPTPBWOrfFZIqDaXQcMY=; b=DgrXMIuEIq1+qD7cNV3IL+4oM/f1lu4bo5fryBIkdN0cHIqjk30FXXht6aKQBOAW10dguPp9Rw/Wd/Ujr+KlxpzwRP0GkQAUjT7olcS2qmdk8wJzIPBvGn915pwSeYGrdg0BsisUin4Fxu6UcMYfzoxoAg0lNj40y6PDdi6fJGU= Received: from BYAPR11CA0046.namprd11.prod.outlook.com (2603:10b6:a03:80::23) by DS0PR12MB7852.namprd12.prod.outlook.com (2603:10b6:8:147::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.26; Fri, 19 Jan 2024 18:22:26 +0000 Received: from SJ1PEPF00001CE3.namprd05.prod.outlook.com (2603:10b6:a03:80:cafe::a3) by BYAPR11CA0046.outlook.office365.com (2603:10b6:a03:80::23) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.26 via Frontend Transport; Fri, 19 Jan 2024 18:22:26 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by SJ1PEPF00001CE3.mail.protection.outlook.com (10.167.242.11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.7202.16 via Frontend Transport; Fri, 19 Jan 2024 18:22:26 +0000 Received: from bmoger-ubuntu.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.34; Fri, 19 Jan 2024 12:22:24 -0600 From: Babu Moger <babu.moger@amd.com> To: <corbet@lwn.net>, <fenghua.yu@intel.com>, <reinette.chatre@intel.com>, <tglx@linutronix.de>, <mingo@redhat.com>, <bp@alien8.de>, <dave.hansen@linux.intel.com> CC: <x86@kernel.org>, <hpa@zytor.com>, <paulmck@kernel.org>, <rdunlap@infradead.org>, <tj@kernel.org>, <peterz@infradead.org>, <yanjiewtw@gmail.com>, <babu.moger@amd.com>, <kim.phillips@amd.com>, <lukas.bulwahn@gmail.com>, <seanjc@google.com>, <jmattson@google.com>, <leitao@debian.org>, <jpoimboe@kernel.org>, <rick.p.edgecombe@intel.com>, <kirill.shutemov@linux.intel.com>, <jithu.joseph@intel.com>, <kai.huang@intel.com>, <kan.liang@linux.intel.com>, <daniel.sneddon@linux.intel.com>, <pbonzini@redhat.com>, <sandipan.das@amd.com>, <ilpo.jarvinen@linux.intel.com>, <peternewman@google.com>, <maciej.wieczor-retman@intel.com>, <linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <eranian@google.com> Subject: [PATCH v2 00/17] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC) Date: Fri, 19 Jan 2024 12:22:02 -0600 Message-ID: <cover.1705688538.git.babu.moger@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20231201005720.235639-1-babu.moger@amd.com> References: <20231201005720.235639-1-babu.moger@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-ClientProxiedBy: SATLEXMB04.amd.com (10.181.40.145) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SJ1PEPF00001CE3:EE_|DS0PR12MB7852:EE_ X-MS-Office365-Filtering-Correlation-Id: 9b39f223-8a12-4762-8dfe-08dc191b96f2 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: hsbHMqMegj3CQr2lrO58tYi6Da2cIo2VVs51eeokx6Dg1KLC4s0FcoBUw3aQ0jJsqOcRmkJsnDsqyo/8LF+a40JidD4yTJ3iw7gO1S10y6DKDnMsXqFzfqe0+61KFvADP6b3N7fOup/WMJeIEpVdW1LQfw39NH71O65dZKbuYuopImarQN0CkuMDCGJWuDvmwthD9684jvHuS2//rk02K04kqhWwGu2FjwxuWrYRNdB4wlm9Q3EB4AHeVrxcHoswnRgxs6+vIYHMFi+oKErWYMgfOi38PdUM/Hv9ceDhstp23UvoajC4KA6lYBOj7QXnxi2QB0qPD6LL8rOZGrqjpcMio5ozsEJNpyVVhJEP7t6K+sAbafTqtQq09eCks+y+FTnZHxMIi96equ7pMe3EbqDcVYwUKh3zLN/UXVn9Sw8Xl4lDi9Z83mx2eQ2/6Iu/80Ndo54BOEe3KuODOg7I8Kf/pyI0WTbC9Rpt56v5prZY8Vzf25TBM+CqbFdRbwGXFKD8P/Dp9AWuUiB4jIv405TyjpK2p4GWAC+xxbC2qT284CjAbIC62/4C6jq8St4J9wYO+e3wABfaEKPc47Mtx7wGJGYALksV3QVUuoJV1acte5YoJSV/SXKCRh97J4o7ozQhteFD9fTRbdosppDMEFWNVmol8r/8PyigeuYNK7qFQvof3RTPnIrx+Qe/i72QowWuaQfL2WXqRZdbPTA6plHxhOdGbJvQsZpIjtx/mX5MEBWNioryFp3Z2c5CT1o5rPSH1zf5+Gzux0d2m9tvnoplyeHC327Ni350LmorOWk= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230031)(4636009)(376002)(39860400002)(136003)(346002)(396003)(230922051799003)(451199024)(186009)(82310400011)(64100799003)(1800799012)(36840700001)(46966006)(40470700004)(86362001)(40460700003)(40480700001)(70586007)(36756003)(2616005)(426003)(83380400001)(81166007)(356005)(16526019)(336012)(47076005)(26005)(70206006)(7416002)(7406005)(316002)(2906002)(5660300002)(110136005)(54906003)(966005)(478600001)(7696005)(4326008)(6666004)(44832011)(82740400003)(41300700001)(8676002)(36860700001)(8936002)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 19 Jan 2024 18:22:26.5807 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 9b39f223-8a12-4762-8dfe-08dc191b96f2 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SJ1PEPF00001CE3.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS0PR12MB7852 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1788544095590380206 X-GMAIL-MSGID: 1788544095590380206 |
Series |
x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)
|
|
Message
Moger, Babu
Jan. 19, 2024, 6:22 p.m. UTC
These series adds the support for Assignable Bandwidth Monitoring Counters (ABMC). It is also called QoS RMID Pinning feature The feature details are documented in the APM listed below [1]. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth Monitoring (ABMC). The documentation is available at Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 The patches are based on top of commit 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master) # Introduction AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring feature only guarantees that RMIDs currently assigned to a processor will be tracked by hardware. The counters of any other RMIDs which are no longer being tracked will be reset to zero. The MBM event counters return "Unavailable" for the RMIDs that are not active. Users can create 256 or more monitor groups. But there can be only limited number of groups that can be give guaranteed monitoring numbers. With ever changing configurations there is no way to definitely know which of these groups will be active for certain point of time. Users do not have the option to monitor a group or set of groups for certain period of time without worrying about RMID being reset in between. The ABMC feature provides an option to the user to assign an RMID to the hardware counter and monitor the bandwidth for a longer duration. The assigned RMID will be active until the user unassigns it manually. There is no need to worry about counters being reset during this period. Additionally, the user can specify a bitmask identifying the specific bandwidth types from the given source to track with the counter. Without ABMC enabled, monitoring will work in current mode without assignment option. # Linux Implementation Linux resctrl subsystem provides the interface to count maximum of two memory bandwidth events per group, from a combination of available total and local events. Keeping the current interface, users can assign a maximum of 2 ABMC counters per group. User will also have the option to assign only one counter to the group. If the system runs out of assignable ABMC counters, kernel will display an error. Users need to unassign an already assigned counter to make space for new assignments. # Examples a. Check if ABMC support is available #mount -t resctrl resctrl /sys/fs/resctrl/ #cat /sys/fs/resctrl/info/L3_MON/mon_features llc_occupancy mbm_total_bytes mbm_total_bytes_config mbm_local_bytes mbm_local_bytes_config mbm_assign_capable ← Linux kernel detected ABMC feature b. Check if ABMC is enabled. By default, ABMC feature is disabled. Monitoring works in legacy monitor mode when ABMC is not enabled. #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable 0 c. There will be new file "monitor_state" for each monitor group when ABMC feature is supported. However, monitor_state is not available if ABMC is disabled. #cat /sys/fs/resctrl/monitor_state Unsupported d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC enabled, monitoring will work in current mode without assignment option. # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes 779247936 # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 765207488 e. Enable ABMC mode. #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable 1 f. Read the monitor states. By default, both total and local MBM events are in "unassign" state. #cat /sys/fs/resctrl/monitor_state total=unassign;local=unassign g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode, the MBA events are not available until the user assigns the events explicitly. #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes Unsupported #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes Unsupported h. The event llc_occupancy is not affected by ABMC mode. Users can still read the llc_occupancy. #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy 557056 i. Now assign the total event and read the monitor_state. #echo total=assign > /sys/fs/resctrl/monitor_state #cat /sys/fs/resctrl/monitor_state total=assign;local=unassign j. Now that the total event is assigned. Read the total event. #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes 6136000 k. Now assign the local event and read the monitor_state. #echo local=assign > /sys/fs/resctrl/monitor_state #cat /sys/fs/resctrl/monitor_state total=assign;local=assign Users can also assign both total and local events in one single command. l. Now that both total and local events are assigned, read the events. #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes 6136000 #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 58694 m. Check the bandwidth configuration for the group. Note that bandwidth configuration has a domain scope. Total event defaults to 0x7F (to count all the events) and local event defaults to 0x15 (to count all the local numa events). The event bitmap decoding is available at https://www.kernel.org/doc/Documentation/x86/resctrl.rst in section "mbm_total_bytes_config", "mbm_local_bytes_config": #cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 0=0x7f;1=0x7f #cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 0=0x15;1=0xi15 n. Change the bandwidth source for domain 0 for the total event to count only reads. Note that this change effects lotal events on the domain 0. #echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config #cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 0=0x33;1=0x7F o. Now read the total event again. The mbm_total_bytes should display only the read events. #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes 314101 p. Unmount the resctrl #umount /sys/fs/resctrl/ NOTE: For simplicity these examples are run on a default resctrl group. Similar experiments are can be run non-defaults groups. --- v2: a. Major change is the way ABMC is enabled. Earlier, user needed to remount with -o abmc to enable ABMC feature. Removed that option now. Now users can enable ABMC by "$echo 1 to /sys/fs/resctrl/info/L3_MON/mbm_assign_enable". b. Added new word 21 to x86/cpufeatures.h. c. Display unsupported if user attempts to read the events when ABMC is enabled and event is not assigned. d. Display monitor_state as "Unsupported" when ABMC is disabled. e. Text updates and rebase to latest tip tree (as of Jan 18). f. This series is still work in progress. I am yet to hear from ARM developers. v1 : https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/ Babu Moger (17): x86/cpufeatures: Add word 21 for scattered CPUID features x86/resctrl: Add support for Assignable Bandwidth Monitoring Counters (ABMC) x86/resctrl: Add ABMC feature in the command line options x86/resctrl: Detect Assignable Bandwidth Monitoring feature details x86/resctrl: Introduce resctrl_file_fflags_init x86/resctrl: Introduce interface to display number of ABMC counters x86/resctrl: Add support to enable/disable ABMC feature x86/resctrl: Introduce the interface to display ABMC state x86/resctrl: Introdruce rdtgroup_assign_enable_write x86/resctrl: Add interface to display monitor state of the group x86/resctrl: Report Unsupported when MBM events are read x86/resctrl: Initialize assignable counters bitmap x86/resctrl: Add data structures for ABMC assignment x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg x86/resctrl: Add the interface to assign the RMID x86/resctrl: Add the interface unassign the RMID x86/resctrl: Update RMID assignments on event configuration changes .../admin-guide/kernel-parameters.txt | 2 +- Documentation/arch/x86/resctrl.rst | 62 +++ arch/x86/include/asm/cpufeature.h | 6 +- arch/x86/include/asm/cpufeatures.h | 7 +- arch/x86/include/asm/disabled-features.h | 3 +- arch/x86/include/asm/msr-index.h | 2 + arch/x86/include/asm/required-features.h | 3 +- arch/x86/kernel/cpu/cpuid-deps.c | 3 + arch/x86/kernel/cpu/resctrl/core.c | 25 +- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 15 + arch/x86/kernel/cpu/resctrl/internal.h | 54 +- arch/x86/kernel/cpu/resctrl/monitor.c | 26 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 466 +++++++++++++++++- arch/x86/kernel/cpu/scattered.c | 1 + include/linux/resctrl.h | 2 + 15 files changed, 651 insertions(+), 26 deletions(-)
Comments
+James On 1/19/2024 10:22 AM, Babu Moger wrote: > > f. This series is still work in progress. I am yet to hear from ARM developers. Please at least include James in your submissions to make Arm aware of this work. Reinette
On 1/19/2024 12:32 PM, Reinette Chatre wrote: > +James > > On 1/19/2024 10:22 AM, Babu Moger wrote: >> >> f. This series is still work in progress. I am yet to hear from ARM developers. > Please at least include James in your submissions to make Arm aware of this work. Thank you Babu Moger
Hi Babu, On 1/19/2024 10:22 AM, Babu Moger wrote: > > These series adds the support for Assignable Bandwidth Monitoring Counters Not a good start ([1]). > (ABMC). It is also called QoS RMID Pinning feature > > The feature details are documented in the APM listed below [1]. > [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming > Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth > Monitoring (ABMC). The documentation is available at > Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 > > The patches are based on top of commit > 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master) > > # Introduction > > AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring > feature only guarantees that RMIDs currently assigned to a processor will > be tracked by hardware. The counters of any other RMIDs which are no longer > being tracked will be reset to zero. The MBM event counters return > "Unavailable" for the RMIDs that are not active. > > Users can create 256 or more monitor groups. But there can be only limited > number of groups that can be give guaranteed monitoring numbers. With ever "can be given"? > changing configurations there is no way to definitely know which of these > groups will be active for certain point of time. Users do not have the > option to monitor a group or set of groups for certain period of time > without worrying about RMID being reset in between. > > The ABMC feature provides an option to the user to assign an RMID to the > hardware counter and monitor the bandwidth for a longer duration. > The assigned RMID will be active until the user unassigns it manually. > There is no need to worry about counters being reset during this period. > Additionally, the user can specify a bitmask identifying the specific > bandwidth types from the given source to track with the counter. > > Without ABMC enabled, monitoring will work in current mode without > assignment option. > > # Linux Implementation > > Linux resctrl subsystem provides the interface to count maximum of two > memory bandwidth events per group, from a combination of available total > and local events. Keeping the current interface, users can assign a maximum > of 2 ABMC counters per group. User will also have the option to assign only > one counter to the group. If the system runs out of assignable ABMC > counters, kernel will display an error. Users need to unassign an already > assigned counter to make space for new assignments. > > > # Examples > > a. Check if ABMC support is available > #mount -t resctrl resctrl /sys/fs/resctrl/ > > #cat /sys/fs/resctrl/info/L3_MON/mon_features > llc_occupancy > mbm_total_bytes > mbm_total_bytes_config > mbm_local_bytes > mbm_local_bytes_config > mbm_assign_capable ← Linux kernel detected ABMC feature > > b. Check if ABMC is enabled. By default, ABMC feature is disabled. > Monitoring works in legacy monitor mode when ABMC is not enabled. > > #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable > 0 > With the introduction of "mbm_assign_enable" the entry in mon_features seems to provide duplicate information. > c. There will be new file "monitor_state" for each monitor group when ABMC > feature is supported. However, monitor_state is not available if ABMC is > disabled. > > #cat /sys/fs/resctrl/monitor_state > Unsupported This sounds potentially confusing since users will still be able to monitor the groups ... > > d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC > enabled, monitoring will work in current mode without assignment option. > > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes > 779247936 > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > 765207488 > > e. Enable ABMC mode. > > #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable > #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable > 1 > > f. Read the monitor states. By default, both total and local MBM > events are in "unassign" state. > > #cat /sys/fs/resctrl/monitor_state > total=unassign;local=unassign This interface does not seem to take into account that hardware can support assignment per domain. I understand that this is not something you want to implement at this time but the user interface has to accommodate such an enhancement. This was already mentioned, and you did acknowledge the point [3] to this new version that does not reflect this is unexpected. My previous suggestions do seem to still stand and and I also am not able to see how Peter's requests [2] were considered. This same interface needs to accommodate usages apart from ABMC. For example, how to use this interface to address the same counter issue on AMD hardware without ABMC, and MPAM (pending James's feedback). I understand that until we hear from Arm we do not know all the requirements that this interface needs to support, but I do expect this interface to at least consider requirements and usage scenarios that are already known. > g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode, > the MBA events are not available until the user assigns the events > explicitly. > > #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes > Unsupported > > #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > Unsupported > This needs some more thought to accommodate Peter's scenario where the counter can be expected to return the final count after the counter is disabled. > h. The event llc_occupancy is not affected by ABMC mode. Users can still > read the llc_occupancy. > > #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy > 557056 > > i. Now assign the total event and read the monitor_state. > > #echo total=assign > /sys/fs/resctrl/monitor_state > #cat /sys/fs/resctrl/monitor_state > total=assign;local=unassign > I do not see the "global assign/unassign" scenario addressed. This version seems to ignore (without discussion) a lot of earlier feedback. Reinette [1] https://lore.kernel.org/lkml/5ce67d8f-e207-4029-8fb3-0bc7deab1e9f@amd.com/ [2] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/ [3] https://lore.kernel.org/lkml/38421428-84cb-b67e-f3ce-b7a0233e016b@amd.com/
+James On 2/1/2024 8:09 PM, Reinette Chatre wrote: > Hi Babu, > > On 1/19/2024 10:22 AM, Babu Moger wrote: >> >> These series adds the support for Assignable Bandwidth Monitoring Counters > > Not a good start ([1]). > >> (ABMC). It is also called QoS RMID Pinning feature >> >> The feature details are documented in the APM listed below [1]. >> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming >> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth >> Monitoring (ABMC). The documentation is available at >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 >> >> The patches are based on top of commit >> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master) >> >> # Introduction >> >> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring >> feature only guarantees that RMIDs currently assigned to a processor will >> be tracked by hardware. The counters of any other RMIDs which are no longer >> being tracked will be reset to zero. The MBM event counters return >> "Unavailable" for the RMIDs that are not active. >> >> Users can create 256 or more monitor groups. But there can be only limited >> number of groups that can be give guaranteed monitoring numbers. With ever > > "can be given"? > >> changing configurations there is no way to definitely know which of these >> groups will be active for certain point of time. Users do not have the >> option to monitor a group or set of groups for certain period of time >> without worrying about RMID being reset in between. >> >> The ABMC feature provides an option to the user to assign an RMID to the >> hardware counter and monitor the bandwidth for a longer duration. >> The assigned RMID will be active until the user unassigns it manually. >> There is no need to worry about counters being reset during this period. >> Additionally, the user can specify a bitmask identifying the specific >> bandwidth types from the given source to track with the counter. >> >> Without ABMC enabled, monitoring will work in current mode without >> assignment option. >> >> # Linux Implementation >> >> Linux resctrl subsystem provides the interface to count maximum of two >> memory bandwidth events per group, from a combination of available total >> and local events. Keeping the current interface, users can assign a maximum >> of 2 ABMC counters per group. User will also have the option to assign only >> one counter to the group. If the system runs out of assignable ABMC >> counters, kernel will display an error. Users need to unassign an already >> assigned counter to make space for new assignments. >> >> >> # Examples >> >> a. Check if ABMC support is available >> #mount -t resctrl resctrl /sys/fs/resctrl/ >> >> #cat /sys/fs/resctrl/info/L3_MON/mon_features >> llc_occupancy >> mbm_total_bytes >> mbm_total_bytes_config >> mbm_local_bytes >> mbm_local_bytes_config >> mbm_assign_capable ← Linux kernel detected ABMC feature >> >> b. Check if ABMC is enabled. By default, ABMC feature is disabled. >> Monitoring works in legacy monitor mode when ABMC is not enabled. >> >> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >> 0 >> > > With the introduction of "mbm_assign_enable" the entry in mon_features seems > to provide duplicate information. > >> c. There will be new file "monitor_state" for each monitor group when ABMC >> feature is supported. However, monitor_state is not available if ABMC is >> disabled. >> >> #cat /sys/fs/resctrl/monitor_state >> Unsupported > > This sounds potentially confusing since users will still be able to monitor > the groups ... > >> >> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC >> enabled, monitoring will work in current mode without assignment option. >> >> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >> 779247936 >> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >> 765207488 >> >> e. Enable ABMC mode. >> >> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >> 1 >> >> f. Read the monitor states. By default, both total and local MBM >> events are in "unassign" state. >> >> #cat /sys/fs/resctrl/monitor_state >> total=unassign;local=unassign > > This interface does not seem to take into account that hardware > can support assignment per domain. I understand that this is > not something you want to implement at this time but the user interface > has to accommodate such an enhancement. This was already mentioned, and > you did acknowledge the point [3] to this new version that does not > reflect this is unexpected. > > My previous suggestions do seem to still stand and and I also am not able to > see how Peter's requests [2] were considered. This same interface needs to > accommodate usages apart from ABMC. For example, how to use this interface > to address the same counter issue on AMD hardware without ABMC, and MPAM > (pending James's feedback). > > I understand that until we hear from Arm we do not know all the requirements > that this interface needs to support, but I do expect this interface to > at least consider requirements and usage scenarios that are already known. > >> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode, >> the MBA events are not available until the user assigns the events >> explicitly. >> >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >> Unsupported >> >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >> Unsupported >> > > This needs some more thought to accommodate Peter's scenario where the counter > can be expected to return the final count after the counter is disabled. > >> h. The event llc_occupancy is not affected by ABMC mode. Users can still >> read the llc_occupancy. >> >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy >> 557056 >> >> i. Now assign the total event and read the monitor_state. >> >> #echo total=assign > /sys/fs/resctrl/monitor_state >> #cat /sys/fs/resctrl/monitor_state >> total=assign;local=unassign >> > > I do not see the "global assign/unassign" scenario addressed. > > This version seems to ignore (without discussion) a lot of earlier > feedback. > > Reinette > > [1] https://lore.kernel.org/lkml/5ce67d8f-e207-4029-8fb3-0bc7deab1e9f@amd.com/ > [2] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/ > [3] https://lore.kernel.org/lkml/38421428-84cb-b67e-f3ce-b7a0233e016b@amd.com/
Hi Reinette, On 2/1/2024 10:09 PM, Reinette Chatre wrote: > Hi Babu, > > On 1/19/2024 10:22 AM, Babu Moger wrote: >> These series adds the support for Assignable Bandwidth Monitoring Counters > Not a good start ([1]). Yea. My bad. > >> (ABMC). It is also called QoS RMID Pinning feature >> >> The feature details are documented in the APM listed below [1]. >> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming >> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth >> Monitoring (ABMC). The documentation is available at >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 >> >> The patches are based on top of commit >> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master) >> >> # Introduction >> >> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring >> feature only guarantees that RMIDs currently assigned to a processor will >> be tracked by hardware. The counters of any other RMIDs which are no longer >> being tracked will be reset to zero. The MBM event counters return >> "Unavailable" for the RMIDs that are not active. >> >> Users can create 256 or more monitor groups. But there can be only limited >> number of groups that can be give guaranteed monitoring numbers. With ever > "can be given"? "can give guaranteed monitoring numbers." I feel this looks better. > >> changing configurations there is no way to definitely know which of these >> groups will be active for certain point of time. Users do not have the >> option to monitor a group or set of groups for certain period of time >> without worrying about RMID being reset in between. >> >> The ABMC feature provides an option to the user to assign an RMID to the >> hardware counter and monitor the bandwidth for a longer duration. >> The assigned RMID will be active until the user unassigns it manually. >> There is no need to worry about counters being reset during this period. >> Additionally, the user can specify a bitmask identifying the specific >> bandwidth types from the given source to track with the counter. >> >> Without ABMC enabled, monitoring will work in current mode without >> assignment option. >> >> # Linux Implementation >> >> Linux resctrl subsystem provides the interface to count maximum of two >> memory bandwidth events per group, from a combination of available total >> and local events. Keeping the current interface, users can assign a maximum >> of 2 ABMC counters per group. User will also have the option to assign only >> one counter to the group. If the system runs out of assignable ABMC >> counters, kernel will display an error. Users need to unassign an already >> assigned counter to make space for new assignments. >> >> >> # Examples >> >> a. Check if ABMC support is available >> #mount -t resctrl resctrl /sys/fs/resctrl/ >> >> #cat /sys/fs/resctrl/info/L3_MON/mon_features >> llc_occupancy >> mbm_total_bytes >> mbm_total_bytes_config >> mbm_local_bytes >> mbm_local_bytes_config >> mbm_assign_capable ← Linux kernel detected ABMC feature >> >> b. Check if ABMC is enabled. By default, ABMC feature is disabled. >> Monitoring works in legacy monitor mode when ABMC is not enabled. >> >> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >> 0 >> > With the introduction of "mbm_assign_enable" the entry in mon_features seems > to provide duplicate information. ok. We can remove the text in mon_features and keep mbm_assign_enable. We need this to enable and disable the feature. > >> c. There will be new file "monitor_state" for each monitor group when ABMC >> feature is supported. However, monitor_state is not available if ABMC is >> disabled. >> >> #cat /sys/fs/resctrl/monitor_state >> Unsupported > This sounds potentially confusing since users will still be able to monitor > the groups ... How about "Assignment-Unsupported"? > >> >> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC >> enabled, monitoring will work in current mode without assignment option. >> >> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >> 779247936 >> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >> 765207488 >> >> e. Enable ABMC mode. >> >> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >> 1 >> >> f. Read the monitor states. By default, both total and local MBM >> events are in "unassign" state. >> >> #cat /sys/fs/resctrl/monitor_state >> total=unassign;local=unassign > This interface does not seem to take into account that hardware > can support assignment per domain. I understand that this is > not something you want to implement at this time but the user interface > has to accommodate such an enhancement. This was already mentioned, and > you did acknowledge the point [3] to this new version that does not > reflect this is unexpected. Yea. Domain level assignment is not supported at this point. Do you want me to explicitly mention here? Please elaborate what you meant here. > > My previous suggestions do seem to still stand and and I also am not able to > see how Peter's requests [2] were considered. This same interface needs to > accommodate usages apart from ABMC. For example, how to use this interface > to address the same counter issue on AMD hardware without ABMC, and MPAM > (pending James's feedback). Yea. Agree. Peter's comments are not addressed. I am not all clear about details of Peters and James requirement. With respect to ABMC here are my requirements. a. Assignment needs to be done at group level. b. User should be able to assign each event individually. Multiple events assignment(in one command) should be supported. c. I have no plans to implement domain level assignment. It is done at system level. d. We need only couple of states. Assigned and unassigned. e. monitor_state is name of file for user interface. We can change that based on comments. Peter, James, Please comment on what you want achieve in "assignment" based on the features you are working on. Do you want to add new states? > > I understand that until we hear from Arm we do not know all the requirements > that this interface needs to support, but I do expect this interface to > at least consider requirements and usage scenarios that are already known. Sure. Will try that in the next version. Lets continue the discussion. >> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode, >> the MBA events are not available until the user assigns the events >> explicitly. >> >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >> Unsupported >> >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >> Unsupported >> > This needs some more thought to accommodate Peter's scenario where the counter > can be expected to return the final count after the counter is disabled. I am not sure how to achieve this with ABMC. This may be applicable to soft rmid only. In case of "soft rmid", previous readings are saved in the soft rmid state. > >> h. The event llc_occupancy is not affected by ABMC mode. Users can still >> read the llc_occupancy. >> >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy >> 557056 >> >> i. Now assign the total event and read the monitor_state. >> >> #echo total=assign > /sys/fs/resctrl/monitor_state >> #cat /sys/fs/resctrl/monitor_state >> total=assign;local=unassign >> > I do not see the "global assign/unassign" scenario addressed. I am not all clear on meaning of "global assign/unassign". Does it mean looping thru all the groups and assign the RMIDs? It may not work in many cases. In case of ABMC, we have only limited number of hw counters. It will fail after hardware runs out of counters. It is better done selectively based on which group user is interested in. But it can be done later if we find a use case for that. > > This version seems to ignore (without discussion) a lot of earlier > feedback. Please feel free comment. There are various threads of discussion. I may have missed. Thanks Babu > > Reinette > > [1] https://lore.kernel.org/lkml/5ce67d8f-e207-4029-8fb3-0bc7deab1e9f@amd.com/ > [2] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/ > [3] https://lore.kernel.org/lkml/38421428-84cb-b67e-f3ce-b7a0233e016b@amd.com/
Hi Babu, On 2/2/2024 1:57 PM, Moger, Babu wrote: > On 2/1/2024 10:09 PM, Reinette Chatre wrote: >> On 1/19/2024 10:22 AM, Babu Moger wrote: >>> These series adds the support for Assignable Bandwidth Monitoring Counters >> Not a good start ([1]). > > Yea. My bad. > >> >>> (ABMC). It is also called QoS RMID Pinning feature >>> >>> The feature details are documented in the APM listed below [1]. >>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming >>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth >>> Monitoring (ABMC). The documentation is available at >>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 >>> >>> The patches are based on top of commit >>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master) >>> >>> # Introduction >>> >>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring >>> feature only guarantees that RMIDs currently assigned to a processor will >>> be tracked by hardware. The counters of any other RMIDs which are no longer >>> being tracked will be reset to zero. The MBM event counters return >>> "Unavailable" for the RMIDs that are not active. >>> Users can create 256 or more monitor groups. But there can be only limited >>> number of groups that can be give guaranteed monitoring numbers. With ever >> "can be given"? > > "can give guaranteed monitoring numbers." > > I feel this looks better. Sounds good. Thank you. > >> >>> changing configurations there is no way to definitely know which of these >>> groups will be active for certain point of time. Users do not have the >>> option to monitor a group or set of groups for certain period of time >>> without worrying about RMID being reset in between. >>> The ABMC feature provides an option to the user to assign an RMID to the >>> hardware counter and monitor the bandwidth for a longer duration. >>> The assigned RMID will be active until the user unassigns it manually. >>> There is no need to worry about counters being reset during this period. >>> Additionally, the user can specify a bitmask identifying the specific >>> bandwidth types from the given source to track with the counter. >>> >>> Without ABMC enabled, monitoring will work in current mode without >>> assignment option. >>> >>> # Linux Implementation >>> >>> Linux resctrl subsystem provides the interface to count maximum of two >>> memory bandwidth events per group, from a combination of available total >>> and local events. Keeping the current interface, users can assign a maximum >>> of 2 ABMC counters per group. User will also have the option to assign only >>> one counter to the group. If the system runs out of assignable ABMC >>> counters, kernel will display an error. Users need to unassign an already >>> assigned counter to make space for new assignments. >>> >>> >>> # Examples >>> >>> a. Check if ABMC support is available >>> #mount -t resctrl resctrl /sys/fs/resctrl/ >>> >>> #cat /sys/fs/resctrl/info/L3_MON/mon_features >>> llc_occupancy >>> mbm_total_bytes >>> mbm_total_bytes_config >>> mbm_local_bytes >>> mbm_local_bytes_config >>> mbm_assign_capable ← Linux kernel detected ABMC feature >>> >>> b. Check if ABMC is enabled. By default, ABMC feature is disabled. >>> Monitoring works in legacy monitor mode when ABMC is not enabled. >>> >>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>> 0 >>> >> With the introduction of "mbm_assign_enable" the entry in mon_features seems >> to provide duplicate information. > > ok. We can remove the text in mon_features and keep mbm_assign_enable. We need this to enable and disable the feature. This could be improved beyond a binary "enable"/"disable" interface to user space. For example, the hardware can discover which "mbm counter assign" related feature (I'm counting the "soft RMID" here as one of the "mbm counter assign" related features) is supported on the platform and it can be presented to the user like: # cat /sys/fs/resctrl/info/L3_MON/mbm_assign [feature_1] feature_2 feature_3 The output indicates which features are supported by the platform and the brackets indicate which feature is enabled. >>> c. There will be new file "monitor_state" for each monitor group when ABMC >>> feature is supported. However, monitor_state is not available if ABMC is >>> disabled. >>> >>> #cat /sys/fs/resctrl/monitor_state >>> Unsupported >> This sounds potentially confusing since users will still be able to monitor >> the groups ... > How about "Assignment-Unsupported"? (please see later) >> >>> >>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC >>> enabled, monitoring will work in current mode without assignment option. >>> >>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >>> 779247936 >>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>> 765207488 >>> >>> e. Enable ABMC mode. >>> >>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>> 1 >>> >>> f. Read the monitor states. By default, both total and local MBM >>> events are in "unassign" state. >>> >>> #cat /sys/fs/resctrl/monitor_state >>> total=unassign;local=unassign >> This interface does not seem to take into account that hardware >> can support assignment per domain. I understand that this is >> not something you want to implement at this time but the user interface >> has to accommodate such an enhancement. This was already mentioned, and >> you did acknowledge the point [3] to this new version that does not >> reflect this is unexpected. > > Yea. Domain level assignment is not supported at this point. Do you want me to explicitly mention here? > > Please elaborate what you meant here. You have made it clear on several occasions that you do not intend to support domain level assignment. That may be ok but the interface you create should not prevent future support of domain level assignment. If my point is not clear, could you please share how this interface is able to support domain level assignment in the future? I am starting to think that we need a file similar to the schemata file for group and domain level monitor configurations. >> My previous suggestions do seem to still stand and and I also am not able to >> see how Peter's requests [2] were considered. This same interface needs to >> accommodate usages apart from ABMC. For example, how to use this interface >> to address the same counter issue on AMD hardware without ABMC, and MPAM >> (pending James's feedback). > > Yea. Agree. Peter's comments are not addressed. I am not all clear > about details of Peters and James requirement. Peter listed his requirements in [1]. That email thread is a worthwhile read for the use cases. I believe that James is aware of this work and do hope to hear from him. > > With respect to ABMC here are my requirements. > > a. Assignment needs to be done at group level. > > b. User should be able to assign each event individually. Multiple events assignment(in one command) should be supported. > > c. I have no plans to implement domain level assignment. It is done at system level. > > d. We need only couple of states. Assigned and unassigned. > > e. monitor_state is name of file for user interface. We can change that based on comments. > > Peter, James, > > Please comment on what you want achieve in "assignment" based on the features you are working on. > > Do you want to add new states? > >> >> I understand that until we hear from Arm we do not know all the requirements >> that this interface needs to support, but I do expect this interface to >> at least consider requirements and usage scenarios that are already known. > > Sure. Will try that in the next version. Lets continue the discussion. > > >>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode, >>> the MBA events are not available until the user assigns the events >>> explicitly. >>> >>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >>> Unsupported >>> >>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>> Unsupported >>> >> This needs some more thought to accommodate Peter's scenario where the counter >> can be expected to return the final count after the counter is disabled. > > I am not sure how to achieve this with ABMC. This may be applicable > to soft rmid only. In case of "soft rmid", previous readings are > saved in the soft rmid state. Right. Please consider this work in two parts, first, there is a generic interface that aims to support ABMC, "soft RMID", and MPAM. Second, there is using this interface to support ABMC. >>> h. The event llc_occupancy is not affected by ABMC mode. Users can still >>> read the llc_occupancy. >>> >>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy >>> 557056 >>> >>> i. Now assign the total event and read the monitor_state. >>> >>> #echo total=assign > /sys/fs/resctrl/monitor_state >>> #cat /sys/fs/resctrl/monitor_state >>> total=assign;local=unassign >>> >> I do not see the "global assign/unassign" scenario addressed. > > I am not all clear on meaning of "global assign/unassign". Does it > mean looping thru all the groups and assign the RMIDs? Please see [1]. > It may not work in many cases. In case of ABMC, we have only limited > number of hw counters. It will fail after hardware runs out of > counters. It is better done selectively based on which group user is > interested in. Right. This is one more item where the generic interface needs to accommodate different hardware implementations. Perhaps this could be one of the "features" exposed by (global) mbm_assign that the user can "enable"/"disable" on demand? > But it can be done later if we find a use case for that. There already exists a use case as presented by Peter in support of AMD hardware without ABMC, no? >> This version seems to ignore (without discussion) a lot of earlier >> feedback. > > Please feel free comment. There are various threads of discussion. I may have missed. > Reinette [1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
Hi Reinette, I am trying to propose few things here to move forward based on my assumptions. Please point me if I missed something. On 2/5/24 16:38, Reinette Chatre wrote: > Hi Babu, > > On 2/2/2024 1:57 PM, Moger, Babu wrote: >> On 2/1/2024 10:09 PM, Reinette Chatre wrote: >>> On 1/19/2024 10:22 AM, Babu Moger wrote: >>>> These series adds the support for Assignable Bandwidth Monitoring Counters >>> Not a good start ([1]). >> >> Yea. My bad. >> >>> >>>> (ABMC). It is also called QoS RMID Pinning feature >>>> >>>> The feature details are documented in the APM listed below [1]. >>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming >>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth >>>> Monitoring (ABMC). The documentation is available at >>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 >>>> >>>> The patches are based on top of commit >>>> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master) >>>> >>>> # Introduction >>>> >>>> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring >>>> feature only guarantees that RMIDs currently assigned to a processor will >>>> be tracked by hardware. The counters of any other RMIDs which are no longer >>>> being tracked will be reset to zero. The MBM event counters return >>>> "Unavailable" for the RMIDs that are not active. >>>> Users can create 256 or more monitor groups. But there can be only limited >>>> number of groups that can be give guaranteed monitoring numbers. With ever >>> "can be given"? >> >> "can give guaranteed monitoring numbers." >> >> I feel this looks better. > > Sounds good. Thank you. > >> >>> >>>> changing configurations there is no way to definitely know which of these >>>> groups will be active for certain point of time. Users do not have the >>>> option to monitor a group or set of groups for certain period of time >>>> without worrying about RMID being reset in between. >>>> The ABMC feature provides an option to the user to assign an RMID to the >>>> hardware counter and monitor the bandwidth for a longer duration. >>>> The assigned RMID will be active until the user unassigns it manually. >>>> There is no need to worry about counters being reset during this period. >>>> Additionally, the user can specify a bitmask identifying the specific >>>> bandwidth types from the given source to track with the counter. >>>> >>>> Without ABMC enabled, monitoring will work in current mode without >>>> assignment option. >>>> >>>> # Linux Implementation >>>> >>>> Linux resctrl subsystem provides the interface to count maximum of two >>>> memory bandwidth events per group, from a combination of available total >>>> and local events. Keeping the current interface, users can assign a maximum >>>> of 2 ABMC counters per group. User will also have the option to assign only >>>> one counter to the group. If the system runs out of assignable ABMC >>>> counters, kernel will display an error. Users need to unassign an already >>>> assigned counter to make space for new assignments. >>>> >>>> >>>> # Examples >>>> >>>> a. Check if ABMC support is available >>>> #mount -t resctrl resctrl /sys/fs/resctrl/ >>>> >>>> #cat /sys/fs/resctrl/info/L3_MON/mon_features >>>> llc_occupancy >>>> mbm_total_bytes >>>> mbm_total_bytes_config >>>> mbm_local_bytes >>>> mbm_local_bytes_config >>>> mbm_assign_capable ← Linux kernel detected ABMC feature >>>> >>>> b. Check if ABMC is enabled. By default, ABMC feature is disabled. >>>> Monitoring works in legacy monitor mode when ABMC is not enabled. >>>> >>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>> 0 >>>> >>> With the introduction of "mbm_assign_enable" the entry in mon_features seems >>> to provide duplicate information. >> >> ok. We can remove the text in mon_features and keep mbm_assign_enable. We need this to enable and disable the feature. > > This could be improved beyond a binary "enable"/"disable" interface to user space. > For example, the hardware can discover which "mbm counter assign" related feature > (I'm counting the "soft RMID" here as one of the "mbm counter assign" related > features) is supported on the platform and it can be presented to the user like: > > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign > [feature_1] feature_2 feature_3 How about this? # cat /sys/fs/resctrl/info/L3_MON/mbm_assign ABMC:Capable SOFT-RMID:Capable To enable ABMC # echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign When ABMC is enabled: # cat /sys/fs/resctrl/info/L3_MON/mbm_assign ABMC:Enable SOFT-RMID:Capable > The output indicates which features are supported by the platform and the brackets indicate > which feature is enabled. > > >>>> c. There will be new file "monitor_state" for each monitor group when ABMC >>>> feature is supported. However, monitor_state is not available if ABMC is >>>> disabled. >>>> >>>> #cat /sys/fs/resctrl/monitor_state >>>> Unsupported >>> This sounds potentially confusing since users will still be able to monitor >>> the groups ... >> How about "Assignment-Unsupported"? > > (please see later) > >>> >>>> >>>> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC >>>> enabled, monitoring will work in current mode without assignment option. >>>> >>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >>>> 779247936 >>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>>> 765207488 >>>> >>>> e. Enable ABMC mode. >>>> >>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>> 1 >>>> >>>> f. Read the monitor states. By default, both total and local MBM >>>> events are in "unassign" state. >>>> >>>> #cat /sys/fs/resctrl/monitor_state >>>> total=unassign;local=unassign >>> This interface does not seem to take into account that hardware >>> can support assignment per domain. I understand that this is >>> not something you want to implement at this time but the user interface >>> has to accommodate such an enhancement. This was already mentioned, and >>> you did acknowledge the point [3] to this new version that does not >>> reflect this is unexpected. >> >> Yea. Domain level assignment is not supported at this point. Do you want me to explicitly mention here? >> >> Please elaborate what you meant here. > > You have made it clear on several occasions that you do not intend to support > domain level assignment. That may be ok but the interface you create should > not prevent future support of domain level assignment. > > If my point is not clear, could you please share how this interface is able to > support domain level assignment in the future? > > I am starting to think that we need a file similar to the schemata file > for group and domain level monitor configurations. Something like this? By default #cat /sys/fs/resctrl/monitor_state default:0=total=assign,local=assign;1=total=assign,local=assign With ABMC, #cat /sys/fs/resctrl/monitor_state ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign > >>> My previous suggestions do seem to still stand and and I also am not able to >>> see how Peter's requests [2] were considered. This same interface needs to >>> accommodate usages apart from ABMC. For example, how to use this interface >>> to address the same counter issue on AMD hardware without ABMC, and MPAM >>> (pending James's feedback). >> >> Yea. Agree. Peter's comments are not addressed. I am not all clear >> about details of Peters and James requirement. > > Peter listed his requirements in [1]. That email thread is a worthwhile read > for the use cases. > > I believe that James is aware of this work and do hope to hear from him. > >> >> With respect to ABMC here are my requirements. >> >> a. Assignment needs to be done at group level. >> >> b. User should be able to assign each event individually. Multiple events assignment(in one command) should be supported. >> >> c. I have no plans to implement domain level assignment. It is done at system level. >> >> d. We need only couple of states. Assigned and unassigned. >> >> e. monitor_state is name of file for user interface. We can change that based on comments. >> >> Peter, James, >> >> Please comment on what you want achieve in "assignment" based on the features you are working on. >> >> Do you want to add new states? >> >>> >>> I understand that until we hear from Arm we do not know all the requirements >>> that this interface needs to support, but I do expect this interface to >>> at least consider requirements and usage scenarios that are already known. >> >> Sure. Will try that in the next version. Lets continue the discussion. >> >> >>>> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode, >>>> the MBA events are not available until the user assigns the events >>>> explicitly. >>>> >>>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >>>> Unsupported >>>> >>>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>>> Unsupported >>>> >>> This needs some more thought to accommodate Peter's scenario where the counter >>> can be expected to return the final count after the counter is disabled. >> >> I am not sure how to achieve this with ABMC. This may be applicable >> to soft rmid only. In case of "soft rmid", previous readings are >> saved in the soft rmid state. > > Right. Please consider this work in two parts, first, there is a generic > interface that aims to support ABMC, "soft RMID", and MPAM. Second, there > is using this interface to support ABMC. Yea. But it is tough without knowing all the details of the other features. How about? #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes ABMC:Unassigned #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes ABMC:Unassigned > >>>> h. The event llc_occupancy is not affected by ABMC mode. Users can still >>>> read the llc_occupancy. >>>> >>>> #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy >>>> 557056 >>>> >>>> i. Now assign the total event and read the monitor_state. >>>> >>>> #echo total=assign > /sys/fs/resctrl/monitor_state >>>> #cat /sys/fs/resctrl/monitor_state >>>> total=assign;local=unassign >>>> >>> I do not see the "global assign/unassign" scenario addressed. >> >> I am not all clear on meaning of "global assign/unassign". Does it >> mean looping thru all the groups and assign the RMIDs? > > Please see [1]. > > >> It may not work in many cases. In case of ABMC, we have only limited >> number of hw counters. It will fail after hardware runs out of >> counters. It is better done selectively based on which group user is >> interested in. > > Right. This is one more item where the generic interface needs to > accommodate different hardware implementations. Perhaps this could > be one of the "features" exposed by (global) mbm_assign that the > user can "enable"/"disable" on demand? > >> But it can be done later if we find a use case for that. > > There already exists a use case as presented by Peter in support > of AMD hardware without ABMC, no? Yes. There is a use case. But seems like the use case is mostly applicable to soft-rmid feature. We can tie the global assign only to soft-rmid. # echo SOFT-RMID:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign Because this is soft-rmid, call global assign method. # echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign Because this is ABMC, do the steps required just to enable ABMC. Don't do individual assignment > >>> This version seems to ignore (without discussion) a lot of earlier >>> feedback. >> >> Please feel free comment. There are various threads of discussion. I may have missed. >> > > > Reinette > > [1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
Hi Babu, On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <babu.moger@amd.com> wrote: > On 2/5/24 16:38, Reinette Chatre wrote: > > This could be improved beyond a binary "enable"/"disable" interface to user space. > > For example, the hardware can discover which "mbm counter assign" related feature > > (I'm counting the "soft RMID" here as one of the "mbm counter assign" related > > features) is supported on the platform and it can be presented to the user like: > > > > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign > > [feature_1] feature_2 feature_3 > > How about this? > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign > ABMC:Capable > SOFT-RMID:Capable > > To enable ABMC > # echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign > > When ABMC is enabled: > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign > ABMC:Enable > SOFT-RMID:Capable There would be no need to use soft RMIDs on a system that supports ABMC, so I can't think of a reason why the underlying implementation would matter to our users. The user should only have to request the interface where monitors must be assigned manually. The mount would succeed if the system has a way to support the interface. > > You have made it clear on several occasions that you do not intend to support > > domain level assignment. That may be ok but the interface you create should > > not prevent future support of domain level assignment. > > > > If my point is not clear, could you please share how this interface is able to > > support domain level assignment in the future? > > > > I am starting to think that we need a file similar to the schemata file > > for group and domain level monitor configurations. > > Something like this? > > By default > #cat /sys/fs/resctrl/monitor_state > default:0=total=assign,local=assign;1=total=assign,local=assign > > With ABMC, > #cat /sys/fs/resctrl/monitor_state > ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign The benefit from all the string parsing in this interface is only halving the number of monitor_state sysfs writes we'd need compared to creating a separate file for mbm_local and mbm_total. Given that our use case is to assign the 32 assignable counters to read the bandwidth of ~256 monitoring groups, this isn't a substantial gain to help us. I think you should just focus on providing the necessary control granularity without trying to consolidate writes in this interface. I will propose an additional interface below to optimize our use case. Whether mbm_total and mbm_local are combined in the group directories or not, I don't see why you wouldn't just repeat the same file interface in the domain directories for a user needing finer-grained controls. > >> Peter, James, > >> > >> Please comment on what you want achieve in "assignment" based on the features you are working on. I prototyped and tested the following additional interface for the large-scale, batch use case that we're primarily concerned about: info/L3_MON/mbm_{local,total}_bytes_assigned Writing a whitespace-delimited list of mongroup directory paths does the following: 1. unassign all monitors for the given counter 2. assigns a monitor to each mongroup referenced in the write 3. batches per-domain register updates resulting from the assignments into a single IPI for each domain This interface allows us to do less sysfs writes and IPIs on systems with more assignable monitoring resources, rather than doing more. The reference to a mongroup when reading/writing the above node is the resctrl-root-relative path to the monitoring group. There is probably a more concise way to refer to the groups, but my prototype used kernfs_walk_and_get() to locate each rdtgroup struct. I would also like to add that in the software-ABMC prototype I made, because it's based on assignment of a small number of RMIDs, assignment results in all counters being assigned at once. On implementations where per-counter assignments aren't possible, assignment through such a resource would be allowed to assign more resources than explicitly requested. This would allow an implementation only capable of global assignment to assign resources to all groups when a non-empty string is written to the proposed file nodes, and all resources to be unassigned when an empty string is written. Reading back from the file nodes would tell the user how much was actually assigned. Thanks! -Peter
Hi Peter, On 2/16/24 14:18, Peter Newman wrote: > Hi Babu, > > On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <babu.moger@amd.com> wrote: >> On 2/5/24 16:38, Reinette Chatre wrote: >>> This could be improved beyond a binary "enable"/"disable" interface to user space. >>> For example, the hardware can discover which "mbm counter assign" related feature >>> (I'm counting the "soft RMID" here as one of the "mbm counter assign" related >>> features) is supported on the platform and it can be presented to the user like: >>> >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign >>> [feature_1] feature_2 feature_3 >> >> How about this? >> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign >> ABMC:Capable >> SOFT-RMID:Capable >> >> To enable ABMC >> # echo ABMC:enable > /sys/fs/resctrl/info/L3_MON/mbm_assign >> >> When ABMC is enabled: >> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign >> ABMC:Enable >> SOFT-RMID:Capable > > There would be no need to use soft RMIDs on a system that supports > ABMC, so I can't think of a reason why the underlying implementation > would matter to our users. The user should only have to request the > interface where monitors must be assigned manually. The mount would > succeed if the system has a way to support the interface. Ok Sure. I will exclude Soft-rmid for this interface. For now, lets keep this only for ABMC. # cat /sys/fs/resctrl/info/L3_MON/mbm_assign ABMC:Capable Or # cat /sys/fs/resctrl/info/L3_MON/mbm_assign ABMC:Enable > > >>> You have made it clear on several occasions that you do not intend to support >>> domain level assignment. That may be ok but the interface you create should >>> not prevent future support of domain level assignment. >>> >>> If my point is not clear, could you please share how this interface is able to >>> support domain level assignment in the future? >>> >>> I am starting to think that we need a file similar to the schemata file >>> for group and domain level monitor configurations. >> >> Something like this? >> >> By default >> #cat /sys/fs/resctrl/monitor_state >> default:0=total=assign,local=assign;1=total=assign,local=assign >> >> With ABMC, >> #cat /sys/fs/resctrl/monitor_state >> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign > > The benefit from all the string parsing in this interface is only > halving the number of monitor_state sysfs writes we'd need compared to > creating a separate file for mbm_local and mbm_total. Given that our > use case is to assign the 32 assignable counters to read the bandwidth > of ~256 monitoring groups, this isn't a substantial gain to help us. I > think you should just focus on providing the necessary control > granularity without trying to consolidate writes in this interface. I Ok. Looks like we need to provide the interface to assign the RMIDs to individual domains in this interface. I wasn't planning that now. But, it can be done without much changes. Something like this(corrected typos: replaced '=' with '-'). #cat /sys/fs/resctrl/monitor_state ABMC:0=total-unassign,local-unassign;1=total-unassign,local-unassign To assign: #echo "ABMC:0=total-assign,local-assign" > /sys/fs/resctrl/monitor_state > will propose an additional interface below to optimize our use case. > > Whether mbm_total and mbm_local are combined in the group directories > or not, I don't see why you wouldn't just repeat the same file > interface in the domain directories for a user needing finer-grained > controls. I don't see the need for the same file inside each domain directory in the group level when the above command can assign the RMIDs per domain. > > >>>> Peter, James, >>>> >>>> Please comment on what you want achieve in "assignment" based on the features you are working on. > > I prototyped and tested the following additional interface for the > large-scale, batch use case that we're primarily concerned about: > > info/L3_MON/mbm_{local,total}_bytes_assigned > > Writing a whitespace-delimited list of mongroup directory paths does > the following: > 1. unassign all monitors for the given counter > 2. assigns a monitor to each mongroup referenced in the write > 3. batches per-domain register updates resulting from the assignments > into a single IPI for each domain > > This interface allows us to do less sysfs writes and IPIs on systems > with more assignable monitoring resources, rather than doing more. > > The reference to a mongroup when reading/writing the above node is the > resctrl-root-relative path to the monitoring group. There is probably > a more concise way to refer to the groups, but my prototype used > kernfs_walk_and_get() to locate each rdtgroup struct. > > I would also like to add that in the software-ABMC prototype I made, > because it's based on assignment of a small number of RMIDs, > assignment results in all counters being assigned at once. On > implementations where per-counter assignments aren't possible, > assignment through such a resource would be allowed to assign more > resources than explicitly requested. > > This would allow an implementation only capable of global assignment > to assign resources to all groups when a non-empty string is written > to the proposed file nodes, and all resources to be unassigned when an > empty string is written. Reading back from the file nodes would tell > the user how much was actually assigned. Yes. This interface can be extended to ABMC as a global assignment option. If you have your patches ready I can add your patches on top of my ABMC feature. Or if you want to add the support later then I will go ahead with current base ABMC support. Let me know.
Hi Babu, On 19/01/2024 18:22, Babu Moger wrote: > These series adds the support for Assignable Bandwidth Monitoring Counters > (ABMC). It is also called QoS RMID Pinning feature > > The feature details are documented in the APM listed below [1]. > [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming > Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth > Monitoring (ABMC). The documentation is available at > Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 > > The patches are based on top of commit > 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master) > > # Introduction > > AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring > feature only guarantees that RMIDs currently assigned to a processor will > be tracked by hardware. The counters of any other RMIDs which are no longer > being tracked will be reset to zero. The MBM event counters return > "Unavailable" for the RMIDs that are not active. > > Users can create 256 or more monitor groups. But there can be only limited > number of groups that can be give guaranteed monitoring numbers. With ever > changing configurations there is no way to definitely know which of these > groups will be active for certain point of time. Users do not have the > option to monitor a group or set of groups for certain period of time > without worrying about RMID being reset in between. > > The ABMC feature provides an option to the user to assign an RMID to the > hardware counter and monitor the bandwidth for a longer duration. > The assigned RMID will be active until the user unassigns it manually. > There is no need to worry about counters being reset during this period. > Additionally, the user can specify a bitmask identifying the specific > bandwidth types from the given source to track with the counter. At a high level, if existing software can't use the counters, I'd prefer we move them into perf. We're currently re-inventing the perf wheel. (this argument doesn't hold for the llc_occupancy, which is a state not counter!) But if this lets someone 'pin' the counters for the groups they monitor, then use existing tools, that seems a good enough argument for doing this. > Without ABMC enabled, monitoring will work in current mode without > assignment option. To check I understand: the counters will get spuriously reset a the whim of the hardware? > # Linux Implementation > > Linux resctrl subsystem provides the interface to count maximum of two > memory bandwidth events per group, from a combination of available total > and local events. Keeping the current interface, users can assign a maximum > of 2 ABMC counters per group. User will also have the option to assign only > one counter to the group. If the system runs out of assignable ABMC > counters, kernel will display an error. Users need to unassign an already > assigned counter to make space for new assignments. > > > # Examples > > a. Check if ABMC support is available > #mount -t resctrl resctrl /sys/fs/resctrl/ > > #cat /sys/fs/resctrl/info/L3_MON/mon_features > llc_occupancy > mbm_total_bytes > mbm_total_bytes_config > mbm_local_bytes > mbm_local_bytes_config > mbm_assign_capable ← Linux kernel detected ABMC feature > > b. Check if ABMC is enabled. By default, ABMC feature is disabled. > Monitoring works in legacy monitor mode when ABMC is not enabled. > > #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable > 0 > > c. There will be new file "monitor_state" for each monitor group when ABMC > feature is supported. However, monitor_state is not available if ABMC is > disabled. > #cat /sys/fs/resctrl/monitor_state > Unsupported > > d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC > enabled, monitoring will work in current mode without assignment option. > > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes > 779247936 > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > 765207488 > > e. Enable ABMC mode. > > #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable > #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable > 1 Why does this mode need enabling? Can't it be enabled automatically on hardware that supports it, or enabled implicitly when the first assignment attempt arrives? I guess this is really needed for a reset - could we implement that instead? This way there isn't an extra step user-space has to do to make the assignments work. > f. Read the monitor states. By default, both total and local MBM > events are in "unassign" state. > > #cat /sys/fs/resctrl/monitor_state > total=unassign;local=unassign > g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode, > the MBA events are not available until the user assigns the events > explicitly. How does this fit with "monitoring will work in current mode without assignment option.". You mentioned the hardware resets the counters when this mode is enabled, does it also refuse to count until the MSR is programmed? If so - is there any mileage in auto-assigning the first N RMID to counters when the groups are created? This way existing user-space tools work until they exceed the limits of hardware. From that point a counter needs to be unassigned from another group. (we'd need to make it easy to find which groups have a counter assigned) > #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes > Unsupported > > #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > Unsupported > > h. The event llc_occupancy is not affected by ABMC mode. Users can still > read the llc_occupancy. > > #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy > 557056 { MPAM would be the same - because llc_occupancy isn't a counter its a view of the state, its possible to multiplex a single llc_occupancy counter behind the scenes to provide the value for as many groups as needed. I suspect any other architecture would have the same property. } > i. Now assign the total event and read the monitor_state. > > #echo total=assign > /sys/fs/resctrl/monitor_state > #cat /sys/fs/resctrl/monitor_state > total=assign;local=unassign > > j. Now that the total event is assigned. Read the total event. > > #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes > 6136000 > > k. Now assign the local event and read the monitor_state. > > #echo local=assign > /sys/fs/resctrl/monitor_state > #cat /sys/fs/resctrl/monitor_state > total=assign;local=assign > > Users can also assign both total and local events in one single > command. > > l. Now that both total and local events are assigned, read the events. > > #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes > 6136000 > #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > 58694 (the bandwidth configuration stuff is the existing BMEC support right?) From user-space's perspective MPAM could be made to look the same. There ought to be some indication to user-space of how many counters it can assign, this number might be different for different resources. This won't be a problem today, but if we had 'mbm_total_bytes' on the L2 cache, the number of counters may be different. MPAM platforms are unlikely to support both 'mbm_total' and 'mbm_local', I think this is just a documentation problem to say that mbm_local can't be configured if its not supported - user-space can't blindly assign both. If the configuration is changed over time - I bet user-space needs a quick way to find where the counters are currently assigned - walking the tree to find out is a bit rubbish. A file that lists the "control_group_name[/mon_group_name]" would help. Thanks, James
Hi Peter, Babu, On 16/02/2024 20:18, Peter Newman wrote: > On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <babu.moger@amd.com> wrote: >> On 2/5/24 16:38, Reinette Chatre wrote: >>> You have made it clear on several occasions that you do not intend to support >>> domain level assignment. That may be ok but the interface you create should >>> not prevent future support of domain level assignment. >>> >>> If my point is not clear, could you please share how this interface is able to >>> support domain level assignment in the future? >>> >>> I am starting to think that we need a file similar to the schemata file >>> for group and domain level monitor configurations. >> >> Something like this? >> >> By default >> #cat /sys/fs/resctrl/monitor_state >> default:0=total=assign,local=assign;1=total=assign,local=assign >> >> With ABMC, >> #cat /sys/fs/resctrl/monitor_state >> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign > > The benefit from all the string parsing in this interface is only > halving the number of monitor_state sysfs writes we'd need compared to > creating a separate file for mbm_local and mbm_total. Given that our > use case is to assign the 32 assignable counters to read the bandwidth > of ~256 monitoring groups, this isn't a substantial gain to help us. I > think you should just focus on providing the necessary control > granularity without trying to consolidate writes in this interface. I > will propose an additional interface below to optimize our use case. > > Whether mbm_total and mbm_local are combined in the group directories > or not, I don't see why you wouldn't just repeat the same file > interface in the domain directories for a user needing finer-grained > controls. I don't follow why this has to be done globally. resctrl allows CLOSID to have different configurations for different purposes between different domains (as long as tasks are pinned to CPUs). It feels a bit odd that these counters can't be considered as per-domain too. MPAM can equally allocate monitors/counters per-domain. If we are ever going to have per-domain assignment, I think its worth the extra work to do that now and avoid the extra user-space interface baggage from the global version. >>>> Peter, James, >>>> >>>> Please comment on what you want achieve in "assignment" based on the features you are working on. > > I prototyped and tested the following additional interface for the > large-scale, batch use case that we're primarily concerned about: > > info/L3_MON/mbm_{local,total}_bytes_assigned > > Writing a whitespace-delimited list of mongroup directory paths does | mkdir /sys/fs/resctrl/my\ group string parsing in the kernel is rarely fun! > the following: > 1. unassign all monitors for the given counter > 2. assigns a monitor to each mongroup referenced in the write > 3. batches per-domain register updates resulting from the assignments > into a single IPI for each domain > > This interface allows us to do less sysfs writes and IPIs on systems > with more assignable monitoring resources, rather than doing more. > > The reference to a mongroup when reading/writing the above node is the > resctrl-root-relative path to the monitoring group. There is probably > a more concise way to refer to the groups, but my prototype used > kernfs_walk_and_get() to locate each rdtgroup struct. If this file were re-used for finding where the monitors were currently allocated, using the name would be a natural fit for building a path to un-assign one group. > I would also like to add that in the software-ABMC prototype I made, > because it's based on assignment of a small number of RMIDs, > assignment results in all counters being assigned at once. On > implementations where per-counter assignments aren't possible, > assignment through such a resource would be allowed to assign more > resources than explicitly requested. > > This would allow an implementation only capable of global assignment Do we know if this exists? Given the configurations have to be different for a domain, I'd be surprised if counter configuration is somehow distributed between domains. > to assign resources to all groups when a non-empty string is written > to the proposed file nodes, and all resources to be unassigned when an > empty string is written. Reading back from the file nodes would tell > the user how much was actually assigned. What do you mean by 'how much', is this allow to fail early? That feels a bit counter-intuitive. As this starts with a reset, if the number of counters is known - it should be easy for user-space to know it can only write X tokens into that file. Thanks, James
Hi James, On Tue, Feb 20, 2024 at 7:21 AM James Morse <james.morse@arm.com> wrote: > On 16/02/2024 20:18, Peter Newman wrote: > > On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <babu.moger@amd.com> wrote: > >> On 2/5/24 16:38, Reinette Chatre wrote: > >>> You have made it clear on several occasions that you do not intend to support > >>> domain level assignment. That may be ok but the interface you create should > >>> not prevent future support of domain level assignment. > >>> > >>> If my point is not clear, could you please share how this interface is able to > >>> support domain level assignment in the future? > >>> > >>> I am starting to think that we need a file similar to the schemata file > >>> for group and domain level monitor configurations. > >> > >> Something like this? > >> > >> By default > >> #cat /sys/fs/resctrl/monitor_state > >> default:0=total=assign,local=assign;1=total=assign,local=assign > >> > >> With ABMC, > >> #cat /sys/fs/resctrl/monitor_state > >> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign > > > > The benefit from all the string parsing in this interface is only > > halving the number of monitor_state sysfs writes we'd need compared to > > creating a separate file for mbm_local and mbm_total. Given that our > > use case is to assign the 32 assignable counters to read the bandwidth > > of ~256 monitoring groups, this isn't a substantial gain to help us. I > > think you should just focus on providing the necessary control > > granularity without trying to consolidate writes in this interface. I > > will propose an additional interface below to optimize our use case. > > > > Whether mbm_total and mbm_local are combined in the group directories > > or not, I don't see why you wouldn't just repeat the same file > > interface in the domain directories for a user needing finer-grained > > controls. > > I don't follow why this has to be done globally. resctrl allows CLOSID to have different > configurations for different purposes between different domains (as long as tasks are > pinned to CPUs). It feels a bit odd that these counters can't be considered as per-domain too. Assigning to all domains at once would allow us to better parallelize the resulting IPIs when we do need to iterate a small set of monitors over a large list of groups. > > I prototyped and tested the following additional interface for the > > large-scale, batch use case that we're primarily concerned about: > > > > info/L3_MON/mbm_{local,total}_bytes_assigned > > > > Writing a whitespace-delimited list of mongroup directory paths does > > | mkdir /sys/fs/resctrl/my\ group > > string parsing in the kernel is rarely fun! Hopefully restricting to a newline-delimited list will keep this fun and easy then. Otherwise if referring to many groups in a single write isn't a viable path forward, I'll still need to find a way to address the fs/syscall/IPI overhead of measuring the bandwidth of a large number of groups. > > > > the following: > > 1. unassign all monitors for the given counter > > 2. assigns a monitor to each mongroup referenced in the write > > 3. batches per-domain register updates resulting from the assignments > > into a single IPI for each domain > > > > This interface allows us to do less sysfs writes and IPIs on systems > > with more assignable monitoring resources, rather than doing more. > > > > The reference to a mongroup when reading/writing the above node is the > > resctrl-root-relative path to the monitoring group. There is probably > > a more concise way to refer to the groups, but my prototype used > > kernfs_walk_and_get() to locate each rdtgroup struct. > > If this file were re-used for finding where the monitors were currently allocated, using > the name would be a natural fit for building a path to un-assign one group. > > > > I would also like to add that in the software-ABMC prototype I made, > > because it's based on assignment of a small number of RMIDs, > > assignment results in all counters being assigned at once. On > > implementations where per-counter assignments aren't possible, > > assignment through such a resource would be allowed to assign more > > resources than explicitly requested. > > > > This would allow an implementation only capable of global assignment > > Do we know if this exists? Given the configurations have to be different for a domain, I'd > be surprised if counter configuration is somehow distributed between domains. It's currently only a proposal[1] for mitigating the context switch overhead cost of soft RMIDs. I'm looking at the other alternative first, though. > > to assign resources to all groups when a non-empty string is written > > to the proposed file nodes, and all resources to be unassigned when an > > empty string is written. Reading back from the file nodes would tell > > the user how much was actually assigned. > > What do you mean by 'how much', is this allow to fail early? That feels a bit > counter-intuitive. As this starts with a reset, if the number of counters is known - it > should be easy for user-space to know it can only write X tokens into that file. I was referring to the operation assigning more groups than requested if the implementation is only capable of a master enable/disable for all monitoring: reading back would indicate that all monitoring groups are in the assigned list. There would otherwise be an interface telling the user how many monitors can be assigned, so there's no reason to expect this operation to fail, short of the user doing something silly like deleting a group while it's concurrently being assigned. -Peter [1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
Hi Babu, On 20/02/2024 15:21, James Morse wrote: > There ought to be some indication to user-space of how many counters it can assign, this > number might be different for different resources. This won't be a problem today, but if > we had 'mbm_total_bytes' on the L2 cache, the number of counters may be different. I found this in patch 6 - sorry for the noise! Thanks, James
Hi James, On 2/20/24 09:21, James Morse wrote: > Hi Babu, > > On 19/01/2024 18:22, Babu Moger wrote: >> These series adds the support for Assignable Bandwidth Monitoring Counters >> (ABMC). It is also called QoS RMID Pinning feature >> >> The feature details are documented in the APM listed below [1]. >> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming >> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth >> Monitoring (ABMC). The documentation is available at >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 >> >> The patches are based on top of commit >> 1ac6b49423e83af2abed9be7fbdf2e491686c66b (tip/master) >> >> # Introduction >> >> AMD hardware can support 256 or more RMIDs. However, bandwidth monitoring >> feature only guarantees that RMIDs currently assigned to a processor will >> be tracked by hardware. The counters of any other RMIDs which are no longer >> being tracked will be reset to zero. The MBM event counters return >> "Unavailable" for the RMIDs that are not active. >> >> Users can create 256 or more monitor groups. But there can be only limited >> number of groups that can be give guaranteed monitoring numbers. With ever >> changing configurations there is no way to definitely know which of these >> groups will be active for certain point of time. Users do not have the >> option to monitor a group or set of groups for certain period of time >> without worrying about RMID being reset in between. >> >> The ABMC feature provides an option to the user to assign an RMID to the >> hardware counter and monitor the bandwidth for a longer duration. >> The assigned RMID will be active until the user unassigns it manually. >> There is no need to worry about counters being reset during this period. >> Additionally, the user can specify a bitmask identifying the specific >> bandwidth types from the given source to track with the counter. > > At a high level, if existing software can't use the counters, I'd prefer we move them into > perf. We're currently re-inventing the perf wheel. (this argument doesn't hold for the > llc_occupancy, which is a state not counter!) > > But if this lets someone 'pin' the counters for the groups they monitor, then use existing > tools, that seems a good enough argument for doing this. Not sure if I understand this. Yes. This feature provides the option to pin the counters to the monitor group. > > >> Without ABMC enabled, monitoring will work in current mode without >> assignment option. > > To check I understand: the counters will get spuriously reset a the whim of the hardware? [1] Not spuriously. Hardware can keep track of certain number of counters active simultaneously (active counters). If there are more monitor groups than the hardware can track, then only most recent associations are kept active. The active set can change based on user actions(RMID association changes from user). This feature can help to pin a counter so it does not reset. > > >> # Linux Implementation >> >> Linux resctrl subsystem provides the interface to count maximum of two >> memory bandwidth events per group, from a combination of available total >> and local events. Keeping the current interface, users can assign a maximum >> of 2 ABMC counters per group. User will also have the option to assign only >> one counter to the group. If the system runs out of assignable ABMC >> counters, kernel will display an error. Users need to unassign an already >> assigned counter to make space for new assignments. >> >> >> # Examples >> >> a. Check if ABMC support is available >> #mount -t resctrl resctrl /sys/fs/resctrl/ >> >> #cat /sys/fs/resctrl/info/L3_MON/mon_features >> llc_occupancy >> mbm_total_bytes >> mbm_total_bytes_config >> mbm_local_bytes >> mbm_local_bytes_config >> mbm_assign_capable ← Linux kernel detected ABMC feature >> >> b. Check if ABMC is enabled. By default, ABMC feature is disabled. >> Monitoring works in legacy monitor mode when ABMC is not enabled. >> >> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >> 0 >> >> c. There will be new file "monitor_state" for each monitor group when ABMC >> feature is supported. However, monitor_state is not available if ABMC is >> disabled. >> #cat /sys/fs/resctrl/monitor_state >> Unsupported >> >> d. Read the event mbm_total_bytes and mbm_local_bytes. Without ABMC >> enabled, monitoring will work in current mode without assignment option. >> >> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >> 779247936 >> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >> 765207488 >> >> e. Enable ABMC mode. >> >> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >> 1 > > Why does this mode need enabling? Can't it be enabled automatically on hardware that > supports it, or enabled implicitly when the first assignment attempt arrives? > > I guess this is really needed for a reset - could we implement that instead? This way > there isn't an extra step user-space has to do to make the assignments work. Mostly the new features are added as an opt-in method. So, kept it that way. If we enable this feature automatically, then we have provide an option to disable it. > > >> f. Read the monitor states. By default, both total and local MBM >> events are in "unassign" state. >> >> #cat /sys/fs/resctrl/monitor_state >> total=unassign;local=unassign > > >> g. Read the event mbm_total_bytes and mbm_local_bytes. In ABMC mode, >> the MBA events are not available until the user assigns the events >> explicitly. > > How does this fit with "monitoring will work in current mode without assignment option.". See my response above. [1] > You mentioned the hardware resets the counters when this mode is enabled, does it also > refuse to count until the MSR is programmed? Yes. That is correct. We need to program the MSRs to start counting again. > > If so - is there any mileage in auto-assigning the first N RMID to counters when the > groups are created? This way existing user-space tools work until they exceed the limits > of hardware. From that point a counter needs to be unassigned from another group. (we'd > need to make it easy to find which groups have a counter assigned) Yes. That is correct. To see the state of assignment, I have added a monitor_state in each group to see if the counters are assigned to that group. > > >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >> Unsupported >> >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >> Unsupported >> >> h. The event llc_occupancy is not affected by ABMC mode. Users can still >> read the llc_occupancy. >> >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy >> 557056 > > { > MPAM would be the same - because llc_occupancy isn't a counter its a view of the > state, its possible to multiplex a single llc_occupancy counter behind the scenes > to provide the value for as many groups as needed. I suspect any other > architecture would have the same property. ok. Good to know. > } > >> i. Now assign the total event and read the monitor_state. >> >> #echo total=assign > /sys/fs/resctrl/monitor_state >> #cat /sys/fs/resctrl/monitor_state >> total=assign;local=unassign >> >> j. Now that the total event is assigned. Read the total event. >> >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >> 6136000 >> >> k. Now assign the local event and read the monitor_state. >> >> #echo local=assign > /sys/fs/resctrl/monitor_state >> #cat /sys/fs/resctrl/monitor_state >> total=assign;local=assign >> >> Users can also assign both total and local events in one single >> command. >> >> l. Now that both total and local events are assigned, read the events. >> >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes >> 6136000 >> #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >> 58694 > > (the bandwidth configuration stuff is the existing BMEC support right?) Yes. correct. > > From user-space's perspective MPAM could be made to look the same. > > There ought to be some indication to user-space of how many counters it can assign, this > number might be different for different resources. This won't be a problem today, but if > we had 'mbm_total_bytes' on the L2 cache, the number of counters may be different. > > MPAM platforms are unlikely to support both 'mbm_total' and 'mbm_local', I think this is Ok. Good to know. > just a documentation problem to say that mbm_local can't be configured if its not > supported - user-space can't blindly assign both. > > If the configuration is changed over time - I bet user-space needs a quick way to find > where the counters are currently assigned - walking the tree to find out is a bit rubbish. > A file that lists the "control_group_name[/mon_group_name]" would help. Looks like you already found in here. https://lore.kernel.org/lkml/c16cac16c813a203390229d77d5ab37ebc923d95.1705688539.git.babu.moger@amd.com/ > > > Thanks, > > James
On 2/20/2024 12:48 PM, Moger, Babu wrote: > On 2/20/24 09:21, James Morse wrote: >> On 19/01/2024 18:22, Babu Moger wrote: >>> e. Enable ABMC mode. >>> >>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>> 1 >> >> Why does this mode need enabling? Can't it be enabled automatically on hardware that >> supports it, or enabled implicitly when the first assignment attempt arrives? >> >> I guess this is really needed for a reset - could we implement that instead? This way >> there isn't an extra step user-space has to do to make the assignments work. > > Mostly the new features are added as an opt-in method. So, kept it that > way. If we enable this feature automatically, then we have provide an > option to disable it. > At the same time it sounds to me like ABMC can improve current users' experience without requiring them to do anything. This sounds appealing. For example, if I understand correctly, it may be possible to start resctrl with ABMC enabled by default and the number of monitoring groups (currently exposed to user space via "num_rmids") limited to the number of counters supported by ABMC. Existing users would then by default obtain better behavior of counters not resetting. The "new feature" could then be viewed as adding support for more monitoring groups than what hardware can support concurrently. Reinette
Hi Reinette, On 2/23/24 11:17, Reinette Chatre wrote: > > > On 2/20/2024 12:48 PM, Moger, Babu wrote: >> On 2/20/24 09:21, James Morse wrote: >>> On 19/01/2024 18:22, Babu Moger wrote: > >>>> e. Enable ABMC mode. >>>> >>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>> 1 >>> >>> Why does this mode need enabling? Can't it be enabled automatically on hardware that >>> supports it, or enabled implicitly when the first assignment attempt arrives? >>> >>> I guess this is really needed for a reset - could we implement that instead? This way >>> there isn't an extra step user-space has to do to make the assignments work. >> >> Mostly the new features are added as an opt-in method. So, kept it that >> way. If we enable this feature automatically, then we have provide an >> option to disable it. >> > > At the same time it sounds to me like ABMC can improve current users' > experience without requiring them to do anything. This sounds appealing. > For example, if I understand correctly, it may be possible to start resctrl > with ABMC enabled by default and the number of monitoring groups (currently > exposed to user space via "num_rmids") limited to the number of counters > supported by ABMC. Existing users would then by default obtain better behavior > of counters not resetting. Yes, I like the idea. But i will break compatibility with pqos tool(intel_cmt_cat utility). pqos tool monitoring will not work without supporting ABMC enablement in the tool. ABMC feature requires an extra step to assign the counters for monitor to work. > > The "new feature" could then be viewed as adding support for more monitoring > groups than what hardware can support concurrently. > > Reinette
On 2/20/2024 12:11 PM, Peter Newman wrote: > Hi James, > > On Tue, Feb 20, 2024 at 7:21 AM James Morse <james.morse@arm.com> wrote: >> On 16/02/2024 20:18, Peter Newman wrote: >>> On Thu, Feb 8, 2024 at 9:29 AM Moger, Babu <babu.moger@amd.com> wrote: >>>> On 2/5/24 16:38, Reinette Chatre wrote: >>>>> You have made it clear on several occasions that you do not intend to support >>>>> domain level assignment. That may be ok but the interface you create should >>>>> not prevent future support of domain level assignment. >>>>> >>>>> If my point is not clear, could you please share how this interface is able to >>>>> support domain level assignment in the future? >>>>> >>>>> I am starting to think that we need a file similar to the schemata file >>>>> for group and domain level monitor configurations. >>>> Something like this? >>>> >>>> By default >>>> #cat /sys/fs/resctrl/monitor_state >>>> default:0=total=assign,local=assign;1=total=assign,local=assign >>>> >>>> With ABMC, >>>> #cat /sys/fs/resctrl/monitor_state >>>> ABMC:0=total=unassign,local=unassign;1=total=unassign,local=unassign >>> The benefit from all the string parsing in this interface is only >>> halving the number of monitor_state sysfs writes we'd need compared to >>> creating a separate file for mbm_local and mbm_total. Given that our >>> use case is to assign the 32 assignable counters to read the bandwidth >>> of ~256 monitoring groups, this isn't a substantial gain to help us. I >>> think you should just focus on providing the necessary control >>> granularity without trying to consolidate writes in this interface. I >>> will propose an additional interface below to optimize our use case. >>> >>> Whether mbm_total and mbm_local are combined in the group directories >>> or not, I don't see why you wouldn't just repeat the same file >>> interface in the domain directories for a user needing finer-grained >>> controls. >> I don't follow why this has to be done globally. resctrl allows CLOSID to have different >> configurations for different purposes between different domains (as long as tasks are >> pinned to CPUs). It feels a bit odd that these counters can't be considered as per-domain too. > Assigning to all domains at once would allow us to better parallelize > the resulting IPIs when we do need to iterate a small set of monitors > over a large list of groups. Planning to work on v3 of this series. For now, I will exclude the global assignment option from this series. We can add the global assignment support when we get time to optimize assignments at later point. Thanks Babu Moger
Hi Babu, On 2/23/2024 12:11 PM, Moger, Babu wrote: > On 2/23/24 11:17, Reinette Chatre wrote: >> >> >> On 2/20/2024 12:48 PM, Moger, Babu wrote: >>> On 2/20/24 09:21, James Morse wrote: >>>> On 19/01/2024 18:22, Babu Moger wrote: >> >>>>> e. Enable ABMC mode. >>>>> >>>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>>> 1 >>>> >>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that >>>> supports it, or enabled implicitly when the first assignment attempt arrives? >>>> >>>> I guess this is really needed for a reset - could we implement that instead? This way >>>> there isn't an extra step user-space has to do to make the assignments work. >>> >>> Mostly the new features are added as an opt-in method. So, kept it that >>> way. If we enable this feature automatically, then we have provide an >>> option to disable it. >>> >> >> At the same time it sounds to me like ABMC can improve current users' >> experience without requiring them to do anything. This sounds appealing. >> For example, if I understand correctly, it may be possible to start resctrl >> with ABMC enabled by default and the number of monitoring groups (currently >> exposed to user space via "num_rmids") limited to the number of counters >> supported by ABMC. Existing users would then by default obtain better behavior >> of counters not resetting. > > Yes, I like the idea. But i will break compatibility with pqos > tool(intel_cmt_cat utility). pqos tool monitoring will not work without > supporting ABMC enablement in the tool. ABMC feature requires an extra > step to assign the counters for monitor to work. I am considering two scenarios, the "default behavior" is what a user will experience when booting resctrl on an ABMC system and the "new feature behavior" where a user can take full advantage of all that ABMC (and soft RMID, and MPAM) can offer. So, first, on an ABMC system in the "default behavior" scenario I expect that resctrl can do required ABMC counter configuration automatically at the time a monitor group is created. In this "default behavior" scenario resctrl would expose "num_rmids" to be half of the number of assignable counters. When a user then creates a monitor group two counters will be used and configured to count the local and total bytes respectively. If two counters are not available then ENOSPC returned, just like when system is out of closid/rmid. With this "default behavior" user space thus gets improved behavior without making any changes on its part. I do not have insight into how many counters ABMC could be expected to expose though ... so some users may be surprised at how few monitor groups can be created with new hardware? This may not be an issue since that would accurately reflect how many _reliable_ monitor groups can be created and if user needs more monitor groups then that would be a time to explore the "new feature" that requires changes in how user interacts with resctrl. Apart from the "default behavior" there are two options to consider ... (a) the "original" behavior(? I do not know what to call it) - this would be where user space wants(?) to have the current non-ABMC behavior on an ABMC system, where the previous "num_rmids" monitor groups can be created but the counters are reset unpredictably ... should this still be supported on ABMC systems though? (b) the "new feature" behavior where user space gets full benefit of ABMC that allows user space to create any number of monitor groups but then user space needs to let hardware (via resctrl) know which events should be counted. I expect that only (b) above would require user space change. Considering that per documentation, "num_rmids" means "This is the upper bound for how many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids" becomes undefined when "new feature" is enabled. When this new feature is enabled then user space is no longer limited by number of RMIDs on how many monitor groups can be created and this is the point that the user interface that you and Peter have ideas about comes into play. Specifically, user space needing a way to specify: (a) "let me create more monitor groups that the hardware can support"/"let me control which events/monitor groups are counted" (like the "mbm_assign" file in your proposal) (b) "here are the events that need to be counted" (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals) Reinette
Hi Reinette, On 2/23/24 16:21, Reinette Chatre wrote: > Hi Babu, > > On 2/23/2024 12:11 PM, Moger, Babu wrote: >> On 2/23/24 11:17, Reinette Chatre wrote: >>> >>> >>> On 2/20/2024 12:48 PM, Moger, Babu wrote: >>>> On 2/20/24 09:21, James Morse wrote: >>>>> On 19/01/2024 18:22, Babu Moger wrote: >>> >>>>>> e. Enable ABMC mode. >>>>>> >>>>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>>>> 1 >>>>> >>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that >>>>> supports it, or enabled implicitly when the first assignment attempt arrives? >>>>> >>>>> I guess this is really needed for a reset - could we implement that instead? This way >>>>> there isn't an extra step user-space has to do to make the assignments work. >>>> >>>> Mostly the new features are added as an opt-in method. So, kept it that >>>> way. If we enable this feature automatically, then we have provide an >>>> option to disable it. >>>> >>> >>> At the same time it sounds to me like ABMC can improve current users' >>> experience without requiring them to do anything. This sounds appealing. >>> For example, if I understand correctly, it may be possible to start resctrl >>> with ABMC enabled by default and the number of monitoring groups (currently >>> exposed to user space via "num_rmids") limited to the number of counters >>> supported by ABMC. Existing users would then by default obtain better behavior >>> of counters not resetting. >> >> Yes, I like the idea. But i will break compatibility with pqos >> tool(intel_cmt_cat utility). pqos tool monitoring will not work without >> supporting ABMC enablement in the tool. ABMC feature requires an extra >> step to assign the counters for monitor to work. > > I am considering two scenarios, the "default behavior" is what a user will > experience when booting resctrl on an ABMC system and the "new feature > behavior" where a user can take full advantage of all that ABMC (and soft > RMID, and MPAM) can offer. > > So, first, on an ABMC system in the "default behavior" scenario I expect > that resctrl can do required ABMC counter configuration automatically at > the time a monitor group is created. In this "default behavior" scenario > resctrl would expose "num_rmids" to be half of the number of assignable > counters. When a user then creates a monitor group two counters will be > used and configured to count the local and total bytes respectively. If > two counters are not available then ENOSPC returned, just like when system > is out of closid/rmid. With this "default behavior" user space thus gets > improved behavior without making any changes on its part. I do not have We can automatically assign the h/w counter when monitor group is created until we run out of h/w counters. That is good idea. By default user will not notice any difference in ABMC mode. > insight into how many counters ABMC could be expected to expose though ... > so some users may be surprised at how few monitor groups can be created > with new hardware? This may not be an issue since that would accurately > reflect how many _reliable_ monitor groups can be created and if user needs > more monitor groups then that would be a time to explore the "new feature" > that requires changes in how user interacts with resctrl. Currently, 32 h/w counters are available to configure. With two counters for each group, we can create 16 groups(15 new groups plus the default group). That should be fine as pqos tool creates only 16 groups when it is started. > > Apart from the "default behavior" there are two options to consider ... > (a) the "original" behavior(? I do not know what to call it) - this would be > where user space wants(?) to have the current non-ABMC behavior on an ABMC > system, where the previous "num_rmids" monitor groups can be created but > the counters are reset unpredictably ... should this still be supported > on ABMC systems though? I would say yes. For some reason user(hardware or software issues) is not able to use ABMC mode, they have an option to go back to legacy mode. > (b) the "new feature" behavior where user space gets full benefit of ABMC > that allows user space to create any number of monitor groups but then > user space needs to let hardware (via resctrl) know which > events should be counted. Is this "new feature" is enabled by default when ABMC is available? Or we need to provide an interface to enable this feature? > > I expect that only (b) above would require user space change. Considering > that per documentation, "num_rmids" means "This is the upper bound for how > many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids" > becomes undefined when "new feature" is enabled. When this new feature is enabled > then user space is no longer limited by number of RMIDs on how many monitor With ABMC, we will have a new field "mbm_assignable_counters". We don't have to change the definition of "num_rmids". > groups can be created and this is the point that the user interface that you > and Peter have ideas about comes into play. Specifically, user space needing > a way to specify: > (a) "let me create more monitor groups that the hardware can support"/"let me > control which events/monitor groups are counted" > (like the "mbm_assign" file in your proposal) > (b) "here are the events that need to be counted" > (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals) With global assignment option out of way for now(may be introduced later), we can provide two interfaces. 1. /sys/fs/resctrl/info/L3_MON/mbm_assign This will be enabled by default when ABMC is available. Users can disable this option to go back to legacy mode. 2. /sys/fs/resctrl/monitor_state. This can used to individually assign or unassign the counters in each group. When assigned: #cat /sys/fs/resctrl/monitor_state 0=total-assign,local-assign;1=total-assign,local-assign When unassigned: #cat /sys/fs/resctrl/monitor_state 0=total-unassign,local-unassign;1=total-unassign,local-unassign Thoughts?
Hi Babu, On 2/26/2024 9:59 AM, Moger, Babu wrote: > On 2/23/24 16:21, Reinette Chatre wrote: >> On 2/23/2024 12:11 PM, Moger, Babu wrote: >>> On 2/23/24 11:17, Reinette Chatre wrote: >>>> >>>> >>>> On 2/20/2024 12:48 PM, Moger, Babu wrote: >>>>> On 2/20/24 09:21, James Morse wrote: >>>>>> On 19/01/2024 18:22, Babu Moger wrote: >>>> >>>>>>> e. Enable ABMC mode. >>>>>>> >>>>>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>>>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>>>>> 1 >>>>>> >>>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that >>>>>> supports it, or enabled implicitly when the first assignment attempt arrives? >>>>>> >>>>>> I guess this is really needed for a reset - could we implement that instead? This way >>>>>> there isn't an extra step user-space has to do to make the assignments work. >>>>> >>>>> Mostly the new features are added as an opt-in method. So, kept it that >>>>> way. If we enable this feature automatically, then we have provide an >>>>> option to disable it. >>>>> >>>> >>>> At the same time it sounds to me like ABMC can improve current users' >>>> experience without requiring them to do anything. This sounds appealing. >>>> For example, if I understand correctly, it may be possible to start resctrl >>>> with ABMC enabled by default and the number of monitoring groups (currently >>>> exposed to user space via "num_rmids") limited to the number of counters >>>> supported by ABMC. Existing users would then by default obtain better behavior >>>> of counters not resetting. >>> >>> Yes, I like the idea. But i will break compatibility with pqos >>> tool(intel_cmt_cat utility). pqos tool monitoring will not work without >>> supporting ABMC enablement in the tool. ABMC feature requires an extra >>> step to assign the counters for monitor to work. >> >> I am considering two scenarios, the "default behavior" is what a user will >> experience when booting resctrl on an ABMC system and the "new feature >> behavior" where a user can take full advantage of all that ABMC (and soft >> RMID, and MPAM) can offer. >> >> So, first, on an ABMC system in the "default behavior" scenario I expect >> that resctrl can do required ABMC counter configuration automatically at >> the time a monitor group is created. In this "default behavior" scenario >> resctrl would expose "num_rmids" to be half of the number of assignable >> counters. When a user then creates a monitor group two counters will be >> used and configured to count the local and total bytes respectively. If >> two counters are not available then ENOSPC returned, just like when system >> is out of closid/rmid. With this "default behavior" user space thus gets >> improved behavior without making any changes on its part. I do not have > > We can automatically assign the h/w counter when monitor group is created > until we run out of h/w counters. That is good idea. By default user will > not notice any difference in ABMC mode. > >> insight into how many counters ABMC could be expected to expose though ... >> so some users may be surprised at how few monitor groups can be created >> with new hardware? This may not be an issue since that would accurately >> reflect how many _reliable_ monitor groups can be created and if user needs >> more monitor groups then that would be a time to explore the "new feature" >> that requires changes in how user interacts with resctrl. > > Currently, 32 h/w counters are available to configure. With two counters > for each group, we can create 16 groups(15 new groups plus the default > group). That should be fine as pqos tool creates only 16 groups when it is > started. user space can never assume that a certain number of groups can be created. >> Apart from the "default behavior" there are two options to consider ... >> (a) the "original" behavior(? I do not know what to call it) - this would be >> where user space wants(?) to have the current non-ABMC behavior on an ABMC >> system, where the previous "num_rmids" monitor groups can be created but >> the counters are reset unpredictably ... should this still be supported >> on ABMC systems though? > > I would say yes. For some reason user(hardware or software issues) is not > able to use ABMC mode, they have an option to go back to legacy mode. I see. Should this perhaps be protected behind the resctrl "debug" mount option? >> (b) the "new feature" behavior where user space gets full benefit of ABMC >> that allows user space to create any number of monitor groups but then >> user space needs to let hardware (via resctrl) know which >> events should be counted. > > Is this "new feature" is enabled by default when ABMC is available? Not in this design, no. In these scenarios ABMC will be available and enabled in both the "default" and "new feature" behavior. The difference is no user space changes are needed in "default" scenario and resctrl limits the number of monitor groups to support all monitor groups to be backed by hardware counters. When "new feature" is enabled when ABMC is available and enabled then user space is able to create more monitor groups than available hardware counters and new user interface is required to manage associating counters with monitor events. > > Or we need to provide an interface to enable this feature? Yes, an interface will be needed to enable this feature. > > >> >> I expect that only (b) above would require user space change. Considering >> that per documentation, "num_rmids" means "This is the upper bound for how >> many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids" >> becomes undefined when "new feature" is enabled. When this new feature is enabled >> then user space is no longer limited by number of RMIDs on how many monitor > > With ABMC, we will have a new field "mbm_assignable_counters". We don't > have to change the definition of "num_rmids". The problem here is that "num_rmids" is (as per Documentation/arch/x86/resctrl.rst) documented to be an upper bound for how many monitor groups can be created. As I understand, when ABMC is enabled and its full capability exposed to user space then there is no limit to how many monitor groups can be created, no? For example, if I understand correctly, theoretically, when ABMC is enabled then "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid is not unsigned, tbd if number of directories may also be limited by kernfs). User space could theoretically create more monitor groups than the number of rmids that a resource claims to support using current upstream enumeration. Instead, it is the "mbm_assignable_counters" that is of interest, that is what user space uses to determine how many of the (potentially very large number of) monitor groups/monitor events can be counted at any particular time. >> groups can be created and this is the point that the user interface that you >> and Peter have ideas about comes into play. Specifically, user space needing >> a way to specify: >> (a) "let me create more monitor groups that the hardware can support"/"let me >> control which events/monitor groups are counted" >> (like the "mbm_assign" file in your proposal) >> (b) "here are the events that need to be counted" >> (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals) > > With global assignment option out of way for now(may be introduced later), > we can provide two interfaces. > > 1. /sys/fs/resctrl/info/L3_MON/mbm_assign > This will be enabled by default when ABMC is available. Users can disable > this option to go back to legacy mode. Potentially (all naming placeholders that will only be visible on systems that actually supports particular mode): legacy [default] new_feature soft_rmid > > 2. /sys/fs/resctrl/monitor_state. > This can used to individually assign or unassign the counters in each group. > > When assigned: > #cat /sys/fs/resctrl/monitor_state > 0=total-assign,local-assign;1=total-assign,local-assign > > When unassigned: > #cat /sys/fs/resctrl/monitor_state > 0=total-unassign,local-unassign;1=total-unassign,local-unassign > > > Thoughts? How do you expect this interface to be used? I understand the mechanics of this interface but on a higher level, do you expect user space to once in a while assign a new counter to a single event or monitor group (for which a fine grained interface works) or do you expect user space to shift multiple counters across several monitor events at intervals? Across resctrl's lifetime we have seen examples of user space wanting to accomplish more with a single resctrl interaction. For example moving multiple tasks to a group that you added support for and moving a monitor group feature from Peter. I thus think that it would be valuable to consider more efficient interfaces from the beginning. I do not think that this is the type of work that is an optimization to be delayed until an unspecified later time, but instead multiple usage of interface can be considered from the start with a most optimal interface created from the beginning. Specifically, why does resctrl need to be "extended" to support a global assignment as proposed by Peter at a later time, why can it not be done as the original and (ideally) only mechanism? Reinette
Hi Reinette, On 2/26/24 15:20, Reinette Chatre wrote: > Hi Babu, > > On 2/26/2024 9:59 AM, Moger, Babu wrote: >> On 2/23/24 16:21, Reinette Chatre wrote: >>> On 2/23/2024 12:11 PM, Moger, Babu wrote: >>>> On 2/23/24 11:17, Reinette Chatre wrote: >>>>> >>>>> >>>>> On 2/20/2024 12:48 PM, Moger, Babu wrote: >>>>>> On 2/20/24 09:21, James Morse wrote: >>>>>>> On 19/01/2024 18:22, Babu Moger wrote: >>>>> >>>>>>>> e. Enable ABMC mode. >>>>>>>> >>>>>>>> #echo 1 > /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_enable >>>>>>>> 1 >>>>>>> >>>>>>> Why does this mode need enabling? Can't it be enabled automatically on hardware that >>>>>>> supports it, or enabled implicitly when the first assignment attempt arrives? >>>>>>> >>>>>>> I guess this is really needed for a reset - could we implement that instead? This way >>>>>>> there isn't an extra step user-space has to do to make the assignments work. >>>>>> >>>>>> Mostly the new features are added as an opt-in method. So, kept it that >>>>>> way. If we enable this feature automatically, then we have provide an >>>>>> option to disable it. >>>>>> >>>>> >>>>> At the same time it sounds to me like ABMC can improve current users' >>>>> experience without requiring them to do anything. This sounds appealing. >>>>> For example, if I understand correctly, it may be possible to start resctrl >>>>> with ABMC enabled by default and the number of monitoring groups (currently >>>>> exposed to user space via "num_rmids") limited to the number of counters >>>>> supported by ABMC. Existing users would then by default obtain better behavior >>>>> of counters not resetting. >>>> >>>> Yes, I like the idea. But i will break compatibility with pqos >>>> tool(intel_cmt_cat utility). pqos tool monitoring will not work without >>>> supporting ABMC enablement in the tool. ABMC feature requires an extra >>>> step to assign the counters for monitor to work. >>> >>> I am considering two scenarios, the "default behavior" is what a user will >>> experience when booting resctrl on an ABMC system and the "new feature >>> behavior" where a user can take full advantage of all that ABMC (and soft >>> RMID, and MPAM) can offer. >>> >>> So, first, on an ABMC system in the "default behavior" scenario I expect >>> that resctrl can do required ABMC counter configuration automatically at >>> the time a monitor group is created. In this "default behavior" scenario >>> resctrl would expose "num_rmids" to be half of the number of assignable >>> counters. When a user then creates a monitor group two counters will be >>> used and configured to count the local and total bytes respectively. If >>> two counters are not available then ENOSPC returned, just like when system >>> is out of closid/rmid. With this "default behavior" user space thus gets >>> improved behavior without making any changes on its part. I do not have >> >> We can automatically assign the h/w counter when monitor group is created >> until we run out of h/w counters. That is good idea. By default user will >> not notice any difference in ABMC mode. >> >>> insight into how many counters ABMC could be expected to expose though ... >>> so some users may be surprised at how few monitor groups can be created >>> with new hardware? This may not be an issue since that would accurately >>> reflect how many _reliable_ monitor groups can be created and if user needs >>> more monitor groups then that would be a time to explore the "new feature" >>> that requires changes in how user interacts with resctrl. >> >> Currently, 32 h/w counters are available to configure. With two counters >> for each group, we can create 16 groups(15 new groups plus the default >> group). That should be fine as pqos tool creates only 16 groups when it is >> started. > > user space can never assume that a certain number of groups can > be created. > >>> Apart from the "default behavior" there are two options to consider ... >>> (a) the "original" behavior(? I do not know what to call it) - this would be >>> where user space wants(?) to have the current non-ABMC behavior on an ABMC >>> system, where the previous "num_rmids" monitor groups can be created but >>> the counters are reset unpredictably ... should this still be supported >>> on ABMC systems though? >> >> I would say yes. For some reason user(hardware or software issues) is not >> able to use ABMC mode, they have an option to go back to legacy mode. > > I see. Should this perhaps be protected behind the resctrl "debug" mount option? The debug option gives wrong impression. It is better to keep the option open to enable the feature in normal mode. > >>> (b) the "new feature" behavior where user space gets full benefit of ABMC >>> that allows user space to create any number of monitor groups but then >>> user space needs to let hardware (via resctrl) know which >>> events should be counted. >> >> Is this "new feature" is enabled by default when ABMC is available? > > Not in this design, no. In these scenarios ABMC will be available and enabled > in both the "default" and "new feature" behavior. The difference is no user > space changes are needed in "default" scenario and resctrl limits the number > of monitor groups to support all monitor groups to be backed by hardware > counters. > When "new feature" is enabled when ABMC is available and enabled then > user space is able to create more monitor groups than available hardware > counters and new user interface is required to manage associating counters > with monitor events. ok. That sounds good. > >> >> Or we need to provide an interface to enable this feature? > > Yes, an interface will be needed to enable this feature. ok. > >> >> >>> >>> I expect that only (b) above would require user space change. Considering >>> that per documentation, "num_rmids" means "This is the upper bound for how >>> many "CTRL_MON" + "MON" groups can be created" I expect that "num_rmids" >>> becomes undefined when "new feature" is enabled. When this new feature is enabled >>> then user space is no longer limited by number of RMIDs on how many monitor >> >> With ABMC, we will have a new field "mbm_assignable_counters". We don't >> have to change the definition of "num_rmids". > > The problem here is that "num_rmids" is (as per Documentation/arch/x86/resctrl.rst) > documented to be an upper bound for how many monitor groups can be created. > As I understand, when ABMC is enabled and its full capability exposed to user > space then there is no limit to how many monitor groups can be created, no? No. That is not correct. The number of monitor groups is still limited by num_rmids. But assignment is limited by mbm_assignable_counters. More below. > > For example, if I understand correctly, theoretically, when ABMC is enabled then > "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid > is not unsigned, tbd if number of directories may also be limited by kernfs). > User space could theoretically create more monitor groups than the number of > rmids that a resource claims to support using current upstream enumeration. CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11 bits(depends on specific h/w) to represent RMIDs. So, we cannot create more than this limit(r->num_rmid). In case of ABMC, h/w uses another counter(mbm_assignable_counters) with RMID to assign the monitoring. So, assignment limit is mbm_assignable_counters. The number of mon groups limit is still r->num_rmid. > Instead, it is the "mbm_assignable_counters" that is of interest, that is what > user space uses to determine how many of the (potentially very large number of) > monitor groups/monitor events can be counted at any particular time. > >>> groups can be created and this is the point that the user interface that you >>> and Peter have ideas about comes into play. Specifically, user space needing >>> a way to specify: >>> (a) "let me create more monitor groups that the hardware can support"/"let me >>> control which events/monitor groups are counted" >>> (like the "mbm_assign" file in your proposal) >>> (b) "here are the events that need to be counted" >>> (like the "monitor_state" and "mbm_{local,total}_bytes_assigned" proposals) >> >> With global assignment option out of way for now(may be introduced later), >> we can provide two interfaces. >> >> 1. /sys/fs/resctrl/info/L3_MON/mbm_assign >> This will be enabled by default when ABMC is available. Users can disable >> this option to go back to legacy mode. > > Potentially (all naming placeholders that will only be visible on systems that > actually supports particular mode): > legacy [default] new_feature soft_rmid ok > >> >> 2. /sys/fs/resctrl/monitor_state. >> This can used to individually assign or unassign the counters in each group. >> >> When assigned: >> #cat /sys/fs/resctrl/monitor_state >> 0=total-assign,local-assign;1=total-assign,local-assign >> >> When unassigned: >> #cat /sys/fs/resctrl/monitor_state >> 0=total-unassign,local-unassign;1=total-unassign,local-unassign >> >> >> Thoughts? > > How do you expect this interface to be used? I understand the mechanics > of this interface but on a higher level, do you expect user space to > once in a while assign a new counter to a single event or monitor group > (for which a fine grained interface works) or do you expect user space to > shift multiple counters across several monitor events at intervals? I think we should provide both the options. I was thinking of providing fine grained interface first. Few use cases: 1. User wants to assign only one event (total or local) per group. In this case, he can assign 32 events in 32 different groups. #echo 0=total-assign > /sys/fs/resctrl/monitor_state or #echo 0=local-assign > /sys/fs/resctrl/monitor_state When done: #echo 0=total-unassign > /sys/fs/resctrl/monitor_state or #echo 0=local-unassign > /sys/fs/resctrl/monitor_state Note: 0 the domain here. 2. User wants to assign both "local" and "total" events per group. In this case, he can assign 32 events in 16 different groups. #echo 0=local-assign,total-assign > /sys/fs/resctrl/monitor_state When done: #echo 0=local-unassign,total-unassign > /sys/fs/resctrl/monitor_state 3. combination of 1 and 2. 4. Assign multiple group assignment at once. I consider this as global assignment. This can be achieved by 1 and 2 from user space looping thru all the interested groups. Peter is worried about system call latency here. He wants to optimize this. I was thinking this can done later. > > Across resctrl's lifetime we have seen examples of user space wanting > to accomplish more with a single resctrl interaction. For example moving > multiple tasks to a group that you added support for and moving a monitor > group feature from Peter. > > I thus think that it would be valuable to consider more efficient > interfaces from the beginning. I do not think that this is the type > of work that is an optimization to be delayed until an unspecified later > time, but instead multiple usage of interface can be considered from the > start with a most optimal interface created from the beginning. Specifically, > why does resctrl need to be "extended" to support a global assignment as proposed > by Peter at a later time, why can it not be done as the original and (ideally) > only mechanism? > > Reinette
Hi Babu, On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <babu.moger@amd.com> wrote: > > On 2/26/24 15:20, Reinette Chatre wrote: > > > > For example, if I understand correctly, theoretically, when ABMC is enabled then > > "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid > > is not unsigned, tbd if number of directories may also be limited by kernfs). > > User space could theoretically create more monitor groups than the number of > > rmids that a resource claims to support using current upstream enumeration. > > CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11 > bits(depends on specific h/w) to represent RMIDs. So, we cannot create > more than this limit(r->num_rmid). > > In case of ABMC, h/w uses another counter(mbm_assignable_counters) with > RMID to assign the monitoring. So, assignment limit is > mbm_assignable_counters. The number of mon groups limit is still r->num_rmid. That is not entirely true. As long as you don't need to maintain bandwidth counts for unassigned monitoring groups, there's no need to allocate a HW RMID to a monitoring group. In my soft-ABMC prototype, where a limited number of HW RMIDs are allocated to assigned monitoring groups, I was forced to replace the HW RMID value stored in the task_struct to a pointer to the struct mongroup, since the RMID value assigned to the mongroup would frequently change, resulting in excessive walks down the tasklist to find all of the tasks using the previous value. However, the number of hardware monitor group identifiers supported (i.e., RMID, PARTID:PMG) is usually high enough that I don't think there's much motivation to support unlimited monitoring groups. In both soft-RMID and soft-ABMC, I didn't bother supporting more groups than num_rmids, because the number was large enough. -Peter
Hi Peter, On 2/27/24 12:26, Peter Newman wrote: > Hi Babu, > > On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <babu.moger@amd.com> wrote: >> >> On 2/26/24 15:20, Reinette Chatre wrote: >>> >>> For example, if I understand correctly, theoretically, when ABMC is enabled then >>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid >>> is not unsigned, tbd if number of directories may also be limited by kernfs). >>> User space could theoretically create more monitor groups than the number of >>> rmids that a resource claims to support using current upstream enumeration. >> >> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11 >> bits(depends on specific h/w) to represent RMIDs. So, we cannot create >> more than this limit(r->num_rmid). >> >> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with >> RMID to assign the monitoring. So, assignment limit is >> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid. > > That is not entirely true. As long as you don't need to maintain > bandwidth counts for unassigned monitoring groups, there's no need to > allocate a HW RMID to a monitoring group. We don't need to allocate a h/w counter for unassigned group. My proposal is to allocate h/w counter only if user requests a assignment. The limit for assigned events at time is mbm_assignable_counters(32 right now). > > In my soft-ABMC prototype, where a limited number of HW RMIDs are > allocated to assigned monitoring groups, I was forced to replace the > HW RMID value stored in the task_struct to a pointer to the struct > mongroup, since the RMID value assigned to the mongroup would > frequently change, resulting in excessive walks down the tasklist to > find all of the tasks using the previous value. > > However, the number of hardware monitor group identifiers supported > (i.e., RMID, PARTID:PMG) is usually high enough that I don't think > there's much motivation to support unlimited monitoring groups. In > both soft-RMID and soft-ABMC, I didn't bother supporting more groups > than num_rmids, because the number was large enough. What is soft-ABMC?
Hi Babu, On Tue, Feb 27, 2024 at 11:37 AM Moger, Babu <babu.moger@amd.com> wrote: > On 2/27/24 12:26, Peter Newman wrote: > > On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <babu.moger@amd.com> wrote: > >> > >> On 2/26/24 15:20, Reinette Chatre wrote: > >>> > >>> For example, if I understand correctly, theoretically, when ABMC is enabled then > >>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid > >>> is not unsigned, tbd if number of directories may also be limited by kernfs). > >>> User space could theoretically create more monitor groups than the number of > >>> rmids that a resource claims to support using current upstream enumeration. > >> > >> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11 > >> bits(depends on specific h/w) to represent RMIDs. So, we cannot create > >> more than this limit(r->num_rmid). > >> > >> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with > >> RMID to assign the monitoring. So, assignment limit is > >> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid. > > > > That is not entirely true. As long as you don't need to maintain > > bandwidth counts for unassigned monitoring groups, there's no need to > > allocate a HW RMID to a monitoring group. > > We don't need to allocate a h/w counter for unassigned group. > My proposal is to allocate h/w counter only if user requests a assignment. > The limit for assigned events at time is mbm_assignable_counters(32 right > now). I said "RMID", not "counter". The point is, the main purpose served by the RMID in an unassigned mongroup is providing a unique value to write into the task_struct to indicate group membership. > > > > > In my soft-ABMC prototype, where a limited number of HW RMIDs are > > allocated to assigned monitoring groups, I was forced to replace the > > HW RMID value stored in the task_struct to a pointer to the struct > > mongroup, since the RMID value assigned to the mongroup would > > frequently change, resulting in excessive walks down the tasklist to > > find all of the tasks using the previous value. > > > > However, the number of hardware monitor group identifiers supported > > (i.e., RMID, PARTID:PMG) is usually high enough that I don't think > > there's much motivation to support unlimited monitoring groups. In > > both soft-RMID and soft-ABMC, I didn't bother supporting more groups > > than num_rmids, because the number was large enough. > > What is soft-ABMC? It's the term I'm using to describe[1] the approach of using the monitor assignment interface to allocate a small number of RMIDs to monitoring groups. -Peter [1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
Hi Peter, On 2/27/24 14:06, Peter Newman wrote: > Hi Babu, > > On Tue, Feb 27, 2024 at 11:37 AM Moger, Babu <babu.moger@amd.com> wrote: >> On 2/27/24 12:26, Peter Newman wrote: >>> On Tue, Feb 27, 2024 at 10:12 AM Moger, Babu <babu.moger@amd.com> wrote: >>>> >>>> On 2/26/24 15:20, Reinette Chatre wrote: >>>>> >>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then >>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid >>>>> is not unsigned, tbd if number of directories may also be limited by kernfs). >>>>> User space could theoretically create more monitor groups than the number of >>>>> rmids that a resource claims to support using current upstream enumeration. >>>> >>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11 >>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create >>>> more than this limit(r->num_rmid). >>>> >>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with >>>> RMID to assign the monitoring. So, assignment limit is >>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid. >>> >>> That is not entirely true. As long as you don't need to maintain >>> bandwidth counts for unassigned monitoring groups, there's no need to >>> allocate a HW RMID to a monitoring group. >> >> We don't need to allocate a h/w counter for unassigned group. >> My proposal is to allocate h/w counter only if user requests a assignment. >> The limit for assigned events at time is mbm_assignable_counters(32 right >> now). > > I said "RMID", not "counter". The point is, the main purpose served by > the RMID in an unassigned mongroup is providing a unique value to > write into the task_struct to indicate group membership. In case of ABMC, cpu(or task) association still uses RMID value stored in "struct mongroup" data structure. Same value is written to PQR_ASSOC(MSR C8Fh). It needs to be a valid value. Hope that make sense. > >> >>> >>> In my soft-ABMC prototype, where a limited number of HW RMIDs are >>> allocated to assigned monitoring groups, I was forced to replace the >>> HW RMID value stored in the task_struct to a pointer to the struct >>> mongroup, since the RMID value assigned to the mongroup would >>> frequently change, resulting in excessive walks down the tasklist to >>> find all of the tasks using the previous value. You are using this pointer as unique value. This will work as long as you are not writing this value to PQR_ASSOC MSR. >>> >>> However, the number of hardware monitor group identifiers supported >>> (i.e., RMID, PARTID:PMG) is usually high enough that I don't think >>> there's much motivation to support unlimited monitoring groups. In >>> both soft-RMID and soft-ABMC, I didn't bother supporting more groups >>> than num_rmids, because the number was large enough. >> >> What is soft-ABMC? > > It's the term I'm using to describe[1] the approach of using the > monitor assignment interface to allocate a small number of RMIDs to > monitoring groups. > > -Peter > > [1] https://lore.kernel.org/lkml/CALPaoCiRD6j_Rp7ffew+PtGTF4rWDORwbuRQqH2i-cY5SvWQBg@mail.gmail.com/
Hi Babu, On 2/27/2024 10:12 AM, Moger, Babu wrote: > On 2/26/24 15:20, Reinette Chatre wrote: >> On 2/26/2024 9:59 AM, Moger, Babu wrote: >>> On 2/23/24 16:21, Reinette Chatre wrote: >>>> Apart from the "default behavior" there are two options to consider ... >>>> (a) the "original" behavior(? I do not know what to call it) - this would be >>>> where user space wants(?) to have the current non-ABMC behavior on an ABMC >>>> system, where the previous "num_rmids" monitor groups can be created but >>>> the counters are reset unpredictably ... should this still be supported >>>> on ABMC systems though? >>> >>> I would say yes. For some reason user(hardware or software issues) is not >>> able to use ABMC mode, they have an option to go back to legacy mode. >> >> I see. Should this perhaps be protected behind the resctrl "debug" mount option? > > The debug option gives wrong impression. It is better to keep the option > open to enable the feature in normal mode. You mentioned that it would only be needed when there are hardware or software issues ... so debug does sound appropriate. Could you please give an example of how debug option gives wrong impression? Why would you want users to keep using "legacy" mode on an ABMC system? .. >> For example, if I understand correctly, theoretically, when ABMC is enabled then >> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid >> is not unsigned, tbd if number of directories may also be limited by kernfs). >> User space could theoretically create more monitor groups than the number of >> rmids that a resource claims to support using current upstream enumeration. > > CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11 > bits(depends on specific h/w) to represent RMIDs. So, we cannot create > more than this limit(r->num_rmid). > > In case of ABMC, h/w uses another counter(mbm_assignable_counters) with > RMID to assign the monitoring. So, assignment limit is > mbm_assignable_counters. The number of mon groups limit is still r->num_rmid. I see. Thank you for clarifying. This does make enabling simpler and one less user interface item that needs changing. .. >>> 2. /sys/fs/resctrl/monitor_state. >>> This can used to individually assign or unassign the counters in each group. >>> >>> When assigned: >>> #cat /sys/fs/resctrl/monitor_state >>> 0=total-assign,local-assign;1=total-assign,local-assign >>> >>> When unassigned: >>> #cat /sys/fs/resctrl/monitor_state >>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign >>> >>> >>> Thoughts? >> >> How do you expect this interface to be used? I understand the mechanics >> of this interface but on a higher level, do you expect user space to >> once in a while assign a new counter to a single event or monitor group >> (for which a fine grained interface works) or do you expect user space to >> shift multiple counters across several monitor events at intervals? > > I think we should provide both the options. I was thinking of providing > fine grained interface first. Could you please provide a motivation for why two interfaces, one inefficient and one not, should be created and maintained? Users can still do fine grained assignment with a global assignment interface. Reinette
Hi Reinette, On 2/27/24 17:50, Reinette Chatre wrote: > Hi Babu, > > On 2/27/2024 10:12 AM, Moger, Babu wrote: >> On 2/26/24 15:20, Reinette Chatre wrote: >>> On 2/26/2024 9:59 AM, Moger, Babu wrote: >>>> On 2/23/24 16:21, Reinette Chatre wrote: > >>>>> Apart from the "default behavior" there are two options to consider ... >>>>> (a) the "original" behavior(? I do not know what to call it) - this would be >>>>> where user space wants(?) to have the current non-ABMC behavior on an ABMC >>>>> system, where the previous "num_rmids" monitor groups can be created but >>>>> the counters are reset unpredictably ... should this still be supported >>>>> on ABMC systems though? >>>> >>>> I would say yes. For some reason user(hardware or software issues) is not >>>> able to use ABMC mode, they have an option to go back to legacy mode. >>> >>> I see. Should this perhaps be protected behind the resctrl "debug" mount option? >> >> The debug option gives wrong impression. It is better to keep the option >> open to enable the feature in normal mode. > > You mentioned that it would only be needed when there are hardware or > software issues ... so debug does sound appropriate. Could you please give > an example of how debug option gives wrong impression? Why would you want > users to keep using "legacy" mode on an ABMC system? I don't have a strong argument here. I am fine as long as there is a way to go back to legacy mode if required. We can provide legacy option in debug mode. > > ... > >>> For example, if I understand correctly, theoretically, when ABMC is enabled then >>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid >>> is not unsigned, tbd if number of directories may also be limited by kernfs). >>> User space could theoretically create more monitor groups than the number of >>> rmids that a resource claims to support using current upstream enumeration. >> >> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11 >> bits(depends on specific h/w) to represent RMIDs. So, we cannot create >> more than this limit(r->num_rmid). >> >> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with >> RMID to assign the monitoring. So, assignment limit is >> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid. > > I see. Thank you for clarifying. This does make enabling simpler and one > less user interface item that needs changing. > > ... > >>>> 2. /sys/fs/resctrl/monitor_state. >>>> This can used to individually assign or unassign the counters in each group. >>>> >>>> When assigned: >>>> #cat /sys/fs/resctrl/monitor_state >>>> 0=total-assign,local-assign;1=total-assign,local-assign >>>> >>>> When unassigned: >>>> #cat /sys/fs/resctrl/monitor_state >>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign >>>> >>>> >>>> Thoughts? >>> >>> How do you expect this interface to be used? I understand the mechanics >>> of this interface but on a higher level, do you expect user space to >>> once in a while assign a new counter to a single event or monitor group >>> (for which a fine grained interface works) or do you expect user space to >>> shift multiple counters across several monitor events at intervals? >> >> I think we should provide both the options. I was thinking of providing >> fine grained interface first. > > Could you please provide a motivation for why two interfaces, one inefficient > and one not, should be created and maintained? Users can still do fine grained > assignment with a global assignment interface. Lets consider one by one. 1. Fine grained assignment. It will be part of the mongroup(or control mongroup). User has the access to the group and can query the group's current status before assigning or unassigning. $cd /sys/fs/resctrl/ctrl_mon1 $cat /sys/fs/resctrl/ctrl_mon1/monitor_state 0=total-unassign,local-unassign;1=total-unassign,local-unassign; Assign the total event $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state Assign the local event $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state Assign both events: $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state Check the assignment status. $cat /sys/fs/resctrl/ctrl_mon1/monitor_state 0=total-assign,local-assign;1=total-unassign,local-unassign; -User interface is simple. -Assignment will fail if all the h/w counters are exhausted. User needs to unassign a counter from another group and use that counter here. This can be done just querying the monitor state of another group. -Monitor group's details(cpus, tasks) are part of the group. So, it is better to have assignment state inside the group. Note: Used interface names here just to give example. 2. global assignment: I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/ directory. In case there are 100 mongroups, we need to have a way to list current assignment status for these groups. I am not sure how to list status of these 100 groups. If user is wants to assign the local event(or total) in a specific group in this list of 100 groups, I am not sure how to provide interface for that. Should we pass the name of mongroup? That will involve looping through using the call kernfs_walk_and_get. This may be ok if we are dealing with very small number of groups.
Hi Babu, On 2/28/2024 9:59 AM, Moger, Babu wrote: > On 2/27/24 17:50, Reinette Chatre wrote: >> On 2/27/2024 10:12 AM, Moger, Babu wrote: >>> On 2/26/24 15:20, Reinette Chatre wrote: >>>> On 2/26/2024 9:59 AM, Moger, Babu wrote: >>>>> On 2/23/24 16:21, Reinette Chatre wrote: >> >>>> For example, if I understand correctly, theoretically, when ABMC is enabled then >>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid >>>> is not unsigned, tbd if number of directories may also be limited by kernfs). >>>> User space could theoretically create more monitor groups than the number of >>>> rmids that a resource claims to support using current upstream enumeration. >>> >>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11 >>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create >>> more than this limit(r->num_rmid). >>> >>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with >>> RMID to assign the monitoring. So, assignment limit is >>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid. >> >> I see. Thank you for clarifying. This does make enabling simpler and one >> less user interface item that needs changing. >> >> ... >> >>>>> 2. /sys/fs/resctrl/monitor_state. >>>>> This can used to individually assign or unassign the counters in each group. >>>>> >>>>> When assigned: >>>>> #cat /sys/fs/resctrl/monitor_state >>>>> 0=total-assign,local-assign;1=total-assign,local-assign >>>>> >>>>> When unassigned: >>>>> #cat /sys/fs/resctrl/monitor_state >>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign >>>>> >>>>> >>>>> Thoughts? >>>> >>>> How do you expect this interface to be used? I understand the mechanics >>>> of this interface but on a higher level, do you expect user space to >>>> once in a while assign a new counter to a single event or monitor group >>>> (for which a fine grained interface works) or do you expect user space to >>>> shift multiple counters across several monitor events at intervals? >>> >>> I think we should provide both the options. I was thinking of providing >>> fine grained interface first. >> >> Could you please provide a motivation for why two interfaces, one inefficient >> and one not, should be created and maintained? Users can still do fine grained >> assignment with a global assignment interface. > > Lets consider one by one. > > 1. Fine grained assignment. > > It will be part of the mongroup(or control mongroup). User has the access > to the group and can query the group's current status before assigning or > unassigning. > > $cd /sys/fs/resctrl/ctrl_mon1 > $cat /sys/fs/resctrl/ctrl_mon1/monitor_state > 0=total-unassign,local-unassign;1=total-unassign,local-unassign; > > Assign the total event > > $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state > > Assign the local event > > $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state > > Assign both events: > > $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state > > Check the assignment status. > > $cat /sys/fs/resctrl/ctrl_mon1/monitor_state > 0=total-assign,local-assign;1=total-unassign,local-unassign; > > -User interface is simple. This should not be the only motivation. Please do not sacrifice efficiency and usability just to have a simple interface. One can also argue that this interface can only be considered simple from the kernel implementation perspective, from user space it seems complicated. For example, as James pointed out earlier [1], user space would need to walk the entire resctrl to find out where counters are assigned. Peter also pointed out how the multiple syscalls needed when adjusting hundreds of monitor groups is inefficient. Please take all feedback into account. You consider "simple interface" as a motivation, there seems to be at least two arguments against this interface. Please consider these in your comparison between interfaces. These are things that should be noted and make their way to the cover letter. > > -Assignment will fail if all the h/w counters are exhausted. User needs to > unassign a counter from another group and use that counter here. This can > be done just querying the monitor state of another group. Right ... and as you state there can be hundreds of monitor groups that user space would need to walk and query to get this information. > > -Monitor group's details(cpus, tasks) are part of the group. So, it is > better to have assignment state inside the group. The assignment state should be clear from the event file. > Note: Used interface names here just to give example. > > > 2. global assignment: > > I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/ > directory. > > In case there are 100 mongroups, we need to have a way to list current > assignment status for these groups. I am not sure how to list status of > these 100 groups. The kernel has many examples of interfaces that manages status of a large number of entities. I am thinking, for example, we can learn a lot from how dynamic debug works. On my system I see: $ wc -l /sys/kernel/debug/dynamic_debug/control 5359 /sys/kernel/debug/dynamic_debug/control > > If user is wants to assign the local event(or total) in a specific group > in this list of 100 groups, I am not sure how to provide interface for > that. Should we pass the name of mongroup? That will involve looping > through using the call kernfs_walk_and_get. This may be ok if we are > dealing with very small number of groups. > What is your concern when needing to modify a large number of groups? Are you concerned about the size of the writes needing to be parsed? It looks like kernfs does support writes of larger than PAGE_SIZE, but it is not clear to me that such large sizes will be required. There is also kernfs_find_and_get() that may be more convenient to use. I believe user space needs to provide control group name for a global interface (the same name can be used by monitor groups belonging to different control groups), and that can be used to narrow search. Reading your message I do not find any motivation _against_ a global interface, except that it is not obvious to you how such interface may look or work. That is fair. Peter seems to have ideas and a working implementation that can be used as reference. So far I have only seen one comment [2] from James that was skeptical about the global interface but the reason notes that MPAM allocates counters per domain, which is the same as ABMC so we will need more information from James here on what is required since he did not respond to Peter. Below is a *hypothetical* interface to start a discussion that explores how to support fine grained assignment in an interface that aims to be easy to use by user space. Obviously Peter is also working on something so there are many viewpoints to consider. File info/L3_MON/mbm_assign_control: #control_group/mon_group/flags ctrl_a/mon_a/00=_;01=_ ctrl_a/mon_b/00=l;01=t ctrl_b/mon_c/00=lt;01=lt Above file displays to user: * No counters are assigned to monitor group mon_a within control group ctrl_a * Counter for local MBM is assigned to domain 0 of monitor group mon_b within control group ctrl_a * Counter for total MBM is assigned to domain 1 of monitor group mon_b within control group ctrl_a * Counters for local and total MBM are assigned to both domains of monitor group mon_c within control group ctrl_b With above interface user space can, with a single read, get insight into how counters are assigned across all monitor groups. User space can write to the file to modify the flags. If assigning a new counter when no more counters are available then the write will fail. Potentially, if changes are made in order provided by the user then the user will be able to unassign counters from one group and re-assign to another group with a single write. I provide this purely to generate some ideas and gather more thoughts on a global interface. Reinette [1] https://lore.kernel.org/lkml/2f373abf-f0c0-4f5d-9e22-1039a40a57f0@arm.com/ [2] https://lore.kernel.org/lkml/1a8c1cd6-a1ce-47a2-bc87-d4cccc84519b@arm.com/
Hi Reinette, On 2/28/24 14:04, Reinette Chatre wrote: > Hi Babu, > > On 2/28/2024 9:59 AM, Moger, Babu wrote: >> On 2/27/24 17:50, Reinette Chatre wrote: >>> On 2/27/2024 10:12 AM, Moger, Babu wrote: >>>> On 2/26/24 15:20, Reinette Chatre wrote: >>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote: >>>>>> On 2/23/24 16:21, Reinette Chatre wrote: >>> > >>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then >>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid >>>>> is not unsigned, tbd if number of directories may also be limited by kernfs). >>>>> User space could theoretically create more monitor groups than the number of >>>>> rmids that a resource claims to support using current upstream enumeration. >>>> >>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11 >>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create >>>> more than this limit(r->num_rmid). >>>> >>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with >>>> RMID to assign the monitoring. So, assignment limit is >>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid. >>> >>> I see. Thank you for clarifying. This does make enabling simpler and one >>> less user interface item that needs changing. >>> >>> ... >>> >>>>>> 2. /sys/fs/resctrl/monitor_state. >>>>>> This can used to individually assign or unassign the counters in each group. >>>>>> >>>>>> When assigned: >>>>>> #cat /sys/fs/resctrl/monitor_state >>>>>> 0=total-assign,local-assign;1=total-assign,local-assign >>>>>> >>>>>> When unassigned: >>>>>> #cat /sys/fs/resctrl/monitor_state >>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign >>>>>> >>>>>> >>>>>> Thoughts? >>>>> >>>>> How do you expect this interface to be used? I understand the mechanics >>>>> of this interface but on a higher level, do you expect user space to >>>>> once in a while assign a new counter to a single event or monitor group >>>>> (for which a fine grained interface works) or do you expect user space to >>>>> shift multiple counters across several monitor events at intervals? >>>> >>>> I think we should provide both the options. I was thinking of providing >>>> fine grained interface first. >>> >>> Could you please provide a motivation for why two interfaces, one inefficient >>> and one not, should be created and maintained? Users can still do fine grained >>> assignment with a global assignment interface. >> >> Lets consider one by one. >> >> 1. Fine grained assignment. >> >> It will be part of the mongroup(or control mongroup). User has the access >> to the group and can query the group's current status before assigning or >> unassigning. >> >> $cd /sys/fs/resctrl/ctrl_mon1 >> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state >> 0=total-unassign,local-unassign;1=total-unassign,local-unassign; >> >> Assign the total event >> >> $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state >> >> Assign the local event >> >> $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state >> >> Assign both events: >> >> $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state >> >> Check the assignment status. >> >> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state >> 0=total-assign,local-assign;1=total-unassign,local-unassign; >> >> -User interface is simple. > > This should not be the only motivation. Please do not sacrifice efficiency > and usability just to have a simple interface. One can also argue that this > interface can only be considered simple from the kernel implementation perspective, > from user space it seems complicated. For example, as James pointed out earlier [1], > user space would need to walk the entire resctrl to find out where counters are > assigned. Peter also pointed out how the multiple syscalls needed when adjusting > hundreds of monitor groups is inefficient. Please take all feedback into account. > > You consider "simple interface" as a motivation, there seems to be at least two > arguments against this interface. Please consider these in your comparison > between interfaces. These are things that should be noted and make their way to > the cover letter. > >> >> -Assignment will fail if all the h/w counters are exhausted. User needs to >> unassign a counter from another group and use that counter here. This can >> be done just querying the monitor state of another group. > > Right ... and as you state there can be hundreds of monitor groups that > user space would need to walk and query to get this information. > >> >> -Monitor group's details(cpus, tasks) are part of the group. So, it is >> better to have assignment state inside the group. > > The assignment state should be clear from the event file. > >> Note: Used interface names here just to give example. >> >> >> 2. global assignment: >> >> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/ >> directory. >> >> In case there are 100 mongroups, we need to have a way to list current >> assignment status for these groups. I am not sure how to list status of >> these 100 groups. > > The kernel has many examples of interfaces that manages status of a large > number of entities. I am thinking, for example, we can learn a lot from > how dynamic debug works. On my system I see: > > $ wc -l /sys/kernel/debug/dynamic_debug/control > 5359 /sys/kernel/debug/dynamic_debug/control > >> >> If user is wants to assign the local event(or total) in a specific group >> in this list of 100 groups, I am not sure how to provide interface for >> that. Should we pass the name of mongroup? That will involve looping >> through using the call kernfs_walk_and_get. This may be ok if we are >> dealing with very small number of groups. >> > > What is your concern when needing to modify a large number of groups? > Are you concerned about the size of the writes needing to be parsed? It looks > like kernfs does support writes of larger than PAGE_SIZE, but it is not clear > to me that such large sizes will be required. > > There is also kernfs_find_and_get() that may be more convenient to use. Will look at this. There is also kernfs_name and kernfs_path. > I believe user space needs to provide control group name for a global > interface (the same name can be used by monitor groups belonging to > different control groups), and that can be used to narrow search. > > Reading your message I do not find any motivation _against_ a global > interface, except that it is not obvious to you how such interface may look > or work. That is fair. Peter seems to have ideas and a working implementation > that can be used as reference. So far I have only seen one comment [2] from James > that was skeptical about the global interface but the reason notes that MPAM > allocates counters per domain, which is the same as ABMC so we will need more > information from James here on what is required since he did not respond to > Peter. > > Below is a *hypothetical* interface to start a discussion that explores how > to support fine grained assignment in an interface that aims to be easy to use > by user space. Obviously Peter is also working on something so there > are many viewpoints to consider. > > File info/L3_MON/mbm_assign_control: > #control_group/mon_group/flags > ctrl_a/mon_a/00=_;01=_ > ctrl_a/mon_b/00=l;01=t > ctrl_b/mon_c/00=lt;01=lt I think you left few things here(Like the default control_mon group). To make more clear, let me list all the groups here based this. When none of the counters assigned: $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control resctrl/00=none,none;01=none,none (#default control_mon group) resctrl/mon_a/00=none,none;01=none,none (#mon group) resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group) resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group) When some counters are assigned: $echo "resctrl/00=total,local" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to default group) $echo "resctrl/mon_a/00=total;01=total" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon group) $echo "resctrl/ctrl_a/00=local;01=local" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control resctrl/00=total,local;01=none,none (#default control_mon group) resctrl/mon_a/00=total,none;01=total,none (#mon group) resctrl/ctrl_a/00=none,local;01=none,local (#control_mon group) resctrl/ctrl_a/mon_ab/00=total,local;01=total,local (#mon group) Few comments about this approach: 1.This will involve lots of text processing in the kernel. Will need to figure out calls for these processing. 2.In this approach there is no way to list assignment of a single group(like group resctrl/ctrl_a/mon_ab alone). 3. This is similar to fine grained approach we discussed but in global level. Want to get Pater/James comments about this approach. > > Above file displays to user: > * No counters are assigned to monitor group mon_a within control group ctrl_a > * Counter for local MBM is assigned to domain 0 of monitor group mon_b within > control group ctrl_a > * Counter for total MBM is assigned to domain 1 of monitor group mon_b within > control group ctrl_a > * Counters for local and total MBM are assigned to both domains of monitor > group mon_c within control group ctrl_b > > With above interface user space can, with a single read, get insight into > how counters are assigned across all monitor groups. > User space can write to the file to modify the flags. If assigning a new > counter when no more counters are available then the write will fail. > Potentially, if changes are made in order provided by the user then > the user will be able to unassign counters from one group and re-assign to > another group with a single write. > > I provide this purely to generate some ideas and gather more thoughts on > a global interface. > > Reinette > > [1] https://lore.kernel.org/lkml/2f373abf-f0c0-4f5d-9e22-1039a40a57f0@arm.com/ > [2] https://lore.kernel.org/lkml/1a8c1cd6-a1ce-47a2-bc87-d4cccc84519b@arm.com/ > > > > >
Hi Babu, On 2/29/2024 12:37 PM, Moger, Babu wrote: > On 2/28/24 14:04, Reinette Chatre wrote: >> On 2/28/2024 9:59 AM, Moger, Babu wrote: >>> On 2/27/24 17:50, Reinette Chatre wrote: >>>> On 2/27/2024 10:12 AM, Moger, Babu wrote: >>>>> On 2/26/24 15:20, Reinette Chatre wrote: >>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote: >>>>>>> On 2/23/24 16:21, Reinette Chatre wrote: >>>> >> >>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then >>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid >>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs). >>>>>> User space could theoretically create more monitor groups than the number of >>>>>> rmids that a resource claims to support using current upstream enumeration. >>>>> >>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11 >>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create >>>>> more than this limit(r->num_rmid). >>>>> >>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with >>>>> RMID to assign the monitoring. So, assignment limit is >>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid. >>>> >>>> I see. Thank you for clarifying. This does make enabling simpler and one >>>> less user interface item that needs changing. >>>> >>>> ... >>>> >>>>>>> 2. /sys/fs/resctrl/monitor_state. >>>>>>> This can used to individually assign or unassign the counters in each group. >>>>>>> >>>>>>> When assigned: >>>>>>> #cat /sys/fs/resctrl/monitor_state >>>>>>> 0=total-assign,local-assign;1=total-assign,local-assign >>>>>>> >>>>>>> When unassigned: >>>>>>> #cat /sys/fs/resctrl/monitor_state >>>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign >>>>>>> >>>>>>> >>>>>>> Thoughts? >>>>>> >>>>>> How do you expect this interface to be used? I understand the mechanics >>>>>> of this interface but on a higher level, do you expect user space to >>>>>> once in a while assign a new counter to a single event or monitor group >>>>>> (for which a fine grained interface works) or do you expect user space to >>>>>> shift multiple counters across several monitor events at intervals? >>>>> >>>>> I think we should provide both the options. I was thinking of providing >>>>> fine grained interface first. >>>> >>>> Could you please provide a motivation for why two interfaces, one inefficient >>>> and one not, should be created and maintained? Users can still do fine grained >>>> assignment with a global assignment interface. >>> >>> Lets consider one by one. >>> >>> 1. Fine grained assignment. >>> >>> It will be part of the mongroup(or control mongroup). User has the access >>> to the group and can query the group's current status before assigning or >>> unassigning. >>> >>> $cd /sys/fs/resctrl/ctrl_mon1 >>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state >>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign; >>> >>> Assign the total event >>> >>> $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state >>> >>> Assign the local event >>> >>> $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state >>> >>> Assign both events: >>> >>> $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state >>> >>> Check the assignment status. >>> >>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state >>> 0=total-assign,local-assign;1=total-unassign,local-unassign; >>> >>> -User interface is simple. >> >> This should not be the only motivation. Please do not sacrifice efficiency >> and usability just to have a simple interface. One can also argue that this >> interface can only be considered simple from the kernel implementation perspective, >> from user space it seems complicated. For example, as James pointed out earlier [1], >> user space would need to walk the entire resctrl to find out where counters are >> assigned. Peter also pointed out how the multiple syscalls needed when adjusting >> hundreds of monitor groups is inefficient. Please take all feedback into account. >> >> You consider "simple interface" as a motivation, there seems to be at least two >> arguments against this interface. Please consider these in your comparison >> between interfaces. These are things that should be noted and make their way to >> the cover letter. >> >>> >>> -Assignment will fail if all the h/w counters are exhausted. User needs to >>> unassign a counter from another group and use that counter here. This can >>> be done just querying the monitor state of another group. >> >> Right ... and as you state there can be hundreds of monitor groups that >> user space would need to walk and query to get this information. >> >>> >>> -Monitor group's details(cpus, tasks) are part of the group. So, it is >>> better to have assignment state inside the group. >> >> The assignment state should be clear from the event file. >> >>> Note: Used interface names here just to give example. >>> >>> >>> 2. global assignment: >>> >>> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/ >>> directory. >>> >>> In case there are 100 mongroups, we need to have a way to list current >>> assignment status for these groups. I am not sure how to list status of >>> these 100 groups. >> >> The kernel has many examples of interfaces that manages status of a large >> number of entities. I am thinking, for example, we can learn a lot from >> how dynamic debug works. On my system I see: >> >> $ wc -l /sys/kernel/debug/dynamic_debug/control >> 5359 /sys/kernel/debug/dynamic_debug/control >> >>> >>> If user is wants to assign the local event(or total) in a specific group >>> in this list of 100 groups, I am not sure how to provide interface for >>> that. Should we pass the name of mongroup? That will involve looping >>> through using the call kernfs_walk_and_get. This may be ok if we are >>> dealing with very small number of groups. >>> >> >> What is your concern when needing to modify a large number of groups? >> Are you concerned about the size of the writes needing to be parsed? It looks >> like kernfs does support writes of larger than PAGE_SIZE, but it is not clear >> to me that such large sizes will be required. >> >> There is also kernfs_find_and_get() that may be more convenient to use. > > Will look at this. There is also kernfs_name and kernfs_path. > >> I believe user space needs to provide control group name for a global >> interface (the same name can be used by monitor groups belonging to >> different control groups), and that can be used to narrow search. >> >> Reading your message I do not find any motivation _against_ a global >> interface, except that it is not obvious to you how such interface may look >> or work. That is fair. Peter seems to have ideas and a working implementation >> that can be used as reference. So far I have only seen one comment [2] from James >> that was skeptical about the global interface but the reason notes that MPAM >> allocates counters per domain, which is the same as ABMC so we will need more >> information from James here on what is required since he did not respond to >> Peter. >> >> Below is a *hypothetical* interface to start a discussion that explores how >> to support fine grained assignment in an interface that aims to be easy to use >> by user space. Obviously Peter is also working on something so there >> are many viewpoints to consider. >> >> File info/L3_MON/mbm_assign_control: >> #control_group/mon_group/flags >> ctrl_a/mon_a/00=_;01=_ >> ctrl_a/mon_b/00=l;01=t >> ctrl_b/mon_c/00=lt;01=lt > > I think you left few things here(Like the default control_mon group). No. Similar to proc_resctrl_show() the fields can be empty for the default group or mon groups belonging to control group. > > To make more clear, let me list all the groups here based this. > > When none of the counters assigned: > > $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control > resctrl/00=none,none;01=none,none (#default control_mon group) > resctrl/mon_a/00=none,none;01=none,none (#mon group) > resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group) > resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group) I am concerned that inconsistent use of "/" will make parsing hard. I find "resctrl" and all the "none" redundant. It is not clear what this improves. Why have: resctrl/00=none,none;01=none,none when this could do: //00=_;01=_ > When some counters are assigned: > > $echo "resctrl/00=total,local" > > /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to > default group) > > $echo "resctrl/mon_a/00=total;01=total" > > /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon > group) > > $echo "resctrl/ctrl_a/00=local;01=local" > > /sys/fs/resctrl/info/L3_MON/mbm_assign_control > > $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" > > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control > We could learn some more lessons from dynamic debug (see Documentation/admin-guide/dynamic-debug-howto.rst). For example, "=" can be used to make an assignment while "+" can be used to add a counter and "-" can be used to remove a counter. "=_" can be used to remove counters from all events in that domain. The interface should also support assign/un-assign to multiple groups with a single write. To start this could use '\n' as separator as is the custom with other resctrl interfaces. > $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control > resctrl/00=total,local;01=none,none (#default control_mon group) > resctrl/mon_a/00=total,none;01=total,none (#mon group) > resctrl/ctrl_a/00=none,local;01=none,local (#control_mon group) > resctrl/ctrl_a/mon_ab/00=total,local;01=total,local (#mon group) > > > Few comments about this approach: > 1.This will involve lots of text processing in the kernel. Will need to > figure out calls for these processing. I see that additional parsing will be needed to determine control group and monitor group. For these it sounds like you already have a few options for kernfs API to use. Apart from that the counter assignment will be similar parsing as what was done in your previous versions. I think parsing will be easier if it does not try to use words for the events but just use one letter flags. For example, there is thus no need to look for "," in the parsing of the events, just parse one character at a time where each character has a specific meaning. > > 2.In this approach there is no way to list assignment of a single > group(like group resctrl/ctrl_a/mon_ab alone). Should the kernel be responsible for enabling this? User space can just do a "cat mbm_assign_control | grep mon_ab". Is this not sufficient? > > 3. This is similar to fine grained approach we discussed but in global level. That is what I have been trying to get across. This has full benefit of the original implementation while also addressing all problems raised against it. > > Want to get Pater/James comments about this approach. (Peter) Of course. I look forward to that. Once agreed it may also be worthwhile to approach x86 maintainers with an RFC of the proposed new user interface to learn their guidance. This is where it is important to keep track of all the requirements, as well as pros and cons of different options. Reinette
Hi Reinette, On 2/29/24 15:50, Reinette Chatre wrote: > Hi Babu, > > On 2/29/2024 12:37 PM, Moger, Babu wrote: >> On 2/28/24 14:04, Reinette Chatre wrote: >>> On 2/28/2024 9:59 AM, Moger, Babu wrote: >>>> On 2/27/24 17:50, Reinette Chatre wrote: >>>>> On 2/27/2024 10:12 AM, Moger, Babu wrote: >>>>>> On 2/26/24 15:20, Reinette Chatre wrote: >>>>>>> On 2/26/2024 9:59 AM, Moger, Babu wrote: >>>>>>>> On 2/23/24 16:21, Reinette Chatre wrote: >>>>> >>> >>>>>>> For example, if I understand correctly, theoretically, when ABMC is enabled then >>>>>>> "num_rmids" can be U32_MAX (after a quick look it is not clear to me why r->num_rmid >>>>>>> is not unsigned, tbd if number of directories may also be limited by kernfs). >>>>>>> User space could theoretically create more monitor groups than the number of >>>>>>> rmids that a resource claims to support using current upstream enumeration. >>>>>> >>>>>> CPU or task association still uses PQR_ASSOC(MSR C8Fh). There are only 11 >>>>>> bits(depends on specific h/w) to represent RMIDs. So, we cannot create >>>>>> more than this limit(r->num_rmid). >>>>>> >>>>>> In case of ABMC, h/w uses another counter(mbm_assignable_counters) with >>>>>> RMID to assign the monitoring. So, assignment limit is >>>>>> mbm_assignable_counters. The number of mon groups limit is still r->num_rmid. >>>>> >>>>> I see. Thank you for clarifying. This does make enabling simpler and one >>>>> less user interface item that needs changing. >>>>> >>>>> ... >>>>> >>>>>>>> 2. /sys/fs/resctrl/monitor_state. >>>>>>>> This can used to individually assign or unassign the counters in each group. >>>>>>>> >>>>>>>> When assigned: >>>>>>>> #cat /sys/fs/resctrl/monitor_state >>>>>>>> 0=total-assign,local-assign;1=total-assign,local-assign >>>>>>>> >>>>>>>> When unassigned: >>>>>>>> #cat /sys/fs/resctrl/monitor_state >>>>>>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign >>>>>>>> >>>>>>>> >>>>>>>> Thoughts? >>>>>>> >>>>>>> How do you expect this interface to be used? I understand the mechanics >>>>>>> of this interface but on a higher level, do you expect user space to >>>>>>> once in a while assign a new counter to a single event or monitor group >>>>>>> (for which a fine grained interface works) or do you expect user space to >>>>>>> shift multiple counters across several monitor events at intervals? >>>>>> >>>>>> I think we should provide both the options. I was thinking of providing >>>>>> fine grained interface first. >>>>> >>>>> Could you please provide a motivation for why two interfaces, one inefficient >>>>> and one not, should be created and maintained? Users can still do fine grained >>>>> assignment with a global assignment interface. >>>> >>>> Lets consider one by one. >>>> >>>> 1. Fine grained assignment. >>>> >>>> It will be part of the mongroup(or control mongroup). User has the access >>>> to the group and can query the group's current status before assigning or >>>> unassigning. >>>> >>>> $cd /sys/fs/resctrl/ctrl_mon1 >>>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state >>>> 0=total-unassign,local-unassign;1=total-unassign,local-unassign; >>>> >>>> Assign the total event >>>> >>>> $echo 0=total-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state >>>> >>>> Assign the local event >>>> >>>> $echo 0=local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state >>>> >>>> Assign both events: >>>> >>>> $echo 0=total-assign,local-assign > /sys/fs/resctrl/ctrl_mon1/monitor_state >>>> >>>> Check the assignment status. >>>> >>>> $cat /sys/fs/resctrl/ctrl_mon1/monitor_state >>>> 0=total-assign,local-assign;1=total-unassign,local-unassign; >>>> >>>> -User interface is simple. >>> >>> This should not be the only motivation. Please do not sacrifice efficiency >>> and usability just to have a simple interface. One can also argue that this >>> interface can only be considered simple from the kernel implementation perspective, >>> from user space it seems complicated. For example, as James pointed out earlier [1], >>> user space would need to walk the entire resctrl to find out where counters are >>> assigned. Peter also pointed out how the multiple syscalls needed when adjusting >>> hundreds of monitor groups is inefficient. Please take all feedback into account. >>> >>> You consider "simple interface" as a motivation, there seems to be at least two >>> arguments against this interface. Please consider these in your comparison >>> between interfaces. These are things that should be noted and make their way to >>> the cover letter. >>> >>>> >>>> -Assignment will fail if all the h/w counters are exhausted. User needs to >>>> unassign a counter from another group and use that counter here. This can >>>> be done just querying the monitor state of another group. >>> >>> Right ... and as you state there can be hundreds of monitor groups that >>> user space would need to walk and query to get this information. >>> >>>> >>>> -Monitor group's details(cpus, tasks) are part of the group. So, it is >>>> better to have assignment state inside the group. >>> >>> The assignment state should be clear from the event file. >>> >>>> Note: Used interface names here just to give example. >>>> >>>> >>>> 2. global assignment: >>>> >>>> I would assume the interface file will be in /sys/fs/resctrl/info/L3_MON/ >>>> directory. >>>> >>>> In case there are 100 mongroups, we need to have a way to list current >>>> assignment status for these groups. I am not sure how to list status of >>>> these 100 groups. >>> >>> The kernel has many examples of interfaces that manages status of a large >>> number of entities. I am thinking, for example, we can learn a lot from >>> how dynamic debug works. On my system I see: >>> >>> $ wc -l /sys/kernel/debug/dynamic_debug/control >>> 5359 /sys/kernel/debug/dynamic_debug/control >>> >>>> >>>> If user is wants to assign the local event(or total) in a specific group >>>> in this list of 100 groups, I am not sure how to provide interface for >>>> that. Should we pass the name of mongroup? That will involve looping >>>> through using the call kernfs_walk_and_get. This may be ok if we are >>>> dealing with very small number of groups. >>>> >>> >>> What is your concern when needing to modify a large number of groups? >>> Are you concerned about the size of the writes needing to be parsed? It looks >>> like kernfs does support writes of larger than PAGE_SIZE, but it is not clear >>> to me that such large sizes will be required. >>> >>> There is also kernfs_find_and_get() that may be more convenient to use. >> >> Will look at this. There is also kernfs_name and kernfs_path. >> >>> I believe user space needs to provide control group name for a global >>> interface (the same name can be used by monitor groups belonging to >>> different control groups), and that can be used to narrow search. >>> >>> Reading your message I do not find any motivation _against_ a global >>> interface, except that it is not obvious to you how such interface may look >>> or work. That is fair. Peter seems to have ideas and a working implementation >>> that can be used as reference. So far I have only seen one comment [2] from James >>> that was skeptical about the global interface but the reason notes that MPAM >>> allocates counters per domain, which is the same as ABMC so we will need more >>> information from James here on what is required since he did not respond to >>> Peter. >>> >>> Below is a *hypothetical* interface to start a discussion that explores how >>> to support fine grained assignment in an interface that aims to be easy to use >>> by user space. Obviously Peter is also working on something so there >>> are many viewpoints to consider. >>> >>> File info/L3_MON/mbm_assign_control: >>> #control_group/mon_group/flags >>> ctrl_a/mon_a/00=_;01=_ >>> ctrl_a/mon_b/00=l;01=t >>> ctrl_b/mon_c/00=lt;01=lt >> >> I think you left few things here(Like the default control_mon group). > > No. Similar to proc_resctrl_show() the fields can be empty for > the default group or mon groups belonging to control group. ok. Need to understand this better. Hope I learn while doing this work. > >> >> To make more clear, let me list all the groups here based this. >> >> When none of the counters assigned: >> >> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control >> resctrl/00=none,none;01=none,none (#default control_mon group) >> resctrl/mon_a/00=none,none;01=none,none (#mon group) >> resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group) >> resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group) > > I am concerned that inconsistent use of "/" will make parsing hard. Do you mean, you don't want to see multiple "/"? resctrl/ctrl_a/mon_ab/ Change to mon_ab/ > > I find "resctrl" and all the "none" redundant. It is not clear what > this improves. > Why have: > resctrl/00=none,none;01=none,none > when this could do: > //00=_;01=_ ok. "//" meaning root of resctrl filesystem? > > >> When some counters are assigned: >> >> $echo "resctrl/00=total,local" > >> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to >> default group) >> >> $echo "resctrl/mon_a/00=total;01=total" > >> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon >> group) >> >> $echo "resctrl/ctrl_a/00=local;01=local" > >> /sys/fs/resctrl/info/L3_MON/mbm_assign_control >> >> $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" > >> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control >> > > We could learn some more lessons from dynamic debug (see > Documentation/admin-guide/dynamic-debug-howto.rst). > For example, "=" can be used to make an assignment while "+" > can be used to add a counter and "-" can be used to remove a counter. > "=_" can be used to remove counters from all events in that domain. Yes. Looked at dynamic debug. I am still learning this interface. Some examples below based on my understanding. To assign a counters to default group on domain 0. $echo "//00=+lt;01=+lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control To assign a counters to mon group inside the default group. $echo "mon_a/00=+t;01=+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control To assign a counters to control mon group inside the default group. $echo "ctrl_a/00=+l;01=+l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control To assign a counters to control mon group inside another control group. $echo "mon_ab/00=+lt;01=+lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro To unassign a counters to control mon group inside another control group. $echo "mon_ab/00=-lt;01=-lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control To unassign all the counters on a specific group. $echo "mon_ab/00=_" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control It does not matter control group or mon group. We just need to name of the group in this interface. Listing will be $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control //00=lt;01=lt /mon_a/00=t;01=t /ctrl_a/00=l;01=l /mon_ab/00=_;01=_ > > The interface should also support assign/un-assign to multiple groups with > a single write. To start this could use '\n' as separator as is the custom > with other resctrl interfaces. Yes. that should be fine. > >> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control >> resctrl/00=total,local;01=none,none (#default control_mon group) >> resctrl/mon_a/00=total,none;01=total,none (#mon group) >> resctrl/ctrl_a/00=none,local;01=none,local (#control_mon group) >> resctrl/ctrl_a/mon_ab/00=total,local;01=total,local (#mon group) >> >> >> Few comments about this approach: >> 1.This will involve lots of text processing in the kernel. Will need to >> figure out calls for these processing. > > I see that additional parsing will be needed to determine control group > and monitor group. For these it sounds like you already have a few options > for kernfs API to use. > Apart from that the counter assignment will be similar parsing as what > was done in your previous versions. I think parsing will be easier if it > does not try to use words for the events but just use one letter flags. > For example, there is thus no need to look for "," in the parsing of the > events, just parse one character at a time where each character has a > specific meaning. ok. > >> >> 2.In this approach there is no way to list assignment of a single >> group(like group resctrl/ctrl_a/mon_ab alone). > > Should the kernel be responsible for enabling this? User space can just > do a "cat mbm_assign_control | grep mon_ab". Is this not sufficient? That may be ok. Peter, Please comment on this. > >> >> 3. This is similar to fine grained approach we discussed but in global level. > > That is what I have been trying to get across. This has full benefit of the > original implementation while also addressing all problems raised against it. > >> >> Want to get Pater/James comments about this approach. > (Peter) > > Of course. I look forward to that. Once agreed it may also be worthwhile to > approach x86 maintainers with an RFC of the proposed new user interface to learn > their guidance. This is where it is important to keep track of all the requirements, > as well as pros and cons of different options. Ok. Sure. I am fine making next version as RFC. > > Reinette
Hi Babu, On 3/1/2024 12:36 PM, Moger, Babu wrote: > On 2/29/24 15:50, Reinette Chatre wrote: >> On 2/29/2024 12:37 PM, Moger, Babu wrote: .. >>> To make more clear, let me list all the groups here based this. >>> >>> When none of the counters assigned: >>> >>> $cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control >>> resctrl/00=none,none;01=none,none (#default control_mon group) >>> resctrl/mon_a/00=none,none;01=none,none (#mon group) >>> resctrl/ctrl_a/00=none,none;01=none,none (#control_mon group) >>> resctrl/ctrl_a/mon_ab/00=none,none;01=none,none (#mon group) >> >> I am concerned that inconsistent use of "/" will make parsing hard. > > Do you mean, you don't want to see multiple "/"? No. I think that having a consistent number of "/" will be easier to parse. In the above example, there are instances of 1, 2, as well as three "/" among the lines. That seems complicated to parse. I was thinking that it will make interpreting and parsing easier if there consistently are just always two "/". (You may find things to be different once you work on the parsing code though.) In summary: * for monitoring of default CTRL_MON group: "//<flags>" * for MON_GROUP inside default CTRL_MON group: "/<MON group>/<flags>" * for monitoring of non-default CTRL_MON group: "<CTRL_MON group>//flags" * for MON_GROUP within CTRL_MON group: "<CTRL_MON group>/<MON group>/<flags>" What do you think? > > resctrl/ctrl_a/mon_ab/ > > Change to > > mon_ab/ rather: ctrl_a/mon_ab/<flags> > >> >> I find "resctrl" and all the "none" redundant. It is not clear what >> this improves. >> Why have: >> resctrl/00=none,none;01=none,none >> when this could do: >> //00=_;01=_ > > ok. > > "//" meaning root of resctrl filesystem? More specifically, monitoring of default control group. It is not intended to specify a pathname. >>> When some counters are assigned: >>> >>> $echo "resctrl/00=total,local" > >>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to >>> default group) >>> >>> $echo "resctrl/mon_a/00=total;01=total" > >>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control (#assigning counter to mon >>> group) >>> >>> $echo "resctrl/ctrl_a/00=local;01=local" > >>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control >>> >>> $echo "resctrl/ctrl_a/mon_ab/00=total,local;01=total,local" > >>> /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control >>> >> >> We could learn some more lessons from dynamic debug (see Documentation/admin-guide/dynamic-debug-howto.rst). For example, "=" can be used to make an assignment while "+" >> can be used to add a counter and "-" can be used to remove a counter. >> "=_" can be used to remove counters from all events in that domain. > > Yes. Looked at dynamic debug. I am still learning this interface. Some examples below based on my understanding. > > To assign a counters to default group on domain 0. > $echo "//00=+lt;01=+lt" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control It should not be necessary to use both "=" and "+" in the same assignment. I think of "=" as "assign" and "+" as append ("-" as remove). An example of this, just focusing on default group. #hypothetical start state of no counters assigned $ cat mbm_assign_control #control_group/monitor_group/flags //00=_;01=_ #assign counter to total MBM of both domains $ echo "//00=t;01=t" > mbm_assign_control $ cat mbm_assign_control #control_group/monitor_group/flags //00=t;01=t #add counter to local MBM of both domains without impacting total MBM counters $echo "//00+l;01+l" > mbm_assign_control $ cat mbm_assign_control #control_group/monitor_group/flags //00=tl;01=tl #remove local MBM counters without impacting total MBM counters $echo "//00-l;01-l" > mbm_assign_control $ cat mbm_assign_control #control_group/monitor_group/flags //00=t;01=t #assign local MBM counters, removing total MBM counters while doing so $echo "//00=l;01=l" > mbm_assign_control $ cat mbm_assign_control #control_group/monitor_group/flags //00=l;01=l #remove all counters $echo "//00=_;01=_" > mbm_assign_control $ cat mbm_assign_control #control_group/monitor_group/flags //00=_;01=_ > > To assign a counters to mon group inside the default group. > $echo "mon_a/00=+t;01=+t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control I think it will simplify parsing if number of "/" is consistent: $echo "/mon_a/00=t;01=t" > ... > > To assign a counters to control mon group inside the default group. It is not clear to me what you mean with this. > $echo "ctrl_a/00=+l;01=+l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control this looks similar to previous example, so I think it will be hard for parser to know whether it is dealing with control group or monitor group. I am not sure I understand your example, but this may perhaps be: echo "ctrl_a//00=l;01=l > ... > > To assign a counters to control mon group inside another control group. I do not know what you mean with "another control group" > $echo "mon_ab/00=+lt;01=+lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_contro How will parser know which control group? I was expecting: $ echo "<CTRL_MON group>/<MON group>/<flags>" > > To unassign a counters to control mon group inside another control group. > $echo "mon_ab/00=-lt;01=-lt" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control > > To unassign all the counters on a specific group. > $echo "mon_ab/00=_" > /sys/fs/resctrl/nfo/L3_MON/mbm_assign_control > > It does not matter control group or mon group. We just need to name of the group in this interface. It matters because users can have monitor groups with the same name within different control groups. Reinette