[v10,11/13] x86/resctrl: Add interface to write mbm_total_bytes_config
Commit Message
The event configuration for mbm_total_bytes can be changed by the user by
writing to the file /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config.
The event configuration settings are domain specific and affect all the
CPUs in the domain.
Following are the types of events supported:
==== ===========================================================
Bits Description
==== ===========================================================
6 Dirty Victims from the QOS domain to all types of memory
5 Reads to slow memory in the non-local NUMA domain
4 Reads to slow memory in the local NUMA domain
3 Non-temporal writes to non-local NUMA domain
2 Non-temporal writes to local NUMA domain
1 Reads to memory in the non-local NUMA domain
0 Reads to memory in the local NUMA domain
==== ===========================================================
For example:
To change the mbm_total_bytes to count only reads on domain 0, the bits
0, 1, 4 and 5 needs to be set, which is 110011b (in hex 0x33). Run the
command.
$echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
To change the mbm_total_bytes to count all the slow memory reads on
domain 1, the bits 4 and 5 needs to be set which is 110000b (in hex 0x30).
Run the command.
$echo 1=0x30 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
Signed-off-by: Babu Moger <babu.moger@amd.com>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 17 ++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 126 ++++++++++++++++++++++++-
include/linux/resctrl.h | 11 +++
3 files changed, 153 insertions(+), 1 deletion(-)
Comments
Hi Babu,
On 12/22/2022 3:31 PM, Babu Moger wrote:
...
> +static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes,
> + loff_t off)
> +{
> + struct rdt_resource *r = of->kn->parent->priv;
> + int ret;
> +
> + /* Valid input requires a trailing newline */
> + if (nbytes == 0 || buf[nbytes - 1] != '\n')
> + return -EINVAL;
> +
> + cpus_read_lock();
Could you please elaborate why this lock is needed here as
well as in the following patch?
Reinette
Hi Reinette,
On 1/4/23 18:29, Reinette Chatre wrote:
> Hi Babu,
>
> On 12/22/2022 3:31 PM, Babu Moger wrote:
>
> ...
>
>> +static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
>> + char *buf, size_t nbytes,
>> + loff_t off)
>> +{
>> + struct rdt_resource *r = of->kn->parent->priv;
>> + int ret;
>> +
>> + /* Valid input requires a trailing newline */
>> + if (nbytes == 0 || buf[nbytes - 1] != '\n')
>> + return -EINVAL;
>> +
>> + cpus_read_lock();
> Could you please elaborate why this lock is needed here as
> well as in the following patch?
Holding the cpus_read_lock() make sure that this cpu is online while doing
this operation. This code eventually sends an IPI to write the MSR on one
of the CPUs using the cpumasks. My understanding is to make sure cpumask
is stable while handling this write. Same thing is done in
rdtgroup_schemata_write.
Thanks
Babu
>
> Reinette
>
Hi Babu,
On 1/5/2023 8:04 AM, Moger, Babu wrote:
> Hi Reinette,
>
> On 1/4/23 18:29, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 12/22/2022 3:31 PM, Babu Moger wrote:
>>
>> ...
>>
>>> +static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
>>> + char *buf, size_t nbytes,
>>> + loff_t off)
>>> +{
>>> + struct rdt_resource *r = of->kn->parent->priv;
>>> + int ret;
>>> +
>>> + /* Valid input requires a trailing newline */
>>> + if (nbytes == 0 || buf[nbytes - 1] != '\n')
>>> + return -EINVAL;
>>> +
>>> + cpus_read_lock();
>> Could you please elaborate why this lock is needed here as
>> well as in the following patch?
>
> Holding the cpus_read_lock() make sure that this cpu is online while doing
> this operation. This code eventually sends an IPI to write the MSR on one
> of the CPUs using the cpumasks. My understanding is to make sure cpumask
> is stable while handling this write. Same thing is done in
This flow uses smp_call_function_any() to send the IPI and update the MSR.
smp_call_function_any() itself disables preemption to protect against
CPUs going offline while attempting the update.
The domain's cpumask itself cannot change during this flow because rdtgroup_mutex
is held the entire time. This mutex is needed by the resctrl CPU online/offline
callbacks that may update the mask.
> rdtgroup_schemata_write.
Yes, rdtgroup_schemata_write uses this but please take a look at _why_ it
is using it. This was something added later as part of the pseudo-locking
code. Please see the commit message for the details that explain the usage:
80b71c340f17 ("x86/intel_rdt: Ensure a CPU remains online for the region's pseudo-locking sequence")
Could you please provide more detail if you still find that this lock is needed?
If you prefer to refer to existing code flows there are other examples
in resctrl where the domain's CPU mask is used to read/write registers without the
hotplug lock that you can use for reference:
* Even in this patch series itself, reading of the config.
* When creating a new resource group (the mkdir flow) the MSRs are written with an
initial config without hotplug lock.
* When writing to the tasks file the CPU on which task may be running receives IPI
without hotplug lock held the entire time.
* See resctrl flow of monitoring data reads.
Alternatively you may want to take a closer look at where the hotplug lock _is_ held in
resctrl to consider if those usages match this work. Understanding
why the hotplug lock is currently used should be clear with the commits associated
with their introduction because there has been a few bugs surrounding this.
Reinette
Hi Reinette,
On 1/5/23 11:49, Reinette Chatre wrote:
> Hi Babu,
>
> On 1/5/2023 8:04 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 1/4/23 18:29, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 12/22/2022 3:31 PM, Babu Moger wrote:
>>>
>>> ...
>>>
>>>> +static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
>>>> + char *buf, size_t nbytes,
>>>> + loff_t off)
>>>> +{
>>>> + struct rdt_resource *r = of->kn->parent->priv;
>>>> + int ret;
>>>> +
>>>> + /* Valid input requires a trailing newline */
>>>> + if (nbytes == 0 || buf[nbytes - 1] != '\n')
>>>> + return -EINVAL;
>>>> +
>>>> + cpus_read_lock();
>>> Could you please elaborate why this lock is needed here as
>>> well as in the following patch?
>> Holding the cpus_read_lock() make sure that this cpu is online while doing
>> this operation. This code eventually sends an IPI to write the MSR on one
>> of the CPUs using the cpumasks. My understanding is to make sure cpumask
>> is stable while handling this write. Same thing is done in
> This flow uses smp_call_function_any() to send the IPI and update the MSR.
> smp_call_function_any() itself disables preemption to protect against
> CPUs going offline while attempting the update.
>
> The domain's cpumask itself cannot change during this flow because rdtgroup_mutex
> is held the entire time. This mutex is needed by the resctrl CPU online/offline
> callbacks that may update the mask.
Ok
>
>> rdtgroup_schemata_write.
> Yes, rdtgroup_schemata_write uses this but please take a look at _why_ it
> is using it. This was something added later as part of the pseudo-locking
> code. Please see the commit message for the details that explain the usage:
> 80b71c340f17 ("x86/intel_rdt: Ensure a CPU remains online for the region's pseudo-locking sequence")
Ok. That make sense.
>
> Could you please provide more detail if you still find that this lock is needed?
> If you prefer to refer to existing code flows there are other examples
> in resctrl where the domain's CPU mask is used to read/write registers without the
> hotplug lock that you can use for reference:
> * Even in this patch series itself, reading of the config.
> * When creating a new resource group (the mkdir flow) the MSRs are written with an
> initial config without hotplug lock.
> * When writing to the tasks file the CPU on which task may be running receives IPI
> without hotplug lock held the entire time.
> * See resctrl flow of monitoring data reads.
>
> Alternatively you may want to take a closer look at where the hotplug lock _is_ held in
> resctrl to consider if those usages match this work. Understanding
> why the hotplug lock is currently used should be clear with the commits associated
> with their introduction because there has been a few bugs surrounding this.
This is overlook from my side looking at the code in rdtgroup_schemata_write.
I was concerned about domains cpu_mask being changed while doing the IPI.
Looks like that is safe with rdtgroup_mutex.
I will remove the cpus_read_lock() from here and following patch. Will run
few tests. Will send the updated patches soon.
Thanks for the explanation.
Thanks
Babu
@@ -176,6 +176,23 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
memset(am, 0, sizeof(*am));
}
+/*
+ * Assumes that hardware counters are also reset and thus that there is
+ * no need to record initial non-zero counts.
+ */
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d)
+{
+ struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
+
+ if (is_mbm_total_enabled())
+ memset(hw_dom->arch_mbm_total, 0,
+ sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
+
+ if (is_mbm_local_enabled())
+ memset(hw_dom->arch_mbm_local, 0,
+ sizeof(*hw_dom->arch_mbm_local) * r->num_rmid);
+}
+
static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
{
u64 shift = 64 - width, chunks;
@@ -1515,6 +1515,129 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of,
return 0;
}
+static void mon_event_config_write(void *info)
+{
+ struct mon_config_info *mon_info = info;
+ unsigned int index;
+
+ index = mon_event_config_index_get(mon_info->evtid);
+ if (index == INVALID_CONFIG_INDEX) {
+ pr_warn_once("Invalid event id %d\n", mon_info->evtid);
+ return;
+ }
+ wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
+}
+
+static int mbm_config_write_domain(struct rdt_resource *r,
+ struct rdt_domain *d, u32 evtid, u32 val)
+{
+ struct mon_config_info mon_info = {0};
+ int ret = 0;
+
+ /* mon_config cannot be more than the supported set of events */
+ if (val > MAX_EVT_CONFIG_BITS) {
+ rdt_last_cmd_puts("Invalid event configuration\n");
+ return -EINVAL;
+ }
+
+ /*
+ * Read the current config value first. If both are the same then
+ * no need to write it again.
+ */
+ mon_info.evtid = evtid;
+ mondata_config_read(d, &mon_info);
+ if (mon_info.mon_config == val)
+ goto out;
+
+ mon_info.mon_config = val;
+
+ /*
+ * Update MSR_IA32_EVT_CFG_BASE MSR on one of the CPUs in the
+ * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE
+ * are scoped at the domain level. Writing any of these MSRs
+ * on one CPU is observed by all the CPUs in the domain.
+ */
+ smp_call_function_any(&d->cpu_mask, mon_event_config_write,
+ &mon_info, 1);
+
+ /*
+ * When an Event Configuration is changed, the bandwidth counters
+ * for all RMIDs and Events will be cleared by the hardware. The
+ * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for
+ * every RMID on the next read to any event for every RMID.
+ * Subsequent reads will have MSR_IA32_QM_CTR.Unavailable (bit 62)
+ * cleared while it is tracked by the hardware. Clear the
+ * mbm_local and mbm_total counts for all the RMIDs.
+ */
+ resctrl_arch_reset_rmid_all(r, d);
+
+out:
+ return ret;
+}
+
+static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
+{
+ char *dom_str = NULL, *id_str;
+ unsigned long dom_id, val;
+ struct rdt_domain *d;
+ int ret = 0;
+
+next:
+ if (!tok || tok[0] == '\0')
+ return 0;
+
+ /* Start processing the strings for each domain */
+ dom_str = strim(strsep(&tok, ";"));
+ id_str = strsep(&dom_str, "=");
+
+ if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
+ rdt_last_cmd_puts("Missing '=' or non-numeric domain id\n");
+ return -EINVAL;
+ }
+
+ if (!dom_str || kstrtoul(dom_str, 16, &val)) {
+ rdt_last_cmd_puts("Non-numeric event configuration value\n");
+ return -EINVAL;
+ }
+
+ list_for_each_entry(d, &r->domains, list) {
+ if (d->id == dom_id) {
+ ret = mbm_config_write_domain(r, d, evtid, val);
+ if (ret)
+ return -EINVAL;
+ goto next;
+ }
+ }
+
+ return -EINVAL;
+}
+
+static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ int ret;
+
+ /* Valid input requires a trailing newline */
+ if (nbytes == 0 || buf[nbytes - 1] != '\n')
+ return -EINVAL;
+
+ cpus_read_lock();
+ mutex_lock(&rdtgroup_mutex);
+
+ rdt_last_cmd_clear();
+
+ buf[nbytes - 1] = '\0';
+
+ ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID);
+
+ mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();
+
+ return ret ?: nbytes;
+}
+
/* rdtgroup information files for one cache resource. */
static struct rftype res_common_files[] = {
{
@@ -1615,9 +1738,10 @@ static struct rftype res_common_files[] = {
},
{
.name = "mbm_total_bytes_config",
- .mode = 0444,
+ .mode = 0644,
.kf_ops = &rdtgroup_kf_single_ops,
.seq_show = mbm_total_bytes_config_show,
+ .write = mbm_total_bytes_config_write,
},
{
.name = "mbm_local_bytes_config",
@@ -250,6 +250,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d,
u32 rmid, enum resctrl_event_id eventid);
+/**
+ * resctrl_arch_reset_rmid_all() - Reset all private state associated with
+ * all rmids and eventids.
+ * @r: The resctrl resource.
+ * @d: The domain for which all architectural counter state will
+ * be cleared.
+ *
+ * This can be called from any CPU.
+ */
+void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d);
+
extern unsigned int resctrl_rmid_realloc_threshold;
extern unsigned int resctrl_rmid_realloc_limit;