x86/resctrl: Only show tasks' pids in current pid namespace

Message ID 20230116071246.97717-1-shawnwang@linux.alibaba.com
State New
Headers
Series x86/resctrl: Only show tasks' pids in current pid namespace |

Commit Message

Shawn Wang Jan. 16, 2023, 7:12 a.m. UTC
  When writing a task id to the "tasks" file in an rdtgroup,
rdtgroup_tasks_write() treats the pid as a number in the current pid
namespace. But when reading the "tasks" file, rdtgroup_tasks_show() shows
the list of global pids from the init namespace. If current pid namespace
is not the init namespace, pids in "tasks" will be confusing and incorrect.

To be more robust, let the "tasks" file only show pids in the current pid
namespace.

Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)
  

Comments

Reinette Chatre Feb. 15, 2023, 9:43 p.m. UTC | #1
Hi Shawn,

On 1/15/2023 11:12 PM, Shawn Wang wrote:
> When writing a task id to the "tasks" file in an rdtgroup,
> rdtgroup_tasks_write() treats the pid as a number in the current pid
> namespace. But when reading the "tasks" file, rdtgroup_tasks_show() shows
> the list of global pids from the init namespace. If current pid namespace
> is not the init namespace, pids in "tasks" will be confusing and incorrect.
> 
> To be more robust, let the "tasks" file only show pids in the current pid
> namespace.
> 

Is it possible to elaborate more on the use case that this is aiming to
address? It is unexpected to me that resource management is approached from
within a container. My expectation is that the resource management and monitoring
is done from the host. 

> Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 5993da21d822..9e97ae24c159 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -718,11 +718,15 @@ static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
>  static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
>  {
>  	struct task_struct *p, *t;
> +	pid_t pid;
>  
>  	rcu_read_lock();
>  	for_each_process_thread(p, t) {
> -		if (is_closid_match(t, r) || is_rmid_match(t, r))
> -			seq_printf(s, "%d\n", t->pid);
> +		if (is_closid_match(t, r) || is_rmid_match(t, r)) {
> +			pid = task_pid_vnr(t);
> +			if (pid)
> +				seq_printf(s, "%d\n", pid);
> +		}
>  	}
>  	rcu_read_unlock();
>  }

This looks like it would solve the stated problem. Does it slow down
reading a tasks file in a measurable way?

Reinette
  
Shawn Wang March 15, 2023, 3:06 p.m. UTC | #2
Hi Reinette,

On 2/16/23 5:43 AM, Reinette Chatre wrote:
> Hi Shawn,
> 
> On 1/15/2023 11:12 PM, Shawn Wang wrote:
>> When writing a task id to the "tasks" file in an rdtgroup,
>> rdtgroup_tasks_write() treats the pid as a number in the current pid
>> namespace. But when reading the "tasks" file, rdtgroup_tasks_show() shows
>> the list of global pids from the init namespace. If current pid namespace
>> is not the init namespace, pids in "tasks" will be confusing and incorrect.
>>
>> To be more robust, let the "tasks" file only show pids in the current pid
>> namespace.
>>
> 
> Is it possible to elaborate more on the use case that this is aiming to
> address? It is unexpected to me that resource management is approached from
> within a container. My expectation is that the resource management and monitoring
> is done from the host.

We have a scenario where we only want to mount the resctrl filesystem under a specific container.
And We found that the pids in the tasks under resctrl are inconsistent with the pids obtained by top.

Besides, current rdtgroup_move_task() uses the find_task_by_vpid() to get the real pid.
Our modification is also to maintain symmetry with the rdtgroup_move_task().

>> ---
>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 5993da21d822..9e97ae24c159 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -718,11 +718,15 @@ static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
>>   static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
>>   {
>>   	struct task_struct *p, *t;
>> +	pid_t pid;
>>   
>>   	rcu_read_lock();
>>   	for_each_process_thread(p, t) {
>> -		if (is_closid_match(t, r) || is_rmid_match(t, r))
>> -			seq_printf(s, "%d\n", t->pid);
>> +		if (is_closid_match(t, r) || is_rmid_match(t, r)) {
>> +			pid = task_pid_vnr(t);
>> +			if (pid)
>> +				seq_printf(s, "%d\n", pid);
>> +		}
>>   	}
>>   	rcu_read_unlock();
>>   }
> 
> This looks like it would solve the stated problem. Does it slow down
> reading a tasks file in a measurable way?

We didn't test it, but it is proportional to the number of pids in the group.
In addition, only an if statement is added here, and actually the reading of
the tasks interface will not be called frequently, so it will not be a bottleneck.

Thanks,
Shawn
  
Reinette Chatre March 16, 2023, 9:41 p.m. UTC | #3
Hi Shawn,

On 3/15/2023 8:06 AM, Shawn Wang wrote:
> On 2/16/23 5:43 AM, Reinette Chatre wrote:
>> On 1/15/2023 11:12 PM, Shawn Wang wrote:
>>> When writing a task id to the "tasks" file in an rdtgroup,
>>> rdtgroup_tasks_write() treats the pid as a number in the current pid
>>> namespace. But when reading the "tasks" file, rdtgroup_tasks_show() shows
>>> the list of global pids from the init namespace. If current pid namespace
>>> is not the init namespace, pids in "tasks" will be confusing and incorrect.
>>>
>>> To be more robust, let the "tasks" file only show pids in the current pid
>>> namespace.
>>>
>>
>> Is it possible to elaborate more on the use case that this is aiming to
>> address? It is unexpected to me that resource management is approached from
>> within a container. My expectation is that the resource management and monitoring
>> is done from the host.
> 
> We have a scenario where we only want to mount the resctrl filesystem under a specific container.

This scenario is interesting to me. My assumption has always been that the resource
management is done from the host and not a container. Especially since a container
can only add its own tasks to resource groups.

> And We found that the pids in the tasks under resctrl are inconsistent with the pids obtained by top.

Indeed.

> 
> Besides, current rdtgroup_move_task() uses the find_task_by_vpid() to get the real pid.
> Our modification is also to maintain symmetry with the rdtgroup_move_task().

I understand, thank you for looking into this.

> 
>>> ---
>>>   arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index 5993da21d822..9e97ae24c159 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -718,11 +718,15 @@ static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
>>>   static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
>>>   {
>>>       struct task_struct *p, *t;
>>> +    pid_t pid;
>>>         rcu_read_lock();
>>>       for_each_process_thread(p, t) {
>>> -        if (is_closid_match(t, r) || is_rmid_match(t, r))
>>> -            seq_printf(s, "%d\n", t->pid);
>>> +        if (is_closid_match(t, r) || is_rmid_match(t, r)) {
>>> +            pid = task_pid_vnr(t);
>>> +            if (pid)
>>> +                seq_printf(s, "%d\n", pid);
>>> +        }
>>>       }
>>>       rcu_read_unlock();
>>>   }
>>
>> This looks like it would solve the stated problem. Does it slow down
>> reading a tasks file in a measurable way?
> 
> We didn't test it, but it is proportional to the number of pids in the group.
> In addition, only an if statement is added here, and actually the reading of
> the tasks interface will not be called frequently, so it will not be a bottleneck.

It adds more than an if statement and for a default root group task_pid_vnr() will
be called for every task on the host. I am not familiar with namespaces so my concern
was the additional task_pid_vnr() call. This does seem to be the custom though and does
what's needed to return the correct data.

I did test this and can confirm that when bind mounting /sys/fs/resctrl into the container
the container's view of /sys/fs/resctrl/tasks only shows its own tasks with the pids as seen
by it. Without this patch both the container and the host shows the same data, which are the
pids from the host namespace.

Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>

When you no longer expect any more feedback I'd recommend that you resubmit this
patch with the new tags to make it easier for the next level maintainers to notice
it and pick it up. To ensure accurate references to discussions you can add a
"Link:" to this email.

Thank you very much

Reinette
  
Reinette Chatre March 16, 2023, 10:18 p.m. UTC | #4
Hi Shawn,

Thinking about this more, this could probably also do
with a:
Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files") 

Reinette
  

Patch

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5993da21d822..9e97ae24c159 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -718,11 +718,15 @@  static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of,
 static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
 {
 	struct task_struct *p, *t;
+	pid_t pid;
 
 	rcu_read_lock();
 	for_each_process_thread(p, t) {
-		if (is_closid_match(t, r) || is_rmid_match(t, r))
-			seq_printf(s, "%d\n", t->pid);
+		if (is_closid_match(t, r) || is_rmid_match(t, r)) {
+			pid = task_pid_vnr(t);
+			if (pid)
+				seq_printf(s, "%d\n", pid);
+		}
 	}
 	rcu_read_unlock();
 }