proc.5: tell how to parse /proc/*/stat correctly

Message ID Y6SJDbKBk471KE4k@p183
State New
Headers
Series proc.5: tell how to parse /proc/*/stat correctly |

Commit Message

Alexey Dobriyan Dec. 22, 2022, 4:42 p.m. UTC
  /proc/*/stat can't be parsed with split() or split(" ") or split(' ')
or sscanf("%d (%s) ...") or equivalents because "comm" can contain
whitespace and parenthesis and is not escaped by the kernel.

BTW escaping would not help with naive split() anyway.

Mention strrchr(')') so people can at least stop adding new bugs.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 man5/proc.5 |    5 +++++
 1 file changed, 5 insertions(+)
  

Comments

Dominique Martinet Dec. 22, 2022, 10:03 p.m. UTC | #1
Alexey Dobriyan wrote on Thu, Dec 22, 2022 at 07:42:53PM +0300:
> --- a/man5/proc.5
> +++ b/man5/proc.5
> @@ -2092,6 +2092,11 @@ Strings longer than
>  .B TASK_COMM_LEN
>  (16) characters (including the terminating null byte) are silently truncated.
>  This is visible whether or not the executable is swapped out.
> +
> +Note that \fIcomm\fP can contain space and closing parenthesis characters. 
> +Parsing /proc/${pid}/stat with split() or equivalent, or scanf(3) isn't
> +reliable. The correct way is to locate closing parenthesis with strrchr(')')
> +from the end of the buffer and parse integers from there.

That's still not enough unless new lines are escaped, which they aren't:

$ echo -n 'test) 0 0 0
' > /proc/$$/comm
$ cat /proc/$$/stat
71076 (test) 0 0 0
) S 71075 71076 71076 34840 71192 4194304 6623 6824 0 0 10 3 2 7 20 0 1 0 36396573 15208448 2888 18446744073709551615 94173281726464 94173282650929 140734972513568 0 0 0 65536 3686404 1266761467 1 0 0 17 1 0 0 0 0 0 94173282892592 94173282940880 94173287231488 140734972522071 140734972522076 140734972522076 140734972526574 0

The silver lining here is that comm length is rather small (16) so we
cannot emulate full lines and a very careful process could notice that
there are not enough fields after the last parenthesis... So just look
for the last closing parenthesis in the next line and try again?

But, really, I just don't see how this can practically be said to be parsable...
  
Solar Designer Dec. 22, 2022, 11:21 p.m. UTC | #2
On Fri, Dec 23, 2022 at 07:03:17AM +0900, Dominique Martinet wrote:
> Alexey Dobriyan wrote on Thu, Dec 22, 2022 at 07:42:53PM +0300:
> > --- a/man5/proc.5
> > +++ b/man5/proc.5
> > @@ -2092,6 +2092,11 @@ Strings longer than
> >  .B TASK_COMM_LEN
> >  (16) characters (including the terminating null byte) are silently truncated.
> >  This is visible whether or not the executable is swapped out.
> > +
> > +Note that \fIcomm\fP can contain space and closing parenthesis characters. 
> > +Parsing /proc/${pid}/stat with split() or equivalent, or scanf(3) isn't
> > +reliable. The correct way is to locate closing parenthesis with strrchr(')')
> > +from the end of the buffer and parse integers from there.
> 
> That's still not enough unless new lines are escaped, which they aren't:
> 
> $ echo -n 'test) 0 0 0
> ' > /proc/$$/comm
> $ cat /proc/$$/stat
> 71076 (test) 0 0 0
> ) S 71075 71076 71076 34840 71192 4194304 6623 6824 0 0 10 3 2 7 20 0 1 0 36396573 15208448 2888 18446744073709551615 94173281726464 94173282650929 140734972513568 0 0 0 65536 3686404 1266761467 1 0 0 17 1 0 0 0 0 0 94173282892592 94173282940880 94173287231488 140734972522071 140734972522076 140734972522076 140734972526574 0
> 
> The silver lining here is that comm length is rather small (16) so we
> cannot emulate full lines and a very careful process could notice that
> there are not enough fields after the last parenthesis... So just look
> for the last closing parenthesis in the next line and try again?

No, just don't treat this file's content as a line (nor as several
lines) - treat it as a string that might contain new line characters.

The ps command from procps-ng seems to manage, e.g. for your test "ps c"
prints:

29394 pts/3    S      0:00 test) 0 0 0?

where the question mark is what it substitutes for the non-printable
character (the new line character).  I didn't check whether the process
name it prints comes from /proc/$$/stat or /proc/$$/status, though (per
strace, it reads both).

> But, really, I just don't see how this can practically be said to be parsable...

This format certainly makes it easier to get a parser wrong than to get
it right.

I agree the above man page edit is not enough, and should also mention
the caveat that this shouldn't be read in nor parsed as a line.

Also, the Linux kernel does have problems with new lines in the comm
field elsewhere, at least in the log messages it produces:

https://github.com/lkrg-org/lkrg/issues/165

Here I looked into this in context of LKRG development, but with the
kernel itself also producing messages with comm in them the point of
only fixing LKRG's messages is moot.

Alexander

P.S. While this thread goes well so far, please note that in general
CC'ing other lists on postings to oss-security (or vice versa) is
discouraged.  With such CC's, possible follow-ups from members of those
other lists can be off-topic for oss-security - e.g., they might focus
on non-security technicalities.  Probably not this time when only a man
page is to be patched, but proposed patches to the Linux kernel often
result in lengthy discussions and multiple versions of the patch.  In
those cases, I think it's better to have separate threads and only post
summary follow-up(s) to oss-security (e.g., one message stating that a
patch was proposed and linking to the thread, and another after the
final version is merged).
  
Dominique Martinet Dec. 23, 2022, 12:15 a.m. UTC | #3
Solar Designer wrote on Fri, Dec 23, 2022 at 12:21:12AM +0100:
> On Fri, Dec 23, 2022 at 07:03:17AM +0900, Dominique Martinet wrote:
> > Alexey Dobriyan wrote on Thu, Dec 22, 2022 at 07:42:53PM +0300:
> > > --- a/man5/proc.5
> > > +++ b/man5/proc.5
> > > @@ -2092,6 +2092,11 @@ Strings longer than
> > >  .B TASK_COMM_LEN
> > >  (16) characters (including the terminating null byte) are silently truncated.
> > >  This is visible whether or not the executable is swapped out.
> > > +
> > > +Note that \fIcomm\fP can contain space and closing parenthesis characters. 
> > > +Parsing /proc/${pid}/stat with split() or equivalent, or scanf(3) isn't
> > > +reliable. The correct way is to locate closing parenthesis with strrchr(')')
> > > +from the end of the buffer and parse integers from there.
> > 
> > That's still not enough unless new lines are escaped, which they aren't:
> > 
> > $ echo -n 'test) 0 0 0
> > ' > /proc/$$/comm
> > $ cat /proc/$$/stat
> > 71076 (test) 0 0 0
> > ) S 71075 71076 71076 34840 71192 4194304 6623 6824 0 0 10 3 2 7 20 0 1 0 36396573 15208448 2888 18446744073709551615 94173281726464 94173282650929 140734972513568 0 0 0 65536 3686404 1266761467 1 0 0 17 1 0 0 0 0 0 94173282892592 94173282940880 94173287231488 140734972522071 140734972522076 140734972522076 140734972526574 0
> > 
> > The silver lining here is that comm length is rather small (16) so we
> > cannot emulate full lines and a very careful process could notice that
> > there are not enough fields after the last parenthesis... So just look
> > for the last closing parenthesis in the next line and try again?
> 
> No, just don't treat this file's content as a line (nor as several
> lines) - treat it as a string that might contain new line characters.

Ah, this came just after the /proc/net/unix discussion in another
thread[1] pointing to [2] with one line per entry, and I was still in
that mode.

For /proc/pid/stat with a single entry I agree treating it as a buffer
and looking for the last closing parenthesis should be correct as per
the man page suggestion -- sorry for the noise.

[1] https://www.openwall.com/lists/oss-security/2022/12/21/8
[2] https://lore.kernel.org/all/8a87957e-4d33-9351-ae74-243441cb03cd@opteya.com/
  
Jan Engelhardt Dec. 23, 2022, 12:21 a.m. UTC | #4
On Thursday 2022-12-22 23:03, Dominique Martinet wrote:
>> +
>> +Note that \fIcomm\fP can contain space and closing parenthesis characters. 
>> +Parsing /proc/${pid}/stat with split() or equivalent, or scanf(3) isn't
>> +reliable. The correct way is to locate closing parenthesis with strrchr(')')
>> +from the end of the buffer and parse integers from there.
>
>That's still not enough unless new lines are escaped, which they aren't:

strrchr does not concern itself with "lines".
If your input buffer contains the complete content of /proc/X/stat (and not
just a "line" thereof), the strrchr approach appears quite workable.
  
Lyndon Nerenberg (VE7TFX/VE6BBM) Dec. 28, 2022, 12:44 a.m. UTC | #5
Dominique Martinet writes:

> But, really, I just don't see how this can practically be said to be parsable...

In its current form it never will be.  The solution is to place
this variable-length field last.  Then you can "cut -d ' ' -f 51-"
to get the command+args part (assuming I counted all those fields
correctly ...)

Of course, this breaks backwards compatability.

--lyndon
  
Tavis Ormandy Dec. 28, 2022, 1:50 a.m. UTC | #6
On 2022-12-28, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
> Dominique Martinet writes:
>
>> But, really, I just don't see how this can practically be said to be parsable...
>
> In its current form it never will be.  The solution is to place
> this variable-length field last.  Then you can "cut -d ' ' -f 51-"
> to get the command+args part (assuming I counted all those fields
> correctly ...)
>
> Of course, this breaks backwards compatability.
>

I think that cut command doesn't handle newlines, you probably need cut
-z, which looks a bit hacky to me. There already is 'ps -q $$ -o comm=',
of course, and that works fine - as well as libprocps.

I don't know, I think just adding the strrchr note seems acceptable.

Tavis.
  
Shawn Webb Dec. 28, 2022, 3:24 p.m. UTC | #7
On Tue, Dec 27, 2022 at 04:44:49PM -0800, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
> Dominique Martinet writes:
> 
> > But, really, I just don't see how this can practically be said to be parsable...
> 
> In its current form it never will be.  The solution is to place
> this variable-length field last.  Then you can "cut -d ' ' -f 51-"
> to get the command+args part (assuming I counted all those fields
> correctly ...)
> 
> Of course, this breaks backwards compatability.

It would also break forwards compatibility in the case new fields
needed to be added.

The only solution would be a libxo-style feature wherein a
machine-parseable format is exposed by virtue of a file extension.

Examples:

1. /proc/pid/stats.json
2. /proc/pid/stats.xml
3. /proc/pid/stats.yaml_shouldnt_be_a_thing

Thanks,
  
Shawn Webb Dec. 28, 2022, 3:31 p.m. UTC | #8
On Wed, Dec 28, 2022 at 10:24:58AM -0500, Shawn Webb wrote:
> On Tue, Dec 27, 2022 at 04:44:49PM -0800, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
> > Dominique Martinet writes:
> > 
> > > But, really, I just don't see how this can practically be said to be parsable...
> > 
> > In its current form it never will be.  The solution is to place
> > this variable-length field last.  Then you can "cut -d ' ' -f 51-"
> > to get the command+args part (assuming I counted all those fields
> > correctly ...)
> > 
> > Of course, this breaks backwards compatability.
> 
> It would also break forwards compatibility in the case new fields
> needed to be added.
> 
> The only solution would be a libxo-style feature wherein a
> machine-parseable format is exposed by virtue of a file extension.
> 
> Examples:
> 
> 1. /proc/pid/stats.json
> 2. /proc/pid/stats.xml
> 3. /proc/pid/stats.yaml_shouldnt_be_a_thing

To expand upon this idea, lets define an example json file:

{
	"schemaver": "20221228001",
	"name": "cat",
	"state": {
		"raw": "R",
		"intval": 1,
		"Pretty": "(Running)",
	},
	"tgid": 5452,
	"pid": 5452,
	"ppid": 743,
	"uid": {
		"real": 501,
		"effective": 501,
		"saved_set": 501,
		"fs": 501
	}
}

And so on.
  
Demi Marie Obenour Dec. 28, 2022, 4:47 p.m. UTC | #9
On Wed, Dec 28, 2022 at 10:24:58AM -0500, Shawn Webb wrote:
> On Tue, Dec 27, 2022 at 04:44:49PM -0800, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
> > Dominique Martinet writes:
> > 
> > > But, really, I just don't see how this can practically be said to be parsable...
> > 
> > In its current form it never will be.  The solution is to place
> > this variable-length field last.  Then you can "cut -d ' ' -f 51-"
> > to get the command+args part (assuming I counted all those fields
> > correctly ...)
> > 
> > Of course, this breaks backwards compatability.
> 
> It would also break forwards compatibility in the case new fields
> needed to be added.
> 
> The only solution would be a libxo-style feature wherein a
> machine-parseable format is exposed by virtue of a file extension.
> 
> Examples:
> 
> 1. /proc/pid/stats.json
> 2. /proc/pid/stats.xml
> 3. /proc/pid/stats.yaml_shouldnt_be_a_thing

A binary format would be even better.  No risk of ambiguity.
  
Jan Engelhardt Dec. 28, 2022, 5:09 p.m. UTC | #10
On Wednesday 2022-12-28 17:47, Demi Marie Obenour wrote:
>> Examples:
>> 
>> 1. /proc/pid/stats.json
>> 2. /proc/pid/stats.xml
>> 3. /proc/pid/stats.yaml_shouldnt_be_a_thing
>
>A binary format would be even better.  No risk of ambiguity.

So like EBML?
  
Shawn Webb Dec. 28, 2022, 5:25 p.m. UTC | #11
On Wed, Dec 28, 2022 at 11:47:25AM -0500, Demi Marie Obenour wrote:
> On Wed, Dec 28, 2022 at 10:24:58AM -0500, Shawn Webb wrote:
> > On Tue, Dec 27, 2022 at 04:44:49PM -0800, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
> > > Dominique Martinet writes:
> > > 
> > > > But, really, I just don't see how this can practically be said to be parsable...
> > > 
> > > In its current form it never will be.  The solution is to place
> > > this variable-length field last.  Then you can "cut -d ' ' -f 51-"
> > > to get the command+args part (assuming I counted all those fields
> > > correctly ...)
> > > 
> > > Of course, this breaks backwards compatability.
> > 
> > It would also break forwards compatibility in the case new fields
> > needed to be added.
> > 
> > The only solution would be a libxo-style feature wherein a
> > machine-parseable format is exposed by virtue of a file extension.
> > 
> > Examples:
> > 
> > 1. /proc/pid/stats.json
> > 2. /proc/pid/stats.xml
> > 3. /proc/pid/stats.yaml_shouldnt_be_a_thing
> 
> A binary format would be even better.  No risk of ambiguity.

I think the argument I'm trying to make is to be flexible in
implementation, allowing for future needs and wants--that is "future
proofing".
  
Demi Marie Obenour Dec. 28, 2022, 6:02 p.m. UTC | #12
On Wed, Dec 28, 2022 at 12:25:17PM -0500, Shawn Webb wrote:
> On Wed, Dec 28, 2022 at 11:47:25AM -0500, Demi Marie Obenour wrote:
> > On Wed, Dec 28, 2022 at 10:24:58AM -0500, Shawn Webb wrote:
> > > On Tue, Dec 27, 2022 at 04:44:49PM -0800, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
> > > > Dominique Martinet writes:
> > > > 
> > > > > But, really, I just don't see how this can practically be said to be parsable...
> > > > 
> > > > In its current form it never will be.  The solution is to place
> > > > this variable-length field last.  Then you can "cut -d ' ' -f 51-"
> > > > to get the command+args part (assuming I counted all those fields
> > > > correctly ...)
> > > > 
> > > > Of course, this breaks backwards compatability.
> > > 
> > > It would also break forwards compatibility in the case new fields
> > > needed to be added.
> > > 
> > > The only solution would be a libxo-style feature wherein a
> > > machine-parseable format is exposed by virtue of a file extension.
> > > 
> > > Examples:
> > > 
> > > 1. /proc/pid/stats.json
> > > 2. /proc/pid/stats.xml
> > > 3. /proc/pid/stats.yaml_shouldnt_be_a_thing
> > 
> > A binary format would be even better.  No risk of ambiguity.
> 
> I think the argument I'm trying to make is to be flexible in
> implementation, allowing for future needs and wants--that is "future
> proofing".

Linux should not have an XML, JSON, or YAML serializer.  Linux already
does way too much; let’s not add one more thing to the list.
  
John Helmert III Dec. 28, 2022, 6:36 p.m. UTC | #13
On Wed, Dec 28, 2022 at 01:02:35PM -0500, Demi Marie Obenour wrote:
> On Wed, Dec 28, 2022 at 12:25:17PM -0500, Shawn Webb wrote:
> > On Wed, Dec 28, 2022 at 11:47:25AM -0500, Demi Marie Obenour wrote:
> > > On Wed, Dec 28, 2022 at 10:24:58AM -0500, Shawn Webb wrote:
> > > > On Tue, Dec 27, 2022 at 04:44:49PM -0800, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
> > > > > Dominique Martinet writes:
> > > > > 
> > > > > > But, really, I just don't see how this can practically be said to be parsable...
> > > > > 
> > > > > In its current form it never will be.  The solution is to place
> > > > > this variable-length field last.  Then you can "cut -d ' ' -f 51-"
> > > > > to get the command+args part (assuming I counted all those fields
> > > > > correctly ...)
> > > > > 
> > > > > Of course, this breaks backwards compatability.
> > > > 
> > > > It would also break forwards compatibility in the case new fields
> > > > needed to be added.
> > > > 
> > > > The only solution would be a libxo-style feature wherein a
> > > > machine-parseable format is exposed by virtue of a file extension.
> > > > 
> > > > Examples:
> > > > 
> > > > 1. /proc/pid/stats.json
> > > > 2. /proc/pid/stats.xml
> > > > 3. /proc/pid/stats.yaml_shouldnt_be_a_thing
> > > 
> > > A binary format would be even better.  No risk of ambiguity.
> > 
> > I think the argument I'm trying to make is to be flexible in
> > implementation, allowing for future needs and wants--that is "future
> > proofing".
> 
> Linux should not have an XML, JSON, or YAML serializer.  Linux already
> does way too much; let’s not add one more thing to the list.

Handling a new binary format is not 'one more thing' added?

> -- 
> Sincerely,
> Demi Marie Obenour (she/her/hers)
> Invisible Things Lab
  
Shawn Webb Dec. 28, 2022, 7:24 p.m. UTC | #14
On Wed, Dec 28, 2022 at 01:02:35PM -0500, Demi Marie Obenour wrote:
> On Wed, Dec 28, 2022 at 12:25:17PM -0500, Shawn Webb wrote:
> > On Wed, Dec 28, 2022 at 11:47:25AM -0500, Demi Marie Obenour wrote:
> > > On Wed, Dec 28, 2022 at 10:24:58AM -0500, Shawn Webb wrote:
> > > > On Tue, Dec 27, 2022 at 04:44:49PM -0800, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
> > > > > Dominique Martinet writes:
> > > > > 
> > > > > > But, really, I just don't see how this can practically be said to be parsable...
> > > > > 
> > > > > In its current form it never will be.  The solution is to place
> > > > > this variable-length field last.  Then you can "cut -d ' ' -f 51-"
> > > > > to get the command+args part (assuming I counted all those fields
> > > > > correctly ...)
> > > > > 
> > > > > Of course, this breaks backwards compatability.
> > > > 
> > > > It would also break forwards compatibility in the case new fields
> > > > needed to be added.
> > > > 
> > > > The only solution would be a libxo-style feature wherein a
> > > > machine-parseable format is exposed by virtue of a file extension.
> > > > 
> > > > Examples:
> > > > 
> > > > 1. /proc/pid/stats.json
> > > > 2. /proc/pid/stats.xml
> > > > 3. /proc/pid/stats.yaml_shouldnt_be_a_thing
> > > 
> > > A binary format would be even better.  No risk of ambiguity.
> > 
> > I think the argument I'm trying to make is to be flexible in
> > implementation, allowing for future needs and wants--that is "future
> > proofing".
> 
> Linux should not have an XML, JSON, or YAML serializer.  Linux already
> does way too much; let’s not add one more thing to the list.

Somewhat agreed. I think formats like JSON provide a good balance
between machine parseable and human readable.

As I described earlier, though, when it comes to concepts like procfs
and sysfs, I have a bias towards abandoning them in favor of sysctl.
If sysctl nodes were to be used, no new serialization formats would
need to be implemented--and developers would also use a safter method
of system and process inspection and manipulation.
  
Alejandro Colomar Dec. 28, 2022, 7:57 p.m. UTC | #15
Hi all,

On 12/28/22 20:24, Shawn Webb wrote:
> On Wed, Dec 28, 2022 at 01:02:35PM -0500, Demi Marie Obenour wrote:
>> On Wed, Dec 28, 2022 at 12:25:17PM -0500, Shawn Webb wrote:
>>> On Wed, Dec 28, 2022 at 11:47:25AM -0500, Demi Marie Obenour wrote:
>>>> On Wed, Dec 28, 2022 at 10:24:58AM -0500, Shawn Webb wrote:
>>>>> On Tue, Dec 27, 2022 at 04:44:49PM -0800, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
>>>>>> Dominique Martinet writes:
>>>>>>
>>>>>>> But, really, I just don't see how this can practically be said to be parsable...
>>>>>>
>>>>>> In its current form it never will be.  The solution is to place
>>>>>> this variable-length field last.  Then you can "cut -d ' ' -f 51-"
>>>>>> to get the command+args part (assuming I counted all those fields
>>>>>> correctly ...)
>>>>>>
>>>>>> Of course, this breaks backwards compatability.
>>>>>
>>>>> It would also break forwards compatibility in the case new fields
>>>>> needed to be added.
>>>>>
>>>>> The only solution would be a libxo-style feature wherein a
>>>>> machine-parseable format is exposed by virtue of a file extension.
>>>>>
>>>>> Examples:
>>>>>
>>>>> 1. /proc/pid/stats.json
>>>>> 2. /proc/pid/stats.xml
>>>>> 3. /proc/pid/stats.yaml_shouldnt_be_a_thing
>>>>
>>>> A binary format would be even better.  No risk of ambiguity.
>>>
>>> I think the argument I'm trying to make is to be flexible in
>>> implementation, allowing for future needs and wants--that is "future
>>> proofing".
>>
>> Linux should not have an XML, JSON, or YAML serializer.  Linux already
>> does way too much; let’s not add one more thing to the list.
> 
> Somewhat agreed. I think formats like JSON provide a good balance
> between machine parseable and human readable.
> a
> As I described earlier, though, when it comes to concepts like procfs
> and sysfs, I have a bias towards abandoning them in favor of sysctl.
> If sysctl nodes were to be used, no new serialization formats would
> need to be implemented--and developers would also use a safter method
> of system and process inspection and manipulation.
> 

Just a comment as someone who is reading without much understanding of the 
contents of /prod/pid/stat:

If organization of the data in the file is a problem, and the format starts to 
matter, maybe it's a hint that there are too many different contents, and could 
be split into different files, each one with its own formatting rules.  I'll 
suggest that maybe a set of files, maybe contained in a common directory 
stats.d, is what you're looking for?

Binary format is not of my preference, since most user-space tools work with the 
standard interface, that is, text.

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>
  
Theodore Ts'o Dec. 28, 2022, 10:14 p.m. UTC | #16
On Wed, Dec 28, 2022 at 01:02:35PM -0500, Demi Marie Obenour wrote:
> > I think the argument I'm trying to make is to be flexible in
> > implementation, allowing for future needs and wants--that is "future
> > proofing".
> 
> Linux should not have an XML, JSON, or YAML serializer.  Linux already
> does way too much; let’s not add one more thing to the list.

There's always Protobufs[1]!  :-)  And all of these are better than
ASN.1, for which Google already has a limited parser (for x.509
certificates).   :-)   :-)   :-)

						- Ted
  
Demi Marie Obenour Dec. 29, 2022, 12:33 a.m. UTC | #17
On Wed, Dec 28, 2022 at 05:14:42PM -0500, Theodore Ts'o wrote:
> On Wed, Dec 28, 2022 at 01:02:35PM -0500, Demi Marie Obenour wrote:
> > > I think the argument I'm trying to make is to be flexible in
> > > implementation, allowing for future needs and wants--that is "future
> > > proofing".
> > 
> > Linux should not have an XML, JSON, or YAML serializer.  Linux already
> > does way too much; let’s not add one more thing to the list.
> 
> There's always Protobufs[1]!  :-)  And all of these are better than
> ASN.1, for which Google already has a limited parser (for x.509
> certificates).   :-)   :-)   :-)
> 
> 						- Ted

Cap’n Proto is better than Protobufs :-)
  
Jakub Wilk Dec. 30, 2022, 8:15 p.m. UTC | #18
* Tavis Ormandy <taviso@gmail.com>, 2022-12-28 01:50:
>>>But, really, I just don't see how this can practically be said to be 
>>>parsable...
>>
>>In its current form it never will be.  The solution is to place this 
>>variable-length field last.  Then you can "cut -d ' ' -f 51-" to get 
>>the command+args part (assuming I counted all those fields correctly 
>>...)
>>
>>Of course, this breaks backwards compatability.
>
>I think that cut command doesn't handle newlines,

Indeed.

>There already is 'ps -q $$ -o >comm='

FWIW, "ps ... -o comm=" doesn't just print the raw comm value: it 
replaces non-printable chars with punctuation characters, and it may 
append " <defunct>" if the process is a zombie.

The easiest way to get unmangled comm is to read it from 
/proc/$PID/comm, then strip the trailing newline.

(But I suspect most /proc/*/stat parsers don't care about the comm field 
at all; they just want to skip over it to get their hands on the 
following fields.)
  
David Laight Dec. 31, 2022, 4:31 p.m. UTC | #19
From: Shawn Webb
> Sent: 28 December 2022 15:25
> 
> On Tue, Dec 27, 2022 at 04:44:49PM -0800, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
> > Dominique Martinet writes:
> >
> > > But, really, I just don't see how this can practically be said to be parsable...
> >
> > In its current form it never will be.  The solution is to place
> > this variable-length field last.  Then you can "cut -d ' ' -f 51-"
> > to get the command+args part (assuming I counted all those fields
> > correctly ...)
> >
> > Of course, this breaks backwards compatability.
> 
> It would also break forwards compatibility in the case new fields
> needed to be added.
> 
> The only solution would be a libxo-style feature wherein a
> machine-parseable format is exposed by virtue of a file extension.
> 
> Examples:
> 
> 1. /proc/pid/stats.json
> 2. /proc/pid/stats.xml
> 3. /proc/pid/stats.yaml_shouldnt_be_a_thing

None of those are of any real use if you are trying to parse the
data in something like a shell script.
Multiple lines formatted as "tag:value" are probably the best bet.
Provided something sane is done with embedded \n (and maybe \r).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
Solar Designer Dec. 31, 2022, 5:27 p.m. UTC | #20
Hi all,

Let's wind this oss-security thread down as it relates to brainstorming
and commenting on totally new designs - no more of that, please.

Many things were said, but realistically the interface isn't _that_
broken (this can be parsed correctly, and procps-ng manages to) and is
(hopefully) not going to change much (in my opinion, and I know I'm not
alone in this, most of the proposals would make things worse overall).

Somewhat realistically, one possible change is replacing the most risky
characters, such as braces and anything <= ASCII 32, perhaps with '?'
to match what procps-ng is doing.  Perhaps do this either on all updates
of "comm" or in all places where "comm" is reported to userspace
(including procfs and kernel messages, by calling a common function).
"comm" isn't the full process name anyway - it's often truncated - so it
can reasonably be made safer in other ways as well.  As an option, the
replacing of whitespace (ASCII 32) and braces could be limited to the
"stat" file, but the control characters are (even more) problematic with
other interfaces where "comm" is exposed, so replacing them should
probably be global.

Happy New Year!

Alexander
  

Patch

--- a/man5/proc.5
+++ b/man5/proc.5
@@ -2092,6 +2092,11 @@  Strings longer than
 .B TASK_COMM_LEN
 (16) characters (including the terminating null byte) are silently truncated.
 This is visible whether or not the executable is swapped out.
+
+Note that \fIcomm\fP can contain space and closing parenthesis characters. 
+Parsing /proc/${pid}/stat with split() or equivalent, or scanf(3) isn't
+reliable. The correct way is to locate closing parenthesis with strrchr(')')
+from the end of the buffer and parse integers from there.
 .TP
 (3) \fIstate\fP \ %c
 One of the following characters, indicating process state: