Semantics of blktrace with lockdown (integrity) enabled kernel.

Message ID ZC8Dbux56HbJjpTy@char.us.oracle.com
State New
Headers
Series Semantics of blktrace with lockdown (integrity) enabled kernel. |

Commit Message

Konrad Rzeszutek Wilk April 6, 2023, 5:37 p.m. UTC
  Hey Jens, Paul, James, Nathan,

We are trying to use blktrace with a kernel that has lockdown enabled and find that it cannot run.

Specifically the issue is that we are trying to do is pretty simple:

strace -f blktrace -d /dev/sda -w 60
 
[pid 148882] <... mprotect resumed>)    = 0
[pid 148881] openat(AT_FDCWD, "/sys/kernel/debug/block/sda/trace0", O_RDONLY|O_NONBLOCK <unfinished ...>
[pid 148882] sched_setaffinity(0, 8, [1]) = 0
[pid 148881] <... openat resumed>)      = -1 EPERM (Operation not permitted)

which fails. The analysis from Eric (CCed) is that 

All debugfs entries do not exist until blktrace is run.  It is opening
/sys/kernel/debug/block/sda/trace0 which isn’t there normally. While running the utility, 
to place something in it, it must have the write permission set.  When exiting out of 
blktrace, the entry is gone, both on a machine running with secure boot enabled 
and one with it disabled.  Which also indicates the write permission was set, 
otherwise the entry would still be there.

The fix is simple enough (see attachment) but we are not sure about the semantics of what
lockdown has in mind.

Looking at the include/linux/security.h the LOCKDOWN_TRACEFS exists which would
imply that it is expected that operations with tracefs *should* work with lockdown (integrity mode).

But at the same point, debugfs writable attributes are a nono with lockdown.

So what is the right way forward?

Thank you.
From 20bd7b8c91463191924ec69833bbd6e6a6231f52 Mon Sep 17 00:00:00 2001
From: Junxiao Bi <junxiao.bi@oracle.com>
Date: Tue, 4 Apr 2023 19:13:21 -0700
Subject: [PATCH] debugfs: whitelisted relay file for lockdown

Relay files in debugfs are used for sending data from kernel to userspace,
the permission of these files are 0444, looks safe to skip lockdown.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
---
 fs/debugfs/file.c     | 17 +++++++++++++++++
 fs/debugfs/internal.h |  5 +++++
 2 files changed, 22 insertions(+)
  

Comments

Paul Moore April 6, 2023, 6:39 p.m. UTC | #1
On Thu, Apr 6, 2023 at 1:38 PM Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>
> Hey Jens, Paul, James, Nathan,
>
> We are trying to use blktrace with a kernel that has lockdown enabled and find that it cannot run.
>
> Specifically the issue is that we are trying to do is pretty simple:
>
> strace -f blktrace -d /dev/sda -w 60
>
> [pid 148882] <... mprotect resumed>)    = 0
> [pid 148881] openat(AT_FDCWD, "/sys/kernel/debug/block/sda/trace0", O_RDONLY|O_NONBLOCK <unfinished ...>
> [pid 148882] sched_setaffinity(0, 8, [1]) = 0
> [pid 148881] <... openat resumed>)      = -1 EPERM (Operation not permitted)
>
> which fails. The analysis from Eric (CCed) is that
>
> All debugfs entries do not exist until blktrace is run.  It is opening
> /sys/kernel/debug/block/sda/trace0 which isn’t there normally. While running the utility,
> to place something in it, it must have the write permission set.  When exiting out of
> blktrace, the entry is gone, both on a machine running with secure boot enabled
> and one with it disabled.  Which also indicates the write permission was set,
> otherwise the entry would still be there.
>
> The fix is simple enough (see attachment) but we are not sure about the semantics of what
> lockdown has in mind.
>
> Looking at the include/linux/security.h the LOCKDOWN_TRACEFS exists which would
> imply that it is expected that operations with tracefs *should* work with lockdown (integrity mode).
>
> But at the same point, debugfs writable attributes are a nono with lockdown.
>
> So what is the right way forward?

What did you use as a basis for your changes?  I'm looking at the
patch you sent and it appears to be making a change to a
debugfs_lockdown_whitelisted() function defined in
fs/debugfs/internal.h which does not exist in Linus' tree.  If I
search through all of the archives on lore.kernel.org the only hit I
get is your email, so it seems doubtful it is in a subsystem tree
which hasn't made its way to Linus yet.

Before we go any further, can you please verify that your issue is
reproducible on a supported, upstream tree (preferably Linus')?
  
Junxiao Bi April 6, 2023, 7:30 p.m. UTC | #2
On 4/6/23 11:39 AM, Paul Moore wrote:

> On Thu, Apr 6, 2023 at 1:38 PM Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
>> Hey Jens, Paul, James, Nathan,
>>
>> We are trying to use blktrace with a kernel that has lockdown enabled and find that it cannot run.
>>
>> Specifically the issue is that we are trying to do is pretty simple:
>>
>> strace -f blktrace -d /dev/sda -w 60
>>
>> [pid 148882] <... mprotect resumed>)    = 0
>> [pid 148881] openat(AT_FDCWD, "/sys/kernel/debug/block/sda/trace0", O_RDONLY|O_NONBLOCK <unfinished ...>
>> [pid 148882] sched_setaffinity(0, 8, [1]) = 0
>> [pid 148881] <... openat resumed>)      = -1 EPERM (Operation not permitted)
>>
>> which fails. The analysis from Eric (CCed) is that
>>
>> All debugfs entries do not exist until blktrace is run.  It is opening
>> /sys/kernel/debug/block/sda/trace0 which isn’t there normally. While running the utility,
>> to place something in it, it must have the write permission set.  When exiting out of
>> blktrace, the entry is gone, both on a machine running with secure boot enabled
>> and one with it disabled.  Which also indicates the write permission was set,
>> otherwise the entry would still be there.
>>
>> The fix is simple enough (see attachment) but we are not sure about the semantics of what
>> lockdown has in mind.
>>
>> Looking at the include/linux/security.h the LOCKDOWN_TRACEFS exists which would
>> imply that it is expected that operations with tracefs *should* work with lockdown (integrity mode).
>>
>> But at the same point, debugfs writable attributes are a nono with lockdown.
>>
>> So what is the right way forward?
> What did you use as a basis for your changes?  I'm looking at the
> patch you sent and it appears to be making a change to a
> debugfs_lockdown_whitelisted() function defined in
> fs/debugfs/internal.h which does not exist in Linus' tree.  If I
> search through all of the archives on lore.kernel.org the only hit I
> get is your email, so it seems doubtful it is in a subsystem tree
> which hasn't made its way to Linus yet.
>
> Before we go any further, can you please verify that your issue is
> reproducible on a supported, upstream tree (preferably Linus')?

The patch attached is applied to oracle kernel which is just used to 
demo the idea of a possible fix.

Upstream will have the same issue because blktrace uses relay files from 
debugfs to transfer trace information from kernel to userspace. Those 
relay files are having permission 0400 which are good, but they support 
mmap (struct file_operations relay_file_operations), which are against 
the rule of lock down. Is there any security concern to whitelist these 
files in lockdown mode? Any idea how to fix this for upstream?

static int debugfs_locked_down(struct inode *inode,
                    struct file *filp,
                    const struct file_operations *real_fops)
{
     if ((inode->i_mode & 07777 & ~0444) == 0 &&
         !(filp->f_mode & FMODE_WRITE) &&
         !real_fops->unlocked_ioctl &&
         !real_fops->compat_ioctl &&
         !real_fops->mmap)
         return 0;

     if (security_locked_down(LOCKDOWN_DEBUGFS))
         return -EPERM;

     return 0;
}

Thanks,

Junxiao.

>
  
Konrad Rzeszutek Wilk April 6, 2023, 7:32 p.m. UTC | #3
On Thu, Apr 06, 2023 at 02:39:57PM -0400, Paul Moore wrote:
> On Thu, Apr 6, 2023 at 1:38 PM Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> >
> > Hey Jens, Paul, James, Nathan,
> >
> > We are trying to use blktrace with a kernel that has lockdown enabled and find that it cannot run.
> >
> > Specifically the issue is that we are trying to do is pretty simple:
> >
> > strace -f blktrace -d /dev/sda -w 60
> >
> > [pid 148882] <... mprotect resumed>)    = 0
> > [pid 148881] openat(AT_FDCWD, "/sys/kernel/debug/block/sda/trace0", O_RDONLY|O_NONBLOCK <unfinished ...>
> > [pid 148882] sched_setaffinity(0, 8, [1]) = 0
> > [pid 148881] <... openat resumed>)      = -1 EPERM (Operation not permitted)
> >
> > which fails. The analysis from Eric (CCed) is that
> >
> > All debugfs entries do not exist until blktrace is run.  It is opening
> > /sys/kernel/debug/block/sda/trace0 which isn’t there normally. While running the utility,
> > to place something in it, it must have the write permission set.  When exiting out of
> > blktrace, the entry is gone, both on a machine running with secure boot enabled
> > and one with it disabled.  Which also indicates the write permission was set,
> > otherwise the entry would still be there.
> >
> > The fix is simple enough (see attachment) but we are not sure about the semantics of what
> > lockdown has in mind.
> >
> > Looking at the include/linux/security.h the LOCKDOWN_TRACEFS exists which would
> > imply that it is expected that operations with tracefs *should* work with lockdown (integrity mode).
> >
> > But at the same point, debugfs writable attributes are a nono with lockdown.
> >
> > So what is the right way forward?
> 
> What did you use as a basis for your changes?  I'm looking at the
> patch you sent and it appears to be making a change to a
> debugfs_lockdown_whitelisted() function defined in
> fs/debugfs/internal.h which does not exist in Linus' tree.  If I
> search through all of the archives on lore.kernel.org the only hit I
> get is your email, so it seems doubtful it is in a subsystem tree
> which hasn't made its way to Linus yet.

My apologies. We had to add some extra code for flipping IBRS on/off at
some point and that is why see this 'whitelisted' one. A more upstream
appropiate patch not be based on this.
> 
> Before we go any further, can you please verify that your issue is
> reproducible on a supported, upstream tree (preferably Linus')?

Yes. Very much so.
Thank you.
  
Paul Moore April 6, 2023, 9:30 p.m. UTC | #4
On Thu, Apr 6, 2023 at 3:30 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:
> On 4/6/23 11:39 AM, Paul Moore wrote:
> > On Thu, Apr 6, 2023 at 1:38 PM Konrad Rzeszutek Wilk
> > <konrad.wilk@oracle.com> wrote:
> >> Hey Jens, Paul, James, Nathan,
> >>
> >> We are trying to use blktrace with a kernel that has lockdown enabled and find that it cannot run.
> >>
> >> Specifically the issue is that we are trying to do is pretty simple:
> >>
> >> strace -f blktrace -d /dev/sda -w 60
> >>
> >> [pid 148882] <... mprotect resumed>)    = 0
> >> [pid 148881] openat(AT_FDCWD, "/sys/kernel/debug/block/sda/trace0", O_RDONLY|O_NONBLOCK <unfinished ...>
> >> [pid 148882] sched_setaffinity(0, 8, [1]) = 0
> >> [pid 148881] <... openat resumed>)      = -1 EPERM (Operation not permitted)
> >>
> >> which fails. The analysis from Eric (CCed) is that
> >>
> >> All debugfs entries do not exist until blktrace is run.  It is opening
> >> /sys/kernel/debug/block/sda/trace0 which isn’t there normally. While running the utility,
> >> to place something in it, it must have the write permission set.  When exiting out of
> >> blktrace, the entry is gone, both on a machine running with secure boot enabled
> >> and one with it disabled.  Which also indicates the write permission was set,
> >> otherwise the entry would still be there.
> >>
> >> The fix is simple enough (see attachment) but we are not sure about the semantics of what
> >> lockdown has in mind.
> >>
> >> Looking at the include/linux/security.h the LOCKDOWN_TRACEFS exists which would
> >> imply that it is expected that operations with tracefs *should* work with lockdown (integrity mode).
> >>
> >> But at the same point, debugfs writable attributes are a nono with lockdown.
> >>
> >> So what is the right way forward?
> > What did you use as a basis for your changes?  I'm looking at the
> > patch you sent and it appears to be making a change to a
> > debugfs_lockdown_whitelisted() function defined in
> > fs/debugfs/internal.h which does not exist in Linus' tree.  If I
> > search through all of the archives on lore.kernel.org the only hit I
> > get is your email, so it seems doubtful it is in a subsystem tree
> > which hasn't made its way to Linus yet.
> >
> > Before we go any further, can you please verify that your issue is
> > reproducible on a supported, upstream tree (preferably Linus')?
>
> The patch attached is applied to oracle kernel which is just used to
> demo the idea of a possible fix.
>
> Upstream will have the same issue because blktrace uses relay files from
> debugfs to transfer trace information from kernel to userspace ...

For future reference, the problem with sending patches that aren't
based on an upstream tree is that it both wastes reviewer time trying
to figure out where the basis of the patch, and it makes one question
if the issue is present in an upstream kernel or if there is some
out-of-tree patch in the unknown kernel which is the root cause.

Maybe you've tested everything and know it is a problem with the
upstream code, but when we see a patch that doesn't match up with
anything, how are we supposed to know?
  
Paul Moore April 6, 2023, 9:43 p.m. UTC | #5
On Thu, Apr 6, 2023 at 3:33 PM Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Thu, Apr 06, 2023 at 02:39:57PM -0400, Paul Moore wrote:

...

> > Before we go any further, can you please verify that your issue is
> > reproducible on a supported, upstream tree (preferably Linus')?
>
> Yes. Very much so.

Okay, in that case I suspect the issue is due to the somewhat limited
granularity in the lockdown LSM.  While there are a number of
different lockdown "levels", the reality is that the admin has to
choose from either NONE, INTEGRITY, or CONFIDENTIALITY.  Without
digging to deep into the code path that you would be hitting, we can
see that TRACEFS is blocked by the CONFIDENTIALITY (and therefore
INTEGRITY too) setting and DEBUGFS is blocked by the INTEGRITY
setting.  With DEBUGFS blocked by INTEGRITY, the only lockdown option
that would allow DEBUGFS is NONE.

Without knowing too much about blktrace beyond the manpage, it looks
like it has the ability to trace/snoop on the block device operations
so I don't think this is something we would want to allow in a
"locked" system.

Sorry.
  
Junxiao Bi April 10, 2023, 7:19 p.m. UTC | #6
On 4/6/23 2:43 PM, Paul Moore wrote:

> On Thu, Apr 6, 2023 at 3:33 PM Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
>> On Thu, Apr 06, 2023 at 02:39:57PM -0400, Paul Moore wrote:
> ...
>
>>> Before we go any further, can you please verify that your issue is
>>> reproducible on a supported, upstream tree (preferably Linus')?
>> Yes. Very much so.
> Okay, in that case I suspect the issue is due to the somewhat limited
> granularity in the lockdown LSM.  While there are a number of
> different lockdown "levels", the reality is that the admin has to
> choose from either NONE, INTEGRITY, or CONFIDENTIALITY.  Without
> digging to deep into the code path that you would be hitting, we can
> see that TRACEFS is blocked by the CONFIDENTIALITY (and therefore
> INTEGRITY too) setting and DEBUGFS is blocked by the INTEGRITY
> setting.  With DEBUGFS blocked by INTEGRITY, the only lockdown option
> that would allow DEBUGFS is NONE.
>
> Without knowing too much about blktrace beyond the manpage, it looks
> like it has the ability to trace/snoop on the block device operations
> so I don't think this is something we would want to allow in a
> "locked" system.

blktrace depends on tracepoint in block layer to trace io events of 
block devices,

through the test with mainline, those tracepoints were not blocked by 
lockdown.

If snoop block devices operations is a security concern in lock down, these

tracepoints should be disabled?

[root@jubi-ol8 tracecmd]# uname -a
Linux jubi-ol8 6.3.0-rc6.master.20230410.ol8.x86_64 #1 SMP 
PREEMPT_DYNAMIC Mon Apr 10 03:33:56 PDT 2023 x86_64 x86_64 x86_64 GNU/Linux
[root@jubi-ol8 tracecmd]# cat /sys/kernel/security/lockdown
none [integrity] confidentiality
[root@jubi-ol8 tracecmd]# trace-cmd record -e block:block_rq_issue -e 
block:block_rq_complete
Hit Ctrl^C to stop recording
^CCPU0 data recorded at offset=0x9fa000
     4096 bytes in size
CPU1 data recorded at offset=0x9fb000
     4096 bytes in size
CPU2 data recorded at offset=0x9fc000
     53248 bytes in size
CPU3 data recorded at offset=0xa09000
     12288 bytes in size

Thanks,

Junxiao.

>
> Sorry.
>
  
Paul Moore April 10, 2023, 8:22 p.m. UTC | #7
On Mon, Apr 10, 2023 at 3:20 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:
> On 4/6/23 2:43 PM, Paul Moore wrote:
> > On Thu, Apr 6, 2023 at 3:33 PM Konrad Rzeszutek Wilk
> > <konrad.wilk@oracle.com> wrote:
> >> On Thu, Apr 06, 2023 at 02:39:57PM -0400, Paul Moore wrote:
> > ...
> >
> >>> Before we go any further, can you please verify that your issue is
> >>> reproducible on a supported, upstream tree (preferably Linus')?
> >> Yes. Very much so.
> > Okay, in that case I suspect the issue is due to the somewhat limited
> > granularity in the lockdown LSM.  While there are a number of
> > different lockdown "levels", the reality is that the admin has to
> > choose from either NONE, INTEGRITY, or CONFIDENTIALITY.  Without
> > digging to deep into the code path that you would be hitting, we can
> > see that TRACEFS is blocked by the CONFIDENTIALITY (and therefore
> > INTEGRITY too) setting and DEBUGFS is blocked by the INTEGRITY
> > setting.  With DEBUGFS blocked by INTEGRITY, the only lockdown option
> > that would allow DEBUGFS is NONE.
> >
> > Without knowing too much about blktrace beyond the manpage, it looks
> > like it has the ability to trace/snoop on the block device operations
> > so I don't think this is something we would want to allow in a
> > "locked" system.
>
> blktrace depends on tracepoint in block layer to trace io events of
> block devices,
>
> through the test with mainline, those tracepoints were not blocked by
> lockdown.
>
> If snoop block devices operations is a security concern in lock down, these
>
> tracepoints should be disabled?

Possibly, however, as I said earlier I'm not very familiar with
blktrace and the associated tracepoints.  If it is possible to snoop
on kernel/user data using blktrace then it probably should be
protected by a lockdown control point.

Is this something you could verify and potentially submit a patch to resolve?
  
Junxiao Bi April 10, 2023, 9:28 p.m. UTC | #8
On 4/10/23 1:22 PM, Paul Moore wrote:

> On Mon, Apr 10, 2023 at 3:20 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:
>> On 4/6/23 2:43 PM, Paul Moore wrote:
>>> On Thu, Apr 6, 2023 at 3:33 PM Konrad Rzeszutek Wilk
>>> <konrad.wilk@oracle.com> wrote:
>>>> On Thu, Apr 06, 2023 at 02:39:57PM -0400, Paul Moore wrote:
>>> ...
>>>
>>>>> Before we go any further, can you please verify that your issue is
>>>>> reproducible on a supported, upstream tree (preferably Linus')?
>>>> Yes. Very much so.
>>> Okay, in that case I suspect the issue is due to the somewhat limited
>>> granularity in the lockdown LSM.  While there are a number of
>>> different lockdown "levels", the reality is that the admin has to
>>> choose from either NONE, INTEGRITY, or CONFIDENTIALITY.  Without
>>> digging to deep into the code path that you would be hitting, we can
>>> see that TRACEFS is blocked by the CONFIDENTIALITY (and therefore
>>> INTEGRITY too) setting and DEBUGFS is blocked by the INTEGRITY
>>> setting.  With DEBUGFS blocked by INTEGRITY, the only lockdown option
>>> that would allow DEBUGFS is NONE.
>>>
>>> Without knowing too much about blktrace beyond the manpage, it looks
>>> like it has the ability to trace/snoop on the block device operations
>>> so I don't think this is something we would want to allow in a
>>> "locked" system.
>> blktrace depends on tracepoint in block layer to trace io events of
>> block devices,
>>
>> through the test with mainline, those tracepoints were not blocked by
>> lockdown.
>>
>> If snoop block devices operations is a security concern in lock down, these
>>
>> tracepoints should be disabled?
> Possibly, however, as I said earlier I'm not very familiar with
> blktrace and the associated tracepoints.  If it is possible to snoop
> on kernel/user data using blktrace then it probably should be
> protected by a lockdown control point.
>
> Is this something you could verify and potentially submit a patch to resolve?

blktrace can not snoop kernel/user data. The information it got from 
kernel is kind of "io metadata", basically include which process from 
which cpu, at what time, triggered what kind of IO events(issue, queue, 
complete etc.), to which disk, from which sector offset and how long. 
blktrace has no way to know what's inside that io. I am kind of think 
this is safe for lockdown mode.

The following is sample of blktrace output.

252,0    1        1     0.000000000  2570  Q   W 45779288 + 48 [nstat]
252,0    1        1     0.000000000  2570  Q   W 45779288 + 48 [nstat]
252,0    1        1     0.000000000  2570  Q   W 45779288 + 48 [nstat]
252,0    1        1     0.000000000  2570  Q   W 45779288 + 48 [nstat]
252,0    3        1     0.001038626  2572  C   W 45779288 + 48 [0]
252,0    3        1     0.001038626  2572  C   W 45779288 + 48 [0]
252,0    3        1     0.001038626  2572  C   W 45779288 + 48 [0]
252,0    3        1     0.001038626  2572  C   W 45779288 + 48 [0]
252,0    1        2     1.764157693  1384  Q  WS 24577112 + 8 [auditd]
252,0    1        2     1.764157693  1384  Q  WS 24577112 + 8 [auditd]
252,0    1        2     1.764157693  1384  Q  WS 24577112 + 8 [auditd]
252,0    1        2     1.764157693  1384  Q  WS 24577112 + 8 [auditd]
252,0    1        3     1.764216845  1384  Q  WS 24577120 + 16 [auditd]
252,0    1        3     1.764216845  1384  Q  WS 24577120 + 16 [auditd]
252,0    1        3     1.764216845  1384  Q  WS 24577120 + 16 [auditd]
252,0    1        3     1.764216845  1384  Q  WS 24577120 + 16 [auditd]


Thanks,

Junxiao.

>
  
Paul Moore April 10, 2023, 9:44 p.m. UTC | #9
On Mon, Apr 10, 2023 at 5:28 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:
> On 4/10/23 1:22 PM, Paul Moore wrote:
> > On Mon, Apr 10, 2023 at 3:20 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:
> >> On 4/6/23 2:43 PM, Paul Moore wrote:
> >>> On Thu, Apr 6, 2023 at 3:33 PM Konrad Rzeszutek Wilk
> >>> <konrad.wilk@oracle.com> wrote:
> >>>> On Thu, Apr 06, 2023 at 02:39:57PM -0400, Paul Moore wrote:
> >>> ...
> >>>
> >>>>> Before we go any further, can you please verify that your issue is
> >>>>> reproducible on a supported, upstream tree (preferably Linus')?
> >>>> Yes. Very much so.
> >>> Okay, in that case I suspect the issue is due to the somewhat limited
> >>> granularity in the lockdown LSM.  While there are a number of
> >>> different lockdown "levels", the reality is that the admin has to
> >>> choose from either NONE, INTEGRITY, or CONFIDENTIALITY.  Without
> >>> digging to deep into the code path that you would be hitting, we can
> >>> see that TRACEFS is blocked by the CONFIDENTIALITY (and therefore
> >>> INTEGRITY too) setting and DEBUGFS is blocked by the INTEGRITY
> >>> setting.  With DEBUGFS blocked by INTEGRITY, the only lockdown option
> >>> that would allow DEBUGFS is NONE.
> >>>
> >>> Without knowing too much about blktrace beyond the manpage, it looks
> >>> like it has the ability to trace/snoop on the block device operations
> >>> so I don't think this is something we would want to allow in a
> >>> "locked" system.
> >> blktrace depends on tracepoint in block layer to trace io events of
> >> block devices,
> >>
> >> through the test with mainline, those tracepoints were not blocked by
> >> lockdown.
> >>
> >> If snoop block devices operations is a security concern in lock down, these
> >>
> >> tracepoints should be disabled?
> > Possibly, however, as I said earlier I'm not very familiar with
> > blktrace and the associated tracepoints.  If it is possible to snoop
> > on kernel/user data using blktrace then it probably should be
> > protected by a lockdown control point.
> >
> > Is this something you could verify and potentially submit a patch to resolve?
>
> blktrace can not snoop kernel/user data. The information it got from
> kernel is kind of "io metadata", basically include which process from
> which cpu, at what time, triggered what kind of IO events(issue, queue,
> complete etc.), to which disk, from which sector offset and how long.
> blktrace has no way to know what's inside that io. I am kind of think
> this is safe for lockdown mode.

Well, you could always submit a patch* and we would review it like any
other; that's usually a much better approach.

* Yes, there was a patch submitted, but it was against a distro kernel
that diverged significantly from the upstream kernel in the relevant
areas.
  
Junxiao Bi April 10, 2023, 9:51 p.m. UTC | #10
On 4/10/23 2:44 PM, Paul Moore wrote:
> On Mon, Apr 10, 2023 at 5:28 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:
>> On 4/10/23 1:22 PM, Paul Moore wrote:
>>> On Mon, Apr 10, 2023 at 3:20 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:
>>>> On 4/6/23 2:43 PM, Paul Moore wrote:
>>>>> On Thu, Apr 6, 2023 at 3:33 PM Konrad Rzeszutek Wilk
>>>>> <konrad.wilk@oracle.com> wrote:
>>>>>> On Thu, Apr 06, 2023 at 02:39:57PM -0400, Paul Moore wrote:
>>>>> ...
>>>>>
>>>>>>> Before we go any further, can you please verify that your issue is
>>>>>>> reproducible on a supported, upstream tree (preferably Linus')?
>>>>>> Yes. Very much so.
>>>>> Okay, in that case I suspect the issue is due to the somewhat limited
>>>>> granularity in the lockdown LSM.  While there are a number of
>>>>> different lockdown "levels", the reality is that the admin has to
>>>>> choose from either NONE, INTEGRITY, or CONFIDENTIALITY.  Without
>>>>> digging to deep into the code path that you would be hitting, we can
>>>>> see that TRACEFS is blocked by the CONFIDENTIALITY (and therefore
>>>>> INTEGRITY too) setting and DEBUGFS is blocked by the INTEGRITY
>>>>> setting.  With DEBUGFS blocked by INTEGRITY, the only lockdown option
>>>>> that would allow DEBUGFS is NONE.
>>>>>
>>>>> Without knowing too much about blktrace beyond the manpage, it looks
>>>>> like it has the ability to trace/snoop on the block device operations
>>>>> so I don't think this is something we would want to allow in a
>>>>> "locked" system.
>>>> blktrace depends on tracepoint in block layer to trace io events of
>>>> block devices,
>>>>
>>>> through the test with mainline, those tracepoints were not blocked by
>>>> lockdown.
>>>>
>>>> If snoop block devices operations is a security concern in lock down, these
>>>>
>>>> tracepoints should be disabled?
>>> Possibly, however, as I said earlier I'm not very familiar with
>>> blktrace and the associated tracepoints.  If it is possible to snoop
>>> on kernel/user data using blktrace then it probably should be
>>> protected by a lockdown control point.
>>>
>>> Is this something you could verify and potentially submit a patch to resolve?
>> blktrace can not snoop kernel/user data. The information it got from
>> kernel is kind of "io metadata", basically include which process from
>> which cpu, at what time, triggered what kind of IO events(issue, queue,
>> complete etc.), to which disk, from which sector offset and how long.
>> blktrace has no way to know what's inside that io. I am kind of think
>> this is safe for lockdown mode.
> Well, you could always submit a patch* and we would review it like any
> other; that's usually a much better approach.
>
> * Yes, there was a patch submitted, but it was against a distro kernel
> that diverged significantly from the upstream kernel in the relevant
> areas.

Sure, i will submit a new one.

Before that, may i ask this question? It may affect the approach of the 
patch.

Lockdown blocked files with mmap operation even that files are 
read-only, may i know what's the security concern there?

static int debugfs_locked_down(struct inode *inode,
                    struct file *filp,
                    const struct file_operations *real_fops)
{
     if ((inode->i_mode & 07777 & ~0444) == 0 &&
         !(filp->f_mode & FMODE_WRITE) &&
         !real_fops->unlocked_ioctl &&
         !real_fops->compat_ioctl &&
         !real_fops->mmap)
         return 0;

     if (security_locked_down(LOCKDOWN_DEBUGFS))
         return -EPERM;

     return 0;
}

Thanks,

Junxiao.

>
  
Paul Moore April 10, 2023, 10 p.m. UTC | #11
On Mon, Apr 10, 2023 at 5:52 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:
> On 4/10/23 2:44 PM, Paul Moore wrote:
> > On Mon, Apr 10, 2023 at 5:28 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:
> >> On 4/10/23 1:22 PM, Paul Moore wrote:
> >>> On Mon, Apr 10, 2023 at 3:20 PM Junxiao Bi <junxiao.bi@oracle.com> wrote:
> >>>> On 4/6/23 2:43 PM, Paul Moore wrote:
> >>>>> On Thu, Apr 6, 2023 at 3:33 PM Konrad Rzeszutek Wilk
> >>>>> <konrad.wilk@oracle.com> wrote:
> >>>>>> On Thu, Apr 06, 2023 at 02:39:57PM -0400, Paul Moore wrote:
> >>>>> ...
> >>>>>
> >>>>>>> Before we go any further, can you please verify that your issue is
> >>>>>>> reproducible on a supported, upstream tree (preferably Linus')?
> >>>>>> Yes. Very much so.
> >>>>> Okay, in that case I suspect the issue is due to the somewhat limited
> >>>>> granularity in the lockdown LSM.  While there are a number of
> >>>>> different lockdown "levels", the reality is that the admin has to
> >>>>> choose from either NONE, INTEGRITY, or CONFIDENTIALITY.  Without
> >>>>> digging to deep into the code path that you would be hitting, we can
> >>>>> see that TRACEFS is blocked by the CONFIDENTIALITY (and therefore
> >>>>> INTEGRITY too) setting and DEBUGFS is blocked by the INTEGRITY
> >>>>> setting.  With DEBUGFS blocked by INTEGRITY, the only lockdown option
> >>>>> that would allow DEBUGFS is NONE.
> >>>>>
> >>>>> Without knowing too much about blktrace beyond the manpage, it looks
> >>>>> like it has the ability to trace/snoop on the block device operations
> >>>>> so I don't think this is something we would want to allow in a
> >>>>> "locked" system.
> >>>> blktrace depends on tracepoint in block layer to trace io events of
> >>>> block devices,
> >>>>
> >>>> through the test with mainline, those tracepoints were not blocked by
> >>>> lockdown.
> >>>>
> >>>> If snoop block devices operations is a security concern in lock down, these
> >>>>
> >>>> tracepoints should be disabled?
> >>> Possibly, however, as I said earlier I'm not very familiar with
> >>> blktrace and the associated tracepoints.  If it is possible to snoop
> >>> on kernel/user data using blktrace then it probably should be
> >>> protected by a lockdown control point.
> >>>
> >>> Is this something you could verify and potentially submit a patch to resolve?
> >> blktrace can not snoop kernel/user data. The information it got from
> >> kernel is kind of "io metadata", basically include which process from
> >> which cpu, at what time, triggered what kind of IO events(issue, queue,
> >> complete etc.), to which disk, from which sector offset and how long.
> >> blktrace has no way to know what's inside that io. I am kind of think
> >> this is safe for lockdown mode.
> > Well, you could always submit a patch* and we would review it like any
> > other; that's usually a much better approach.
> >
> > * Yes, there was a patch submitted, but it was against a distro kernel
> > that diverged significantly from the upstream kernel in the relevant
> > areas.
>
> Sure, i will submit a new one.
>
> Before that, may i ask this question? It may affect the approach of the
> patch.
>
> Lockdown blocked files with mmap operation even that files are
> read-only, may i know what's the security concern there?
>
> static int debugfs_locked_down(struct inode *inode,
>                     struct file *filp,
>                     const struct file_operations *real_fops)
> {
>      if ((inode->i_mode & 07777 & ~0444) == 0 &&
>          !(filp->f_mode & FMODE_WRITE) &&
>          !real_fops->unlocked_ioctl &&
>          !real_fops->compat_ioctl &&
>          !real_fops->mmap)
>          return 0;
>
>      if (security_locked_down(LOCKDOWN_DEBUGFS))
>          return -EPERM;
>
>      return 0;
> }

I think the comment block at the top of that function describes it well:

/*
 * Only permit access to world-readable files when the kernel is locked down.
 * We also need to exclude any file that has ways to write or alter it as root
 * can bypass the permissions check.
 */

--
paul-moore.com
  
Junxiao Bi April 10, 2023, 10:31 p.m. UTC | #12
On 4/10/23 3:00 PM, Paul Moore wrote:

>>> Well, you could always submit a patch* and we would review it like any
>>> other; that's usually a much better approach.
>>>
>>> * Yes, there was a patch submitted, but it was against a distro kernel
>>> that diverged significantly from the upstream kernel in the relevant
>>> areas.
>> Sure, i will submit a new one.
>>
>> Before that, may i ask this question? It may affect the approach of the
>> patch.
>>
>> Lockdown blocked files with mmap operation even that files are
>> read-only, may i know what's the security concern there?
>>
>> static int debugfs_locked_down(struct inode *inode,
>>                      struct file *filp,
>>                      const struct file_operations *real_fops)
>> {
>>       if ((inode->i_mode & 07777 & ~0444) == 0 &&
>>           !(filp->f_mode & FMODE_WRITE) &&
>>           !real_fops->unlocked_ioctl &&
>>           !real_fops->compat_ioctl &&
>>           !real_fops->mmap)
>>           return 0;
>>
>>       if (security_locked_down(LOCKDOWN_DEBUGFS))
>>           return -EPERM;
>>
>>       return 0;
>> }
> I think the comment block at the top of that function describes it well:
>
> /*
>   * Only permit access to world-readable files when the kernel is locked down.
>   * We also need to exclude any file that has ways to write or alter it as root
>   * can bypass the permissions check.
>   */

I may have some misunderstanding of  commit 5496197f9b08("debugfs: 
Restrict debugfs when the kernel is locked down"),  it mentioned chmod 
is disabled for debugfs file, so i thought permission of debugfs file 
can not be changed. Actually I just tested that, the permission can be 
changed! I am not sure whether this is an issue or not. Anyway i 
understand the security concern with mmap, thanks a lot.

Thanks,

Junxiao.

>
> --
> paul-moore.com
  

Patch

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index d574bda24e21..93ab719d8c7b 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -20,6 +20,7 @@ 
 #include <linux/device.h>
 #include <linux/poll.h>
 #include <linux/security.h>
+#include <linux/relay.h>
 
 #include "internal.h"
 
@@ -137,6 +138,22 @@  void debugfs_file_put(struct dentry *dentry)
 }
 EXPORT_SYMBOL_GPL(debugfs_file_put);
 
+bool debugfs_file_is_relay(struct dentry *dentry)
+{
+	struct debugfs_fsdata *fsd;
+	void *d_fsd;
+	void *fops;
+
+	d_fsd = READ_ONCE(dentry->d_fsdata);
+	if (!((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) {
+		fsd = d_fsd;
+		fops = (void *)fsd->real_fops;
+	} else
+		fops = (void *)((unsigned long)d_fsd &
+					~DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
+    return fops == (void *)&relay_file_operations;
+}
+
 /*
  * Only permit access to world-readable files when the kernel is locked down.
  * We also need to exclude any file that has ways to write or alter it as root
diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h
index 6bcedb3f90b3..392bb1972226 100644
--- a/fs/debugfs/internal.h
+++ b/fs/debugfs/internal.h
@@ -37,6 +37,7 @@  static const char * const arch_whitelist[] = {
 	"mds_user_clear"
 };
 
+extern bool debugfs_file_is_relay(struct dentry *dentry);
 struct dentry *__attribute__((weak))get_arch_debugfs_dir(void) {return NULL; }
 
 static bool debugfs_lockdown_whitelisted(struct dentry *dentry)
@@ -51,6 +52,10 @@  static bool debugfs_lockdown_whitelisted(struct dentry *dentry)
 		}
 	}
 
+	/* relay file is used for userspace/kernel communicate.*/
+	if (debugfs_file_is_relay(dentry))
+		return true;
+
 	return false;
 }