[v6,1/3] random: add vgetrandom_alloc() syscall
Commit Message
The vDSO getrandom() works over an opaque per-thread state of an
unexported size, which must be marked as MADV_WIPEONFORK and be
mlock()'d for proper operation. Over time, the nuances of these
allocations may change or grow or even differ based on architectural
features.
The syscall has the signature:
void *vgetrandom_alloc([inout] size_t *num, [out] size_t *size_per_each, unsigned int flags);
This takes the desired number of opaque states in `num`, and returns a
pointer to an array of opaque states, the number actually allocated back
in `num`, and the size in bytes of each one in `size_per_each`, enabling
a libc to slice up the returned array into a state per each thread. (The
`flags` argument is always zero for now.) Libc is expected to allocate a
chunk of these on first use, and then dole them out to threads as
they're created, allocating more when needed. The following commit shows
an example of this, being used in conjunction with the getrandom() vDSO
function.
We very intentionally do *not* leave state allocation for vDSO
getrandom() up to userspace itself, but rather provide this new syscall
for such allocations. vDSO getrandom() must not store its state in just
any old memory address, but rather just ones that the kernel specially
allocates for it, leaving the particularities of those allocations up to
the kernel.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
MAINTAINERS | 1 +
arch/x86/Kconfig | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/unistd.h | 1 +
drivers/char/random.c | 59 +++++++++++++++++++++++++
include/uapi/asm-generic/unistd.h | 7 ++-
kernel/sys_ni.c | 3 ++
lib/vdso/getrandom.h | 23 ++++++++++
scripts/checksyscalls.sh | 4 ++
tools/include/uapi/asm-generic/unistd.h | 7 ++-
10 files changed, 105 insertions(+), 2 deletions(-)
create mode 100644 lib/vdso/getrandom.h
Comments
* Jason A. Donenfeld:
> + * The vgetrandom() function in userspace requires an opaque state, which this
> + * function provides to userspace, by mapping a certain number of special pages
> + * into the calling process. It takes a hint as to the number of opaque states
> + * desired, and returns the number of opaque states actually allocated, the
> + * size of each one in bytes, and the address of the first state.
> + */
> +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> + unsigned long __user *, size_per_each, unsigned int, flags)
I think you should make this __u64, so that you get a consistent
userspace interface on all architectures, without the need for compat
system calls.
Thanks,
Florian
Hi Florian,
On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> * Jason A. Donenfeld:
>
> > + * The vgetrandom() function in userspace requires an opaque state, which this
> > + * function provides to userspace, by mapping a certain number of special pages
> > + * into the calling process. It takes a hint as to the number of opaque states
> > + * desired, and returns the number of opaque states actually allocated, the
> > + * size of each one in bytes, and the address of the first state.
> > + */
> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > + unsigned long __user *, size_per_each, unsigned int, flags)
>
> I think you should make this __u64, so that you get a consistent
> userspace interface on all architectures, without the need for compat
> system calls.
That would be quite unconventional. Most syscalls that take lengths do
so with the native register size (`unsigned long`, `size_t`), rather
than u64. If you can point to a recent trend away from this by
indicating some commits that added new syscalls with u64, I'd be happy
to be shown otherwise. But AFAIK, that's not the way it's done.
Jason
* Jason A. Donenfeld:
> Hi Florian,
>
> On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
>> * Jason A. Donenfeld:
>>
>> > + * The vgetrandom() function in userspace requires an opaque state, which this
>> > + * function provides to userspace, by mapping a certain number of special pages
>> > + * into the calling process. It takes a hint as to the number of opaque states
>> > + * desired, and returns the number of opaque states actually allocated, the
>> > + * size of each one in bytes, and the address of the first state.
>> > + */
>> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
>> > + unsigned long __user *, size_per_each, unsigned int, flags)
>>
>> I think you should make this __u64, so that you get a consistent
>> userspace interface on all architectures, without the need for compat
>> system calls.
>
> That would be quite unconventional. Most syscalls that take lengths do
> so with the native register size (`unsigned long`, `size_t`), rather
> than u64. If you can point to a recent trend away from this by
> indicating some commits that added new syscalls with u64, I'd be happy
> to be shown otherwise. But AFAIK, that's not the way it's done.
See clone3 and struct clone_args. It's more common with pointers, which
are now 64 bits unconditionally: struct futex_waitv, struct rseq_cs and
struct rseq.
If the length or pointer is a system call argument, widening it to 64
bits is not necessary because zero-extension to the full register
eliminates the need for a compat system call. But if you pass the
address to a size or pointer, you'll need compat syscalls if you don't
make the passed data __u64.
Thanks,
Florian
Hi Florian,
On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> * Jason A. Donenfeld:
>
> > Hi Florian,
> >
> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> >> * Jason A. Donenfeld:
> >>
> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> >> > + * function provides to userspace, by mapping a certain number of special pages
> >> > + * into the calling process. It takes a hint as to the number of opaque states
> >> > + * desired, and returns the number of opaque states actually allocated, the
> >> > + * size of each one in bytes, and the address of the first state.
> >> > + */
> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> >> > + unsigned long __user *, size_per_each, unsigned int, flags)
> >>
> >> I think you should make this __u64, so that you get a consistent
> >> userspace interface on all architectures, without the need for compat
> >> system calls.
> >
> > That would be quite unconventional. Most syscalls that take lengths do
> > so with the native register size (`unsigned long`, `size_t`), rather
> > than u64. If you can point to a recent trend away from this by
> > indicating some commits that added new syscalls with u64, I'd be happy
> > to be shown otherwise. But AFAIK, that's not the way it's done.
>
> See clone3 and struct clone_args.
The struct is one thing. But actually, clone3 takes a `size_t`:
SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
I take from this that I too should use `size_t` rather than `unsigned
long.` And it doesn't seem like there's any compat clone3.
Jason
* Jason A. Donenfeld:
> Hi Florian,
>
> On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
>> * Jason A. Donenfeld:
>>
>> > Hi Florian,
>> >
>> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
>> >> * Jason A. Donenfeld:
>> >>
>> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
>> >> > + * function provides to userspace, by mapping a certain number of special pages
>> >> > + * into the calling process. It takes a hint as to the number of opaque states
>> >> > + * desired, and returns the number of opaque states actually allocated, the
>> >> > + * size of each one in bytes, and the address of the first state.
>> >> > + */
>> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
>> >> > + unsigned long __user *, size_per_each, unsigned int, flags)
>> >>
>> >> I think you should make this __u64, so that you get a consistent
>> >> userspace interface on all architectures, without the need for compat
>> >> system calls.
>> >
>> > That would be quite unconventional. Most syscalls that take lengths do
>> > so with the native register size (`unsigned long`, `size_t`), rather
>> > than u64. If you can point to a recent trend away from this by
>> > indicating some commits that added new syscalls with u64, I'd be happy
>> > to be shown otherwise. But AFAIK, that's not the way it's done.
>>
>> See clone3 and struct clone_args.
>
> The struct is one thing. But actually, clone3 takes a `size_t`:
>
> SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
>
> I take from this that I too should use `size_t` rather than `unsigned
> long.` And it doesn't seem like there's any compat clone3.
But vgetrandom_alloc does not use unsigned long, but unsigned long *.
You need to look at the contents for struct clone_args for comparison.
Thanks,
Florian
Hi Florian,
On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> * Jason A. Donenfeld:
>
> > Hi Florian,
> >
> > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> >> * Jason A. Donenfeld:
> >>
> >> > Hi Florian,
> >> >
> >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> >> >> * Jason A. Donenfeld:
> >> >>
> >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> >> >> > + * function provides to userspace, by mapping a certain number of special pages
> >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> >> >> > + * desired, and returns the number of opaque states actually allocated, the
> >> >> > + * size of each one in bytes, and the address of the first state.
> >> >> > + */
> >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> >> >> > + unsigned long __user *, size_per_each, unsigned int, flags)
> >> >>
> >> >> I think you should make this __u64, so that you get a consistent
> >> >> userspace interface on all architectures, without the need for compat
> >> >> system calls.
> >> >
> >> > That would be quite unconventional. Most syscalls that take lengths do
> >> > so with the native register size (`unsigned long`, `size_t`), rather
> >> > than u64. If you can point to a recent trend away from this by
> >> > indicating some commits that added new syscalls with u64, I'd be happy
> >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> >>
> >> See clone3 and struct clone_args.
> >
> > The struct is one thing. But actually, clone3 takes a `size_t`:
> >
> > SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> >
> > I take from this that I too should use `size_t` rather than `unsigned
> > long.` And it doesn't seem like there's any compat clone3.
>
> But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> You need to look at the contents for struct clone_args for comparison.
Ah! I see what you mean; that's a good point. The usual register
clearing thing isn't going to happen because these are addresses.
I still am somewhat hesitant, though, because `size_t` is really the
"proper" type to be used. Maybe the compat syscall thing is just a
necessary evil?
The other direction would be making this a u32, since 640k ought to be
enough for anybody and such, but maybe that'd be a mistake too.
So I'm not sure. Anybody else on the list with experience adding
syscalls have an opinion?
Jason
Hey again,
On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:
> Hi Florian,
>
> On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> > * Jason A. Donenfeld:
> >
> > > Hi Florian,
> > >
> > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> > >> * Jason A. Donenfeld:
> > >>
> > >> > Hi Florian,
> > >> >
> > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> > >> >> * Jason A. Donenfeld:
> > >> >>
> > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> > >> >> > + * function provides to userspace, by mapping a certain number of special pages
> > >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> > >> >> > + * desired, and returns the number of opaque states actually allocated, the
> > >> >> > + * size of each one in bytes, and the address of the first state.
> > >> >> > + */
> > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > >> >> > + unsigned long __user *, size_per_each, unsigned int, flags)
> > >> >>
> > >> >> I think you should make this __u64, so that you get a consistent
> > >> >> userspace interface on all architectures, without the need for compat
> > >> >> system calls.
> > >> >
> > >> > That would be quite unconventional. Most syscalls that take lengths do
> > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > >> > than u64. If you can point to a recent trend away from this by
> > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > >>
> > >> See clone3 and struct clone_args.
> > >
> > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > >
> > > SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > >
> > > I take from this that I too should use `size_t` rather than `unsigned
> > > long.` And it doesn't seem like there's any compat clone3.
> >
> > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > You need to look at the contents for struct clone_args for comparison.
>
> Ah! I see what you mean; that's a good point. The usual register
> clearing thing isn't going to happen because these are addresses.
>
> I still am somewhat hesitant, though, because `size_t` is really the
> "proper" type to be used. Maybe the compat syscall thing is just a
> necessary evil?
>
> The other direction would be making this a u32, since 640k ought to be
> enough for anybody and such, but maybe that'd be a mistake too.
>
> So I'm not sure. Anybody else on the list with experience adding
> syscalls have an opinion?
Looks like set_mempolicy, get_mempoliy, and migrate_pages pass an
unsigned long pointer and I don't see any compat stuff around it:
SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
unsigned long, maxnode)
SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
unsigned long __user *, nmask, unsigned long, maxnode,
unsigned long, addr, unsigned long, flags)
SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
const unsigned long __user *, old_nodes,
const unsigned long __user *, new_nodes)
In contrast sched_setaffinity and get_robust_list take a unsigned long
pointer and does have a compat wrapper:
SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
unsigned long __user *, user_mask_ptr)
SYSCALL_DEFINE3(get_robust_list, int, pid,
struct robust_list_head __user * __user *, head_ptr,
size_t __user *, len_ptr)
Jason
On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:
> Hi Florian,
>
> On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> > * Jason A. Donenfeld:
> >
> > > Hi Florian,
> > >
> > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> > >> * Jason A. Donenfeld:
> > >>
> > >> > Hi Florian,
> > >> >
> > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> > >> >> * Jason A. Donenfeld:
> > >> >>
> > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> > >> >> > + * function provides to userspace, by mapping a certain number of special pages
> > >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> > >> >> > + * desired, and returns the number of opaque states actually allocated, the
> > >> >> > + * size of each one in bytes, and the address of the first state.
> > >> >> > + */
> > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > >> >> > + unsigned long __user *, size_per_each, unsigned int, flags)
> > >> >>
> > >> >> I think you should make this __u64, so that you get a consistent
> > >> >> userspace interface on all architectures, without the need for compat
> > >> >> system calls.
> > >> >
> > >> > That would be quite unconventional. Most syscalls that take lengths do
> > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > >> > than u64. If you can point to a recent trend away from this by
> > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > >>
> > >> See clone3 and struct clone_args.
For system calls that take structs as arguments we use u64 in the struct
for proper alignment so we can extend structs without regressing old
kernels. We have a few of those extensible struct system calls.
But we don't really have a lot system calls that pass u64 as a pointer
outside of a structure so far. Neither as register and nor as pointer
iirc. Passing them as a register arg is problematic because of 32bit
arches. But passing as pointer should be fine but it is indeed uncommon.
> > >
> > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > >
> > > SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > >
> > > I take from this that I too should use `size_t` rather than `unsigned
> > > long.` And it doesn't seem like there's any compat clone3.
> >
> > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > You need to look at the contents for struct clone_args for comparison.
>
> Ah! I see what you mean; that's a good point. The usual register
> clearing thing isn't going to happen because these are addresses.
>
> I still am somewhat hesitant, though, because `size_t` is really the
> "proper" type to be used. Maybe the compat syscall thing is just a
> necessary evil?
We try to avoid adding new compat-requiring syscalls like the plague
usually. (At least for new syscalls that don't need to inherit behavior
from earlier syscalls they are a revisions of.)
>
> The other direction would be making this a u32, since 640k ought to be
> enough for anybody and such, but maybe that'd be a mistake too.
I think making this a size_t is fine. We haven't traditionally used u32
for sizes. All syscalls that pass structs versioned by size use size_t.
So I would recommend to stick with that.
Alternatively, you could also introduce a simple struct versioned by
size for this system call similar to mount_setatt() and clone3() and so
on. This way you don't need to worry about future extensibilty. Just a
thought.
Hi Christian,
Thanks a bunch for chiming in.
On Thu, Nov 24, 2022 at 01:49:27PM +0100, Christian Brauner wrote:
> Alternatively, you could also introduce a simple struct versioned by
> size for this system call similar to mount_setatt() and clone3() and so
> on. This way you don't need to worry about future extensibilty. Just a
> thought.
Briefly considered that, but it seemed a bit heavy for something like
this. I'm not super heavily opposed, but just seemed like a bit much.
> > > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > > >> >> > + unsigned long __user *, size_per_each, unsigned int, flags)
> > > >> >>
> > > >> >> I think you should make this __u64, so that you get a consistent
> > > >> >> userspace interface on all architectures, without the need for compat
> > > >> >> system calls.
> > > >> >
> > > >> > That would be quite unconventional. Most syscalls that take lengths do
> > > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > > >> > than u64. If you can point to a recent trend away from this by
> > > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > > >>
> > > >> See clone3 and struct clone_args.
>
> For system calls that take structs as arguments we use u64 in the struct
> for proper alignment so we can extend structs without regressing old
> kernels. We have a few of those extensible struct system calls.
>
> But we don't really have a lot system calls that pass u64 as a pointer
> outside of a structure so far. Neither as register and nor as pointer
> iirc.
Right, the __u64_aligned business seemed to be mostly about
extensibility.
> > > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > > >
> > > > SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > > >
> > > > I take from this that I too should use `size_t` rather than `unsigned
> > > > long.` And it doesn't seem like there's any compat clone3.
> > >
> > > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > > You need to look at the contents for struct clone_args for comparison.
> >
> > Ah! I see what you mean; that's a good point. The usual register
> > clearing thing isn't going to happen because these are addresses.
> >
> > I still am somewhat hesitant, though, because `size_t` is really the
> > "proper" type to be used. Maybe the compat syscall thing is just a
> > necessary evil?
>
> I think making this a size_t is fine. We haven't traditionally used u32
> for sizes. All syscalls that pass structs versioned by size use size_t.
> So I would recommend to stick with that.
This isn't quite a struct versioned by size. This is:
void *vgetrandom_alloc([inout] size_t *num, [out] size_t *size_per_each, unsigned int flags);
You give it an input 'num' and some flags (currently flags=0), and it
gives you back an output 'num' size, an output 'size_per_each' size, and
an opaque pointer value mapping as its return value.
I do like the idea of keeping size_t so that the type is "right". But
the other arguments are equally compelling as well, so not sure.
Jason
On Thu, Nov 24, 2022, at 13:48, Jason A. Donenfeld wrote:
> On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:
> Looks like set_mempolicy, get_mempoliy, and migrate_pages pass an
> unsigned long pointer and I don't see any compat stuff around it:
>
> SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long
> __user *, nmask,
> unsigned long, maxnode)
>
> SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
> unsigned long __user *, nmask, unsigned long, maxnode,
> unsigned long, addr, unsigned long, flags)
>
> SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
> const unsigned long __user *, old_nodes,
> const unsigned long __user *, new_nodes)
Compat handling for these is done all the way down in the
pointer access:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mempolicy.c#n1368
This works here because it's a special bitmap but is not the
best approach if you just have a pointer to a single value.
Arnd
On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:
> Hi Florian,
>
> On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> > * Jason A. Donenfeld:
> >
> > > Hi Florian,
> > >
> > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> > >> * Jason A. Donenfeld:
> > >>
> > >> > Hi Florian,
> > >> >
> > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> > >> >> * Jason A. Donenfeld:
> > >> >>
> > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> > >> >> > + * function provides to userspace, by mapping a certain number of special pages
> > >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> > >> >> > + * desired, and returns the number of opaque states actually allocated, the
> > >> >> > + * size of each one in bytes, and the address of the first state.
> > >> >> > + */
> > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > >> >> > + unsigned long __user *, size_per_each, unsigned int, flags)
> > >> >>
> > >> >> I think you should make this __u64, so that you get a consistent
> > >> >> userspace interface on all architectures, without the need for compat
> > >> >> system calls.
> > >> >
> > >> > That would be quite unconventional. Most syscalls that take lengths do
> > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > >> > than u64. If you can point to a recent trend away from this by
> > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > >>
> > >> See clone3 and struct clone_args.
> > >
> > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > >
> > > SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > >
> > > I take from this that I too should use `size_t` rather than `unsigned
> > > long.` And it doesn't seem like there's any compat clone3.
> >
> > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > You need to look at the contents for struct clone_args for comparison.
>
> The other direction would be making this a u32
I think `unsigned int` is actually a sensible size for what these values
should be. That eliminates the problem and potential bikeshed too. So
I'll go with that for v+1.
Jason
@@ -17287,6 +17287,7 @@ T: git https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git
S: Maintained
F: drivers/char/random.c
F: drivers/virt/vmgenid.c
+F: lib/vdso/getrandom.h
RAPIDIO SUBSYSTEM
M: Matt Porter <mporter@kernel.crashing.org>
@@ -59,6 +59,7 @@ config X86
#
select ACPI_LEGACY_TABLES_LOOKUP if ACPI
select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
+ select ADVISE_SYSCALLS if X86_64
select ARCH_32BIT_OFF_T if X86_32
select ARCH_CLOCKSOURCE_INIT
select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
@@ -372,6 +372,7 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common vgetrandom_alloc sys_vgetrandom_alloc
#
# Due to a historical design error, certain syscalls are numbered differently
@@ -27,6 +27,7 @@
# define __ARCH_WANT_COMPAT_SYS_PWRITEV64
# define __ARCH_WANT_COMPAT_SYS_PREADV64V2
# define __ARCH_WANT_COMPAT_SYS_PWRITEV64V2
+# define __ARCH_WANT_VGETRANDOM_ALLOC
# define X32_NR_syscalls (__NR_x32_syscalls)
# define IA32_NR_syscalls (__NR_ia32_syscalls)
@@ -8,6 +8,7 @@
* into roughly six sections, each with a section header:
*
* - Initialization and readiness waiting.
+ * - vDSO support helpers.
* - Fast key erasure RNG, the "crng".
* - Entropy accumulation and extraction routines.
* - Entropy collection routines.
@@ -39,6 +40,7 @@
#include <linux/blkdev.h>
#include <linux/interrupt.h>
#include <linux/mm.h>
+#include <linux/mman.h>
#include <linux/nodemask.h>
#include <linux/spinlock.h>
#include <linux/kthread.h>
@@ -59,6 +61,7 @@
#include <asm/irq.h>
#include <asm/irq_regs.h>
#include <asm/io.h>
+#include "../../lib/vdso/getrandom.h"
/*********************************************************************
*
@@ -146,6 +149,62 @@ EXPORT_SYMBOL(wait_for_random_bytes);
__func__, (void *)_RET_IP_, crng_init)
+
+/********************************************************************
+ *
+ * vDSO support helpers.
+ *
+ * The actual vDSO function is defined over in lib/vdso/getrandom.c,
+ * but this section contains the kernel-mode helpers to support that.
+ *
+ ********************************************************************/
+
+#ifdef __ARCH_WANT_VGETRANDOM_ALLOC
+/*
+ * The vgetrandom() function in userspace requires an opaque state, which this
+ * function provides to userspace, by mapping a certain number of special pages
+ * into the calling process. It takes a hint as to the number of opaque states
+ * desired, and returns the number of opaque states actually allocated, the
+ * size of each one in bytes, and the address of the first state.
+ */
+SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
+ unsigned long __user *, size_per_each, unsigned int, flags)
+{
+ unsigned long alloc_size;
+ unsigned long num_states;
+ unsigned long pages_addr;
+ int ret;
+
+ if (flags)
+ return -EINVAL;
+
+ if (get_user(num_states, num))
+ return -EFAULT;
+
+ num_states = clamp(num_states, 1UL, (SIZE_MAX & PAGE_MASK) / sizeof(struct vgetrandom_state));
+ alloc_size = PAGE_ALIGN(num_states * sizeof(struct vgetrandom_state));
+
+ if (put_user(alloc_size / sizeof(struct vgetrandom_state), num) ||
+ put_user(sizeof(struct vgetrandom_state), size_per_each))
+ return -EFAULT;
+
+ pages_addr = vm_mmap(NULL, 0, alloc_size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0);
+ if (IS_ERR_VALUE(pages_addr))
+ return pages_addr;
+
+ ret = do_madvise(current->mm, pages_addr, alloc_size, MADV_WIPEONFORK);
+ if (ret < 0)
+ goto err_unmap;
+
+ return pages_addr;
+
+err_unmap:
+ vm_munmap(pages_addr, alloc_size);
+ return ret;
+}
+#endif
+
/*********************************************************************
*
* Fast key erasure RNG, the "crng".
@@ -886,8 +886,13 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
+#ifdef __ARCH_WANT_VGETRANDOM_ALLOC
+#define __NR_vgetrandom_alloc 451
+__SYSCALL(__NR_vgetrandom_alloc, sys_vgetrandom_alloc)
+#endif
+
#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
/*
* 32 bit systems traditionally used different
@@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
/* memfd_secret */
COND_SYSCALL(memfd_secret);
+/* random */
+COND_SYSCALL(vgetrandom_alloc);
+
/*
* Architecture specific weak syscall entries.
*/
new file mode 100644
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef _VDSO_LIB_GETRANDOM_H
+#define _VDSO_LIB_GETRANDOM_H
+
+#include <crypto/chacha.h>
+
+struct vgetrandom_state {
+ union {
+ struct {
+ u8 batch[CHACHA_BLOCK_SIZE * 3 / 2];
+ u32 key[CHACHA_KEY_SIZE / sizeof(u32)];
+ };
+ u8 batch_key[CHACHA_BLOCK_SIZE * 2];
+ };
+ unsigned long generation;
+ u8 pos;
+};
+
+#endif /* _VDSO_LIB_GETRANDOM_H */
@@ -44,6 +44,10 @@ cat << EOF
#define __IGNORE_memfd_secret
#endif
+#ifndef __ARCH_WANT_VGETRANDOM_ALLOC
+#define __IGNORE_vgetrandom_alloc
+#endif
+
/* Missing flags argument */
#define __IGNORE_renameat /* renameat2 */
@@ -886,8 +886,13 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
+#ifdef __ARCH_WANT_VGETRANDOM_ALLOC
+#define __NR_vgetrandom_alloc 451
+__SYSCALL(__NR_vgetrandom_alloc, sys_vgetrandom_alloc)
+#endif
+
#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
/*
* 32 bit systems traditionally used different