[v6,1/3] random: add vgetrandom_alloc() syscall

Message ID 20221121152909.3414096-2-Jason@zx2c4.com
State New
Headers
Series implement getrandom() in vDSO |

Commit Message

Jason A. Donenfeld Nov. 21, 2022, 3:29 p.m. UTC
  The vDSO getrandom() works over an opaque per-thread state of an
unexported size, which must be marked as MADV_WIPEONFORK and be
mlock()'d for proper operation. Over time, the nuances of these
allocations may change or grow or even differ based on architectural
features.

The syscall has the signature:

  void *vgetrandom_alloc([inout] size_t *num, [out] size_t *size_per_each, unsigned int flags);

This takes the desired number of opaque states in `num`, and returns a
pointer to an array of opaque states, the number actually allocated back
in `num`, and the size in bytes of each one in `size_per_each`, enabling
a libc to slice up the returned array into a state per each thread. (The
`flags` argument is always zero for now.) Libc is expected to allocate a
chunk of these on first use, and then dole them out to threads as
they're created, allocating more when needed. The following commit shows
an example of this, being used in conjunction with the getrandom() vDSO
function.

We very intentionally do *not* leave state allocation for vDSO
getrandom() up to userspace itself, but rather provide this new syscall
for such allocations. vDSO getrandom() must not store its state in just
any old memory address, but rather just ones that the kernel specially
allocates for it, leaving the particularities of those allocations up to
the kernel.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 MAINTAINERS                             |  1 +
 arch/x86/Kconfig                        |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl  |  1 +
 arch/x86/include/asm/unistd.h           |  1 +
 drivers/char/random.c                   | 59 +++++++++++++++++++++++++
 include/uapi/asm-generic/unistd.h       |  7 ++-
 kernel/sys_ni.c                         |  3 ++
 lib/vdso/getrandom.h                    | 23 ++++++++++
 scripts/checksyscalls.sh                |  4 ++
 tools/include/uapi/asm-generic/unistd.h |  7 ++-
 10 files changed, 105 insertions(+), 2 deletions(-)
 create mode 100644 lib/vdso/getrandom.h
  

Comments

Florian Weimer Nov. 23, 2022, 10:46 a.m. UTC | #1
* Jason A. Donenfeld:

> + * The vgetrandom() function in userspace requires an opaque state, which this
> + * function provides to userspace, by mapping a certain number of special pages
> + * into the calling process. It takes a hint as to the number of opaque states
> + * desired, and returns the number of opaque states actually allocated, the
> + * size of each one in bytes, and the address of the first state.
> + */
> +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> +		unsigned long __user *, size_per_each, unsigned int, flags)

I think you should make this __u64, so that you get a consistent
userspace interface on all architectures, without the need for compat
system calls.

Thanks,
Florian
  
Jason A. Donenfeld Nov. 24, 2022, 1:04 a.m. UTC | #2
Hi Florian,

On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> * Jason A. Donenfeld:
> 
> > + * The vgetrandom() function in userspace requires an opaque state, which this
> > + * function provides to userspace, by mapping a certain number of special pages
> > + * into the calling process. It takes a hint as to the number of opaque states
> > + * desired, and returns the number of opaque states actually allocated, the
> > + * size of each one in bytes, and the address of the first state.
> > + */
> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > +		unsigned long __user *, size_per_each, unsigned int, flags)
> 
> I think you should make this __u64, so that you get a consistent
> userspace interface on all architectures, without the need for compat
> system calls.

That would be quite unconventional. Most syscalls that take lengths do
so with the native register size (`unsigned long`, `size_t`), rather
than u64. If you can point to a recent trend away from this by
indicating some commits that added new syscalls with u64, I'd be happy
to be shown otherwise. But AFAIK, that's not the way it's done.

Jason
  
Florian Weimer Nov. 24, 2022, 5:25 a.m. UTC | #3
* Jason A. Donenfeld:

> Hi Florian,
>
> On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
>> * Jason A. Donenfeld:
>> 
>> > + * The vgetrandom() function in userspace requires an opaque state, which this
>> > + * function provides to userspace, by mapping a certain number of special pages
>> > + * into the calling process. It takes a hint as to the number of opaque states
>> > + * desired, and returns the number of opaque states actually allocated, the
>> > + * size of each one in bytes, and the address of the first state.
>> > + */
>> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
>> > +		unsigned long __user *, size_per_each, unsigned int, flags)
>> 
>> I think you should make this __u64, so that you get a consistent
>> userspace interface on all architectures, without the need for compat
>> system calls.
>
> That would be quite unconventional. Most syscalls that take lengths do
> so with the native register size (`unsigned long`, `size_t`), rather
> than u64. If you can point to a recent trend away from this by
> indicating some commits that added new syscalls with u64, I'd be happy
> to be shown otherwise. But AFAIK, that's not the way it's done.

See clone3 and struct clone_args.  It's more common with pointers, which
are now 64 bits unconditionally: struct futex_waitv, struct rseq_cs and
struct rseq.

If the length or pointer is a system call argument, widening it to 64
bits is not necessary because zero-extension to the full register
eliminates the need for a compat system call.  But if you pass the
address to a size or pointer, you'll need compat syscalls if you don't
make the passed data __u64.

Thanks,
Florian
  
Jason A. Donenfeld Nov. 24, 2022, 12:03 p.m. UTC | #4
Hi Florian,

On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> * Jason A. Donenfeld:
> 
> > Hi Florian,
> >
> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> >> * Jason A. Donenfeld:
> >> 
> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> >> > + * function provides to userspace, by mapping a certain number of special pages
> >> > + * into the calling process. It takes a hint as to the number of opaque states
> >> > + * desired, and returns the number of opaque states actually allocated, the
> >> > + * size of each one in bytes, and the address of the first state.
> >> > + */
> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> >> > +		unsigned long __user *, size_per_each, unsigned int, flags)
> >> 
> >> I think you should make this __u64, so that you get a consistent
> >> userspace interface on all architectures, without the need for compat
> >> system calls.
> >
> > That would be quite unconventional. Most syscalls that take lengths do
> > so with the native register size (`unsigned long`, `size_t`), rather
> > than u64. If you can point to a recent trend away from this by
> > indicating some commits that added new syscalls with u64, I'd be happy
> > to be shown otherwise. But AFAIK, that's not the way it's done.
> 
> See clone3 and struct clone_args.

The struct is one thing. But actually, clone3 takes a `size_t`:

    SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)

I take from this that I too should use `size_t` rather than `unsigned
long.` And it doesn't seem like there's any compat clone3.

Jason
  
Florian Weimer Nov. 24, 2022, 12:15 p.m. UTC | #5
* Jason A. Donenfeld:

> Hi Florian,
>
> On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
>> * Jason A. Donenfeld:
>> 
>> > Hi Florian,
>> >
>> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
>> >> * Jason A. Donenfeld:
>> >> 
>> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
>> >> > + * function provides to userspace, by mapping a certain number of special pages
>> >> > + * into the calling process. It takes a hint as to the number of opaque states
>> >> > + * desired, and returns the number of opaque states actually allocated, the
>> >> > + * size of each one in bytes, and the address of the first state.
>> >> > + */
>> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
>> >> > +		unsigned long __user *, size_per_each, unsigned int, flags)
>> >> 
>> >> I think you should make this __u64, so that you get a consistent
>> >> userspace interface on all architectures, without the need for compat
>> >> system calls.
>> >
>> > That would be quite unconventional. Most syscalls that take lengths do
>> > so with the native register size (`unsigned long`, `size_t`), rather
>> > than u64. If you can point to a recent trend away from this by
>> > indicating some commits that added new syscalls with u64, I'd be happy
>> > to be shown otherwise. But AFAIK, that's not the way it's done.
>> 
>> See clone3 and struct clone_args.
>
> The struct is one thing. But actually, clone3 takes a `size_t`:
>
>     SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
>
> I take from this that I too should use `size_t` rather than `unsigned
> long.` And it doesn't seem like there's any compat clone3.

But vgetrandom_alloc does not use unsigned long, but unsigned long *.
You need to look at the contents for struct clone_args for comparison.

Thanks,
Florian
  
Jason A. Donenfeld Nov. 24, 2022, 12:24 p.m. UTC | #6
Hi Florian,

On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> * Jason A. Donenfeld:
> 
> > Hi Florian,
> >
> > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> >> * Jason A. Donenfeld:
> >> 
> >> > Hi Florian,
> >> >
> >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> >> >> * Jason A. Donenfeld:
> >> >> 
> >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> >> >> > + * function provides to userspace, by mapping a certain number of special pages
> >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> >> >> > + * desired, and returns the number of opaque states actually allocated, the
> >> >> > + * size of each one in bytes, and the address of the first state.
> >> >> > + */
> >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> >> >> > +		unsigned long __user *, size_per_each, unsigned int, flags)
> >> >> 
> >> >> I think you should make this __u64, so that you get a consistent
> >> >> userspace interface on all architectures, without the need for compat
> >> >> system calls.
> >> >
> >> > That would be quite unconventional. Most syscalls that take lengths do
> >> > so with the native register size (`unsigned long`, `size_t`), rather
> >> > than u64. If you can point to a recent trend away from this by
> >> > indicating some commits that added new syscalls with u64, I'd be happy
> >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> >> 
> >> See clone3 and struct clone_args.
> >
> > The struct is one thing. But actually, clone3 takes a `size_t`:
> >
> >     SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> >
> > I take from this that I too should use `size_t` rather than `unsigned
> > long.` And it doesn't seem like there's any compat clone3.
> 
> But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> You need to look at the contents for struct clone_args for comparison.

Ah! I see what you mean; that's a good point. The usual register
clearing thing isn't going to happen because these are addresses.

I still am somewhat hesitant, though, because `size_t` is really the
"proper" type to be used. Maybe the compat syscall thing is just a
necessary evil?

The other direction would be making this a u32, since 640k ought to be
enough for anybody and such, but maybe that'd be a mistake too.

So I'm not sure. Anybody else on the list with experience adding
syscalls have an opinion?

Jason
  
Jason A. Donenfeld Nov. 24, 2022, 12:48 p.m. UTC | #7
Hey again,

On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:
> Hi Florian,
> 
> On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> > * Jason A. Donenfeld:
> > 
> > > Hi Florian,
> > >
> > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> > >> * Jason A. Donenfeld:
> > >> 
> > >> > Hi Florian,
> > >> >
> > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> > >> >> * Jason A. Donenfeld:
> > >> >> 
> > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> > >> >> > + * function provides to userspace, by mapping a certain number of special pages
> > >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> > >> >> > + * desired, and returns the number of opaque states actually allocated, the
> > >> >> > + * size of each one in bytes, and the address of the first state.
> > >> >> > + */
> > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > >> >> > +                unsigned long __user *, size_per_each, unsigned int, flags)
> > >> >> 
> > >> >> I think you should make this __u64, so that you get a consistent
> > >> >> userspace interface on all architectures, without the need for compat
> > >> >> system calls.
> > >> >
> > >> > That would be quite unconventional. Most syscalls that take lengths do
> > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > >> > than u64. If you can point to a recent trend away from this by
> > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > >> 
> > >> See clone3 and struct clone_args.
> > >
> > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > >
> > >     SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > >
> > > I take from this that I too should use `size_t` rather than `unsigned
> > > long.` And it doesn't seem like there's any compat clone3.
> > 
> > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > You need to look at the contents for struct clone_args for comparison.
> 
> Ah! I see what you mean; that's a good point. The usual register
> clearing thing isn't going to happen because these are addresses.
> 
> I still am somewhat hesitant, though, because `size_t` is really the
> "proper" type to be used. Maybe the compat syscall thing is just a
> necessary evil?
> 
> The other direction would be making this a u32, since 640k ought to be
> enough for anybody and such, but maybe that'd be a mistake too.
> 
> So I'm not sure. Anybody else on the list with experience adding
> syscalls have an opinion?

Looks like set_mempolicy, get_mempoliy, and migrate_pages pass an
unsigned long pointer and I don't see any compat stuff around it:

    SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
                    unsigned long, maxnode)
    
    SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
                    unsigned long __user *, nmask, unsigned long, maxnode,
                    unsigned long, addr, unsigned long, flags)

    SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
                    const unsigned long __user *, old_nodes,
                    const unsigned long __user *, new_nodes)
     

In contrast sched_setaffinity and get_robust_list take a unsigned long
pointer and does have a compat wrapper:

    SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
                    unsigned long __user *, user_mask_ptr)

    SYSCALL_DEFINE3(get_robust_list, int, pid,
                    struct robust_list_head __user * __user *, head_ptr,
                    size_t __user *, len_ptr)

Jason
  
Christian Brauner Nov. 24, 2022, 12:49 p.m. UTC | #8
On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:
> Hi Florian,
> 
> On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> > * Jason A. Donenfeld:
> > 
> > > Hi Florian,
> > >
> > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> > >> * Jason A. Donenfeld:
> > >> 
> > >> > Hi Florian,
> > >> >
> > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> > >> >> * Jason A. Donenfeld:
> > >> >> 
> > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> > >> >> > + * function provides to userspace, by mapping a certain number of special pages
> > >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> > >> >> > + * desired, and returns the number of opaque states actually allocated, the
> > >> >> > + * size of each one in bytes, and the address of the first state.
> > >> >> > + */
> > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > >> >> > +		unsigned long __user *, size_per_each, unsigned int, flags)
> > >> >> 
> > >> >> I think you should make this __u64, so that you get a consistent
> > >> >> userspace interface on all architectures, without the need for compat
> > >> >> system calls.
> > >> >
> > >> > That would be quite unconventional. Most syscalls that take lengths do
> > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > >> > than u64. If you can point to a recent trend away from this by
> > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > >> 
> > >> See clone3 and struct clone_args.

For system calls that take structs as arguments we use u64 in the struct
for proper alignment so we can extend structs without regressing old
kernels. We have a few of those extensible struct system calls.

But we don't really have a lot system calls that pass u64 as a pointer
outside of a structure so far. Neither as register and nor as pointer
iirc. Passing them as a register arg is problematic because of 32bit
arches. But passing as pointer should be fine but it is indeed uncommon.

> > >
> > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > >
> > >     SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > >
> > > I take from this that I too should use `size_t` rather than `unsigned
> > > long.` And it doesn't seem like there's any compat clone3.
> > 
> > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > You need to look at the contents for struct clone_args for comparison.
> 
> Ah! I see what you mean; that's a good point. The usual register
> clearing thing isn't going to happen because these are addresses.
> 
> I still am somewhat hesitant, though, because `size_t` is really the
> "proper" type to be used. Maybe the compat syscall thing is just a
> necessary evil?

We try to avoid adding new compat-requiring syscalls like the plague
usually. (At least for new syscalls that don't need to inherit behavior
from earlier syscalls they are a revisions of.)

> 
> The other direction would be making this a u32, since 640k ought to be
> enough for anybody and such, but maybe that'd be a mistake too.

I think making this a size_t is fine. We haven't traditionally used u32
for sizes. All syscalls that pass structs versioned by size use size_t.
So I would recommend to stick with that.

Alternatively, you could also introduce a simple struct versioned by
size for this system call similar to mount_setatt() and clone3() and so
on. This way you don't need to worry about future extensibilty. Just a
thought.
  
Jason A. Donenfeld Nov. 24, 2022, 12:57 p.m. UTC | #9
Hi Christian,

Thanks a bunch for chiming in.

On Thu, Nov 24, 2022 at 01:49:27PM +0100, Christian Brauner wrote:
> Alternatively, you could also introduce a simple struct versioned by
> size for this system call similar to mount_setatt() and clone3() and so
> on. This way you don't need to worry about future extensibilty. Just a
> thought.

Briefly considered that, but it seemed a bit heavy for something like
this. I'm not super heavily opposed, but just seemed like a bit much.

> > > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > > >> >> > +		unsigned long __user *, size_per_each, unsigned int, flags)
> > > >> >> 
> > > >> >> I think you should make this __u64, so that you get a consistent
> > > >> >> userspace interface on all architectures, without the need for compat
> > > >> >> system calls.
> > > >> >
> > > >> > That would be quite unconventional. Most syscalls that take lengths do
> > > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > > >> > than u64. If you can point to a recent trend away from this by
> > > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > > >> 
> > > >> See clone3 and struct clone_args.
> 
> For system calls that take structs as arguments we use u64 in the struct
> for proper alignment so we can extend structs without regressing old
> kernels. We have a few of those extensible struct system calls.
> 
> But we don't really have a lot system calls that pass u64 as a pointer
> outside of a structure so far. Neither as register and nor as pointer
> iirc.

Right, the __u64_aligned business seemed to be mostly about
extensibility.

> > > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > > >
> > > >     SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > > >
> > > > I take from this that I too should use `size_t` rather than `unsigned
> > > > long.` And it doesn't seem like there's any compat clone3.
> > > 
> > > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > > You need to look at the contents for struct clone_args for comparison.
> > 
> > Ah! I see what you mean; that's a good point. The usual register
> > clearing thing isn't going to happen because these are addresses.
> > 
> > I still am somewhat hesitant, though, because `size_t` is really the
> > "proper" type to be used. Maybe the compat syscall thing is just a
> > necessary evil?
>
> I think making this a size_t is fine. We haven't traditionally used u32
> for sizes. All syscalls that pass structs versioned by size use size_t.
> So I would recommend to stick with that.

This isn't quite a struct versioned by size. This is:

    void *vgetrandom_alloc([inout] size_t *num, [out] size_t *size_per_each, unsigned int flags);

You give it an input 'num' and some flags (currently flags=0), and it
gives you back an output 'num' size, an output 'size_per_each' size, and
an opaque pointer value mapping as its return value.

I do like the idea of keeping size_t so that the type is "right". But
the other arguments are equally compelling as well, so not sure.

Jason
  
Arnd Bergmann Nov. 24, 2022, 1:18 p.m. UTC | #10
On Thu, Nov 24, 2022, at 13:48, Jason A. Donenfeld wrote:
> On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:

> Looks like set_mempolicy, get_mempoliy, and migrate_pages pass an
> unsigned long pointer and I don't see any compat stuff around it:
>
>     SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long 
> __user *, nmask,
>                     unsigned long, maxnode)
>    
>     SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
>                     unsigned long __user *, nmask, unsigned long, maxnode,
>                     unsigned long, addr, unsigned long, flags)
>
>     SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
>                     const unsigned long __user *, old_nodes,
>                     const unsigned long __user *, new_nodes)

Compat handling for these is done all the way down in the
pointer access:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mempolicy.c#n1368

This works here because it's a special bitmap but is not the
best approach if you just have a pointer to a single value.

       Arnd
  
Jason A. Donenfeld Nov. 24, 2022, 4:30 p.m. UTC | #11
On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:
> Hi Florian,
> 
> On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> > * Jason A. Donenfeld:
> > 
> > > Hi Florian,
> > >
> > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> > >> * Jason A. Donenfeld:
> > >> 
> > >> > Hi Florian,
> > >> >
> > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> > >> >> * Jason A. Donenfeld:
> > >> >> 
> > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> > >> >> > + * function provides to userspace, by mapping a certain number of special pages
> > >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> > >> >> > + * desired, and returns the number of opaque states actually allocated, the
> > >> >> > + * size of each one in bytes, and the address of the first state.
> > >> >> > + */
> > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > >> >> > +		unsigned long __user *, size_per_each, unsigned int, flags)
> > >> >> 
> > >> >> I think you should make this __u64, so that you get a consistent
> > >> >> userspace interface on all architectures, without the need for compat
> > >> >> system calls.
> > >> >
> > >> > That would be quite unconventional. Most syscalls that take lengths do
> > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > >> > than u64. If you can point to a recent trend away from this by
> > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > >> 
> > >> See clone3 and struct clone_args.
> > >
> > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > >
> > >     SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > >
> > > I take from this that I too should use `size_t` rather than `unsigned
> > > long.` And it doesn't seem like there's any compat clone3.
> > 
> > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > You need to look at the contents for struct clone_args for comparison.
> 
> The other direction would be making this a u32

I think `unsigned int` is actually a sensible size for what these values
should be. That eliminates the problem and potential bikeshed too. So
I'll go with that for v+1.

Jason
  

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 256f03904987..843dd6a49538 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17287,6 +17287,7 @@  T:	git https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git
 S:	Maintained
 F:	drivers/char/random.c
 F:	drivers/virt/vmgenid.c
+F:	lib/vdso/getrandom.h
 
 RAPIDIO SUBSYSTEM
 M:	Matt Porter <mporter@kernel.crashing.org>
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67745ceab0db..331e21ba961a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -59,6 +59,7 @@  config X86
 	#
 	select ACPI_LEGACY_TABLES_LOOKUP	if ACPI
 	select ACPI_SYSTEM_POWER_STATES_SUPPORT	if ACPI
+	select ADVISE_SYSCALLS			if X86_64
 	select ARCH_32BIT_OFF_T			if X86_32
 	select ARCH_CLOCKSOURCE_INIT
 	select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..0186f173f0e8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@ 
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	common	vgetrandom_alloc	sys_vgetrandom_alloc
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/asm/unistd.h b/arch/x86/include/asm/unistd.h
index 761173ccc33c..1bf509eaeff1 100644
--- a/arch/x86/include/asm/unistd.h
+++ b/arch/x86/include/asm/unistd.h
@@ -27,6 +27,7 @@ 
 #  define __ARCH_WANT_COMPAT_SYS_PWRITEV64
 #  define __ARCH_WANT_COMPAT_SYS_PREADV64V2
 #  define __ARCH_WANT_COMPAT_SYS_PWRITEV64V2
+#  define __ARCH_WANT_VGETRANDOM_ALLOC
 #  define X32_NR_syscalls (__NR_x32_syscalls)
 #  define IA32_NR_syscalls (__NR_ia32_syscalls)
 
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 65ee69896967..9b64db52849f 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -8,6 +8,7 @@ 
  * into roughly six sections, each with a section header:
  *
  *   - Initialization and readiness waiting.
+ *   - vDSO support helpers.
  *   - Fast key erasure RNG, the "crng".
  *   - Entropy accumulation and extraction routines.
  *   - Entropy collection routines.
@@ -39,6 +40,7 @@ 
 #include <linux/blkdev.h>
 #include <linux/interrupt.h>
 #include <linux/mm.h>
+#include <linux/mman.h>
 #include <linux/nodemask.h>
 #include <linux/spinlock.h>
 #include <linux/kthread.h>
@@ -59,6 +61,7 @@ 
 #include <asm/irq.h>
 #include <asm/irq_regs.h>
 #include <asm/io.h>
+#include "../../lib/vdso/getrandom.h"
 
 /*********************************************************************
  *
@@ -146,6 +149,62 @@  EXPORT_SYMBOL(wait_for_random_bytes);
 				__func__, (void *)_RET_IP_, crng_init)
 
 
+
+/********************************************************************
+ *
+ * vDSO support helpers.
+ *
+ * The actual vDSO function is defined over in lib/vdso/getrandom.c,
+ * but this section contains the kernel-mode helpers to support that.
+ *
+ ********************************************************************/
+
+#ifdef __ARCH_WANT_VGETRANDOM_ALLOC
+/*
+ * The vgetrandom() function in userspace requires an opaque state, which this
+ * function provides to userspace, by mapping a certain number of special pages
+ * into the calling process. It takes a hint as to the number of opaque states
+ * desired, and returns the number of opaque states actually allocated, the
+ * size of each one in bytes, and the address of the first state.
+ */
+SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
+		unsigned long __user *, size_per_each, unsigned int, flags)
+{
+	unsigned long alloc_size;
+	unsigned long num_states;
+	unsigned long pages_addr;
+	int ret;
+
+	if (flags)
+		return -EINVAL;
+
+	if (get_user(num_states, num))
+		return -EFAULT;
+
+	num_states = clamp(num_states, 1UL, (SIZE_MAX & PAGE_MASK) / sizeof(struct vgetrandom_state));
+	alloc_size = PAGE_ALIGN(num_states * sizeof(struct vgetrandom_state));
+
+	if (put_user(alloc_size / sizeof(struct vgetrandom_state), num) ||
+	    put_user(sizeof(struct vgetrandom_state), size_per_each))
+		return -EFAULT;
+
+	pages_addr = vm_mmap(NULL, 0, alloc_size, PROT_READ | PROT_WRITE,
+			     MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0);
+	if (IS_ERR_VALUE(pages_addr))
+		return pages_addr;
+
+	ret = do_madvise(current->mm, pages_addr, alloc_size, MADV_WIPEONFORK);
+	if (ret < 0)
+		goto err_unmap;
+
+	return pages_addr;
+
+err_unmap:
+	vm_munmap(pages_addr, alloc_size);
+	return ret;
+}
+#endif
+
 /*********************************************************************
  *
  * Fast key erasure RNG, the "crng".
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..77b6debe7e18 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,13 @@  __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#ifdef __ARCH_WANT_VGETRANDOM_ALLOC
+#define __NR_vgetrandom_alloc 451
+__SYSCALL(__NR_vgetrandom_alloc, sys_vgetrandom_alloc)
+#endif
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..f28196cb919b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -360,6 +360,9 @@  COND_SYSCALL(pkey_free);
 /* memfd_secret */
 COND_SYSCALL(memfd_secret);
 
+/* random */
+COND_SYSCALL(vgetrandom_alloc);
+
 /*
  * Architecture specific weak syscall entries.
  */
diff --git a/lib/vdso/getrandom.h b/lib/vdso/getrandom.h
new file mode 100644
index 000000000000..c7f727db2aaa
--- /dev/null
+++ b/lib/vdso/getrandom.h
@@ -0,0 +1,23 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
+ */
+
+#ifndef _VDSO_LIB_GETRANDOM_H
+#define _VDSO_LIB_GETRANDOM_H
+
+#include <crypto/chacha.h>
+
+struct vgetrandom_state {
+	union {
+		struct {
+			u8 batch[CHACHA_BLOCK_SIZE * 3 / 2];
+			u32 key[CHACHA_KEY_SIZE / sizeof(u32)];
+		};
+		u8 batch_key[CHACHA_BLOCK_SIZE * 2];
+	};
+	unsigned long generation;
+	u8 pos;
+};
+
+#endif /* _VDSO_LIB_GETRANDOM_H */
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index f33e61aca93d..7f7928c6487f 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -44,6 +44,10 @@  cat << EOF
 #define __IGNORE_memfd_secret
 #endif
 
+#ifndef __ARCH_WANT_VGETRANDOM_ALLOC
+#define __IGNORE_vgetrandom_alloc
+#endif
+
 /* Missing flags argument */
 #define __IGNORE_renameat	/* renameat2 */
 
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..77b6debe7e18 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,13 @@  __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#ifdef __ARCH_WANT_VGETRANDOM_ALLOC
+#define __NR_vgetrandom_alloc 451
+__SYSCALL(__NR_vgetrandom_alloc, sys_vgetrandom_alloc)
+#endif
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different