regset: use vmalloc() for regset_get_alloc()

Message ID 20240201171159.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid
State New
Headers
Series regset: use vmalloc() for regset_get_alloc() |

Commit Message

Doug Anderson Feb. 2, 2024, 1:12 a.m. UTC
  While browsing through ChromeOS crash reports, I found one with an
allocation failure that looked like this:

  chrome: page allocation failure: order:7,
          mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO),
	  nodemask=(null),cpuset=urgent,mems_allowed=0
  CPU: 7 PID: 3295 Comm: chrome Not tainted
          5.15.133-20574-g8044615ac35c #1 (HASH:1162 1)
  Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT)
  Call trace:
  dump_backtrace+0x0/0x1ec
  show_stack+0x20/0x2c
  dump_stack_lvl+0x60/0x78
  dump_stack+0x18/0x38
  warn_alloc+0x104/0x174
  __alloc_pages+0x5f0/0x6e4
  kmalloc_order+0x44/0x98
  kmalloc_order_trace+0x34/0x124
  __kmalloc+0x228/0x36c
  __regset_get+0x68/0xcc
  regset_get_alloc+0x1c/0x28
  elf_core_dump+0x3d8/0xd8c
  do_coredump+0xeb8/0x1378
  get_signal+0x14c/0x804
  ...

An order 7 allocation is (1 << 7) contiguous pages, or 512K. It's not
a surprise that this allocation failed on a system that's been running
for a while.

In this case we're just generating a core dump and there's no reason
we need contiguous memory. Change the allocation to vmalloc(). We'll
change the free in binfmt_elf to kvfree() which works regardless of
how the memory was allocated.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
I don't know this code at all, but I _think_ I've identified the
places where the memory is freed correctly. Please double-check that I
didn't miss anything obvious, though!

 fs/binfmt_elf.c | 2 +-
 kernel/regset.c | 7 ++++---
 2 files changed, 5 insertions(+), 4 deletions(-)
  

Comments

Al Viro Feb. 2, 2024, 1:22 a.m. UTC | #1
On Thu, Feb 01, 2024 at 05:12:03PM -0800, Douglas Anderson wrote:
> While browsing through ChromeOS crash reports, I found one with an
> allocation failure that looked like this:

> An order 7 allocation is (1 << 7) contiguous pages, or 512K. It's not
> a surprise that this allocation failed on a system that's been running
> for a while.

>  	if (size > regset->n * regset->size)
>  		size = regset->n * regset->size;
>  	if (!p) {
> -		to_free = p = kzalloc(size, GFP_KERNEL);
> +		to_free = p = vmalloc(size);

	What the hell?  Which regset could have lead to that?
It would need to have the total size of register in excess of
256K.  Seriously, which regset is that about?  Note that we
have just made sure that size is not greater than that product.
size is unsigned int, so it's not as if a negative value passed
to function could get through that test only to be interpreted
as large positive later...

	Details, please.
  
Matthew Wilcox Feb. 2, 2024, 1:24 a.m. UTC | #2
On Thu, Feb 01, 2024 at 05:12:03PM -0800, Douglas Anderson wrote:
> While browsing through ChromeOS crash reports, I found one with an
> allocation failure that looked like this:
> 
>   chrome: page allocation failure: order:7,
>           mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO),
> 	  nodemask=(null),cpuset=urgent,mems_allowed=0

That does seem bad ...

> @@ -16,14 +17,14 @@ static int __regset_get(struct task_struct *target,
>  	if (size > regset->n * regset->size)
>  		size = regset->n * regset->size;
>  	if (!p) {
> -		to_free = p = kzalloc(size, GFP_KERNEL);
> +		to_free = p = vmalloc(size);

It's my impression that sometimes this size might be relatively small?
Perhaps we should make this kvmalloc so that we can satisfy it from the
slab allocator if it is small?

Also, I assume that we don't rely on this memory being physically
contiguous; we don't, for example, do I/O on it?
  
Doug Anderson Feb. 2, 2024, 2:54 a.m. UTC | #3
Hi,

On Thu, Feb 1, 2024 at 5:22 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Feb 01, 2024 at 05:12:03PM -0800, Douglas Anderson wrote:
> > While browsing through ChromeOS crash reports, I found one with an
> > allocation failure that looked like this:
>
> > An order 7 allocation is (1 << 7) contiguous pages, or 512K. It's not
> > a surprise that this allocation failed on a system that's been running
> > for a while.
>
> >       if (size > regset->n * regset->size)
> >               size = regset->n * regset->size;
> >       if (!p) {
> > -             to_free = p = kzalloc(size, GFP_KERNEL);
> > +             to_free = p = vmalloc(size);
>
>         What the hell?  Which regset could have lead to that?
> It would need to have the total size of register in excess of
> 256K.  Seriously, which regset is that about?  Note that we
> have just made sure that size is not greater than that product.
> size is unsigned int, so it's not as if a negative value passed
> to function could get through that test only to be interpreted
> as large positive later...
>
>         Details, please.

I can continue to dig more, but it is easy for me to reproduce this.
On the stack is elf_core_dump() and it seems like we're getting a core
dump of the chrome process. So I just arbitrarily look for the chrome
GPU process:

$ ps aux | grep gpu-process
chronos   2075  3.0  1.1 34075552 95372 ?      S<l  18:44   0:01
/opt/google/chrome/chrome --type=gpu-process ...

Then I send it a quit:

$ kill -quit 2075

I added some printouts for this allocation and there are a ton. Here's
all of them, some of which are over 256K:

[   66.677393] DOUG: Allocating 272 bytes
[   66.688994] DOUG: Allocating 272 bytes
[   66.692921] DOUG: Allocating 528 bytes
[   66.696799] DOUG: Allocating 8 bytes
[   66.701058] DOUG: Allocating 272 bytes
[   66.704988] DOUG: Allocating 528 bytes
[   66.708875] DOUG: Allocating 8 bytes
[   66.712929] DOUG: Allocating 272 bytes
[   66.716845] DOUG: Allocating 528 bytes
[   66.720721] DOUG: Allocating 8 bytes
[   66.724752] DOUG: Allocating 272 bytes
[   66.728719] DOUG: Allocating 528 bytes
[   66.732621] DOUG: Allocating 8 bytes
[   66.736615] DOUG: Allocating 272 bytes
[   66.740584] DOUG: Allocating 528 bytes
[   66.744483] DOUG: Allocating 8 bytes
[   66.748507] DOUG: Allocating 272 bytes
[   66.752412] DOUG: Allocating 528 bytes
[   66.756328] DOUG: Allocating 8 bytes
[   66.760382] DOUG: Allocating 272 bytes
[   66.764356] DOUG: Allocating 528 bytes
[   66.768275] DOUG: Allocating 8 bytes
[   66.772236] DOUG: Allocating 272 bytes
[   66.776135] DOUG: Allocating 528 bytes
[   66.780013] DOUG: Allocating 8 bytes
[   66.787244] DOUG: Allocating 272 bytes
[   66.791175] DOUG: Allocating 528 bytes
[   66.795056] DOUG: Allocating 8 bytes
[   66.799101] DOUG: Allocating 272 bytes
[   66.803007] DOUG: Allocating 528 bytes
[   66.806930] DOUG: Allocating 8 bytes
[   66.810775] DOUG: Allocating 272 bytes
[   66.814668] DOUG: Allocating 528 bytes
[   66.818544] DOUG: Allocating 8 bytes
[   66.822409] DOUG: Allocating 272 bytes
[   66.826328] DOUG: Allocating 528 bytes
[   66.830258] DOUG: Allocating 8 bytes
[   66.834331] DOUG: Allocating 272 bytes
[   66.838510] DOUG: Allocating 528 bytes
[   66.842399] DOUG: Allocating 8 bytes
[   66.846301] DOUG: Allocating 272 bytes
[   66.850181] DOUG: Allocating 528 bytes
[   66.854051] DOUG: Allocating 8 bytes
[   66.857864] DOUG: Allocating 272 bytes
[   66.861745] DOUG: Allocating 528 bytes
[   66.865621] DOUG: Allocating 8 bytes
[   66.869495] DOUG: Allocating 272 bytes
[   66.873384] DOUG: Allocating 528 bytes
[   66.877261] DOUG: Allocating 8 bytes
[   66.892077] DOUG: Allocating 528 bytes
[   66.895978] DOUG: Allocating 16 bytes
[   66.899760] DOUG: Allocating 264 bytes
[   66.903624] DOUG: Allocating 264 bytes
[   66.907489] DOUG: Allocating 4 bytes
[   66.911184] DOUG: Allocating 279584 bytes
[   66.915392] DOUG: Allocating 8768 bytes
[   66.919354] DOUG: Allocating 65552 bytes
[   66.923415] DOUG: Allocating 64 bytes
[   66.927190] DOUG: Allocating 16 bytes
[   66.930968] DOUG: Allocating 8 bytes
[   66.934649] DOUG: Allocating 8 bytes
[   66.938332] DOUG: Allocating 528 bytes
[   66.942199] DOUG: Allocating 16 bytes
[   66.945970] DOUG: Allocating 264 bytes
[   66.949832] DOUG: Allocating 264 bytes
[   66.953702] DOUG: Allocating 4 bytes
[   66.957385] DOUG: Allocating 279584 bytes
[   66.961605] DOUG: Allocating 8768 bytes
[   66.965574] DOUG: Allocating 65552 bytes
[   66.969632] DOUG: Allocating 64 bytes
[   66.973405] DOUG: Allocating 16 bytes
[   66.977179] DOUG: Allocating 8 bytes
[   66.980862] DOUG: Allocating 8 bytes
[   66.984553] DOUG: Allocating 528 bytes
[   66.988416] DOUG: Allocating 16 bytes
[   66.992191] DOUG: Allocating 264 bytes
[   66.996046] DOUG: Allocating 264 bytes
[   66.999907] DOUG: Allocating 4 bytes
[   67.003590] DOUG: Allocating 279584 bytes
[   67.007773] DOUG: Allocating 8768 bytes
[   67.011732] DOUG: Allocating 65552 bytes
[   67.015789] DOUG: Allocating 64 bytes
[   67.019576] DOUG: Allocating 16 bytes
[   67.023366] DOUG: Allocating 8 bytes
[   67.027059] DOUG: Allocating 8 bytes
[   67.030753] DOUG: Allocating 528 bytes
[   67.034620] DOUG: Allocating 16 bytes
[   67.038402] DOUG: Allocating 264 bytes
[   67.042266] DOUG: Allocating 264 bytes
[   67.046144] DOUG: Allocating 4 bytes
[   67.049827] DOUG: Allocating 279584 bytes
[   67.054026] DOUG: Allocating 8768 bytes
[   67.057990] DOUG: Allocating 65552 bytes
[   67.062050] DOUG: Allocating 64 bytes
[   67.065826] DOUG: Allocating 16 bytes
[   67.069603] DOUG: Allocating 8 bytes
[   67.073285] DOUG: Allocating 8 bytes
[   67.076977] DOUG: Allocating 528 bytes
[   67.080836] DOUG: Allocating 16 bytes
[   67.084605] DOUG: Allocating 264 bytes
[   67.088461] DOUG: Allocating 264 bytes
[   67.092328] DOUG: Allocating 4 bytes
[   67.096015] DOUG: Allocating 279584 bytes
[   67.100214] DOUG: Allocating 8768 bytes
[   67.104182] DOUG: Allocating 65552 bytes
[   67.108245] DOUG: Allocating 64 bytes
[   67.112028] DOUG: Allocating 16 bytes
[   67.115804] DOUG: Allocating 8 bytes
[   67.119487] DOUG: Allocating 8 bytes
[   67.123168] DOUG: Allocating 528 bytes
[   67.127027] DOUG: Allocating 16 bytes
[   67.130806] DOUG: Allocating 264 bytes
[   67.134662] DOUG: Allocating 264 bytes
[   67.138527] DOUG: Allocating 4 bytes
[   67.142213] DOUG: Allocating 279584 bytes
[   67.146402] DOUG: Allocating 8768 bytes
[   67.150378] DOUG: Allocating 65552 bytes
[   67.154434] DOUG: Allocating 64 bytes
[   67.158209] DOUG: Allocating 16 bytes
[   67.161980] DOUG: Allocating 8 bytes
[   67.165665] DOUG: Allocating 8 bytes
[   67.169355] DOUG: Allocating 528 bytes
[   67.173219] DOUG: Allocating 16 bytes
[   67.176989] DOUG: Allocating 264 bytes
[   67.180847] DOUG: Allocating 264 bytes
[   67.184710] DOUG: Allocating 4 bytes
[   67.188385] DOUG: Allocating 279584 bytes
[   67.192569] DOUG: Allocating 8768 bytes
[   67.196522] DOUG: Allocating 65552 bytes
[   67.200570] DOUG: Allocating 64 bytes
[   67.204340] DOUG: Allocating 16 bytes
[   67.208109] DOUG: Allocating 8 bytes
[   67.211788] DOUG: Allocating 8 bytes
[   67.215468] DOUG: Allocating 528 bytes
[   67.219332] DOUG: Allocating 16 bytes
[   67.223108] DOUG: Allocating 264 bytes
[   67.226968] DOUG: Allocating 264 bytes
[   67.230834] DOUG: Allocating 4 bytes
[   67.234510] DOUG: Allocating 279584 bytes
[   67.238697] DOUG: Allocating 8768 bytes
[   67.242660] DOUG: Allocating 65552 bytes
[   67.246716] DOUG: Allocating 64 bytes
[   67.250487] DOUG: Allocating 16 bytes
[   67.254261] DOUG: Allocating 8 bytes
[   67.257955] DOUG: Allocating 8 bytes
[   67.261640] DOUG: Allocating 528 bytes
[   67.265497] DOUG: Allocating 16 bytes
[   67.269267] DOUG: Allocating 264 bytes
[   67.273131] DOUG: Allocating 264 bytes
[   67.277026] DOUG: Allocating 4 bytes
[   67.280721] DOUG: Allocating 279584 bytes
[   67.284914] DOUG: Allocating 8768 bytes
[   67.288868] DOUG: Allocating 65552 bytes
[   67.292927] DOUG: Allocating 64 bytes
[   67.296699] DOUG: Allocating 16 bytes
[   67.300479] DOUG: Allocating 8 bytes
[   67.304158] DOUG: Allocating 8 bytes
[   67.307848] DOUG: Allocating 528 bytes
[   67.311702] DOUG: Allocating 16 bytes
[   67.315469] DOUG: Allocating 264 bytes
[   67.319331] DOUG: Allocating 264 bytes
[   67.323196] DOUG: Allocating 4 bytes
[   67.326879] DOUG: Allocating 279584 bytes
[   67.331067] DOUG: Allocating 8768 bytes
[   67.335033] DOUG: Allocating 65552 bytes
[   67.339089] DOUG: Allocating 64 bytes
[   67.342866] DOUG: Allocating 16 bytes
[   67.346641] DOUG: Allocating 8 bytes
[   67.350323] DOUG: Allocating 8 bytes
[   67.354005] DOUG: Allocating 528 bytes
[   67.357869] DOUG: Allocating 16 bytes
[   67.361636] DOUG: Allocating 264 bytes
[   67.365492] DOUG: Allocating 264 bytes
[   67.369355] DOUG: Allocating 4 bytes
[   67.373040] DOUG: Allocating 279584 bytes
[   67.377218] DOUG: Allocating 8768 bytes
[   67.381179] DOUG: Allocating 65552 bytes
[   67.385228] DOUG: Allocating 64 bytes
[   67.389005] DOUG: Allocating 16 bytes
[   67.392784] DOUG: Allocating 8 bytes
[   67.396461] DOUG: Allocating 8 bytes
[   67.400150] DOUG: Allocating 528 bytes
[   67.404011] DOUG: Allocating 16 bytes
[   67.407792] DOUG: Allocating 264 bytes
[   67.411649] DOUG: Allocating 264 bytes
[   67.415506] DOUG: Allocating 4 bytes
[   67.419184] DOUG: Allocating 279584 bytes
[   67.423364] DOUG: Allocating 8768 bytes
[   67.427320] DOUG: Allocating 65552 bytes
[   67.431367] DOUG: Allocating 64 bytes
[   67.435146] DOUG: Allocating 16 bytes
[   67.438923] DOUG: Allocating 8 bytes
[   67.442602] DOUG: Allocating 8 bytes
[   67.446286] DOUG: Allocating 528 bytes
[   67.450143] DOUG: Allocating 16 bytes
[   67.453913] DOUG: Allocating 264 bytes
[   67.457775] DOUG: Allocating 264 bytes
[   67.461637] DOUG: Allocating 4 bytes
[   67.465323] DOUG: Allocating 279584 bytes
[   67.469501] DOUG: Allocating 8768 bytes
[   67.473463] DOUG: Allocating 65552 bytes
[   67.477511] DOUG: Allocating 64 bytes
[   67.481283] DOUG: Allocating 16 bytes
[   67.485056] DOUG: Allocating 8 bytes
[   67.488735] DOUG: Allocating 8 bytes
[   67.492428] DOUG: Allocating 528 bytes
[   67.496298] DOUG: Allocating 16 bytes
[   67.500072] DOUG: Allocating 264 bytes
[   67.503932] DOUG: Allocating 264 bytes
[   67.507803] DOUG: Allocating 4 bytes
[   67.511484] DOUG: Allocating 279584 bytes
[   67.515667] DOUG: Allocating 8768 bytes
[   67.519624] DOUG: Allocating 65552 bytes
[   67.523679] DOUG: Allocating 64 bytes
[   67.527447] DOUG: Allocating 16 bytes
[   67.531222] DOUG: Allocating 8 bytes
[   67.534907] DOUG: Allocating 8 bytes
[   67.538593] DOUG: Allocating 528 bytes
[   67.542458] DOUG: Allocating 16 bytes
[   67.546225] DOUG: Allocating 264 bytes
[   67.550090] DOUG: Allocating 264 bytes
[   67.553956] DOUG: Allocating 4 bytes
[   67.557634] DOUG: Allocating 279584 bytes
[   67.561818] DOUG: Allocating 8768 bytes
[   67.565775] DOUG: Allocating 65552 bytes
[   67.569823] DOUG: Allocating 64 bytes
[   67.573602] DOUG: Allocating 16 bytes
[   67.577380] DOUG: Allocating 8 bytes
[   67.581060] DOUG: Allocating 8 bytes
[   67.584748] DOUG: Allocating 528 bytes
[   67.588607] DOUG: Allocating 16 bytes
[   67.592384] DOUG: Allocating 264 bytes
[   67.596240] DOUG: Allocating 264 bytes
[   67.600105] DOUG: Allocating 4 bytes
[   67.603786] DOUG: Allocating 279584 bytes
[   67.607968] DOUG: Allocating 8768 bytes
[   67.611927] DOUG: Allocating 65552 bytes
[   67.615979] DOUG: Allocating 64 bytes
[   67.619757] DOUG: Allocating 16 bytes
[   67.623529] DOUG: Allocating 8 bytes
[   67.627216] DOUG: Allocating 8 bytes

The above printouts were taken on a sc7180-trogdor-lazor device
running mainline (roughly "Linux localhost 6.8.0-rc2") booted w/
ChromeOS userspace.

If you need me to dig more into how coredumps work then I can see if I
can track down exactly what part of the coredump is causing it to need
the big allocation. "chrome" is a bit of a beast of an application,
though. I'd also note that chrome makes extensive use of address space
randomization which uses up huge amounts of virtual address space, so
a shot in the dark is that maybe that has something to do with it?
Looking at the virtual address space of Chrome in "top" shows stuff
like:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
COMMAND
 2012 chronos   12  -8   32.7g 230520 160504 S   1.0   2.9   0:12.49
chrome
 6044 chronos   12  -8   32.5g  95204  61888 S   1.0   1.2   0:05.90
chrome
 2191 chronos   12  -8  107.0g  72200  51264 S   0.0   0.9   0:00.08
chrome

-Doug
  
Doug Anderson Feb. 2, 2024, 2:58 a.m. UTC | #4
Hi,

On Thu, Feb 1, 2024 at 5:24 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Feb 01, 2024 at 05:12:03PM -0800, Douglas Anderson wrote:
> > While browsing through ChromeOS crash reports, I found one with an
> > allocation failure that looked like this:
> >
> >   chrome: page allocation failure: order:7,
> >           mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO),
> >         nodemask=(null),cpuset=urgent,mems_allowed=0
>
> That does seem bad ...
>
> > @@ -16,14 +17,14 @@ static int __regset_get(struct task_struct *target,
> >       if (size > regset->n * regset->size)
> >               size = regset->n * regset->size;
> >       if (!p) {
> > -             to_free = p = kzalloc(size, GFP_KERNEL);
> > +             to_free = p = vmalloc(size);
>
> It's my impression that sometimes this size might be relatively small?
> Perhaps we should make this kvmalloc so that we can satisfy it from the
> slab allocator if it is small?

Right. Sometimes it's small. It feels sad to me that somehow vmalloc()
of small sizes would be much less efficient than kvmalloc() of small
sizes, but I can change it to that if you want. It feels like we
should use kmalloc() if we need it to be contiguous, kvmalloc() if we
know that there will be big efficiency gains with things being
contiguous but we can get by with non-contiguous, and vmalloc() if we
just don't care. ;-)

..anyway, I'll spin v2 with kvmalloc().


> Also, I assume that we don't rely on this memory being physically
> contiguous; we don't, for example, do I/O on it?

As far as I can tell we don't. I had never looked at or thought about
this code before today and so all I have is ~an hour of code analysis
behind me, so if someone tells me I'm wrong then I'll believe them.

-Doug
  
Al Viro Feb. 2, 2024, 3:04 a.m. UTC | #5
On Thu, Feb 01, 2024 at 06:54:51PM -0800, Doug Anderson wrote:
> >         What the hell?  Which regset could have lead to that?
> > It would need to have the total size of register in excess of
> > 256K.  Seriously, which regset is that about?  Note that we
> > have just made sure that size is not greater than that product.
> > size is unsigned int, so it's not as if a negative value passed
> > to function could get through that test only to be interpreted
> > as large positive later...
> >
> >         Details, please.
> 
> I can continue to dig more, but it is easy for me to reproduce this.
> On the stack is elf_core_dump() and it seems like we're getting a core
> dump of the chrome process. So I just arbitrarily look for the chrome
> GPU process:
> 
> $ ps aux | grep gpu-process
> chronos   2075  3.0  1.1 34075552 95372 ?      S<l  18:44   0:01
> /opt/google/chrome/chrome --type=gpu-process ...
> 
> Then I send it a quit:
> 
> $ kill -quit 2075
> 
> I added some printouts for this allocation and there are a ton. Here's
> all of them, some of which are over 256K:

Well, the next step would be to see which regset it is - if you
see that kind of allocation, print regset->n, regset->size and
regset->core_note_type.
  
Doug Anderson Feb. 2, 2024, 3:15 a.m. UTC | #6
Hi,

On Thu, Feb 1, 2024 at 7:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Feb 01, 2024 at 06:54:51PM -0800, Doug Anderson wrote:
> > >         What the hell?  Which regset could have lead to that?
> > > It would need to have the total size of register in excess of
> > > 256K.  Seriously, which regset is that about?  Note that we
> > > have just made sure that size is not greater than that product.
> > > size is unsigned int, so it's not as if a negative value passed
> > > to function could get through that test only to be interpreted
> > > as large positive later...
> > >
> > >         Details, please.
> >
> > I can continue to dig more, but it is easy for me to reproduce this.
> > On the stack is elf_core_dump() and it seems like we're getting a core
> > dump of the chrome process. So I just arbitrarily look for the chrome
> > GPU process:
> >
> > $ ps aux | grep gpu-process
> > chronos   2075  3.0  1.1 34075552 95372 ?      S<l  18:44   0:01
> > /opt/google/chrome/chrome --type=gpu-process ...
> >
> > Then I send it a quit:
> >
> > $ kill -quit 2075
> >
> > I added some printouts for this allocation and there are a ton. Here's
> > all of them, some of which are over 256K:
>
> Well, the next step would be to see which regset it is - if you
> see that kind of allocation, print regset->n, regset->size and
> regset->core_note_type.

Of course! Here are the big ones:

[   45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16,
core_note_type=1029
[   45.884809] DOUG: Allocating 8768 bytes, n=548, size=16, core_note_type=1035
[   45.893958] DOUG: Allocating 65552 bytes, n=4097, size=16,
core_note_type=1036

-Doug
  
Al Viro Feb. 2, 2024, 3:49 a.m. UTC | #7
On Thu, Feb 01, 2024 at 07:15:48PM -0800, Doug Anderson wrote:
> >
> > Well, the next step would be to see which regset it is - if you
> > see that kind of allocation, print regset->n, regset->size and
> > regset->core_note_type.
> 
> Of course! Here are the big ones:
> 
> [   45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16,
> core_note_type=1029

0x405, NT_ARM_SVE
        [REGSET_SVE] = { /* Scalable Vector Extension */
                .core_note_type = NT_ARM_SVE,
                .n = DIV_ROUND_UP(SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE),
                                  SVE_VQ_BYTES),
                .size = SVE_VQ_BYTES,

IDGI.  Wasn't SVE up to 32 * 2Kbit, i.e. 8Kbyte max?  Any ARM folks around?
Sure, I understand that it's variable-sized and we want to allocate enough
for the worst case, but can we really get about 280Kb there?  Context switches
would be really unpleasant on such boxen...

> [   45.884809] DOUG: Allocating 8768 bytes, n=548, size=16, core_note_type=1035
> [   45.893958] DOUG: Allocating 65552 bytes, n=4097, size=16,
> core_note_type=1036

0x40c, NT_ARM_ZA.
                /*
                 * ZA is a single register but it's variably sized and
                 * the ptrace core requires that the size of any data
                 * be an exact multiple of the configured register
                 * size so report as though we had SVE_VQ_BYTES
                 * registers. These values aren't exposed to
                 * userspace.
                 */
                .n = DIV_ROUND_UP(ZA_PT_SIZE(SME_VQ_MAX), SVE_VQ_BYTES),
                .size = SVE_VQ_BYTES,
  
Al Viro Feb. 2, 2024, 4:05 a.m. UTC | #8
On Fri, Feb 02, 2024 at 03:49:25AM +0000, Al Viro wrote:
> On Thu, Feb 01, 2024 at 07:15:48PM -0800, Doug Anderson wrote:
> > >
> > > Well, the next step would be to see which regset it is - if you
> > > see that kind of allocation, print regset->n, regset->size and
> > > regset->core_note_type.
> > 
> > Of course! Here are the big ones:
> > 
> > [   45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16,
> > core_note_type=1029
> 
> 0x405, NT_ARM_SVE
>         [REGSET_SVE] = { /* Scalable Vector Extension */
>                 .core_note_type = NT_ARM_SVE,
>                 .n = DIV_ROUND_UP(SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE),
>                                   SVE_VQ_BYTES),
>                 .size = SVE_VQ_BYTES,
> 
> IDGI.  Wasn't SVE up to 32 * 2Kbit, i.e. 8Kbyte max?  Any ARM folks around?
> Sure, I understand that it's variable-sized and we want to allocate enough
> for the worst case, but can we really get about 280Kb there?  Context switches
> would be really unpleasant on such boxen...

FWIW, this apparently intends to be "variable, up to SVE_PT_SIZE(...) bytes";
no idea if SVE_PT_SIZE is the right thing to use here.
  
Doug Anderson Feb. 2, 2024, 4:24 p.m. UTC | #9
Hi,

On Thu, Feb 1, 2024 at 8:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Feb 02, 2024 at 03:49:25AM +0000, Al Viro wrote:
> > On Thu, Feb 01, 2024 at 07:15:48PM -0800, Doug Anderson wrote:
> > > >
> > > > Well, the next step would be to see which regset it is - if you
> > > > see that kind of allocation, print regset->n, regset->size and
> > > > regset->core_note_type.
> > >
> > > Of course! Here are the big ones:
> > >
> > > [   45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16,
> > > core_note_type=1029
> >
> > 0x405, NT_ARM_SVE
> >         [REGSET_SVE] = { /* Scalable Vector Extension */
> >                 .core_note_type = NT_ARM_SVE,
> >                 .n = DIV_ROUND_UP(SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE),
> >                                   SVE_VQ_BYTES),
> >                 .size = SVE_VQ_BYTES,
> >
> > IDGI.  Wasn't SVE up to 32 * 2Kbit, i.e. 8Kbyte max?  Any ARM folks around?
> > Sure, I understand that it's variable-sized and we want to allocate enough
> > for the worst case, but can we really get about 280Kb there?  Context switches
> > would be really unpleasant on such boxen...
>
> FWIW, this apparently intends to be "variable, up to SVE_PT_SIZE(...) bytes";
> no idea if SVE_PT_SIZE is the right thing to use here.

+folks from `./scripts/get_maintainer.pl -f arch/arm64/kernel/ptrace.c`

Trying to follow the macros to see where "n" comes from is a maze of
twisty little passages, all alike. Hopefully someone from the ARM
world can help tell if the value of 17474 for n here is correct or if
something is wonky.
  
Al Viro Feb. 2, 2024, 4:49 p.m. UTC | #10
On Fri, Feb 02, 2024 at 08:24:17AM -0800, Doug Anderson wrote:
> Hi,
> 
> On Thu, Feb 1, 2024 at 8:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Fri, Feb 02, 2024 at 03:49:25AM +0000, Al Viro wrote:
> > > On Thu, Feb 01, 2024 at 07:15:48PM -0800, Doug Anderson wrote:
> > > > >
> > > > > Well, the next step would be to see which regset it is - if you
> > > > > see that kind of allocation, print regset->n, regset->size and
> > > > > regset->core_note_type.
> > > >
> > > > Of course! Here are the big ones:
> > > >
> > > > [   45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16,
> > > > core_note_type=1029
> > >
> > > 0x405, NT_ARM_SVE
> > >         [REGSET_SVE] = { /* Scalable Vector Extension */
> > >                 .core_note_type = NT_ARM_SVE,
> > >                 .n = DIV_ROUND_UP(SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE),
> > >                                   SVE_VQ_BYTES),
> > >                 .size = SVE_VQ_BYTES,
> > >
> > > IDGI.  Wasn't SVE up to 32 * 2Kbit, i.e. 8Kbyte max?  Any ARM folks around?
> > > Sure, I understand that it's variable-sized and we want to allocate enough
> > > for the worst case, but can we really get about 280Kb there?  Context switches
> > > would be really unpleasant on such boxen...
> >
> > FWIW, this apparently intends to be "variable, up to SVE_PT_SIZE(...) bytes";
> > no idea if SVE_PT_SIZE is the right thing to use here.
> 
> +folks from `./scripts/get_maintainer.pl -f arch/arm64/kernel/ptrace.c`
> 
> Trying to follow the macros to see where "n" comes from is a maze of
> twisty little passages, all alike. Hopefully someone from the ARM
> world can help tell if the value of 17474 for n here is correct or if
> something is wonky.

It might be interesting to have it print the return value of __regset_get()
in those cases; if *that* is huge, we really have a problem.  If it ends up
small enough to fit into few pages, OTOH...

SVE_VQ_MAX is defined as 255; is that really in units of 128 bits?  IOW,
do we really expect to support 32Kbit registers?  That would drive the
size into that range, all right, but it would really suck on context
switches.

I could be misreading it, though - the macros in there are not easy to
follow and I've never dealt with SVE before, so take the above with
a cartload of salt.
  
Al Viro Feb. 2, 2024, 4:55 p.m. UTC | #11
On Fri, Feb 02, 2024 at 04:49:47PM +0000, Al Viro wrote:
> > +folks from `./scripts/get_maintainer.pl -f arch/arm64/kernel/ptrace.c`
> > 
> > Trying to follow the macros to see where "n" comes from is a maze of
> > twisty little passages, all alike. Hopefully someone from the ARM
> > world can help tell if the value of 17474 for n here is correct or if
> > something is wonky.
> 
> It might be interesting to have it print the return value of __regset_get()
> in those cases; if *that* is huge, we really have a problem.  If it ends up
> small enough to fit into few pages, OTOH...
> 
> SVE_VQ_MAX is defined as 255; is that really in units of 128 bits?  IOW,
> do we really expect to support 32Kbit registers?  That would drive the
> size into that range, all right, but it would really suck on context
> switches.
> 
> I could be misreading it, though - the macros in there are not easy to
> follow and I've never dealt with SVE before, so take the above with
> a cartload of salt.

Worse - it's SVE_VQ_MAX is 512; sorry about the confusion.  OK, that would
certainly explain the size (header + 32 registers, each up to 512 * 16 bytes),
but... ouch.
  
Mark Brown Feb. 2, 2024, 5:48 p.m. UTC | #12
On Fri, Feb 02, 2024 at 03:49:25AM +0000, Al Viro wrote:
> On Thu, Feb 01, 2024 at 07:15:48PM -0800, Doug Anderson wrote:

> > [   45.875574] DOUG: Allocating 279584 bytes, n=17474, size=16,
> > core_note_type=1029

> 0x405, NT_ARM_SVE
>         [REGSET_SVE] = { /* Scalable Vector Extension */
>                 .core_note_type = NT_ARM_SVE,
>                 .n = DIV_ROUND_UP(SVE_PT_SIZE(SVE_VQ_MAX, SVE_PT_REGS_SVE),
>                                   SVE_VQ_BYTES),
>                 .size = SVE_VQ_BYTES,

> IDGI.  Wasn't SVE up to 32 * 2Kbit, i.e. 8Kbyte max?  Any ARM folks around?
> Sure, I understand that it's variable-sized and we want to allocate enough
> for the worst case, but can we really get about 280Kb there?  Context switches
> would be really unpleasant on such boxen...

The architecture itself is limited to 2048 bit vector lengths, and
practical implementations have thus far not exceeded 512 bits with the
overwhelming majority of systems being 128 bit.  2048 is commonly seen
in emulation though.  As well as the 32 Z registers we have 16 P
registers of VQ*2 bytes plus one more register FFR the same size as the
P registers and a header describing the VL and specific format of the
data, all in this regset.

The Linux ABI defines the maximum vector length much larger than the
architecture allows and that define does flow into the kernel code, I
believe this was based on consideration of bits 8:4 of ZCR_ELx[1] which
look like they're earmarked for potential future expansion should 2048
bits ever prove to be insufficient.  We should really do something like
what we did for SME and define down what ptrace uses to the actual
architectural maximum since no system will ever see any more than that,
that'd still result in large allocations but less impressively and
wastefully so.  I'll go and look at doing a patch for that just now.

Unfortunately SVE_VQ_MAX is in the uapi headers, we've already stopped
using it in the test programs due to the overallocation.

[1] https://developer.arm.com/documentation/ddi0601/2023-12/AArch64-Registers/ZCR-EL1--SVE-Control-Register--EL1-?lang=en

> > [   45.884809] DOUG: Allocating 8768 bytes, n=548, size=16, core_note_type=1035
> > [   45.893958] DOUG: Allocating 65552 bytes, n=4097, size=16,
> > core_note_type=1036

> 0x40c, NT_ARM_ZA.
>                 /*
>                  * ZA is a single register but it's variably sized and
>                  * the ptrace core requires that the size of any data
>                  * be an exact multiple of the configured register
>                  * size so report as though we had SVE_VQ_BYTES
>                  * registers. These values aren't exposed to
>                  * userspace.
>                  */
>                 .n = DIV_ROUND_UP(ZA_PT_SIZE(SME_VQ_MAX), SVE_VQ_BYTES),
>                 .size = SVE_VQ_BYTES,

Yup, and SME_VQ_MAX is defined to the actual architectural maximum of
2048 largely due to issues with the size of the allocation for ptrace.
There are not yet any physical implementations of SME so I can't comment
on the actual vector lengths we'll observe in the immediate future.  

I see there's a comment update needed there for s/SVE/SME/ too.
  
Dave Martin Feb. 2, 2024, 6:07 p.m. UTC | #13
On Fri, Feb 02, 2024 at 04:55:24PM +0000, Al Viro wrote:
> On Fri, Feb 02, 2024 at 04:49:47PM +0000, Al Viro wrote:
> > > +folks from `./scripts/get_maintainer.pl -f arch/arm64/kernel/ptrace.c`
> > > 
> > > Trying to follow the macros to see where "n" comes from is a maze of
> > > twisty little passages, all alike. Hopefully someone from the ARM
> > > world can help tell if the value of 17474 for n here is correct or if
> > > something is wonky.

Nope, that's the "correct" answer...

> > 
> > It might be interesting to have it print the return value of __regset_get()
> > in those cases; if *that* is huge, we really have a problem.  If it ends up
> > small enough to fit into few pages, OTOH...
> > 
> > SVE_VQ_MAX is defined as 255; is that really in units of 128 bits?  IOW,
> > do we really expect to support 32Kbit registers?  That would drive the
> > size into that range, all right, but it would really suck on context
> > switches.
> > 
> > I could be misreading it, though - the macros in there are not easy to
> > follow and I've never dealt with SVE before, so take the above with
> > a cartload of salt.
> 
> Worse - it's SVE_VQ_MAX is 512; sorry about the confusion.  OK, that would
> certainly explain the size (header + 32 registers, each up to 512 * 16 bytes),
> but... ouch.

Mark Brown [+ Cc] has been taking care of SVE in my absence, but
from memory:

The SVE architecture has a really big maximum vector size (16 * 128 =
2048 bits), and there is a theoretical possibility of it getting bigger
in the future, though unlikely.

Real platforms to date have a much smaller limit, though Qemu can go up
to 2048 bits IIUC.

My aim when working on the ABI was to future-proof it against
foreseeable expansion on the architecture side, but this does mean that
we cannot statically determine a sane limit for the vector size.


I suppose we could have had a more sane limit built into the kernel or a
Kconfig option for it, but it seemed simpler just to determine the size
dynamically depending on the task's current state.  This is not so
important for coredumps, but for the the gdbstub wire protocol etc. it
seemed undesirable to have the regset larger than needed.

Hence the reason for adding ->get_size() in
27e64b4be4b8 ("regset: Add support for dynamically sized regsets").

What I guess was not so obvious from the commit message is the
expected relationship between the actual and maximum possible size
of the regset: for SVE the actual size is in practice going to be *much*
smaller than the max, while the max is crazy large because of being an
ABI design limit chosen for futureproofing purposes.



So, if the only reason for trying to migrate to vmalloc() is to cope
with an insanely sized regset on arm64, I think somehow or other we can
avoid that.

Options:

 a) bring back ->get_size() so that we can allocate the correct size
before generating the regset data;

 b) make aarch64_regsets[] __ro_after_init and set
aarch64_regsets[REGSET_SVE].n based on the boot-time probed maximum size
(which will be sane); or

 c) allow membufs to grow if needed (sounds fragile though, and may be
hard to justify just for one arch?).


Thoughts?

If people don't want to bring back get_size(), then (b) doesn't look
too bad.

Cheers
---Dave
  
Doug Anderson Feb. 2, 2024, 7:13 p.m. UTC | #14
Hi,

On Fri, Feb 2, 2024 at 10:08 AM Dave Martin <Dave.Martin@arm.com> wrote:
>
> So, if the only reason for trying to migrate to vmalloc() is to cope
> with an insanely sized regset on arm64, I think somehow or other we can
> avoid that.

Right. The only reason for the patch to switch to vmalloc() was in
reaction to seeing the order 7 memory allocation. If we can decrease
that to something sensible then I'm happy enough keeping the
allocation as kmalloc().


> Options:
>
>  a) bring back ->get_size() so that we can allocate the correct size
> before generating the regset data;
>
>  b) make aarch64_regsets[] __ro_after_init and set
> aarch64_regsets[REGSET_SVE].n based on the boot-time probed maximum size
> (which will be sane); or
>
>  c) allow membufs to grow if needed (sounds fragile though, and may be
> hard to justify just for one arch?).
>
>
> Thoughts?
>
> If people don't want to bring back get_size(), then (b) doesn't look
> too bad.

Either a) or b) sounds fine to me, but I'm just a visitor to this code
so maybe I'll let the adults in the room chime in with their opinions.
;-) Also: if you think it's fruitful for me to try to write a patch to
do either of those then I can, but I also wouldn't object at all to
someone else writing a patch to fix this and I can just provide a
Tested-by and/or Reviewed-by. Let me know.

-Doug
  
Mark Brown Feb. 2, 2024, 7:42 p.m. UTC | #15
On Fri, Feb 02, 2024 at 06:07:54PM +0000, Dave Martin wrote:

> So, if the only reason for trying to migrate to vmalloc() is to cope
> with an insanely sized regset on arm64, I think somehow or other we can
> avoid that.

With SME we do routinely see the full glory of the 64K regset for ZA in
emulated systems so I think we have to treat it as an issue.

> Options:

>  a) bring back ->get_size() so that we can allocate the correct size
> before generating the regset data;

>  b) make aarch64_regsets[] __ro_after_init and set
> aarch64_regsets[REGSET_SVE].n based on the boot-time probed maximum size
> (which will be sane); or

Either of those seems sensible to me, a function would minimise the size
of allocations based on the process configuration which would be nice
and given that we're doing allocations it's probably reasonable
overhead.

>  c) allow membufs to grow if needed (sounds fragile though, and may be
> hard to justify just for one arch?).

I'm having a hard time getting enthusiastic about that one for the
reasons you mention.

We can also just lower the maximum size we tell the ptrace core to the
actual architectural maximum since AFAICT we don't expose that anywhere
external, I've got a patch in CI for that.  We'd still be allocating
more memory than we need for practical systems but less extravagantly
so.  It seems more suitable for an immediate fix for people to pick up
for production.

It did occur to me at some point in the past that we should avoid
telling the core about regsets that aren't physically supported on the
current system, I didn't get round to looking at that yet.
  
Doug Anderson Feb. 2, 2024, 8:38 p.m. UTC | #16
Hi,

On Fri, Feb 2, 2024 at 11:43 AM Mark Brown <broonie@kernel.org> wrote:
>
> On Fri, Feb 02, 2024 at 06:07:54PM +0000, Dave Martin wrote:
>
> > So, if the only reason for trying to migrate to vmalloc() is to cope
> > with an insanely sized regset on arm64, I think somehow or other we can
> > avoid that.
>
> With SME we do routinely see the full glory of the 64K regset for ZA in
> emulated systems so I think we have to treat it as an issue.

Ah, got it. 64K is much less likely to be as big of a problem (only an
order 4 allocation), but you're right that it's still a size where
kvmalloc() would be an improvement. With that in mind I'll plan to
send out a v2 of my patch where I use kvmalloc() instead of vmalloc()
and update the commit description a bit, including a link to this
thread. Then I will assume that others on this thread will move
forward with actually making the allocations smaller.

Please yell if the above sounds wrong. :-)

-Doug
  

Patch

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 5397b552fbeb..ac178ad38823 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1928,7 +1928,7 @@  static void free_note_info(struct elf_note_info *info)
 		threads = t->next;
 		WARN_ON(t->notes[0].data && t->notes[0].data != &t->prstatus);
 		for (i = 1; i < info->thread_notes; ++i)
-			kfree(t->notes[i].data);
+			kvfree(t->notes[i].data);
 		kfree(t);
 	}
 	kfree(info->psinfo.data);
diff --git a/kernel/regset.c b/kernel/regset.c
index 586823786f39..ed8d8cf3a22c 100644
--- a/kernel/regset.c
+++ b/kernel/regset.c
@@ -2,6 +2,7 @@ 
 #include <linux/export.h>
 #include <linux/slab.h>
 #include <linux/regset.h>
+#include <linux/vmalloc.h>
 
 static int __regset_get(struct task_struct *target,
 			const struct user_regset *regset,
@@ -16,14 +17,14 @@  static int __regset_get(struct task_struct *target,
 	if (size > regset->n * regset->size)
 		size = regset->n * regset->size;
 	if (!p) {
-		to_free = p = kzalloc(size, GFP_KERNEL);
+		to_free = p = vmalloc(size);
 		if (!p)
 			return -ENOMEM;
 	}
 	res = regset->regset_get(target, regset,
 			   (struct membuf){.p = p, .left = size});
 	if (res < 0) {
-		kfree(to_free);
+		vfree(to_free);
 		return res;
 	}
 	*data = p;
@@ -71,6 +72,6 @@  int copy_regset_to_user(struct task_struct *target,
 	ret = regset_get_alloc(target, regset, size, &buf);
 	if (ret > 0)
 		ret = copy_to_user(data, buf, ret) ? -EFAULT : 0;
-	kfree(buf);
+	vfree(buf);
 	return ret;
 }