[v2,3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

Message ID 20230713125706.2884502-4-glider@google.com
State New
Headers
Series Implement MTE tag compression for swapped pages |

Commit Message

Alexander Potapenko July 13, 2023, 12:57 p.m. UTC
  The config implements the EA0 algorithm suggested by Evgenii Stepanov
to compress the memory tags for ARM MTE during swapping.

The algorithm is based on RLE and specifically targets 128-byte buffers
of tags corresponding to a single page. In the common case a buffer
can be compressed into 63 bits, making it possible to store it without
additional memory allocation.

Suggested-by: Evgenii Stepanov <eugenis@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v2:
  - as suggested by Yuri Norov, switched from struct bitq (which is
    not needed anymore) to <linux/bitmap.h>
  - add missing symbol exports
---
 arch/arm64/Kconfig               |  10 +
 arch/arm64/include/asm/mtecomp.h |  60 +++++
 arch/arm64/mm/Makefile           |   1 +
 arch/arm64/mm/mtecomp.c          | 412 +++++++++++++++++++++++++++++++
 4 files changed, 483 insertions(+)
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c
  

Comments

Alexander Potapenko July 13, 2023, 4:37 p.m. UTC | #1
> +/* Compress the data inline. */
> +static u64 ea0_compress_inline(int len, u8 *tags, short *sizes)
> +{
> +       u64 result;
> +
> +       ea0_compress_to_buf(len, tags, sizes, (u8 *)&result, sizeof(result));
> +       result = be64_to_cpu(result);
This be64_to_cpu() is leftover from the previous compressor
implementation which treated u64 as u8[8] (i.e. big-endian).
The new implementation works with the native u64 representation, so
the conversion is not needed anymore (and actually produces invalid
handles that can't be stored in Xarray).
I will drop this call in v3.


> +bool ea0_decompress(u64 handle, u8 *tags)
> +{
> +       u8 *storage = ea0_storage(handle);
> +       int size = ea0_storage_size(handle);
> +
> +       if (size == 128) {
> +               memcpy(tags, storage, size);
> +               return true;
> +       }
> +       if (size == 8) {
> +               handle = cpu_to_be64(handle);
Ditto.
  
Andy Shevchenko July 13, 2023, 5:23 p.m. UTC | #2
On Thu, Jul 13, 2023 at 02:57:03PM +0200, Alexander Potapenko wrote:
> The config implements the EA0 algorithm suggested by Evgenii Stepanov
> to compress the memory tags for ARM MTE during swapping.
> 
> The algorithm is based on RLE and specifically targets 128-byte buffers
> of tags corresponding to a single page. In the common case a buffer
> can be compressed into 63 bits, making it possible to store it without
> additional memory allocation.

...

> +/*
> + * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
> + * compression algorithms.
> + *
> + * The algorithm attempts to compress a 128-byte (MTE_GRANULES_PER_PAGE / 2)
> + * array of tags into a smaller byte sequence that can be stored in a
> + * 16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
> + * an 8-byte pointer.
> + *
> + * We encapsulate tag storage memory management in this module, because it is
> + * tightly coupled with the pointer representation.
> + *   ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value
> + *     that can be stored in Xarray
> + *   ea_decompress(*ptr, *tags) takes the opaque value and loads the tags into
> + *     the provided 128-byte buffer.

> + *
> + *
> + *

Isn't a bit too many blank lines?

> + * The compression algorithm works as follows.
> + *
> + * 1. The input array of 128 bytes is transformed into tag ranges (two arrays:
> + *    @r_tags containing tag values and @r_sizes containing range lengths) by
> + *    ea0_tags_to_ranges(). Note that @r_sizes sums up to 256.
> + *
> + * 2. Depending on the number N of ranges, the following storage class is picked:
> + *            N <= 6:  8 bytes (inline case, no allocation required);
> + *       6 < N <= 11: 16 bytes
> + *      11 < N <= 23: 32 bytes
> + *      23 < N <= 46: 64 bytes
> + *      46 < N:       128 bytes (no compression will be performed)
> + *
> + * 3. The number of the largest element of @r_sizes is stored in @largest_idx.
> + *    The element itself is thrown away from @r_sizes, because it can be
> + *    reconstructed from the sum of the remaining elements. Note that now none
> + *    of the remaining @r_sizes elements is greater than 127.
> + *
> + * 4. For the inline case, the following values are stored in the 8-byte handle:
> + *       largest_idx : i4
> + *      r_tags[0..5] : i4 x 6
> + *     r_sizes[0..4] : i7 x 5
> + *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> + *
> + *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> + *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> + *    distinguished from a kernel pointer.
> + *
> + * 5. For the out-of-line case, the storage is allocated from one of the
> + *    "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
> + *    on 8 bytes, so its bits 2..0 can be used to store the size class:
> + *     - 0 for 128 bytes
> + *     - 1 for 16
> + *     - 2 for 32
> + *     - 4 for 64.
> + *    Bit 63 of the pointer is zeroed out, so that it can be stored in Xarray.
> + *
> + * 6. The data layout in the allocated storage is as follows:
> + *         largest_idx : i6
> + *        r_tags[0..N] : i4 x N
> + *     r_sizes[0..N-1] : i7 x (N-1)
> + *
> + *
> + *

Ditto.

> + * The decompression algorithm performs the steps below.
> + *
> + * 1. Decide if data is stored inline (bits 62..60 of the handle != 0b111) or
> + *    out-of line.
> + *
> + * 2. For the inline case, treat the handle itself as the input buffer.
> + *
> + * 3. For the out-of-line case, look at bits 2..0 of the handle to understand
> + *    the input buffer length. To obtain the pointer to the input buffer, unset
> + *    bits 2..0 of the handle and set bit 63.
> + *
> + * 4. If the input buffer is 128 byte long, copy its contents to the output
> + *    buffer.
> + *
> + * 5. Otherwise, read @largest_idx, @r_tags and @r_sizes from the input buffer.
> + *    Calculate the removed largest element of @r_sizes:
> + *      largest = 256 - sum(r_sizes)
> + *    and insert it into @r_sizes at position @largest_idx.
> + *
> + * 6. While @r_sizes[i] > 0, add a 4-bit value @r_tags[i] to the output buffer
> + *    @r_sizes[i] times.
> + */

...

> +#include <linux/bitmap.h>
> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <asm/mtecomp.h>
> +#include <linux/slab.h>
> +#include <linux/swab.h>
> +#include <linux/string.h>
> +#include <linux/types.h>

Please, keep linux/* and asm/* separated like

linux/*
...blank line...
asm/*

...

> +#define HANDLE_MASK ~(BIT_ULL(63))

GENMASK_ULL(62, 0)

...

> +/* Out-of-line handles have 0b111 in bits 62..60. */
> +#define NOINLINE_MASK (BIT_ULL(62) | BIT_ULL(61) | BIT_ULL(60))

GENMASK_ULL()?

...

> +/* Cache index is stored in the lowest pointer bits. */
> +#define CACHE_ID_MASK (BIT_ULL(2) | BIT_ULL(1) | BIT_ULL(0))

Ditto.

...

> +/* Translate allocation size into mtecomp_caches[] index. */
> +static int ea0_size_to_cache_id(int len)
> +{
> +	switch (len) {
> +	case 16:
> +		return 1;
> +	case 32:
> +		return 2;
> +	case 64:
> +		return 3;
> +	default:
> +		return 0;
> +	}
> +}
> +
> +/* Translate mtecomp_caches[] index into allocation size. */
> +static int ea0_cache_id_to_size(int id)
> +{
> +	switch (id) {
> +	case 1:
> +		return 16;
> +	case 2:
> +		return 32;
> +	case 3:
> +		return 64;
> +	default:
> +		return 128;
> +	}
> +}

Not sure why fls() / BIT() can't be used directly instead of these functions,
but okay, they are not too ugly.

...

> +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
> +{
> +	u8 prev_tag = 0xff;

GENMASK()? U8_MAX? ((u8)-1)? What is this?

> +	int cur_idx = -1;
> +	u8 cur_tag;
> +	int i;
> +
> +	memset(out_tags, 0, *out_len * sizeof(*out_tags));
> +	memset(out_sizes, 0, *out_len * sizeof(*out_sizes));

array_size() ?

> +	for (i = 0; i < MTE_GRANULES_PER_PAGE; i++) {
> +		cur_tag = tags[i / 2];
> +		if (i % 2)
> +			cur_tag = cur_tag % 16;
> +		else
> +			cur_tag = cur_tag / 16;
> +		if (cur_tag == prev_tag) {
> +			out_sizes[cur_idx]++;
> +		} else {
> +			cur_idx++;
> +			prev_tag = cur_tag;
> +			out_tags[cur_idx] = prev_tag;
> +			out_sizes[cur_idx] = 1;
> +		}

Perhaps instead of doing those % 2, / 2 we simply can go twice less outer loops
and introduce an inner loop of 2 iterations?

> +	}
> +	*out_len = cur_idx + 1;
> +}

> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> +{
> +	int i, j, pos = 0;
> +	u8 prev;
> +
> +	for (i = 0; i < r_len; i++) {
> +		for (j = 0; j < r_sizes[i]; j++) {

> +			if (pos % 2 == 0)

Would be better to keep this aligned with above?

			if (pos % 2)
				...
			else
				...

> +				prev = r_tags[i];
> +			else
> +				tags[pos / 2] = (prev << 4) | r_tags[i];
> +			pos++;
> +		}
> +	}
> +}

...

> +EXPORT_SYMBOL(ea0_storage_size);

Btw, can we go to the namespaced export from day 1?

...

> +	for (i = 0; i < len; i++) {
> +		if (i == len)
> +			break;

Interesting dead code. What was the idea behind this?

> +		if (sizes[i] > largest) {
> +			largest = sizes[i];
> +			largest_idx = i;
> +		}

(alas max_array() can't be used here)

...

> +		bitmap_set_value_unaligned((unsigned long *)buf, largest_idx,
> +					   bit_pos, 4);

> +		bitmap_set_value_unaligned((unsigned long *)buf, largest_idx,
> +					   bit_pos, 6);

> +		bitmap_set_value_unaligned((unsigned long *)buf, tags[i],
> +					   bit_pos, 4);

> +		bitmap_set_value_unaligned((unsigned long *)buf, 0, bit_pos, 4);

> +		bitmap_set_value_unaligned((unsigned long *)buf, sizes[i],
> +					   bit_pos, 7);

> +	largest_idx = bitmap_get_value_unaligned((unsigned long *)buf, bit_pos,
> +						 l_bits);

> +		r_tags[i] = bitmap_get_value_unaligned((unsigned long *)buf,
> +						       bit_pos, 4);

> +		r_sizes[i] = bitmap_get_value_unaligned((unsigned long *)buf,
> +							bit_pos, 7);

These castings is a red flag. bitmap API shouldn't be used like this. Something
is not okay here.

...

> +void ea0_release_handle(u64 handle)
> +{
> +	void *storage = ea0_storage(handle);
> +	int size = ea0_storage_size(handle);
> +	struct kmem_cache *c;

> +	if (!handle || !storage)
> +		return;

You use handle before this check. Haven't you run static analysers?

...

> +static int mtecomp_init(void)
> +{
> +	char name[16];
> +	int size;
> +	int i;
> +
> +	for (i = 0; i < NUM_CACHES; i++) {
> +		size = ea0_cache_id_to_size(i);
> +		snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
> +		mtecomp_caches[i] =
> +			kmem_cache_create(name, size, size, 0, NULL);
> +	}
> +	return 0;
> +}

> +

Unneeded blank line.

> +module_init(mtecomp_init);
  
Yury Norov July 13, 2023, 7:27 p.m. UTC | #3
> > +		bitmap_set_value_unaligned((unsigned long *)buf, largest_idx,
> > +					   bit_pos, 4);
> 
> > +		bitmap_set_value_unaligned((unsigned long *)buf, largest_idx,
> > +					   bit_pos, 6);
> 
> > +		bitmap_set_value_unaligned((unsigned long *)buf, tags[i],
> > +					   bit_pos, 4);
> 
> > +		bitmap_set_value_unaligned((unsigned long *)buf, 0, bit_pos, 4);
> 
> > +		bitmap_set_value_unaligned((unsigned long *)buf, sizes[i],
> > +					   bit_pos, 7);
> 
> > +	largest_idx = bitmap_get_value_unaligned((unsigned long *)buf, bit_pos,
> > +						 l_bits);
> 
> > +		r_tags[i] = bitmap_get_value_unaligned((unsigned long *)buf,
> > +						       bit_pos, 4);
> 
> > +		r_sizes[i] = bitmap_get_value_unaligned((unsigned long *)buf,
> > +							bit_pos, 7);
> 
> These castings is a red flag. bitmap API shouldn't be used like this. Something
> is not okay here.

Big-endian arches are not OK. Out-of-boundary access is not OK when
the buf is not exactly a multiple of words.

> > +void ea0_release_handle(u64 handle)
> > +{
> > +	void *storage = ea0_storage(handle);
> > +	int size = ea0_storage_size(handle);
> > +	struct kmem_cache *c;
> 
> > +	if (!handle || !storage)
> > +		return;
> 
> You use handle before this check. Haven't you run static analysers?

This approach is called 'defensive programming' as I learned from
previous iteration. Another interesting thing is that the only caller
of the function in patch #5 explicitly checks the handle for NULL, so
we're surely double-defensed here.

        +void _mte_free_saved_tags(void *storage)
        +{
        +       unsigned long handle = xa_to_value(storage);
        +       int size;
        +
        +       if (!handle)
        +               return;
        +       size = ea0_storage_size(handle);
        +       ea0_release_handle(handle);
        +}

_mte_free_saved_tags() calculates size, but doesn't use it in any form,
just to calculate it again in callee...

Thanks,
Yury
  
Andy Shevchenko July 14, 2023, 8:01 a.m. UTC | #4
On Thu, Jul 13, 2023 at 12:27:15PM -0700, Yury Norov wrote:

...

> > > +	void *storage = ea0_storage(handle);
> > > +	int size = ea0_storage_size(handle);

> > > +	if (!handle || !storage)
> > > +		return;
> > 
> > You use handle before this check. Haven't you run static analysers?
> 
> This approach is called 'defensive programming' as I learned from
> previous iteration.

No, this approach called "let compiler optimize the check away". :-)

When compiler sees something like

	TYPE bar = MACRO(foo); // FUNC(foo);

	if (!foo)
		blalblabla;

the conditional can be eliminated as the optimizer thinks the way "okay,
developer already _used_ the foo in some code, it means it can't not be NULL,
let's drop a dead code".

(Yes, I know what defensive programming means and we actually quite rarely
 use it in the kernel for the internal APIs)
  
Alexander Potapenko July 14, 2023, 9:25 a.m. UTC | #5
> > + *
> > + *
> > + *
>
> Isn't a bit too many blank lines?

Will fix in v3, thanks!

> > + *
> > + *
> > + *
>
> Ditto.
Ack

> ...
>
> > +#include <linux/bitmap.h>
> > +#include <linux/gfp.h>
> > +#include <linux/module.h>
> > +#include <asm/mtecomp.h>
> > +#include <linux/slab.h>
> > +#include <linux/swab.h>
> > +#include <linux/string.h>
> > +#include <linux/types.h>
>
> Please, keep linux/* and asm/* separated like
>
> linux/*
> ...blank line...
> asm/*

Done (here and in other files).

> ...
>
> > +#define HANDLE_MASK ~(BIT_ULL(63))
>
> GENMASK_ULL(62, 0)
Done (here and below)

> Not sure why fls() / BIT() can't be used directly instead of these functions,
> but okay, they are not too ugly.

They can't be used directly because 128 maps to 0, but I can sure
simplify them a bit.


> > +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
> > +{
> > +     u8 prev_tag = 0xff;
>
> GENMASK()? U8_MAX? ((u8)-1)? What is this?
This is an invalid tag value, let it be U8_MAX.


> > +     memset(out_tags, 0, *out_len * sizeof(*out_tags));
> > +     memset(out_sizes, 0, *out_len * sizeof(*out_sizes));
>
> array_size() ?
Done

>
> > +     for (i = 0; i < MTE_GRANULES_PER_PAGE; i++) {
> > +             cur_tag = tags[i / 2];
> > +             if (i % 2)
> > +                     cur_tag = cur_tag % 16;
> > +             else
> > +                     cur_tag = cur_tag / 16;
> > +             if (cur_tag == prev_tag) {
> > +                     out_sizes[cur_idx]++;
> > +             } else {
> > +                     cur_idx++;
> > +                     prev_tag = cur_tag;
> > +                     out_tags[cur_idx] = prev_tag;
> > +                     out_sizes[cur_idx] = 1;
> > +             }
>
> Perhaps instead of doing those % 2, / 2 we simply can go twice less outer loops
> and introduce an inner loop of 2 iterations?

Done


> > +                     if (pos % 2 == 0)
>
> Would be better to keep this aligned with above?
>
>                         if (pos % 2)
>                                 ...
>                         else
>                                 ...

It would, but i % 2 above didn't survive the rewrite, so I assume it
is fine to keep pos % 2 == 0 as is.

>
> ...
>
> > +EXPORT_SYMBOL(ea0_storage_size);
>
> Btw, can we go to the namespaced export from day 1?

Am I getting it right that I just need to change EXPORT_SYMBOL to
EXPORT_SYMBOL_NS and import the namespace in
arch/arm64/mm/test_mtecomp.c?
I.e. MODULE_IMPORT_NS is not needed in mteswap_comp.c, because it is
linked into the kernel?

>
> ...
>
> > +     for (i = 0; i < len; i++) {
> > +             if (i == len)
> > +                     break;
>
> Interesting dead code. What was the idea behind this?

Haha, no idea :)
Removed it.

>
> > +             if (sizes[i] > largest) {
> > +                     largest = sizes[i];
> > +                     largest_idx = i;
> > +             }
>
> (alas max_array() can't be used here)
There's no max_array() in the kernel, am I missing something?


>
> > +             r_sizes[i] = bitmap_get_value_unaligned((unsigned long *)buf,
> > +                                                     bit_pos, 7);
>
> These castings is a red flag. bitmap API shouldn't be used like this. Something
> is not okay here.

In fact it makes sense to make buf an unsigned long*, we don't treat
it as a byte array anywhere else.

> ...
>
> > +void ea0_release_handle(u64 handle)
> > +{
> > +     void *storage = ea0_storage(handle);
> > +     int size = ea0_storage_size(handle);
> > +     struct kmem_cache *c;
>
> > +     if (!handle || !storage)
> > +             return;
>
> You use handle before this check. Haven't you run static analysers?

Sparse doesn't report anything in these files, are there any
alternatives adopted in the kernel?

Note that handle is not dereferenced above, so there's no error per se.
Yet (as pointed out below) these checks are redundant, so I'll remove
some of them.

>
> ...
>
> > +static int mtecomp_init(void)
> > +{
> > +     char name[16];
> > +     int size;
> > +     int i;
> > +
> > +     for (i = 0; i < NUM_CACHES; i++) {
> > +             size = ea0_cache_id_to_size(i);
> > +             snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
> > +             mtecomp_caches[i] =
> > +                     kmem_cache_create(name, size, size, 0, NULL);
> > +     }
> > +     return 0;
> > +}
>
> > +
>
> Unneeded blank line.

I think there's no agreement on this in the kernel code, but my
version is more popular:

$ git grep -B2 '^module_init(' | grep '\-}' -A2 | grep module_init | wc
   2688    2707  164023
$ git grep -B2 '^module_init(' | grep '\-}' -A1 | grep module_init | wc
    505     523   30989
  
Andy Shevchenko July 14, 2023, 10:47 a.m. UTC | #6
On Fri, Jul 14, 2023 at 11:25:41AM +0200, Alexander Potapenko wrote:

...

> > Not sure why fls() / BIT() can't be used directly instead of these functions,
> > but okay, they are not too ugly.
> 
> They can't be used directly because 128 maps to 0, but I can sure
> simplify them a bit.

Right, that's why I'm okay with the current implementation. But
if you want to rewrite, up to you.

...

> > > +                     if (pos % 2 == 0)
> >
> > Would be better to keep this aligned with above?
> >
> >                         if (pos % 2)
> >                                 ...
> >                         else
> >                                 ...
> 
> It would, but i % 2 above didn't survive the rewrite, so I assume it
> is fine to keep pos % 2 == 0 as is.

Not big deal, but less characters improve the brain process, so

	if (pos % 2)

kinda quicker to read and understand in my opinion.

...

> > > +EXPORT_SYMBOL(ea0_storage_size);
> >
> > Btw, can we go to the namespaced export from day 1?
> 
> Am I getting it right that I just need to change EXPORT_SYMBOL to
> EXPORT_SYMBOL_NS and import the namespace in
> arch/arm64/mm/test_mtecomp.c?
> I.e. MODULE_IMPORT_NS is not needed in mteswap_comp.c, because it is
> linked into the kernel?

I think you always need to include MODULE_IMPORT_NS for the sake of
robustness of the code.

...

> > > +             if (sizes[i] > largest) {
> > > +                     largest = sizes[i];
> > > +                     largest_idx = i;
> > > +             }
> >
> > (alas max_array() can't be used here)
> There's no max_array() in the kernel, am I missing something?

There will be (via ASoC tree and maybe IIO tree later on) in v6.6-rc1, but
as I think it can't be used anyway because you need the index of the value
as well.

...

> > > +void ea0_release_handle(u64 handle)
> > > +{
> > > +     void *storage = ea0_storage(handle);
> > > +     int size = ea0_storage_size(handle);
> > > +     struct kmem_cache *c;
> >
> > > +     if (!handle || !storage)
> > > +             return;
> >
> > You use handle before this check. Haven't you run static analysers?
> 
> Sparse doesn't report anything in these files, are there any
> alternatives adopted in the kernel?
> 
> Note that handle is not dereferenced above, so there's no error per se.

Even if it's a simple pointer arithmetics, the storage might (theoretically?)
have a dangling pointer, no?

> Yet (as pointed out below) these checks are redundant, so I'll remove
> some of them.

...

> > > +
> >
> > Unneeded blank line.
> 
> I think there's no agreement on this in the kernel code, but my
> version is more popular:
> 
> $ git grep -B2 '^module_init(' | grep '\-}' -A2 | grep module_init | wc
>    2688    2707  164023
> $ git grep -B2 '^module_init(' | grep '\-}' -A1 | grep module_init | wc
>     505     523   30989

Even though, there is no need for this blank line. And note, for better
argument, compare this for the new code added let's say for the past 2
years. I believe numbers will tend to my variant.

I.o.w. you need to count on trends and not only on frequencies.
  
Alexander Potapenko July 14, 2023, 11:17 a.m. UTC | #7
On Fri, Jul 14, 2023 at 12:47 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Fri, Jul 14, 2023 at 11:25:41AM +0200, Alexander Potapenko wrote:
>
> ...
>
> > > Not sure why fls() / BIT() can't be used directly instead of these functions,
> > > but okay, they are not too ugly.
> >
> > They can't be used directly because 128 maps to 0, but I can sure
> > simplify them a bit.
>
> Right, that's why I'm okay with the current implementation. But
> if you want to rewrite, up to you.
>
> ...
>
> > > > +                     if (pos % 2 == 0)
> > >
> > > Would be better to keep this aligned with above?
> > >
> > >                         if (pos % 2)
> > >                                 ...
> > >                         else
> > >                                 ...
> >
> > It would, but i % 2 above didn't survive the rewrite, so I assume it
> > is fine to keep pos % 2 == 0 as is.
>
> Not big deal, but less characters improve the brain process, so
>
>         if (pos % 2)
>
> kinda quicker to read and understand in my opinion.

Ok, will do.

> ...
>
> > > > +EXPORT_SYMBOL(ea0_storage_size);
> > >
> > > Btw, can we go to the namespaced export from day 1?
> >
> > Am I getting it right that I just need to change EXPORT_SYMBOL to
> > EXPORT_SYMBOL_NS and import the namespace in
> > arch/arm64/mm/test_mtecomp.c?
> > I.e. MODULE_IMPORT_NS is not needed in mteswap_comp.c, because it is
> > linked into the kernel?
>
> I think you always need to include MODULE_IMPORT_NS for the sake of
> robustness of the code.

Thanks! The docs aren't very specific on this.



> ...
>
> > > > +
> > >
> > > Unneeded blank line.
> >
> > I think there's no agreement on this in the kernel code, but my
> > version is more popular:
> >
> > $ git grep -B2 '^module_init(' | grep '\-}' -A2 | grep module_init | wc
> >    2688    2707  164023
> > $ git grep -B2 '^module_init(' | grep '\-}' -A1 | grep module_init | wc
> >     505     523   30989
>
> Even though, there is no need for this blank line. And note, for better
> argument, compare this for the new code added let's say for the past 2
> years. I believe numbers will tend to my variant.
>
> I.o.w. you need to count on trends and not only on frequencies.

Fair enough, will fix.
  

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7856c3a3e35af..0aa727888e746 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2091,6 +2091,16 @@  config ARM64_EPAN
 	  if the cpu does not implement the feature.
 endmenu # "ARMv8.7 architectural features"
 
+config ARM64_MTE_COMP
+	bool "Tag compression for ARM64 MTE"
+	default y
+	depends on ARM64_MTE
+	help
+	  Enable tag compression support for ARM64 MTE.
+
+	  128-byte tag buffers corresponding to 4K pages can be compressed using
+	  the EA0 algorithm to save heap memory.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
new file mode 100644
index 0000000000000..65a3730cc50d9
--- /dev/null
+++ b/arch/arm64/include/asm/mtecomp.h
@@ -0,0 +1,60 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_MTECOMP_H
+#define __ASM_MTECOMP_H
+
+#include <linux/types.h>
+
+/*
+ * ea0_compress() - compress the given tag array.
+ * @tags: 128-byte array to read the tags from.
+ *
+ * Compresses the tags and returns a 64-bit opaque handle pointing to the
+ * tag storage. May allocate memory, which is freed by @ea0_release_handle().
+ */
+u64 ea0_compress(u8 *tags);
+
+/*
+ * ea0_decompress() - decompress the tag array addressed by the handle.
+ * @handle: handle returned by @ea0_decompress()
+ * @tags: 128-byte array to write the tags to.
+ *
+ * Reads the compressed data and writes it into the user-supplied tag array.
+ * Returns true on success, false on error.
+ */
+bool ea0_decompress(u64 handle, u8 *tags);
+
+/*
+ * ea0_release_handle() - release the handle returned by ea0_compress().
+ * @handle: handle returned by ea0_compress().
+ */
+void ea0_release_handle(u64 handle);
+
+/* Functions below are exported for testing purposes. */
+
+/*
+ * ea0_storage_size() - calculate the memory occupied by compressed tags.
+ * @handle: storage handle returned by ea0_compress.
+ */
+int ea0_storage_size(u64 handle);
+
+/*
+ * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
+ * @tags: 128-byte array containing 256 MTE tags.
+ * @out_tags: u8 array to store the tag of every range.
+ * @out_sizes: u16 array to store the size of every range.
+ * @out_len: length of @out_tags and @out_sizes (output parameter, initially
+ *           equal to lengths of out_tags[] and out_sizes[]).
+ */
+void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len);
+
+/*
+ * ea0_ranges_to_tags() - fill @tags using given tag ranges.
+ * @r_tags: u8[256] containing the tag of every range.
+ * @r_sizes: u16[256] containing the size of every range.
+ * @r_len: length of @r_tags and @r_sizes.
+ * @tags: 128-byte array to write the tags to.
+ */
+void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
+
+#endif // __ASM_MTECOMP_H
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d0..46778f6dd83c2 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -10,6 +10,7 @@  obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd.o
 obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
+obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
new file mode 100644
index 0000000000000..e6c1026bd7e89
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.c
@@ -0,0 +1,412 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * MTE tag compression algorithm.
+ * Proposed by Evgenii Stepanov <eugenis@google.com>
+ */
+
+/*
+ * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
+ * compression algorithms.
+ *
+ * The algorithm attempts to compress a 128-byte (MTE_GRANULES_PER_PAGE / 2)
+ * array of tags into a smaller byte sequence that can be stored in a
+ * 16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
+ * an 8-byte pointer.
+ *
+ * We encapsulate tag storage memory management in this module, because it is
+ * tightly coupled with the pointer representation.
+ *   ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value
+ *     that can be stored in Xarray
+ *   ea_decompress(*ptr, *tags) takes the opaque value and loads the tags into
+ *     the provided 128-byte buffer.
+ *
+ *
+ *
+ * The compression algorithm works as follows.
+ *
+ * 1. The input array of 128 bytes is transformed into tag ranges (two arrays:
+ *    @r_tags containing tag values and @r_sizes containing range lengths) by
+ *    ea0_tags_to_ranges(). Note that @r_sizes sums up to 256.
+ *
+ * 2. Depending on the number N of ranges, the following storage class is picked:
+ *            N <= 6:  8 bytes (inline case, no allocation required);
+ *       6 < N <= 11: 16 bytes
+ *      11 < N <= 23: 32 bytes
+ *      23 < N <= 46: 64 bytes
+ *      46 < N:       128 bytes (no compression will be performed)
+ *
+ * 3. The number of the largest element of @r_sizes is stored in @largest_idx.
+ *    The element itself is thrown away from @r_sizes, because it can be
+ *    reconstructed from the sum of the remaining elements. Note that now none
+ *    of the remaining @r_sizes elements is greater than 127.
+ *
+ * 4. For the inline case, the following values are stored in the 8-byte handle:
+ *       largest_idx : i4
+ *      r_tags[0..5] : i4 x 6
+ *     r_sizes[0..4] : i7 x 5
+ *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
+ *
+ *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
+ *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
+ *    distinguished from a kernel pointer.
+ *
+ * 5. For the out-of-line case, the storage is allocated from one of the
+ *    "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
+ *    on 8 bytes, so its bits 2..0 can be used to store the size class:
+ *     - 0 for 128 bytes
+ *     - 1 for 16
+ *     - 2 for 32
+ *     - 4 for 64.
+ *    Bit 63 of the pointer is zeroed out, so that it can be stored in Xarray.
+ *
+ * 6. The data layout in the allocated storage is as follows:
+ *         largest_idx : i6
+ *        r_tags[0..N] : i4 x N
+ *     r_sizes[0..N-1] : i7 x (N-1)
+ *
+ *
+ *
+ * The decompression algorithm performs the steps below.
+ *
+ * 1. Decide if data is stored inline (bits 62..60 of the handle != 0b111) or
+ *    out-of line.
+ *
+ * 2. For the inline case, treat the handle itself as the input buffer.
+ *
+ * 3. For the out-of-line case, look at bits 2..0 of the handle to understand
+ *    the input buffer length. To obtain the pointer to the input buffer, unset
+ *    bits 2..0 of the handle and set bit 63.
+ *
+ * 4. If the input buffer is 128 byte long, copy its contents to the output
+ *    buffer.
+ *
+ * 5. Otherwise, read @largest_idx, @r_tags and @r_sizes from the input buffer.
+ *    Calculate the removed largest element of @r_sizes:
+ *      largest = 256 - sum(r_sizes)
+ *    and insert it into @r_sizes at position @largest_idx.
+ *
+ * 6. While @r_sizes[i] > 0, add a 4-bit value @r_tags[i] to the output buffer
+ *    @r_sizes[i] times.
+ */
+
+#include <linux/bitmap.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <asm/mtecomp.h>
+#include <linux/slab.h>
+#include <linux/swab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+/* The handle must fit into an Xarray value. */
+#define HANDLE_MASK ~(BIT_ULL(63))
+
+/* Out-of-line handles have 0b111 in bits 62..60. */
+#define NOINLINE_MASK (BIT_ULL(62) | BIT_ULL(61) | BIT_ULL(60))
+
+/* Cache index is stored in the lowest pointer bits. */
+#define CACHE_ID_MASK (BIT_ULL(2) | BIT_ULL(1) | BIT_ULL(0))
+
+/* Four separate caches to store out-of-line data. */
+#define NUM_CACHES 4
+static struct kmem_cache *mtecomp_caches[NUM_CACHES];
+
+/* Translate allocation size into mtecomp_caches[] index. */
+static int ea0_size_to_cache_id(int len)
+{
+	switch (len) {
+	case 16:
+		return 1;
+	case 32:
+		return 2;
+	case 64:
+		return 3;
+	default:
+		return 0;
+	}
+}
+
+/* Translate mtecomp_caches[] index into allocation size. */
+static int ea0_cache_id_to_size(int id)
+{
+	switch (id) {
+	case 1:
+		return 16;
+	case 2:
+		return 32;
+	case 3:
+		return 64;
+	default:
+		return 128;
+	}
+}
+
+/* Transform tags into tag ranges. */
+void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
+{
+	u8 prev_tag = 0xff;
+	int cur_idx = -1;
+	u8 cur_tag;
+	int i;
+
+	memset(out_tags, 0, *out_len * sizeof(*out_tags));
+	memset(out_sizes, 0, *out_len * sizeof(*out_sizes));
+	for (i = 0; i < MTE_GRANULES_PER_PAGE; i++) {
+		cur_tag = tags[i / 2];
+		if (i % 2)
+			cur_tag = cur_tag % 16;
+		else
+			cur_tag = cur_tag / 16;
+		if (cur_tag == prev_tag) {
+			out_sizes[cur_idx]++;
+		} else {
+			cur_idx++;
+			prev_tag = cur_tag;
+			out_tags[cur_idx] = prev_tag;
+			out_sizes[cur_idx] = 1;
+		}
+	}
+	*out_len = cur_idx + 1;
+}
+EXPORT_SYMBOL(ea0_tags_to_ranges);
+
+/* Transform tag ranges back into tags. */
+void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
+{
+	int i, j, pos = 0;
+	u8 prev;
+
+	for (i = 0; i < r_len; i++) {
+		for (j = 0; j < r_sizes[i]; j++) {
+			if (pos % 2 == 0)
+				prev = r_tags[i];
+			else
+				tags[pos / 2] = (prev << 4) | r_tags[i];
+			pos++;
+		}
+	}
+}
+EXPORT_SYMBOL(ea0_ranges_to_tags);
+
+/* Translate @num_ranges into the allocation size needed to hold them. */
+static int ea0_alloc_size(int num_ranges)
+{
+	if (num_ranges <= 6)
+		return 8;
+	if (num_ranges <= 11)
+		return 16;
+	if (num_ranges <= 23)
+		return 32;
+	if (num_ranges <= 46)
+		return 64;
+	return 128;
+}
+
+/* Translate allocation size into maximum number of ranges that it can hold. */
+static int ea0_size_to_ranges(int size)
+{
+	switch (size) {
+	case 8:
+		return 6;
+	case 16:
+		return 11;
+	case 32:
+		return 23;
+	case 64:
+		return 46;
+	default:
+		return 0;
+	}
+}
+
+/* Is the data stored inline in the handle itself? */
+static bool ea0_is_inline(u64 handle)
+{
+	return (handle & NOINLINE_MASK) != NOINLINE_MASK;
+}
+
+/* Get the size of the buffer backing @handle. */
+int ea0_storage_size(u64 handle)
+{
+	if (ea0_is_inline(handle))
+		return 8;
+	return ea0_cache_id_to_size(handle & CACHE_ID_MASK);
+}
+EXPORT_SYMBOL(ea0_storage_size);
+
+/* Compress ranges into the buffer of the given length. */
+static void ea0_compress_to_buf(int len, u8 *tags, short *sizes, u8 *buf,
+				int buflen)
+{
+	int largest_idx = -1, i;
+	short largest = 0;
+	unsigned long bit_pos = 0;
+
+	for (i = 0; i < len; i++) {
+		if (i == len)
+			break;
+		if (sizes[i] > largest) {
+			largest = sizes[i];
+			largest_idx = i;
+		}
+	}
+	if (len <= 6) {
+		/* Inline case, @buflen <= 8. */
+		bitmap_set_value_unaligned((unsigned long *)buf, largest_idx,
+					   bit_pos, 4);
+		bit_pos += 4;
+	} else {
+		bitmap_set_value_unaligned((unsigned long *)buf, largest_idx,
+					   bit_pos, 6);
+		bit_pos += 6;
+	}
+	for (i = 0; i < len; i++) {
+		bitmap_set_value_unaligned((unsigned long *)buf, tags[i],
+					   bit_pos, 4);
+		bit_pos += 4;
+	}
+	for (i = len; i < ea0_size_to_ranges(buflen); i++) {
+		bitmap_set_value_unaligned((unsigned long *)buf, 0, bit_pos, 4);
+		bit_pos += 4;
+	}
+	for (i = 0; i < len; i++) {
+		if (i == largest_idx)
+			continue;
+		bitmap_set_value_unaligned((unsigned long *)buf, sizes[i],
+					   bit_pos, 7);
+		bit_pos += 7;
+	}
+}
+
+/* Compress the data inline. */
+static u64 ea0_compress_inline(int len, u8 *tags, short *sizes)
+{
+	u64 result;
+
+	ea0_compress_to_buf(len, tags, sizes, (u8 *)&result, sizeof(result));
+	result = be64_to_cpu(result);
+	return result;
+}
+
+/* Compress @tags and return a handle. */
+u64 ea0_compress(u8 *tags)
+{
+	int alloc_size, cache_id;
+	struct kmem_cache *cache;
+	short r_sizes[256];
+	u8 r_tags[256];
+	int r_len = ARRAY_SIZE(r_tags);
+	u8 *storage;
+
+	ea0_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	alloc_size = ea0_alloc_size(r_len);
+	if (alloc_size == 8)
+		return ea0_compress_inline(r_len, r_tags, r_sizes);
+	cache_id = ea0_size_to_cache_id(alloc_size);
+	cache = mtecomp_caches[cache_id];
+	storage = kmem_cache_alloc(cache, GFP_KERNEL);
+	if (alloc_size < 128) {
+		ea0_compress_to_buf(r_len, r_tags, r_sizes, storage,
+				    alloc_size);
+		return ((u64)storage | cache_id) & HANDLE_MASK;
+	}
+	memcpy(storage, tags, alloc_size);
+	return (u64)storage & HANDLE_MASK;
+}
+EXPORT_SYMBOL(ea0_compress);
+
+/* Decompress the contents of the given buffer into @tags. */
+static bool ea0_decompress_from_buf(u8 *buf, int buflen, u8 *tags)
+{
+	int largest_idx, i, r_len = ea0_size_to_ranges(buflen);
+	short r_sizes[46], sum = 0;
+	u8 r_tags[46];
+	unsigned long bit_pos = 0, l_bits = (buflen == 8) ? 4 : 6;
+
+	largest_idx = bitmap_get_value_unaligned((unsigned long *)buf, bit_pos,
+						 l_bits);
+	bit_pos += l_bits;
+	for (i = 0; i < r_len; i++) {
+		r_tags[i] = bitmap_get_value_unaligned((unsigned long *)buf,
+						       bit_pos, 4);
+		bit_pos += 4;
+	}
+	for (i = 0; i < r_len; i++) {
+		if (i == largest_idx)
+			continue;
+		r_sizes[i] = bitmap_get_value_unaligned((unsigned long *)buf,
+							bit_pos, 7);
+		bit_pos += 7;
+		if (!r_sizes[i]) {
+			r_len = i;
+			break;
+		}
+		sum += r_sizes[i];
+	}
+	if (sum >= 256)
+		return false;
+	r_sizes[largest_idx] = 256 - sum;
+	ea0_ranges_to_tags(r_tags, r_sizes, r_len, tags);
+	return true;
+}
+
+/* Get pointer to the out-of-line storage from a handle. */
+static void *ea0_storage(u64 handle)
+{
+	if (ea0_is_inline(handle))
+		return NULL;
+	return (void *)((handle & (~CACHE_ID_MASK)) | BIT_ULL(63));
+}
+
+/* Decompress tags from the buffer referenced by @handle. */
+bool ea0_decompress(u64 handle, u8 *tags)
+{
+	u8 *storage = ea0_storage(handle);
+	int size = ea0_storage_size(handle);
+
+	if (size == 128) {
+		memcpy(tags, storage, size);
+		return true;
+	}
+	if (size == 8) {
+		handle = cpu_to_be64(handle);
+		return ea0_decompress_from_buf((u8 *)&handle, sizeof(handle),
+					       tags);
+	}
+	return ea0_decompress_from_buf(storage, size, tags);
+}
+EXPORT_SYMBOL(ea0_decompress);
+
+/* Release the memory referenced by @handle. */
+void ea0_release_handle(u64 handle)
+{
+	void *storage = ea0_storage(handle);
+	int size = ea0_storage_size(handle);
+	struct kmem_cache *c;
+
+	if (!handle || !storage)
+		return;
+
+	c = mtecomp_caches[ea0_size_to_cache_id(size)];
+	kmem_cache_free(c, storage);
+}
+EXPORT_SYMBOL(ea0_release_handle);
+
+/* Set up mtecomp_caches[]. */
+static int mtecomp_init(void)
+{
+	char name[16];
+	int size;
+	int i;
+
+	for (i = 0; i < NUM_CACHES; i++) {
+		size = ea0_cache_id_to_size(i);
+		snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
+		mtecomp_caches[i] =
+			kmem_cache_create(name, size, size, 0, NULL);
+	}
+	return 0;
+}
+
+module_init(mtecomp_init);