[v3,3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

Message ID 20230717113709.328671-4-glider@google.com
State New
Headers
Series Implement MTE tag compression for swapped pages |

Commit Message

Alexander Potapenko July 17, 2023, 11:37 a.m. UTC
  The config implements the EA0 algorithm suggested by Evgenii Stepanov
to compress the memory tags for ARM MTE during swapping.

The algorithm is based on RLE and specifically targets 128-byte buffers
of tags corresponding to a single page. In the common case a buffer
can be compressed into 63 bits, making it possible to store it without
additional memory allocation.

Suggested-by: Evgenii Stepanov <eugenis@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
 v3:
  - Addressed comments by Andy Shevchenko:
   - use bitmap_{set,get}_value() writte by Syed Nayyar Waris
   - switched to unsigned long everywhere (fewer casts)
   - simplified the code, removed redundant checks
   - dropped ea0_compress_inline()
 - added bit size constants and helpers to access the bitmap
 - explicitly initialize all compressed sizes in ea0_compress_to_buf()
 - initialize all handle bits

 v2:
  - as suggested by Yuri Norov, switched from struct bitq (which is
    not needed anymore) to <linux/bitmap.h>
  - add missing symbol exports
---
 arch/arm64/Kconfig               |  10 +
 arch/arm64/include/asm/mtecomp.h |  60 +++++
 arch/arm64/mm/Makefile           |   1 +
 arch/arm64/mm/mtecomp.c          | 406 +++++++++++++++++++++++++++++++
 4 files changed, 477 insertions(+)
 create mode 100644 arch/arm64/include/asm/mtecomp.h
 create mode 100644 arch/arm64/mm/mtecomp.c
  

Comments

Andy Shevchenko July 17, 2023, 1:49 p.m. UTC | #1
On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:
> The config implements the EA0 algorithm suggested by Evgenii Stepanov
> to compress the memory tags for ARM MTE during swapping.
> 
> The algorithm is based on RLE and specifically targets 128-byte buffers
> of tags corresponding to a single page. In the common case a buffer
> can be compressed into 63 bits, making it possible to store it without
> additional memory allocation.

...

> +config ARM64_MTE_COMP
> +	bool "Tag compression for ARM64 MTE"

At least here, make sure everybody understands what you are talking about. WTF
MTE is?

> +	default y
> +	depends on ARM64_MTE
> +	help
> +	  Enable tag compression support for ARM64 MTE.
> +
> +	  128-byte tag buffers corresponding to 4K pages can be compressed using
> +	  the EA0 algorithm to save heap memory.

>  config ARM64_SVE
>  	bool "ARM Scalable Vector Extension support"

You see the difference?

...

> +/*

Are you deliberately made it NON-kernel-doc? If so, why? And why does it
have too many similarities with above mentioned format?

> + * ea0_compress() - compress the given tag array.
> + * @tags: 128-byte array to read the tags from.
> + *
> + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> + * tag storage. May allocate memory, which is freed by @ea0_release_handle().
> + */
> +unsigned long ea0_compress(u8 *tags);
> +
> +/*
> + * ea0_decompress() - decompress the tag array addressed by the handle.
> + * @handle: handle returned by @ea0_decompress()
> + * @tags: 128-byte array to write the tags to.
> + *
> + * Reads the compressed data and writes it into the user-supplied tag array.
> + * Returns true on success, false on error.

In case you are going to make them real kernel-doc:s, make sure kernel-doc
validator doesn't warn. Here, for example, return section is missing. The easy
fix is to add : after Returns. Same to the rest of function descriptions. Also
why you put the descriptions in to the header file? It's a bit unusual for the
exported ones.

> + */

...

> +/*
> + * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
> + * @tags: 128-byte array containing 256 MTE tags.
> + * @out_tags: u8 array to store the tag of every range.
> + * @out_sizes: u16 array to store the size of every range.

u16? I don't see it.

> + * @out_len: length of @out_tags and @out_sizes (output parameter, initially
> + *           equal to lengths of out_tags[] and out_sizes[]).
> + */

> +/*
> + * ea0_ranges_to_tags() - fill @tags using given tag ranges.
> + * @r_tags: u8[256] containing the tag of every range.
> + * @r_sizes: u16[256] containing the size of every range.

Ditto.

> + * @r_len: length of @r_tags and @r_sizes.
> + * @tags: 128-byte array to write the tags to.
> + */
> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);

In both cases signed integer may be promoted with a sign. Is it a problem here?

...

> +/*
> + * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
> + * compression algorithms.
> + *
> + * The algorithm attempts to compress a 128-byte (MTE_GRANULES_PER_PAGE / 2)
> + * array of tags into a smaller byte sequence that can be stored in a
> + * 16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
> + * an 8-byte pointer.
> + *
> + * We encapsulate tag storage memory management in this module, because it is
> + * tightly coupled with the pointer representation.

> + *   ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value

ea0_compress() is usual way how we refer to the functions. Let tools to make
the necessary references.

> + *     that can be stored in Xarray
> + *   ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into

Ditto. And so on.

> + *     the provided 128-byte buffer.
> + *
> + * The compression algorithm works as follows.
> + *
> + * 1. The input array of 128 bytes is transformed into tag ranges (two arrays:
> + *    @r_tags containing tag values and @r_sizes containing range lengths) by
> + *    ea0_tags_to_ranges(). Note that @r_sizes sums up to 256.
> + *
> + * 2. Depending on the number N of ranges, the following storage class is picked:
> + *            N <= 6:  8 bytes (inline case, no allocation required);
> + *       6 < N <= 11: 16 bytes
> + *      11 < N <= 23: 32 bytes
> + *      23 < N <= 46: 64 bytes
> + *      46 < N:       128 bytes (no compression will be performed)
> + *
> + * 3. The number of the largest element of @r_sizes is stored in @largest_idx.
> + *    The element itself is thrown away from @r_sizes, because it can be
> + *    reconstructed from the sum of the remaining elements. Note that now none
> + *    of the remaining @r_sizes elements is greater than 127.
> + *
> + * 4. For the inline case, the following values are stored in the 8-byte handle:
> + *       largest_idx : i4
> + *      r_tags[0..5] : i4 x 6
> + *     r_sizes[0..4] : i7 x 5
> + *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> + *
> + *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> + *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> + *    distinguished from a kernel pointer.
> + *
> + * 5. For the out-of-line case, the storage is allocated from one of the
> + *    "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
> + *    on 8 bytes, so its bits 2..0 can be used to store the size class:

> + *     - 0 for 128 bytes
> + *     - 1 for 16
> + *     - 2 for 32
> + *     - 4 for 64.

Is this chosen deliberately (for performance?)? Otherwise why not put them in
natural exponential growing?

> + *    Bit 63 of the pointer is zeroed out, so that it can be stored in Xarray.
> + *
> + * 6. The data layout in the allocated storage is as follows:
> + *         largest_idx : i6
> + *        r_tags[0..N] : i4 x N
> + *     r_sizes[0..N-1] : i7 x (N-1)
> + *
> + * The decompression algorithm performs the steps below.
> + *
> + * 1. Decide if data is stored inline (bits 62..60 of the handle != 0b111) or
> + *    out-of line.
> + *
> + * 2. For the inline case, treat the handle itself as the input buffer.
> + *
> + * 3. For the out-of-line case, look at bits 2..0 of the handle to understand
> + *    the input buffer length. To obtain the pointer to the input buffer, unset
> + *    bits 2..0 of the handle and set bit 63.
> + *
> + * 4. If the input buffer is 128 byte long, copy its contents to the output
> + *    buffer.
> + *
> + * 5. Otherwise, read @largest_idx, @r_tags and @r_sizes from the input buffer.
> + *    Calculate the removed largest element of @r_sizes:
> + *      largest = 256 - sum(r_sizes)
> + *    and insert it into @r_sizes at position @largest_idx.
> + *
> + * 6. While @r_sizes[i] > 0, add a 4-bit value @r_tags[i] to the output buffer
> + *    @r_sizes[i] times.
> + */

...

> +#include <linux/bits.h>
> +#include <linux/bitmap.h>

bitmap guarantees that bits.h will be included.

> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/swab.h>
> +#include <linux/string.h>
> +#include <linux/types.h>

...

> +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
> +{
> +	u8 prev_tag = U8_MAX;

> +	int cur_idx = -1;

At which circumstances does this assignment make sense?

> +	u8 cur_tag;
> +	int i, j;
> +
> +	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
> +	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
> +
> +	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> +		for (j = 0; j < 2; j++) {
> +			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> +			if (cur_tag == prev_tag) {
> +				out_sizes[cur_idx]++;

Who guarantees this one is not [-1]?

> +			} else {

> +				cur_idx++;

Aha, above seems a bit prone to out of boundaries errors. Can you make it
unsigned and start from 0?

> +				prev_tag = cur_tag;
> +				out_tags[cur_idx] = prev_tag;
> +				out_sizes[cur_idx] = 1;
> +			}
> +		}
> +	}
> +	*out_len = cur_idx + 1;
> +}

...

> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> +{
> +	int i, j, pos = 0;

Wouldn't be more correct to have this assignment inside the first for-loop?

> +	u8 prev;
> +
> +	for (i = 0; i < r_len; i++) {
> +		for (j = 0; j < r_sizes[i]; j++) {
> +			if (pos % 2)
> +				tags[pos / 2] = (prev << 4) | r_tags[i];
> +			else
> +				prev = r_tags[i];
> +			pos++;
> +		}
> +	}
> +}

...

> +#define RANGES_INLINE ea0_size_to_ranges(8)

Don't forget to undef it when not needed.

...

> +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> +			 unsigned long *pos, unsigned long bits)

Please, don't use reserved namespace. Yours is ea0, use it:
ea0_bitmap_write()! Same to other similarly named functions.

...

> +	unsigned long bit_pos = 0, l_bits;
> +	int largest_idx = -1, i;
> +	short largest = 0;

Here and elsewhere, please, double check the correctness and/or necessity of
signdness and assignments of local variables.

...

> +	for (i = 0; i < len; i++) {
> +		if (sizes[i] > largest) {

Here

		if (largest >= sizes[i])
			continue;
makes sense, but...

> +			largest = sizes[i];
> +			largest_idx = i;
> +		}
> +	}

...

> +	for (i = 0; i < len; i++) {
> +		if (i == largest_idx)
> +			continue;
> +		bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);

...here I would do the opposite since it's one liner.

> +	}

...

> +	u8 r_tags[256];
> +	int r_len = ARRAY_SIZE(r_tags);

sizeof() ?

...

> +	l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> +						 BITS_PER_LARGEST_IDX;

Is it a dup? Perhaps a helper for this?

Seems BITS_PER_TAG, BITS_PER_SIZE and the rest should also be namespaced,
EA0_BITS_...

...

> +bool ea0_decompress(unsigned long handle, u8 *tags)
> +{
> +	unsigned long *storage = ea0_storage(handle);
> +	int size = ea0_storage_size(handle);
> +
> +	if (size == 128) {
> +		memcpy(tags, storage, size);
> +		return true;
> +	}
> +	if (size == 8)
> +		return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);

Maybe

	switch (ea0_storage_size(handle)) {
		...
	default:
	}

?

> +	return ea0_decompress_from_buf(storage, ea0_size_to_ranges(size), tags);
> +}

...

> +void ea0_release_handle(unsigned long handle)
> +{
> +	void *storage = ea0_storage(handle);
> +	int size = ea0_storage_size(handle);
> +	struct kmem_cache *c;

> +	if (!storage)
> +		return;

I find slightly better for maintaining in the form as

	struct kmem_cache *c;
	void *storage;
	int size;

	storage = ea0_storage(handle);
	if (!storage)
		return;

	size = ea0_storage_size(handle);

> +	c = mtecomp_caches[ea0_size_to_cache_id(size)];
> +	kmem_cache_free(c, storage);
> +}

...

> +static int mtecomp_init(void)
> +{
> +	char name[16];
> +	int size;
> +	int i;
> +
> +	BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);

Why not static_assert()?

> +	for (i = 0; i < NUM_CACHES; i++) {
> +		size = ea0_cache_id_to_size(i);
> +		snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);

sizeof() will work the same way without need of having kernel.h be included.

> +		mtecomp_caches[i] =
> +			kmem_cache_create(name, size, size, 0, NULL);
> +	}
> +	return 0;
> +}
  
Alexander Potapenko July 18, 2023, 3:33 p.m. UTC | #2
On Mon, Jul 17, 2023 at 3:49 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:
> > The config implements the EA0 algorithm suggested by Evgenii Stepanov
> > to compress the memory tags for ARM MTE during swapping.
> >
> > The algorithm is based on RLE and specifically targets 128-byte buffers
> > of tags corresponding to a single page. In the common case a buffer
> > can be compressed into 63 bits, making it possible to store it without
> > additional memory allocation.
>
> ...
>
> > +config ARM64_MTE_COMP
> > +     bool "Tag compression for ARM64 MTE"
>
> At least here, make sure everybody understands what you are talking about. WTF
> MTE is?
Replaced with "Memory Tagging Extension"


> ...
>
> > +/*
>
> Are you deliberately made it NON-kernel-doc? If so, why? And why does it
> have too many similarities with above mentioned format?

No, just by lack of habit.

> > + * ea0_compress() - compress the given tag array.
> > + * @tags: 128-byte array to read the tags from.
> > + *
> > + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> > + * tag storage. May allocate memory, which is freed by @ea0_release_handle().
> > + */
> > +unsigned long ea0_compress(u8 *tags);
> > +
> > +/*
> > + * ea0_decompress() - decompress the tag array addressed by the handle.
> > + * @handle: handle returned by @ea0_decompress()
> > + * @tags: 128-byte array to write the tags to.
> > + *
> > + * Reads the compressed data and writes it into the user-supplied tag array.
> > + * Returns true on success, false on error.
>
> In case you are going to make them real kernel-doc:s, make sure kernel-doc
> validator doesn't warn.

I'll try to.
However it doesn't seem to be very picky.

  $ scripts/kernel-doc -v  -none arch/arm64/include/asm/mtecomp.h

warns about e.g. parameter name mismatch, but does not care about the
missing return section.

> Here, for example, return section is missing. The easy
> fix is to add : after Returns. Same to the rest of function descriptions.
Done

> Also
> why you put the descriptions in to the header file? It's a bit unusual for the
> exported ones.

https://docs.kernel.org/doc-guide/kernel-doc.html was not specific on
this, and I thought anyone wanting to understand how an interface
works would prefer reading the header rather than the implementation.
I can move the comments to arch/arm64/mm/mtecomp.c if you think it's a
better place for them.

> > +/*
> > + * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
> > + * @tags: 128-byte array containing 256 MTE tags.
> > + * @out_tags: u8 array to store the tag of every range.
> > + * @out_sizes: u16 array to store the size of every range.
>
> u16? I don't see it.
Fixed.


> > +/*
> > + * ea0_ranges_to_tags() - fill @tags using given tag ranges.
> > + * @r_tags: u8[256] containing the tag of every range.
> > + * @r_sizes: u16[256] containing the size of every range.
>
> Ditto.
Fixed.

>
> > + * @r_len: length of @r_tags and @r_sizes.
> > + * @tags: 128-byte array to write the tags to.
> > + */
> > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
>
> In both cases signed integer may be promoted with a sign. Is it a problem here?
Not sure if you mean r_len or r_sizes, but all those values are >= 0
and <= 256, so there should be no problems.
(shorts could have been ints as well, we are just saving some stack
space in ea0_compress()/ea0_decompress()).

> > + *
> > + * We encapsulate tag storage memory management in this module, because it is
> > + * tightly coupled with the pointer representation.
>
> > + *   ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value
>
> ea0_compress() is usual way how we refer to the functions. Let tools to make
> the necessary references.

Done.

>
> > + *     that can be stored in Xarray
> > + *   ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into
>
> Ditto. And so on.
Done.


> > + * 5. For the out-of-line case, the storage is allocated from one of the
> > + *    "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
> > + *    on 8 bytes, so its bits 2..0 can be used to store the size class:
>
> > + *     - 0 for 128 bytes
> > + *     - 1 for 16
> > + *     - 2 for 32
> > + *     - 4 for 64.
>
> Is this chosen deliberately (for performance?)? Otherwise why not put them in
> natural exponential growing?

This way pointers to uncompressed buffers do not have extra data stored in them.
This property does not have any immediate use though.


> ...
>
> > +#include <linux/bits.h>
> > +#include <linux/bitmap.h>
>
> bitmap guarantees that bits.h will be included.

I am following the IWYU principle here, and I believe it's a good thing to do.
I've seen cases where these transitive dependencies rotted after some
refactoring, but the fact was only noticed in certain configurations.
Also, there's an ongoing work by Ingo Molnar to speed up kernel builds
by optimizing headers
(https://lwn.net/ml/linux-kernel/YdIfz+LMewetSaEB@gmail.com/), and it
relies on explicit dependencies which are easier to untangle.

>
> ...
>
> > +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
> > +{
> > +     u8 prev_tag = U8_MAX;
>
> > +     int cur_idx = -1;
>
> At which circumstances does this assignment make sense?
This value is never used to index the array, it is incremented on the
first loop iteration.

> > +     for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> > +             for (j = 0; j < 2; j++) {
> > +                     cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> > +                     if (cur_tag == prev_tag) {
> > +                             out_sizes[cur_idx]++;
>
> Who guarantees this one is not [-1]?
The fact that prev_tag is initialized to U8_MAX, whereas cur_tag <= 15.


> > +                     } else {
>
> > +                             cur_idx++;
>
> Aha, above seems a bit prone to out of boundaries errors.
Not really, but since you stumbled upon it, I'd better make it more readable.

> Can you make it
> unsigned and start from 0?

Changed to start with 0, but I am a bit hesitant about making it
unsigned: it is now no more special than a loop variable.

> > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > +{
> > +     int i, j, pos = 0;
>
> Wouldn't be more correct to have this assignment inside the first for-loop?

Do you mean setting it back to 0 on every iteration of the outer loop?
We sure don't want that, since pos is the location in the tags[] array
where the next tag is written.
If what you meant is doing "for (i = 0, pos = 0; ...)" this is a
question of preference, but I can do that.


>
> > +#define RANGES_INLINE ea0_size_to_ranges(8)
>
> Don't forget to undef it when not needed.

Ok, will do.
Shall I undef the constants above as well (e.g. BITS_PER_TAG)?
The intuitive answer is "no", but then there should be some difference
between those and RANGES_INLINE?

>
> ...
>
> > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > +                      unsigned long *pos, unsigned long bits)
>
> Please, don't use reserved namespace. Yours is ea0, use it:
> ea0_bitmap_write()! Same to other similarly named functions.

Done.
However there are two parallel namespaces now: "ea0" for the
compression algorithm, and "memcomp" for the module initialization and
data structures.
Dunno if it makes sense to merge them (and rename the .c file accordingly).

> ...
>
> > +     unsigned long bit_pos = 0, l_bits;
> > +     int largest_idx = -1, i;
> > +     short largest = 0;
>
> Here and elsewhere, please, double check the correctness and/or necessity of
> signdness and assignments of local variables.

I replaced a bunch of ints with size_t where it made sense (basically
all array indices remain ints, but all sizes are size_t now).
Also removed the assignment to largest_idx above.


> Here
>
>                 if (largest >= sizes[i])
>                         continue;
> makes sense, but...
>
> > +                     largest = sizes[i];
> > +                     largest_idx = i;
> > +             }
> > +     }
>
> ...
>
> > +     for (i = 0; i < len; i++) {
> > +             if (i == largest_idx)
> > +                     continue;
> > +             bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);
>
> ...here I would do the opposite since it's one liner.

Agreed.

>
> > +     }
>
> ...
>
> > +     u8 r_tags[256];
> > +     int r_len = ARRAY_SIZE(r_tags);
>
No, it is the length of the arrays (both r_tags and r_sizes).
Below you make a good point it will spare us a kernel.h dependency,
but for the sake of keeping the code error-prone we probably shouldn't
assume r_tags is a byte array.


> > +     l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> > +                                              BITS_PER_LARGEST_IDX;
>
> Is it a dup? Perhaps a helper for this?
Do you mean it is the same in ea0_compress_to_buf() and
ea0_decompress_from_buf()?
Added a helper.

> Seems BITS_PER_TAG, BITS_PER_SIZE and the rest should also be namespaced,
> EA0_BITS_...

Done.

> ...
>
> > +bool ea0_decompress(unsigned long handle, u8 *tags)
> > +{
> > +     unsigned long *storage = ea0_storage(handle);
> > +     int size = ea0_storage_size(handle);
> > +
> > +     if (size == 128) {
> > +             memcpy(tags, storage, size);
> > +             return true;
> > +     }
> > +     if (size == 8)
> > +             return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);
>
> Maybe
>
>         switch (ea0_storage_size(handle)) {
>                 ...
>         default:
>         }
>
> ?

Sounds reasonable, done.

> > +void ea0_release_handle(unsigned long handle)
> > +{
> > +     void *storage = ea0_storage(handle);
> > +     int size = ea0_storage_size(handle);
> > +     struct kmem_cache *c;
>
> > +     if (!storage)
> > +             return;
>
> I find slightly better for maintaining in the form as
>
>         struct kmem_cache *c;
>         void *storage;
>         int size;
>
>         storage = ea0_storage(handle);
>         if (!storage)
>                 return;
>
>         size = ea0_storage_size(handle);

Done.

> > +static int mtecomp_init(void)
> > +{
> > +     char name[16];
> > +     int size;
> > +     int i;
> > +
> > +     BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);
>
> Why not static_assert()?
>
> > +     for (i = 0; i < NUM_CACHES; i++) {
> > +             size = ea0_cache_id_to_size(i);
> > +             snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
>
> sizeof() will work the same way without need of having kernel.h be included.

Done.
  
Andy Shevchenko July 18, 2023, 5:17 p.m. UTC | #3
On Tue, Jul 18, 2023 at 05:33:37PM +0200, Alexander Potapenko wrote:
> On Mon, Jul 17, 2023 at 3:49 PM Andy Shevchenko
> <andriy.shevchenko@linux.intel.com> wrote:
> > On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:

...

> However it doesn't seem to be very picky.
> 
>   $ scripts/kernel-doc -v  -none arch/arm64/include/asm/mtecomp.h
> 
> warns about e.g. parameter name mismatch, but does not care about the
> missing return section.

-Wreturn is missing

...

> > Also
> > why you put the descriptions in to the header file? It's a bit unusual for the
> > exported ones.
> 
> https://docs.kernel.org/doc-guide/kernel-doc.html was not specific on
> this, and I thought anyone wanting to understand how an interface
> works would prefer reading the header rather than the implementation.
> I can move the comments to arch/arm64/mm/mtecomp.c if you think it's a
> better place for them.

With the kernel doc in the C file you may also comment the internal ones and
generate documentation only for exported ones. This is not possible with h.

...

> > > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
> > In both cases signed integer may be promoted with a sign. Is it a problem here?
> Not sure if you mean r_len or r_sizes,

Mostly about the latter.

> but all those values are >= 0
> and <= 256, so there should be no problems.
> (shorts could have been ints as well, we are just saving some stack
> space in ea0_compress()/ea0_decompress()).

Then why they are signed? Please, justify that.
Signdness prone to subtle and hard to hunt errors due to integer promotions.

...

> > > +#include <linux/bits.h>
> > > +#include <linux/bitmap.h>
> >
> > bitmap guarantees that bits.h will be included.
> 
> I am following the IWYU principle here, and I believe it's a good thing to do.
> I've seen cases where these transitive dependencies rotted after some
> refactoring, but the fact was only noticed in certain configurations.
> Also, there's an ongoing work by Ingo Molnar to speed up kernel builds
> by optimizing headers
> (https://lwn.net/ml/linux-kernel/YdIfz+LMewetSaEB@gmail.com/), and it
> relies on explicit dependencies which are easier to untangle.

Yes, but we have some guarantees. E.g., we don't include compiler*.h
when types.h is included, because of the guarantees.

Otherwise your code misses _a lot_ of headers.

...

> > Can you make it unsigned and start from 0?
> 
> Changed to start with 0, but I am a bit hesitant about making it
> unsigned: it is now no more special than a loop variable.

Signdness is a beast in C, needs always an additional justification.

...

> > > +     int i, j, pos = 0;
> >
> > Wouldn't be more correct to have this assignment inside the first for-loop?
> 
> Do you mean setting it back to 0 on every iteration of the outer loop?

Yes.

> We sure don't want that, since pos is the location in the tags[] array
> where the next tag is written.

OK!

...

> > > +#define RANGES_INLINE ea0_size_to_ranges(8)
> >
> > Don't forget to undef it when not needed.
> 
> Ok, will do.

> Shall I undef the constants above as well (e.g. BITS_PER_TAG)?
> The intuitive answer is "no",

Correct.

> but then there should be some difference between those and RANGES_INLINE?

Yes, one is widely used constant and one is a _localized_ helper.

...

> > > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > > +                      unsigned long *pos, unsigned long bits)
> >
> > Please, don't use reserved namespace. Yours is ea0, use it:
> > ea0_bitmap_write()! Same to other similarly named functions.
> 
> Done.
> However there are two parallel namespaces now: "ea0" for the
> compression algorithm, and "memcomp" for the module initialization and
> data structures.
> Dunno if it makes sense to merge them (and rename the .c file accordingly).

Your choice. Mu point, just do prefix it with something meaningful.

...

> > > +     u8 r_tags[256];
> > > +     int r_len = ARRAY_SIZE(r_tags);
> >
> No, it is the length of the arrays (both r_tags and r_sizes).
> Below you make a good point it will spare us a kernel.h dependency,
> but for the sake of keeping the code error-prone we probably shouldn't
> assume r_tags is a byte array.

It's a common practice even outside of Linux kernel to use sizeof() against
char arrays. I don't see the point to have the ARRAY_SIZE() complication here.

...

> > > +             snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);

You see, if you grep for similar calls I'm pretty sure the order of 2 of power
of 10 will be the difference between sizeof()/ARRAY_SIZE() if the latter even
occurs at least once.
  
Yury Norov July 19, 2023, 6:09 a.m. UTC | #4
On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:
> The config implements the EA0 algorithm suggested by Evgenii Stepanov
> to compress the memory tags for ARM MTE during swapping.
> 
> The algorithm is based on RLE and specifically targets 128-byte buffers
> of tags corresponding to a single page. In the common case a buffer
> can be compressed into 63 bits, making it possible to store it without
> additional memory allocation.
> 
> Suggested-by: Evgenii Stepanov <eugenis@google.com>
> Signed-off-by: Alexander Potapenko <glider@google.com>
> 
> ---
>  v3:
>   - Addressed comments by Andy Shevchenko:
>    - use bitmap_{set,get}_value() writte by Syed Nayyar Waris
>    - switched to unsigned long everywhere (fewer casts)
>    - simplified the code, removed redundant checks
>    - dropped ea0_compress_inline()
>  - added bit size constants and helpers to access the bitmap
>  - explicitly initialize all compressed sizes in ea0_compress_to_buf()
>  - initialize all handle bits
> 
>  v2:
>   - as suggested by Yuri Norov, switched from struct bitq (which is
>     not needed anymore) to <linux/bitmap.h>
>   - add missing symbol exports
> ---
>  arch/arm64/Kconfig               |  10 +
>  arch/arm64/include/asm/mtecomp.h |  60 +++++
>  arch/arm64/mm/Makefile           |   1 +
>  arch/arm64/mm/mtecomp.c          | 406 +++++++++++++++++++++++++++++++
>  4 files changed, 477 insertions(+)
>  create mode 100644 arch/arm64/include/asm/mtecomp.h
>  create mode 100644 arch/arm64/mm/mtecomp.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2511b30d0f67..52cdc7603cf7c 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2093,6 +2093,16 @@ config ARM64_EPAN
>  	  if the cpu does not implement the feature.
>  endmenu # "ARMv8.7 architectural features"
>  
> +config ARM64_MTE_COMP
> +	bool "Tag compression for ARM64 MTE"
> +	default y
> +	depends on ARM64_MTE
> +	help
> +	  Enable tag compression support for ARM64 MTE.
> +
> +	  128-byte tag buffers corresponding to 4K pages can be compressed using
> +	  the EA0 algorithm to save heap memory.
> +
>  config ARM64_SVE
>  	bool "ARM Scalable Vector Extension support"
>  	default y
> diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
> new file mode 100644
> index 0000000000000..0c444c0d4ac04
> --- /dev/null
> +++ b/arch/arm64/include/asm/mtecomp.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef __ASM_MTECOMP_H
> +#define __ASM_MTECOMP_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * ea0_compress() - compress the given tag array.
> + * @tags: 128-byte array to read the tags from.
> + *
> + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> + * tag storage. May allocate memory, which is freed by @ea0_release_handle().
> + */
> +unsigned long ea0_compress(u8 *tags);
> +
> +/*
> + * ea0_decompress() - decompress the tag array addressed by the handle.
> + * @handle: handle returned by @ea0_decompress()
> + * @tags: 128-byte array to write the tags to.
> + *
> + * Reads the compressed data and writes it into the user-supplied tag array.
> + * Returns true on success, false on error.
> + */
> +bool ea0_decompress(unsigned long handle, u8 *tags);
> +
> +/*
> + * ea0_release_handle() - release the handle returned by ea0_compress().
> + * @handle: handle returned by ea0_compress().
> + */
> +void ea0_release_handle(unsigned long handle);
> +
> +/* Functions below are exported for testing purposes. */

Then declare them in a separate local header or simply in the test, but
please not in a public header.

> +
> +/*
> + * ea0_storage_size() - calculate the memory occupied by compressed tags.
> + * @handle: storage handle returned by ea0_compress.
> + */
> +int ea0_storage_size(unsigned long handle);
> +
> +/*
> + * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
> + * @tags: 128-byte array containing 256 MTE tags.
> + * @out_tags: u8 array to store the tag of every range.
> + * @out_sizes: u16 array to store the size of every range.
> + * @out_len: length of @out_tags and @out_sizes (output parameter, initially
> + *           equal to lengths of out_tags[] and out_sizes[]).
> + */
> +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len);
> +
> +/*
> + * ea0_ranges_to_tags() - fill @tags using given tag ranges.
> + * @r_tags: u8[256] containing the tag of every range.
> + * @r_sizes: u16[256] containing the size of every range.
> + * @r_len: length of @r_tags and @r_sizes.
> + * @tags: 128-byte array to write the tags to.
> + */
> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
> +
> +#endif // __ASM_MTECOMP_H
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d0..46778f6dd83c2 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd.o
>  obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
>  obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
>  obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
> +obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
>  KASAN_SANITIZE_physaddr.o	+= n
>  
>  obj-$(CONFIG_KASAN)		+= kasan_init.o
> diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
> new file mode 100644
> index 0000000000000..50a379c035aee
> --- /dev/null
> +++ b/arch/arm64/mm/mtecomp.c
> @@ -0,0 +1,406 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * MTE tag compression algorithm.
> + * Proposed by Evgenii Stepanov <eugenis@google.com>
> + */
> +
> +/*
> + * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
> + * compression algorithms.

This is the 4th time I see mr. Stepanov's credentials in the patch.
I've no doubts he's a worthy gentleman but please avoid mentioning
people in source code. Suggested-by is enough. IIRC, the rule for
that exists for about decade.

For the purpose of namespacing, the mte_compress/mte_decompress would
sound better.

> + *
> + * The algorithm attempts to compress a 128-byte (MTE_GRANULES_PER_PAGE / 2)
> + * array of tags into a smaller byte sequence that can be stored in a
> + * 16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
> + * an 8-byte pointer.
> + *
> + * We encapsulate tag storage memory management in this module, because it is
> + * tightly coupled with the pointer representation.
> + *   ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value
> + *     that can be stored in Xarray
> + *   ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into
> + *     the provided 128-byte buffer.
> + *
> + * The compression algorithm works as follows.
> + *
> + * 1. The input array of 128 bytes is transformed into tag ranges (two arrays:
> + *    @r_tags containing tag values and @r_sizes containing range lengths) by
> + *    ea0_tags_to_ranges(). Note that @r_sizes sums up to 256.
> + *
> + * 2. Depending on the number N of ranges, the following storage class is picked:
> + *            N <= 6:  8 bytes (inline case, no allocation required);
> + *       6 < N <= 11: 16 bytes
> + *      11 < N <= 23: 32 bytes
> + *      23 < N <= 46: 64 bytes
> + *      46 < N:       128 bytes (no compression will be performed)
> + *
> + * 3. The number of the largest element of @r_sizes is stored in @largest_idx.
> + *    The element itself is thrown away from @r_sizes, because it can be
> + *    reconstructed from the sum of the remaining elements. Note that now none
> + *    of the remaining @r_sizes elements is greater than 127.
> + *
> + * 4. For the inline case, the following values are stored in the 8-byte handle:
> + *       largest_idx : i4
> + *      r_tags[0..5] : i4 x 6
> + *     r_sizes[0..4] : i7 x 5
> + *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> + *
> + *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> + *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> + *    distinguished from a kernel pointer.

I honestly tried to understand... For example, what the
        'r_sizes[0..4] : i7 x 5'
means? An array of 5 elements, 17 bits each? But it's alone greater
than size of pointer... Oh gosh...

Moreover, MTE tags are all 4-bits.

It seems like text without pictures is above my mental abilities. Can you
please illustrate it? For example, from the #4, I (hopefully correctly)
realized that:

Inline frame format:
   0                                                    60       62  63
 +---------------------------------------------------------------------+
 |              No idea what happens here             |    Lidx    | X |
 +---------------------------------------------------------------------+
 63    : X      :  RAZ : Reserved for Xarray.
 60-62 : Lidx   : 0..5 : Largest element index. 
                     6 : Reserved
                     7 : Invalid handler

> + *
> + * 5. For the out-of-line case, the storage is allocated from one of the
> + *    "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
> + *    on 8 bytes, so its bits 2..0 can be used to store the size class:
> + *     - 0 for 128 bytes
> + *     - 1 for 16
> + *     - 2 for 32
> + *     - 4 for 64.
> + *    Bit 63 of the pointer is zeroed out, so that it can be stored in Xarray.
> + *
> + * 6. The data layout in the allocated storage is as follows:
> + *         largest_idx : i6
> + *        r_tags[0..N] : i4 x N
> + *     r_sizes[0..N-1] : i7 x (N-1)
> + *
> + * The decompression algorithm performs the steps below.
> + *
> + * 1. Decide if data is stored inline (bits 62..60 of the handle != 0b111) or
> + *    out-of line.
> + *
> + * 2. For the inline case, treat the handle itself as the input buffer.
> + *
> + * 3. For the out-of-line case, look at bits 2..0 of the handle to understand
> + *    the input buffer length. To obtain the pointer to the input buffer, unset
> + *    bits 2..0 of the handle and set bit 63.
> + *
> + * 4. If the input buffer is 128 byte long, copy its contents to the output
> + *    buffer.
> + *
> + * 5. Otherwise, read @largest_idx, @r_tags and @r_sizes from the input buffer.
> + *    Calculate the removed largest element of @r_sizes:
> + *      largest = 256 - sum(r_sizes)
> + *    and insert it into @r_sizes at position @largest_idx.
> + *
> + * 6. While @r_sizes[i] > 0, add a 4-bit value @r_tags[i] to the output buffer
> + *    @r_sizes[i] times.
> + */
> +

Because of the size, I believe this comment is worth to put in Docs,
moreover we already have "Documentation/arch/arm64/memory-tagging-extension.rst"
Why don't you add an 'MTE Compression' section in there?

> +#include <linux/bits.h>
> +#include <linux/bitmap.h>
> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/swab.h>
> +#include <linux/string.h>
> +#include <linux/types.h>

Nit: Alphabetic order?

Andy is right, bitmap.h includes bit.h, no need to include both. And if
your code will get broken one day, it's a bitmap maintainers' work to fix.

> +
> +#include <asm/mtecomp.h>
> +
> +/* The handle must fit into an Xarray value. */
> +#define HANDLE_MASK GENMASK_ULL(62, 0)
> +
> +/* Out-of-line handles have 0b111 in bits 62..60. */
> +#define NOINLINE_MASK GENMASK_ULL(62, 60)
> +
> +/* Cache index is stored in the lowest pointer bits. */
> +#define CACHE_ID_MASK GENMASK_ULL(2, 0)
> +
> +/*
> + * Four separate caches to store out-of-line data:
> + *  0: mte-tags-128
> + *  1: mte-tags-16
> + *  2: mte-tags-32
> + *  3: mte-tags-64
> + */
> +#define NUM_CACHES 4
> +static struct kmem_cache *mtecomp_caches[NUM_CACHES];
> +
> +/*
> + * Sizes of compressed values.
> + */
> +#define BITS_PER_TAG 4
> +#define BITS_PER_SIZE 7
> +#define BITS_PER_LARGEST_IDX_INLINE 4
> +#define BITS_PER_LARGEST_IDX 6

But in the comment you say that largest index in inline frame is 3-bits long.

> +
> +/* Translate allocation size into mtecomp_caches[] index. */
> +static int ea0_size_to_cache_id(int len)

Here and everywhere, do you need signed values? If not, unsigned int.

> +{
> +	if (len < 128)
> +		return fls(len >> 4);
> +	return 0;
> +}
> +
> +/* Translate mtecomp_caches[] index into allocation size. */
> +static int ea0_cache_id_to_size(int id)
> +{
> +	if (id == 0)
> +		return 128;
> +	return 8 << id;
> +}
> +
> +/* Transform tags into tag ranges. */
> +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
> +{
> +	u8 prev_tag = U8_MAX;
> +	int cur_idx = -1;
> +	u8 cur_tag;
> +	int i, j;
> +
> +	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
> +	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
> +
> +	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> +		for (j = 0; j < 2; j++) {
> +			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> +			if (cur_tag == prev_tag) {
> +				out_sizes[cur_idx]++;
> +			} else {
> +				cur_idx++;
> +				prev_tag = cur_tag;
> +				out_tags[cur_idx] = prev_tag;
> +				out_sizes[cur_idx] = 1;
> +			}
> +		}
> +	}
> +	*out_len = cur_idx + 1;
> +}
> +EXPORT_SYMBOL_NS(ea0_tags_to_ranges, MTECOMP);
> +
> +/* Transform tag ranges back into tags. */
> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> +{
> +	int i, j, pos = 0;
> +	u8 prev;
> +
> +	for (i = 0; i < r_len; i++) {
> +		for (j = 0; j < r_sizes[i]; j++) {
> +			if (pos % 2)
> +				tags[pos / 2] = (prev << 4) | r_tags[i];
> +			else
> +				prev = r_tags[i];
> +			pos++;
> +		}
> +	}
> +}
> +EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);

Because I didn't understand the compressed frame format, not sure I
can understand this logic...

> +
> +/* Translate @num_ranges into the allocation size needed to hold them. */
> +static int ea0_alloc_size(int num_ranges)
> +{
> +	if (num_ranges <= 6)
> +		return 8;
> +	if (num_ranges <= 11)
> +		return 16;
> +	if (num_ranges <= 23)
> +		return 32;
> +	if (num_ranges <= 46)
> +		return 64;
> +	return 128;
> +}
> +
> +/* Translate allocation size into maximum number of ranges that it can hold. */
> +static int ea0_size_to_ranges(int size)
> +{
> +	switch (size) {
> +	case 8:
> +		return 6;
> +	case 16:
> +		return 11;
> +	case 32:
> +		return 23;
> +	case 64:
> +		return 46;
> +	default:
> +		return 0;
> +	}
> +}

I wonder if there's a math formula here? Can you explain where from
those numbers come?

> +#define RANGES_INLINE ea0_size_to_ranges(8)

Maybe

 #define RANGES_INLINE (6)

> +
> +/* Is the data stored inline in the handle itself? */
> +static bool ea0_is_inline(unsigned long handle)
> +{
> +	return (handle & NOINLINE_MASK) != NOINLINE_MASK;
> +}
> +
> +/* Get the size of the buffer backing @handle. */
> +int ea0_storage_size(unsigned long handle)
> +{
> +	if (ea0_is_inline(handle))
> +		return 8;
> +	return ea0_cache_id_to_size(handle & CACHE_ID_MASK);
> +}
> +EXPORT_SYMBOL_NS(ea0_storage_size, MTECOMP);
> +
> +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> +			 unsigned long *pos, unsigned long bits)

Please don't steal prefixes. But the idea is good. For the next
iteration, let's rename bitmap_set_value() to bitmap_write()?
So that your function will be an mte_bitmap_write().

Thanks,
Yury

> +{
> +	bitmap_set_value(bitmap, value, *pos, bits);
> +	*pos += bits;
> +}
> +
> +/* Compress ranges into the buffer that can accommodate up to max_ranges. */
> +static void ea0_compress_to_buf(int len, u8 *tags, short *sizes,
> +				unsigned long *bitmap, int max_ranges)
> +{
> +	unsigned long bit_pos = 0, l_bits;
> +	int largest_idx = -1, i;
> +	short largest = 0;
> +
> +	for (i = 0; i < len; i++) {
> +		if (sizes[i] > largest) {
> +			largest = sizes[i];
> +			largest_idx = i;
> +		}
> +	}
> +	l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> +						 BITS_PER_LARGEST_IDX;
> +	bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
> +	for (i = 0; i < len; i++)
> +		bitmap_write(bitmap, tags[i], &bit_pos, BITS_PER_TAG);
> +	for (i = len; i < max_ranges; i++)
> +		bitmap_write(bitmap, 0, &bit_pos, BITS_PER_TAG);
> +	for (i = 0; i < len; i++) {
> +		if (i == largest_idx)
> +			continue;
> +		bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);
> +	}
> +	for (i = len; i < max_ranges; i++)
> +		bitmap_write(bitmap, 0, &bit_pos, BITS_PER_SIZE);
> +}
> +
> +/* Compress @tags and return a handle. */
> +unsigned long ea0_compress(u8 *tags)
> +{
> +	int alloc_size, cache_id;
> +	struct kmem_cache *cache;
> +	short r_sizes[256];
> +	u8 r_tags[256];
> +	int r_len = ARRAY_SIZE(r_tags);
> +	unsigned long *storage;
> +	/*
> +	 * ea0_compress_to_buf() only initializes the bits that ea0_decompress()
> +	 * will read. But when the tags are stored in the handle itself, it must
> +	 * have all its bits initialized.
> +	 */
> +	unsigned long result = 0;
> +
> +	ea0_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
> +	alloc_size = ea0_alloc_size(r_len);
> +	if (alloc_size == 8) {
> +		ea0_compress_to_buf(r_len, r_tags, r_sizes, &result,
> +				    RANGES_INLINE);
> +		return result;
> +	}
> +	cache_id = ea0_size_to_cache_id(alloc_size);
> +	cache = mtecomp_caches[cache_id];
> +	storage = kmem_cache_alloc(cache, GFP_KERNEL);
> +	if (alloc_size < 128) {
> +		/* alloc_size is always a multiple of sizeof(unsigned long). */
> +		ea0_compress_to_buf(r_len, r_tags, r_sizes, storage,
> +				    ea0_size_to_ranges(alloc_size));
> +		return ((unsigned long)storage | cache_id) & HANDLE_MASK;
> +	}
> +	memcpy(storage, tags, alloc_size);
> +	return (unsigned long)storage & HANDLE_MASK;
> +}
> +EXPORT_SYMBOL_NS(ea0_compress, MTECOMP);
> +
> +static unsigned long bitmap_read(const unsigned long *bitmap,
> +				 unsigned long *pos, unsigned long bits)
> +{
> +	unsigned long result;
> +
> +	result = bitmap_get_value(bitmap, *pos, bits);
> +	*pos += bits;
> +	return result;
> +}
> +
> +/* Decompress the contents of the given buffer into @tags. */
> +static bool ea0_decompress_from_buf(const unsigned long *bitmap, int max_ranges,
> +				    u8 *tags)
> +{
> +	int largest_idx, i;
> +	short r_sizes[46], sum = 0;
> +	u8 r_tags[46];
> +	unsigned long bit_pos = 0, l_bits;
> +
> +	l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> +						 BITS_PER_LARGEST_IDX;
> +	largest_idx = bitmap_read(bitmap, &bit_pos, l_bits);
> +	for (i = 0; i < max_ranges; i++)
> +		r_tags[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_TAG);
> +	for (i = 0; i < max_ranges; i++) {
> +		if (i == largest_idx)
> +			continue;
> +		r_sizes[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_SIZE);
> +		if (!r_sizes[i]) {
> +			max_ranges = i;
> +			break;
> +		}
> +		sum += r_sizes[i];
> +	}
> +	if (sum >= 256)
> +		return false;
> +	r_sizes[largest_idx] = 256 - sum;
> +	ea0_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
> +	return true;
> +}
> +
> +/* Get pointer to the out-of-line storage from a handle. */
> +static void *ea0_storage(unsigned long handle)
> +{
> +	if (ea0_is_inline(handle))
> +		return NULL;
> +	return (void *)((handle & (~CACHE_ID_MASK)) | BIT_ULL(63));
> +}
> +
> +/* Decompress tags from the buffer referenced by @handle. */
> +bool ea0_decompress(unsigned long handle, u8 *tags)
> +{
> +	unsigned long *storage = ea0_storage(handle);
> +	int size = ea0_storage_size(handle);
> +
> +	if (size == 128) {
> +		memcpy(tags, storage, size);
> +		return true;
> +	}
> +	if (size == 8)
> +		return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);
> +	return ea0_decompress_from_buf(storage, ea0_size_to_ranges(size), tags);
> +}
> +EXPORT_SYMBOL_NS(ea0_decompress, MTECOMP);
> +
> +/* Release the memory referenced by @handle. */
> +void ea0_release_handle(unsigned long handle)
> +{
> +	void *storage = ea0_storage(handle);
> +	int size = ea0_storage_size(handle);
> +	struct kmem_cache *c;
> +
> +	if (!storage)
> +		return;
> +
> +	c = mtecomp_caches[ea0_size_to_cache_id(size)];
> +	kmem_cache_free(c, storage);
> +}
> +EXPORT_SYMBOL_NS(ea0_release_handle, MTECOMP);
> +
> +/* Set up mtecomp_caches[]. */
> +static int mtecomp_init(void)
> +{
> +	char name[16];
> +	int size;
> +	int i;
> +
> +	BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);
> +	for (i = 0; i < NUM_CACHES; i++) {
> +		size = ea0_cache_id_to_size(i);
> +		snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
> +		mtecomp_caches[i] =
> +			kmem_cache_create(name, size, size, 0, NULL);
> +	}
> +	return 0;
> +}
> +module_init(mtecomp_init);
> -- 
> 2.41.0.255.g8b1d071c50-goog
  
Alexander Potapenko July 19, 2023, 12:16 p.m. UTC | #5
On Tue, Jul 18, 2023 at 7:18 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:

> > However it doesn't seem to be very picky.
> >
> >   $ scripts/kernel-doc -v  -none arch/arm64/include/asm/mtecomp.h
> >
> > warns about e.g. parameter name mismatch, but does not care about the
> > missing return section.
>
> -Wreturn is missing

Nice, adding it (or -Wall) uncovered more problems.

> ...
>
> > > Also
> > > why you put the descriptions in to the header file? It's a bit unusual for the
> > > exported ones.
> >
> > https://docs.kernel.org/doc-guide/kernel-doc.html was not specific on
> > this, and I thought anyone wanting to understand how an interface
> > works would prefer reading the header rather than the implementation.
> > I can move the comments to arch/arm64/mm/mtecomp.c if you think it's a
> > better place for them.
>
> With the kernel doc in the C file you may also comment the internal ones and
> generate documentation only for exported ones. This is not possible with h.

I see.
After some hesitation and looking at the existing practices in the
kernel, I moved the kernel doc comments to mtecomp.c



> > but all those values are >= 0
> > and <= 256, so there should be no problems.
> > (shorts could have been ints as well, we are just saving some stack
> > space in ea0_compress()/ea0_decompress()).
>
> Then why they are signed? Please, justify that.
> Signdness prone to subtle and hard to hunt errors due to integer promotions.

Switched to unsigned shorts.

> ...
>
> > > > +#include <linux/bits.h>
> > > > +#include <linux/bitmap.h>
> > >
> > > bitmap guarantees that bits.h will be included.
> >
> > I am following the IWYU principle here, and I believe it's a good thing to do.
> > I've seen cases where these transitive dependencies rotted after some
> > refactoring, but the fact was only noticed in certain configurations.
> > Also, there's an ongoing work by Ingo Molnar to speed up kernel builds
> > by optimizing headers
> > (https://lwn.net/ml/linux-kernel/YdIfz+LMewetSaEB@gmail.com/), and it
> > relies on explicit dependencies which are easier to untangle.
>
> Yes, but we have some guarantees. E.g., we don't include compiler*.h
> when types.h is included, because of the guarantees.
Ok, I removed bits.h

> Otherwise your code misses _a lot_ of headers.

... and tried to expand the list of headers where applicable.

>
> ...
>
> > > Can you make it unsigned and start from 0?
> >
> > Changed to start with 0, but I am a bit hesitant about making it
> > unsigned: it is now no more special than a loop variable.
>
> Signdness is a beast in C, needs always an additional justification.

Changed all ints to unsigned, because, well, why not :)


> > Shall I undef the constants above as well (e.g. BITS_PER_TAG)?
> > The intuitive answer is "no",
>
> Correct.
>
> > but then there should be some difference between those and RANGES_INLINE?
>
> Yes, one is widely used constant and one is a _localized_ helper.

Ack

>
> ...
>
> > > > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > > > +                      unsigned long *pos, unsigned long bits)
> > >
> > > Please, don't use reserved namespace. Yours is ea0, use it:
> > > ea0_bitmap_write()! Same to other similarly named functions.
> >
> > Done.
> > However there are two parallel namespaces now: "ea0" for the
> > compression algorithm, and "memcomp" for the module initialization and
> > data structures.
> > Dunno if it makes sense to merge them (and rename the .c file accordingly).
>
> Your choice. Mu point, just do prefix it with something meaningful.
Fine, I'll go with "ea0" for the algorithm and "memcomp" for the
module-specific stuff.
Let's also see what ARM maintainers say.

> ...
>
> > > > +     u8 r_tags[256];
> > > > +     int r_len = ARRAY_SIZE(r_tags);
> > >
> > No, it is the length of the arrays (both r_tags and r_sizes).
> > Below you make a good point it will spare us a kernel.h dependency,
> > but for the sake of keeping the code error-prone we probably shouldn't
> > assume r_tags is a byte array.
>
> It's a common practice even outside of Linux kernel to use sizeof() against
> char arrays. I don't see the point to have the ARRAY_SIZE() complication here.

Fixed.

> ...
>
> > > > +             snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
>
> You see, if you grep for similar calls I'm pretty sure the order of 2 of power
> of 10 will be the difference between sizeof()/ARRAY_SIZE() if the latter even
> occurs at least once.

You are right, it is 3700+ vs. 66 :)
Changed ARRAY_SIZE() to sizeof() here as well.
  
Alexander Potapenko July 19, 2023, 2 p.m. UTC | #6
On Wed, Jul 19, 2023 at 8:09 AM Yury Norov <yury.norov@gmail.com> wrote:
...
> > +
> > +/* Functions below are exported for testing purposes. */
>
> Then declare them in a separate local header or simply in the test, but
> please not in a public header.

Fair enough, moved them to the test.

> > +
> > +/*
> > + * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
> > + * compression algorithms.
>
> This is the 4th time I see mr. Stepanov's credentials in the patch.
> I've no doubts he's a worthy gentleman but please avoid mentioning
> people in source code. Suggested-by is enough. IIRC, the rule for
> that exists for about decade.
>
> For the purpose of namespacing, the mte_compress/mte_decompress would
> sound better.

This indeed makes sense; "EA0" is not widely recognizable, and we are
unlikely to have more than one MTE compression algorithm anyway.
I changed "ea0" to "mte" in the patch series.



> > + *
> > + * 4. For the inline case, the following values are stored in the 8-byte handle:
> > + *       largest_idx : i4
> > + *      r_tags[0..5] : i4 x 6
> > + *     r_sizes[0..4] : i7 x 5
> > + *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> > + *
> > + *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> > + *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> > + *    distinguished from a kernel pointer.
>
> I honestly tried to understand... For example, what the
>         'r_sizes[0..4] : i7 x 5'
> means? An array of 5 elements, 17 bits each? But it's alone greater
> than size of pointer... Oh gosh...

iN (note that it is a small i, not a 1) is LLVM notation for an N-bit
integer type.
There's no such thing in C syntax, and describing the data layout
using bitfields would be painful.
Would it help if I add a comment explaining that iN is an N-bit
integer? Or do you prefer something like

  r_sizes[0..4] : 5 x 7 bits

?

>
> Moreover, MTE tags are all 4-bits.
>
> It seems like text without pictures is above my mental abilities. Can you
> please illustrate it? For example, from the #4, I (hopefully correctly)
> realized that:
>
> Inline frame format:
>    0                                                    60       62  63
>  +---------------------------------------------------------------------+
I think it's more natural to number bits from 63 to 0

>  |              No idea what happens here             |    Lidx    | X |
>  +---------------------------------------------------------------------+
>  63    : X      :  RAZ : Reserved for Xarray.
>  60-62 : Lidx   : 0..5 : Largest element index.

There's some mismatch now between this picture, where Lidx is i3, and
the implementation that treats it as an i4, merging bit 63 into it.
Maybe we can avoid this by not splitting off bit 63?

>                      6 : Reserved
>                      7 : Invalid handler

No, 7 means "treat this handle as an out-of-line one". It is still
valid, but instead of tag data it contains a pointer.

>
> Because of the size, I believe this comment is worth to put in Docs,
> moreover we already have "Documentation/arch/arm64/memory-tagging-extension.rst"
> Why don't you add an 'MTE Compression' section in there?

Good idea, will do!

>
> > +#include <linux/bits.h>
> > +#include <linux/bitmap.h>
> > +#include <linux/gfp.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/swab.h>
> > +#include <linux/string.h>
> > +#include <linux/types.h>
>
> Nit: Alphabetic order?
Ack.

>
> Andy is right, bitmap.h includes bit.h, no need to include both. And if
> your code will get broken one day, it's a bitmap maintainers' work to fix.
Ack.


> > +
> > +/*
> > + * Sizes of compressed values.
> > + */
> > +#define BITS_PER_TAG 4
> > +#define BITS_PER_SIZE 7
> > +#define BITS_PER_LARGEST_IDX_INLINE 4
> > +#define BITS_PER_LARGEST_IDX 6
>
> But in the comment you say that largest index in inline frame is 3-bits long.

The comment says "i4" (see my comments above)

> > +
> > +/* Translate allocation size into mtecomp_caches[] index. */
> > +static int ea0_size_to_cache_id(int len)
>
> Here and everywhere, do you need signed values? If not, unsigned int.

Done as part of fixing Andy's comments

> > +
> > +/* Transform tag ranges back into tags. */
> > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > +{
> > +     int i, j, pos = 0;
> > +     u8 prev;
> > +
> > +     for (i = 0; i < r_len; i++) {
> > +             for (j = 0; j < r_sizes[i]; j++) {
> > +                     if (pos % 2)
> > +                             tags[pos / 2] = (prev << 4) | r_tags[i];
> > +                     else
> > +                             prev = r_tags[i];
> > +                     pos++;
> > +             }
> > +     }
> > +}
> > +EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);
>
> Because I didn't understand the compressed frame format, not sure I
> can understand this logic...
Hope the above comments will help, if not - please let me know.

> > +
> > +/* Translate @num_ranges into the allocation size needed to hold them. */
> > +static int ea0_alloc_size(int num_ranges)
> > +{
> > +     if (num_ranges <= 6)
> > +             return 8;
> > +     if (num_ranges <= 11)
> > +             return 16;
> > +     if (num_ranges <= 23)
> > +             return 32;
> > +     if (num_ranges <= 46)
> > +             return 64;
> > +     return 128;
> > +}
> > +
> > +/* Translate allocation size into maximum number of ranges that it can hold. */
> > +static int ea0_size_to_ranges(int size)
> > +{
> > +     switch (size) {
> > +     case 8:
> > +             return 6;
> > +     case 16:
> > +             return 11;
> > +     case 32:
> > +             return 23;
> > +     case 64:
> > +             return 46;
> > +     default:
> > +             return 0;
> > +     }
> > +}
>
> I wonder if there's a math formula here? Can you explain where from
> those numbers come?

I'll add a comment there.
Basically, for the inline case it is the biggest N such as 4 + 4*N +
7*(N-1) <= 63
(not 64, because Xarray steals one bit from us)

For the out-of-line case it is the biggest N such as 6+4*N + 7*(N-1)
<= array size in bits (128, 256, or 512).

>
> > +#define RANGES_INLINE ea0_size_to_ranges(8)
>
> Maybe
>
>  #define RANGES_INLINE (6)

Fair enough.


> > +
> > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > +                      unsigned long *pos, unsigned long bits)
>
> Please don't steal prefixes. But the idea is good. For the next
> iteration, let's rename bitmap_set_value() to bitmap_write()?
> So that your function will be an mte_bitmap_write().

I don't object, but it diverges from bitmap_set_value8() now.
Will do.
  
Evgenii Stepanov July 19, 2023, 8:32 p.m. UTC | #7
On Tue, Jul 18, 2023 at 11:09 PM Yury Norov <yury.norov@gmail.com> wrote:
> This is the 4th time I see mr. Stepanov's credentials in the patch.
> I've no doubts he's a worthy gentleman but please avoid mentioning
> people in source code. Suggested-by is enough. IIRC, the rule for
> that exists for about decade.

Thank you for your kind words :)
Indeed, Suggested-by is more than enough.
  
Yury Norov July 19, 2023, 9:06 p.m. UTC | #8
On Wed, Jul 19, 2023 at 04:00:14PM +0200, Alexander Potapenko wrote:
> > > + * 4. For the inline case, the following values are stored in the 8-byte handle:
> > > + *       largest_idx : i4
> > > + *      r_tags[0..5] : i4 x 6
> > > + *     r_sizes[0..4] : i7 x 5
> > > + *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> > > + *
> > > + *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> > > + *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> > > + *    distinguished from a kernel pointer.
> >
> > I honestly tried to understand... For example, what the
> >         'r_sizes[0..4] : i7 x 5'
> > means? An array of 5 elements, 17 bits each? But it's alone greater
> > than size of pointer... Oh gosh...
> 
> iN (note that it is a small i, not a 1) is LLVM notation for an N-bit
> integer type.
> There's no such thing in C syntax, and describing the data layout
> using bitfields would be painful.
> Would it help if I add a comment explaining that iN is an N-bit
> integer? Or do you prefer something like
> 
>   r_sizes[0..4] : 5 x 7 bits
> 
> ?

Yes, that would help.
 
> >
> > Moreover, MTE tags are all 4-bits.
> >
> > It seems like text without pictures is above my mental abilities. Can you
> > please illustrate it? For example, from the #4, I (hopefully correctly)
> > realized that:
> >
> > Inline frame format:
> >    0                                                    60       62  63
> >  +---------------------------------------------------------------------+
> I think it's more natural to number bits from 63 to 0
> 
> >  |              No idea what happens here             |    Lidx    | X |
> >  +---------------------------------------------------------------------+
> >  63    : X      :  RAZ : Reserved for Xarray.
> >  60-62 : Lidx   : 0..5 : Largest element index.
> 
> There's some mismatch now between this picture, where Lidx is i3, and
> the implementation that treats it as an i4, merging bit 63 into it.
> Maybe we can avoid this by not splitting off bit 63?
> 
> >                      6 : Reserved
> >                      7 : Invalid handler
> 
> No, 7 means "treat this handle as an out-of-line one". It is still
> valid, but instead of tag data it contains a pointer.

So, it's invalid combination for _inline_ handler, right? Anyways, I'm
waiting for an updated docs, so it will hopefully bring some light.

> > > +
> > > +/* Transform tag ranges back into tags. */
> > > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > > +{
> > > +     int i, j, pos = 0;
> > > +     u8 prev;
> > > +
> > > +     for (i = 0; i < r_len; i++) {
> > > +             for (j = 0; j < r_sizes[i]; j++) {
> > > +                     if (pos % 2)
> > > +                             tags[pos / 2] = (prev << 4) | r_tags[i];
> > > +                     else
> > > +                             prev = r_tags[i];
> > > +                     pos++;

This code flushes tags at every 2nd iteration. Is that true that
there's always an even number of iterations, i.e. rsizes is always
even, assuming r_len can be 1?

If not, it's possible to loose a tag, consider rlen == 1 and
rsizes[0] == 1.

If yes, you can simplify:

     for (i = 0; i < r_len; i++)
             for (j = 0; j < r_sizes[i]; j++)
                     tags[pos++] = (r_tags[i] << 4) | r_tags[i];

Anyways, in the test can you run all possible combinations?

> > > +             }
> > > +     }
> > > +}
> > > +EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);
> >
> > Because I didn't understand the compressed frame format, not sure I
> > can understand this logic...
> Hope the above comments will help, if not - please let me know.
> 
> > > +
> > > +/* Translate @num_ranges into the allocation size needed to hold them. */
> > > +static int ea0_alloc_size(int num_ranges)
> > > +{
> > > +     if (num_ranges <= 6)
> > > +             return 8;
> > > +     if (num_ranges <= 11)
> > > +             return 16;
> > > +     if (num_ranges <= 23)
> > > +             return 32;
> > > +     if (num_ranges <= 46)
> > > +             return 64;
> > > +     return 128;
> > > +}
> > > +
> > > +/* Translate allocation size into maximum number of ranges that it can hold. */
> > > +static int ea0_size_to_ranges(int size)
> > > +{
> > > +     switch (size) {
> > > +     case 8:
> > > +             return 6;
> > > +     case 16:
> > > +             return 11;
> > > +     case 32:
> > > +             return 23;
> > > +     case 64:
> > > +             return 46;
> > > +     default:
> > > +             return 0;
> > > +     }
> > > +}
> >
> > I wonder if there's a math formula here? Can you explain where from
> > those numbers come?
> 
> I'll add a comment there.
> Basically, for the inline case it is the biggest N such as 4 + 4*N +
> 7*(N-1) <= 63
>
> (not 64, because Xarray steals one bit from us)
> 
> For the out-of-line case it is the biggest N such as 6+4*N + 7*(N-1)
> <= array size in bits (128, 256, or 512).

So it's: N <= (BIT_CAPACITY + NUM4 - NUM7) / (TAGSZ + SZ)

Doesn't look like a rocket science...

Why don't you code the formula instead of results? Are there any
difficulties? Things may change. For example, in next spec they
may bring 3- or 5-bit tags, and your compressor will crack loudly
with hardcoded numbers.
  
Alexander Potapenko July 20, 2023, noon UTC | #9
On Wed, Jul 19, 2023 at 11:06 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> > Would it help if I add a comment explaining that iN is an N-bit
> > integer? Or do you prefer something like
> >
> >   r_sizes[0..4] : 5 x 7 bits
> >
> > ?
>
> Yes, that would help.
Ok.


> > >                      7 : Invalid handler
> >
> > No, 7 means "treat this handle as an out-of-line one". It is still
> > valid, but instead of tag data it contains a pointer.
>
> So, it's invalid combination for _inline_ handler, right?
Yes, that's right.

> Anyways, I'm
> waiting for an updated docs, so it will hopefully bring some light.
It's on the way.

> > > > +
> > > > +/* Transform tag ranges back into tags. */
> > > > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > > > +{
> > > > +     int i, j, pos = 0;
> > > > +     u8 prev;
> > > > +
> > > > +     for (i = 0; i < r_len; i++) {
> > > > +             for (j = 0; j < r_sizes[i]; j++) {
> > > > +                     if (pos % 2)
> > > > +                             tags[pos / 2] = (prev << 4) | r_tags[i];
> > > > +                     else
> > > > +                             prev = r_tags[i];
> > > > +                     pos++;
>
> This code flushes tags at every 2nd iteration. Is that true that
> there's always an even number of iterations, i.e. rsizes is always
> even, assuming r_len can be 1?
>
> If not, it's possible to loose a tag, consider rlen == 1 and
> rsizes[0] == 1.

By design, ea0_ranges_to_tags() only accepts ranges that sum up to 256
(as produced by ea0_tags_to_ranges())
We could add a check for that, but given that it is an internal
helper, a comment should suffice.

But r_sizes[i] does not have to be even (otherwise we could've used
this fact for a more efficient compression).
For example, if the array of tags is {0xab, 0x0, 0x0, 0x0...}, then
r_len = 3, r_sizes[] = {1, 1, 254}, r_tags = {0x0a, 0x0b, 0x0}.

>
> If yes, you can simplify:
>
>      for (i = 0; i < r_len; i++)
>              for (j = 0; j < r_sizes[i]; j++)
>                      tags[pos++] = (r_tags[i] << 4) | r_tags[i];
>
> Anyways, in the test can you run all possible combinations?

I will extract code from compress_range_helper() to generate different
numbers of ranges and test against those.


> > > > +/* Translate allocation size into maximum number of ranges that it can hold. */
> > > > +static int ea0_size_to_ranges(int size)
> > > > +{
> > > > +     switch (size) {
> > > > +     case 8:
> > > > +             return 6;
> > > > +     case 16:
> > > > +             return 11;
> > > > +     case 32:
> > > > +             return 23;
> > > > +     case 64:
> > > > +             return 46;
> > > > +     default:
> > > > +             return 0;
> > > > +     }
> > > > +}
> > >
> > > I wonder if there's a math formula here? Can you explain where from
> > > those numbers come?
> >
> > I'll add a comment there.
> > Basically, for the inline case it is the biggest N such as 4 + 4*N +
> > 7*(N-1) <= 63
> >
> > (not 64, because Xarray steals one bit from us)
> >
> > For the out-of-line case it is the biggest N such as 6+4*N + 7*(N-1)
> > <= array size in bits (128, 256, or 512).
>
> So it's: N <= (BIT_CAPACITY + NUM4 - NUM7) / (TAGSZ + SZ)
>
> Doesn't look like a rocket science...

Note that the size of largest_idx is picked based on the largest N for
64-byte buffers.
And the size of r_sizes[] depends on the number of tags that fit into
128 bytes: if we throw away the largest range, each of the remaining
ones will fit into 7 bits.
But this will not be the case for e.g. 3-bit tags (we'll need 8-bit sizes then).

Changing mte_size_to_ranges() to "(size * 8 + MTE_BITS_PER_SIZE -
largest_bits) / (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE)" indeed looks
better than hardcoded numbers, but we should keep in mind that the
parameters depend on each other and cannot be easily changed.

>
> Why don't you code the formula instead of results? Are there any
> difficulties? Things may change. For example, in next spec they
> may bring 3- or 5-bit tags, and your compressor will crack loudly
> with hardcoded numbers.

The main reason for the compressor to crack loudly would be the
hardcoded assumption of the tags consisting of 4 bits.
I don't think it is practical to account on that number to change (and
e.g. use bitmap_read/bitmap_write to convert tags into ranges) - we'd
rather have a static assert for that.
  

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f67..52cdc7603cf7c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2093,6 +2093,16 @@  config ARM64_EPAN
 	  if the cpu does not implement the feature.
 endmenu # "ARMv8.7 architectural features"
 
+config ARM64_MTE_COMP
+	bool "Tag compression for ARM64 MTE"
+	default y
+	depends on ARM64_MTE
+	help
+	  Enable tag compression support for ARM64 MTE.
+
+	  128-byte tag buffers corresponding to 4K pages can be compressed using
+	  the EA0 algorithm to save heap memory.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
new file mode 100644
index 0000000000000..0c444c0d4ac04
--- /dev/null
+++ b/arch/arm64/include/asm/mtecomp.h
@@ -0,0 +1,60 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_MTECOMP_H
+#define __ASM_MTECOMP_H
+
+#include <linux/types.h>
+
+/*
+ * ea0_compress() - compress the given tag array.
+ * @tags: 128-byte array to read the tags from.
+ *
+ * Compresses the tags and returns a 64-bit opaque handle pointing to the
+ * tag storage. May allocate memory, which is freed by @ea0_release_handle().
+ */
+unsigned long ea0_compress(u8 *tags);
+
+/*
+ * ea0_decompress() - decompress the tag array addressed by the handle.
+ * @handle: handle returned by @ea0_decompress()
+ * @tags: 128-byte array to write the tags to.
+ *
+ * Reads the compressed data and writes it into the user-supplied tag array.
+ * Returns true on success, false on error.
+ */
+bool ea0_decompress(unsigned long handle, u8 *tags);
+
+/*
+ * ea0_release_handle() - release the handle returned by ea0_compress().
+ * @handle: handle returned by ea0_compress().
+ */
+void ea0_release_handle(unsigned long handle);
+
+/* Functions below are exported for testing purposes. */
+
+/*
+ * ea0_storage_size() - calculate the memory occupied by compressed tags.
+ * @handle: storage handle returned by ea0_compress.
+ */
+int ea0_storage_size(unsigned long handle);
+
+/*
+ * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
+ * @tags: 128-byte array containing 256 MTE tags.
+ * @out_tags: u8 array to store the tag of every range.
+ * @out_sizes: u16 array to store the size of every range.
+ * @out_len: length of @out_tags and @out_sizes (output parameter, initially
+ *           equal to lengths of out_tags[] and out_sizes[]).
+ */
+void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len);
+
+/*
+ * ea0_ranges_to_tags() - fill @tags using given tag ranges.
+ * @r_tags: u8[256] containing the tag of every range.
+ * @r_sizes: u16[256] containing the size of every range.
+ * @r_len: length of @r_tags and @r_sizes.
+ * @tags: 128-byte array to write the tags to.
+ */
+void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
+
+#endif // __ASM_MTECOMP_H
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d0..46778f6dd83c2 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -10,6 +10,7 @@  obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd.o
 obj-$(CONFIG_TRANS_TABLE)	+= trans_pgd-asm.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 obj-$(CONFIG_ARM64_MTE)		+= mteswap.o
+obj-$(CONFIG_ARM64_MTE_COMP)	+= mtecomp.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
new file mode 100644
index 0000000000000..50a379c035aee
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.c
@@ -0,0 +1,406 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * MTE tag compression algorithm.
+ * Proposed by Evgenii Stepanov <eugenis@google.com>
+ */
+
+/*
+ * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
+ * compression algorithms.
+ *
+ * The algorithm attempts to compress a 128-byte (MTE_GRANULES_PER_PAGE / 2)
+ * array of tags into a smaller byte sequence that can be stored in a
+ * 16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
+ * an 8-byte pointer.
+ *
+ * We encapsulate tag storage memory management in this module, because it is
+ * tightly coupled with the pointer representation.
+ *   ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value
+ *     that can be stored in Xarray
+ *   ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into
+ *     the provided 128-byte buffer.
+ *
+ * The compression algorithm works as follows.
+ *
+ * 1. The input array of 128 bytes is transformed into tag ranges (two arrays:
+ *    @r_tags containing tag values and @r_sizes containing range lengths) by
+ *    ea0_tags_to_ranges(). Note that @r_sizes sums up to 256.
+ *
+ * 2. Depending on the number N of ranges, the following storage class is picked:
+ *            N <= 6:  8 bytes (inline case, no allocation required);
+ *       6 < N <= 11: 16 bytes
+ *      11 < N <= 23: 32 bytes
+ *      23 < N <= 46: 64 bytes
+ *      46 < N:       128 bytes (no compression will be performed)
+ *
+ * 3. The number of the largest element of @r_sizes is stored in @largest_idx.
+ *    The element itself is thrown away from @r_sizes, because it can be
+ *    reconstructed from the sum of the remaining elements. Note that now none
+ *    of the remaining @r_sizes elements is greater than 127.
+ *
+ * 4. For the inline case, the following values are stored in the 8-byte handle:
+ *       largest_idx : i4
+ *      r_tags[0..5] : i4 x 6
+ *     r_sizes[0..4] : i7 x 5
+ *    (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
+ *
+ *    Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
+ *    be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
+ *    distinguished from a kernel pointer.
+ *
+ * 5. For the out-of-line case, the storage is allocated from one of the
+ *    "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
+ *    on 8 bytes, so its bits 2..0 can be used to store the size class:
+ *     - 0 for 128 bytes
+ *     - 1 for 16
+ *     - 2 for 32
+ *     - 4 for 64.
+ *    Bit 63 of the pointer is zeroed out, so that it can be stored in Xarray.
+ *
+ * 6. The data layout in the allocated storage is as follows:
+ *         largest_idx : i6
+ *        r_tags[0..N] : i4 x N
+ *     r_sizes[0..N-1] : i7 x (N-1)
+ *
+ * The decompression algorithm performs the steps below.
+ *
+ * 1. Decide if data is stored inline (bits 62..60 of the handle != 0b111) or
+ *    out-of line.
+ *
+ * 2. For the inline case, treat the handle itself as the input buffer.
+ *
+ * 3. For the out-of-line case, look at bits 2..0 of the handle to understand
+ *    the input buffer length. To obtain the pointer to the input buffer, unset
+ *    bits 2..0 of the handle and set bit 63.
+ *
+ * 4. If the input buffer is 128 byte long, copy its contents to the output
+ *    buffer.
+ *
+ * 5. Otherwise, read @largest_idx, @r_tags and @r_sizes from the input buffer.
+ *    Calculate the removed largest element of @r_sizes:
+ *      largest = 256 - sum(r_sizes)
+ *    and insert it into @r_sizes at position @largest_idx.
+ *
+ * 6. While @r_sizes[i] > 0, add a 4-bit value @r_tags[i] to the output buffer
+ *    @r_sizes[i] times.
+ */
+
+#include <linux/bits.h>
+#include <linux/bitmap.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/swab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+/* The handle must fit into an Xarray value. */
+#define HANDLE_MASK GENMASK_ULL(62, 0)
+
+/* Out-of-line handles have 0b111 in bits 62..60. */
+#define NOINLINE_MASK GENMASK_ULL(62, 60)
+
+/* Cache index is stored in the lowest pointer bits. */
+#define CACHE_ID_MASK GENMASK_ULL(2, 0)
+
+/*
+ * Four separate caches to store out-of-line data:
+ *  0: mte-tags-128
+ *  1: mte-tags-16
+ *  2: mte-tags-32
+ *  3: mte-tags-64
+ */
+#define NUM_CACHES 4
+static struct kmem_cache *mtecomp_caches[NUM_CACHES];
+
+/*
+ * Sizes of compressed values.
+ */
+#define BITS_PER_TAG 4
+#define BITS_PER_SIZE 7
+#define BITS_PER_LARGEST_IDX_INLINE 4
+#define BITS_PER_LARGEST_IDX 6
+
+/* Translate allocation size into mtecomp_caches[] index. */
+static int ea0_size_to_cache_id(int len)
+{
+	if (len < 128)
+		return fls(len >> 4);
+	return 0;
+}
+
+/* Translate mtecomp_caches[] index into allocation size. */
+static int ea0_cache_id_to_size(int id)
+{
+	if (id == 0)
+		return 128;
+	return 8 << id;
+}
+
+/* Transform tags into tag ranges. */
+void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
+{
+	u8 prev_tag = U8_MAX;
+	int cur_idx = -1;
+	u8 cur_tag;
+	int i, j;
+
+	memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
+	memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
+
+	for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
+		for (j = 0; j < 2; j++) {
+			cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
+			if (cur_tag == prev_tag) {
+				out_sizes[cur_idx]++;
+			} else {
+				cur_idx++;
+				prev_tag = cur_tag;
+				out_tags[cur_idx] = prev_tag;
+				out_sizes[cur_idx] = 1;
+			}
+		}
+	}
+	*out_len = cur_idx + 1;
+}
+EXPORT_SYMBOL_NS(ea0_tags_to_ranges, MTECOMP);
+
+/* Transform tag ranges back into tags. */
+void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
+{
+	int i, j, pos = 0;
+	u8 prev;
+
+	for (i = 0; i < r_len; i++) {
+		for (j = 0; j < r_sizes[i]; j++) {
+			if (pos % 2)
+				tags[pos / 2] = (prev << 4) | r_tags[i];
+			else
+				prev = r_tags[i];
+			pos++;
+		}
+	}
+}
+EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);
+
+/* Translate @num_ranges into the allocation size needed to hold them. */
+static int ea0_alloc_size(int num_ranges)
+{
+	if (num_ranges <= 6)
+		return 8;
+	if (num_ranges <= 11)
+		return 16;
+	if (num_ranges <= 23)
+		return 32;
+	if (num_ranges <= 46)
+		return 64;
+	return 128;
+}
+
+/* Translate allocation size into maximum number of ranges that it can hold. */
+static int ea0_size_to_ranges(int size)
+{
+	switch (size) {
+	case 8:
+		return 6;
+	case 16:
+		return 11;
+	case 32:
+		return 23;
+	case 64:
+		return 46;
+	default:
+		return 0;
+	}
+}
+#define RANGES_INLINE ea0_size_to_ranges(8)
+
+/* Is the data stored inline in the handle itself? */
+static bool ea0_is_inline(unsigned long handle)
+{
+	return (handle & NOINLINE_MASK) != NOINLINE_MASK;
+}
+
+/* Get the size of the buffer backing @handle. */
+int ea0_storage_size(unsigned long handle)
+{
+	if (ea0_is_inline(handle))
+		return 8;
+	return ea0_cache_id_to_size(handle & CACHE_ID_MASK);
+}
+EXPORT_SYMBOL_NS(ea0_storage_size, MTECOMP);
+
+static void bitmap_write(unsigned long *bitmap, unsigned long value,
+			 unsigned long *pos, unsigned long bits)
+{
+	bitmap_set_value(bitmap, value, *pos, bits);
+	*pos += bits;
+}
+
+/* Compress ranges into the buffer that can accommodate up to max_ranges. */
+static void ea0_compress_to_buf(int len, u8 *tags, short *sizes,
+				unsigned long *bitmap, int max_ranges)
+{
+	unsigned long bit_pos = 0, l_bits;
+	int largest_idx = -1, i;
+	short largest = 0;
+
+	for (i = 0; i < len; i++) {
+		if (sizes[i] > largest) {
+			largest = sizes[i];
+			largest_idx = i;
+		}
+	}
+	l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
+						 BITS_PER_LARGEST_IDX;
+	bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
+	for (i = 0; i < len; i++)
+		bitmap_write(bitmap, tags[i], &bit_pos, BITS_PER_TAG);
+	for (i = len; i < max_ranges; i++)
+		bitmap_write(bitmap, 0, &bit_pos, BITS_PER_TAG);
+	for (i = 0; i < len; i++) {
+		if (i == largest_idx)
+			continue;
+		bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);
+	}
+	for (i = len; i < max_ranges; i++)
+		bitmap_write(bitmap, 0, &bit_pos, BITS_PER_SIZE);
+}
+
+/* Compress @tags and return a handle. */
+unsigned long ea0_compress(u8 *tags)
+{
+	int alloc_size, cache_id;
+	struct kmem_cache *cache;
+	short r_sizes[256];
+	u8 r_tags[256];
+	int r_len = ARRAY_SIZE(r_tags);
+	unsigned long *storage;
+	/*
+	 * ea0_compress_to_buf() only initializes the bits that ea0_decompress()
+	 * will read. But when the tags are stored in the handle itself, it must
+	 * have all its bits initialized.
+	 */
+	unsigned long result = 0;
+
+	ea0_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+	alloc_size = ea0_alloc_size(r_len);
+	if (alloc_size == 8) {
+		ea0_compress_to_buf(r_len, r_tags, r_sizes, &result,
+				    RANGES_INLINE);
+		return result;
+	}
+	cache_id = ea0_size_to_cache_id(alloc_size);
+	cache = mtecomp_caches[cache_id];
+	storage = kmem_cache_alloc(cache, GFP_KERNEL);
+	if (alloc_size < 128) {
+		/* alloc_size is always a multiple of sizeof(unsigned long). */
+		ea0_compress_to_buf(r_len, r_tags, r_sizes, storage,
+				    ea0_size_to_ranges(alloc_size));
+		return ((unsigned long)storage | cache_id) & HANDLE_MASK;
+	}
+	memcpy(storage, tags, alloc_size);
+	return (unsigned long)storage & HANDLE_MASK;
+}
+EXPORT_SYMBOL_NS(ea0_compress, MTECOMP);
+
+static unsigned long bitmap_read(const unsigned long *bitmap,
+				 unsigned long *pos, unsigned long bits)
+{
+	unsigned long result;
+
+	result = bitmap_get_value(bitmap, *pos, bits);
+	*pos += bits;
+	return result;
+}
+
+/* Decompress the contents of the given buffer into @tags. */
+static bool ea0_decompress_from_buf(const unsigned long *bitmap, int max_ranges,
+				    u8 *tags)
+{
+	int largest_idx, i;
+	short r_sizes[46], sum = 0;
+	u8 r_tags[46];
+	unsigned long bit_pos = 0, l_bits;
+
+	l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
+						 BITS_PER_LARGEST_IDX;
+	largest_idx = bitmap_read(bitmap, &bit_pos, l_bits);
+	for (i = 0; i < max_ranges; i++)
+		r_tags[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_TAG);
+	for (i = 0; i < max_ranges; i++) {
+		if (i == largest_idx)
+			continue;
+		r_sizes[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_SIZE);
+		if (!r_sizes[i]) {
+			max_ranges = i;
+			break;
+		}
+		sum += r_sizes[i];
+	}
+	if (sum >= 256)
+		return false;
+	r_sizes[largest_idx] = 256 - sum;
+	ea0_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
+	return true;
+}
+
+/* Get pointer to the out-of-line storage from a handle. */
+static void *ea0_storage(unsigned long handle)
+{
+	if (ea0_is_inline(handle))
+		return NULL;
+	return (void *)((handle & (~CACHE_ID_MASK)) | BIT_ULL(63));
+}
+
+/* Decompress tags from the buffer referenced by @handle. */
+bool ea0_decompress(unsigned long handle, u8 *tags)
+{
+	unsigned long *storage = ea0_storage(handle);
+	int size = ea0_storage_size(handle);
+
+	if (size == 128) {
+		memcpy(tags, storage, size);
+		return true;
+	}
+	if (size == 8)
+		return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);
+	return ea0_decompress_from_buf(storage, ea0_size_to_ranges(size), tags);
+}
+EXPORT_SYMBOL_NS(ea0_decompress, MTECOMP);
+
+/* Release the memory referenced by @handle. */
+void ea0_release_handle(unsigned long handle)
+{
+	void *storage = ea0_storage(handle);
+	int size = ea0_storage_size(handle);
+	struct kmem_cache *c;
+
+	if (!storage)
+		return;
+
+	c = mtecomp_caches[ea0_size_to_cache_id(size)];
+	kmem_cache_free(c, storage);
+}
+EXPORT_SYMBOL_NS(ea0_release_handle, MTECOMP);
+
+/* Set up mtecomp_caches[]. */
+static int mtecomp_init(void)
+{
+	char name[16];
+	int size;
+	int i;
+
+	BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);
+	for (i = 0; i < NUM_CACHES; i++) {
+		size = ea0_cache_id_to_size(i);
+		snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
+		mtecomp_caches[i] =
+			kmem_cache_create(name, size, size, 0, NULL);
+	}
+	return 0;
+}
+module_init(mtecomp_init);