[0/4] Add vector pair builtins to PowerPC

Message ID ZU62hIC0H7pvSwrY@cowardly-lion.the-meissners.org
Headers
Series Add vector pair builtins to PowerPC |

Message

Michael Meissner Nov. 10, 2023, 11:02 p.m. UTC
  These set of patches add support for using the vector pair load (lxvp, plxvp,
and lxvpx) instructions and the vector pair store (stxvp, pstxvp, and stxvpx)
that were introduced with ISA 3.1 on Power10 systems.

With GCC 13, the only place vector pairs (and vector quads) were used were to
feed into the MMA subsystem.  These patches do not use the MMA subsystem, but
it gives users a way to write code that is extremely memory bandwidth
intensive.

There are two main ways to add vector pair support to the GCC compiler:
built-in functions vs. __attribute__((__vector_size__(32))).

The first method is to add a set of built-in functions that use the vector pair
type and it allows the user to write loops and such using the vector pair type
(__vector_pair).  Loads are normally done using the load vector pair
instructions.  Then the operation is done as a post reload split to do the two
independent vector operations on the two 128-bit vectors located in the vector
pair.  When the type is stored, normally a store vector pair instruction is
used.  By keeping the value within a vector pair through register allocation,
the compiler does not generate extra move instructions which can slow down the
loop.

The second method is to add support for the V4DF, V8SF, etc. types.  By doing
so, you can use the attribute __vector_size__(32)) to declare variables that
are vector pairs, and the GCC compiler will generate the appropriate code.  I
implemented a limited prototype of this support, but it has some problems that
I haven't addressed.  One potential problem with using the 32-byte vector size
is it can generate worse code for options that aren't covered withe as the
compiler unpacks things and re-packs them.  The compiler would also generate
these unpacks and packs if you are generating code for a power9 system.  There
are a bunch of test cases that fail with my prototype implementation that I
haven't addressed yet.

After discussions within our group, it was decided that using built-in
functions is the way to go at this time, and these patches are implement those
functions.

In terms of benchmarks, I wrote two benchmarks:

   1)	One benchmark is a saxpy type loop: value[i] += (a[i] * b[i]).  That is
	a loop with 3 loads and a store per loop.

   2)	Another benchmark produces a scalar sun of an entire vector.  This is a
	loop that just has a single load and no store.

For the saxpy type loop, I get the following general numbers for both float and
double:

   1)	The vector pair built-in functions are roughly 10% faster than using
	normal vector processing.

   2)	The vector pair built-in functions are roughly 19-20% faster than if I
	write the loop using the vector pair loads using the exist built-ins,
	and then manually split the values and do the arithmetic and single
	vector stores,

   3)	The vector pair built-in functions are roughly 35-40% faster than if I
	write the loop using the existing built-ins for both vector pair load
	and vector pair store.  If I apply the patches that Peter Bergner has
	been writing for PR target/109116, then it improves the speed of the
	existing built-ins for assembling and disassembling vector pairs.  In
	this case, the vector pair built-in functions are 20-25% faster,
	instead of 35-40% faster.  This is due to the patch eliminating extra
	vector moves.

Unfortunately, for floating point, doing the sum of the whole vector is slower
using the new vector pair built-in functions using a simple loop (compared to
using the existing built-ins for disassembling vector pairs.  If I write more
complex loops that manually unroll the loop, then the floating point vector
pair built-in functions become like the integer vector pair integer built-in
functions.  So there is some amount of tuning that will need to be done.

There are 4 patches within this group of patches.

    1)	The first patch adds vector pair support for 32-bit and 64-bit floating
	point operations.  The operations provided are absolute value,
	addition, fused multiply-add, minimu, maximum, multiplication,
	negation, and subtraction.  I did not add divde or square root because
	these instructions take long enough to compute that you don't get any
	advantage of using the vector pair load/store instructions.

    2)	The second patch add vector pair support for 8-bit, 16-bit, 32-bit, and
	64-bit integer operations.  The operations provided include addition,
	bitwise and, bitwise inclusive or, bitwise exclusive or, bitwise not,
	both signed and unsigned minimum/maximu, negation, and subtraction.  I
	did not add multiply because the PowerPC architecture does not provide
	single instructions to do integer vector multiply on the whole vector.
	I could add shifts and rotates, but I didn't think memory intensive
	code used these operations.

    3)	The third patch adds methods to create vector pair values (zero, splat
	from a scalar value, and combine two 128-bit vectors), as well as a
	convenient method to exact one 128-bit vector from a vector pair.

    4)	The fourth patch adds horizontal addition for 32-bit, 64-bit floating
	point, and 64-bit integers.  I do wonder if there are more horizontal
	reductions that should be done.

I have built and tested these patches on:

    *	A little endian power10 server using --with-cpu=power10
    *	A little endian power9 server using --with-cpu=power9
    *	A big endian power9 server using --with-cpu=power9.

Can I check these patches into the master branch?