Disable FMADD in chains for Zen4 and generic
Checks
Commit Message
Hi,
this patch disables use of FMA in matrix multiplication loop for generic (for
x86-64-v3) and zen4. I tested this on zen4 and Xenon Gold Gold 6212U.
For Intel this is neutral both on the matrix multiplication microbenchmark
(attached) and spec2k17 where the difference was within noise for Core.
On core the micro-benchmark runs as follows:
With FMA:
578,500,241 cycles:u # 3.645 GHz ( +- 0.12% )
753,318,477 instructions:u # 1.30 insn per cycle ( +- 0.00% )
125,417,701 branches:u # 790.227 M/sec ( +- 0.00% )
0.159146 +- 0.000363 seconds time elapsed ( +- 0.23% )
No FMA:
577,573,960 cycles:u # 3.514 GHz ( +- 0.15% )
878,318,479 instructions:u # 1.52 insn per cycle ( +- 0.00% )
125,417,702 branches:u # 763.035 M/sec ( +- 0.00% )
0.164734 +- 0.000321 seconds time elapsed ( +- 0.19% )
So the cycle count is unchanged and discrete multiply+add takes same time as FMA.
While on zen:
With FMA:
484875179 cycles:u # 3.599 GHz ( +- 0.05% ) (82.11%)
752031517 instructions:u # 1.55 insn per cycle
125106525 branches:u # 928.712 M/sec ( +- 0.03% ) (85.09%)
128356 branch-misses:u # 0.10% of all branches ( +- 0.06% ) (83.58%)
No FMA:
375875209 cycles:u # 3.592 GHz ( +- 0.08% ) (80.74%)
875725341 instructions:u # 2.33 insn per cycle
124903825 branches:u # 1.194 G/sec ( +- 0.04% ) (84.59%)
0.105203 +- 0.000188 seconds time elapsed ( +- 0.18% )
The diffrerence is that Cores understand the fact that fmadd does not need
all three parameters to start computation, while Zen cores doesn't.
Since this seems noticeable win on zen and not loss on Core it seems like good
default for generic.
I plan to commit the patch next week if there are no compplains.
Honza
#include <stdio.h>
#include <time.h>
#define SIZE 1000
float a[SIZE][SIZE];
float b[SIZE][SIZE];
float c[SIZE][SIZE];
void init(void)
{
int i, j, k;
for(i=0; i<SIZE; ++i)
{
for(j=0; j<SIZE; ++j)
{
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}
}
void mult(void)
{
int i, j, k;
for(i=0; i<SIZE; ++i)
{
for(j=0; j<SIZE; ++j)
{
for(k=0; k<SIZE; ++k)
{
c[i][j] += a[i][k] * b[k][j];
}
}
}
}
int main(void)
{
clock_t s, e;
init();
s=clock();
mult();
e=clock();
printf(" mult took %10d clocks\n", (int)(e-s));
return 0;
}
* confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
Enable for znver4 and Core.
Comments
On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hi,
> this patch disables use of FMA in matrix multiplication loop for generic (for
> x86-64-v3) and zen4. I tested this on zen4 and Xenon Gold Gold 6212U.
>
> For Intel this is neutral both on the matrix multiplication microbenchmark
> (attached) and spec2k17 where the difference was within noise for Core.
>
> On core the micro-benchmark runs as follows:
>
> With FMA:
>
> 578,500,241 cycles:u # 3.645 GHz ( +- 0.12% )
> 753,318,477 instructions:u # 1.30 insn per cycle ( +- 0.00% )
> 125,417,701 branches:u # 790.227 M/sec ( +- 0.00% )
> 0.159146 +- 0.000363 seconds time elapsed ( +- 0.23% )
>
>
> No FMA:
>
> 577,573,960 cycles:u # 3.514 GHz ( +- 0.15% )
> 878,318,479 instructions:u # 1.52 insn per cycle ( +- 0.00% )
> 125,417,702 branches:u # 763.035 M/sec ( +- 0.00% )
> 0.164734 +- 0.000321 seconds time elapsed ( +- 0.19% )
>
> So the cycle count is unchanged and discrete multiply+add takes same time as FMA.
>
> While on zen:
>
>
> With FMA:
> 484875179 cycles:u # 3.599 GHz ( +- 0.05% ) (82.11%)
> 752031517 instructions:u # 1.55 insn per cycle
> 125106525 branches:u # 928.712 M/sec ( +- 0.03% ) (85.09%)
> 128356 branch-misses:u # 0.10% of all branches ( +- 0.06% ) (83.58%)
>
> No FMA:
> 375875209 cycles:u # 3.592 GHz ( +- 0.08% ) (80.74%)
> 875725341 instructions:u # 2.33 insn per cycle
> 124903825 branches:u # 1.194 G/sec ( +- 0.04% ) (84.59%)
> 0.105203 +- 0.000188 seconds time elapsed ( +- 0.18% )
>
> The diffrerence is that Cores understand the fact that fmadd does not need
> all three parameters to start computation, while Zen cores doesn't.
This came up in a separate thread as well, but when doing reassoc of a
chain with
multiple dependent FMAs.
I can't understand how this uarch detail can affect performance when
as in the testcase
the longest input latency is on the multiplication from a memory load.
Do we actually
understand _why_ the FMAs are slower here?
Do we know that Cores can start the multiplication part when the add
operand isn't
ready yet? I'm curious how you set up a micro benchmark to measure this.
There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per cycle.
So in theory we can at most do 2 FMA per cycle but with latency (FMA)
== 4 for Zen3/4
and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more
throughput when there are many FADD/FMUL ops to execute? That works independent
on whether FMAs have a head-start on multiplication as you'd still be
bottle-necked
on the 2-wide issue for FMA?
On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a latency
of four. So you should get worse results there (looking at the
numbers above you
do get worse results, slightly so), probably the higher number of uops is hidden
by the latency.
> Since this seems noticeable win on zen and not loss on Core it seems like good
> default for generic.
>
> I plan to commit the patch next week if there are no compplains.
complaint!
Richard.
> Honza
>
> #include <stdio.h>
> #include <time.h>
>
> #define SIZE 1000
>
> float a[SIZE][SIZE];
> float b[SIZE][SIZE];
> float c[SIZE][SIZE];
>
> void init(void)
> {
> int i, j, k;
> for(i=0; i<SIZE; ++i)
> {
> for(j=0; j<SIZE; ++j)
> {
> a[i][j] = (float)i + j;
> b[i][j] = (float)i - j;
> c[i][j] = 0.0f;
> }
> }
> }
>
> void mult(void)
> {
> int i, j, k;
>
> for(i=0; i<SIZE; ++i)
> {
> for(j=0; j<SIZE; ++j)
> {
> for(k=0; k<SIZE; ++k)
> {
> c[i][j] += a[i][k] * b[k][j];
> }
> }
> }
> }
>
> int main(void)
> {
> clock_t s, e;
>
> init();
> s=clock();
> mult();
> e=clock();
> printf(" mult took %10d clocks\n", (int)(e-s));
>
> return 0;
>
> }
>
> * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
> Enable for znver4 and Core.
>
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 43fa9e8fd6d..74b03cbcc60 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
>
> /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> smaller FMA chain. */
> -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
> - | m_YONGFENG)
> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> + | m_YONGFENG | m_GENERIC)
>
> /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> smaller FMA chain. */
> -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
> - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
>
> /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> smaller FMA chain. */
>
> This came up in a separate thread as well, but when doing reassoc of a
> chain with
> multiple dependent FMAs.
>
> I can't understand how this uarch detail can affect performance when
> as in the testcase
> the longest input latency is on the multiplication from a memory load.
> Do we actually
> understand _why_ the FMAs are slower here?
This is my understanding:
The loop is well predictable and memory caluclations + loads can happen
in parallel. So the main dependency chain is updating the accumulator
computing c[i][j]. FMADD is 4 cycles on Zen4, while ADD is 3. So the
loop with FMADD can not run any faster than one iteration per 4 cycles,
while ADD can do one iteration per 3. Which roughtly matches the
speedup we see 484875179*3/4=363656384 while measured speed is 375875209
cycles. The benchmark is quite short and I run it 100 times in perf to
collect the data so the overhead is probably attributing to smaller then
expected difference.
>
> Do we know that Cores can start the multiplication part when the add
> operand isn't
> ready yet? I'm curious how you set up a micro benchmark to measure this.
Here is cycle counting benchmark:
#include <stdio.h>
int
main()
{
float o=0;
for (int i = 0; i < 1000000000; i++)
{
#ifdef ACCUMULATE
float p1 = o;
float p2 = 0;
#else
float p1 = 0;
float p2 = o;
#endif
float p3 = 0;
#ifdef FMA
asm ("vfmadd231ss %2, %3, %0":"=x"(o):"0"(p1),"x"(p2),"x"(p3));
#else
float t;
asm ("mulss %2, %0":"=x"(t):"0"(p2),"x"(p3));
asm ("addss %2, %0":"=x"(o):"0"(p1),"x"(t));
#endif
}
printf ("%f\n",o);
return 0;
}
It performs FMAs in sequence all with zeros. If you define ACCUMULATE
you get the pattern from matrix multiplication. On Zen I get:
jh@ryzen3:~> gcc -O3 -DFMA -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles:
4,001,011,489 cycles:u # 4.837 GHz (83.32%)
jh@ryzen3:~> gcc -O3 -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles:
3,000,335,064 cycles:u # 4.835 GHz (83.08%)
So 4 cycles for FMA loop and 3 cycles for separate mul and add.
Muls execute in parallel to adds in the second case.
If the dependence chain is done over multiplied paramter I get:
jh@ryzen3:~> gcc -O3 -DFMA l.c ; perf stat ./a.out 2>&1 | grep cycles:
4,000,118,069 cycles:u # 4.836 GHz (83.32%)
jh@ryzen3:~> gcc -O3 l.c ; perf stat ./a.out 2>&1 | grep cycles:
6,001,947,341 cycles:u # 4.838 GHz (83.32%)
FMA is the same (it is still one FMA instruction periteration) while
mul+add is 6 cycles since the dependency chain is longer.
Core gives me:
jh@aster:~> gcc -O3 l.c -DFMA -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u
5,001,515,473 cycles:u # 3.796 GHz
jh@aster:~> gcc -O3 l.c -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u
4,000,977,739 cycles:u # 3.819 GHz
jh@aster:~> gcc -O3 l.c -DFMA ; perf stat ./a.out 2>&1 | grep cycles:u
5,350,523,047 cycles:u # 3.814 GHz
jh@aster:~> gcc -O3 l.c ; perf stat ./a.out 2>&1 | grep cycles:u
10,251,994,240 cycles:u # 3.852 GHz
So FMA seems 5 cycles if we accumulate and bit more (off noise) if we do
the long chain. I think some cores have bigger difference between these
two numbers.
I am bit surprised of the last number of 10 cycles. I would expect 8.
I changed the matrix multiplication benchmark to repeat multiplication
100 times
>
> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per cycle.
> So in theory we can at most do 2 FMA per cycle but with latency (FMA)
> == 4 for Zen3/4
> and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more
> throughput when there are many FADD/FMUL ops to execute? That works independent
> on whether FMAs have a head-start on multiplication as you'd still be
> bottle-necked
> on the 2-wide issue for FMA?
I am not sure I follow what you say here. The knob only checks for
FMADDS used in accmulation type loop, so it is latency 4 and latency 3
per accumulation. Indeed in ohter loops fmadd is win.
>
> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a latency
> of four. So you should get worse results there (looking at the
> numbers above you
> do get worse results, slightly so), probably the higher number of uops is hidden
> by the latency.
I think the slower non-FMA on Core was just a noise (it shows in overall
time but not in cycle counts).
I changed the benchmark to run the multiplication 100 times.
On Intel I get:
jh@aster:~/gcc/build/gcc> gcc matrix-nofma.s ; perf stat ./a.out
mult took 15146405 clocks
Performance counter stats for './a.out':
15,149.62 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
948 page-faults:u # 62.576 /sec
55,803,919,561 cycles:u # 3.684 GHz
87,615,590,411 instructions:u # 1.57 insn per cycle
12,512,896,307 branches:u # 825.955 M/sec
12,605,403 branch-misses:u # 0.10% of all branches
15.150064855 seconds time elapsed
15.146817000 seconds user
0.003333000 seconds sys
jh@aster:~/gcc/build/gcc> gcc matrix-fma.s ; perf stat ./a.out
mult took 15308879 clocks
Performance counter stats for './a.out':
15,312.27 msec task-clock:u # 1.000 CPUs utilized
1 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
948 page-faults:u # 61.911 /sec
59,449,535,152 cycles:u # 3.882 GHz
75,115,590,460 instructions:u # 1.26 insn per cycle
12,512,896,356 branches:u # 817.181 M/sec
12,605,235 branch-misses:u # 0.10% of all branches
15.312776274 seconds time elapsed
15.309462000 seconds user
0.003333000 seconds sys
The difference seems close to noise.
If I am counting right, with 100*1000*1000*1000 multiplications
5*100*1000*1000*1000/8=62500000000 cycles overall.
Perhaps since the chain is independent for every 125 multilications it
runs a bit fater.
jh@alberti:~> gcc matrix-nofma.s ; perf stat ./a.out
mult took 10046353 clocks
Performance counter stats for './a.out':
10051.47 msec task-clock:u # 0.999 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
940 page-faults:u # 93.519 /sec
36983540385 cycles:u # 3.679 GHz (83.34%)
3535506 stalled-cycles-frontend:u # 0.01% frontend cycles idle (83.33%)
12252917 stalled-cycles-backend:u # 0.03% backend cycles idle (83.34%)
87650235892 instructions:u # 2.37 insn per cycle
# 0.00 stalled cycles per insn (83.34%)
12504689935 branches:u # 1.244 G/sec (83.33%)
12606975 branch-misses:u # 0.10% of all branches (83.32%)
10.059089949 seconds time elapsed
10.048218000 seconds user
0.003998000 seconds sys
jh@alberti:~> gcc matrix-fma.s ; perf stat ./a.out
mult took 13147631 clocks
Performance counter stats for './a.out':
13152.81 msec task-clock:u # 0.999 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
940 page-faults:u # 71.468 /sec
48394201333 cycles:u # 3.679 GHz (83.32%)
4251637 stalled-cycles-frontend:u # 0.01% frontend cycles idle (83.32%)
13664772 stalled-cycles-backend:u # 0.03% backend cycles idle (83.34%)
75101376364 instructions:u # 1.55 insn per cycle
# 0.00 stalled cycles per insn (83.35%)
12510705466 branches:u # 951.182 M/sec (83.34%)
12612898 branch-misses:u # 0.10% of all branches (83.33%)
13.162186067 seconds time elapsed
13.153354000 seconds user
0.000000000 seconds sys
So here I wuld expet 3*100*1000*1000*1000/8=37500000000 cycles for first
and 4*100*1000*1000*1000/8 = 50000000000 cycles for second.
So again small over-statement apparently due to parallelism between
vector multiplication, but overall it seems to match what I would expect
to see.
Honza
>
> > Since this seems noticeable win on zen and not loss on Core it seems like good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
>
> complaint!
>
> Richard.
>
> > Honza
> >
> > #include <stdio.h>
> > #include <time.h>
> >
> > #define SIZE 1000
> >
> > float a[SIZE][SIZE];
> > float b[SIZE][SIZE];
> > float c[SIZE][SIZE];
> >
> > void init(void)
> > {
> > int i, j, k;
> > for(i=0; i<SIZE; ++i)
> > {
> > for(j=0; j<SIZE; ++j)
> > {
> > a[i][j] = (float)i + j;
> > b[i][j] = (float)i - j;
> > c[i][j] = 0.0f;
> > }
> > }
> > }
> >
> > void mult(void)
> > {
> > int i, j, k;
> >
> > for(i=0; i<SIZE; ++i)
> > {
> > for(j=0; j<SIZE; ++j)
> > {
> > for(k=0; k<SIZE; ++k)
> > {
> > c[i][j] += a[i][k] * b[k][j];
> > }
> > }
> > }
> > }
> >
> > int main(void)
> > {
> > clock_t s, e;
> >
> > init();
> > s=clock();
> > mult();
> > e=clock();
> > printf(" mult took %10d clocks\n", (int)(e-s));
> >
> > return 0;
> >
> > }
> >
> > * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
> > Enable for znver4 and Core.
> >
> > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > index 43fa9e8fd6d..74b03cbcc60 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
> >
> > /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> > smaller FMA chain. */
> > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
> > - | m_YONGFENG)
> > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > + | m_YONGFENG | m_GENERIC)
> >
> > /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> > smaller FMA chain. */
> > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
> > - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
> >
> > /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> > smaller FMA chain. */
On Tue, 12 Dec 2023, Richard Biener wrote:
> On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> >
> > Hi,
> > this patch disables use of FMA in matrix multiplication loop for generic (for
> > x86-64-v3) and zen4. I tested this on zen4 and Xenon Gold Gold 6212U.
> >
> > For Intel this is neutral both on the matrix multiplication microbenchmark
> > (attached) and spec2k17 where the difference was within noise for Core.
> >
> > On core the micro-benchmark runs as follows:
> >
> > With FMA:
> >
> > 578,500,241 cycles:u # 3.645 GHz ( +- 0.12% )
> > 753,318,477 instructions:u # 1.30 insn per cycle ( +- 0.00% )
> > 125,417,701 branches:u # 790.227 M/sec ( +- 0.00% )
> > 0.159146 +- 0.000363 seconds time elapsed ( +- 0.23% )
> >
> >
> > No FMA:
> >
> > 577,573,960 cycles:u # 3.514 GHz ( +- 0.15% )
> > 878,318,479 instructions:u # 1.52 insn per cycle ( +- 0.00% )
> > 125,417,702 branches:u # 763.035 M/sec ( +- 0.00% )
> > 0.164734 +- 0.000321 seconds time elapsed ( +- 0.19% )
> >
> > So the cycle count is unchanged and discrete multiply+add takes same time as FMA.
> >
> > While on zen:
> >
> >
> > With FMA:
> > 484875179 cycles:u # 3.599 GHz ( +- 0.05% ) (82.11%)
> > 752031517 instructions:u # 1.55 insn per cycle
> > 125106525 branches:u # 928.712 M/sec ( +- 0.03% ) (85.09%)
> > 128356 branch-misses:u # 0.10% of all branches ( +- 0.06% ) (83.58%)
> >
> > No FMA:
> > 375875209 cycles:u # 3.592 GHz ( +- 0.08% ) (80.74%)
> > 875725341 instructions:u # 2.33 insn per cycle
> > 124903825 branches:u # 1.194 G/sec ( +- 0.04% ) (84.59%)
> > 0.105203 +- 0.000188 seconds time elapsed ( +- 0.18% )
> >
> > The diffrerence is that Cores understand the fact that fmadd does not need
> > all three parameters to start computation, while Zen cores doesn't.
>
> This came up in a separate thread as well, but when doing reassoc of a
> chain with multiple dependent FMAs.
> I can't understand how this uarch detail can affect performance when as in
> the testcase the longest input latency is on the multiplication from a
> memory load.
The latency from the memory operand doesn't matter since it's not a part
of the critical path. The memory uop of the FMA starts executing as soon
as the address is ready.
> Do we actually understand _why_ the FMAs are slower here?
It's simple, on Zen4 FMA has latency 4 while add has latency 3, and you
clearly see it in the quoted numbers: zen-with-fma has slightly below 4
cycles per branch, zen-without-fma has exactly 3 cycles per branch.
Please refer to uops.info for latency data:
https://uops.info/html-instr/VMULPS_YMM_YMM_YMM.html
https://uops.info/html-instr/VFMADD231PS_YMM_YMM_YMM.html
> Do we know that Cores can start the multiplication part when the add
> operand isn't ready yet? I'm curious how you set up a micro benchmark to
> measure this.
Unlike some of the Arm cores, none of x86 cores can consume the addend
of an FMA on a later cycle than the multiplicands, with Alder Lake-E
being the sole exception, apparently (see 6/10/10 latencies in the
aforementioned uops.info FMA page).
> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per
> cycle. So in theory we can at most do 2 FMA per cycle but with latency
> (FMA) == 4 for Zen3/4 and latency (FADD/FMUL) == 3 we might be able to
> squeeze out a little bit more throughput when there are many FADD/FMUL ops
> to execute? That works independent on whether FMAs have a head-start on
> multiplication as you'd still be bottle-necked on the 2-wide issue for
> FMA?
It doesn't matter here since all FMAs/FMULs are dependent on each other
so the processor can start a new FMA only each 4th (or 3rd cycle), except
when starting a new iteration of the outer loop.
> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a
> latency of four. So you should get worse results there (looking at the
> numbers above you do get worse results, slightly so), probably the higher
> number of uops is hidden by the latency.
A simple solution would be to enable AVOID_FMA_CHAINS when FMA latency
exceeds FMUL latency (all Zens and Broadwell).
> > Since this seems noticeable win on zen and not loss on Core it seems like good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
>
> complaint!
Thanks for raising this, hopefully my explanation clears it up.
Alexander
On Tue, Dec 12, 2023 at 10:38 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hi,
> this patch disables use of FMA in matrix multiplication loop for generic (for
> x86-64-v3) and zen4. I tested this on zen4 and Xenon Gold Gold 6212U.
>
> For Intel this is neutral both on the matrix multiplication microbenchmark
> (attached) and spec2k17 where the difference was within noise for Core.
>
> On core the micro-benchmark runs as follows:
>
> With FMA:
>
> 578,500,241 cycles:u # 3.645 GHz ( +- 0.12% )
> 753,318,477 instructions:u # 1.30 insn per cycle ( +- 0.00% )
> 125,417,701 branches:u # 790.227 M/sec ( +- 0.00% )
> 0.159146 +- 0.000363 seconds time elapsed ( +- 0.23% )
>
>
> No FMA:
>
> 577,573,960 cycles:u # 3.514 GHz ( +- 0.15% )
> 878,318,479 instructions:u # 1.52 insn per cycle ( +- 0.00% )
> 125,417,702 branches:u # 763.035 M/sec ( +- 0.00% )
> 0.164734 +- 0.000321 seconds time elapsed ( +- 0.19% )
>
> So the cycle count is unchanged and discrete multiply+add takes same time as FMA.
>
> While on zen:
>
>
> With FMA:
> 484875179 cycles:u # 3.599 GHz ( +- 0.05% ) (82.11%)
> 752031517 instructions:u # 1.55 insn per cycle
> 125106525 branches:u # 928.712 M/sec ( +- 0.03% ) (85.09%)
> 128356 branch-misses:u # 0.10% of all branches ( +- 0.06% ) (83.58%)
>
> No FMA:
> 375875209 cycles:u # 3.592 GHz ( +- 0.08% ) (80.74%)
> 875725341 instructions:u # 2.33 insn per cycle
> 124903825 branches:u # 1.194 G/sec ( +- 0.04% ) (84.59%)
> 0.105203 +- 0.000188 seconds time elapsed ( +- 0.18% )
>
> The diffrerence is that Cores understand the fact that fmadd does not need
> all three parameters to start computation, while Zen cores doesn't.
>
> Since this seems noticeable win on zen and not loss on Core it seems like good
> default for generic.
>
> I plan to commit the patch next week if there are no compplains.
The generic part LGTM.(It's exactly what we proposed in [1])
[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html
>
> Honza
>
> #include <stdio.h>
> #include <time.h>
>
> #define SIZE 1000
>
> float a[SIZE][SIZE];
> float b[SIZE][SIZE];
> float c[SIZE][SIZE];
>
> void init(void)
> {
> int i, j, k;
> for(i=0; i<SIZE; ++i)
> {
> for(j=0; j<SIZE; ++j)
> {
> a[i][j] = (float)i + j;
> b[i][j] = (float)i - j;
> c[i][j] = 0.0f;
> }
> }
> }
>
> void mult(void)
> {
> int i, j, k;
>
> for(i=0; i<SIZE; ++i)
> {
> for(j=0; j<SIZE; ++j)
> {
> for(k=0; k<SIZE; ++k)
> {
> c[i][j] += a[i][k] * b[k][j];
> }
> }
> }
> }
>
> int main(void)
> {
> clock_t s, e;
>
> init();
> s=clock();
> mult();
> e=clock();
> printf(" mult took %10d clocks\n", (int)(e-s));
>
> return 0;
>
> }
>
> * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
> Enable for znver4 and Core.
>
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 43fa9e8fd6d..74b03cbcc60 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
>
> /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> smaller FMA chain. */
> -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
> - | m_YONGFENG)
> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> + | m_YONGFENG | m_GENERIC)
>
> /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> smaller FMA chain. */
> -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
> - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
>
> /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> smaller FMA chain. */
> > The diffrerence is that Cores understand the fact that fmadd does not need
> > all three parameters to start computation, while Zen cores doesn't.
> >
> > Since this seems noticeable win on zen and not loss on Core it seems like good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
> The generic part LGTM.(It's exactly what we proposed in [1])
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html
Thanks. I wonder if can think of other generic changes that would make
sense to do?
Concerning zen4 and FMA, it is not really win with AVX512 enabled
(which is what I was benchmarking for znver4 tuning), but indeed it is
win with AVX256 where the extra latency is not hidden by the parallelism
exposed by doing evertyhing twice.
I re-benmchmarked zen4 and it behaves similarly to zen3 with avx256, so
for x86-64-v3 this makes sense.
Honza
> >
> > Honza
> >
> > #include <stdio.h>
> > #include <time.h>
> >
> > #define SIZE 1000
> >
> > float a[SIZE][SIZE];
> > float b[SIZE][SIZE];
> > float c[SIZE][SIZE];
> >
> > void init(void)
> > {
> > int i, j, k;
> > for(i=0; i<SIZE; ++i)
> > {
> > for(j=0; j<SIZE; ++j)
> > {
> > a[i][j] = (float)i + j;
> > b[i][j] = (float)i - j;
> > c[i][j] = 0.0f;
> > }
> > }
> > }
> >
> > void mult(void)
> > {
> > int i, j, k;
> >
> > for(i=0; i<SIZE; ++i)
> > {
> > for(j=0; j<SIZE; ++j)
> > {
> > for(k=0; k<SIZE; ++k)
> > {
> > c[i][j] += a[i][k] * b[k][j];
> > }
> > }
> > }
> > }
> >
> > int main(void)
> > {
> > clock_t s, e;
> >
> > init();
> > s=clock();
> > mult();
> > e=clock();
> > printf(" mult took %10d clocks\n", (int)(e-s));
> >
> > return 0;
> >
> > }
> >
> > * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
> > Enable for znver4 and Core.
> >
> > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > index 43fa9e8fd6d..74b03cbcc60 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
> >
> > /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> > smaller FMA chain. */
> > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
> > - | m_YONGFENG)
> > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > + | m_YONGFENG | m_GENERIC)
> >
> > /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> > smaller FMA chain. */
> > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
> > - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
> >
> > /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> > smaller FMA chain. */
>
>
>
> --
> BR,
> Hongtao
On Thu, Dec 14, 2023 at 12:03 AM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > > The diffrerence is that Cores understand the fact that fmadd does not need
> > > all three parameters to start computation, while Zen cores doesn't.
> > >
> > > Since this seems noticeable win on zen and not loss on Core it seems like good
> > > default for generic.
> > >
> > > I plan to commit the patch next week if there are no compplains.
> > The generic part LGTM.(It's exactly what we proposed in [1])
> >
> > [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html
>
> Thanks. I wonder if can think of other generic changes that would make
> sense to do?
> Concerning zen4 and FMA, it is not really win with AVX512 enabled
> (which is what I was benchmarking for znver4 tuning), but indeed it is
> win with AVX256 where the extra latency is not hidden by the parallelism
> exposed by doing evertyhing twice.
>
> I re-benmchmarked zen4 and it behaves similarly to zen3 with avx256, so
> for x86-64-v3 this makes sense.
>
> Honza
> > >
> > > Honza
> > >
> > > #include <stdio.h>
> > > #include <time.h>
> > >
> > > #define SIZE 1000
> > >
> > > float a[SIZE][SIZE];
> > > float b[SIZE][SIZE];
> > > float c[SIZE][SIZE];
> > >
> > > void init(void)
> > > {
> > > int i, j, k;
> > > for(i=0; i<SIZE; ++i)
> > > {
> > > for(j=0; j<SIZE; ++j)
> > > {
> > > a[i][j] = (float)i + j;
> > > b[i][j] = (float)i - j;
> > > c[i][j] = 0.0f;
> > > }
> > > }
> > > }
> > >
> > > void mult(void)
> > > {
> > > int i, j, k;
> > >
> > > for(i=0; i<SIZE; ++i)
> > > {
> > > for(j=0; j<SIZE; ++j)
> > > {
> > > for(k=0; k<SIZE; ++k)
> > > {
> > > c[i][j] += a[i][k] * b[k][j];
> > > }
> > > }
> > > }
> > > }
> > >
> > > int main(void)
> > > {
> > > clock_t s, e;
> > >
> > > init();
> > > s=clock();
> > > mult();
> > > e=clock();
> > > printf(" mult took %10d clocks\n", (int)(e-s));
> > >
> > > return 0;
> > >
> > > }
> > >
> > > * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
> > > Enable for znver4 and Core.
> > >
> > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > > index 43fa9e8fd6d..74b03cbcc60 100644
> > > --- a/gcc/config/i386/x86-tune.def
> > > +++ b/gcc/config/i386/x86-tune.def
> > > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
> > >
> > > /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> > > smaller FMA chain. */
> > > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
> > > - | m_YONGFENG)
> > > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > > + | m_YONGFENG | m_GENERIC)
> > >
> > > /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> > > smaller FMA chain. */
> > > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
> > > - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> > > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > > + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
Can we backport the patch(at least the generic part) to
GCC11/GCC12/GCC13 release branch?
> > >
> > > /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> > > smaller FMA chain. */
> >
> >
> >
> > --
> > BR,
> > Hongtao
> Can we backport the patch(at least the generic part) to
> GCC11/GCC12/GCC13 release branch?
Yes, the periodic testers has took the change and as far as I can tell,
there are no surprises.
Thanks,
Honza
> > > >
> > > > /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> > > > smaller FMA chain. */
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
>
>
>
> --
> BR,
> Hongtao
@@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
/* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
smaller FMA chain. */
-DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
- | m_YONGFENG)
+DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
+ | m_YONGFENG | m_GENERIC)
/* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
smaller FMA chain. */
-DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
- | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
+DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
+ | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
/* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
smaller FMA chain. */