Reduce floating-point difficulties in timevar.cc

Message ID ded23976-2873-45de-aebd-28839ec0ab82@AZ-NEU-EX03.Arm.com
State Not Applicable
Headers
Series Reduce floating-point difficulties in timevar.cc |

Checks

Context Check Description
snail/gcc-patch-check fail Git am fail log

Commit Message

Matthew Malcomson July 21, 2023, 12:11 p.m. UTC
  On some AArch64 bootstrapped builds, we were getting a flaky test
because the floating point operations in `get_time` were being fused
with the floating point operations in `timevar_accumulate`.

This meant that the rounding behaviour of our multiplication with
`ticks_to_msec` was different when used in `timer::start` and when
performed in `timer::stop`.  These extra inaccuracies led to the
testcase `g++.dg/ext/timevar1.C` being flaky on some hardware.

This change ensures those operations are not fused and hence stops the test
being flaky on that particular machine.  There is no expected change in the
generated code.
Bootstrap & regtest on AArch64 passes with no regressions.

gcc/ChangeLog:

	* timevar.cc (get_time): Make this noinline to avoid fusing
	behaviour and associated test flakyness.


N.b. I didn't know who to include as reviewer -- guessed Richard Biener as the
global reviewer that had the most contributions to this file and Richard
Sandiford since I've asked him for reviews a lot in the past.


###############     Attachment also inlined for ease of reply    ###############
diff --git a/gcc/timevar.cc b/gcc/timevar.cc
index d695297aae7f6b2a6de01a37fe86c2a232338df0..5ea4ec259e114f31f611e7105cd102f4c9552d18 100644
--- a/gcc/timevar.cc
+++ b/gcc/timevar.cc
@@ -212,6 +212,7 @@ timer::named_items::print (FILE *fp, const timevar_time_def *total)
    HAVE_WALL_TIME macros.  */
 
 static void
+__attribute__((noinline))
 get_time (struct timevar_time_def *now)
 {
   now->user = 0;
  

Comments

Richard Biener July 21, 2023, 12:47 p.m. UTC | #1
On Fri, 21 Jul 2023, Matthew Malcomson wrote:

> On some AArch64 bootstrapped builds, we were getting a flaky test
> because the floating point operations in `get_time` were being fused
> with the floating point operations in `timevar_accumulate`.
> 
> This meant that the rounding behaviour of our multiplication with
> `ticks_to_msec` was different when used in `timer::start` and when
> performed in `timer::stop`.  These extra inaccuracies led to the
> testcase `g++.dg/ext/timevar1.C` being flaky on some hardware.
> 
> This change ensures those operations are not fused and hence stops the test
> being flaky on that particular machine.  There is no expected change in the
> generated code.
> Bootstrap & regtest on AArch64 passes with no regressions.

I think this is undesriable.  With fused you mean we use FMA?
I think you could use -ffp-contract=off for the TU instead.

Note you can't use __attribute__((noinline)) literally since the
host compiler might not support this.

Richard.

> gcc/ChangeLog:
> 
> 	* timevar.cc (get_time): Make this noinline to avoid fusing
> 	behaviour and associated test flakyness.
> 
> 
> N.b. I didn't know who to include as reviewer -- guessed Richard Biener as the
> global reviewer that had the most contributions to this file and Richard
> Sandiford since I've asked him for reviews a lot in the past.
> 
> 
> ###############     Attachment also inlined for ease of reply    ###############
> 
> 
> diff --git a/gcc/timevar.cc b/gcc/timevar.cc
> index d695297aae7f6b2a6de01a37fe86c2a232338df0..5ea4ec259e114f31f611e7105cd102f4c9552d18 100644
> --- a/gcc/timevar.cc
> +++ b/gcc/timevar.cc
> @@ -212,6 +212,7 @@ timer::named_items::print (FILE *fp, const timevar_time_def *total)
>     HAVE_WALL_TIME macros.  */
>  
>  static void
> +__attribute__((noinline))
>  get_time (struct timevar_time_def *now)
>  {
>    now->user = 0;
> 
> 
> 
>
  
Xi Ruoyao July 21, 2023, 12:49 p.m. UTC | #2
On Fri, 2023-07-21 at 13:11 +0100, Matthew Malcomson via Gcc-patches
wrote:
> This change ensures those operations are not fused and hence stops the test
> being flaky on that particular machine.  There is no expected change in the
> generated code.
> Bootstrap & regtest on AArch64 passes with no regressions.
> 
> gcc/ChangeLog:
> 
> 	* timevar.cc (get_time): Make this noinline to avoid fusing
> 	behaviour and associated test flakyness.

I don't think it's correct.  It will break bootstrapping GCC from other
ISO C++11 compilers, you need to at least guard it with #ifdef __GNUC__.
And IMO it's just hiding the real problem.

We need more info of the "particular machine".  Is this a hardware bug
(i.e. the machine violates the AArch64 spec) or a GCC code generation
issue?  Or should we generally use -ffp-contract=off in BOOT_CFLAGS?
  
Matthew Malcomson July 21, 2023, 1:11 p.m. UTC | #3
Responding to two emails at the same time ;-)

On 7/21/23 13:47, Richard Biener wrote:
> On Fri, 21 Jul 2023, Matthew Malcomson wrote:
> 
>> On some AArch64 bootstrapped builds, we were getting a flaky test
>> because the floating point operations in `get_time` were being fused
>> with the floating point operations in `timevar_accumulate`.
>>
>> This meant that the rounding behaviour of our multiplication with
>> `ticks_to_msec` was different when used in `timer::start` and when
>> performed in `timer::stop`.  These extra inaccuracies led to the
>> testcase `g++.dg/ext/timevar1.C` being flaky on some hardware.
>>
>> This change ensures those operations are not fused and hence stops the test
>> being flaky on that particular machine.  There is no expected change in the
>> generated code.
>> Bootstrap & regtest on AArch64 passes with no regressions.
> 
> I think this is undesriable.  With fused you mean we use FMA?
> I think you could use -ffp-contract=off for the TU instead.

Yeah -- we used fused multiply subtract because we combined the multiply 
in `get_time` with the subtract in `timevar_accumulate`.

> 
> Note you can't use __attribute__((noinline)) literally since the
> host compiler might not support this.
> 
> Richard.
> 

On 7/21/23 13:49, Xi Ruoyao wrote:
...
> I don't think it's correct.  It will break bootstrapping GCC from other
> ISO C++11 compilers, you need to at least guard it with #ifdef __GNUC__.
> And IMO it's just hiding the real problem.
> 
> We need more info of the "particular machine".  Is this a hardware bug
> (i.e. the machine violates the AArch64 spec) or a GCC code generation
> issue?  Or should we generally use -ffp-contract=off in BOOT_CFLAGS?
> 

My understanding is that this is not a hardware bug and that it's 
specified that rounding does not happen on the multiply "sub-part" in 
`FNMSUB`, but rounding happens on the `FMUL` that generates some input 
to it.

I was given to understand from discussions with others that this codegen 
is allowed -- though I honestly didn't confirm the line of reasoning 
through all the relevant standards.


------------------------
W.r.t. both:
Thanks for pointing out bootstrapping from other ISO C++ compilers -- 
(didn't realise that was a concern).

I can look into `-ffp-contract=off` as you both have recommended.
One question -- if we have concerns that the host compiler may not be 
able to handle `attribute((noinline))` would we also be concerned that 
this flag may not be supported?
(Or is the severity of lack of support sufficiently different in the two 
cases that this is fine -- i.e. not compile vs may trigger floating 
point rounding inaccuracies?)
  
Richard Biener July 21, 2023, 1:41 p.m. UTC | #4
> Am 21.07.2023 um 15:12 schrieb Matthew Malcomson <matthew.malcomson@arm.com>:
> 
> Responding to two emails at the same time ;-)
> 
>> On 7/21/23 13:47, Richard Biener wrote:
>>> On Fri, 21 Jul 2023, Matthew Malcomson wrote:
>>> On some AArch64 bootstrapped builds, we were getting a flaky test
>>> because the floating point operations in `get_time` were being fused
>>> with the floating point operations in `timevar_accumulate`.
>>> 
>>> This meant that the rounding behaviour of our multiplication with
>>> `ticks_to_msec` was different when used in `timer::start` and when
>>> performed in `timer::stop`.  These extra inaccuracies led to the
>>> testcase `g++.dg/ext/timevar1.C` being flaky on some hardware.
>>> 
>>> This change ensures those operations are not fused and hence stops the test
>>> being flaky on that particular machine.  There is no expected change in the
>>> generated code.
>>> Bootstrap & regtest on AArch64 passes with no regressions.
>> I think this is undesriable.  With fused you mean we use FMA?
>> I think you could use -ffp-contract=off for the TU instead.
> 
> Yeah -- we used fused multiply subtract because we combined the multiply in `get_time` with the subtract in `timevar_accumulate`.
> 
>> Note you can't use __attribute__((noinline)) literally since the
>> host compiler might not support this.
>> Richard.
> 
> On 7/21/23 13:49, Xi Ruoyao wrote:
> ...
>> I don't think it's correct.  It will break bootstrapping GCC from other
>> ISO C++11 compilers, you need to at least guard it with #ifdef __GNUC__.
>> And IMO it's just hiding the real problem.
>> We need more info of the "particular machine".  Is this a hardware bug
>> (i.e. the machine violates the AArch64 spec) or a GCC code generation
>> issue?  Or should we generally use -ffp-contract=off in BOOT_CFLAGS?
> 
> My understanding is that this is not a hardware bug and that it's specified that rounding does not happen on the multiply "sub-part" in `FNMSUB`, but rounding happens on the `FMUL` that generates some input to it.
> 
> I was given to understand from discussions with others that this codegen is allowed -- though I honestly didn't confirm the line of reasoning through all the relevant standards.
> 
> 
> ------------------------
> W.r.t. both:
> Thanks for pointing out bootstrapping from other ISO C++ compilers -- (didn't realise that was a concern).
> 
> I can look into `-ffp-contract=off` as you both have recommended.
> One question -- if we have concerns that the host compiler may not be able to handle `attribute((noinline))` would we also be concerned that this flag may not be supported?
> (Or is the severity of lack of support sufficiently different in the two cases that this is fine -- i.e. not compile vs may trigger floating point rounding inaccuracies?)

I’d only use it in stage2+ flags where we know we’re dealing with GCC 

Richard 

> 
>
  
Xi Ruoyao July 21, 2023, 1:45 p.m. UTC | #5
On Fri, 2023-07-21 at 14:11 +0100, Matthew Malcomson wrote:
> My understanding is that this is not a hardware bug and that it's 
> specified that rounding does not happen on the multiply "sub-part" in 
> `FNMSUB`, but rounding happens on the `FMUL` that generates some input
> to it.

AFAIK the C standard does only say "A floating *expression* may be
contracted".  I.e:

double r = a * b + c;

may be compiled to use FMA because "a * b + c" is a floating point
expression.  But

double t = a * b;
double r = t + c;

is not, because "a * b" and "t + c" are two separate floating point
expressions.

So a contraction across two functions is not allowed.  We now have -ffp-
contract=on (https://gcc.gnu.org/r14-2023) to only allow C-standard
contractions.

Perhaps -ffp-contract=on (not off) is enough to fix the issue (if you
are building GCC 14 snapshot).  The default is "fast" (if no -std=
option is used), which allows some contractions disallowed by the
standard.

But GCC is in C++ and I'm not sure if the C++ standard has the same
definition for allowed contractions as C.

> I can look into `-ffp-contract=off` as you both have recommended.
> One question -- if we have concerns that the host compiler may not be 
> able to handle `attribute((noinline))` would we also be concerned that
> this flag may not be supported?

Only use it in BOOT_CFLAGS, i. e. 'make BOOT_CFLAGS="-O2 -g -ffp-
contract=on"' (or "off" instead of "on").  In 3-stage bootstrapping it's
only applied in stage 2 and 3, during which GCC is compiled by itself.

> (Or is the severity of lack of support sufficiently different in the two 
> cases that this is fine -- i.e. not compile vs may trigger floating 
> point rounding inaccuracies?)

It's possible that the test itself is flaky.  Can you provide some
detail about how it fails?
  
Alexander Monakov July 21, 2023, 1:58 p.m. UTC | #6
On Fri, 21 Jul 2023, Xi Ruoyao via Gcc-patches wrote:

> Perhaps -ffp-contract=on (not off) is enough to fix the issue (if you
> are building GCC 14 snapshot).  The default is "fast" (if no -std=
> option is used), which allows some contractions disallowed by the
> standard.

Not fully, see below.

> But GCC is in C++ and I'm not sure if the C++ standard has the same
> definition for allowed contractions as C.

It doesn't, but in GCC we should aim to provide the same semantics in C++
as in C.

> > (Or is the severity of lack of support sufficiently different in the two 
> > cases that this is fine -- i.e. not compile vs may trigger floating 
> > point rounding inaccuracies?)
> 
> It's possible that the test itself is flaky.  Can you provide some
> detail about how it fails?

See also PR 99903 for an earlier known issue which appears due to x87
excess precision and so tweaking -ffp-contract wouldn't help:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903

Now that multiple platforms are hitting this, can we _please_ get rid
of the questionable attempt to compute time in a floating-point variable
and just use an uint64_t storing nanoseconds?

Alexander
  
Xi Ruoyao July 21, 2023, 3:11 p.m. UTC | #7
On Fri, 2023-07-21 at 16:58 +0300, Alexander Monakov wrote:
> 
> On Fri, 21 Jul 2023, Xi Ruoyao via Gcc-patches wrote:
> 
> > Perhaps -ffp-contract=on (not off) is enough to fix the issue (if you
> > are building GCC 14 snapshot).  The default is "fast" (if no -std=
> > option is used), which allows some contractions disallowed by the
> > standard.
> 
> Not fully, see below.
> 
> > But GCC is in C++ and I'm not sure if the C++ standard has the same
> > definition for allowed contractions as C.
> 
> It doesn't, but in GCC we should aim to provide the same semantics in C++
> as in C.
> 
> > > (Or is the severity of lack of support sufficiently different in the two 
> > > cases that this is fine -- i.e. not compile vs may trigger floating 
> > > point rounding inaccuracies?)
> > 
> > It's possible that the test itself is flaky.  Can you provide some
> > detail about how it fails?
> 
> See also PR 99903 for an earlier known issue which appears due to x87
> excess precision and so tweaking -ffp-contract wouldn't help:
> 
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903

Does it affect AArch64 too?

> Now that multiple platforms are hitting this, can we _please_ get rid
> of the questionable attempt to compute time in a floating-point variable
> and just use an uint64_t storing nanoseconds?

To me this is the correct thing to do.
  
Matthew Malcomson July 21, 2023, 3:24 p.m. UTC | #8
On 7/21/23 14:45, Xi Ruoyao wrote:
> On Fri, 2023-07-21 at 14:11 +0100, Matthew Malcomson wrote:
>> My understanding is that this is not a hardware bug and that it's
>> specified that rounding does not happen on the multiply "sub-part" in
>> `FNMSUB`, but rounding happens on the `FMUL` that generates some input
>> to it.
> 
> AFAIK the C standard does only say "A floating *expression* may be
> contracted".  I.e:
> 
> double r = a * b + c;
> 
> may be compiled to use FMA because "a * b + c" is a floating point
> expression.  But
> 
> double t = a * b;
> double r = t + c;
> 
> is not, because "a * b" and "t + c" are two separate floating point
> expressions.
> 
> So a contraction across two functions is not allowed.  We now have -ffp-
> contract=on (https://gcc.gnu.org/r14-2023) to only allow C-standard
> contractions.
> 
> Perhaps -ffp-contract=on (not off) is enough to fix the issue (if you
> are building GCC 14 snapshot).  The default is "fast" (if no -std=
> option is used), which allows some contractions disallowed by the
> standard.
> 
> But GCC is in C++ and I'm not sure if the C++ standard has the same
> definition for allowed contractions as C.
> 

Thanks -- I'll look into whether `-ffp-contract=on` works.

> 
> It's possible that the test itself is flaky.  Can you provide some
> detail about how it fails?
> 

Sure -- The outline is that `timer::validate_phases` sees the sum of 
sub-part timers as greater than the timer for the "overall" time 
(outside of a tolerance of 1.000001).  It then complains and hits 
`gcc_unreachable()`.

While I found it difficult to get enough information out of the test 
that is run in the testsuite, I found that if passing an invalid 
argument to `cc1plus` all sub-parts would be zero, and sometimes the 
"total" would be negative.

This was due to the `times` syscall returning the same clock tick for 
start and end of the "total" timer and the difference in rounding 
between FNMSUB and FMUL means that depending on what that clock tick is 
the "elapsed time" can end up calculated as negative.

I didn't proove it 100% but I believe the same fundamental difference 
(but opposite rounding error) could trigger the testsuite failure -- if 
the "end" of one sub-phase timer is greater than the "start" of another 
sub-phase timer then sum of parts could be greater than total.

There is a "tolerance" in this test that I considered increasing, but 
since that would not affect the "invalid arguments" thing (where the 
total is negative and hence the tolerance multiplication of 1.000001 
would have to be supplemented by a positive offset) I suggested avoiding 
the inline.

W.r.t. the x86 bug that Alexander Monakov has pointed to, it's a very 
similar thing but in this case the problem is not bit-precision of 
values after the inlining, but rather a difference between fused and not 
fused operations after the inlining.

Agreed that using integral arithmetic is the more robust solution.
  
Alexander Monakov July 21, 2023, 3:56 p.m. UTC | #9
On Fri, 21 Jul 2023, Xi Ruoyao wrote:

> > See also PR 99903 for an earlier known issue which appears due to x87
> > excess precision and so tweaking -ffp-contract wouldn't help:
> > 
> >   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903
> 
> Does it affect AArch64 too?

Well, not literally (AArch64 doesn't have excess precision), but absence
of intermediate rounding in FMA is similar to excess precision.

I'm saying it's the same issue manifesting via different pathways on x86
and aarch64. Sorry if I misunderstood your question.

Alexander
  
Andrew Pinski July 21, 2023, 5:41 p.m. UTC | #10
On Fri, Jul 21, 2023 at 5:13 AM Matthew Malcomson via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On some AArch64 bootstrapped builds, we were getting a flaky test
> because the floating point operations in `get_time` were being fused
> with the floating point operations in `timevar_accumulate`.
>
> This meant that the rounding behaviour of our multiplication with
> `ticks_to_msec` was different when used in `timer::start` and when
> performed in `timer::stop`.  These extra inaccuracies led to the
> testcase `g++.dg/ext/timevar1.C` being flaky on some hardware.
>
> This change ensures those operations are not fused and hence stops the test
> being flaky on that particular machine.  There is no expected change in the
> generated code.
> Bootstrap & regtest on AArch64 passes with no regressions.

Oh this does explain why powerpc also sees it: https://gcc.gnu.org/PR110316 .
I wonder if not adding noinline here but rather changing the code to
tolerate the fused multiple-subtract instead
which is kinda related to what I suggested in comment #1 in PR 110316.

Thanks,
Andrew

>
> gcc/ChangeLog:
>
>         * timevar.cc (get_time): Make this noinline to avoid fusing
>         behaviour and associated test flakyness.
>
>
> N.b. I didn't know who to include as reviewer -- guessed Richard Biener as the
> global reviewer that had the most contributions to this file and Richard
> Sandiford since I've asked him for reviews a lot in the past.
>
>
> ###############     Attachment also inlined for ease of reply    ###############
>
>
> diff --git a/gcc/timevar.cc b/gcc/timevar.cc
> index d695297aae7f6b2a6de01a37fe86c2a232338df0..5ea4ec259e114f31f611e7105cd102f4c9552d18 100644
> --- a/gcc/timevar.cc
> +++ b/gcc/timevar.cc
> @@ -212,6 +212,7 @@ timer::named_items::print (FILE *fp, const timevar_time_def *total)
>     HAVE_WALL_TIME macros.  */
>
>  static void
> +__attribute__((noinline))
>  get_time (struct timevar_time_def *now)
>  {
>    now->user = 0;
>
>
>
  

Patch

diff --git a/gcc/timevar.cc b/gcc/timevar.cc
index d695297aae7f6b2a6de01a37fe86c2a232338df0..5ea4ec259e114f31f611e7105cd102f4c9552d18 100644
--- a/gcc/timevar.cc
+++ b/gcc/timevar.cc
@@ -212,6 +212,7 @@  timer::named_items::print (FILE *fp, const timevar_time_def *total)
    HAVE_WALL_TIME macros.  */
 
 static void
+__attribute__((noinline))
 get_time (struct timevar_time_def *now)
 {
   now->user = 0;