expr.cc: avoid unexpected side effects in expand_expr_divmod optimization

Message ID 20221208105944.660323-1-jose.marchesi@oracle.com
State Accepted
Headers
Series expr.cc: avoid unexpected side effects in expand_expr_divmod optimization |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

Jose E. Marchesi Dec. 8, 2022, 10:59 a.m. UTC
  The expand_expr_divmod function in expr.cc attempts to optimize cases
where both arguments of a division/modulus are known to be positive
when interpreted as signed.  In these cases, both signed division and
unsigned division will raise the same value, and therefore the
cheapest option can be used.

In order to determine what is the cheaper option in the current
target, expand_expr_divmod actually expands both a signed divmod and
an unsigned divmod using local "sequences":

  start_sequence ();
  ...
  expand_divmod (... signed ...);
  ...
  end_sequence ();

  start_sequence ();
  ...
  expand_divmod (... unsigned ...);
  ...
  end_sequence ();

And then compares the cost of each generated sequence, choosing the
best one.  Finally, it emits the selected expanded sequence and
returns the rtx with the result.

This approach has a caveat.  Some targets do not provide instructions
for division/modulus instructions.  In the case of BPF, it provides
unsigned division/modulus, but not signed division/modulus.

In these cases, the expand_divmod tries can contain calls to funcalls.
For example, in BPF:

  start_sequence ();
  ...
  expand_divmod (... signed ...); -> This generates funcall to __divdi3
  ...
  end_sequence ();

  start_sequence ();
  ...
  expand_divmod (... unsigned ...); -> This generates direct `div' insn.
  ...
  end_sequence ();

The problem is that when a funcall is expanded, an accompanying global
symbol definition is written in the output stream:

  .global __divdi3

And this symbol definition remains in the compiled assembly file, even
if the sequence using the direct `div' instruction above is used.

This is particularly bad in BPF, because the kernel bpf loader chokes
on the spurious symbol __divdi3 and makes the resulting BPF object
unloadable (note that BPF objects are not linked before processed by
the kernel.)

In order to fix this, this patch modifies expand_expr_divmod in the
following way:

- When trying each sequence (signed, unsigned) the expand_divmod calls
  are told to _not_ use libcalls if everything else fails.  This is
  done by passing OPTAB_WIDEN as the `methods' argument.  (Before it
  was using the default value OPTAB_LIB_WIDEN.)

- If any of the tried expanded sequences contain a funcall, then the
  optimization is not attempted.

A couple of BPF tests are also added to make sure this doesn't break
at any point in the future.

Tested in bpf-unknown-none and x86_64-linux-gnu.
Regtested in x86_64-linux-gnu.  No regressions.

gcc/ChangeLog

	* expr.cc (expand_expr_divmod): Avoid side-effects of trying
	sequences involving funcalls in optimization.

gcc/testsuite/ChangeLog:

	* gcc.target/bpf/divmod-funcall-1.c: New test.
	* gcc.target/bpf/divmod-funcall-2.c: Likewise.
---
 gcc/expr.cc                                   | 44 +++++++++++--------
 .../gcc.target/bpf/divmod-funcall-1.c         |  8 ++++
 .../gcc.target/bpf/divmod-funcall-2.c         |  8 ++++
 3 files changed, 41 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
 create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
  

Comments

Jakub Jelinek Dec. 8, 2022, 12:20 p.m. UTC | #1
On Thu, Dec 08, 2022 at 11:59:44AM +0100, Jose E. Marchesi via Gcc-patches wrote:
> gcc/ChangeLog
> 
> 	* expr.cc (expand_expr_divmod): Avoid side-effects of trying
> 	sequences involving funcalls in optimization.

That looks wrong.
The globals for mentioned calls just shouldn't be emitted during expansion,
especially if it is bigger annoyance than just having some extra symbols
in the symbol table.
expand_expr_divmod is definitely not the only place where something is
expanded and later not used, lots of other places in the expander do that,
and more importantly, there are over 80 optimization passes after expansion,
many of them can remove code determined to be dead, and while lots of dead
code is removed in GIMPLE optimizations already, definitely not all.
So, rather than add hacks for this in a single spot, much better is to emit
the globals only for stuff that is actually needed (so during final or
immediately before it).

	Jakub
  
Jose E. Marchesi Dec. 8, 2022, 1:02 p.m. UTC | #2
Hi Jakub.

> On Thu, Dec 08, 2022 at 11:59:44AM +0100, Jose E. Marchesi via Gcc-patches wrote:
>> gcc/ChangeLog
>> 
>> 	* expr.cc (expand_expr_divmod): Avoid side-effects of trying
>> 	sequences involving funcalls in optimization.
>
> That looks wrong.
> The globals for mentioned calls just shouldn't be emitted during expansion,
> especially if it is bigger annoyance than just having some extra symbols
> in the symbol table.
> expand_expr_divmod is definitely not the only place where something is
> expanded and later not used, lots of other places in the expander do that,
> and more importantly, there are over 80 optimization passes after expansion,
> many of them can remove code determined to be dead, and while lots of dead
> code is removed in GIMPLE optimizations already, definitely not all.
> So, rather than add hacks for this in a single spot, much better is to emit
> the globals only for stuff that is actually needed (so during final or
> immediately before it).

Yeah I see the point.

The culprit of the leadked .global seems to be a call to
assemble_external_libcall in emit_library_call_value_1:

expand_expr_divmod
  expand_divmod -> This will result in libcall
   sign_expand_divmod
     emit_library_call_value
       emit_library_call_value_1
         ...
         /* If this machine requires an external definition for library
            functions, write one out.  */
         assemble_external_libcall (fun);
         ...

The documented purpose of assemble_external_libcall is, as stated in
output.h, to "Assemble a string constant".

So, it seems to me that emit_library_call_value should not assemble
anything, since it is used by expand functions whose expansions may be
eventually discarded.

However, simply removing that call to assemble_external_libcall makes
.global declarations to not be emitted even when the funcall is actually
emitted in final:

For:

  int foo(unsigned int len)
  {
    return ((long)len) * 234 / 5;
  }

we get:

    .file   "foo.c"
    .text
    <------------- NO .global __divdi3
    .align  3
    .global foo
    .type   foo, @function
  foo:
    mov32   %r1,%r1
    mov     %r2,5
    mul     %r1,234
    call    __divdi3
    exit
  .size   foo, .-foo
  .ident  "GCC: (GNU) 13.0.0 20221207 (experimental)"

Note that BPF lacks signed division instructions.

So, I guess the right fix would be to call assemble_external_libcall
during final?  The `.global FOO' directive would be generated
immediately before the call sequence, but I guess that would be ok.

WDYT?
  
Jakub Jelinek Dec. 8, 2022, 1:19 p.m. UTC | #3
On Thu, Dec 08, 2022 at 02:02:36PM +0100, Jose E. Marchesi wrote:
> So, I guess the right fix would be to call assemble_external_libcall
> during final?  The `.global FOO' directive would be generated
> immediately before the call sequence, but I guess that would be ok.

During final only if all the targets can deal with the effects of
assemble_external_libcall being done in the middle of emitting assembly
for the function.

Otherwise, it could be e.g. done in the first loop of shorten_branches.

Note, in calls.cc it is done only for emit_library_call_value_1
and not for emit_call_1, so if we do it late, we need to be able to find
out what call is to a libcall and what is to a normal call.  If there is
no way to differentiate it right now, perhaps we need some flag somewhere,
say on a SYMBOL_REF.  And then assemble_external_libcall either only
if such a SYMBOL_REF appears in CALL_INSN or sibcall JUMP_INSN, or
perhaps anywhere in the function and its constant pool.

	Jakub
  
Richard Biener Dec. 8, 2022, 1:42 p.m. UTC | #4
> Am 08.12.2022 um 11:56 schrieb Jose E. Marchesi via Gcc-patches <gcc-patches@gcc.gnu.org>:
> 
> The expand_expr_divmod function in expr.cc attempts to optimize cases
> where both arguments of a division/modulus are known to be positive
> when interpreted as signed.  In these cases, both signed division and
> unsigned division will raise the same value, and therefore the
> cheapest option can be used.
> 
> In order to determine what is the cheaper option in the current
> target, expand_expr_divmod actually expands both a signed divmod and
> an unsigned divmod using local "sequences":
> 
>  start_sequence ();
>  ...
>  expand_divmod (... signed ...);
>  ...
>  end_sequence ();
> 
>  start_sequence ();
>  ...
>  expand_divmod (... unsigned ...);
>  ...
>  end_sequence ();
> 
> And then compares the cost of each generated sequence, choosing the
> best one.  Finally, it emits the selected expanded sequence and
> returns the rtx with the result.
> 
> This approach has a caveat.  Some targets do not provide instructions
> for division/modulus instructions.  In the case of BPF, it provides
> unsigned division/modulus, but not signed division/modulus.
> 
> In these cases, the expand_divmod tries can contain calls to funcalls.
> For example, in BPF:
> 
>  start_sequence ();
>  ...
>  expand_divmod (... signed ...); -> This generates funcall to __divdi3
>  ...
>  end_sequence ();
> 
>  start_sequence ();
>  ...
>  expand_divmod (... unsigned ...); -> This generates direct `div' insn.
>  ...
>  end_sequence ();
> 
> The problem is that when a funcall is expanded, an accompanying global
> symbol definition is written in the output stream:
> 
>  .global __divdi3
> 
> And this symbol definition remains in the compiled assembly file, even
> if the sequence using the direct `div' instruction above is used.
> 
> This is particularly bad in BPF, because the kernel bpf loader chokes
> on the spurious symbol __divdi3 and makes the resulting BPF object
> unloadable (note that BPF objects are not linked before processed by
> the kernel.)
> 
> In order to fix this, this patch modifies expand_expr_divmod in the
> following way:
> 
> - When trying each sequence (signed, unsigned) the expand_divmod calls
>  are told to _not_ use libcalls if everything else fails.  This is
>  done by passing OPTAB_WIDEN as the `methods' argument.  (Before it
>  was using the default value OPTAB_LIB_WIDEN.)
> 
> - If any of the tried expanded sequences contain a funcall, then the
>  optimization is not attempted.

How do libcalls appear in iff you specify OPTABS_WIDEN only?  Doesn’t that allow to simplify this and also use the sequence without a libcall?

Richard 

> 
> A couple of BPF tests are also added to make sure this doesn't break
> at any point in the future.
> 
> Tested in bpf-unknown-none and x86_64-linux-gnu.
> Regtested in x86_64-linux-gnu.  No regressions.
> 
> gcc/ChangeLog
> 
>    * expr.cc (expand_expr_divmod): Avoid side-effects of trying
>    sequences involving funcalls in optimization.
> 
> gcc/testsuite/ChangeLog:
> 
>    * gcc.target/bpf/divmod-funcall-1.c: New test.
>    * gcc.target/bpf/divmod-funcall-2.c: Likewise.
> ---
> gcc/expr.cc                                   | 44 +++++++++++--------
> .../gcc.target/bpf/divmod-funcall-1.c         |  8 ++++
> .../gcc.target/bpf/divmod-funcall-2.c         |  8 ++++
> 3 files changed, 41 insertions(+), 19 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
> create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
> 
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index d9407432ea5..4d4be5d7bda 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -9168,32 +9168,38 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>       do_pending_stack_adjust ();
>       start_sequence ();
>       rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -                   op0, op1, target, 1);
> +                   op0, op1, target, 1, OPTAB_WIDEN);
>       rtx_insn *uns_insns = get_insns ();
>       end_sequence ();
>       start_sequence ();
>       rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -                   op0, op1, target, 0);
> +                   op0, op1, target, 0, OPTAB_WIDEN);
>       rtx_insn *sgn_insns = get_insns ();
>       end_sequence ();
> -      unsigned uns_cost = seq_cost (uns_insns, speed_p);
> -      unsigned sgn_cost = seq_cost (sgn_insns, speed_p);
> 
> -      /* If costs are the same then use as tie breaker the other other
> -     factor.  */
> -      if (uns_cost == sgn_cost)
> -    {
> -      uns_cost = seq_cost (uns_insns, !speed_p);
> -      sgn_cost = seq_cost (sgn_insns, !speed_p);
> -    }
> -
> -      if (uns_cost < sgn_cost || (uns_cost == sgn_cost && unsignedp))
> -    {
> -      emit_insn (uns_insns);
> -      return uns_ret;
> -    }
> -      emit_insn (sgn_insns);
> -      return sgn_ret;
> +      /* Do not try to optimize if any of the sequences tried above
> +         resulted in a funcall.  */
> +      if (uns_ret && sgn_ret)
> +        {
> +          unsigned uns_cost = seq_cost (uns_insns, speed_p);
> +          unsigned sgn_cost = seq_cost (sgn_insns, speed_p);
> +
> +          /* If costs are the same then use as tie breaker the other
> +             other factor.  */
> +          if (uns_cost == sgn_cost)
> +            {
> +              uns_cost = seq_cost (uns_insns, !speed_p);
> +              sgn_cost = seq_cost (sgn_insns, !speed_p);
> +            }
> +
> +          if (uns_cost < sgn_cost || (uns_cost == sgn_cost && unsignedp))
> +            {
> +              emit_insn (uns_insns);
> +              return uns_ret;
> +            }
> +          emit_insn (sgn_insns);
> +          return sgn_ret;
> +        }
>     }
>   return expand_divmod (mod_p, code, mode, treeop0, treeop1,
>            op0, op1, target, unsignedp);
> diff --git a/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c b/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
> new file mode 100644
> index 00000000000..dffb1506f06
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
> @@ -0,0 +1,8 @@
> +/* { dg-do compile } */
> +/* { dg-final { scan-assembler-not "__divdi3" } } */
> +
> +int
> +foo (unsigned int len)
> +{
> +  return ((unsigned long)len) * 234 / 5;
> +}
> diff --git a/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c b/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
> new file mode 100644
> index 00000000000..41e8e40c35c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
> @@ -0,0 +1,8 @@
> +/* { dg-do compile } */
> +/* { dg-final { scan-assembler-not "__moddi3" } } */
> +
> +int
> +foo (unsigned int len)
> +{
> +  return ((unsigned long)len) * 234 % 5;
> +}
> -- 
> 2.30.2
>
  
Jose E. Marchesi Dec. 8, 2022, 4:03 p.m. UTC | #5
>> Am 08.12.2022 um 11:56 schrieb Jose E. Marchesi via Gcc-patches <gcc-patches@gcc.gnu.org>:
>> 
>> The expand_expr_divmod function in expr.cc attempts to optimize cases
>> where both arguments of a division/modulus are known to be positive
>> when interpreted as signed.  In these cases, both signed division and
>> unsigned division will raise the same value, and therefore the
>> cheapest option can be used.
>> 
>> In order to determine what is the cheaper option in the current
>> target, expand_expr_divmod actually expands both a signed divmod and
>> an unsigned divmod using local "sequences":
>> 
>>  start_sequence ();
>>  ...
>>  expand_divmod (... signed ...);
>>  ...
>>  end_sequence ();
>> 
>>  start_sequence ();
>>  ...
>>  expand_divmod (... unsigned ...);
>>  ...
>>  end_sequence ();
>> 
>> And then compares the cost of each generated sequence, choosing the
>> best one.  Finally, it emits the selected expanded sequence and
>> returns the rtx with the result.
>> 
>> This approach has a caveat.  Some targets do not provide instructions
>> for division/modulus instructions.  In the case of BPF, it provides
>> unsigned division/modulus, but not signed division/modulus.
>> 
>> In these cases, the expand_divmod tries can contain calls to funcalls.
>> For example, in BPF:
>> 
>>  start_sequence ();
>>  ...
>>  expand_divmod (... signed ...); -> This generates funcall to __divdi3
>>  ...
>>  end_sequence ();
>> 
>>  start_sequence ();
>>  ...
>>  expand_divmod (... unsigned ...); -> This generates direct `div' insn.
>>  ...
>>  end_sequence ();
>> 
>> The problem is that when a funcall is expanded, an accompanying global
>> symbol definition is written in the output stream:
>> 
>>  .global __divdi3
>> 
>> And this symbol definition remains in the compiled assembly file, even
>> if the sequence using the direct `div' instruction above is used.
>> 
>> This is particularly bad in BPF, because the kernel bpf loader chokes
>> on the spurious symbol __divdi3 and makes the resulting BPF object
>> unloadable (note that BPF objects are not linked before processed by
>> the kernel.)
>> 
>> In order to fix this, this patch modifies expand_expr_divmod in the
>> following way:
>> 
>> - When trying each sequence (signed, unsigned) the expand_divmod calls
>>  are told to _not_ use libcalls if everything else fails.  This is
>>  done by passing OPTAB_WIDEN as the `methods' argument.  (Before it
>>  was using the default value OPTAB_LIB_WIDEN.)
>> 
>> - If any of the tried expanded sequences contain a funcall, then the
>>  optimization is not attempted.
>
> How do libcalls appear in iff you specify OPTABS_WIDEN only?  Doesn’t
> that allow to simplify this and also use the sequence without a
> libcall?

If you pass OPTABS_WIDEN only then libcalls are not an option and (as
far as I can tell) expand_divmod returns NULL if a libcall is the only
possibility.

> Richard 
>
>> 
>> A couple of BPF tests are also added to make sure this doesn't break
>> at any point in the future.
>> 
>> Tested in bpf-unknown-none and x86_64-linux-gnu.
>> Regtested in x86_64-linux-gnu.  No regressions.
>> 
>> gcc/ChangeLog
>> 
>>    * expr.cc (expand_expr_divmod): Avoid side-effects of trying
>>    sequences involving funcalls in optimization.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>    * gcc.target/bpf/divmod-funcall-1.c: New test.
>>    * gcc.target/bpf/divmod-funcall-2.c: Likewise.
>> ---
>> gcc/expr.cc                                   | 44 +++++++++++--------
>> .../gcc.target/bpf/divmod-funcall-1.c         |  8 ++++
>> .../gcc.target/bpf/divmod-funcall-2.c         |  8 ++++
>> 3 files changed, 41 insertions(+), 19 deletions(-)
>> create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
>> create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
>> 
>> diff --git a/gcc/expr.cc b/gcc/expr.cc
>> index d9407432ea5..4d4be5d7bda 100644
>> --- a/gcc/expr.cc
>> +++ b/gcc/expr.cc
>> @@ -9168,32 +9168,38 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>>       do_pending_stack_adjust ();
>>       start_sequence ();
>>       rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> -                   op0, op1, target, 1);
>> +                   op0, op1, target, 1, OPTAB_WIDEN);
>>       rtx_insn *uns_insns = get_insns ();
>>       end_sequence ();
>>       start_sequence ();
>>       rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> -                   op0, op1, target, 0);
>> +                   op0, op1, target, 0, OPTAB_WIDEN);
>>       rtx_insn *sgn_insns = get_insns ();
>>       end_sequence ();
>> -      unsigned uns_cost = seq_cost (uns_insns, speed_p);
>> -      unsigned sgn_cost = seq_cost (sgn_insns, speed_p);
>> 
>> -      /* If costs are the same then use as tie breaker the other other
>> -     factor.  */
>> -      if (uns_cost == sgn_cost)
>> -    {
>> -      uns_cost = seq_cost (uns_insns, !speed_p);
>> -      sgn_cost = seq_cost (sgn_insns, !speed_p);
>> -    }
>> -
>> -      if (uns_cost < sgn_cost || (uns_cost == sgn_cost && unsignedp))
>> -    {
>> -      emit_insn (uns_insns);
>> -      return uns_ret;
>> -    }
>> -      emit_insn (sgn_insns);
>> -      return sgn_ret;
>> +      /* Do not try to optimize if any of the sequences tried above
>> +         resulted in a funcall.  */
>> +      if (uns_ret && sgn_ret)
>> +        {
>> +          unsigned uns_cost = seq_cost (uns_insns, speed_p);
>> +          unsigned sgn_cost = seq_cost (sgn_insns, speed_p);
>> +
>> +          /* If costs are the same then use as tie breaker the other
>> +             other factor.  */
>> +          if (uns_cost == sgn_cost)
>> +            {
>> +              uns_cost = seq_cost (uns_insns, !speed_p);
>> +              sgn_cost = seq_cost (sgn_insns, !speed_p);
>> +            }
>> +
>> +          if (uns_cost < sgn_cost || (uns_cost == sgn_cost && unsignedp))
>> +            {
>> +              emit_insn (uns_insns);
>> +              return uns_ret;
>> +            }
>> +          emit_insn (sgn_insns);
>> +          return sgn_ret;
>> +        }
>>     }
>>   return expand_divmod (mod_p, code, mode, treeop0, treeop1,
>>            op0, op1, target, unsignedp);
>> diff --git a/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c b/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
>> new file mode 100644
>> index 00000000000..dffb1506f06
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
>> @@ -0,0 +1,8 @@
>> +/* { dg-do compile } */
>> +/* { dg-final { scan-assembler-not "__divdi3" } } */
>> +
>> +int
>> +foo (unsigned int len)
>> +{
>> +  return ((unsigned long)len) * 234 / 5;
>> +}
>> diff --git a/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c b/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
>> new file mode 100644
>> index 00000000000..41e8e40c35c
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
>> @@ -0,0 +1,8 @@
>> +/* { dg-do compile } */
>> +/* { dg-final { scan-assembler-not "__moddi3" } } */
>> +
>> +int
>> +foo (unsigned int len)
>> +{
>> +  return ((unsigned long)len) * 234 % 5;
>> +}
>> -- 
>> 2.30.2
>>
  
Jose E. Marchesi Dec. 8, 2022, 10:40 p.m. UTC | #6
Hi Jakub.

> On Thu, Dec 08, 2022 at 02:02:36PM +0100, Jose E. Marchesi wrote:
>> So, I guess the right fix would be to call assemble_external_libcall
>> during final?  The `.global FOO' directive would be generated
>> immediately before the call sequence, but I guess that would be ok.
>
> During final only if all the targets can deal with the effects of
> assemble_external_libcall being done in the middle of emitting assembly
> for the function.
>
> Otherwise, it could be e.g. done in the first loop of shorten_branches.
>
> Note, in calls.cc it is done only for emit_library_call_value_1
> and not for emit_call_1, so if we do it late, we need to be able to find
> out what call is to a libcall and what is to a normal call.  If there is
> no way to differentiate it right now, perhaps we need some flag somewhere,
> say on a SYMBOL_REF.  And then assemble_external_libcall either only
> if such a SYMBOL_REF appears in CALL_INSN or sibcall JUMP_INSN, or
> perhaps anywhere in the function and its constant pool.

Allright, the quick-and-dirty patch below seems to DTRT with simple
examples.

First, when libcalls are generated.  Note only one .global is generated
for all calls, and actually it is around the same position than before:

  $ cat foo.c
  int foo(unsigned int len, int flag)
  {
    if (flag)
      return (((long)len) * 234 / 5);
    return (((long)len) * 2 / 5);
  }
  $ cc1 -O2 foo.c
  $ cat foo.c
	.file	"foo.c"
	.text
	.global	__divdi3
	.align	3
	.global	foo
	.type	foo, @function
  foo:
	mov32	%r1,%r1
	lsh	%r2,32
	jne	%r2,0,.L5
	mov	%r2,5
	lsh	%r1,1
	call	__divdi3
	lsh	%r0,32
	arsh	%r0,32
	exit
  .L5:
	mov	%r2,5
	mul	%r1,234
	call	__divdi3
	lsh	%r0,32
	arsh	%r0,32
	exit
	.size	foo, .-foo
	.ident	"GCC: (GNU) 13.0.0 20221207 (experimental)"

Second, when libcalls are tried by expand_moddiv in a sequence, but then
discarded and not linked in the main sequence:

  $ cat foo.c
  int foo(unsigned int len, int flag)
  {
    if (flag)
      return (((long)len) * 234 / 5);
    return (((long)len) * 2 / 5);
  }
  $ cc1 -O2 foo.c
  $ cat foo.c
	.file	"foo.c"
	.text
	.align	3
	.global	foo
	.type	foo, @function
  foo:
	mov32	%r0,%r1
	lsh	%r2,32
	jne	%r2,0,.L5
	add	%r0,%r0
	div	%r0,5
	lsh	%r0,32
	arsh	%r0,32
	exit
  .L5:
	mul	%r0,234
	div	%r0,5
	lsh	%r0,32
	arsh	%r0,32
	exit
	.size	foo, .-foo
	.ident	"GCC: (GNU) 13.0.0 20221207 (experimental)"

Note the .global now is not generated, as desired.

As you can see below, I am adding a new RTX flag `is_libcall', with
written form "/l".

Before I get into serious testing etc, can you please confirm whether
this is the right approach or not?

In particular, I am a little bit concerned about the expectation I am
using that the target of the `call' instruction emitted by emit_call_1
is always a (MEM (SYMBOL_REF ...)) when it is passed a SYMBOL_REF as the
first argument (`fun' in emit_library_call_value_1).

Thanks.

diff --git a/gcc/calls.cc b/gcc/calls.cc
index 6dd6f73e978..6c4a3725272 100644
--- a/gcc/calls.cc
+++ b/gcc/calls.cc
@@ -4370,10 +4370,6 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx value,
 	|| argvec[i].partial != 0)
       update_stack_alignment_for_call (&argvec[i].locate);
 
-  /* If this machine requires an external definition for library
-     functions, write one out.  */
-  assemble_external_libcall (fun);
-
   original_args_size = args_size;
   args_size.constant = (aligned_upper_bound (args_size.constant
 					     + stack_pointer_delta,
@@ -4717,6 +4713,9 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx value,
 	       valreg,
 	       old_inhibit_defer_pop + 1, call_fusage, flags, args_so_far);
 
+  /* Mark the emitted call as a libcall with the new flag.  */
+  RTL_LIBCALL_P (last_call_insn ()) = 1;
+
   if (flag_ipa_ra)
     {
       rtx datum = orgfun;
diff --git a/gcc/final.cc b/gcc/final.cc
index eea572238f6..df57de5afd0 100644
--- a/gcc/final.cc
+++ b/gcc/final.cc
@@ -815,6 +815,8 @@ make_pass_compute_alignments (gcc::context *ctxt)
    reorg.cc, since the branch splitting exposes new instructions with delay
    slots.  */
 
+static rtx call_from_call_insn (rtx_call_insn *insn);
+
 void
 shorten_branches (rtx_insn *first)
 {
@@ -850,6 +852,24 @@ shorten_branches (rtx_insn *first)
   for (insn = get_insns (), i = 1; insn; insn = NEXT_INSN (insn))
     {
       INSN_SHUID (insn) = i++;
+
+      /* If this is a `call' instruction that implements a libcall,
+         and this machine requires an external definition for library
+         functions, write one out.  */
+      if (CALL_P (insn) && RTL_LIBCALL_P (insn))
+        {
+          rtx nested_call = call_from_call_insn (safe_as_a <rtx_call_insn*> (insn));
+          rtx mem = XEXP (nested_call, 0);
+          gcc_assert (GET_CODE (mem) == MEM);
+          rtx fun = XEXP (mem, 0);
+          gcc_assert (GET_CODE (fun) == SYMBOL_REF);
+          assemble_external_libcall (fun);
+
+          /* Clear the LIBCALL flag to make sure we don't assemble the
+             external definition more than once.  */
+          RTL_LIBCALL_P (insn) = 0;
+        }
+
       if (INSN_P (insn))
 	continue;
 
diff --git a/gcc/print-rtl.cc b/gcc/print-rtl.cc
index e115f987173..26a06511619 100644
--- a/gcc/print-rtl.cc
+++ b/gcc/print-rtl.cc
@@ -882,6 +882,9 @@ rtx_writer::print_rtx (const_rtx in_rtx)
       if (RTX_FLAG (in_rtx, return_val))
 	fputs ("/i", m_outfile);
 
+      if (RTX_FLAG (in_rtx, is_libcall))
+        fputs ("/l", m_outfile);
+
       /* Print REG_NOTE names for EXPR_LIST and INSN_LIST.  */
       if ((GET_CODE (in_rtx) == EXPR_LIST
 	   || GET_CODE (in_rtx) == INSN_LIST
diff --git a/gcc/read-rtl.cc b/gcc/read-rtl.cc
index 62c7895af60..eb0ae150a5b 100644
--- a/gcc/read-rtl.cc
+++ b/gcc/read-rtl.cc
@@ -1543,6 +1543,9 @@ read_flags (rtx return_rtx)
 	  case 'i':
 	    RTX_FLAG (return_rtx, return_val) = 1;
 	    break;
+          case 'l':
+            RTX_FLAG (return_rtx, is_libcall) = 1;
+            break;
 	  default:
 	    fatal_with_file_and_line ("unrecognized flag: `%c'", flag_char);
 	}
diff --git a/gcc/rtl.h b/gcc/rtl.h
index 7a8c4709257..92c802d3876 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -401,6 +401,10 @@ struct GTY((desc("0"), tag("0"),
      Dumped as "/i" in RTL dumps.  */
   unsigned return_val : 1;
 
+  /* 1 in a CALL_INSN if it is a libcall.
+     Dumped as "/l" in RTL dumps.  */
+  unsigned is_libcall : 1;
+
   union {
     /* The final union field is aligned to 64 bits on LP64 hosts,
        giving a 32-bit gap after the fields above.  We optimize the
@@ -1578,6 +1582,10 @@ jump_table_for_label (const rtx_code_label *label)
 #define RTL_PURE_CALL_P(RTX)					\
   (RTL_FLAG_CHECK1 ("RTL_PURE_CALL_P", (RTX), CALL_INSN)->return_val)
 
+/* 1 if RTX is a libcall.  */
+#define RTL_LIBCALL_P(RTX)                      \
+  (RTL_FLAG_CHECK1 ("RTL_LIBCALL_P", (RTX), CALL_INSN)->is_libcall)
+
 /* 1 if RTX is a call to a const or pure function.  */
 #define RTL_CONST_OR_PURE_CALL_P(RTX) \
   (RTL_CONST_CALL_P (RTX) || RTL_PURE_CALL_P (RTX))
  
Jose E. Marchesi Jan. 4, 2023, 8:58 a.m. UTC | #7
ping.
Would this be a good approach for fixing the issue?

> Hi Jakub.
>
>> On Thu, Dec 08, 2022 at 02:02:36PM +0100, Jose E. Marchesi wrote:
>>> So, I guess the right fix would be to call assemble_external_libcall
>>> during final?  The `.global FOO' directive would be generated
>>> immediately before the call sequence, but I guess that would be ok.
>>
>> During final only if all the targets can deal with the effects of
>> assemble_external_libcall being done in the middle of emitting assembly
>> for the function.
>>
>> Otherwise, it could be e.g. done in the first loop of shorten_branches.
>>
>> Note, in calls.cc it is done only for emit_library_call_value_1
>> and not for emit_call_1, so if we do it late, we need to be able to find
>> out what call is to a libcall and what is to a normal call.  If there is
>> no way to differentiate it right now, perhaps we need some flag somewhere,
>> say on a SYMBOL_REF.  And then assemble_external_libcall either only
>> if such a SYMBOL_REF appears in CALL_INSN or sibcall JUMP_INSN, or
>> perhaps anywhere in the function and its constant pool.
>
> Allright, the quick-and-dirty patch below seems to DTRT with simple
> examples.
>
> First, when libcalls are generated.  Note only one .global is generated
> for all calls, and actually it is around the same position than before:
>
>   $ cat foo.c
>   int foo(unsigned int len, int flag)
>   {
>     if (flag)
>       return (((long)len) * 234 / 5);
>     return (((long)len) * 2 / 5);
>   }
>   $ cc1 -O2 foo.c
>   $ cat foo.c
> 	.file	"foo.c"
> 	.text
> 	.global	__divdi3
> 	.align	3
> 	.global	foo
> 	.type	foo, @function
>   foo:
> 	mov32	%r1,%r1
> 	lsh	%r2,32
> 	jne	%r2,0,.L5
> 	mov	%r2,5
> 	lsh	%r1,1
> 	call	__divdi3
> 	lsh	%r0,32
> 	arsh	%r0,32
> 	exit
>   .L5:
> 	mov	%r2,5
> 	mul	%r1,234
> 	call	__divdi3
> 	lsh	%r0,32
> 	arsh	%r0,32
> 	exit
> 	.size	foo, .-foo
> 	.ident	"GCC: (GNU) 13.0.0 20221207 (experimental)"
>
> Second, when libcalls are tried by expand_moddiv in a sequence, but then
> discarded and not linked in the main sequence:
>
>   $ cat foo.c
>   int foo(unsigned int len, int flag)
>   {
>     if (flag)
>       return (((long)len) * 234 / 5);
>     return (((long)len) * 2 / 5);
>   }
>   $ cc1 -O2 foo.c
>   $ cat foo.c
> 	.file	"foo.c"
> 	.text
> 	.align	3
> 	.global	foo
> 	.type	foo, @function
>   foo:
> 	mov32	%r0,%r1
> 	lsh	%r2,32
> 	jne	%r2,0,.L5
> 	add	%r0,%r0
> 	div	%r0,5
> 	lsh	%r0,32
> 	arsh	%r0,32
> 	exit
>   .L5:
> 	mul	%r0,234
> 	div	%r0,5
> 	lsh	%r0,32
> 	arsh	%r0,32
> 	exit
> 	.size	foo, .-foo
> 	.ident	"GCC: (GNU) 13.0.0 20221207 (experimental)"
>
> Note the .global now is not generated, as desired.
>
> As you can see below, I am adding a new RTX flag `is_libcall', with
> written form "/l".
>
> Before I get into serious testing etc, can you please confirm whether
> this is the right approach or not?
>
> In particular, I am a little bit concerned about the expectation I am
> using that the target of the `call' instruction emitted by emit_call_1
> is always a (MEM (SYMBOL_REF ...)) when it is passed a SYMBOL_REF as the
> first argument (`fun' in emit_library_call_value_1).
>
> Thanks.
>
> diff --git a/gcc/calls.cc b/gcc/calls.cc
> index 6dd6f73e978..6c4a3725272 100644
> --- a/gcc/calls.cc
> +++ b/gcc/calls.cc
> @@ -4370,10 +4370,6 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx value,
>  	|| argvec[i].partial != 0)
>        update_stack_alignment_for_call (&argvec[i].locate);
>  
> -  /* If this machine requires an external definition for library
> -     functions, write one out.  */
> -  assemble_external_libcall (fun);
> -
>    original_args_size = args_size;
>    args_size.constant = (aligned_upper_bound (args_size.constant
>  					     + stack_pointer_delta,
> @@ -4717,6 +4713,9 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx value,
>  	       valreg,
>  	       old_inhibit_defer_pop + 1, call_fusage, flags, args_so_far);
>  
> +  /* Mark the emitted call as a libcall with the new flag.  */
> +  RTL_LIBCALL_P (last_call_insn ()) = 1;
> +
>    if (flag_ipa_ra)
>      {
>        rtx datum = orgfun;
> diff --git a/gcc/final.cc b/gcc/final.cc
> index eea572238f6..df57de5afd0 100644
> --- a/gcc/final.cc
> +++ b/gcc/final.cc
> @@ -815,6 +815,8 @@ make_pass_compute_alignments (gcc::context *ctxt)
>     reorg.cc, since the branch splitting exposes new instructions with delay
>     slots.  */
>  
> +static rtx call_from_call_insn (rtx_call_insn *insn);
> +
>  void
>  shorten_branches (rtx_insn *first)
>  {
> @@ -850,6 +852,24 @@ shorten_branches (rtx_insn *first)
>    for (insn = get_insns (), i = 1; insn; insn = NEXT_INSN (insn))
>      {
>        INSN_SHUID (insn) = i++;
> +
> +      /* If this is a `call' instruction that implements a libcall,
> +         and this machine requires an external definition for library
> +         functions, write one out.  */
> +      if (CALL_P (insn) && RTL_LIBCALL_P (insn))
> +        {
> +          rtx nested_call = call_from_call_insn (safe_as_a <rtx_call_insn*> (insn));
> +          rtx mem = XEXP (nested_call, 0);
> +          gcc_assert (GET_CODE (mem) == MEM);
> +          rtx fun = XEXP (mem, 0);
> +          gcc_assert (GET_CODE (fun) == SYMBOL_REF);
> +          assemble_external_libcall (fun);
> +
> +          /* Clear the LIBCALL flag to make sure we don't assemble the
> +             external definition more than once.  */
> +          RTL_LIBCALL_P (insn) = 0;
> +        }
> +
>        if (INSN_P (insn))
>  	continue;
>  
> diff --git a/gcc/print-rtl.cc b/gcc/print-rtl.cc
> index e115f987173..26a06511619 100644
> --- a/gcc/print-rtl.cc
> +++ b/gcc/print-rtl.cc
> @@ -882,6 +882,9 @@ rtx_writer::print_rtx (const_rtx in_rtx)
>        if (RTX_FLAG (in_rtx, return_val))
>  	fputs ("/i", m_outfile);
>  
> +      if (RTX_FLAG (in_rtx, is_libcall))
> +        fputs ("/l", m_outfile);
> +
>        /* Print REG_NOTE names for EXPR_LIST and INSN_LIST.  */
>        if ((GET_CODE (in_rtx) == EXPR_LIST
>  	   || GET_CODE (in_rtx) == INSN_LIST
> diff --git a/gcc/read-rtl.cc b/gcc/read-rtl.cc
> index 62c7895af60..eb0ae150a5b 100644
> --- a/gcc/read-rtl.cc
> +++ b/gcc/read-rtl.cc
> @@ -1543,6 +1543,9 @@ read_flags (rtx return_rtx)
>  	  case 'i':
>  	    RTX_FLAG (return_rtx, return_val) = 1;
>  	    break;
> +          case 'l':
> +            RTX_FLAG (return_rtx, is_libcall) = 1;
> +            break;
>  	  default:
>  	    fatal_with_file_and_line ("unrecognized flag: `%c'", flag_char);
>  	}
> diff --git a/gcc/rtl.h b/gcc/rtl.h
> index 7a8c4709257..92c802d3876 100644
> --- a/gcc/rtl.h
> +++ b/gcc/rtl.h
> @@ -401,6 +401,10 @@ struct GTY((desc("0"), tag("0"),
>       Dumped as "/i" in RTL dumps.  */
>    unsigned return_val : 1;
>  
> +  /* 1 in a CALL_INSN if it is a libcall.
> +     Dumped as "/l" in RTL dumps.  */
> +  unsigned is_libcall : 1;
> +
>    union {
>      /* The final union field is aligned to 64 bits on LP64 hosts,
>         giving a 32-bit gap after the fields above.  We optimize the
> @@ -1578,6 +1582,10 @@ jump_table_for_label (const rtx_code_label *label)
>  #define RTL_PURE_CALL_P(RTX)					\
>    (RTL_FLAG_CHECK1 ("RTL_PURE_CALL_P", (RTX), CALL_INSN)->return_val)
>  
> +/* 1 if RTX is a libcall.  */
> +#define RTL_LIBCALL_P(RTX)                      \
> +  (RTL_FLAG_CHECK1 ("RTL_LIBCALL_P", (RTX), CALL_INSN)->is_libcall)
> +
>  /* 1 if RTX is a call to a const or pure function.  */
>  #define RTL_CONST_OR_PURE_CALL_P(RTX) \
>    (RTL_CONST_CALL_P (RTX) || RTL_PURE_CALL_P (RTX))
  
Richard Biener Jan. 9, 2023, 8:05 a.m. UTC | #8
On Wed, Jan 4, 2023 at 9:54 AM Jose E. Marchesi via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
> ping.
> Would this be a good approach for fixing the issue?

adding the is_libcall bit enlarges rtx_def by 8 bytes - there's no room for more
bits here.

I really wonder how other targets avoid the issue you are pointing out?
Do their assemblers prune unused (extern) .global?

> > Hi Jakub.
> >
> >> On Thu, Dec 08, 2022 at 02:02:36PM +0100, Jose E. Marchesi wrote:
> >>> So, I guess the right fix would be to call assemble_external_libcall
> >>> during final?  The `.global FOO' directive would be generated
> >>> immediately before the call sequence, but I guess that would be ok.
> >>
> >> During final only if all the targets can deal with the effects of
> >> assemble_external_libcall being done in the middle of emitting assembly
> >> for the function.
> >>
> >> Otherwise, it could be e.g. done in the first loop of shorten_branches.
> >>
> >> Note, in calls.cc it is done only for emit_library_call_value_1
> >> and not for emit_call_1, so if we do it late, we need to be able to find
> >> out what call is to a libcall and what is to a normal call.  If there is
> >> no way to differentiate it right now, perhaps we need some flag somewhere,
> >> say on a SYMBOL_REF.  And then assemble_external_libcall either only
> >> if such a SYMBOL_REF appears in CALL_INSN or sibcall JUMP_INSN, or
> >> perhaps anywhere in the function and its constant pool.
> >
> > Allright, the quick-and-dirty patch below seems to DTRT with simple
> > examples.
> >
> > First, when libcalls are generated.  Note only one .global is generated
> > for all calls, and actually it is around the same position than before:
> >
> >   $ cat foo.c
> >   int foo(unsigned int len, int flag)
> >   {
> >     if (flag)
> >       return (((long)len) * 234 / 5);
> >     return (((long)len) * 2 / 5);
> >   }
> >   $ cc1 -O2 foo.c
> >   $ cat foo.c
> >       .file   "foo.c"
> >       .text
> >       .global __divdi3
> >       .align  3
> >       .global foo
> >       .type   foo, @function
> >   foo:
> >       mov32   %r1,%r1
> >       lsh     %r2,32
> >       jne     %r2,0,.L5
> >       mov     %r2,5
> >       lsh     %r1,1
> >       call    __divdi3
> >       lsh     %r0,32
> >       arsh    %r0,32
> >       exit
> >   .L5:
> >       mov     %r2,5
> >       mul     %r1,234
> >       call    __divdi3
> >       lsh     %r0,32
> >       arsh    %r0,32
> >       exit
> >       .size   foo, .-foo
> >       .ident  "GCC: (GNU) 13.0.0 20221207 (experimental)"
> >
> > Second, when libcalls are tried by expand_moddiv in a sequence, but then
> > discarded and not linked in the main sequence:
> >
> >   $ cat foo.c
> >   int foo(unsigned int len, int flag)
> >   {
> >     if (flag)
> >       return (((long)len) * 234 / 5);
> >     return (((long)len) * 2 / 5);
> >   }
> >   $ cc1 -O2 foo.c
> >   $ cat foo.c
> >       .file   "foo.c"
> >       .text
> >       .align  3
> >       .global foo
> >       .type   foo, @function
> >   foo:
> >       mov32   %r0,%r1
> >       lsh     %r2,32
> >       jne     %r2,0,.L5
> >       add     %r0,%r0
> >       div     %r0,5
> >       lsh     %r0,32
> >       arsh    %r0,32
> >       exit
> >   .L5:
> >       mul     %r0,234
> >       div     %r0,5
> >       lsh     %r0,32
> >       arsh    %r0,32
> >       exit
> >       .size   foo, .-foo
> >       .ident  "GCC: (GNU) 13.0.0 20221207 (experimental)"
> >
> > Note the .global now is not generated, as desired.
> >
> > As you can see below, I am adding a new RTX flag `is_libcall', with
> > written form "/l".
> >
> > Before I get into serious testing etc, can you please confirm whether
> > this is the right approach or not?
> >
> > In particular, I am a little bit concerned about the expectation I am
> > using that the target of the `call' instruction emitted by emit_call_1
> > is always a (MEM (SYMBOL_REF ...)) when it is passed a SYMBOL_REF as the
> > first argument (`fun' in emit_library_call_value_1).
> >
> > Thanks.
> >
> > diff --git a/gcc/calls.cc b/gcc/calls.cc
> > index 6dd6f73e978..6c4a3725272 100644
> > --- a/gcc/calls.cc
> > +++ b/gcc/calls.cc
> > @@ -4370,10 +4370,6 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx value,
> >       || argvec[i].partial != 0)
> >        update_stack_alignment_for_call (&argvec[i].locate);
> >
> > -  /* If this machine requires an external definition for library
> > -     functions, write one out.  */
> > -  assemble_external_libcall (fun);
> > -
> >    original_args_size = args_size;
> >    args_size.constant = (aligned_upper_bound (args_size.constant
> >                                            + stack_pointer_delta,
> > @@ -4717,6 +4713,9 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx value,
> >              valreg,
> >              old_inhibit_defer_pop + 1, call_fusage, flags, args_so_far);
> >
> > +  /* Mark the emitted call as a libcall with the new flag.  */
> > +  RTL_LIBCALL_P (last_call_insn ()) = 1;
> > +
> >    if (flag_ipa_ra)
> >      {
> >        rtx datum = orgfun;
> > diff --git a/gcc/final.cc b/gcc/final.cc
> > index eea572238f6..df57de5afd0 100644
> > --- a/gcc/final.cc
> > +++ b/gcc/final.cc
> > @@ -815,6 +815,8 @@ make_pass_compute_alignments (gcc::context *ctxt)
> >     reorg.cc, since the branch splitting exposes new instructions with delay
> >     slots.  */
> >
> > +static rtx call_from_call_insn (rtx_call_insn *insn);
> > +
> >  void
> >  shorten_branches (rtx_insn *first)
> >  {
> > @@ -850,6 +852,24 @@ shorten_branches (rtx_insn *first)
> >    for (insn = get_insns (), i = 1; insn; insn = NEXT_INSN (insn))
> >      {
> >        INSN_SHUID (insn) = i++;
> > +
> > +      /* If this is a `call' instruction that implements a libcall,
> > +         and this machine requires an external definition for library
> > +         functions, write one out.  */
> > +      if (CALL_P (insn) && RTL_LIBCALL_P (insn))
> > +        {
> > +          rtx nested_call = call_from_call_insn (safe_as_a <rtx_call_insn*> (insn));
> > +          rtx mem = XEXP (nested_call, 0);
> > +          gcc_assert (GET_CODE (mem) == MEM);
> > +          rtx fun = XEXP (mem, 0);
> > +          gcc_assert (GET_CODE (fun) == SYMBOL_REF);
> > +          assemble_external_libcall (fun);
> > +
> > +          /* Clear the LIBCALL flag to make sure we don't assemble the
> > +             external definition more than once.  */
> > +          RTL_LIBCALL_P (insn) = 0;
> > +        }
> > +
> >        if (INSN_P (insn))
> >       continue;
> >
> > diff --git a/gcc/print-rtl.cc b/gcc/print-rtl.cc
> > index e115f987173..26a06511619 100644
> > --- a/gcc/print-rtl.cc
> > +++ b/gcc/print-rtl.cc
> > @@ -882,6 +882,9 @@ rtx_writer::print_rtx (const_rtx in_rtx)
> >        if (RTX_FLAG (in_rtx, return_val))
> >       fputs ("/i", m_outfile);
> >
> > +      if (RTX_FLAG (in_rtx, is_libcall))
> > +        fputs ("/l", m_outfile);
> > +
> >        /* Print REG_NOTE names for EXPR_LIST and INSN_LIST.  */
> >        if ((GET_CODE (in_rtx) == EXPR_LIST
> >          || GET_CODE (in_rtx) == INSN_LIST
> > diff --git a/gcc/read-rtl.cc b/gcc/read-rtl.cc
> > index 62c7895af60..eb0ae150a5b 100644
> > --- a/gcc/read-rtl.cc
> > +++ b/gcc/read-rtl.cc
> > @@ -1543,6 +1543,9 @@ read_flags (rtx return_rtx)
> >         case 'i':
> >           RTX_FLAG (return_rtx, return_val) = 1;
> >           break;
> > +          case 'l':
> > +            RTX_FLAG (return_rtx, is_libcall) = 1;
> > +            break;
> >         default:
> >           fatal_with_file_and_line ("unrecognized flag: `%c'", flag_char);
> >       }
> > diff --git a/gcc/rtl.h b/gcc/rtl.h
> > index 7a8c4709257..92c802d3876 100644
> > --- a/gcc/rtl.h
> > +++ b/gcc/rtl.h
> > @@ -401,6 +401,10 @@ struct GTY((desc("0"), tag("0"),
> >       Dumped as "/i" in RTL dumps.  */
> >    unsigned return_val : 1;
> >
> > +  /* 1 in a CALL_INSN if it is a libcall.
> > +     Dumped as "/l" in RTL dumps.  */
> > +  unsigned is_libcall : 1;
> > +
> >    union {
> >      /* The final union field is aligned to 64 bits on LP64 hosts,
> >         giving a 32-bit gap after the fields above.  We optimize the
> > @@ -1578,6 +1582,10 @@ jump_table_for_label (const rtx_code_label *label)
> >  #define RTL_PURE_CALL_P(RTX)                                 \
> >    (RTL_FLAG_CHECK1 ("RTL_PURE_CALL_P", (RTX), CALL_INSN)->return_val)
> >
> > +/* 1 if RTX is a libcall.  */
> > +#define RTL_LIBCALL_P(RTX)                      \
> > +  (RTL_FLAG_CHECK1 ("RTL_LIBCALL_P", (RTX), CALL_INSN)->is_libcall)
> > +
> >  /* 1 if RTX is a call to a const or pure function.  */
> >  #define RTL_CONST_OR_PURE_CALL_P(RTX) \
> >    (RTL_CONST_CALL_P (RTX) || RTL_PURE_CALL_P (RTX))
  
Jakub Jelinek Jan. 9, 2023, 9:57 a.m. UTC | #9
On Mon, Jan 09, 2023 at 09:05:26AM +0100, Richard Biener wrote:
> On Wed, Jan 4, 2023 at 9:54 AM Jose E. Marchesi via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> >
> > ping.
> > Would this be a good approach for fixing the issue?
> 
> adding the is_libcall bit enlarges rtx_def by 8 bytes - there's no room for more
> bits here.

That is obviously not the way to go, sure.

> I really wonder how other targets avoid the issue you are pointing out?
> Do their assemblers prune unused (extern) .global?

I think no target solves this, if they see an extern call during expansion
and emit some directive for those, they emit the global or whatever directive
which remains there.

If all bits for CALL_INSN are taken, can't we add a flag on the CALL
rtx inside of the CALL_INSN pattern?  Or a flag on the SYMBOL_REF inside of
it (libcalls are always direct calls, aren't they) or SYMBOL_REF_FLAGS ?

	Jakub
  
Richard Biener Jan. 9, 2023, 1:04 p.m. UTC | #10
On Mon, Jan 9, 2023 at 10:58 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Mon, Jan 09, 2023 at 09:05:26AM +0100, Richard Biener wrote:
> > On Wed, Jan 4, 2023 at 9:54 AM Jose E. Marchesi via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > >
> > > ping.
> > > Would this be a good approach for fixing the issue?
> >
> > adding the is_libcall bit enlarges rtx_def by 8 bytes - there's no room for more
> > bits here.
>
> That is obviously not the way to go, sure.
>
> > I really wonder how other targets avoid the issue you are pointing out?
> > Do their assemblers prune unused (extern) .global?
>
> I think no target solves this, if they see an extern call during expansion
> and emit some directive for those, they emit the global or whatever directive
> which remains there.
>
> If all bits for CALL_INSN are taken, can't we add a flag on the CALL
> rtx inside of the CALL_INSN pattern?  Or a flag on the SYMBOL_REF inside of
> it (libcalls are always direct calls, aren't they) or SYMBOL_REF_FLAGS ?

I suppose the SYMBOL_REF would be what I'd target here.  Note we already
have

/* 1 if RTX is a symbol_ref that has been the library function in
   emit_library_call.  */
#define SYMBOL_REF_USED(RTX)                                            \
  (RTL_FLAG_CHECK1 ("SYMBOL_REF_USED", (RTX), SYMBOL_REF)->used)

so can't we just use that during the final scan for the delayed assembling?

>
>         Jakub
>
  
Jakub Jelinek Jan. 9, 2023, 1:25 p.m. UTC | #11
On Mon, Jan 09, 2023 at 02:04:48PM +0100, Richard Biener via Gcc-patches wrote:
> On Mon, Jan 9, 2023 at 10:58 AM Jakub Jelinek <jakub@redhat.com> wrote:
> >
> > On Mon, Jan 09, 2023 at 09:05:26AM +0100, Richard Biener wrote:
> > > On Wed, Jan 4, 2023 at 9:54 AM Jose E. Marchesi via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > >
> > > > ping.
> > > > Would this be a good approach for fixing the issue?
> > >
> > > adding the is_libcall bit enlarges rtx_def by 8 bytes - there's no room for more
> > > bits here.
> >
> > That is obviously not the way to go, sure.
> >
> > > I really wonder how other targets avoid the issue you are pointing out?
> > > Do their assemblers prune unused (extern) .global?
> >
> > I think no target solves this, if they see an extern call during expansion
> > and emit some directive for those, they emit the global or whatever directive
> > which remains there.
> >
> > If all bits for CALL_INSN are taken, can't we add a flag on the CALL
> > rtx inside of the CALL_INSN pattern?  Or a flag on the SYMBOL_REF inside of
> > it (libcalls are always direct calls, aren't they) or SYMBOL_REF_FLAGS ?
> 
> I suppose the SYMBOL_REF would be what I'd target here.  Note we already
> have
> 
> /* 1 if RTX is a symbol_ref that has been the library function in
>    emit_library_call.  */
> #define SYMBOL_REF_USED(RTX)                                            \
>   (RTL_FLAG_CHECK1 ("SYMBOL_REF_USED", (RTX), SYMBOL_REF)->used)
> 
> so can't we just use that during the final scan for the delayed assembling?

No, this one can't, it is used to avoid emitting the external directive
multiple times.  We need something next to it to identify for which symbols
that should be done.  Or of course if we are really out of bits that could
be used for it, the above could be repurposed for SYMBOL_REF_LIBCALL and
the current SYMBOL_REF_USED could be handled with a hash set.

	Jakub
  
Jeff Law Jan. 9, 2023, 2:01 p.m. UTC | #12
On 1/9/23 02:57, Jakub Jelinek via Gcc-patches wrote:
> On Mon, Jan 09, 2023 at 09:05:26AM +0100, Richard Biener wrote:
>> On Wed, Jan 4, 2023 at 9:54 AM Jose E. Marchesi via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>>
>>>
>>> ping.
>>> Would this be a good approach for fixing the issue?
>>
>> adding the is_libcall bit enlarges rtx_def by 8 bytes - there's no room for more
>> bits here.
> 
> That is obviously not the way to go, sure.
> 
>> I really wonder how other targets avoid the issue you are pointing out?
>> Do their assemblers prune unused (extern) .global?
> 
> I think no target solves this, if they see an extern call during expansion
> and emit some directive for those, they emit the global or whatever directive
> which remains there.
> 
> If all bits for CALL_INSN are taken, can't we add a flag on the CALL
> rtx inside of the CALL_INSN pattern?  Or a flag on the SYMBOL_REF inside of
> it (libcalls are always direct calls, aren't they) or SYMBOL_REF_FLAGS ?
You might look at 32bit PA SOM.  It was always a bit odd in this respect.

You had to import every external symbol explicitly and it disliked 
importing something that wasn't used.  I recall some special handling 
for libcalls as well.

Jeff
  
Andrew Pinski Jan. 30, 2023, 6:45 p.m. UTC | #13
On Thu, Dec 8, 2022 at 2:56 AM Jose E. Marchesi via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> The expand_expr_divmod function in expr.cc attempts to optimize cases
> where both arguments of a division/modulus are known to be positive
> when interpreted as signed.  In these cases, both signed division and
> unsigned division will raise the same value, and therefore the
> cheapest option can be used.

I suspect this issue is also the same as PR 48783 .

Thanks,
Andrew

>
> In order to determine what is the cheaper option in the current
> target, expand_expr_divmod actually expands both a signed divmod and
> an unsigned divmod using local "sequences":
>
>   start_sequence ();
>   ...
>   expand_divmod (... signed ...);
>   ...
>   end_sequence ();
>
>   start_sequence ();
>   ...
>   expand_divmod (... unsigned ...);
>   ...
>   end_sequence ();
>
> And then compares the cost of each generated sequence, choosing the
> best one.  Finally, it emits the selected expanded sequence and
> returns the rtx with the result.
>
> This approach has a caveat.  Some targets do not provide instructions
> for division/modulus instructions.  In the case of BPF, it provides
> unsigned division/modulus, but not signed division/modulus.
>
> In these cases, the expand_divmod tries can contain calls to funcalls.
> For example, in BPF:
>
>   start_sequence ();
>   ...
>   expand_divmod (... signed ...); -> This generates funcall to __divdi3
>   ...
>   end_sequence ();
>
>   start_sequence ();
>   ...
>   expand_divmod (... unsigned ...); -> This generates direct `div' insn.
>   ...
>   end_sequence ();
>
> The problem is that when a funcall is expanded, an accompanying global
> symbol definition is written in the output stream:
>
>   .global __divdi3
>
> And this symbol definition remains in the compiled assembly file, even
> if the sequence using the direct `div' instruction above is used.
>
> This is particularly bad in BPF, because the kernel bpf loader chokes
> on the spurious symbol __divdi3 and makes the resulting BPF object
> unloadable (note that BPF objects are not linked before processed by
> the kernel.)
>
> In order to fix this, this patch modifies expand_expr_divmod in the
> following way:
>
> - When trying each sequence (signed, unsigned) the expand_divmod calls
>   are told to _not_ use libcalls if everything else fails.  This is
>   done by passing OPTAB_WIDEN as the `methods' argument.  (Before it
>   was using the default value OPTAB_LIB_WIDEN.)
>
> - If any of the tried expanded sequences contain a funcall, then the
>   optimization is not attempted.
>
> A couple of BPF tests are also added to make sure this doesn't break
> at any point in the future.
>
> Tested in bpf-unknown-none and x86_64-linux-gnu.
> Regtested in x86_64-linux-gnu.  No regressions.
>
> gcc/ChangeLog
>
>         * expr.cc (expand_expr_divmod): Avoid side-effects of trying
>         sequences involving funcalls in optimization.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/bpf/divmod-funcall-1.c: New test.
>         * gcc.target/bpf/divmod-funcall-2.c: Likewise.
> ---
>  gcc/expr.cc                                   | 44 +++++++++++--------
>  .../gcc.target/bpf/divmod-funcall-1.c         |  8 ++++
>  .../gcc.target/bpf/divmod-funcall-2.c         |  8 ++++
>  3 files changed, 41 insertions(+), 19 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
>  create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
>
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index d9407432ea5..4d4be5d7bda 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -9168,32 +9168,38 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>        do_pending_stack_adjust ();
>        start_sequence ();
>        rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -                                  op0, op1, target, 1);
> +                                  op0, op1, target, 1, OPTAB_WIDEN);
>        rtx_insn *uns_insns = get_insns ();
>        end_sequence ();
>        start_sequence ();
>        rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
> -                                  op0, op1, target, 0);
> +                                  op0, op1, target, 0, OPTAB_WIDEN);
>        rtx_insn *sgn_insns = get_insns ();
>        end_sequence ();
> -      unsigned uns_cost = seq_cost (uns_insns, speed_p);
> -      unsigned sgn_cost = seq_cost (sgn_insns, speed_p);
>
> -      /* If costs are the same then use as tie breaker the other other
> -        factor.  */
> -      if (uns_cost == sgn_cost)
> -       {
> -         uns_cost = seq_cost (uns_insns, !speed_p);
> -         sgn_cost = seq_cost (sgn_insns, !speed_p);
> -       }
> -
> -      if (uns_cost < sgn_cost || (uns_cost == sgn_cost && unsignedp))
> -       {
> -         emit_insn (uns_insns);
> -         return uns_ret;
> -       }
> -      emit_insn (sgn_insns);
> -      return sgn_ret;
> +      /* Do not try to optimize if any of the sequences tried above
> +         resulted in a funcall.  */
> +      if (uns_ret && sgn_ret)
> +        {
> +          unsigned uns_cost = seq_cost (uns_insns, speed_p);
> +          unsigned sgn_cost = seq_cost (sgn_insns, speed_p);
> +
> +          /* If costs are the same then use as tie breaker the other
> +             other factor.  */
> +          if (uns_cost == sgn_cost)
> +            {
> +              uns_cost = seq_cost (uns_insns, !speed_p);
> +              sgn_cost = seq_cost (sgn_insns, !speed_p);
> +            }
> +
> +          if (uns_cost < sgn_cost || (uns_cost == sgn_cost && unsignedp))
> +            {
> +              emit_insn (uns_insns);
> +              return uns_ret;
> +            }
> +          emit_insn (sgn_insns);
> +          return sgn_ret;
> +        }
>      }
>    return expand_divmod (mod_p, code, mode, treeop0, treeop1,
>                         op0, op1, target, unsignedp);
> diff --git a/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c b/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
> new file mode 100644
> index 00000000000..dffb1506f06
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
> @@ -0,0 +1,8 @@
> +/* { dg-do compile } */
> +/* { dg-final { scan-assembler-not "__divdi3" } } */
> +
> +int
> +foo (unsigned int len)
> +{
> +  return ((unsigned long)len) * 234 / 5;
> +}
> diff --git a/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c b/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
> new file mode 100644
> index 00000000000..41e8e40c35c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
> @@ -0,0 +1,8 @@
> +/* { dg-do compile } */
> +/* { dg-final { scan-assembler-not "__moddi3" } } */
> +
> +int
> +foo (unsigned int len)
> +{
> +  return ((unsigned long)len) * 234 % 5;
> +}
> --
> 2.30.2
>
  
Jose E. Marchesi Jan. 30, 2023, 6:55 p.m. UTC | #14
> On Thu, Dec 8, 2022 at 2:56 AM Jose E. Marchesi via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> The expand_expr_divmod function in expr.cc attempts to optimize cases
>> where both arguments of a division/modulus are known to be positive
>> when interpreted as signed.  In these cases, both signed division and
>> unsigned division will raise the same value, and therefore the
>> cheapest option can be used.
>
> I suspect this issue is also the same as PR 48783 .

Yeah I kinda dropped the ball here.

Will look into Jeff's suggestions on how to fix this without exhausting
the CALL_INSN bits.

> Thanks,
> Andrew
>
>>
>> In order to determine what is the cheaper option in the current
>> target, expand_expr_divmod actually expands both a signed divmod and
>> an unsigned divmod using local "sequences":
>>
>>   start_sequence ();
>>   ...
>>   expand_divmod (... signed ...);
>>   ...
>>   end_sequence ();
>>
>>   start_sequence ();
>>   ...
>>   expand_divmod (... unsigned ...);
>>   ...
>>   end_sequence ();
>>
>> And then compares the cost of each generated sequence, choosing the
>> best one.  Finally, it emits the selected expanded sequence and
>> returns the rtx with the result.
>>
>> This approach has a caveat.  Some targets do not provide instructions
>> for division/modulus instructions.  In the case of BPF, it provides
>> unsigned division/modulus, but not signed division/modulus.
>>
>> In these cases, the expand_divmod tries can contain calls to funcalls.
>> For example, in BPF:
>>
>>   start_sequence ();
>>   ...
>>   expand_divmod (... signed ...); -> This generates funcall to __divdi3
>>   ...
>>   end_sequence ();
>>
>>   start_sequence ();
>>   ...
>>   expand_divmod (... unsigned ...); -> This generates direct `div' insn.
>>   ...
>>   end_sequence ();
>>
>> The problem is that when a funcall is expanded, an accompanying global
>> symbol definition is written in the output stream:
>>
>>   .global __divdi3
>>
>> And this symbol definition remains in the compiled assembly file, even
>> if the sequence using the direct `div' instruction above is used.
>>
>> This is particularly bad in BPF, because the kernel bpf loader chokes
>> on the spurious symbol __divdi3 and makes the resulting BPF object
>> unloadable (note that BPF objects are not linked before processed by
>> the kernel.)
>>
>> In order to fix this, this patch modifies expand_expr_divmod in the
>> following way:
>>
>> - When trying each sequence (signed, unsigned) the expand_divmod calls
>>   are told to _not_ use libcalls if everything else fails.  This is
>>   done by passing OPTAB_WIDEN as the `methods' argument.  (Before it
>>   was using the default value OPTAB_LIB_WIDEN.)
>>
>> - If any of the tried expanded sequences contain a funcall, then the
>>   optimization is not attempted.
>>
>> A couple of BPF tests are also added to make sure this doesn't break
>> at any point in the future.
>>
>> Tested in bpf-unknown-none and x86_64-linux-gnu.
>> Regtested in x86_64-linux-gnu.  No regressions.
>>
>> gcc/ChangeLog
>>
>>         * expr.cc (expand_expr_divmod): Avoid side-effects of trying
>>         sequences involving funcalls in optimization.
>>
>> gcc/testsuite/ChangeLog:
>>
>>         * gcc.target/bpf/divmod-funcall-1.c: New test.
>>         * gcc.target/bpf/divmod-funcall-2.c: Likewise.
>> ---
>>  gcc/expr.cc                                   | 44 +++++++++++--------
>>  .../gcc.target/bpf/divmod-funcall-1.c         |  8 ++++
>>  .../gcc.target/bpf/divmod-funcall-2.c         |  8 ++++
>>  3 files changed, 41 insertions(+), 19 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
>>  create mode 100644 gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
>>
>> diff --git a/gcc/expr.cc b/gcc/expr.cc
>> index d9407432ea5..4d4be5d7bda 100644
>> --- a/gcc/expr.cc
>> +++ b/gcc/expr.cc
>> @@ -9168,32 +9168,38 @@ expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
>>        do_pending_stack_adjust ();
>>        start_sequence ();
>>        rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> -                                  op0, op1, target, 1);
>> +                                  op0, op1, target, 1, OPTAB_WIDEN);
>>        rtx_insn *uns_insns = get_insns ();
>>        end_sequence ();
>>        start_sequence ();
>>        rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
>> -                                  op0, op1, target, 0);
>> +                                  op0, op1, target, 0, OPTAB_WIDEN);
>>        rtx_insn *sgn_insns = get_insns ();
>>        end_sequence ();
>> -      unsigned uns_cost = seq_cost (uns_insns, speed_p);
>> -      unsigned sgn_cost = seq_cost (sgn_insns, speed_p);
>>
>> -      /* If costs are the same then use as tie breaker the other other
>> -        factor.  */
>> -      if (uns_cost == sgn_cost)
>> -       {
>> -         uns_cost = seq_cost (uns_insns, !speed_p);
>> -         sgn_cost = seq_cost (sgn_insns, !speed_p);
>> -       }
>> -
>> -      if (uns_cost < sgn_cost || (uns_cost == sgn_cost && unsignedp))
>> -       {
>> -         emit_insn (uns_insns);
>> -         return uns_ret;
>> -       }
>> -      emit_insn (sgn_insns);
>> -      return sgn_ret;
>> +      /* Do not try to optimize if any of the sequences tried above
>> +         resulted in a funcall.  */
>> +      if (uns_ret && sgn_ret)
>> +        {
>> +          unsigned uns_cost = seq_cost (uns_insns, speed_p);
>> +          unsigned sgn_cost = seq_cost (sgn_insns, speed_p);
>> +
>> +          /* If costs are the same then use as tie breaker the other
>> +             other factor.  */
>> +          if (uns_cost == sgn_cost)
>> +            {
>> +              uns_cost = seq_cost (uns_insns, !speed_p);
>> +              sgn_cost = seq_cost (sgn_insns, !speed_p);
>> +            }
>> +
>> +          if (uns_cost < sgn_cost || (uns_cost == sgn_cost && unsignedp))
>> +            {
>> +              emit_insn (uns_insns);
>> +              return uns_ret;
>> +            }
>> +          emit_insn (sgn_insns);
>> +          return sgn_ret;
>> +        }
>>      }
>>    return expand_divmod (mod_p, code, mode, treeop0, treeop1,
>>                         op0, op1, target, unsignedp);
>> diff --git a/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c b/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
>> new file mode 100644
>> index 00000000000..dffb1506f06
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
>> @@ -0,0 +1,8 @@
>> +/* { dg-do compile } */
>> +/* { dg-final { scan-assembler-not "__divdi3" } } */
>> +
>> +int
>> +foo (unsigned int len)
>> +{
>> +  return ((unsigned long)len) * 234 / 5;
>> +}
>> diff --git a/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c b/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
>> new file mode 100644
>> index 00000000000..41e8e40c35c
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
>> @@ -0,0 +1,8 @@
>> +/* { dg-do compile } */
>> +/* { dg-final { scan-assembler-not "__moddi3" } } */
>> +
>> +int
>> +foo (unsigned int len)
>> +{
>> +  return ((unsigned long)len) * 234 % 5;
>> +}
>> --
>> 2.30.2
>>
  

Patch

diff --git a/gcc/expr.cc b/gcc/expr.cc
index d9407432ea5..4d4be5d7bda 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -9168,32 +9168,38 @@  expand_expr_divmod (tree_code code, machine_mode mode, tree treeop0,
       do_pending_stack_adjust ();
       start_sequence ();
       rtx uns_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 1);
+				   op0, op1, target, 1, OPTAB_WIDEN);
       rtx_insn *uns_insns = get_insns ();
       end_sequence ();
       start_sequence ();
       rtx sgn_ret = expand_divmod (mod_p, code, mode, treeop0, treeop1,
-				   op0, op1, target, 0);
+				   op0, op1, target, 0, OPTAB_WIDEN);
       rtx_insn *sgn_insns = get_insns ();
       end_sequence ();
-      unsigned uns_cost = seq_cost (uns_insns, speed_p);
-      unsigned sgn_cost = seq_cost (sgn_insns, speed_p);
 
-      /* If costs are the same then use as tie breaker the other other
-	 factor.  */
-      if (uns_cost == sgn_cost)
-	{
-	  uns_cost = seq_cost (uns_insns, !speed_p);
-	  sgn_cost = seq_cost (sgn_insns, !speed_p);
-	}
-
-      if (uns_cost < sgn_cost || (uns_cost == sgn_cost && unsignedp))
-	{
-	  emit_insn (uns_insns);
-	  return uns_ret;
-	}
-      emit_insn (sgn_insns);
-      return sgn_ret;
+      /* Do not try to optimize if any of the sequences tried above
+         resulted in a funcall.  */
+      if (uns_ret && sgn_ret)
+        {
+          unsigned uns_cost = seq_cost (uns_insns, speed_p);
+          unsigned sgn_cost = seq_cost (sgn_insns, speed_p);
+
+          /* If costs are the same then use as tie breaker the other
+             other factor.  */
+          if (uns_cost == sgn_cost)
+            {
+              uns_cost = seq_cost (uns_insns, !speed_p);
+              sgn_cost = seq_cost (sgn_insns, !speed_p);
+            }
+
+          if (uns_cost < sgn_cost || (uns_cost == sgn_cost && unsignedp))
+            {
+              emit_insn (uns_insns);
+              return uns_ret;
+            }
+          emit_insn (sgn_insns);
+          return sgn_ret;
+        }
     }
   return expand_divmod (mod_p, code, mode, treeop0, treeop1,
 			op0, op1, target, unsignedp);
diff --git a/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c b/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
new file mode 100644
index 00000000000..dffb1506f06
--- /dev/null
+++ b/gcc/testsuite/gcc.target/bpf/divmod-funcall-1.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-final { scan-assembler-not "__divdi3" } } */
+
+int
+foo (unsigned int len)
+{
+  return ((unsigned long)len) * 234 / 5;
+}
diff --git a/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c b/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
new file mode 100644
index 00000000000..41e8e40c35c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/bpf/divmod-funcall-2.c
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-final { scan-assembler-not "__moddi3" } } */
+
+int
+foo (unsigned int len)
+{
+  return ((unsigned long)len) * 234 % 5;
+}