Improved RTL expansion of field assignments into promoted registers.

Message ID 005901da399e$7d13b330$773b1990$@nextmovesoftware.com
State Accepted
Headers
Series Improved RTL expansion of field assignments into promoted registers. |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

Roger Sayle Dec. 28, 2023, 2:59 p.m. UTC
  This patch fixes PR rtl-optmization/104914 by tweaking/improving the way
that fields are written into a pseudo register that needs to be kept sign
extended.

The motivating example from the bugzilla PR is:

extern void ext(int);
void foo(const unsigned char *buf) {
  int val;
  ((unsigned char*)&val)[0] = *buf++;
  ((unsigned char*)&val)[1] = *buf++;
  ((unsigned char*)&val)[2] = *buf++;
  ((unsigned char*)&val)[3] = *buf++;
  if(val > 0)
    ext(1);
  else
    ext(0);
}

which at the end of the tree optimization passes looks like:

void foo (const unsigned char * buf)
{
  int val;
  unsigned char _1;
  unsigned char _2;
  unsigned char _3;
  unsigned char _4;
  int val.5_5;

  <bb 2> [local count: 1073741824]:
  _1 = *buf_7(D);
  MEM[(unsigned char *)&val] = _1;
  _2 = MEM[(const unsigned char *)buf_7(D) + 1B];
  MEM[(unsigned char *)&val + 1B] = _2;
  _3 = MEM[(const unsigned char *)buf_7(D) + 2B];
  MEM[(unsigned char *)&val + 2B] = _3;
  _4 = MEM[(const unsigned char *)buf_7(D) + 3B];
  MEM[(unsigned char *)&val + 3B] = _4;
  val.5_5 = val;
  if (val.5_5 > 0)
    goto <bb 3>; [59.00%]
  else
    goto <bb 4>; [41.00%]

  <bb 3> [local count: 633507681]:
  ext (1);
  goto <bb 5>; [100.00%]

  <bb 4> [local count: 440234144]:
  ext (0);

  <bb 5> [local count: 1073741824]:
  val ={v} {CLOBBER(eol)};
  return;

}

Here four bytes are being sequentially written into the SImode value
val.  On some platforms, such as MIPS64, this SImode value is kept in
a 64-bit register, suitably sign-extended.  The function expand_assignment
contains logic to handle this via SUBREG_PROMOTED_VAR_P (around line 6264
in expr.cc) which outputs an explicit extension operation after each
store_field (typically insv) to such promoted/extended pseudos.

The first observation is that there's no need to perform sign extension
after each byte in the example above; the extension is only required
after changes to the most significant byte (i.e. to a field that overlaps
the most significant bit).

The bug fix is actually a bit more subtle, but at this point during
code expansion it's not safe to use a SUBREG when sign-extending this
field.  Currently, GCC generates (sign_extend:DI (subreg:SI (reg:DI) 0))
but combine (and other RTL optimizers) later realize that because SImode
values are always sign-extended in their 64-bit hard registers that
this is a no-op and eliminates it.  The trouble is that it's unsafe to
refer to the SImode lowpart of a 64-bit register using SUBREG at those
critical points when temporarily the value isn't correctly sign-extended,
and the usual backend invariants don't hold.  At these critical points,
the middle-end needs to use an explicit TRUNCATE rtx (as this isn't a
TRULY_NOOP_TRUNCATION), so that the explicit sign-extension looks like
(sign_extend:DI (truncate:SI (reg:DI)), which avoids the problem.

Note that MODE_REP_EXTENDED (NARROW, WIDE) != UNKOWN implies (or should
imply) !TRULY_NOOP_TRUNCATION (NARROW, WIDE).  I've another (independent)
patch that I'll post in a few minutes.


This middle-end patch has been tested on x86_64-pc-linux-gnu with
make bootstrap and make -k check, both with and without
--target_board=unix{-m32} with no new failures.  The cc1 from a
cross-compiler to mips64 appears to generate much better code for
the above test case.  Ok for mainline?


2023-12-28  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        PR rtl-optimization/104914
        * expr.cc (expand_assignment): When target is SUBREG_PROMOTED_VAR_P
        a sign or zero extension is only required if the modified field
        overlaps the SUBREG's most significant bit.  On MODE_REP_EXTENDED
        targets, don't refer to the temporarily incorrectly extended value
        using a SUBREG, but instead generate an explicit TRUNCATE rtx.


Thanks in advance,
Roger
--
  

Comments

Jeff Law Dec. 28, 2023, 6:22 p.m. UTC | #1
On 12/28/23 07:59, Roger Sayle wrote:
> 
> This patch fixes PR rtl-optmization/104914 by tweaking/improving the way
> that fields are written into a pseudo register that needs to be kept sign
> extended.
Well, I think "fixes" is a bit of a stretch.  We're avoiding the issue 
by changing the early RTL generation, but if I understand what's going 
on in the RTL optimizers and MIPS backend correctly, the core bug still 
remains.  Admittedly I haven't put it under a debugger, but that MIPS 
definition of NOOP_TRUNCATION just seems badly wrong and is just waiting 
to pop it's ugly head up again.



> 
> The motivating example from the bugzilla PR is:
> 
> extern void ext(int);
> void foo(const unsigned char *buf) {
>    int val;
>    ((unsigned char*)&val)[0] = *buf++;
>    ((unsigned char*)&val)[1] = *buf++;
>    ((unsigned char*)&val)[2] = *buf++;
>    ((unsigned char*)&val)[3] = *buf++;
>    if(val > 0)
>      ext(1);
>    else
>      ext(0);
> }
> 
> which at the end of the tree optimization passes looks like:
> 
> void foo (const unsigned char * buf)
> {
>    int val;
>    unsigned char _1;
>    unsigned char _2;
>    unsigned char _3;
>    unsigned char _4;
>    int val.5_5;
> 
>    <bb 2> [local count: 1073741824]:
>    _1 = *buf_7(D);
>    MEM[(unsigned char *)&val] = _1;
>    _2 = MEM[(const unsigned char *)buf_7(D) + 1B];
>    MEM[(unsigned char *)&val + 1B] = _2;
>    _3 = MEM[(const unsigned char *)buf_7(D) + 2B];
>    MEM[(unsigned char *)&val + 2B] = _3;
>    _4 = MEM[(const unsigned char *)buf_7(D) + 3B];
>    MEM[(unsigned char *)&val + 3B] = _4;
>    val.5_5 = val;
>    if (val.5_5 > 0)
>      goto <bb 3>; [59.00%]
>    else
>      goto <bb 4>; [41.00%]
> 
>    <bb 3> [local count: 633507681]:
>    ext (1);
>    goto <bb 5>; [100.00%]
> 
>    <bb 4> [local count: 440234144]:
>    ext (0);
> 
>    <bb 5> [local count: 1073741824]:
>    val ={v} {CLOBBER(eol)};
>    return;
> 
> }
> 
> Here four bytes are being sequentially written into the SImode value
> val.  On some platforms, such as MIPS64, this SImode value is kept in
> a 64-bit register, suitably sign-extended.  The function expand_assignment
> contains logic to handle this via SUBREG_PROMOTED_VAR_P (around line 6264
> in expr.cc) which outputs an explicit extension operation after each
> store_field (typically insv) to such promoted/extended pseudos.
> 
> The first observation is that there's no need to perform sign extension
> after each byte in the example above; the extension is only required
> after changes to the most significant byte (i.e. to a field that overlaps
> the most significant bit).
True.


> 
> The bug fix is actually a bit more subtle, but at this point during
> code expansion it's not safe to use a SUBREG when sign-extending this
> field.  Currently, GCC generates (sign_extend:DI (subreg:SI (reg:DI) 0))
> but combine (and other RTL optimizers) later realize that because SImode
> values are always sign-extended in their 64-bit hard registers that
> this is a no-op and eliminates it.  The trouble is that it's unsafe to
> refer to the SImode lowpart of a 64-bit register using SUBREG at those
> critical points when temporarily the value isn't correctly sign-extended,
> and the usual backend invariants don't hold.  At these critical points,
> the middle-end needs to use an explicit TRUNCATE rtx (as this isn't a
> TRULY_NOOP_TRUNCATION), so that the explicit sign-extension looks like
> (sign_extend:DI (truncate:SI (reg:DI)), which avoids the problem.


> 
> Note that MODE_REP_EXTENDED (NARROW, WIDE) != UNKOWN implies (or should
> imply) !TRULY_NOOP_TRUNCATION (NARROW, WIDE).  I've another (independent)
> patch that I'll post in a few minutes.
> 
> 
> This middle-end patch has been tested on x86_64-pc-linux-gnu with
> make bootstrap and make -k check, both with and without
> --target_board=unix{-m32} with no new failures.  The cc1 from a
> cross-compiler to mips64 appears to generate much better code for
> the above test case.  Ok for mainline?
> 
> 
> 2023-12-28  Roger Sayle  <roger@nextmovesoftware.com>
> 
> gcc/ChangeLog
>          PR rtl-optimization/104914
>          * expr.cc (expand_assignment): When target is SUBREG_PROMOTED_VAR_P
>          a sign or zero extension is only required if the modified field
>          overlaps the SUBREG's most significant bit.  On MODE_REP_EXTENDED
>          targets, don't refer to the temporarily incorrectly extended value
>          using a SUBREG, but instead generate an explicit TRUNCATE rtx.
[ ... ]


> +	      /* Check if the field overlaps the MSB, requiring extension.  */
> +	      else if (known_eq (bitpos + bitsize,
> +				 GET_MODE_BITSIZE (GET_MODE (to_rtx))))
Do you need to look at the size of the field as well?  ie, the starting 
position might be before the sign bit, but the width of the field might 
cover the mode's sign bit?

I'm not real good in the RTL expansion code, so if I'm offbase on this, 
just let me know.

jeff
  
Roger Sayle Dec. 28, 2023, 7:35 p.m. UTC | #2
Hi Jeff,
Thanks for the speedy review.

> On 12/28/23 07:59, Roger Sayle wrote:
> > This patch fixes PR rtl-optmization/104914 by tweaking/improving the
> > way that fields are written into a pseudo register that needs to be
> > kept sign extended.
> Well, I think "fixes" is a bit of a stretch.  We're avoiding the issue by changing the
> early RTL generation, but if I understand what's going on in the RTL optimizers
> and MIPS backend correctly, the core bug still remains.  Admittedly I haven't put it
> under a debugger, but that MIPS definition of NOOP_TRUNCATION just seems
> badly wrong and is just waiting to pop it's ugly head up again.

I think this really is the/a correct fix. The MIPS backend defines NOOP_TRUNCATION
to false, so it's not correct to use a SUBREG to convert from DImode to SImode.
The problem then is where in the compiler (middle-end or backend) is this invalid
SUBREG being created and how can it be fixed.  In this particular case, the fault
is in RTL expansion.  There may be other places where a SUBREG is inappropriately
used instead of a TRUNCATE, but this is the place where things go wrong for
PR rtl-optimization/104914.

Once an inappropriate SImode SUBREG is in the RTL stream, it can remain
harmlessly latent (most of the time), unless it gets split, simplified or spilled.
Copying this SImode expression into it's own pseudo, results in incorrect code.
One approach might be to use an UNSPEC for places where backend
invariants are temporarily invalid, but in this case it's machine independent
middle-end code that's using SUBREGs as though the target was an x86/pdp11.

So I agree that on the surface, both of these appear to be identical:
> (set (reg:DI) (sign_extend:DI (truncate:SI (reg:DI))))
> (set (reg:DI) (sign_extend:DI (subreg:SI (reg:DI))))

But should they get split or spilled by reload:

(set (reg_tmp:SI) (subreg:SI (reg:DI))
(set (reg:DI) (sign_extend:DI (reg_tmp:SI))

is invalid as the reg_tmp isn't correctly sign-extended for SImode.
But,

(set (reg_tmp:SI) (truncate:SI (reg:DI))
(set (reg:DI) (sign_extend:DI (reg_tmp:SI))

is fine.  The difference is the instant in time, when the SUBREG's invariants
aren't yet valid (and its contents shouldn't be thought of as SImode).

On nvptx, where truly_noop_truncation is always "false", it'd show the same
bug/failure, if it were not for that fact that nvptx doesn't attempt to store
values in "mode extended" (SUBREG_PROMOTED_VAR_P) registers.
The bug is really in MODE_REP_EXTENDED support.

> > The motivating example from the bugzilla PR is:
> >
> > extern void ext(int);
> > void foo(const unsigned char *buf) {
> >    int val;
> >    ((unsigned char*)&val)[0] = *buf++;
> >    ((unsigned char*)&val)[1] = *buf++;
> >    ((unsigned char*)&val)[2] = *buf++;
> >    ((unsigned char*)&val)[3] = *buf++;
> >    if(val > 0)
> >      ext(1);
> >    else
> >      ext(0);
> > }
> >
> > which at the end of the tree optimization passes looks like:
> >
> > void foo (const unsigned char * buf)
> > {
> >    int val;
> >    unsigned char _1;
> >    unsigned char _2;
> >    unsigned char _3;
> >    unsigned char _4;
> >    int val.5_5;
> >
> >    <bb 2> [local count: 1073741824]:
> >    _1 = *buf_7(D);
> >    MEM[(unsigned char *)&val] = _1;
> >    _2 = MEM[(const unsigned char *)buf_7(D) + 1B];
> >    MEM[(unsigned char *)&val + 1B] = _2;
> >    _3 = MEM[(const unsigned char *)buf_7(D) + 2B];
> >    MEM[(unsigned char *)&val + 2B] = _3;
> >    _4 = MEM[(const unsigned char *)buf_7(D) + 3B];
> >    MEM[(unsigned char *)&val + 3B] = _4;
> >    val.5_5 = val;
> >    if (val.5_5 > 0)
> >      goto <bb 3>; [59.00%]
> >    else
> >      goto <bb 4>; [41.00%]
> >
> >    <bb 3> [local count: 633507681]:
> >    ext (1);
> >    goto <bb 5>; [100.00%]
> >
> >    <bb 4> [local count: 440234144]:
> >    ext (0);
> >
> >    <bb 5> [local count: 1073741824]:
> >    val ={v} {CLOBBER(eol)};
> >    return;
> >
> > }
> >
> > Here four bytes are being sequentially written into the SImode value
> > val.  On some platforms, such as MIPS64, this SImode value is kept in
> > a 64-bit register, suitably sign-extended.  The function
> > expand_assignment contains logic to handle this via
> > SUBREG_PROMOTED_VAR_P (around line 6264 in expr.cc) which outputs an
> > explicit extension operation after each store_field (typically insv) to such
> promoted/extended pseudos.
> >
> > The first observation is that there's no need to perform sign
> > extension after each byte in the example above; the extension is only
> > required after changes to the most significant byte (i.e. to a field
> > that overlaps the most significant bit).
> True.
> 
> > The bug fix is actually a bit more subtle, but at this point during
> > code expansion it's not safe to use a SUBREG when sign-extending this
> > field.  Currently, GCC generates (sign_extend:DI (subreg:SI (reg:DI)
> > 0)) but combine (and other RTL optimizers) later realize that because
> > SImode values are always sign-extended in their 64-bit hard registers
> > that this is a no-op and eliminates it.  The trouble is that it's
> > unsafe to refer to the SImode lowpart of a 64-bit register using
> > SUBREG at those critical points when temporarily the value isn't
> > correctly sign-extended, and the usual backend invariants don't hold.
> > At these critical points, the middle-end needs to use an explicit
> > TRUNCATE rtx (as this isn't a TRULY_NOOP_TRUNCATION), so that the
> > explicit sign-extension looks like (sign_extend:DI (truncate:SI (reg:DI)), which
> avoids the problem.
> 
> 
> > Note that MODE_REP_EXTENDED (NARROW, WIDE) != UNKOWN implies (or
> > should
> > imply) !TRULY_NOOP_TRUNCATION (NARROW, WIDE).  I've another
> > (independent) patch that I'll post in a few minutes.
> >
> >
> > This middle-end patch has been tested on x86_64-pc-linux-gnu with make
> > bootstrap and make -k check, both with and without
> > --target_board=unix{-m32} with no new failures.  The cc1 from a
> > cross-compiler to mips64 appears to generate much better code for the
> > above test case.  Ok for mainline?
> >
> >
> > 2023-12-28  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >          PR rtl-optimization/104914
> >          * expr.cc (expand_assignment): When target is
> SUBREG_PROMOTED_VAR_P
> >          a sign or zero extension is only required if the modified field
> >          overlaps the SUBREG's most significant bit.  On MODE_REP_EXTENDED
> >          targets, don't refer to the temporarily incorrectly extended value
> >          using a SUBREG, but instead generate an explicit TRUNCATE rtx.
> [ ... ]
> 
> 
> > +	      /* Check if the field overlaps the MSB, requiring extension.  */
> > +	      else if (known_eq (bitpos + bitsize,
> > +				 GET_MODE_BITSIZE (GET_MODE (to_rtx))))
> Do you need to look at the size of the field as well?  ie, the starting position might
> be before the sign bit, but the width of the field might cover the mode's sign bit?
> 
> I'm not real good in the RTL expansion code, so if I'm offbase on this, just let me
> know.

There are two things that help here.  The first is that the most significant
bit never appears in the middle of a field, so we don't have to worry about
overlapping, nor writes to the paradoxical bits of the SUBREG.  And secondly,
bits are numbered from zero for least significant, to MODE_BITSIZE (mode) - 1
for most significant, irrespective of the endian-ness.  So the code only needs
to check the highest value bitpos + bitsize is the maximum value for the mode.
The above logic stays the same, but which byte insert requires extension will
change between mips64be and mips64le.  i.e. we test that the most significant
bit of the field/byte being written in the most significant bit of the SUBREG
target. [That's my understanding/rationalization, I could wrong].

One thing I could be more cautious about is using maybe_eq instead of
known_eq, but the rest of the code (including truly_noop_truncation) assumes
scalar integer modes, so variable length vectors aren't (yet) a concern.
Would using maybe_eq be better coding style?


Cheers,
Roger
--
  
YunQiang Su Dec. 29, 2023, 1:51 a.m. UTC | #3
Jeff Law <jeffreyalaw@gmail.com> 于2023年12月29日周五 02:23写道:
>
>
>
> On 12/28/23 07:59, Roger Sayle wrote:
> >
> > This patch fixes PR rtl-optmization/104914 by tweaking/improving the way
> > that fields are written into a pseudo register that needs to be kept sign
> > extended.
> Well, I think "fixes" is a bit of a stretch.  We're avoiding the issue
> by changing the early RTL generation, but if I understand what's going
> on in the RTL optimizers and MIPS backend correctly, the core bug still
> remains.  Admittedly I haven't put it under a debugger, but that MIPS
> definition of NOOP_TRUNCATION just seems badly wrong and is just waiting
> to pop it's ugly head up again.
>

Yes. I am trying to get rid of it from MIPS64.
It may reduce our maintain workload.

>
>
> >
> > The motivating example from the bugzilla PR is:
> >
> > extern void ext(int);
> > void foo(const unsigned char *buf) {
> >    int val;
> >    ((unsigned char*)&val)[0] = *buf++;
> >    ((unsigned char*)&val)[1] = *buf++;
> >    ((unsigned char*)&val)[2] = *buf++;
> >    ((unsigned char*)&val)[3] = *buf++;
> >    if(val > 0)
> >      ext(1);
> >    else
> >      ext(0);
> > }
> >
> > which at the end of the tree optimization passes looks like:
> >
> > void foo (const unsigned char * buf)
> > {
> >    int val;
> >    unsigned char _1;
> >    unsigned char _2;
> >    unsigned char _3;
> >    unsigned char _4;
> >    int val.5_5;
> >
> >    <bb 2> [local count: 1073741824]:
> >    _1 = *buf_7(D);
> >    MEM[(unsigned char *)&val] = _1;
> >    _2 = MEM[(const unsigned char *)buf_7(D) + 1B];
> >    MEM[(unsigned char *)&val + 1B] = _2;
> >    _3 = MEM[(const unsigned char *)buf_7(D) + 2B];
> >    MEM[(unsigned char *)&val + 2B] = _3;
> >    _4 = MEM[(const unsigned char *)buf_7(D) + 3B];
> >    MEM[(unsigned char *)&val + 3B] = _4;
> >    val.5_5 = val;
> >    if (val.5_5 > 0)
> >      goto <bb 3>; [59.00%]
> >    else
> >      goto <bb 4>; [41.00%]
> >
> >    <bb 3> [local count: 633507681]:
> >    ext (1);
> >    goto <bb 5>; [100.00%]
> >
> >    <bb 4> [local count: 440234144]:
> >    ext (0);
> >
> >    <bb 5> [local count: 1073741824]:
> >    val ={v} {CLOBBER(eol)};
> >    return;
> >
> > }
> >
> > Here four bytes are being sequentially written into the SImode value
> > val.  On some platforms, such as MIPS64, this SImode value is kept in
> > a 64-bit register, suitably sign-extended.  The function expand_assignment
> > contains logic to handle this via SUBREG_PROMOTED_VAR_P (around line 6264
> > in expr.cc) which outputs an explicit extension operation after each
> > store_field (typically insv) to such promoted/extended pseudos.
> >
> > The first observation is that there's no need to perform sign extension
> > after each byte in the example above; the extension is only required
> > after changes to the most significant byte (i.e. to a field that overlaps
> > the most significant bit).
> True.
>
>
> >
> > The bug fix is actually a bit more subtle, but at this point during
> > code expansion it's not safe to use a SUBREG when sign-extending this
> > field.  Currently, GCC generates (sign_extend:DI (subreg:SI (reg:DI) 0))
> > but combine (and other RTL optimizers) later realize that because SImode
> > values are always sign-extended in their 64-bit hard registers that
> > this is a no-op and eliminates it.  The trouble is that it's unsafe to
> > refer to the SImode lowpart of a 64-bit register using SUBREG at those
> > critical points when temporarily the value isn't correctly sign-extended,
> > and the usual backend invariants don't hold.  At these critical points,
> > the middle-end needs to use an explicit TRUNCATE rtx (as this isn't a
> > TRULY_NOOP_TRUNCATION), so that the explicit sign-extension looks like
> > (sign_extend:DI (truncate:SI (reg:DI)), which avoids the problem.
>
>
> >
> > Note that MODE_REP_EXTENDED (NARROW, WIDE) != UNKOWN implies (or should
> > imply) !TRULY_NOOP_TRUNCATION (NARROW, WIDE).  I've another (independent)
> > patch that I'll post in a few minutes.
> >
> >
> > This middle-end patch has been tested on x86_64-pc-linux-gnu with
> > make bootstrap and make -k check, both with and without
> > --target_board=unix{-m32} with no new failures.  The cc1 from a
> > cross-compiler to mips64 appears to generate much better code for
> > the above test case.  Ok for mainline?
> >
> >
> > 2023-12-28  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >          PR rtl-optimization/104914
> >          * expr.cc (expand_assignment): When target is SUBREG_PROMOTED_VAR_P
> >          a sign or zero extension is only required if the modified field
> >          overlaps the SUBREG's most significant bit.  On MODE_REP_EXTENDED
> >          targets, don't refer to the temporarily incorrectly extended value
> >          using a SUBREG, but instead generate an explicit TRUNCATE rtx.
> [ ... ]
>
>
> > +           /* Check if the field overlaps the MSB, requiring extension.  */
> > +           else if (known_eq (bitpos + bitsize,
> > +                              GET_MODE_BITSIZE (GET_MODE (to_rtx))))
> Do you need to look at the size of the field as well?  ie, the starting
> position might be before the sign bit, but the width of the field might
> cover the mode's sign bit?
>
> I'm not real good in the RTL expansion code, so if I'm offbase on this,
> just let me know.
>

I guess `known_ge` may be better.
Is `bitregion_end` here better than `bitpos + bitsize` ?
  
YunQiang Su Dec. 29, 2023, 2:07 a.m. UTC | #4
In general, I agree with this change.
When gcc12 on RV64, more than one `sext.w` will be produced with our test.
(Note, use -O1).

>
> There are two things that help here.  The first is that the most significant
> bit never appears in the middle of a field, so we don't have to worry about
> overlapping, nor writes to the paradoxical bits of the SUBREG.  And secondly,
> bits are numbered from zero for least significant, to MODE_BITSIZE (mode) - 1
> for most significant, irrespective of the endian-ness.  So the code only needs

I am worrying that the higher bits than MODE_BITSIZE (mode) - 1 are also
modified. In this case, we also need do truncate/sign_extend.
While I cannot produce this C code yet.

> to check the highest value bitpos + bitsize is the maximum value for the mode.
> The above logic stays the same, but which byte insert requires extension will
> change between mips64be and mips64le.  i.e. we test that the most significant
> bit of the field/byte being written in the most significant bit of the SUBREG
> target. [That's my understanding/rationalization, I could wrong].
>

The bit higher than MODE_BITSIZE (mode) - 1 also matters.
Since MIPS ISA claims that the src register of SImode instructions should
be sign_extended, otherwise UNPREDICTABLE.
It means,
   li $r2, 0xfff0 ffff ffff 0001
   #              ^
   addu $r1, $r0, $r2
is not allowed.

> One thing I could be more cautious about is using maybe_eq instead of
> known_eq, but the rest of the code (including truly_noop_truncation) assumes
> scalar integer modes, so variable length vectors aren't (yet) a concern.
> Would using maybe_eq be better coding style?
>
>
> Cheers,
> Roger
> --
>
>
  
Jeff Law Dec. 30, 2023, 3:27 a.m. UTC | #5
On 12/28/23 19:07, YunQiang Su wrote:
> In general, I agree with this change.
> When gcc12 on RV64, more than one `sext.w` will be produced with our test.
> (Note, use -O1).
> 
>>
>> There are two things that help here.  The first is that the most significant
>> bit never appears in the middle of a field, so we don't have to worry about
>> overlapping, nor writes to the paradoxical bits of the SUBREG.  And secondly,
>> bits are numbered from zero for least significant, to MODE_BITSIZE (mode) - 1
>> for most significant, irrespective of the endian-ness.  So the code only needs
> 
> I am worrying that the higher bits than MODE_BITSIZE (mode) - 1 are also
> modified. In this case, we also need do truncate/sign_extend.
> While I cannot produce this C code yet.
> 
>> to check the highest value bitpos + bitsize is the maximum value for the mode.
>> The above logic stays the same, but which byte insert requires extension will
>> change between mips64be and mips64le.  i.e. we test that the most significant
>> bit of the field/byte being written in the most significant bit of the SUBREG
>> target. [That's my understanding/rationalization, I could wrong].
>>
> 
> The bit higher than MODE_BITSIZE (mode) - 1 also matters.
> Since MIPS ISA claims that the src register of SImode instructions should
> be sign_extended, otherwise UNPREDICTABLE.
> It means,
>     li $r2, 0xfff0 ffff ffff 0001
>     #              ^
>     addu $r1, $r0, $r2
> is not allowed.
Right.  But that's the whole point behind avoiding the narrowing subreg 
and forcing use of a truncate operation.

So basically the question becomes is there a way to modify those bits in 
a way that GCC doesn't know that it needs to to truncate/extend?

The most obvious concern would be bitfield insertions that modify those 
bits.  But in that case the destination must have been DImode and we 
must truncate it to SImode before we can do anything with the SImode 
object.  BUt that's all supposed to work as long as 
TRULY_NOOP_TRUNCATION is defined properly.

Jeff
  
YunQiang Su Dec. 31, 2023, 4:34 a.m. UTC | #6
> Right.  But that's the whole point behind avoiding the narrowing subreg
> and forcing use of a truncate operation.
>
> So basically the question becomes is there a way to modify those bits in
> a way that GCC doesn't know that it needs to to truncate/extend?
>

I guess that this code may cause some problem.
int test(int val, unsigned char c, int pos) {
  ((unsigned char*)&val)[pos+0] = c;
  return val;
}
GCC avoids using bitops, instead it uses load/store for it.
Any ISA has INSERT_CHAR_VAR instruction?
     INSERT_CHAR_VAR $rN, $rM,$rX

So I guess that  known_lt may be a better choice
if (known_lt)
    no_truncate_or_extend_needed;
else
    add_truncate_or_extend;

> The most obvious concern would be bitfield insertions that modify those
> bits.  But in that case the destination must have been DImode and we
> must truncate it to SImode before we can do anything with the SImode
> object.  BUt that's all supposed to work as long as
> TRULY_NOOP_TRUNCATION is defined properly.
>
> Jeff
  
Jeff Law Jan. 2, 2024, 4:46 p.m. UTC | #7
On 12/30/23 21:34, YunQiang Su wrote:
>> Right.  But that's the whole point behind avoiding the narrowing subreg
>> and forcing use of a truncate operation.
>>
>> So basically the question becomes is there a way to modify those bits in
>> a way that GCC doesn't know that it needs to to truncate/extend?
>>
> 
> I guess that this code may cause some problem.
> int test(int val, unsigned char c, int pos) {
>    ((unsigned char*)&val)[pos+0] = c;
>    return val;
> }
> GCC avoids using bitops, instead it uses load/store for it.
> Any ISA has INSERT_CHAR_VAR instruction?
>       INSERT_CHAR_VAR $rN, $rM,$rX
ISAs that can do that via load certainly exist, but none (to the best of 
my knowledge) need to extend sub-words for correctness reasons.



> 
> So I guess that  known_lt may be a better choice
> if (known_lt)
>      no_truncate_or_extend_needed;
> else
>      add_truncate_or_extend;
I think Roger's argument is these cases can't happen.

jeff
  
Jeff Law Jan. 2, 2024, 4:55 p.m. UTC | #8
On 12/28/23 12:35, Roger Sayle wrote:
> 
> Hi Jeff,
> Thanks for the speedy review.
> 
>> On 12/28/23 07:59, Roger Sayle wrote:
>>> This patch fixes PR rtl-optmization/104914 by tweaking/improving the
>>> way that fields are written into a pseudo register that needs to be
>>> kept sign extended.
>> Well, I think "fixes" is a bit of a stretch.  We're avoiding the issue by changing the
>> early RTL generation, but if I understand what's going on in the RTL optimizers
>> and MIPS backend correctly, the core bug still remains.  Admittedly I haven't put it
>> under a debugger, but that MIPS definition of NOOP_TRUNCATION just seems
>> badly wrong and is just waiting to pop it's ugly head up again.
> 
> I think this really is the/a correct fix. The MIPS backend defines NOOP_TRUNCATION
> to false, so it's not correct to use a SUBREG to convert from DImode to SImode.
> The problem then is where in the compiler (middle-end or backend) is this invalid
> SUBREG being created and how can it be fixed.  In this particular case, the fault
> is in RTL expansion.  There may be other places where a SUBREG is inappropriately
> used instead of a TRUNCATE, but this is the place where things go wrong for
> PR rtl-optimization/104914.
Maybe a better way to put it is I think you're patch one piece of the 
solution, but we still have the potential for bugs due to the seemingly 
bogus defintion of TRULY_NOOP_TRUNCATION in the mips port.



> 
> Once an inappropriate SImode SUBREG is in the RTL stream, it can remain
> harmlessly latent (most of the time), unless it gets split, simplified or spilled.
> Copying this SImode expression into it's own pseudo, results in incorrect code.
> One approach might be to use an UNSPEC for places where backend
> invariants are temporarily invalid, but in this case it's machine independent
> middle-end code that's using SUBREGs as though the target was an x86/pdp11.
> 
> So I agree that on the surface, both of these appear to be identical:
>> (set (reg:DI) (sign_extend:DI (truncate:SI (reg:DI))))
>> (set (reg:DI) (sign_extend:DI (subreg:SI (reg:DI))))
> 
> But should they get split or spilled by reload:
Even if they're not spilled, on a target like mips they're not 
equivalent.  It highlights the poor design around TRULY_NOOP_TRUNCATION. 
  Essentially it's out of band information on how RTL need to be 
interpreted.  We have similar problems with SHIFT_COUNT_TRUNCATED and 
other macros.




>>> 2023-12-28  Roger Sayle  <roger@nextmovesoftware.com>
>>>
>>> gcc/ChangeLog
>>>           PR rtl-optimization/104914
>>>           * expr.cc (expand_assignment): When target is
>> SUBREG_PROMOTED_VAR_P
>>>           a sign or zero extension is only required if the modified field
>>>           overlaps the SUBREG's most significant bit.  On MODE_REP_EXTENDED
>>>           targets, don't refer to the temporarily incorrectly extended value
>>>           using a SUBREG, but instead generate an explicit TRUNCATE rtx.
>> [ ... ]
>>
>>
>>> +	      /* Check if the field overlaps the MSB, requiring extension.  */
>>> +	      else if (known_eq (bitpos + bitsize,
>>> +				 GET_MODE_BITSIZE (GET_MODE (to_rtx))))
>> Do you need to look at the size of the field as well?  ie, the starting position might
>> be before the sign bit, but the width of the field might cover the mode's sign bit?
>>
>> I'm not real good in the RTL expansion code, so if I'm offbase on this, just let me
>> know.
> 
> There are two things that help here.  The first is that the most significant
> bit never appears in the middle of a field, so we don't have to worry about
> overlapping, nor writes to the paradoxical bits of the SUBREG.  And secondly,
> bits are numbered from zero for least significant, to MODE_BITSIZE (mode) - 1
> for most significant, irrespective of the endian-ness.  So the code only needs
> to check the highest value bitpos + bitsize is the maximum value for the mode.
> The above logic stays the same, but which byte insert requires extension will
> change between mips64be and mips64le.  i.e. we test that the most significant
> bit of the field/byte being written in the most significant bit of the SUBREG
> target. [That's my understanding/rationalization, I could wrong].
Yea, if the type is mode M, then we won't have a field that exceeds the 
size of M.  So we just need to know if the position + size lands 
squarely on the MSB.  Thanks for walking me what should have been fairly 
obvious in retrospect.


> 
> One thing I could be more cautious about is using maybe_eq instead of
> known_eq, but the rest of the code (including truly_noop_truncation) assumes
> scalar integer modes, so variable length vectors aren't (yet) a concern.
> Would using maybe_eq be better coding style?
It's probably better to use maybe_eq for future proofing.  OK with that 
change after the usual testing.

Thanks diving into this.

jeff
  

Patch

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 9fef2bf6585..1a34b48e38f 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -6272,19 +6272,32 @@  expand_assignment (tree to, tree from, bool nontemporal)
 		  && known_eq (bitpos, 0)
 		  && known_eq (bitsize, GET_MODE_BITSIZE (GET_MODE (to_rtx))))
 		result = store_expr (from, to_rtx, 0, nontemporal, false);
-	      else
+	      /* Check if the field overlaps the MSB, requiring extension.  */
+	      else if (known_eq (bitpos + bitsize,
+				 GET_MODE_BITSIZE (GET_MODE (to_rtx))))
 		{
-		  rtx to_rtx1
-		    = lowpart_subreg (subreg_unpromoted_mode (to_rtx),
-				      SUBREG_REG (to_rtx),
-				      subreg_promoted_mode (to_rtx));
+		  scalar_int_mode imode = subreg_unpromoted_mode (to_rtx);
+		  scalar_int_mode omode = subreg_promoted_mode (to_rtx);
+		  rtx to_rtx1 = lowpart_subreg (imode, SUBREG_REG (to_rtx),
+						omode);
 		  result = store_field (to_rtx1, bitsize, bitpos,
 					bitregion_start, bitregion_end,
 					mode1, from, get_alias_set (to),
 					nontemporal, reversep);
+		  /* If the target usually keeps IMODE appropriately
+		     extended in OMODE it's unsafe to refer to it using
+		     a SUBREG whilst this invariant doesn't hold.  */
+		  if (targetm.mode_rep_extended (imode, omode) != UNKNOWN)
+		    to_rtx1 = simplify_gen_unary (TRUNCATE, imode,
+						  SUBREG_REG (to_rtx), omode);
 		  convert_move (SUBREG_REG (to_rtx), to_rtx1,
 				SUBREG_PROMOTED_SIGN (to_rtx));
 		}
+	      else
+		result = store_field (to_rtx, bitsize, bitpos,
+				      bitregion_start, bitregion_end,
+				      mode1, from, get_alias_set (to),
+				      nontemporal, reversep);
 	    }
 	  else
 	    result = store_field (to_rtx, bitsize, bitpos,