simplify-rtx: Fix invalid simplification with paradoxical subregs [PR110206]
Checks
Commit Message
As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
(gdb) list
469 if (code == SUBREG)
470 {
471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
old_rtx, fn, data);
472 if (op0 == SUBREG_REG (x))
473 return x;
474 op0 = simplify_gen_subreg (GET_MODE (x), op0,
475 GET_MODE (SUBREG_REG (x)),
476 SUBREG_BYTE (x));
477 return op0 ? op0 : x;
478 }
simplifies with following arguments:
(gdb) p debug_rtx (op0)
(const_vector:V4QI [
(const_int -52 [0xffffffffffffffcc]) repeated x4
])
(gdb) p debug_rtx (x)
(subreg:V16QI (reg:V4QI 98) 0)
to:
(gdb) p debug_rtx (op0)
(const_vector:V16QI [
(const_int -52 [0xffffffffffffffcc]) repeated x16
])
This simplification is invalid, it is not possible to get V16QImode vector
from V4QImode vector, even when all elements are duplicates.
The simplification happens in simplify_context::simplify_subreg:
(gdb) list
7558 if (VECTOR_MODE_P (outermode)
7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
7560 && vec_duplicate_p (op, &elt))
7561 return gen_vec_duplicate (outermode, elt);
but the above simplification is valid only for non-paradoxical registers,
where outermode <= innermode. We should not assume that elements outside
the original register are valid, let alone all duplicates.
PR target/110206
gcc/ChangeLog:
* simplify-rtx.cc (simplify_context::simplify_subreg):
Avoid returning a vector with duplicated value
outside the original register.
gcc/testsuite/ChangeLog:
* gcc.dg/torture/pr110206.c: New test.
Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
OK for master and release branches?
Uros.
Comments
On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
>
> (gdb) list
> 469 if (code == SUBREG)
> 470 {
> 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> old_rtx, fn, data);
> 472 if (op0 == SUBREG_REG (x))
> 473 return x;
> 474 op0 = simplify_gen_subreg (GET_MODE (x), op0,
> 475 GET_MODE (SUBREG_REG (x)),
> 476 SUBREG_BYTE (x));
> 477 return op0 ? op0 : x;
> 478 }
>
> simplifies with following arguments:
>
> (gdb) p debug_rtx (op0)
> (const_vector:V4QI [
> (const_int -52 [0xffffffffffffffcc]) repeated x4
> ])
> (gdb) p debug_rtx (x)
> (subreg:V16QI (reg:V4QI 98) 0)
>
> to:
>
> (gdb) p debug_rtx (op0)
> (const_vector:V16QI [
> (const_int -52 [0xffffffffffffffcc]) repeated x16
> ])
>
> This simplification is invalid, it is not possible to get V16QImode vector
> from V4QImode vector, even when all elements are duplicates.
>
> The simplification happens in simplify_context::simplify_subreg:
>
> (gdb) list
> 7558 if (VECTOR_MODE_P (outermode)
> 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
> 7560 && vec_duplicate_p (op, &elt))
> 7561 return gen_vec_duplicate (outermode, elt);
>
> but the above simplification is valid only for non-paradoxical registers,
> where outermode <= innermode. We should not assume that elements outside
> the original register are valid, let alone all duplicates.
Hmm, but looking at the audit trail the x86 backend expects them to be zero?
Isn't that wrong as well?
That is, I think putting any random value into the upper lanes when
constant folding
a paradoxical subreg sounds OK to me, no?
Of course we might choose to not do such constant propagation for
efficiency reason - at least
when the resulting CONST_* would require a larger constant pool entry
or more costly
construction.
Thanks,
Richard.
> PR target/110206
>
> gcc/ChangeLog:
>
> * simplify-rtx.cc (simplify_context::simplify_subreg):
> Avoid returning a vector with duplicated value
> outside the original register.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/torture/pr110206.c: New test.
>
> Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
>
> OK for master and release branches?
>
> Uros.
On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> >
> > (gdb) list
> > 469 if (code == SUBREG)
> > 470 {
> > 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > old_rtx, fn, data);
> > 472 if (op0 == SUBREG_REG (x))
> > 473 return x;
> > 474 op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > 475 GET_MODE (SUBREG_REG (x)),
> > 476 SUBREG_BYTE (x));
> > 477 return op0 ? op0 : x;
> > 478 }
> >
> > simplifies with following arguments:
> >
> > (gdb) p debug_rtx (op0)
> > (const_vector:V4QI [
> > (const_int -52 [0xffffffffffffffcc]) repeated x4
> > ])
> > (gdb) p debug_rtx (x)
> > (subreg:V16QI (reg:V4QI 98) 0)
> >
> > to:
> >
> > (gdb) p debug_rtx (op0)
> > (const_vector:V16QI [
> > (const_int -52 [0xffffffffffffffcc]) repeated x16
> > ])
> >
> > This simplification is invalid, it is not possible to get V16QImode vector
> > from V4QImode vector, even when all elements are duplicates.
> >
> > The simplification happens in simplify_context::simplify_subreg:
> >
> > (gdb) list
> > 7558 if (VECTOR_MODE_P (outermode)
> > 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
> > 7560 && vec_duplicate_p (op, &elt))
> > 7561 return gen_vec_duplicate (outermode, elt);
> >
> > but the above simplification is valid only for non-paradoxical registers,
> > where outermode <= innermode. We should not assume that elements outside
> > the original register are valid, let alone all duplicates.
>
> Hmm, but looking at the audit trail the x86 backend expects them to be zero?
> Isn't that wrong as well?
If you mean Comment #10, it is just an observation that
simplify_replace_rtx simplifies arguments from Comment #9 to:
(gdb) p debug_rtx (src)
(const_vector:V8HI [
(const_int 204 [0xcc]) repeated x4
(const_int 0 [0]) repeated x4
])
instead of:
(gdb) p debug_rtx (src)
(const_vector:V8HI [
(const_int 204 [0xcc]) repeated x8
])
which is in line with the statement below.
>
> That is, I think putting any random value into the upper lanes when
> constant folding
> a paradoxical subreg sounds OK to me, no?
The compiler is putting zero there as can be seen from the above new RTX.
> Of course we might choose to not do such constant propagation for
> efficiency reason - at least
> when the resulting CONST_* would require a larger constant pool entry
> or more costly
> construction.
This is probably a follow-up improvement, where this patch tries to
fix a specific invalid simplification of simplify_replace_rtx that is
invalid universally.
Uros.
On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> > >
> > > (gdb) list
> > > 469 if (code == SUBREG)
> > > 470 {
> > > 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > > old_rtx, fn, data);
> > > 472 if (op0 == SUBREG_REG (x))
> > > 473 return x;
> > > 474 op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > > 475 GET_MODE (SUBREG_REG (x)),
> > > 476 SUBREG_BYTE (x));
> > > 477 return op0 ? op0 : x;
> > > 478 }
> > >
> > > simplifies with following arguments:
> > >
> > > (gdb) p debug_rtx (op0)
> > > (const_vector:V4QI [
> > > (const_int -52 [0xffffffffffffffcc]) repeated x4
> > > ])
> > > (gdb) p debug_rtx (x)
> > > (subreg:V16QI (reg:V4QI 98) 0)
> > >
> > > to:
> > >
> > > (gdb) p debug_rtx (op0)
> > > (const_vector:V16QI [
> > > (const_int -52 [0xffffffffffffffcc]) repeated x16
> > > ])
> > >
> > > This simplification is invalid, it is not possible to get V16QImode vector
> > > from V4QImode vector, even when all elements are duplicates.
^^^
I think this simplification is valid. A simplification to
(const_vector:V16QI [
(const_int -52 [0xffffffffffffffcc]) repeated x4
(const_int 0 [0]) repeated x12
])
would be valid as well.
> > > The simplification happens in simplify_context::simplify_subreg:
> > >
> > > (gdb) list
> > > 7558 if (VECTOR_MODE_P (outermode)
> > > 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
> > > 7560 && vec_duplicate_p (op, &elt))
> > > 7561 return gen_vec_duplicate (outermode, elt);
> > >
> > > but the above simplification is valid only for non-paradoxical registers,
> > > where outermode <= innermode. We should not assume that elements outside
> > > the original register are valid, let alone all duplicates.
> >
> > Hmm, but looking at the audit trail the x86 backend expects them to be zero?
> > Isn't that wrong as well?
>
> If you mean Comment #10, it is just an observation that
> simplify_replace_rtx simplifies arguments from Comment #9 to:
>
> (gdb) p debug_rtx (src)
> (const_vector:V8HI [
> (const_int 204 [0xcc]) repeated x4
> (const_int 0 [0]) repeated x4
> ])
>
> instead of:
>
> (gdb) p debug_rtx (src)
> (const_vector:V8HI [
> (const_int 204 [0xcc]) repeated x8
> ])
>
> which is in line with the statement below.
> >
> > That is, I think putting any random value into the upper lanes when
> > constant folding
> > a paradoxical subreg sounds OK to me, no?
>
> The compiler is putting zero there as can be seen from the above new RTX.
>
> > Of course we might choose to not do such constant propagation for
> > efficiency reason - at least
> > when the resulting CONST_* would require a larger constant pool entry
> > or more costly
> > construction.
>
> This is probably a follow-up improvement, where this patch tries to
> fix a specific invalid simplification of simplify_replace_rtx that is
> invalid universally.
How so? What specifies the values of the paradoxical subreg for the
bytes not covered by the subreg operand?
>
> Uros.
On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> > > >
> > > > (gdb) list
> > > > 469 if (code == SUBREG)
> > > > 470 {
> > > > 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > > > old_rtx, fn, data);
> > > > 472 if (op0 == SUBREG_REG (x))
> > > > 473 return x;
> > > > 474 op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > > > 475 GET_MODE (SUBREG_REG (x)),
> > > > 476 SUBREG_BYTE (x));
> > > > 477 return op0 ? op0 : x;
> > > > 478 }
> > > >
> > > > simplifies with following arguments:
> > > >
> > > > (gdb) p debug_rtx (op0)
> > > > (const_vector:V4QI [
> > > > (const_int -52 [0xffffffffffffffcc]) repeated x4
> > > > ])
> > > > (gdb) p debug_rtx (x)
> > > > (subreg:V16QI (reg:V4QI 98) 0)
> > > >
> > > > to:
> > > >
> > > > (gdb) p debug_rtx (op0)
> > > > (const_vector:V16QI [
> > > > (const_int -52 [0xffffffffffffffcc]) repeated x16
> > > > ])
> > > >
> > > > This simplification is invalid, it is not possible to get V16QImode vector
> > > > from V4QImode vector, even when all elements are duplicates.
>
> ^^^
>
> I think this simplification is valid. A simplification to
>
> (const_vector:V16QI [
> (const_int -52 [0xffffffffffffffcc]) repeated x4
> (const_int 0 [0]) repeated x12
> ])
>
> would be valid as well.
>
> > > > The simplification happens in simplify_context::simplify_subreg:
> > > >
> > > > (gdb) list
> > > > 7558 if (VECTOR_MODE_P (outermode)
> > > > 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
> > > > 7560 && vec_duplicate_p (op, &elt))
> > > > 7561 return gen_vec_duplicate (outermode, elt);
> > > >
> > > > but the above simplification is valid only for non-paradoxical registers,
> > > > where outermode <= innermode. We should not assume that elements outside
> > > > the original register are valid, let alone all duplicates.
> > >
> > > Hmm, but looking at the audit trail the x86 backend expects them to be zero?
> > > Isn't that wrong as well?
> >
> > If you mean Comment #10, it is just an observation that
> > simplify_replace_rtx simplifies arguments from Comment #9 to:
> >
> > (gdb) p debug_rtx (src)
> > (const_vector:V8HI [
> > (const_int 204 [0xcc]) repeated x4
> > (const_int 0 [0]) repeated x4
> > ])
> >
> > instead of:
> >
> > (gdb) p debug_rtx (src)
> > (const_vector:V8HI [
> > (const_int 204 [0xcc]) repeated x8
> > ])
> >
> > which is in line with the statement below.
> > >
> > > That is, I think putting any random value into the upper lanes when
> > > constant folding
> > > a paradoxical subreg sounds OK to me, no?
> >
> > The compiler is putting zero there as can be seen from the above new RTX.
> >
> > > Of course we might choose to not do such constant propagation for
> > > efficiency reason - at least
> > > when the resulting CONST_* would require a larger constant pool entry
> > > or more costly
> > > construction.
> >
> > This is probably a follow-up improvement, where this patch tries to
> > fix a specific invalid simplification of simplify_replace_rtx that is
> > invalid universally.
>
> How so? What specifies the values of the paradoxical subreg for the
> bytes not covered by the subreg operand?
I don't know why 0 is generated here (and if it is valid) for
paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to
the wrong constant and triggers unwanted propagation later on.
Uros.
On Mon, Jul 10, 2023 at 1:01 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> > > > >
> > > > > (gdb) list
> > > > > 469 if (code == SUBREG)
> > > > > 470 {
> > > > > 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > > > > old_rtx, fn, data);
> > > > > 472 if (op0 == SUBREG_REG (x))
> > > > > 473 return x;
> > > > > 474 op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > > > > 475 GET_MODE (SUBREG_REG (x)),
> > > > > 476 SUBREG_BYTE (x));
> > > > > 477 return op0 ? op0 : x;
> > > > > 478 }
> > > > >
> > > > > simplifies with following arguments:
> > > > >
> > > > > (gdb) p debug_rtx (op0)
> > > > > (const_vector:V4QI [
> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x4
> > > > > ])
> > > > > (gdb) p debug_rtx (x)
> > > > > (subreg:V16QI (reg:V4QI 98) 0)
> > > > >
> > > > > to:
> > > > >
> > > > > (gdb) p debug_rtx (op0)
> > > > > (const_vector:V16QI [
> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x16
> > > > > ])
> > > > >
> > > > > This simplification is invalid, it is not possible to get V16QImode vector
> > > > > from V4QImode vector, even when all elements are duplicates.
> >
> > ^^^
> >
> > I think this simplification is valid. A simplification to
> >
> > (const_vector:V16QI [
> > (const_int -52 [0xffffffffffffffcc]) repeated x4
> > (const_int 0 [0]) repeated x12
> > ])
> >
> > would be valid as well.
> >
> > > > > The simplification happens in simplify_context::simplify_subreg:
> > > > >
> > > > > (gdb) list
> > > > > 7558 if (VECTOR_MODE_P (outermode)
> > > > > 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
> > > > > 7560 && vec_duplicate_p (op, &elt))
> > > > > 7561 return gen_vec_duplicate (outermode, elt);
> > > > >
> > > > > but the above simplification is valid only for non-paradoxical registers,
> > > > > where outermode <= innermode. We should not assume that elements outside
> > > > > the original register are valid, let alone all duplicates.
> > > >
> > > > Hmm, but looking at the audit trail the x86 backend expects them to be zero?
> > > > Isn't that wrong as well?
> > >
> > > If you mean Comment #10, it is just an observation that
> > > simplify_replace_rtx simplifies arguments from Comment #9 to:
> > >
> > > (gdb) p debug_rtx (src)
> > > (const_vector:V8HI [
> > > (const_int 204 [0xcc]) repeated x4
> > > (const_int 0 [0]) repeated x4
> > > ])
> > >
> > > instead of:
> > >
> > > (gdb) p debug_rtx (src)
> > > (const_vector:V8HI [
> > > (const_int 204 [0xcc]) repeated x8
> > > ])
> > >
> > > which is in line with the statement below.
> > > >
> > > > That is, I think putting any random value into the upper lanes when
> > > > constant folding
> > > > a paradoxical subreg sounds OK to me, no?
> > >
> > > The compiler is putting zero there as can be seen from the above new RTX.
> > >
> > > > Of course we might choose to not do such constant propagation for
> > > > efficiency reason - at least
> > > > when the resulting CONST_* would require a larger constant pool entry
> > > > or more costly
> > > > construction.
> > >
> > > This is probably a follow-up improvement, where this patch tries to
> > > fix a specific invalid simplification of simplify_replace_rtx that is
> > > invalid universally.
> >
> > How so? What specifies the values of the paradoxical subreg for the
> > bytes not covered by the subreg operand?
>
> I don't know why 0 is generated here (and if it is valid) for
> paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to
> the wrong constant and triggers unwanted propagation later on.
Quoting what I wrote in the PR below. I think pragmatically the fix is
good - we might miss some opportunistic folding this way but we for
sure may not optimistically register an equality via REG_EQUAL without
enforcing it (removing the producer and replacing it with the optimistic
constant).
So consider the patch approved if no other RTL maintainer chimes in
within 48h.
Thanks,
Richard.
I can see cprop1 adds the REG_EQUAL note:
(insn 22 21 23 4 (set (reg:V8HI 100)
(zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0)
(parallel [
(const_int 0 [0])
(const_int 1 [0x1])
(const_int 2 [0x2])
(const_int 3 [0x3])
(const_int 4 [0x4])
(const_int 5 [0x5])
(const_int 6 [0x6])
(const_int 7 [0x7])
])))) "t.c":12:42 7557 {sse4_1_zero_extendv8qiv8hi2}
- (expr_list:REG_DEAD (reg:V4QI 98)
- (nil)))
+ (expr_list:REG_EQUAL (const_vector:V8HI [
+ (const_int 204 [0xcc]) repeated x8
+ ])
+ (expr_list:REG_DEAD (reg:V4QI 98)
+ (nil))))
but I don't see yet what the actual wrong transform based on this
REG_EQUAL note is?
It looks like we CSE the above with
- 46: r122:V8QI=[`*.LC3']
- REG_EQUAL const_vector
- 48: r125:V8HI=zero_extend(vec_select(r122:V8QI#0,parallel))
- REG_EQUAL const_vector
- REG_DEAD r122:V8QI
- 49: r126:V8HI=r124:V8HI*r125:V8HI
- REG_DEAD r125:V8HI
+ 49: r126:V8HI=r124:V8HI*r100:V8HI
but otherwise do nothing. So the issue is that we rely on the "undefined"
vals to have a specific value (from the earlier REG_EQUAL note) but actual
code generation doesn't ensure this (it doesn't need to). That said,
the issue isn't the constant folding per-se but that we do not actually
constant fold but register an equality that doesn't hold.
> Uros.
Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Mon, Jul 10, 2023 at 1:01 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>>
>> On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
>> <richard.guenther@gmail.com> wrote:
>> >
>> > On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>> > >
>> > > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
>> > > <richard.guenther@gmail.com> wrote:
>> > > >
>> > > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
>> > > > <gcc-patches@gcc.gnu.org> wrote:
>> > > > >
>> > > > > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
>> > > > >
>> > > > > (gdb) list
>> > > > > 469 if (code == SUBREG)
>> > > > > 470 {
>> > > > > 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
>> > > > > old_rtx, fn, data);
>> > > > > 472 if (op0 == SUBREG_REG (x))
>> > > > > 473 return x;
>> > > > > 474 op0 = simplify_gen_subreg (GET_MODE (x), op0,
>> > > > > 475 GET_MODE (SUBREG_REG (x)),
>> > > > > 476 SUBREG_BYTE (x));
>> > > > > 477 return op0 ? op0 : x;
>> > > > > 478 }
>> > > > >
>> > > > > simplifies with following arguments:
>> > > > >
>> > > > > (gdb) p debug_rtx (op0)
>> > > > > (const_vector:V4QI [
>> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x4
>> > > > > ])
>> > > > > (gdb) p debug_rtx (x)
>> > > > > (subreg:V16QI (reg:V4QI 98) 0)
>> > > > >
>> > > > > to:
>> > > > >
>> > > > > (gdb) p debug_rtx (op0)
>> > > > > (const_vector:V16QI [
>> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x16
>> > > > > ])
>> > > > >
>> > > > > This simplification is invalid, it is not possible to get V16QImode vector
>> > > > > from V4QImode vector, even when all elements are duplicates.
>> >
>> > ^^^
>> >
>> > I think this simplification is valid. A simplification to
>> >
>> > (const_vector:V16QI [
>> > (const_int -52 [0xffffffffffffffcc]) repeated x4
>> > (const_int 0 [0]) repeated x12
>> > ])
>> >
>> > would be valid as well.
>> >
>> > > > > The simplification happens in simplify_context::simplify_subreg:
>> > > > >
>> > > > > (gdb) list
>> > > > > 7558 if (VECTOR_MODE_P (outermode)
>> > > > > 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
>> > > > > 7560 && vec_duplicate_p (op, &elt))
>> > > > > 7561 return gen_vec_duplicate (outermode, elt);
>> > > > >
>> > > > > but the above simplification is valid only for non-paradoxical registers,
>> > > > > where outermode <= innermode. We should not assume that elements outside
>> > > > > the original register are valid, let alone all duplicates.
>> > > >
>> > > > Hmm, but looking at the audit trail the x86 backend expects them to be zero?
>> > > > Isn't that wrong as well?
>> > >
>> > > If you mean Comment #10, it is just an observation that
>> > > simplify_replace_rtx simplifies arguments from Comment #9 to:
>> > >
>> > > (gdb) p debug_rtx (src)
>> > > (const_vector:V8HI [
>> > > (const_int 204 [0xcc]) repeated x4
>> > > (const_int 0 [0]) repeated x4
>> > > ])
>> > >
>> > > instead of:
>> > >
>> > > (gdb) p debug_rtx (src)
>> > > (const_vector:V8HI [
>> > > (const_int 204 [0xcc]) repeated x8
>> > > ])
>> > >
>> > > which is in line with the statement below.
>> > > >
>> > > > That is, I think putting any random value into the upper lanes when
>> > > > constant folding
>> > > > a paradoxical subreg sounds OK to me, no?
>> > >
>> > > The compiler is putting zero there as can be seen from the above new RTX.
>> > >
>> > > > Of course we might choose to not do such constant propagation for
>> > > > efficiency reason - at least
>> > > > when the resulting CONST_* would require a larger constant pool entry
>> > > > or more costly
>> > > > construction.
>> > >
>> > > This is probably a follow-up improvement, where this patch tries to
>> > > fix a specific invalid simplification of simplify_replace_rtx that is
>> > > invalid universally.
>> >
>> > How so? What specifies the values of the paradoxical subreg for the
>> > bytes not covered by the subreg operand?
>>
>> I don't know why 0 is generated here (and if it is valid) for
>> paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to
>> the wrong constant and triggers unwanted propagation later on.
>
> Quoting what I wrote in the PR below. I think pragmatically the fix is
> good - we might miss some opportunistic folding this way but we for
> sure may not optimistically register an equality via REG_EQUAL without
> enforcing it (removing the producer and replacing it with the optimistic
> constant).
>
> So consider the patch approved if no other RTL maintainer chimes in
> within 48h.
Sorry, can you hold off a bit longer? Wanted to have a look but the
deadline is about to expire.
I think at least a comment is needed, since like Richard says,
the transformation seems correct for paradoxical subregs, even if it
isn't wanted for other reasons.
Thanks,
Richard
>
> Thanks,
> Richard.
>
>
> I can see cprop1 adds the REG_EQUAL note:
>
> (insn 22 21 23 4 (set (reg:V8HI 100)
> (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0)
> (parallel [
> (const_int 0 [0])
> (const_int 1 [0x1])
> (const_int 2 [0x2])
> (const_int 3 [0x3])
> (const_int 4 [0x4])
> (const_int 5 [0x5])
> (const_int 6 [0x6])
> (const_int 7 [0x7])
> ])))) "t.c":12:42 7557 {sse4_1_zero_extendv8qiv8hi2}
> - (expr_list:REG_DEAD (reg:V4QI 98)
> - (nil)))
> + (expr_list:REG_EQUAL (const_vector:V8HI [
> + (const_int 204 [0xcc]) repeated x8
> + ])
> + (expr_list:REG_DEAD (reg:V4QI 98)
> + (nil))))
>
> but I don't see yet what the actual wrong transform based on this
> REG_EQUAL note is?
>
> It looks like we CSE the above with
>
> - 46: r122:V8QI=[`*.LC3']
> - REG_EQUAL const_vector
> - 48: r125:V8HI=zero_extend(vec_select(r122:V8QI#0,parallel))
> - REG_EQUAL const_vector
> - REG_DEAD r122:V8QI
> - 49: r126:V8HI=r124:V8HI*r125:V8HI
> - REG_DEAD r125:V8HI
> + 49: r126:V8HI=r124:V8HI*r100:V8HI
>
> but otherwise do nothing. So the issue is that we rely on the "undefined"
> vals to have a specific value (from the earlier REG_EQUAL note) but actual
> code generation doesn't ensure this (it doesn't need to). That said,
> the issue isn't the constant folding per-se but that we do not actually
> constant fold but register an equality that doesn't hold.
>
>
>
>> Uros.
On Wed, Jul 12, 2023 at 12:23 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Mon, Jul 10, 2023 at 1:01 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> >>
> >> On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
> >> <richard.guenther@gmail.com> wrote:
> >> >
> >> > On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >> > >
> >> > > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> >> > > <richard.guenther@gmail.com> wrote:
> >> > > >
> >> > > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> >> > > > <gcc-patches@gcc.gnu.org> wrote:
> >> > > > >
> >> > > > > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> >> > > > >
> >> > > > > (gdb) list
> >> > > > > 469 if (code == SUBREG)
> >> > > > > 470 {
> >> > > > > 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> >> > > > > old_rtx, fn, data);
> >> > > > > 472 if (op0 == SUBREG_REG (x))
> >> > > > > 473 return x;
> >> > > > > 474 op0 = simplify_gen_subreg (GET_MODE (x), op0,
> >> > > > > 475 GET_MODE (SUBREG_REG (x)),
> >> > > > > 476 SUBREG_BYTE (x));
> >> > > > > 477 return op0 ? op0 : x;
> >> > > > > 478 }
> >> > > > >
> >> > > > > simplifies with following arguments:
> >> > > > >
> >> > > > > (gdb) p debug_rtx (op0)
> >> > > > > (const_vector:V4QI [
> >> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x4
> >> > > > > ])
> >> > > > > (gdb) p debug_rtx (x)
> >> > > > > (subreg:V16QI (reg:V4QI 98) 0)
> >> > > > >
> >> > > > > to:
> >> > > > >
> >> > > > > (gdb) p debug_rtx (op0)
> >> > > > > (const_vector:V16QI [
> >> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x16
> >> > > > > ])
> >> > > > >
> >> > > > > This simplification is invalid, it is not possible to get V16QImode vector
> >> > > > > from V4QImode vector, even when all elements are duplicates.
> >> >
> >> > ^^^
> >> >
> >> > I think this simplification is valid. A simplification to
> >> >
> >> > (const_vector:V16QI [
> >> > (const_int -52 [0xffffffffffffffcc]) repeated x4
> >> > (const_int 0 [0]) repeated x12
> >> > ])
> >> >
> >> > would be valid as well.
> >> >
> >> > > > > The simplification happens in simplify_context::simplify_subreg:
> >> > > > >
> >> > > > > (gdb) list
> >> > > > > 7558 if (VECTOR_MODE_P (outermode)
> >> > > > > 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
> >> > > > > 7560 && vec_duplicate_p (op, &elt))
> >> > > > > 7561 return gen_vec_duplicate (outermode, elt);
> >> > > > >
> >> > > > > but the above simplification is valid only for non-paradoxical registers,
> >> > > > > where outermode <= innermode. We should not assume that elements outside
> >> > > > > the original register are valid, let alone all duplicates.
> >> > > >
> >> > > > Hmm, but looking at the audit trail the x86 backend expects them to be zero?
> >> > > > Isn't that wrong as well?
> >> > >
> >> > > If you mean Comment #10, it is just an observation that
> >> > > simplify_replace_rtx simplifies arguments from Comment #9 to:
> >> > >
> >> > > (gdb) p debug_rtx (src)
> >> > > (const_vector:V8HI [
> >> > > (const_int 204 [0xcc]) repeated x4
> >> > > (const_int 0 [0]) repeated x4
> >> > > ])
> >> > >
> >> > > instead of:
> >> > >
> >> > > (gdb) p debug_rtx (src)
> >> > > (const_vector:V8HI [
> >> > > (const_int 204 [0xcc]) repeated x8
> >> > > ])
> >> > >
> >> > > which is in line with the statement below.
> >> > > >
> >> > > > That is, I think putting any random value into the upper lanes when
> >> > > > constant folding
> >> > > > a paradoxical subreg sounds OK to me, no?
> >> > >
> >> > > The compiler is putting zero there as can be seen from the above new RTX.
> >> > >
> >> > > > Of course we might choose to not do such constant propagation for
> >> > > > efficiency reason - at least
> >> > > > when the resulting CONST_* would require a larger constant pool entry
> >> > > > or more costly
> >> > > > construction.
> >> > >
> >> > > This is probably a follow-up improvement, where this patch tries to
> >> > > fix a specific invalid simplification of simplify_replace_rtx that is
> >> > > invalid universally.
> >> >
> >> > How so? What specifies the values of the paradoxical subreg for the
> >> > bytes not covered by the subreg operand?
> >>
> >> I don't know why 0 is generated here (and if it is valid) for
> >> paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to
> >> the wrong constant and triggers unwanted propagation later on.
> >
> > Quoting what I wrote in the PR below. I think pragmatically the fix is
> > good - we might miss some opportunistic folding this way but we for
> > sure may not optimistically register an equality via REG_EQUAL without
> > enforcing it (removing the producer and replacing it with the optimistic
> > constant).
> >
> > So consider the patch approved if no other RTL maintainer chimes in
> > within 48h.
>
> Sorry, can you hold off a bit longer? Wanted to have a look but the
> deadline is about to expire.
No problem, I will wait for you.
Thanks,
Uros.
>
> I think at least a comment is needed, since like Richard says,
> the transformation seems correct for paradoxical subregs, even if it
> isn't wanted for other reasons.
>
> Thanks,
> Richard
>
> >
> > Thanks,
> > Richard.
> >
> >
> > I can see cprop1 adds the REG_EQUAL note:
> >
> > (insn 22 21 23 4 (set (reg:V8HI 100)
> > (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0)
> > (parallel [
> > (const_int 0 [0])
> > (const_int 1 [0x1])
> > (const_int 2 [0x2])
> > (const_int 3 [0x3])
> > (const_int 4 [0x4])
> > (const_int 5 [0x5])
> > (const_int 6 [0x6])
> > (const_int 7 [0x7])
> > ])))) "t.c":12:42 7557 {sse4_1_zero_extendv8qiv8hi2}
> > - (expr_list:REG_DEAD (reg:V4QI 98)
> > - (nil)))
> > + (expr_list:REG_EQUAL (const_vector:V8HI [
> > + (const_int 204 [0xcc]) repeated x8
> > + ])
> > + (expr_list:REG_DEAD (reg:V4QI 98)
> > + (nil))))
> >
> > but I don't see yet what the actual wrong transform based on this
> > REG_EQUAL note is?
> >
> > It looks like we CSE the above with
> >
> > - 46: r122:V8QI=[`*.LC3']
> > - REG_EQUAL const_vector
> > - 48: r125:V8HI=zero_extend(vec_select(r122:V8QI#0,parallel))
> > - REG_EQUAL const_vector
> > - REG_DEAD r122:V8QI
> > - 49: r126:V8HI=r124:V8HI*r125:V8HI
> > - REG_DEAD r125:V8HI
> > + 49: r126:V8HI=r124:V8HI*r100:V8HI
> >
> > but otherwise do nothing. So the issue is that we rely on the "undefined"
> > vals to have a specific value (from the earlier REG_EQUAL note) but actual
> > code generation doesn't ensure this (it doesn't need to). That said,
> > the issue isn't the constant folding per-se but that we do not actually
> > constant fold but register an equality that doesn't hold.
> >
> >
> >
> >> Uros.
On Wed, Jul 12, 2023 at 12:58 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Wed, Jul 12, 2023 at 12:23 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > On Mon, Jul 10, 2023 at 1:01 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >>
> > >> On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
> > >> <richard.guenther@gmail.com> wrote:
> > >> >
> > >> > On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >> > >
> > >> > > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> > >> > > <richard.guenther@gmail.com> wrote:
> > >> > > >
> > >> > > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> > >> > > > <gcc-patches@gcc.gnu.org> wrote:
> > >> > > > >
> > >> > > > > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> > >> > > > >
> > >> > > > > (gdb) list
> > >> > > > > 469 if (code == SUBREG)
> > >> > > > > 470 {
> > >> > > > > 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > >> > > > > old_rtx, fn, data);
> > >> > > > > 472 if (op0 == SUBREG_REG (x))
> > >> > > > > 473 return x;
> > >> > > > > 474 op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > >> > > > > 475 GET_MODE (SUBREG_REG (x)),
> > >> > > > > 476 SUBREG_BYTE (x));
> > >> > > > > 477 return op0 ? op0 : x;
> > >> > > > > 478 }
> > >> > > > >
> > >> > > > > simplifies with following arguments:
> > >> > > > >
> > >> > > > > (gdb) p debug_rtx (op0)
> > >> > > > > (const_vector:V4QI [
> > >> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x4
> > >> > > > > ])
> > >> > > > > (gdb) p debug_rtx (x)
> > >> > > > > (subreg:V16QI (reg:V4QI 98) 0)
> > >> > > > >
> > >> > > > > to:
> > >> > > > >
> > >> > > > > (gdb) p debug_rtx (op0)
> > >> > > > > (const_vector:V16QI [
> > >> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x16
> > >> > > > > ])
> > >> > > > >
> > >> > > > > This simplification is invalid, it is not possible to get V16QImode vector
> > >> > > > > from V4QImode vector, even when all elements are duplicates.
> > >> >
> > >> > ^^^
> > >> >
> > >> > I think this simplification is valid. A simplification to
> > >> >
> > >> > (const_vector:V16QI [
> > >> > (const_int -52 [0xffffffffffffffcc]) repeated x4
> > >> > (const_int 0 [0]) repeated x12
> > >> > ])
> > >> >
> > >> > would be valid as well.
> > >> >
> > >> > > > > The simplification happens in simplify_context::simplify_subreg:
> > >> > > > >
> > >> > > > > (gdb) list
> > >> > > > > 7558 if (VECTOR_MODE_P (outermode)
> > >> > > > > 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
> > >> > > > > 7560 && vec_duplicate_p (op, &elt))
> > >> > > > > 7561 return gen_vec_duplicate (outermode, elt);
> > >> > > > >
> > >> > > > > but the above simplification is valid only for non-paradoxical registers,
> > >> > > > > where outermode <= innermode. We should not assume that elements outside
> > >> > > > > the original register are valid, let alone all duplicates.
> > >> > > >
> > >> > > > Hmm, but looking at the audit trail the x86 backend expects them to be zero?
> > >> > > > Isn't that wrong as well?
> > >> > >
> > >> > > If you mean Comment #10, it is just an observation that
> > >> > > simplify_replace_rtx simplifies arguments from Comment #9 to:
> > >> > >
> > >> > > (gdb) p debug_rtx (src)
> > >> > > (const_vector:V8HI [
> > >> > > (const_int 204 [0xcc]) repeated x4
> > >> > > (const_int 0 [0]) repeated x4
> > >> > > ])
> > >> > >
> > >> > > instead of:
> > >> > >
> > >> > > (gdb) p debug_rtx (src)
> > >> > > (const_vector:V8HI [
> > >> > > (const_int 204 [0xcc]) repeated x8
> > >> > > ])
> > >> > >
> > >> > > which is in line with the statement below.
> > >> > > >
> > >> > > > That is, I think putting any random value into the upper lanes when
> > >> > > > constant folding
> > >> > > > a paradoxical subreg sounds OK to me, no?
> > >> > >
> > >> > > The compiler is putting zero there as can be seen from the above new RTX.
> > >> > >
> > >> > > > Of course we might choose to not do such constant propagation for
> > >> > > > efficiency reason - at least
> > >> > > > when the resulting CONST_* would require a larger constant pool entry
> > >> > > > or more costly
> > >> > > > construction.
> > >> > >
> > >> > > This is probably a follow-up improvement, where this patch tries to
> > >> > > fix a specific invalid simplification of simplify_replace_rtx that is
> > >> > > invalid universally.
> > >> >
> > >> > How so? What specifies the values of the paradoxical subreg for the
> > >> > bytes not covered by the subreg operand?
> > >>
> > >> I don't know why 0 is generated here (and if it is valid) for
> > >> paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to
> > >> the wrong constant and triggers unwanted propagation later on.
> > >
> > > Quoting what I wrote in the PR below. I think pragmatically the fix is
> > > good - we might miss some opportunistic folding this way but we for
> > > sure may not optimistically register an equality via REG_EQUAL without
> > > enforcing it (removing the producer and replacing it with the optimistic
> > > constant).
> > >
> > > So consider the patch approved if no other RTL maintainer chimes in
> > > within 48h.
> >
> > Sorry, can you hold off a bit longer? Wanted to have a look but the
> > deadline is about to expire.
>
> No problem, I will wait for you.
Please also note Comment #14 in the PR. With the patch, the compiler sets
(const_vector:V8HI [
(const_int 204 [0xcc]) repeated x4
(const_int 0 [0]) repeated x4
])
as a new REG_EQUAL note. This also does not look OK to me. IMO, the
compiler should not emit REG_EQUAL note when some of the elements are
derived from undefined values outside the original register.
Uros.
On Wed, Jul 12, 2023 at 1:05 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Wed, Jul 12, 2023 at 12:58 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Wed, Jul 12, 2023 at 12:23 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > > On Mon, Jul 10, 2023 at 1:01 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > > >>
> > > >> On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
> > > >> <richard.guenther@gmail.com> wrote:
> > > >> >
> > > >> > On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> > > >> > >
> > > >> > > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> > > >> > > <richard.guenther@gmail.com> wrote:
> > > >> > > >
> > > >> > > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> > > >> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > >> > > > >
> > > >> > > > > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> > > >> > > > >
> > > >> > > > > (gdb) list
> > > >> > > > > 469 if (code == SUBREG)
> > > >> > > > > 470 {
> > > >> > > > > 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > > >> > > > > old_rtx, fn, data);
> > > >> > > > > 472 if (op0 == SUBREG_REG (x))
> > > >> > > > > 473 return x;
> > > >> > > > > 474 op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > > >> > > > > 475 GET_MODE (SUBREG_REG (x)),
> > > >> > > > > 476 SUBREG_BYTE (x));
> > > >> > > > > 477 return op0 ? op0 : x;
> > > >> > > > > 478 }
> > > >> > > > >
> > > >> > > > > simplifies with following arguments:
> > > >> > > > >
> > > >> > > > > (gdb) p debug_rtx (op0)
> > > >> > > > > (const_vector:V4QI [
> > > >> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x4
> > > >> > > > > ])
> > > >> > > > > (gdb) p debug_rtx (x)
> > > >> > > > > (subreg:V16QI (reg:V4QI 98) 0)
> > > >> > > > >
> > > >> > > > > to:
> > > >> > > > >
> > > >> > > > > (gdb) p debug_rtx (op0)
> > > >> > > > > (const_vector:V16QI [
> > > >> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x16
> > > >> > > > > ])
> > > >> > > > >
> > > >> > > > > This simplification is invalid, it is not possible to get V16QImode vector
> > > >> > > > > from V4QImode vector, even when all elements are duplicates.
> > > >> >
> > > >> > ^^^
> > > >> >
> > > >> > I think this simplification is valid. A simplification to
> > > >> >
> > > >> > (const_vector:V16QI [
> > > >> > (const_int -52 [0xffffffffffffffcc]) repeated x4
> > > >> > (const_int 0 [0]) repeated x12
> > > >> > ])
> > > >> >
> > > >> > would be valid as well.
> > > >> >
> > > >> > > > > The simplification happens in simplify_context::simplify_subreg:
> > > >> > > > >
> > > >> > > > > (gdb) list
> > > >> > > > > 7558 if (VECTOR_MODE_P (outermode)
> > > >> > > > > 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
> > > >> > > > > 7560 && vec_duplicate_p (op, &elt))
> > > >> > > > > 7561 return gen_vec_duplicate (outermode, elt);
> > > >> > > > >
> > > >> > > > > but the above simplification is valid only for non-paradoxical registers,
> > > >> > > > > where outermode <= innermode. We should not assume that elements outside
> > > >> > > > > the original register are valid, let alone all duplicates.
> > > >> > > >
> > > >> > > > Hmm, but looking at the audit trail the x86 backend expects them to be zero?
> > > >> > > > Isn't that wrong as well?
> > > >> > >
> > > >> > > If you mean Comment #10, it is just an observation that
> > > >> > > simplify_replace_rtx simplifies arguments from Comment #9 to:
> > > >> > >
> > > >> > > (gdb) p debug_rtx (src)
> > > >> > > (const_vector:V8HI [
> > > >> > > (const_int 204 [0xcc]) repeated x4
> > > >> > > (const_int 0 [0]) repeated x4
> > > >> > > ])
> > > >> > >
> > > >> > > instead of:
> > > >> > >
> > > >> > > (gdb) p debug_rtx (src)
> > > >> > > (const_vector:V8HI [
> > > >> > > (const_int 204 [0xcc]) repeated x8
> > > >> > > ])
> > > >> > >
> > > >> > > which is in line with the statement below.
> > > >> > > >
> > > >> > > > That is, I think putting any random value into the upper lanes when
> > > >> > > > constant folding
> > > >> > > > a paradoxical subreg sounds OK to me, no?
> > > >> > >
> > > >> > > The compiler is putting zero there as can be seen from the above new RTX.
> > > >> > >
> > > >> > > > Of course we might choose to not do such constant propagation for
> > > >> > > > efficiency reason - at least
> > > >> > > > when the resulting CONST_* would require a larger constant pool entry
> > > >> > > > or more costly
> > > >> > > > construction.
> > > >> > >
> > > >> > > This is probably a follow-up improvement, where this patch tries to
> > > >> > > fix a specific invalid simplification of simplify_replace_rtx that is
> > > >> > > invalid universally.
> > > >> >
> > > >> > How so? What specifies the values of the paradoxical subreg for the
> > > >> > bytes not covered by the subreg operand?
> > > >>
> > > >> I don't know why 0 is generated here (and if it is valid) for
> > > >> paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to
> > > >> the wrong constant and triggers unwanted propagation later on.
> > > >
> > > > Quoting what I wrote in the PR below. I think pragmatically the fix is
> > > > good - we might miss some opportunistic folding this way but we for
> > > > sure may not optimistically register an equality via REG_EQUAL without
> > > > enforcing it (removing the producer and replacing it with the optimistic
> > > > constant).
> > > >
> > > > So consider the patch approved if no other RTL maintainer chimes in
> > > > within 48h.
> > >
> > > Sorry, can you hold off a bit longer? Wanted to have a look but the
> > > deadline is about to expire.
> >
> > No problem, I will wait for you.
>
> Please also note Comment #14 in the PR. With the patch, the compiler sets
>
> (const_vector:V8HI [
> (const_int 204 [0xcc]) repeated x4
> (const_int 0 [0]) repeated x4
> ])
>
> as a new REG_EQUAL note. This also does not look OK to me. IMO, the
> compiler should not emit REG_EQUAL note when some of the elements are
> derived from undefined values outside the original register.
Yes. The reverse would also be true, so when we'd actually constant propagate
(which would be OK) we also may not put an REG_EQUAL note with the original
expression which produced the undefined values.
Within the current framework it looks difficult to guarantee either so the patch
removing the possibility of simplifying alltogether looks most reasonable to me.
Richard.
>
> Uros.
Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Jul 12, 2023 at 1:05 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>>
>> On Wed, Jul 12, 2023 at 12:58 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>> >
>> > On Wed, Jul 12, 2023 at 12:23 PM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> > >
>> > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > > > On Mon, Jul 10, 2023 at 1:01 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>> > > >>
>> > > >> On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
>> > > >> <richard.guenther@gmail.com> wrote:
>> > > >> >
>> > > >> > On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>> > > >> > >
>> > > >> > > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
>> > > >> > > <richard.guenther@gmail.com> wrote:
>> > > >> > > >
>> > > >> > > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
>> > > >> > > > <gcc-patches@gcc.gnu.org> wrote:
>> > > >> > > > >
>> > > >> > > > > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
>> > > >> > > > >
>> > > >> > > > > (gdb) list
>> > > >> > > > > 469 if (code == SUBREG)
>> > > >> > > > > 470 {
>> > > >> > > > > 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
>> > > >> > > > > old_rtx, fn, data);
>> > > >> > > > > 472 if (op0 == SUBREG_REG (x))
>> > > >> > > > > 473 return x;
>> > > >> > > > > 474 op0 = simplify_gen_subreg (GET_MODE (x), op0,
>> > > >> > > > > 475 GET_MODE (SUBREG_REG (x)),
>> > > >> > > > > 476 SUBREG_BYTE (x));
>> > > >> > > > > 477 return op0 ? op0 : x;
>> > > >> > > > > 478 }
>> > > >> > > > >
>> > > >> > > > > simplifies with following arguments:
>> > > >> > > > >
>> > > >> > > > > (gdb) p debug_rtx (op0)
>> > > >> > > > > (const_vector:V4QI [
>> > > >> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x4
>> > > >> > > > > ])
>> > > >> > > > > (gdb) p debug_rtx (x)
>> > > >> > > > > (subreg:V16QI (reg:V4QI 98) 0)
>> > > >> > > > >
>> > > >> > > > > to:
>> > > >> > > > >
>> > > >> > > > > (gdb) p debug_rtx (op0)
>> > > >> > > > > (const_vector:V16QI [
>> > > >> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x16
>> > > >> > > > > ])
>> > > >> > > > >
>> > > >> > > > > This simplification is invalid, it is not possible to get V16QImode vector
>> > > >> > > > > from V4QImode vector, even when all elements are duplicates.
>> > > >> >
>> > > >> > ^^^
>> > > >> >
>> > > >> > I think this simplification is valid. A simplification to
>> > > >> >
>> > > >> > (const_vector:V16QI [
>> > > >> > (const_int -52 [0xffffffffffffffcc]) repeated x4
>> > > >> > (const_int 0 [0]) repeated x12
>> > > >> > ])
>> > > >> >
>> > > >> > would be valid as well.
>> > > >> >
>> > > >> > > > > The simplification happens in simplify_context::simplify_subreg:
>> > > >> > > > >
>> > > >> > > > > (gdb) list
>> > > >> > > > > 7558 if (VECTOR_MODE_P (outermode)
>> > > >> > > > > 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
>> > > >> > > > > 7560 && vec_duplicate_p (op, &elt))
>> > > >> > > > > 7561 return gen_vec_duplicate (outermode, elt);
>> > > >> > > > >
>> > > >> > > > > but the above simplification is valid only for non-paradoxical registers,
>> > > >> > > > > where outermode <= innermode. We should not assume that elements outside
>> > > >> > > > > the original register are valid, let alone all duplicates.
>> > > >> > > >
>> > > >> > > > Hmm, but looking at the audit trail the x86 backend expects them to be zero?
>> > > >> > > > Isn't that wrong as well?
>> > > >> > >
>> > > >> > > If you mean Comment #10, it is just an observation that
>> > > >> > > simplify_replace_rtx simplifies arguments from Comment #9 to:
>> > > >> > >
>> > > >> > > (gdb) p debug_rtx (src)
>> > > >> > > (const_vector:V8HI [
>> > > >> > > (const_int 204 [0xcc]) repeated x4
>> > > >> > > (const_int 0 [0]) repeated x4
>> > > >> > > ])
>> > > >> > >
>> > > >> > > instead of:
>> > > >> > >
>> > > >> > > (gdb) p debug_rtx (src)
>> > > >> > > (const_vector:V8HI [
>> > > >> > > (const_int 204 [0xcc]) repeated x8
>> > > >> > > ])
>> > > >> > >
>> > > >> > > which is in line with the statement below.
>> > > >> > > >
>> > > >> > > > That is, I think putting any random value into the upper lanes when
>> > > >> > > > constant folding
>> > > >> > > > a paradoxical subreg sounds OK to me, no?
>> > > >> > >
>> > > >> > > The compiler is putting zero there as can be seen from the above new RTX.
>> > > >> > >
>> > > >> > > > Of course we might choose to not do such constant propagation for
>> > > >> > > > efficiency reason - at least
>> > > >> > > > when the resulting CONST_* would require a larger constant pool entry
>> > > >> > > > or more costly
>> > > >> > > > construction.
>> > > >> > >
>> > > >> > > This is probably a follow-up improvement, where this patch tries to
>> > > >> > > fix a specific invalid simplification of simplify_replace_rtx that is
>> > > >> > > invalid universally.
>> > > >> >
>> > > >> > How so? What specifies the values of the paradoxical subreg for the
>> > > >> > bytes not covered by the subreg operand?
>> > > >>
>> > > >> I don't know why 0 is generated here (and if it is valid) for
>> > > >> paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to
>> > > >> the wrong constant and triggers unwanted propagation later on.
>> > > >
>> > > > Quoting what I wrote in the PR below. I think pragmatically the fix is
>> > > > good - we might miss some opportunistic folding this way but we for
>> > > > sure may not optimistically register an equality via REG_EQUAL without
>> > > > enforcing it (removing the producer and replacing it with the optimistic
>> > > > constant).
>> > > >
>> > > > So consider the patch approved if no other RTL maintainer chimes in
>> > > > within 48h.
>> > >
>> > > Sorry, can you hold off a bit longer? Wanted to have a look but the
>> > > deadline is about to expire.
>> >
>> > No problem, I will wait for you.
>>
>> Please also note Comment #14 in the PR. With the patch, the compiler sets
>>
>> (const_vector:V8HI [
>> (const_int 204 [0xcc]) repeated x4
>> (const_int 0 [0]) repeated x4
>> ])
>>
>> as a new REG_EQUAL note. This also does not look OK to me. IMO, the
>> compiler should not emit REG_EQUAL note when some of the elements are
>> derived from undefined values outside the original register.
>
> Yes. The reverse would also be true, so when we'd actually constant propagate
> (which would be OK) we also may not put an REG_EQUAL note with the original
> expression which produced the undefined values.
Yeah. Good, it sounds like we're all in agreement. :) For:
(insn 22 21 23 4 (set (reg:V8HI 100)
(zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0)
(parallel [
(const_int 0 [0])
(const_int 1 [0x1])
(const_int 2 [0x2])
(const_int 3 [0x3])
(const_int 4 [0x4])
(const_int 5 [0x5])
(const_int 6 [0x6])
(const_int 7 [0x7])
])))) "pr110206.c":12:42 7471 {sse4_1_zero_extendv8qiv8hi2}
(expr_list:REG_EQUAL (const_vector:V8HI [
(const_int 204 [0xcc]) repeated x8
])
(expr_list:REG_DEAD (reg:V4QI 98)
(nil))))
folding to:
(set (reg:V8HI 100)
(const_vector:V8HI [(const_int 204 [0xcc]) ...]))
would be OK, but adding the REG_EQUAL is not.
> Within the current framework it looks difficult to guarantee either so the patch
> removing the possibility of simplifying alltogether looks most reasonable to me.
But like Uros says, the alternative REG_EQUAL is equally wrong
in principle. It just happens to work out OK for this testcase.
fwprop.c has:
/* If there are any paradoxical SUBREGs, drop the REG_EQUAL note,
because the bits in there can be anything and so might not
match the REG_EQUAL note content. See PR70574. */
if (contains_paradoxical_subreg_p (SET_SRC (set)))
return false;
expand_mult_const has a similar guard. IMO the bug is that cprop is
lacking the same test.
Thanks,
Richard
@@ -7557,6 +7557,7 @@ simplify_context::simplify_subreg (machine_mode outermode, rtx op,
if (VECTOR_MODE_P (outermode)
&& GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
+ && !paradoxical_subreg_p (outermode, innermode)
&& vec_duplicate_p (op, &elt))
return gen_vec_duplicate (outermode, elt);
new file mode 100644
@@ -0,0 +1,30 @@
+/* PR target/110206 */
+/* { dg-do run { target x86_64-*-* i?86-*-* } } */
+
+typedef unsigned char __attribute__((__vector_size__ (4))) U;
+typedef unsigned char __attribute__((__vector_size__ (8))) V;
+typedef unsigned short u16;
+
+V g;
+
+void
+__attribute__((noinline))
+foo (U u, u16 c, V *r)
+{
+ if (!c)
+ __builtin_abort ();
+ V x = __builtin_shufflevector (u, (204 >> u), 7, 0, 5, 1, 3, 5, 0, 2);
+ V y = __builtin_shufflevector (g, (V) { }, 7, 6, 6, 7, 2, 6, 3, 5);
+ V z = __builtin_shufflevector (y, 204 * x, 3, 9, 8, 1, 4, 6, 14, 5);
+ *r = z;
+}
+
+int
+main (void)
+{
+ V r;
+ foo ((U){4}, 5, &r);
+ if (r[6] != 0x30)
+ __builtin_abort();
+ return 0;
+}