expmed: Allow extract_bit_field via mem for low-precision modes.

Message ID e52abe71-2364-32e7-d29a-dc05452a06f5@gmail.com
State Accepted
Headers
Series expmed: Allow extract_bit_field via mem for low-precision modes. |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

Robin Dapp Aug. 30, 2023, 12:22 p.m. UTC
  Hi,

when looking at a riscv ICE in vect-live-6.c I noticed that we
assume that the variable part (coeffs[1] * x1) of the to-be-extracted
bit number in extract_bit_field_1 is a multiple of BITS_PER_UNIT.

This means that bits_to_bytes_round_down and num_trailing_bits
cannot handle e.g. extracting from a "VNx4BI"-mode vector which has
4-bit precision on riscv.

This patch adds a special case for that situation and sets bytenum to
zero as well as bitnum to its proper value.  It works for the riscv
case because in all other situations we can align to a byte boundary.
If x1 were 3 for some reason, however, the above assertion would still
fail.  I don't think this can happen for riscv as we only ever double
the number of chunks for larger vector sizes but not sure about the
general case.

If there's another, correct way to work around feel free to suggest.

Bootstrap/testsuite on aarch64 and x86 is running but I would be
surprised if there were any changes as riscv is the only target that
uses modes with precision < 8.

Regards
 Robin

gcc/ChangeLog:

	* expmed.cc (extract_bit_field_1): Handle bitnum with variable
	part less than BITS_PER_UNIT.
---
 gcc/expmed.cc | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)
  

Comments

Richard Sandiford Aug. 30, 2023, 1:08 p.m. UTC | #1
[Sorry for any weird MUA issues, don't have access to my usual set-up.]

> when looking at a riscv ICE in vect-live-6.c I noticed that we
> assume that the variable part (coeffs[1] * x1) of the to-be-extracted
> bit number in extract_bit_field_1 is a multiple of BITS_PER_UNIT.
>
> This means that bits_to_bytes_round_down and num_trailing_bits
> cannot handle e.g. extracting from a "VNx4BI"-mode vector which has
> 4-bit precision on riscv.

But in the VLA case, doesn't it instead have precision 4+4X?
The problem then is that we can't tell at compile time which
byte that corresponds to.  So...

> This patch adds a special case for that situation and sets bytenum to
> zero as well as bitnum to its proper value.  It works for the riscv
> case because in all other situations we can align to a byte boundary.
> If x1 were 3 for some reason, however, the above assertion would still
> fail.  I don't think this can happen for riscv as we only ever double
> the number of chunks for larger vector sizes but not sure about the
> general case.
>
> If there's another, correct way to work around feel free to suggest.
>
> Bootstrap/testsuite on aarch64 and x86 is running but I would be
> surprised if there were any changes as riscv is the only target that
> uses modes with precision < 8.
>
> Regards
>  Robin
>
> gcc/ChangeLog:
>
>	* expmed.cc (extract_bit_field_1): Handle bitnum with variable
>	part less than BITS_PER_UNIT.
> ---
>  gcc/expmed.cc | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index e22e43c8505..1b0119f9cfc 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -1858,8 +1858,22 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, poly_uint64 bitnum,
>       but is useful for things like vector booleans.  */
>    if (MEM_P (op0) && !bitnum.is_constant ())
>      {
> -      bytenum = bits_to_bytes_round_down (bitnum);
> -      bitnum = num_trailing_bits (bitnum);
> +      /* bits_to_bytes_round_down tries to align to a byte (BITS_PER_UNIT)
> +	 boundary and asserts that bitnum.coeffs[1] % BITS_PER_UNIT == 0.
> +	 For modes with precision < BITS_PER_UNIT this fails but we can
> +	 still extract from the first byte.  */
> +      poly_uint16 prec = GET_MODE_PRECISION (outermode);
> +      if (prec.coeffs[1] < BITS_PER_UNIT && bitnum.coeffs[1] < BITS_PER_UNIT)
> +	{
> +	  bytenum = 0;
> +	  bitnum = bitnum.coeffs[0] & (BITS_PER_UNIT - 1);

...this doesn't look right.  We can't drop bitnum.coeffs[1] when it's
nonzero, because it says that for some runtime vector sizes, the bit
position might be higher than bitnum.coeffs[0].

Also, it's not possible to access coeffs[1] unconditionally in
target-independent code.

Thanks,
Richard

> +	}
> +      else
> +	{
> +	  bytenum = bits_to_bytes_round_down (bitnum);
> +	  bitnum = num_trailing_bits (bitnum);
> +	}
> +
>        poly_uint64 bytesize = bits_to_bytes_round_up (bitnum + bitsize);
>        op0 = adjust_bitfield_address_size (op0, BLKmode, bytenum, bytesize);
>        op0_mode = opt_scalar_int_mode ();
  
Robin Dapp Aug. 30, 2023, 3:06 p.m. UTC | #2
> But in the VLA case, doesn't it instead have precision 4+4X?
> The problem then is that we can't tell at compile time which
> byte that corresponds to.  So...

Yes 4 + 4x.  I keep getting confused with poly modes :)
In this case we want to extract the bitnum [3 4] = 3 + 4x which
would be in byte 0 for x = 0 or x = 1 and in byte 1 for x = 2, 3 and
so on.

Can't we still make that work somehow?  As far as I can tell we're looking
for the byte range to be accessed.  It's not like we have a precision or
bitnum of e.g. [3 17] where the access could be anywhere but still a pow2
fraction of BITS_PER_UNIT.

I'm just having trouble writing that down.

What about something like

    int factor = BITS_PER_UINT / prec.coeffs[0];
    bytenum = force_align_down_and_div (bitnum, prec.coeffs[0]);
    bytenum *= factor;

(or a similar thing done manually without helpers) guarded by the
proper condition?
Or do we need something more generic for the factor (i.e. prec.coeffs[0])
is not enough when we have a precision like [8 16]? Does that even exist?.

Regards
 Robin
  
Richard Sandiford Aug. 30, 2023, 5:13 p.m. UTC | #3
Robin Dapp <rdapp.gcc@gmail.com> writes:
>> But in the VLA case, doesn't it instead have precision 4+4X?
>> The problem then is that we can't tell at compile time which
>> byte that corresponds to.  So...
>
> Yes 4 + 4x.  I keep getting confused with poly modes :)
> In this case we want to extract the bitnum [3 4] = 3 + 4x which
> would be in byte 0 for x = 0 or x = 1 and in byte 1 for x = 2, 3 and
> so on.
>
> Can't we still make that work somehow?  As far as I can tell we're looking
> for the byte range to be accessed.  It's not like we have a precision or
> bitnum of e.g. [3 17] where the access could be anywhere but still a pow2
> fraction of BITS_PER_UNIT.
>
> I'm just having trouble writing that down.
>
> What about something like
>
>     int factor = BITS_PER_UINT / prec.coeffs[0];
>     bytenum = force_align_down_and_div (bitnum, prec.coeffs[0]);
>     bytenum *= factor;
>
> (or a similar thing done manually without helpers) guarded by the
> proper condition?
> Or do we need something more generic for the factor (i.e. prec.coeffs[0])
> is not enough when we have a precision like [8 16]? Does that even exist?.

It's not just a question of which byte though.  It's also a question
of which bit.

One option would be to code-generate for even X and for odd X, and select
between them at runtime.  But that doesn't scale well to 2+2X and 1+1X.

Otherwise I think we need to treat the bit position as a variable,
with bitpos % 8 and bitpos / 8 being calculated at runtime.

Thanks,
Richard
  
Robin Dapp Sept. 1, 2023, 10:38 a.m. UTC | #4
> It's not just a question of which byte though.  It's also a question
> of which bit.
> 
> One option would be to code-generate for even X and for odd X, and select
> between them at runtime.  But that doesn't scale well to 2+2X and 1+1X.
> 
> Otherwise I think we need to treat the bit position as a variable,
> with bitpos % 8 and bitpos / 8 being calculated at runtime.

Thanks.  I worked around it with a backend vec_extract<VNxBI>QI expander
so we don't run into that situation directly anymore.  The problem is of
course still latent and I'm going to look at it again after some other things
on my plate.

Regards
 Robin
  
Richard Sandiford Sept. 1, 2023, 10:45 a.m. UTC | #5
Robin Dapp via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> It's not just a question of which byte though.  It's also a question
>> of which bit.
>> 
>> One option would be to code-generate for even X and for odd X, and select
>> between them at runtime.  But that doesn't scale well to 2+2X and 1+1X.
>> 
>> Otherwise I think we need to treat the bit position as a variable,
>> with bitpos % 8 and bitpos / 8 being calculated at runtime.
>
> Thanks.  I worked around it with a backend vec_extract<VNxBI>QI expander
> so we don't run into that situation directly anymore.  The problem is of
> course still latent and I'm going to look at it again after some other things
> on my plate.

Yeah, sounds like a good workaround.  If the target has an efficient way
of coping with the VLAness then the optab will probably be better than
whatever the generic code ends up being.

Thanks,
Richard
  

Patch

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index e22e43c8505..1b0119f9cfc 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -1858,8 +1858,22 @@  extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, poly_uint64 bitnum,
      but is useful for things like vector booleans.  */
   if (MEM_P (op0) && !bitnum.is_constant ())
     {
-      bytenum = bits_to_bytes_round_down (bitnum);
-      bitnum = num_trailing_bits (bitnum);
+      /* bits_to_bytes_round_down tries to align to a byte (BITS_PER_UNIT)
+	 boundary and asserts that bitnum.coeffs[1] % BITS_PER_UNIT == 0.
+	 For modes with precision < BITS_PER_UNIT this fails but we can
+	 still extract from the first byte.  */
+      poly_uint16 prec = GET_MODE_PRECISION (outermode);
+      if (prec.coeffs[1] < BITS_PER_UNIT && bitnum.coeffs[1] < BITS_PER_UNIT)
+	{
+	  bytenum = 0;
+	  bitnum = bitnum.coeffs[0] & (BITS_PER_UNIT - 1);
+	}
+      else
+	{
+	  bytenum = bits_to_bytes_round_down (bitnum);
+	  bitnum = num_trailing_bits (bitnum);
+	}
+
       poly_uint64 bytesize = bits_to_bytes_round_up (bitnum + bitsize);
       op0 = adjust_bitfield_address_size (op0, BLKmode, bytenum, bytesize);
       op0_mode = opt_scalar_int_mode ();