[v2] libgcc: Mostly vectorize CIE encoding extraction for FDEs

Message ID 878rkr12mj.fsf@oldenburg.str.redhat.com
State Accepted
Headers
Series [v2] libgcc: Mostly vectorize CIE encoding extraction for FDEs |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

Florian Weimer Nov. 4, 2022, 9:37 a.m. UTC
  "zR" and "zPLR" are the most common augmentations.  Use a simple
SIMD-with-in-a-register technique to check for both augmentations,
and that the following variable-length integers have length 1, to
get more quickly at the encoding field.

libgcc/

	* unwind-dw2-fde.c (get_cie_encoding_slow): Rename from
	get_cie_encoding.  Mark as noinline.
	(get_cie_encoding): Add fast path for "zR" and "zPLR"
	augmentations.  Call get_cie_encoding_slow as a fall-back.

---
v2: Use memcpy to avoid a potential aliasing violation.

 libgcc/unwind-dw2-fde.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 60 insertions(+), 2 deletions(-)


base-commit: e724b0480bfa5ec04f39be8c7290330b495c59de
  

Comments

Jakub Jelinek Nov. 4, 2022, 10:27 a.m. UTC | #1
On Fri, Nov 04, 2022 at 10:37:56AM +0100, Florian Weimer via Gcc-patches wrote:
> "zR" and "zPLR" are the most common augmentations.  Use a simple
> SIMD-with-in-a-register technique to check for both augmentations,
> and that the following variable-length integers have length 1, to
> get more quickly at the encoding field.
> 
> libgcc/
> 
> 	* unwind-dw2-fde.c (get_cie_encoding_slow): Rename from
> 	get_cie_encoding.  Mark as noinline.
> 	(get_cie_encoding): Add fast path for "zR" and "zPLR"
> 	augmentations.  Call get_cie_encoding_slow as a fall-back.
> 
> ---
> v2: Use memcpy to avoid a potential aliasing violation.
> 
>  libgcc/unwind-dw2-fde.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 60 insertions(+), 2 deletions(-)
> 
> diff --git a/libgcc/unwind-dw2-fde.c b/libgcc/unwind-dw2-fde.c
> index 3c0cc654ec0..21eee77882e 100644
> --- a/libgcc/unwind-dw2-fde.c
> +++ b/libgcc/unwind-dw2-fde.c
> @@ -333,8 +333,10 @@ base_from_object (unsigned char encoding, const struct object *ob)
>  /* Return the FDE pointer encoding from the CIE.  */
>  /* ??? This is a subset of extract_cie_info from unwind-dw2.c.  */
>  
> -static int
> -get_cie_encoding (const struct dwarf_cie *cie)
> +/* Disable inlining because the function is only used as a slow path in
> +   get_cie_encoding below.  */
> +static int __attribute__ ((noinline))
> +get_cie_encoding_slow (const struct dwarf_cie *cie)
>  {
>    const unsigned char *aug, *p;
>    _Unwind_Ptr dummy;
> @@ -389,6 +391,62 @@ get_cie_encoding (const struct dwarf_cie *cie)
>      }
>  }
>  
> +static inline int
> +get_cie_encoding (const struct dwarf_cie *cie)
> +{
> +  /* Fast path for some augmentations and single-byte variable-length
> +     integers.  Do this only for targets that align struct dwarf_cie to 8
> +     bytes, which ensures that at least 8 bytes are available starting at
> +     cie->version.  */
> +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ \
> +  || __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> +  if (__alignof (*cie) == 8 && sizeof (unsigned long long) == 8)
> +    {
> +      unsigned long long value;
> +      memcpy (&value, &cie->version, sizeof (value));
> +
> +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> +#define C(x) __builtin_bswap64 (x)
> +#else
> +#define C(x) x
> +#endif
> +
> +      /* Fast path for "zR".  Check for version 1, the "zR" string and that
> +	 the sleb128/uleb128 values are single bytes.  In the comments
> +	 below, '1', 'c', 'd', 'r', 'l' are version, code alignment, data
> +	 alignment, return address column, augmentation length.  Note that
> +	 with CIE version 1, the return address column is byte-encoded.  */
> +      unsigned long long expected =
> +	/*   1 z R 0 c d r l.  */
> +	C (0x017a520000000000ULL);
> +      unsigned long long mask =
> +	/*   1 z R 0 c d r l.  */
> +	C (0xffffffff80800080ULL);
> +
> +      if ((value & mask) == expected)
> +	return cie->augmentation[7];
> +
> +      /* Fast path for "zPLR".  */
> +      expected =
> +	/*   1 z P L R 0 c d.  */
> +	C (0x017a504c52000000ULL);
> +      mask =
> +	/*   1 z P L R 0 c d.  */
> +	C (0xffffffffffff8080ULL);

The formatting is all wrong, = should go before C, not at the end of lines.

> +#undef C
> +
> +      /* Validate the augmentation length, and return the enconding after
> +	 it.  No check for the return address column because it is
> +	 byte-encoded with CIE version 1.  */
> +      if (__builtin_expect ((value & mask) == expected
> +			    && (cie->augmentation[8] & 0x80) == 0, 1))
> +	  return cie->augmentation[9];

And the above line is indented too much, should be just tab.

But more importantly, I don't see how it can work at all correctly.
For the "zR" case:
  Version:               1
  Augmentation:          "zR"
  Code alignment factor: 1
  Data alignment factor: -8
  Return address column: 16
  Augmentation data:     1b
I believe the function wants to return 0x1b, we have
1 z R 0 c d r l X
and we want to return the X, which is indeed at &cie->version + 8
aka cie->augmentation[7].
But with "zPLR"
  Version:               1
  Augmentation:          "zPLR"
  Code alignment factor: 1
  Data alignment factor: -8
  Return address column: 16
  Augmentation data:     03 00 00 00 00 03 1b
we have
1 z P L R 0 c d r l M N N N N O X 
and still want to return the X, while you return the M in there.
How large the N is depends on the M value.  The "P" data
is <encoding byte> + <encoded value>, "L" takes just <encoding byte>
and so does "R".
So, you'd need in addition to your (cie->augmentation[8] & 0x80) == 0
check (that augmentation data length is single byte uleb128) verify
that cie->augmentation[9] is DW_EH_PE_udata4, then you can skip
the 4 bytes after it and the one byte for "L" and return
cie->augmentation[15];  But you'd need to verify what different
DW_EH_PE_* values are used in the wild.

BTW, version 1 is there just because binutils emit those,
if one uses -fno-dwarf2-cfi-asm, there is:
  Version:               3
  Augmentation:          "zR"
  Code alignment factor: 1
  Data alignment factor: -8
  Return address column: 16
  Augmentation data:     03
and
  Version:               3
  Augmentation:          "zPLR"
  Code alignment factor: 1
  Data alignment factor: -8
  Return address column: 16
  Augmentation data:     03 00 00 00 00 03 03
instead (so 3 instead of 1, and leb128 RA instead of byte;
1 vs. 3 can be handled by mask and checking if top byte of the
RA isn't set could work too).  One should trie what one gets
with -fpic/-fpie/-fno-pie on various arches though.
E.g. doesn't aarch64 use augmentations with B in it?

Say ia32 I see:
  Version:               1
  Augmentation:          "zPLR"
  Code alignment factor: 1
  Data alignment factor: -4
  Return address column: 8
  Augmentation data:     00 00 00 00 00 00 1b
and
  Version:               1
  Augmentation:          "zPLR"
  Code alignment factor: 1
  Data alignment factor: -4
  Return address column: 8
  Augmentation data:     9b ed ff ff ff 1b 1b
so it is already DW_EH_absptr, DW_EH_PE_udata4 and
DW_EH_PE_pcrel|DW_EH_PE_sdata4 for the "P" encoding.
The latter 2 and the former on 32-bit arches have 4
bytes encoding, but still, perhaps it would be easier
to rely on the augmentation data length and return
the last byte of the augmentation (when "R" is known to be last)?

	Jakub
  
Florian Weimer Nov. 4, 2022, 12:49 p.m. UTC | #2
* Jakub Jelinek:

>> +#undef C
>> +
>> +      /* Validate the augmentation length, and return the enconding after
>> +	 it.  No check for the return address column because it is
>> +	 byte-encoded with CIE version 1.  */
>> +      if (__builtin_expect ((value & mask) == expected
>> +			    && (cie->augmentation[8] & 0x80) == 0, 1))
>> +	  return cie->augmentation[9];
>
> And the above line is indented too much, should be just tab.
>
> But more importantly, I don't see how it can work at all correctly.
> For the "zR" case:
>   Version:               1
>   Augmentation:          "zR"
>   Code alignment factor: 1
>   Data alignment factor: -8
>   Return address column: 16
>   Augmentation data:     1b
> I believe the function wants to return 0x1b, we have
> 1 z R 0 c d r l X
> and we want to return the X, which is indeed at &cie->version + 8
> aka cie->augmentation[7].
> But with "zPLR"
>   Version:               1
>   Augmentation:          "zPLR"
>   Code alignment factor: 1
>   Data alignment factor: -8
>   Return address column: 16
>   Augmentation data:     03 00 00 00 00 03 1b
> we have
> 1 z P L R 0 c d r l M N N N N O X 
> and still want to return the X, while you return the M in there.
> How large the N is depends on the M value.  The "P" data
> is <encoding byte> + <encoded value>, "L" takes just <encoding byte>
> and so does "R".
> So, you'd need in addition to your (cie->augmentation[8] & 0x80) == 0
> check (that augmentation data length is single byte uleb128) verify
> that cie->augmentation[9] is DW_EH_PE_udata4, then you can skip
> the 4 bytes after it and the one byte for "L" and return
> cie->augmentation[15];  But you'd need to verify what different
> DW_EH_PE_* values are used in the wild.

Meh, right.  I think what I'm doing is pretty pointless.  We call
extract_cie_info immediately after calling _Unwind_Find_FDE in
uw_frame_state_for.  That already gives us the FDE encoding.  So it
seems to make more sense to do the PC range check in uw_frame_state_for,
and _Unwind_Find_FDE won't need the FDE encoding anymore.

My testing doesn't catch this problem unfortunately.

> BTW, version 1 is there just because binutils emit those,
> if one uses -fno-dwarf2-cfi-asm, there is:
>   Version:               3
>   Augmentation:          "zR"
>   Code alignment factor: 1
>   Data alignment factor: -8
>   Return address column: 16
>   Augmentation data:     03
> and
>   Version:               3
>   Augmentation:          "zPLR"
>   Code alignment factor: 1
>   Data alignment factor: -8
>   Return address column: 16
>   Augmentation data:     03 00 00 00 00 03 03
> instead (so 3 instead of 1, and leb128 RA instead of byte;
> 1 vs. 3 can be handled by mask and checking if top byte of the
> RA isn't set could work too).  One should trie what one gets
> with -fpic/-fpie/-fno-pie on various arches though.
> E.g. doesn't aarch64 use augmentations with B in it?

I haven't seen those, and I also can't find the GCC code to emit it.  I
think it's now handled via aarch64_frob_update_context, not an
augmentation flag.  Perhaps we should remove the 'B' cases.
  

Patch

diff --git a/libgcc/unwind-dw2-fde.c b/libgcc/unwind-dw2-fde.c
index 3c0cc654ec0..21eee77882e 100644
--- a/libgcc/unwind-dw2-fde.c
+++ b/libgcc/unwind-dw2-fde.c
@@ -333,8 +333,10 @@  base_from_object (unsigned char encoding, const struct object *ob)
 /* Return the FDE pointer encoding from the CIE.  */
 /* ??? This is a subset of extract_cie_info from unwind-dw2.c.  */
 
-static int
-get_cie_encoding (const struct dwarf_cie *cie)
+/* Disable inlining because the function is only used as a slow path in
+   get_cie_encoding below.  */
+static int __attribute__ ((noinline))
+get_cie_encoding_slow (const struct dwarf_cie *cie)
 {
   const unsigned char *aug, *p;
   _Unwind_Ptr dummy;
@@ -389,6 +391,62 @@  get_cie_encoding (const struct dwarf_cie *cie)
     }
 }
 
+static inline int
+get_cie_encoding (const struct dwarf_cie *cie)
+{
+  /* Fast path for some augmentations and single-byte variable-length
+     integers.  Do this only for targets that align struct dwarf_cie to 8
+     bytes, which ensures that at least 8 bytes are available starting at
+     cie->version.  */
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ \
+  || __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+  if (__alignof (*cie) == 8 && sizeof (unsigned long long) == 8)
+    {
+      unsigned long long value;
+      memcpy (&value, &cie->version, sizeof (value));
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+#define C(x) __builtin_bswap64 (x)
+#else
+#define C(x) x
+#endif
+
+      /* Fast path for "zR".  Check for version 1, the "zR" string and that
+	 the sleb128/uleb128 values are single bytes.  In the comments
+	 below, '1', 'c', 'd', 'r', 'l' are version, code alignment, data
+	 alignment, return address column, augmentation length.  Note that
+	 with CIE version 1, the return address column is byte-encoded.  */
+      unsigned long long expected =
+	/*   1 z R 0 c d r l.  */
+	C (0x017a520000000000ULL);
+      unsigned long long mask =
+	/*   1 z R 0 c d r l.  */
+	C (0xffffffff80800080ULL);
+
+      if ((value & mask) == expected)
+	return cie->augmentation[7];
+
+      /* Fast path for "zPLR".  */
+      expected =
+	/*   1 z P L R 0 c d.  */
+	C (0x017a504c52000000ULL);
+      mask =
+	/*   1 z P L R 0 c d.  */
+	C (0xffffffffffff8080ULL);
+#undef C
+
+      /* Validate the augmentation length, and return the enconding after
+	 it.  No check for the return address column because it is
+	 byte-encoded with CIE version 1.  */
+      if (__builtin_expect ((value & mask) == expected
+			    && (cie->augmentation[8] & 0x80) == 0, 1))
+	  return cie->augmentation[9];
+    }
+#endif
+
+  return get_cie_encoding_slow (cie);
+}
+
 static inline int
 get_fde_encoding (const struct dwarf_fde *f)
 {