From patchwork Tue Nov 22 08:00:51 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Thomas Neumann <thomas.neumann@in.tum.de>
X-Patchwork-Id: 24179
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp2065780wrr;
        Tue, 22 Nov 2022 00:02:00 -0800 (PST)
X-Google-Smtp-Source: 
 AA0mqf7oqzD+54OTVi7+nJYznW9GIb+MeBjtu8UzWDeTJRXnPK+huqSA/tGgR5/2gYPnTm4RLVRw
X-Received: by 2002:a17:907:7e86:b0:7af:bc9:5e8d with SMTP id
 qb6-20020a1709077e8600b007af0bc95e8dmr19160436ejc.3.1669104120667;
        Tue, 22 Nov 2022 00:02:00 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1669104120; cv=none;
        d=google.com; s=arc-20160816;
        b=MSokbza9X/j9ETlVMeL3JVU3W3gSs+cR41wALk4hwrx9llaEw7Vn4vk5UqynWAXebk
         LnGnG7e0e5+DTj+hiiRJSi8560sRzXP+JD23VD+vQAghE8d9KoeHOrXiV5vXXoMUeqzB
         HDvQCkxwbS19uF6CYlAxbC8g0b5JV5IthXP4uR3wq0sL93zthOmgOcTH6/q7HRupsZj/
         kj0j3Fh4jr62zYJ5uORHI41uJzCHxWReywaaa0LUNOXbj5V3LCRsnLI6LdW5oa58sdTs
         WQGkNAAN29ADh3zDER6bGQZiAE2WMJnz3gO74OeqywQliGRtp5AMD5YxQ7TCr8bSraPE
         NBww==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:from:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence
         :content-transfer-encoding:in-reply-to:content-language:references
         :cc:to:subject:user-agent:mime-version:date:message-id:dmarc-filter
         :delivered-to:dkim-signature:dkim-filter;
        bh=ouygyKIQ5fFTqdGtLxBsAjAfM44bRTTWCXSd/NH10qs=;
        b=03wvnPXnPr4xGzvJI9n1GsNJ5duIJssOx4oZmk8cxB30lWXCHbu5LApZSRsWhcDx7b
         i9SNKEL0syMHbAEd9phtkKMQb1ts4aUOnB18nzf+uRtfPc8Pk3zaWkxgj2izTl7mlkqO
         0P2qyMXKcCi6Zc+xDEXVu8NzervwSD3BTg6ZfTq1OEDUaJxXl8JdB9HIpwwGoek7/TJ6
         aGnOHTVCRV4qZyb/JcytPH7YM40HVbqjh80//ye1NR7LOHuAvK1v3k/KMGL+sr+BlVTb
         cNDvECPAeKXg5KOcCr7Q6e2Ph4Avranlyw+ERdOXGBkcr3I2McNvmf7LB899PHeP4Ksw
         WdqQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=pUTg5IKy;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from sourceware.org (server2.sourceware.org. [8.43.85.97])
        by mx.google.com with ESMTPS id
 s11-20020a170906a18b00b0078cdba56108si9815969ejy.296.2022.11.22.00.02.00
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 22 Nov 2022 00:02:00 -0800 (PST)
Received-SPF: pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender) client-ip=8.43.85.97;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@gcc.gnu.org header.s=default header.b=pUTg5IKy;
       spf=pass (google.com: domain of
 gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as
 permitted sender)
 smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gnu.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 8BB5E3858D20
	for <ouuuleilei@gmail.com>; Tue, 22 Nov 2022 08:01:59 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8BB5E3858D20
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1669104119;
	bh=ouygyKIQ5fFTqdGtLxBsAjAfM44bRTTWCXSd/NH10qs=;
	h=Date:Subject:To:Cc:References:In-Reply-To:List-Id:
	 List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe:
	 From:Reply-To:From;
	b=pUTg5IKysuUK/IlSOtb/3+CPGNclXkCO9+eAd/NgP05c/ybCs3Yh32fPLUjk8lhg8
	 8jI0SY5Fvp0t6FNez+PBcRDG8z/ZeyKlol+fIcMrI9KpAPrtA5W2CD+LNs9oWrOcvw
	 wljDCmPHcrsKT/p1v7SxBp8PaDnS+pxUHkFzOfaY=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mailout1.rbg.tum.de (mailout1.rbg.tum.de [IPv6:2a09:80c0::201])
 by sourceware.org (Postfix) with ESMTPS id 67EFB3858D20
 for <gcc-patches@gcc.gnu.org>; Tue, 22 Nov 2022 08:00:57 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 67EFB3858D20
Received: from mailrelay1.rbg.tum.de (mailrelay1.in.tum.de [131.159.254.14])
 by mailout1.rbg.tum.de (Postfix) with ESMTPS id 1202A4D;
 Tue, 22 Nov 2022 09:00:54 +0100 (CET)
Received: by mailrelay1.rbg.tum.de (Postfix, from userid 112)
 id 0D55BDD; Tue, 22 Nov 2022 09:00:54 +0100 (CET)
Received: from mailrelay1.rbg.tum.de (localhost [127.0.0.1])
 by mailrelay1.rbg.tum.de (Postfix) with ESMTP id A081385;
 Tue, 22 Nov 2022 09:00:53 +0100 (CET)
Received: from mail.in.tum.de (vmrbg426.in.tum.de [131.159.0.73])
 by mailrelay1.rbg.tum.de (Postfix) with ESMTPS id 9A68322;
 Tue, 22 Nov 2022 09:00:53 +0100 (CET)
Received: by mail.in.tum.de (Postfix, from userid 112)
 id 9402C4A01A8; Tue, 22 Nov 2022 09:00:53 +0100 (CET)
Received: (Authenticated sender: neumann)
 by mail.in.tum.de (Postfix) with ESMTPSA id 365974A031D;
 Tue, 22 Nov 2022 09:00:52 +0100 (CET)
 (Extended-Queue-bit xtech_rk@fff.in.tum.de)
Message-ID: <80659153-a4ea-8f66-c317-a8a750f34a01@in.tum.de>
Date: Tue, 22 Nov 2022 09:00:51 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.4.2
Subject: [PATCH] speed up end_fde_sort using radix sort
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Cc: Tamar Christina <Tamar.Christina@arm.com>,
 Jason Merrill <jason@redhat.com>, Florian Weimer <fweimer@redhat.com>,
 Jonathan Wakely <jwakely.gcc@gmail.com>, "H.J. Lu" <hjl.tools@gmail.com>,
 Jakub Jelinek <jakub@redhat.com>
References: <2a4776b9-9271-bb3c-a626-d5ec22dae6f3@in.tum.de>
 <VI1PR08MB532513A4AF91751CAA34A2D7FF0A9@VI1PR08MB5325.eurprd08.prod.outlook.com>
 <b4b30340-0854-e2d2-03ce-a76c5052e335@in.tum.de> <Y3tljCN9GmZXiApb@tucnak>
 <CAMe9rOoAwZkJjPP23DPsP0P3JcZi_SxwFcNOuB1uxxj2rmaqtg@mail.gmail.com>
Content-Language: en-US
In-Reply-To: 
 <CAMe9rOoAwZkJjPP23DPsP0P3JcZi_SxwFcNOuB1uxxj2rmaqtg@mail.gmail.com>
X-Spam-Status: No, score=-10.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: Thomas Neumann via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: Thomas Neumann <thomas.neumann@in.tum.de>
Reply-To: Thomas Neumann <thomas.neumann@in.tum.de>
Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org>
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1750182522348316897?=
X-GMAIL-MSGID: =?utf-8?q?1750182522348316897?=

When registering a dynamic unwinding frame the fde list is sorted.
Previously, we split the list into a sorted and an unsorted part,
sorted the later using heap sort, and merged both. That can be
quite slow due to the large number of (expensive) comparisons.

This patch replaces that logic with a radix sort instead. The
radix sort uses the same amount of memory as the old logic,
using the second list as auxiliary space, and it includes two
techniques to speed up sorting: First, it computes the pointer
addresses for blocks of values, reducing the decoding overhead.
And it recognizes when the data has reached a sorted state,
allowing for early termination. When running out of memory
we fall back to pure heap sort, as before.

For this test program

#include <cstdio>
int main(int argc, char** argv) {
      return 0;
}

compiled with g++ -O -o hello -static hello.c we get with
perf stat -r 200 on a 5950X the following performance numbers:

old logic:

               0,20 msec task-clock
            930.834      cycles
          3.079.765      instructions
         0,00030478 +- 0,00000237 seconds time elapsed

new logic:

               0,10 msec task-clock
            473.269      cycles
          1.239.077      instructions
         0,00021119 +- 0,00000168 seconds time elapsed

libgcc/ChangeLog:
         * unwind-dw2-fde.c: Use radix sort instead of split+sort+merge.
---
  libgcc/unwind-dw2-fde.c | 234 +++++++++++++++++++++++-----------------
  1 file changed, 134 insertions(+), 100 deletions(-)

diff --git a/libgcc/unwind-dw2-fde.c b/libgcc/unwind-dw2-fde.c
index 3c0cc654ec0..b81540c41a4 100644
--- a/libgcc/unwind-dw2-fde.c
+++ b/libgcc/unwind-dw2-fde.c
@@ -456,22 +456,52 @@ fde_mixed_encoding_compare (struct object *ob, const fde *x, const fde *y)
  
  typedef int (*fde_compare_t) (struct object *, const fde *, const fde *);
  
+// The extractor functions compute the pointer values for a block of
+// fdes. The block processing hides the call overhead.
  
-/* This is a special mix of insertion sort and heap sort, optimized for
-   the data sets that actually occur. They look like
-   101 102 103 127 128 105 108 110 190 111 115 119 125 160 126 129 130.
-   I.e. a linearly increasing sequence (coming from functions in the text
-   section), with additionally a few unordered elements (coming from functions
-   in gnu_linkonce sections) whose values are higher than the values in the
-   surrounding linear sequence (but not necessarily higher than the values
-   at the end of the linear sequence!).
-   The worst-case total run time is O(N) + O(n log (n)), where N is the
-   total number of FDEs and n is the number of erratic ones.  */
+static void
+fde_unencoded_extract (struct object *ob __attribute__ ((unused)),
+		       _Unwind_Ptr *target, const fde **x, int count)
+{
+  for (int index = 0; index < count; ++index)
+    memcpy (target + index, x[index]->pc_begin, sizeof (_Unwind_Ptr));
+}
+
+static void
+fde_single_encoding_extract (struct object *ob, _Unwind_Ptr *target,
+			     const fde **x, int count)
+{
+  _Unwind_Ptr base;
+
+  base = base_from_object (ob->s.b.encoding, ob);
+  for (int index = 0; index < count; ++index)
+    read_encoded_value_with_base (ob->s.b.encoding, base, x[index]->pc_begin,
+				  target + index);
+}
+
+static void
+fde_mixed_encoding_extract (struct object *ob, _Unwind_Ptr *target,
+			    const fde **x, int count)
+{
+  for (int index = 0; index < count; ++index)
+    {
+      int encoding = get_fde_encoding (x[index]);
+      read_encoded_value_with_base (encoding, base_from_object (encoding, ob),
+				    x[index]->pc_begin, target + index);
+    }
+}
+
+typedef void (*fde_extractor_t) (struct object *, _Unwind_Ptr *, const fde **,
+				 int);
+
+// Data is is sorted using radix sort if possible, using an temporary
+// auxiliary data structure of the same size as the input. When running
+// out of memory to in-place heap sort.
  
  struct fde_accumulator
  {
    struct fde_vector *linear;
-  struct fde_vector *erratic;
+  struct fde_vector *aux;
  };
  
  static inline int
@@ -485,8 +515,8 @@ start_fde_sort (struct fde_accumulator *accu, size_t count)
    if ((accu->linear = malloc (size)))
      {
        accu->linear->count = 0;
-      if ((accu->erratic = malloc (size)))
-	accu->erratic->count = 0;
+      if ((accu->aux = malloc (size)))
+	accu->aux->count = 0;
        return 1;
      }
    else
@@ -500,59 +530,6 @@ fde_insert (struct fde_accumulator *accu, const fde *this_fde)
      accu->linear->array[accu->linear->count++] = this_fde;
  }
  
-/* Split LINEAR into a linear sequence with low values and an erratic
-   sequence with high values, put the linear one (of longest possible
-   length) into LINEAR and the erratic one into ERRATIC. This is O(N).
-
-   Because the longest linear sequence we are trying to locate within the
-   incoming LINEAR array can be interspersed with (high valued) erratic
-   entries.  We construct a chain indicating the sequenced entries.
-   To avoid having to allocate this chain, we overlay it onto the space of
-   the ERRATIC array during construction.  A final pass iterates over the
-   chain to determine what should be placed in the ERRATIC array, and
-   what is the linear sequence.  This overlay is safe from aliasing.  */
-
-static inline void
-fde_split (struct object *ob, fde_compare_t fde_compare,
-	   struct fde_vector *linear, struct fde_vector *erratic)
-{
-  static const fde *marker;
-  size_t count = linear->count;
-  const fde *const *chain_end = &marker;
-  size_t i, j, k;
-
-  /* This should optimize out, but it is wise to make sure this assumption
-     is correct. Should these have different sizes, we cannot cast between
-     them and the overlaying onto ERRATIC will not work.  */
-  gcc_assert (sizeof (const fde *) == sizeof (const fde **));
-
-  for (i = 0; i < count; i++)
-    {
-      const fde *const *probe;
-
-      for (probe = chain_end;
-	   probe != &marker && fde_compare (ob, linear->array[i], *probe) < 0;
-	   probe = chain_end)
-	{
-	  chain_end = (const fde *const*) erratic->array[probe - linear->array];
-	  erratic->array[probe - linear->array] = NULL;
-	}
-      erratic->array[i] = (const fde *) chain_end;
-      chain_end = &linear->array[i];
-    }
-
-  /* Each entry in LINEAR which is part of the linear sequence we have
-     discovered will correspond to a non-NULL entry in the chain we built in
-     the ERRATIC array.  */
-  for (i = j = k = 0; i < count; i++)
-    if (erratic->array[i])
-      linear->array[j++] = linear->array[i];
-    else
-      erratic->array[k++] = linear->array[i];
-  linear->count = j;
-  erratic->count = k;
-}
-
  #define SWAP(x,y) do { const fde * tmp = x; x = y; y = tmp; } while (0)
  
  /* Convert a semi-heap to a heap.  A semi-heap is a heap except possibly
@@ -615,59 +592,116 @@ frame_heapsort (struct object *ob, fde_compare_t fde_compare,
  #undef SWAP
  }
  
-/* Merge V1 and V2, both sorted, and put the result into V1.  */
+// Radix sort data in V1 using V2 as aux memory. Runtime O(n).
  static inline void
-fde_merge (struct object *ob, fde_compare_t fde_compare,
-	   struct fde_vector *v1, struct fde_vector *v2)
+fde_radixsort (struct object *ob, fde_extractor_t fde_extractor,
+	       struct fde_vector *v1, struct fde_vector *v2)
  {
-  size_t i1, i2;
-  const fde * fde2;
-
-  i2 = v2->count;
-  if (i2 > 0)
+#define FANOUTBITS 8
+#define FANOUT (1 << FANOUTBITS)
+#define BLOCKSIZE 128
+  const unsigned rounds
+    = (__CHAR_BIT__ * sizeof (_Unwind_Ptr) + FANOUTBITS - 1) / FANOUTBITS;
+  const fde **a1 = v1->array, **a2 = v2->array;
+  _Unwind_Ptr ptrs[BLOCKSIZE + 1];
+  unsigned n = v1->count;
+  for (unsigned round = 0; round != rounds; ++round)
      {
-      i1 = v1->count;
-      do
+      unsigned counts[FANOUT] = {0};
+      unsigned violations = 0;
+
+      // Count the number of elements per bucket and check if we are already
+      // sorted.
+      _Unwind_Ptr last = 0;
+      for (unsigned i = 0; i < n;)
  	{
-	  i2--;
-	  fde2 = v2->array[i2];
-	  while (i1 > 0 && fde_compare (ob, v1->array[i1-1], fde2) > 0)
+	  unsigned chunk = ((n - i) <= BLOCKSIZE) ? (n - i) : BLOCKSIZE;
+	  fde_extractor (ob, ptrs + 1, a1 + i, chunk);
+	  ptrs[0] = last;
+	  for (unsigned j = 0; j < chunk; ++j)
  	    {
-	      v1->array[i1+i2] = v1->array[i1-1];
-	      i1--;
+	      unsigned b = (ptrs[j + 1] >> (round * FANOUTBITS)) & (FANOUT - 1);
+	      counts[b]++;
+	      // Use summation instead of an if to eliminate branches.
+	      violations += ptrs[j + 1] < ptrs[j];
  	    }
-	  v1->array[i1+i2] = fde2;
+	  i += chunk;
+	  last = ptrs[chunk];
  	}
-      while (i2 > 0);
-      v1->count += v2->count;
+
+      // Stop if we are already sorted.
+      if (!violations)
+	{
+	  // The sorted data is in a1 now.
+	  a2 = a1;
+	  break;
+	}
+
+      // Compute the prefix sum.
+      unsigned sum = 0;
+      for (unsigned i = 0; i != FANOUT; ++i)
+	{
+	  unsigned s = sum;
+	  sum += counts[i];
+	  counts[i] = s;
+	}
+
+      // Place all elements.
+      for (unsigned i = 0; i < n;)
+	{
+	  unsigned chunk = ((n - i) <= BLOCKSIZE) ? (n - i) : BLOCKSIZE;
+	  fde_extractor (ob, ptrs, a1 + i, chunk);
+	  for (unsigned j = 0; j < chunk; ++j)
+	    {
+	      unsigned b = (ptrs[j] >> (round * FANOUTBITS)) & (FANOUT - 1);
+	      a2[counts[b]++] = a1[i + j];
+	    }
+	  i += chunk;
+	}
+
+      // Swap a1 and a2.
+      const fde **tmp = a1;
+      a1 = a2;
+      a2 = tmp;
      }
+#undef BLOCKSIZE
+#undef FANOUT
+#undef FANOUTBITS
+
+  // The data is in a2 now, move in place if needed.
+  if (a2 != v1->array)
+    memcpy (v1->array, a2, sizeof (const fde *) * n);
  }
  
  static inline void
  end_fde_sort (struct object *ob, struct fde_accumulator *accu, size_t count)
  {
-  fde_compare_t fde_compare;
-
    gcc_assert (!accu->linear || accu->linear->count == count);
  
-  if (ob->s.b.mixed_encoding)
-    fde_compare = fde_mixed_encoding_compare;
-  else if (ob->s.b.encoding == DW_EH_PE_absptr)
-    fde_compare = fde_unencoded_compare;
-  else
-    fde_compare = fde_single_encoding_compare;
-
-  if (accu->erratic)
+  if (accu->aux)
      {
-      fde_split (ob, fde_compare, accu->linear, accu->erratic);
-      gcc_assert (accu->linear->count + accu->erratic->count == count);
-      frame_heapsort (ob, fde_compare, accu->erratic);
-      fde_merge (ob, fde_compare, accu->linear, accu->erratic);
-      free (accu->erratic);
+      fde_extractor_t fde_extractor;
+      if (ob->s.b.mixed_encoding)
+	fde_extractor = fde_mixed_encoding_extract;
+      else if (ob->s.b.encoding == DW_EH_PE_absptr)
+	fde_extractor = fde_unencoded_extract;
+      else
+	fde_extractor = fde_single_encoding_extract;
+
+      fde_radixsort (ob, fde_extractor, accu->linear, accu->aux);
+      free (accu->aux);
      }
    else
      {
-      /* We've not managed to malloc an erratic array,
+      fde_compare_t fde_compare;
+      if (ob->s.b.mixed_encoding)
+	fde_compare = fde_mixed_encoding_compare;
+      else if (ob->s.b.encoding == DW_EH_PE_absptr)
+	fde_compare = fde_unencoded_compare;
+      else
+	fde_compare = fde_single_encoding_compare;
+
+      /* We've not managed to malloc an aux array,
  	 so heap sort in the linear one.  */
        frame_heapsort (ob, fde_compare, accu->linear);
      }