[RFC,00/10] perf: user space sframe unwinding

Message ID cover.1699487758.git.jpoimboe@kernel.org
Headers
Series perf: user space sframe unwinding |

Message

Josh Poimboeuf Nov. 9, 2023, 12:41 a.m. UTC
  Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF (or .eh_frame) instead isn't feasible
because of complexity and slowness.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the SFrame ("Simple Frame") format starting with binutils
2.40.

These patches add support for unwinding user space from the kernel using
SFrame with perf.  It should be easy to add user unwinding support for
other components like ftrace.

I tested it on Gentoo by recompiling everything with -Wa,-gsframe and
using a custom glibc patch (which I'll send in a reply to this email).

The unwinding itself seems to work well, though I still have a major
problem: how to tell perf tool to stitch together the separate
kernel+user callchains into a single event?

Right now I have a hack which somehow causes perf tool to overwrite the
kernel callchain with the user one.  I'm perf-clueless, any ideas or
patches for a clean way to implement that would be most helpful.


Otherwise there were two main challenges:

1) Finding .sframe sections in shared/dlopened libraries

   The kernel has no visibility to the contents of shared libraries.
   This was solved by adding a PR_ADD_SFRAME option to prctl() which
   allows the runtime linker to manually provide the in-memory address
   of an .sframe section to the kernel.

2) Dealing with page faults

   Keeping all binaries' sframe data pinned would likely waste a lot of
   memory.  Instead, read it from user space on demand.  That can't be
   done from perf NMI context due to page faults, so defer the unwind to
   the next user exit.  Since the NMI handler doesn't do exit work,
   self-IPI and then schedule task work to be run on exit from the IPI.


Special thanks to Indu for the original concept, and to Steven and Peter
for helping a lot with the design.  And to Steven for letting me do it ;-)


TODO:
- Stitch kernel+user events together in perf tool (help needed)
- Add arm64 support
- Add VDSO .sframe support
- Allow specifying FP vs sframe from perf tool?  Right now it's
  auto-detected, maybe that's enough
- Port ftrace and others to use sframe
- Support sframe v2
- Determine the impact of missing DRAP support (aligned stacks which
  SFrame doesn't currently support)
- Add debugging hooks



Josh Poimboeuf (10):
  perf: Remove get_perf_callchain() 'init_nr' argument
  perf: Remove get_perf_callchain() 'crosstask' argument
  perf: Simplify get_perf_callchain() user logic
  perf: Introduce deferred user callchains
  perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED
  unwind: Introduce generic user space unwinding interfaces
  unwind/x86: Add HAVE_USER_UNWIND
  perf/x86: Use user_unwind interface
  unwind: Introduce SFrame user space unwinding
  unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME

 arch/Kconfig                       |   9 +
 arch/x86/Kconfig                   |   3 +
 arch/x86/events/core.c             |  65 ++---
 arch/x86/include/asm/mmu.h         |   2 +-
 arch/x86/include/asm/user_unwind.h |  11 +
 fs/binfmt_elf.c                    |  46 +++-
 include/linux/mm_types.h           |   3 +
 include/linux/perf_event.h         |  24 +-
 include/linux/sframe.h             |  46 ++++
 include/linux/user_unwind.h        |  33 +++
 include/uapi/linux/elf.h           |   1 +
 include/uapi/linux/perf_event.h    |   1 +
 include/uapi/linux/prctl.h         |   3 +
 kernel/Makefile                    |   1 +
 kernel/bpf/stackmap.c              |   6 +-
 kernel/events/callchain.c          |  39 ++-
 kernel/events/core.c               |  96 ++++++-
 kernel/fork.c                      |  10 +
 kernel/sys.c                       |  11 +
 kernel/unwind/Makefile             |   2 +
 kernel/unwind/sframe.c             | 414 +++++++++++++++++++++++++++++
 kernel/unwind/sframe.h             | 217 +++++++++++++++
 kernel/unwind/user.c               |  86 ++++++
 mm/init-mm.c                       |   2 +
 24 files changed, 1060 insertions(+), 71 deletions(-)
 create mode 100644 arch/x86/include/asm/user_unwind.h
 create mode 100644 include/linux/sframe.h
 create mode 100644 include/linux/user_unwind.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h
 create mode 100644 kernel/unwind/user.c
  

Comments

Josh Poimboeuf Nov. 9, 2023, 12:45 a.m. UTC | #1
On Wed, Nov 08, 2023 at 04:41:05PM -0800, Josh Poimboeuf wrote:
> Some distros have started compiling frame pointers into all their
> packages to enable the kernel to do system-wide profiling of user space.
> Unfortunately that creates a runtime performance penalty across the
> entire system.  Using DWARF (or .eh_frame) instead isn't feasible
> because of complexity and slowness.
> 
> For in-kernel unwinding we solved this problem with the creation of the
> ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
> has created the SFrame ("Simple Frame") format starting with binutils
> 2.40.
> 
> These patches add support for unwinding user space from the kernel using
> SFrame with perf.  It should be easy to add user unwinding support for
> other components like ftrace.
> 
> I tested it on Gentoo by recompiling everything with -Wa,-gsframe and
> using a custom glibc patch (which I'll send in a reply to this email).

Here's my glibc patch:

diff --git a/elf/dl-load.c b/elf/dl-load.c
index 2923b1141d..333d7c39fd 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -29,6 +29,7 @@
 #include <bits/wordsize.h>
 #include <sys/mman.h>
 #include <sys/param.h>
+#include <sys/prctl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <gnu/lib-names.h>
@@ -88,6 +89,10 @@ struct filebuf
 
 #define STRING(x) __STRING (x)
 
+#ifndef PT_GNU_SFRAME
+#define PT_GNU_SFRAME 0x6474e554
+#endif
+
 
 int __stack_prot attribute_hidden attribute_relro
 #if _STACK_GROWS_DOWN && defined PROT_GROWSDOWN
@@ -1213,6 +1218,10 @@ _dl_map_object_from_fd (const char *name, const char *origname, int fd,
 	  l->l_relro_addr = ph->p_vaddr;
 	  l->l_relro_size = ph->p_memsz;
 	  break;
+
+	case PT_GNU_SFRAME:
+	  l->l_sframe_addr = ph->p_vaddr;
+	  break;
 	}
 
     if (__glibc_unlikely (nloadcmds == 0))
@@ -1263,6 +1272,8 @@ _dl_map_object_from_fd (const char *name, const char *origname, int fd,
 	l->l_map_start = l->l_map_end = 0;
 	goto lose;
       }
+
+
   }
 
   if (l->l_ld != 0)
@@ -1376,6 +1387,13 @@ cannot enable executable stack as shared object requires");
 	break;
       }
 
+#define PR_ADD_SFRAME 71
+  if (l->l_sframe_addr != 0)
+  {
+    l->l_sframe_addr += l->l_addr;
+    __prctl(PR_ADD_SFRAME, l->l_sframe_addr, NULL, NULL, NULL);
+  }
+
   /* We are done mapping in the file.  We no longer need the descriptor.  */
   if (__glibc_unlikely (__close_nocancel (fd) != 0))
     {
diff --git a/include/link.h b/include/link.h
index c6af095d87..36ac75680f 100644
--- a/include/link.h
+++ b/include/link.h
@@ -348,6 +348,8 @@ struct link_map
     ElfW(Addr) l_relro_addr;
     size_t l_relro_size;
 
+    ElfW(Addr) l_sframe_addr;
+
     unsigned long long int l_serial;
   };