kbuild: Add inline-account tool
Commit Message
A common cause of binary code bloat is excessive inlining. Traditional
tools (like nm --size-sort -t d) don't address that directly because
they only see the final functions, but don't know about inlines.
This patch adds inline-account that makes it easy to track that down
by accounting code bytes to all functions visible in the debug information,
as well as code lines.
Here are some examples:
Show all inlines that increase code size by >1K in the core scheduler:
$ inline-account.py --min-bytes=1000 kernel/sched/core.o
Total code bytes seen 75690
Code bytes by functions:
Function Total Avg Num
rq_pin_lock 1401 (0.02%) 35 39
__sched_setscheduler 1277 (0.02%) 41 31
perf_fetch_caller_regs 1012 (0.01%) 17 58
Code bytes by nearby source line blocks:
prefix /home/ak/lsrc/git/linux/
Line Total
kernel/sched/sched.h:1610 1387 (0.02%)
include/trace/events/sched.h:16 1172 (0.02%)
include/trace/events/sched.h:222 1058 (0.01%)
This indicates that rq_pin_lock should likely be not inline,
and perhaps perf_fetch_caller_regs not either.
Note that not all large inlines are necessary bloat. If there is only
a single call site it isn't bloat (the tool currently cannot distinguish
that case). For example it is commonly seen with syscall definitions
that use single large inlines with only a single caller. In the example
above I think it's the case with __sched_setscheduler.
Show the >1K inlines in lib/maple_tree.o, which for some reason
comes in at a incredible 73k of code size:
$ inline-account.py --min-bytes 1000 lib/maple_tree.o
Total code bytes seen 73578
Code bytes by functions:
Function Total Avg Num
mas_mab_cp 5537 (0.08%) 37 149
mas_pop_node 3798 (0.05%) 28 131
ma_slots 2368 (0.03%) 14 162
ma_pivots 2353 (0.03%) 10 222
mas_destroy_rebalance 2056 (0.03%) 42 48
mas_start 1661 (0.02%) 13 125
mas_set_parent 1454 (0.02%) 20 72
mas_set_alloc_req 1410 (0.02%) 17 80
mte_node_type 1360 (0.02%) 5 228
mas_data_end 1189 (0.02%) 16 74
mte_to_node 1085 (0.01%) 3 276
mas_split 1053 (0.01%) 65 16
mas_topiary_replace 1033 (0.01%) 38 27
mas_root_expand 1001 (0.01%) 35 28
Code bytes by nearby source line blocks:
prefix /home/ak/lsrc/git/linux/
Line Total
lib/maple_tree.c:210 1360 (0.02%)
include/trace/events/maple_tree.h:80 1283 (0.02%)
lib/maple_tree.c:649 1193 (0.02%)
lib/maple_tree.c:288 1097 (0.01%)
It's clear there is a lot of potential for shrinking here, as a quick
experiment shows:
$ size lib/maple_tree.o
text data bss dec hex filename
72257 5312 8 77577 12f09 lib/maple_tree.o
$ sed -i -e s/__always_inline// -e 's/ inline/ /' lib/maple_tree.c
$ make -s lib/maple_tree.o
$ size lib/maple_tree.o
text data bss dec hex filename
47774 4720 8 52502 cd16 lib/maple_tree.o
34% reduction just from trusting the compiler. Most of it seems
to come from abuse of __always_inline. I suppose a large scale
tree purge of that would give some decent binary size results.
$ inline-account.py --show=5 kernel/workqueue.o
Total code bytes seen 40403
Code bytes by functions:
Function Total Avg Num
bitmap_copy 1477 (0.04%) 26 56
show_pwq 912 (0.02%) 76 12
workqueue_init_early 846 (0.02%) 29 29
__flush_workqueue 753 (0.02%) 31 24
alloc_and_link_pwqs 558 (0.01%) 69 8
Code bytes by nearby source line blocks:
prefix /home/ak/lsrc/git/linux/
Line Total
include/linux/bitmap.h:268 1336 (0.03%)
include/trace/events/workqueue.h:23 1038 (0.03%)
include/trace/events/workqueue.h:108 732 (0.02%)
include/trace/events/workqueue.h:59 694 (0.02%)
include/trace/events/workqueue.h:82 670 (0.02%)
$
This is an interesting case because bitmap_copy is just
static inline void bitmap_copy(unsigned long *dst, const unsigned long *src,
unsigned int nbits)
{
unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
if (small_const_nbits(nbits))
*dst = *src;
else
memcpy(dst, src, len);
}
memcpy (which is a macro) must sometimes generate a lot of
code. The small_const_nbits case definitely should be inlined though
because it's likely even smaller than a call. Would need
more investigation.
The other large inlines are trace points. Perhaps there is something
there that could be done to shrink that a bit.
Finally we can do a global accounting (currently with multiple runs):
(ignore the percentage numbers since they are just for the local file)
$ find -name '*.o' | xargs -n1 inline-account.py > a
$ sort -n -r -k 2 a | head -30
ZSTD_count 81799 (0.19%) 32 2514
ZSTD_count 52233 (0.25%) 33 1544
kmalloc 43324 (0.00%) 12 3334
pv_queued_spin_unlock 42027 (0.00%) 9 4580
constant_test_bit 41667 (0.00%) 5 8005
arch/x86/include/asm/paravirt.h:591 41044 (0.00%)
arch/x86/include/asm/bitops.h:207 40153 (0.00%)
__refcount_add 37968 (0.00%) 24 1532
page_fixed_fake_head 36368 (0.00%) 19 1832
include/linux/slab.h:599 35654 (0.00%)
arch/x86/include/asm/jump_label.h:27 35156 (0.00%)
spin_lock 32170 (0.00%) 10 3007
__refcount_sub_and_test 32068 (0.00%) 17 1842
include/linux/spinlock.h:351 31102 (0.00%)
arch_static_branch 30874 (0.00%) 4 7022
get_current 30714 (0.00%) 9 3351
arch/x86/include/asm/current.h:41 29912 (0.00%)
trace_trigger_soft_disabled 29814 (0.00%) 21 1368
perf_fetch_caller_regs 27504 (0.00%) 16 1634
ZSTD_storeSeq 26060 (0.06%) 30 862
hid_map_usage 25582 (0.00%) 88 288
ZSTD_compressBlock_lazy_generic 24953 (0.12%) 46 535
ZSTD_compressBlock_lazy_generic 24953 (0.06%) 46 535
paravirt_ret0 24152 (0.00%) 24152 1
spin_unlock_irqrestore 23253 (0.00%) 10 2281
include/linux/spinlock.h:406 22526 (0.00%)
ZSTD_RowFindBestMatch 21527 (0.10%) 23 922
ZSTD_RowFindBestMatch 21527 (0.05%) 23 922
__list_add 21209 (0.00%) 11 1851
include/linux/refcount.h:283 20642 (0.00%)
- So my kernel is spending around ~30K just for getting task_structs in
current.
- I'm sure ZSTD is great, but is it >200K in duplicated code worth great?
- page_fixed_fake_head probably shouldn't be inlined
- There might be some potential in out lining reference counts
(although that one might be truly performance critical)
- There's maybe some potential in shrinking trace point bloat?
... and more similar insights.
Also of course there can be critical inlines that really need
to be inline in many sites for best performance. But that's rarely the case
if they are big because it's unlikely the small call overhead is making
a significant difference for a large chunk of code.
In any case the tool is useful, so I think it deserves its place
in scripts/
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
scripts/inline-account.py | 173 ++++++++++++++++++++++++++++++++++++++
1 file changed, 173 insertions(+)
create mode 100755 scripts/inline-account.py
Comments
On Sat, Oct 21, 2023 at 1:12 AM Andi Kleen <ak@linux.intel.com> wrote:
>
> A common cause of binary code bloat is excessive inlining. Traditional
> tools (like nm --size-sort -t d) don't address that directly because
> they only see the final functions, but don't know about inlines.
>
> This patch adds inline-account that makes it easy to track that down
> by accounting code bytes to all functions visible in the debug information,
> as well as code lines.
>
> Here are some examples:
>
> Show all inlines that increase code size by >1K in the core scheduler:
>
> $ inline-account.py --min-bytes=1000 kernel/sched/core.o
How to use this tool?
I did not get any output.
masahiro@zoe:~/workspace/linux-kbuild(a)$ ./scripts/inline-account.py
--min-bytes=1000 kernel/sched/core.o
Total code bytes seen 0
Code bytes by functions:
Function Total Avg Num
Code bytes by nearby source line blocks:
prefix
Line Total
masahiro@zoe:~/workspace/linux-kbuild(a)$ ./scripts/inline-account.py
kernel/sched/core.o
Total code bytes seen 0
Code bytes by functions:
Function Total Avg Num
Code bytes by nearby source line blocks:
prefix
Line Total
masahiro@zoe:~/workspace/linux-kbuild(a)$ ./scripts/inline-account.py
lib/maple_tree.o
Total code bytes seen 0
Code bytes by functions:
Function Total Avg Num
Code bytes by nearby source line blocks:
prefix
Line Total
I do not think this patch is mature enough
regarding the coding style.
Please polish the code.
> How to use this tool?
>
> I did not get any output?
Hmm, perhaps you need CONFIG_DEBUG_INFO=y ?
> I do not think this patch is mature enough
> regarding the coding style.
>
> Please polish the code.
Don't follow. It's standard python code style, and the
other python scripts in scripts/ use the same.
-Andi
On Sat, Oct 21, 2023 at 1:12 AM Andi Kleen <ak@linux.intel.com> wrote:
>
> A common cause of binary code bloat is excessive inlining. Traditional
> tools (like nm --size-sort -t d) don't address that directly because
> they only see the final functions, but don't know about inlines.
>
> This patch adds inline-account that makes it easy to track that down
> by accounting code bytes to all functions visible in the debug information,
> as well as code lines.
>
> Here are some examples:
>
> Show all inlines that increase code size by >1K in the core scheduler:
>
> $ inline-account.py --min-bytes=1000 kernel/sched/core.o
> Total code bytes seen 75690
>
> Code bytes by functions:
> Function Total Avg Num
> rq_pin_lock 1401 (0.02%) 35 39
> __sched_setscheduler 1277 (0.02%) 41 31
> perf_fetch_caller_regs 1012 (0.01%) 17 58
>
> Code bytes by nearby source line blocks:
> prefix /home/ak/lsrc/git/linux/
> Line Total
> kernel/sched/sched.h:1610 1387 (0.02%)
> include/trace/events/sched.h:16 1172 (0.02%)
> include/trace/events/sched.h:222 1058 (0.01%)
>
> This indicates that rq_pin_lock should likely be not inline,
> and perhaps perf_fetch_caller_regs not either.
>
> Note that not all large inlines are necessary bloat. If there is only
> a single call site it isn't bloat (the tool currently cannot distinguish
> that case). For example it is commonly seen with syscall definitions
> that use single large inlines with only a single caller. In the example
> above I think it's the case with __sched_setscheduler.
>
> Show the >1K inlines in lib/maple_tree.o, which for some reason
> comes in at a incredible 73k of code size:
>
> $ inline-account.py --min-bytes 1000 lib/maple_tree.o
> Total code bytes seen 73578
>
> Code bytes by functions:
> Function Total Avg Num
> mas_mab_cp 5537 (0.08%) 37 149
> mas_pop_node 3798 (0.05%) 28 131
> ma_slots 2368 (0.03%) 14 162
> ma_pivots 2353 (0.03%) 10 222
> mas_destroy_rebalance 2056 (0.03%) 42 48
> mas_start 1661 (0.02%) 13 125
> mas_set_parent 1454 (0.02%) 20 72
> mas_set_alloc_req 1410 (0.02%) 17 80
> mte_node_type 1360 (0.02%) 5 228
> mas_data_end 1189 (0.02%) 16 74
> mte_to_node 1085 (0.01%) 3 276
> mas_split 1053 (0.01%) 65 16
> mas_topiary_replace 1033 (0.01%) 38 27
> mas_root_expand 1001 (0.01%) 35 28
>
> Code bytes by nearby source line blocks:
> prefix /home/ak/lsrc/git/linux/
> Line Total
> lib/maple_tree.c:210 1360 (0.02%)
> include/trace/events/maple_tree.h:80 1283 (0.02%)
> lib/maple_tree.c:649 1193 (0.02%)
> lib/maple_tree.c:288 1097 (0.01%)
>
> It's clear there is a lot of potential for shrinking here, as a quick
> experiment shows:
>
> $ size lib/maple_tree.o
> text data bss dec hex filename
> 72257 5312 8 77577 12f09 lib/maple_tree.o
> $ sed -i -e s/__always_inline// -e 's/ inline/ /' lib/maple_tree.c
> $ make -s lib/maple_tree.o
> $ size lib/maple_tree.o
> text data bss dec hex filename
> 47774 4720 8 52502 cd16 lib/maple_tree.o
>
> 34% reduction just from trusting the compiler. Most of it seems
> to come from abuse of __always_inline. I suppose a large scale
> tree purge of that would give some decent binary size results.
>
> $ inline-account.py --show=5 kernel/workqueue.o
> Total code bytes seen 40403
>
> Code bytes by functions:
> Function Total Avg Num
> bitmap_copy 1477 (0.04%) 26 56
> show_pwq 912 (0.02%) 76 12
> workqueue_init_early 846 (0.02%) 29 29
> __flush_workqueue 753 (0.02%) 31 24
> alloc_and_link_pwqs 558 (0.01%) 69 8
>
> Code bytes by nearby source line blocks:
> prefix /home/ak/lsrc/git/linux/
> Line Total
> include/linux/bitmap.h:268 1336 (0.03%)
> include/trace/events/workqueue.h:23 1038 (0.03%)
> include/trace/events/workqueue.h:108 732 (0.02%)
> include/trace/events/workqueue.h:59 694 (0.02%)
> include/trace/events/workqueue.h:82 670 (0.02%)
> $
>
> This is an interesting case because bitmap_copy is just
>
> static inline void bitmap_copy(unsigned long *dst, const unsigned long *src,
> unsigned int nbits)
> {
> unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
>
> if (small_const_nbits(nbits))
> *dst = *src;
> else
> memcpy(dst, src, len);
> }
>
> memcpy (which is a macro) must sometimes generate a lot of
> code. The small_const_nbits case definitely should be inlined though
> because it's likely even smaller than a call. Would need
> more investigation.
>
> The other large inlines are trace points. Perhaps there is something
> there that could be done to shrink that a bit.
>
> Finally we can do a global accounting (currently with multiple runs):
>
> (ignore the percentage numbers since they are just for the local file)
>
> $ find -name '*.o' | xargs -n1 inline-account.py > a
> $ sort -n -r -k 2 a | head -30
> ZSTD_count 81799 (0.19%) 32 2514
> ZSTD_count 52233 (0.25%) 33 1544
> kmalloc 43324 (0.00%) 12 3334
> pv_queued_spin_unlock 42027 (0.00%) 9 4580
> constant_test_bit 41667 (0.00%) 5 8005
> arch/x86/include/asm/paravirt.h:591 41044 (0.00%)
> arch/x86/include/asm/bitops.h:207 40153 (0.00%)
> __refcount_add 37968 (0.00%) 24 1532
> page_fixed_fake_head 36368 (0.00%) 19 1832
> include/linux/slab.h:599 35654 (0.00%)
> arch/x86/include/asm/jump_label.h:27 35156 (0.00%)
> spin_lock 32170 (0.00%) 10 3007
> __refcount_sub_and_test 32068 (0.00%) 17 1842
> include/linux/spinlock.h:351 31102 (0.00%)
> arch_static_branch 30874 (0.00%) 4 7022
> get_current 30714 (0.00%) 9 3351
> arch/x86/include/asm/current.h:41 29912 (0.00%)
> trace_trigger_soft_disabled 29814 (0.00%) 21 1368
> perf_fetch_caller_regs 27504 (0.00%) 16 1634
> ZSTD_storeSeq 26060 (0.06%) 30 862
> hid_map_usage 25582 (0.00%) 88 288
> ZSTD_compressBlock_lazy_generic 24953 (0.12%) 46 535
> ZSTD_compressBlock_lazy_generic 24953 (0.06%) 46 535
> paravirt_ret0 24152 (0.00%) 24152 1
> spin_unlock_irqrestore 23253 (0.00%) 10 2281
> include/linux/spinlock.h:406 22526 (0.00%)
> ZSTD_RowFindBestMatch 21527 (0.10%) 23 922
> ZSTD_RowFindBestMatch 21527 (0.05%) 23 922
> __list_add 21209 (0.00%) 11 1851
> include/linux/refcount.h:283 20642 (0.00%)
>
> - So my kernel is spending around ~30K just for getting task_structs in
> current.
> - I'm sure ZSTD is great, but is it >200K in duplicated code worth great?
> - page_fixed_fake_head probably shouldn't be inlined
> - There might be some potential in out lining reference counts
> (although that one might be truly performance critical)
> - There's maybe some potential in shrinking trace point bloat?
>
> ... and more similar insights.
>
> Also of course there can be critical inlines that really need
> to be inline in many sites for best performance. But that's rarely the case
> if they are big because it's unlikely the small call overhead is making
> a significant difference for a large chunk of code.
>
> In any case the tool is useful, so I think it deserves its place
> in scripts/
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
> scripts/inline-account.py | 173 ++++++++++++++++++++++++++++++++++++++
> 1 file changed, 173 insertions(+)
> create mode 100755 scripts/inline-account.py
>
> diff --git a/scripts/inline-account.py b/scripts/inline-account.py
> new file mode 100755
> index 000000000000..b6cfe195efe6
> --- /dev/null
> +++ b/scripts/inline-account.py
> @@ -0,0 +1,173 @@
> +#!/usr/bin/env python3
SPDX?
> +# account code bytes per source code / functions from objdump -Sl output
> +# useful to find inline bloat
> +# Author: Andi Kleen
A blank line please between the comment header
and the start of the code.
For example, see scripts/clang-tools/run-clang-tools.py
as a Python script written in a good coding style.
> +import os, sys, re, argparse, multiprocessing
> +from collections import Counter
> +from functools import reduce
> +
> +p = argparse.ArgumentParser(
> + description="""
> +Account code bytes per source code / functions from objdump.
> +Useful to find inline bloat.
> +
> +The line numbers are the beginning of a block, so the actual code can be later.
> +Line numbers can be a also little off due to objdump bugs
> +also some misaccounting can happen due to inexact gcc debug information.
> +The number output for functions may account a single large function multiple
> +times. program/object files need to be built with -g.
> +
> +This is somewhat slow due to objdump -S being slow. It helps to have
> +plenty of cores.""")
> +p.add_argument('--min-bytes', type=int, help='minimum bytes to report', default=100)
> +p.add_argument('--threads', '-t', type=int, default=multiprocessing.cpu_count(),
> + help='Number of objdump processes to run')
> +p.add_argument('--verbose', '-v', action='store_true', help="Print more")
> +p.add_argument('--show', type=int, help='Number of results to show')
> +p.add_argument('file', help='object file/program as input')
> +args = p.parse_args()
Put the code lines in a function,
except import directives and
if __name__ == '__main__':
main()
> +
> +def get_syms(fn):
> + f = os.popen("nm --print-size " + fn)
> + syms = []
> + pc = None
> + for l in f:
> + n = l.split()
> + if len(n) > 2 and n[2].upper() == "T":
> + pc = int(n[0], 16)
> + syms.append(pc)
> + ln = int(n[1], 16)
> + f.close()
with os.popen("nm --print-size " + fn) as f:
then, remove the f.close()
> + if not pc:
> + sys.exit(fn + " has no symbols")
> + syms.append(pc + ln)
I guess you intended 'pc + ln' as the
tail of the .text
Not really because the output from 'nm --print-size'
is not sorted by address.
You need to pass the -n flag to
make 'nm' sort the output by address.
> + return syms
> +
> +class Account:
> + pass
class without any method?
Please make add_account() and account_range()
class methods for readability.
Maybe account_range() to __init().
If you want to use map(),
al = list(map(lambda r: Account(r), ranges))
> +
> +def add_account(a, b):
> + a.funcbytes += b.funcbytes
> + a.linebytes += b.linebytes
> + a.funccount += b.funccount
> + a.nolinebytes += b.nolinebytes
> + a.nofuncbytes += b.nofuncbytes
> + a.total += b.total
> + return a
> +
> +# dont add sys.exit here, causes deadlocks
> +def account_range(r):
> + a = Account()
> + a.funcbytes = Counter()
> + a.linebytes = Counter()
> + a.funccount = Counter()
> + a.nolinebytes = 0
> + a.nofuncbytes = 0
> + a.total = 0
> +
> + line = None
> + func = None
> + codefunc = None
> +
> + cmd = ("objdump -Sl %s --start-address=%#x --stop-address=%#x" %
> + (args.file, r[0], r[1]))
> + f = os.popen(cmd)
Ditto. with os.popen(cmd) as f:
> + for l in f:
> + # 250: e8 00 00 00 00 callq 255 <proc_skip_spaces+0x5>
> + m = re.match(r'\s*([0-9a-fA-F]+):\s+(.*)', l)
> + if m:
> + #print "iscode", func, l,
Do not leave debugging code.
> + bytes = len(re.findall(r'[0-9a-f][0-9a-f] ', m.group(2)))
> + if not func:
> + a.nofuncbytes += bytes
> + continue
> + if not line:
> + a.nolinebytes += bytes
> + continue
> + a.total += bytes
> + a.funcbytes[func] += bytes
> + a.linebytes[(file, line)] += bytes
> + codefunc = func
> + continue
> +
> + # sysctl_init():
> + m = re.match(r'([a-zA-Z_][a-zA-Z0-9_]*)\(\):$', l)
> + if m:
> + if codefunc and m.group(1) != codefunc:
> + a.funccount[codefunc] += 1
> + codefunc = None
> + func = m.group(1)
> + continue
> +
> + # /sysctl.c:1666
> + m = re.match(r'^([^:]+):(\d+)$', l)
> + if m:
> + file, line = m.group(1), int(m.group(2))
> + continue
> + f.close()
> +
> + if codefunc:
> + a.funccount[codefunc] += 1
> + return a
> +
> +# objdump -S is slow, so we parallelize
> +
> +# split symbol table into chunks for parallelization
> +# we split on functions boundaries to avoid mis-accounting
> +# assumes functions have roughly similar length
> +syms = sorted(get_syms(args.file))
This sorted() is too late.
Sorting must be done by 'nm -n'.
> +chunk = max(int(min((len(syms) - 1) / args.threads, len(syms) - 1)), 1)
I do not understand.
Why min()?
If you check args.threads >= 1,
(len(syms) - 1) / args.threads is smaller, isn't it?
> +boundaries = [syms[x] for x in range(0, len(syms) - 1, chunk)] + [syms[-1]]
> +ranges = [(boundaries[x], boundaries[x+1]) for x in range(0, len(boundaries) - 1)
> + if boundaries[x+1] > boundaries[x]]
> +assert ranges[0][0] == syms[0]
> +assert ranges[-1][1] == syms[-1]
> +
> +# map-reduce
> +if args.threads == 1:
> + al = list(map(account_range, ranges))
> +else:
> + al = multiprocessing.Pool(args.threads).map(account_range, ranges)
> +a = reduce(add_account, al)
> +
> +print("Total code bytes seen", a.total)
> +if args.verbose:
> + print("Bytes with no function %d (%.2f%%)" % (a.nofuncbytes, 100.0*(float(a.nofuncbytes)/a.total)))
> + print("Bytes with no lines %d (%.2f%%)" % (a.nolinebytes, 100.0*(float(a.nolinebytes)/a.total)))
> +
> +def sort_map(m):
> + return sorted(list(m.keys()), key=lambda x: m[x], reverse=True)
> +
> +print("\nCode bytes by functions:")
> +print("%-50s %-5s %-5s %-5s %-5s" % ("Function", "Total", "", "Avg", "Num"))
> +for i, j in enumerate(sort_map(a.funcbytes)):
> + if a.funcbytes[j] < args.min_bytes:
> + break
> + if args.show and i >= args.show:
> + break
> + print("%-50s %-5d (%.2f%%) %-5d %-5d" % (
> + j,
> + a.funcbytes[j],
> + a.funcbytes[j] / float(a.total),
> + a.funcbytes[j] / a.funccount[j],
> + a.funccount[j]))
> +
> +for j in list(a.linebytes.keys()):
> + if a.linebytes[j] < args.min_bytes:
> + del a.linebytes[j]
> +
> +# os.path.commonprefix fails with >50k entries
> +# just use the first 10
> +prefix = os.path.commonprefix([x[0] for x in list(a.linebytes.keys())[:10]])
This does not seem to be a good idea.
I see a lot of absolute paths left behind,
and also empty lines.
> +
> +print("\nCode bytes by nearby source line blocks:")
> +print("prefix", prefix)
> +
> +print("%-50s %-5s" % ("Line", "Total"))
> +for i, j in enumerate(sort_map(a.linebytes)):
> + if args.show and i >= args.show:
> + break
> + print("%-50s %-5d (%.2f%%)" % (
> + "%s:%d" % (j[0].replace(prefix, ""), j[1]),
If j[0] and prefix are the same,
the file part disappears.
> + a.linebytes[j],
> + a.linebytes[j] / float(a.total)))
> --
> 2.41.0
>
--
Best Regards
Masahiro Yamada
On Mon, Oct 23, 2023 at 8:23 AM Andi Kleen <ak@linux.intel.com> wrote:
>
> > How to use this tool?
> >
> > I did not get any output?
>
> Hmm, perhaps you need CONFIG_DEBUG_INFO=y ?
Please print a hint suggesting CONFIG_DEBUG_INFO
if "objdump --source" does not output the
source-mixed disassembly.
>
> > I do not think this patch is mature enough
> > regarding the coding style.
> >
> > Please polish the code.
>
> Don't follow. It's standard python code style, and the
> other python scripts in scripts/ use the same.
I hoped somebody would have provided some code reviews,
but it looks like it did not get people's interest.
See other scripts and check how the code is structured in functions.
I do not want to see a patch like
94a03e1d22e8dc75ddec159b2efadd4c6354503a
So, the script should be written in good quality
from day 1.
Also, PEP-8 is worth reading to learn good coding style
if you have not checked.
--
Best Regards
Masahiro Yamada
new file mode 100755
@@ -0,0 +1,173 @@
+#!/usr/bin/env python3
+# account code bytes per source code / functions from objdump -Sl output
+# useful to find inline bloat
+# Author: Andi Kleen
+import os, sys, re, argparse, multiprocessing
+from collections import Counter
+from functools import reduce
+
+p = argparse.ArgumentParser(
+ description="""
+Account code bytes per source code / functions from objdump.
+Useful to find inline bloat.
+
+The line numbers are the beginning of a block, so the actual code can be later.
+Line numbers can be a also little off due to objdump bugs
+also some misaccounting can happen due to inexact gcc debug information.
+The number output for functions may account a single large function multiple
+times. program/object files need to be built with -g.
+
+This is somewhat slow due to objdump -S being slow. It helps to have
+plenty of cores.""")
+p.add_argument('--min-bytes', type=int, help='minimum bytes to report', default=100)
+p.add_argument('--threads', '-t', type=int, default=multiprocessing.cpu_count(),
+ help='Number of objdump processes to run')
+p.add_argument('--verbose', '-v', action='store_true', help="Print more")
+p.add_argument('--show', type=int, help='Number of results to show')
+p.add_argument('file', help='object file/program as input')
+args = p.parse_args()
+
+def get_syms(fn):
+ f = os.popen("nm --print-size " + fn)
+ syms = []
+ pc = None
+ for l in f:
+ n = l.split()
+ if len(n) > 2 and n[2].upper() == "T":
+ pc = int(n[0], 16)
+ syms.append(pc)
+ ln = int(n[1], 16)
+ f.close()
+ if not pc:
+ sys.exit(fn + " has no symbols")
+ syms.append(pc + ln)
+ return syms
+
+class Account:
+ pass
+
+def add_account(a, b):
+ a.funcbytes += b.funcbytes
+ a.linebytes += b.linebytes
+ a.funccount += b.funccount
+ a.nolinebytes += b.nolinebytes
+ a.nofuncbytes += b.nofuncbytes
+ a.total += b.total
+ return a
+
+# dont add sys.exit here, causes deadlocks
+def account_range(r):
+ a = Account()
+ a.funcbytes = Counter()
+ a.linebytes = Counter()
+ a.funccount = Counter()
+ a.nolinebytes = 0
+ a.nofuncbytes = 0
+ a.total = 0
+
+ line = None
+ func = None
+ codefunc = None
+
+ cmd = ("objdump -Sl %s --start-address=%#x --stop-address=%#x" %
+ (args.file, r[0], r[1]))
+ f = os.popen(cmd)
+ for l in f:
+ # 250: e8 00 00 00 00 callq 255 <proc_skip_spaces+0x5>
+ m = re.match(r'\s*([0-9a-fA-F]+):\s+(.*)', l)
+ if m:
+ #print "iscode", func, l,
+ bytes = len(re.findall(r'[0-9a-f][0-9a-f] ', m.group(2)))
+ if not func:
+ a.nofuncbytes += bytes
+ continue
+ if not line:
+ a.nolinebytes += bytes
+ continue
+ a.total += bytes
+ a.funcbytes[func] += bytes
+ a.linebytes[(file, line)] += bytes
+ codefunc = func
+ continue
+
+ # sysctl_init():
+ m = re.match(r'([a-zA-Z_][a-zA-Z0-9_]*)\(\):$', l)
+ if m:
+ if codefunc and m.group(1) != codefunc:
+ a.funccount[codefunc] += 1
+ codefunc = None
+ func = m.group(1)
+ continue
+
+ # /sysctl.c:1666
+ m = re.match(r'^([^:]+):(\d+)$', l)
+ if m:
+ file, line = m.group(1), int(m.group(2))
+ continue
+ f.close()
+
+ if codefunc:
+ a.funccount[codefunc] += 1
+ return a
+
+# objdump -S is slow, so we parallelize
+
+# split symbol table into chunks for parallelization
+# we split on functions boundaries to avoid mis-accounting
+# assumes functions have roughly similar length
+syms = sorted(get_syms(args.file))
+chunk = max(int(min((len(syms) - 1) / args.threads, len(syms) - 1)), 1)
+boundaries = [syms[x] for x in range(0, len(syms) - 1, chunk)] + [syms[-1]]
+ranges = [(boundaries[x], boundaries[x+1]) for x in range(0, len(boundaries) - 1)
+ if boundaries[x+1] > boundaries[x]]
+assert ranges[0][0] == syms[0]
+assert ranges[-1][1] == syms[-1]
+
+# map-reduce
+if args.threads == 1:
+ al = list(map(account_range, ranges))
+else:
+ al = multiprocessing.Pool(args.threads).map(account_range, ranges)
+a = reduce(add_account, al)
+
+print("Total code bytes seen", a.total)
+if args.verbose:
+ print("Bytes with no function %d (%.2f%%)" % (a.nofuncbytes, 100.0*(float(a.nofuncbytes)/a.total)))
+ print("Bytes with no lines %d (%.2f%%)" % (a.nolinebytes, 100.0*(float(a.nolinebytes)/a.total)))
+
+def sort_map(m):
+ return sorted(list(m.keys()), key=lambda x: m[x], reverse=True)
+
+print("\nCode bytes by functions:")
+print("%-50s %-5s %-5s %-5s %-5s" % ("Function", "Total", "", "Avg", "Num"))
+for i, j in enumerate(sort_map(a.funcbytes)):
+ if a.funcbytes[j] < args.min_bytes:
+ break
+ if args.show and i >= args.show:
+ break
+ print("%-50s %-5d (%.2f%%) %-5d %-5d" % (
+ j,
+ a.funcbytes[j],
+ a.funcbytes[j] / float(a.total),
+ a.funcbytes[j] / a.funccount[j],
+ a.funccount[j]))
+
+for j in list(a.linebytes.keys()):
+ if a.linebytes[j] < args.min_bytes:
+ del a.linebytes[j]
+
+# os.path.commonprefix fails with >50k entries
+# just use the first 10
+prefix = os.path.commonprefix([x[0] for x in list(a.linebytes.keys())[:10]])
+
+print("\nCode bytes by nearby source line blocks:")
+print("prefix", prefix)
+
+print("%-50s %-5s" % ("Line", "Total"))
+for i, j in enumerate(sort_map(a.linebytes)):
+ if args.show and i >= args.show:
+ break
+ print("%-50s %-5d (%.2f%%)" % (
+ "%s:%d" % (j[0].replace(prefix, ""), j[1]),
+ a.linebytes[j],
+ a.linebytes[j] / float(a.total)))