[v1] arch: Enable function alignment for arm64

Message ID 20221208053649.540891-1-almasrymina@google.com
State New
Headers
Series [v1] arch: Enable function alignment for arm64 |

Commit Message

Mina Almasry Dec. 8, 2022, 5:36 a.m. UTC
  We recently ran into a double-digit percentage hackbench regression
when backporting commit 12df140f0bdf ("mm,hugetlb: take hugetlb_lock
before decrementing h->resv_huge_pages") to an older kernel. This was
surprising since hackbench does use hugetlb pages at all and the
modified code is not invoked. After some debugging we found that the
regression can be fixed by back-porting commit d49a0626216b ("arch:
Introduce CONFIG_FUNCTION_ALIGNMENT") and enabling function alignment
for arm64. I suggest enabling it by default for arm64 if possible.

Tested by examing function alignment on a compiled object file
before/after:

After this patch:

$ ~/is-aligned.sh mm/hugetlb.o 16
file=mm/hugetlb.o, alignment=16
total number of functions: 146
total number of unaligned: 0

Before this patch:

$ ~/is-aligned.sh mm/hugetlb.o 16
file=mm/hugetlb.o, alignment=16
total number of functions: 146
total number of unaligned: 94

Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

--
2.39.0.rc0.267.gcb52ba06e7-goog
  

Comments

Mina Almasry Dec. 8, 2022, 6:47 a.m. UTC | #1
On Wed, Dec 7, 2022 at 9:36 PM Mina Almasry <almasrymina@google.com> wrote:
>
> We recently ran into a double-digit percentage hackbench regression
> when backporting commit 12df140f0bdf ("mm,hugetlb: take hugetlb_lock
> before decrementing h->resv_huge_pages") to an older kernel. This was
> surprising since hackbench does use hugetlb pages at all and the
> modified code is not invoked. After some debugging we found that the
> regression can be fixed by back-porting commit d49a0626216b ("arch:
> Introduce CONFIG_FUNCTION_ALIGNMENT") and enabling function alignment
> for arm64. I suggest enabling it by default for arm64 if possible.
>
> Tested by examing function alignment on a compiled object file
> before/after:
>
> After this patch:
>
> $ ~/is-aligned.sh mm/hugetlb.o 16
> file=mm/hugetlb.o, alignment=16
> total number of functions: 146
> total number of unaligned: 0
>
> Before this patch:
>
> $ ~/is-aligned.sh mm/hugetlb.o 16
> file=mm/hugetlb.o, alignment=16
> total number of functions: 146
> total number of unaligned: 94
>
> Cc: Peter Zijlstra <peterz@infradead.org>

I missed the Signed-off-by: Mina Almasry <almasrymina@google.com> tag
here. I can add with the next version.

> ---
>  arch/arm64/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index cf6d1cd8b6dc..bcc9e1578937 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -235,6 +235,7 @@ config ARM64
>         select TRACE_IRQFLAGS_SUPPORT
>         select TRACE_IRQFLAGS_NMI_SUPPORT
>         select HAVE_SOFTIRQ_ON_OWN_STACK
> +       select FUNCTION_ALIGNMENT_16B
>         help
>           ARM 64-bit (AArch64) Linux support.
>
> --
> 2.39.0.rc0.267.gcb52ba06e7-goog
  
Will Deacon Jan. 24, 2023, 12:09 p.m. UTC | #2
On Wed, Dec 07, 2022 at 09:36:48PM -0800, Mina Almasry wrote:
> We recently ran into a double-digit percentage hackbench regression
> when backporting commit 12df140f0bdf ("mm,hugetlb: take hugetlb_lock
> before decrementing h->resv_huge_pages") to an older kernel. This was
> surprising since hackbench does use hugetlb pages at all and the
> modified code is not invoked. After some debugging we found that the
> regression can be fixed by back-porting commit d49a0626216b ("arch:
> Introduce CONFIG_FUNCTION_ALIGNMENT") and enabling function alignment
> for arm64. I suggest enabling it by default for arm64 if possible.
> 
> Tested by examing function alignment on a compiled object file
> before/after:
> 
> After this patch:
> 
> $ ~/is-aligned.sh mm/hugetlb.o 16
> file=mm/hugetlb.o, alignment=16
> total number of functions: 146
> total number of unaligned: 0
> 
> Before this patch:
> 
> $ ~/is-aligned.sh mm/hugetlb.o 16
> file=mm/hugetlb.o, alignment=16
> total number of functions: 146
> total number of unaligned: 94
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> ---
>  arch/arm64/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index cf6d1cd8b6dc..bcc9e1578937 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -235,6 +235,7 @@ config ARM64
>  	select TRACE_IRQFLAGS_SUPPORT
>  	select TRACE_IRQFLAGS_NMI_SUPPORT
>  	select HAVE_SOFTIRQ_ON_OWN_STACK
> +	select FUNCTION_ALIGNMENT_16B
>  	help
>  	  ARM 64-bit (AArch64) Linux support.

This increases the size of .text for a defconfig build by ~2%, so I think it
would be nice to have some real numbers for the performance uplift. Are you
able to elaborate beyond "double-digit percentage hackbench regression"?

In general, however, I'm supportive of the patch (and it seems that x86
does the same thing) so:

Acked-by: Will Deacon <will@kernel.org>

Will
  
David Laight Jan. 25, 2023, 11:16 a.m. UTC | #3
From: Will Deacon <will@kernel.org>
> Sent: 24 January 2023 12:09
> 
> On Wed, Dec 07, 2022 at 09:36:48PM -0800, Mina Almasry wrote:
> > We recently ran into a double-digit percentage hackbench regression
> > when backporting commit 12df140f0bdf ("mm,hugetlb: take hugetlb_lock
> > before decrementing h->resv_huge_pages") to an older kernel. This was
> > surprising since hackbench does use hugetlb pages at all and the
> > modified code is not invoked. After some debugging we found that the
> > regression can be fixed by back-porting commit d49a0626216b ("arch:
> > Introduce CONFIG_FUNCTION_ALIGNMENT") and enabling function alignment
> > for arm64. I suggest enabling it by default for arm64 if possible.
> >
...
> 
> This increases the size of .text for a defconfig build by ~2%, so I think it
> would be nice to have some real numbers for the performance uplift. Are you
> able to elaborate beyond "double-digit percentage hackbench regression"?
> 
> In general, however, I'm supportive of the patch (and it seems that x86
> does the same thing) so:

I bet it just changes the alignment of the code so that more
functions are using different cache lines.

All sorts of other random changes are likely to have a similar effect.

Cache-line aligning the start of a function probably reduces the
number of cache lines the functions needs - but that isn't guaranteed.
It also slightly reduces the delay on a cache miss - but they are so
slow it probably makes almost no difference.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
Mina Almasry Feb. 22, 2023, 10:08 p.m. UTC | #4
On Tue, Jan 24, 2023 at 4:09 AM Will Deacon <will@kernel.org> wrote:
>
> On Wed, Dec 07, 2022 at 09:36:48PM -0800, Mina Almasry wrote:
> > We recently ran into a double-digit percentage hackbench regression
> > when backporting commit 12df140f0bdf ("mm,hugetlb: take hugetlb_lock
> > before decrementing h->resv_huge_pages") to an older kernel. This was
> > surprising since hackbench does use hugetlb pages at all and the
> > modified code is not invoked. After some debugging we found that the
> > regression can be fixed by back-porting commit d49a0626216b ("arch:
> > Introduce CONFIG_FUNCTION_ALIGNMENT") and enabling function alignment
> > for arm64. I suggest enabling it by default for arm64 if possible.
> >
> > Tested by examing function alignment on a compiled object file
> > before/after:
> >
> > After this patch:
> >
> > $ ~/is-aligned.sh mm/hugetlb.o 16
> > file=mm/hugetlb.o, alignment=16
> > total number of functions: 146
> > total number of unaligned: 0
> >
> > Before this patch:
> >
> > $ ~/is-aligned.sh mm/hugetlb.o 16
> > file=mm/hugetlb.o, alignment=16
> > total number of functions: 146
> > total number of unaligned: 94
> >
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > ---
> >  arch/arm64/Kconfig | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index cf6d1cd8b6dc..bcc9e1578937 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -235,6 +235,7 @@ config ARM64
> >       select TRACE_IRQFLAGS_SUPPORT
> >       select TRACE_IRQFLAGS_NMI_SUPPORT
> >       select HAVE_SOFTIRQ_ON_OWN_STACK
> > +     select FUNCTION_ALIGNMENT_16B
> >       help
> >         ARM 64-bit (AArch64) Linux support.
>
> This increases the size of .text for a defconfig build by ~2%, so I think it
> would be nice to have some real numbers for the performance uplift. Are you
> able to elaborate beyond "double-digit percentage hackbench regression"?
>

(Sorry for the late reply)

The full story is already in the commit message. The only details I
omitted are the exact regression numbers we saw:

-16% on hackbench_process_pipes_234 (which should be `hackbench -pipe
234 process 1000`)
-23% on hackbench_process_sockets_234 (which should be `hackbnech 234
process 1000`)

Like the commit message says it doesn't make much sense that
cherry-picking 12df140f0bdf ("mm,hugetlb: take hugetlb_lock before
decrementing h->resv_huge_pages") to our kernel causes such a huge
regression, because hackbench doesn't use hugetlb at all.

> In general, however, I'm supportive of the patch (and it seems that x86
> does the same thing) so:
>
> Acked-by: Will Deacon <will@kernel.org>
>
> Will
  
Mina Almasry Feb. 22, 2023, 10:16 p.m. UTC | #5
On Wed, Jan 25, 2023 at 3:16 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Will Deacon <will@kernel.org>
> > Sent: 24 January 2023 12:09
> >
> > On Wed, Dec 07, 2022 at 09:36:48PM -0800, Mina Almasry wrote:
> > > We recently ran into a double-digit percentage hackbench regression
> > > when backporting commit 12df140f0bdf ("mm,hugetlb: take hugetlb_lock
> > > before decrementing h->resv_huge_pages") to an older kernel. This was
> > > surprising since hackbench does use hugetlb pages at all and the
> > > modified code is not invoked. After some debugging we found that the
> > > regression can be fixed by back-porting commit d49a0626216b ("arch:
> > > Introduce CONFIG_FUNCTION_ALIGNMENT") and enabling function alignment
> > > for arm64. I suggest enabling it by default for arm64 if possible.
> > >
> ...
> >
> > This increases the size of .text for a defconfig build by ~2%, so I think it
> > would be nice to have some real numbers for the performance uplift. Are you
> > able to elaborate beyond "double-digit percentage hackbench regression"?
> >
> > In general, however, I'm supportive of the patch (and it seems that x86
> > does the same thing) so:
>
> I bet it just changes the alignment of the code so that more
> functions are using different cache lines.
>
> All sorts of other random changes are likely to have a similar effect.
>
> Cache-line aligning the start of a function probably reduces the
> number of cache lines the functions needs - but that isn't guaranteed.
> It also slightly reduces the delay on a cache miss - but they are so
> slow it probably makes almost no difference.
>

David, my understanding is similar to yours. I.e. without explicit alignment:

1. Random changes to the code can cause critical path functions to
become aligned or unaligned which will cause perf
regressions/improvements.
2. Random changes to the code can cause critical path functions to be
placed near a cache line boundary, causing one more cache line to be
loaded when they are run, which will cause perf regressions.

So for these very reasons function alignment is a good thing.

>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
  
David Laight Feb. 22, 2023, 10:42 p.m. UTC | #6
From: Mina Almasry
> Sent: 22 February 2023 22:16
> 
> On Wed, Jan 25, 2023 at 3:16 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Will Deacon <will@kernel.org>
> > > Sent: 24 January 2023 12:09
> > >
> > > On Wed, Dec 07, 2022 at 09:36:48PM -0800, Mina Almasry wrote:
> > > > We recently ran into a double-digit percentage hackbench regression
> > > > when backporting commit 12df140f0bdf ("mm,hugetlb: take hugetlb_lock
> > > > before decrementing h->resv_huge_pages") to an older kernel. This was
> > > > surprising since hackbench does use hugetlb pages at all and the
> > > > modified code is not invoked. After some debugging we found that the
> > > > regression can be fixed by back-porting commit d49a0626216b ("arch:
> > > > Introduce CONFIG_FUNCTION_ALIGNMENT") and enabling function alignment
> > > > for arm64. I suggest enabling it by default for arm64 if possible.
> > > >
> > ...
> > >
> > > This increases the size of .text for a defconfig build by ~2%, so I think it
> > > would be nice to have some real numbers for the performance uplift. Are you
> > > able to elaborate beyond "double-digit percentage hackbench regression"?
> > >
> > > In general, however, I'm supportive of the patch (and it seems that x86
> > > does the same thing) so:
> >
> > I bet it just changes the alignment of the code so that more
> > functions are using different cache lines.
> >
> > All sorts of other random changes are likely to have a similar effect.
> >
> > Cache-line aligning the start of a function probably reduces the
> > number of cache lines the functions needs - but that isn't guaranteed.
> > It also slightly reduces the delay on a cache miss - but they are so
> > slow it probably makes almost no difference.
> >
> 
> David, my understanding is similar to yours. I.e. without explicit alignment:
> 
> 1. Random changes to the code can cause critical path functions to
> become aligned or unaligned which will cause perf
> regressions/improvements.
> 2. Random changes to the code can cause critical path functions to be
> placed near a cache line boundary, causing one more cache line to be
> loaded when they are run, which will cause perf regressions.
> 
> So for these very reasons function alignment is a good thing.

Except that aligning functions doesn't necessarily improve things.

Even within a function the alignment of the top of a loop (that
is executed a lot) might matter more than the alignment of the
function itself.

Any change will affect which code 'share' cache lines so can
stop the working set of some test (or code loop with deep
function calls) from fitting in the I-cache so making it
much slower.

Changing the size also affects where the TLB boundaries are
(especially if not using very large pages).
If the number of pages exceeds the number of TLB things
will slow down.

I think one version of gcc used to align most labels.
While the code might be slightly faster, the bloat actually
made it a net loss.

So aligning functions might help, but it might just
make things worse.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index cf6d1cd8b6dc..bcc9e1578937 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -235,6 +235,7 @@  config ARM64
 	select TRACE_IRQFLAGS_SUPPORT
 	select TRACE_IRQFLAGS_NMI_SUPPORT
 	select HAVE_SOFTIRQ_ON_OWN_STACK
+	select FUNCTION_ALIGNMENT_16B
 	help
 	  ARM 64-bit (AArch64) Linux support.