diff mbox series

mm: Move mm_count into its own cache line

Message ID	20230515143536.114960-1-mathieu.desnoyers@efficios.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, kernel test robot <yujie.liu@intel.com>, Aaron Lu <aaron.lu@intel.com>, Olivier Dion <odion@efficios.com>, michael.christie@oracle.com, Andrew Morton <akpm@linux-foundation.org>, Feng Tang <feng.tang@intel.com>, John Hubbard <jhubbard@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com>, Peter Xu <peterx@redhat.com>, linux-mm@kvack.org Subject: [PATCH] mm: Move mm_count into its own cache line Date: Mon, 15 May 2023 10:35:36 -0400 Message-Id: <20230515143536.114960-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mm: Move mm_count into its own cache line \| mm: Move mm_count into its own cache line

Commit Message

Mathieu Desnoyers May 15, 2023, 2:35 p.m. UTC

  The mm_struct mm_count field is frequently updated by mmgrab/mmdrop
performed by context switch. This causes false-sharing for surrounding
mm_struct fields which are read-mostly.

This has been observed on a 2sockets/112core/224cpu Intel Sapphire
Rapids server running hackbench, and by the kernel test robot
will-it-scale testcase.

Move the mm_count field into its own cache line to prevent false-sharing
with other mm_struct fields.

Move mm_count to the first field of mm_struct to minimize the amount of
padding required: rather than adding padding before and after the
mm_count field, padding is only added after mm_count.

Note that I noticed this odd comment in mm_struct:

commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct")

                /*
                 * With some kernel config, the current mmap_lock's offset
                 * inside 'mm_struct' is at 0x120, which is very optimal, as
                 * its two hot fields 'count' and 'owner' sit in 2 different
                 * cachelines,  and when mmap_lock is highly contended, both
                 * of the 2 fields will be accessed frequently, current layout
                 * will help to reduce cache bouncing.
                 *
                 * So please be careful with adding new fields before
                 * mmap_lock, which can easily push the 2 fields into one
                 * cacheline.
                 */
                struct rw_semaphore mmap_lock;

This comment is rather odd for a few reasons:

- It requires addition/removal of mm_struct fields to carefully consider
  field alignment of _other_ fields,
- It expresses the wish to keep an "optimal" alignment for a specific
  kernel config.

I suspect that the author of this comment may want to revisit this topic
and perhaps introduce a split-struct approach for struct rw_semaphore,
if the need is to place various fields of this structure in different
cache lines.

Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
Link: https://lore.kernel.org/lkml/7a0c1db1-103d-d518-ed96-1584a28fbf32@efficios.com
Reported-by: kernel test robot <yujie.liu@intel.com>
Link: https://lore.kernel.org/oe-lkp/202305151017.27581d75-yujie.liu@intel.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Olivier Dion <odion@efficios.com>
Cc: michael.christie@oracle.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org
---
 include/linux/mm_types.h | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

Comments

Aaron Lu May 16, 2023, 4:40 a.m. UTC | #1

On Mon, May 15, 2023 at 10:35:36AM -0400, Mathieu Desnoyers wrote:
> The mm_struct mm_count field is frequently updated by mmgrab/mmdrop
> performed by context switch. This causes false-sharing for surrounding
> mm_struct fields which are read-mostly.
> 
> This has been observed on a 2sockets/112core/224cpu Intel Sapphire
> Rapids server running hackbench, and by the kernel test robot
> will-it-scale testcase.
> 
> Move the mm_count field into its own cache line to prevent false-sharing
> with other mm_struct fields.
> 
> Move mm_count to the first field of mm_struct to minimize the amount of
> padding required: rather than adding padding before and after the
> mm_count field, padding is only added after mm_count.
> 
> Note that I noticed this odd comment in mm_struct:
> 
> commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct")
> 
>                 /*
>                  * With some kernel config, the current mmap_lock's offset
>                  * inside 'mm_struct' is at 0x120, which is very optimal, as
>                  * its two hot fields 'count' and 'owner' sit in 2 different
>                  * cachelines,  and when mmap_lock is highly contended, both
>                  * of the 2 fields will be accessed frequently, current layout
>                  * will help to reduce cache bouncing.
>                  *
>                  * So please be careful with adding new fields before
>                  * mmap_lock, which can easily push the 2 fields into one
>                  * cacheline.
>                  */
>                 struct rw_semaphore mmap_lock;
> 
> This comment is rather odd for a few reasons:
> 
> - It requires addition/removal of mm_struct fields to carefully consider
>   field alignment of _other_ fields,
> - It expresses the wish to keep an "optimal" alignment for a specific
>   kernel config.
> 
> I suspect that the author of this comment may want to revisit this topic
> and perhaps introduce a split-struct approach for struct rw_semaphore,
> if the need is to place various fields of this structure in different
> cache lines.
> 
> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
> Link: https://lore.kernel.org/lkml/7a0c1db1-103d-d518-ed96-1584a28fbf32@efficios.com
> Reported-by: kernel test robot <yujie.liu@intel.com>
> Link: https://lore.kernel.org/oe-lkp/202305151017.27581d75-yujie.liu@intel.com
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Reviewed-by: Aaron Lu <aaron.lu@intel.com>

Feng Tang May 16, 2023, 6:53 a.m. UTC | #2

Hi Mathieu,

On Mon, May 15, 2023 at 10:35:36PM +0800, Mathieu Desnoyers wrote:
> The mm_struct mm_count field is frequently updated by mmgrab/mmdrop
> performed by context switch. This causes false-sharing for surrounding
> mm_struct fields which are read-mostly.
> 
> This has been observed on a 2sockets/112core/224cpu Intel Sapphire
> Rapids server running hackbench, and by the kernel test robot
> will-it-scale testcase.
> 
> Move the mm_count field into its own cache line to prevent false-sharing
> with other mm_struct fields.
> 
> Move mm_count to the first field of mm_struct to minimize the amount of
> padding required: rather than adding padding before and after the
> mm_count field, padding is only added after mm_count.
> 
> Note that I noticed this odd comment in mm_struct:
> 
> commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct")
> 
>                 /*
>                  * With some kernel config, the current mmap_lock's offset
>                  * inside 'mm_struct' is at 0x120, which is very optimal, as
>                  * its two hot fields 'count' and 'owner' sit in 2 different
>                  * cachelines,  and when mmap_lock is highly contended, both
>                  * of the 2 fields will be accessed frequently, current layout
>                  * will help to reduce cache bouncing.
>                  *
>                  * So please be careful with adding new fields before
>                  * mmap_lock, which can easily push the 2 fields into one
>                  * cacheline.
>                  */
>                 struct rw_semaphore mmap_lock;
> 
> This comment is rather odd for a few reasons:
> 
> - It requires addition/removal of mm_struct fields to carefully consider
>   field alignment of _other_ fields,
> - It expresses the wish to keep an "optimal" alignment for a specific
>   kernel config.
> 
> I suspect that the author of this comment may want to revisit this topic
> and perhaps introduce a split-struct approach for struct rw_semaphore,
> if the need is to place various fields of this structure in different
> cache lines.

Thanks for bringing this up.


The full context of the commit 2e3025434a6b is here:
https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/

Add Linus, Waiman who have analyzed this case.  

That a commit changed the cacheline layout of mmap_lock inside of
'mm_struct', which caused a will-it-scale regression. As false sharing
handling is tricky and we chosed to be defensive and just _restore_
its cacheline layout as before (even if it is kind of weired as
being related to kernel configs :)).

As for rw_semaphore, it is a fundermental thing while that regerssion
is just one single workload of micro-benchmark. IMHO, any change to
its layout should consider more workloads, and deserve a wide range
of benchmark tests.

I just checked latest kernel, seems the cache layout is already
different from what 2e3025434a6b try to restore, that the 'count' and
'owner' fields sit in 2 different cachelines. So this patch won't
'hurt' in this regard.

Thanks,
Feng

> 
> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
> Link: https://lore.kernel.org/lkml/7a0c1db1-103d-d518-ed96-1584a28fbf32@efficios.com
> Reported-by: kernel test robot <yujie.liu@intel.com>
> Link: https://lore.kernel.org/oe-lkp/202305151017.27581d75-yujie.liu@intel.com
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Aaron Lu <aaron.lu@intel.com>
> Cc: Olivier Dion <odion@efficios.com>
> Cc: michael.christie@oracle.com
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Feng Tang <feng.tang@intel.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: linux-mm@kvack.org
> ---
>  include/linux/mm_types.h | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 306a3d1a0fa6..de10fc797c8e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -583,6 +583,21 @@ struct mm_cid {
>  struct kioctx_table;
>  struct mm_struct {
>  	struct {
> +		/*
> +		 * Fields which are often written to are placed in a separate
> +		 * cache line.
> +		 */
> +		struct {
> +			/**
> +			 * @mm_count: The number of references to &struct
> +			 * mm_struct (@mm_users count as 1).
> +			 *
> +			 * Use mmgrab()/mmdrop() to modify. When this drops to
> +			 * 0, the &struct mm_struct is freed.
> +			 */
> +			atomic_t mm_count;
> +		} ____cacheline_aligned_in_smp;
> +
>  		struct maple_tree mm_mt;
>  #ifdef CONFIG_MMU
>  		unsigned long (*get_unmapped_area) (struct file *filp,
> @@ -620,14 +635,6 @@ struct mm_struct {
>  		 */
>  		atomic_t mm_users;
>  
> -		/**
> -		 * @mm_count: The number of references to &struct mm_struct
> -		 * (@mm_users count as 1).
> -		 *
> -		 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
> -		 * &struct mm_struct is freed.
> -		 */
> -		atomic_t mm_count;
>  #ifdef CONFIG_SCHED_MM_CID
>  		/**
>  		 * @pcpu_cid: Per-cpu current cid.
> -- 
> 2.25.1
>

Mathieu Desnoyers June 16, 2023, 6:54 p.m. UTC | #3

On 5/15/23 10:35, Mathieu Desnoyers wrote:
> The mm_struct mm_count field is frequently updated by mmgrab/mmdrop
> performed by context switch. This causes false-sharing for surrounding
> mm_struct fields which are read-mostly.

Hi Andrew,

Given that this patch touches mm code, I think it should go through your 
tree. Nobody has voiced any objection for a month, Aaron Lu gave his 
Reviewed-by tag [1], and it solves measurable performance regressions.

Would you be willing to pick it up ?

Please let me know if something else is needed.

Thanks!

Mathieu

[1] https://lore.kernel.org/lkml/20230516044050.GA315678@ziqianlu-desk2

> 
> This has been observed on a 2sockets/112core/224cpu Intel Sapphire
> Rapids server running hackbench, and by the kernel test robot
> will-it-scale testcase.
> 
> Move the mm_count field into its own cache line to prevent false-sharing
> with other mm_struct fields.
> 
> Move mm_count to the first field of mm_struct to minimize the amount of
> padding required: rather than adding padding before and after the
> mm_count field, padding is only added after mm_count.
> 
> Note that I noticed this odd comment in mm_struct:
> 
> commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct")
> 
>                  /*
>                   * With some kernel config, the current mmap_lock's offset
>                   * inside 'mm_struct' is at 0x120, which is very optimal, as
>                   * its two hot fields 'count' and 'owner' sit in 2 different
>                   * cachelines,  and when mmap_lock is highly contended, both
>                   * of the 2 fields will be accessed frequently, current layout
>                   * will help to reduce cache bouncing.
>                   *
>                   * So please be careful with adding new fields before
>                   * mmap_lock, which can easily push the 2 fields into one
>                   * cacheline.
>                   */
>                  struct rw_semaphore mmap_lock;
> 
> This comment is rather odd for a few reasons:
> 
> - It requires addition/removal of mm_struct fields to carefully consider
>    field alignment of _other_ fields,
> - It expresses the wish to keep an "optimal" alignment for a specific
>    kernel config.
> 
> I suspect that the author of this comment may want to revisit this topic
> and perhaps introduce a split-struct approach for struct rw_semaphore,
> if the need is to place various fields of this structure in different
> cache lines.
> 
> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
> Link: https://lore.kernel.org/lkml/7a0c1db1-103d-d518-ed96-1584a28fbf32@efficios.com
> Reported-by: kernel test robot <yujie.liu@intel.com>
> Link: https://lore.kernel.org/oe-lkp/202305151017.27581d75-yujie.liu@intel.com
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Aaron Lu <aaron.lu@intel.com>
> Cc: Olivier Dion <odion@efficios.com>
> Cc: michael.christie@oracle.com
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Feng Tang <feng.tang@intel.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: linux-mm@kvack.org
> ---
>   include/linux/mm_types.h | 23 +++++++++++++++--------
>   1 file changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 306a3d1a0fa6..de10fc797c8e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -583,6 +583,21 @@ struct mm_cid {
>   struct kioctx_table;
>   struct mm_struct {
>   	struct {
> +		/*
> +		 * Fields which are often written to are placed in a separate
> +		 * cache line.
> +		 */
> +		struct {
> +			/**
> +			 * @mm_count: The number of references to &struct
> +			 * mm_struct (@mm_users count as 1).
> +			 *
> +			 * Use mmgrab()/mmdrop() to modify. When this drops to
> +			 * 0, the &struct mm_struct is freed.
> +			 */
> +			atomic_t mm_count;
> +		} ____cacheline_aligned_in_smp;
> +
>   		struct maple_tree mm_mt;
>   #ifdef CONFIG_MMU
>   		unsigned long (*get_unmapped_area) (struct file *filp,
> @@ -620,14 +635,6 @@ struct mm_struct {
>   		 */
>   		atomic_t mm_users;
>   
> -		/**
> -		 * @mm_count: The number of references to &struct mm_struct
> -		 * (@mm_users count as 1).
> -		 *
> -		 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
> -		 * &struct mm_struct is freed.
> -		 */
> -		atomic_t mm_count;
>   #ifdef CONFIG_SCHED_MM_CID
>   		/**
>   		 * @pcpu_cid: Per-cpu current cid.

Andrew Morton June 16, 2023, 8:16 p.m. UTC | #4

On Mon, 15 May 2023 10:35:36 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> The mm_struct mm_count field is frequently updated by mmgrab/mmdrop
> performed by context switch. This causes false-sharing for surrounding
> mm_struct fields which are read-mostly.
> 
> This has been observed on a 2sockets/112core/224cpu Intel Sapphire
> Rapids server running hackbench, and by the kernel test robot
> will-it-scale testcase.
> 
> Move the mm_count field into its own cache line to prevent false-sharing
> with other mm_struct fields.
> 
> Move mm_count to the first field of mm_struct to minimize the amount of
> padding required: rather than adding padding before and after the
> mm_count field, padding is only added after mm_count.
> 
> Note that I noticed this odd comment in mm_struct:
> 
> commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct")
> 
>                 /*
>                  * With some kernel config, the current mmap_lock's offset
>                  * inside 'mm_struct' is at 0x120, which is very optimal, as
>                  * its two hot fields 'count' and 'owner' sit in 2 different
>                  * cachelines,  and when mmap_lock is highly contended, both
>                  * of the 2 fields will be accessed frequently, current layout
>                  * will help to reduce cache bouncing.
>                  *
>                  * So please be careful with adding new fields before
>                  * mmap_lock, which can easily push the 2 fields into one
>                  * cacheline.
>                  */
>                 struct rw_semaphore mmap_lock;
> 
> This comment is rather odd for a few reasons:
> 
> - It requires addition/removal of mm_struct fields to carefully consider
>   field alignment of _other_ fields,
> - It expresses the wish to keep an "optimal" alignment for a specific
>   kernel config.
> 
> I suspect that the author of this comment may want to revisit this topic
> and perhaps introduce a split-struct approach for struct rw_semaphore,
> if the need is to place various fields of this structure in different
> cache lines.
> 
> ...
>
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -583,6 +583,21 @@ struct mm_cid {
>  struct kioctx_table;
>  struct mm_struct {
>  	struct {
> +		/*
> +		 * Fields which are often written to are placed in a separate
> +		 * cache line.
> +		 */
> +		struct {
> +			/**
> +			 * @mm_count: The number of references to &struct
> +			 * mm_struct (@mm_users count as 1).
> +			 *
> +			 * Use mmgrab()/mmdrop() to modify. When this drops to
> +			 * 0, the &struct mm_struct is freed.
> +			 */
> +			atomic_t mm_count;
> +		} ____cacheline_aligned_in_smp;
> +

Why add the anonymous struct?

	atomic_t mm_count ____cacheline_aligned_in_smp;

would suffice?

Secondly, the ____cacheline_aligned_in_smp doesn't actually do
anything?  mm_count is at offset 0 which is cacheline aligned anyway. 
The next field (mm_mt) will share a cacheline with mm_count.

If the plan is to put mm_count in "its own" cacheline then padding will
be needed?

Mathieu Desnoyers June 16, 2023, 8:38 p.m. UTC | #5

On 6/16/23 16:16, Andrew Morton wrote:
> On Mon, 15 May 2023 10:35:36 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> The mm_struct mm_count field is frequently updated by mmgrab/mmdrop
>> performed by context switch. This causes false-sharing for surrounding
>> mm_struct fields which are read-mostly.
>>
>> This has been observed on a 2sockets/112core/224cpu Intel Sapphire
>> Rapids server running hackbench, and by the kernel test robot
>> will-it-scale testcase.
>>
>> Move the mm_count field into its own cache line to prevent false-sharing
>> with other mm_struct fields.
>>
>> Move mm_count to the first field of mm_struct to minimize the amount of
>> padding required: rather than adding padding before and after the
>> mm_count field, padding is only added after mm_count.
>>
>> Note that I noticed this odd comment in mm_struct:
>>
>> commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct")
>>
>>                  /*
>>                   * With some kernel config, the current mmap_lock's offset
>>                   * inside 'mm_struct' is at 0x120, which is very optimal, as
>>                   * its two hot fields 'count' and 'owner' sit in 2 different
>>                   * cachelines,  and when mmap_lock is highly contended, both
>>                   * of the 2 fields will be accessed frequently, current layout
>>                   * will help to reduce cache bouncing.
>>                   *
>>                   * So please be careful with adding new fields before
>>                   * mmap_lock, which can easily push the 2 fields into one
>>                   * cacheline.
>>                   */
>>                  struct rw_semaphore mmap_lock;
>>
>> This comment is rather odd for a few reasons:
>>
>> - It requires addition/removal of mm_struct fields to carefully consider
>>    field alignment of _other_ fields,
>> - It expresses the wish to keep an "optimal" alignment for a specific
>>    kernel config.
>>
>> I suspect that the author of this comment may want to revisit this topic
>> and perhaps introduce a split-struct approach for struct rw_semaphore,
>> if the need is to place various fields of this structure in different
>> cache lines.
>>
>> ...
>>
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -583,6 +583,21 @@ struct mm_cid {
>>   struct kioctx_table;
>>   struct mm_struct {
>>   	struct {
>> +		/*
>> +		 * Fields which are often written to are placed in a separate
>> +		 * cache line.
>> +		 */
>> +		struct {
>> +			/**
>> +			 * @mm_count: The number of references to &struct
>> +			 * mm_struct (@mm_users count as 1).
>> +			 *
>> +			 * Use mmgrab()/mmdrop() to modify. When this drops to
>> +			 * 0, the &struct mm_struct is freed.
>> +			 */
>> +			atomic_t mm_count;
>> +		} ____cacheline_aligned_in_smp;
>> +
> 
> Why add the anonymous struct?
> 
> 	atomic_t mm_count ____cacheline_aligned_in_smp;
> 
> would suffice?

The anonymous struct is needed to ensure the mm_count is alone in its 
own cacheline.

An aligned attribute applied to an integer field only aligns the offset 
of that field in the structure, without changing its size. An aligned 
attribute applied to a structure aligns both its offset in the structure 
containing it _and_ extends its size with padding.

This takes care of adding padding _after_ mm_count as well. Alignment on 
a structure requires that the structure size is extended with padding to 
cover the required alignment. Otherwise an array of that structure could 
not have each of its items on a multiple of the required alignment.

> 
> Secondly, the ____cacheline_aligned_in_smp doesn't actually do
> anything?  mm_count is at offset 0 which is cacheline aligned anyway.
> The next field (mm_mt) will share a cacheline with mm_count.

If applied on the integer field, I agree that it would not do anything. 
However, applied on the anonymous structure, it takes care of adding 
padding _after_ the mm_count field, which is exactly what we want here.

> 
> If the plan is to put mm_count in "its own" cacheline then padding will
> be needed?

It's taken care of by the anonymous structure trick. Here is an quick 
example showing the difference between alignment attribute applied to an 
integer type vs to an anonymous structure:

#include <stdio.h>

struct s1 {
         int a __attribute__((aligned(32)));
         int b;
};

struct s2 {
         int a;
         int b __attribute__((aligned(32)));
};

struct s3 {
         struct {
                 int a;
         } __attribute__((aligned(32)));
         int b;
};

int main()
{
         struct s1 t1;
         struct s2 t2;
         struct s3 t3;

         printf("%d %d\n", (unsigned long)&t1.a - (unsigned long)&t1,
                         (unsigned long)&t1.b - (unsigned long)&t1);
         printf("%d %d\n", (unsigned long)&t2.a - (unsigned long)&t2,
                         (unsigned long)&t2.b - (unsigned long)&t2);
         printf("%d %d\n", (unsigned long)&t3.a - (unsigned long)&t3,
                         (unsigned long)&t3.b - (unsigned long)&t3);
         return 0;
}

Result:

0 4
0 32
0 32

Applying the aligned attribute on the integer field would require 
explicit padding, because it does not add padding after the field.

Applying the aligned attribute on the _following_ field would work, but 
it's rather odd and error-prone.

Applying the aligned attribute on an anonymous structure clearly 
documents the intent, and adds the padding that guarantees this variable 
is alone in its cache line.

Thanks,

Mathieu

John Hubbard June 16, 2023, 10:35 p.m. UTC | #6

On 6/16/23 13:38, Mathieu Desnoyers wrote:
...
>>> This comment is rather odd for a few reasons:
>>>
>>> - It requires addition/removal of mm_struct fields to carefully consider
>>>    field alignment of _other_ fields,
>>> - It expresses the wish to keep an "optimal" alignment for a specific
>>>    kernel config.
>>>
>>> I suspect that the author of this comment may want to revisit this topic
>>> and perhaps introduce a split-struct approach for struct rw_semaphore,
>>> if the need is to place various fields of this structure in different
>>> cache lines.
>>>

Agreed. The whole thing is far too fragile, but when reviewing this I
wasn't sure what else to suggest. Now looking at it again with your
alignment suggestion, there is an interesting conflicting set of
desires:

a) Here: Feng Tang discovered that .count and .owner are best put in
separate cache lines for the contended case for mmap_lock, and

b) rwsem.h, which specifies precisely the opposite for the uncontended
case:

  * For an uncontended rwsem, count and owner are the only fields a task
  * needs to touch when acquiring the rwsem. So they are put next to each
  * other to increase the chance that they will share the same cacheline.

I suspect that overall, it's "better" to align rw_semaphore's .count and
.owner field so that the lock is optimized for the contended case,
because it's reasonable to claim that the benefit of having those two
fields in the same cacheline for the uncontended case is far less than
the cost to the contended case, of keeping them close to each other.

However, it's still not unlikely that someone will measure a performance
regression if such a change is made.

Thoughts?

...
>> If the plan is to put mm_count in "its own" cacheline then padding will
>> be needed?
> 
> It's taken care of by the anonymous structure trick. Here is an quick example showing the difference between alignment attribute applied to an integer type vs to an anonymous structure:

Thanks for explaining very clearly how that works, that's really
helpful!
thanks,

John Hubbard June 16, 2023, 10:39 p.m. UTC | #7

On 5/15/23 07:35, Mathieu Desnoyers wrote:
> The mm_struct mm_count field is frequently updated by mmgrab/mmdrop
> performed by context switch. This causes false-sharing for surrounding
> mm_struct fields which are read-mostly.
> 
> This has been observed on a 2sockets/112core/224cpu Intel Sapphire
> Rapids server running hackbench, and by the kernel test robot
> will-it-scale testcase.
> 
> Move the mm_count field into its own cache line to prevent false-sharing
> with other mm_struct fields.
> 
> Move mm_count to the first field of mm_struct to minimize the amount of
> padding required: rather than adding padding before and after the
> mm_count field, padding is only added after mm_count.

Oh, and I almost forgot to add:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,

Mathieu Desnoyers June 17, 2023, 4:08 p.m. UTC | #8

On 6/16/23 18:35, John Hubbard wrote:
> On 6/16/23 13:38, Mathieu Desnoyers wrote:
> ...
>>>> This comment is rather odd for a few reasons:
>>>>
>>>> - It requires addition/removal of mm_struct fields to carefully 
>>>> consider
>>>>    field alignment of _other_ fields,
>>>> - It expresses the wish to keep an "optimal" alignment for a specific
>>>>    kernel config.
>>>>
>>>> I suspect that the author of this comment may want to revisit this 
>>>> topic
>>>> and perhaps introduce a split-struct approach for struct rw_semaphore,
>>>> if the need is to place various fields of this structure in different
>>>> cache lines.
>>>>
> 
> Agreed. The whole thing is far too fragile, but when reviewing this I
> wasn't sure what else to suggest. Now looking at it again with your
> alignment suggestion, there is an interesting conflicting set of
> desires:
> 
> a) Here: Feng Tang discovered that .count and .owner are best put in
> separate cache lines for the contended case for mmap_lock, and
> 
> b) rwsem.h, which specifies precisely the opposite for the uncontended
> case:
> 
>   * For an uncontended rwsem, count and owner are the only fields a task
>   * needs to touch when acquiring the rwsem. So they are put next to each
>   * other to increase the chance that they will share the same cacheline.
> 
> I suspect that overall, it's "better" to align rw_semaphore's .count and
> .owner field so that the lock is optimized for the contended case,
> because it's reasonable to claim that the benefit of having those two
> fields in the same cacheline for the uncontended case is far less than
> the cost to the contended case, of keeping them close to each other.
> 
> However, it's still not unlikely that someone will measure a performance
> regression if such a change is made.
> 
> Thoughts?
> 

I suspect that the purpose of b) is only to maximize the functional density
of the data cache: only using a single cache line for the rwsem uncontended
case has the smallest footprint on the data cache.

However, if we look at the rwsem in the context of its use within another
data structure containing it, I think we can do better by analyzing the access
pattern of _other_ fields of that structure.

I have faced a similar situation within liburcu's wfcqueue's API [1], where
it's better for head and tail to sit on different cache lines to eliminate
false-sharing between enqueue and dequeue. I solved this by splitting the
head and tail parameters in the API. So the user can decide to place them
other on the same cache line, or on different cache lines, depending on the
use-case. The user also has the freedom to place both head and tail on the
same cache line as _other_ fields based on usage pattern.

By providing enough flexibility to place the rwsem fields so that the count
is on its own cache-line, and owner is on the same cache line as other fields
touched when the rwsem is held, I suspect we can both improve functional
density _and_ eliminate false-sharing in the contended case.

Thanks,

Mathieu

[1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/wfcqueue.h#L279

> ...
>>> If the plan is to put mm_count in "its own" cacheline then padding will
>>> be needed?
>>
>> It's taken care of by the anonymous structure trick. Here is an quick 
>> example showing the difference between alignment attribute applied to 
>> an integer type vs to an anonymous structure:
> 
> Thanks for explaining very clearly how that works, that's really
> helpful!
> thanks,

diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..de10fc797c8e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -583,6 +583,21 @@  struct mm_cid {
 struct kioctx_table;
 struct mm_struct {
 	struct {
+		/*
+		 * Fields which are often written to are placed in a separate
+		 * cache line.
+		 */
+		struct {
+			/**
+			 * @mm_count: The number of references to &struct
+			 * mm_struct (@mm_users count as 1).
+			 *
+			 * Use mmgrab()/mmdrop() to modify. When this drops to
+			 * 0, the &struct mm_struct is freed.
+			 */
+			atomic_t mm_count;
+		} ____cacheline_aligned_in_smp;
+
 		struct maple_tree mm_mt;
 #ifdef CONFIG_MMU
 		unsigned long (*get_unmapped_area) (struct file *filp,
@@ -620,14 +635,6 @@  struct mm_struct {
 		 */
 		atomic_t mm_users;
 
-		/**
-		 * @mm_count: The number of references to &struct mm_struct
-		 * (@mm_users count as 1).
-		 *
-		 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
-		 * &struct mm_struct is freed.
-		 */
-		atomic_t mm_count;
 #ifdef CONFIG_SCHED_MM_CID
 		/**
 		 * @pcpu_cid: Per-cpu current cid.