Message ID | 20230515143536.114960-1-mathieu.desnoyers@efficios.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp6970375vqo; Mon, 15 May 2023 07:37:29 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6bGVdzfqkUVq5Vzyqmznwobu5BVG5baUDu/oekQZaDFw3aQD4uChTTr65G87c/9kQXJCfr X-Received: by 2002:a05:6a20:748a:b0:105:a24f:cff1 with SMTP id p10-20020a056a20748a00b00105a24fcff1mr6665353pzd.24.1684161449117; Mon, 15 May 2023 07:37:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684161449; cv=none; d=google.com; s=arc-20160816; b=y4DtlD7ZPopQ/l2euhCsq98IJZ8MBcgiAai64pJ6WHVcWU/Win7gBGHr2Nkz6WgOBz q3ygrom8fTBDmBp095Aj41s6rRvkadF30nMXRY8uo5j1sjlYtiTXkIE0hpGfVzxN5VQm 9Trok9rb/GctL2Q3xp+7cPMtOqsNPlYq+lCWtdCwq9ZU94C/WeCGYlup0npiK8St/yl2 fRqyoU0m225mu1WzmWeUuzO2ZXQUnhl/moqPsafX+SOpgCqwd4OfKGps8J72t6m6HdMZ c9HzeAnete/MBiYTOy/siXJ4oROh/K43t4EsU32HzV8IO/zcDgef5GbXeLiv5Q7pb1Z2 inJw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=ZYEArWolIZt4sa6YXFI25cdVhoWmld9ONGA/zqZyRNw=; b=TO3uu7jg+6Au1p++43+ppFGk2isXjaQcBPVOu6BsTnO1HhZFzNm1SMsTxOl0Wygnaa d7cBwexwyl/4DBhgv9m+9tLRnN1MmcWMdGiDWL1spNWWHyj/VKfnZ/SsfetWGkO/f0B2 vfDuB34WLxGa1B4pkuZnBLQoT+Ai/WYrW0DzJKp/XA1IZXuTUtC+Y5FTNd1zSbIfojKd yEw76pDzRikp96l909O4WEYbKpBNK7BHAZeUj9NPtAgR6K++OnIlnB8uZLtC1YG5gdZe kVZbbY1hHobDw/YfrbDo3zWgOdNwb+mzE6Xv9FIhEcj7CeRHLkWdKp+e/iBM1QJOGoyg nDfA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=Ygbu0By2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a13-20020aa7970d000000b006281ec442absi17576991pfg.309.2023.05.15.07.37.13; Mon, 15 May 2023 07:37:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=Ygbu0By2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240191AbjEOOfp (ORCPT <rfc822;peekingduck44@gmail.com> + 99 others); Mon, 15 May 2023 10:35:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60628 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241846AbjEOOfo (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 15 May 2023 10:35:44 -0400 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BFCCA11D for <linux-kernel@vger.kernel.org>; Mon, 15 May 2023 07:35:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1684161342; bh=55UiL7049d5WwXoq8Ga1ETl0Xcqd3k/79bhP8jRai1I=; h=From:To:Cc:Subject:Date:From; b=Ygbu0By2kzmJLHh1Fy4KXvfh+K9aNW5LbzhbEX5LeLatHUBNzOUEcNYDrcUw5Az5e v3SH4smxTxOMBPYSXoGiQuPqbVFeQMfc02K4UtkVhJMeHCcKxxgL795dpfbwruTact TdIrD0ixOzK3WpYzJKsh77uwmo9LwNOyNw2KZLr8q0TNVo0EGf8faPJezJsV3+JtuX zZFsLOUVQYckfd3VO9KJkCXjOA4Ixt05jknCAgSUVrIgCecJycLf3JuVsqt7twXxav 9Ma1bfkisrmpyWi+8puxAJFgbKbBvMCrn5rzbym0yGp/GAAAZNzShPunQjAlsiohQ7 NxI3QqSkgR+ig== Received: from localhost.localdomain (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4QKhk55lzxz12dV; Mon, 15 May 2023 10:35:41 -0400 (EDT) From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, kernel test robot <yujie.liu@intel.com>, Aaron Lu <aaron.lu@intel.com>, Olivier Dion <odion@efficios.com>, michael.christie@oracle.com, Andrew Morton <akpm@linux-foundation.org>, Feng Tang <feng.tang@intel.com>, John Hubbard <jhubbard@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com>, Peter Xu <peterx@redhat.com>, linux-mm@kvack.org Subject: [PATCH] mm: Move mm_count into its own cache line Date: Mon, 15 May 2023 10:35:36 -0400 Message-Id: <20230515143536.114960-1-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1765971275893055311?= X-GMAIL-MSGID: =?utf-8?q?1765971275893055311?= |
Series |
mm: Move mm_count into its own cache line
|
|
Commit Message
Mathieu Desnoyers
May 15, 2023, 2:35 p.m. UTC
The mm_struct mm_count field is frequently updated by mmgrab/mmdrop
performed by context switch. This causes false-sharing for surrounding
mm_struct fields which are read-mostly.
This has been observed on a 2sockets/112core/224cpu Intel Sapphire
Rapids server running hackbench, and by the kernel test robot
will-it-scale testcase.
Move the mm_count field into its own cache line to prevent false-sharing
with other mm_struct fields.
Move mm_count to the first field of mm_struct to minimize the amount of
padding required: rather than adding padding before and after the
mm_count field, padding is only added after mm_count.
Note that I noticed this odd comment in mm_struct:
commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct")
/*
* With some kernel config, the current mmap_lock's offset
* inside 'mm_struct' is at 0x120, which is very optimal, as
* its two hot fields 'count' and 'owner' sit in 2 different
* cachelines, and when mmap_lock is highly contended, both
* of the 2 fields will be accessed frequently, current layout
* will help to reduce cache bouncing.
*
* So please be careful with adding new fields before
* mmap_lock, which can easily push the 2 fields into one
* cacheline.
*/
struct rw_semaphore mmap_lock;
This comment is rather odd for a few reasons:
- It requires addition/removal of mm_struct fields to carefully consider
field alignment of _other_ fields,
- It expresses the wish to keep an "optimal" alignment for a specific
kernel config.
I suspect that the author of this comment may want to revisit this topic
and perhaps introduce a split-struct approach for struct rw_semaphore,
if the need is to place various fields of this structure in different
cache lines.
Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
Link: https://lore.kernel.org/lkml/7a0c1db1-103d-d518-ed96-1584a28fbf32@efficios.com
Reported-by: kernel test robot <yujie.liu@intel.com>
Link: https://lore.kernel.org/oe-lkp/202305151017.27581d75-yujie.liu@intel.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Olivier Dion <odion@efficios.com>
Cc: michael.christie@oracle.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org
---
include/linux/mm_types.h | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)
Comments
On Mon, May 15, 2023 at 10:35:36AM -0400, Mathieu Desnoyers wrote: > The mm_struct mm_count field is frequently updated by mmgrab/mmdrop > performed by context switch. This causes false-sharing for surrounding > mm_struct fields which are read-mostly. > > This has been observed on a 2sockets/112core/224cpu Intel Sapphire > Rapids server running hackbench, and by the kernel test robot > will-it-scale testcase. > > Move the mm_count field into its own cache line to prevent false-sharing > with other mm_struct fields. > > Move mm_count to the first field of mm_struct to minimize the amount of > padding required: rather than adding padding before and after the > mm_count field, padding is only added after mm_count. > > Note that I noticed this odd comment in mm_struct: > > commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct") > > /* > * With some kernel config, the current mmap_lock's offset > * inside 'mm_struct' is at 0x120, which is very optimal, as > * its two hot fields 'count' and 'owner' sit in 2 different > * cachelines, and when mmap_lock is highly contended, both > * of the 2 fields will be accessed frequently, current layout > * will help to reduce cache bouncing. > * > * So please be careful with adding new fields before > * mmap_lock, which can easily push the 2 fields into one > * cacheline. > */ > struct rw_semaphore mmap_lock; > > This comment is rather odd for a few reasons: > > - It requires addition/removal of mm_struct fields to carefully consider > field alignment of _other_ fields, > - It expresses the wish to keep an "optimal" alignment for a specific > kernel config. > > I suspect that the author of this comment may want to revisit this topic > and perhaps introduce a split-struct approach for struct rw_semaphore, > if the need is to place various fields of this structure in different > cache lines. > > Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") > Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") > Link: https://lore.kernel.org/lkml/7a0c1db1-103d-d518-ed96-1584a28fbf32@efficios.com > Reported-by: kernel test robot <yujie.liu@intel.com> > Link: https://lore.kernel.org/oe-lkp/202305151017.27581d75-yujie.liu@intel.com > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Aaron Lu <aaron.lu@intel.com>
Hi Mathieu, On Mon, May 15, 2023 at 10:35:36PM +0800, Mathieu Desnoyers wrote: > The mm_struct mm_count field is frequently updated by mmgrab/mmdrop > performed by context switch. This causes false-sharing for surrounding > mm_struct fields which are read-mostly. > > This has been observed on a 2sockets/112core/224cpu Intel Sapphire > Rapids server running hackbench, and by the kernel test robot > will-it-scale testcase. > > Move the mm_count field into its own cache line to prevent false-sharing > with other mm_struct fields. > > Move mm_count to the first field of mm_struct to minimize the amount of > padding required: rather than adding padding before and after the > mm_count field, padding is only added after mm_count. > > Note that I noticed this odd comment in mm_struct: > > commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct") > > /* > * With some kernel config, the current mmap_lock's offset > * inside 'mm_struct' is at 0x120, which is very optimal, as > * its two hot fields 'count' and 'owner' sit in 2 different > * cachelines, and when mmap_lock is highly contended, both > * of the 2 fields will be accessed frequently, current layout > * will help to reduce cache bouncing. > * > * So please be careful with adding new fields before > * mmap_lock, which can easily push the 2 fields into one > * cacheline. > */ > struct rw_semaphore mmap_lock; > > This comment is rather odd for a few reasons: > > - It requires addition/removal of mm_struct fields to carefully consider > field alignment of _other_ fields, > - It expresses the wish to keep an "optimal" alignment for a specific > kernel config. > > I suspect that the author of this comment may want to revisit this topic > and perhaps introduce a split-struct approach for struct rw_semaphore, > if the need is to place various fields of this structure in different > cache lines. Thanks for bringing this up. The full context of the commit 2e3025434a6b is here: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ Add Linus, Waiman who have analyzed this case. That a commit changed the cacheline layout of mmap_lock inside of 'mm_struct', which caused a will-it-scale regression. As false sharing handling is tricky and we chosed to be defensive and just _restore_ its cacheline layout as before (even if it is kind of weired as being related to kernel configs :)). As for rw_semaphore, it is a fundermental thing while that regerssion is just one single workload of micro-benchmark. IMHO, any change to its layout should consider more workloads, and deserve a wide range of benchmark tests. I just checked latest kernel, seems the cache layout is already different from what 2e3025434a6b try to restore, that the 'count' and 'owner' fields sit in 2 different cachelines. So this patch won't 'hurt' in this regard. Thanks, Feng > > Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") > Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") > Link: https://lore.kernel.org/lkml/7a0c1db1-103d-d518-ed96-1584a28fbf32@efficios.com > Reported-by: kernel test robot <yujie.liu@intel.com> > Link: https://lore.kernel.org/oe-lkp/202305151017.27581d75-yujie.liu@intel.com > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Aaron Lu <aaron.lu@intel.com> > Cc: Olivier Dion <odion@efficios.com> > Cc: michael.christie@oracle.com > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Feng Tang <feng.tang@intel.com> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Jason Gunthorpe <jgg@nvidia.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: linux-mm@kvack.org > --- > include/linux/mm_types.h | 23 +++++++++++++++-------- > 1 file changed, 15 insertions(+), 8 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 306a3d1a0fa6..de10fc797c8e 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -583,6 +583,21 @@ struct mm_cid { > struct kioctx_table; > struct mm_struct { > struct { > + /* > + * Fields which are often written to are placed in a separate > + * cache line. > + */ > + struct { > + /** > + * @mm_count: The number of references to &struct > + * mm_struct (@mm_users count as 1). > + * > + * Use mmgrab()/mmdrop() to modify. When this drops to > + * 0, the &struct mm_struct is freed. > + */ > + atomic_t mm_count; > + } ____cacheline_aligned_in_smp; > + > struct maple_tree mm_mt; > #ifdef CONFIG_MMU > unsigned long (*get_unmapped_area) (struct file *filp, > @@ -620,14 +635,6 @@ struct mm_struct { > */ > atomic_t mm_users; > > - /** > - * @mm_count: The number of references to &struct mm_struct > - * (@mm_users count as 1). > - * > - * Use mmgrab()/mmdrop() to modify. When this drops to 0, the > - * &struct mm_struct is freed. > - */ > - atomic_t mm_count; > #ifdef CONFIG_SCHED_MM_CID > /** > * @pcpu_cid: Per-cpu current cid. > -- > 2.25.1 >
On 5/15/23 10:35, Mathieu Desnoyers wrote: > The mm_struct mm_count field is frequently updated by mmgrab/mmdrop > performed by context switch. This causes false-sharing for surrounding > mm_struct fields which are read-mostly. Hi Andrew, Given that this patch touches mm code, I think it should go through your tree. Nobody has voiced any objection for a month, Aaron Lu gave his Reviewed-by tag [1], and it solves measurable performance regressions. Would you be willing to pick it up ? Please let me know if something else is needed. Thanks! Mathieu [1] https://lore.kernel.org/lkml/20230516044050.GA315678@ziqianlu-desk2 > > This has been observed on a 2sockets/112core/224cpu Intel Sapphire > Rapids server running hackbench, and by the kernel test robot > will-it-scale testcase. > > Move the mm_count field into its own cache line to prevent false-sharing > with other mm_struct fields. > > Move mm_count to the first field of mm_struct to minimize the amount of > padding required: rather than adding padding before and after the > mm_count field, padding is only added after mm_count. > > Note that I noticed this odd comment in mm_struct: > > commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct") > > /* > * With some kernel config, the current mmap_lock's offset > * inside 'mm_struct' is at 0x120, which is very optimal, as > * its two hot fields 'count' and 'owner' sit in 2 different > * cachelines, and when mmap_lock is highly contended, both > * of the 2 fields will be accessed frequently, current layout > * will help to reduce cache bouncing. > * > * So please be careful with adding new fields before > * mmap_lock, which can easily push the 2 fields into one > * cacheline. > */ > struct rw_semaphore mmap_lock; > > This comment is rather odd for a few reasons: > > - It requires addition/removal of mm_struct fields to carefully consider > field alignment of _other_ fields, > - It expresses the wish to keep an "optimal" alignment for a specific > kernel config. > > I suspect that the author of this comment may want to revisit this topic > and perhaps introduce a split-struct approach for struct rw_semaphore, > if the need is to place various fields of this structure in different > cache lines. > > Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") > Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") > Link: https://lore.kernel.org/lkml/7a0c1db1-103d-d518-ed96-1584a28fbf32@efficios.com > Reported-by: kernel test robot <yujie.liu@intel.com> > Link: https://lore.kernel.org/oe-lkp/202305151017.27581d75-yujie.liu@intel.com > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Aaron Lu <aaron.lu@intel.com> > Cc: Olivier Dion <odion@efficios.com> > Cc: michael.christie@oracle.com > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Feng Tang <feng.tang@intel.com> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Jason Gunthorpe <jgg@nvidia.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: linux-mm@kvack.org > --- > include/linux/mm_types.h | 23 +++++++++++++++-------- > 1 file changed, 15 insertions(+), 8 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 306a3d1a0fa6..de10fc797c8e 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -583,6 +583,21 @@ struct mm_cid { > struct kioctx_table; > struct mm_struct { > struct { > + /* > + * Fields which are often written to are placed in a separate > + * cache line. > + */ > + struct { > + /** > + * @mm_count: The number of references to &struct > + * mm_struct (@mm_users count as 1). > + * > + * Use mmgrab()/mmdrop() to modify. When this drops to > + * 0, the &struct mm_struct is freed. > + */ > + atomic_t mm_count; > + } ____cacheline_aligned_in_smp; > + > struct maple_tree mm_mt; > #ifdef CONFIG_MMU > unsigned long (*get_unmapped_area) (struct file *filp, > @@ -620,14 +635,6 @@ struct mm_struct { > */ > atomic_t mm_users; > > - /** > - * @mm_count: The number of references to &struct mm_struct > - * (@mm_users count as 1). > - * > - * Use mmgrab()/mmdrop() to modify. When this drops to 0, the > - * &struct mm_struct is freed. > - */ > - atomic_t mm_count; > #ifdef CONFIG_SCHED_MM_CID > /** > * @pcpu_cid: Per-cpu current cid.
On Mon, 15 May 2023 10:35:36 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > The mm_struct mm_count field is frequently updated by mmgrab/mmdrop > performed by context switch. This causes false-sharing for surrounding > mm_struct fields which are read-mostly. > > This has been observed on a 2sockets/112core/224cpu Intel Sapphire > Rapids server running hackbench, and by the kernel test robot > will-it-scale testcase. > > Move the mm_count field into its own cache line to prevent false-sharing > with other mm_struct fields. > > Move mm_count to the first field of mm_struct to minimize the amount of > padding required: rather than adding padding before and after the > mm_count field, padding is only added after mm_count. > > Note that I noticed this odd comment in mm_struct: > > commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct") > > /* > * With some kernel config, the current mmap_lock's offset > * inside 'mm_struct' is at 0x120, which is very optimal, as > * its two hot fields 'count' and 'owner' sit in 2 different > * cachelines, and when mmap_lock is highly contended, both > * of the 2 fields will be accessed frequently, current layout > * will help to reduce cache bouncing. > * > * So please be careful with adding new fields before > * mmap_lock, which can easily push the 2 fields into one > * cacheline. > */ > struct rw_semaphore mmap_lock; > > This comment is rather odd for a few reasons: > > - It requires addition/removal of mm_struct fields to carefully consider > field alignment of _other_ fields, > - It expresses the wish to keep an "optimal" alignment for a specific > kernel config. > > I suspect that the author of this comment may want to revisit this topic > and perhaps introduce a split-struct approach for struct rw_semaphore, > if the need is to place various fields of this structure in different > cache lines. > > ... > > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -583,6 +583,21 @@ struct mm_cid { > struct kioctx_table; > struct mm_struct { > struct { > + /* > + * Fields which are often written to are placed in a separate > + * cache line. > + */ > + struct { > + /** > + * @mm_count: The number of references to &struct > + * mm_struct (@mm_users count as 1). > + * > + * Use mmgrab()/mmdrop() to modify. When this drops to > + * 0, the &struct mm_struct is freed. > + */ > + atomic_t mm_count; > + } ____cacheline_aligned_in_smp; > + Why add the anonymous struct? atomic_t mm_count ____cacheline_aligned_in_smp; would suffice? Secondly, the ____cacheline_aligned_in_smp doesn't actually do anything? mm_count is at offset 0 which is cacheline aligned anyway. The next field (mm_mt) will share a cacheline with mm_count. If the plan is to put mm_count in "its own" cacheline then padding will be needed?
On 6/16/23 16:16, Andrew Morton wrote: > On Mon, 15 May 2023 10:35:36 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> The mm_struct mm_count field is frequently updated by mmgrab/mmdrop >> performed by context switch. This causes false-sharing for surrounding >> mm_struct fields which are read-mostly. >> >> This has been observed on a 2sockets/112core/224cpu Intel Sapphire >> Rapids server running hackbench, and by the kernel test robot >> will-it-scale testcase. >> >> Move the mm_count field into its own cache line to prevent false-sharing >> with other mm_struct fields. >> >> Move mm_count to the first field of mm_struct to minimize the amount of >> padding required: rather than adding padding before and after the >> mm_count field, padding is only added after mm_count. >> >> Note that I noticed this odd comment in mm_struct: >> >> commit 2e3025434a6b ("mm: relocate 'write_protect_seq' in struct mm_struct") >> >> /* >> * With some kernel config, the current mmap_lock's offset >> * inside 'mm_struct' is at 0x120, which is very optimal, as >> * its two hot fields 'count' and 'owner' sit in 2 different >> * cachelines, and when mmap_lock is highly contended, both >> * of the 2 fields will be accessed frequently, current layout >> * will help to reduce cache bouncing. >> * >> * So please be careful with adding new fields before >> * mmap_lock, which can easily push the 2 fields into one >> * cacheline. >> */ >> struct rw_semaphore mmap_lock; >> >> This comment is rather odd for a few reasons: >> >> - It requires addition/removal of mm_struct fields to carefully consider >> field alignment of _other_ fields, >> - It expresses the wish to keep an "optimal" alignment for a specific >> kernel config. >> >> I suspect that the author of this comment may want to revisit this topic >> and perhaps introduce a split-struct approach for struct rw_semaphore, >> if the need is to place various fields of this structure in different >> cache lines. >> >> ... >> >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -583,6 +583,21 @@ struct mm_cid { >> struct kioctx_table; >> struct mm_struct { >> struct { >> + /* >> + * Fields which are often written to are placed in a separate >> + * cache line. >> + */ >> + struct { >> + /** >> + * @mm_count: The number of references to &struct >> + * mm_struct (@mm_users count as 1). >> + * >> + * Use mmgrab()/mmdrop() to modify. When this drops to >> + * 0, the &struct mm_struct is freed. >> + */ >> + atomic_t mm_count; >> + } ____cacheline_aligned_in_smp; >> + > > Why add the anonymous struct? > > atomic_t mm_count ____cacheline_aligned_in_smp; > > would suffice? The anonymous struct is needed to ensure the mm_count is alone in its own cacheline. An aligned attribute applied to an integer field only aligns the offset of that field in the structure, without changing its size. An aligned attribute applied to a structure aligns both its offset in the structure containing it _and_ extends its size with padding. This takes care of adding padding _after_ mm_count as well. Alignment on a structure requires that the structure size is extended with padding to cover the required alignment. Otherwise an array of that structure could not have each of its items on a multiple of the required alignment. > > Secondly, the ____cacheline_aligned_in_smp doesn't actually do > anything? mm_count is at offset 0 which is cacheline aligned anyway. > The next field (mm_mt) will share a cacheline with mm_count. If applied on the integer field, I agree that it would not do anything. However, applied on the anonymous structure, it takes care of adding padding _after_ the mm_count field, which is exactly what we want here. > > If the plan is to put mm_count in "its own" cacheline then padding will > be needed? It's taken care of by the anonymous structure trick. Here is an quick example showing the difference between alignment attribute applied to an integer type vs to an anonymous structure: #include <stdio.h> struct s1 { int a __attribute__((aligned(32))); int b; }; struct s2 { int a; int b __attribute__((aligned(32))); }; struct s3 { struct { int a; } __attribute__((aligned(32))); int b; }; int main() { struct s1 t1; struct s2 t2; struct s3 t3; printf("%d %d\n", (unsigned long)&t1.a - (unsigned long)&t1, (unsigned long)&t1.b - (unsigned long)&t1); printf("%d %d\n", (unsigned long)&t2.a - (unsigned long)&t2, (unsigned long)&t2.b - (unsigned long)&t2); printf("%d %d\n", (unsigned long)&t3.a - (unsigned long)&t3, (unsigned long)&t3.b - (unsigned long)&t3); return 0; } Result: 0 4 0 32 0 32 Applying the aligned attribute on the integer field would require explicit padding, because it does not add padding after the field. Applying the aligned attribute on the _following_ field would work, but it's rather odd and error-prone. Applying the aligned attribute on an anonymous structure clearly documents the intent, and adds the padding that guarantees this variable is alone in its cache line. Thanks, Mathieu
On 6/16/23 13:38, Mathieu Desnoyers wrote: ... >>> This comment is rather odd for a few reasons: >>> >>> - It requires addition/removal of mm_struct fields to carefully consider >>> field alignment of _other_ fields, >>> - It expresses the wish to keep an "optimal" alignment for a specific >>> kernel config. >>> >>> I suspect that the author of this comment may want to revisit this topic >>> and perhaps introduce a split-struct approach for struct rw_semaphore, >>> if the need is to place various fields of this structure in different >>> cache lines. >>> Agreed. The whole thing is far too fragile, but when reviewing this I wasn't sure what else to suggest. Now looking at it again with your alignment suggestion, there is an interesting conflicting set of desires: a) Here: Feng Tang discovered that .count and .owner are best put in separate cache lines for the contended case for mmap_lock, and b) rwsem.h, which specifies precisely the opposite for the uncontended case: * For an uncontended rwsem, count and owner are the only fields a task * needs to touch when acquiring the rwsem. So they are put next to each * other to increase the chance that they will share the same cacheline. I suspect that overall, it's "better" to align rw_semaphore's .count and .owner field so that the lock is optimized for the contended case, because it's reasonable to claim that the benefit of having those two fields in the same cacheline for the uncontended case is far less than the cost to the contended case, of keeping them close to each other. However, it's still not unlikely that someone will measure a performance regression if such a change is made. Thoughts? ... >> If the plan is to put mm_count in "its own" cacheline then padding will >> be needed? > > It's taken care of by the anonymous structure trick. Here is an quick example showing the difference between alignment attribute applied to an integer type vs to an anonymous structure: Thanks for explaining very clearly how that works, that's really helpful! thanks,
On 5/15/23 07:35, Mathieu Desnoyers wrote: > The mm_struct mm_count field is frequently updated by mmgrab/mmdrop > performed by context switch. This causes false-sharing for surrounding > mm_struct fields which are read-mostly. > > This has been observed on a 2sockets/112core/224cpu Intel Sapphire > Rapids server running hackbench, and by the kernel test robot > will-it-scale testcase. > > Move the mm_count field into its own cache line to prevent false-sharing > with other mm_struct fields. > > Move mm_count to the first field of mm_struct to minimize the amount of > padding required: rather than adding padding before and after the > mm_count field, padding is only added after mm_count. Oh, and I almost forgot to add: Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks,
On 6/16/23 18:35, John Hubbard wrote: > On 6/16/23 13:38, Mathieu Desnoyers wrote: > ... >>>> This comment is rather odd for a few reasons: >>>> >>>> - It requires addition/removal of mm_struct fields to carefully >>>> consider >>>> field alignment of _other_ fields, >>>> - It expresses the wish to keep an "optimal" alignment for a specific >>>> kernel config. >>>> >>>> I suspect that the author of this comment may want to revisit this >>>> topic >>>> and perhaps introduce a split-struct approach for struct rw_semaphore, >>>> if the need is to place various fields of this structure in different >>>> cache lines. >>>> > > Agreed. The whole thing is far too fragile, but when reviewing this I > wasn't sure what else to suggest. Now looking at it again with your > alignment suggestion, there is an interesting conflicting set of > desires: > > a) Here: Feng Tang discovered that .count and .owner are best put in > separate cache lines for the contended case for mmap_lock, and > > b) rwsem.h, which specifies precisely the opposite for the uncontended > case: > > * For an uncontended rwsem, count and owner are the only fields a task > * needs to touch when acquiring the rwsem. So they are put next to each > * other to increase the chance that they will share the same cacheline. > > I suspect that overall, it's "better" to align rw_semaphore's .count and > .owner field so that the lock is optimized for the contended case, > because it's reasonable to claim that the benefit of having those two > fields in the same cacheline for the uncontended case is far less than > the cost to the contended case, of keeping them close to each other. > > However, it's still not unlikely that someone will measure a performance > regression if such a change is made. > > Thoughts? > I suspect that the purpose of b) is only to maximize the functional density of the data cache: only using a single cache line for the rwsem uncontended case has the smallest footprint on the data cache. However, if we look at the rwsem in the context of its use within another data structure containing it, I think we can do better by analyzing the access pattern of _other_ fields of that structure. I have faced a similar situation within liburcu's wfcqueue's API [1], where it's better for head and tail to sit on different cache lines to eliminate false-sharing between enqueue and dequeue. I solved this by splitting the head and tail parameters in the API. So the user can decide to place them other on the same cache line, or on different cache lines, depending on the use-case. The user also has the freedom to place both head and tail on the same cache line as _other_ fields based on usage pattern. By providing enough flexibility to place the rwsem fields so that the count is on its own cache-line, and owner is on the same cache line as other fields touched when the rwsem is held, I suspect we can both improve functional density _and_ eliminate false-sharing in the contended case. Thanks, Mathieu [1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/wfcqueue.h#L279 > ... >>> If the plan is to put mm_count in "its own" cacheline then padding will >>> be needed? >> >> It's taken care of by the anonymous structure trick. Here is an quick >> example showing the difference between alignment attribute applied to >> an integer type vs to an anonymous structure: > > Thanks for explaining very clearly how that works, that's really > helpful! > thanks,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 306a3d1a0fa6..de10fc797c8e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -583,6 +583,21 @@ struct mm_cid { struct kioctx_table; struct mm_struct { struct { + /* + * Fields which are often written to are placed in a separate + * cache line. + */ + struct { + /** + * @mm_count: The number of references to &struct + * mm_struct (@mm_users count as 1). + * + * Use mmgrab()/mmdrop() to modify. When this drops to + * 0, the &struct mm_struct is freed. + */ + atomic_t mm_count; + } ____cacheline_aligned_in_smp; + struct maple_tree mm_mt; #ifdef CONFIG_MMU unsigned long (*get_unmapped_area) (struct file *filp, @@ -620,14 +635,6 @@ struct mm_struct { */ atomic_t mm_users; - /** - * @mm_count: The number of references to &struct mm_struct - * (@mm_users count as 1). - * - * Use mmgrab()/mmdrop() to modify. When this drops to 0, the - * &struct mm_struct is freed. - */ - atomic_t mm_count; #ifdef CONFIG_SCHED_MM_CID /** * @pcpu_cid: Per-cpu current cid.