Message ID | 20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-29678-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:2bc4:b0:101:a8e8:374 with SMTP id hx4csp97575dyb; Wed, 17 Jan 2024 19:07:31 -0800 (PST) X-Google-Smtp-Source: AGHT+IEus0lg1A5XvynQAqNYlVNckKR0TvCidab3k4cErmFZrkvz7ZwNYXAb3ajhZyxOjom+u4xv X-Received: by 2002:a05:6a21:7883:b0:19a:f4a6:b492 with SMTP id bf3-20020a056a21788300b0019af4a6b492mr257047pzc.6.1705547250841; Wed, 17 Jan 2024 19:07:30 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1705547250; cv=pass; d=google.com; s=arc-20160816; b=D4fEkALSsinWfGt3Qsf6JBx4foXurPKxqsrE+MneObUJ4hToLboJ1Y8tl8M74h5ITa WNo3wfseX8emjfUa1ExhppaVe+dVctxABggPvOhUugPrTZAkiNnfcnGYY4mMMpwIg1m9 9diYfFXVO86ZYoXboAMlMcRWmUSoYIBKUgwLZNQlgM0gghgq86/bsLVUSm6gRY8n+rlk bjWZf35oQDqwTprOPFTVNeNSRw+HT3dbAlJiIWE9CfHzO8sVTMkhJMiDcws49enmlrEy ukuchMOKXvwH+L56sLygQFHvF1NYmVzFkjtUTX4p3lBX/2Y1IRC/hUSJmXiIvQ5//evV w3HQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:from :dkim-signature; bh=5t8WW2a5QmKlOvStagneE3RmRwukLHdm3rfjomiYQlA=; fh=Fm7oGaoSyaoSNy5CKYtJzzqJj4aq0AwbdWSgVNWLSu4=; b=X5KGlv+rxzZv2x+GkmNgzyOS9dpXwkdGW4r/L1u9kQlBblX2HOnUXJxL3fxwSN+Lto fK5hvPyQR1l8JJ7ZsSgHUaq5zCfUhF+hj+9841tqfaN+iOa0caxumlhyMPGLkz5rjBbX 1Az/Gly4aiWnpLlxrkxVsSdN1n63rKuXs6giNQQfVyKRoierzWhwMZrwG+hjAjJH6QVa CaWdpzf+sYWLlGr9SiGAwvwlSAfCR9hE3HJ8vz04VqoFdpoKYJlsIaIUjn1ZaXcJO/q0 PP1fINX+IGv0Lwdpyb0qHZcz9XMC5UrjKn4rIBRt8jU01A9Nuci8CZ0URyEiyjdMEj9N 0KaA== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Fvm0Zspz; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-29678-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-29678-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id a12-20020a17090ad80c00b0028fef30ef35si675345pjv.80.2024.01.17.19.07.30 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Jan 2024 19:07:30 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-29678-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Fvm0Zspz; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-29678-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-29678-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 58D02284813 for <ouuuleilei@gmail.com>; Thu, 18 Jan 2024 03:06:16 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 2DEE679D4; Thu, 18 Jan 2024 03:06:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Fvm0Zspz" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 512F433D3 for <linux-kernel@vger.kernel.org>; Thu, 18 Jan 2024 03:05:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705547160; cv=none; b=uvOP186jsuq3Oe5xAoTSzAq5jRSI5qnBBZDsoJOde+92Xf/xp9mWbm5FUeBwAkF+qXzHri6RIUhYldHQPzd1QaZksi2cjyJLgB3oH4zfJLAdqhwF1l+3bVGLVYyxAXCvuK3OQu2f9sRL+ZroLdhKXc6CYTXzzjlbihQkPxtV2qg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705547160; c=relaxed/simple; bh=OKgJl85PBzkDLOECRXoBFZ9nSFWwN6N+G8E0iFDjg2w=; h=Received:DKIM-Signature:From:Subject:Date:Message-Id:MIME-Version: Content-Type:Content-Transfer-Encoding:X-B4-Tracking:To:Cc: X-Mailer; b=my8pc5nSg/9xbLzk5SqNgDyZwIF0DmbXB3i4Lw0oKedvX6gGvwcRIi8CzLhpzwPWlfA8E9yfItn/bV02j9hBmjHU5bcQzWXBzcYWaIhVNIKCKd8AKlhkVhbqrsfwZ0NUQHFzd5SMUC57f+MowRllVVwumROkqOWS5dsgdez8xeY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Fvm0Zspz; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 80460C433C7; Thu, 18 Jan 2024 03:05:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1705547159; bh=OKgJl85PBzkDLOECRXoBFZ9nSFWwN6N+G8E0iFDjg2w=; h=From:Subject:Date:To:Cc:From; b=Fvm0Zspz/CNN5KQ3c3PT8DPP7tlnCcuqAG6lVMLeHgcYHdXPeaRYafbaM/Pzef+/8 xp40ZaUpHNnFQ64XyvwdFxQTxYv06XkirXjv1wIP3xzceM5JeJvnPzsJxcILCgDQdK 5A35efQm7HP3sbTsfzPNzXUWuXLbs01LJwU9cbrHvM+WEDtZc3B1XKqGUnL7kp3TIv kUrL8+dVI52kKeUSWHpc+zOINfEWIrsKvSc+44O+tuxjfu30lIDCGh22daQAxhyLS2 XIAhV78mkrVxgPzR5OMIFBpu32N2pO8rDmMds48kRbY4LRjyUTfqFBtCMy5QIa8rU4 AGn97B1AvF4sw== From: Chris Li <chrisl@kernel.org> Subject: [PATCH 0/2] RFC: zswap tree use xarray instead of RB tree Date: Wed, 17 Jan 2024 19:05:40 -0800 Message-Id: <20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit X-B4-Tracking: v=1; b=H4sIAISVqGUC/x3MQQqAIBBA0avIrBPUzKCrRAupqWZjMkJZ0t2Tl m/xf4GETJhgEAUYT0p0hArdCJh3HzaUtFSDUcYqrax80uWjzJ7Z37LXzjiFndXYQk0i40r5343 T+36OfBucXgAAAA== To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?b?V2VpIFh177+8?= <weixugc@google.com>, Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>, Chun-Tse Shao <ctshao@google.com>, =?utf-8?q?Suren_Baghdasaryan=EF=BF=BC?= <surenb@google.com>, Yosry Ahmed <yosryahmed@google.com>, Brain Geffon <bgeffon@google.com>, Minchan Kim <minchan@kernel.org>, Michal Hocko <mhocko@suse.com>, Mel Gorman <mgorman@techsingularity.net>, Huang Ying <ying.huang@intel.com>, Nhat Pham <nphamcs@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, Kairui Song <kasong@tencent.com>, Zhongkun He <hezhongkun.hzk@bytedance.com>, Kemeng Shi <shikemeng@huaweicloud.com>, Barry Song <v-songbaohua@oppo.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, "Liam R. Howlett" <Liam.Howlett@oracle.com>, Joel Fernandes <joel@joelfernandes.org>, Chengming Zhou <zhouchengming@bytedance.com>, Chris Li <chrisl@kernel.org> X-Mailer: b4 0.12.3 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1788395914307128079 X-GMAIL-MSGID: 1788395914307128079 |
Series |
RFC: zswap tree use xarray instead of RB tree
|
|
Message
Chris Li
Jan. 18, 2024, 3:05 a.m. UTC
The RB tree shows some contribution to the swap fault
long tail latency due to two factors:
1) RB tree requires re-balance from time to time.
2) The zswap RB tree has a tree level spin lock protecting
the tree access.
The swap cache is using xarray. The break down the swap
cache access does not have the similar long time as zswap
RB tree.
Moving the zswap entry to xarray enable read side
take read RCU lock only.
The first patch adds the xarray alongside the RB tree.
There is some debug check asserting the xarray agrees with
the RB tree results.
The second patch removes the zwap RB tree.
I expect to merge the zswap rb tree spin lock with the xarray
lock in the follow up changes.
I can surely use some help in reviewing and testing.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
Chris Li (2):
mm: zswap.c: add xarray tree to zswap
mm: zswap.c: remove RB tree
mm/zswap.c | 120 ++++++++++++++++++++++++++++++-------------------------------
1 file changed, 59 insertions(+), 61 deletions(-)
---
base-commit: d7ba3d7c3bf13e2faf419cce9e9bdfc3a1a50905
change-id: 20240104-zswap-xarray-716260e541e3
Best regards,
Comments
That's a long CC list for sure :) On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > The RB tree shows some contribution to the swap fault > long tail latency due to two factors: > 1) RB tree requires re-balance from time to time. > 2) The zswap RB tree has a tree level spin lock protecting > the tree access. > > The swap cache is using xarray. The break down the swap > cache access does not have the similar long time as zswap > RB tree. I think the comparison to the swap cache may not be valid as the swap cache has many trees per swapfile, while zswap has a single tree. > > Moving the zswap entry to xarray enable read side > take read RCU lock only. Nice. > > The first patch adds the xarray alongside the RB tree. > There is some debug check asserting the xarray agrees with > the RB tree results. > > The second patch removes the zwap RB tree. The breakdown looks like something that would be a development step, but for patch submission I think it makes more sense to have a single patch replacing the rbtree with an xarray. > > I expect to merge the zswap rb tree spin lock with the xarray > lock in the follow up changes. Shouldn't this simply be changing uses of tree->lock to use xa_{lock/unlock}? We also need to make sure we don't try to lock the tree when operating on the xarray if the caller is already holding the lock, but this seems to be straightforward enough to be done as part of this patch or this series at least. Am I missing something? > > I can surely use some help in reviewing and testing. > > Signed-off-by: Chris Li <chrisl@kernel.org> > --- > Chris Li (2): > mm: zswap.c: add xarray tree to zswap > mm: zswap.c: remove RB tree > > mm/zswap.c | 120 ++++++++++++++++++++++++++++++------------------------------- > 1 file changed, 59 insertions(+), 61 deletions(-) > --- > base-commit: d7ba3d7c3bf13e2faf419cce9e9bdfc3a1a50905 > change-id: 20240104-zswap-xarray-716260e541e3 > > Best regards, > -- > Chris Li <chrisl@kernel.org> >
On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > That's a long CC list for sure :) > > On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > > > The RB tree shows some contribution to the swap fault > > long tail latency due to two factors: > > 1) RB tree requires re-balance from time to time. > > 2) The zswap RB tree has a tree level spin lock protecting > > the tree access. > > > > The swap cache is using xarray. The break down the swap > > cache access does not have the similar long time as zswap > > RB tree. > > I think the comparison to the swap cache may not be valid as the swap > cache has many trees per swapfile, while zswap has a single tree. > > > > > Moving the zswap entry to xarray enable read side > > take read RCU lock only. > > Nice. > > > > > The first patch adds the xarray alongside the RB tree. > > There is some debug check asserting the xarray agrees with > > the RB tree results. > > > > The second patch removes the zwap RB tree. > > The breakdown looks like something that would be a development step, > but for patch submission I think it makes more sense to have a single > patch replacing the rbtree with an xarray. > > > > > I expect to merge the zswap rb tree spin lock with the xarray > > lock in the follow up changes. > > Shouldn't this simply be changing uses of tree->lock to use > xa_{lock/unlock}? We also need to make sure we don't try to lock the > tree when operating on the xarray if the caller is already holding the > lock, but this seems to be straightforward enough to be done as part > of this patch or this series at least. > > Am I missing something? Also, I assume we will only see performance improvements after the tree lock in its current form is removed so that we get loads protected only by RCU. Can we get some performance numbers to see how the latency improves with the xarray under contention (unless Chengming is already planning on testing this for his multi-tree patches).
On Wed, Jan 17, 2024 at 10:02 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > That's a long CC list for sure :) > > On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > > > The RB tree shows some contribution to the swap fault > > long tail latency due to two factors: > > 1) RB tree requires re-balance from time to time. > > 2) The zswap RB tree has a tree level spin lock protecting > > the tree access. > > > > The swap cache is using xarray. The break down the swap > > cache access does not have the similar long time as zswap > > RB tree. > > I think the comparison to the swap cache may not be valid as the swap > cache has many trees per swapfile, while zswap has a single tree. Yes, good point. I think we can bench mark the xarray zswap vs the RB tree zswap, that would be more of a direct comparison. > > Moving the zswap entry to xarray enable read side > > take read RCU lock only. > > Nice. > > > > > The first patch adds the xarray alongside the RB tree. > > There is some debug check asserting the xarray agrees with > > the RB tree results. > > > > The second patch removes the zwap RB tree. > > The breakdown looks like something that would be a development step, > but for patch submission I think it makes more sense to have a single > patch replacing the rbtree with an xarray. I think it makes the review easier. The code adding and removing does not have much overlap. Combining it to a single patch does not save patch size. Having the assert check would be useful for some bisecting to narrow down which step causing the problem. I am fine with squash it to one patch as well. > > > > > I expect to merge the zswap rb tree spin lock with the xarray > > lock in the follow up changes. > > Shouldn't this simply be changing uses of tree->lock to use > xa_{lock/unlock}? We also need to make sure we don't try to lock the > tree when operating on the xarray if the caller is already holding the > lock, but this seems to be straightforward enough to be done as part > of this patch or this series at least. > > Am I missing something? Currently the zswap entry refcount is protected by the zswap tree spin lock as well. Can't remove the tree spin lock without changing the refcount code. I think the zswap search entry should just return the entry with refcount atomic increase, inside the RCU read() or xarray lock. The previous zswap code does the find_and_get entry() which is closer to what I want. Chris
Hi Yosry and Chris, On 2024/1/18 14:39, Yosry Ahmed wrote: > On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> That's a long CC list for sure :) >> >> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: >>> >>> The RB tree shows some contribution to the swap fault >>> long tail latency due to two factors: >>> 1) RB tree requires re-balance from time to time. >>> 2) The zswap RB tree has a tree level spin lock protecting >>> the tree access. >>> >>> The swap cache is using xarray. The break down the swap >>> cache access does not have the similar long time as zswap >>> RB tree. >> >> I think the comparison to the swap cache may not be valid as the swap >> cache has many trees per swapfile, while zswap has a single tree. >> >>> >>> Moving the zswap entry to xarray enable read side >>> take read RCU lock only. >> >> Nice. >> >>> >>> The first patch adds the xarray alongside the RB tree. >>> There is some debug check asserting the xarray agrees with >>> the RB tree results. >>> >>> The second patch removes the zwap RB tree. >> >> The breakdown looks like something that would be a development step, >> but for patch submission I think it makes more sense to have a single >> patch replacing the rbtree with an xarray. >> >>> >>> I expect to merge the zswap rb tree spin lock with the xarray >>> lock in the follow up changes. >> >> Shouldn't this simply be changing uses of tree->lock to use >> xa_{lock/unlock}? We also need to make sure we don't try to lock the >> tree when operating on the xarray if the caller is already holding the >> lock, but this seems to be straightforward enough to be done as part >> of this patch or this series at least. >> >> Am I missing something? > > Also, I assume we will only see performance improvements after the > tree lock in its current form is removed so that we get loads > protected only by RCU. Can we get some performance numbers to see how > the latency improves with the xarray under contention (unless > Chengming is already planning on testing this for his multi-tree > patches). I just give it a try, the same test of kernel build in tmpfs with zswap shrinker enabled, all based on the latest mm/mm-stable branch. mm-stable zswap-split-tree zswap-xarray real 1m10.442s 1m4.157s 1m9.962s user 17m48.232s 17m41.477s 17m45.887s sys 8m13.517s 5m2.226s 7m59.305s Looks like the contention of concurrency is still there, I haven't look into the code yet, will review it later.
On Wed, Jan 17, 2024 at 11:02 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Jan 17, 2024 at 10:57 PM Chengming Zhou > <zhouchengming@bytedance.com> wrote: > > > > Hi Yosry and Chris, > > > > On 2024/1/18 14:39, Yosry Ahmed wrote: > > > On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > >> > > >> That's a long CC list for sure :) > > >> > > >> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > >>> > > >>> The RB tree shows some contribution to the swap fault > > >>> long tail latency due to two factors: > > >>> 1) RB tree requires re-balance from time to time. > > >>> 2) The zswap RB tree has a tree level spin lock protecting > > >>> the tree access. > > >>> > > >>> The swap cache is using xarray. The break down the swap > > >>> cache access does not have the similar long time as zswap > > >>> RB tree. > > >> > > >> I think the comparison to the swap cache may not be valid as the swap > > >> cache has many trees per swapfile, while zswap has a single tree. > > >> > > >>> > > >>> Moving the zswap entry to xarray enable read side > > >>> take read RCU lock only. > > >> > > >> Nice. > > >> > > >>> > > >>> The first patch adds the xarray alongside the RB tree. > > >>> There is some debug check asserting the xarray agrees with > > >>> the RB tree results. > > >>> > > >>> The second patch removes the zwap RB tree. > > >> > > >> The breakdown looks like something that would be a development step, > > >> but for patch submission I think it makes more sense to have a single > > >> patch replacing the rbtree with an xarray. > > >> > > >>> > > >>> I expect to merge the zswap rb tree spin lock with the xarray > > >>> lock in the follow up changes. > > >> > > >> Shouldn't this simply be changing uses of tree->lock to use > > >> xa_{lock/unlock}? We also need to make sure we don't try to lock the > > >> tree when operating on the xarray if the caller is already holding the > > >> lock, but this seems to be straightforward enough to be done as part > > >> of this patch or this series at least. > > >> > > >> Am I missing something? > > > > > > Also, I assume we will only see performance improvements after the > > > tree lock in its current form is removed so that we get loads > > > protected only by RCU. Can we get some performance numbers to see how > > > the latency improves with the xarray under contention (unless > > > Chengming is already planning on testing this for his multi-tree > > > patches). > > > > I just give it a try, the same test of kernel build in tmpfs with zswap > > shrinker enabled, all based on the latest mm/mm-stable branch. > > > > mm-stable zswap-split-tree zswap-xarray > > real 1m10.442s 1m4.157s 1m9.962s > > user 17m48.232s 17m41.477s 17m45.887s > > sys 8m13.517s 5m2.226s 7m59.305s > > > > Looks like the contention of concurrency is still there, I haven't > > look into the code yet, will review it later. Thanks for the quick test. Interesting to see the sys usage drop for the xarray case even with the spin lock. Not sure if the 13 second saving is statistically significant or not. We might need to have both xarray and split trees for the zswap. It is likely removing the spin lock wouldn't be able to make up the 35% difference. That is just my guess. There is only one way to find out. BTW, do you have a script I can run to replicate your results? > > I think that's expected with the current version because the tree > spin_lock is still there and we are still doing lookups with a > spinlock. Right. Chris
On Wed, Jan 17, 2024 at 11:05 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > The name changes from Chris to Christopher are confusing :D > > > > > I think it makes the review easier. The code adding and removing does > > not have much overlap. Combining it to a single patch does not save > > patch size. Having the assert check would be useful for some bisecting > > to narrow down which step causing the problem. I am fine with squash > > it to one patch as well. > > I think having two patches is unnecessarily noisy, and we add some > debug code in this patch that we remove in the next patch anyway. > Let's see what others think, but personally I prefer a single patch. > > > > > > > > > > > > I expect to merge the zswap rb tree spin lock with the xarray > > > > lock in the follow up changes. > > > > > > Shouldn't this simply be changing uses of tree->lock to use > > > xa_{lock/unlock}? We also need to make sure we don't try to lock the > > > tree when operating on the xarray if the caller is already holding the > > > lock, but this seems to be straightforward enough to be done as part > > > of this patch or this series at least. > > > > > > Am I missing something? > > > > Currently the zswap entry refcount is protected by the zswap tree spin > > lock as well. Can't remove the tree spin lock without changing the > > refcount code. I think the zswap search entry should just return the > > entry with refcount atomic increase, inside the RCU read() or xarray > > lock. The previous zswap code does the find_and_get entry() which is > > closer to what I want. > > I think this can be done in an RCU read section surrounding xa_load() xa_load() already has RCU read lock inside. If you do that you might just as well use some XAS API to work with the lock directly. > and the refcount increment. Didn't look closely to check how much > complexity this adds to manage refcounts with RCU, but I think there > should be a lot of examples all around the kernel. The complexity is not adding the refcount inside xa_load(). It is on the zswap code that calls zswap_search() and zswap_{insert,erase}(). As far as I can tell, those codes need some tricky changes to go along with the refcount change. > > IIUC, there are no performance benefits from this conversion until we > remove the tree spinlock, right? The original intent is helping the long tail case. RB tree has worse long tails than xarray. I expect it will help the page fault long tail even without removing the tree spinlock. Chris
On 2024/1/18 15:19, Chris Li wrote: > On Wed, Jan 17, 2024 at 11:02 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> On Wed, Jan 17, 2024 at 10:57 PM Chengming Zhou >> <zhouchengming@bytedance.com> wrote: >>> >>> Hi Yosry and Chris, >>> >>> On 2024/1/18 14:39, Yosry Ahmed wrote: >>>> On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote: >>>>> >>>>> That's a long CC list for sure :) >>>>> >>>>> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: >>>>>> >>>>>> The RB tree shows some contribution to the swap fault >>>>>> long tail latency due to two factors: >>>>>> 1) RB tree requires re-balance from time to time. >>>>>> 2) The zswap RB tree has a tree level spin lock protecting >>>>>> the tree access. >>>>>> >>>>>> The swap cache is using xarray. The break down the swap >>>>>> cache access does not have the similar long time as zswap >>>>>> RB tree. >>>>> >>>>> I think the comparison to the swap cache may not be valid as the swap >>>>> cache has many trees per swapfile, while zswap has a single tree. >>>>> >>>>>> >>>>>> Moving the zswap entry to xarray enable read side >>>>>> take read RCU lock only. >>>>> >>>>> Nice. >>>>> >>>>>> >>>>>> The first patch adds the xarray alongside the RB tree. >>>>>> There is some debug check asserting the xarray agrees with >>>>>> the RB tree results. >>>>>> >>>>>> The second patch removes the zwap RB tree. >>>>> >>>>> The breakdown looks like something that would be a development step, >>>>> but for patch submission I think it makes more sense to have a single >>>>> patch replacing the rbtree with an xarray. >>>>> >>>>>> >>>>>> I expect to merge the zswap rb tree spin lock with the xarray >>>>>> lock in the follow up changes. >>>>> >>>>> Shouldn't this simply be changing uses of tree->lock to use >>>>> xa_{lock/unlock}? We also need to make sure we don't try to lock the >>>>> tree when operating on the xarray if the caller is already holding the >>>>> lock, but this seems to be straightforward enough to be done as part >>>>> of this patch or this series at least. >>>>> >>>>> Am I missing something? >>>> >>>> Also, I assume we will only see performance improvements after the >>>> tree lock in its current form is removed so that we get loads >>>> protected only by RCU. Can we get some performance numbers to see how >>>> the latency improves with the xarray under contention (unless >>>> Chengming is already planning on testing this for his multi-tree >>>> patches). >>> >>> I just give it a try, the same test of kernel build in tmpfs with zswap >>> shrinker enabled, all based on the latest mm/mm-stable branch. >>> >>> mm-stable zswap-split-tree zswap-xarray >>> real 1m10.442s 1m4.157s 1m9.962s >>> user 17m48.232s 17m41.477s 17m45.887s >>> sys 8m13.517s 5m2.226s 7m59.305s >>> >>> Looks like the contention of concurrency is still there, I haven't >>> look into the code yet, will review it later. > > Thanks for the quick test. Interesting to see the sys usage drop for > the xarray case even with the spin lock. > Not sure if the 13 second saving is statistically significant or not. > > We might need to have both xarray and split trees for the zswap. It is > likely removing the spin lock wouldn't be able to make up the 35% > difference. That is just my guess. There is only one way to find out. Yes, I totally agree with this! IMHO, concurrent zswap_store paths still have to contend for the xarray spinlock even though we would have converted the rb-tree to the xarray structure at last. So I think we should have both. > > BTW, do you have a script I can run to replicate your results? ``` #!/bin/bash testname="build-kernel-tmpfs" cgroup="/sys/fs/cgroup/$testname" tmpdir="/tmp/vm-scalability-tmp" workdir="$tmpdir/$testname" memory_max="$((2 * 1024 * 1024 * 1024))" linux_src="/root/zcm/linux-6.6.tar.xz" NR_TASK=32 swapon ~/zcm/swapfile echo 60 > /proc/sys/vm/swappiness echo zsmalloc > /sys/module/zswap/parameters/zpool echo lz4 > /sys/module/zswap/parameters/compressor echo 1 > /sys/module/zswap/parameters/shrinker_enabled echo 1 > /sys/module/zswap/parameters/enabled if ! [ -d $tmpdir ]; then mkdir -p $tmpdir mount -t tmpfs -o size=100% nodev $tmpdir fi mkdir -p $cgroup echo $memory_max > $cgroup/memory.max echo $$ > $cgroup/cgroup.procs rm -rf $workdir mkdir -p $workdir cd $workdir tar xvf $linux_src cd linux-6.6 make -j$NR_TASK clean make defconfig time make -j$NR_TASK ```
* Christopher Li <chrisl@kernel.org> [240118 01:48]: > On Wed, Jan 17, 2024 at 10:02 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > That's a long CC list for sure :) > > > > On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > > > > > The RB tree shows some contribution to the swap fault > > > long tail latency due to two factors: > > > 1) RB tree requires re-balance from time to time. > > > 2) The zswap RB tree has a tree level spin lock protecting > > > the tree access. > > > > > > The swap cache is using xarray. The break down the swap > > > cache access does not have the similar long time as zswap > > > RB tree. > > > > I think the comparison to the swap cache may not be valid as the swap > > cache has many trees per swapfile, while zswap has a single tree. > > Yes, good point. I think we can bench mark the xarray zswap vs the RB > tree zswap, that would be more of a direct comparison. > > > > Moving the zswap entry to xarray enable read side > > > take read RCU lock only. > > > > Nice. > > > > > > > > The first patch adds the xarray alongside the RB tree. > > > There is some debug check asserting the xarray agrees with > > > the RB tree results. > > > > > > The second patch removes the zwap RB tree. > > > > The breakdown looks like something that would be a development step, > > but for patch submission I think it makes more sense to have a single > > patch replacing the rbtree with an xarray. > > I think it makes the review easier. The code adding and removing does > not have much overlap. Combining it to a single patch does not save > patch size. Having the assert check would be useful for some bisecting > to narrow down which step causing the problem. I am fine with squash > it to one patch as well. I had thought similar when I replaced the rbtree with the maple tree in the VMA space. That conversion was more involved and I wanted to detect if there was ever any difference, and where I had made the error in the multiple patch conversion. This became rather painful once an issue was found, as then anyone bisecting other issues could hit this difference and either blamed the commit pointing at the BUG_ON() or gave up (I don't blame them for giving up, I would). With only two commits, it may be easier for people to see a fixed tag pointing to the same commit that bisect found (if they check), but it proved an issue with my multiple patch conversion. You may not experience this issue with the users of the zswap, but I plan to avoid doing this again in the future. At least a WARN_ON_ONCE() and a comment might help? Thanks, Liam
On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou <zhouchengming@bytedance.com> wrote: > >>> mm-stable zswap-split-tree zswap-xarray > >>> real 1m10.442s 1m4.157s 1m9.962s > >>> user 17m48.232s 17m41.477s 17m45.887s > >>> sys 8m13.517s 5m2.226s 7m59.305s > >>> > >>> Looks like the contention of concurrency is still there, I haven't > >>> look into the code yet, will review it later. > > > > Thanks for the quick test. Interesting to see the sys usage drop for > > the xarray case even with the spin lock. > > Not sure if the 13 second saving is statistically significant or not. > > > > We might need to have both xarray and split trees for the zswap. It is > > likely removing the spin lock wouldn't be able to make up the 35% > > difference. That is just my guess. There is only one way to find out. > > Yes, I totally agree with this! IMHO, concurrent zswap_store paths still > have to contend for the xarray spinlock even though we would have converted > the rb-tree to the xarray structure at last. So I think we should have both. > > > > > BTW, do you have a script I can run to replicate your results? Hi Chengming, Thanks for your script. > > ``` > #!/bin/bash > > testname="build-kernel-tmpfs" > cgroup="/sys/fs/cgroup/$testname" > > tmpdir="/tmp/vm-scalability-tmp" > workdir="$tmpdir/$testname" > > memory_max="$((2 * 1024 * 1024 * 1024))" > > linux_src="/root/zcm/linux-6.6.tar.xz" > NR_TASK=32 > > swapon ~/zcm/swapfile How big is your swapfile here? It seems you have only one swapfile there. That can explain the contention. Have you tried multiple swapfiles for the same test? That should reduce the contention without using your patch. Chris > echo 60 > /proc/sys/vm/swappiness > > echo zsmalloc > /sys/module/zswap/parameters/zpool > echo lz4 > /sys/module/zswap/parameters/compressor > echo 1 > /sys/module/zswap/parameters/shrinker_enabled > echo 1 > /sys/module/zswap/parameters/enabled > > if ! [ -d $tmpdir ]; then > mkdir -p $tmpdir > mount -t tmpfs -o size=100% nodev $tmpdir > fi > > mkdir -p $cgroup > echo $memory_max > $cgroup/memory.max > echo $$ > $cgroup/cgroup.procs > > rm -rf $workdir > mkdir -p $workdir > cd $workdir > > tar xvf $linux_src > cd linux-6.6 > make -j$NR_TASK clean > make defconfig > time make -j$NR_TASK > ``` > >
On Thu, Jan 18, 2024 at 11:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > > > > > > > The first patch adds the xarray alongside the RB tree. > > > > There is some debug check asserting the xarray agrees with > > > > the RB tree results. > > > > > > > > The second patch removes the zwap RB tree. > > > > > > The breakdown looks like something that would be a development step, > > > but for patch submission I think it makes more sense to have a single > > > patch replacing the rbtree with an xarray. > > > > I think it makes the review easier. The code adding and removing does > > not have much overlap. Combining it to a single patch does not save > > patch size. Having the assert check would be useful for some bisecting > > to narrow down which step causing the problem. I am fine with squash > > it to one patch as well. > > I had thought similar when I replaced the rbtree with the maple tree in > the VMA space. That conversion was more involved and I wanted to detect > if there was ever any difference, and where I had made the error in the > multiple patch conversion. > > This became rather painful once an issue was found, as then anyone > bisecting other issues could hit this difference and either blamed the > commit pointing at the BUG_ON() or gave up (I don't blame them for > giving up, I would). With only two commits, it may be easier for people > to see a fixed tag pointing to the same commit that bisect found (if > they check), but it proved an issue with my multiple patch conversion. Thanks for sharing your experience. That debug assert did help me catch issues on my own internal version after rebasing to the latest mm tree. If the user can't do the bisect, then I agree we don't need to assert in the official version. I can always bisect on my one internal version. > > You may not experience this issue with the users of the zswap, but I > plan to avoid doing this again in the future. At least a WARN_ON_ONCE() > and a comment might help? Sure, I might just merge the two patches. Don't have the BUG_ON() any more. Chris
On 2024/1/19 18:26, Chris Li wrote: > On Thu, Jan 18, 2024 at 10:19 PM Chengming Zhou > <zhouchengming@bytedance.com> wrote: >> >> On 2024/1/19 12:59, Chris Li wrote: >>> On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou >>> <zhouchengming@bytedance.com> wrote: >>> >>>>>>> mm-stable zswap-split-tree zswap-xarray >>>>>>> real 1m10.442s 1m4.157s 1m9.962s >>>>>>> user 17m48.232s 17m41.477s 17m45.887s >>>>>>> sys 8m13.517s 5m2.226s 7m59.305s >>>>>>> >>>>>>> Looks like the contention of concurrency is still there, I haven't >>>>>>> look into the code yet, will review it later. >>>>> >>>>> Thanks for the quick test. Interesting to see the sys usage drop for >>>>> the xarray case even with the spin lock. >>>>> Not sure if the 13 second saving is statistically significant or not. >>>>> >>>>> We might need to have both xarray and split trees for the zswap. It is >>>>> likely removing the spin lock wouldn't be able to make up the 35% >>>>> difference. That is just my guess. There is only one way to find out. >>>> >>>> Yes, I totally agree with this! IMHO, concurrent zswap_store paths still >>>> have to contend for the xarray spinlock even though we would have converted >>>> the rb-tree to the xarray structure at last. So I think we should have both. >>>> >>>>> >>>>> BTW, do you have a script I can run to replicate your results? >>> >>> Hi Chengming, >>> >>> Thanks for your script. >>> >>>> >>>> ``` >>>> #!/bin/bash >>>> >>>> testname="build-kernel-tmpfs" >>>> cgroup="/sys/fs/cgroup/$testname" >>>> >>>> tmpdir="/tmp/vm-scalability-tmp" >>>> workdir="$tmpdir/$testname" >>>> >>>> memory_max="$((2 * 1024 * 1024 * 1024))" >>>> >>>> linux_src="/root/zcm/linux-6.6.tar.xz" >>>> NR_TASK=32 >>>> >>>> swapon ~/zcm/swapfile >>> >>> How big is your swapfile here? >> >> The swapfile is big enough here, I use a 50GB swapfile. > > Thanks, > >> >>> >>> It seems you have only one swapfile there. That can explain the contention. >>> Have you tried multiple swapfiles for the same test? >>> That should reduce the contention without using your patch. >> Do you mean to have many 64MB swapfiles to swapon at the same time? > > 64MB is too small. There are limits to MAX_SWAPFILES. It is less than > (32 - n) swap files. > If you want to use 50G swap space, you can have MAX_SWAPFILES, each > swapfile 50GB / MAX_SWAPFILES. Right. > >> Maybe it's feasible to test, > > Of course it is testable, I am curious to see the test results. > >> I'm not sure how swapout will choose. > > It will rotate through the same priority swap files first. > swapfile.c: get_swap_pages(). > >> But in our usecase, we normally have only one swapfile. > > Is there a good reason why you can't use more than one swapfile? I think no, but it seems an unneeded change/burden to our admin. So I just tested and optimized for the normal case. > One swapfile will not take the full advantage of the existing code. > Even if you split the zswap trees within a swapfile. With only one > swapfile, you will still be having lock contention on "(struct > swap_info_struct).lock". > It is one lock per swapfile. > Using more than one swap file should get you better results. IIUC, we already have the per-cpu swap entry cache to not contend for this lock? And I don't see much hot of this lock in the testing. Thanks.