Message ID | 20221026200613.1031261-3-nphamcs@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp463115wru; Wed, 26 Oct 2022 13:09:52 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6K33oSgHXV8tp7at5NgMcrpk3pvU8+cc1nMD+pY8F3McMLSpv0Ec74n76dSYhWiyac2aj/ X-Received: by 2002:a17:90a:582:b0:20a:97f6:f52e with SMTP id i2-20020a17090a058200b0020a97f6f52emr6117761pji.126.1666814991790; Wed, 26 Oct 2022 13:09:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666814991; cv=none; d=google.com; s=arc-20160816; b=TYKylp25JMA0QFR+39+SAJJduJppXm2BZStFjyIQOHDDyapEKARxM7HjoQlY+tsYSu 3a2OBYkjpfw/AUxD0rNDSb/XgDdU/ZLA6GHh9X7+mTfOt0Quiei3AUQcmtdK+i2V+B67 cZImkg90+wul0RnflZShbGYwa29kMbbQpGpOS0BVrIBhY9I7TCx4CHMMG2aiCKKE6/vK gVhzb/06xPBLu80axOolDph1VfoYc85OLP8XUvKWUSd2wWbsEgc08t0WpemSyvKMkf4M ONFs5QhlU+t9Gn4xqCf/AYxWPGqVwT3Gu49UsDSwbzGxJOqJA8buthbpAQnb9RT02RsO pmFg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=3eLGwbKV3BBUOzsaG0W84saUrLP5SGKXVz3Xw1kLQgs=; b=hySAdexC9CBPhrGzmZQvo3kUOB1Cv+eCL62Mauy7g841fBS9XONUhb4LtQG1hXkwtD N1/yuIgr89YBuXvg1KjpMvadExGCX/wtpGAo8y5Y+raEn7Ixby4W8lj1y76mKXgovFrc 6SRY1CmG4OqRvS+6bORkZ11AYaVfRAKIXoOkeHN2gE65MnjvYMBtFNDBmTRTd0itcNTN P9hVDnTlb4WuIQoVHkz1xFbTo6qC3tnPmJdmyLmJ/LHeMlRAMxjQteHjpBenamNU26iO +oYsH9x1TUu4XkMZOV8nOD6g74AYJTpGjpQj8vBw1JzvacKBZfRbLa/K5d+pGW4Plkmc WGFg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=YnbOdy61; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b13-20020a6541cd000000b0045a73a1790asi7829967pgq.613.2022.10.26.13.09.38; Wed, 26 Oct 2022 13:09:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=YnbOdy61; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234444AbiJZUHD (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Wed, 26 Oct 2022 16:07:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59280 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235107AbiJZUGW (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 26 Oct 2022 16:06:22 -0400 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5165791857 for <linux-kernel@vger.kernel.org>; Wed, 26 Oct 2022 13:06:17 -0700 (PDT) Received: by mail-pl1-x62c.google.com with SMTP id f23so15265397plr.6 for <linux-kernel@vger.kernel.org>; Wed, 26 Oct 2022 13:06:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=3eLGwbKV3BBUOzsaG0W84saUrLP5SGKXVz3Xw1kLQgs=; b=YnbOdy61eJftyiNAVtQlaSheYjFU9QGnOhN8d6buLYOroIL7bSX7FBIV769kp7XN4t e00hvxwyO2Fo6ei7jhhnHBZmjP73wDubAsLsY5DnWJCfpb4gCK2hcNL8qE+fzCE8LhNY fhjv87kf1mwJTJFEKNvhw02dnlg2DwTRnYDLkTyiaLo5M1YkRdYwWNg+Jv4nD3VMCmA2 lqvIbtJ4dGpAU+OaDxkZbNnb3n2WGiPpNhyweYQo8Ub2i4GU/eo9iaPDr0Koj1Z6wDox ZSHFeoMNAVh2a6of/OV3CWer0niEqMY4ldK3uIT/l8rqJxBggQSOqzqg4vv3IUiW2WB8 Y7ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3eLGwbKV3BBUOzsaG0W84saUrLP5SGKXVz3Xw1kLQgs=; b=6nD99jfU8/eN5Y10D5aZdw4WJRW1wWXh93JQExYhIQUyFRmZgWTlR3/GMzcisH+OOm wSGHqKnhUTH0mw+UlKUCQq46gcJmkzrfIoIeYrKfKNTmVvGYl0YR7MQE5LoWctsZ9nvN VzMF7GFqZuT0TBrzIPTjVYYWvroMTqWrdgbXecIWoAQzL8G333BWR80TAUPOgwHzHx6w 4VnNIcsoY6F7LiTBWa4Z5YFxv+UxGkv/Mq9uaXrzmjVzyC9or1vNmgIl4Qw/yKQDY4Tr WWUGmKZzt5UA7MdtGwKpf3UgOEgRRv+WJj09ic9pweu/8aUrecHMgOh6ALKGj6BPPeUq oGlg== X-Gm-Message-State: ACrzQf045SkI96c3PH8L77GgGSxUKWzIFbBaXIlTcEPhXBV+YC9PtK2R hcWr2KvAnppjyaOx6Hjanao= X-Received: by 2002:a17:902:cecc:b0:186:cd5c:3fc2 with SMTP id d12-20020a170902cecc00b00186cd5c3fc2mr8223249plg.152.1666814776763; Wed, 26 Oct 2022 13:06:16 -0700 (PDT) Received: from localhost (fwdproxy-prn-116.fbsv.net. [2a03:2880:ff:74::face:b00c]) by smtp.gmail.com with ESMTPSA id b5-20020aa78ec5000000b0056c04dee930sm3353909pfr.120.2022.10.26.13.06.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Oct 2022 13:06:16 -0700 (PDT) From: Nhat Pham <nphamcs@gmail.com> To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, minchan@kernel.org, ngupta@vflare.org, senozhatsky@chromium.org, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com Subject: [PATCH 2/5] zsmalloc: Consolidate zs_pool's migrate_lock and size_class's locks Date: Wed, 26 Oct 2022 13:06:10 -0700 Message-Id: <20221026200613.1031261-3-nphamcs@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221026200613.1031261-1-nphamcs@gmail.com> References: <20221026200613.1031261-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747782196691955604?= X-GMAIL-MSGID: =?utf-8?q?1747782196691955604?= |
Series |
Implement writeback for zsmalloc
|
|
Commit Message
Nhat Pham
Oct. 26, 2022, 8:06 p.m. UTC
Currently, zsmalloc has a hierarchy of locks, which includes a
pool-level migrate_lock, and a lock for each size class. We have to
obtain both locks in the hotpath in most cases anyway, except for
zs_malloc. This exception will no longer exist when we introduce a LRU
into the zs_pool for the new writeback functionality - we will need to
obtain a pool-level lock to synchronize LRU handling even in zs_malloc.
In preparation for zsmalloc writeback, consolidate these locks into a
single pool-level lock, which drastically reduces the complexity of
synchronization in zsmalloc.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
mm/zsmalloc.c | 87 ++++++++++++++++++++++-----------------------------
1 file changed, 37 insertions(+), 50 deletions(-)
Comments
On Wed, Oct 26, 2022 at 01:06:10PM -0700, Nhat Pham wrote: > Currently, zsmalloc has a hierarchy of locks, which includes a > pool-level migrate_lock, and a lock for each size class. We have to > obtain both locks in the hotpath in most cases anyway, except for > zs_malloc. This exception will no longer exist when we introduce a LRU > into the zs_pool for the new writeback functionality - we will need to > obtain a pool-level lock to synchronize LRU handling even in zs_malloc. > > In preparation for zsmalloc writeback, consolidate these locks into a > single pool-level lock, which drastically reduces the complexity of > synchronization in zsmalloc. > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On (22/10/26 13:06), Nhat Pham wrote: > struct size_class { > - spinlock_t lock; > struct list_head fullness_list[NR_ZS_FULLNESS]; > /* > * Size of objects stored in this class. Must be multiple > @@ -247,8 +245,7 @@ struct zs_pool { > #ifdef CONFIG_COMPACTION > struct work_struct free_work; > #endif > - /* protect page/zspage migration */ > - rwlock_t migrate_lock; > + spinlock_t lock; > }; I'm not in love with this, to be honest. One big pool lock instead of 255 per-class locks doesn't look attractive, as one big pool lock is going to be hammered quite a lot when zram is used, e.g. as a regular block device with a file system and is under heavy parallel writes/reads.
On Wed, Nov 02, 2022 at 12:28:56PM +0900, Sergey Senozhatsky wrote: > On (22/10/26 13:06), Nhat Pham wrote: > > struct size_class { > > - spinlock_t lock; > > struct list_head fullness_list[NR_ZS_FULLNESS]; > > /* > > * Size of objects stored in this class. Must be multiple > > @@ -247,8 +245,7 @@ struct zs_pool { > > #ifdef CONFIG_COMPACTION > > struct work_struct free_work; > > #endif > > - /* protect page/zspage migration */ > > - rwlock_t migrate_lock; > > + spinlock_t lock; > > }; > > I'm not in love with this, to be honest. One big pool lock instead > of 255 per-class locks doesn't look attractive, as one big pool lock > is going to be hammered quite a lot when zram is used, e.g. as a regular > block device with a file system and is under heavy parallel writes/reads. > I agree with Sergey. I am also worry about that LRU stuff should be part of allocator instead of higher level.
On Wed, Nov 02, 2022 at 02:36:35PM -0700, Minchan Kim wrote: > On Wed, Nov 02, 2022 at 12:28:56PM +0900, Sergey Senozhatsky wrote: > > On (22/10/26 13:06), Nhat Pham wrote: > > > struct size_class { > > > - spinlock_t lock; > > > struct list_head fullness_list[NR_ZS_FULLNESS]; > > > /* > > > * Size of objects stored in this class. Must be multiple > > > @@ -247,8 +245,7 @@ struct zs_pool { > > > #ifdef CONFIG_COMPACTION > > > struct work_struct free_work; > > > #endif > > > - /* protect page/zspage migration */ > > > - rwlock_t migrate_lock; > > > + spinlock_t lock; > > > }; > > > > I'm not in love with this, to be honest. One big pool lock instead > > of 255 per-class locks doesn't look attractive, as one big pool lock > > is going to be hammered quite a lot when zram is used, e.g. as a regular > > block device with a file system and is under heavy parallel writes/reads. TBH the class always struck me as an odd scope to split the lock. Lock contention depends on how variable the compression rate is of the hottest incoming data, which is unpredictable from a user POV. My understanding is that the primary usecase for zram is swapping, and the pool lock is the same granularity as the swap locking. Regardless, we'll do some benchmarks with filesystems to understand what a reasonable tradeoff would be between overhead and complexity. Do you have a particular one in mind? (I'm thinking journaled ones are not of much interest, since their IO tends to be fairly serialized.) btrfs? > I am also worry about that LRU stuff should be part of allocator > instead of higher level. I'm sorry, but that's not a reasonable objection. These patches implement a core feature of being a zswap backend, using standard LRU and locking techniques established by the other backends. I don't disagree that it would nicer if zswap had a strong abstraction for backend pages and a generalized LRU. But that is major surgery on a codebase of over 6,500 lines. It's not a reasonable ask to change all that first before implementing a basic feature that's useful now. I get that your main interest is zram, and so this feature isn't of interest to you. But zram isn't the only user, nor is it the primary user, of zsmalloc.
On Thu, Nov 03, 2022 at 11:18:04AM -0400, Johannes Weiner wrote: > On Wed, Nov 02, 2022 at 02:36:35PM -0700, Minchan Kim wrote: > > On Wed, Nov 02, 2022 at 12:28:56PM +0900, Sergey Senozhatsky wrote: > > > On (22/10/26 13:06), Nhat Pham wrote: > > > > struct size_class { > > > > - spinlock_t lock; > > > > struct list_head fullness_list[NR_ZS_FULLNESS]; > > > > /* > > > > * Size of objects stored in this class. Must be multiple > > > > @@ -247,8 +245,7 @@ struct zs_pool { > > > > #ifdef CONFIG_COMPACTION > > > > struct work_struct free_work; > > > > #endif > > > > - /* protect page/zspage migration */ > > > > - rwlock_t migrate_lock; > > > > + spinlock_t lock; > > > > }; > > > > > > I'm not in love with this, to be honest. One big pool lock instead > > > of 255 per-class locks doesn't look attractive, as one big pool lock > > > is going to be hammered quite a lot when zram is used, e.g. as a regular > > > block device with a file system and is under heavy parallel writes/reads. > > TBH the class always struck me as an odd scope to split the lock. Lock > contention depends on how variable the compression rate is of the > hottest incoming data, which is unpredictable from a user POV. > > My understanding is that the primary usecase for zram is swapping, and > the pool lock is the same granularity as the swap locking. People uses the zram to store caching object files in build server. > > Regardless, we'll do some benchmarks with filesystems to understand > what a reasonable tradeoff would be between overhead and complexity. Thanks. > Do you have a particular one in mind? (I'm thinking journaled ones are > not of much interest, since their IO tends to be fairly serialized.) > > btrfs? I am not sure what FSes others are using but at least for me, just plain ext4. > > > I am also worry about that LRU stuff should be part of allocator > > instead of higher level. > > I'm sorry, but that's not a reasonable objection. > > These patches implement a core feature of being a zswap backend, using > standard LRU and locking techniques established by the other backends. > > I don't disagree that it would nicer if zswap had a strong abstraction > for backend pages and a generalized LRU. But that is major surgery on > a codebase of over 6,500 lines. It's not a reasonable ask to change > all that first before implementing a basic feature that's useful now. With same logic, folks added the LRU logic into their allocators without the effort considering moving the LRU into upper layer. And then trend is still going on since I have seen multiple times people are trying to add more allocators. So if it's not a reasonable ask to consier, we couldn't stop the trend in the end. > > I get that your main interest is zram, and so this feature isn't of > interest to you. But zram isn't the only user, nor is it the primary I am interest to the feature but my interest is more of general swap layer to manage the LRU so that it could support any hierarchy among swap devices, not only zswap.
On Thu, Nov 03, 2022 at 08:53:29AM -0700, Minchan Kim wrote: > On Thu, Nov 03, 2022 at 11:18:04AM -0400, Johannes Weiner wrote: > > On Wed, Nov 02, 2022 at 02:36:35PM -0700, Minchan Kim wrote: > > > On Wed, Nov 02, 2022 at 12:28:56PM +0900, Sergey Senozhatsky wrote: > > > > On (22/10/26 13:06), Nhat Pham wrote: > > > > > struct size_class { > > > > > - spinlock_t lock; > > > > > struct list_head fullness_list[NR_ZS_FULLNESS]; > > > > > /* > > > > > * Size of objects stored in this class. Must be multiple > > > > > @@ -247,8 +245,7 @@ struct zs_pool { > > > > > #ifdef CONFIG_COMPACTION > > > > > struct work_struct free_work; > > > > > #endif > > > > > - /* protect page/zspage migration */ > > > > > - rwlock_t migrate_lock; > > > > > + spinlock_t lock; > > > > > }; > > > > > > > > I'm not in love with this, to be honest. One big pool lock instead > > > > of 255 per-class locks doesn't look attractive, as one big pool lock > > > > is going to be hammered quite a lot when zram is used, e.g. as a regular > > > > block device with a file system and is under heavy parallel writes/reads. > > > > TBH the class always struck me as an odd scope to split the lock. Lock > > contention depends on how variable the compression rate is of the > > hottest incoming data, which is unpredictable from a user POV. > > > > My understanding is that the primary usecase for zram is swapping, and > > the pool lock is the same granularity as the swap locking. > > People uses the zram to store caching object files in build server. Oh, interesting. We can try with a kernel build directory on zram. > > Do you have a particular one in mind? (I'm thinking journaled ones are > > not of much interest, since their IO tends to be fairly serialized.) > > > > btrfs? > > I am not sure what FSes others are using but at least for me, just > plain ext4. Okay, we can test with both. > > > I am also worry about that LRU stuff should be part of allocator > > > instead of higher level. > > > > I'm sorry, but that's not a reasonable objection. > > > > These patches implement a core feature of being a zswap backend, using > > standard LRU and locking techniques established by the other backends. > > > > I don't disagree that it would nicer if zswap had a strong abstraction > > for backend pages and a generalized LRU. But that is major surgery on > > a codebase of over 6,500 lines. It's not a reasonable ask to change > > all that first before implementing a basic feature that's useful now. > > With same logic, folks added the LRU logic into their allocators > without the effort considering moving the LRU into upper layer. > > And then trend is still going on since I have seen multiple times > people are trying to add more allocators. So if it's not a reasonable > ask to consier, we couldn't stop the trend in the end. So there is actually an ongoing effort to do that. Yosry and I have spent quite some time on coming up with an LRU design that's independent from compression policy over email and at Plumbers. My concern is more about the order of doing things: 1. The missing writeback support is a gaping hole in zsmalloc, which affects production systems. A generalized LRU list is a good idea, but it's a huge task that from a user pov really is not critical. Even from a kernel dev / maintainer POV, there are bigger fish to fry in the zswap code base and the backends than this. 2. Refactoring existing functionality is much easier than writing generalized code that simultaneously enables new behavior. zsmalloc is the most complex of our backends. To make its LRU writeback work we had to patch zswap's ->map ordering to accomodate it, e.g. Such tricky changes are easier to make and test incrementally. The generalized LRU project will hugely benefit from already having a proven writeback implementation in zsmalloc, because then all the requirements in zswap and zsmalloc will be in black and white. > > I get that your main interest is zram, and so this feature isn't of > > interest to you. But zram isn't the only user, nor is it the primary > > I am interest to the feature but my interest is more of general swap > layer to manage the LRU so that it could support any hierarchy among > swap devices, not only zswap. I think we're on the same page about the longer term goals.
On Thu, Nov 3, 2022 at 11:08 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Nov 03, 2022 at 08:53:29AM -0700, Minchan Kim wrote: > > On Thu, Nov 03, 2022 at 11:18:04AM -0400, Johannes Weiner wrote: > > > On Wed, Nov 02, 2022 at 02:36:35PM -0700, Minchan Kim wrote: > > > > On Wed, Nov 02, 2022 at 12:28:56PM +0900, Sergey Senozhatsky wrote: > > > > > On (22/10/26 13:06), Nhat Pham wrote: > > > > > > struct size_class { > > > > > > - spinlock_t lock; > > > > > > struct list_head fullness_list[NR_ZS_FULLNESS]; > > > > > > /* > > > > > > * Size of objects stored in this class. Must be multiple > > > > > > @@ -247,8 +245,7 @@ struct zs_pool { > > > > > > #ifdef CONFIG_COMPACTION > > > > > > struct work_struct free_work; > > > > > > #endif > > > > > > - /* protect page/zspage migration */ > > > > > > - rwlock_t migrate_lock; > > > > > > + spinlock_t lock; > > > > > > }; > > > > > > > > > > I'm not in love with this, to be honest. One big pool lock instead > > > > > of 255 per-class locks doesn't look attractive, as one big pool lock > > > > > is going to be hammered quite a lot when zram is used, e.g. as a regular > > > > > block device with a file system and is under heavy parallel writes/reads. > > > > > > TBH the class always struck me as an odd scope to split the lock. Lock > > > contention depends on how variable the compression rate is of the > > > hottest incoming data, which is unpredictable from a user POV. > > > > > > My understanding is that the primary usecase for zram is swapping, and > > > the pool lock is the same granularity as the swap locking. > > > > People uses the zram to store caching object files in build server. > > Oh, interesting. We can try with a kernel build directory on zram. > > > > Do you have a particular one in mind? (I'm thinking journaled ones are > > > not of much interest, since their IO tends to be fairly serialized.) > > > > > > btrfs? > > > > I am not sure what FSes others are using but at least for me, just > > plain ext4. > > Okay, we can test with both. > > > > > I am also worry about that LRU stuff should be part of allocator > > > > instead of higher level. > > > > > > I'm sorry, but that's not a reasonable objection. > > > > > > These patches implement a core feature of being a zswap backend, using > > > standard LRU and locking techniques established by the other backends. > > > > > > I don't disagree that it would nicer if zswap had a strong abstraction > > > for backend pages and a generalized LRU. But that is major surgery on > > > a codebase of over 6,500 lines. It's not a reasonable ask to change > > > all that first before implementing a basic feature that's useful now. > > > > With same logic, folks added the LRU logic into their allocators > > without the effort considering moving the LRU into upper layer. > > > > And then trend is still going on since I have seen multiple times > > people are trying to add more allocators. So if it's not a reasonable > > ask to consier, we couldn't stop the trend in the end. > > So there is actually an ongoing effort to do that. Yosry and I have > spent quite some time on coming up with an LRU design that's > independent from compression policy over email and at Plumbers. > > My concern is more about the order of doing things: > > 1. The missing writeback support is a gaping hole in zsmalloc, which > affects production systems. A generalized LRU list is a good idea, > but it's a huge task that from a user pov really is not > critical. Even from a kernel dev / maintainer POV, there are bigger > fish to fry in the zswap code base and the backends than this. > > 2. Refactoring existing functionality is much easier than writing > generalized code that simultaneously enables new behavior. zsmalloc > is the most complex of our backends. To make its LRU writeback work > we had to patch zswap's ->map ordering to accomodate it, e.g. Such > tricky changes are easier to make and test incrementally. > > The generalized LRU project will hugely benefit from already having > a proven writeback implementation in zsmalloc, because then all the > requirements in zswap and zsmalloc will be in black and white. > > > > I get that your main interest is zram, and so this feature isn't of > > > interest to you. But zram isn't the only user, nor is it the primary > > > > I am interest to the feature but my interest is more of general swap > > layer to manage the LRU so that it could support any hierarchy among > > swap devices, not only zswap. > > I think we're on the same page about the longer term goals. > Yeah. As Johannes said, I was also recently looking into this. This can also help solve other problems than consolidating implementations. Currently if zswap rejects a page, it goes into swap, which is more-or-less a violation of page LRUs since hotter pages that are more recently reclaimed end up in swap (slow), while colder pages that were reclaimed before are in zswap. Having a separate layer managing the LRU of swap pages can also make sure this doesn't happen. More broadly, making zswap a separate layer from swap enables other improvements such as using zswap regardless of the presence of a backend swapfile and not consuming space in swapfiles if a page is in zswap. Of course, this is a much larger surgery. I am intending to spend more time looking further into this, but other things keep popping up :)
On Thu, Nov 03, 2022 at 02:08:01PM -0400, Johannes Weiner wrote: < snip > > > > > I am also worry about that LRU stuff should be part of allocator > > > > instead of higher level. > > > > > > I'm sorry, but that's not a reasonable objection. > > > > > > These patches implement a core feature of being a zswap backend, using > > > standard LRU and locking techniques established by the other backends. > > > > > > I don't disagree that it would nicer if zswap had a strong abstraction > > > for backend pages and a generalized LRU. But that is major surgery on > > > a codebase of over 6,500 lines. It's not a reasonable ask to change > > > all that first before implementing a basic feature that's useful now. > > > > With same logic, folks added the LRU logic into their allocators > > without the effort considering moving the LRU into upper layer. > > > > And then trend is still going on since I have seen multiple times > > people are trying to add more allocators. So if it's not a reasonable > > ask to consier, we couldn't stop the trend in the end. > > So there is actually an ongoing effort to do that. Yosry and I have > spent quite some time on coming up with an LRU design that's > independent from compression policy over email and at Plumbers. I am really glad to hear somebody is working toward right direction. > > My concern is more about the order of doing things: > > 1. The missing writeback support is a gaping hole in zsmalloc, which > affects production systems. A generalized LRU list is a good idea, > but it's a huge task that from a user pov really is not > critical. Even from a kernel dev / maintainer POV, there are bigger > fish to fry in the zswap code base and the backends than this. Even though I believe the general LRU in the swap subsystem is way to go, I was about to suggesting putting the LRU logic in the zswap layer to stop this trend since it's not too difficult at my first glance(Sure, I might miss something clear there). However, if you guys are working toward the generalized direction, I am totally in favor of the approach and looking forward to seeing the project under expectation that we will clean up all the duplicated logic, fixing the weird layering and then finally supports hierarchical swap writeback for any combinations of swap devices. > > 2. Refactoring existing functionality is much easier than writing > generalized code that simultaneously enables new behavior. zsmalloc > is the most complex of our backends. To make its LRU writeback work > we had to patch zswap's ->map ordering to accomodate it, e.g. Such > tricky changes are easier to make and test incrementally. > > The generalized LRU project will hugely benefit from already having > a proven writeback implementation in zsmalloc, because then all the > requirements in zswap and zsmalloc will be in black and white. Agreed if we are working toward the right direction and this work could help to fill the gap of the hole until we can see all the requirements and achieve it. > > > > I get that your main interest is zram, and so this feature isn't of > > > interest to you. But zram isn't the only user, nor is it the primary > > > > I am interest to the feature but my interest is more of general swap > > layer to manage the LRU so that it could support any hierarchy among > > swap devices, not only zswap. > > I think we're on the same page about the longer term goals. Fingers crossed.
On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote: < snip > > > > > > I am also worry about that LRU stuff should be part of allocator > > > > > instead of higher level. > > > > > > > > I'm sorry, but that's not a reasonable objection. > > > > > > > > These patches implement a core feature of being a zswap backend, using > > > > standard LRU and locking techniques established by the other backends. > > > > > > > > I don't disagree that it would nicer if zswap had a strong abstraction > > > > for backend pages and a generalized LRU. But that is major surgery on > > > > a codebase of over 6,500 lines. It's not a reasonable ask to change > > > > all that first before implementing a basic feature that's useful now. > > > > > > With same logic, folks added the LRU logic into their allocators > > > without the effort considering moving the LRU into upper layer. > > > > > > And then trend is still going on since I have seen multiple times > > > people are trying to add more allocators. So if it's not a reasonable > > > ask to consier, we couldn't stop the trend in the end. > > > > So there is actually an ongoing effort to do that. Yosry and I have > > spent quite some time on coming up with an LRU design that's > > independent from compression policy over email and at Plumbers. > > > > My concern is more about the order of doing things: > > > > 1. The missing writeback support is a gaping hole in zsmalloc, which > > affects production systems. A generalized LRU list is a good idea, > > but it's a huge task that from a user pov really is not > > critical. Even from a kernel dev / maintainer POV, there are bigger > > fish to fry in the zswap code base and the backends than this. > > > > 2. Refactoring existing functionality is much easier than writing > > generalized code that simultaneously enables new behavior. zsmalloc > > is the most complex of our backends. To make its LRU writeback work > > we had to patch zswap's ->map ordering to accomodate it, e.g. Such > > tricky changes are easier to make and test incrementally. > > > > The generalized LRU project will hugely benefit from already having > > a proven writeback implementation in zsmalloc, because then all the > > requirements in zswap and zsmalloc will be in black and white. > > > > > > I get that your main interest is zram, and so this feature isn't of > > > > interest to you. But zram isn't the only user, nor is it the primary > > > > > > I am interest to the feature but my interest is more of general swap > > > layer to manage the LRU so that it could support any hierarchy among > > > swap devices, not only zswap. > > > > I think we're on the same page about the longer term goals. > > > > Yeah. As Johannes said, I was also recently looking into this. This > can also help solve other problems than consolidating implementations. > Currently if zswap rejects a page, it goes into swap, which is > more-or-less a violation of page LRUs since hotter pages that are more > recently reclaimed end up in swap (slow), while colder pages that were > reclaimed before are in zswap. Having a separate layer managing the > LRU of swap pages can also make sure this doesn't happen. True. > > More broadly, making zswap a separate layer from swap enables other > improvements such as using zswap regardless of the presence of a > backend swapfile and not consuming space in swapfiles if a page is in > zswap. Of course, this is a much larger surgery. If we could decouple the LRU writeback from zswap and supports compression without backing swapfile, sounds like becoming more of zram. ;-) > > I am intending to spend more time looking further into this, but other > things keep popping up :) Same with me. Thanks for looking it, Yosry!
On Thu, Nov 3, 2022 at 1:37 PM Minchan Kim <minchan@kernel.org> wrote: > > On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote: > < snip > > > > > > > > I am also worry about that LRU stuff should be part of allocator > > > > > > instead of higher level. > > > > > > > > > > I'm sorry, but that's not a reasonable objection. > > > > > > > > > > These patches implement a core feature of being a zswap backend, using > > > > > standard LRU and locking techniques established by the other backends. > > > > > > > > > > I don't disagree that it would nicer if zswap had a strong abstraction > > > > > for backend pages and a generalized LRU. But that is major surgery on > > > > > a codebase of over 6,500 lines. It's not a reasonable ask to change > > > > > all that first before implementing a basic feature that's useful now. > > > > > > > > With same logic, folks added the LRU logic into their allocators > > > > without the effort considering moving the LRU into upper layer. > > > > > > > > And then trend is still going on since I have seen multiple times > > > > people are trying to add more allocators. So if it's not a reasonable > > > > ask to consier, we couldn't stop the trend in the end. > > > > > > So there is actually an ongoing effort to do that. Yosry and I have > > > spent quite some time on coming up with an LRU design that's > > > independent from compression policy over email and at Plumbers. > > > > > > My concern is more about the order of doing things: > > > > > > 1. The missing writeback support is a gaping hole in zsmalloc, which > > > affects production systems. A generalized LRU list is a good idea, > > > but it's a huge task that from a user pov really is not > > > critical. Even from a kernel dev / maintainer POV, there are bigger > > > fish to fry in the zswap code base and the backends than this. > > > > > > 2. Refactoring existing functionality is much easier than writing > > > generalized code that simultaneously enables new behavior. zsmalloc > > > is the most complex of our backends. To make its LRU writeback work > > > we had to patch zswap's ->map ordering to accomodate it, e.g. Such > > > tricky changes are easier to make and test incrementally. > > > > > > The generalized LRU project will hugely benefit from already having > > > a proven writeback implementation in zsmalloc, because then all the > > > requirements in zswap and zsmalloc will be in black and white. > > > > > > > > I get that your main interest is zram, and so this feature isn't of > > > > > interest to you. But zram isn't the only user, nor is it the primary > > > > > > > > I am interest to the feature but my interest is more of general swap > > > > layer to manage the LRU so that it could support any hierarchy among > > > > swap devices, not only zswap. > > > > > > I think we're on the same page about the longer term goals. > > > > > > > Yeah. As Johannes said, I was also recently looking into this. This > > can also help solve other problems than consolidating implementations. > > Currently if zswap rejects a page, it goes into swap, which is > > more-or-less a violation of page LRUs since hotter pages that are more > > recently reclaimed end up in swap (slow), while colder pages that were > > reclaimed before are in zswap. Having a separate layer managing the > > LRU of swap pages can also make sure this doesn't happen. > > True. > > > > > More broadly, making zswap a separate layer from swap enables other > > improvements such as using zswap regardless of the presence of a > > backend swapfile and not consuming space in swapfiles if a page is in > > zswap. Of course, this is a much larger surgery. > > If we could decouple the LRU writeback from zswap and supports > compression without backing swapfile, sounds like becoming more of > zram. ;-) That's a little bit grey. Maybe we can consolidate them one day :) We have been using zswap without swapfile at Google for a while, this gives us the ability to reject pages that do not compress well enough for us, which I suspect zram would not support given that it is designed to be the final destination of the page. Also, having the same configuration and code running on machines whether or not they have a swapfile is nice, otherwise one would need to use zram if there is no swapfile and switch to zswap if there is one. > > > > > I am intending to spend more time looking further into this, but other > > things keep popping up :) > > Same with me. Thanks for looking it, Yosry!
On Thu, Nov 3, 2022 at 2:47 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Nov 3, 2022 at 1:37 PM Minchan Kim <minchan@kernel.org> wrote: > > > > On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote: > > < snip > > > > > > > > > > I am also worry about that LRU stuff should be part of allocator > > > > > > > instead of higher level. > > > > > > > > > > > > I'm sorry, but that's not a reasonable objection. > > > > > > > > > > > > These patches implement a core feature of being a zswap backend, using > > > > > > standard LRU and locking techniques established by the other backends. > > > > > > > > > > > > I don't disagree that it would nicer if zswap had a strong abstraction > > > > > > for backend pages and a generalized LRU. But that is major surgery on > > > > > > a codebase of over 6,500 lines. It's not a reasonable ask to change > > > > > > all that first before implementing a basic feature that's useful now. > > > > > > > > > > With same logic, folks added the LRU logic into their allocators > > > > > without the effort considering moving the LRU into upper layer. > > > > > > > > > > And then trend is still going on since I have seen multiple times > > > > > people are trying to add more allocators. So if it's not a reasonable > > > > > ask to consier, we couldn't stop the trend in the end. > > > > > > > > So there is actually an ongoing effort to do that. Yosry and I have > > > > spent quite some time on coming up with an LRU design that's > > > > independent from compression policy over email and at Plumbers. > > > > > > > > My concern is more about the order of doing things: > > > > > > > > 1. The missing writeback support is a gaping hole in zsmalloc, which > > > > affects production systems. A generalized LRU list is a good idea, > > > > but it's a huge task that from a user pov really is not > > > > critical. Even from a kernel dev / maintainer POV, there are bigger > > > > fish to fry in the zswap code base and the backends than this. > > > > > > > > 2. Refactoring existing functionality is much easier than writing > > > > generalized code that simultaneously enables new behavior. zsmalloc > > > > is the most complex of our backends. To make its LRU writeback work > > > > we had to patch zswap's ->map ordering to accomodate it, e.g. Such > > > > tricky changes are easier to make and test incrementally. > > > > > > > > The generalized LRU project will hugely benefit from already having > > > > a proven writeback implementation in zsmalloc, because then all the > > > > requirements in zswap and zsmalloc will be in black and white. > > > > > > > > > > I get that your main interest is zram, and so this feature isn't of > > > > > > interest to you. But zram isn't the only user, nor is it the primary > > > > > > > > > > I am interest to the feature but my interest is more of general swap > > > > > layer to manage the LRU so that it could support any hierarchy among > > > > > swap devices, not only zswap. > > > > > > > > I think we're on the same page about the longer term goals. > > > > > > > > > > Yeah. As Johannes said, I was also recently looking into this. This > > > can also help solve other problems than consolidating implementations. > > > Currently if zswap rejects a page, it goes into swap, which is > > > more-or-less a violation of page LRUs since hotter pages that are more > > > recently reclaimed end up in swap (slow), while colder pages that were > > > reclaimed before are in zswap. Having a separate layer managing the > > > LRU of swap pages can also make sure this doesn't happen. > > > > True. > > > > > > > > More broadly, making zswap a separate layer from swap enables other > > > improvements such as using zswap regardless of the presence of a > > > backend swapfile and not consuming space in swapfiles if a page is in > > > zswap. Of course, this is a much larger surgery. > > > > If we could decouple the LRU writeback from zswap and supports > > compression without backing swapfile, sounds like becoming more of > > zram. ;-) > > That's a little bit grey. Maybe we can consolidate them one day :) > > We have been using zswap without swapfile at Google for a while We do require swapfiles for zswap -- they can be truncated. > this > gives us the ability to reject pages that do not compress well enough > for us, which I suspect zram would not support given that it is > designed to be the final destination of the page. Also, having the > same configuration and code running on machines whether or not they > have a swapfile is nice, otherwise one would need to use zram if there > is no swapfile and switch to zswap if there is one. > > > > > > > > > I am intending to spend more time looking further into this, but other > > > things keep popping up :) > > > > Same with me. Thanks for looking it, Yosry! >
On Thu, Nov 03, 2022 at 01:46:56PM -0700, Yosry Ahmed wrote: > On Thu, Nov 3, 2022 at 1:37 PM Minchan Kim <minchan@kernel.org> wrote: > > > > On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote: > > < snip > > > > > > > > > > I am also worry about that LRU stuff should be part of allocator > > > > > > > instead of higher level. > > > > > > > > > > > > I'm sorry, but that's not a reasonable objection. > > > > > > > > > > > > These patches implement a core feature of being a zswap backend, using > > > > > > standard LRU and locking techniques established by the other backends. > > > > > > > > > > > > I don't disagree that it would nicer if zswap had a strong abstraction > > > > > > for backend pages and a generalized LRU. But that is major surgery on > > > > > > a codebase of over 6,500 lines. It's not a reasonable ask to change > > > > > > all that first before implementing a basic feature that's useful now. > > > > > > > > > > With same logic, folks added the LRU logic into their allocators > > > > > without the effort considering moving the LRU into upper layer. > > > > > > > > > > And then trend is still going on since I have seen multiple times > > > > > people are trying to add more allocators. So if it's not a reasonable > > > > > ask to consier, we couldn't stop the trend in the end. > > > > > > > > So there is actually an ongoing effort to do that. Yosry and I have > > > > spent quite some time on coming up with an LRU design that's > > > > independent from compression policy over email and at Plumbers. > > > > > > > > My concern is more about the order of doing things: > > > > > > > > 1. The missing writeback support is a gaping hole in zsmalloc, which > > > > affects production systems. A generalized LRU list is a good idea, > > > > but it's a huge task that from a user pov really is not > > > > critical. Even from a kernel dev / maintainer POV, there are bigger > > > > fish to fry in the zswap code base and the backends than this. > > > > > > > > 2. Refactoring existing functionality is much easier than writing > > > > generalized code that simultaneously enables new behavior. zsmalloc > > > > is the most complex of our backends. To make its LRU writeback work > > > > we had to patch zswap's ->map ordering to accomodate it, e.g. Such > > > > tricky changes are easier to make and test incrementally. > > > > > > > > The generalized LRU project will hugely benefit from already having > > > > a proven writeback implementation in zsmalloc, because then all the > > > > requirements in zswap and zsmalloc will be in black and white. > > > > > > > > > > I get that your main interest is zram, and so this feature isn't of > > > > > > interest to you. But zram isn't the only user, nor is it the primary > > > > > > > > > > I am interest to the feature but my interest is more of general swap > > > > > layer to manage the LRU so that it could support any hierarchy among > > > > > swap devices, not only zswap. > > > > > > > > I think we're on the same page about the longer term goals. > > > > > > > > > > Yeah. As Johannes said, I was also recently looking into this. This > > > can also help solve other problems than consolidating implementations. > > > Currently if zswap rejects a page, it goes into swap, which is > > > more-or-less a violation of page LRUs since hotter pages that are more > > > recently reclaimed end up in swap (slow), while colder pages that were > > > reclaimed before are in zswap. Having a separate layer managing the > > > LRU of swap pages can also make sure this doesn't happen. > > > > True. > > > > > > > > More broadly, making zswap a separate layer from swap enables other > > > improvements such as using zswap regardless of the presence of a > > > backend swapfile and not consuming space in swapfiles if a page is in > > > zswap. Of course, this is a much larger surgery. > > > > If we could decouple the LRU writeback from zswap and supports > > compression without backing swapfile, sounds like becoming more of > > zram. ;-) > > That's a little bit grey. Maybe we can consolidate them one day :) > > We have been using zswap without swapfile at Google for a while, this > gives us the ability to reject pages that do not compress well enough > for us, which I suspect zram would not support given that it is > designed to be the final destination of the page. Also, having the zRAM could do with little change but at current implmentation, it will print swapout failure message(it's not a big deal since we could suppress) but I have thought rather than that, we needs to move the page unevictable LRU list with marking the page CoW to catch a time to move the page into evictable LRU list o provide second chance to be compressed. Just off-topic. > same configuration and code running on machines whether or not they > have a swapfile is nice, otherwise one would need to use zram if there > is no swapfile and switch to zswap if there is one.
On Thu, Nov 3, 2022 at 2:16 PM Yu Zhao <yuzhao@google.com> wrote: > > On Thu, Nov 3, 2022 at 2:47 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Nov 3, 2022 at 1:37 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote: > > > < snip > > > > > > > > > > > > I am also worry about that LRU stuff should be part of allocator > > > > > > > > instead of higher level. > > > > > > > > > > > > > > I'm sorry, but that's not a reasonable objection. > > > > > > > > > > > > > > These patches implement a core feature of being a zswap backend, using > > > > > > > standard LRU and locking techniques established by the other backends. > > > > > > > > > > > > > > I don't disagree that it would nicer if zswap had a strong abstraction > > > > > > > for backend pages and a generalized LRU. But that is major surgery on > > > > > > > a codebase of over 6,500 lines. It's not a reasonable ask to change > > > > > > > all that first before implementing a basic feature that's useful now. > > > > > > > > > > > > With same logic, folks added the LRU logic into their allocators > > > > > > without the effort considering moving the LRU into upper layer. > > > > > > > > > > > > And then trend is still going on since I have seen multiple times > > > > > > people are trying to add more allocators. So if it's not a reasonable > > > > > > ask to consier, we couldn't stop the trend in the end. > > > > > > > > > > So there is actually an ongoing effort to do that. Yosry and I have > > > > > spent quite some time on coming up with an LRU design that's > > > > > independent from compression policy over email and at Plumbers. > > > > > > > > > > My concern is more about the order of doing things: > > > > > > > > > > 1. The missing writeback support is a gaping hole in zsmalloc, which > > > > > affects production systems. A generalized LRU list is a good idea, > > > > > but it's a huge task that from a user pov really is not > > > > > critical. Even from a kernel dev / maintainer POV, there are bigger > > > > > fish to fry in the zswap code base and the backends than this. > > > > > > > > > > 2. Refactoring existing functionality is much easier than writing > > > > > generalized code that simultaneously enables new behavior. zsmalloc > > > > > is the most complex of our backends. To make its LRU writeback work > > > > > we had to patch zswap's ->map ordering to accomodate it, e.g. Such > > > > > tricky changes are easier to make and test incrementally. > > > > > > > > > > The generalized LRU project will hugely benefit from already having > > > > > a proven writeback implementation in zsmalloc, because then all the > > > > > requirements in zswap and zsmalloc will be in black and white. > > > > > > > > > > > > I get that your main interest is zram, and so this feature isn't of > > > > > > > interest to you. But zram isn't the only user, nor is it the primary > > > > > > > > > > > > I am interest to the feature but my interest is more of general swap > > > > > > layer to manage the LRU so that it could support any hierarchy among > > > > > > swap devices, not only zswap. > > > > > > > > > > I think we're on the same page about the longer term goals. > > > > > > > > > > > > > Yeah. As Johannes said, I was also recently looking into this. This > > > > can also help solve other problems than consolidating implementations. > > > > Currently if zswap rejects a page, it goes into swap, which is > > > > more-or-less a violation of page LRUs since hotter pages that are more > > > > recently reclaimed end up in swap (slow), while colder pages that were > > > > reclaimed before are in zswap. Having a separate layer managing the > > > > LRU of swap pages can also make sure this doesn't happen. > > > > > > True. > > > > > > > > > > > More broadly, making zswap a separate layer from swap enables other > > > > improvements such as using zswap regardless of the presence of a > > > > backend swapfile and not consuming space in swapfiles if a page is in > > > > zswap. Of course, this is a much larger surgery. > > > > > > If we could decouple the LRU writeback from zswap and supports > > > compression without backing swapfile, sounds like becoming more of > > > zram. ;-) > > > > That's a little bit grey. Maybe we can consolidate them one day :) > > > > We have been using zswap without swapfile at Google for a while > > We do require swapfiles for zswap -- they can be truncated. Right. We detect truncated swapfiles and use them a "ghost" swapfiles, which enables us to use zswap without a "real" swapfile more-or-less. > > > this > > gives us the ability to reject pages that do not compress well enough > > for us, which I suspect zram would not support given that it is > > designed to be the final destination of the page. Also, having the > > same configuration and code running on machines whether or not they > > have a swapfile is nice, otherwise one would need to use zram if there > > is no swapfile and switch to zswap if there is one. > > > > > > > > > > > > > I am intending to spend more time looking further into this, but other > > > > things keep popping up :) > > > > > > Same with me. Thanks for looking it, Yosry! > >
On Thu, Nov 3, 2022 at 2:43 PM Minchan Kim <minchan@kernel.org> wrote: > > On Thu, Nov 03, 2022 at 01:46:56PM -0700, Yosry Ahmed wrote: > > On Thu, Nov 3, 2022 at 1:37 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote: > > > < snip > > > > > > > > > > > > I am also worry about that LRU stuff should be part of allocator > > > > > > > > instead of higher level. > > > > > > > > > > > > > > I'm sorry, but that's not a reasonable objection. > > > > > > > > > > > > > > These patches implement a core feature of being a zswap backend, using > > > > > > > standard LRU and locking techniques established by the other backends. > > > > > > > > > > > > > > I don't disagree that it would nicer if zswap had a strong abstraction > > > > > > > for backend pages and a generalized LRU. But that is major surgery on > > > > > > > a codebase of over 6,500 lines. It's not a reasonable ask to change > > > > > > > all that first before implementing a basic feature that's useful now. > > > > > > > > > > > > With same logic, folks added the LRU logic into their allocators > > > > > > without the effort considering moving the LRU into upper layer. > > > > > > > > > > > > And then trend is still going on since I have seen multiple times > > > > > > people are trying to add more allocators. So if it's not a reasonable > > > > > > ask to consier, we couldn't stop the trend in the end. > > > > > > > > > > So there is actually an ongoing effort to do that. Yosry and I have > > > > > spent quite some time on coming up with an LRU design that's > > > > > independent from compression policy over email and at Plumbers. > > > > > > > > > > My concern is more about the order of doing things: > > > > > > > > > > 1. The missing writeback support is a gaping hole in zsmalloc, which > > > > > affects production systems. A generalized LRU list is a good idea, > > > > > but it's a huge task that from a user pov really is not > > > > > critical. Even from a kernel dev / maintainer POV, there are bigger > > > > > fish to fry in the zswap code base and the backends than this. > > > > > > > > > > 2. Refactoring existing functionality is much easier than writing > > > > > generalized code that simultaneously enables new behavior. zsmalloc > > > > > is the most complex of our backends. To make its LRU writeback work > > > > > we had to patch zswap's ->map ordering to accomodate it, e.g. Such > > > > > tricky changes are easier to make and test incrementally. > > > > > > > > > > The generalized LRU project will hugely benefit from already having > > > > > a proven writeback implementation in zsmalloc, because then all the > > > > > requirements in zswap and zsmalloc will be in black and white. > > > > > > > > > > > > I get that your main interest is zram, and so this feature isn't of > > > > > > > interest to you. But zram isn't the only user, nor is it the primary > > > > > > > > > > > > I am interest to the feature but my interest is more of general swap > > > > > > layer to manage the LRU so that it could support any hierarchy among > > > > > > swap devices, not only zswap. > > > > > > > > > > I think we're on the same page about the longer term goals. > > > > > > > > > > > > > Yeah. As Johannes said, I was also recently looking into this. This > > > > can also help solve other problems than consolidating implementations. > > > > Currently if zswap rejects a page, it goes into swap, which is > > > > more-or-less a violation of page LRUs since hotter pages that are more > > > > recently reclaimed end up in swap (slow), while colder pages that were > > > > reclaimed before are in zswap. Having a separate layer managing the > > > > LRU of swap pages can also make sure this doesn't happen. > > > > > > True. > > > > > > > > > > > More broadly, making zswap a separate layer from swap enables other > > > > improvements such as using zswap regardless of the presence of a > > > > backend swapfile and not consuming space in swapfiles if a page is in > > > > zswap. Of course, this is a much larger surgery. > > > > > > If we could decouple the LRU writeback from zswap and supports > > > compression without backing swapfile, sounds like becoming more of > > > zram. ;-) > > > > That's a little bit grey. Maybe we can consolidate them one day :) > > > > We have been using zswap without swapfile at Google for a while, this > > gives us the ability to reject pages that do not compress well enough > > for us, which I suspect zram would not support given that it is > > designed to be the final destination of the page. Also, having the > > zRAM could do with little change but at current implmentation, it will > print swapout failure message(it's not a big deal since we could > suppress) but I have thought rather than that, we needs to move the > page unevictable LRU list with marking the page CoW to catch a time > to move the page into evictable LRU list o provide second chance to > be compressed. Just off-topic. Right. We do something similar-ish today. However, this does not work though for zswap if there is a backing swapfile, as the page needs to still be evictable to the swapfile. A decoupled LRU can also manage this appropriately. > > > same configuration and code running on machines whether or not they > > have a swapfile is nice, otherwise one would need to use zram if there > > is no swapfile and switch to zswap if there is one.
On (22/11/03 11:18), Johannes Weiner wrote: > > > I'm not in love with this, to be honest. One big pool lock instead > > > of 255 per-class locks doesn't look attractive, as one big pool lock > > > is going to be hammered quite a lot when zram is used, e.g. as a regular > > > block device with a file system and is under heavy parallel writes/reads. > > TBH the class always struck me as an odd scope to split the lock. Lock > contention depends on how variable the compression rate is of the > hottest incoming data, which is unpredictable from a user POV. > > My understanding is that the primary usecase for zram is swapping, and > the pool lock is the same granularity as the swap locking. That's what we thought until a couple of merge windows ago we figured (the hard way) that SUSE uses ZRAM as a normal block device with a real file-system on it. And they use it often enough to immediately spot the regression which we landed. > Do you have a particular one in mind? (I'm thinking journaled ones are > not of much interest, since their IO tends to be fairly serialized.) > > btrfs? Probably some parallel fio workloads? Seq, random reads/writes from numerous workers. I personally sometimes use ZRAM when I want to compile something and I care only about the package, I don't need .o for recomplilation or something, just the final package.
We have benchmarked the lock consolidation to see the performance effect of this change on zram. First, we ran a synthetic FS workload on a server machine with 36 cores (same machine for all runs), using this benchmark script: https://github.com/josefbacik/fs_mark using 32 threads, and cranking the pressure up to > 80% FS usage. Here is the result (unit is file/second): With lock consolidation (btrfs): Average: 13520.2, Median: 13531.0, Stddev: 137.5961482019028 Without lock consolidation (btrfs): Average: 13487.2, Median: 13575.0, Stddev: 309.08283679298665 With lock consolidation (ext4): Average: 16824.4, Median: 16839.0, Stddev: 89.97388510006668 Without lock consolidation (ext4) Average: 16958.0, Median: 16986.0, Stddev: 194.7370021336469 As you can see, we observe a 0.3% regression for btrfs, and a 0.9% regression for ext4. This is a small, barely measurable difference in my opinion. For a more realistic scenario, we also tries building the kernel on zram. Here is the time it takes (in seconds): With lock consolidation (btrfs): real Average: 319.6, Median: 320.0, Stddev: 0.8944271909999159 user Average: 6894.2, Median: 6895.0, Stddev: 25.528415540334656 sys Average: 521.4, Median: 522.0, Stddev: 1.51657508881031 Without lock consolidation (btrfs): real Average: 319.8, Median: 320.0, Stddev: 0.8366600265340756 user Average: 6896.6, Median: 6899.0, Stddev: 16.04057355583023 sys Average: 520.6, Median: 521.0, Stddev: 1.140175425099138 With lock consolidation (ext4): real Average: 320.0, Median: 319.0, Stddev: 1.4142135623730951 user Average: 6896.8, Median: 6878.0, Stddev: 28.621670111997307 sys Average: 521.2, Median: 521.0, Stddev: 1.7888543819998317 Without lock consolidation (ext4) real Average: 319.6, Median: 319.0, Stddev: 0.8944271909999159 user Average: 6886.2, Median: 6887.0, Stddev: 16.93221781102523 sys Average: 520.4, Median: 520.0, Stddev: 1.140175425099138 The difference is entirely within the noise of a typical run on zram. This hardly justifies the complexity of maintaining both the pool lock and the class lock. In fact, for writeback, we would need to introduce yet another lock to prevent data races on the pool's LRU, further complicating the lock handling logic. IMHO, it is just better to collapse all of these into a single pool-level lock.
On Mon, Nov 07, 2022 at 01:31:14PM -0800, Nhat Pham wrote: > We have benchmarked the lock consolidation to see the performance effect of > this change on zram. First, we ran a synthetic FS workload on a server machine > with 36 cores (same machine for all runs), using this benchmark script: > > https://github.com/josefbacik/fs_mark > > using 32 threads, and cranking the pressure up to > 80% FS usage. > > Here is the result (unit is file/second): > > With lock consolidation (btrfs): > Average: 13520.2, Median: 13531.0, Stddev: 137.5961482019028 > > Without lock consolidation (btrfs): > Average: 13487.2, Median: 13575.0, Stddev: 309.08283679298665 > > With lock consolidation (ext4): > Average: 16824.4, Median: 16839.0, Stddev: 89.97388510006668 > > Without lock consolidation (ext4) > Average: 16958.0, Median: 16986.0, Stddev: 194.7370021336469 > > As you can see, we observe a 0.3% regression for btrfs, and a 0.9% regression > for ext4. This is a small, barely measurable difference in my opinion. > > For a more realistic scenario, we also tries building the kernel on zram. > Here is the time it takes (in seconds): > > With lock consolidation (btrfs): > real > Average: 319.6, Median: 320.0, Stddev: 0.8944271909999159 > user > Average: 6894.2, Median: 6895.0, Stddev: 25.528415540334656 > sys > Average: 521.4, Median: 522.0, Stddev: 1.51657508881031 > > Without lock consolidation (btrfs): > real > Average: 319.8, Median: 320.0, Stddev: 0.8366600265340756 > user > Average: 6896.6, Median: 6899.0, Stddev: 16.04057355583023 > sys > Average: 520.6, Median: 521.0, Stddev: 1.140175425099138 > > With lock consolidation (ext4): > real > Average: 320.0, Median: 319.0, Stddev: 1.4142135623730951 > user > Average: 6896.8, Median: 6878.0, Stddev: 28.621670111997307 > sys > Average: 521.2, Median: 521.0, Stddev: 1.7888543819998317 > > Without lock consolidation (ext4) > real > Average: 319.6, Median: 319.0, Stddev: 0.8944271909999159 > user > Average: 6886.2, Median: 6887.0, Stddev: 16.93221781102523 > sys > Average: 520.4, Median: 520.0, Stddev: 1.140175425099138 > > The difference is entirely within the noise of a typical run on zram. This > hardly justifies the complexity of maintaining both the pool lock and the class > lock. In fact, for writeback, we would need to introduce yet another lock to I am glad to make the zsmalloc lock scheme simpler without meaning regression since it introduced a lot mess. Please include the test result in description. Thanks for the testing, Nhat.
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index d03941cace2c..326faa751f0a 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -33,8 +33,7 @@ /* * lock ordering: * page_lock - * pool->migrate_lock - * class->lock + * pool->lock * zspage->lock */ @@ -192,7 +191,6 @@ static const int fullness_threshold_frac = 4; static size_t huge_class_size; struct size_class { - spinlock_t lock; struct list_head fullness_list[NR_ZS_FULLNESS]; /* * Size of objects stored in this class. Must be multiple @@ -247,8 +245,7 @@ struct zs_pool { #ifdef CONFIG_COMPACTION struct work_struct free_work; #endif - /* protect page/zspage migration */ - rwlock_t migrate_lock; + spinlock_t lock; }; struct zspage { @@ -355,7 +352,7 @@ static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage) kmem_cache_free(pool->zspage_cachep, zspage); } -/* class->lock(which owns the handle) synchronizes races */ +/* pool->lock(which owns the handle) synchronizes races */ static void record_obj(unsigned long handle, unsigned long obj) { *(unsigned long *)handle = obj; @@ -452,7 +449,7 @@ static __maybe_unused int is_first_page(struct page *page) return PagePrivate(page); } -/* Protected by class->lock */ +/* Protected by pool->lock */ static inline int get_zspage_inuse(struct zspage *zspage) { return zspage->inuse; @@ -597,13 +594,13 @@ static int zs_stats_size_show(struct seq_file *s, void *v) if (class->index != i) continue; - spin_lock(&class->lock); + spin_lock(&pool->lock); class_almost_full = zs_stat_get(class, CLASS_ALMOST_FULL); class_almost_empty = zs_stat_get(class, CLASS_ALMOST_EMPTY); obj_allocated = zs_stat_get(class, OBJ_ALLOCATED); obj_used = zs_stat_get(class, OBJ_USED); freeable = zs_can_compact(class); - spin_unlock(&class->lock); + spin_unlock(&pool->lock); objs_per_zspage = class->objs_per_zspage; pages_used = obj_allocated / objs_per_zspage * @@ -916,7 +913,7 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class, get_zspage_mapping(zspage, &class_idx, &fg); - assert_spin_locked(&class->lock); + assert_spin_locked(&pool->lock); VM_BUG_ON(get_zspage_inuse(zspage)); VM_BUG_ON(fg != ZS_EMPTY); @@ -1247,19 +1244,19 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, BUG_ON(in_interrupt()); /* It guarantees it can get zspage from handle safely */ - read_lock(&pool->migrate_lock); + spin_lock(&pool->lock); obj = handle_to_obj(handle); obj_to_location(obj, &page, &obj_idx); zspage = get_zspage(page); /* - * migration cannot move any zpages in this zspage. Here, class->lock + * migration cannot move any zpages in this zspage. Here, pool->lock * is too heavy since callers would take some time until they calls * zs_unmap_object API so delegate the locking from class to zspage * which is smaller granularity. */ migrate_read_lock(zspage); - read_unlock(&pool->migrate_lock); + spin_unlock(&pool->lock); class = zspage_class(pool, zspage); off = (class->size * obj_idx) & ~PAGE_MASK; @@ -1412,8 +1409,8 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) size += ZS_HANDLE_SIZE; class = pool->size_class[get_size_class_index(size)]; - /* class->lock effectively protects the zpage migration */ - spin_lock(&class->lock); + /* pool->lock effectively protects the zpage migration */ + spin_lock(&pool->lock); zspage = find_get_zspage(class); if (likely(zspage)) { obj = obj_malloc(pool, zspage, handle); @@ -1421,12 +1418,12 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) fix_fullness_group(class, zspage); record_obj(handle, obj); class_stat_inc(class, OBJ_USED, 1); - spin_unlock(&class->lock); + spin_unlock(&pool->lock); return handle; } - spin_unlock(&class->lock); + spin_unlock(&pool->lock); zspage = alloc_zspage(pool, class, gfp); if (!zspage) { @@ -1434,7 +1431,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) return (unsigned long)ERR_PTR(-ENOMEM); } - spin_lock(&class->lock); + spin_lock(&pool->lock); obj = obj_malloc(pool, zspage, handle); newfg = get_fullness_group(class, zspage); insert_zspage(class, zspage, newfg); @@ -1447,7 +1444,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) /* We completely set up zspage so mark them as movable */ SetZsPageMovable(pool, zspage); - spin_unlock(&class->lock); + spin_unlock(&pool->lock); return handle; } @@ -1491,16 +1488,14 @@ void zs_free(struct zs_pool *pool, unsigned long handle) return; /* - * The pool->migrate_lock protects the race with zpage's migration + * The pool->lock protects the race with zpage's migration * so it's safe to get the page from handle. */ - read_lock(&pool->migrate_lock); + spin_lock(&pool->lock); obj = handle_to_obj(handle); obj_to_page(obj, &f_page); zspage = get_zspage(f_page); class = zspage_class(pool, zspage); - spin_lock(&class->lock); - read_unlock(&pool->migrate_lock); obj_free(class->size, obj); class_stat_dec(class, OBJ_USED, 1); @@ -1510,7 +1505,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle) free_zspage(pool, class, zspage); out: - spin_unlock(&class->lock); + spin_unlock(&pool->lock); cache_free_handle(pool, handle); } EXPORT_SYMBOL_GPL(zs_free); @@ -1867,16 +1862,12 @@ static int zs_page_migrate(struct page *newpage, struct page *page, pool = zspage->pool; /* - * The pool migrate_lock protects the race between zpage migration + * The pool's lock protects the race between zpage migration * and zs_free. */ - write_lock(&pool->migrate_lock); + spin_lock(&pool->lock); class = zspage_class(pool, zspage); - /* - * the class lock protects zpage alloc/free in the zspage. - */ - spin_lock(&class->lock); /* the migrate_write_lock protects zpage access via zs_map_object */ migrate_write_lock(zspage); @@ -1906,10 +1897,9 @@ static int zs_page_migrate(struct page *newpage, struct page *page, replace_sub_page(class, zspage, newpage, page); /* * Since we complete the data copy and set up new zspage structure, - * it's okay to release migration_lock. + * it's okay to release the pool's lock. */ - write_unlock(&pool->migrate_lock); - spin_unlock(&class->lock); + spin_unlock(&pool->lock); dec_zspage_isolation(zspage); migrate_write_unlock(zspage); @@ -1964,9 +1954,9 @@ static void async_free_zspage(struct work_struct *work) if (class->index != i) continue; - spin_lock(&class->lock); + spin_lock(&pool->lock); list_splice_init(&class->fullness_list[ZS_EMPTY], &free_pages); - spin_unlock(&class->lock); + spin_unlock(&pool->lock); } list_for_each_entry_safe(zspage, tmp, &free_pages, list) { @@ -1976,9 +1966,9 @@ static void async_free_zspage(struct work_struct *work) get_zspage_mapping(zspage, &class_idx, &fullness); VM_BUG_ON(fullness != ZS_EMPTY); class = pool->size_class[class_idx]; - spin_lock(&class->lock); + spin_lock(&pool->lock); __free_zspage(pool, class, zspage); - spin_unlock(&class->lock); + spin_unlock(&pool->lock); } }; @@ -2039,10 +2029,11 @@ static unsigned long __zs_compact(struct zs_pool *pool, struct zspage *dst_zspage = NULL; unsigned long pages_freed = 0; - /* protect the race between zpage migration and zs_free */ - write_lock(&pool->migrate_lock); - /* protect zpage allocation/free */ - spin_lock(&class->lock); + /* + * protect the race between zpage migration and zs_free + * as well as zpage allocation/free + */ + spin_lock(&pool->lock); while ((src_zspage = isolate_zspage(class, true))) { /* protect someone accessing the zspage(i.e., zs_map_object) */ migrate_write_lock(src_zspage); @@ -2067,7 +2058,7 @@ static unsigned long __zs_compact(struct zs_pool *pool, putback_zspage(class, dst_zspage); migrate_write_unlock(dst_zspage); dst_zspage = NULL; - if (rwlock_is_contended(&pool->migrate_lock)) + if (spin_is_contended(&pool->lock)) break; } @@ -2084,11 +2075,9 @@ static unsigned long __zs_compact(struct zs_pool *pool, pages_freed += class->pages_per_zspage; } else migrate_write_unlock(src_zspage); - spin_unlock(&class->lock); - write_unlock(&pool->migrate_lock); + spin_unlock(&pool->lock); cond_resched(); - write_lock(&pool->migrate_lock); - spin_lock(&class->lock); + spin_lock(&pool->lock); } if (src_zspage) { @@ -2096,8 +2085,7 @@ static unsigned long __zs_compact(struct zs_pool *pool, migrate_write_unlock(src_zspage); } - spin_unlock(&class->lock); - write_unlock(&pool->migrate_lock); + spin_unlock(&pool->lock); return pages_freed; } @@ -2200,7 +2188,7 @@ struct zs_pool *zs_create_pool(const char *name) return NULL; init_deferred_free(pool); - rwlock_init(&pool->migrate_lock); + spin_lock_init(&pool->lock); pool->name = kstrdup(name, GFP_KERNEL); if (!pool->name) @@ -2271,7 +2259,6 @@ struct zs_pool *zs_create_pool(const char *name) class->index = i; class->pages_per_zspage = pages_per_zspage; class->objs_per_zspage = objs_per_zspage; - spin_lock_init(&class->lock); pool->size_class[i] = class; for (fullness = ZS_EMPTY; fullness < NR_ZS_FULLNESS; fullness++)