Message ID | 20221119001536.2086599-5-nphamcs@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp500484wrr; Fri, 18 Nov 2022 17:20:50 -0800 (PST) X-Google-Smtp-Source: AA0mqf598/6mUHMvRO5DJ/Sf1F1K777R3IoXLAyvogKKN6QWot4WTfeK01s6pQmN0xsJrhNC3+vk X-Received: by 2002:a05:6402:c1c:b0:461:bd12:52ce with SMTP id co28-20020a0564020c1c00b00461bd1252cemr8330711edb.197.1668820850651; Fri, 18 Nov 2022 17:20:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668820850; cv=none; d=google.com; s=arc-20160816; b=H/CgVNYN1LgF1JF3r1P85CSCrk1rA/IyHItHtmPk2upkZY2JHJe08NAJ6gr9rc57NY XzgEShysPm0SCNFbaXRI4W/amTJVpq4TrMX/JLdhOQGwwl2rJvettXvBBy2bKJ2E9Qjr a5RkVgL9gNS4EWNnODyZFFVrcWAOEawmwFdcB+z+ySzuPr5iQOfuF30N3TKTQR//Pb5z hHQRsECErUVNtVTSJ3zlWdmgRHjGN7KelN/d5FyjNTC8YCQNYliVvLkuoRR7AkSKcUmM Mrl/X0HuGwUuw+oEgFQkfzQ9vd8fmfOZGr8xWjZOdgxy8J6/hXyzHi7pMYSMpNTabqGj 8qPQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=xlNO430pFQOka1esCEqa7IqA23nNPCeXNP57iQ3X88k=; b=tva1CTAntnb+SLcEGU6Z5e28TzRpTvuT1hDirFRT5GMXa5hDp96W2joMMpuIshFsjm X8/XNjUaSMANS8nKjpoNtq7riXXt89y3vUaBbOLYFBUBzVfzN+XKstmUPkTQ4NZBSkv8 Absa0nj8O2ByCAt6WxAsqMsqFQEQ3XfJerqtgQ8WoQ0AroAyQQZtdM4uf+gF3ATzieDj DBOeuotiHM5bGZY+1m+SDg/Zn8Und8Fu2gpIuA51t/HpR8OD8LdUe4y8xG6i8hoqjQua p+vSf0Iz9/3pjjh8PJGcpe3fwWC0GoANOg9EB76yXmbuLDv1wAj33d7D5KLbrVCAoOxa KWxw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=dou3yaHy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f12-20020a056402354c00b004608c0b9a8asi4424561edd.201.2022.11.18.17.20.26; Fri, 18 Nov 2022 17:20:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=dou3yaHy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230484AbiKSBTC (ORCPT <rfc822;kkmonlee@gmail.com> + 99 others); Fri, 18 Nov 2022 20:19:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50196 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231929AbiKSBSO (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 18 Nov 2022 20:18:14 -0500 Received: from mail-pg1-x534.google.com (mail-pg1-x534.google.com [IPv6:2607:f8b0:4864:20::534]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 808EEF5A19 for <linux-kernel@vger.kernel.org>; Fri, 18 Nov 2022 16:15:43 -0800 (PST) Received: by mail-pg1-x534.google.com with SMTP id 62so6287191pgb.13 for <linux-kernel@vger.kernel.org>; Fri, 18 Nov 2022 16:15:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=xlNO430pFQOka1esCEqa7IqA23nNPCeXNP57iQ3X88k=; b=dou3yaHyJKMOuyzUjM+AISJf6d71pZrUfWgikbtI9j3Y6hoNgor6yN25JPdwAaQ20T bmEir2kFtNskQqAa6VxLNCDTkWDBSHlNjmNr1BjrGtRvWaoVjBjwCe2CNlGrMB9dPazV zVupTfokhJ0RXHrTqeO8XRh2Sl5a0XajITvW/0w0eEWrXHyjTEthY+XOlM1/P3Qx+FOk PojrXY03qZdpjSSaqO9sYspzBL7r770DMqA1qyL4NgGRnuKeKEWvSnSjInRtGVvk6Vgd LqD/1+XOJo0RiC3b6nmNF2vpTvmAVhB1Dh2dkK+s9F02Uq+tVt1tf0f7PGy4eLuCIXyD 42JQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xlNO430pFQOka1esCEqa7IqA23nNPCeXNP57iQ3X88k=; b=hsgIwZ6S/C4tM4N5k+DjZIzzC1FBwrkYUHatDwG9I4xG39KVwOrjH1FQKxHV415u94 AhpB6eup9+99gHIOo7u2W+x825lWTOMEsAB8WD0bjNsu/7KwTQ/BTJUIDw4qxzXc+/Jq RkJLkmJ+C2CVfVNVbg55tH4ktukqkT0vMzi/BwkEaptNvlzqca0M9ID/+fpguM3IiQgd iYby4FfJ3BG5C7x/g1F5efowHtnx1RkGeicN1ibID55xuQ1DCc+YG0SXsOPTH5QJuhtL RTwPJRW6T1lbH83TCn69E9eYbooYf0Zg5DIeD+EOMyMt70iXr3fDgsQ/uacyEShbkO4c 4mrg== X-Gm-Message-State: ANoB5pm4SVO4WLwGHixUxZTlQIg6aLXa8vWu+iNzxelPESrwHCkqS5yQ BifrB2jcxwtpqiK7rxQBq9k= X-Received: by 2002:a63:d241:0:b0:439:8688:a98d with SMTP id t1-20020a63d241000000b004398688a98dmr8511296pgi.424.1668816943018; Fri, 18 Nov 2022 16:15:43 -0800 (PST) Received: from localhost (fwdproxy-prn-017.fbsv.net. [2a03:2880:ff:11::face:b00c]) by smtp.gmail.com with ESMTPSA id u139-20020a627991000000b00561cf757749sm3751245pfc.183.2022.11.18.16.15.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 18 Nov 2022 16:15:42 -0800 (PST) From: Nhat Pham <nphamcs@gmail.com> To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, minchan@kernel.org, ngupta@vflare.org, senozhatsky@chromium.org, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com Subject: [PATCH v6 4/6] zsmalloc: Add a LRU to zs_pool to keep track of zspages in LRU order Date: Fri, 18 Nov 2022 16:15:34 -0800 Message-Id: <20221119001536.2086599-5-nphamcs@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221119001536.2086599-1-nphamcs@gmail.com> References: <20221119001536.2086599-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1749885492325247480?= X-GMAIL-MSGID: =?utf-8?q?1749885492325247480?= |
Series |
Implement writeback for zsmalloc
|
|
Commit Message
Nhat Pham
Nov. 19, 2022, 12:15 a.m. UTC
This helps determines the coldest zspages as candidates for writeback.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
mm/zsmalloc.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
--
2.30.2
Comments
On Fri, Nov 18, 2022 at 04:15:34PM -0800, Nhat Pham wrote: > This helps determines the coldest zspages as candidates for writeback. > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> This looks good to me. The ifdefs are higher than usual, but in this case they actually really nicely annotate exactly which hunks need to move to zswap (as CONFIG_ZPOOL == CONFIG_ZSWAP) when we unify the LRU! zbud and z3fold don't have those helpful annotations (since they're zswap-only to begin with), which will make their conversion a bit more laborious. But zsmalloc can be a (rough) guiding template for them. Thanks
On Fri, Nov 18, 2022 at 04:15:34PM -0800, Nhat Pham wrote: > This helps determines the coldest zspages as candidates for writeback. > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org>
On (22/11/18 16:15), Nhat Pham wrote: [..] > @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > obj_to_location(obj, &page, &obj_idx); > zspage = get_zspage(page); > > +#ifdef CONFIG_ZPOOL > + /* Move the zspage to front of pool's LRU */ > + if (mm == ZS_MM_WO) { > + if (!list_empty(&zspage->lru)) > + list_del(&zspage->lru); > + list_add(&zspage->lru, &pool->lru); > + } > +#endif Do we consider pages that were mapped for MM_RO/MM_RW as cold? I wonder why, we use them, so technically they are not exactly "least recently used".
On Tue, Nov 22, 2022 at 10:52:58AM +0900, Sergey Senozhatsky wrote: > On (22/11/18 16:15), Nhat Pham wrote: > [..] > > @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > > obj_to_location(obj, &page, &obj_idx); > > zspage = get_zspage(page); > > > > +#ifdef CONFIG_ZPOOL > > + /* Move the zspage to front of pool's LRU */ > > + if (mm == ZS_MM_WO) { > > + if (!list_empty(&zspage->lru)) > > + list_del(&zspage->lru); > > + list_add(&zspage->lru, &pool->lru); > > + } > > +#endif > > Do we consider pages that were mapped for MM_RO/MM_RW as cold? > I wonder why, we use them, so technically they are not exactly > "least recently used". This is a swap LRU. Per definition there are no ongoing accesses to the memory while the page is swapped out that would make it "hot". A new entry is hot, then ages to the tail until it gets either written back or swaps back in. Because of that, the zswap backends have traditionally had the lru-add in the allocation function (zs_malloc, zbud_alloc, z3fold_alloc). Minchan insisted we move it here for zsmalloc, since 'update lru on data access' is more generic. Unfortunately, one of the data accesses is when we write the swap entry to disk - during reclaim when the page is isolated from the LRU! Obviously we MUST NOT put it back on the LRU mid-reclaim. So now we have very generic LRU code, and exactly one usecase that needs exceptions from the generic behavior. The code is raising questions, not surprisingly. We can add a lengthy comment to it - a variant of the above text? My vote would still be to just move it back to zs_malloc, where it makes sense, is easier to explain, and matches the other backends.
On (22/11/22 12:42), Johannes Weiner wrote: > On Tue, Nov 22, 2022 at 10:52:58AM +0900, Sergey Senozhatsky wrote: > > On (22/11/18 16:15), Nhat Pham wrote: > > [..] > > > @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > > > obj_to_location(obj, &page, &obj_idx); > > > zspage = get_zspage(page); > > > > > > +#ifdef CONFIG_ZPOOL > > > + /* Move the zspage to front of pool's LRU */ > > > + if (mm == ZS_MM_WO) { > > > + if (!list_empty(&zspage->lru)) > > > + list_del(&zspage->lru); > > > + list_add(&zspage->lru, &pool->lru); > > > + } > > > +#endif > > > > Do we consider pages that were mapped for MM_RO/MM_RW as cold? > > I wonder why, we use them, so technically they are not exactly > > "least recently used". > > This is a swap LRU. Per definition there are no ongoing accesses to > the memory while the page is swapped out that would make it "hot". Hmm. Not arguing, just trying to understand some things. There are no accesses to swapped out pages yes, but zspage holds multiple objects, which are compressed swapped out pages in this particular case. For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage, that is 93 compressed swapped out pages. Consider ZS_FULL zspages which is at the tail of the LRU list. Suppose that we page-faulted 20 times and read 20 objects from that zspage, IOW zspage has been in use 20 times very recently, while writeback still considers it to be "not-used" and will evict it. So if this works for you then I'm fine. But we probably, like you suggested, can document a couple of things here - namely why WRITE access to zspage counts as "zspage is in use" but READ access to the same zspage does not count as "zspage is in use".
On (22/11/18 16:15), Nhat Pham wrote: > +#ifdef CONFIG_ZPOOL > + /* Move the zspage to front of pool's LRU */ > + if (mm == ZS_MM_WO) { > + if (!list_empty(&zspage->lru)) > + list_del(&zspage->lru); > + list_add(&zspage->lru, &pool->lru); > + } > +#endif Just an idea. Have you considered having size class LRU instead of pool LRU? Evicting pages from different classes can have different impact on the system, in theory. For instance, ZS_FULL zspage in class size 3264 (bytes) holds 5 compressed objects per-zspage, which are 5 compressed swapped out pages. While zspage in a class size 176 (bytes) holds 93 compressed objects (swapped pages). Both zspages consist of 4 non-contiguous 0-order physical pages, so when we free zspage from these classes we release 4 physical pages. However, in terms of major page faults evicting a page from size class 3264 looks better than from a size class 176: 5 major page faults vs 93 major page faults. Does this make sense?
On Tue, Nov 22, 2022 at 7:50 PM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (22/11/22 12:42), Johannes Weiner wrote: > > On Tue, Nov 22, 2022 at 10:52:58AM +0900, Sergey Senozhatsky wrote: > > > On (22/11/18 16:15), Nhat Pham wrote: > > > [..] > > > > @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > > > > obj_to_location(obj, &page, &obj_idx); > > > > zspage = get_zspage(page); > > > > > > > > +#ifdef CONFIG_ZPOOL > > > > + /* Move the zspage to front of pool's LRU */ > > > > + if (mm == ZS_MM_WO) { > > > > + if (!list_empty(&zspage->lru)) > > > > + list_del(&zspage->lru); > > > > + list_add(&zspage->lru, &pool->lru); > > > > + } > > > > +#endif > > > > > > Do we consider pages that were mapped for MM_RO/MM_RW as cold? > > > I wonder why, we use them, so technically they are not exactly > > > "least recently used". > > > > This is a swap LRU. Per definition there are no ongoing accesses to > > the memory while the page is swapped out that would make it "hot". > > Hmm. Not arguing, just trying to understand some things. > > There are no accesses to swapped out pages yes, but zspage holds multiple > objects, which are compressed swapped out pages in this particular case. > For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage, > that is 93 compressed swapped out pages. Consider ZS_FULL zspages which > is at the tail of the LRU list. Suppose that we page-faulted 20 times and > read 20 objects from that zspage, IOW zspage has been in use 20 times very > recently, while writeback still considers it to be "not-used" and will > evict it. > > So if this works for you then I'm fine. But we probably, like you suggested, > can document a couple of things here - namely why WRITE access to zspage > counts as "zspage is in use" but READ access to the same zspage does not > count as "zspage is in use". > I guess the key here is that we have an LRU of zspages, when we really want an LRU of compressed objects. In some cases, we may end up reclaiming the wrong pages. Assuming we have 2 zspages, Z1 and Z2, and 4 physical pages that we compress over time, P1 -> P4. Let's assume P1 -> P4 get compressed in order (P4 is the hottest page), and they get assigned to zspages as follows: Z1: P1, P3 Z2: P2, P4 In this case, the zspages LRU would be Z2->Z1, because Z2 was touched last when we compressed P4. Now if we want to writeback, we will look at Z1, and we might end up reclaiming P3, depending on the order the pages are stored in. A worst case scenario of this would be if we have a large number of pages, maybe 1000, P1->P1000 (where P1000 is the hottest), and they all go into Z1 and Z2 in this way: Z1: P1 -> P499, P1000 Z2: P500 -> P999 In this case, Z1 contains 499 cold pages, but it got P1000 at the end which caused us to put it on the front of the LRU. Now writeback will consistently use Z2. This is bad. Now I have no idea how practical this is, but it seems fairly random, based on the compression size of pages and access patterns. Does this mean we should move zspages to the front of the LRU when we writeback from them? No, I wouldn't say so. The same exact scenario can happen because of this. Imagine the following assignment of the 1000 pages: Z1: P<odd> (P1, P3, .., P999) Z2: P<even> (P2, P4, .., P1000) Z2 is at the front of the LRU because it has P1000, so the first time we do writeback we will start at Z1. Once we reclaim one object from Z1, we will start writeback from Z2 next time, and we will keep alternating. Now if we are really unlucky, we can end up reclaiming in this order P999, P1000, P997, P998, ... . So yeah I don't think putting zspages in the front of the LRU when we writeback is the answer. I would even say it's completely orthogonal to the problem, because writing back an object from the zspage at the end of the LRU gives us 0 information about the state of other objects on the same zspage. Ideally, we would have an LRU of objects instead, but this would be very complicated with the current form of writeback. It would be much easier if we have an LRU for zswap entries instead, which is something I am looking into, and is a much bigger surgery, and should be separate from this work. Today zswap inverts LRU priorities anyway by sending hot pages to the swapfile when zswap is full, when colder pages are in zswap, so I wouldn't really worry about this now :)
On Wed, Nov 23, 2022 at 12:02 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Nov 22, 2022 at 7:50 PM Sergey Senozhatsky > <senozhatsky@chromium.org> wrote: > > > > On (22/11/22 12:42), Johannes Weiner wrote: > > > On Tue, Nov 22, 2022 at 10:52:58AM +0900, Sergey Senozhatsky wrote: > > > > On (22/11/18 16:15), Nhat Pham wrote: > > > > [..] > > > > > @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > > > > > obj_to_location(obj, &page, &obj_idx); > > > > > zspage = get_zspage(page); > > > > > > > > > > +#ifdef CONFIG_ZPOOL > > > > > + /* Move the zspage to front of pool's LRU */ > > > > > + if (mm == ZS_MM_WO) { > > > > > + if (!list_empty(&zspage->lru)) > > > > > + list_del(&zspage->lru); > > > > > + list_add(&zspage->lru, &pool->lru); > > > > > + } > > > > > +#endif > > > > > > > > Do we consider pages that were mapped for MM_RO/MM_RW as cold? > > > > I wonder why, we use them, so technically they are not exactly > > > > "least recently used". > > > > > > This is a swap LRU. Per definition there are no ongoing accesses to > > > the memory while the page is swapped out that would make it "hot". > > > > Hmm. Not arguing, just trying to understand some things. > > > > There are no accesses to swapped out pages yes, but zspage holds multiple > > objects, which are compressed swapped out pages in this particular case. > > For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage, > > that is 93 compressed swapped out pages. Consider ZS_FULL zspages which > > is at the tail of the LRU list. Suppose that we page-faulted 20 times and > > read 20 objects from that zspage, IOW zspage has been in use 20 times very > > recently, while writeback still considers it to be "not-used" and will > > evict it. > > > > So if this works for you then I'm fine. But we probably, like you suggested, > > can document a couple of things here - namely why WRITE access to zspage > > counts as "zspage is in use" but READ access to the same zspage does not > > count as "zspage is in use". > > > > I guess the key here is that we have an LRU of zspages, when we really > want an LRU of compressed objects. In some cases, we may end up > reclaiming the wrong pages. > > Assuming we have 2 zspages, Z1 and Z2, and 4 physical pages that we > compress over time, P1 -> P4. > > Let's assume P1 -> P4 get compressed in order (P4 is the hottest > page), and they get assigned to zspages as follows: > Z1: P1, P3 > Z2: P2, P4 > > In this case, the zspages LRU would be Z2->Z1, because Z2 was touched > last when we compressed P4. Now if we want to writeback, we will look > at Z1, and we might end up reclaiming P3, depending on the order the > pages are stored in. > > A worst case scenario of this would be if we have a large number of > pages, maybe 1000, P1->P1000 (where P1000 is the hottest), and they > all go into Z1 and Z2 in this way: > Z1: P1 -> P499, P1000 > Z2: P500 -> P999 > > In this case, Z1 contains 499 cold pages, but it got P1000 at the end > which caused us to put it on the front of the LRU. Now writeback will > consistently use Z2. This is bad. Now I have no idea how practical > this is, but it seems fairly random, based on the compression size of > pages and access patterns. > > Does this mean we should move zspages to the front of the LRU when we > writeback from them? No, I wouldn't say so. The same exact scenario > can happen because of this. Imagine the following assignment of the > 1000 pages: > Z1: P<odd> (P1, P3, .., P999) > Z2: P<even> (P2, P4, .., P1000) > > Z2 is at the front of the LRU because it has P1000, so the first time > we do writeback we will start at Z1. Once we reclaim one object from > Z1, we will start writeback from Z2 next time, and we will keep > alternating. Now if we are really unlucky, we can end up reclaiming in > this order P999, P1000, P997, P998, ... . So yeah I don't think > putting zspages in the front of the LRU when we writeback is the > answer. I would even say it's completely orthogonal to the problem, > because writing back an object from the zspage at the end of the LRU > gives us 0 information about the state of other objects on the same > zspage. > > Ideally, we would have an LRU of objects instead, but this would be > very complicated with the current form of writeback. It would be much > easier if we have an LRU for zswap entries instead, which is something > I am looking into, and is a much bigger surgery, and should be > separate from this work. Today zswap inverts LRU priorities anyway by > sending hot pages to the swapfile when zswap is full, when colder > pages are in zswap, so I wouldn't really worry about this now :) Oh, I didn't realize we reclaim all the objects in the zspage at the end of the LRU. All the examples are wrong, but the concept still stands, the problem is that we have an LRU of zspages not an LRU of objects. Nonetheless, the fact that we refaulted an object in a zspage does not necessarily mean that other objects on the same are hotter than objects in other zspages IIUC.
On Wed, Nov 23, 2022 at 12:11:24AM -0800, Yosry Ahmed wrote: > On Wed, Nov 23, 2022 at 12:02 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Nov 22, 2022 at 7:50 PM Sergey Senozhatsky > > > There are no accesses to swapped out pages yes, but zspage holds multiple > > > objects, which are compressed swapped out pages in this particular case. > > > For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage, > > > that is 93 compressed swapped out pages. Consider ZS_FULL zspages which > > > is at the tail of the LRU list. Suppose that we page-faulted 20 times and > > > read 20 objects from that zspage, IOW zspage has been in use 20 times very > > > recently, while writeback still considers it to be "not-used" and will > > > evict it. > > > > > > So if this works for you then I'm fine. But we probably, like you suggested, > > > can document a couple of things here - namely why WRITE access to zspage > > > counts as "zspage is in use" but READ access to the same zspage does not > > > count as "zspage is in use". > Nonetheless, the fact that we refaulted an object in a zspage does not > necessarily mean that other objects on the same are hotter than > objects in other zspages IIUC. Yes. On allocation, we know that there is at least one hot object in the page. On refault, the connection between objects in a page is weak. And it's weaker on zsmalloc than with other backends due to the many size classes making temporal grouping less likely. So I think you're quite right, Segey, that a per-class LRU would be more accurate. It's no-LRU < zspage-LRU < class-LRU < object-LRU. Like Yosry said, the plan is to implement an object-LRU next as part of the generalized LRU for zsmalloc, zbud and z3fold. For now, the zspage LRU is an improvement to no-LRU. Our production experiments confirmed that.
On (22/11/23 00:02), Yosry Ahmed wrote: > > There are no accesses to swapped out pages yes, but zspage holds multiple > > objects, which are compressed swapped out pages in this particular case. > > For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage, > > that is 93 compressed swapped out pages. Consider ZS_FULL zspages which > > is at the tail of the LRU list. Suppose that we page-faulted 20 times and > > read 20 objects from that zspage, IOW zspage has been in use 20 times very > > recently, while writeback still considers it to be "not-used" and will > > evict it. > > > > So if this works for you then I'm fine. But we probably, like you suggested, > > can document a couple of things here - namely why WRITE access to zspage > > counts as "zspage is in use" but READ access to the same zspage does not > > count as "zspage is in use". > > > > I guess the key here is that we have an LRU of zspages, when we really > want an LRU of compressed objects. In some cases, we may end up > reclaiming the wrong pages. Yes, completely agree. [..] > Ideally, we would have an LRU of objects instead, but this would be > very complicated with the current form of writeback. Right. So we have two writebacks now: one in zram and on in zsmalloc. And zram writeback works with objects' access patterns, it simply tracks timestamps per entry and it doesn't know/care about zspages. Writeback targets in zram are selected by simply looking at timestamps of objects (compressed normal pages). And that is the right level for LRU, allocator is too low-level for this. I'm looking forward to seeing new LRU implementation (at a level higher than allocator) :)
On (22/11/23 11:30), Johannes Weiner wrote: > Like Yosry said, the plan is to implement an object-LRU next as part > of the generalized LRU for zsmalloc, zbud and z3fold. > > For now, the zspage LRU is an improvement to no-LRU. Our production > experiments confirmed that. Sounds good!
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 326faa751f0a..7dd464b5a6a5 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -239,6 +239,11 @@ struct zs_pool { /* Compact classes */ struct shrinker shrinker; +#ifdef CONFIG_ZPOOL + /* List tracking the zspages in LRU order by most recently added object */ + struct list_head lru; +#endif + #ifdef CONFIG_ZSMALLOC_STAT struct dentry *stat_dentry; #endif @@ -260,6 +265,12 @@ struct zspage { unsigned int freeobj; struct page *first_page; struct list_head list; /* fullness list */ + +#ifdef CONFIG_ZPOOL + /* links the zspage to the lru list in the pool */ + struct list_head lru; +#endif + struct zs_pool *pool; #ifdef CONFIG_COMPACTION rwlock_t lock; @@ -953,6 +964,9 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class, } remove_zspage(class, zspage, ZS_EMPTY); +#ifdef CONFIG_ZPOOL + list_del(&zspage->lru); +#endif __free_zspage(pool, class, zspage); } @@ -998,6 +1012,10 @@ static void init_zspage(struct size_class *class, struct zspage *zspage) off %= PAGE_SIZE; } +#ifdef CONFIG_ZPOOL + INIT_LIST_HEAD(&zspage->lru); +#endif + set_freeobj(zspage, 0); } @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, obj_to_location(obj, &page, &obj_idx); zspage = get_zspage(page); +#ifdef CONFIG_ZPOOL + /* Move the zspage to front of pool's LRU */ + if (mm == ZS_MM_WO) { + if (!list_empty(&zspage->lru)) + list_del(&zspage->lru); + list_add(&zspage->lru, &pool->lru); + } +#endif + /* * migration cannot move any zpages in this zspage. Here, pool->lock * is too heavy since callers would take some time until they calls @@ -1967,6 +1994,9 @@ static void async_free_zspage(struct work_struct *work) VM_BUG_ON(fullness != ZS_EMPTY); class = pool->size_class[class_idx]; spin_lock(&pool->lock); +#ifdef CONFIG_ZPOOL + list_del(&zspage->lru); +#endif __free_zspage(pool, class, zspage); spin_unlock(&pool->lock); } @@ -2278,6 +2308,10 @@ struct zs_pool *zs_create_pool(const char *name) */ zs_register_shrinker(pool); +#ifdef CONFIG_ZPOOL + INIT_LIST_HEAD(&pool->lru); +#endif + return pool; err: