From patchwork Fri Feb 9 11:59:49 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chengming Zhou X-Patchwork-Id: 20153 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:50ea:b0:106:860b:bbdd with SMTP id r10csp797379dyd; Fri, 9 Feb 2024 04:01:09 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCU2fQgjJiHlKE1KG5OGX6J/ZMh6V1pch9heyuCmondIEoWrN3BDmwCLuMhsgBAWjhOWayCeMXraqsGEwOsyjky/KBKF5g== X-Google-Smtp-Source: AGHT+IF6+noNo1Bx3MLC4XtXihkrAX7rjaLpKpmQX+o9EdArrTpZLnqsz6Oy8WHanhyW9IoftP2N X-Received: by 2002:a05:622a:10f:b0:42b:e5e0:5db with SMTP id u15-20020a05622a010f00b0042be5e005dbmr1555026qtw.11.1707480069639; Fri, 09 Feb 2024 04:01:09 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707480069; cv=pass; d=google.com; s=arc-20160816; b=Kp80GABOwALCg/vCrXqGFYaXtb1ggHkA8eZI2dpPe+SHY9AN8r7ve+TCYm2kYsKDnG OrJ9KUsxm26NRgr5YQy8wiEBbE0HT8z0QeAuudjji3HRFik/SJnVVSxIY0v5KBgcGUYR LynslP2umkHh+K83d2Hf1Q9YbI7s5BTBvRvgv3Ay6O0BN9LDPU0ZptIX6AbuV/HcO9zo DPryhKtnPzGbKqanaU3R7d1JcgMAfwbJUa5+d0jRzRh9XaAXKMxqpwHEiUD0kcsxTJZ/ XRJUJrp7Aydp/YcafuZLHmpzGgqzgDGioNiCZmCExtgZnTsOWMyIzJJx8WTc31kISxj8 nv2g== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=iu/SOvWf0hKfo9wq+PAj0R5rFj1AjOSdhRH6wieMVN0=; fh=h+JIOZUC7DGaoGewbIxcg8laI+j8d5WQTsJIWZklA74=; b=KVDQgBgLQt+G1SMgUszS2tOtNqfL4UY8n1mjjop/jICg15xe3DgSDo1ICDVYH6Sfjc x1oHkuwNP5pkhnHChKnv93nVgLe0+q9MwuF9ogAcv/MBV2GJR2veVuNZTGc8dc7LN/D1 GPNpZJs5dUeEMsLWq1/A+GqWTBmmh//Lgs2W4sggpLd3X1yLO3d2h66cAglWOrXPcook 3je/3i/xh3F6FLZJJFdm6i5qA2GGIKa34cY4uhLT3TtP5wCWIeQbtc1ofadlrVc1QK4R gNdh0GnCsf8APcDG8gXEeAjaRpdpN/K7WXQKLQxJLsAdUhrskDpnbAxNEOss6sT2H6rn 27+A==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=v1W4MpYz; arc=pass (i=1 spf=pass spfdomain=linux.dev dkim=pass dkdomain=linux.dev dmarc=pass fromdomain=linux.dev); spf=pass (google.com: domain of linux-kernel+bounces-59304-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-59304-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev X-Forwarded-Encrypted: i=2; AJvYcCUqYh3FLacnnZ9330LUbrmYzhqzRWmfg70pRtaDHZYEmSB5EE5W3aUwHWjutwcW0Z4csxrW/bdy8VFkQJRRFtvZ8Rzt8A== Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id y12-20020a05622a120c00b0042bb00c9584si1711993qtx.506.2024.02.09.04.01.09 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 09 Feb 2024 04:01:09 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-59304-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=v1W4MpYz; arc=pass (i=1 spf=pass spfdomain=linux.dev dkim=pass dkdomain=linux.dev dmarc=pass fromdomain=linux.dev); spf=pass (google.com: domain of linux-kernel+bounces-59304-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-59304-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 589501C2352A for ; Fri, 9 Feb 2024 12:01:09 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 211B433CE9; Fri, 9 Feb 2024 12:00:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="v1W4MpYz" Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 815162E854 for ; Fri, 9 Feb 2024 12:00:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707480049; cv=none; b=fRtbAxTAmtfWOJcgXLrevP8U1JA3LKAQR5qg54yOpihWaDwPga1GDWys11nDQxDtb79y965+6QVVdgAFfyOT0TPBbKISLBJhblMf9BjA1qYfJd36hV9kY9akGIR1RNRE+gnSqIgfUb9XYAb0RWXfj52A8KQ+GTdsLa491o2/fZ0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707480049; c=relaxed/simple; bh=Zu7L06jYFQlWzbX9n5ooim16fw606MESl/qmxOkoCFI=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=SjVVgUYp9WXXLZ+1egeG/axcCF2inE1BR6guVu9jNU19IqOnGUXLmQEmId3yu98Z624/8YLp2oV6TwgXRhrCnUFvMgGDokhaMWpIr3XT2tI73n9OtXvltFn3rmPReoqbA39VvZcsB4+wpeADTrtThm3atnTXTdAq48J58/z41nc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=v1W4MpYz; arc=none smtp.client-ip=91.218.175.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1707480040; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=iu/SOvWf0hKfo9wq+PAj0R5rFj1AjOSdhRH6wieMVN0=; b=v1W4MpYzxWVO+BDXLSHgO0OdvuI77K1u6J0SQ1NbDuYsn97FQagFhUMfUk/7oIg/zvkIOZ 9f1OpxCvDilIWqNBjYVqIYqQHrSMCYJwFbscR+qsUlLyMMWwzeTbWGOCZ3XFFl/RmQj7VG 8zRRwO0/xgjsMmdwUAofSI+LFgpR1pQ= From: chengming.zhou@linux.dev To: willy@infradead.org, hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, chengming.zhou@linux.dev, Chengming Zhou Subject: [PATCH RFC 0/1] mm/zswap: fix LRU reclaim for zswap writeback folios Date: Fri, 9 Feb 2024 11:59:49 +0000 Message-Id: <20240209115950.3885183-1-chengming.zhou@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790422621059629794 X-GMAIL-MSGID: 1790422621059629794 From: Chengming Zhou The memory shrinker of zswap is special: it has to first allocate more memory (folio) to free less memory (compressed copy), later that folio can be freed after written back. So it's better to evict these folios before trying to reclaim other folios. Here may come some problems of LRU management: 1. zswap_writeback_entry() reuses the __read_swap_cache_async(), which is normally only used in the swapin context. This may cause refault accounting and active/workingset information of the folio to be wrong. For example, zswap shrinker writeback try to reclaim the folio, but workingset_refault() mark it active to put it at head of active list. 2. folio_end_writeback() will check PG_reclaim flag which we did set in zswap_writeback_entry(), to try to rotate the folio to the tail of inactive list, to speed up its reclaim. But folio_rotate_reclaimable() won't work as we thought, actually all LRU move interfaces may don't work when the folio is isolated from LRU. (per-cpu add batch is somewhat like isolated from LRU) So when folio_end_writeback() calls folio_rotate_reclaimable(), it won't do nothing but just clear PG_reclaim flag if that folio is isolated (in per-cpu add batch or isolated by vmscan shrinker) 3. so the final result is the folio that has been written back and is expected to be evicted, but now is not at tail of inactive list. Meanwhile vmscan shrinker may try to evict other folios to cause more refaults. There is a report [1] of this problem. We should handle these cases better. First of all, we should consider when and where to put these folios on LRU. 1. after alloc: now we use folio_add_lru() to put folio on local batch, so it will be put at head of inactive/active list when batch drain. 2. after writeback: clear PG_reclaim and folio_rotate_reclaimable(). - after add batch drain: rotate successfully to tail of inactive list - before add batch drain: do nothing since folio is not on LRU list So these are two main time points we care about: the first is somewhat correct IMHO since the folio is under writeback and has PG_reclaim set, it may confuse shrinker if we put those to the tail of inactive list too early. If we really want to put it to tail, we can easily introduce another folio_add_lru_tail() to put on a new local lru_add_tail batch. The second is where we need to improve, we should rotate it to tail even when folio is in local batch. Since we can't manipulate folio LRU status when it's isolated in local batch, an obvious fix is to use folio flag to tell later lru_add_fn() where the folio should be put: active or inactive, head or tail. But the problem is that PG_readahead is the alias of PG_reclaim, we can't put readahead folio to the tail of inactive list obviously. So this patch changes folio_rotate_reclaimable() to queue to local rotate batch even when !PG_lru at first, hoping that: - folio_end_writeback() finish on the same cpu with lru_add batch, so it must be handled after the lru_add batch, after which it will see PG_lru and successfully rotate it to tail of inactive list. - even folio_end_writeback() is not finished on the same cpu, there maybe a big chance that rotate batch is handled after add batch. Testing kernel build in tmpfs with memory.max = 2GB. (zswap shrinker and writeback enabled with one 50GB swapfile) mm-unstable-hot zswap-lru-reclaim real 63.34 62.72 user 1063.20 1060.30 sys 272.04 256.14 workingset_refault_anon 2103297.00 1788155.80 workingset_refault_file 28638.20 39249.40 workingset_activate_anon 746134.00 695435.40 workingset_activate_file 4344.60 4255.80 workingset_restore_anon 653163.80 605315.60 workingset_restore_file 1079.00 883.00 workingset_nodereclaim 0.00 0.00 pgscan 12971305.60 12730331.20 pgscan_kswapd 0.00 0.00 pgscan_direct 12971305.60 12730331.20 pgscan_khugepaged 0.00 0.00 We can see that refault and sys cpu have some improvements. As for the workingset_refault() caused by zswap writeback, maybe we should remove it in zswap writeback case, but there are more pgscan and some regression. I don't know why, so just leave it as it is. This is RFC, any comment or discussion is welcome! Thanks! [1] https://lore.kernel.org/all/20231024142706.195517-1-hezhongkun.hzk@bytedance.com/ Chengming Zhou (1): mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru() mm/swap.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)