From patchwork Tue Oct 17 16:13:01 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 154385
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id
 ib8csp4245073vqb;
        Tue, 17 Oct 2023 09:13:41 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IG4XDrwlNE8ZVIKVdPbe6bWuqPTmk1QqLjo7ATxKoIbbcIcsH8jH5v1m2nZpkeEvGDaTRwJ
X-Received: by 2002:a17:902:e80c:b0:1c4:1e65:1e5e with SMTP id
 u12-20020a170902e80c00b001c41e651e5emr2883371plg.0.1697559221393;
        Tue, 17 Oct 2023 09:13:41 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1697559221; cv=none;
        d=google.com; s=arc-20160816;
        b=haHxyZG9SQXR2qJ6gtJJFfEo09kPcnuRFY6Exl3xPjgXRNal+UHZGbVwBhg9MId1y3
         cm9HYlkK8Skr0aJCztBVeoHrRaBiV0qCvkr6RxmeTAXHRfUJ3pU0/ILSO9ODMbXjhVjO
         lmjeE3b/haF1nLxXlKMIufFXNn4TUz1Nhh7s47hpS+4+9x2LvohwXm0qCgObSyMhHikE
         OT8FWx8+SJp6f8YU3fzWj/jm7N8PZ+vHpXneaOJUw7Ofp+2bIzFjLWP6Zoi3Y4oiSMx0
         NNhZGr1wzEaRJDuo0pQa3T28aQUdzYxCwQ5fXIiV96rUw0/O19fTbjHY2wB3QeEdWA4K
         H/Ww==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=T5B/w9PjZ4Jbft8qP7poeYP5gkQ7OZY5K0bvKoQXXpg=;
        fh=odHZZLn5aUrUhssH6KSiJJlFsZ7wTjfMLFlHN8avkCo=;
        b=ROGjcSb1XFEpSAiIJU1u//hxD+dEWvQt8XcKB08L8VlbI6lxKAUXnIcF7BOgsHvWy3
         N26oHs6yzFuS83E4Gln8/sDjigtrnOCk2uF0Z7z4Mhy96qvRS9gCuKpqnBDLjqSdzlxu
         Kkmb2L8RkeVwZRn579/6YuUoD5TvbibaqZZ816GwmD5kiCx2gxmGLhNwE72Cf3xBt0wC
         1b/1VJo99NGPQPTEJbAvRKwGIz50EpP0JgeUaoWVNx22e5I/KCdNCweyP6gNNAJxxnDB
         tKWmn9ivRPJ246cvy1KwFMDnnMsX5rTQp1tUBVGwQV0he+QRpK3C4zL+InbFUmeMpCBi
         heXA==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33])
        by mx.google.com with ESMTPS id
 n5-20020a170902e54500b001ca7a4c8360si2299715plf.31.2023.10.17.09.13.40
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 17 Oct 2023 09:13:41 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by lipwig.vger.email (Postfix) with ESMTP id B41438024AE9;
	Tue, 17 Oct 2023 09:13:38 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1343616AbjJQQNU (ORCPT <rfc822;hjfbswb@gmail.com> + 20 others);
        Tue, 17 Oct 2023 12:13:20 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39768 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234287AbjJQQNS (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 17 Oct 2023 12:13:18 -0400
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 8826E9E
        for <linux-kernel@vger.kernel.org>;
 Tue, 17 Oct 2023 09:13:16 -0700 (PDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E32661007;
        Tue, 17 Oct 2023 09:13:56 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
 BBAEF3F762;
        Tue, 17 Oct 2023 09:13:14 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        David Hildenbrand <david@redhat.com>,
        Matthew Wilcox <willy@infradead.org>,
        Huang Ying <ying.huang@intel.com>,
        Gao Xiang <xiang@kernel.org>, Yu Zhao <yuzhao@google.com>,
        Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
        Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org,
        linux-mm@kvack.org
Subject: [PATCH v2 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from
 swap_cluster_info:flags
Date: Tue, 17 Oct 2023 17:13:01 +0100
Message-Id: <20231017161302.2518826-2-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20231017161302.2518826-1-ryan.roberts@arm.com>
References: <20231017161302.2518826-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]);
 Tue, 17 Oct 2023 09:13:38 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1780019858045207559
X-GMAIL-MSGID: 1780019858045207559

As preparation for supporting small-sized THP in the swap-out path,
without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
which, when present, always implies PMD-sized THP, which is the same as
the cluster size.

The only use of the flag was to determine whether a swap entry refers to
a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
Instead of relying on the flag, we now pass in nr_pages, which
originates from the folio's number of pages. This allows the logic to
work for folios of any order.

The one snag is that one of the swap_page_trans_huge_swapped() call
sites does not have the folio. But it was only being called there to
avoid bothering to call __try_to_reclaim_swap() in some cases.
__try_to_reclaim_swap() gets the folio and (via some other functions)
calls swap_page_trans_huge_swapped(). So I've removed the problematic
call site and believe the new logic should be equivalent.

Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
which used to be called during folio splitting, since
split_swap_cluster()'s only job was to remove the flag.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h | 10 ----------
 mm/huge_memory.c     |  3 ---
 mm/swapfile.c        | 47 ++++++++------------------------------------
 3 files changed, 8 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 19f30a29e1f1..a073366a227c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,7 +259,6 @@ struct swap_cluster_info {
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
-#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */
 
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
@@ -595,15 +594,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 }
 #endif /* CONFIG_SWAP */
 
-#ifdef CONFIG_THP_SWAP
-extern int split_swap_cluster(swp_entry_t entry);
-#else
-static inline int split_swap_cluster(swp_entry_t entry)
-{
-	return 0;
-}
-#endif
-
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c9cbcbf6697e..46b3fb943207 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2597,9 +2597,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		shmem_uncharge(head->mapping->host, nr_dropped);
 	remap_page(folio, nr);
 
-	if (folio_test_swapcache(folio))
-		split_swap_cluster(folio->swap);
-
 	for (i = 0; i < nr; i++) {
 		struct page *subpage = head + i;
 		if (subpage == page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e52f486834eb..b83ad77e04c0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -342,18 +342,6 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
 	info->data = 0;
 }
 
-static inline bool cluster_is_huge(struct swap_cluster_info *info)
-{
-	if (IS_ENABLED(CONFIG_THP_SWAP))
-		return info->flags & CLUSTER_FLAG_HUGE;
-	return false;
-}
-
-static inline void cluster_clear_huge(struct swap_cluster_info *info)
-{
-	info->flags &= ~CLUSTER_FLAG_HUGE;
-}
-
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -1021,7 +1009,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
 	offset = idx * SWAPFILE_CLUSTER;
 	ci = lock_cluster(si, offset);
 	alloc_cluster(si, idx);
-	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
+	cluster_set_count(ci, SWAPFILE_CLUSTER);
 
 	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
 	unlock_cluster(ci);
@@ -1354,7 +1342,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 	ci = lock_cluster_or_swap_info(si, offset);
 	if (size == SWAPFILE_CLUSTER) {
-		VM_BUG_ON(!cluster_is_huge(ci));
 		map = si->swap_map + offset;
 		for (i = 0; i < SWAPFILE_CLUSTER; i++) {
 			val = map[i];
@@ -1362,7 +1349,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 			if (val == SWAP_HAS_CACHE)
 				free_entries++;
 		}
-		cluster_clear_huge(ci);
 		if (free_entries == SWAPFILE_CLUSTER) {
 			unlock_cluster_or_swap_info(si, ci);
 			spin_lock(&si->lock);
@@ -1384,23 +1370,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster_or_swap_info(si, ci);
 }
 
-#ifdef CONFIG_THP_SWAP
-int split_swap_cluster(swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-
-	si = _swap_info_get(entry);
-	if (!si)
-		return -EBUSY;
-	ci = lock_cluster(si, offset);
-	cluster_clear_huge(ci);
-	unlock_cluster(ci);
-	return 0;
-}
-#endif
-
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
@@ -1508,22 +1477,23 @@ int swp_swapcount(swp_entry_t entry)
 }
 
 static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
-					 swp_entry_t entry)
+					 swp_entry_t entry,
+					 unsigned int nr_pages)
 {
 	struct swap_cluster_info *ci;
 	unsigned char *map = si->swap_map;
 	unsigned long roffset = swp_offset(entry);
-	unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+	unsigned long offset = round_down(roffset, nr_pages);
 	int i;
 	bool ret = false;
 
 	ci = lock_cluster_or_swap_info(si, offset);
-	if (!ci || !cluster_is_huge(ci)) {
+	if (!ci || nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
 		goto unlock_out;
 	}
-	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		if (swap_count(map[offset + i])) {
 			ret = true;
 			break;
@@ -1545,7 +1515,7 @@ static bool folio_swapped(struct folio *folio)
 	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
 		return swap_swapcount(si, entry) != 0;
 
-	return swap_page_trans_huge_swapped(si, entry);
+	return swap_page_trans_huge_swapped(si, entry, folio_nr_pages(folio));
 }
 
 /**
@@ -1606,8 +1576,7 @@ int free_swap_and_cache(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		count = __swap_entry_free(p, entry);
-		if (count == SWAP_HAS_CACHE &&
-		    !swap_page_trans_huge_swapped(p, entry))
+		if (count == SWAP_HAS_CACHE)
 			__try_to_reclaim_swap(p, swp_offset(entry),
 					      TTRS_UNMAPPED | TTRS_FULL);
 	}

From patchwork Tue Oct 17 16:13:02 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 154386
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id
 ib8csp4245269vqb;
        Tue, 17 Oct 2023 09:13:58 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IG9LG/jfRllvkqiAkxwtwPZ3yRrYf0zdA80Z8Vf2lmjSuHqUYbYKwO6+Kma9DKzU96EPX8v
X-Received: by 2002:a05:6a00:1d9b:b0:692:ad93:e852 with SMTP id
 z27-20020a056a001d9b00b00692ad93e852mr2753562pfw.2.1697559238432;
        Tue, 17 Oct 2023 09:13:58 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1697559238; cv=none;
        d=google.com; s=arc-20160816;
        b=uvmlS6ErOE25vWvrgQWEkJc7bsyG404tzF4R1ymr9naiksnSsZuUP5kWRaL/ZVc6FZ
         KauhnNWS7f5J9NQpGC+GzfPUx3bndtoOpqXcNLKloqlbmJJ7Hb+WjXDXYwCo9ToDISPb
         ilElNNGLjv0tVePGO7ngxE3d8avFHY8UCFOyrwEFp/pmNPzVt6yqe9JoYBV+OoJU3UCV
         LiIaqOyetjuWevgmQvwBEcp/qOdsYzS1Z6pTdOYQtKqd7cB25mmoMWS2aWfG8k4OTOQ+
         sOHfGEbbI5btJkDVrhD1I9gGyz/KflBwG5k9EJcGCmckzy4eRzOKzvup4zIWo2fnSeeg
         PLAg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=mnOvsh3al+SoOgsN/9gCY3HqncX/GEGRH6QjIIu7IY8=;
        fh=odHZZLn5aUrUhssH6KSiJJlFsZ7wTjfMLFlHN8avkCo=;
        b=sfU0DDzTjObfhzPk7el1ymo2ROxs/KX1iMFRbLAT5FvUkp/HzoQHxiEAw12yLkCaf0
         qGVnHgUYp+EffrpkO9STrtRf1C0cYKQai4FOGYWz7iEm2kNCxqmaPriVFCEic4eMuHjV
         cGTiysFIKKVwCaXrrB/EGS2tj5KM+7WgSgw58zUXOve+rWRowMMKEkNovVoiw+SkeDgD
         pJjn2XN0sGPCTvMtYAzdN44mD0Ayvt8UiMZHHEvg6VmdFnI5rcy0YVOo2e9zch9lKSkV
         ufqAvvgk2jMpEckwq6RexMqRA+FXVVDIR6UK6f68l9zIb0djTcmIuNUMhXz0v1x14C94
         MHug==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from fry.vger.email (fry.vger.email. [23.128.96.38])
        by mx.google.com with ESMTPS id
 g2-20020aa796a2000000b006be329230ebsi1837729pfk.284.2023.10.17.09.13.58
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 17 Oct 2023 09:13:58 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by fry.vger.email (Postfix) with ESMTP id 9553B80A2364;
	Tue, 17 Oct 2023 09:13:55 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S234974AbjJQQN2 (ORCPT <rfc822;hjfbswb@gmail.com> + 20 others);
        Tue, 17 Oct 2023 12:13:28 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39778 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1343584AbjJQQNU (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 17 Oct 2023 12:13:20 -0400
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 5942C95
        for <linux-kernel@vger.kernel.org>;
 Tue, 17 Oct 2023 09:13:18 -0700 (PDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BF8E212FC;
        Tue, 17 Oct 2023 09:13:58 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
 7E3C13F762;
        Tue, 17 Oct 2023 09:13:16 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        David Hildenbrand <david@redhat.com>,
        Matthew Wilcox <willy@infradead.org>,
        Huang Ying <ying.huang@intel.com>,
        Gao Xiang <xiang@kernel.org>, Yu Zhao <yuzhao@google.com>,
        Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
        Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org,
        linux-mm@kvack.org
Subject: [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting
Date: Tue, 17 Oct 2023 17:13:02 +0100
Message-Id: <20231017161302.2518826-3-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20231017161302.2518826-1-ryan.roberts@arm.com>
References: <20231017161302.2518826-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]);
 Tue, 17 Oct 2023 09:13:55 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1780019876230317195
X-GMAIL-MSGID: 1780019876230317195

The upcoming anonymous small-sized THP feature enables performance
improvements by allocating large folios for anonymous memory. However
I've observed that on an arm64 system running a parallel workload (e.g.
kernel compilation) across many cores, under high memory pressure, the
speed regresses. This is due to bottlenecking on the increased number of
TLBIs added due to all the extra folio splitting.

Therefore, solve this regression by adding support for swapping out
small-sized THP without needing to split the folio, just like is already
done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
enabled, and when the swap backing store is a non-rotating block device.
These are the same constraints as for the existing PMD-sized THP
swap-out support.

Note that no attempt is made to swap-in THP here - this is still done
page-by-page, like for PMD-sized THP.

The main change here is to improve the swap entry allocator so that it
can allocate any power-of-2 number of contiguous entries between [4, (1
<< PMD_ORDER)] (THP cannot support order-1 folios). This is done by
allocating a cluster for each distinct order and allocating sequentially
from it until the cluster is full. This ensures that we don't need to
search the map and we get no fragmentation due to alignment padding for
different orders in the cluster. If there is no current cluster for a
given order, we attempt to allocate a free cluster from the list. If
there are no free clusters, we fail the allocation and the caller falls
back to splitting the folio and allocates individual entries (as per
existing PMD-sized THP fallback).

The per-order current clusters are maintained per-cpu using the existing
percpu_cluster infrastructure. This is done to avoid interleving pages
from different tasks, which would prevent IO being batched. This is
already done for the order-0 allocations so we follow the same pattern.

As far as I can tell, this should not cause any extra fragmentation
concerns, given how similar it is to the existing PMD-sized THP
allocation mechanism. There could be up to (PMD_ORDER-2) * nr_cpus
clusters in concurrent use though, which in a pathalogical case (cluster
set aside for every order for every cpu and only one huge entry
allocated from it) would tie up ~12MiB of unused swap entries for these
high orders (assuming PMD_ORDER=9). In practice, the number of orders in
use will be small and the amount of swap space reserved is very small
compared to a typical swap file.

Note that PMD_ORDER is not compile-time constant on powerpc, so we have
to allocate the large_next[] array at runtime.

I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
device as the swap device and from inside a memcg limited to 40G memory.
I've then run `usemem` from vm-scalability with 70 processes (each has
its own core), each allocating and writing 1G of memory. I've repeated
everything 5 times and taken the mean and stdev:

Mean Performance Improvement vs 4K/baseline

| alloc size |            baseline |       + this series |
|            |  v6.6-rc4+anonfolio |                     |
|:-----------|--------------------:|--------------------:|
| 4K Page    |                0.0% |                1.1% |
| 64K THP    |              -44.1% |                0.9% |
| 2M THP     |               56.0% |               56.4% |

So with this change, the regression for 64K swap performance goes away.
Both 4K and 64K benhcmarks are now bottlenecked on TLBI performance from
try_to_unmap_flush_dirty(), on arm64 at least. When using fewer cpus in
the test, I see upto 2x performance of 64K THP swapping compared to 4K.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h |  6 ++++
 mm/swapfile.c        | 74 +++++++++++++++++++++++++++++++++++---------
 mm/vmscan.c          | 10 +++---
 3 files changed, 71 insertions(+), 19 deletions(-)

--
2.25.1

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a073366a227c..35cbbe6509a9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -268,6 +268,12 @@ struct swap_cluster_info {
 struct percpu_cluster {
 	struct swap_cluster_info index; /* Current cluster index */
 	unsigned int next; /* Likely next allocation offset */
+	unsigned int large_next[];	/*
+					 * next free offset within current
+					 * allocation cluster for large folios,
+					 * or UINT_MAX if no current cluster.
+					 * Index is (order - 1).
+					 */
 };

 struct swap_cluster_list {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b83ad77e04c0..625964e53c22 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	return n_ret;
 }

-static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
+			    unsigned int nr_pages)
 {
+	int order_idx;
 	unsigned long idx;
 	struct swap_cluster_info *ci;
+	struct percpu_cluster *cluster;
 	unsigned long offset;

 	/*
 	 * Should not even be attempting cluster allocations when huge
 	 * page swap is disabled.  Warn and fail the allocation.
 	 */
-	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
+	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
+	    nr_pages < 4 || nr_pages > SWAPFILE_CLUSTER ||
+	    !is_power_of_2(nr_pages)) {
 		VM_WARN_ON_ONCE(1);
 		return 0;
 	}

-	if (cluster_list_empty(&si->free_clusters))
+	/*
+	 * Not using clusters so unable to allocate large entries.
+	 */
+	if (!si->cluster_info)
 		return 0;

-	idx = cluster_list_first(&si->free_clusters);
-	offset = idx * SWAPFILE_CLUSTER;
-	ci = lock_cluster(si, offset);
-	alloc_cluster(si, idx);
-	cluster_set_count(ci, SWAPFILE_CLUSTER);
+	order_idx = ilog2(nr_pages) - 2;
+	cluster = this_cpu_ptr(si->percpu_cluster);
+	offset = cluster->large_next[order_idx];
+
+	if (offset == UINT_MAX) {
+		if (cluster_list_empty(&si->free_clusters))
+			return 0;
+
+		idx = cluster_list_first(&si->free_clusters);
+		offset = idx * SWAPFILE_CLUSTER;

-	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
+		ci = lock_cluster(si, offset);
+		alloc_cluster(si, idx);
+		cluster_set_count(ci, SWAPFILE_CLUSTER);
+
+		/*
+		 * If scan_swap_map_slots() can't find a free cluster, it will
+		 * check si->swap_map directly. To make sure this standby
+		 * cluster isn't taken by scan_swap_map_slots(), mark the swap
+		 * entries bad (occupied). (same approach as discard).
+		 */
+		memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD,
+			SWAPFILE_CLUSTER - nr_pages);
+	} else {
+		idx = offset / SWAPFILE_CLUSTER;
+		ci = lock_cluster(si, offset);
+	}
+
+	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
 	unlock_cluster(ci);
-	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
+	swap_range_alloc(si, offset, nr_pages);
 	*slot = swp_entry(si->type, offset);

+	offset += nr_pages;
+	if (idx != offset / SWAPFILE_CLUSTER)
+		offset = UINT_MAX;
+	cluster->large_next[order_idx] = offset;
+
 	return 1;
 }

@@ -1041,7 +1076,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 	int node;

 	/* Only single cluster request supported */
-	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
+	WARN_ON_ONCE(n_goal > 1 && size > 1);

 	spin_lock(&swap_avail_lock);

@@ -1078,14 +1113,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-		if (size == SWAPFILE_CLUSTER) {
+		if (size > 1) {
 			if (si->flags & SWP_BLKDEV)
-				n_ret = swap_alloc_cluster(si, swp_entries);
+				n_ret = swap_alloc_large(si, swp_entries, size);
 		} else
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 						    n_goal, swp_entries);
 		spin_unlock(&si->lock);
-		if (n_ret || size == SWAPFILE_CLUSTER)
+		if (n_ret || size > 1)
 			goto check_out;
 		cond_resched();

@@ -3046,6 +3081,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (p->bdev && bdev_nonrot(p->bdev)) {
 		int cpu;
 		unsigned long ci, nr_cluster;
+		int nr_order;
+		int i;

 		p->flags |= SWP_SOLIDSTATE;
 		p->cluster_next_cpu = alloc_percpu(unsigned int);
@@ -3073,7 +3110,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		for (ci = 0; ci < nr_cluster; ci++)
 			spin_lock_init(&((cluster_info + ci)->lock));

-		p->percpu_cluster = alloc_percpu(struct percpu_cluster);
+		nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER - 1 : 0;
+		p->percpu_cluster = __alloc_percpu(
+					struct_size(p->percpu_cluster,
+						    large_next,
+						    nr_order),
+					__alignof__(struct percpu_cluster));
 		if (!p->percpu_cluster) {
 			error = -ENOMEM;
 			goto bad_swap_unlock_inode;
@@ -3082,6 +3124,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 			struct percpu_cluster *cluster;
 			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
 			cluster_set_null(&cluster->index);
+			for (i = 0; i < nr_order; i++)
+				cluster->large_next[i] = UINT_MAX;
 		}
 	} else {
 		atomic_inc(&nr_rotate_swap);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c16e2b1ea8ae..5984d2ae4547 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					if (!can_split_folio(folio, NULL))
 						goto activate_locked;
 					/*
-					 * Split folios without a PMD map right
-					 * away. Chances are some or all of the
-					 * tail pages can be freed without IO.
+					 * Split PMD-mappable folios without a
+					 * PMD map right away. Chances are some
+					 * or all of the tail pages can be freed
+					 * without IO.
 					 */
-					if (!folio_entire_mapcount(folio) &&
+					if (folio_test_pmd_mappable(folio) &&
+					    !folio_entire_mapcount(folio) &&
 					    split_folio_to_list(folio,
 								folio_list))
 						goto activate_locked;