From patchwork Fri Nov 24 13:26:06 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169418
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191191vqx;
        Fri, 24 Nov 2023 05:28:51 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHBBLQ+aHQXos17CeXspEwNUTLHw3nyvS26EggIuFS63+UFxmHP8zWd3M/jy+9157nvfv9v
X-Received: by 2002:a17:90b:164b:b0:285:8a70:b56b with SMTP id
 il11-20020a17090b164b00b002858a70b56bmr1944480pjb.37.1700832531082;
        Fri, 24 Nov 2023 05:28:51 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832531; cv=none;
        d=google.com; s=arc-20160816;
        b=xdUESU6uOJbbf30pq5hvGqBZwnlGn/W7ZzND7YIdG7bt0rNxKF82RlUtiesrDg2pkL
         ee8Ixa5fc0Gdm25qLlst1/7YxadnSaWhcCt2WraPuRsmlaZMQwKjgqV7iLGqEOp+xmVq
         ovlUlv+9U31/RXOwIRKFHDez9H7E3zayd9sRYg4JBfTtoZGHXT2legM+Rv3UUq1LeUaR
         Fd9IgUZF+l7kuTQrpLd0lPrZQbnzH3vr08VrBhuPaOH4ZzMQDR/fVUwr3z2vaVnbLuEG
         GWUeFmLBuis8b0tWOM+95mcrKLxr4yOYIoHck4xyUjnAJ35bSQkFFAgQVOGd+PYt1q58
         Fa1g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=2NlC2FNJ1Esba7mAbHYEoZIvTqXmerwYmRl67CA2zqk=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=wTYcKonWNrKmXD0Wg6/dYfnTts0S2m/YZPcpc4VIRt8tdYaLwJDGug2kMzgKHPcsGS
         1S7XPrTB7cmGXnhBBwwfDg0ovqqnVW+/NfJfBf1OKP8RKDzEcRfNw0Gnqb9oPvfHn+qW
         Iddgw/4A3TUigibvQ/E+/jElzSzW9WAxTDiu4+uLqXsem4GbubFUFCRsgiB9kWvQX1uZ
         9z05c+8JiVEd9dhoixCp7DapiSF0md7xBJsHS0c6dhV44NangNGhKv4IYIMeLhN5xwcR
         46GX4aBRcL8iseISplHzUGKgLcewEIUVTC8LJZ6qJJTJiCkJkYBQGRDN20dA3xsA3epH
         Ywhw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=huSZIS7Z;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7])
        by mx.google.com with ESMTPS id
 e2-20020a170902744200b001c3411c9b83si3213964plt.454.2023.11.24.05.28.50
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:28:51 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 client-ip=2620:137:e000::3:7;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=huSZIS7Z;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id 5356E80A8B5F;
	Fri, 24 Nov 2023 05:26:55 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231233AbjKXN0j (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others);
        Fri, 24 Nov 2023 08:26:39 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46684 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231177AbjKXN0f (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:26:35 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A2B1810C6
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:26:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832400;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=2NlC2FNJ1Esba7mAbHYEoZIvTqXmerwYmRl67CA2zqk=;
        b=huSZIS7ZCZkJGKeKPRlHnAkMuo0OBY7JGZP8+/BCn5J+kOXfC8lbohmLRLnKrmKgA4EZ+/
        zDgLByYhMR+rTvLoZXxVtaR0MAdWmZaXLgMvZDRISqJPtOYUoimusBtF4sBq9huyh2Ky6V
        KL5Di6fvd7+5QHda4J3e4hx32PKVACw=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-529-4jiaO-6aNt2opR94wzdfSg-1; Fri, 24 Nov 2023 08:26:35 -0500
X-MC-Unique: 4jiaO-6aNt2opR94wzdfSg-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1C5F185A58A;
        Fri, 24 Nov 2023 13:26:35 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 9B2C32166B2A;
        Fri, 24 Nov 2023 13:26:31 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 01/20] mm/rmap: factor out adding folio range into
 __folio_add_rmap_range()
Date: Fri, 24 Nov 2023 14:26:06 +0100
Message-ID: <20231124132626.235350-2-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,
        T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:26:57 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452172491865453
X-GMAIL-MSGID: 1783452172491865453

Let's factor it out, optimize for small folios, and add some more sanity
checks.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 119 ++++++++++++++++++++++++------------------------------
 1 file changed, 53 insertions(+), 66 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 7a27a2b41802..afddf3d82a8f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1127,6 +1127,54 @@ int folio_total_mapcount(struct folio *folio)
 	return mapcount;
 }
 
+static unsigned int __folio_add_rmap_range(struct folio *folio,
+		struct page *page, unsigned int nr_pages, bool compound,
+		int *nr_pmdmapped)
+{
+	atomic_t *mapped = &folio->_nr_pages_mapped;
+	int first, nr = 0;
+
+	VM_WARN_ON_FOLIO(compound && page != &folio->page, folio);
+	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
+	VM_WARN_ON_FOLIO(compound && nr_pages != folio_nr_pages(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_large(folio) && nr_pages != 1, folio);
+
+	if (likely(!folio_test_large(folio)))
+		return atomic_inc_and_test(&page->_mapcount);
+
+	/* Is page being mapped by PTE? Is this its first map to be added? */
+	if (!compound) {
+		do {
+			first = atomic_inc_and_test(&page->_mapcount);
+			if (first) {
+				first = atomic_inc_return_relaxed(mapped);
+				if (first < COMPOUND_MAPPED)
+					nr++;
+			}
+		} while (page++, --nr_pages > 0);
+	} else if (folio_test_pmd_mappable(folio)) {
+		/* That test is redundant: it's for safety or to optimize out */
+
+		first = atomic_inc_and_test(&folio->_entire_mapcount);
+		if (first) {
+			nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
+			if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
+				*nr_pmdmapped = folio_nr_pages(folio);
+				nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
+				/* Raced ahead of a remove and another add? */
+				if (unlikely(nr < 0))
+					nr = 0;
+			} else {
+				/* Raced ahead of a remove of COMPOUND_MAPPED */
+				nr = 0;
+			}
+		}
+	} else {
+		VM_WARN_ON_ONCE_FOLIO(true, folio);
+	}
+	return nr;
+}
+
 /**
  * folio_move_anon_rmap - move a folio to our anon_vma
  * @folio:	The folio to move to our anon_vma
@@ -1227,38 +1275,10 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 		unsigned long address, rmap_t flags)
 {
 	struct folio *folio = page_folio(page);
-	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
+	unsigned int nr, nr_pmdmapped = 0;
 	bool compound = flags & RMAP_COMPOUND;
-	bool first;
-
-	/* Is page being mapped by PTE? Is this its first map to be added? */
-	if (likely(!compound)) {
-		first = atomic_inc_and_test(&page->_mapcount);
-		nr = first;
-		if (first && folio_test_large(folio)) {
-			nr = atomic_inc_return_relaxed(mapped);
-			nr = (nr < COMPOUND_MAPPED);
-		}
-	} else if (folio_test_pmd_mappable(folio)) {
-		/* That test is redundant: it's for safety or to optimize out */
-
-		first = atomic_inc_and_test(&folio->_entire_mapcount);
-		if (first) {
-			nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
-			if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
-				nr_pmdmapped = folio_nr_pages(folio);
-				nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
-				/* Raced ahead of a remove and another add? */
-				if (unlikely(nr < 0))
-					nr = 0;
-			} else {
-				/* Raced ahead of a remove of COMPOUND_MAPPED */
-				nr = 0;
-			}
-		}
-	}
 
+	nr = __folio_add_rmap_range(folio, page, 1, compound, &nr_pmdmapped);
 	if (nr_pmdmapped)
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped);
 	if (nr)
@@ -1349,43 +1369,10 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
 			unsigned int nr_pages, struct vm_area_struct *vma,
 			bool compound)
 {
-	atomic_t *mapped = &folio->_nr_pages_mapped;
-	unsigned int nr_pmdmapped = 0, first;
-	int nr = 0;
-
-	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
-
-	/* Is page being mapped by PTE? Is this its first map to be added? */
-	if (likely(!compound)) {
-		do {
-			first = atomic_inc_and_test(&page->_mapcount);
-			if (first && folio_test_large(folio)) {
-				first = atomic_inc_return_relaxed(mapped);
-				first = (first < COMPOUND_MAPPED);
-			}
-
-			if (first)
-				nr++;
-		} while (page++, --nr_pages > 0);
-	} else if (folio_test_pmd_mappable(folio)) {
-		/* That test is redundant: it's for safety or to optimize out */
-
-		first = atomic_inc_and_test(&folio->_entire_mapcount);
-		if (first) {
-			nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
-			if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
-				nr_pmdmapped = folio_nr_pages(folio);
-				nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
-				/* Raced ahead of a remove and another add? */
-				if (unlikely(nr < 0))
-					nr = 0;
-			} else {
-				/* Raced ahead of a remove of COMPOUND_MAPPED */
-				nr = 0;
-			}
-		}
-	}
+	unsigned int nr, nr_pmdmapped = 0;
 
+	nr = __folio_add_rmap_range(folio, page, nr_pages, compound,
+				    &nr_pmdmapped);
 	if (nr_pmdmapped)
 		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
 			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);

From patchwork Fri Nov 24 13:26:07 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169412
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1190090vqx;
        Fri, 24 Nov 2023 05:27:18 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHTHae9JJ3zlYygxAgGAezxJmkYeqYlUd/5WR/lhtCrpQlhsb5vwxiCC/x4WShucJu26wAJ
X-Received: by 2002:a05:6e02:1aab:b0:359:d2ed:15f4 with SMTP id
 l11-20020a056e021aab00b00359d2ed15f4mr3833882ilv.8.1700832438601;
        Fri, 24 Nov 2023 05:27:18 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832438; cv=none;
        d=google.com; s=arc-20160816;
        b=Fgjx8d1i5YixiogX0GIqfHlhd6jJbmX37ZLeem9Y+5ZgqrPfvOkuOzaVFZj8ES4BSu
         aczV9xIksSKn1fuJc0uhODzjUQJc4BqWEBZaImD4UBF8pvY36Rs/74uC1VXOh2ssjlFG
         55/Wq6pDmLZtm3G77/tyFfDz3AO/8YlciLgehiRjlFVX9LgtaQ2A9xGOsugrEKQYp8kK
         CkNlvirnfUtzNGl30ibGFryO3G3CGFUg4B39AC36TydOXmWROPbcUDTSNrFr+DmvTPxn
         oC0ajtTs5vL40vhwSvCrXLGR4cBFBqCwFnELbyPp9J62aO146r2z4oIRV7YZDCCGywbF
         BBJw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=/66tdUTtIlpBX26NzgkLUQsEQZwgNiRXmWKFnjYfBOk=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=EmDqXQyLJPGSCrN+P86A8kfXWgc1nADpkrQnsAhPILQ6kFKogW/cUn8aqy+/9WPlO1
         MENSc6lFKHyRFguOgTrAftooXlSUfC+gvlmXw5l68haO4aU1duPbmDpmU2uoQwVdzy/D
         VhKRG7CLRNY4rv5AlrLXjetIwtOBRiNJej2CkfyUMQaOljnBw4LR1iOr4AaNsOWa+tta
         0s0mTvOnmyllvwk37FYwX/gqZX2T670gM5Bp0spcmxoD4QanNisEEGa3R1ODQ0JSDSqS
         6ZsS/iMLYCKwHU8BC4CV4iBSGU3qVGHBhhLC4TFDI1dwAOgU0RhEBUvfCGMG0qr1r02j
         3joQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b="CrFVJ/yg";
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.36 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from pete.vger.email (pete.vger.email. [23.128.96.36])
        by mx.google.com with ESMTPS id
 s18-20020a635252000000b005c2201d6a55si3443451pgl.39.2023.11.24.05.27.18
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:27:18 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b="CrFVJ/yg";
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.36 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by pete.vger.email (Postfix) with ESMTP id 30D1180AE573;
	Fri, 24 Nov 2023 05:27:10 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345280AbjKXN0m (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:26:42 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46694 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229833AbjKXN0i (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:26:38 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DA6141BE
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:26:43 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832403;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=/66tdUTtIlpBX26NzgkLUQsEQZwgNiRXmWKFnjYfBOk=;
        b=CrFVJ/ygWBgIybi1/fSnA/wAKPEiWLgD0ZYic3dKzEo7XN3acPK5j4U63r7qSRCpCLuDDJ
        nkO4oYWIVrbYFGdhp025iQF6a2Eay6m+71puab/MJgGnD6Y63sJ1ovmczu/IfFKQu/oi39
        MsNZ0NV1Pjophm8DPa4gTe83gBX68CA=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-74-ojdg3s0mO9myWpWvuZxs4w-1; Fri, 24 Nov 2023 08:26:39 -0500
X-MC-Unique: ojdg3s0mO9myWpWvuZxs4w-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 01090185A781;
        Fri, 24 Nov 2023 13:26:39 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 7B9A22166B2A;
        Fri, 24 Nov 2023 13:26:35 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 02/20] mm: add a total mapcount for large folios
Date: Fri, 24 Nov 2023 14:26:07 +0100
Message-ID: <20231124132626.235350-3-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:27:10 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452074892216481
X-GMAIL-MSGID: 1783452074892216481

Let's track the total mapcount for all large folios in the first subpage.

The total mapcount is what we actually want to know in folio_mapcount()
and it is also sufficient for implementing folio_mapped().

With PTE-mapped THP becoming more important and soon more widely used, we
want to avoid looping over all pages of a folio just to calculate the total
mapcount. Further, we might soon want to use the total mapcount in other
context more frequently, so prepare for reading it efficiently and
atomically.

Maintain the total mapcount also for hugetlb pages. Use the total mapcount
to implement folio_mapcount(). Make folio_mapped() simply call
folio_mapped().

We can now get rid of folio_large_is_mapped() and move
folio_large_total_mapcount() to mm.h. Similarly, get rid of
folio_nr_pages_mapped() and stop dumping that value in __dump_page().

While at it, simplify total_mapcount() by calling folio_mapcount() and
page_mapped() by calling folio_mapped(): it seems to add only one more MOV
instruction on x86-64 to the compiled code, which we shouldn't have to
worry about.

_nr_pages_mapped is now only used in rmap code, so not accidentally
externally where it might be used on arbitrary order-1 pages. The remaining
usage is:

(1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
 -> If we would account the total folio as mapped when mapping a
    page (based on the total mapcount), we could remove that usage.
    We'll have to be careful about memory-sensitive applications that also
    adjust /sys/kernel/debug/fault_around_bytes to not get a large
    folio completely mapped on page fault.

(2) Detect when to add an anon folio to the deferred split queue
 -> If we would apply a different heuristic, or scan using the rmap on
    the memory reclaim path for partially mapped anon folios to
    split them, we could remove that usage as well.

For now, these things remain as they are, they need more thought. Hugh
really did a fantastic job implementing that tracking after all.

Note that before the total mapcount would overflow, already our refcount
would overflow: each distinct mapping requires a distinct reference.
Probably, in the future, we want 64bit refcount+mapcount for larger
folios.

Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/mm/transhuge.rst | 12 +++++------
 include/linux/mm.h             | 37 +++++++++-----------------------
 include/linux/mm_types.h       |  5 +++--
 include/linux/rmap.h           | 15 ++++++++-----
 mm/debug.c                     |  4 ++--
 mm/hugetlb.c                   |  4 ++--
 mm/internal.h                  | 10 +--------
 mm/page_alloc.c                |  4 ++++
 mm/rmap.c                      | 39 ++++++++++++----------------------
 9 files changed, 52 insertions(+), 78 deletions(-)

diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index 9a607059ea11..b0d3b1d3e8ea 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -116,14 +116,14 @@ pages:
     succeeds on tail pages.
 
   - map/unmap of a PMD entry for the whole THP increment/decrement
-    folio->_entire_mapcount and also increment/decrement
-    folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount
-    goes from -1 to 0 or 0 to -1.
+    folio->_entire_mapcount, increment/decrement folio->_total_mapcount
+    and also increment/decrement folio->_nr_pages_mapped by COMPOUND_MAPPED
+    when _entire_mapcount goes from -1 to 0 or 0 to -1.
 
   - map/unmap of individual pages with PTE entry increment/decrement
-    page->_mapcount and also increment/decrement folio->_nr_pages_mapped
-    when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts
-    the number of pages mapped by PTE.
+    page->_mapcount, increment/decrement folio->_total_mapcount and also
+    increment/decrement folio->_nr_pages_mapped when page->_mapcount goes
+    from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE.
 
 split_huge_page internally has to distribute the refcounts in the head
 page to the tail pages before clearing all PG_head/tail bits from the page
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..fe91aaefa3db 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1207,17 +1207,16 @@ static inline int page_mapcount(struct page *page)
 	return mapcount;
 }
 
-int folio_total_mapcount(struct folio *folio);
+static inline int folio_total_mapcount(struct folio *folio)
+{
+	VM_WARN_ON_FOLIO(!folio_test_large(folio), folio);
+	return atomic_read(&folio->_total_mapcount) + 1;
+}
 
 /**
- * folio_mapcount() - Calculate the number of mappings of this folio.
+ * folio_mapcount() - Number of mappings of this folio.
  * @folio: The folio.
  *
- * A large folio tracks both how many times the entire folio is mapped,
- * and how many times each individual page in the folio is mapped.
- * This function calculates the total number of times the folio is
- * mapped.
- *
  * Return: The number of times this folio is mapped.
  */
 static inline int folio_mapcount(struct folio *folio)
@@ -1229,19 +1228,7 @@ static inline int folio_mapcount(struct folio *folio)
 
 static inline int total_mapcount(struct page *page)
 {
-	if (likely(!PageCompound(page)))
-		return atomic_read(&page->_mapcount) + 1;
-	return folio_total_mapcount(page_folio(page));
-}
-
-static inline bool folio_large_is_mapped(struct folio *folio)
-{
-	/*
-	 * Reading _entire_mapcount below could be omitted if hugetlb
-	 * participated in incrementing nr_pages_mapped when compound mapped.
-	 */
-	return atomic_read(&folio->_nr_pages_mapped) > 0 ||
-		atomic_read(&folio->_entire_mapcount) >= 0;
+	return folio_mapcount(page_folio(page));
 }
 
 /**
@@ -1252,9 +1239,7 @@ static inline bool folio_large_is_mapped(struct folio *folio)
  */
 static inline bool folio_mapped(struct folio *folio)
 {
-	if (likely(!folio_test_large(folio)))
-		return atomic_read(&folio->_mapcount) >= 0;
-	return folio_large_is_mapped(folio);
+	return folio_mapcount(folio) > 0;
 }
 
 /*
@@ -1264,9 +1249,7 @@ static inline bool folio_mapped(struct folio *folio)
  */
 static inline bool page_mapped(struct page *page)
 {
-	if (likely(!PageCompound(page)))
-		return atomic_read(&page->_mapcount) >= 0;
-	return folio_large_is_mapped(page_folio(page));
+	return folio_mapped(page_folio(page));
 }
 
 static inline struct page *virt_to_head_page(const void *x)
@@ -2139,7 +2122,7 @@ static inline size_t folio_size(struct folio *folio)
  * looking at the precise mapcount of the first subpage in the folio, and
  * assuming the other subpages are the same. This may not be true for large
  * folios. If you want exact mapcounts for exact calculations, look at
- * page_mapcount() or folio_total_mapcount().
+ * page_mapcount() or folio_mapcount().
  *
  * Return: The estimated number of processes sharing a folio.
  */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 957ce38768b2..99b84b4797b9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -264,7 +264,8 @@ typedef struct {
  * @virtual: Virtual address in the kernel direct map.
  * @_last_cpupid: IDs of last CPU and last process that accessed the folio.
  * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
- * @_nr_pages_mapped: Do not use directly, call folio_mapcount().
+ * @_total_mapcount: Do not use directly, call folio_mapcount().
+ * @_nr_pages_mapped: Do not use outside of rmap code.
  * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
  * @_folio_nr_pages: Do not use directly, call folio_nr_pages().
  * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
@@ -323,8 +324,8 @@ struct folio {
 		struct {
 			unsigned long _flags_1;
 			unsigned long _head_1;
-			unsigned long _folio_avail;
 	/* public: */
+			atomic_t _total_mapcount;
 			atomic_t _entire_mapcount;
 			atomic_t _nr_pages_mapped;
 			atomic_t _pincount;
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b26fe858fd44..42e2c74d4d6e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -210,14 +210,19 @@ void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 
 static inline void __page_dup_rmap(struct page *page, bool compound)
 {
-	if (compound) {
-		struct folio *folio = (struct folio *)page;
+	struct folio *folio = page_folio(page);
 
-		VM_BUG_ON_PAGE(compound && !PageHead(page), page);
-		atomic_inc(&folio->_entire_mapcount);
-	} else {
+	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
+	if (likely(!folio_test_large(folio))) {
 		atomic_inc(&page->_mapcount);
+		return;
 	}
+
+	if (compound)
+		atomic_inc(&folio->_entire_mapcount);
+	else
+		atomic_inc(&page->_mapcount);
+	atomic_inc(&folio->_total_mapcount);
 }
 
 static inline void page_dup_file_rmap(struct page *page, bool compound)
diff --git a/mm/debug.c b/mm/debug.c
index ee533a5ceb79..97f6f6b32ae7 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -99,10 +99,10 @@ static void __dump_page(struct page *page)
 			page, page_ref_count(head), mapcount, mapping,
 			page_to_pgoff(page), page_to_pfn(page));
 	if (compound) {
-		pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n",
+		pr_warn("head:%p order:%u entire_mapcount:%d total_mapcount:%d pincount:%d\n",
 				head, compound_order(head),
 				folio_entire_mapcount(folio),
-				folio_nr_pages_mapped(folio),
+				folio_mapcount(folio),
 				atomic_read(&folio->_pincount));
 	}
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1169ef2f2176..cf84784064c7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1509,7 +1509,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
 	struct page *p;
 
 	atomic_set(&folio->_entire_mapcount, 0);
-	atomic_set(&folio->_nr_pages_mapped, 0);
+	atomic_set(&folio->_total_mapcount, 0);
 	atomic_set(&folio->_pincount, 0);
 
 	for (i = 1; i < nr_pages; i++) {
@@ -2119,7 +2119,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio,
 	/* we rely on prep_new_hugetlb_folio to set the destructor */
 	folio_set_order(folio, order);
 	atomic_set(&folio->_entire_mapcount, -1);
-	atomic_set(&folio->_nr_pages_mapped, 0);
+	atomic_set(&folio->_total_mapcount, -1);
 	atomic_set(&folio->_pincount, 0);
 	return true;
 
diff --git a/mm/internal.h b/mm/internal.h
index b61034bd50f5..bb2e55c402e7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -67,15 +67,6 @@ void page_writeback_init(void);
  */
 #define SHOW_MEM_FILTER_NODES		(0x0001u)	/* disallowed nodes */
 
-/*
- * How many individual pages have an elevated _mapcount.  Excludes
- * the folio's entire_mapcount.
- */
-static inline int folio_nr_pages_mapped(struct folio *folio)
-{
-	return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
-}
-
 static inline void *folio_raw_mapping(struct folio *folio)
 {
 	unsigned long mapping = (unsigned long)folio->mapping;
@@ -429,6 +420,7 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 	struct folio *folio = (struct folio *)page;
 
 	folio_set_order(folio, order);
+	atomic_set(&folio->_total_mapcount, -1);
 	atomic_set(&folio->_entire_mapcount, -1);
 	atomic_set(&folio->_nr_pages_mapped, 0);
 	atomic_set(&folio->_pincount, 0);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 733732e7e0ba..aad45758c0c7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -988,6 +988,10 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 			bad_page(page, "nonzero entire_mapcount");
 			goto out;
 		}
+		if (unlikely(atomic_read(&folio->_total_mapcount) + 1)) {
+			bad_page(page, "nonzero total_mapcount");
+			goto out;
+		}
 		if (unlikely(atomic_read(&folio->_nr_pages_mapped))) {
 			bad_page(page, "nonzero nr_pages_mapped");
 			goto out;
diff --git a/mm/rmap.c b/mm/rmap.c
index afddf3d82a8f..38765796dca8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1104,35 +1104,12 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 	return page_vma_mkclean_one(&pvmw);
 }
 
-int folio_total_mapcount(struct folio *folio)
-{
-	int mapcount = folio_entire_mapcount(folio);
-	int nr_pages;
-	int i;
-
-	/* In the common case, avoid the loop when no pages mapped by PTE */
-	if (folio_nr_pages_mapped(folio) == 0)
-		return mapcount;
-	/*
-	 * Add all the PTE mappings of those pages mapped by PTE.
-	 * Limit the loop to folio_nr_pages_mapped()?
-	 * Perhaps: given all the raciness, that may be a good or a bad idea.
-	 */
-	nr_pages = folio_nr_pages(folio);
-	for (i = 0; i < nr_pages; i++)
-		mapcount += atomic_read(&folio_page(folio, i)->_mapcount);
-
-	/* But each of those _mapcounts was based on -1 */
-	mapcount += nr_pages;
-	return mapcount;
-}
-
 static unsigned int __folio_add_rmap_range(struct folio *folio,
 		struct page *page, unsigned int nr_pages, bool compound,
 		int *nr_pmdmapped)
 {
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int first, nr = 0;
+	int first, count, nr = 0;
 
 	VM_WARN_ON_FOLIO(compound && page != &folio->page, folio);
 	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
@@ -1144,6 +1121,7 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
 
 	/* Is page being mapped by PTE? Is this its first map to be added? */
 	if (!compound) {
+		count = nr_pages;
 		do {
 			first = atomic_inc_and_test(&page->_mapcount);
 			if (first) {
@@ -1151,7 +1129,8 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
 				if (first < COMPOUND_MAPPED)
 					nr++;
 			}
-		} while (page++, --nr_pages > 0);
+		} while (page++, --count > 0);
+		atomic_add(nr_pages, &folio->_total_mapcount);
 	} else if (folio_test_pmd_mappable(folio)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
@@ -1169,6 +1148,7 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
 				nr = 0;
 			}
 		}
+		atomic_inc(&folio->_total_mapcount);
 	} else {
 		VM_WARN_ON_ONCE_FOLIO(true, folio);
 	}
@@ -1348,6 +1328,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
 	}
 
+	if (folio_test_large(folio))
+		/* increment count (starts at -1) */
+		atomic_set(&folio->_total_mapcount, 0);
+
 	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
 	__folio_set_anon(folio, vma, address, true);
 	SetPageAnonExclusive(&folio->page);
@@ -1427,6 +1411,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 
 	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
 
+	if (folio_test_large(folio))
+		atomic_dec(&folio->_total_mapcount);
+
 	/* Hugetlb pages are not counted in NR_*MAPPED */
 	if (unlikely(folio_test_hugetlb(folio))) {
 		/* hugetlb pages are always mapped with pmds */
@@ -2576,6 +2563,7 @@ void hugepage_add_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
 
 	atomic_inc(&folio->_entire_mapcount);
+	atomic_inc(&folio->_total_mapcount);
 	if (flags & RMAP_EXCLUSIVE)
 		SetPageAnonExclusive(&folio->page);
 	VM_WARN_ON_FOLIO(folio_entire_mapcount(folio) > 1 &&
@@ -2588,6 +2576,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio,
 	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	/* increment count (starts at -1) */
 	atomic_set(&folio->_entire_mapcount, 0);
+	atomic_set(&folio->_total_mapcount, 0);
 	folio_clear_hugetlb_restore_reserve(folio);
 	__folio_set_anon(folio, vma, address, true);
 	SetPageAnonExclusive(&folio->page);

From patchwork Fri Nov 24 13:26:08 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169413
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1190376vqx;
        Fri, 24 Nov 2023 05:27:43 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IF5Nr8nn8XcGZPCJ1pBCtXwJNXwtQ+gO51EwFlRyhWU154Ak/YleLKhqwj5ud3Huoq8nxrD
X-Received: by 2002:a05:6a20:e68b:b0:18b:d09d:6910 with SMTP id
 mz11-20020a056a20e68b00b0018bd09d6910mr4312817pzb.29.1700832463431;
        Fri, 24 Nov 2023 05:27:43 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832463; cv=none;
        d=google.com; s=arc-20160816;
        b=k3KvETnZi39aJ8j9+fhl2yp/5hpOY4/9gr4mq9Utm+WzGlPG3T8vEGjM30TXm2F5lr
         RX74imC8ElNoOMyxBNhRMuJISJijX+ZQQW/8O1nI+o7CBzqPInWXFyHlljhfQYcFtJ4i
         hwzRyaIRf+gNnQXKqPoLmWkdzPLIO7hBBvJV+1MHN4b3NrEsLjzJOBFy6El6Y3UiEQto
         19GgicWw/xO/dNUjxBwClamC6giVxOZmSXhmMQGOud0lxSQORTKn3W0VapqKoe7rK86O
         +2BGdiZ41BH+9BXUugv0KY70j8s+4Q+RO8YHgbRcyeO1Kvy4iCaYIwQT0i5yZxecdiBL
         XGxQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=NG++BiHeM5ZPoN7pJxd4ztX7J63H+nPZ6vnmgIm1VTs=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=k1M6Is1qJDjuMvt+MxqiwiWRgJTTtdaGCqEbTwt4d6wKhqA8sk+zNZhFD0wQw9EMLv
         YFK1slh2H/7Tim/uPwBvu4B2onXV6J//FMNgWTKoT7a0xLu3WifuaamFhrKhymJZ0Ye4
         Gr3dTbJNxquU9XRxsjH2DFgjhjrdDiu008H1S1dxEyj62NbRVkdGBzt1HcHFcRx1Sa6R
         hVFfpWV+bS3nVO5JKqHqSBFMgoTsCU5EkXSld5O+jD9tK/LMwoRBt+blgc/djf1+Upku
         tx5KNXCsaNSQpG4+wea+ji07lQy41h8gkz//pLm53tTkxAgXu0Lvds3lJlf+Onr4+fOY
         KHMw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=OylzUnkJ;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from howler.vger.email (howler.vger.email. [23.128.96.34])
        by mx.google.com with ESMTPS id
 u9-20020a631409000000b005859b2d8d7asi3448415pgl.4.2023.11.24.05.27.43
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:27:43 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=OylzUnkJ;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by howler.vger.email (Postfix) with ESMTP id A2C0A8047649;
	Fri, 24 Nov 2023 05:27:24 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231233AbjKXN1B (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others);
        Fri, 24 Nov 2023 08:27:01 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46782 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1345265AbjKXN0l (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:26:41 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 79F7810C6
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:26:47 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832406;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=NG++BiHeM5ZPoN7pJxd4ztX7J63H+nPZ6vnmgIm1VTs=;
        b=OylzUnkJL6FYu3Btmdy0zUw1R1cv0f54gxDEhntt2B3N3j/XH+YD8SaeA0dZZFEPNFLgCa
        qpwsi4c9Q8S7x8buln12N0uGRmR6PINAIqN2v4UOn31dPGjm8e6tYlq7pIGCH99V7nGdTq
        70/DFohTmolUdY7h+KBf5SSvi2IVkOw=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-628-ZJGKa9ZtNFmIPA6BX8h3SQ-1; Fri,
 24 Nov 2023 08:26:43 -0500
X-MC-Unique: ZJGKa9ZtNFmIPA6BX8h3SQ-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 0023E2806052;
        Fri, 24 Nov 2023 13:26:43 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 476F32166B2A;
        Fri, 24 Nov 2023 13:26:39 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 03/20] mm: convert folio_estimated_sharers() to
 folio_mapped_shared() and improve it
Date: Fri, 24 Nov 2023 14:26:08 +0100
Message-ID: <20231124132626.235350-4-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:27:24 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452101322559914
X-GMAIL-MSGID: 1783452101322559914

Callers of folio_estimated_sharers() only care about "mapped shared vs.
mapped exclusively". Let's rename the function and improve our detection
for partially-mappable folios (i.e., PTE-mapped THPs).

For now we can only implement, based on our guess, "certainly mapped
shared vs. maybe mapped exclusively". Ideally, we'd have something like
"maybe mapped shared vs. certainly mapped exclusive" -- or even better
"certainly mapped shared vs. certainly mapped exclusively" instead. But
these semantics are currently impossible using our guess-based heuristic
we apply for partially-mappable folios.

Naming the function "folio_certainly_mapped_shared" could be possible,
but let's just keep it simple an call it "folio_mapped_shared" and
document the fuzziness that applies for now.

As we can now read the total mapcount of large folios very efficiently,
use that to improve our implementation, falling back to making a guess only
in case the folio is not "obviously mapped shared".

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 68 +++++++++++++++++++++++++++++++++++++++-------
 mm/huge_memory.c   |  2 +-
 mm/madvise.c       |  6 ++--
 mm/memory.c        |  2 +-
 mm/mempolicy.c     | 14 ++++------
 mm/migrate.c       |  2 +-
 6 files changed, 70 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fe91aaefa3db..17dac913f367 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2114,21 +2114,69 @@ static inline size_t folio_size(struct folio *folio)
 }
 
 /**
- * folio_estimated_sharers - Estimate the number of sharers of a folio.
+ * folio_mapped_shared - Report if a folio is certainly mapped by
+ *			 multiple entities in their page tables
  * @folio: The folio.
  *
- * folio_estimated_sharers() aims to serve as a function to efficiently
- * estimate the number of processes sharing a folio. This is done by
- * looking at the precise mapcount of the first subpage in the folio, and
- * assuming the other subpages are the same. This may not be true for large
- * folios. If you want exact mapcounts for exact calculations, look at
- * page_mapcount() or folio_mapcount().
+ * This function checks if a folio is certainly *currently* mapped by
+ * multiple entities in their page table ("mapped shared") or if the folio
+ * may be mapped exclusively by a single entity ("mapped exclusively").
  *
- * Return: The estimated number of processes sharing a folio.
+ * Usually, we consider a single entity to be a single MM. However, some
+ * folios (KSM, pagecache) can be mapped multiple times into the same MM.
+ *
+ * For KSM folios, each individual page table mapping is considered a
+ * separate entity. So if a KSM folio is mapped multiple times into the
+ * same process, it is considered "mapped shared".
+ *
+ * For pagecache folios that are entirely mapped multiple times into the
+ * same MM (i.e., multiple VMAs in the same MM cover the same
+ * file range), we traditionally (and for simplicity) consider them,
+ * "mapped shared". For partially-mapped folios (e..g, PTE-mapped THP), we
+ * might detect them either as "mapped shared" or "mapped exclusively" --
+ * whatever is simpler.
+ *
+ * For small folios and entirely mapped large folios (e.g., hugetlb,
+ * PMD-mapped PMD-sized THP), the result will be exactly correct.
+ *
+ * For all other (partially-mappable) folios, such as PTE-mapped THP, the
+ * return value is partially fuzzy: true is not fuzzy, because it means
+ * "certainly mapped shared", but false means "maybe mapped exclusively".
+ *
+ * Note that this function only considers *current* page table mappings
+ * tracked via rmap -- that properly adjusts the folio mapcount(s) -- and
+ * does not consider:
+ * (1) any way the folio might get mapped in the (near) future (e.g.,
+ *     swapcache, pagecache, temporary unmapping for migration).
+ * (2) any way a folio might be mapped besides using the rmap (PFN mappings).
+ * (3) any form of page table sharing.
+ *
+ * Return: Whether the folio is certainly mapped by multiple entities.
  */
-static inline int folio_estimated_sharers(struct folio *folio)
+static inline bool folio_mapped_shared(struct folio *folio)
 {
-	return page_mapcount(folio_page(folio, 0));
+	unsigned int total_mapcount;
+
+	if (likely(!folio_test_large(folio)))
+		return atomic_read(&folio->page._mapcount) != 0;
+	total_mapcount = folio_total_mapcount(folio);
+
+	/* A single mapping implies "mapped exclusively". */
+	if (total_mapcount == 1)
+		return false;
+
+	/* If there is an entire mapping, it must be the only mapping. */
+	if (folio_entire_mapcount(folio) || unlikely(folio_test_hugetlb(folio)))
+		return total_mapcount != 1;
+	/*
+	 * Partially-mappable folios are tricky ... but some are "obviously
+	 * mapped shared": if we have more (PTE) mappings than we have pages
+	 * in the folio, some other entity is certainly involved.
+	 */
+	if (total_mapcount > folio_nr_pages(folio))
+		return true;
+	/* ... guess based on the mapcount of the first page of the folio. */
+	return atomic_read(&folio->page._mapcount) > 0;
 }
 
 #ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f31f02472396..874eeeb90e0b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1638,7 +1638,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * If other processes are mapping this folio, we couldn't discard
 	 * the folio unless they all do MADV_FREE so let's skip the folio.
 	 */
-	if (folio_estimated_sharers(folio) != 1)
+	if (folio_mapped_shared(folio))
 		goto out;
 
 	if (!folio_trylock(folio))
diff --git a/mm/madvise.c b/mm/madvise.c
index cf4d694280e9..1a82867c8c2e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -365,7 +365,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		folio = pfn_folio(pmd_pfn(orig_pmd));
 
 		/* Do not interfere with other mappings of this folio */
-		if (folio_estimated_sharers(folio) != 1)
+		if (folio_mapped_shared(folio))
 			goto huge_unlock;
 
 		if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -441,7 +441,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		if (folio_test_large(folio)) {
 			int err;
 
-			if (folio_estimated_sharers(folio) != 1)
+			if (folio_mapped_shared(folio))
 				break;
 			if (pageout_anon_only_filter && !folio_test_anon(folio))
 				break;
@@ -665,7 +665,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		if (folio_test_large(folio)) {
 			int err;
 
-			if (folio_estimated_sharers(folio) != 1)
+			if (folio_mapped_shared(folio))
 				break;
 			if (!folio_trylock(folio))
 				break;
diff --git a/mm/memory.c b/mm/memory.c
index 1f18ed4a5497..6bcfa763a146 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4848,7 +4848,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 * Flag if the folio is shared between multiple address spaces. This
 	 * is later used when determining whether to group tasks together
 	 */
-	if (folio_estimated_sharers(folio) > 1 && (vma->vm_flags & VM_SHARED))
+	if (folio_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
 		flags |= TNF_SHARED;
 
 	nid = folio_nid(folio);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..0492113497cc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -605,12 +605,11 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
 	 * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
 	 * Choosing not to migrate a shared folio is not counted as a failure.
 	 *
-	 * To check if the folio is shared, ideally we want to make sure
-	 * every page is mapped to the same process. Doing that is very
-	 * expensive, so check the estimated sharers of the folio instead.
+	 * See folio_mapped_shared() on possible imprecision when we cannot
+	 * easily detect if a folio is shared.
 	 */
 	if ((flags & MPOL_MF_MOVE_ALL) ||
-	    (folio_estimated_sharers(folio) == 1 && !hugetlb_pmd_shared(pte)))
+	    (!folio_mapped_shared(folio) && !hugetlb_pmd_shared(pte)))
 		if (!isolate_hugetlb(folio, qp->pagelist))
 			qp->nr_failed++;
 unlock:
@@ -988,11 +987,10 @@ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
 	 * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
 	 * Choosing not to migrate a shared folio is not counted as a failure.
 	 *
-	 * To check if the folio is shared, ideally we want to make sure
-	 * every page is mapped to the same process. Doing that is very
-	 * expensive, so check the estimated sharers of the folio instead.
+	 * See folio_mapped_shared() on possible imprecision when we cannot
+	 * easily detect if a folio is shared.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || folio_estimated_sharers(folio) == 1) {
+	if ((flags & MPOL_MF_MOVE_ALL) || !folio_mapped_shared(folio)) {
 		if (folio_isolate_lru(folio)) {
 			list_add_tail(&folio->lru, foliolist);
 			node_stat_mod_folio(folio,
diff --git a/mm/migrate.c b/mm/migrate.c
index 35a88334bb3c..fda41bc09903 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2559,7 +2559,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
 	 * every page is mapped to the same process. Doing that is very
 	 * expensive, so check the estimated mapcount of the folio instead.
 	 */
-	if (folio_estimated_sharers(folio) != 1 && folio_is_file_lru(folio) &&
+	if (folio_mapped_shared(folio) && folio_is_file_lru(folio) &&
 	    (vma->vm_flags & VM_EXEC))
 		goto out;
 

From patchwork Fri Nov 24 13:26:09 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169420
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191325vqx;
        Fri, 24 Nov 2023 05:29:02 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IEex0gHnWG/1cHHdYQ55b8l8za5zY3z3lu+x8Heq3T0xkqZiXegzCUb1cexGkt39iotJdoL
X-Received: by 2002:a17:902:bb8f:b0:1cc:bfb4:2dca with SMTP id
 m15-20020a170902bb8f00b001ccbfb42dcamr2673250pls.6.1700832542179;
        Fri, 24 Nov 2023 05:29:02 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832542; cv=none;
        d=google.com; s=arc-20160816;
        b=MGTGS3XiR6HL6s8+bmnmptnpyOJ/MXs9Q5S5E8vTU3OJ7G1Yxkzos3nIvqjJJoaKcw
         bgVPKPrwQJIQpzMlc/KGotYgaYeC+PiYoANAWYaqmFCHOGnO4K6KpeKTtwjFIDrhpleC
         fokE6tbPaIOnVnuUGejFeocOBx1fU/3zTlAzYmto7Buzi9KAhH6lLI7hA92RWBnr3XCu
         uYyc7YOqTWDc35WK1EhS01vKn9KeQHWFTbIoex/WOCfxQrghaIkO0mQa/oSBW88ialX7
         HLVLivTjoIA5cEhr3QyDL4hqsTSpWOsL2YlEtdZdN3TIcOIvcSKKhWjWptRkavBpuT4M
         T0/Q==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=lJ96IMZVe5oCg8NRltA+Bp5PZmcoSK+MjUKn4pYnke0=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=pq2wXSVCMvBigkyycGc0GmnYKPJGnCcMQRs6gQfOT5PRB15MQI6s5MkoR/M/KuKGo/
         u5Zsyv2odKV/rvhq/HLik3/ra8FrZ3QnoMdbQDc/YF8soSKf7aMjCnG5IfKhvw+z0UtG
         i2sPP12MrTvrKN6I3/H0BNlBzSqo5WTm7B1A/+ohpiwizdK5l840VUfZJbK4dA8zLBWQ
         DEwfUdQqI0DeROxT+8zT6hVLuCs1DEvmGgQIOo/6jX0vAAf9QpzAbeH3RZqlaj1Ktais
         s3c4WoXRB2Uc5T/Gek6niPlgVTRygXdGyoYNZJd5uHuCbEC0g/uZtLaSs6WwADaf035p
         PC6w==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=FM0OJ75b;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from snail.vger.email (snail.vger.email. [23.128.96.37])
        by mx.google.com with ESMTPS id
 y6-20020a170902ed4600b001cfa0b7c6a0si1895678plb.432.2023.11.24.05.29.01
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:29:02 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=FM0OJ75b;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id 5B6B9818ABD7;
	Fri, 24 Nov 2023 05:27:33 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345242AbjKXN1K (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:27:10 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46782 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231494AbjKXN07 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:26:59 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0A438170C
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:26:49 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832409;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=lJ96IMZVe5oCg8NRltA+Bp5PZmcoSK+MjUKn4pYnke0=;
        b=FM0OJ75bu4CiGpQIwlsLGqL+b36KzF5OtZ08MDNU3rz6kSWBiKvSVdCBLhiCSAcjcfzJ4x
        A5xYJeU7dBXk4M4oajgGIYdH7mVYaCmnthhpAIZjcIndrMg4ja61pqYJ3w2y1Ub1qYrZBb
        qZL72wZIhGyFcmcOP7X1Na1xsRMMu9g=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-507-j-wvy56dM2qBkwjy1JR5fQ-1; Fri, 24 Nov 2023 08:26:47 -0500
X-MC-Unique: j-wvy56dM2qBkwjy1JR5fQ-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5E5A4811E7B;
        Fri, 24 Nov 2023 13:26:46 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 46A822166B2A;
        Fri, 24 Nov 2023 13:26:43 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 04/20] mm/rmap: pass dst_vma to
 page_try_dup_anon_rmap() and page_dup_file_rmap()
Date: Fri, 24 Nov 2023 14:26:09 +0100
Message-ID: <20231124132626.235350-5-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,
        T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:27:33 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452183445671441
X-GMAIL-MSGID: 1783452183445671441

We'll need access to the destination MM when modifying the total mapcount
of a partially-mappable folio next. So pass in the destination VMA for
consistency.

While at it, change the parameter order for page_try_dup_anon_rmap() such
that the "bool compound" parameter is last, to match the other rmap
functions.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h | 21 +++++++++++++--------
 mm/huge_memory.c     |  2 +-
 mm/hugetlb.c         |  9 +++++----
 mm/memory.c          |  6 +++---
 mm/migrate.c         |  2 +-
 5 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 42e2c74d4d6e..6cb497f6feab 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -208,7 +208,8 @@ void hugepage_add_anon_rmap(struct folio *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address);
 
-static inline void __page_dup_rmap(struct page *page, bool compound)
+static inline void __page_dup_rmap(struct page *page,
+		struct vm_area_struct *dst_vma, bool compound)
 {
 	struct folio *folio = page_folio(page);
 
@@ -225,17 +226,19 @@ static inline void __page_dup_rmap(struct page *page, bool compound)
 	atomic_inc(&folio->_total_mapcount);
 }
 
-static inline void page_dup_file_rmap(struct page *page, bool compound)
+static inline void page_dup_file_rmap(struct page *page,
+		struct vm_area_struct *dst_vma, bool compound)
 {
-	__page_dup_rmap(page, compound);
+	__page_dup_rmap(page, dst_vma, compound);
 }
 
 /**
  * page_try_dup_anon_rmap - try duplicating a mapping of an already mapped
  *			    anonymous page
  * @page: the page to duplicate the mapping for
+ * @dst_vma: the destination vma
+ * @src_vma: the source vma
  * @compound: the page is mapped as compound or as a small page
- * @vma: the source vma
  *
  * The caller needs to hold the PT lock and the vma->vma_mm->write_protect_seq.
  *
@@ -247,8 +250,10 @@ static inline void page_dup_file_rmap(struct page *page, bool compound)
  *
  * Returns 0 if duplicating the mapping succeeded. Returns -EBUSY otherwise.
  */
-static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
-					 struct vm_area_struct *vma)
+static inline int page_try_dup_anon_rmap(struct page *page,
+					 struct vm_area_struct *dst_vma,
+					 struct vm_area_struct *src_vma,
+					 bool compound)
 {
 	VM_BUG_ON_PAGE(!PageAnon(page), page);
 
@@ -267,7 +272,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
 	 * future on write faults.
 	 */
 	if (likely(!is_device_private_page(page) &&
-	    unlikely(page_needs_cow_for_dma(vma, page))))
+	    unlikely(page_needs_cow_for_dma(src_vma, page))))
 		return -EBUSY;
 
 	ClearPageAnonExclusive(page);
@@ -276,7 +281,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
 	 * the page R/O into both processes.
 	 */
 dup:
-	__page_dup_rmap(page, compound);
+	__page_dup_rmap(page, dst_vma, compound);
 	return 0;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 874eeeb90e0b..51a878efca0e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1166,7 +1166,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 
 	get_page(src_page);
-	if (unlikely(page_try_dup_anon_rmap(src_page, true, src_vma))) {
+	if (unlikely(page_try_dup_anon_rmap(src_page, dst_vma, src_vma, true))) {
 		/* Page maybe pinned: split and retry the fault on PTEs. */
 		put_page(src_page);
 		pte_free(dst_mm, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cf84784064c7..1ddef4082cad 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5401,9 +5401,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			 * sleep during the process.
 			 */
 			if (!folio_test_anon(pte_folio)) {
-				page_dup_file_rmap(&pte_folio->page, true);
+				page_dup_file_rmap(&pte_folio->page, dst_vma,
+						   true);
 			} else if (page_try_dup_anon_rmap(&pte_folio->page,
-							  true, src_vma)) {
+							  dst_vma, src_vma, true)) {
 				pte_t src_pte_old = entry;
 				struct folio *new_folio;
 
@@ -6272,7 +6273,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	if (anon_rmap)
 		hugepage_add_new_anon_rmap(folio, vma, haddr);
 	else
-		page_dup_file_rmap(&folio->page, true);
+		page_dup_file_rmap(&folio->page, vma, true);
 	new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
 	/*
@@ -6723,7 +6724,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 		goto out_release_unlock;
 
 	if (folio_in_pagecache)
-		page_dup_file_rmap(&folio->page, true);
+		page_dup_file_rmap(&folio->page, dst_vma, true);
 	else
 		hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr);
 
diff --git a/mm/memory.c b/mm/memory.c
index 6bcfa763a146..14416d05e1b6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -836,7 +836,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		get_page(page);
 		rss[mm_counter(page)]++;
 		/* Cannot fail as these pages cannot get pinned. */
-		BUG_ON(page_try_dup_anon_rmap(page, false, src_vma));
+		BUG_ON(page_try_dup_anon_rmap(page, dst_vma, src_vma, false));
 
 		/*
 		 * We do not preserve soft-dirty information, because so
@@ -950,7 +950,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		 * future.
 		 */
 		folio_get(folio);
-		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
+		if (unlikely(page_try_dup_anon_rmap(page, dst_vma, src_vma, false))) {
 			/* Page may be pinned, we have to copy. */
 			folio_put(folio);
 			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
@@ -959,7 +959,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		rss[MM_ANONPAGES]++;
 	} else if (page) {
 		folio_get(folio);
-		page_dup_file_rmap(page, false);
+		page_dup_file_rmap(page, dst_vma, false);
 		rss[mm_counter_file(page)]++;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index fda41bc09903..341a84c3e8e4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -252,7 +252,7 @@ static bool remove_migration_pte(struct folio *folio,
 				hugepage_add_anon_rmap(folio, vma, pvmw.address,
 						       rmap_flags);
 			else
-				page_dup_file_rmap(new, true);
+				page_dup_file_rmap(new, vma, true);
 			set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte,
 					psize);
 		} else

From patchwork Fri Nov 24 13:26:10 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169414
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1190555vqx;
        Fri, 24 Nov 2023 05:27:58 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHYKTVWgEPzTX+C3xcJxQFwiE8UHfJ+rpmWoZgEH4BpbV4cXKxbuQJhrMscAjiECgnphP1m
X-Received: by 2002:a17:902:ab06:b0:1cc:3bfc:69b1 with SMTP id
 ik6-20020a170902ab0600b001cc3bfc69b1mr2518110plb.24.1700832477734;
        Fri, 24 Nov 2023 05:27:57 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832477; cv=none;
        d=google.com; s=arc-20160816;
        b=EzZfHwRRDveUH1+59ztZNsx32+tBLeyPhDiv5CfaR7Rb1tKqpwsXEytfgPhkg2dRzT
         PPwdQHsCjU1efBM9rr+X3RpWNjrTBEeuKuqLaM5EmzXK3zO6VxiQE2eumCCw7Vq8Whgv
         azbX2HUOcjI2pO61aaE+TGwjzA5rhNuXxLt1Dd5BW9FoPyrqBRt8COdP0A0luJlUF9gT
         G6Mw5UtG3QcT/FYFyfNNCTEltsUBerFuJ8IOTnv4BDijZY4vLe09EcnrLuf7GM0aqJg8
         me92vajSWjZ5Sn4x8iqbW7sqFwa/1Y3G6VaEcogxNmSMpiqZ/mMU4nRxfd5DEcu+zBcU
         0rzA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=/ZXmCy/cptnrW109242Bp79Yc/hIzrll+sgQ1b8Vj4U=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=RXupt6jICpZ7gcvKbrz7ax/5bY1Uor3TSbLWBdMYBVrdEO9jLi+i0tyTHkCyB9G5ML
         O+wFKoYX3Q5UYehChOddFgvVLKlB/HlBVq/+qLE81xsNZi8xg3bG7H4AxWMdqfCVAG/P
         YfQBREaiPRuphyVeEZjPyctiL07go67iOmvo6iucghnjGmuVqE6RD4APhihkbJYfe6YR
         pfk1Dt4Hgd96uAvSNgFL6FwDovzBriYqLPzwAQlA7y8JaR+OQp5NB56VWaznSvLY9jwa
         KBNwVEteXXL81IPDOc7C8qOrhpFhn21A8z/6Q07hhV1qorUBEwM4niXoA3OO0l8Uq5yv
         PpIQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=G82MZBUh;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:2 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2])
        by mx.google.com with ESMTPS id
 ik24-20020a170902ab1800b001cf54c3b19csi3334732plb.123.2023.11.24.05.27.57
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:27:57 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:2 as permitted sender)
 client-ip=2620:137:e000::3:2;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=G82MZBUh;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:2 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by agentk.vger.email (Postfix) with ESMTP id 4EC0181825D7;
	Fri, 24 Nov 2023 05:27:45 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345297AbjKXN1M (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:27:12 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37602 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232837AbjKXN1B (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:27:01 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C1FA31730
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:26:53 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832412;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=/ZXmCy/cptnrW109242Bp79Yc/hIzrll+sgQ1b8Vj4U=;
        b=G82MZBUh3iwaP5zwmz+tCYWCC+6wc2m3xOxqRSVZgmLMFmkLIb2n6qnTOCu27cu4Q/jI6e
        9SWtNS6ByBuCPor/cFfZNaNoduD9ebWm29y7raoRBlJshg1jzPgFzjh3nFiSr0a0mHXBTR
        7Neq1ivoNSIwK480TAFQJLrkITfTBd4=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-459-PPvWI2VbOfaXtOuRopE19A-1; Fri, 24 Nov 2023 08:26:50 -0500
X-MC-Unique: PPvWI2VbOfaXtOuRopE19A-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id D82C7811E7B;
        Fri, 24 Nov 2023 13:26:49 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id A59242166B2A;
        Fri, 24 Nov 2023 13:26:46 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 05/20] mm/rmap: abstract total mapcount operations for
 partially-mappable folios
Date: Fri, 24 Nov 2023 14:26:10 +0100
Message-ID: <20231124132626.235350-6-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:27:46 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452116657587957
X-GMAIL-MSGID: 1783452116657587957

Let's prepare for doing additional accounting whenever modifying the total
mapcount of partially-mappable (!hugetlb) folios. Pass the VMA as well.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h | 41 ++++++++++++++++++++++++++++++++++++++++-
 mm/rmap.c            | 23 ++++++++++++-----------
 2 files changed, 52 insertions(+), 12 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6cb497f6feab..9d5c2ed6ced5 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -168,6 +168,39 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *folio_get_anon_vma(struct folio *folio);
 
+static inline void folio_set_large_mapcount(struct folio *folio,
+		int count, struct vm_area_struct *vma)
+{
+	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+	/* increment count (starts at -1) */
+	atomic_set(&folio->_total_mapcount, count - 1);
+}
+
+static inline void folio_inc_large_mapcount(struct folio *folio,
+		struct vm_area_struct *vma)
+{
+	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+	atomic_inc(&folio->_total_mapcount);
+}
+
+static inline void folio_add_large_mapcount(struct folio *folio,
+		int count, struct vm_area_struct *vma)
+{
+	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+	atomic_add(count, &folio->_total_mapcount);
+}
+
+static inline void folio_dec_large_mapcount(struct folio *folio,
+		struct vm_area_struct *vma)
+{
+	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+	atomic_dec(&folio->_total_mapcount);
+}
+
 /* RMAP flags, currently only relevant for some anon rmap operations. */
 typedef int __bitwise rmap_t;
 
@@ -219,11 +252,17 @@ static inline void __page_dup_rmap(struct page *page,
 		return;
 	}
 
+	if (unlikely(folio_test_hugetlb(folio))) {
+		atomic_inc(&folio->_entire_mapcount);
+		atomic_inc(&folio->_total_mapcount);
+		return;
+	}
+
 	if (compound)
 		atomic_inc(&folio->_entire_mapcount);
 	else
 		atomic_inc(&page->_mapcount);
-	atomic_inc(&folio->_total_mapcount);
+	folio_inc_large_mapcount(folio, dst_vma);
 }
 
 static inline void page_dup_file_rmap(struct page *page,
diff --git a/mm/rmap.c b/mm/rmap.c
index 38765796dca8..689ad85cf87e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1105,8 +1105,8 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 }
 
 static unsigned int __folio_add_rmap_range(struct folio *folio,
-		struct page *page, unsigned int nr_pages, bool compound,
-		int *nr_pmdmapped)
+		struct page *page, unsigned int nr_pages,
+		struct vm_area_struct *vma, bool compound, int *nr_pmdmapped)
 {
 	atomic_t *mapped = &folio->_nr_pages_mapped;
 	int first, count, nr = 0;
@@ -1130,7 +1130,7 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
 					nr++;
 			}
 		} while (page++, --count > 0);
-		atomic_add(nr_pages, &folio->_total_mapcount);
+		folio_add_large_mapcount(folio, nr_pages, vma);
 	} else if (folio_test_pmd_mappable(folio)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
@@ -1148,7 +1148,7 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
 				nr = 0;
 			}
 		}
-		atomic_inc(&folio->_total_mapcount);
+		folio_inc_large_mapcount(folio, vma);
 	} else {
 		VM_WARN_ON_ONCE_FOLIO(true, folio);
 	}
@@ -1258,7 +1258,8 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 	unsigned int nr, nr_pmdmapped = 0;
 	bool compound = flags & RMAP_COMPOUND;
 
-	nr = __folio_add_rmap_range(folio, page, 1, compound, &nr_pmdmapped);
+	nr = __folio_add_rmap_range(folio, page, 1, vma, compound,
+				    &nr_pmdmapped);
 	if (nr_pmdmapped)
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped);
 	if (nr)
@@ -1329,8 +1330,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	}
 
 	if (folio_test_large(folio))
-		/* increment count (starts at -1) */
-		atomic_set(&folio->_total_mapcount, 0);
+		folio_set_large_mapcount(folio, 1, vma);
 
 	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
 	__folio_set_anon(folio, vma, address, true);
@@ -1355,7 +1355,7 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
 {
 	unsigned int nr, nr_pmdmapped = 0;
 
-	nr = __folio_add_rmap_range(folio, page, nr_pages, compound,
+	nr = __folio_add_rmap_range(folio, page, nr_pages, vma, compound,
 				    &nr_pmdmapped);
 	if (nr_pmdmapped)
 		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
@@ -1411,16 +1411,17 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 
 	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
 
-	if (folio_test_large(folio))
-		atomic_dec(&folio->_total_mapcount);
-
 	/* Hugetlb pages are not counted in NR_*MAPPED */
 	if (unlikely(folio_test_hugetlb(folio))) {
 		/* hugetlb pages are always mapped with pmds */
 		atomic_dec(&folio->_entire_mapcount);
+		atomic_dec(&folio->_total_mapcount);
 		return;
 	}
 
+	if (folio_test_large(folio))
+		folio_dec_large_mapcount(folio, vma);
+
 	/* Is page being unmapped by PTE? Is this its last map to be removed? */
 	if (likely(!compound)) {
 		last = atomic_add_negative(-1, &page->_mapcount);

From patchwork Fri Nov 24 13:26:11 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169421
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191341vqx;
        Fri, 24 Nov 2023 05:29:04 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGprwUGWuR2whFKLKmqxKYLi5ESmm2ljdY+0sYmvSQFtIzfZmgY2ixptK9Hlw87/AIt6RE/
X-Received: by 2002:a05:6a00:44c8:b0:6cb:901a:9303 with SMTP id
 cv8-20020a056a0044c800b006cb901a9303mr2981119pfb.13.1700832544605;
        Fri, 24 Nov 2023 05:29:04 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832544; cv=none;
        d=google.com; s=arc-20160816;
        b=v7aBf63CncuXnuVYriIKtq6Oem7mX/w8dg6754A78ew/DV+GBcjLZkBL7yyM5mQ8kh
         rBtLFusw2rz6jFHKnkNsCMqZ2+mghQ30Rba4hpAb9jIOsklEJHeKT59a+NTdcp7+hks1
         9E+/7ro0NLex+Oh+i8A0XcQiiexS02c7eoZdu3GfSK+JeErxkO/Q7kqeuA6cd2Gaitph
         1tdZS8EYmRmv66+bBCQT2n7mB13eypJ6AoIvNEXF/YrSAdNydpF2l2gQ0Fm9izwMecFX
         E1kK0T7VAKckvVf7ny/lUS7dQ5PI/PbsXXQ+47gX3Yd8y4zbg02emmPASLjd1AhCq3rA
         0c7Q==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=ujZJ3g6hK6oBZOKXOf3RJw4ZrpKoYgRWOTMhG7TrjzA=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=gNh2x6nBbJSrah7HII2nY/6tiYazNwX2s5gchLNnJVkTh73u8q/oGiiVIoJg31aZdA
         WW3hQ/GM/gBJRvL65W+W/fdi2dkw8iTWqzK2Vl4C2In5KVIfn37k49sNq++mnmcGlUzA
         CDAxzZGoXmMEgFNptVEP/hnTgjA3l+K/5cVQnSnvw4y7BV6wAWvMAOpbCP4EFalaJy+c
         9ZQQQAqCJAv1373ZDcyitENSGnWzqR5XAJs0WZ3jxd5ylNylvtmhqyJS/fDQYHobR+Ci
         UNyrlflEte26rHG6Yo/IbLQ3GtYAveCCMfQjf22PEgy90zU2gc4VJbhM7kgysgWhi7DJ
         wHbw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=SeTVvvm8;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7])
        by mx.google.com with ESMTPS id
 d11-20020a63ed0b000000b00578a2da998asi3592316pgi.304.2023.11.24.05.29.04
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:29:04 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 client-ip=2620:137:e000::3:7;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=SeTVvvm8;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id C8A618191468;
	Fri, 24 Nov 2023 05:27:43 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345606AbjKXN1a (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:27:30 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37040 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235270AbjKXN1E (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:27:04 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A39E019B1
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832420;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=ujZJ3g6hK6oBZOKXOf3RJw4ZrpKoYgRWOTMhG7TrjzA=;
        b=SeTVvvm8Xc2oDJVbQuCbZFFk5U1dRzA3boOM/6Af8TardoT+WGyfOelh/gXMBEFo38N5d4
        GBafbBUjG7moggr6Jqp4J5VRey9az6Mucx/P9NsXuikXQsz2isTSj5pU7k+JOrYqOElRmc
        RlAtB9tnF/ws1RxrqzSo7UWmn3aArzs=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-446-U_verE8yOyS0gXTE1oEkOw-1; Fri,
 24 Nov 2023 08:26:54 -0500
X-MC-Unique: U_verE8yOyS0gXTE1oEkOw-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A1C4538116E0;
        Fri, 24 Nov 2023 13:26:53 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 232362166B2A;
        Fri, 24 Nov 2023 13:26:50 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 06/20] atomic_seqcount: new (raw) seqcount variant to
 support concurrent writers
Date: Fri, 24 Nov 2023 14:26:11 +0100
Message-ID: <20231124132626.235350-7-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,
        T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:27:44 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452186335281729
X-GMAIL-MSGID: 1783452186335281729

Assume we have a writer side that is fairly simple and only updates some
counters by adding some values:
	folio->counter_a += diff_a;
	folio->counter_b += diff_b;
	folio->counter_c += diff_c;
	...

Further, assume that our readers want to always read consistent
set of counters. That is, they not only want to read each counter
atomically, but also get a consistent/atomic view across *all*
counters, detecting the case where there are concurrent modifications of
the counters.

Traditionally, we'd use a seqcount protected by some locking on the
writer side. The readers can run lockless, detect when there were
concurrent updates, to simply retry again to re-read all values.

However, a seqcount requires to serialize all writers to only allow for a
single writer at a time. Alternatives might include per-cpu
counters / local atomics, but for the target use cases, both primitives
are not applicable:

We want to store counters (2 to 7 for now, depending on the folio size) in
the "struct folio" of some larger folios (order >=2 ) whereby the counters
get adjusted whenever we (un)map part of a folio.

(a) The reader side must be able to get a consistent view of the
    counters and be able to detect concurrent changes (i.e., concurrent
    (un)mapping), as described above. In some cases we can simply stop
    immediately if we detect any concurrent writer -- any concurrent
    (un)map activity.
(b) The writer side updates the counters as described above and should
    ideally run completely lockless. In many cases, we always have a
    single write at a time. But in some scenarios, we can trigger
    a lot of concurrent writers. We want the writer
    side to be able to make progress instead of repeadetly spinning,
    waiting for possibly many other writers.
(c) Space in the "struct folio" especially for smallish folios is very
    limited, and the "struct page" layout imposes various restrictions
    on where we can even put new data; growing the size of the
    "struct page" is not desired because it can result in serious metadata
    overhead and easily has performance implications (cache-line). So we
    cannot place ordinary spinlocks in there (especially also because they
    change their size based on lockdep and actual implementation), and the
    only real alternative is a bit spinlock, which is really undesired.

If we want to allow concurrent writers, we can use atomic RMW operations
when updating the counters:
	atomic_add(diff_a, &folio->counter_a);
	atomic_add(diff_b, &folio->counter_b);
	atomic_add(diff_c, &folio->counter_c);
	...

But the existing seqcount to make the reader size detect concurrent
updates is not capable of handling concurrent writers.

So let's add a new atomic seqcount for exactly that purpose. Instead of
using a single LSB in the seqcount to detect a single concurrent writer, it
uses multiple LSBs to detect multiple concurrent writers. As the
seqcount can be modified concurrently, it ends up being an atomic type. In
theory, each CPU can participate, so we have to steal quite some LSBs on
64bit. As that reduces the bits available for the actual sequence quite
drastically especially on 64bit, and there is the concern that 16bit for
the sequence might not be sufficient, just use an atomic_long_t for now.

For the use case discussed, we will place the new atomic seqcount into the
"struct folio"/"struct page", where the limitations as described above
apply. For that use case, the "raw" variant -- raw_atomic_seqcount_t -- is
required, so we only add that.

For the normal seqcount on the writer side, we have the following memory
ordering:

	s->sequence++
	smp_wmb();
	[critical section]
	smp_wmb();
	s->sequence++

It's important that other CPUs don't observe stores to the sequence
to be reordered with stores in the critical section.

For the atomic seqcount, we could have similarly used:

	atomic_long_add(SHARED, &s->sequence);
	smp_wmb();
	[critical section]
	smp_wmb();
	atomic_long_add(STEP - SHARED, &s->sequence);

But especially on x86_64, the atomic_long_add() already implies a full
memory barrier. So instead, we can do:

	atomic_long_add(SHARED, &s->sequence);
	__smp_mb__after_atomic();
	[critical section]
	__smp_mb__before_atomic();
	atomic_long_add(STEP - SHARED, &s->sequence);

Or alternatively:

	atomic_long_add_return(SHARED, &s->sequence);
	[critical section]
	atomic_long_add_return(STEP - SHARED, &s->sequence);

Could we use acquire-release semantics? Like the following:

	atomic_long_add_return_acquire(SHARED, &s->sequence)
	[critical section]
	atomic_long_add_return_release(STEP - SHARED, &s->sequence)

Maybe, but (a) it would make it different to normal seqcounts,
because stores before/after the atomic_long_add_*() could now be reordered
and; (b) memory-barriers.txt might indicate that the sequence counter store
might be reordered: "For compound atomics performing both a load and a
store, ACQUIRE semantics apply only to the load and RELEASE semantics
apply only to the store portion of the operation.".

So let's keep it simple for now.

Effectively, with the atomic seqcount We end up with more atomic RMW
operations in the critical section but get no writer starvation / lock
contention in return.

We'll limit the implementation to !PREEMPT_RT and disallowing
readers/writers from interrupt context.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/atomic_seqcount.h | 170 ++++++++++++++++++++++++++++++++
 lib/Kconfig.debug               |  11 +++
 2 files changed, 181 insertions(+)
 create mode 100644 include/linux/atomic_seqcount.h

diff --git a/include/linux/atomic_seqcount.h b/include/linux/atomic_seqcount.h
new file mode 100644
index 000000000000..109447b663a1
--- /dev/null
+++ b/include/linux/atomic_seqcount.h
@@ -0,0 +1,170 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef __LINUX_ATOMIC_SEQLOCK_H
+#define __LINUX_ATOMIC_SEQLOCK_H
+
+#include <linux/compiler.h>
+#include <linux/threads.h>
+#include <linux/preempt.h>
+
+/*
+ * raw_atomic_seqcount_t -- a reader-writer consistency mechanism with
+ * lockless readers (read-only retry loops), and lockless writers.
+ * The writers must use atomic RMW operations in the critical section.
+ *
+ * This locking mechanism is applicable when all individual operations
+ * performed by writers can be expressed using atomic RMW operations
+ * (so they can run lockless) and readers only need a way to get an atomic
+ * view over all individual atomic values: like writers atomically updating
+ * multiple counters, and readers wanting to observe a consistent state
+ * across all these counters.
+ *
+ * For now, only the raw variant is implemented, that doesn't perform any
+ * lockdep checks.
+ *
+ * Copyright Red Hat, Inc. 2023
+ *
+ * Author(s): David Hildenbrand <david@redhat.com>
+ */
+
+typedef struct raw_atomic_seqcount {
+	atomic_long_t sequence;
+} raw_atomic_seqcount_t;
+
+#define raw_seqcount_init(s) atomic_long_set(&((s)->sequence), 0)
+
+#ifdef CONFIG_64BIT
+
+#define ATOMIC_SEQCOUNT_SHARED_WRITER			0x0000000000000001ul
+/* 65536 CPUs */
+#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX		0x0000000000008000ul
+#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK		0x000000000000fffful
+#define ATOMIC_SEQCOUNT_WRITERS_MASK			0x000000000000fffful
+/* We have 48bit for the actual sequence. */
+#define ATOMIC_SEQCOUNT_SEQUENCE_STEP			0x0000000000010000ul
+
+#else /* CONFIG_64BIT */
+
+#define ATOMIC_SEQCOUNT_SHARED_WRITER			0x00000001ul
+/* 64 CPUs */
+#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX		0x00000040ul
+#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK		0x0000007ful
+#define ATOMIC_SEQCOUNT_WRITERS_MASK			0x0000007ful
+/* We have 25bit for the actual sequence. */
+#define ATOMIC_SEQCOUNT_SEQUENCE_STEP			0x00000080ul
+
+#endif /* CONFIG_64BIT */
+
+#if CONFIG_NR_CPUS > ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX
+#error "raw_atomic_seqcount_t does not support such large CONFIG_NR_CPUS"
+#endif
+
+/**
+ * raw_read_atomic_seqcount() - read the raw_atomic_seqcount_t counter value
+ * @s: Pointer to the raw_atomic_seqcount_t
+ *
+ * raw_read_atomic_seqcount() opens a read critical section of the given
+ * raw_atomic_seqcount_t, and without checking or masking the sequence counter
+ * LSBs (using ATOMIC_SEQCOUNT_WRITERS_MASK). Calling code is responsible for
+ * handling that.
+ *
+ * Return: count to be passed to raw_read_atomic_seqcount_retry()
+ */
+static inline unsigned long raw_read_atomic_seqcount(raw_atomic_seqcount_t *s)
+{
+	unsigned long seq = atomic_long_read(&s->sequence);
+
+	/* Read the sequence before anything in the critical section */
+	smp_rmb();
+	return seq;
+}
+
+/**
+ * raw_read_atomic_seqcount_begin() - begin a raw_seqcount_t read section
+ * @s: Pointer to the raw_atomic_seqcount_t
+ *
+ * raw_read_atomic_seqcount_begin() opens a read critical section of the
+ * given raw_seqcount_t. This function must not be used in interrupt context.
+ *
+ * Return: count to be passed to raw_read_atomic_seqcount_retry()
+ */
+static inline unsigned long raw_read_atomic_seqcount_begin(raw_atomic_seqcount_t *s)
+{
+	unsigned long seq;
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT));
+#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
+	DEBUG_LOCKS_WARN_ON(in_interrupt());
+#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+	while ((seq = atomic_long_read(&s->sequence)) &
+		ATOMIC_SEQCOUNT_WRITERS_MASK)
+		cpu_relax();
+
+	/* Load the sequence before any load in the critical section. */
+	smp_rmb();
+	return seq;
+}
+
+/**
+ * raw_read_atomic_seqcount_retry() - end a raw_seqcount_t read critical section
+ * @s: Pointer to the raw_atomic_seqcount_t
+ * @start: count, for example from raw_read_atomic_seqcount_begin()
+ *
+ * raw_read_atomic_seqcount_retry() closes the read critical section of the
+ * given raw_seqcount_t.  If the critical section was invalid, it must be ignored
+ * (and typically retried).
+ *
+ * Return: true if a read section retry is required, else false
+ */
+static inline bool raw_read_atomic_seqcount_retry(raw_atomic_seqcount_t *s,
+		unsigned long start)
+{
+	/* Load the sequence after any load in the critical section. */
+	smp_rmb();
+	return unlikely(atomic_long_read(&s->sequence) != start);
+}
+
+/**
+ * raw_write_seqcount_begin() - start a raw_seqcount_t write critical section
+ * @s: Pointer to the raw_atomic_seqcount_t
+ *
+ * raw_write_seqcount_begin() opens the write critical section of the
+ * given raw_seqcount_t. This function must not be used in interrupt context.
+ */
+static inline void raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s)
+{
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT));
+#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
+	DEBUG_LOCKS_WARN_ON(in_interrupt());
+#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+	preempt_disable();
+	atomic_long_add(ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence);
+	/* Store the sequence before any store in the critical section. */
+	smp_mb__after_atomic();
+#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
+	DEBUG_LOCKS_WARN_ON((atomic_long_read(&s->sequence) &
+			     ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) >
+			    ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX);
+#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+}
+
+/**
+ * raw_write_seqcount_end() - end a raw_seqcount_t write critical section
+ * @s: Pointer to the raw_atomic_seqcount_t
+ *
+ * raw_write_seqcount_end() closes the write critical section of the
+ * given raw_seqcount_t.
+ */
+static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s)
+{
+#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
+	DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) &
+			      ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK));
+#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+	/* Store the sequence after any store in the critical section. */
+	smp_mb__before_atomic();
+	atomic_long_add(ATOMIC_SEQCOUNT_SEQUENCE_STEP -
+			ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence);
+	preempt_enable();
+}
+
+#endif /* __LINUX_ATOMIC_SEQLOCK_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index cc7d53d9dc01..569c2c6ed47f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1298,6 +1298,7 @@ config PROVE_LOCKING
 	select DEBUG_MUTEXES if !PREEMPT_RT
 	select DEBUG_RT_MUTEXES if RT_MUTEXES
 	select DEBUG_RWSEMS
+	select DEBUG_ATOMIC_SEQCOUNT if !PREEMPT_RT
 	select DEBUG_WW_MUTEX_SLOWPATH
 	select DEBUG_LOCK_ALLOC
 	select PREEMPT_COUNT if !ARCH_NO_PREEMPT
@@ -1425,6 +1426,16 @@ config DEBUG_RWSEMS
 	  This debugging feature allows mismatched rw semaphore locks
 	  and unlocks to be detected and reported.
 
+config DEBUG_ATOMIC_SEQCOUNT
+	bool "Atomic seqcount debugging: basic checks"
+	depends on DEBUG_KERNEL && !PREEMPT_RT
+	help
+	 This feature allows some atomic seqcount semantics violations to be
+	 detected and reported.
+
+	 The debug checks are only performed when running code that actively
+	 uses atomic seqcounts; there are no dedicated test cases yet.
+
 config DEBUG_LOCK_ALLOC
 	bool "Lock debugging: detect incorrect freeing of live locks"
 	depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT

From patchwork Fri Nov 24 13:26:12 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169423
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191397vqx;
        Fri, 24 Nov 2023 05:29:09 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IFKMURYCnnzcMlYrhtFHxhM/ZYgC+Wtxr3W7HdzVzh6Uq4tUn5TvyzvI45j08Ym3SiRWHbL
X-Received: by 2002:a17:902:e84f:b0:1cc:4a47:1fe5 with SMTP id
 t15-20020a170902e84f00b001cc4a471fe5mr3272056plg.59.1700832549355;
        Fri, 24 Nov 2023 05:29:09 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832549; cv=none;
        d=google.com; s=arc-20160816;
        b=FnunE8bpYBcsXJdW9KFwl3t0BFjBIq/aKQWEtBcd90Mp8LSeE8pTOjISMrcEgwozpG
         dDAxzGTl3x7FcC5FTvByuTE10wBos4meDmVRVQ3jwy3oJl5MhhajtgP5rpJtqbVvJPft
         NgmS2h5Ms0VsYxQA/9kX1JPL5j1HdytB0oUjYe+PvcT8glDN7dXtTQ4rhGrzdWPtcsW2
         yWeJ3mngX+Rjz/ljOPjlhk8IRLI01lTbpL5F76TbkOAMdpAwo+aHBk7D3vPiWLfBTG1J
         PEZ6HxEtyu4VEgBfLVwwE+ngPLeqcF7273zzwFpoeV2LXLOpDiJz+ybjfIiNIqHoquno
         Krbw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=a5yfSJmvw+walwOxM5XR2U+SrOM6iOWFmuLu46mrnEU=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=Orz5qWBGbY8+0Plntx/Ru7gvV1sFAOUtF0HMPTELwJOzFMoLzjBKQD97UWa7kwypJY
         il7a2ew6cPHZ97UD/7K0Pzm/eQmEswoGO9+k9OpGGUTfPM65pgXXRGt+NQmLTw3OTOGs
         NJ9E/uZl+lWyaqQ674phsMIpnxzx+tPijqm2gxY7DmFjzzx9/yczNINLruAE01VeGcQX
         24P4c8ankDRvuR/sK21D5ctICiDRvXbgUOYB3Ans4xNni86glnA/A+eqsZ3u9jASiDlJ
         P1wc/1u9h5gIfH+Q7myv13YsuJFedrU4F0/RfzKQn/QuePLX0WuvFAWN+a1stE5JL+2A
         31KA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=blneaO6V;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from snail.vger.email (snail.vger.email. [23.128.96.37])
        by mx.google.com with ESMTPS id
 y2-20020a170902700200b001c7264c458dsi3318846plk.181.2023.11.24.05.29.08
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:29:09 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=blneaO6V;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id 1B0C58191477;
	Fri, 24 Nov 2023 05:28:02 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345657AbjKXN1d (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:27:33 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37512 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235293AbjKXN1F (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:27:05 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 55A5019AE
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832420;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=a5yfSJmvw+walwOxM5XR2U+SrOM6iOWFmuLu46mrnEU=;
        b=blneaO6Von/7If1kkxAo+LYa5rv//QxgVYhvipio1uf59ixRdPSzdR41gUbs9hq+vqvk5M
        2ZAC6xSD6EvmYcR0DOSBpfzjKvkeJ7koYcTBinUVIQCNVd6HWHqwLzo4XGT6PvnNzVdQlA
        45zF0teKtWOwMSDeBPtLcGhN7cMj5U8=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-689-ORZ29QTLNFSiOT-xdaI3QA-1; Fri,
 24 Nov 2023 08:26:57 -0500
X-MC-Unique: ORZ29QTLNFSiOT-xdaI3QA-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BD5003C108C7;
        Fri, 24 Nov 2023 13:26:56 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 09AB92166B2A;
        Fri, 24 Nov 2023 13:26:53 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 07/20] mm/rmap_id: track if one ore multiple MMs map a
 partially-mappable folio
Date: Fri, 24 Nov 2023 14:26:12 +0100
Message-ID: <20231124132626.235350-8-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,
        T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:28:03 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452190938775369
X-GMAIL-MSGID: 1783452190938775369

In contrast to small folios and hugetlb folios, for a partially-mappable
folio (i.e., THP), the total mapcount is often not expressive to identify
whether such a folio is "mapped shared" or "mapped exclusively". For small
folios and hugetlb folios that are always entirely mapped, the single
mapcount is traditionally used for that purpose: is it 1? Then the folio
is currently mapped exclusively; is it bigger than 1? Then it's mapped
at least twice, and, therefore, considered "mapped shared".

For a partially-mappable folio, each individual PTE/PMD/... mapping
requires exactly one folio reference and one folio mapcount;
folio_mapcount() > 1 does not imply that the folio is "mapped shared".

While there are some obvious cases when we can conclude that
partially-mappable folios are "mapped shared" -- see
folio_mapped_shared() -- but it is currently not always possible to
precisely tell whether a folio is "mapped exclusively".

For implementing a precise variant of folio_mapped_shared() and for
COW-reuse support of PTE-mapped anon THP, we need an efficient and precise
way to identify "mapped shared" vs. "mapped exclusively".

So how could we track if more than one MM is currently mapping a folio in
its page tables? Having a list of MMs per folio, or even a counter for
each MM for each folio is clearly not feasible.

... but what if we could play some fun math games to perform this
tracking while requiring a handful of counters per folio, the exact number
of counters depending on the size of the folio?

1. !!! Experimental Feature !!!
===============================

We'll only support CONFIG_64BIT and !CONFIG_PREEMPT_RT (implied by THP
support) for now. As we currently never get partially-mappable folios
without CONFIG_TRANSPARENT_HUGEPAGE, let's limit to that to avoid
unnecessary rmap ID allocations for setups without THP.

32bit support might be possible if there is demand, limiting it to 64k
rmap IDs and reasonably sized folio sizes (e.g., <= order-15).
Similarly, RT might be possible if there is ever real demand for it.

The feature will be experimental initially, and, therefore, disabled as
default.

Once the involved math is considered solid, the implementation saw extended
testing, and the performance implications are clear and have either been
optimized (e.g., rmap batching) or mitigated (e.g., do we really have to
perform this tracking for folios that are always assumed shared, like
folios mapping executables or shared libraries? Is some hardware
problematic?), we can consider always enabling it as default.

2. Per-mm rmap IDs
==================

We'll have to assign each MM an rmap ID that is smaller than
16*1024*1024 on 64bit. Note that these are significantly more than the
maximum number of processes we can possibly have in the system. There isn't
really a difference between supporting 16M IDs and 2M/4M IDs.

Due to the ID size limitation, we cannot use the MM pointer value and need
a separate ID allocator. Maybe, we want to cache some rmap IDs per CPU?
Maybe we want to improve the allocation path? We can add such improvements
when deemed necessary.

In the distant future, we might want to allocate rmap IDs for selected
VMAs: for example, imagine a systemcall that does something like fork
(COW-sharing of pages) within a process for a range of anonymous memory,
ending up with a new VMA that wants a separate rmap ID. For now, per-MM
is simple and sufficient.

3. Tracking Overview
====================

We derive a sequence of special sub-IDs from our MM rmap ID.

Any time we map/unmap a part (e.g., PTE, PMD) of a partially-mappable
folio to/from a MM, we:

 (1) Adjust (increment/decrement) the mapcount of the folio
 (2) Adjust (add/remove) the folio rmap values using the MM sub-IDs

So the rmap values are always linked to the folio mapcount. Consequently,
we know that a single rmap value in the folio is the sum of exactly
 #folio_mapcount() rmap sub-IDs. To identify whether a single MM is
responsible for all folio_mapcount() mappings of a folio
("mapped exclusively") or whether other MMs are involved ("mapped shared"),
we perform the following checks:

 (1) Do we have more mappings than the folio has pages? Then the folio is
     certainly shared. That is, when "folio_mapcount() > folio_nr_pages()"
 (2) For each rmap value X, does that rmap value folio->_rmap_valX
     correspond to "folio_mapcount() * sub-ID[X]" of the MM?
     Then the folio is certainly exclusive. Note that we only check that
     when "folio_mapcount() <= folio_nr_pages()".

4. Synchronization
==================

We're using an atomic seqcount, stored in the folio, to allow for readers
to detect concurrent (un)mapping, whereby they could obtain a wrong
snapshot of the mapcount+rmap values and make a wrong decision.

Further, the mapcount and all rmap values are updated using RMW atomics,
to allow for concurrent updates.

5. sub-IDs
==========

To achieve (2), we generate sub-IDs that have the following property,
assuming that our folio has P=folio_nr_pages() pages.
  "2 * sub-ID" cannot be represented by the sum of any other *2* sub-IDs
  "3 * sub-ID" cannot be represented by the sum of any other *3* sub-IDs
  "4 * sub-ID" cannot be represented by the sum of any other *4* sub-IDs
  ...
  "P * sub-ID" cannot be represented by the sum of any other *P* sub-IDs

The sub-IDs are generated in generations, whereby
(1) Generation #0 is the number 0
(2) Generation #N takes all numbers from generations #0..#N-1 and adds
    (P + 1)^(N - 1), effectively doubling the number of sub-IDs

Consequently, the smallest number S in gen #N is:
  S[#N] = (P + 1)^(N - 1)

The largest number L in gen #N is:
  L[#N] = (P + 1)^(N - 1) + (P + 1)^(N - 2) + ... (P + 1)^0 + 0.
  -> [geometric sum with "P + 1 != 1"]
        = (1 - (P + 1)^N) / (1 - (P + 1))
        = (1 - (P + 1)^N) / (-P)
        = ((P + 1)^N - 1) / P

Example with P=4 (order-2 folio):

Generation #0:      0
------------------------     + (4 + 1)^0 = 1
Generation #1:      1
------------------------     + (4 + 1)^1 = 5
Generation #2:      5
                    6
------------------------     + (4 + 1)^2 = 25
Generation #3:     25
                   26
                   30
                   31
------------------------     + (4 + 1)^3 = 125
[...]

Intuitively, we are working with sub-counters that cannot overflow as
long as we have <= P components. Let's consider the simple case of P=3,
whereby our sub-counters are exactly 2-bit wide.

Subid |      Bits | Sub-counters
--------------------------------
 0    | 0000 0000 |   0,0,0,0
 1    | 0000 0001 |   0,0,0,1
 4    | 0000 0100 |   0,0,1,0
 5    | 0000 0101 |   0,0,1,1
 16   | 0001 0000 |   0,1,0,0
 17   | 0001 0001 |   0,1,0,1
 20   | 0001 0100 |   0,1,1,0
 21   | 0001 0101 |   0,1,1,1
 64   | 0100 0000 |   1,0,0,0
 65   | 0100 0001 |   1,0,0,1
 68   | 0100 0100 |   1,0,1,0
 69   | 0100 0101 |   1,0,1,1
 80   | 0101 0100 |   1,1,0,0
 81   | 0101 0001 |   1,1,0,1
 84   | 0101 0100 |   1,1,1,0
 85   | 0101 0101 |   1,1,1,1

So if we, say, have:
	3 * 17 = 0,3,0,3
how could we possible get to that number by using 3 other subids? It's
impossible, because the sub-counters won't overflow as long as we stay
<= 3.

Interesting side note that might come in handy at some point: we also
cannot get to 0,3,0,3 by using 1 or 2 other subids. But, we could get to
1 * 17 = 0,1,0,1 by using 2 subids (16 and 1) or similarly to 2 * 17 =
0,2,0,2 by using 4 subids (2x16 and 2x1). Looks like we cannot get to
X * subid using any 1..X other subids.

Note 1: we'll add the actual detection logic used to be used by
folio_mapped_shared() and wp_can_reuse_anon_folio() separately.

Note 2: we might want to use that infrastructure for hugetlb as well in the
future: there is nothing THP-specific about rmap ID handling.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm_types.h |  58 +++++++
 include/linux/rmap.h     | 126 +++++++++++++-
 kernel/fork.c            |  26 +++
 mm/Kconfig               |  21 +++
 mm/Makefile              |   1 +
 mm/huge_memory.c         |  16 +-
 mm/init-mm.c             |   4 +
 mm/page_alloc.c          |   9 +
 mm/rmap_id.c             | 351 +++++++++++++++++++++++++++++++++++++++
 9 files changed, 604 insertions(+), 8 deletions(-)
 create mode 100644 mm/rmap_id.c

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99b84b4797b9..75305c57ef64 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -18,6 +18,7 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/atomic_seqcount.h>
 #include <linux/percpu_counter.h>
 
 #include <asm/mmu.h>
@@ -273,6 +274,14 @@ typedef struct {
  * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
  * @_hugetlb_hwpoison: Do not use directly, call raw_hwp_list_head().
  * @_deferred_list: Folios to be split under memory pressure.
+ * @_rmap_atomic_seqcount: Seqcount protecting _total_mapcount and _rmapX.
+ *     Does not apply to hugetlb.
+ * @_rmap_val0 Do not use outside of rmap code. Does not apply to hugetlb.
+ * @_rmap_val1 Do not use outside of rmap code. Does not apply to hugetlb.
+ * @_rmap_val2 Do not use outside of rmap code. Does not apply to hugetlb.
+ * @_rmap_val3 Do not use outside of rmap code. Does not apply to hugetlb.
+ * @_rmap_val4 Do not use outside of rmap code. Does not apply to hugetlb.
+ * @_rmap_val5 Do not use outside of rmap code. Does not apply to hugetlb.
  *
  * A folio is a physically, virtually and logically contiguous set
  * of bytes.  It is a power-of-two in size, and it is aligned to that
@@ -331,6 +340,9 @@ struct folio {
 			atomic_t _pincount;
 #ifdef CONFIG_64BIT
 			unsigned int _folio_nr_pages;
+#ifdef CONFIG_RMAP_ID
+			raw_atomic_seqcount_t _rmap_atomic_seqcount;
+#endif /* CONFIG_RMAP_ID */
 #endif
 	/* private: the union with struct page is transitional */
 		};
@@ -356,6 +368,34 @@ struct folio {
 		};
 		struct page __page_2;
 	};
+	union {
+		struct {
+			unsigned long _flags_3;
+			unsigned long _head_3;
+	/* public: */
+#ifdef CONFIG_RMAP_ID
+			atomic_long_t _rmap_val0;
+			atomic_long_t _rmap_val1;
+			atomic_long_t _rmap_val2;
+			atomic_long_t _rmap_val3;
+#endif /* CONFIG_RMAP_ID */
+	/* private: the union with struct page is transitional */
+		};
+		struct page __page_3;
+	};
+	union {
+		struct {
+			unsigned long _flags_4;
+			unsigned long _head_4;
+	/* public: */
+#ifdef CONFIG_RMAP_ID
+			atomic_long_t _rmap_val4;
+			atomic_long_t _rmap_val5;
+#endif /* CONFIG_RMAP_ID */
+	/* private: the union with struct page is transitional */
+		};
+		struct page __page_4;
+	};
 };
 
 #define FOLIO_MATCH(pg, fl)						\
@@ -392,6 +432,20 @@ FOLIO_MATCH(compound_head, _head_2);
 FOLIO_MATCH(flags, _flags_2a);
 FOLIO_MATCH(compound_head, _head_2a);
 #undef FOLIO_MATCH
+#define FOLIO_MATCH(pg, fl)						\
+	static_assert(offsetof(struct folio, fl) ==			\
+			offsetof(struct page, pg) + 3 * sizeof(struct page))
+FOLIO_MATCH(flags, _flags_3);
+FOLIO_MATCH(compound_head, _head_3);
+#undef FOLIO_MATCH
+#undef FOLIO_MATCH
+#define FOLIO_MATCH(pg, fl)						\
+	static_assert(offsetof(struct folio, fl) ==			\
+			offsetof(struct page, pg) + 4 * sizeof(struct page))
+FOLIO_MATCH(flags, _flags_4);
+FOLIO_MATCH(compound_head, _head_4);
+#undef FOLIO_MATCH
+
 
 /**
  * struct ptdesc -    Memory descriptor for page tables.
@@ -975,6 +1029,10 @@ struct mm_struct {
 #endif
 		} lru_gen;
 #endif /* CONFIG_LRU_GEN */
+
+#ifdef CONFIG_RMAP_ID
+		int mm_rmap_id;
+#endif /* CONFIG_RMAP_ID */
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9d5c2ed6ced5..19c9dc3216df 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -168,6 +168,116 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *folio_get_anon_vma(struct folio *folio);
 
+#ifdef CONFIG_RMAP_ID
+/*
+ * For init_mm and friends, we don't actually expect to ever rmap pages. So
+ * we use a reserved dummy ID that we'll never hand out the normal way.
+ */
+#define RMAP_ID_DUMMY		0
+#define RMAP_ID_MIN	(RMAP_ID_DUMMY + 1)
+#define RMAP_ID_MAX	(16 * 1024 * 1024u - 1)
+
+void free_rmap_id(int id);
+int alloc_rmap_id(void);
+
+#define RMAP_SUBID_4_MAX_ORDER		10
+#define RMAP_SUBID_5_MIN_ORDER		11
+#define RMAP_SUBID_5_MAX_ORDER		12
+#define RMAP_SUBID_6_MIN_ORDER		13
+#define RMAP_SUBID_6_MAX_ORDER		15
+
+static inline void __folio_prep_large_rmap(struct folio *folio)
+{
+	const unsigned int order = folio_order(folio);
+
+	raw_seqcount_init(&folio->_rmap_atomic_seqcount);
+	switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+	case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
+		atomic_long_set(&folio->_rmap_val5, 0);
+		fallthrough;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+	case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
+		atomic_long_set(&folio->_rmap_val4, 0);
+		fallthrough;
+#endif
+	default:
+		atomic_long_set(&folio->_rmap_val3, 0);
+		atomic_long_set(&folio->_rmap_val2, 0);
+		atomic_long_set(&folio->_rmap_val1, 0);
+		atomic_long_set(&folio->_rmap_val0, 0);
+		break;
+	}
+}
+
+static inline void __folio_undo_large_rmap(struct folio *folio)
+{
+#ifdef CONFIG_DEBUG_VM
+	const unsigned int order = folio_order(folio);
+
+	switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+	case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
+		VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val5));
+		fallthrough;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+	case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
+		VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val4));
+		fallthrough;
+#endif
+	default:
+		VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val3));
+		VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val2));
+		VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val1));
+		VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val0));
+		break;
+	}
+#endif
+}
+
+static inline void __folio_write_large_rmap_begin(struct folio *folio)
+{
+	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+	raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount);
+}
+
+static inline void __folio_write_large_rmap_end(struct folio *folio)
+{
+	raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount);
+}
+
+void __folio_set_large_rmap_val(struct folio *folio, int count,
+		struct mm_struct *mm);
+void __folio_add_large_rmap_val(struct folio *folio, int count,
+		struct mm_struct *mm);
+#else
+static inline void __folio_prep_large_rmap(struct folio *folio)
+{
+}
+static inline void __folio_undo_large_rmap(struct folio *folio)
+{
+}
+static inline void __folio_write_large_rmap_begin(struct folio *folio)
+{
+	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+}
+static inline void __folio_write_large_rmap_end(struct folio *folio)
+{
+}
+static inline void __folio_set_large_rmap_val(struct folio *folio, int count,
+		struct mm_struct *mm)
+{
+}
+static inline void __folio_add_large_rmap_val(struct folio *folio, int count,
+		struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_RMAP_ID */
+
 static inline void folio_set_large_mapcount(struct folio *folio,
 		int count, struct vm_area_struct *vma)
 {
@@ -175,30 +285,34 @@ static inline void folio_set_large_mapcount(struct folio *folio,
 	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
 	/* increment count (starts at -1) */
 	atomic_set(&folio->_total_mapcount, count - 1);
+	__folio_set_large_rmap_val(folio, count, vma->vm_mm);
 }
 
 static inline void folio_inc_large_mapcount(struct folio *folio,
 		struct vm_area_struct *vma)
 {
-	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
-	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+	__folio_write_large_rmap_begin(folio);
 	atomic_inc(&folio->_total_mapcount);
+	__folio_add_large_rmap_val(folio, 1, vma->vm_mm);
+	__folio_write_large_rmap_end(folio);
 }
 
 static inline void folio_add_large_mapcount(struct folio *folio,
 		int count, struct vm_area_struct *vma)
 {
-	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
-	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+	__folio_write_large_rmap_begin(folio);
 	atomic_add(count, &folio->_total_mapcount);
+	__folio_add_large_rmap_val(folio, count, vma->vm_mm);
+	__folio_write_large_rmap_end(folio);
 }
 
 static inline void folio_dec_large_mapcount(struct folio *folio,
 		struct vm_area_struct *vma)
 {
-	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
-	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+	__folio_write_large_rmap_begin(folio);
 	atomic_dec(&folio->_total_mapcount);
+	__folio_add_large_rmap_val(folio, -1, vma->vm_mm);
+	__folio_write_large_rmap_end(folio);
 }
 
 /* RMAP flags, currently only relevant for some anon rmap operations. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..773c93613ca2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -814,6 +814,26 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 #define mm_free_pgd(mm)
 #endif /* CONFIG_MMU */
 
+#ifdef CONFIG_RMAP_ID
+static inline int mm_alloc_rmap_id(struct mm_struct *mm)
+{
+	int id = alloc_rmap_id();
+
+	if (id < 0)
+		return id;
+	mm->mm_rmap_id = id;
+	return 0;
+}
+
+static inline void mm_free_rmap_id(struct mm_struct *mm)
+{
+	free_rmap_id(mm->mm_rmap_id);
+}
+#else
+#define mm_alloc_rmap_id(mm)	(0)
+#define mm_free_rmap_id(mm)
+#endif /* CONFIG_RMAP_ID */
+
 static void check_mm(struct mm_struct *mm)
 {
 	int i;
@@ -917,6 +937,7 @@ void __mmdrop(struct mm_struct *mm)
 
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
+	mm_free_rmap_id(mm);
 	destroy_context(mm);
 	mmu_notifier_subscriptions_destroy(mm);
 	check_mm(mm);
@@ -1298,6 +1319,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_pgd(mm))
 		goto fail_nopgd;
 
+	if (mm_alloc_rmap_id(mm))
+		goto fail_normapid;
+
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
@@ -1317,6 +1341,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 fail_cid:
 	destroy_context(mm);
 fail_nocontext:
+	mm_free_rmap_id(mm);
+fail_normapid:
 	mm_free_pgd(mm);
 fail_nopgd:
 	free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index 89971a894b60..bb0b7b885ada 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -861,6 +861,27 @@ choice
 	  benefit.
 endchoice
 
+menuconfig RMAP_ID
+	bool "Rmap ID tracking (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE && 64BIT
+	help
+	  Use per-MM rmap IDs and the unleashed power of math to track
+	  whether partially-mappable hugepages (i.e., THPs for now) are
+	  "mapped shared" or "mapped exclusively".
+
+	  This tracking allow for efficiently and precisely detecting
+	  whether a PTE-mapped THP is mapped by a single process
+	  ("mapped exclusively") or mapped by multiple ones ("mapped
+	  shared"), with the cost of additional tracking when (un)mapping
+	  (parts of) such a THP.
+
+	  If this configuration is not enabled, an heuristic is used
+	  instead that might result in false "mapped exclusively"
+	  detection; some features relying on this information might
+	  operate slightly imprecise (e.g., MADV_PAGEOUT succeeds although
+	  it should fail) or might not be available at all (e.g.,
+	  Copy-on-Write reuse support).
+
 config THP_SWAP
 	def_bool y
 	depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP && 64BIT
diff --git a/mm/Makefile b/mm/Makefile
index 33873c8aedb3..b0cf2563f33a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_RMAP_ID) += rmap_id.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 51a878efca0e..0228b04c4053 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -614,6 +614,7 @@ void folio_prep_large_rmappable(struct folio *folio)
 {
 	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
 	INIT_LIST_HEAD(&folio->_deferred_list);
+	__folio_prep_large_rmap(folio);
 	folio_set_large_rmappable(folio);
 }
 
@@ -2478,8 +2479,8 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
 			 (1L << PG_dirty) |
 			 LRU_GEN_MASK | LRU_REFS_MASK));
 
-	/* ->mapping in first and second tail page is replaced by other uses */
-	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
+	/* ->mapping in some tail page is replaced by other uses */
+	VM_BUG_ON_PAGE(tail > 4 && page_tail->mapping != TAIL_MAPPING,
 			page_tail);
 	page_tail->mapping = head->mapping;
 	page_tail->index = head->index + tail;
@@ -2550,6 +2551,16 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	ClearPageHasHWPoisoned(head);
 
+#ifdef CONFIG_RMAP_ID
+	/*
+	 * Make sure folio->_rmap_atomic_seqcount, which overlays
+	 * tail->private, is 0. All other folio->_rmap_valX should be 0
+	 * after unmapping the folio.
+	 */
+	if (likely(nr >= 4))
+		raw_seqcount_init(&folio->_rmap_atomic_seqcount);
+#endif /* CONFIG_RMAP_ID */
+
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(folio, i, lruvec, list);
 		/* Some pages can be beyond EOF: drop them from page cache */
@@ -2809,6 +2820,7 @@ void folio_undo_large_rmappable(struct folio *folio)
 	struct deferred_split *ds_queue;
 	unsigned long flags;
 
+	__folio_undo_large_rmap(folio);
 	/*
 	 * At this point, there is no one trying to add the folio to
 	 * deferred_list. If folio is not in deferred_list, it's safe
diff --git a/mm/init-mm.c b/mm/init-mm.c
index cfd367822cdd..8890271b50c6 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -7,6 +7,7 @@
 #include <linux/cpumask.h>
 #include <linux/mman.h>
 #include <linux/pgtable.h>
+#include <linux/rmap.h>
 
 #include <linux/atomic.h>
 #include <linux/user_namespace.h>
@@ -46,6 +47,9 @@ struct mm_struct init_mm = {
 	.cpu_bitmap	= CPU_BITS_NONE,
 #ifdef CONFIG_IOMMU_SVA
 	.pasid		= IOMMU_PASID_INVALID,
+#endif
+#ifdef CONFIG_RMAP_ID
+	.mm_rmap_id	= RMAP_ID_DUMMY,
 #endif
 	INIT_MM_CONTEXT(init_mm)
 };
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aad45758c0c7..c1dd039801e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1007,6 +1007,15 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 		 * deferred_list.next -- ignore value.
 		 */
 		break;
+#ifdef CONFIG_RMAP_ID
+	case 3:
+	case 4:
+		/*
+		 * the third and fourth tail page: ->mapping may be
+		 * used to store RMAP values for RMAP ID tracking.
+		 */
+		break;
+#endif /* CONFIG_RMAP_ID */
 	default:
 		if (page->mapping != TAIL_MAPPING) {
 			bad_page(page, "corrupted mapping in tail page");
diff --git a/mm/rmap_id.c b/mm/rmap_id.c
new file mode 100644
index 000000000000..e66b0f5aea2d
--- /dev/null
+++ b/mm/rmap_id.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * rmap ID tracking for precise "mapped shared" vs. "mapped exclusively"
+ * detection of partially-mappable folios (e.g., PTE-mapped THP).
+ *
+ * Copyright Red Hat, Inc. 2023
+ *
+ * Author(s): David Hildenbrand <david@redhat.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/idr.h>
+
+#include "internal.h"
+
+static DEFINE_SPINLOCK(rmap_id_lock);
+static DEFINE_IDA(rmap_ida);
+
+/* For now we only expect folios from the buddy, not hugetlb folios. */
+#if MAX_ORDER > RMAP_SUBID_6_MAX_ORDER
+#error "rmap ID tracking does not support such large MAX_ORDER"
+#endif
+
+/*
+ * We assign each MM a unique rmap ID and derive from it a sequence of
+ * special sub-IDs. We add/remove these sub-IDs to/from the corresponding
+ * folio rmap values (folio->rmap_valX) whenever (un)mapping (parts of) a
+ * partially mappable folio.
+ *
+ * With 24bit rmap IDs, and a folio size that is compatible with 4
+ * rmap values (more below), we calculate the sub-ID sequence like this:
+ *
+ * rmap ID    :  | 3 3 3 3 3 3 | 2 2 2 2 2 2 | 1 1 1 1 1 1 | 0 0 0 0 0 0 |
+ * sub-ID IDX :  |   IDX #3    |   IDX #2    |   IDX #1    |   IDX #0    |
+ *
+ * sub-IDs    :  [ subid_4(#3), subid_4(#2), subid_4(#1), subid_4(#0) ]
+ * rmap value :  [  _rmap_val3,  _rmap_val2,  _rmap_val1,  _rmap_val0 ]
+ *
+ * Any time we map/unmap a part (e.g., PTE, PMD) of a partially-mappable
+ * folio to/from a MM, we:
+ *  (1) Adjust (increment/decrement) the mapcount of the folio
+ *  (2) Adjust (add/remove) the folio rmap values using the MM sub-IDs
+ *
+ * So the rmap values are always linked to the folio mapcount.
+ * Consequently, we know that a single rmap value in the folio is the sum
+ * of exactly #folio_mapcount() rmap sub-IDs. As one example, if the folio
+ * is completely unmapped, the rmap values must be 0. As another example,
+ * if the folio is mapped exactly once, the rmap values correspond to the
+ * MM sub-IDs.
+ *
+ * To identify whether a given MM is responsible for all #folio_mapcount()
+ * mappings of a folio ("mapped exclusively") or whether other MMs are
+ * involved ("mapped shared"), we perform the following checks:
+ *  (1) Do we have more mappings than the folio has pages? Then the folio
+ *      is mapped shared. So when "folio_mapcount() > folio_nr_pages()".
+ *  (2) Do the rmap values corresond to "#folio_mapcount() * sub-IDs" of
+ *      the MM? Then the folio is mapped exclusive.
+ *
+ * To achieve (2), we generate sub-IDs that have the following property,
+ * assuming that our folio has P=folio_nr_pages() pages.
+ *   "2 * sub-ID" cannot be represented by the sum of any other *2* sub-IDs
+ *   "3 * sub-ID" cannot be represented by the sum of any other *3* sub-IDs
+ *   "4 * sub-ID" cannot be represented by the sum of any other *4* sub-IDs
+ *   ...
+ *   "P * sub-ID" cannot be represented by the sum of any other *P* sub-IDs
+ *
+ * Further, we want "P * sub-ID" (the maximum number we will ever look at)
+ * to not overflow. If we overflow with " > P" mappings, we don't care as
+ * we won't be looking at the numbers until theya re fully expressive
+ * again.
+ *
+ * Consequently, to not overflow 64bit values with "P * sub-ID", folios
+ * with large P require more rmap values (we cannot generate that many sub
+ * IDs), whereby folios with smaller P can get away with less rmap values
+ * (we can generate more sub-IDs).
+ *
+ * The sub-IDs are generated in generations, whereby
+ * (1) Generation #0 is the number 0
+ * (2) Generation #N takes all numbers from generations #0..#N-1 and adds
+ *     (P + 1)^(N - 1), effectively doubling the number of sub-IDs
+ *
+ * Note: a PMD-sized THP can, for a short time while PTE-mapping it, be
+ *       mapped using PTEs and a single PMD, resulting in "P + 1" mappings.
+ *       For now, we don't consider this case, as we are ususally not
+ *       looking at such folios while they being remapped, because the
+ *       involved page tables are locked and stop any page table walkers.
+ */
+
+/*
+ * With 1024 (order-10) possible exclusive mappings per folio, we can have 64
+ * sub-IDs per 64bit value.
+ *
+ * With 4 such 64bit values, we can support 64^4 == 16M IDs.
+ */
+static const unsigned long rmap_subids_4[64] = {
+	0ul,
+	1ul,
+	1025ul,
+	1026ul,
+	1050625ul,
+	1050626ul,
+	1051650ul,
+	1051651ul,
+	1076890625ul,
+	1076890626ul,
+	1076891650ul,
+	1076891651ul,
+	1077941250ul,
+	1077941251ul,
+	1077942275ul,
+	1077942276ul,
+	1103812890625ul,
+	1103812890626ul,
+	1103812891650ul,
+	1103812891651ul,
+	1103813941250ul,
+	1103813941251ul,
+	1103813942275ul,
+	1103813942276ul,
+	1104889781250ul,
+	1104889781251ul,
+	1104889782275ul,
+	1104889782276ul,
+	1104890831875ul,
+	1104890831876ul,
+	1104890832900ul,
+	1104890832901ul,
+	1131408212890625ul,
+	1131408212890626ul,
+	1131408212891650ul,
+	1131408212891651ul,
+	1131408213941250ul,
+	1131408213941251ul,
+	1131408213942275ul,
+	1131408213942276ul,
+	1131409289781250ul,
+	1131409289781251ul,
+	1131409289782275ul,
+	1131409289782276ul,
+	1131409290831875ul,
+	1131409290831876ul,
+	1131409290832900ul,
+	1131409290832901ul,
+	1132512025781250ul,
+	1132512025781251ul,
+	1132512025782275ul,
+	1132512025782276ul,
+	1132512026831875ul,
+	1132512026831876ul,
+	1132512026832900ul,
+	1132512026832901ul,
+	1132513102671875ul,
+	1132513102671876ul,
+	1132513102672900ul,
+	1132513102672901ul,
+	1132513103722500ul,
+	1132513103722501ul,
+	1132513103723525ul,
+	1132513103723526ul,
+};
+
+static unsigned long get_rmap_subid_4(struct mm_struct *mm, int nr)
+{
+	const unsigned int rmap_id = mm->mm_rmap_id;
+
+	VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 3);
+	return rmap_subids_4[(rmap_id >> (nr * 6)) & 0x3f];
+}
+
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+/*
+ * With 4096 (order-12) possible exclusive mappings per folio, we can have
+ * 32 sub-IDs per 64bit value.
+ *
+ * With 5 such 64bit values, we can support 32^5 > 16M IDs.
+ */
+static const unsigned long rmap_subids_5[32] = {
+	0ul,
+	1ul,
+	4097ul,
+	4098ul,
+	16785409ul,
+	16785410ul,
+	16789506ul,
+	16789507ul,
+	68769820673ul,
+	68769820674ul,
+	68769824770ul,
+	68769824771ul,
+	68786606082ul,
+	68786606083ul,
+	68786610179ul,
+	68786610180ul,
+	281749955297281ul,
+	281749955297282ul,
+	281749955301378ul,
+	281749955301379ul,
+	281749972082690ul,
+	281749972082691ul,
+	281749972086787ul,
+	281749972086788ul,
+	281818725117954ul,
+	281818725117955ul,
+	281818725122051ul,
+	281818725122052ul,
+	281818741903363ul,
+	281818741903364ul,
+	281818741907460ul,
+	281818741907461ul,
+};
+
+static unsigned long get_rmap_subid_5(struct mm_struct *mm, int nr)
+{
+	const unsigned int rmap_id = mm->mm_rmap_id;
+
+	VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 4);
+	return rmap_subids_5[(rmap_id >> (nr * 5)) & 0x1f];
+}
+#endif
+
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+/*
+ * With 32768 (order-15) possible exclusive mappings per folio, we can have
+ * 16 sub-IDs per 64bit value.
+ *
+ * With 6 such 64bit values, we can support 8^6 == 16M IDs.
+ */
+static const unsigned long rmap_subids_6[16] = {
+	0ul,
+	1ul,
+	32769ul,
+	32770ul,
+	1073807361ul,
+	1073807362ul,
+	1073840130ul,
+	1073840131ul,
+	35187593412609ul,
+	35187593412610ul,
+	35187593445378ul,
+	35187593445379ul,
+	35188667219970ul,
+	35188667219971ul,
+	35188667252739ul,
+	35188667252740ul,
+};
+
+static unsigned long get_rmap_subid_6(struct mm_struct *mm, int nr)
+{
+	const unsigned int rmap_id = mm->mm_rmap_id;
+
+	VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 15);
+	return rmap_subids_6[(rmap_id >> (nr * 4)) & 0xf];
+}
+#endif
+
+void __folio_set_large_rmap_val(struct folio *folio, int count,
+		struct mm_struct *mm)
+{
+	const unsigned int order = folio_order(folio);
+
+	switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+	case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
+		atomic_long_set(&folio->_rmap_val0, get_rmap_subid_6(mm, 0) * count);
+		atomic_long_set(&folio->_rmap_val1, get_rmap_subid_6(mm, 1) * count);
+		atomic_long_set(&folio->_rmap_val2, get_rmap_subid_6(mm, 2) * count);
+		atomic_long_set(&folio->_rmap_val3, get_rmap_subid_6(mm, 3) * count);
+		atomic_long_set(&folio->_rmap_val4, get_rmap_subid_6(mm, 4) * count);
+		atomic_long_set(&folio->_rmap_val5, get_rmap_subid_6(mm, 5) * count);
+		break;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+	case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
+		atomic_long_set(&folio->_rmap_val0, get_rmap_subid_5(mm, 0) * count);
+		atomic_long_set(&folio->_rmap_val1, get_rmap_subid_5(mm, 1) * count);
+		atomic_long_set(&folio->_rmap_val2, get_rmap_subid_5(mm, 2) * count);
+		atomic_long_set(&folio->_rmap_val3, get_rmap_subid_5(mm, 3) * count);
+		atomic_long_set(&folio->_rmap_val4, get_rmap_subid_5(mm, 4) * count);
+		break;
+#endif
+	default:
+		atomic_long_set(&folio->_rmap_val0, get_rmap_subid_4(mm, 0) * count);
+		atomic_long_set(&folio->_rmap_val1, get_rmap_subid_4(mm, 1) * count);
+		atomic_long_set(&folio->_rmap_val2, get_rmap_subid_4(mm, 2) * count);
+		atomic_long_set(&folio->_rmap_val3, get_rmap_subid_4(mm, 3) * count);
+		break;
+	}
+}
+
+void __folio_add_large_rmap_val(struct folio *folio, int count,
+		struct mm_struct *mm)
+{
+	const unsigned int order = folio_order(folio);
+
+	switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+	case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
+		atomic_long_add(get_rmap_subid_6(mm, 0) * count, &folio->_rmap_val0);
+		atomic_long_add(get_rmap_subid_6(mm, 1) * count, &folio->_rmap_val1);
+		atomic_long_add(get_rmap_subid_6(mm, 2) * count, &folio->_rmap_val2);
+		atomic_long_add(get_rmap_subid_6(mm, 3) * count, &folio->_rmap_val3);
+		atomic_long_add(get_rmap_subid_6(mm, 4) * count, &folio->_rmap_val4);
+		atomic_long_add(get_rmap_subid_6(mm, 5) * count, &folio->_rmap_val5);
+		break;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+	case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
+		atomic_long_add(get_rmap_subid_5(mm, 0) * count, &folio->_rmap_val0);
+		atomic_long_add(get_rmap_subid_5(mm, 1) * count, &folio->_rmap_val1);
+		atomic_long_add(get_rmap_subid_5(mm, 2) * count, &folio->_rmap_val2);
+		atomic_long_add(get_rmap_subid_5(mm, 3) * count, &folio->_rmap_val3);
+		atomic_long_add(get_rmap_subid_5(mm, 4) * count, &folio->_rmap_val4);
+		break;
+#endif
+	default:
+		atomic_long_add(get_rmap_subid_4(mm, 0) * count, &folio->_rmap_val0);
+		atomic_long_add(get_rmap_subid_4(mm, 1) * count, &folio->_rmap_val1);
+		atomic_long_add(get_rmap_subid_4(mm, 2) * count, &folio->_rmap_val2);
+		atomic_long_add(get_rmap_subid_4(mm, 3) * count, &folio->_rmap_val3);
+		break;
+	}
+}
+
+int alloc_rmap_id(void)
+{
+	int id;
+
+	/*
+	 * We cannot use a mutex, because free_rmap_id() might get called
+	 * when we are not allowed to sleep.
+	 *
+	 * TODO: do we need something like idr_preload()?
+	 */
+	spin_lock(&rmap_id_lock);
+	id = ida_alloc_range(&rmap_ida, RMAP_ID_MIN, RMAP_ID_MAX, GFP_ATOMIC);
+	spin_unlock(&rmap_id_lock);
+
+	return id;
+}
+
+void free_rmap_id(int id)
+{
+	if (id == RMAP_ID_DUMMY)
+		return;
+	if (WARN_ON_ONCE(id < RMAP_ID_MIN || id > RMAP_ID_MAX))
+		return;
+	spin_lock(&rmap_id_lock);
+	ida_free(&rmap_ida, id);
+	spin_unlock(&rmap_id_lock);
+}

From patchwork Fri Nov 24 13:26:13 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169415
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1190662vqx;
        Fri, 24 Nov 2023 05:28:04 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IEsZpStPd0Hc9TAZuM7ayoQZVdH0AnamqZGNPqC40aX6W+ZbTF9hWPiyA4Gq1WCeJE07kyk
X-Received: by 2002:a05:6a21:7886:b0:17f:d42e:202c with SMTP id
 bf6-20020a056a21788600b0017fd42e202cmr3383386pzc.49.1700832484504;
        Fri, 24 Nov 2023 05:28:04 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832484; cv=none;
        d=google.com; s=arc-20160816;
        b=c9fLWaOLCrtNiqZpoOaNqdNODQWLSGMal0Dh8wQi6MX8UXr+IemFgNJ6RLOo94jb5P
         CYlhzWdXP/xaqjfI3XFpBZMt7B1U0+NVidZH73FiXzZXSoJ8GKovSadR1ywr1WVioMqr
         VXNSKCskMHgq1qPEtBeCsAj9y9PPAwFb5QJTg1EL+IsJaTX77Mz5KGLvo2aYtlbqIFrt
         0CyxVtqDQ21LyGlwmg8xE89invoo1FQR++6r+EqeIhx8czVCgyukN0rwVpjSvgC6WBOu
         3ywkI+tx9bW86WG4uxEW6i5dPSLxmFWoXM8ahKZpfEZv2eVyVegiEE6xMOXV2f4S2IZR
         jOSA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=hWu/0kIJvur5UANFtettn6VuzgBBSt2XXjypdyFb6aY=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=FshwaGwDlVrtBPl/5PGbhUOhHKGE/UfYgnugMo5i+qx2kecRE/o25MId/oWihMLLLI
         yu7HUbb2dzuSGJxBQdTQX1lVbWpFlLiQiouKqEpQHoK4buy10+HieMpf4/O0XoUM6smh
         AUSlMa2VRGFHIuzK3lrNhQ04T+d7DCguowDq2jipn451xpIrsHSKygNjH++RzIbWkC3Y
         6xqkblsP+H69zKR3anWbl72uBFjBl8/THqtV+wSURqK8T9Eu/DwbIUfSheiNIEoGq1SS
         DXFp0pQl3Fnx4ssOATjloBWr+PYftAOQ7xF/2uoEx6e1WNq8Zn8LlYk9xYF847g7KE4U
         od1w==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=E1drljMP;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32])
        by mx.google.com with ESMTPS id
 ck21-20020a056a02091500b005b11e5a69fdsi3716615pgb.508.2023.11.24.05.28.04
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:28:04 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=E1drljMP;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by agentk.vger.email (Postfix) with ESMTP id B4C6081825D6;
	Fri, 24 Nov 2023 05:27:55 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345351AbjKXN1q (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:27:46 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37536 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1345366AbjKXN1V (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:27:21 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2F57E1BE
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832424;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=hWu/0kIJvur5UANFtettn6VuzgBBSt2XXjypdyFb6aY=;
        b=E1drljMPaYYgsDkAyRrgoAZQkECwwlQYnT824exfvTCspKDlxc3StsNHa8adqhYAlYpmiU
        eid3XrLsNm5/42r01WqszDEYY/Es3nDR9VaKOGIVMaNHlEq7HXv6ToEtkjO+kJtwYMu7Ea
        9kAw1Maz+Oyx9y25DIErl4EqdiRozMI=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-351-57hRR764P9m-_xbjczRUpw-1; Fri, 24 Nov 2023 08:27:01 -0500
X-MC-Unique: 57hRR764P9m-_xbjczRUpw-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id DCBB085A5BD;
        Fri, 24 Nov 2023 13:26:59 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 003162166B2A;
        Fri, 24 Nov 2023 13:26:56 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 08/20] mm: pass MM to folio_mapped_shared()
Date: Fri, 24 Nov 2023 14:26:13 +0100
Message-ID: <20231124132626.235350-9-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:27:55 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452123358300830
X-GMAIL-MSGID: 1783452123358300830

We'll need the MM next to make a better decision regarding
partially-mappable folios (e.g., PTE-mapped THP) using per-MM rmap IDs.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |  4 +++-
 mm/huge_memory.c   |  2 +-
 mm/madvise.c       |  6 +++---
 mm/memory.c        |  2 +-
 mm/mempolicy.c     | 14 +++++++-------
 mm/migrate.c       |  2 +-
 6 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 17dac913f367..765e688690f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2117,6 +2117,7 @@ static inline size_t folio_size(struct folio *folio)
  * folio_mapped_shared - Report if a folio is certainly mapped by
  *			 multiple entities in their page tables
  * @folio: The folio.
+ * @mm: The mm the folio is mapped into.
  *
  * This function checks if a folio is certainly *currently* mapped by
  * multiple entities in their page table ("mapped shared") or if the folio
@@ -2153,7 +2154,8 @@ static inline size_t folio_size(struct folio *folio)
  *
  * Return: Whether the folio is certainly mapped by multiple entities.
  */
-static inline bool folio_mapped_shared(struct folio *folio)
+static inline bool folio_mapped_shared(struct folio *folio,
+		struct mm_struct *mm)
 {
 	unsigned int total_mapcount;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0228b04c4053..fd7251923557 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1639,7 +1639,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * If other processes are mapping this folio, we couldn't discard
 	 * the folio unless they all do MADV_FREE so let's skip the folio.
 	 */
-	if (folio_mapped_shared(folio))
+	if (folio_mapped_shared(folio, mm))
 		goto out;
 
 	if (!folio_trylock(folio))
diff --git a/mm/madvise.c b/mm/madvise.c
index 1a82867c8c2e..e3e4f3ea5f6d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -365,7 +365,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		folio = pfn_folio(pmd_pfn(orig_pmd));
 
 		/* Do not interfere with other mappings of this folio */
-		if (folio_mapped_shared(folio))
+		if (folio_mapped_shared(folio, mm))
 			goto huge_unlock;
 
 		if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -441,7 +441,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		if (folio_test_large(folio)) {
 			int err;
 
-			if (folio_mapped_shared(folio))
+			if (folio_mapped_shared(folio, mm))
 				break;
 			if (pageout_anon_only_filter && !folio_test_anon(folio))
 				break;
@@ -665,7 +665,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		if (folio_test_large(folio)) {
 			int err;
 
-			if (folio_mapped_shared(folio))
+			if (folio_mapped_shared(folio, mm))
 				break;
 			if (!folio_trylock(folio))
 				break;
diff --git a/mm/memory.c b/mm/memory.c
index 14416d05e1b6..5048d58d6174 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4848,7 +4848,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 * Flag if the folio is shared between multiple address spaces. This
 	 * is later used when determining whether to group tasks together
 	 */
-	if (folio_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
+	if (folio_mapped_shared(folio, vma->vm_mm) && (vma->vm_flags & VM_SHARED))
 		flags |= TNF_SHARED;
 
 	nid = folio_nid(folio);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0492113497cc..bd0243da26bf 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -418,7 +418,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 };
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
-				unsigned long flags);
+		struct mm_struct *mm, unsigned long flags);
 static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 				pgoff_t ilx, int *nid);
 
@@ -481,7 +481,7 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
 		return;
 	if (!(qp->flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
 	    !vma_migratable(walk->vma) ||
-	    !migrate_folio_add(folio, qp->pagelist, qp->flags))
+	    !migrate_folio_add(folio, qp->pagelist, walk->mm, qp->flags))
 		qp->nr_failed++;
 }
 
@@ -561,7 +561,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 		}
 		if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
 		    !vma_migratable(vma) ||
-		    !migrate_folio_add(folio, qp->pagelist, flags)) {
+		    !migrate_folio_add(folio, qp->pagelist, walk->mm, flags)) {
 			qp->nr_failed++;
 			if (strictly_unmovable(flags))
 				break;
@@ -609,7 +609,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
 	 * easily detect if a folio is shared.
 	 */
 	if ((flags & MPOL_MF_MOVE_ALL) ||
-	    (!folio_mapped_shared(folio) && !hugetlb_pmd_shared(pte)))
+	    (!folio_mapped_shared(folio, walk->mm) && !hugetlb_pmd_shared(pte)))
 		if (!isolate_hugetlb(folio, qp->pagelist))
 			qp->nr_failed++;
 unlock:
@@ -981,7 +981,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 
 #ifdef CONFIG_MIGRATION
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
-				unsigned long flags)
+		struct mm_struct *mm, unsigned long flags)
 {
 	/*
 	 * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
@@ -990,7 +990,7 @@ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
 	 * See folio_mapped_shared() on possible imprecision when we cannot
 	 * easily detect if a folio is shared.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || !folio_mapped_shared(folio)) {
+	if ((flags & MPOL_MF_MOVE_ALL) || !folio_mapped_shared(folio, mm)) {
 		if (folio_isolate_lru(folio)) {
 			list_add_tail(&folio->lru, foliolist);
 			node_stat_mod_folio(folio,
@@ -1195,7 +1195,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 #else
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
-				unsigned long flags)
+		struct mm_struct *mm, unsigned long flags)
 {
 	return false;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 341a84c3e8e4..8a1d75ff2dc6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2559,7 +2559,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
 	 * every page is mapped to the same process. Doing that is very
 	 * expensive, so check the estimated mapcount of the folio instead.
 	 */
-	if (folio_mapped_shared(folio) && folio_is_file_lru(folio) &&
+	if (folio_mapped_shared(folio, vma->vm_mm) && folio_is_file_lru(folio) &&
 	    (vma->vm_flags & VM_EXEC))
 		goto out;
 

From patchwork Fri Nov 24 13:26:14 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169424
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191423vqx;
        Fri, 24 Nov 2023 05:29:12 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IEZHiiXFcmrwseVNAYeh/HJAWQZnDbec6lYy4I5cB8HuzRKlQrXPedeilZOCNiWbQRwZ6U0
X-Received: by 2002:a17:902:d2cf:b0:1cc:13d0:d515 with SMTP id
 n15-20020a170902d2cf00b001cc13d0d515mr3430779plc.20.1700832552020;
        Fri, 24 Nov 2023 05:29:12 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832552; cv=none;
        d=google.com; s=arc-20160816;
        b=qfTdsdGCwlstSPDBCNlUWIpuTQbbTTgTpL9XwbDeEBXutMAlpn5yqOrLN+rtd0lOGO
         jDlQvxLYAuuCiLnIigVpyR+V6nVW7ROso1kQSfniYSzOqHbRMPF3WXkzeeUozHvxkFvP
         KnbHTcQBUCHkE/zWsp44LfWXyy+MIi5k7KLqSRspkSXHNirnfLQh38dTwCFLtkhlJDWB
         flfUqpibgclCkFuyAMoUGO3yXdpVd+PO564fFXv3l4R2XJNrwAx1IVFYlCzQ93lhyunL
         zIz1L3TaFYzqJIcHiXtSEC4y3XS0VEcXMqPPjSa/ffdUxLij77kMIe5iSWmBNHYYr9u8
         60hg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=qOqXc/Lc47PjYVBDjn5tNISm7M7uwGrIMk+t0cHF8N0=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=Tq9+5KVo4FfbP6OeybDDCxL9eNHesBEF9h0VKRbB5wFZYaKia1ZkkSGNOKSvdE1Jpt
         OXHpZnVO1ZdqE/fLj9W1ZyPeTB1e3epDDc/qd+tmAv129OUrPsD0YzRoEWFz0vI5pW4z
         DwzLA0Z7Pb2wr0PGKPP8IDsofDr+FgLakkqk+r+ehhgr/vYfGcQbNEgJpyP1L+aozX5v
         N6bZn2PGUrg/zbRE4fh+0k19eVHY24HB/zKROR1MLYy5jzRWJWOxi9y42GI2//o9lkvz
         qIN+mhIyYEVP29ZlLPF2xBC49+mGkqXlDU0iRQsu5ww488R+3qA94vTWVIu6x6cO/rQF
         kd1A==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=Ci6nzm9h;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7])
        by mx.google.com with ESMTPS id
 f14-20020a170902ce8e00b001cfaec29ec5si142781plg.28.2023.11.24.05.29.11
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:29:12 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 client-ip=2620:137:e000::3:7;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=Ci6nzm9h;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id A72C38197EB3;
	Fri, 24 Nov 2023 05:28:11 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345377AbjKXN1t (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:27:49 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58132 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231423AbjKXN1W (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:27:22 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7D82C10F9
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832429;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=qOqXc/Lc47PjYVBDjn5tNISm7M7uwGrIMk+t0cHF8N0=;
        b=Ci6nzm9hcHMWZuOapqbpBSXFSXWl0TnzZyMigA7IRrFl1gTanwBv+Yxr8wUCSiBJO071Bt
        MEK/fQHM2ZYEnjZIVNPCV7ntYVs3F6f0Gge60gEQr9Gbs9WlQj3RSxMpCtrZZEzpBXX0pM
        er3r07Kwxy5b54n2RQrliMdEl3Nzk/E=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-85-HfF3hCRgOCiCl6qXEnqmsA-1; Fri, 24 Nov 2023 08:27:04 -0500
X-MC-Unique: HfF3hCRgOCiCl6qXEnqmsA-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id AD25B811E86;
        Fri, 24 Nov 2023 13:27:03 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 477E52166B2A;
        Fri, 24 Nov 2023 13:27:00 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 09/20] mm: improve folio_mapped_shared() for
 partially-mappable folios using rmap IDs
Date: Fri, 24 Nov 2023 14:26:14 +0100
Message-ID: <20231124132626.235350-10-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,
        T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:28:11 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452194099868828
X-GMAIL-MSGID: 1783452194099868828

Let's make folio_mapped_shared() precise by using or rmap ID
magic to identify if a single MM is responsible for all mappings.

If there is a lot of concurrent (un)map activity, we could theoretically
spin for quite a while. But we're only looking at the rmap values in case
we didn't already identify the folio as "obviously shared". In most
cases, there should only be one or a handful of page tables involved.

For current THPs with ~512 .. 2048 subpages, we really shouldn't see a
lot of concurrent updates that keep us spinning for a long time. Anyhow,
if ever a problem this can be optimized later if there is real demand.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h   | 21 ++++++++++++---
 include/linux/rmap.h |  2 ++
 mm/rmap_id.c         | 63 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 765e688690f1..1081a8faa1a3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2113,6 +2113,17 @@ static inline size_t folio_size(struct folio *folio)
 	return PAGE_SIZE << folio_order(folio);
 }
 
+#ifdef CONFIG_RMAP_ID
+bool __folio_large_mapped_shared(struct folio *folio, struct mm_struct *mm);
+#else
+static inline bool __folio_large_mapped_shared(struct folio *folio,
+		struct mm_struct *mm)
+{
+	/* ... guess based on the mapcount of the first page of the folio. */
+	return atomic_read(&folio->page._mapcount) > 0;
+}
+#endif
+
 /**
  * folio_mapped_shared - Report if a folio is certainly mapped by
  *			 multiple entities in their page tables
@@ -2141,8 +2152,11 @@ static inline size_t folio_size(struct folio *folio)
  * PMD-mapped PMD-sized THP), the result will be exactly correct.
  *
  * For all other (partially-mappable) folios, such as PTE-mapped THP, the
- * return value is partially fuzzy: true is not fuzzy, because it means
- * "certainly mapped shared", but false means "maybe mapped exclusively".
+ * return value is partially fuzzy without CONFIG_RMAP_ID: true is not fuzzy,
+ * because it means "certainly mapped shared", but false means
+ * "maybe mapped exclusively".
+ *
+ * With CONFIG_RMAP_ID, the result will be exactly correct.
  *
  * Note that this function only considers *current* page table mappings
  * tracked via rmap -- that properly adjusts the folio mapcount(s) -- and
@@ -2177,8 +2191,7 @@ static inline bool folio_mapped_shared(struct folio *folio,
 	 */
 	if (total_mapcount > folio_nr_pages(folio))
 		return true;
-	/* ... guess based on the mapcount of the first page of the folio. */
-	return atomic_read(&folio->page._mapcount) > 0;
+	return __folio_large_mapped_shared(folio, mm);
 }
 
 #ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 19c9dc3216df..a73e146d82d1 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -253,6 +253,8 @@ void __folio_set_large_rmap_val(struct folio *folio, int count,
 		struct mm_struct *mm);
 void __folio_add_large_rmap_val(struct folio *folio, int count,
 		struct mm_struct *mm);
+bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
+		struct mm_struct *mm);
 #else
 static inline void __folio_prep_large_rmap(struct folio *folio)
 {
diff --git a/mm/rmap_id.c b/mm/rmap_id.c
index e66b0f5aea2d..85a61c830f19 100644
--- a/mm/rmap_id.c
+++ b/mm/rmap_id.c
@@ -322,6 +322,69 @@ void __folio_add_large_rmap_val(struct folio *folio, int count,
 	}
 }
 
+bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
+		 struct mm_struct *mm)
+{
+	const unsigned int order = folio_order(folio);
+	unsigned long diff = 0;
+
+	switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+	case RMAP_SUBID_6_MIN_ORDER .. RMAP_SUBID_6_MAX_ORDER:
+		diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_6(mm, 0) * count);
+		diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_6(mm, 1) * count);
+		diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_6(mm, 2) * count);
+		diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_6(mm, 3) * count);
+		diff |= atomic_long_read(&folio->_rmap_val4) ^ (get_rmap_subid_6(mm, 4) * count);
+		diff |= atomic_long_read(&folio->_rmap_val5) ^ (get_rmap_subid_6(mm, 5) * count);
+		break;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+	case RMAP_SUBID_5_MIN_ORDER .. RMAP_SUBID_5_MAX_ORDER:
+		diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_5(mm, 0) * count);
+		diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_5(mm, 1) * count);
+		diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_5(mm, 2) * count);
+		diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_5(mm, 3) * count);
+		diff |= atomic_long_read(&folio->_rmap_val4) ^ (get_rmap_subid_5(mm, 4) * count);
+		break;
+#endif
+	default:
+		diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_4(mm, 0) * count);
+		diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_4(mm, 1) * count);
+		diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_4(mm, 2) * count);
+		diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_4(mm, 3) * count);
+		break;
+	}
+	return !diff;
+}
+
+bool __folio_large_mapped_shared(struct folio *folio, struct mm_struct *mm)
+{
+	unsigned long start;
+	bool exclusive;
+	int mapcount;
+
+	VM_WARN_ON_ONCE(!folio_test_large_rmappable(folio));
+	VM_WARN_ON_ONCE(folio_test_hugetlb(folio));
+
+	/*
+	 * Livelocking here is unlikely, as the caller already handles the
+	 * "obviously shared" cases. If ever an issue and there is too much
+	 * concurrent (un)mapping happening (using different page tables), we
+	 * could stop earlier and just return "shared".
+	 */
+	do {
+		start = raw_read_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount);
+		mapcount = folio_mapcount(folio);
+		if (unlikely(mapcount > folio_nr_pages(folio)))
+			return true;
+		exclusive = __folio_has_large_matching_rmap_val(folio, mapcount, mm);
+	} while (raw_read_atomic_seqcount_retry(&folio->_rmap_atomic_seqcount,
+						start));
+
+	return !exclusive;
+}
+
 int alloc_rmap_id(void)
 {
 	int id;

From patchwork Fri Nov 24 13:26:15 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169416
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191067vqx;
        Fri, 24 Nov 2023 05:28:39 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGcBRZCOtbQ+gYRc22yH3JbFxGA/9ibPMTw51MqHV3l4c7eoCrr35qfbH7zRSu7cuwx3wGP
X-Received: by 2002:a05:6a20:2584:b0:189:c852:561e with SMTP id
 k4-20020a056a20258400b00189c852561emr3195513pzd.1.1700832518929;
        Fri, 24 Nov 2023 05:28:38 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832518; cv=none;
        d=google.com; s=arc-20160816;
        b=TSbqa5OICEkcsAPMyQiTG+EzBH/O5+xVszg5iYNZN9XHsZh/cGTNPknhUVRDWiD5fY
         Prn+Tk+H45Kpy7GaEAUgQNYUh5qOOQdPz2tMdxaVwQSkuFw9u+/s3HDM/oBo9bOiSoJD
         NobW7sdREWmKc22jETj4e01XiyzgMMMAhrV9qcpv0/n09Mtn9lR7ql24BppRWhDW+Eyr
         Ihi1ojG6TM+Ohl5A+Xpj855oaYzyjqU19kIXCNXK9+42JF4heqvxnPV/KgZ8QbWBlWae
         IRBm7+3LdPdgrUhE64ya2ZnzHEy1FFnrxlNpq4WJ8VEG4UmDc/IfiagxFahXu1J19w/S
         O+cQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=QUP8GR2DnkC5694xV/qGgIseF+xugXnf+IpocEcF75Q=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=w6GQ2Eh3gz9aMLKrm8UcNiivzG/zzLffNc93vsWTuhJ50p5S2b0c/eAax2oTwhEWWy
         L1eXrg1G3IOQYzmJLfyqIVaftQGfnAFEI66FKXDTjva6FHsdWPc1xlqtx0kNkLiluhPW
         axPM6bI4yNDA7zdSgo8v3taBA9tV0C7U4Eschb7ITviGmmskHg3LoDvPVDC/3lXrwTo+
         VtEAulFPrF3Zrq6aGSPV1uAfofTqSEu3CQpFGClmCY5pKLBp4dBMgVWNXQ4f/aorTdMf
         svi5N4hJ+uXjxPhihO6GDKi00eZADNE9FzWpKGIN6VQjegm+UdQIvTgi2FU/YH3ATRgD
         8wyg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=f2v4Nd9M;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.36 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from pete.vger.email (pete.vger.email. [23.128.96.36])
        by mx.google.com with ESMTPS id
 o66-20020a634145000000b005be03f0da7esi3869009pga.174.2023.11.24.05.28.38
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:28:38 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=f2v4Nd9M;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.36 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by pete.vger.email (Postfix) with ESMTP id CCF5C80AE56F;
	Fri, 24 Nov 2023 05:28:31 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345763AbjKXN1x (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:27:53 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48872 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1345407AbjKXN1X (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:27:23 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5C26C170A
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:12 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832431;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=QUP8GR2DnkC5694xV/qGgIseF+xugXnf+IpocEcF75Q=;
        b=f2v4Nd9MWqwmzWHvH54xxheGVyGAyt5tAxJIMVpfBd0QdgkTuJFTFXpNhTI0bmsb+XtBuw
        kKskpr0b9PT+eBiQ0HxvdQMgwW0ZzTWmPnqV9y9qoUWFdhaGf7jJhOjxbtF0fnoulpXgZl
        +uGVUW0gJA7SwIMywIZvNl0TsromnDE=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-508-EVwlOeefM6O30sNZp-JA9g-1; Fri, 24 Nov 2023 08:27:08 -0500
X-MC-Unique: EVwlOeefM6O30sNZp-JA9g-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BC99C811E93;
        Fri, 24 Nov 2023 13:27:07 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id ECEBC2166B2B;
        Fri, 24 Nov 2023 13:27:03 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 10/20] mm/memory: COW reuse support for PTE-mapped THP
 with rmap IDs
Date: Fri, 24 Nov 2023 14:26:15 +0100
Message-ID: <20231124132626.235350-11-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:28:32 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452159296161880
X-GMAIL-MSGID: 1783452159296161880

For now, we only end up reusing small folios and PMD-mapped large folios
(i.e., THP) after fork(); PTE-mapped THPs are never reused, except when
only a single page of the folio remains mapped. Instead, we end up copying
each subpage even though the THP might be exclusive to the MM.

The logic we're using for small folios and PMD-mapped THPs is the
following: Is the only reference to the folio from a single page table
mapping? Then:
  (a) There are no other references to the folio from other MMs
      (e.g., page table mapping, GUP)
  (b) There are no other references to the folio from page migration/
      swapout/swapcache that might temporarily unmap the folio.

Consequently, the folio is exclusive to that process and can be reused.
In that case, we end up with folio_refcount(folio) == 1 and an implied
folio_mapcount(folio) == 1, while holding the page table lock and the
page lock to protect against possible races.

For PTE-mapped THP, however, we have not one, but multiple references
from page tables, whereby such THPs can be mapped into multiple
page tables in the MM.

Reusing the logic that we use for small folios and PMD-mapped THPs means,
that when reusing a PTE-mapped THP, we want to make sure that:
  (1) All folio references are from page table mappings.
  (2) All page table mappings belong to the same MM.
  (3) We didn't race with (un)mapping of the page related to other page
      tables, such that the mapcount and refcount are stable.

For (1), we can check
	folio_refcount(folio) == folio_mapcount(folio)
For (2) and (3), we can use our new rmap ID infrastructure.

We won't bother with the swapcache and LRU cache for now. Add some sanity
checks under CONFIG_DEBUG_VM, to identify any obvious problems early.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 5048d58d6174..fb533995ff68 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3360,6 +3360,95 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio)
 static bool wp_can_reuse_anon_folio(struct folio *folio,
 				    struct vm_area_struct *vma)
 {
+#ifdef CONFIG_RMAP_ID
+	if (folio_test_large(folio)) {
+		bool retried = false;
+		unsigned long start;
+		int mapcount, i;
+
+		/*
+		 * The assumption for anonymous folios is that each page can
+		 * only get mapped once into a MM.  This also holds for
+		 * small folios -- except when KSM is involved. KSM does
+		 * currently not apply to large folios.
+		 *
+		 * Further, each taken mapcount must be paired with exactly one
+		 * taken reference, whereby references must be incremented
+		 * before the mapcount when mapping a page, and references must
+		 * be decremented after the mapcount when unmapping a page.
+		 *
+		 * So if all references to a folio are from mappings, and all
+		 * mappings are due to our (MM) page tables, and there was no
+		 * concurrent (un)mapping, this folio is certainly exclusive.
+		 *
+		 * We currently don't optimize for:
+		 * (a) folio is mapped into multiple page tables in this
+		 *     MM (e.g., mremap) and other page tables are
+		 *     concurrently (un)mapping the folio.
+		 * (b) the folio is in the swapcache. Likely the other PTEs
+		 *     are still swap entries and folio_free_swap() would fail.
+		 * (c) the folio is in the LRU cache.
+		 */
+retry:
+		start = raw_read_atomic_seqcount(&folio->_rmap_atomic_seqcount);
+		if (start & ATOMIC_SEQCOUNT_WRITERS_MASK)
+			return false;
+		mapcount = folio_mapcount(folio);
+
+		/* Is this folio possibly exclusive ... */
+		if (mapcount > folio_nr_pages(folio) || folio_entire_mapcount(folio))
+			return false;
+
+		/* ... and are all references from mappings ... */
+		if (folio_ref_count(folio) != mapcount)
+			return false;
+
+		/* ... and do all mappings belong to us ... */
+		if (!__folio_has_large_matching_rmap_val(folio, mapcount, vma->vm_mm))
+			return false;
+
+		/* ... and was there no concurrent (un)mapping ? */
+		if (raw_read_atomic_seqcount_retry(&folio->_rmap_atomic_seqcount,
+						   start))
+			return false;
+
+		/* Safety checks we might want to drop in the future. */
+		if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+			unsigned int mapcount;
+
+			if (WARN_ON_ONCE(folio_test_ksm(folio)))
+				return false;
+			/*
+			 * We might have raced against swapout code adding
+			 * the folio to the swapcache (which, by itself, is not
+			 * problematic). Let's simply check again if we would
+			 * properly detect the additional reference now and
+			 * properly fail.
+			 */
+			if (unlikely(folio_test_swapcache(folio))) {
+				if (WARN_ON_ONCE(retried))
+					return false;
+				retried = true;
+				goto retry;
+			}
+			for (i = 0; i < folio_nr_pages(folio); i++) {
+				mapcount = page_mapcount(folio_page(folio, i));
+				if (WARN_ON_ONCE(mapcount > 1))
+					return false;
+			}
+		}
+
+		/*
+		 * This folio is exclusive to us. Do we need the page lock?
+		 * Likely not, and a trylock would be unfortunate if this
+		 * folio is mapped into multiple page tables and we get
+		 * concurrent page faults. If there would be references from
+		 * page migration/swapout/swapcache, we would have detected
+		 * an additional reference and never ended up here.
+		 */
+		return true;
+	}
+#endif /* CONFIG_RMAP_ID */
 	/*
 	 * We have to verify under folio lock: these early checks are
 	 * just an optimization to avoid locking the folio and freeing

From patchwork Fri Nov 24 13:26:16 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169426
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191502vqx;
        Fri, 24 Nov 2023 05:29:18 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGHj5flMJieAGbzpe/kx8WglkdOs3MpKJUFeGJf/KO+U9lpV2xm7vgx/yt7qLgZduEcS5aG
X-Received: by 2002:a05:6a00:2d90:b0:6c9:9e11:859d with SMTP id
 fb16-20020a056a002d9000b006c99e11859dmr3096201pfb.1.1700832558007;
        Fri, 24 Nov 2023 05:29:18 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832557; cv=none;
        d=google.com; s=arc-20160816;
        b=y2Nzui7xOhegoR9AAgHeuFb68F6AtvsYlSmsA1Ri9j4ghc3PpcrsOOhxB3in5Oz3Nw
         g54iZcJU3wc5PjBazXx9YY9L7aQQOe6hPnrjfc61D9SaqkijDVHy2s1z1vM5Wdot4xtI
         fm2aipbe4KiFBwKhJF38X2ktWt73T3CoBFG5AcxHvpC5mqiPWJ1XqxCzetftYEZy1kiO
         c3KHJxa8Z+Tb0zAxridZzrSI8c7+dJBh2aem5kl/Ru5wWkBfzbC9WxmyMPdylHlcBxnJ
         JHEsqa8/8bNixZQuMv7sQX4R4CjNeYmatGp1srNibqc5HDRrixAN8fTqrJm5SU7rT6Oa
         JoOA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=tnmd+tFa3nPIVK41hUx6D+f5FKMgV+UNk16uwK5iEbI=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=jk5lXowPvifxYAIF8Ku+fHrmA0ng9dTwFYsRJuxCqq0uPRteVx6ZZRsDrlzTBviSh2
         DHLLOTtjgdNaDKsXGTYet0iqsdbsJOJj+icAB638Q04N9J7Q/vgiJosoVVqdxAgwaeg7
         kCgtPTT+cE585rYaXfRn09k7FhiH7C6UMkUdVTYj4kypGpAh+/Ta9hCUMhyP83BxZxCB
         wqnuaZ3WKC1PVA+7ZRfFmV2I1JYbDTL3xgrY1PpAdL/DiXXxTmBlOyWn7gatVBxibVDv
         myZE3UY1WrR2gF3ON2UNxWrHkkCDMq41foksPnnMv8D+/KRcygdKmQTZtVcV95Emoaec
         S/TQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=W8rcl+FE;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7])
        by mx.google.com with ESMTPS id
 c9-20020a654209000000b005b57aa8517bsi3476088pgq.91.2023.11.24.05.29.16
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:29:17 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 client-ip=2620:137:e000::3:7;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=W8rcl+FE;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id C307781CC84C;
	Fri, 24 Nov 2023 05:28:29 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233005AbjKXN2E (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others);
        Fri, 24 Nov 2023 08:28:04 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58184 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1345445AbjKXN1X (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:27:23 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EDCD01717
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:13 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832432;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=tnmd+tFa3nPIVK41hUx6D+f5FKMgV+UNk16uwK5iEbI=;
        b=W8rcl+FEAblkz91paxNS587GZI3bA4FlrwNi5tf3HGti1990LJVaXQml2IsAVRnpPStdlU
        JUdXdiW4G76G/hc0kMxACrlKbdzh705lBQG2NyajZ3MXnIqthc2QD+Wy5JYScAgby4mwsK
        FrgDozJzPtD/GsmFLjltgBXylILvgGY=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-554-n03FLBEbPOa7jDqGNMEHgQ-1; Fri, 24 Nov 2023 08:27:11 -0500
X-MC-Unique: n03FLBEbPOa7jDqGNMEHgQ-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id EACB085A58C;
        Fri, 24 Nov 2023 13:27:10 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 29FE22166B2A;
        Fri, 24 Nov 2023 13:27:07 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 11/20] mm/rmap_id: support for 1,
 2 and 3 values by manual calculation
Date: Fri, 24 Nov 2023 14:26:16 +0100
Message-ID: <20231124132626.235350-12-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,
        T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:28:30 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452200332065897
X-GMAIL-MSGID: 1783452200332065897

For smaller folios, we can use less rmap values:
* <= order-2: 1x 64bit value
* <= order-5: 2x 64bit values
* <= order-9: 3x 64bit values

We end up with a lot of subids, so we cannot really use lookup tables.
Pre-calculate the subids per MM.

For order-9 we could think about having a lookup table with 128bit
entries. Further, we could calcualte them only when really required.

With 2 MiB THP this now implies only 3 instead of 4 values.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm_types.h |  3 ++
 include/linux/rmap.h     | 58 ++++++++++++++++++++++++++++-
 kernel/fork.c            |  6 +++
 mm/rmap_id.c             | 79 +++++++++++++++++++++++++++++++++++++---
 4 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 75305c57ef64..0ca5004e8f4a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1032,6 +1032,9 @@ struct mm_struct {
 
 #ifdef CONFIG_RMAP_ID
 		int mm_rmap_id;
+		unsigned long mm_rmap_subid_1;
+		unsigned long mm_rmap_subid_2[2];
+		unsigned long mm_rmap_subid_3[3];
 #endif /* CONFIG_RMAP_ID */
 	} __randomize_layout;
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a73e146d82d1..39aeab457f4a 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -180,12 +180,54 @@ struct anon_vma *folio_get_anon_vma(struct folio *folio);
 void free_rmap_id(int id);
 int alloc_rmap_id(void);
 
+#define RMAP_SUBID_1_MAX_ORDER		2
+#define RMAP_SUBID_2_MIN_ORDER		3
+#define RMAP_SUBID_2_MAX_ORDER		5
+#define RMAP_SUBID_3_MIN_ORDER		6
+#define RMAP_SUBID_3_MAX_ORDER		9
+#define RMAP_SUBID_4_MIN_ORDER		10
 #define RMAP_SUBID_4_MAX_ORDER		10
 #define RMAP_SUBID_5_MIN_ORDER		11
 #define RMAP_SUBID_5_MAX_ORDER		12
 #define RMAP_SUBID_6_MIN_ORDER		13
 #define RMAP_SUBID_6_MAX_ORDER		15
 
+static inline unsigned long calc_rmap_subid(unsigned int n, unsigned int i)
+{
+	unsigned long nr = 0, mult = 1;
+
+	while (i) {
+		if (i & 1)
+			nr += mult;
+		mult *= (n + 1);
+		i >>= 1;
+	}
+	return nr;
+}
+
+static inline unsigned long calc_rmap_subid_1(int rmap_id)
+{
+	VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX);
+
+	return calc_rmap_subid(1u << RMAP_SUBID_1_MAX_ORDER, rmap_id);
+}
+
+static inline unsigned long calc_rmap_subid_2(int rmap_id, int nr)
+{
+	VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 1);
+
+	return calc_rmap_subid(1u << RMAP_SUBID_2_MAX_ORDER,
+			       (rmap_id >> (nr * 12)) & 0xfff);
+}
+
+static inline unsigned long calc_rmap_subid_3(int rmap_id, int nr)
+{
+	VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 2);
+
+	return calc_rmap_subid(1u << RMAP_SUBID_3_MAX_ORDER,
+			       (rmap_id >> (nr * 8)) & 0xff);
+}
+
 static inline void __folio_prep_large_rmap(struct folio *folio)
 {
 	const unsigned int order = folio_order(folio);
@@ -202,10 +244,16 @@ static inline void __folio_prep_large_rmap(struct folio *folio)
 		atomic_long_set(&folio->_rmap_val4, 0);
 		fallthrough;
 #endif
-	default:
+	case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
 		atomic_long_set(&folio->_rmap_val3, 0);
+		fallthrough;
+	case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
 		atomic_long_set(&folio->_rmap_val2, 0);
+		fallthrough;
+	case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
 		atomic_long_set(&folio->_rmap_val1, 0);
+		fallthrough;
+	default:
 		atomic_long_set(&folio->_rmap_val0, 0);
 		break;
 	}
@@ -227,10 +275,16 @@ static inline void __folio_undo_large_rmap(struct folio *folio)
 		VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val4));
 		fallthrough;
 #endif
-	default:
+	case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
 		VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val3));
+		fallthrough;
+	case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
 		VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val2));
+		fallthrough;
+	case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
 		VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val1));
+		fallthrough;
+	default:
 		VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val0));
 		break;
 	}
diff --git a/kernel/fork.c b/kernel/fork.c
index 773c93613ca2..1d2f6248c83e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -822,6 +822,12 @@ static inline int mm_alloc_rmap_id(struct mm_struct *mm)
 	if (id < 0)
 		return id;
 	mm->mm_rmap_id = id;
+	mm->mm_rmap_subid_1 = calc_rmap_subid_1(id);
+	mm->mm_rmap_subid_2[0] = calc_rmap_subid_2(id, 0);
+	mm->mm_rmap_subid_2[1] = calc_rmap_subid_2(id, 1);
+	mm->mm_rmap_subid_3[0] = calc_rmap_subid_3(id, 0);
+	mm->mm_rmap_subid_3[1] = calc_rmap_subid_3(id, 1);
+	mm->mm_rmap_subid_3[2] = calc_rmap_subid_3(id, 2);
 	return 0;
 }
 
diff --git a/mm/rmap_id.c b/mm/rmap_id.c
index 85a61c830f19..6c3187547741 100644
--- a/mm/rmap_id.c
+++ b/mm/rmap_id.c
@@ -87,6 +87,39 @@ static DEFINE_IDA(rmap_ida);
  *       involved page tables are locked and stop any page table walkers.
  */
 
+/*
+ * With 4 (order-2) possible exclusive mappings per folio, we can have
+ * 16777216 = 16M sub-IDs per 64bit value.
+ */
+static unsigned long get_rmap_subid_1(struct mm_struct *mm)
+{
+	return mm->mm_rmap_subid_1;
+}
+
+/*
+ * With 32 (order-5) possible exclusive mappings per folio, we can have
+ * 4096 sub-IDs per 64bit value.
+ *
+ * With 2 such 64bit values, we can support 4096^2 == 16M IDs.
+ */
+static unsigned long get_rmap_subid_2(struct mm_struct *mm, int nr)
+{
+	VM_WARN_ON_ONCE(nr > 1);
+	return mm->mm_rmap_subid_2[nr];
+}
+
+/*
+ * With 512 (order-9) possible exclusive mappings per folio, we can have
+ * 128 sub-IDs per 64bit value.
+ *
+ * With 3 such 64bit values, we can support 128^3 == 16M IDs.
+ */
+static unsigned long get_rmap_subid_3(struct mm_struct *mm, int nr)
+{
+	VM_WARN_ON_ONCE(nr > 2);
+	return mm->mm_rmap_subid_3[nr];
+}
+
 /*
  * With 1024 (order-10) possible exclusive mappings per folio, we can have 64
  * sub-IDs per 64bit value.
@@ -279,12 +312,24 @@ void __folio_set_large_rmap_val(struct folio *folio, int count,
 		atomic_long_set(&folio->_rmap_val4, get_rmap_subid_5(mm, 4) * count);
 		break;
 #endif
-	default:
+	case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
 		atomic_long_set(&folio->_rmap_val0, get_rmap_subid_4(mm, 0) * count);
 		atomic_long_set(&folio->_rmap_val1, get_rmap_subid_4(mm, 1) * count);
 		atomic_long_set(&folio->_rmap_val2, get_rmap_subid_4(mm, 2) * count);
 		atomic_long_set(&folio->_rmap_val3, get_rmap_subid_4(mm, 3) * count);
 		break;
+	case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
+		atomic_long_set(&folio->_rmap_val0, get_rmap_subid_3(mm, 0) * count);
+		atomic_long_set(&folio->_rmap_val1, get_rmap_subid_3(mm, 1) * count);
+		atomic_long_set(&folio->_rmap_val2, get_rmap_subid_3(mm, 2) * count);
+		break;
+	case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
+		atomic_long_set(&folio->_rmap_val0, get_rmap_subid_2(mm, 0) * count);
+		atomic_long_set(&folio->_rmap_val1, get_rmap_subid_2(mm, 1) * count);
+		break;
+	default:
+		atomic_long_set(&folio->_rmap_val0, get_rmap_subid_1(mm) * count);
+		break;
 	}
 }
 
@@ -313,12 +358,24 @@ void __folio_add_large_rmap_val(struct folio *folio, int count,
 		atomic_long_add(get_rmap_subid_5(mm, 4) * count, &folio->_rmap_val4);
 		break;
 #endif
-	default:
+	case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
 		atomic_long_add(get_rmap_subid_4(mm, 0) * count, &folio->_rmap_val0);
 		atomic_long_add(get_rmap_subid_4(mm, 1) * count, &folio->_rmap_val1);
 		atomic_long_add(get_rmap_subid_4(mm, 2) * count, &folio->_rmap_val2);
 		atomic_long_add(get_rmap_subid_4(mm, 3) * count, &folio->_rmap_val3);
 		break;
+	case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
+		atomic_long_add(get_rmap_subid_3(mm, 0) * count, &folio->_rmap_val0);
+		atomic_long_add(get_rmap_subid_3(mm, 1) * count, &folio->_rmap_val1);
+		atomic_long_add(get_rmap_subid_3(mm, 2) * count, &folio->_rmap_val2);
+		break;
+	case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
+		atomic_long_add(get_rmap_subid_2(mm, 0) * count, &folio->_rmap_val0);
+		atomic_long_add(get_rmap_subid_2(mm, 1) * count, &folio->_rmap_val1);
+		break;
+	default:
+		atomic_long_add(get_rmap_subid_1(mm) * count, &folio->_rmap_val0);
+		break;
 	}
 }
 
@@ -330,7 +387,7 @@ bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
 
 	switch (order) {
 #if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
-	case RMAP_SUBID_6_MIN_ORDER .. RMAP_SUBID_6_MAX_ORDER:
+	case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
 		diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_6(mm, 0) * count);
 		diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_6(mm, 1) * count);
 		diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_6(mm, 2) * count);
@@ -340,7 +397,7 @@ bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
 		break;
 #endif
 #if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
-	case RMAP_SUBID_5_MIN_ORDER .. RMAP_SUBID_5_MAX_ORDER:
+	case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
 		diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_5(mm, 0) * count);
 		diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_5(mm, 1) * count);
 		diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_5(mm, 2) * count);
@@ -348,12 +405,24 @@ bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
 		diff |= atomic_long_read(&folio->_rmap_val4) ^ (get_rmap_subid_5(mm, 4) * count);
 		break;
 #endif
-	default:
+	case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
 		diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_4(mm, 0) * count);
 		diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_4(mm, 1) * count);
 		diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_4(mm, 2) * count);
 		diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_4(mm, 3) * count);
 		break;
+	case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
+		diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_3(mm, 0) * count);
+		diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_3(mm, 1) * count);
+		diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_3(mm, 2) * count);
+		break;
+	case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
+		diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_2(mm, 0) * count);
+		diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_2(mm, 1) * count);
+		break;
+	default:
+		diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_1(mm) * count);
+		break;
 	}
 	return !diff;
 }

From patchwork Fri Nov 24 13:26:17 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169417
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191155vqx;
        Fri, 24 Nov 2023 05:28:48 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IFhPzFW3SP+JwPvAmm/3aa8lHRynDIcgou45bAKLC64fipODMQGG+Bu/HCK/jnr8zKey3V9
X-Received: by 2002:a17:90b:4d0d:b0:280:4ec6:97e9 with SMTP id
 mw13-20020a17090b4d0d00b002804ec697e9mr3191988pjb.30.1700832528048;
        Fri, 24 Nov 2023 05:28:48 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832528; cv=none;
        d=google.com; s=arc-20160816;
        b=h76trDTXCA011FRN2nCePP3hWA5+IiI1kcGXMrn0p+UM9W5Fc6ts405FgviGViBu1M
         eKNmuJknOAsqPFPLdnhq7x3wK/u9ruN6FcXmDKij4gKUzzdU1nmVx9TYhChh+5C6/3mA
         EbhWn3ovma5pWFWl7S8INBPCmnRtuGRO4L8f+k+7bizje/Y9SJ+h/2uzS++LrToQYt3R
         9dVF121LoJXmFR3GLUrdFVI2FUurJkNSqlRewa2DYwLvQBeqLouViNVTgYpCssEv4Iz5
         Oss4WbH/hbHEARHaxAooGGuBsn+5OWrXZcVgMn5pqDW2c/xXDZdai666dAiXbGrZtlUi
         Arsg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=XtSW4MwxrPZp0IyU2r08B+zJ2gAs9ZHqKJrFZADgzdE=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=vnbva+/iJS8ruCE0dzrGn80kHW654jfZnm9XJfUq4dAocDvLRy3K/G+igy5xBiMXOn
         VV6gSgoXrVZwQUBcMGfPj93uaevunj3+j4cSgjR2OMivVUwElifCu1FnkZVwjGdLFurW
         ImlxzzLjrKcpAEE0coWQTvHXfcticFflUnM+iDKtoFejvk/MdjpszbHCZ0jrW/eDR9Fg
         sgclF4NOiYxobMy3nlsUFn/STfm5x+N0Gh7X8cf4PWXLt99UPk71nYjpL015EF5IEqUI
         6sYm1usUU7PqqAb1AN37GP9eySPJ0D8xvQlnp/UH0j3wBQiMA5OZgfkfUpnRdYnPW9ny
         DOVg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=WeCLO36I;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33])
        by mx.google.com with ESMTPS id
 x4-20020a17090a6c0400b002803ec7393csi4079945pjj.27.2023.11.24.05.28.47
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:28:48 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=WeCLO36I;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by lipwig.vger.email (Postfix) with ESMTP id 205B48030A57;
	Fri, 24 Nov 2023 05:28:45 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235206AbjKXN2V (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others);
        Fri, 24 Nov 2023 08:28:21 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58382 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1345684AbjKXN1n (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:27:43 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4746C1BF5
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832440;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=XtSW4MwxrPZp0IyU2r08B+zJ2gAs9ZHqKJrFZADgzdE=;
        b=WeCLO36IZ9SNhaMFgnZu8voNGRCkCCJPoksdZVs0/7brTpb5DEAxPDojyg+BjTAiO+0/nA
        SayEjB50qiNcOo6MATyLZ4UAULHm3GhKd6PNukhIhK5FVxVNwjrV/rOt1Ko6lLUsPGVCuG
        MuF0ZlgAJrryMx5SaInDNhLl6h8JuXs=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-624-S2Vde6e2MxqrSRyf_jNx7w-1; Fri, 24 Nov 2023 08:27:15 -0500
X-MC-Unique: S2Vde6e2MxqrSRyf_jNx7w-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B241E185A784;
        Fri, 24 Nov 2023 13:27:14 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 3636A2166B2A;
        Fri, 24 Nov 2023 13:27:11 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 12/20] mm/rmap: introduce folio_add_anon_rmap_range()
Date: Fri, 24 Nov 2023 14:26:17 +0100
Message-ID: <20231124132626.235350-13-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:28:45 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452168986567177
X-GMAIL-MSGID: 1783452168986567177

There are probably ways to have an even cleaner interface (e.g.,
pass the mapping granularity instead of "compound"). For now, let's
handle it like folio_add_file_rmap_range().

Use separate loops for handling the "SetPageAnonExclusive()" case and
performing debug checks. The latter should get optimized out automatically
without CONFIG_DEBUG_VM.

We'll use this function to batch rmap operations when PTE-remapping a
PMD-mapped THP next.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h |  3 ++
 mm/rmap.c            | 69 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 39aeab457f4a..76e6fb1dad5c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -393,6 +393,9 @@ typedef int __bitwise rmap_t;
  * rmap interfaces called when adding or removing pte of page
  */
 void folio_move_anon_rmap(struct folio *, struct vm_area_struct *);
+void folio_add_anon_rmap_range(struct folio *, struct page *,
+		unsigned int nr_pages, struct vm_area_struct *,
+		unsigned long address, rmap_t flags);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/rmap.c b/mm/rmap.c
index 689ad85cf87e..da7fa46a18fc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1240,25 +1240,29 @@ static void __page_check_anon_rmap(struct folio *folio, struct page *page,
 }
 
 /**
- * page_add_anon_rmap - add pte mapping to an anonymous page
- * @page:	the page to add the mapping to
- * @vma:	the vm area in which the mapping is added
- * @address:	the user virtual address mapped
- * @flags:	the rmap flags
+ * folio_add_anon_rmap_range - add mappings to a page range of an anon folio
+ * @folio:	The folio to add the mapping to
+ * @page:	The first page to add
+ * @nr_pages:	The number of pages which will be mapped
+ * @vma:	The vm area in which the mapping is added
+ * @address:	The user virtual address of the first page to map
+ * @flags:	The rmap flags
+ *
+ * The page range of folio is defined by [first_page, first_page + nr_pages)
  *
  * The caller needs to hold the pte lock, and the page must be locked in
  * the anon_vma case: to serialize mapping,index checking after setting,
- * and to ensure that PageAnon is not being upgraded racily to PageKsm
- * (but PageKsm is never downgraded to PageAnon).
+ * and to ensure that an anon folio is not being upgraded racily to a KSM folio
+ * (but KSM folios are never downgraded).
  */
-void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
+void folio_add_anon_rmap_range(struct folio *folio, struct page *page,
+		unsigned int nr_pages, struct vm_area_struct *vma,
 		unsigned long address, rmap_t flags)
 {
-	struct folio *folio = page_folio(page);
-	unsigned int nr, nr_pmdmapped = 0;
+	unsigned int i, nr, nr_pmdmapped = 0;
 	bool compound = flags & RMAP_COMPOUND;
 
-	nr = __folio_add_rmap_range(folio, page, 1, vma, compound,
+	nr = __folio_add_rmap_range(folio, page, nr_pages, vma, compound,
 				    &nr_pmdmapped);
 	if (nr_pmdmapped)
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped);
@@ -1279,12 +1283,20 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 	} else if (likely(!folio_test_ksm(folio))) {
 		__page_check_anon_rmap(folio, page, vma, address);
 	}
-	if (flags & RMAP_EXCLUSIVE)
-		SetPageAnonExclusive(page);
-	/* While PTE-mapping a THP we have a PMD and a PTE mapping. */
-	VM_WARN_ON_FOLIO((atomic_read(&page->_mapcount) > 0 ||
-			  (folio_test_large(folio) && folio_entire_mapcount(folio) > 1)) &&
-			 PageAnonExclusive(page), folio);
+
+	if (flags & RMAP_EXCLUSIVE) {
+		for (i = 0; i < nr_pages; i++)
+			SetPageAnonExclusive(page + i);
+	}
+	for (i = 0; i < nr_pages; i++) {
+		struct page *cur_page = page + i;
+
+		/* While PTE-mapping a THP we have a PMD and a PTE mapping. */
+		VM_WARN_ON_FOLIO((atomic_read(&cur_page->_mapcount) > 0 ||
+				  (folio_test_large(folio) &&
+				   folio_entire_mapcount(folio) > 1)) &&
+				 PageAnonExclusive(cur_page), folio);
+	}
 
 	/*
 	 * For large folio, only mlock it if it's fully mapped to VMA. It's
@@ -1296,6 +1308,29 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 		mlock_vma_folio(folio, vma);
 }
 
+/**
+ * page_add_anon_rmap - add mappings to an anonymous page
+ * @page:	The page to add the mapping to
+ * @vma:	The vm area in which the mapping is added
+ * @address:	The user virtual address of the page to map
+ * @flags:	The rmap flags
+ *
+ * See folio_add_anon_rmap_range().
+ */
+void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
+		unsigned long address, rmap_t flags)
+{
+	struct folio *folio = page_folio(page);
+	unsigned int nr_pages;
+
+	if (likely(!(flags & RMAP_COMPOUND)))
+		nr_pages = 1;
+	else
+		nr_pages = folio_nr_pages(folio);
+
+	folio_add_anon_rmap_range(folio, page, nr_pages, vma, address, flags);
+}
+
 /**
  * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
  * @folio:	The folio to add the mapping to.

From patchwork Fri Nov 24 13:26:18 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169419
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191208vqx;
        Fri, 24 Nov 2023 05:28:52 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHV7lwS1kCbql6avzeY9WARUGI18hjlsZdWOhiN0wQWSaGFYYycxRbp/opWcHNvvt7BKJ8R
X-Received: by 2002:a05:6a00:7cf:b0:6c4:d6fa:ee9d with SMTP id
 n15-20020a056a0007cf00b006c4d6faee9dmr6488454pfu.1.1700832532433;
        Fri, 24 Nov 2023 05:28:52 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832532; cv=none;
        d=google.com; s=arc-20160816;
        b=BRbJyo0PqmJRjr2cT3teXRLacUOLeUqMlD6FbMopSPFQk3kiRfY2O8v2VnLlgkT4rD
         oVla/TW9y06fTku3782QWALB2lX8n2vpvSK4LObnmAy7+wwicK3gmyrXy+88bows/Ua0
         rN/aMo3iwbQihs/ZBRL/kwrsyDxe7forO64Xt6EA4q4ppgISebXm5aN3DaQmT3BRPLDi
         lwbX5iOrSBh8N7Bu5ihuWdKFPD4RdYG9AiDsQpwhOw/gfMYmBD33cjWdtkqWNG8yy7AN
         wcT8hs9ZngBeDvBaiozZ8UtFbl1Rk2+0JU0jJTRlzWxn2by+SyPFi9euU4xjC24K6rMa
         H0JQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=vCiq1mHEXhReKPGNK7ma5q/JWzLE5up94nYDqOWMaw0=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=fs5nUznSp5YtUUlEUcJCJY3y/aWk0KVgeqUAcIflg0wzXVk5Q8tP3D/z9VG+cOP80r
         SdjvmW2gi6C1wpooym+sB4af1jkYTl08TuY1+MGR0WHvBF+QyRm7gKQN6rFZ9NTw4oKA
         KRacAtl3Vzz464SA6nXabDlnoaO7fFpcwgKZmum7VWnEfgvQfK2vaS1ReerztLNGKMHW
         /aPRWhDfPgq86qsm3pHstC0spwoB2i4pWp/30NjSoI8UQuga9tZgi0bNNWRLoGqkStl0
         nhWmSzx9rOCD2RPA0ziTa/ST+Y2RRk938Vce6z2FnWyf6h1tHGlhnaqylQAG5R0MtQdJ
         NIDw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=KMtMf+7d;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32])
        by mx.google.com with ESMTPS id
 z1-20020aa78881000000b006cbb7fdff10si3526805pfe.194.2023.11.24.05.28.52
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:28:52 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=KMtMf+7d;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by agentk.vger.email (Postfix) with ESMTP id C957B81825CD;
	Fri, 24 Nov 2023 05:28:49 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345366AbjKXN2Y (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:28:24 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48830 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231172AbjKXN1r (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:27:47 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0CB1E1FD5
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832443;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=vCiq1mHEXhReKPGNK7ma5q/JWzLE5up94nYDqOWMaw0=;
        b=KMtMf+7dmBRtjbwxbYQXOrYTQcwZiZJDTFkfqfBZDVirfeYkN8AH/PvV7p0IXhMRq4/keh
        bYPgAC+g3Q580Km9kmZK6U8f8eP8GSYKjt1GZ0kjLjD9w//Bl75M7BQoShXl0h6+1VUUjC
        Q20c2Ym3edm/eG74EUt849f10Tw5pJc=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-654-EfryieQzPKuzYd766I-Umw-1; Fri,
 24 Nov 2023 08:27:19 -0500
X-MC-Unique: EfryieQzPKuzYd766I-Umw-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 805CE1C05142;
        Fri, 24 Nov 2023 13:27:18 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 0717E2166B2A;
        Fri, 24 Nov 2023 13:27:14 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 13/20] mm/huge_memory: batch rmap operations in
 __split_huge_pmd_locked()
Date: Fri, 24 Nov 2023 14:26:18 +0100
Message-ID: <20231124132626.235350-14-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:28:50 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452173437447670
X-GMAIL-MSGID: 1783452173437447670

Let's batch the rmap operations, as a preparation to making individual
page_add_anon_rmap() calls more expensive.

While at it, use more folio operations (but only in the code branch we're
touching), use VM_WARN_ON_FOLIO(), and pass RMAP_COMPOUND instead of
manually setting PageAnonExclusive.

We should never see non-anon pages on that branch: otherwise, the
existing page_add_anon_rmap() call would have been flawed already.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd7251923557..f47971d1afbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2100,6 +2100,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long haddr, bool freeze)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	struct folio *folio;
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
@@ -2195,16 +2196,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
 		page = pmd_page(old_pmd);
+		folio = page_folio(page);
 		if (pmd_dirty(old_pmd)) {
 			dirty = true;
-			SetPageDirty(page);
+			folio_set_dirty(folio);
 		}
 		write = pmd_write(old_pmd);
 		young = pmd_young(old_pmd);
 		soft_dirty = pmd_soft_dirty(old_pmd);
 		uffd_wp = pmd_uffd_wp(old_pmd);
 
-		VM_BUG_ON_PAGE(!page_count(page), page);
+		VM_WARN_ON_FOLIO(!folio_ref_count(folio), folio);
+		VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
 
 		/*
 		 * Without "freeze", we'll simply split the PMD, propagating the
@@ -2221,11 +2224,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 *
 		 * See page_try_share_anon_rmap(): invalidate PMD first.
 		 */
-		anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
+		anon_exclusive = PageAnonExclusive(page);
 		if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
 			freeze = false;
-		if (!freeze)
-			page_ref_add(page, HPAGE_PMD_NR - 1);
+		if (!freeze) {
+			rmap_t rmap_flags = RMAP_NONE;
+
+			folio_ref_add(folio, HPAGE_PMD_NR - 1);
+			if (anon_exclusive)
+				rmap_flags = RMAP_EXCLUSIVE;
+			folio_add_anon_rmap_range(folio, page, HPAGE_PMD_NR,
+						  vma, haddr, rmap_flags);
+		}
 	}
 
 	/*
@@ -2268,8 +2278,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
 			if (write)
 				entry = pte_mkwrite(entry, vma);
-			if (anon_exclusive)
-				SetPageAnonExclusive(page + i);
 			if (!young)
 				entry = pte_mkold(entry);
 			/* NOTE: this may set soft-dirty too on some archs */
@@ -2279,7 +2287,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_mksoft_dirty(entry);
 			if (uffd_wp)
 				entry = pte_mkuffd_wp(entry);
-			page_add_anon_rmap(page + i, vma, addr, RMAP_NONE);
 		}
 		VM_BUG_ON(!pte_none(ptep_get(pte)));
 		set_pte_at(mm, addr, pte, entry);

From patchwork Fri Nov 24 13:26:19 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169428
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191728vqx;
        Fri, 24 Nov 2023 05:29:41 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHgYE+zg/SzrX4uBWWWclrIzHkXkhvhD2AdPxOq3kkrTopSK4r0L6beJ93WO9SvokO9VemC
X-Received: by 2002:a05:6a21:6da3:b0:188:f3d:ea35 with SMTP id
 wl35-20020a056a216da300b001880f3dea35mr4202670pzb.50.1700832581479;
        Fri, 24 Nov 2023 05:29:41 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832581; cv=none;
        d=google.com; s=arc-20160816;
        b=P1KBK6+u/H3ePAr0ADCID2C8+tRz7rFxxlVZeuFSrVTZunCaaUD048wuUXApUxLEzF
         +1YnIHMWy3FobK+C2xRcx9uNbbdLWltmGJ4Zr23mVOM4X6ItQBDdrAOzmah02EKIuXcI
         HuIu8Gl2ZDm+0V7gnLKMXV+ZPhd+YsXkRQuAgMkHDk3BpPgK2grcWQzopX+0J7W+il90
         kCoplsf2Gu8Xv6ew1YM4bsNMTVy/QsfdbY1iLsKwPqzIleHCiKU8v6Mxlm8h9/ZI3zWo
         cxtyNjmKuRYnIUT5poJGA66kHkqXaZHss0LvgdXmTDaj2KjXTD+hV0U0+U6Mz8nUxAHL
         6bwA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=8wkdTzr4ovr6JqvWUuBFMPrmTwPV2fAdlmREPfnLqjk=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=0j17+20a6N89IgfVXu9QnG9A5t37ulMUPSevUmleXgCfV5Lr3yR0xYKeJbMOAkU7hf
         MLf/ncO0NFHwGr5HvinP7+r2IFpPlh6Wp+lnLv0l6FG73SU4y4ZxcNNpq24e7+gbnb0h
         v5g+vMATeaTeKWV+YphnynV+jYLdqtEHniQiswvMtWkqiB4tf9GuTIwxP8ASeRkzcdCQ
         0azMFPoCRh5QUgimLKQRilptcf71UjQNP/hKM2s1Okd6EphpKm2iB/Q1ExObuI5seNPo
         D39pQUvko/2lvxiHZX6GZ0u4hQVCZWXqOp95mH4MihD18RgyuuKeIGR7vTW1Gvc2U2YM
         z6dw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=P3OryhCw;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:4 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4])
        by mx.google.com with ESMTPS id
 k32-20020a634b60000000b005bd335981e2si3463709pgl.678.2023.11.24.05.29.35
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:29:41 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:4 as permitted sender)
 client-ip=2620:137:e000::3:4;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=P3OryhCw;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:4 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by howler.vger.email (Postfix) with ESMTP id 303308047649;
	Fri, 24 Nov 2023 05:29:26 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345427AbjKXN2m (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:28:42 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37522 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230104AbjKXN2S (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:28:18 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6CA4F2105
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:32 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832451;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=8wkdTzr4ovr6JqvWUuBFMPrmTwPV2fAdlmREPfnLqjk=;
        b=P3OryhCwrrW3/y4KdyQKeVXl4ZsGXGCqICZDVyfWOr3qPxyr9OAPSTyhSzJPwo15ndIdhf
        M9IMRawBLym9dumKCK8uEAQfKEZt9+p2UK+VeGksUrqDmGNIoljzhaRSXa/a2ERoC8pkii
        QVZwgR8zn6qn35NAhZmGrVTX0kFoxO4=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-631-Mzyg-U9HMR2q8VnSaYE9mw-1; Fri, 24 Nov 2023 08:27:23 -0500
X-MC-Unique: Mzyg-U9HMR2q8VnSaYE9mw-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 7F79C85A58C;
        Fri, 24 Nov 2023 13:27:22 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id E251F2166B2B;
        Fri, 24 Nov 2023 13:27:18 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 14/20] mm/huge_memory: avoid folio_refcount() <
 folio_mapcount() in __split_huge_pmd_locked()
Date: Fri, 24 Nov 2023 14:26:19 +0100
Message-ID: <20231124132626.235350-15-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:29:26 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452224988353516
X-GMAIL-MSGID: 1783452224988353516

Currently, there is a short period in time where the refcount is smaller
than the mapcount. Let's just make sure we obey the rules of refcount
vs. mapcount: increment the refcount before incrementing the mapcount
and decrement the refcount after decrementing the mapcount.

While this could make code like can_split_folio() fail to detect other
folio references, such code is (currently) racy already and this change
shouldn't actually be considered a real fix but rather an improvement/
cleanup.

The refcount vs. mapcount changes are now well balanced in the code, with
the cost of one additional refcount change, which really shouldn't matter
here that much -- we're usually touching >= 512 subpage mapcounts and
much more after all.

Found while playing with some sanity checks to detect such cases, which
we might add at some later point.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f47971d1afbf..9639b4edc8a5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2230,7 +2230,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		if (!freeze) {
 			rmap_t rmap_flags = RMAP_NONE;
 
-			folio_ref_add(folio, HPAGE_PMD_NR - 1);
+			folio_ref_add(folio, HPAGE_PMD_NR);
 			if (anon_exclusive)
 				rmap_flags = RMAP_EXCLUSIVE;
 			folio_add_anon_rmap_range(folio, page, HPAGE_PMD_NR,
@@ -2294,10 +2294,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	pte_unmap(pte - 1);
 
-	if (!pmd_migration)
+	if (!pmd_migration) {
 		page_remove_rmap(page, vma, true);
-	if (freeze)
 		put_page(page);
+	}
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);

From patchwork Fri Nov 24 13:26:20 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169427
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191635vqx;
        Fri, 24 Nov 2023 05:29:32 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGvEuN2azV0Bqb3M23sEN3xY1ZqWuELkmR9Ulsse2wPtZ8t48Vz6/1VYj4BIJiSAn33hcuR
X-Received: by 2002:a17:903:18f:b0:1cf:591c:a8b1 with SMTP id
 z15-20020a170903018f00b001cf591ca8b1mr3063643plg.15.1700832571746;
        Fri, 24 Nov 2023 05:29:31 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832571; cv=none;
        d=google.com; s=arc-20160816;
        b=SlSzbAozrR34Nx1qrQwKxrELTUF5R1N2CrvCoU/ozb8mMK84DXH5hbhBux5VdxFcnJ
         7KRwqmqxlGMStNsIhZKMxhYOMvrVlGKhOZnUeHNzz0ToeLBDmw4MPFSi5sAuBk8A3LU/
         7HIJTJDS/H1MRQkPxj+gTkQ1A19TXuQyE0kMk36vX4/wOxlxCqMxR0iUW8K131DQFmIq
         5bAEFbte4JUCLc5muYpea/YppH8Emhrj3Vb91THKsD/ViFtGFfcprj3GbYgiEU7ciJRb
         zAEd+QlZ40QlTS+4QIK5BctVKN3Vn80/0bM1Olp3BEKReekk8ufxhf2SK2htCGPcLAqW
         9igA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=qX2dA3aqMotkmkgBnT9s/2Q/bW+jQxPjHk8bIRl2TXE=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=F4pkQ0+r0qBKKK0sJhGzwiIIFE/2lMyQftrMw/U7GArPOQV0I4wD7EG3r8iwy/cbyv
         BY4gT4Z0Y/qmDd+omABHcPDiL+2fmArxO2xp+U+CuGxwatw+3XEZZ5z8A3PwQmBwiuLy
         Xbw5FTpoBTQ1WTtqUDJ8pGedulC8+oOlJbVKhTTeoGfPYT3IOFZYLTWRK3Xhi7LDlPPV
         eCwdwG2n9MIGWe1VVYqSHuQ0w9W6k9jkRahxFM7J0hJse9Ik+h0URbuyy1YKnTvo94T8
         YrDHN6/9t2/6+4rZBezfPtkm2ZNpaG6/q6+KscAhlLWjrXto6CwdNazi1l7hnw02QRGQ
         xYWA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=dQWeGBfK;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3])
        by mx.google.com with ESMTPS id
 jg17-20020a17090326d100b001cf68d3e90csi3313912plb.98.2023.11.24.05.29.28
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:29:31 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 client-ip=2620:137:e000::3:3;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=dQWeGBfK;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by lipwig.vger.email (Postfix) with ESMTP id CFA0E80816A1;
	Fri, 24 Nov 2023 05:29:24 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231494AbjKXN3H (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others);
        Fri, 24 Nov 2023 08:29:07 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48808 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235220AbjKXN2V (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:28:21 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 252A619BB
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:37 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832456;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=qX2dA3aqMotkmkgBnT9s/2Q/bW+jQxPjHk8bIRl2TXE=;
        b=dQWeGBfKXMPea9Kj/9qHdaRFhs6cbWq9G+QHWZkqxUminddsyiLqg5ILBCnQ+Wo/rkcETm
        zXzv++mcKEnuIMWkojOJbWVHLodNAEnCEXh1dke6pZ4PN0lLN3o9uK0a0GvPYKHe2/G6SM
        4Lpjtx7snDBxf0B4qa2on9d+IaitCvM=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-198-yyv_70-8PeWkL3Q6JW939Q-1; Fri,
 24 Nov 2023 08:27:26 -0500
X-MC-Unique: yyv_70-8PeWkL3Q6JW939Q-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id CD8A52806053;
        Fri, 24 Nov 2023 13:27:25 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id DE44C2166B2B;
        Fri, 24 Nov 2023 13:27:22 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 15/20] mm/rmap_id: verify precalculated subids with
 CONFIG_DEBUG_VM
Date: Fri, 24 Nov 2023 14:26:20 +0100
Message-ID: <20231124132626.235350-16-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:29:24 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452214628749976
X-GMAIL-MSGID: 1783452214628749976

Let's verify the precalculated subids for 4/5/6 values.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap_id.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/rmap_id.c b/mm/rmap_id.c
index 6c3187547741..421d8d2b646c 100644
--- a/mm/rmap_id.c
+++ b/mm/rmap_id.c
@@ -481,3 +481,29 @@ void free_rmap_id(int id)
 	ida_free(&rmap_ida, id);
 	spin_unlock(&rmap_id_lock);
 }
+
+#ifdef CONFIG_DEBUG_VM
+static int __init rmap_id_init(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(rmap_subids_4); i++)
+		WARN_ON_ONCE(calc_rmap_subid(1u << RMAP_SUBID_4_MAX_ORDER, i) !=
+			     rmap_subids_4[i]);
+
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+	for (i = 0; i < ARRAY_SIZE(rmap_subids_5); i++)
+		WARN_ON_ONCE(calc_rmap_subid(1u << RMAP_SUBID_5_MAX_ORDER, i) !=
+			     rmap_subids_5[i]);
+#endif
+
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+	for (i = 0; i < ARRAY_SIZE(rmap_subids_6); i++)
+		WARN_ON_ONCE(calc_rmap_subid(1u << RMAP_SUBID_6_MAX_ORDER, i) !=
+			     rmap_subids_6[i]);
+#endif
+
+	return 0;
+}
+module_init(rmap_id_init)
+#endif /* CONFIG_DEBUG_VM */

From patchwork Fri Nov 24 13:26:21 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169433
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1192576vqx;
        Fri, 24 Nov 2023 05:30:38 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IF9L0ad+gMup0/KzCJNTr9aiiCU9MNYG3xJGnmpt/E9sP+h0TMdHzz0XlOL1j5KRVnOrLGe
X-Received: by 2002:a05:6a20:8f1a:b0:18b:962c:1ddf with SMTP id
 b26-20020a056a208f1a00b0018b962c1ddfmr3060156pzk.56.1700832638133;
        Fri, 24 Nov 2023 05:30:38 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832638; cv=none;
        d=google.com; s=arc-20160816;
        b=rNCOoC7fotfipjmHMYqjGMVOhW5onFbkcYXlUrCItHhDtAVBpwLYK49neHlL2RmFJy
         vafEFsb7XLAZUDDRIrZZ6Ty4amysJ3VQcURt5nsq3cmNYrCdo83ZE9RGvt6Rb80xCVQh
         eSGlKXtY02eBNTJGm6lYhG4WLjyxVfRWRwaJNwcsiOB0cGwefbAfRb8Xf9jimglzC1Qa
         /uvGvngix5WZFKSClfpftDMMc6QBqHtZIyGrcXCGGKXNJ3u2KY0a5Do7yNTC1IGx64Cl
         NDXAgpOMw7QC2w5ai3qGGB3QzfVbm39PN5dWpwzpslujYRadlcw7pBuOUn3d5v2RUHPP
         2UIg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=IWxh+srF4eII3yzN2Ch68mHrKm0yO3+Xlp6snuZfYi4=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=gHDhYFHinZ1ENbVEPJEsvDwHxhWU/b9ROpzF/GECKT/AF9I2JzhEkATtCejzpBwdX8
         UHWDnm2JAhOB9IbLNtWFU7frG5louUdT82mMOAeVl9wFBaUotN5smAxu4V17PGd1G/HY
         TX6sKonWnJK3FPC5nOYdXVYFklSWoB7wmNJK4/sD+5JSwl0jREJRA4SoKYxvyPjaewA/
         XAAQF0p6wR0NYLa2CQsV2SFHpIIMquxWw3hvWRm31cJvkmf85R0CJAwuquUZD9zbSyIe
         phivvVeLAF3tfDIz+nWsJvpHzl+FTOdIsj2iE9MY9dwJDM2RI4pWxNymKI9Qax3J4vec
         MNhg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b="R/Vj3wZQ";
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from howler.vger.email (howler.vger.email. [23.128.96.34])
        by mx.google.com with ESMTPS id
 by37-20020a056a0205a500b005b942de1e92si3835842pgb.443.2023.11.24.05.30.32
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:30:38 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b="R/Vj3wZQ";
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by howler.vger.email (Postfix) with ESMTP id 23880807F497;
	Fri, 24 Nov 2023 05:30:29 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232949AbjKXN27 (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others);
        Fri, 24 Nov 2023 08:28:59 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37654 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232787AbjKXN2U (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:28:20 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6D2662108
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:33 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832453;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=IWxh+srF4eII3yzN2Ch68mHrKm0yO3+Xlp6snuZfYi4=;
        b=R/Vj3wZQr5otgXJxdaACSiB4pyJvJfVdaSa+/u5vg6ZjxlQfHjqKf1vDePuZ1Ux96U7ypJ
        LQ/g0nPGw0usvd7PfsyDn/PwqVMuhyIOJzePDIR/r7Usi3JnRd+Ioz0c0i7+5oAQKMuk8e
        QFvLt9A+cZI52SKnua9hOTuQ9Cf6lWY=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-344-yC66DjjfM-2TnstsUc80dA-1; Fri, 24 Nov 2023 08:27:30 -0500
X-MC-Unique: yC66DjjfM-2TnstsUc80dA-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5C903185A780;
        Fri, 24 Nov 2023 13:27:29 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 150282166B2A;
        Fri, 24 Nov 2023 13:27:25 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 16/20] atomic_seqcount: support a single exclusive
 writer in the absence of other writers
Date: Fri, 24 Nov 2023 14:26:21 +0100
Message-ID: <20231124132626.235350-17-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:30:29 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452284420170537
X-GMAIL-MSGID: 1783452284420170537

The current atomic seqcount requires that all writers must use atomic
RMW operations in the critical section, which can result in quite some
overhead on some platforms. In the common case, there is only a single
writer, and ideally we'd be able to not use atomic RMW operations in that
case, to reduce the overall number of atomic RMW operations on the
fast path.

So let's add support for a single exclusive writer. If there are no
other writers, a writer can become the single exclusive writer by using
an atomic cmpxchg on the atomic seqcount. However, if there is any
concurrent writer (shared or exclusive), the writers become shared and
only have to wait for a single exclusive writer to finish.

So shared writers might be delayed a bit by the single exclusive writer,
but they don't starve as they are guaranteed to make progress after the
exclusive writer finished (that ideally runs faster than any shared writer
due to no atomic RMW operations in the critical section).

The exclusive path now effectively acts as a lock: if the trylock fails,
we fallback to the shared path. We need acquire-release semantics that are
implied by the full memory barriers that we are enforcing.

Instead of the atomic_long_add_return(), we could keep using an
atomic_long_add() + atomic_long_read(). But I suspect that doesn't
really matter. If it ever matters, if will be easy to optimize.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/atomic_seqcount.h | 101 ++++++++++++++++++++++++++------
 include/linux/rmap.h            |   5 +-
 2 files changed, 85 insertions(+), 21 deletions(-)

diff --git a/include/linux/atomic_seqcount.h b/include/linux/atomic_seqcount.h
index 109447b663a1..00286a9da221 100644
--- a/include/linux/atomic_seqcount.h
+++ b/include/linux/atomic_seqcount.h
@@ -8,8 +8,11 @@
 
 /*
  * raw_atomic_seqcount_t -- a reader-writer consistency mechanism with
- * lockless readers (read-only retry loops), and lockless writers.
- * The writers must use atomic RMW operations in the critical section.
+ * lockless readers (read-only retry loops), and (almost) lockless writers.
+ * Shared writers must use atomic RMW operations in the critical section,
+ * a single exclusive writer can avoid atomic RMW operations in the critical
+ * section. Shared writers will always have to wait for at most one exclusive
+ * writer to finish in order to make progress.
  *
  * This locking mechanism is applicable when all individual operations
  * performed by writers can be expressed using atomic RMW operations
@@ -38,9 +41,10 @@ typedef struct raw_atomic_seqcount {
 /* 65536 CPUs */
 #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX		0x0000000000008000ul
 #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK		0x000000000000fffful
-#define ATOMIC_SEQCOUNT_WRITERS_MASK			0x000000000000fffful
+#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER		0x0000000000010000ul
+#define ATOMIC_SEQCOUNT_WRITERS_MASK			0x000000000001fffful
 /* We have 48bit for the actual sequence. */
-#define ATOMIC_SEQCOUNT_SEQUENCE_STEP			0x0000000000010000ul
+#define ATOMIC_SEQCOUNT_SEQUENCE_STEP			0x0000000000020000ul
 
 #else /* CONFIG_64BIT */
 
@@ -48,9 +52,10 @@ typedef struct raw_atomic_seqcount {
 /* 64 CPUs */
 #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX		0x00000040ul
 #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK		0x0000007ful
-#define ATOMIC_SEQCOUNT_WRITERS_MASK			0x0000007ful
-/* We have 25bit for the actual sequence. */
-#define ATOMIC_SEQCOUNT_SEQUENCE_STEP			0x00000080ul
+#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER		0x00000080ul
+#define ATOMIC_SEQCOUNT_WRITERS_MASK			0x000000fful
+/* We have 24bit for the actual sequence. */
+#define ATOMIC_SEQCOUNT_SEQUENCE_STEP			0x00000100ul
 
 #endif /* CONFIG_64BIT */
 
@@ -126,44 +131,102 @@ static inline bool raw_read_atomic_seqcount_retry(raw_atomic_seqcount_t *s,
 /**
  * raw_write_seqcount_begin() - start a raw_seqcount_t write critical section
  * @s: Pointer to the raw_atomic_seqcount_t
+ * @try_exclusive: Whether to try becoming the exclusive writer.
  *
  * raw_write_seqcount_begin() opens the write critical section of the
  * given raw_seqcount_t. This function must not be used in interrupt context.
+ *
+ * Return: "true" when we are the exclusive writer and can avoid atomic RMW
+ *         operations in the critical section. Otherwise, we are a shared
+ *         writer and have to use atomic RMW operations in the critical
+ *         section. Will always return "false" if @try_exclusive is not "true".
  */
-static inline void raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s)
+static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s,
+						   bool try_exclusive)
 {
+	unsigned long seqcount, seqcount_new;
+
 	BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT));
 #ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
 	DEBUG_LOCKS_WARN_ON(in_interrupt());
 #endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
 	preempt_disable();
-	atomic_long_add(ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence);
-	/* Store the sequence before any store in the critical section. */
-	smp_mb__after_atomic();
+
+	/* If requested, can we just become the exclusive writer? */
+	if (!try_exclusive)
+		goto shared;
+
+	seqcount = atomic_long_read(&s->sequence);
+	if (unlikely(seqcount & ATOMIC_SEQCOUNT_WRITERS_MASK))
+		goto shared;
+
+	seqcount_new = seqcount | ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER;
+	/*
+	 * Store the sequence before any store in the critical section. Further,
+	 * this implies an acquire so loads within the critical section are
+	 * not reordered to be outside the critical section.
+	 */
+	if (atomic_long_try_cmpxchg(&s->sequence, &seqcount, seqcount_new))
+		return true;
+shared:
+	/*
+	 * Indicate that there is a shared writer, and spin until the exclusive
+	 * writer is done. This avoids writer starvation, because we'll always
+	 * have to wait for at most one writer.
+	 *
+	 * We spin with preemption disabled to not reschedule to a reader that
+	 * cannot make any progress either way.
+	 *
+	 * Store the sequence before any store in the critical section.
+	 */
+	seqcount = atomic_long_add_return(ATOMIC_SEQCOUNT_SHARED_WRITER,
+					  &s->sequence);
 #ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
-	DEBUG_LOCKS_WARN_ON((atomic_long_read(&s->sequence) &
-			     ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) >
+	DEBUG_LOCKS_WARN_ON((seqcount & ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) >
 			    ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX);
 #endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+	if (likely(!(seqcount & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER)))
+		return false;
+
+	while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER)
+		cpu_relax();
+	return false;
 }
 
 /**
  * raw_write_seqcount_end() - end a raw_seqcount_t write critical section
  * @s: Pointer to the raw_atomic_seqcount_t
+ * @exclusive: Return value of raw_write_atomic_seqcount_begin().
  *
  * raw_write_seqcount_end() closes the write critical section of the
  * given raw_seqcount_t.
  */
-static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s)
+static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s,
+						 bool exclusive)
 {
+	unsigned long val = ATOMIC_SEQCOUNT_SEQUENCE_STEP;
+
+	if (likely(exclusive)) {
+#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
+		DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) &
+				      ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER));
+#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+		val -= ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER;
+	} else {
 #ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
-	DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) &
-			      ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK));
+		DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) &
+				      ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK));
 #endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
-	/* Store the sequence after any store in the critical section. */
+		val -= ATOMIC_SEQCOUNT_SHARED_WRITER;
+	}
+	/*
+	 * Store the sequence after any store in the critical section. For
+	 * the exclusive path, this further implies a release, so loads
+	 * within the critical section are not reordered to be outside the
+	 * cricial section.
+	 */
 	smp_mb__before_atomic();
-	atomic_long_add(ATOMIC_SEQCOUNT_SEQUENCE_STEP -
-			ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence);
+	atomic_long_add(val, &s->sequence);
 	preempt_enable();
 }
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 76e6fb1dad5c..0758dddc5528 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -295,12 +295,13 @@ static inline void __folio_write_large_rmap_begin(struct folio *folio)
 {
 	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
 	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
-	raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount);
+	raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount,
+					false);
 }
 
 static inline void __folio_write_large_rmap_end(struct folio *folio)
 {
-	raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount);
+	raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount, false);
 }
 
 void __folio_set_large_rmap_val(struct folio *folio, int count,

From patchwork Fri Nov 24 13:26:22 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169430
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1192154vqx;
        Fri, 24 Nov 2023 05:30:12 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHK2I5S0fw8XsKyKLtD5lQEutx5kkH5WjxXm4+jIx7ZX4L4FV8tvvv/ajF7tGQ4B9fhmuNw
X-Received: by 2002:a17:90a:d494:b0:285:8939:c4b3 with SMTP id
 s20-20020a17090ad49400b002858939c4b3mr2156489pju.13.1700832612318;
        Fri, 24 Nov 2023 05:30:12 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832612; cv=none;
        d=google.com; s=arc-20160816;
        b=B9M3rZEkO2A25l1lrmPZID8KbtICDdkw0O3UnW5ZMUYdzHi8SEw5zn/97Q+eZGGp3V
         FE9nhxAV1ZdFQ2Zg0CdiqPrTrMYqnCaSbjtCToFWWETkibfmL3rjhaT5aJP5WaXDWVHX
         l/QDfI3iC/W+KsoIcbQS4BbofPBMuedLjYUi9++rc9CNAFPOH/h/FjlD9Jz7nTz3Tql7
         25V5BC/G4VITucuM6MJ1/K+Imy9SyrmC+PHbLUsUtiWahXQaTR+I2aHNedAvoZQr5bt7
         nI0bnJinUVAvmwJQmZI7mtGhE3fQCMiD2+79i4h2hmTG+3W9bMi0anoFP4wn5Cynns3D
         08Vw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=Cfr/GJlqpI78eOuvYmduIMFro8OwYW0A0XSs9rIUuNI=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=EgFaEfn4IhN5w/fjB2qAsAnAs5smb70VDpb/OzZTHbscH00NmiActd5iM+uUyKe0BZ
         kpGEGM9nfBZV6XJFF3fKWS9lnNlgqzUU3+J4kNmzwE/KrzWu+RTD+PSUKNGifbHy8OW6
         Pjri3KFDk2oXH1kaTFY7oLyboIu4Nmcyv64F0+u1AoOq0+a35sNhM3HgdCGLhR6a9+C0
         1jLJfNskzpPIYdqIK6uyXZokMy9DZ7Hx5lucGfpQSwbonUpOAdzsFAxY4hQexrOiiWGF
         O4BEd3NiCvfyQ5Grjv0j09GT3d2JAjllvxcBYIsW2e8rzpthKZm+Q85XZQ964XyKEi2p
         TIsw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=FyleEjPm;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3])
        by mx.google.com with ESMTPS id
 p6-20020a17090ab90600b0028598045121si521064pjr.9.2023.11.24.05.29.46
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:30:12 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 client-ip=2620:137:e000::3:3;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=FyleEjPm;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by lipwig.vger.email (Postfix) with ESMTP id 6C0C5808168A;
	Fri, 24 Nov 2023 05:29:42 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345478AbjKXN3V (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:29:21 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37030 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233150AbjKXN2g (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:28:36 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E05012127
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:40 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832459;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=Cfr/GJlqpI78eOuvYmduIMFro8OwYW0A0XSs9rIUuNI=;
        b=FyleEjPmcwZ5n7ImKP66cqRkRswiA9IpSxNvsIieRWS/LEflfM1PWjI2g+dFYK5KMRl49O
        olhGwz5OI+B8DDvezMOJuBmkjIHYo7jUI/faxGNty1H6/rCwxBjMRq8vOeZPQeoqpoC8p6
        XOwh8XYVOs5lfH4YjqeeLspq7YPlY9c=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-509-8WRNY9xWPlSFJQ6I882kqQ-1; Fri,
 24 Nov 2023 08:27:33 -0500
X-MC-Unique: 8WRNY9xWPlSFJQ6I882kqQ-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 28DBB1C04357;
        Fri, 24 Nov 2023 13:27:33 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 953662166B2A;
        Fri, 24 Nov 2023 13:27:29 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 17/20] mm/rmap_id: reduce atomic RMW operations when we
 are the exclusive writer
Date: Fri, 24 Nov 2023 14:26:22 +0100
Message-ID: <20231124132626.235350-18-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:29:42 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452256988591114
X-GMAIL-MSGID: 1783452256988591114

We can reduce the number of atomic RMW operations when we are the
single exclusive writer -- the common case.

So instead of always requiring

(1) 2 atomic RMW operations for adjusting the atomic seqcount
(2) 1 atomic RMW operation for adjusting the total mapcount
(3) 1 to 6 atomic RMW operation for adjusting the rmap values

We can avoid (2) and (3) if we are the exclusive writer and limit it
to the 2 atomic RMW operations from (1).

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h | 81 +++++++++++++++++++++++++++++++++-----------
 mm/rmap_id.c         | 52 ++++++++++++++++++++++++++++
 2 files changed, 114 insertions(+), 19 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0758dddc5528..538c23d3c0c9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -291,23 +291,36 @@ static inline void __folio_undo_large_rmap(struct folio *folio)
 #endif
 }
 
-static inline void __folio_write_large_rmap_begin(struct folio *folio)
+static inline bool __folio_write_large_rmap_begin(struct folio *folio)
 {
+	bool exclusive;
+
 	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
 	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
-	raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount,
-					false);
+
+	exclusive = raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount,
+						    true);
+	if (likely(exclusive)) {
+		prefetchw(&folio->_rmap_val0);
+		if (unlikely(folio_order(folio) > RMAP_SUBID_4_MAX_ORDER))
+			prefetchw(&folio->_rmap_val4);
+	}
+	return exclusive;
 }
 
-static inline void __folio_write_large_rmap_end(struct folio *folio)
+static inline void __folio_write_large_rmap_end(struct folio *folio,
+		bool exclusive)
 {
-	raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount, false);
+	raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount,
+				      exclusive);
 }
 
 void __folio_set_large_rmap_val(struct folio *folio, int count,
 		struct mm_struct *mm);
 void __folio_add_large_rmap_val(struct folio *folio, int count,
 		struct mm_struct *mm);
+void __folio_add_large_rmap_val_exclusive(struct folio *folio, int count,
+		struct mm_struct *mm);
 bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
 		struct mm_struct *mm);
 #else
@@ -317,12 +330,14 @@ static inline void __folio_prep_large_rmap(struct folio *folio)
 static inline void __folio_undo_large_rmap(struct folio *folio)
 {
 }
-static inline void __folio_write_large_rmap_begin(struct folio *folio)
+static inline bool __folio_write_large_rmap_begin(struct folio *folio)
 {
 	VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
 	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+	return false;
 }
-static inline void __folio_write_large_rmap_end(struct folio *folio)
+static inline void __folio_write_large_rmap_end(struct folio *folio,
+		bool exclusive)
 {
 }
 static inline void __folio_set_large_rmap_val(struct folio *folio, int count,
@@ -333,6 +348,10 @@ static inline void __folio_add_large_rmap_val(struct folio *folio, int count,
 		struct mm_struct *mm)
 {
 }
+static inline void __folio_add_large_rmap_val_exclusive(struct folio *folio,
+		int count, struct mm_struct *mm)
+{
+}
 #endif /* CONFIG_RMAP_ID */
 
 static inline void folio_set_large_mapcount(struct folio *folio,
@@ -348,28 +367,52 @@ static inline void folio_set_large_mapcount(struct folio *folio,
 static inline void folio_inc_large_mapcount(struct folio *folio,
 		struct vm_area_struct *vma)
 {
-	__folio_write_large_rmap_begin(folio);
-	atomic_inc(&folio->_total_mapcount);
-	__folio_add_large_rmap_val(folio, 1, vma->vm_mm);
-	__folio_write_large_rmap_end(folio);
+	bool exclusive;
+
+	exclusive = __folio_write_large_rmap_begin(folio);
+	if (likely(exclusive)) {
+		atomic_set(&folio->_total_mapcount,
+			   atomic_read(&folio->_total_mapcount) + 1);
+		__folio_add_large_rmap_val_exclusive(folio, 1, vma->vm_mm);
+	} else {
+		atomic_inc(&folio->_total_mapcount);
+		__folio_add_large_rmap_val(folio, 1, vma->vm_mm);
+	}
+	__folio_write_large_rmap_end(folio, exclusive);
 }
 
 static inline void folio_add_large_mapcount(struct folio *folio,
 		int count, struct vm_area_struct *vma)
 {
-	__folio_write_large_rmap_begin(folio);
-	atomic_add(count, &folio->_total_mapcount);
-	__folio_add_large_rmap_val(folio, count, vma->vm_mm);
-	__folio_write_large_rmap_end(folio);
+	bool exclusive;
+
+	exclusive = __folio_write_large_rmap_begin(folio);
+	if (likely(exclusive)) {
+		atomic_set(&folio->_total_mapcount,
+			   atomic_read(&folio->_total_mapcount) + count);
+		__folio_add_large_rmap_val_exclusive(folio, count, vma->vm_mm);
+	} else {
+		atomic_add(count, &folio->_total_mapcount);
+		__folio_add_large_rmap_val(folio, count, vma->vm_mm);
+	}
+	__folio_write_large_rmap_end(folio, exclusive);
 }
 
 static inline void folio_dec_large_mapcount(struct folio *folio,
 		struct vm_area_struct *vma)
 {
-	__folio_write_large_rmap_begin(folio);
-	atomic_dec(&folio->_total_mapcount);
-	__folio_add_large_rmap_val(folio, -1, vma->vm_mm);
-	__folio_write_large_rmap_end(folio);
+	bool exclusive;
+
+	exclusive = __folio_write_large_rmap_begin(folio);
+	if (likely(exclusive)) {
+		atomic_set(&folio->_total_mapcount,
+			   atomic_read(&folio->_total_mapcount) - 1);
+		__folio_add_large_rmap_val_exclusive(folio, -1, vma->vm_mm);
+	} else {
+		atomic_dec(&folio->_total_mapcount);
+		__folio_add_large_rmap_val(folio, -1, vma->vm_mm);
+	}
+	__folio_write_large_rmap_end(folio, exclusive);
 }
 
 /* RMAP flags, currently only relevant for some anon rmap operations. */
diff --git a/mm/rmap_id.c b/mm/rmap_id.c
index 421d8d2b646c..5009c6e43965 100644
--- a/mm/rmap_id.c
+++ b/mm/rmap_id.c
@@ -379,6 +379,58 @@ void __folio_add_large_rmap_val(struct folio *folio, int count,
 	}
 }
 
+void __folio_add_large_rmap_val_exclusive(struct folio *folio, int count,
+		struct mm_struct *mm)
+{
+	const unsigned int order = folio_order(folio);
+
+	/*
+	 * Concurrent rmap value modifications are impossible. We don't care
+	 * about store tearing because readers will realize the concurrent
+	 * updates using the seqcount and simply retry. So adjust the bare
+	 * atomic counter instead.
+	 */
+	switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+	case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
+		folio->_rmap_val0.counter += get_rmap_subid_6(mm, 0) * count;
+		folio->_rmap_val1.counter += get_rmap_subid_6(mm, 1) * count;
+		folio->_rmap_val2.counter += get_rmap_subid_6(mm, 2) * count;
+		folio->_rmap_val3.counter += get_rmap_subid_6(mm, 3) * count;
+		folio->_rmap_val4.counter += get_rmap_subid_6(mm, 4) * count;
+		folio->_rmap_val5.counter += get_rmap_subid_6(mm, 5) * count;
+		break;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+	case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
+		folio->_rmap_val0.counter += get_rmap_subid_5(mm, 0) * count;
+		folio->_rmap_val1.counter += get_rmap_subid_5(mm, 1) * count;
+		folio->_rmap_val2.counter += get_rmap_subid_5(mm, 2) * count;
+		folio->_rmap_val3.counter += get_rmap_subid_5(mm, 3) * count;
+		folio->_rmap_val4.counter += get_rmap_subid_5(mm, 4) * count;
+		break;
+#endif
+	case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
+		folio->_rmap_val0.counter += get_rmap_subid_4(mm, 0) * count;
+		folio->_rmap_val1.counter += get_rmap_subid_4(mm, 1) * count;
+		folio->_rmap_val2.counter += get_rmap_subid_4(mm, 2) * count;
+		folio->_rmap_val3.counter += get_rmap_subid_4(mm, 3) * count;
+		break;
+	case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
+		folio->_rmap_val0.counter += get_rmap_subid_3(mm, 0) * count;
+		folio->_rmap_val1.counter += get_rmap_subid_3(mm, 1) * count;
+		folio->_rmap_val2.counter += get_rmap_subid_3(mm, 2) * count;
+		break;
+	case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
+		folio->_rmap_val0.counter += get_rmap_subid_2(mm, 0) * count;
+		folio->_rmap_val1.counter += get_rmap_subid_2(mm, 1) * count;
+		break;
+	default:
+		folio->_rmap_val0.counter += get_rmap_subid_1(mm);
+		break;
+	}
+}
+
 bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
 		 struct mm_struct *mm)
 {

From patchwork Fri Nov 24 13:26:23 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169429
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1192045vqx;
        Fri, 24 Nov 2023 05:30:05 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGnI4RpVhLVr/xsJMQE9xNQqYVbd/QQbJYBgtNj2wTefIGOkSmXRuNBXhgFqdnKRvTtQWoZ
X-Received: by 2002:a17:90b:38ca:b0:27f:fe16:247a with SMTP id
 nn10-20020a17090b38ca00b0027ffe16247amr3300014pjb.17.1700832605148;
        Fri, 24 Nov 2023 05:30:05 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832605; cv=none;
        d=google.com; s=arc-20160816;
        b=LGI0hgsZDvd4e7stwmLP7ctVbDtU5Pu94T6h7ZE1o11yh7xblLNTSzBFDyyvAs60rn
         v0WEW8YNXZ0iRZVg8GOKLHoanKI5NkPdN9n656fsFRNuNWCIoNHIbUkF0nx4MuGfx5Vv
         AQNSwZ+UKQDKP9i/ZjcnGtTOK3KdXTbQaAQxlAd3Lm8FQ9/0AJWZYcMm/GXswRFSd2SJ
         Ijhq3sqWMkMCUW6xXoyqZTKkY0w3vD2iT//WI/MlQBsod5zScdtJ2JhfIZB9WhyuDAnr
         hksNWEFhdTIx1u5Srp0qANV/tU3lD5nBrT1Ct/8Oi5QWV8ll1HAI08YOHquLcaMLBef0
         XgIg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=uVx5jgkX7VDXu2BOJ112MWVTaXFaq3iLz1p8EjK2pug=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=lXINxmrnczI/d5/Y838bc1Rf811XbK5FIIEzLg5+PG2ypCLPKVzGImoJqRJIVQzSrQ
         JtkALw6r+lenCaYeFjgjgezIOwwKYOi+uhD6TrO1gQDP5DNmiUzUP2EaDSQFlAcKAlt/
         QJjgAeCSXFm2Nd/4flgozYInMPgdsGMBSsyqfaQpYdfSRTndR/pq02Pqc+J40gWicG80
         RixftY3ajeBgY/4Kp11TufpSCR1XqXMp2jRabi5qKGUlNUU1I15cQWG3jTay9Nff1IvR
         vK7PCrDkXc7Pj+clZsInMHOyX0X8qD82GHxSZeso89Wrx/c//JoTseRR+Qr2CzvcfaTW
         dqtQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=TiePEiML;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3])
        by mx.google.com with ESMTPS id
 x17-20020a17090a789100b0028014aca793si4068597pjk.2.2023.11.24.05.29.49
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:30:05 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 client-ip=2620:137:e000::3:3;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=TiePEiML;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by lipwig.vger.email (Postfix) with ESMTP id 5DB07808169A;
	Fri, 24 Nov 2023 05:29:47 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345295AbjKXN33 (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:29:29 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58270 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235292AbjKXN2h (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:28:37 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D26EB213A
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:42 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832461;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=uVx5jgkX7VDXu2BOJ112MWVTaXFaq3iLz1p8EjK2pug=;
        b=TiePEiMLX8kRem+4vM/XteE8B3g580adOI01NCZjErMBifJmMhxl5rTjFWFDkK4dEdcB+I
        9QujlixgjEb+wcWOoOXeBOH789xzWDV/gbB8RXHc88HLiK819wT5QlzV2ZOOam5HGanfX1
        Aohc8fxxDfB9PiKyUkL6eCoatGEWqtM=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-77-x3WGX4woMCePcALEhUGVJg-1; Fri, 24 Nov 2023 08:27:37 -0500
X-MC-Unique: x3WGX4woMCePcALEhUGVJg-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E919E185A781;
        Fri, 24 Nov 2023 13:27:36 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 887C32166B2A;
        Fri, 24 Nov 2023 13:27:33 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 18/20] atomic_seqcount: use atomic add-return instead
 of atomic cmpxchg on 64bit
Date: Fri, 24 Nov 2023 14:26:23 +0100
Message-ID: <20231124132626.235350-19-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:29:47 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452249502509398
X-GMAIL-MSGID: 1783452249502509398

Turns out that it can be beneficial on some HW to use an add-return instead
of and atomic cmpxchg. However, we have to deal with more possible races
now: in the worst case, each and every CPU might try becoming the exclusive
writer at the same time, so we need the same number of bits as for the
shared writer case.

In case we detect that we didn't end up being the exclusive writer,
simply back off and convert to a shared writer.

Only implement this optimization on 64bit, where we can steal more bits
from the actual sequence without sorrow.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/atomic_seqcount.h | 43 +++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/linux/atomic_seqcount.h b/include/linux/atomic_seqcount.h
index 00286a9da221..9cd40903863d 100644
--- a/include/linux/atomic_seqcount.h
+++ b/include/linux/atomic_seqcount.h
@@ -42,9 +42,10 @@ typedef struct raw_atomic_seqcount {
 #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX		0x0000000000008000ul
 #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK		0x000000000000fffful
 #define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER		0x0000000000010000ul
-#define ATOMIC_SEQCOUNT_WRITERS_MASK			0x000000000001fffful
-/* We have 48bit for the actual sequence. */
-#define ATOMIC_SEQCOUNT_SEQUENCE_STEP			0x0000000000020000ul
+#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK		0x00000000ffff0000ul
+#define ATOMIC_SEQCOUNT_WRITERS_MASK			0x00000000fffffffful
+/* We have 32bit for the actual sequence. */
+#define ATOMIC_SEQCOUNT_SEQUENCE_STEP			0x0000000100000000ul
 
 #else /* CONFIG_64BIT */
 
@@ -53,6 +54,7 @@ typedef struct raw_atomic_seqcount {
 #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX		0x00000040ul
 #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK		0x0000007ful
 #define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER		0x00000080ul
+#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK		0x00000080ul
 #define ATOMIC_SEQCOUNT_WRITERS_MASK			0x000000fful
 /* We have 24bit for the actual sequence. */
 #define ATOMIC_SEQCOUNT_SEQUENCE_STEP			0x00000100ul
@@ -144,7 +146,7 @@ static inline bool raw_read_atomic_seqcount_retry(raw_atomic_seqcount_t *s,
 static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s,
 						   bool try_exclusive)
 {
-	unsigned long seqcount, seqcount_new;
+	unsigned long __maybe_unused seqcount, seqcount_new;
 
 	BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT));
 #ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
@@ -160,6 +162,32 @@ static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s,
 	if (unlikely(seqcount & ATOMIC_SEQCOUNT_WRITERS_MASK))
 		goto shared;
 
+#ifdef CONFIG_64BIT
+	BUILD_BUG_ON(__builtin_popcount(ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK) !=
+		     __builtin_popcount(ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK));
+
+	/* See comment for atomic_long_try_cmpxchg() below. */
+	seqcount = atomic_long_add_return(ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER,
+					  &s->sequence);
+	if (likely((seqcount & ATOMIC_SEQCOUNT_WRITERS_MASK) ==
+		    ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER))
+		return true;
+
+	/*
+	 * Whoops, we raced with another writer. Back off, converting ourselves
+	 * to a shared writer and wait for any exclusive writers.
+	 */
+	atomic_long_add(ATOMIC_SEQCOUNT_SHARED_WRITER - ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER,
+			&s->sequence);
+	/*
+	 * No need for __smp_mb__after_atomic(): the reader side already
+	 * realizes that it has to retry and the memory barrier from
+	 * atomic_long_add_return() is sufficient for that.
+	 */
+	while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK)
+		cpu_relax();
+	return false;
+#else
 	seqcount_new = seqcount | ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER;
 	/*
 	 * Store the sequence before any store in the critical section. Further,
@@ -168,6 +196,7 @@ static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s,
 	 */
 	if (atomic_long_try_cmpxchg(&s->sequence, &seqcount, seqcount_new))
 		return true;
+#endif
 shared:
 	/*
 	 * Indicate that there is a shared writer, and spin until the exclusive
@@ -185,10 +214,10 @@ static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s,
 	DEBUG_LOCKS_WARN_ON((seqcount & ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) >
 			    ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX);
 #endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
-	if (likely(!(seqcount & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER)))
+	if (likely(!(seqcount & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK)))
 		return false;
 
-	while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER)
+	while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK)
 		cpu_relax();
 	return false;
 }
@@ -209,7 +238,7 @@ static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s,
 	if (likely(exclusive)) {
 #ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
 		DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) &
-				      ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER));
+				      ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK));
 #endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
 		val -= ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER;
 	} else {

From patchwork Fri Nov 24 13:26:24 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169432
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1192249vqx;
        Fri, 24 Nov 2023 05:30:17 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHcXGIt61iYTLSC9b/htttosJjKYn2zHfJMxZwouJQgvvIdSnbZl8B/3kZh3RTcieLL155g
X-Received: by 2002:a05:6a20:8e10:b0:18b:a1a2:854f with SMTP id
 y16-20020a056a208e1000b0018ba1a2854fmr2680711pzj.49.1700832616778;
        Fri, 24 Nov 2023 05:30:16 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832616; cv=none;
        d=google.com; s=arc-20160816;
        b=g74S7nL/6hOp875Cf8KCcc34RIJ7qGCwBaKPJ2WV/cqh7mM+bVGeyllc2sC2xtTdRm
         Pb0MFe0D00177EeVbtjFRuNR+vjG1y9X4vsCVKZ95kYAveWhFlljmCazB6KgARE64MBq
         4Ywakfq0z4JsaKe2v9LwuePlsO5QEA6TpQT92Hn1Y2k3wehWHCAmRG3IEXu6yvPtTgI0
         niBdjDPfOioEDJdLBj/tXlu2ScnbIhTM9n53k8V0ZC9OFCt+bJLNZ8E5++BVJz/de4tJ
         /9ZmhYaPKdcB0pGjeGIwcbVkosdQNzGR/BNnWtgIKtDbvdPNOHr67vYXh/v18NG9oBhK
         fJpg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=BnEHwgivpdbaZgh54Ih7L1AobW4tDWwZEN96Ch5d2sI=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=W1Dm5pMTjUcRQBkSlsm2zBfw7FQS/6+75oY3zqs/2D2l7YzORKxIFcXajedBd3QC+E
         7Ww8mjA4diyMW7tU77g0LmZ5wTnOJ0CGmowv2+/RPrwzQVOIxpaFP8r2xId7zqHb+S90
         GZZ6veZDJFehz32hklk500SNS97zvtPZbgIA7/lYh2ziVp5ZyHb9ZL07iX3u2X1n1oaA
         A9PagN+cxJFv4yevhJGPsMlowyhbu6sdUqMDfxEI0+v/VhduENf+elsyUiFEGghckGpm
         BvvBFUQKFnMTDN0beEFyOn8r/yS+6jVM96gJV093g1te8PfF3M/3kmb815nYNZrmKSfH
         DNTA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=Hizn47st;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:6 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6])
        by mx.google.com with ESMTPS id
 r14-20020a63514e000000b005b909e93e2dsi3647300pgl.522.2023.11.24.05.29.55
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:30:16 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:6 as permitted sender)
 client-ip=2620:137:e000::3:6;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=Hizn47st;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:6 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by pete.vger.email (Postfix) with ESMTP id A1D6480AF24F;
	Fri, 24 Nov 2023 05:29:48 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229833AbjKXN3Y (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others);
        Fri, 24 Nov 2023 08:29:24 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37588 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235293AbjKXN2h (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:28:37 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 81CF72686
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:43 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832462;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=BnEHwgivpdbaZgh54Ih7L1AobW4tDWwZEN96Ch5d2sI=;
        b=Hizn47stOY0bjcCAapanjWMo1DctKX6iH+q4hCiCWA0XhmT2hRBOUWfHfCXBdsEZQQsJiU
        PwgKXtiLA4TJss8Dutnh2w5T8tDYEclkU4NKPAOWGGLBOp0UWz2wZUCxxhYje3Xa4IQL8/
        TGJ7vu7hxWsuWfYCnpXM3IRr5scvSJ8=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-542-EU4DmTN2O1ar0sjJ8gqpbQ-1; Fri, 24 Nov 2023 08:27:41 -0500
X-MC-Unique: EU4DmTN2O1ar0sjJ8gqpbQ-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 3AD05185A783;
        Fri, 24 Nov 2023 13:27:40 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 550AE2166B2A;
        Fri, 24 Nov 2023 13:27:37 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 19/20] mm/rmap: factor out removing folio range into
 __folio_remove_rmap_range()
Date: Fri, 24 Nov 2023 14:26:24 +0100
Message-ID: <20231124132626.235350-20-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:29:48 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452262126825149
X-GMAIL-MSGID: 1783452262126825149

Let's factor it out, optimize for small folios, and compact it a bit.

Well, we're adding the range part, but that will surely come in handy
soon -- and it's now wasier to compare it with __folio_add_rmap_range().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 90 +++++++++++++++++++++++++++++++++----------------------
 1 file changed, 55 insertions(+), 35 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index da7fa46a18fc..80ac53633332 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1155,6 +1155,57 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
 	return nr;
 }
 
+static unsigned int __folio_remove_rmap_range(struct folio *folio,
+		struct page *page, unsigned int nr_pages,
+		struct vm_area_struct *vma, bool compound, int *nr_pmdmapped)
+{
+	atomic_t *mapped = &folio->_nr_pages_mapped;
+	int last, count, nr = 0;
+
+	VM_WARN_ON_FOLIO(compound && page != &folio->page, folio);
+	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
+	VM_WARN_ON_FOLIO(compound && nr_pages != folio_nr_pages(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_large(folio) && nr_pages != 1, folio);
+
+	if (likely(!folio_test_large(folio)))
+		return atomic_add_negative(-1, &page->_mapcount);
+
+	/* Is page being unmapped by PTE? Is this its last map to be removed? */
+	if (!compound) {
+		folio_add_large_mapcount(folio, -nr_pages, vma);
+		count = nr_pages;
+		do {
+			last = atomic_add_negative(-1, &page->_mapcount);
+			if (last) {
+				last = atomic_dec_return_relaxed(mapped);
+				if (last < COMPOUND_MAPPED)
+					nr++;
+			}
+		} while (page++, --count > 0);
+	} else if (folio_test_pmd_mappable(folio)) {
+		/* That test is redundant: it's for safety or to optimize out */
+
+		folio_dec_large_mapcount(folio, vma);
+		last = atomic_add_negative(-1, &folio->_entire_mapcount);
+		if (last) {
+			nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
+			if (likely(nr < COMPOUND_MAPPED)) {
+				*nr_pmdmapped = folio_nr_pages(folio);
+				nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
+				/* Raced ahead of another remove and an add? */
+				if (unlikely(nr < 0))
+					nr = 0;
+			} else {
+				/* An add of COMPOUND_MAPPED raced ahead */
+				nr = 0;
+			}
+		}
+	} else {
+		VM_WARN_ON_ONCE_FOLIO(true, folio);
+	}
+	return nr;
+}
+
 /**
  * folio_move_anon_rmap - move a folio to our anon_vma
  * @folio:	The folio to move to our anon_vma
@@ -1439,13 +1490,10 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		bool compound)
 {
 	struct folio *folio = page_folio(page);
-	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
-	bool last;
+	unsigned long nr_pages = compound ? folio_nr_pages(folio) : 1;
+	unsigned int nr, nr_pmdmapped = 0;
 	enum node_stat_item idx;
 
-	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
-
 	/* Hugetlb pages are not counted in NR_*MAPPED */
 	if (unlikely(folio_test_hugetlb(folio))) {
 		/* hugetlb pages are always mapped with pmds */
@@ -1454,36 +1502,8 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		return;
 	}
 
-	if (folio_test_large(folio))
-		folio_dec_large_mapcount(folio, vma);
-
-	/* Is page being unmapped by PTE? Is this its last map to be removed? */
-	if (likely(!compound)) {
-		last = atomic_add_negative(-1, &page->_mapcount);
-		nr = last;
-		if (last && folio_test_large(folio)) {
-			nr = atomic_dec_return_relaxed(mapped);
-			nr = (nr < COMPOUND_MAPPED);
-		}
-	} else if (folio_test_pmd_mappable(folio)) {
-		/* That test is redundant: it's for safety or to optimize out */
-
-		last = atomic_add_negative(-1, &folio->_entire_mapcount);
-		if (last) {
-			nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
-			if (likely(nr < COMPOUND_MAPPED)) {
-				nr_pmdmapped = folio_nr_pages(folio);
-				nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
-				/* Raced ahead of another remove and an add? */
-				if (unlikely(nr < 0))
-					nr = 0;
-			} else {
-				/* An add of COMPOUND_MAPPED raced ahead */
-				nr = 0;
-			}
-		}
-	}
-
+	nr = __folio_remove_rmap_range(folio, page, nr_pages, vma, compound,
+				       &nr_pmdmapped);
 	if (nr_pmdmapped) {
 		if (folio_test_anon(folio))
 			idx = NR_ANON_THPS;

From patchwork Fri Nov 24 13:26:25 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 169434
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1193849vqx;
        Fri, 24 Nov 2023 05:32:00 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGr6j/DF9i521vuKH0sCSK3bc7XN56TlGSoKiFLpR0Js0SZpwl0f/86JTIaQoRVMghmV4aO
X-Received: by 2002:a17:902:f68e:b0:1cf:8364:ec24 with SMTP id
 l14-20020a170902f68e00b001cf8364ec24mr3399915plg.4.1700832720543;
        Fri, 24 Nov 2023 05:32:00 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1700832720; cv=none;
        d=google.com; s=arc-20160816;
        b=eroBkJB4rYNsgwGRcdUqWLnJXLGx3I1+NYWY4DwCm6XWdiXH7EduGQsA3T4RTRXZ5W
         WzPsRmVhaQ7aCg2EcOYs9xG+ifa7DaHCVStMmVW9B9fCz35uAr9e6mvfo8i6IeG1OS4+
         8H+RF8REck1K173Yccojph7d8vJji8uKxBh6glBTKfkIWSHciyCKn0VLU9l9BrS9k2kC
         BA3gnVRGFvgT4ihv2+zQRLbPaujf+5dziujeiro6m0L5j4jqyH9EMmaAFbr5ARHkwZfa
         rXSUmr0ily4Wq8MRzGShbOfli9jfHf1pP8rsQf5DqqLlZHa9ZMj8Jb/zmQHI0JEwAUMJ
         viVw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=3nZeFmARPNGvb6d3sXI2ciVXy/9aT4qOMIkiPA0/sq4=;
        fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=;
        b=BbWaX28Bm6949Owd1HK1CJAMxKCoCqxyUywS3FkaLYMfE8FgqRs8czSHqjvY/nIwI1
         YmjxfWhmZfGzFIm7+wXunmAJM6vWqD5fGg61OHoaijy74F+WmPUo8Pc58xA7xGnVFZub
         itd6eGta3uG3yyc65fAOYI4fOTx7qiwDZWJZgdSFhwE8MXt/z1CLihGuHdnCutRLw4/r
         lUldIaUFickGYgPZNW474bXEffds5km+H8gNlZdEVV83IzyqPfCLCapWyUTW3oosFBzR
         KJnBIvVsfnorQx4yJj6J6EwK5zfwNivOXG7u4Cky2YkFkSQQiUj/A1WB7is7SItRrb+d
         RQMQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=gVv6tq92;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3])
        by mx.google.com with ESMTPS id
 y2-20020a170902864200b001cf85966f78si3290407plt.117.2023.11.24.05.31.59
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Nov 2023 05:32:00 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 client-ip=2620:137:e000::3:3;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=gVv6tq92;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by lipwig.vger.email (Postfix) with ESMTP id D4E21807F4EE;
	Fri, 24 Nov 2023 05:30:06 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345470AbjKXN35 (ORCPT <rfc822;ouuuleilei@gmail.com>
        + 99 others); Fri, 24 Nov 2023 08:29:57 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37618 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1345501AbjKXN3V (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Nov 2023 08:29:21 -0500
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E80710FA
        for <linux-kernel@vger.kernel.org>;
 Fri, 24 Nov 2023 05:27:54 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1700832474;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=3nZeFmARPNGvb6d3sXI2ciVXy/9aT4qOMIkiPA0/sq4=;
        b=gVv6tq921bqmqMH1s1nn3UT0EyVUwrXyzHVr+xrw/Caz8F8mTec9df74iUGQI+Zxqpa20F
        6LFymO1lJgW+ksjgF/4Ids1iwVR9beM7GxxRrqY2lI69JJOhUxgJC9m1y4S9qg34zeH8jT
        D3F6vd4riKjF8EkHFDwFOoWvap+3m/0=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-617-qEni6tCTMAipvOzVZRL0Sg-1; Fri,
 24 Nov 2023 08:27:44 -0500
X-MC-Unique: qEni6tCTMAipvOzVZRL0Sg-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com
 [10.11.54.6])
        (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
         key-exchange X25519 server-signature RSA-PSS (2048 bits)
 server-digest SHA256)
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id AF09F2806052;
        Fri, 24 Nov 2023 13:27:43 +0000 (UTC)
Received: from t14s.fritz.box (unknown [10.39.194.71])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 870CD2166B2A;
        Fri, 24 Nov 2023 13:27:40 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Ryan Roberts <ryan.roberts@arm.com>,
        Matthew Wilcox <willy@infradead.org>,
        Hugh Dickins <hughd@google.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        Yang Shi <shy828301@gmail.com>,
        Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 20/20] mm/rmap: perform all mapcount operations of
 large folios under the rmap seqcount
Date: Fri, 24 Nov 2023 14:26:25 +0100
Message-ID: <20231124132626.235350-21-david@redhat.com>
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
References: <20231124132626.235350-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]);
 Fri, 24 Nov 2023 05:30:07 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1783452370716979334
X-GMAIL-MSGID: 1783452370716979334

Let's extend the atomic seqcount to also protect modifications of:
* The subpage mapcounts
* The entire mapcount
* folio->_nr_pages_mapped

This way, we can avoid another 1/2 atomic RMW operations on the fast
path (and significantly more when patching): When we are the exclusive
writer, we only need two atomic RMW operations to manage the atomic
seqcount.

Let's document how the existing atomic seqcount memory barriers keep the
old behavior unmodified: especially, how it makes sure that folio
refcount updates cannot be reordered with folio mapcount updates.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h | 95 ++++++++++++++++++++++++++------------------
 mm/rmap.c            | 84 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 137 insertions(+), 42 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 538c23d3c0c9..3cff4aa71393 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -301,6 +301,12 @@ static inline bool __folio_write_large_rmap_begin(struct folio *folio)
 	exclusive = raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount,
 						    true);
 	if (likely(exclusive)) {
+		/*
+		 * Note: raw_write_atomic_seqcount_begin() implies a full
+		 * memory barrier like non-exclusive mapcount operations
+		 * will. Any refcount updates that happened before this call
+		 * are visible before any mapcount updates on other CPUs.
+		 */
 		prefetchw(&folio->_rmap_val0);
 		if (unlikely(folio_order(folio) > RMAP_SUBID_4_MAX_ORDER))
 			prefetchw(&folio->_rmap_val4);
@@ -311,6 +317,12 @@ static inline bool __folio_write_large_rmap_begin(struct folio *folio)
 static inline void __folio_write_large_rmap_end(struct folio *folio,
 		bool exclusive)
 {
+	/*
+	 * Note: raw_write_atomic_seqcount_end() implies a full memory
+	 * barrier like non-exclusive mapcount operations will. Any
+	 * refcount updates happening after this call are visible after any
+	 * mapcount updates on other CPUs.
+	 */
 	raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount,
 				      exclusive);
 }
@@ -367,52 +379,46 @@ static inline void folio_set_large_mapcount(struct folio *folio,
 static inline void folio_inc_large_mapcount(struct folio *folio,
 		struct vm_area_struct *vma)
 {
-	bool exclusive;
+	atomic_inc(&folio->_total_mapcount);
+	__folio_add_large_rmap_val(folio, 1, vma->vm_mm);
+}
 
-	exclusive = __folio_write_large_rmap_begin(folio);
-	if (likely(exclusive)) {
-		atomic_set(&folio->_total_mapcount,
-			   atomic_read(&folio->_total_mapcount) + 1);
-		__folio_add_large_rmap_val_exclusive(folio, 1, vma->vm_mm);
-	} else {
-		atomic_inc(&folio->_total_mapcount);
-		__folio_add_large_rmap_val(folio, 1, vma->vm_mm);
-	}
-	__folio_write_large_rmap_end(folio, exclusive);
+static inline void folio_inc_large_mapcount_exclusive(struct folio *folio,
+		struct vm_area_struct *vma)
+{
+	atomic_set(&folio->_total_mapcount,
+		   atomic_read(&folio->_total_mapcount) + 1);
+	__folio_add_large_rmap_val_exclusive(folio, 1, vma->vm_mm);
 }
 
 static inline void folio_add_large_mapcount(struct folio *folio,
 		int count, struct vm_area_struct *vma)
 {
-	bool exclusive;
+	atomic_add(count, &folio->_total_mapcount);
+	__folio_add_large_rmap_val(folio, count, vma->vm_mm);
+}
 
-	exclusive = __folio_write_large_rmap_begin(folio);
-	if (likely(exclusive)) {
-		atomic_set(&folio->_total_mapcount,
-			   atomic_read(&folio->_total_mapcount) + count);
-		__folio_add_large_rmap_val_exclusive(folio, count, vma->vm_mm);
-	} else {
-		atomic_add(count, &folio->_total_mapcount);
-		__folio_add_large_rmap_val(folio, count, vma->vm_mm);
-	}
-	__folio_write_large_rmap_end(folio, exclusive);
+static inline void folio_add_large_mapcount_exclusive(struct folio *folio,
+		int count, struct vm_area_struct *vma)
+{
+	atomic_set(&folio->_total_mapcount,
+		   atomic_read(&folio->_total_mapcount) + count);
+	__folio_add_large_rmap_val_exclusive(folio, count, vma->vm_mm);
 }
 
 static inline void folio_dec_large_mapcount(struct folio *folio,
 		struct vm_area_struct *vma)
 {
-	bool exclusive;
+	atomic_dec(&folio->_total_mapcount);
+	__folio_add_large_rmap_val(folio, -1, vma->vm_mm);
+}
 
-	exclusive = __folio_write_large_rmap_begin(folio);
-	if (likely(exclusive)) {
-		atomic_set(&folio->_total_mapcount,
-			   atomic_read(&folio->_total_mapcount) - 1);
-		__folio_add_large_rmap_val_exclusive(folio, -1, vma->vm_mm);
-	} else {
-		atomic_dec(&folio->_total_mapcount);
-		__folio_add_large_rmap_val(folio, -1, vma->vm_mm);
-	}
-	__folio_write_large_rmap_end(folio, exclusive);
+static inline void folio_dec_large_mapcount_exclusive(struct folio *folio,
+		struct vm_area_struct *vma)
+{
+	atomic_set(&folio->_total_mapcount,
+		   atomic_read(&folio->_total_mapcount) - 1);
+	__folio_add_large_rmap_val_exclusive(folio, -1, vma->vm_mm);
 }
 
 /* RMAP flags, currently only relevant for some anon rmap operations. */
@@ -462,6 +468,7 @@ static inline void __page_dup_rmap(struct page *page,
 		struct vm_area_struct *dst_vma, bool compound)
 {
 	struct folio *folio = page_folio(page);
+	bool exclusive;
 
 	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
 	if (likely(!folio_test_large(folio))) {
@@ -475,11 +482,23 @@ static inline void __page_dup_rmap(struct page *page,
 		return;
 	}
 
-	if (compound)
-		atomic_inc(&folio->_entire_mapcount);
-	else
-		atomic_inc(&page->_mapcount);
-	folio_inc_large_mapcount(folio, dst_vma);
+	exclusive = __folio_write_large_rmap_begin(folio);
+	if (likely(exclusive)) {
+		if (compound)
+			atomic_set(&folio->_entire_mapcount,
+				   atomic_read(&folio->_entire_mapcount) + 1);
+		else
+			atomic_set(&page->_mapcount,
+				   atomic_read(&page->_mapcount) + 1);
+		folio_inc_large_mapcount_exclusive(folio, dst_vma);
+	} else {
+		if (compound)
+			atomic_inc(&folio->_entire_mapcount);
+		else
+			atomic_inc(&page->_mapcount);
+		folio_inc_large_mapcount(folio, dst_vma);
+	}
+	__folio_write_large_rmap_end(folio, exclusive);
 }
 
 static inline void page_dup_file_rmap(struct page *page,
diff --git a/mm/rmap.c b/mm/rmap.c
index 80ac53633332..755a62b046e2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1109,7 +1109,8 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
 		struct vm_area_struct *vma, bool compound, int *nr_pmdmapped)
 {
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int first, count, nr = 0;
+	int first, val, count, nr = 0;
+	bool exclusive;
 
 	VM_WARN_ON_FOLIO(compound && page != &folio->page, folio);
 	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
@@ -1119,8 +1120,23 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
 	if (likely(!folio_test_large(folio)))
 		return atomic_inc_and_test(&page->_mapcount);
 
+	exclusive = __folio_write_large_rmap_begin(folio);
+
 	/* Is page being mapped by PTE? Is this its first map to be added? */
-	if (!compound) {
+	if (likely(exclusive) && !compound) {
+		count = nr_pages;
+		do {
+			val = atomic_read(&page->_mapcount) + 1;
+			atomic_set(&page->_mapcount, val);
+			if (!val) {
+				val = atomic_read(mapped) + 1;
+				atomic_set(mapped, val);
+				if (val < COMPOUND_MAPPED)
+					nr++;
+			}
+		} while (page++, --count > 0);
+		folio_add_large_mapcount_exclusive(folio, nr_pages, vma);
+	} else if (!compound) {
 		count = nr_pages;
 		do {
 			first = atomic_inc_and_test(&page->_mapcount);
@@ -1131,6 +1147,26 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
 			}
 		} while (page++, --count > 0);
 		folio_add_large_mapcount(folio, nr_pages, vma);
+	} else if (likely(exclusive) && folio_test_pmd_mappable(folio)) {
+		/* That test is redundant: it's for safety or to optimize out */
+
+		val = atomic_read(&folio->_entire_mapcount) + 1;
+		atomic_set(&folio->_entire_mapcount, val);
+		if (!val) {
+			nr = atomic_read(mapped) + COMPOUND_MAPPED;
+			atomic_set(mapped, nr);
+			if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
+				*nr_pmdmapped = folio_nr_pages(folio);
+				nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
+				/* Raced ahead of a remove and another add? */
+				if (unlikely(nr < 0))
+					nr = 0;
+			} else {
+				/* Raced ahead of a remove of COMPOUND_MAPPED */
+				nr = 0;
+			}
+		}
+		folio_inc_large_mapcount_exclusive(folio, vma);
 	} else if (folio_test_pmd_mappable(folio)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
@@ -1152,6 +1188,8 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
 	} else {
 		VM_WARN_ON_ONCE_FOLIO(true, folio);
 	}
+
+	__folio_write_large_rmap_end(folio, exclusive);
 	return nr;
 }
 
@@ -1160,7 +1198,8 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio,
 		struct vm_area_struct *vma, bool compound, int *nr_pmdmapped)
 {
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int last, count, nr = 0;
+	int last, val, count, nr = 0;
+	bool exclusive;
 
 	VM_WARN_ON_FOLIO(compound && page != &folio->page, folio);
 	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
@@ -1170,8 +1209,23 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio,
 	if (likely(!folio_test_large(folio)))
 		return atomic_add_negative(-1, &page->_mapcount);
 
+	exclusive = __folio_write_large_rmap_begin(folio);
+
 	/* Is page being unmapped by PTE? Is this its last map to be removed? */
-	if (!compound) {
+	if (likely(exclusive) && !compound) {
+		folio_add_large_mapcount_exclusive(folio, -nr_pages, vma);
+		count = nr_pages;
+		do {
+			val = atomic_read(&page->_mapcount) - 1;
+			atomic_set(&page->_mapcount, val);
+			if (val < 0) {
+				val = atomic_read(mapped) - 1;
+				atomic_set(mapped, val);
+				if (val < COMPOUND_MAPPED)
+					nr++;
+			}
+		} while (page++, --count > 0);
+	} else if (!compound) {
 		folio_add_large_mapcount(folio, -nr_pages, vma);
 		count = nr_pages;
 		do {
@@ -1182,6 +1236,26 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio,
 					nr++;
 			}
 		} while (page++, --count > 0);
+	} else if (likely(exclusive) && folio_test_pmd_mappable(folio)) {
+		/* That test is redundant: it's for safety or to optimize out */
+
+		folio_dec_large_mapcount_exclusive(folio, vma);
+		val = atomic_read(&folio->_entire_mapcount) - 1;
+		atomic_set(&folio->_entire_mapcount, val);
+		if (val < 0) {
+			nr = atomic_read(mapped) - COMPOUND_MAPPED;
+			atomic_set(mapped, nr);
+			if (likely(nr < COMPOUND_MAPPED)) {
+				*nr_pmdmapped = folio_nr_pages(folio);
+				nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
+				/* Raced ahead of another remove and an add? */
+				if (unlikely(nr < 0))
+					nr = 0;
+			} else {
+				/* An add of COMPOUND_MAPPED raced ahead */
+				nr = 0;
+			}
+		}
 	} else if (folio_test_pmd_mappable(folio)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
@@ -1203,6 +1277,8 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio,
 	} else {
 		VM_WARN_ON_ONCE_FOLIO(true, folio);
 	}
+
+	__folio_write_large_rmap_end(folio, exclusive);
 	return nr;
 }