From patchwork Fri Sep 29 11:44:12 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 146507
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3963145vqu;
        Fri, 29 Sep 2023 04:48:55 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IFXuYmiiR1JdrmaDYHHuePnBxqz8oxfiqasGn6v55WbTDE+xXvARI/5uDk0MXTxBHmvMjqf
X-Received: by 2002:a05:6a20:12cd:b0:151:35ad:f327 with SMTP id
 v13-20020a056a2012cd00b0015135adf327mr4421308pzg.17.1695988134806;
        Fri, 29 Sep 2023 04:48:54 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695988134; cv=none;
        d=google.com; s=arc-20160816;
        b=N+OC6qRmSk+te2EL5VtErulvj2Z2GF2PEKQJKdfk7HiyVNE8Uwhpbugy7tjCOmC0D4
         7fZAdvmcocB0370e/nNepNi1iPocz1IcBZV+Qk6efEyahTAdOEvMlF/yiBmlCfk79Sts
         wNoXcppw7DpRxVvQzr+8JbXbpJ4cbkVX8qZBHYYiCr4nDF/zoBZE50kwZkC5WWaT7YTJ
         /FRnCqjWMAvPdyaJvqXQShPHaOFuRty3Y0fbiwf1yDwuD6XQrTOAaZgbFEMz9xzFpNo3
         Jeah8z1ZdusQB6CohfadSt56C5CzA5h/MHKAJVgvzDcaJVgvThHCDbEhcyGARbtuUYKe
         dPRA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=wW8PmLeN+NODDjeW+k+loXBT9eKUllZt4ss/4twQx6U=;
        fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=;
        b=EQ/V98t/swrsB1slV4ysGuYYXyg1yMyyLYEdCtuOe90bOQJ03HkwKk2YEjNLNFcLof
         vJDUAUTorRKEyDlxLwDdYv9r/SkturxZKOWnP29NfCV8AeE0z432uhLTl9xSXPSO0KR/
         uG2AdFkS6esMcilRHDG5f/DYnEyhk029lvR0fhj+1BaVg8eIEWOBuN7aC2594xdFzNe9
         WUeY1puodbwlurFegW+eId/Ewek4+AxUlQAs4UQAuuvXz358umsSKF+Zo2VDIWp0x0Mk
         ACEnhtgAlfHXzp/d+deAVhtxEiDIRrs0+PZLTTF254c6YKRbHmmGdsC6rG53rQ173Tqf
         joBA==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from snail.vger.email (snail.vger.email. [23.128.96.37])
        by mx.google.com with ESMTPS id
 mh6-20020a17090b4ac600b002736ff3cc79si1576021pjb.23.2023.09.29.04.48.54
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 29 Sep 2023 04:48:54 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id 404A7831556D;
	Fri, 29 Sep 2023 04:44:51 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233000AbjI2Loj (ORCPT <rfc822;pwkd43@gmail.com> + 20 others);
        Fri, 29 Sep 2023 07:44:39 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42048 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232925AbjI2Loh (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 29 Sep 2023 07:44:37 -0400
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 5F31CE7
        for <linux-kernel@vger.kernel.org>;
 Fri, 29 Sep 2023 04:44:36 -0700 (PDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 77FA4DA7;
        Fri, 29 Sep 2023 04:45:14 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
 94DAD3F59C;
        Fri, 29 Sep 2023 04:44:33 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        Matthew Wilcox <willy@infradead.org>,
        Yin Fengwei <fengwei.yin@intel.com>,
        David Hildenbrand <david@redhat.com>,
        Yu Zhao <yuzhao@google.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Anshuman Khandual <anshuman.khandual@arm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Luis Chamberlain <mcgrof@kernel.org>,
        Itaru Kitayama <itaru.kitayama@gmail.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        John Hubbard <jhubbard@nvidia.com>,
        David Rientjes <rientjes@google.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        Hugh Dickins <hughd@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org
Subject: [PATCH v6 1/9] mm: Allow deferred splitting of arbitrary anon large
 folios
Date: Fri, 29 Sep 2023 12:44:12 +0100
Message-Id: <20230929114421.3761121-2-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com>
References: <20230929114421.3761121-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED,
        SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Fri, 29 Sep 2023 04:44:51 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1778372454420897229
X-GMAIL-MSGID: 1778372454420897229

In preparation for the introduction of large folios for anonymous
memory, we would like to be able to split them when they have unmapped
subpages, in order to free those unused pages under memory pressure. So
remove the artificial requirement that the large folio needed to be at
least PMD-sized.

Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 9f795b93cf40..8600bd029acf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1446,11 +1446,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		__lruvec_stat_mod_folio(folio, idx, -nr);
 
 		/*
-		 * Queue anon THP for deferred split if at least one
+		 * Queue anon large folio for deferred split if at least one
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}

From patchwork Fri Sep 29 11:44:13 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 146503
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3962136vqu;
        Fri, 29 Sep 2023 04:46:57 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IHe7wWvVDXnAv15AHJrn2a3RYZVsnXlb/h8zzhMVNZgeHaP822hYcqoP0+HVGWARrTHgMrY
X-Received: by 2002:a05:6808:d49:b0:3a8:74bf:8977 with SMTP id
 w9-20020a0568080d4900b003a874bf8977mr4486270oik.56.1695988017539;
        Fri, 29 Sep 2023 04:46:57 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695988017; cv=none;
        d=google.com; s=arc-20160816;
        b=kxHqXPIqrbnxpWLYHw5b5n9r7eRiyyFPB+cZJtOycsD1jUSlyVcDB/lUhDVcrjVrtI
         +Ma4ZDaMWsL2hKuSoh/aaht8OnPl7aRGkaJVY+c1aK65SsEVAuzgpXhN1GteQtOMj2X0
         wKCvvvJ8GgmhC96594l4Polpf3l61RSsj15fl2mLq+sOGNX4WgBmk04L5/rkaDME/R0u
         iHrArgFpdAZ2XYtSJQv4GtbMeLT/h8fjk3TUb5EEbtM8/fMEUCjEfMdo1eOGsiYv+EoV
         ml6EgN6NzHas0Z8ZQe5PdMnpv+JDl1Ak3M2oHwvS6gXJj3iRXDw30jP4+et1P96xLKLX
         wzQA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=Mr+mjEqwfDoSUXJ6kY2GFEcxLi6J7er8Wf2qGr66jIc=;
        fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=;
        b=J+lXYTe+1mGnpe0wB4WOiKjoBztGiaec6f3igY4SaCp5UzbarPeW2hA5DdBvrbAk5t
         ae8BYzYhESaSi9PsdXUUM0TWf6JnOLBnB2MS5F5zbJ2KbWps8vuNhbxSBnOyv9BTE7AR
         u4iRKYx+U4FRUHnZOJCSh/z7Ad4CjBGZ/cVFfYgz1uUwOwsRanNOcvQoS20lbiq2b1+f
         Ou5PMMJ/vQE1yCeGygyYk4EZmzOIPYitSAi0Q/T/VkPhAETGrwUKvsYxnRLWp/fcfBsN
         WwvIOZSfDi/5jb+1BIfM0h4u0KJHL4Y/RPXNv4XEiXlRRc2rrml6FlEYP9w5mZ2m/dp7
         t9Hg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:4 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4])
        by mx.google.com with ESMTPS id
 k72-20020a636f4b000000b005831a5a3499si13685317pgc.306.2023.09.29.04.46.57
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 29 Sep 2023 04:46:57 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:4 as permitted sender)
 client-ip=2620:137:e000::3:4;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:4 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by howler.vger.email (Postfix) with ESMTP id A4C6F84DB7A0;
	Fri, 29 Sep 2023 04:45:05 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233084AbjI2Lor (ORCPT <rfc822;pwkd43@gmail.com> + 20 others);
        Fri, 29 Sep 2023 07:44:47 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42090 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232925AbjI2Lom (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 29 Sep 2023 07:44:42 -0400
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 0E21C1AB
        for <linux-kernel@vger.kernel.org>;
 Fri, 29 Sep 2023 04:44:39 -0700 (PDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3537F1007;
        Fri, 29 Sep 2023 04:45:17 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
 527553F59C;
        Fri, 29 Sep 2023 04:44:36 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        Matthew Wilcox <willy@infradead.org>,
        Yin Fengwei <fengwei.yin@intel.com>,
        David Hildenbrand <david@redhat.com>,
        Yu Zhao <yuzhao@google.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Anshuman Khandual <anshuman.khandual@arm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Luis Chamberlain <mcgrof@kernel.org>,
        Itaru Kitayama <itaru.kitayama@gmail.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        John Hubbard <jhubbard@nvidia.com>,
        David Rientjes <rientjes@google.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        Hugh Dickins <hughd@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org
Subject: [PATCH v6 2/9] mm: Non-pmd-mappable,
 large folios for folio_add_new_anon_rmap()
Date: Fri, 29 Sep 2023 12:44:13 +0100
Message-Id: <20230929114421.3761121-3-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com>
References: <20230929114421.3761121-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED,
        SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]);
 Fri, 29 Sep 2023 04:45:05 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1778372331254929313
X-GMAIL-MSGID: 1778372331254929313

In preparation for anonymous large folio support, improve
folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
passed to it. In this case, all contained pages are accounted using the
order-0 folio (or base page) scheme.

Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 8600bd029acf..106149690366 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1266,31 +1266,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
  * This means the inc-and-test can be bypassed.
  * The folio does not have to be locked.
  *
- * If the folio is large, it is accounted as a THP.  As the folio
+ * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
  * is new, it's assumed to be mapped exclusively by a single process.
  */
 void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		unsigned long address)
 {
-	int nr;
+	int nr = folio_nr_pages(folio);
 
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	VM_BUG_ON_VMA(address < vma->vm_start ||
+			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
 	__folio_set_swapbacked(folio);
 
-	if (likely(!folio_test_pmd_mappable(folio))) {
+	if (likely(!folio_test_large(folio))) {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_mapcount, 0);
-		nr = 1;
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
+	} else if (!folio_test_pmd_mappable(folio)) {
+		int i;
+
+		for (i = 0; i < nr; i++) {
+			struct page *page = folio_page(folio, i);
+
+			/* increment count (starts at -1) */
+			atomic_set(&page->_mapcount, 0);
+			__page_set_anon_rmap(folio, page, vma,
+					address + (i << PAGE_SHIFT), 1);
+		}
+
+		atomic_set(&folio->_nr_pages_mapped, nr);
 	} else {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
 		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
-		nr = folio_nr_pages(folio);
+		__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
 	}
 
 	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
-	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }
 
 /**

From patchwork Fri Sep 29 11:44:14 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 146508
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3963161vqu;
        Fri, 29 Sep 2023 04:48:56 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IGGwgjzbxCt9m5nO9+8CCPkqYNg9Cjx8c24MQlZdRHPcvAXLkxdq2PHvJ6EYxDQAIi+VW1N
X-Received: by 2002:a17:903:1251:b0:1c6:25c2:e784 with SMTP id
 u17-20020a170903125100b001c625c2e784mr4399500plh.35.1695988136106;
        Fri, 29 Sep 2023 04:48:56 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695988136; cv=none;
        d=google.com; s=arc-20160816;
        b=SgOcDTLiZTn0/TJjnATELLq7b1gZxrU+2c2aZL3/aHfb/r2wG1k+hsDFkrCecOTuNk
         Fd54NqarBE9H1FtK6KWlOf/IuxHlSDPgxeEjOVbOpaWMCqG7jS903zhDG8/32kRSOkUy
         vO8PK1dTHl4NH1jK+a2kEnn/2DvUDxO9eGuxxVofHLF/auUkkewQ6CspCYW928ohSg1V
         BTaQK3QgRzrMjBXIYMbq1YC6MyAwCHWR/g2gjnd/6e11MW7eXyjgA1t0ykvhQ5j+GX/v
         a3OEYp5tBIgWOYQ6WSCw0YdcAOMA8NVoLsn5lwqkZFTvFTm5TkRYsw9pMB1zVyAeWAVK
         pBeQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=BCOYVCPEAW5w4k6+XBBjGI0EWgGCdsX2k3cf8C21COc=;
        fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=;
        b=kzrduhUGC5DjlWs5OMDgHVij7DZcQA0j3s7ysGFsJ84FhjOw5xwmlp7uy87failLiG
         6wS2GksYC2dDyN/ispIe5kERc+XxhVBOaqCoA/pKphG1yzjk31OVxRpG8HqC6qLu5XyO
         HVA1aW3YLk2luQIhLYLiKmfpBms/FH4BKH3ZCBLiH+A4ZMOEBZUAu+VGZ8plLplabh0L
         2R+Get8awETIT0CPO/2kJIfTDbDBfm0J9Tt5ecoSAM6hryyC85IQOiNkfC4VHam9y18f
         UwrpQyYxOR4xUC2EYt3IZKtYErovUFX10C8+jKUwmNauh5f2o4jkYorVJFateiTPSrKU
         Bp/Q==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7])
        by mx.google.com with ESMTPS id
 la7-20020a170902fa0700b001bf7289d2b2si19548145plb.315.2023.09.29.04.48.55
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 29 Sep 2023 04:48:56 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 client-ip=2620:137:e000::3:7;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id 0592D831556A;
	Fri, 29 Sep 2023 04:45:00 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232994AbjI2Lou (ORCPT <rfc822;pwkd43@gmail.com> + 20 others);
        Fri, 29 Sep 2023 07:44:50 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42114 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233038AbjI2Lor (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 29 Sep 2023 07:44:47 -0400
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id F06C61B2
        for <linux-kernel@vger.kernel.org>;
 Fri, 29 Sep 2023 04:44:41 -0700 (PDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0CDDC1063;
        Fri, 29 Sep 2023 04:45:20 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
 103F93F59C;
        Fri, 29 Sep 2023 04:44:38 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        Matthew Wilcox <willy@infradead.org>,
        Yin Fengwei <fengwei.yin@intel.com>,
        David Hildenbrand <david@redhat.com>,
        Yu Zhao <yuzhao@google.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Anshuman Khandual <anshuman.khandual@arm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Luis Chamberlain <mcgrof@kernel.org>,
        Itaru Kitayama <itaru.kitayama@gmail.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        John Hubbard <jhubbard@nvidia.com>,
        David Rientjes <rientjes@google.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        Hugh Dickins <hughd@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org
Subject: [PATCH v6 3/9] mm: thp: Account pte-mapped anonymous THP usage
Date: Fri, 29 Sep 2023 12:44:14 +0100
Message-Id: <20230929114421.3761121-4-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com>
References: <20230929114421.3761121-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED,
        SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Fri, 29 Sep 2023 04:45:00 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1778372455783482513
X-GMAIL-MSGID: 1778372455783482513

Add accounting for pte-mapped anonymous transparent hugepages at various
locations. This visibility will aid in debugging and tuning performance
for the "small order" thp extension that will be added in a subsequent
commit, where hugepages can be allocated which are large (greater than
order-0) but smaller than PMD_ORDER. This new accounting follows a
similar pattern to the existing NR_ANON_THPS, which measures pmd-mapped
anonymous transparent hugepages.

We account pte-mapped anonymous thp mappings per-page, where the page is
mapped at least once via PTE and the page belongs to a large folio. So
when a page belonging to a large folio is PTE-mapped for the first time,
then we add 1 to NR_ANON_THPS_PTEMAPPED. And when a page belonging to a
large folio is PTE-unmapped for the last time, then we remove 1 from
NR_ANON_THPS_PTEMAPPED.

/proc/meminfo:
  Introduce new "AnonHugePteMap" field, which reports the amount of
  memory (in KiB) mapped from large folios globally (similar to
  AnonHugePages field).

/proc/vmstat:
  Introduce new "nr_anon_thp_pte" field, which reports the amount of
  memory (in pages) mapped from large folios globally (similar to
  nr_anon_transparent_hugepages field).

/sys/devices/system/node/nodeX/meminfo
  Introduce new "AnonHugePteMap" field, which reports the amount of
  memory (in KiB) mapped from large folios per-node (similar to
  AnonHugePages field).

show_mem (panic logger):
  Introduce new "anon_thp_pte" field, which reports the amount of memory
  (in KiB) mapped from large folios per-node (similar to anon_thp
  field).

memory.stat (cgroup v1 and v2):
  Introduce new "anon_thp_pte" field, which reports the amount of memory
  (in bytes) mapped from large folios in the memcg (similar to rss_huge
  (v1) / anon_thp (v2) fields).

/proc/<pid>/smaps & /proc/<pid>/smaps_rollup:
  Introduce new "AnonHugePteMap" field, which reports the amount of
  memory (in KiB) mapped from large folios within the vma/process
  (similar to AnonHugePages field).

NOTE on charge migration: The new NR_ANON_THPS_PTEMAPPED charge is NOT
moved between cgroups, even when the (v1)
memory.move_charge_at_immigrate feature is enabled. That feature is
marked deprecated and the current code does not attempt to move the
NR_ANON_MAPPED charge for large PTE-mapped folios anyway (see comment in
mem_cgroup_move_charge_pte_range()). If this code was enhanced to allow
moving the NR_ANON_MAPPED charge for large PTE-mapped folios, we would
also need to add support for moving the new NR_ANON_THPS_PTEMAPPED
charge. This would likely get quite fiddly. Given the deprecation of
memory.move_charge_at_immigrate, I assume it is not valuable to
implement.

NOTE on naming: Given the new small order anonymous thp feature will be
exposed to user space as an extension to thp, I've opted to call the new
counters after thp also (as aposed to "large"/"large folio"/etc.), so
"huge" no longer strictly means PMD - one could argue hugetlb already
breaks this rule anyway. I also did not want to risk breaking back
compat by renaming/redefining the existing counters (which would have
resulted in more consistent and clearer names). So the existing
NR_ANON_THPS counters remain and continue to only refer to PMD-mapped
THPs. And I've added new counters, which only refer to PTE-mapped THPs.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/ABI/testing/procfs-smaps_rollup  |  1 +
 Documentation/admin-guide/cgroup-v1/memory.rst |  5 ++++-
 Documentation/admin-guide/cgroup-v2.rst        |  6 +++++-
 Documentation/admin-guide/mm/transhuge.rst     | 11 +++++++----
 Documentation/filesystems/proc.rst             | 14 ++++++++++++--
 drivers/base/node.c                            |  2 ++
 fs/proc/meminfo.c                              |  2 ++
 fs/proc/task_mmu.c                             |  4 ++++
 include/linux/mmzone.h                         |  1 +
 mm/memcontrol.c                                |  8 ++++++++
 mm/rmap.c                                      | 11 +++++++++--
 mm/show_mem.c                                  |  2 ++
 mm/vmstat.c                                    |  1 +
 13 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup
index b446a7154a1b..b50b3eda5a3f 100644
--- a/Documentation/ABI/testing/procfs-smaps_rollup
+++ b/Documentation/ABI/testing/procfs-smaps_rollup
@@ -34,6 +34,7 @@ Description:
 			Anonymous:	      68 kB
 			LazyFree:	       0 kB
 			AnonHugePages:	       0 kB
+			AnonHugePteMap:        0 kB
 			ShmemPmdMapped:	       0 kB
 			Shared_Hugetlb:	       0 kB
 			Private_Hugetlb:       0 kB
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 5f502bf68fbc..b7efc7531896 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -535,7 +535,10 @@ memory.stat file includes following statistics:
     cache           # of bytes of page cache memory.
     rss             # of bytes of anonymous and swap cache memory (includes
                     transparent hugepages).
-    rss_huge        # of bytes of anonymous transparent hugepages.
+    rss_huge        # of bytes of anonymous transparent hugepages, mapped by
+                    PMD.
+    anon_thp_pte    # of bytes of anonymous transparent hugepages, mapped by
+                    PTE.
     mapped_file     # of bytes of mapped file (includes tmpfs/shmem)
     pgpgin          # of charging events to the memory cgroup. The charging
                     event happens each time a page is accounted as either mapped
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index b26b5274eaaf..48b961b8fc6d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1421,7 +1421,11 @@ PAGE_SIZE multiple when read back.
 
 	  anon_thp
 		Amount of memory used in anonymous mappings backed by
-		transparent hugepages
+		transparent hugepages, mapped by PMD
+
+	  anon_thp_pte
+		Amount of memory used in anonymous mappings backed by
+		transparent hugepages, mapped by PTE
 
 	  file_thp
 		Amount of cached filesystem data backed by transparent
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index b0cc8243e093..ebda57850643 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -291,10 +291,13 @@ Monitoring usage
 ================
 
 The number of anonymous transparent huge pages currently used by the
-system is available by reading the AnonHugePages field in ``/proc/meminfo``.
-To identify what applications are using anonymous transparent huge pages,
-it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
-for each mapping.
+system is available by reading the AnonHugePages and AnonHugePteMap
+fields in ``/proc/meminfo``. To identify what applications are using
+anonymous transparent huge pages, it is necessary to read
+``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
+fields for each mapping. Note that in both cases, AnonHugePages refers
+only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
+using PTEs.
 
 The number of file transparent huge pages mapped to userspace is available
 by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2b59cff8be17..ccbb76a509f0 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -464,6 +464,7 @@ Memory Area, or VMA) there is a series of lines such as the following::
     KSM:                   0 kB
     LazyFree:              0 kB
     AnonHugePages:         0 kB
+    AnonHugePteMap:        0 kB
     ShmemPmdMapped:        0 kB
     Shared_Hugetlb:        0 kB
     Private_Hugetlb:       0 kB
@@ -511,7 +512,11 @@ pressure if the memory is clean. Please note that the printed value might
 be lower than the real value due to optimizations used in the current
 implementation. If this is not desirable please file a bug report.
 
-"AnonHugePages" shows the amount of memory backed by transparent hugepage.
+"AnonHugePages" shows the amount of memory backed by transparent hugepage,
+mapped by PMD.
+
+"AnonHugePteMap" shows the amount of memory backed by transparent hugepage,
+mapped by PTE.
 
 "ShmemPmdMapped" shows the amount of shared (shmem/tmpfs) memory backed by
 huge pages.
@@ -1006,6 +1011,7 @@ Example output. You may not have all of these fields.
     EarlyMemtestBad:       0 kB
     HardwareCorrupted:     0 kB
     AnonHugePages:   4149248 kB
+    AnonHugePteMap:        0 kB
     ShmemHugePages:        0 kB
     ShmemPmdMapped:        0 kB
     FileHugePages:         0 kB
@@ -1165,7 +1171,11 @@ HardwareCorrupted
               The amount of RAM/memory in KB, the kernel identifies as
               corrupted.
 AnonHugePages
-              Non-file backed huge pages mapped into userspace page tables
+              Non-file backed huge pages mapped into userspace page tables by
+              PMD
+AnonHugePteMap
+              Non-file backed huge pages mapped into userspace page tables by
+              PTE
 ShmemHugePages
               Memory used by shared memory (shmem) and tmpfs allocated
               with huge pages
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 493d533f8375..08f1759387d2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -443,6 +443,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			     "Node %d AnonHugePages:  %8lu kB\n"
+			     "Node %d AnonHugePteMap: %8lu kB\n"
 			     "Node %d ShmemHugePages: %8lu kB\n"
 			     "Node %d ShmemPmdMapped: %8lu kB\n"
 			     "Node %d FileHugePages:  %8lu kB\n"
@@ -475,6 +476,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			     ,
 			     nid, K(node_page_state(pgdat, NR_ANON_THPS)),
+			     nid, K(node_page_state(pgdat, NR_ANON_THPS_PTEMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 45af9a989d40..bac20cc60b6a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -143,6 +143,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	show_val_kb(m, "AnonHugePages:  ",
 		    global_node_page_state(NR_ANON_THPS));
+	show_val_kb(m, "AnonHugePteMap: ",
+		    global_node_page_state(NR_ANON_THPS_PTEMAPPED));
 	show_val_kb(m, "ShmemHugePages: ",
 		    global_node_page_state(NR_SHMEM_THPS));
 	show_val_kb(m, "ShmemPmdMapped: ",
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3dd5be96691b..7b5dad163533 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -392,6 +392,7 @@ struct mem_size_stats {
 	unsigned long anonymous;
 	unsigned long lazyfree;
 	unsigned long anonymous_thp;
+	unsigned long anonymous_thp_pte;
 	unsigned long shmem_thp;
 	unsigned long file_thp;
 	unsigned long swap;
@@ -452,6 +453,8 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 		mss->anonymous += size;
 		if (!PageSwapBacked(page) && !dirty && !PageDirty(page))
 			mss->lazyfree += size;
+		if (!compound && PageTransCompound(page))
+			mss->anonymous_thp_pte += size;
 	}
 
 	if (PageKsm(page))
@@ -833,6 +836,7 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss,
 	SEQ_PUT_DEC(" kB\nKSM:            ", mss->ksm);
 	SEQ_PUT_DEC(" kB\nLazyFree:       ", mss->lazyfree);
 	SEQ_PUT_DEC(" kB\nAnonHugePages:  ", mss->anonymous_thp);
+	SEQ_PUT_DEC(" kB\nAnonHugePteMap: ", mss->anonymous_thp_pte);
 	SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp);
 	SEQ_PUT_DEC(" kB\nFilePmdMapped:  ", mss->file_thp);
 	SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc5b4b3..5032fc31c651 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -186,6 +186,7 @@ enum node_stat_item {
 	NR_FILE_THPS,
 	NR_FILE_PMDMAPPED,
 	NR_ANON_THPS,
+	NR_ANON_THPS_PTEMAPPED,
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d13dde2f8b56..07d8e0b55b0e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -809,6 +809,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 		case NR_ANON_MAPPED:
 		case NR_FILE_MAPPED:
 		case NR_ANON_THPS:
+		case NR_ANON_THPS_PTEMAPPED:
 		case NR_SHMEM_PMDMAPPED:
 		case NR_FILE_PMDMAPPED:
 			WARN_ON_ONCE(!in_task());
@@ -1512,6 +1513,7 @@ static const struct memory_stat memory_stats[] = {
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{ "anon_thp",			NR_ANON_THPS			},
+	{ "anon_thp_pte",		NR_ANON_THPS_PTEMAPPED		},
 	{ "file_thp",			NR_FILE_THPS			},
 	{ "shmem_thp",			NR_SHMEM_THPS			},
 #endif
@@ -4052,6 +4054,7 @@ static const unsigned int memcg1_stats[] = {
 	NR_ANON_MAPPED,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	NR_ANON_THPS,
+	NR_ANON_THPS_PTEMAPPED,
 #endif
 	NR_SHMEM,
 	NR_FILE_MAPPED,
@@ -4067,6 +4070,7 @@ static const char *const memcg1_stat_names[] = {
 	"rss",
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	"rss_huge",
+	"anon_thp_pte",
 #endif
 	"shmem",
 	"mapped_file",
@@ -6259,6 +6263,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			 * can be done but it would be too convoluted so simply
 			 * ignore such a partial THP and keep it in original
 			 * memcg. There should be somebody mapping the head.
+			 * This simplification also means that pte-mapped large
+			 * folios are never migrated, which means we don't need
+			 * to worry about migrating the NR_ANON_THPS_PTEMAPPED
+			 * accounting.
 			 */
 			if (PageTransCompound(page))
 				goto put;
diff --git a/mm/rmap.c b/mm/rmap.c
index 106149690366..52dabee73023 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1205,7 +1205,7 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 {
 	struct folio *folio = page_folio(page);
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
+	int nr = 0, nr_pmdmapped = 0, nr_lgmapped = 0;
 	bool compound = flags & RMAP_COMPOUND;
 	bool first = true;
 
@@ -1214,6 +1214,7 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
 		if (first && folio_test_large(folio)) {
+			nr_lgmapped = 1;
 			nr = atomic_inc_return_relaxed(mapped);
 			nr = (nr < COMPOUND_MAPPED);
 		}
@@ -1241,6 +1242,8 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 
 	if (nr_pmdmapped)
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped);
+	if (nr_lgmapped)
+		__lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, nr_lgmapped);
 	if (nr)
 		__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
 
@@ -1295,6 +1298,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		atomic_set(&folio->_nr_pages_mapped, nr);
+		__lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, nr);
 	} else {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
@@ -1405,7 +1409,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 {
 	struct folio *folio = page_folio(page);
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
+	int nr = 0, nr_pmdmapped = 0, nr_lgmapped = 0;
 	bool last;
 	enum node_stat_item idx;
 
@@ -1423,6 +1427,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 		last = atomic_add_negative(-1, &page->_mapcount);
 		nr = last;
 		if (last && folio_test_large(folio)) {
+			nr_lgmapped = 1;
 			nr = atomic_dec_return_relaxed(mapped);
 			nr = (nr < COMPOUND_MAPPED);
 		}
@@ -1454,6 +1459,8 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 			idx = NR_FILE_PMDMAPPED;
 		__lruvec_stat_mod_folio(folio, idx, -nr_pmdmapped);
 	}
+	if (nr_lgmapped && folio_test_anon(folio))
+		__lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, -nr_lgmapped);
 	if (nr) {
 		idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
 		__lruvec_stat_mod_folio(folio, idx, -nr);
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 4b888b18bdde..e648a815f0fb 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -254,6 +254,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			" shmem_thp:%lukB"
 			" shmem_pmdmapped:%lukB"
 			" anon_thp:%lukB"
+			" anon_thp_pte:%lukB"
 #endif
 			" writeback_tmp:%lukB"
 			" kernel_stack:%lukB"
@@ -280,6 +281,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			K(node_page_state(pgdat, NR_ANON_THPS)),
+			K(node_page_state(pgdat, NR_ANON_THPS_PTEMAPPED)),
 #endif
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			node_page_state(pgdat, NR_KERNEL_STACK_KB),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 00e81e99c6ee..267de0e4ddca 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1224,6 +1224,7 @@ const char * const vmstat_text[] = {
 	"nr_file_hugepages",
 	"nr_file_pmdmapped",
 	"nr_anon_transparent_hugepages",
+	"nr_anon_thp_pte",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",

From patchwork Fri Sep 29 11:44:15 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 146514
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3976804vqu;
        Fri, 29 Sep 2023 05:10:22 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IFCWphLfnKPxpB8pPzM7Gh52XVPZocUNvxxeH355zSJjNo9m30lvN6KZiG1k/RPmyxzwdRQ
X-Received: by 2002:a05:6358:5907:b0:143:8af6:48e7 with SMTP id
 g7-20020a056358590700b001438af648e7mr4718424rwf.5.1695989421679;
        Fri, 29 Sep 2023 05:10:21 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695989421; cv=none;
        d=google.com; s=arc-20160816;
        b=dxpO0SDhLCbMG+QN0RuLc7GTv0hEx06tD1AHK3XVEZi3YjC9zTTJv/HTMiyp+ocDhE
         p8d7juX6WcD/iSHuWcslsaL9SIvXnV/32kFALXMAHYvcoOg/bUiRtZ4EpitLMLBEVaTD
         Tc3V6hPUDBWab7TCL/jWfYYUYUF62QKABEPlsC0Uo+sJQ/dvB01ph9nQhrHTX6g8NkMk
         UvKYGoqxfpRu/Kh0c8jrpWQPFc6RJepVGvbg5NN5+y4OoQefp0gOiiL3AkZV0enel5g6
         2oxzWXf4GaOXv2Drh1nTB/r6MfHzzS1/9hNftz9PNQFKZaLYlmPp5XBQBCDFsv+H6s25
         THBA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=mFKsrfYRPYbVeNdBA8ZVRwXhiya+e01CfoMJVaq5bvk=;
        fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=;
        b=x8CWe1bcl+cwTOKQqeMf6vESwR61lbk4mcgY3w08EAlvWK+zObzlLDQE71GBRi1dsp
         BmswA/rF4wBATOcalaguMyG6OODmqpg52oo/eyAYJorW0OSuAJgZyVuHji72r4IiuEEu
         YqG/g7wVcpElCVkEZmzN08ImsKU7E3jjylo7pv2RslO0gvci+2f+XsN2Kj70zCf0SyK0
         369lN7AA+iKTRjMQzCF1I7tvTsFmpR8N2xL1RBQMU3O+ey+He859Zw3TGYSPDZdZVKdF
         MV5NMuJnxOsrmpDdRXuFzCPBEbL8qiq8Z6c8NWsqnIIxipf3YFo3tEh6aS3AKbHIFW8W
         e42w==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33])
        by mx.google.com with ESMTPS id
 l195-20020a633ecc000000b00578d2d19575si21147250pga.237.2023.09.29.05.10.21
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 29 Sep 2023 05:10:21 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by lipwig.vger.email (Postfix) with ESMTP id 7BD048097157;
	Fri, 29 Sep 2023 04:45:48 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233147AbjI2Lo7 (ORCPT <rfc822;pwkd43@gmail.com> + 20 others);
        Fri, 29 Sep 2023 07:44:59 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42124 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233083AbjI2Lot (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 29 Sep 2023 07:44:49 -0400
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id CE02F1B1
        for <linux-kernel@vger.kernel.org>;
 Fri, 29 Sep 2023 04:44:44 -0700 (PDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D9363DA7;
        Fri, 29 Sep 2023 04:45:22 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
 DC3AE3F59C;
        Fri, 29 Sep 2023 04:44:41 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        Matthew Wilcox <willy@infradead.org>,
        Yin Fengwei <fengwei.yin@intel.com>,
        David Hildenbrand <david@redhat.com>,
        Yu Zhao <yuzhao@google.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Anshuman Khandual <anshuman.khandual@arm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Luis Chamberlain <mcgrof@kernel.org>,
        Itaru Kitayama <itaru.kitayama@gmail.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        John Hubbard <jhubbard@nvidia.com>,
        David Rientjes <rientjes@google.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        Hugh Dickins <hughd@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org
Subject: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask
 sysfs files
Date: Fri, 29 Sep 2023 12:44:15 +0100
Message-Id: <20230929114421.3761121-5-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com>
References: <20230929114421.3761121-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]);
 Fri, 29 Sep 2023 04:45:48 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1778373803849479475
X-GMAIL-MSGID: 1778373803849479475

In preparation for adding support for anonymous large folios that are
smaller than the PMD-size, introduce 2 new sysfs files that will be used
to control the new behaviours via the transparent_hugepage interface.
For now, the kernel still only supports PMD-order anonymous THP, so when
reading back anon_orders, it will reflect that. Therefore there are no
behavioural changes intended here.

The bulk of the change is implemented by converting
transhuge_vma_suitable() and hugepage_vma_check() so that they take a
bitfield of orders for which the user wants to determine support, and
the functions filter out all the orders that can't be supported. If
there is only 1 order set in the input then the output can continue to
be treated like a boolean; this is the case for most call sites.

The remainder is copied from Documentation/admin-guide/mm/transhuge.rst,
as modified by this commit. See that file for further details.

By default, allocation of anonymous THPs that are smaller than PMD-size
is disabled. These smaller allocation orders can be enabled by writing
an encoded set of orders as follows::

	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders

Where an order refers to the number of pages in the large folio as
2^order, and where each order is encoded in the written value such that
each set bit represents an enabled order; So setting bit-2 indicates
that order-2 folios are in use, and order-2 means 2^2=4 pages (=16K if
the page size is 4K). The example above enables order-9 (PMD-order) and
order-3.

By enabling multiple orders, allocation of each order will be attempted,
highest to lowest, until a successful allocation is made. If the
PMD-order is unset, then no PMD-sized THPs will be allocated.

The kernel will ignore any orders that it does not support so read the
file back to determine which orders are enabled::

	cat /sys/kernel/mm/transparent_hugepage/anon_orders

For some workloads it may be desirable to limit some THP orders to be
used only for MADV_HUGEPAGE regions, while allowing others to be used
always. For example, a workload may only benefit from PMD-sized THP in
specific areas, but can take benefit of 32K sized THP more generally. In
this case, THP can be enabled in ``madvise`` mode as normal, but
specific orders can be configured to be allocated as if in ``always``
mode. The below example enables orders 9 and 3, with order-9 only
applied to MADV_HUGEPAGE regions, and order-3 applied always::

	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders
	echo 0x008 >/sys/kernel/mm/transparent_hugepage/anon_always_mask

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  74 ++++++++--
 Documentation/filesystems/proc.rst         |   6 +-
 fs/proc/task_mmu.c                         |   3 +-
 include/linux/huge_mm.h                    |  93 +++++++++---
 mm/huge_memory.c                           | 164 ++++++++++++++++++---
 mm/khugepaged.c                            |  18 ++-
 mm/memory.c                                |   6 +-
 mm/page_vma_mapped.c                       |   3 +-
 8 files changed, 296 insertions(+), 71 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index ebda57850643..9f954e73a4ca 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -45,10 +45,22 @@ components:
    the two is using hugepages just because of the fact the TLB miss is
    going to run faster.
 
+Furthermore, it is possible to configure THP to allocate large folios
+to back anonymous memory, which are smaller than PMD-size (for example
+16K, 32K, 64K, etc). These THPs continue to be PTE-mapped, but in many
+cases can still provide the similar benefits to those outlined above:
+Page faults are significantly reduced (by a factor of e.g. 4, 8, 16,
+etc), but latency spikes are much less prominent because the size of
+each page isn't as huge as the PMD-sized variant and there is less
+memory to clear in each page fault. Some architectures also employ TLB
+compression mechanisms to squeeze more entries in when a set of PTEs
+are virtually and physically contiguous and approporiately aligned. In
+this case, TLB misses will occur less often.
+
 THP can be enabled system wide or restricted to certain tasks or even
 memory ranges inside task's address space. Unless THP is completely
 disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into huge pages.
+collapses sequences of basic pages into PMD-sized huge pages.
 
 The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
 interface and using madvise(2) and prctl(2) system calls.
@@ -146,25 +158,69 @@ madvise
 never
 	should be self-explanatory.
 
-By default kernel tries to use huge zero page on read page fault to
-anonymous mapping. It's possible to disable huge zero page by writing 0
-or enable it back by writing 1::
+By default kernel tries to use huge, PMD-mapped zero page on read page
+fault to anonymous mapping. It's possible to disable huge zero page by
+writing 0 or enable it back by writing 1::
 
 	echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
 	echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
 
 Some userspace (such as a test program, or an optimized memory allocation
-library) may want to know the size (in bytes) of a transparent hugepage::
+library) may want to know the size (in bytes) of a PMD-mappable
+transparent hugepage::
 
 	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
 
+By default, allocation of anonymous THPs that are smaller than
+PMD-size is disabled. These smaller allocation orders can be enabled
+by writing an encoded set of orders as follows::
+
+	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders
+
+Where an order refers to the number of pages in the large folio as
+2^order, and where each order is encoded in the written value such
+that each set bit represents an enabled order; So setting bit-2
+indicates that order-2 folios are in use, and order-2 means 2^2=4
+pages (=16K if the page size is 4K). The example above enables order-9
+(PMD-order) and order-3.
+
+By enabling multiple orders, allocation of each order will be
+attempted, highest to lowest, until a successful allocation is made.
+If the PMD-order is unset, then no PMD-sized THPs will be allocated.
+
+The kernel will ignore any orders that it does not support so read the
+file back to determine which orders are enabled::
+
+	cat /sys/kernel/mm/transparent_hugepage/anon_orders
+
+For some workloads it may be desirable to limit some THP orders to be
+used only for MADV_HUGEPAGE regions, while allowing others to be used
+always. For example, a workload may only benefit from PMD-sized THP in
+specific areas, but can take benefit of 32K sized THP more generally.
+In this case, THP can be enabled in ``madvise`` mode as normal, but
+specific orders can be configured to be allocated as if in ``always``
+mode. The below example enables orders 9 and 3, with order-9 only
+applied to MADV_HUGEPAGE regions, and order-3 applied always::
+
+	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
+	echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders
+	echo 0x008 >/sys/kernel/mm/transparent_hugepage/anon_always_mask
+
 khugepaged will be automatically started when
-transparent_hugepage/enabled is set to "always" or "madvise, and it'll
-be automatically shutdown if it's set to "never".
+transparent_hugepage/enabled is set to "always" or "madvise",
+providing the PMD-order is enabled in
+transparent_hugepage/anon_orders, and it'll be automatically shutdown
+if it's set to "never" or the PMD-order is disabled in
+transparent_hugepage/anon_orders.
 
 Khugepaged controls
 -------------------
 
+.. note::
+   khugepaged currently only searches for opportunities to collapse to
+   PMD-sized THP and no attempt is made to collapse to smaller order
+   THP.
+
 khugepaged runs usually at low frequency so while one may not want to
 invoke defrag algorithms synchronously during the page faults, it
 should be worth invoking defrag at least in khugepaged. However it's
@@ -285,7 +341,7 @@ Need of application restart
 The transparent_hugepage/enabled values and tmpfs mount option only affect
 future behavior. So to make them effective you need to restart any
 application that could have been using hugepages. This also applies to the
-regions registered in khugepaged.
+regions registered in khugepaged, and transparent_hugepage/anon_orders.
 
 Monitoring usage
 ================
@@ -416,7 +472,7 @@ for huge pages.
 Optimizing the applications
 ===========================
 
-To be guaranteed that the kernel will map a 2M page immediately in any
+To be guaranteed that the kernel will map a thp immediately in any
 memory region, the mmap region has to be hugepage naturally
 aligned. posix_memalign() can provide that guarantee.
 
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index ccbb76a509f0..72526f8bb658 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -533,9 +533,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
 does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
 
-"THPeligible" indicates whether the mapping is eligible for allocating THP
-pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
-It just shows the current status.
+"THPeligible" indicates whether the mapping is eligible for allocating
+naturally aligned THP pages of any currently enabled order. 1 if true, 0
+otherwise. It just shows the current status.
 
 "VmFlags" field deserves a separate description. This member represents the
 kernel flags associated with the particular virtual memory area in two letter
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7b5dad163533..f978dce7f7ce 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -869,7 +869,8 @@ static int show_smap(struct seq_file *m, void *v)
 	__show_smap(m, &mss, false);
 
 	seq_printf(m, "THPeligible:    %8u\n",
-		   hugepage_vma_check(vma, vma->vm_flags, true, false, true));
+		   !!hugepage_vma_check(vma, vma->vm_flags, true, false, true,
+					THP_ORDERS_ALL));
 
 	if (arch_pkeys_enabled())
 		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fa0350b0812a..2e7c338229a6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -67,6 +67,21 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
+/*
+ * Mask of all large folio orders supported for anonymous THP.
+ */
+#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
+
+/*
+ * Mask of all large folio orders supported for file THP.
+ */
+#define THP_ORDERS_ALL_FILE	(BIT(PMD_ORDER) | BIT(PUD_ORDER))
+
+/*
+ * Mask of all large folio orders supported for THP.
+ */
+#define THP_ORDERS_ALL		(THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE)
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
@@ -77,6 +92,7 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
 
 extern unsigned long transparent_hugepage_flags;
+extern unsigned int huge_anon_orders;
 
 #define hugepage_flags_enabled()					       \
 	(transparent_hugepage_flags &				       \
@@ -86,6 +102,17 @@ extern unsigned long transparent_hugepage_flags;
 	(transparent_hugepage_flags &			\
 	 (1<<TRANSPARENT_HUGEPAGE_FLAG))
 
+static inline int first_order(unsigned int orders)
+{
+	return fls(orders) - 1;
+}
+
+static inline int next_order(unsigned int *orders, int prev)
+{
+	*orders &= ~BIT(prev);
+	return first_order(*orders);
+}
+
 /*
  * Do the below checks:
  *   - For file vma, check if the linear page offset of vma is
@@ -97,23 +124,39 @@ extern unsigned long transparent_hugepage_flags;
  *   - For all vmas, check if the haddr is in an aligned HPAGE_PMD_SIZE
  *     area.
  */
-static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
-		unsigned long addr)
-{
-	unsigned long haddr;
-
-	/* Don't have to check pgoff for anonymous vma */
-	if (!vma_is_anonymous(vma)) {
-		if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
-				HPAGE_PMD_NR))
-			return false;
+static inline unsigned int transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long addr, unsigned int orders)
+{
+	int order;
+
+	/*
+	 * Iterate over orders, highest to lowest, removing orders that don't
+	 * meet alignment requirements from the set. Exit loop at first order
+	 * that meets requirements, since all lower orders must also meet
+	 * requirements.
+	 */
+
+	order = first_order(orders);
+
+	while (orders) {
+		unsigned long hpage_size = PAGE_SIZE << order;
+		unsigned long haddr = ALIGN_DOWN(addr, hpage_size);
+
+		if (haddr >= vma->vm_start &&
+		    haddr + hpage_size <= vma->vm_end) {
+			if (!vma_is_anonymous(vma)) {
+				if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
+						vma->vm_pgoff,
+						hpage_size >> PAGE_SHIFT))
+					break;
+			} else
+				break;
+		}
+
+		order = next_order(&orders, order);
 	}
 
-	haddr = addr & HPAGE_PMD_MASK;
-
-	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
-		return false;
-	return true;
+	return orders;
 }
 
 static inline bool file_thp_enabled(struct vm_area_struct *vma)
@@ -130,8 +173,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 	       !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
 }
 
-bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
-			bool smaps, bool in_pf, bool enforce_sysfs);
+unsigned int hugepage_vma_check(struct vm_area_struct *vma,
+				unsigned long vm_flags, bool smaps, bool in_pf,
+				bool enforce_sysfs, unsigned int orders);
 
 #define transparent_hugepage_use_zero_page()				\
 	(transparent_hugepage_flags &					\
@@ -267,17 +311,18 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 	return false;
 }
 
-static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
-		unsigned long addr)
+static inline unsigned int transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long addr, unsigned int orders)
 {
-	return false;
+	return 0;
 }
 
-static inline bool hugepage_vma_check(struct vm_area_struct *vma,
-				      unsigned long vm_flags, bool smaps,
-				      bool in_pf, bool enforce_sysfs)
+static inline unsigned int hugepage_vma_check(struct vm_area_struct *vma,
+					unsigned long vm_flags, bool smaps,
+					bool in_pf, bool enforce_sysfs,
+					unsigned int orders)
 {
-	return false;
+	return 0;
 }
 
 static inline void folio_prep_large_rmappable(struct folio *folio) {}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..bcecce769017 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -70,12 +70,48 @@ static struct shrinker deferred_split_shrinker;
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
+unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER);
+static unsigned int huge_anon_always_mask __read_mostly;
 
-bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
-			bool smaps, bool in_pf, bool enforce_sysfs)
+/**
+ * hugepage_vma_check - determine which hugepage orders can be applied to vma
+ * @vma:  the vm area to check
+ * @vm_flags: use these vm_flags instead of vma->vm_flags
+ * @smaps: whether answer will be used for smaps file
+ * @in_pf: whether answer will be used by page fault handler
+ * @enforce_sysfs: whether sysfs config should be taken into account
+ * @orders: bitfield of all orders to consider
+ *
+ * Calculates the intersection of the requested hugepage orders and the allowed
+ * hugepage orders for the provided vma. Permitted orders are encoded as a set
+ * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
+ * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
+ *
+ * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
+ * orders are allowed.
+ */
+unsigned int hugepage_vma_check(struct vm_area_struct *vma,
+				unsigned long vm_flags, bool smaps, bool in_pf,
+				bool enforce_sysfs, unsigned int orders)
 {
+	/*
+	 * Fix up the orders mask; Supported orders for file vmas are static.
+	 * Supported orders for anon vmas are configured dynamically - but only
+	 * use the dynamic set if enforce_sysfs=true, otherwise use the full
+	 * set.
+	 */
+	if (vma_is_anonymous(vma))
+		orders &= enforce_sysfs ? READ_ONCE(huge_anon_orders)
+					: THP_ORDERS_ALL_ANON;
+	else
+		orders &= THP_ORDERS_ALL_FILE;
+
+	/* No orders in the intersection. */
+	if (!orders)
+		return 0;
+
 	if (!vma->vm_mm)		/* vdso */
-		return false;
+		return 0;
 
 	/*
 	 * Explicitly disabled through madvise or prctl, or some
@@ -84,16 +120,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 * */
 	if ((vm_flags & VM_NOHUGEPAGE) ||
 	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		return false;
+		return 0;
 	/*
 	 * If the hardware/firmware marked hugepage support disabled.
 	 */
 	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
-		return false;
+		return 0;
 
 	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
 	if (vma_is_dax(vma))
-		return in_pf;
+		return in_pf ? orders : 0;
 
 	/*
 	 * Special VMA and hugetlb VMA.
@@ -101,17 +137,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 * VM_MIXEDMAP set.
 	 */
 	if (vm_flags & VM_NO_KHUGEPAGED)
-		return false;
+		return 0;
 
 	/*
-	 * Check alignment for file vma and size for both file and anon vma.
+	 * Check alignment for file vma and size for both file and anon vma by
+	 * filtering out the unsuitable orders.
 	 *
 	 * Skip the check for page fault. Huge fault does the check in fault
-	 * handlers. And this check is not suitable for huge PUD fault.
+	 * handlers.
 	 */
-	if (!in_pf &&
-	    !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
-		return false;
+	if (!in_pf) {
+		int order = first_order(orders);
+		unsigned long addr;
+
+		while (orders) {
+			addr = vma->vm_end - (PAGE_SIZE << order);
+			if (transhuge_vma_suitable(vma, addr, BIT(order)))
+				break;
+			order = next_order(&orders, order);
+		}
+
+		if (!orders)
+			return 0;
+	}
 
 	/*
 	 * Enabled via shmem mount options or sysfs settings.
@@ -120,23 +168,35 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 */
 	if (!in_pf && shmem_file(vma->vm_file))
 		return shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff,
-				     !enforce_sysfs, vma->vm_mm, vm_flags);
+				     !enforce_sysfs, vma->vm_mm, vm_flags)
+			? orders : 0;
 
 	/* Enforce sysfs THP requirements as necessary */
-	if (enforce_sysfs &&
-	    (!hugepage_flags_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
-					   !hugepage_flags_always())))
-		return false;
+	if (enforce_sysfs) {
+		/* enabled=never. */
+		if (!hugepage_flags_enabled())
+			return 0;
+
+		/* enabled=madvise without VM_HUGEPAGE. */
+		if (!(vm_flags & VM_HUGEPAGE) && !hugepage_flags_always()) {
+			if (vma_is_anonymous(vma)) {
+				orders &= READ_ONCE(huge_anon_always_mask);
+				if (!orders)
+					return 0;
+			} else
+				return 0;
+		}
+	}
 
 	/* Only regular file is valid */
 	if (!in_pf && file_thp_enabled(vma))
-		return true;
+		return orders;
 
 	if (!vma_is_anonymous(vma))
-		return false;
+		return 0;
 
 	if (vma_is_temporary_stack(vma))
-		return false;
+		return 0;
 
 	/*
 	 * THPeligible bit of smaps should show 1 for proper VMAs even
@@ -146,9 +206,9 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 * the first page fault.
 	 */
 	if (!vma->anon_vma)
-		return (smaps || in_pf);
+		return (smaps || in_pf) ? orders : 0;
 
-	return true;
+	return orders;
 }
 
 static bool get_huge_zero_page(void)
@@ -391,11 +451,69 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 static struct kobj_attribute hpage_pmd_size_attr =
 	__ATTR_RO(hpage_pmd_size);
 
+static ssize_t anon_orders_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "0x%08x\n", READ_ONCE(huge_anon_orders));
+}
+
+static ssize_t anon_orders_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count)
+{
+	int err;
+	int ret = count;
+	unsigned int orders;
+
+	err = kstrtouint(buf, 0, &orders);
+	if (err)
+		ret = -EINVAL;
+
+	if (ret > 0) {
+		orders &= THP_ORDERS_ALL_ANON;
+		WRITE_ONCE(huge_anon_orders, orders);
+
+		err = start_stop_khugepaged();
+		if (err)
+			ret = err;
+	}
+
+	return ret;
+}
+
+static struct kobj_attribute anon_orders_attr = __ATTR_RW(anon_orders);
+
+static ssize_t anon_always_mask_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "0x%08x\n", READ_ONCE(huge_anon_always_mask));
+}
+
+static ssize_t anon_always_mask_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	int err;
+	unsigned int always_mask;
+
+	err = kstrtouint(buf, 0, &always_mask);
+	if (err)
+		return -EINVAL;
+
+	WRITE_ONCE(huge_anon_always_mask, always_mask);
+
+	return count;
+}
+
+static struct kobj_attribute anon_always_mask_attr = __ATTR_RW(anon_always_mask);
+
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
 	&use_zero_page_attr.attr,
 	&hpage_pmd_size_attr.attr,
+	&anon_orders_attr.attr,
+	&anon_always_mask_attr.attr,
 #ifdef CONFIG_SHMEM
 	&shmem_enabled_attr.attr,
 #endif
@@ -778,7 +896,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 
-	if (!transhuge_vma_suitable(vma, haddr))
+	if (!transhuge_vma_suitable(vma, haddr, BIT(PMD_ORDER)))
 		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 88433cc25d8a..2b5c0321d96b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -446,7 +446,8 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
 	    hugepage_flags_enabled()) {
-		if (hugepage_vma_check(vma, vm_flags, false, false, true))
+		if (hugepage_vma_check(vma, vm_flags, false, false, true,
+				       BIT(PMD_ORDER)))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -921,10 +922,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
-	if (!transhuge_vma_suitable(vma, address))
+	if (!transhuge_vma_suitable(vma, address, BIT(PMD_ORDER)))
 		return SCAN_ADDRESS_RANGE;
 	if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
-				cc->is_khugepaged))
+				cc->is_khugepaged, BIT(PMD_ORDER)))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1499,7 +1500,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	 * and map it by a PMD, regardless of sysfs THP settings. As such, let's
 	 * analogously elide sysfs THP settings here.
 	 */
-	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
+	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false,
+				BIT(PMD_ORDER)))
 		return SCAN_VMA_CHECK;
 
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
@@ -2369,7 +2371,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true)) {
+		if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true,
+					BIT(PMD_ORDER))) {
 skip:
 			progress++;
 			continue;
@@ -2626,7 +2629,7 @@ int start_stop_khugepaged(void)
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_flags_enabled()) {
+	if (hugepage_flags_enabled() && (huge_anon_orders & BIT(PMD_ORDER))) {
 		if (!khugepaged_thread)
 			khugepaged_thread = kthread_run(khugepaged, NULL,
 							"khugepaged");
@@ -2706,7 +2709,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	*prev = vma;
 
-	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
+	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false,
+				BIT(PMD_ORDER)))
 		return -EINVAL;
 
 	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
diff --git a/mm/memory.c b/mm/memory.c
index e4b0f6a461d8..b5b82fc8e164 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4256,7 +4256,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	pmd_t entry;
 	vm_fault_t ret = VM_FAULT_FALLBACK;
 
-	if (!transhuge_vma_suitable(vma, haddr))
+	if (!transhuge_vma_suitable(vma, haddr, BIT(PMD_ORDER)))
 		return ret;
 
 	page = compound_head(page);
@@ -5055,7 +5055,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 retry_pud:
 	if (pud_none(*vmf.pud) &&
-	    hugepage_vma_check(vma, vm_flags, false, true, true)) {
+	    hugepage_vma_check(vma, vm_flags, false, true, true, BIT(PUD_ORDER))) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
@@ -5089,7 +5089,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		goto retry_pud;
 
 	if (pmd_none(*vmf.pmd) &&
-	    hugepage_vma_check(vma, vm_flags, false, true, true)) {
+	    hugepage_vma_check(vma, vm_flags, false, true, true, BIT(PMD_ORDER))) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e0b368e545ed..5f7e89c5b595 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			 * cleared *pmd but not decremented compound_mapcount().
 			 */
 			if ((pvmw->flags & PVMW_SYNC) &&
-			    transhuge_vma_suitable(vma, pvmw->address) &&
+			    transhuge_vma_suitable(vma, pvmw->address,
+						   BIT(PMD_ORDER)) &&
 			    (pvmw->nr_pages >= HPAGE_PMD_NR)) {
 				spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
 

From patchwork Fri Sep 29 11:44:16 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 146505
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3963064vqu;
        Fri, 29 Sep 2023 04:48:45 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IESeQwgdgwiQdnkmHiKM2bKdzsup4rcC8++ESbnrisyfX8TiEhWrndvuF6COFyXs+/79lfQ
X-Received: by 2002:a17:90a:d810:b0:273:f887:be17 with SMTP id
 a16-20020a17090ad81000b00273f887be17mr3826657pjv.47.1695988124763;
        Fri, 29 Sep 2023 04:48:44 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695988124; cv=none;
        d=google.com; s=arc-20160816;
        b=emQo7jkpfMC3MT5JVIXNagscrQIvbX4xlOjyWUTmqnhN4gzws0L2ErN884lcd/m99c
         Pn80sKMG0H6XrSVVEEEp/U3iHO2wjjo4OSvUrC10wwpGNIlWruZOujxHm1/QLqKPbHxR
         NADVjJo1eKHhMcp6in+c1+Yo4W8Sy3oZl7SH+jB5YPPbfmr4XIN5G9Ciro60AKP759Co
         xtV9rWiEvz+gQxUyCcMFr4Jz+MIrTOb6sU5i4t6JRhM90BZA5kYHdDyXRUCj2eK6wCuT
         0cN3jlYIpj/48lt2gzTbTXLtBmFHr9j1sA5ApxuvIgD3+oj7wjxOxhnSndQEoUaM8seQ
         lGZg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=2gExMYv36b1IpQ9pVpTbz0yc/Ix+8JpSzZniXtjIQmc=;
        fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=;
        b=Z4A2qRRfFFVc7xWVI3XwrqFYTdRa7MB7cDbpibJQPC+uOLt0mt30Ld1257P1bv5+LJ
         Xg/BnuyURM10FxKq4SKj+DOJjbykD0CLOUuLRymZALyXFfwyaGlzjk4w7AahogaJZl6Q
         KududaV4tpcashjtAK8xzxJjaagrQtuRE1CKwP7+W3NMa0dqiJc85RUdJc85o0udRyIh
         pnilM7KlnRUGtBldW1+MZItP+kp3DZyP32+/BUG4A25gDmXK0VXVRxMQjPY8rbBR872p
         kYGyF0Kge0M2MHE/7bwzf6OjtOXhl86xKvNCbJVTAWY8MOGfa5URpkhyBVL/wvkHk2LI
         rYOA==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from howler.vger.email (howler.vger.email. [23.128.96.34])
        by mx.google.com with ESMTPS id
 my12-20020a17090b4c8c00b0027763ca82e9si1523511pjb.91.2023.09.29.04.48.44
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 29 Sep 2023 04:48:44 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by howler.vger.email (Postfix) with ESMTP id 962C280743FD;
	Fri, 29 Sep 2023 04:45:20 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233090AbjI2LpD (ORCPT <rfc822;pwkd43@gmail.com> + 20 others);
        Fri, 29 Sep 2023 07:45:03 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42140 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233112AbjI2Low (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 29 Sep 2023 07:44:52 -0400
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id A59AA1B2
        for <linux-kernel@vger.kernel.org>;
 Fri, 29 Sep 2023 04:44:47 -0700 (PDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9765D1FB;
        Fri, 29 Sep 2023 04:45:25 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
 B3E903F59C;
        Fri, 29 Sep 2023 04:44:44 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        Matthew Wilcox <willy@infradead.org>,
        Yin Fengwei <fengwei.yin@intel.com>,
        David Hildenbrand <david@redhat.com>,
        Yu Zhao <yuzhao@google.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Anshuman Khandual <anshuman.khandual@arm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Luis Chamberlain <mcgrof@kernel.org>,
        Itaru Kitayama <itaru.kitayama@gmail.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        John Hubbard <jhubbard@nvidia.com>,
        David Rientjes <rientjes@google.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        Hugh Dickins <hughd@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org
Subject: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios
Date: Fri, 29 Sep 2023 12:44:16 +0100
Message-Id: <20230929114421.3761121-6-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com>
References: <20230929114421.3761121-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED,
        SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]);
 Fri, 29 Sep 2023 04:45:20 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1778372444310858226
X-GMAIL-MSGID: 1778372444310858226

Introduce the logic to allow THP to be configured (through the new
anon_orders interface we just added) to allocate large folios to back
anonymous memory, which are smaller than PMD-size (for example order-2,
order-3, order-4, etc).

These THPs continue to be PTE-mapped, but in many cases can still
provide similar benefits to traditional PMD-sized THP: Page faults are
significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
the configured order), but latency spikes are much less prominent
because the size of each page isn't as huge as the PMD-sized variant and
there is less memory to clear in each page fault. The number of per-page
operations (e.g. ref counting, rmap management, lru list management) are
also significantly reduced since those ops now become per-folio.

Some architectures also employ TLB compression mechanisms to squeeze
more entries in when a set of PTEs are virtually and physically
contiguous and approporiately aligned. In this case, TLB misses will
occur less often.

The new behaviour is disabled by default because the anon_orders
defaults to only enabling PMD-order, but can be enabled at runtime by
writing to anon_orders (see documentation in previous commit). The long
term aim is to default anon_orders to include suitable lower orders, but
there are some risks around internal fragmentation that need to be
better understood first.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |   9 +-
 include/linux/huge_mm.h                    |   6 +-
 mm/memory.c                                | 108 +++++++++++++++++++--
 3 files changed, 111 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 9f954e73a4ca..732c3b2f4ba8 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
 ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
 fields for each mapping. Note that in both cases, AnonHugePages refers
 only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
-using PTEs.
+using PTEs. This includes all THPs whose order is smaller than
+PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
+for other reasons.
 
 The number of file transparent huge pages mapped to userspace is available
 by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
@@ -367,6 +369,11 @@ frequently will incur overhead.
 There are a number of counters in ``/proc/vmstat`` that may be used to
 monitor how successfully the system is providing huge pages for use.
 
+.. note::
+   Currently the below counters only record events relating to
+   PMD-order THPs. Events relating to smaller order THPs are not
+   included.
+
 thp_fault_alloc
 	is incremented every time a huge page is successfully
 	allocated to handle a page fault.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2e7c338229a6..c4860476a1f5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
 /*
- * Mask of all large folio orders supported for anonymous THP.
+ * Mask of all large folio orders supported for anonymous THP; all orders up to
+ * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
+ * (which is a limitation of the THP implementation).
  */
-#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
+#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
 
 /*
  * Mask of all large folio orders supported for file THP.
diff --git a/mm/memory.c b/mm/memory.c
index b5b82fc8e164..92ed9c782dc9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }
 
+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+	int i;
+
+	if (nr_pages == 1)
+		return vmf_pte_changed(vmf);
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+			return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+	gfp_t gfp;
+	pte_t *pte;
+	unsigned long addr;
+	struct folio *folio;
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned int orders;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (userfaultfd_armed(vma))
+		goto fallback;
+
+	/*
+	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
+	 * for this vma. Then filter out the orders that can't be allocated over
+	 * the faulting address and still be fully contained in the vma.
+	 */
+	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
+				    BIT(PMD_ORDER) - 1);
+	orders = transhuge_vma_suitable(vma, vmf->address, orders);
+
+	if (!orders)
+		goto fallback;
+
+	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+	if (!pte)
+		return ERR_PTR(-EAGAIN);
+
+	order = first_order(orders);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		vmf->pte = pte + pte_index(addr);
+		if (!vmf_pte_range_changed(vmf, 1 << order))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	vmf->pte = NULL;
+	pte_unmap(pte);
+
+	gfp = vma_thp_gfp_mask(vma);
+
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vma, addr, true);
+		if (folio) {
+			clear_huge_page(&folio->page, addr, 1 << order);
+			return folio;
+		}
+		order = next_order(&orders, order);
+	}
+
+fallback:
+	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
+}
+#else
+#define alloc_anon_folio(vmf) \
+		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
+#endif
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
  */
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
+	int i;
+	int nr_pages = 1;
+	unsigned long addr = vmf->address;
 	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
@@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = alloc_anon_folio(vmf);
+	if (IS_ERR(folio))
+		return 0;
 	if (!folio)
 		goto oom;
 
+	nr_pages = folio_nr_pages(folio);
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry), vma);
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	if (!vmf->pte)
 		goto release;
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (vmf_pte_range_changed(vmf, nr_pages)) {
+		for (i = 0; i < nr_pages; i++)
+			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
 		goto release;
 	}
 
@@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	folio_add_new_anon_rmap(folio, vma, addr);
 	folio_add_lru_vma(folio, vma);
 setpte:
 	if (uffd_wp)
 		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);

From patchwork Fri Sep 29 11:44:17 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 146506
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3963105vqu;
        Fri, 29 Sep 2023 04:48:50 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IH3wH1HjcHDRvbHX7CVLTaq8eZa6oRHXjChMtYF3u/6NCMGORkWczA3c2xzjY9ZLruWJh6O
X-Received: by 2002:a05:6358:7209:b0:140:ff29:7057 with SMTP id
 h9-20020a056358720900b00140ff297057mr4546571rwa.7.1695988130217;
        Fri, 29 Sep 2023 04:48:50 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695988130; cv=none;
        d=google.com; s=arc-20160816;
        b=d/EbuVXp5eN890WzE5syiQSpCi0EYvWCnQzlUIYAKvy8CRyo7OIFC3S2yl1sgQwy1Q
         IjS0cAtLUPDtD4fszVq/CZJ1DgsTK8NNnPhibrbS3Ad6gzzs93HomiM0TBeUT8bGWK+2
         CZxHelfECRGT7sVS++H3GM1C5ytTB4IzANkFbDA/OcelXwXeyzbhL+hs5i47opLA5JSg
         6xDbjMG6CL5qIttn7ce1djanWk1KweTtqxGJaauRZwb4yM9TOsT5Zap7Wssr9WpPcl9j
         hK8UvQTiRqCQOk/5JPGds/BT9PwdAkzepVb7+HsRTu5yX+m5jsJaKhfADqCv+E1523M7
         dIEw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=Y7eGeJJo26SMVTELJHv7Gjv4n633olBN8TsxgQi6IXE=;
        fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=;
        b=DequeOWCkINlOsR9xXv902FsL5OLnalfkS6+/oHtmugz3Yf8aH6CflAlBdiVooxNxt
         j4X1XElVqHNLdQmwhfGHSzys9HwFR2LZrRDha5zA4OnKdCQTyaLpP+ONEfDV1bAECm+i
         GrmiLICU3wRSM2/doevj+XXBf5IFIaK24hQ9/7DuZIVqyn/+eJ5L0xsKr/B6+RrSvSs0
         qiDUfW/O5Zy8SF5RMUskPLD9sN0PjgU4buDmDhwqmcX5BFY2V7DfxgMsGD0dJ4IY2nYK
         IBMxy7Bt8JrYsxEGFP0XoZVqTPJLA8zsbFg3xBMxddsiiM6qfTD1d/Y2K+oKZ2tdWKKq
         7MhQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from howler.vger.email (howler.vger.email. [23.128.96.34])
        by mx.google.com with ESMTPS id
 z16-20020a656650000000b0057808a9b0besi18817964pgv.664.2023.09.29.04.48.48
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 29 Sep 2023 04:48:50 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.34 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by howler.vger.email (Postfix) with ESMTP id A7651807066D;
	Fri, 29 Sep 2023 04:45:20 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233131AbjI2LpH (ORCPT <rfc822;pwkd43@gmail.com> + 20 others);
        Fri, 29 Sep 2023 07:45:07 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55626 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233040AbjI2Lo4 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 29 Sep 2023 07:44:56 -0400
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 4E8D7CD3
        for <linux-kernel@vger.kernel.org>;
 Fri, 29 Sep 2023 04:44:50 -0700 (PDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 523D91007;
        Fri, 29 Sep 2023 04:45:28 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
 7040B3F59C;
        Fri, 29 Sep 2023 04:44:47 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        Matthew Wilcox <willy@infradead.org>,
        Yin Fengwei <fengwei.yin@intel.com>,
        David Hildenbrand <david@redhat.com>,
        Yu Zhao <yuzhao@google.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Anshuman Khandual <anshuman.khandual@arm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Luis Chamberlain <mcgrof@kernel.org>,
        Itaru Kitayama <itaru.kitayama@gmail.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        John Hubbard <jhubbard@nvidia.com>,
        David Rientjes <rientjes@google.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        Hugh Dickins <hughd@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org
Subject: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders
Date: Fri, 29 Sep 2023 12:44:17 +0100
Message-Id: <20230929114421.3761121-7-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com>
References: <20230929114421.3761121-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED,
        SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]);
 Fri, 29 Sep 2023 04:45:20 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1778372449446604492
X-GMAIL-MSGID: 1778372449446604492

In addition to passing a bitfield of folio orders to enable for THP,
allow the string "recommend" to be written, which has the effect of
causing the system to enable the orders preferred by the architecture
and by the mm. The user can see what these orders are by subsequently
reading back the file.

Note that these recommended orders are expected to be static for a given
boot of the system, and so the keyword "auto" was deliberately not used,
as I want to reserve it for a possible future use where the "best" order
is chosen more dynamically at runtime.

Recommended orders are determined as follows:
  - PMD_ORDER: The traditional THP size
  - arch_wants_pte_order() if implemented by the arch
  - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list

arch_wants_pte_order() can be overridden by the architecture if desired.
Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
set of ptes map physically contigious, naturally aligned memory, so this
mechanism allows the architecture to optimize as required.

Here we add the default implementation of arch_wants_pte_order(), used
when the architecture does not define it, which returns -1, implying
that the HW has no preference.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  4 ++++
 include/linux/pgtable.h                    | 13 +++++++++++++
 mm/huge_memory.c                           | 14 +++++++++++---
 3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 732c3b2f4ba8..d6363d4efa3a 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9
 By enabling multiple orders, allocation of each order will be
 attempted, highest to lowest, until a successful allocation is made.
 If the PMD-order is unset, then no PMD-sized THPs will be allocated.
+It is also possible to enable the recommended set of orders, which
+will be optimized for the architecture and mm::
+
+	echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
 
 The kernel will ignore any orders that it does not support so read the
 file back to determine which orders are enabled::
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..0e110ce57cc3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
 }
 #endif
 
+#ifndef arch_wants_pte_order
+/*
+ * Returns preferred folio order for pte-mapped memory. Must be in range [0,
+ * PMD_ORDER) and must not be order-1 since THP requires large folios to be at
+ * least order-2. Negative value implies that the HW has no preference and mm
+ * will choose it's own default order.
+ */
+static inline int arch_wants_pte_order(void)
+{
+	return -1;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bcecce769017..e2e2d3906a21 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj,
 	int err;
 	int ret = count;
 	unsigned int orders;
+	int arch;
 
-	err = kstrtouint(buf, 0, &orders);
-	if (err)
-		ret = -EINVAL;
+	if (sysfs_streq(buf, "recommend")) {
+		arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
+		orders = BIT(arch);
+		orders |= BIT(PAGE_ALLOC_COSTLY_ORDER);
+		orders |= BIT(PMD_ORDER);
+	} else {
+		err = kstrtouint(buf, 0, &orders);
+		if (err)
+			ret = -EINVAL;
+	}
 
 	if (ret > 0) {
 		orders &= THP_ORDERS_ALL_ANON;

From patchwork Fri Sep 29 11:44:18 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 146504
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3962520vqu;
        Fri, 29 Sep 2023 04:47:45 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IH8AfW7YNmLexx20T/qZl1SIXSUKvCAWdeniODKmO4QnqsGZUFSaGWgVQ64EOfxcCVqPMJy
X-Received: by 2002:a05:6a21:193:b0:15e:22a4:b897 with SMTP id
 le19-20020a056a21019300b0015e22a4b897mr7485433pzb.10.1695988065399;
        Fri, 29 Sep 2023 04:47:45 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695988065; cv=none;
        d=google.com; s=arc-20160816;
        b=CXZNz2q/1zkhZ5IutiM5vbXdmhvQsXBWdYwokvm6RRQWhZVaLkJ7VA13zr045+90se
         YdY0kwdhc9MFmT5OoeSQATH6j4ujWh7TzJeT4u8epUL0p9ixA3hMFob1nmMUQZs84lei
         XlSt+xfBCP1xZWu8/G4Ps+0W2CaR6vkX8zy+J6qbbabeAP/umOorteTJFRTunxpoDjCv
         fC2VCr2JgSHQWCSo1AzBlkeG6VyAw9qCQFEvLuDtoG+MUMt09bAo+MhWUOeifVQE66jW
         Hk6b5lgGqDs1UQ4JFyc1vMYd3RUp2xtUTmGjOzfooOotGANgfVPNdEq1uKxkGnnnFg9g
         8HOA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=R12DE0IaGz6VExNqcRsTIxpYiYIG/gDiSa9v3a/m4ts=;
        fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=;
        b=GPZXGG3L75b8vkTF8VhOTRJfakAk6JYBv4NXYX33mSmNnWZebOFuBA5U+JN1YmZRKM
         H5R2YOYjVgvaF5abtHrWqHHn2XjBQzzn5RmUsuGASkwgAjV/fZe3X/WptnDl3iejc75a
         gIsj98mYNj9E08sOA1m05j90f7hTfRXW/wIxH+A/0H3E215Zym9ruBoE+/hlL6MiSETH
         5StcT9l7+WQsBg8XT37dGcvu8cr9559vk6yFgMXzmR+dzilpgEVruNaIh6fnVx/RqEdy
         XugnIPOwzKZkx+kLgtXeuX7jLONd17n4N6zzsmjpwLaUYx32kNhCdNyrkCQ9yrTzRNNE
         n5jw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:6 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6])
        by mx.google.com with ESMTPS id
 g18-20020a631112000000b00563d9ff5158si20464724pgl.350.2023.09.29.04.47.45
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 29 Sep 2023 04:47:45 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:6 as permitted sender)
 client-ip=2620:137:e000::3:6;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:6 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by pete.vger.email (Postfix) with ESMTP id 8D69480301FA;
	Fri, 29 Sep 2023 04:46:17 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233219AbjI2LpY (ORCPT <rfc822;pwkd43@gmail.com> + 20 others);
        Fri, 29 Sep 2023 07:45:24 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52462 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233105AbjI2Lo7 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 29 Sep 2023 07:44:59 -0400
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 0C335CE8
        for <linux-kernel@vger.kernel.org>;
 Fri, 29 Sep 2023 04:44:53 -0700 (PDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0EAF31FB;
        Fri, 29 Sep 2023 04:45:31 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
 2C6323F59C;
        Fri, 29 Sep 2023 04:44:50 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        Matthew Wilcox <willy@infradead.org>,
        Yin Fengwei <fengwei.yin@intel.com>,
        David Hildenbrand <david@redhat.com>,
        Yu Zhao <yuzhao@google.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Anshuman Khandual <anshuman.khandual@arm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Luis Chamberlain <mcgrof@kernel.org>,
        Itaru Kitayama <itaru.kitayama@gmail.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        John Hubbard <jhubbard@nvidia.com>,
        David Rientjes <rientjes@google.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        Hugh Dickins <hughd@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org
Subject: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order()
Date: Fri, 29 Sep 2023 12:44:18 +0100
Message-Id: <20230929114421.3761121-8-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com>
References: <20230929114421.3761121-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]);
 Fri, 29 Sep 2023 04:46:17 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1778372381671344692
X-GMAIL-MSGID: 1778372381671344692

Define an arch-specific override of arch_wants_pte_order() so that when
anon_orders=recommend is set, large folios will be allocated for
anonymous memory with an order that is compatible with arm64's HPA uarch
feature.

Reviewed-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7f7d9b1df4e5..e3d2449dec5c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
 extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
+
+#define arch_wants_pte_order arch_wants_pte_order
+static inline int arch_wants_pte_order(void)
+{
+	/*
+	 * Many arm64 CPUs support hardware page aggregation (HPA), which can
+	 * coalesce 4 contiguous pages into a single TLB entry.
+	 */
+	return 2;
+}
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */

From patchwork Fri Sep 29 11:44:19 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 146509
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3966633vqu;
        Fri, 29 Sep 2023 04:55:52 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IEzxNXTjbTDeLw9BWSVOgaAvOOOC/WqDbIbcwkCMUJKM+xonlPFTmb+3Hq7mKKRDn3FUFVp
X-Received: by 2002:a17:90a:f2d5:b0:26b:280b:d24c with SMTP id
 gt21-20020a17090af2d500b0026b280bd24cmr3459556pjb.42.1695988552406;
        Fri, 29 Sep 2023 04:55:52 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695988552; cv=none;
        d=google.com; s=arc-20160816;
        b=cxUUgp8ACyzyDNoW6Rv6ntYKaOPnPRQtd1Uh7P3m39Yv2qRBk/nmKOXl+TL1Yx1bPt
         0/JMSte/dcWa4KiFIWSpGMHf0P2PIQPdlhdWtdhwpQzeFQsZlCJFezSi5VDGiDzeQO5B
         C5RQvUYv9IRQB/6iDU13aOY0U0IJxFkaduhWbYNJSHF+Dur4FipfnjGILhKa2Vg4uuSW
         BkYyJpJpEqUXkmILG97tckbBUYelUuCebCjOA2gCJu8TxG8XOipUQCLzOzg3ySd7JDxI
         EoP3hJLQwhq6faF4+An6j8hw1+gLs3c4ESJgqLF4XF73WTA3b8yEN0QR/KWixqVgrs+W
         8LqQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=quTNzgeRKuuO6Qb2fbhO+s0uzcvHJvNtQAmGH7Z0kG8=;
        fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=;
        b=jnKMFLS8ELfvw3qShHFPkoJ8uR8Nx+prHvoVKv1CK4Frv4Ornehow5XWkulKNLKKKE
         4I7AgezwNjO+2xoHU4SAnJkYpcBB1mxV4g0DQy6oFkRWUuSAkHSfEb2/KB/5TZU1NRXs
         R9Ygn/kzVzM7tHGP3DjMeXhkxf4v1d3wOVD9TbHlfiFfNYF237618/dO+BvNa4kXNKTl
         eDYczSdt0VJIAxR79u7dYYXUe9uJM/SHx2nShqhkIJuSgwajUhwQ9JkSzZrfQjlOoM8s
         km2wD4DwZMALrT9Cvk/lk24rTpVqhvG94Nw4o5cSO0NL4GpHNQzlU8cz1oFVDQ8z9Fiw
         5RHg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7])
        by mx.google.com with ESMTPS id
 em9-20020a17090b014900b00276bdabe471si1376636pjb.163.2023.09.29.04.55.52
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 29 Sep 2023 04:55:52 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 client-ip=2620:137:e000::3:7;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id 1004580C5F81;
	Fri, 29 Sep 2023 04:45:36 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233251AbjI2Lp2 (ORCPT <rfc822;pwkd43@gmail.com> + 20 others);
        Fri, 29 Sep 2023 07:45:28 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52376 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233111AbjI2LpB (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 29 Sep 2023 07:45:01 -0400
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id B347210CA
        for <linux-kernel@vger.kernel.org>;
 Fri, 29 Sep 2023 04:44:55 -0700 (PDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BF64FDA7;
        Fri, 29 Sep 2023 04:45:33 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
 DDAFE3F59C;
        Fri, 29 Sep 2023 04:44:52 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        Matthew Wilcox <willy@infradead.org>,
        Yin Fengwei <fengwei.yin@intel.com>,
        David Hildenbrand <david@redhat.com>,
        Yu Zhao <yuzhao@google.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Anshuman Khandual <anshuman.khandual@arm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Luis Chamberlain <mcgrof@kernel.org>,
        Itaru Kitayama <itaru.kitayama@gmail.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        John Hubbard <jhubbard@nvidia.com>,
        David Rientjes <rientjes@google.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        Hugh Dickins <hughd@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org
Subject: [PATCH v6 8/9] selftests/mm/cow: Generalize do_run_with_thp() helper
Date: Fri, 29 Sep 2023 12:44:19 +0100
Message-Id: <20230929114421.3761121-9-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com>
References: <20230929114421.3761121-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED,
        SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Fri, 29 Sep 2023 04:45:36 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1778372892196915750
X-GMAIL-MSGID: 1778372892196915750

do_run_with_thp() prepares (PMD-sized) THP memory into different states
before running tests. With the introduction of THP orders that are
smaller than PMD_ORDER, we would like to reuse this logic to also test
those smaller orders. So let's add a size parameter which tells the
function what size THP it should operate on.

No functional change intended here, but a separate commit will add new
tests for smaller order THP, where available.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 tools/testing/selftests/mm/cow.c | 151 +++++++++++++++++--------------
 1 file changed, 84 insertions(+), 67 deletions(-)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index 7324ce5363c0..d887ce454e34 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -32,7 +32,7 @@
 
 static size_t pagesize;
 static int pagemap_fd;
-static size_t thpsize;
+static size_t pmdsize;
 static int nr_hugetlbsizes;
 static size_t hugetlbsizes[10];
 static int gup_fd;
@@ -734,14 +734,14 @@ enum thp_run {
 	THP_RUN_PARTIAL_SHARED,
 };
 
-static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
+static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t size)
 {
 	char *mem, *mmap_mem, *tmp, *mremap_mem = MAP_FAILED;
-	size_t size, mmap_size, mremap_size;
+	size_t mmap_size, mremap_size;
 	int ret;
 
-	/* For alignment purposes, we need twice the thp size. */
-	mmap_size = 2 * thpsize;
+	/* For alignment purposes, we need twice the requested size. */
+	mmap_size = 2 * size;
 	mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 	if (mmap_mem == MAP_FAILED) {
@@ -749,36 +749,40 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		return;
 	}
 
-	/* We need a THP-aligned memory area. */
-	mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
+	/* We need to naturally align the memory area. */
+	mem = (char *)(((uintptr_t)mmap_mem + size) & ~(size - 1));
 
-	ret = madvise(mem, thpsize, MADV_HUGEPAGE);
+	ret = madvise(mem, size, MADV_HUGEPAGE);
 	if (ret) {
 		ksft_test_result_fail("MADV_HUGEPAGE failed\n");
 		goto munmap;
 	}
 
 	/*
-	 * Try to populate a THP. Touch the first sub-page and test if we get
-	 * another sub-page populated automatically.
+	 * Try to populate a THP. Touch the first sub-page and test if
+	 * we get the last sub-page populated automatically.
 	 */
 	mem[0] = 0;
-	if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
+	if (!pagemap_is_populated(pagemap_fd, mem + size - pagesize)) {
 		ksft_test_result_skip("Did not get a THP populated\n");
 		goto munmap;
 	}
-	memset(mem, 0, thpsize);
+	memset(mem, 0, size);
 
-	size = thpsize;
 	switch (thp_run) {
 	case THP_RUN_PMD:
 	case THP_RUN_PMD_SWAPOUT:
+		if (size != pmdsize) {
+			ksft_test_result_fail("test bug: can't PMD-map size\n");
+			goto munmap;
+		}
 		break;
 	case THP_RUN_PTE:
 	case THP_RUN_PTE_SWAPOUT:
 		/*
 		 * Trigger PTE-mapping the THP by temporarily mapping a single
-		 * subpage R/O.
+		 * subpage R/O. This is a noop if the THP is not pmdsize (and
+		 * therefore already PTE-mapped).
 		 */
 		ret = mprotect(mem + pagesize, pagesize, PROT_READ);
 		if (ret) {
@@ -797,7 +801,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		 * Discard all but a single subpage of that PTE-mapped THP. What
 		 * remains is a single PTE mapping a single subpage.
 		 */
-		ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTNEED);
+		ret = madvise(mem + pagesize, size - pagesize, MADV_DONTNEED);
 		if (ret) {
 			ksft_test_result_fail("MADV_DONTNEED failed\n");
 			goto munmap;
@@ -809,7 +813,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		 * Remap half of the THP. We need some new memory location
 		 * for that.
 		 */
-		mremap_size = thpsize / 2;
+		mremap_size = size / 2;
 		mremap_mem = mmap(NULL, mremap_size, PROT_NONE,
 				  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 		if (mem == MAP_FAILED) {
@@ -830,7 +834,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		 * child. This will result in some parts of the THP never
 		 * have been shared.
 		 */
-		ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTFORK);
+		ret = madvise(mem + pagesize, size - pagesize, MADV_DONTFORK);
 		if (ret) {
 			ksft_test_result_fail("MADV_DONTFORK failed\n");
 			goto munmap;
@@ -844,7 +848,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		}
 		wait(&ret);
 		/* Allow for sharing all pages again. */
-		ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DOFORK);
+		ret = madvise(mem + pagesize, size - pagesize, MADV_DOFORK);
 		if (ret) {
 			ksft_test_result_fail("MADV_DOFORK failed\n");
 			goto munmap;
@@ -875,52 +879,65 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
 		munmap(mremap_mem, mremap_size);
 }
 
-static void run_with_thp(test_fn fn, const char *desc)
+static int sz2ord(size_t size)
+{
+	return __builtin_ctzll(size / pagesize);
+}
+
+static void run_with_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PMD);
+	ksft_print_msg("[RUN] %s ... with order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PMD, size);
 }
 
-static void run_with_thp_swap(test_fn fn, const char *desc)
+static void run_with_thp_swap(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with swapped-out THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT);
+	ksft_print_msg("[RUN] %s ... with swapped-out order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT, size);
 }
 
-static void run_with_pte_mapped_thp(test_fn fn, const char *desc)
+static void run_with_pte_mapped_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with PTE-mapped THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PTE);
+	ksft_print_msg("[RUN] %s ... with PTE-mapped order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PTE, size);
 }
 
-static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc)
+static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT);
+	ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT, size);
 }
 
-static void run_with_single_pte_of_thp(test_fn fn, const char *desc)
+static void run_with_single_pte_of_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with single PTE of THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_SINGLE_PTE);
+	ksft_print_msg("[RUN] %s ... with single PTE of order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_SINGLE_PTE, size);
 }
 
-static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc)
+static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with single PTE of swapped-out THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT);
+	ksft_print_msg("[RUN] %s ... with single PTE of swapped-out order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT, size);
 }
 
-static void run_with_partial_mremap_thp(test_fn fn, const char *desc)
+static void run_with_partial_mremap_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with partially mremap()'ed THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP);
+	ksft_print_msg("[RUN] %s ... with partially mremap()'ed order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP, size);
 }
 
-static void run_with_partial_shared_thp(test_fn fn, const char *desc)
+static void run_with_partial_shared_thp(test_fn fn, const char *desc, size_t size)
 {
-	ksft_print_msg("[RUN] %s ... with partially shared THP\n", desc);
-	do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED);
+	ksft_print_msg("[RUN] %s ... with partially shared order-%d THP\n",
+		desc, sz2ord(size));
+	do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED, size);
 }
 
 static void run_with_hugetlb(test_fn fn, const char *desc, size_t hugetlbsize)
@@ -1091,15 +1108,15 @@ static void run_anon_test_case(struct test_case const *test_case)
 
 	run_with_base_page(test_case->fn, test_case->desc);
 	run_with_base_page_swap(test_case->fn, test_case->desc);
-	if (thpsize) {
-		run_with_thp(test_case->fn, test_case->desc);
-		run_with_thp_swap(test_case->fn, test_case->desc);
-		run_with_pte_mapped_thp(test_case->fn, test_case->desc);
-		run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc);
-		run_with_single_pte_of_thp(test_case->fn, test_case->desc);
-		run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc);
-		run_with_partial_mremap_thp(test_case->fn, test_case->desc);
-		run_with_partial_shared_thp(test_case->fn, test_case->desc);
+	if (pmdsize) {
+		run_with_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_thp_swap(test_case->fn, test_case->desc, pmdsize);
+		run_with_pte_mapped_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, pmdsize);
+		run_with_single_pte_of_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, pmdsize);
+		run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize);
+		run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize);
 	}
 	for (i = 0; i < nr_hugetlbsizes; i++)
 		run_with_hugetlb(test_case->fn, test_case->desc,
@@ -1120,7 +1137,7 @@ static int tests_per_anon_test_case(void)
 {
 	int tests = 2 + nr_hugetlbsizes;
 
-	if (thpsize)
+	if (pmdsize)
 		tests += 8;
 	return tests;
 }
@@ -1329,7 +1346,7 @@ static void run_anon_thp_test_cases(void)
 {
 	int i;
 
-	if (!thpsize)
+	if (!pmdsize)
 		return;
 
 	ksft_print_msg("[INFO] Anonymous THP tests\n");
@@ -1338,13 +1355,13 @@ static void run_anon_thp_test_cases(void)
 		struct test_case const *test_case = &anon_thp_test_cases[i];
 
 		ksft_print_msg("[RUN] %s\n", test_case->desc);
-		do_run_with_thp(test_case->fn, THP_RUN_PMD);
+		do_run_with_thp(test_case->fn, THP_RUN_PMD, pmdsize);
 	}
 }
 
 static int tests_per_anon_thp_test_case(void)
 {
-	return thpsize ? 1 : 0;
+	return pmdsize ? 1 : 0;
 }
 
 typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size);
@@ -1419,7 +1436,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
 	}
 
 	/* For alignment purposes, we need twice the thp size. */
-	mmap_size = 2 * thpsize;
+	mmap_size = 2 * pmdsize;
 	mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
 			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 	if (mmap_mem == MAP_FAILED) {
@@ -1434,11 +1451,11 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
 	}
 
 	/* We need a THP-aligned memory area. */
-	mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
-	smem = (char *)(((uintptr_t)mmap_smem + thpsize) & ~(thpsize - 1));
+	mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1));
+	smem = (char *)(((uintptr_t)mmap_smem + pmdsize) & ~(pmdsize - 1));
 
-	ret = madvise(mem, thpsize, MADV_HUGEPAGE);
-	ret |= madvise(smem, thpsize, MADV_HUGEPAGE);
+	ret = madvise(mem, pmdsize, MADV_HUGEPAGE);
+	ret |= madvise(smem, pmdsize, MADV_HUGEPAGE);
 	if (ret) {
 		ksft_test_result_fail("MADV_HUGEPAGE failed\n");
 		goto munmap;
@@ -1457,7 +1474,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
 		goto munmap;
 	}
 
-	fn(mem, smem, thpsize);
+	fn(mem, smem, pmdsize);
 munmap:
 	munmap(mmap_mem, mmap_size);
 	if (mmap_smem != MAP_FAILED)
@@ -1650,7 +1667,7 @@ static void run_non_anon_test_case(struct non_anon_test_case const *test_case)
 	run_with_zeropage(test_case->fn, test_case->desc);
 	run_with_memfd(test_case->fn, test_case->desc);
 	run_with_tmpfile(test_case->fn, test_case->desc);
-	if (thpsize)
+	if (pmdsize)
 		run_with_huge_zeropage(test_case->fn, test_case->desc);
 	for (i = 0; i < nr_hugetlbsizes; i++)
 		run_with_memfd_hugetlb(test_case->fn, test_case->desc,
@@ -1671,7 +1688,7 @@ static int tests_per_non_anon_test_case(void)
 {
 	int tests = 3 + nr_hugetlbsizes;
 
-	if (thpsize)
+	if (pmdsize)
 		tests += 1;
 	return tests;
 }
@@ -1681,10 +1698,10 @@ int main(int argc, char **argv)
 	int err;
 
 	pagesize = getpagesize();
-	thpsize = read_pmd_pagesize();
-	if (thpsize)
-		ksft_print_msg("[INFO] detected THP size: %zu KiB\n",
-			       thpsize / 1024);
+	pmdsize = read_pmd_pagesize();
+	if (pmdsize)
+		ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n",
+			       pmdsize / 1024);
 	nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
 						    ARRAY_SIZE(hugetlbsizes));
 	detect_huge_zeropage();

From patchwork Fri Sep 29 11:44:20 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 146510
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3966758vqu;
        Fri, 29 Sep 2023 04:56:08 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IGcVvHAeUy7G6nV1ZPSwLevl97VfHbUyu+vs74FAkH8MOhq9yXBWbStDQT/pS881VH4/dHp
X-Received: by 2002:a05:6a00:2303:b0:68f:b3ed:7d4d with SMTP id
 h3-20020a056a00230300b0068fb3ed7d4dmr4033504pfh.15.1695988568327;
        Fri, 29 Sep 2023 04:56:08 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695988568; cv=none;
        d=google.com; s=arc-20160816;
        b=zSxoPMhvX7XQY3ZhlDwJ8BPaCkuJyz2AvPdjgdvsp0gN1UOSJaQKtv8DOOjfqMnmB6
         vRF+PGif2s5OfVo9TwhBtDEUnNd+UFmFclS4Rd7sWVv8NFIJ0pBFmVcAi3gCQ5hXiJKD
         3mGE6g5AAaUTyZ7tLW+4hR7A11N7aA4naxptMfVrwg0mzVnexVWUROyRAW9EECbD3EDz
         EPPPL9jWED7u7tqYhm+DxkHFeG4u80DhZFEwl8c+g6d14UN6LnATy94j6wBMOPR6sZBm
         0/maXWVaNNy23I+HypcZYj/bhsHxCnpLhj+SG168nBV3AGgUQXHdO5PsQYyUnNI7HKwZ
         dBTQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=xAX3/CyH7Dx1azBnB2aEK4gEWTkd0D8Qfh6cCpOBfVo=;
        fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=;
        b=dIsGsAOHKCXcCG9aspIJFgsTc6kCFPl8Y8WXTwFSyMLwWxhdthB3F20K3cY1GRMQN+
         rjnviurQ0x8oMHg5a/6R2glF41ofdoYjLgqWgbFhxrhl6c+zCipnjc21vFqY9Ddp8q6+
         JKtfV4mzz+Jh8yuINoHNk2lEKkRwru9dzhyNvVGmn5ZsxEplrKLddLbyRSMyLw4AD+jI
         9ErC8K9+pu2k3BVnFBhv3dC1J28B3766jMNxY5nlVmOMg7NzoUQAJniE/bXekejFk89A
         UCj24TOff92Y5qNMxor0H7tsM+bwurTjCPVtlHRKgBd/dniKpQexrLQKSJluPAvgiiyt
         1XWQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7])
        by mx.google.com with ESMTPS id
 b128-20020a633486000000b0057755c96163si21812485pga.14.2023.09.29.04.56.08
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 29 Sep 2023 04:56:08 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 client-ip=2620:137:e000::3:7;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:7 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id 5C57F831C820;
	Fri, 29 Sep 2023 04:45:46 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233169AbjI2Lpd (ORCPT <rfc822;pwkd43@gmail.com> + 20 others);
        Fri, 29 Sep 2023 07:45:33 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55580 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233180AbjI2LpM (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 29 Sep 2023 07:45:12 -0400
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTP id 88AC11B2
        for <linux-kernel@vger.kernel.org>;
 Fri, 29 Sep 2023 04:44:58 -0700 (PDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7C4B61007;
        Fri, 29 Sep 2023 04:45:36 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
 9A1783F59C;
        Fri, 29 Sep 2023 04:44:55 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        Matthew Wilcox <willy@infradead.org>,
        Yin Fengwei <fengwei.yin@intel.com>,
        David Hildenbrand <david@redhat.com>,
        Yu Zhao <yuzhao@google.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Anshuman Khandual <anshuman.khandual@arm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Luis Chamberlain <mcgrof@kernel.org>,
        Itaru Kitayama <itaru.kitayama@gmail.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        John Hubbard <jhubbard@nvidia.com>,
        David Rientjes <rientjes@google.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        Hugh Dickins <hughd@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org
Subject: [PATCH v6 9/9] selftests/mm/cow: Add tests for small-order anon THP
Date: Fri, 29 Sep 2023 12:44:20 +0100
Message-Id: <20230929114421.3761121-10-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com>
References: <20230929114421.3761121-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED,
        SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Fri, 29 Sep 2023 04:45:46 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1778372909177582086
X-GMAIL-MSGID: 1778372909177582086

Add tests similar to the existing THP tests, but which operate on memory
backed by smaller-order, PTE-mapped THP. This reuses all the existing
infrastructure. If the test suite detects that small-order THP is not
supported by the kernel, the new tests are skipped.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 tools/testing/selftests/mm/cow.c | 93 ++++++++++++++++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index d887ce454e34..6c5e37d8bb69 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -33,10 +33,13 @@
 static size_t pagesize;
 static int pagemap_fd;
 static size_t pmdsize;
+static size_t ptesize;
 static int nr_hugetlbsizes;
 static size_t hugetlbsizes[10];
 static int gup_fd;
 static bool has_huge_zeropage;
+static unsigned int orig_anon_orders;
+static bool orig_anon_orders_valid;
 
 static void detect_huge_zeropage(void)
 {
@@ -1118,6 +1121,14 @@ static void run_anon_test_case(struct test_case const *test_case)
 		run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize);
 		run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize);
 	}
+	if (ptesize) {
+		run_with_pte_mapped_thp(test_case->fn, test_case->desc, ptesize);
+		run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, ptesize);
+		run_with_single_pte_of_thp(test_case->fn, test_case->desc, ptesize);
+		run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, ptesize);
+		run_with_partial_mremap_thp(test_case->fn, test_case->desc, ptesize);
+		run_with_partial_shared_thp(test_case->fn, test_case->desc, ptesize);
+	}
 	for (i = 0; i < nr_hugetlbsizes; i++)
 		run_with_hugetlb(test_case->fn, test_case->desc,
 				 hugetlbsizes[i]);
@@ -1139,6 +1150,8 @@ static int tests_per_anon_test_case(void)
 
 	if (pmdsize)
 		tests += 8;
+	if (ptesize)
+		tests += 6;
 	return tests;
 }
 
@@ -1693,6 +1706,80 @@ static int tests_per_non_anon_test_case(void)
 	return tests;
 }
 
+#define ANON_ORDERS_FILE "/sys/kernel/mm/transparent_hugepage/anon_orders"
+
+static int read_anon_orders(unsigned int *orders)
+{
+	ssize_t buflen = 80;
+	char buf[buflen];
+	int fd;
+
+	fd = open(ANON_ORDERS_FILE, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	buflen = read(fd, buf, buflen);
+	close(fd);
+
+	if (buflen < 1)
+		return -1;
+
+	*orders = strtoul(buf, NULL, 16);
+
+	return 0;
+}
+
+static int write_anon_orders(unsigned int orders)
+{
+	ssize_t buflen = 80;
+	char buf[buflen];
+	int fd;
+
+	fd = open(ANON_ORDERS_FILE, O_WRONLY);
+	if (fd == -1)
+		return -1;
+
+	buflen = snprintf(buf, buflen, "0x%08x\n", orders);
+	buflen = write(fd, buf, buflen);
+	close(fd);
+
+	if (buflen < 1)
+		return -1;
+
+	return 0;
+}
+
+static size_t save_thp_anon_orders(void)
+{
+	/*
+	 * If the kernel supports multiple orders for anon THP (indicated by the
+	 * presence of anon_orders file), configure it for the PMD-order and the
+	 * PMD-order - 1, which we will report back and use as the PTE-order THP
+	 * size. Save the original value so that it can be restored on exit. If
+	 * the kernel does not support multiple orders, report back 0 for the
+	 * PTE-size so those tests are skipped.
+	 */
+
+	int pteorder = sz2ord(pmdsize) - 1;
+	unsigned int orders = (1UL << sz2ord(pmdsize)) | (1UL << pteorder);
+
+	if (read_anon_orders(&orig_anon_orders))
+		return 0;
+
+	orig_anon_orders_valid = true;
+
+	if (write_anon_orders(orders))
+		return 0;
+
+	return pagesize << pteorder;
+}
+
+static void restore_thp_anon_orders(void)
+{
+	if (orig_anon_orders_valid)
+		write_anon_orders(orig_anon_orders);
+}
+
 int main(int argc, char **argv)
 {
 	int err;
@@ -1702,6 +1789,10 @@ int main(int argc, char **argv)
 	if (pmdsize)
 		ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n",
 			       pmdsize / 1024);
+	ptesize = save_thp_anon_orders();
+	if (ptesize)
+		ksft_print_msg("[INFO] configured PTE-mapped THP size: %zu KiB\n",
+			       ptesize / 1024);
 	nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
 						    ARRAY_SIZE(hugetlbsizes));
 	detect_huge_zeropage();
@@ -1720,6 +1811,8 @@ int main(int argc, char **argv)
 	run_anon_thp_test_cases();
 	run_non_anon_test_cases();
 
+	restore_thp_anon_orders();
+
 	err = ksft_get_fail_cnt();
 	if (err)
 		ksft_exit_fail_msg("%d out of %d tests failed\n",