From patchwork Fri Sep 29 11:44:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 146507 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3963145vqu; Fri, 29 Sep 2023 04:48:55 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFXuYmiiR1JdrmaDYHHuePnBxqz8oxfiqasGn6v55WbTDE+xXvARI/5uDk0MXTxBHmvMjqf X-Received: by 2002:a05:6a20:12cd:b0:151:35ad:f327 with SMTP id v13-20020a056a2012cd00b0015135adf327mr4421308pzg.17.1695988134806; Fri, 29 Sep 2023 04:48:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695988134; cv=none; d=google.com; s=arc-20160816; b=N+OC6qRmSk+te2EL5VtErulvj2Z2GF2PEKQJKdfk7HiyVNE8Uwhpbugy7tjCOmC0D4 7fZAdvmcocB0370e/nNepNi1iPocz1IcBZV+Qk6efEyahTAdOEvMlF/yiBmlCfk79Sts wNoXcppw7DpRxVvQzr+8JbXbpJ4cbkVX8qZBHYYiCr4nDF/zoBZE50kwZkC5WWaT7YTJ /FRnCqjWMAvPdyaJvqXQShPHaOFuRty3Y0fbiwf1yDwuD6XQrTOAaZgbFEMz9xzFpNo3 Jeah8z1ZdusQB6CohfadSt56C5CzA5h/MHKAJVgvzDcaJVgvThHCDbEhcyGARbtuUYKe dPRA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=wW8PmLeN+NODDjeW+k+loXBT9eKUllZt4ss/4twQx6U=; fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=; b=EQ/V98t/swrsB1slV4ysGuYYXyg1yMyyLYEdCtuOe90bOQJ03HkwKk2YEjNLNFcLof vJDUAUTorRKEyDlxLwDdYv9r/SkturxZKOWnP29NfCV8AeE0z432uhLTl9xSXPSO0KR/ uG2AdFkS6esMcilRHDG5f/DYnEyhk029lvR0fhj+1BaVg8eIEWOBuN7aC2594xdFzNe9 WUeY1puodbwlurFegW+eId/Ewek4+AxUlQAs4UQAuuvXz358umsSKF+Zo2VDIWp0x0Mk ACEnhtgAlfHXzp/d+deAVhtxEiDIRrs0+PZLTTF254c6YKRbHmmGdsC6rG53rQ173Tqf joBA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id mh6-20020a17090b4ac600b002736ff3cc79si1576021pjb.23.2023.09.29.04.48.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 04:48:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 404A7831556D; Fri, 29 Sep 2023 04:44:51 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233000AbjI2Loj (ORCPT + 20 others); Fri, 29 Sep 2023 07:44:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42048 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232925AbjI2Loh (ORCPT ); Fri, 29 Sep 2023 07:44:37 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 5F31CE7 for ; Fri, 29 Sep 2023 04:44:36 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 77FA4DA7; Fri, 29 Sep 2023 04:45:14 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 94DAD3F59C; Fri, 29 Sep 2023 04:44:33 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 1/9] mm: Allow deferred splitting of arbitrary anon large folios Date: Fri, 29 Sep 2023 12:44:12 +0100 Message-Id: <20230929114421.3761121-2-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 29 Sep 2023 04:44:51 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778372454420897229 X-GMAIL-MSGID: 1778372454420897229 In preparation for the introduction of large folios for anonymous memory, we would like to be able to split them when they have unmapped subpages, in order to free those unused pages under memory pressure. So remove the artificial requirement that the large folio needed to be at least PMD-sized. Reviewed-by: Yu Zhao Reviewed-by: Yin Fengwei Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Ryan Roberts --- mm/rmap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 9f795b93cf40..8600bd029acf 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1446,11 +1446,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, __lruvec_stat_mod_folio(folio, idx, -nr); /* - * Queue anon THP for deferred split if at least one + * Queue anon large folio for deferred split if at least one * page of the folio is unmapped and at least one page * is still mapped. */ - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) + if (folio_test_large(folio) && folio_test_anon(folio)) if (!compound || nr < nr_pmdmapped) deferred_split_folio(folio); } From patchwork Fri Sep 29 11:44:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 146503 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3962136vqu; Fri, 29 Sep 2023 04:46:57 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHe7wWvVDXnAv15AHJrn2a3RYZVsnXlb/h8zzhMVNZgeHaP822hYcqoP0+HVGWARrTHgMrY X-Received: by 2002:a05:6808:d49:b0:3a8:74bf:8977 with SMTP id w9-20020a0568080d4900b003a874bf8977mr4486270oik.56.1695988017539; Fri, 29 Sep 2023 04:46:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695988017; cv=none; d=google.com; s=arc-20160816; b=kxHqXPIqrbnxpWLYHw5b5n9r7eRiyyFPB+cZJtOycsD1jUSlyVcDB/lUhDVcrjVrtI +Ma4ZDaMWsL2hKuSoh/aaht8OnPl7aRGkaJVY+c1aK65SsEVAuzgpXhN1GteQtOMj2X0 wKCvvvJ8GgmhC96594l4Polpf3l61RSsj15fl2mLq+sOGNX4WgBmk04L5/rkaDME/R0u iHrArgFpdAZ2XYtSJQv4GtbMeLT/h8fjk3TUb5EEbtM8/fMEUCjEfMdo1eOGsiYv+EoV ml6EgN6NzHas0Z8ZQe5PdMnpv+JDl1Ak3M2oHwvS6gXJj3iRXDw30jP4+et1P96xLKLX wzQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=Mr+mjEqwfDoSUXJ6kY2GFEcxLi6J7er8Wf2qGr66jIc=; fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=; b=J+lXYTe+1mGnpe0wB4WOiKjoBztGiaec6f3igY4SaCp5UzbarPeW2hA5DdBvrbAk5t ae8BYzYhESaSi9PsdXUUM0TWf6JnOLBnB2MS5F5zbJ2KbWps8vuNhbxSBnOyv9BTE7AR u4iRKYx+U4FRUHnZOJCSh/z7Ad4CjBGZ/cVFfYgz1uUwOwsRanNOcvQoS20lbiq2b1+f Ou5PMMJ/vQE1yCeGygyYk4EZmzOIPYitSAi0Q/T/VkPhAETGrwUKvsYxnRLWp/fcfBsN WwvIOZSfDi/5jb+1BIfM0h4u0KJHL4Y/RPXNv4XEiXlRRc2rrml6FlEYP9w5mZ2m/dp7 t9Hg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id k72-20020a636f4b000000b005831a5a3499si13685317pgc.306.2023.09.29.04.46.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 04:46:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id A4C6F84DB7A0; Fri, 29 Sep 2023 04:45:05 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233084AbjI2Lor (ORCPT + 20 others); Fri, 29 Sep 2023 07:44:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42090 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232925AbjI2Lom (ORCPT ); Fri, 29 Sep 2023 07:44:42 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 0E21C1AB for ; Fri, 29 Sep 2023 04:44:39 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3537F1007; Fri, 29 Sep 2023 04:45:17 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 527553F59C; Fri, 29 Sep 2023 04:44:36 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 2/9] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Date: Fri, 29 Sep 2023 12:44:13 +0100 Message-Id: <20230929114421.3761121-3-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Fri, 29 Sep 2023 04:45:05 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778372331254929313 X-GMAIL-MSGID: 1778372331254929313 In preparation for anonymous large folio support, improve folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be passed to it. In this case, all contained pages are accounted using the order-0 folio (or base page) scheme. Reviewed-by: Yu Zhao Reviewed-by: Yin Fengwei Signed-off-by: Ryan Roberts --- mm/rmap.c | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 8600bd029acf..106149690366 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1266,31 +1266,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, * This means the inc-and-test can be bypassed. * The folio does not have to be locked. * - * If the folio is large, it is accounted as a THP. As the folio + * If the folio is pmd-mappable, it is accounted as a THP. As the folio * is new, it's assumed to be mapped exclusively by a single process. */ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, unsigned long address) { - int nr; + int nr = folio_nr_pages(folio); - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); + VM_BUG_ON_VMA(address < vma->vm_start || + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); __folio_set_swapbacked(folio); - if (likely(!folio_test_pmd_mappable(folio))) { + if (likely(!folio_test_large(folio))) { /* increment count (starts at -1) */ atomic_set(&folio->_mapcount, 0); - nr = 1; + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); + } else if (!folio_test_pmd_mappable(folio)) { + int i; + + for (i = 0; i < nr; i++) { + struct page *page = folio_page(folio, i); + + /* increment count (starts at -1) */ + atomic_set(&page->_mapcount, 0); + __page_set_anon_rmap(folio, page, vma, + address + (i << PAGE_SHIFT), 1); + } + + atomic_set(&folio->_nr_pages_mapped, nr); } else { /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); - nr = folio_nr_pages(folio); + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); } __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); } /** From patchwork Fri Sep 29 11:44:14 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 146508 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3963161vqu; Fri, 29 Sep 2023 04:48:56 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGGwgjzbxCt9m5nO9+8CCPkqYNg9Cjx8c24MQlZdRHPcvAXLkxdq2PHvJ6EYxDQAIi+VW1N X-Received: by 2002:a17:903:1251:b0:1c6:25c2:e784 with SMTP id u17-20020a170903125100b001c625c2e784mr4399500plh.35.1695988136106; Fri, 29 Sep 2023 04:48:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695988136; cv=none; d=google.com; s=arc-20160816; b=SgOcDTLiZTn0/TJjnATELLq7b1gZxrU+2c2aZL3/aHfb/r2wG1k+hsDFkrCecOTuNk Fd54NqarBE9H1FtK6KWlOf/IuxHlSDPgxeEjOVbOpaWMCqG7jS903zhDG8/32kRSOkUy vO8PK1dTHl4NH1jK+a2kEnn/2DvUDxO9eGuxxVofHLF/auUkkewQ6CspCYW928ohSg1V BTaQK3QgRzrMjBXIYMbq1YC6MyAwCHWR/g2gjnd/6e11MW7eXyjgA1t0ykvhQ5j+GX/v a3OEYp5tBIgWOYQ6WSCw0YdcAOMA8NVoLsn5lwqkZFTvFTm5TkRYsw9pMB1zVyAeWAVK pBeQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=BCOYVCPEAW5w4k6+XBBjGI0EWgGCdsX2k3cf8C21COc=; fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=; b=kzrduhUGC5DjlWs5OMDgHVij7DZcQA0j3s7ysGFsJ84FhjOw5xwmlp7uy87failLiG 6wS2GksYC2dDyN/ispIe5kERc+XxhVBOaqCoA/pKphG1yzjk31OVxRpG8HqC6qLu5XyO HVA1aW3YLk2luQIhLYLiKmfpBms/FH4BKH3ZCBLiH+A4ZMOEBZUAu+VGZ8plLplabh0L 2R+Get8awETIT0CPO/2kJIfTDbDBfm0J9Tt5ecoSAM6hryyC85IQOiNkfC4VHam9y18f UwrpQyYxOR4xUC2EYt3IZKtYErovUFX10C8+jKUwmNauh5f2o4jkYorVJFateiTPSrKU Bp/Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id la7-20020a170902fa0700b001bf7289d2b2si19548145plb.315.2023.09.29.04.48.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 04:48:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 0592D831556A; Fri, 29 Sep 2023 04:45:00 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232994AbjI2Lou (ORCPT + 20 others); Fri, 29 Sep 2023 07:44:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42114 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233038AbjI2Lor (ORCPT ); Fri, 29 Sep 2023 07:44:47 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id F06C61B2 for ; Fri, 29 Sep 2023 04:44:41 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0CDDC1063; Fri, 29 Sep 2023 04:45:20 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 103F93F59C; Fri, 29 Sep 2023 04:44:38 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 3/9] mm: thp: Account pte-mapped anonymous THP usage Date: Fri, 29 Sep 2023 12:44:14 +0100 Message-Id: <20230929114421.3761121-4-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 29 Sep 2023 04:45:00 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778372455783482513 X-GMAIL-MSGID: 1778372455783482513 Add accounting for pte-mapped anonymous transparent hugepages at various locations. This visibility will aid in debugging and tuning performance for the "small order" thp extension that will be added in a subsequent commit, where hugepages can be allocated which are large (greater than order-0) but smaller than PMD_ORDER. This new accounting follows a similar pattern to the existing NR_ANON_THPS, which measures pmd-mapped anonymous transparent hugepages. We account pte-mapped anonymous thp mappings per-page, where the page is mapped at least once via PTE and the page belongs to a large folio. So when a page belonging to a large folio is PTE-mapped for the first time, then we add 1 to NR_ANON_THPS_PTEMAPPED. And when a page belonging to a large folio is PTE-unmapped for the last time, then we remove 1 from NR_ANON_THPS_PTEMAPPED. /proc/meminfo: Introduce new "AnonHugePteMap" field, which reports the amount of memory (in KiB) mapped from large folios globally (similar to AnonHugePages field). /proc/vmstat: Introduce new "nr_anon_thp_pte" field, which reports the amount of memory (in pages) mapped from large folios globally (similar to nr_anon_transparent_hugepages field). /sys/devices/system/node/nodeX/meminfo Introduce new "AnonHugePteMap" field, which reports the amount of memory (in KiB) mapped from large folios per-node (similar to AnonHugePages field). show_mem (panic logger): Introduce new "anon_thp_pte" field, which reports the amount of memory (in KiB) mapped from large folios per-node (similar to anon_thp field). memory.stat (cgroup v1 and v2): Introduce new "anon_thp_pte" field, which reports the amount of memory (in bytes) mapped from large folios in the memcg (similar to rss_huge (v1) / anon_thp (v2) fields). /proc//smaps & /proc//smaps_rollup: Introduce new "AnonHugePteMap" field, which reports the amount of memory (in KiB) mapped from large folios within the vma/process (similar to AnonHugePages field). NOTE on charge migration: The new NR_ANON_THPS_PTEMAPPED charge is NOT moved between cgroups, even when the (v1) memory.move_charge_at_immigrate feature is enabled. That feature is marked deprecated and the current code does not attempt to move the NR_ANON_MAPPED charge for large PTE-mapped folios anyway (see comment in mem_cgroup_move_charge_pte_range()). If this code was enhanced to allow moving the NR_ANON_MAPPED charge for large PTE-mapped folios, we would also need to add support for moving the new NR_ANON_THPS_PTEMAPPED charge. This would likely get quite fiddly. Given the deprecation of memory.move_charge_at_immigrate, I assume it is not valuable to implement. NOTE on naming: Given the new small order anonymous thp feature will be exposed to user space as an extension to thp, I've opted to call the new counters after thp also (as aposed to "large"/"large folio"/etc.), so "huge" no longer strictly means PMD - one could argue hugetlb already breaks this rule anyway. I also did not want to risk breaking back compat by renaming/redefining the existing counters (which would have resulted in more consistent and clearer names). So the existing NR_ANON_THPS counters remain and continue to only refer to PMD-mapped THPs. And I've added new counters, which only refer to PTE-mapped THPs. Signed-off-by: Ryan Roberts --- Documentation/ABI/testing/procfs-smaps_rollup | 1 + Documentation/admin-guide/cgroup-v1/memory.rst | 5 ++++- Documentation/admin-guide/cgroup-v2.rst | 6 +++++- Documentation/admin-guide/mm/transhuge.rst | 11 +++++++---- Documentation/filesystems/proc.rst | 14 ++++++++++++-- drivers/base/node.c | 2 ++ fs/proc/meminfo.c | 2 ++ fs/proc/task_mmu.c | 4 ++++ include/linux/mmzone.h | 1 + mm/memcontrol.c | 8 ++++++++ mm/rmap.c | 11 +++++++++-- mm/show_mem.c | 2 ++ mm/vmstat.c | 1 + 13 files changed, 58 insertions(+), 10 deletions(-) diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup index b446a7154a1b..b50b3eda5a3f 100644 --- a/Documentation/ABI/testing/procfs-smaps_rollup +++ b/Documentation/ABI/testing/procfs-smaps_rollup @@ -34,6 +34,7 @@ Description: Anonymous: 68 kB LazyFree: 0 kB AnonHugePages: 0 kB + AnonHugePteMap: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 5f502bf68fbc..b7efc7531896 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -535,7 +535,10 @@ memory.stat file includes following statistics: cache # of bytes of page cache memory. rss # of bytes of anonymous and swap cache memory (includes transparent hugepages). - rss_huge # of bytes of anonymous transparent hugepages. + rss_huge # of bytes of anonymous transparent hugepages, mapped by + PMD. + anon_thp_pte # of bytes of anonymous transparent hugepages, mapped by + PTE. mapped_file # of bytes of mapped file (includes tmpfs/shmem) pgpgin # of charging events to the memory cgroup. The charging event happens each time a page is accounted as either mapped diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index b26b5274eaaf..48b961b8fc6d 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1421,7 +1421,11 @@ PAGE_SIZE multiple when read back. anon_thp Amount of memory used in anonymous mappings backed by - transparent hugepages + transparent hugepages, mapped by PMD + + anon_thp_pte + Amount of memory used in anonymous mappings backed by + transparent hugepages, mapped by PTE file_thp Amount of cached filesystem data backed by transparent diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index b0cc8243e093..ebda57850643 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -291,10 +291,13 @@ Monitoring usage ================ The number of anonymous transparent huge pages currently used by the -system is available by reading the AnonHugePages field in ``/proc/meminfo``. -To identify what applications are using anonymous transparent huge pages, -it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields -for each mapping. +system is available by reading the AnonHugePages and AnonHugePteMap +fields in ``/proc/meminfo``. To identify what applications are using +anonymous transparent huge pages, it is necessary to read +``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap +fields for each mapping. Note that in both cases, AnonHugePages refers +only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped +using PTEs. The number of file transparent huge pages mapped to userspace is available by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 2b59cff8be17..ccbb76a509f0 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -464,6 +464,7 @@ Memory Area, or VMA) there is a series of lines such as the following:: KSM: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB + AnonHugePteMap: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB @@ -511,7 +512,11 @@ pressure if the memory is clean. Please note that the printed value might be lower than the real value due to optimizations used in the current implementation. If this is not desirable please file a bug report. -"AnonHugePages" shows the amount of memory backed by transparent hugepage. +"AnonHugePages" shows the amount of memory backed by transparent hugepage, +mapped by PMD. + +"AnonHugePteMap" shows the amount of memory backed by transparent hugepage, +mapped by PTE. "ShmemPmdMapped" shows the amount of shared (shmem/tmpfs) memory backed by huge pages. @@ -1006,6 +1011,7 @@ Example output. You may not have all of these fields. EarlyMemtestBad: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 4149248 kB + AnonHugePteMap: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB @@ -1165,7 +1171,11 @@ HardwareCorrupted The amount of RAM/memory in KB, the kernel identifies as corrupted. AnonHugePages - Non-file backed huge pages mapped into userspace page tables + Non-file backed huge pages mapped into userspace page tables by + PMD +AnonHugePteMap + Non-file backed huge pages mapped into userspace page tables by + PTE ShmemHugePages Memory used by shared memory (shmem) and tmpfs allocated with huge pages diff --git a/drivers/base/node.c b/drivers/base/node.c index 493d533f8375..08f1759387d2 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -443,6 +443,7 @@ static ssize_t node_read_meminfo(struct device *dev, "Node %d SUnreclaim: %8lu kB\n" #ifdef CONFIG_TRANSPARENT_HUGEPAGE "Node %d AnonHugePages: %8lu kB\n" + "Node %d AnonHugePteMap: %8lu kB\n" "Node %d ShmemHugePages: %8lu kB\n" "Node %d ShmemPmdMapped: %8lu kB\n" "Node %d FileHugePages: %8lu kB\n" @@ -475,6 +476,7 @@ static ssize_t node_read_meminfo(struct device *dev, #ifdef CONFIG_TRANSPARENT_HUGEPAGE , nid, K(node_page_state(pgdat, NR_ANON_THPS)), + nid, K(node_page_state(pgdat, NR_ANON_THPS_PTEMAPPED)), nid, K(node_page_state(pgdat, NR_SHMEM_THPS)), nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)), nid, K(node_page_state(pgdat, NR_FILE_THPS)), diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 45af9a989d40..bac20cc60b6a 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -143,6 +143,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v) #ifdef CONFIG_TRANSPARENT_HUGEPAGE show_val_kb(m, "AnonHugePages: ", global_node_page_state(NR_ANON_THPS)); + show_val_kb(m, "AnonHugePteMap: ", + global_node_page_state(NR_ANON_THPS_PTEMAPPED)); show_val_kb(m, "ShmemHugePages: ", global_node_page_state(NR_SHMEM_THPS)); show_val_kb(m, "ShmemPmdMapped: ", diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3dd5be96691b..7b5dad163533 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -392,6 +392,7 @@ struct mem_size_stats { unsigned long anonymous; unsigned long lazyfree; unsigned long anonymous_thp; + unsigned long anonymous_thp_pte; unsigned long shmem_thp; unsigned long file_thp; unsigned long swap; @@ -452,6 +453,8 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page, mss->anonymous += size; if (!PageSwapBacked(page) && !dirty && !PageDirty(page)) mss->lazyfree += size; + if (!compound && PageTransCompound(page)) + mss->anonymous_thp_pte += size; } if (PageKsm(page)) @@ -833,6 +836,7 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss, SEQ_PUT_DEC(" kB\nKSM: ", mss->ksm); SEQ_PUT_DEC(" kB\nLazyFree: ", mss->lazyfree); SEQ_PUT_DEC(" kB\nAnonHugePages: ", mss->anonymous_thp); + SEQ_PUT_DEC(" kB\nAnonHugePteMap: ", mss->anonymous_thp_pte); SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp); SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp); SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4106fbc5b4b3..5032fc31c651 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -186,6 +186,7 @@ enum node_stat_item { NR_FILE_THPS, NR_FILE_PMDMAPPED, NR_ANON_THPS, + NR_ANON_THPS_PTEMAPPED, NR_VMSCAN_WRITE, NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_DIRTIED, /* page dirtyings since bootup */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d13dde2f8b56..07d8e0b55b0e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -809,6 +809,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, case NR_ANON_MAPPED: case NR_FILE_MAPPED: case NR_ANON_THPS: + case NR_ANON_THPS_PTEMAPPED: case NR_SHMEM_PMDMAPPED: case NR_FILE_PMDMAPPED: WARN_ON_ONCE(!in_task()); @@ -1512,6 +1513,7 @@ static const struct memory_stat memory_stats[] = { #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE { "anon_thp", NR_ANON_THPS }, + { "anon_thp_pte", NR_ANON_THPS_PTEMAPPED }, { "file_thp", NR_FILE_THPS }, { "shmem_thp", NR_SHMEM_THPS }, #endif @@ -4052,6 +4054,7 @@ static const unsigned int memcg1_stats[] = { NR_ANON_MAPPED, #ifdef CONFIG_TRANSPARENT_HUGEPAGE NR_ANON_THPS, + NR_ANON_THPS_PTEMAPPED, #endif NR_SHMEM, NR_FILE_MAPPED, @@ -4067,6 +4070,7 @@ static const char *const memcg1_stat_names[] = { "rss", #ifdef CONFIG_TRANSPARENT_HUGEPAGE "rss_huge", + "anon_thp_pte", #endif "shmem", "mapped_file", @@ -6259,6 +6263,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, * can be done but it would be too convoluted so simply * ignore such a partial THP and keep it in original * memcg. There should be somebody mapping the head. + * This simplification also means that pte-mapped large + * folios are never migrated, which means we don't need + * to worry about migrating the NR_ANON_THPS_PTEMAPPED + * accounting. */ if (PageTransCompound(page)) goto put; diff --git a/mm/rmap.c b/mm/rmap.c index 106149690366..52dabee73023 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1205,7 +1205,7 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, { struct folio *folio = page_folio(page); atomic_t *mapped = &folio->_nr_pages_mapped; - int nr = 0, nr_pmdmapped = 0; + int nr = 0, nr_pmdmapped = 0, nr_lgmapped = 0; bool compound = flags & RMAP_COMPOUND; bool first = true; @@ -1214,6 +1214,7 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, first = atomic_inc_and_test(&page->_mapcount); nr = first; if (first && folio_test_large(folio)) { + nr_lgmapped = 1; nr = atomic_inc_return_relaxed(mapped); nr = (nr < COMPOUND_MAPPED); } @@ -1241,6 +1242,8 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, if (nr_pmdmapped) __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped); + if (nr_lgmapped) + __lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, nr_lgmapped); if (nr) __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); @@ -1295,6 +1298,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, } atomic_set(&folio->_nr_pages_mapped, nr); + __lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, nr); } else { /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); @@ -1405,7 +1409,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, { struct folio *folio = page_folio(page); atomic_t *mapped = &folio->_nr_pages_mapped; - int nr = 0, nr_pmdmapped = 0; + int nr = 0, nr_pmdmapped = 0, nr_lgmapped = 0; bool last; enum node_stat_item idx; @@ -1423,6 +1427,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, last = atomic_add_negative(-1, &page->_mapcount); nr = last; if (last && folio_test_large(folio)) { + nr_lgmapped = 1; nr = atomic_dec_return_relaxed(mapped); nr = (nr < COMPOUND_MAPPED); } @@ -1454,6 +1459,8 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, idx = NR_FILE_PMDMAPPED; __lruvec_stat_mod_folio(folio, idx, -nr_pmdmapped); } + if (nr_lgmapped && folio_test_anon(folio)) + __lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, -nr_lgmapped); if (nr) { idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED; __lruvec_stat_mod_folio(folio, idx, -nr); diff --git a/mm/show_mem.c b/mm/show_mem.c index 4b888b18bdde..e648a815f0fb 100644 --- a/mm/show_mem.c +++ b/mm/show_mem.c @@ -254,6 +254,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z " shmem_thp:%lukB" " shmem_pmdmapped:%lukB" " anon_thp:%lukB" + " anon_thp_pte:%lukB" #endif " writeback_tmp:%lukB" " kernel_stack:%lukB" @@ -280,6 +281,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z K(node_page_state(pgdat, NR_SHMEM_THPS)), K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)), K(node_page_state(pgdat, NR_ANON_THPS)), + K(node_page_state(pgdat, NR_ANON_THPS_PTEMAPPED)), #endif K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), node_page_state(pgdat, NR_KERNEL_STACK_KB), diff --git a/mm/vmstat.c b/mm/vmstat.c index 00e81e99c6ee..267de0e4ddca 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1224,6 +1224,7 @@ const char * const vmstat_text[] = { "nr_file_hugepages", "nr_file_pmdmapped", "nr_anon_transparent_hugepages", + "nr_anon_thp_pte", "nr_vmscan_write", "nr_vmscan_immediate_reclaim", "nr_dirtied", From patchwork Fri Sep 29 11:44:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 146514 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3976804vqu; Fri, 29 Sep 2023 05:10:22 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFCWphLfnKPxpB8pPzM7Gh52XVPZocUNvxxeH355zSJjNo9m30lvN6KZiG1k/RPmyxzwdRQ X-Received: by 2002:a05:6358:5907:b0:143:8af6:48e7 with SMTP id g7-20020a056358590700b001438af648e7mr4718424rwf.5.1695989421679; Fri, 29 Sep 2023 05:10:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695989421; cv=none; d=google.com; s=arc-20160816; b=dxpO0SDhLCbMG+QN0RuLc7GTv0hEx06tD1AHK3XVEZi3YjC9zTTJv/HTMiyp+ocDhE p8d7juX6WcD/iSHuWcslsaL9SIvXnV/32kFALXMAHYvcoOg/bUiRtZ4EpitLMLBEVaTD Tc3V6hPUDBWab7TCL/jWfYYUYUF62QKABEPlsC0Uo+sJQ/dvB01ph9nQhrHTX6g8NkMk UvKYGoqxfpRu/Kh0c8jrpWQPFc6RJepVGvbg5NN5+y4OoQefp0gOiiL3AkZV0enel5g6 2oxzWXf4GaOXv2Drh1nTB/r6MfHzzS1/9hNftz9PNQFKZaLYlmPp5XBQBCDFsv+H6s25 THBA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=mFKsrfYRPYbVeNdBA8ZVRwXhiya+e01CfoMJVaq5bvk=; fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=; b=x8CWe1bcl+cwTOKQqeMf6vESwR61lbk4mcgY3w08EAlvWK+zObzlLDQE71GBRi1dsp BmswA/rF4wBATOcalaguMyG6OODmqpg52oo/eyAYJorW0OSuAJgZyVuHji72r4IiuEEu YqG/g7wVcpElCVkEZmzN08ImsKU7E3jjylo7pv2RslO0gvci+2f+XsN2Kj70zCf0SyK0 369lN7AA+iKTRjMQzCF1I7tvTsFmpR8N2xL1RBQMU3O+ey+He859Zw3TGYSPDZdZVKdF MV5NMuJnxOsrmpDdRXuFzCPBEbL8qiq8Z6c8NWsqnIIxipf3YFo3tEh6aS3AKbHIFW8W e42w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id l195-20020a633ecc000000b00578d2d19575si21147250pga.237.2023.09.29.05.10.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 05:10:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 7BD048097157; Fri, 29 Sep 2023 04:45:48 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233147AbjI2Lo7 (ORCPT + 20 others); Fri, 29 Sep 2023 07:44:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42124 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233083AbjI2Lot (ORCPT ); Fri, 29 Sep 2023 07:44:49 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id CE02F1B1 for ; Fri, 29 Sep 2023 04:44:44 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D9363DA7; Fri, 29 Sep 2023 04:45:22 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DC3AE3F59C; Fri, 29 Sep 2023 04:44:41 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files Date: Fri, 29 Sep 2023 12:44:15 +0100 Message-Id: <20230929114421.3761121-5-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 29 Sep 2023 04:45:48 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778373803849479475 X-GMAIL-MSGID: 1778373803849479475 In preparation for adding support for anonymous large folios that are smaller than the PMD-size, introduce 2 new sysfs files that will be used to control the new behaviours via the transparent_hugepage interface. For now, the kernel still only supports PMD-order anonymous THP, so when reading back anon_orders, it will reflect that. Therefore there are no behavioural changes intended here. The bulk of the change is implemented by converting transhuge_vma_suitable() and hugepage_vma_check() so that they take a bitfield of orders for which the user wants to determine support, and the functions filter out all the orders that can't be supported. If there is only 1 order set in the input then the output can continue to be treated like a boolean; this is the case for most call sites. The remainder is copied from Documentation/admin-guide/mm/transhuge.rst, as modified by this commit. See that file for further details. By default, allocation of anonymous THPs that are smaller than PMD-size is disabled. These smaller allocation orders can be enabled by writing an encoded set of orders as follows:: echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders Where an order refers to the number of pages in the large folio as 2^order, and where each order is encoded in the written value such that each set bit represents an enabled order; So setting bit-2 indicates that order-2 folios are in use, and order-2 means 2^2=4 pages (=16K if the page size is 4K). The example above enables order-9 (PMD-order) and order-3. By enabling multiple orders, allocation of each order will be attempted, highest to lowest, until a successful allocation is made. If the PMD-order is unset, then no PMD-sized THPs will be allocated. The kernel will ignore any orders that it does not support so read the file back to determine which orders are enabled:: cat /sys/kernel/mm/transparent_hugepage/anon_orders For some workloads it may be desirable to limit some THP orders to be used only for MADV_HUGEPAGE regions, while allowing others to be used always. For example, a workload may only benefit from PMD-sized THP in specific areas, but can take benefit of 32K sized THP more generally. In this case, THP can be enabled in ``madvise`` mode as normal, but specific orders can be configured to be allocated as if in ``always`` mode. The below example enables orders 9 and 3, with order-9 only applied to MADV_HUGEPAGE regions, and order-3 applied always:: echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders echo 0x008 >/sys/kernel/mm/transparent_hugepage/anon_always_mask Signed-off-by: Ryan Roberts --- Documentation/admin-guide/mm/transhuge.rst | 74 ++++++++-- Documentation/filesystems/proc.rst | 6 +- fs/proc/task_mmu.c | 3 +- include/linux/huge_mm.h | 93 +++++++++--- mm/huge_memory.c | 164 ++++++++++++++++++--- mm/khugepaged.c | 18 ++- mm/memory.c | 6 +- mm/page_vma_mapped.c | 3 +- 8 files changed, 296 insertions(+), 71 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index ebda57850643..9f954e73a4ca 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -45,10 +45,22 @@ components: the two is using hugepages just because of the fact the TLB miss is going to run faster. +Furthermore, it is possible to configure THP to allocate large folios +to back anonymous memory, which are smaller than PMD-size (for example +16K, 32K, 64K, etc). These THPs continue to be PTE-mapped, but in many +cases can still provide the similar benefits to those outlined above: +Page faults are significantly reduced (by a factor of e.g. 4, 8, 16, +etc), but latency spikes are much less prominent because the size of +each page isn't as huge as the PMD-sized variant and there is less +memory to clear in each page fault. Some architectures also employ TLB +compression mechanisms to squeeze more entries in when a set of PTEs +are virtually and physically contiguous and approporiately aligned. In +this case, TLB misses will occur less often. + THP can be enabled system wide or restricted to certain tasks or even memory ranges inside task's address space. Unless THP is completely disabled, there is ``khugepaged`` daemon that scans memory and -collapses sequences of basic pages into huge pages. +collapses sequences of basic pages into PMD-sized huge pages. The THP behaviour is controlled via :ref:`sysfs ` interface and using madvise(2) and prctl(2) system calls. @@ -146,25 +158,69 @@ madvise never should be self-explanatory. -By default kernel tries to use huge zero page on read page fault to -anonymous mapping. It's possible to disable huge zero page by writing 0 -or enable it back by writing 1:: +By default kernel tries to use huge, PMD-mapped zero page on read page +fault to anonymous mapping. It's possible to disable huge zero page by +writing 0 or enable it back by writing 1:: echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page Some userspace (such as a test program, or an optimized memory allocation -library) may want to know the size (in bytes) of a transparent hugepage:: +library) may want to know the size (in bytes) of a PMD-mappable +transparent hugepage:: cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size +By default, allocation of anonymous THPs that are smaller than +PMD-size is disabled. These smaller allocation orders can be enabled +by writing an encoded set of orders as follows:: + + echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders + +Where an order refers to the number of pages in the large folio as +2^order, and where each order is encoded in the written value such +that each set bit represents an enabled order; So setting bit-2 +indicates that order-2 folios are in use, and order-2 means 2^2=4 +pages (=16K if the page size is 4K). The example above enables order-9 +(PMD-order) and order-3. + +By enabling multiple orders, allocation of each order will be +attempted, highest to lowest, until a successful allocation is made. +If the PMD-order is unset, then no PMD-sized THPs will be allocated. + +The kernel will ignore any orders that it does not support so read the +file back to determine which orders are enabled:: + + cat /sys/kernel/mm/transparent_hugepage/anon_orders + +For some workloads it may be desirable to limit some THP orders to be +used only for MADV_HUGEPAGE regions, while allowing others to be used +always. For example, a workload may only benefit from PMD-sized THP in +specific areas, but can take benefit of 32K sized THP more generally. +In this case, THP can be enabled in ``madvise`` mode as normal, but +specific orders can be configured to be allocated as if in ``always`` +mode. The below example enables orders 9 and 3, with order-9 only +applied to MADV_HUGEPAGE regions, and order-3 applied always:: + + echo madvise >/sys/kernel/mm/transparent_hugepage/enabled + echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders + echo 0x008 >/sys/kernel/mm/transparent_hugepage/anon_always_mask + khugepaged will be automatically started when -transparent_hugepage/enabled is set to "always" or "madvise, and it'll -be automatically shutdown if it's set to "never". +transparent_hugepage/enabled is set to "always" or "madvise", +providing the PMD-order is enabled in +transparent_hugepage/anon_orders, and it'll be automatically shutdown +if it's set to "never" or the PMD-order is disabled in +transparent_hugepage/anon_orders. Khugepaged controls ------------------- +.. note:: + khugepaged currently only searches for opportunities to collapse to + PMD-sized THP and no attempt is made to collapse to smaller order + THP. + khugepaged runs usually at low frequency so while one may not want to invoke defrag algorithms synchronously during the page faults, it should be worth invoking defrag at least in khugepaged. However it's @@ -285,7 +341,7 @@ Need of application restart The transparent_hugepage/enabled values and tmpfs mount option only affect future behavior. So to make them effective you need to restart any application that could have been using hugepages. This also applies to the -regions registered in khugepaged. +regions registered in khugepaged, and transparent_hugepage/anon_orders. Monitoring usage ================ @@ -416,7 +472,7 @@ for huge pages. Optimizing the applications =========================== -To be guaranteed that the kernel will map a 2M page immediately in any +To be guaranteed that the kernel will map a thp immediately in any memory region, the mmap region has to be hugepage naturally aligned. posix_memalign() can provide that guarantee. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index ccbb76a509f0..72526f8bb658 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -533,9 +533,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap. does not take into account swapped out page of underlying shmem objects. "Locked" indicates whether the mapping is locked in memory or not. -"THPeligible" indicates whether the mapping is eligible for allocating THP -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise. -It just shows the current status. +"THPeligible" indicates whether the mapping is eligible for allocating +naturally aligned THP pages of any currently enabled order. 1 if true, 0 +otherwise. It just shows the current status. "VmFlags" field deserves a separate description. This member represents the kernel flags associated with the particular virtual memory area in two letter diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 7b5dad163533..f978dce7f7ce 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -869,7 +869,8 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, &mss, false); seq_printf(m, "THPeligible: %8u\n", - hugepage_vma_check(vma, vma->vm_flags, true, false, true)); + !!hugepage_vma_check(vma, vma->vm_flags, true, false, true, + THP_ORDERS_ALL)); if (arch_pkeys_enabled()) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index fa0350b0812a..2e7c338229a6 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -67,6 +67,21 @@ extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) #define HPAGE_PMD_NR (1<vm_start >> PAGE_SHIFT) - vma->vm_pgoff, - HPAGE_PMD_NR)) - return false; +static inline unsigned int transhuge_vma_suitable(struct vm_area_struct *vma, + unsigned long addr, unsigned int orders) +{ + int order; + + /* + * Iterate over orders, highest to lowest, removing orders that don't + * meet alignment requirements from the set. Exit loop at first order + * that meets requirements, since all lower orders must also meet + * requirements. + */ + + order = first_order(orders); + + while (orders) { + unsigned long hpage_size = PAGE_SIZE << order; + unsigned long haddr = ALIGN_DOWN(addr, hpage_size); + + if (haddr >= vma->vm_start && + haddr + hpage_size <= vma->vm_end) { + if (!vma_is_anonymous(vma)) { + if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - + vma->vm_pgoff, + hpage_size >> PAGE_SHIFT)) + break; + } else + break; + } + + order = next_order(&orders, order); } - haddr = addr & HPAGE_PMD_MASK; - - if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) - return false; - return true; + return orders; } static inline bool file_thp_enabled(struct vm_area_struct *vma) @@ -130,8 +173,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); } -bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, - bool smaps, bool in_pf, bool enforce_sysfs); +unsigned int hugepage_vma_check(struct vm_area_struct *vma, + unsigned long vm_flags, bool smaps, bool in_pf, + bool enforce_sysfs, unsigned int orders); #define transparent_hugepage_use_zero_page() \ (transparent_hugepage_flags & \ @@ -267,17 +311,18 @@ static inline bool folio_test_pmd_mappable(struct folio *folio) return false; } -static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, - unsigned long addr) +static inline unsigned int transhuge_vma_suitable(struct vm_area_struct *vma, + unsigned long addr, unsigned int orders) { - return false; + return 0; } -static inline bool hugepage_vma_check(struct vm_area_struct *vma, - unsigned long vm_flags, bool smaps, - bool in_pf, bool enforce_sysfs) +static inline unsigned int hugepage_vma_check(struct vm_area_struct *vma, + unsigned long vm_flags, bool smaps, + bool in_pf, bool enforce_sysfs, + unsigned int orders) { - return false; + return 0; } static inline void folio_prep_large_rmappable(struct folio *folio) {} diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 064fbd90822b..bcecce769017 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -70,12 +70,48 @@ static struct shrinker deferred_split_shrinker; static atomic_t huge_zero_refcount; struct page *huge_zero_page __read_mostly; unsigned long huge_zero_pfn __read_mostly = ~0UL; +unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER); +static unsigned int huge_anon_always_mask __read_mostly; -bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, - bool smaps, bool in_pf, bool enforce_sysfs) +/** + * hugepage_vma_check - determine which hugepage orders can be applied to vma + * @vma: the vm area to check + * @vm_flags: use these vm_flags instead of vma->vm_flags + * @smaps: whether answer will be used for smaps file + * @in_pf: whether answer will be used by page fault handler + * @enforce_sysfs: whether sysfs config should be taken into account + * @orders: bitfield of all orders to consider + * + * Calculates the intersection of the requested hugepage orders and the allowed + * hugepage orders for the provided vma. Permitted orders are encoded as a set + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3 + * corresponds to order-3, etc). Order-0 is never considered a hugepage order. + * + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage + * orders are allowed. + */ +unsigned int hugepage_vma_check(struct vm_area_struct *vma, + unsigned long vm_flags, bool smaps, bool in_pf, + bool enforce_sysfs, unsigned int orders) { + /* + * Fix up the orders mask; Supported orders for file vmas are static. + * Supported orders for anon vmas are configured dynamically - but only + * use the dynamic set if enforce_sysfs=true, otherwise use the full + * set. + */ + if (vma_is_anonymous(vma)) + orders &= enforce_sysfs ? READ_ONCE(huge_anon_orders) + : THP_ORDERS_ALL_ANON; + else + orders &= THP_ORDERS_ALL_FILE; + + /* No orders in the intersection. */ + if (!orders) + return 0; + if (!vma->vm_mm) /* vdso */ - return false; + return 0; /* * Explicitly disabled through madvise or prctl, or some @@ -84,16 +120,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, * */ if ((vm_flags & VM_NOHUGEPAGE) || test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) - return false; + return 0; /* * If the hardware/firmware marked hugepage support disabled. */ if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED)) - return false; + return 0; /* khugepaged doesn't collapse DAX vma, but page fault is fine. */ if (vma_is_dax(vma)) - return in_pf; + return in_pf ? orders : 0; /* * Special VMA and hugetlb VMA. @@ -101,17 +137,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, * VM_MIXEDMAP set. */ if (vm_flags & VM_NO_KHUGEPAGED) - return false; + return 0; /* - * Check alignment for file vma and size for both file and anon vma. + * Check alignment for file vma and size for both file and anon vma by + * filtering out the unsuitable orders. * * Skip the check for page fault. Huge fault does the check in fault - * handlers. And this check is not suitable for huge PUD fault. + * handlers. */ - if (!in_pf && - !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE))) - return false; + if (!in_pf) { + int order = first_order(orders); + unsigned long addr; + + while (orders) { + addr = vma->vm_end - (PAGE_SIZE << order); + if (transhuge_vma_suitable(vma, addr, BIT(order))) + break; + order = next_order(&orders, order); + } + + if (!orders) + return 0; + } /* * Enabled via shmem mount options or sysfs settings. @@ -120,23 +168,35 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, */ if (!in_pf && shmem_file(vma->vm_file)) return shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff, - !enforce_sysfs, vma->vm_mm, vm_flags); + !enforce_sysfs, vma->vm_mm, vm_flags) + ? orders : 0; /* Enforce sysfs THP requirements as necessary */ - if (enforce_sysfs && - (!hugepage_flags_enabled() || (!(vm_flags & VM_HUGEPAGE) && - !hugepage_flags_always()))) - return false; + if (enforce_sysfs) { + /* enabled=never. */ + if (!hugepage_flags_enabled()) + return 0; + + /* enabled=madvise without VM_HUGEPAGE. */ + if (!(vm_flags & VM_HUGEPAGE) && !hugepage_flags_always()) { + if (vma_is_anonymous(vma)) { + orders &= READ_ONCE(huge_anon_always_mask); + if (!orders) + return 0; + } else + return 0; + } + } /* Only regular file is valid */ if (!in_pf && file_thp_enabled(vma)) - return true; + return orders; if (!vma_is_anonymous(vma)) - return false; + return 0; if (vma_is_temporary_stack(vma)) - return false; + return 0; /* * THPeligible bit of smaps should show 1 for proper VMAs even @@ -146,9 +206,9 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, * the first page fault. */ if (!vma->anon_vma) - return (smaps || in_pf); + return (smaps || in_pf) ? orders : 0; - return true; + return orders; } static bool get_huge_zero_page(void) @@ -391,11 +451,69 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj, static struct kobj_attribute hpage_pmd_size_attr = __ATTR_RO(hpage_pmd_size); +static ssize_t anon_orders_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "0x%08x\n", READ_ONCE(huge_anon_orders)); +} + +static ssize_t anon_orders_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err; + int ret = count; + unsigned int orders; + + err = kstrtouint(buf, 0, &orders); + if (err) + ret = -EINVAL; + + if (ret > 0) { + orders &= THP_ORDERS_ALL_ANON; + WRITE_ONCE(huge_anon_orders, orders); + + err = start_stop_khugepaged(); + if (err) + ret = err; + } + + return ret; +} + +static struct kobj_attribute anon_orders_attr = __ATTR_RW(anon_orders); + +static ssize_t anon_always_mask_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "0x%08x\n", READ_ONCE(huge_anon_always_mask)); +} + +static ssize_t anon_always_mask_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err; + unsigned int always_mask; + + err = kstrtouint(buf, 0, &always_mask); + if (err) + return -EINVAL; + + WRITE_ONCE(huge_anon_always_mask, always_mask); + + return count; +} + +static struct kobj_attribute anon_always_mask_attr = __ATTR_RW(anon_always_mask); + static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, &use_zero_page_attr.attr, &hpage_pmd_size_attr.attr, + &anon_orders_attr.attr, + &anon_always_mask_attr.attr, #ifdef CONFIG_SHMEM &shmem_enabled_attr.attr, #endif @@ -778,7 +896,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) struct folio *folio; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; - if (!transhuge_vma_suitable(vma, haddr)) + if (!transhuge_vma_suitable(vma, haddr, BIT(PMD_ORDER))) return VM_FAULT_FALLBACK; if (unlikely(anon_vma_prepare(vma))) return VM_FAULT_OOM; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 88433cc25d8a..2b5c0321d96b 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -446,7 +446,8 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && hugepage_flags_enabled()) { - if (hugepage_vma_check(vma, vm_flags, false, false, true)) + if (hugepage_vma_check(vma, vm_flags, false, false, true, + BIT(PMD_ORDER))) __khugepaged_enter(vma->vm_mm); } } @@ -921,10 +922,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, if (!vma) return SCAN_VMA_NULL; - if (!transhuge_vma_suitable(vma, address)) + if (!transhuge_vma_suitable(vma, address, BIT(PMD_ORDER))) return SCAN_ADDRESS_RANGE; if (!hugepage_vma_check(vma, vma->vm_flags, false, false, - cc->is_khugepaged)) + cc->is_khugepaged, BIT(PMD_ORDER))) return SCAN_VMA_CHECK; /* * Anon VMA expected, the address may be unmapped then @@ -1499,7 +1500,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, * and map it by a PMD, regardless of sysfs THP settings. As such, let's * analogously elide sysfs THP settings here. */ - if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false)) + if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false, + BIT(PMD_ORDER))) return SCAN_VMA_CHECK; /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ @@ -2369,7 +2371,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, progress++; break; } - if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true)) { + if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true, + BIT(PMD_ORDER))) { skip: progress++; continue; @@ -2626,7 +2629,7 @@ int start_stop_khugepaged(void) int err = 0; mutex_lock(&khugepaged_mutex); - if (hugepage_flags_enabled()) { + if (hugepage_flags_enabled() && (huge_anon_orders & BIT(PMD_ORDER))) { if (!khugepaged_thread) khugepaged_thread = kthread_run(khugepaged, NULL, "khugepaged"); @@ -2706,7 +2709,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, *prev = vma; - if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false)) + if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false, + BIT(PMD_ORDER))) return -EINVAL; cc = kmalloc(sizeof(*cc), GFP_KERNEL); diff --git a/mm/memory.c b/mm/memory.c index e4b0f6a461d8..b5b82fc8e164 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4256,7 +4256,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) pmd_t entry; vm_fault_t ret = VM_FAULT_FALLBACK; - if (!transhuge_vma_suitable(vma, haddr)) + if (!transhuge_vma_suitable(vma, haddr, BIT(PMD_ORDER))) return ret; page = compound_head(page); @@ -5055,7 +5055,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, return VM_FAULT_OOM; retry_pud: if (pud_none(*vmf.pud) && - hugepage_vma_check(vma, vm_flags, false, true, true)) { + hugepage_vma_check(vma, vm_flags, false, true, true, BIT(PUD_ORDER))) { ret = create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; @@ -5089,7 +5089,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, goto retry_pud; if (pmd_none(*vmf.pmd) && - hugepage_vma_check(vma, vm_flags, false, true, true)) { + hugepage_vma_check(vma, vm_flags, false, true, true, BIT(PMD_ORDER))) { ret = create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index e0b368e545ed..5f7e89c5b595 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) * cleared *pmd but not decremented compound_mapcount(). */ if ((pvmw->flags & PVMW_SYNC) && - transhuge_vma_suitable(vma, pvmw->address) && + transhuge_vma_suitable(vma, pvmw->address, + BIT(PMD_ORDER)) && (pvmw->nr_pages >= HPAGE_PMD_NR)) { spinlock_t *ptl = pmd_lock(mm, pvmw->pmd); From patchwork Fri Sep 29 11:44:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 146505 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3963064vqu; Fri, 29 Sep 2023 04:48:45 -0700 (PDT) X-Google-Smtp-Source: AGHT+IESeQwgdgwiQdnkmHiKM2bKdzsup4rcC8++ESbnrisyfX8TiEhWrndvuF6COFyXs+/79lfQ X-Received: by 2002:a17:90a:d810:b0:273:f887:be17 with SMTP id a16-20020a17090ad81000b00273f887be17mr3826657pjv.47.1695988124763; Fri, 29 Sep 2023 04:48:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695988124; cv=none; d=google.com; s=arc-20160816; b=emQo7jkpfMC3MT5JVIXNagscrQIvbX4xlOjyWUTmqnhN4gzws0L2ErN884lcd/m99c Pn80sKMG0H6XrSVVEEEp/U3iHO2wjjo4OSvUrC10wwpGNIlWruZOujxHm1/QLqKPbHxR NADVjJo1eKHhMcp6in+c1+Yo4W8Sy3oZl7SH+jB5YPPbfmr4XIN5G9Ciro60AKP759Co xtV9rWiEvz+gQxUyCcMFr4Jz+MIrTOb6sU5i4t6JRhM90BZA5kYHdDyXRUCj2eK6wCuT 0cN3jlYIpj/48lt2gzTbTXLtBmFHr9j1sA5ApxuvIgD3+oj7wjxOxhnSndQEoUaM8seQ lGZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=2gExMYv36b1IpQ9pVpTbz0yc/Ix+8JpSzZniXtjIQmc=; fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=; b=Z4A2qRRfFFVc7xWVI3XwrqFYTdRa7MB7cDbpibJQPC+uOLt0mt30Ld1257P1bv5+LJ Xg/BnuyURM10FxKq4SKj+DOJjbykD0CLOUuLRymZALyXFfwyaGlzjk4w7AahogaJZl6Q KududaV4tpcashjtAK8xzxJjaagrQtuRE1CKwP7+W3NMa0dqiJc85RUdJc85o0udRyIh pnilM7KlnRUGtBldW1+MZItP+kp3DZyP32+/BUG4A25gDmXK0VXVRxMQjPY8rbBR872p kYGyF0Kge0M2MHE/7bwzf6OjtOXhl86xKvNCbJVTAWY8MOGfa5URpkhyBVL/wvkHk2LI rYOA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id my12-20020a17090b4c8c00b0027763ca82e9si1523511pjb.91.2023.09.29.04.48.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 04:48:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 962C280743FD; Fri, 29 Sep 2023 04:45:20 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233090AbjI2LpD (ORCPT + 20 others); Fri, 29 Sep 2023 07:45:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233112AbjI2Low (ORCPT ); Fri, 29 Sep 2023 07:44:52 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A59AA1B2 for ; Fri, 29 Sep 2023 04:44:47 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9765D1FB; Fri, 29 Sep 2023 04:45:25 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B3E903F59C; Fri, 29 Sep 2023 04:44:44 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios Date: Fri, 29 Sep 2023 12:44:16 +0100 Message-Id: <20230929114421.3761121-6-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Fri, 29 Sep 2023 04:45:20 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778372444310858226 X-GMAIL-MSGID: 1778372444310858226 Introduce the logic to allow THP to be configured (through the new anon_orders interface we just added) to allocate large folios to back anonymous memory, which are smaller than PMD-size (for example order-2, order-3, order-4, etc). These THPs continue to be PTE-mapped, but in many cases can still provide similar benefits to traditional PMD-sized THP: Page faults are significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on the configured order), but latency spikes are much less prominent because the size of each page isn't as huge as the PMD-sized variant and there is less memory to clear in each page fault. The number of per-page operations (e.g. ref counting, rmap management, lru list management) are also significantly reduced since those ops now become per-folio. Some architectures also employ TLB compression mechanisms to squeeze more entries in when a set of PTEs are virtually and physically contiguous and approporiately aligned. In this case, TLB misses will occur less often. The new behaviour is disabled by default because the anon_orders defaults to only enabling PMD-order, but can be enabled at runtime by writing to anon_orders (see documentation in previous commit). The long term aim is to default anon_orders to include suitable lower orders, but there are some risks around internal fragmentation that need to be better understood first. Signed-off-by: Ryan Roberts --- Documentation/admin-guide/mm/transhuge.rst | 9 +- include/linux/huge_mm.h | 6 +- mm/memory.c | 108 +++++++++++++++++++-- 3 files changed, 111 insertions(+), 12 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 9f954e73a4ca..732c3b2f4ba8 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap fields for each mapping. Note that in both cases, AnonHugePages refers only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped -using PTEs. +using PTEs. This includes all THPs whose order is smaller than +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped +for other reasons. The number of file transparent huge pages mapped to userspace is available by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. @@ -367,6 +369,11 @@ frequently will incur overhead. There are a number of counters in ``/proc/vmstat`` that may be used to monitor how successfully the system is providing huge pages for use. +.. note:: + Currently the below counters only record events relating to + PMD-order THPs. Events relating to smaller order THPs are not + included. + thp_fault_alloc is incremented every time a huge page is successfully allocated to handle a page fault. diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2e7c338229a6..c4860476a1f5 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_NR (1<pte + i))) + return true; + } + + return false; +} + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static struct folio *alloc_anon_folio(struct vm_fault *vmf) +{ + gfp_t gfp; + pte_t *pte; + unsigned long addr; + struct folio *folio; + struct vm_area_struct *vma = vmf->vma; + unsigned int orders; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (userfaultfd_armed(vma)) + goto fallback; + + /* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * for this vma. Then filter out the orders that can't be allocated over + * the faulting address and still be fully contained in the vma. + */ + orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true, + BIT(PMD_ORDER) - 1); + orders = transhuge_vma_suitable(vma, vmf->address, orders); + + if (!orders) + goto fallback; + + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); + if (!pte) + return ERR_PTR(-EAGAIN); + + order = first_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + vmf->pte = pte + pte_index(addr); + if (!vmf_pte_range_changed(vmf, 1 << order)) + break; + order = next_order(&orders, order); + } + + vmf->pte = NULL; + pte_unmap(pte); + + gfp = vma_thp_gfp_mask(vma); + + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) { + clear_huge_page(&folio->page, addr, 1 << order); + return folio; + } + order = next_order(&orders, order); + } + +fallback: + return vma_alloc_zeroed_movable_folio(vma, vmf->address); +} +#else +#define alloc_anon_folio(vmf) \ + vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address) +#endif + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { + int i; + int nr_pages = 1; + unsigned long addr = vmf->address; bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; struct folio *folio; @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vmf); + if (IS_ERR(folio)) + return 0; if (!folio) goto oom; + nr_pages = folio_nr_pages(folio); + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry), vma); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (!vmf->pte) goto release; - if (vmf_pte_changed(vmf)) { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if (vmf_pte_range_changed(vmf, nr_pages)) { + for (i = 0; i < nr_pages; i++) + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i); goto release; } @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - folio_add_new_anon_rmap(folio, vma, vmf->address); + folio_ref_add(folio, nr_pages - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); + folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma); setpte: if (uffd_wp) entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages); /* No need to invalidate - it was non-present before */ - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages); unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); From patchwork Fri Sep 29 11:44:17 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 146506 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3963105vqu; Fri, 29 Sep 2023 04:48:50 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH3wH1HjcHDRvbHX7CVLTaq8eZa6oRHXjChMtYF3u/6NCMGORkWczA3c2xzjY9ZLruWJh6O X-Received: by 2002:a05:6358:7209:b0:140:ff29:7057 with SMTP id h9-20020a056358720900b00140ff297057mr4546571rwa.7.1695988130217; Fri, 29 Sep 2023 04:48:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695988130; cv=none; d=google.com; s=arc-20160816; b=d/EbuVXp5eN890WzE5syiQSpCi0EYvWCnQzlUIYAKvy8CRyo7OIFC3S2yl1sgQwy1Q IjS0cAtLUPDtD4fszVq/CZJ1DgsTK8NNnPhibrbS3Ad6gzzs93HomiM0TBeUT8bGWK+2 CZxHelfECRGT7sVS++H3GM1C5ytTB4IzANkFbDA/OcelXwXeyzbhL+hs5i47opLA5JSg 6xDbjMG6CL5qIttn7ce1djanWk1KweTtqxGJaauRZwb4yM9TOsT5Zap7Wssr9WpPcl9j hK8UvQTiRqCQOk/5JPGds/BT9PwdAkzepVb7+HsRTu5yX+m5jsJaKhfADqCv+E1523M7 dIEw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=Y7eGeJJo26SMVTELJHv7Gjv4n633olBN8TsxgQi6IXE=; fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=; b=DequeOWCkINlOsR9xXv902FsL5OLnalfkS6+/oHtmugz3Yf8aH6CflAlBdiVooxNxt j4X1XElVqHNLdQmwhfGHSzys9HwFR2LZrRDha5zA4OnKdCQTyaLpP+ONEfDV1bAECm+i GrmiLICU3wRSM2/doevj+XXBf5IFIaK24hQ9/7DuZIVqyn/+eJ5L0xsKr/B6+RrSvSs0 qiDUfW/O5Zy8SF5RMUskPLD9sN0PjgU4buDmDhwqmcX5BFY2V7DfxgMsGD0dJ4IY2nYK IBMxy7Bt8JrYsxEGFP0XoZVqTPJLA8zsbFg3xBMxddsiiM6qfTD1d/Y2K+oKZ2tdWKKq 7MhQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id z16-20020a656650000000b0057808a9b0besi18817964pgv.664.2023.09.29.04.48.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 04:48:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id A7651807066D; Fri, 29 Sep 2023 04:45:20 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233131AbjI2LpH (ORCPT + 20 others); Fri, 29 Sep 2023 07:45:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55626 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233040AbjI2Lo4 (ORCPT ); Fri, 29 Sep 2023 07:44:56 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 4E8D7CD3 for ; Fri, 29 Sep 2023 04:44:50 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 523D91007; Fri, 29 Sep 2023 04:45:28 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7040B3F59C; Fri, 29 Sep 2023 04:44:47 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders Date: Fri, 29 Sep 2023 12:44:17 +0100 Message-Id: <20230929114421.3761121-7-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Fri, 29 Sep 2023 04:45:20 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778372449446604492 X-GMAIL-MSGID: 1778372449446604492 In addition to passing a bitfield of folio orders to enable for THP, allow the string "recommend" to be written, which has the effect of causing the system to enable the orders preferred by the architecture and by the mm. The user can see what these orders are by subsequently reading back the file. Note that these recommended orders are expected to be static for a given boot of the system, and so the keyword "auto" was deliberately not used, as I want to reserve it for a possible future use where the "best" order is chosen more dynamically at runtime. Recommended orders are determined as follows: - PMD_ORDER: The traditional THP size - arch_wants_pte_order() if implemented by the arch - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list arch_wants_pte_order() can be overridden by the architecture if desired. Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous set of ptes map physically contigious, naturally aligned memory, so this mechanism allows the architecture to optimize as required. Here we add the default implementation of arch_wants_pte_order(), used when the architecture does not define it, which returns -1, implying that the HW has no preference. Signed-off-by: Ryan Roberts --- Documentation/admin-guide/mm/transhuge.rst | 4 ++++ include/linux/pgtable.h | 13 +++++++++++++ mm/huge_memory.c | 14 +++++++++++--- 3 files changed, 28 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 732c3b2f4ba8..d6363d4efa3a 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9 By enabling multiple orders, allocation of each order will be attempted, highest to lowest, until a successful allocation is made. If the PMD-order is unset, then no PMD-sized THPs will be allocated. +It is also possible to enable the recommended set of orders, which +will be optimized for the architecture and mm:: + + echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders The kernel will ignore any orders that it does not support so read the file back to determine which orders are enabled:: diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index af7639c3b0a3..0e110ce57cc3 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma, } #endif +#ifndef arch_wants_pte_order +/* + * Returns preferred folio order for pte-mapped memory. Must be in range [0, + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at + * least order-2. Negative value implies that the HW has no preference and mm + * will choose it's own default order. + */ +static inline int arch_wants_pte_order(void) +{ + return -1; +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bcecce769017..e2e2d3906a21 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj, int err; int ret = count; unsigned int orders; + int arch; - err = kstrtouint(buf, 0, &orders); - if (err) - ret = -EINVAL; + if (sysfs_streq(buf, "recommend")) { + arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); + orders = BIT(arch); + orders |= BIT(PAGE_ALLOC_COSTLY_ORDER); + orders |= BIT(PMD_ORDER); + } else { + err = kstrtouint(buf, 0, &orders); + if (err) + ret = -EINVAL; + } if (ret > 0) { orders &= THP_ORDERS_ALL_ANON; From patchwork Fri Sep 29 11:44:18 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 146504 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3962520vqu; Fri, 29 Sep 2023 04:47:45 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH8AfW7YNmLexx20T/qZl1SIXSUKvCAWdeniODKmO4QnqsGZUFSaGWgVQ64EOfxcCVqPMJy X-Received: by 2002:a05:6a21:193:b0:15e:22a4:b897 with SMTP id le19-20020a056a21019300b0015e22a4b897mr7485433pzb.10.1695988065399; Fri, 29 Sep 2023 04:47:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695988065; cv=none; d=google.com; s=arc-20160816; b=CXZNz2q/1zkhZ5IutiM5vbXdmhvQsXBWdYwokvm6RRQWhZVaLkJ7VA13zr045+90se YdY0kwdhc9MFmT5OoeSQATH6j4ujWh7TzJeT4u8epUL0p9ixA3hMFob1nmMUQZs84lei XlSt+xfBCP1xZWu8/G4Ps+0W2CaR6vkX8zy+J6qbbabeAP/umOorteTJFRTunxpoDjCv fC2VCr2JgSHQWCSo1AzBlkeG6VyAw9qCQFEvLuDtoG+MUMt09bAo+MhWUOeifVQE66jW Hk6b5lgGqDs1UQ4JFyc1vMYd3RUp2xtUTmGjOzfooOotGANgfVPNdEq1uKxkGnnnFg9g 8HOA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=R12DE0IaGz6VExNqcRsTIxpYiYIG/gDiSa9v3a/m4ts=; fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=; b=GPZXGG3L75b8vkTF8VhOTRJfakAk6JYBv4NXYX33mSmNnWZebOFuBA5U+JN1YmZRKM H5R2YOYjVgvaF5abtHrWqHHn2XjBQzzn5RmUsuGASkwgAjV/fZe3X/WptnDl3iejc75a gIsj98mYNj9E08sOA1m05j90f7hTfRXW/wIxH+A/0H3E215Zym9ruBoE+/hlL6MiSETH 5StcT9l7+WQsBg8XT37dGcvu8cr9559vk6yFgMXzmR+dzilpgEVruNaIh6fnVx/RqEdy XugnIPOwzKZkx+kLgtXeuX7jLONd17n4N6zzsmjpwLaUYx32kNhCdNyrkCQ9yrTzRNNE n5jw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id g18-20020a631112000000b00563d9ff5158si20464724pgl.350.2023.09.29.04.47.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 04:47:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 8D69480301FA; Fri, 29 Sep 2023 04:46:17 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233219AbjI2LpY (ORCPT + 20 others); Fri, 29 Sep 2023 07:45:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52462 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233105AbjI2Lo7 (ORCPT ); Fri, 29 Sep 2023 07:44:59 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 0C335CE8 for ; Fri, 29 Sep 2023 04:44:53 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0EAF31FB; Fri, 29 Sep 2023 04:45:31 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2C6323F59C; Fri, 29 Sep 2023 04:44:50 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order() Date: Fri, 29 Sep 2023 12:44:18 +0100 Message-Id: <20230929114421.3761121-8-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Fri, 29 Sep 2023 04:46:17 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778372381671344692 X-GMAIL-MSGID: 1778372381671344692 Define an arch-specific override of arch_wants_pte_order() so that when anon_orders=recommend is set, large folios will be allocated for anonymous memory with an order that is compatible with arm64's HPA uarch feature. Reviewed-by: Yu Zhao Signed-off-by: Ryan Roberts Acked-by: Catalin Marinas --- arch/arm64/include/asm/pgtable.h | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7f7d9b1df4e5..e3d2449dec5c 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma, extern void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t new_pte); + +#define arch_wants_pte_order arch_wants_pte_order +static inline int arch_wants_pte_order(void) +{ + /* + * Many arm64 CPUs support hardware page aggregation (HPA), which can + * coalesce 4 contiguous pages into a single TLB entry. + */ + return 2; +} #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ From patchwork Fri Sep 29 11:44:19 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 146509 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3966633vqu; Fri, 29 Sep 2023 04:55:52 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEzxNXTjbTDeLw9BWSVOgaAvOOOC/WqDbIbcwkCMUJKM+xonlPFTmb+3Hq7mKKRDn3FUFVp X-Received: by 2002:a17:90a:f2d5:b0:26b:280b:d24c with SMTP id gt21-20020a17090af2d500b0026b280bd24cmr3459556pjb.42.1695988552406; Fri, 29 Sep 2023 04:55:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695988552; cv=none; d=google.com; s=arc-20160816; b=cxUUgp8ACyzyDNoW6Rv6ntYKaOPnPRQtd1Uh7P3m39Yv2qRBk/nmKOXl+TL1Yx1bPt 0/JMSte/dcWa4KiFIWSpGMHf0P2PIQPdlhdWtdhwpQzeFQsZlCJFezSi5VDGiDzeQO5B C5RQvUYv9IRQB/6iDU13aOY0U0IJxFkaduhWbYNJSHF+Dur4FipfnjGILhKa2Vg4uuSW BkYyJpJpEqUXkmILG97tckbBUYelUuCebCjOA2gCJu8TxG8XOipUQCLzOzg3ySd7JDxI EoP3hJLQwhq6faF4+An6j8hw1+gLs3c4ESJgqLF4XF73WTA3b8yEN0QR/KWixqVgrs+W 8LqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=quTNzgeRKuuO6Qb2fbhO+s0uzcvHJvNtQAmGH7Z0kG8=; fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=; b=jnKMFLS8ELfvw3qShHFPkoJ8uR8Nx+prHvoVKv1CK4Frv4Ornehow5XWkulKNLKKKE 4I7AgezwNjO+2xoHU4SAnJkYpcBB1mxV4g0DQy6oFkRWUuSAkHSfEb2/KB/5TZU1NRXs R9Ygn/kzVzM7tHGP3DjMeXhkxf4v1d3wOVD9TbHlfiFfNYF237618/dO+BvNa4kXNKTl eDYczSdt0VJIAxR79u7dYYXUe9uJM/SHx2nShqhkIJuSgwajUhwQ9JkSzZrfQjlOoM8s km2wD4DwZMALrT9Cvk/lk24rTpVqhvG94Nw4o5cSO0NL4GpHNQzlU8cz1oFVDQ8z9Fiw 5RHg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id em9-20020a17090b014900b00276bdabe471si1376636pjb.163.2023.09.29.04.55.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 04:55:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 1004580C5F81; Fri, 29 Sep 2023 04:45:36 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233251AbjI2Lp2 (ORCPT + 20 others); Fri, 29 Sep 2023 07:45:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52376 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233111AbjI2LpB (ORCPT ); Fri, 29 Sep 2023 07:45:01 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id B347210CA for ; Fri, 29 Sep 2023 04:44:55 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BF64FDA7; Fri, 29 Sep 2023 04:45:33 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DDAFE3F59C; Fri, 29 Sep 2023 04:44:52 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 8/9] selftests/mm/cow: Generalize do_run_with_thp() helper Date: Fri, 29 Sep 2023 12:44:19 +0100 Message-Id: <20230929114421.3761121-9-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 29 Sep 2023 04:45:36 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778372892196915750 X-GMAIL-MSGID: 1778372892196915750 do_run_with_thp() prepares (PMD-sized) THP memory into different states before running tests. With the introduction of THP orders that are smaller than PMD_ORDER, we would like to reuse this logic to also test those smaller orders. So let's add a size parameter which tells the function what size THP it should operate on. No functional change intended here, but a separate commit will add new tests for smaller order THP, where available. Signed-off-by: Ryan Roberts --- tools/testing/selftests/mm/cow.c | 151 +++++++++++++++++-------------- 1 file changed, 84 insertions(+), 67 deletions(-) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index 7324ce5363c0..d887ce454e34 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -32,7 +32,7 @@ static size_t pagesize; static int pagemap_fd; -static size_t thpsize; +static size_t pmdsize; static int nr_hugetlbsizes; static size_t hugetlbsizes[10]; static int gup_fd; @@ -734,14 +734,14 @@ enum thp_run { THP_RUN_PARTIAL_SHARED, }; -static void do_run_with_thp(test_fn fn, enum thp_run thp_run) +static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t size) { char *mem, *mmap_mem, *tmp, *mremap_mem = MAP_FAILED; - size_t size, mmap_size, mremap_size; + size_t mmap_size, mremap_size; int ret; - /* For alignment purposes, we need twice the thp size. */ - mmap_size = 2 * thpsize; + /* For alignment purposes, we need twice the requested size. */ + mmap_size = 2 * size; mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (mmap_mem == MAP_FAILED) { @@ -749,36 +749,40 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) return; } - /* We need a THP-aligned memory area. */ - mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1)); + /* We need to naturally align the memory area. */ + mem = (char *)(((uintptr_t)mmap_mem + size) & ~(size - 1)); - ret = madvise(mem, thpsize, MADV_HUGEPAGE); + ret = madvise(mem, size, MADV_HUGEPAGE); if (ret) { ksft_test_result_fail("MADV_HUGEPAGE failed\n"); goto munmap; } /* - * Try to populate a THP. Touch the first sub-page and test if we get - * another sub-page populated automatically. + * Try to populate a THP. Touch the first sub-page and test if + * we get the last sub-page populated automatically. */ mem[0] = 0; - if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) { + if (!pagemap_is_populated(pagemap_fd, mem + size - pagesize)) { ksft_test_result_skip("Did not get a THP populated\n"); goto munmap; } - memset(mem, 0, thpsize); + memset(mem, 0, size); - size = thpsize; switch (thp_run) { case THP_RUN_PMD: case THP_RUN_PMD_SWAPOUT: + if (size != pmdsize) { + ksft_test_result_fail("test bug: can't PMD-map size\n"); + goto munmap; + } break; case THP_RUN_PTE: case THP_RUN_PTE_SWAPOUT: /* * Trigger PTE-mapping the THP by temporarily mapping a single - * subpage R/O. + * subpage R/O. This is a noop if the THP is not pmdsize (and + * therefore already PTE-mapped). */ ret = mprotect(mem + pagesize, pagesize, PROT_READ); if (ret) { @@ -797,7 +801,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) * Discard all but a single subpage of that PTE-mapped THP. What * remains is a single PTE mapping a single subpage. */ - ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTNEED); + ret = madvise(mem + pagesize, size - pagesize, MADV_DONTNEED); if (ret) { ksft_test_result_fail("MADV_DONTNEED failed\n"); goto munmap; @@ -809,7 +813,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) * Remap half of the THP. We need some new memory location * for that. */ - mremap_size = thpsize / 2; + mremap_size = size / 2; mremap_mem = mmap(NULL, mremap_size, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (mem == MAP_FAILED) { @@ -830,7 +834,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) * child. This will result in some parts of the THP never * have been shared. */ - ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTFORK); + ret = madvise(mem + pagesize, size - pagesize, MADV_DONTFORK); if (ret) { ksft_test_result_fail("MADV_DONTFORK failed\n"); goto munmap; @@ -844,7 +848,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) } wait(&ret); /* Allow for sharing all pages again. */ - ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DOFORK); + ret = madvise(mem + pagesize, size - pagesize, MADV_DOFORK); if (ret) { ksft_test_result_fail("MADV_DOFORK failed\n"); goto munmap; @@ -875,52 +879,65 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) munmap(mremap_mem, mremap_size); } -static void run_with_thp(test_fn fn, const char *desc) +static int sz2ord(size_t size) +{ + return __builtin_ctzll(size / pagesize); +} + +static void run_with_thp(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with THP\n", desc); - do_run_with_thp(fn, THP_RUN_PMD); + ksft_print_msg("[RUN] %s ... with order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PMD, size); } -static void run_with_thp_swap(test_fn fn, const char *desc) +static void run_with_thp_swap(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with swapped-out THP\n", desc); - do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT); + ksft_print_msg("[RUN] %s ... with swapped-out order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT, size); } -static void run_with_pte_mapped_thp(test_fn fn, const char *desc) +static void run_with_pte_mapped_thp(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with PTE-mapped THP\n", desc); - do_run_with_thp(fn, THP_RUN_PTE); + ksft_print_msg("[RUN] %s ... with PTE-mapped order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PTE, size); } -static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc) +static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped THP\n", desc); - do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT); + ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT, size); } -static void run_with_single_pte_of_thp(test_fn fn, const char *desc) +static void run_with_single_pte_of_thp(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with single PTE of THP\n", desc); - do_run_with_thp(fn, THP_RUN_SINGLE_PTE); + ksft_print_msg("[RUN] %s ... with single PTE of order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_SINGLE_PTE, size); } -static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc) +static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with single PTE of swapped-out THP\n", desc); - do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT); + ksft_print_msg("[RUN] %s ... with single PTE of swapped-out order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT, size); } -static void run_with_partial_mremap_thp(test_fn fn, const char *desc) +static void run_with_partial_mremap_thp(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with partially mremap()'ed THP\n", desc); - do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP); + ksft_print_msg("[RUN] %s ... with partially mremap()'ed order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP, size); } -static void run_with_partial_shared_thp(test_fn fn, const char *desc) +static void run_with_partial_shared_thp(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with partially shared THP\n", desc); - do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED); + ksft_print_msg("[RUN] %s ... with partially shared order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED, size); } static void run_with_hugetlb(test_fn fn, const char *desc, size_t hugetlbsize) @@ -1091,15 +1108,15 @@ static void run_anon_test_case(struct test_case const *test_case) run_with_base_page(test_case->fn, test_case->desc); run_with_base_page_swap(test_case->fn, test_case->desc); - if (thpsize) { - run_with_thp(test_case->fn, test_case->desc); - run_with_thp_swap(test_case->fn, test_case->desc); - run_with_pte_mapped_thp(test_case->fn, test_case->desc); - run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc); - run_with_single_pte_of_thp(test_case->fn, test_case->desc); - run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc); - run_with_partial_mremap_thp(test_case->fn, test_case->desc); - run_with_partial_shared_thp(test_case->fn, test_case->desc); + if (pmdsize) { + run_with_thp(test_case->fn, test_case->desc, pmdsize); + run_with_thp_swap(test_case->fn, test_case->desc, pmdsize); + run_with_pte_mapped_thp(test_case->fn, test_case->desc, pmdsize); + run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, pmdsize); + run_with_single_pte_of_thp(test_case->fn, test_case->desc, pmdsize); + run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, pmdsize); + run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize); + run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize); } for (i = 0; i < nr_hugetlbsizes; i++) run_with_hugetlb(test_case->fn, test_case->desc, @@ -1120,7 +1137,7 @@ static int tests_per_anon_test_case(void) { int tests = 2 + nr_hugetlbsizes; - if (thpsize) + if (pmdsize) tests += 8; return tests; } @@ -1329,7 +1346,7 @@ static void run_anon_thp_test_cases(void) { int i; - if (!thpsize) + if (!pmdsize) return; ksft_print_msg("[INFO] Anonymous THP tests\n"); @@ -1338,13 +1355,13 @@ static void run_anon_thp_test_cases(void) struct test_case const *test_case = &anon_thp_test_cases[i]; ksft_print_msg("[RUN] %s\n", test_case->desc); - do_run_with_thp(test_case->fn, THP_RUN_PMD); + do_run_with_thp(test_case->fn, THP_RUN_PMD, pmdsize); } } static int tests_per_anon_thp_test_case(void) { - return thpsize ? 1 : 0; + return pmdsize ? 1 : 0; } typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size); @@ -1419,7 +1436,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc) } /* For alignment purposes, we need twice the thp size. */ - mmap_size = 2 * thpsize; + mmap_size = 2 * pmdsize; mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (mmap_mem == MAP_FAILED) { @@ -1434,11 +1451,11 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc) } /* We need a THP-aligned memory area. */ - mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1)); - smem = (char *)(((uintptr_t)mmap_smem + thpsize) & ~(thpsize - 1)); + mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1)); + smem = (char *)(((uintptr_t)mmap_smem + pmdsize) & ~(pmdsize - 1)); - ret = madvise(mem, thpsize, MADV_HUGEPAGE); - ret |= madvise(smem, thpsize, MADV_HUGEPAGE); + ret = madvise(mem, pmdsize, MADV_HUGEPAGE); + ret |= madvise(smem, pmdsize, MADV_HUGEPAGE); if (ret) { ksft_test_result_fail("MADV_HUGEPAGE failed\n"); goto munmap; @@ -1457,7 +1474,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc) goto munmap; } - fn(mem, smem, thpsize); + fn(mem, smem, pmdsize); munmap: munmap(mmap_mem, mmap_size); if (mmap_smem != MAP_FAILED) @@ -1650,7 +1667,7 @@ static void run_non_anon_test_case(struct non_anon_test_case const *test_case) run_with_zeropage(test_case->fn, test_case->desc); run_with_memfd(test_case->fn, test_case->desc); run_with_tmpfile(test_case->fn, test_case->desc); - if (thpsize) + if (pmdsize) run_with_huge_zeropage(test_case->fn, test_case->desc); for (i = 0; i < nr_hugetlbsizes; i++) run_with_memfd_hugetlb(test_case->fn, test_case->desc, @@ -1671,7 +1688,7 @@ static int tests_per_non_anon_test_case(void) { int tests = 3 + nr_hugetlbsizes; - if (thpsize) + if (pmdsize) tests += 1; return tests; } @@ -1681,10 +1698,10 @@ int main(int argc, char **argv) int err; pagesize = getpagesize(); - thpsize = read_pmd_pagesize(); - if (thpsize) - ksft_print_msg("[INFO] detected THP size: %zu KiB\n", - thpsize / 1024); + pmdsize = read_pmd_pagesize(); + if (pmdsize) + ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n", + pmdsize / 1024); nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes, ARRAY_SIZE(hugetlbsizes)); detect_huge_zeropage(); From patchwork Fri Sep 29 11:44:20 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 146510 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp3966758vqu; Fri, 29 Sep 2023 04:56:08 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGcVvHAeUy7G6nV1ZPSwLevl97VfHbUyu+vs74FAkH8MOhq9yXBWbStDQT/pS881VH4/dHp X-Received: by 2002:a05:6a00:2303:b0:68f:b3ed:7d4d with SMTP id h3-20020a056a00230300b0068fb3ed7d4dmr4033504pfh.15.1695988568327; Fri, 29 Sep 2023 04:56:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695988568; cv=none; d=google.com; s=arc-20160816; b=zSxoPMhvX7XQY3ZhlDwJ8BPaCkuJyz2AvPdjgdvsp0gN1UOSJaQKtv8DOOjfqMnmB6 vRF+PGif2s5OfVo9TwhBtDEUnNd+UFmFclS4Rd7sWVv8NFIJ0pBFmVcAi3gCQ5hXiJKD 3mGE6g5AAaUTyZ7tLW+4hR7A11N7aA4naxptMfVrwg0mzVnexVWUROyRAW9EECbD3EDz EPPPL9jWED7u7tqYhm+DxkHFeG4u80DhZFEwl8c+g6d14UN6LnATy94j6wBMOPR6sZBm 0/maXWVaNNy23I+HypcZYj/bhsHxCnpLhj+SG168nBV3AGgUQXHdO5PsQYyUnNI7HKwZ dBTQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=xAX3/CyH7Dx1azBnB2aEK4gEWTkd0D8Qfh6cCpOBfVo=; fh=smwoDWwCmhzJYttwqG7Q1aXJ58+o1gEThPYfICPOx+Y=; b=dIsGsAOHKCXcCG9aspIJFgsTc6kCFPl8Y8WXTwFSyMLwWxhdthB3F20K3cY1GRMQN+ rjnviurQ0x8oMHg5a/6R2glF41ofdoYjLgqWgbFhxrhl6c+zCipnjc21vFqY9Ddp8q6+ JKtfV4mzz+Jh8yuINoHNk2lEKkRwru9dzhyNvVGmn5ZsxEplrKLddLbyRSMyLw4AD+jI 9ErC8K9+pu2k3BVnFBhv3dC1J28B3766jMNxY5nlVmOMg7NzoUQAJniE/bXekejFk89A UCj24TOff92Y5qNMxor0H7tsM+bwurTjCPVtlHRKgBd/dniKpQexrLQKSJluPAvgiiyt 1XWQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id b128-20020a633486000000b0057755c96163si21812485pga.14.2023.09.29.04.56.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 04:56:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 5C57F831C820; Fri, 29 Sep 2023 04:45:46 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233169AbjI2Lpd (ORCPT + 20 others); Fri, 29 Sep 2023 07:45:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55580 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233180AbjI2LpM (ORCPT ); Fri, 29 Sep 2023 07:45:12 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 88AC11B2 for ; Fri, 29 Sep 2023 04:44:58 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7C4B61007; Fri, 29 Sep 2023 04:45:36 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 9A1783F59C; Fri, 29 Sep 2023 04:44:55 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 9/9] selftests/mm/cow: Add tests for small-order anon THP Date: Fri, 29 Sep 2023 12:44:20 +0100 Message-Id: <20230929114421.3761121-10-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 29 Sep 2023 04:45:46 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778372909177582086 X-GMAIL-MSGID: 1778372909177582086 Add tests similar to the existing THP tests, but which operate on memory backed by smaller-order, PTE-mapped THP. This reuses all the existing infrastructure. If the test suite detects that small-order THP is not supported by the kernel, the new tests are skipped. Signed-off-by: Ryan Roberts --- tools/testing/selftests/mm/cow.c | 93 ++++++++++++++++++++++++++++++++ 1 file changed, 93 insertions(+) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index d887ce454e34..6c5e37d8bb69 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -33,10 +33,13 @@ static size_t pagesize; static int pagemap_fd; static size_t pmdsize; +static size_t ptesize; static int nr_hugetlbsizes; static size_t hugetlbsizes[10]; static int gup_fd; static bool has_huge_zeropage; +static unsigned int orig_anon_orders; +static bool orig_anon_orders_valid; static void detect_huge_zeropage(void) { @@ -1118,6 +1121,14 @@ static void run_anon_test_case(struct test_case const *test_case) run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize); run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize); } + if (ptesize) { + run_with_pte_mapped_thp(test_case->fn, test_case->desc, ptesize); + run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, ptesize); + run_with_single_pte_of_thp(test_case->fn, test_case->desc, ptesize); + run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, ptesize); + run_with_partial_mremap_thp(test_case->fn, test_case->desc, ptesize); + run_with_partial_shared_thp(test_case->fn, test_case->desc, ptesize); + } for (i = 0; i < nr_hugetlbsizes; i++) run_with_hugetlb(test_case->fn, test_case->desc, hugetlbsizes[i]); @@ -1139,6 +1150,8 @@ static int tests_per_anon_test_case(void) if (pmdsize) tests += 8; + if (ptesize) + tests += 6; return tests; } @@ -1693,6 +1706,80 @@ static int tests_per_non_anon_test_case(void) return tests; } +#define ANON_ORDERS_FILE "/sys/kernel/mm/transparent_hugepage/anon_orders" + +static int read_anon_orders(unsigned int *orders) +{ + ssize_t buflen = 80; + char buf[buflen]; + int fd; + + fd = open(ANON_ORDERS_FILE, O_RDONLY); + if (fd == -1) + return -1; + + buflen = read(fd, buf, buflen); + close(fd); + + if (buflen < 1) + return -1; + + *orders = strtoul(buf, NULL, 16); + + return 0; +} + +static int write_anon_orders(unsigned int orders) +{ + ssize_t buflen = 80; + char buf[buflen]; + int fd; + + fd = open(ANON_ORDERS_FILE, O_WRONLY); + if (fd == -1) + return -1; + + buflen = snprintf(buf, buflen, "0x%08x\n", orders); + buflen = write(fd, buf, buflen); + close(fd); + + if (buflen < 1) + return -1; + + return 0; +} + +static size_t save_thp_anon_orders(void) +{ + /* + * If the kernel supports multiple orders for anon THP (indicated by the + * presence of anon_orders file), configure it for the PMD-order and the + * PMD-order - 1, which we will report back and use as the PTE-order THP + * size. Save the original value so that it can be restored on exit. If + * the kernel does not support multiple orders, report back 0 for the + * PTE-size so those tests are skipped. + */ + + int pteorder = sz2ord(pmdsize) - 1; + unsigned int orders = (1UL << sz2ord(pmdsize)) | (1UL << pteorder); + + if (read_anon_orders(&orig_anon_orders)) + return 0; + + orig_anon_orders_valid = true; + + if (write_anon_orders(orders)) + return 0; + + return pagesize << pteorder; +} + +static void restore_thp_anon_orders(void) +{ + if (orig_anon_orders_valid) + write_anon_orders(orig_anon_orders); +} + int main(int argc, char **argv) { int err; @@ -1702,6 +1789,10 @@ int main(int argc, char **argv) if (pmdsize) ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n", pmdsize / 1024); + ptesize = save_thp_anon_orders(); + if (ptesize) + ksft_print_msg("[INFO] configured PTE-mapped THP size: %zu KiB\n", + ptesize / 1024); nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes, ARRAY_SIZE(hugetlbsizes)); detect_huge_zeropage(); @@ -1720,6 +1811,8 @@ int main(int argc, char **argv) run_anon_thp_test_cases(); run_non_anon_test_cases(); + restore_thp_anon_orders(); + err = ksft_get_fail_cnt(); if (err) ksft_exit_fail_msg("%d out of %d tests failed\n",