From patchwork Fri Nov 24 13:26:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169418 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191191vqx; Fri, 24 Nov 2023 05:28:51 -0800 (PST) X-Google-Smtp-Source: AGHT+IHBBLQ+aHQXos17CeXspEwNUTLHw3nyvS26EggIuFS63+UFxmHP8zWd3M/jy+9157nvfv9v X-Received: by 2002:a17:90b:164b:b0:285:8a70:b56b with SMTP id il11-20020a17090b164b00b002858a70b56bmr1944480pjb.37.1700832531082; Fri, 24 Nov 2023 05:28:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832531; cv=none; d=google.com; s=arc-20160816; b=xdUESU6uOJbbf30pq5hvGqBZwnlGn/W7ZzND7YIdG7bt0rNxKF82RlUtiesrDg2pkL ee8Ixa5fc0Gdm25qLlst1/7YxadnSaWhcCt2WraPuRsmlaZMQwKjgqV7iLGqEOp+xmVq ovlUlv+9U31/RXOwIRKFHDez9H7E3zayd9sRYg4JBfTtoZGHXT2legM+Rv3UUq1LeUaR Fd9IgUZF+l7kuTQrpLd0lPrZQbnzH3vr08VrBhuPaOH4ZzMQDR/fVUwr3z2vaVnbLuEG GWUeFmLBuis8b0tWOM+95mcrKLxr4yOYIoHck4xyUjnAJ35bSQkFFAgQVOGd+PYt1q58 Fa1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=2NlC2FNJ1Esba7mAbHYEoZIvTqXmerwYmRl67CA2zqk=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=wTYcKonWNrKmXD0Wg6/dYfnTts0S2m/YZPcpc4VIRt8tdYaLwJDGug2kMzgKHPcsGS 1S7XPrTB7cmGXnhBBwwfDg0ovqqnVW+/NfJfBf1OKP8RKDzEcRfNw0Gnqb9oPvfHn+qW Iddgw/4A3TUigibvQ/E+/jElzSzW9WAxTDiu4+uLqXsem4GbubFUFCRsgiB9kWvQX1uZ 9z05c+8JiVEd9dhoixCp7DapiSF0md7xBJsHS0c6dhV44NangNGhKv4IYIMeLhN5xwcR 46GX4aBRcL8iseISplHzUGKgLcewEIUVTC8LJZ6qJJTJiCkJkYBQGRDN20dA3xsA3epH Ywhw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=huSZIS7Z; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id e2-20020a170902744200b001c3411c9b83si3213964plt.454.2023.11.24.05.28.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:28:51 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=huSZIS7Z; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 5356E80A8B5F; Fri, 24 Nov 2023 05:26:55 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231233AbjKXN0j (ORCPT + 99 others); Fri, 24 Nov 2023 08:26:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46684 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231177AbjKXN0f (ORCPT ); Fri, 24 Nov 2023 08:26:35 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A2B1810C6 for ; Fri, 24 Nov 2023 05:26:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832400; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2NlC2FNJ1Esba7mAbHYEoZIvTqXmerwYmRl67CA2zqk=; b=huSZIS7ZCZkJGKeKPRlHnAkMuo0OBY7JGZP8+/BCn5J+kOXfC8lbohmLRLnKrmKgA4EZ+/ zDgLByYhMR+rTvLoZXxVtaR0MAdWmZaXLgMvZDRISqJPtOYUoimusBtF4sBq9huyh2Ky6V KL5Di6fvd7+5QHda4J3e4hx32PKVACw= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-529-4jiaO-6aNt2opR94wzdfSg-1; Fri, 24 Nov 2023 08:26:35 -0500 X-MC-Unique: 4jiaO-6aNt2opR94wzdfSg-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1C5F185A58A; Fri, 24 Nov 2023 13:26:35 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 9B2C32166B2A; Fri, 24 Nov 2023 13:26:31 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 01/20] mm/rmap: factor out adding folio range into __folio_add_rmap_range() Date: Fri, 24 Nov 2023 14:26:06 +0100 Message-ID: <20231124132626.235350-2-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:26:57 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452172491865453 X-GMAIL-MSGID: 1783452172491865453 Let's factor it out, optimize for small folios, and add some more sanity checks. Signed-off-by: David Hildenbrand --- mm/rmap.c | 119 ++++++++++++++++++++++++------------------------------ 1 file changed, 53 insertions(+), 66 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 7a27a2b41802..afddf3d82a8f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1127,6 +1127,54 @@ int folio_total_mapcount(struct folio *folio) return mapcount; } +static unsigned int __folio_add_rmap_range(struct folio *folio, + struct page *page, unsigned int nr_pages, bool compound, + int *nr_pmdmapped) +{ + atomic_t *mapped = &folio->_nr_pages_mapped; + int first, nr = 0; + + VM_WARN_ON_FOLIO(compound && page != &folio->page, folio); + VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio); + VM_WARN_ON_FOLIO(compound && nr_pages != folio_nr_pages(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_large(folio) && nr_pages != 1, folio); + + if (likely(!folio_test_large(folio))) + return atomic_inc_and_test(&page->_mapcount); + + /* Is page being mapped by PTE? Is this its first map to be added? */ + if (!compound) { + do { + first = atomic_inc_and_test(&page->_mapcount); + if (first) { + first = atomic_inc_return_relaxed(mapped); + if (first < COMPOUND_MAPPED) + nr++; + } + } while (page++, --nr_pages > 0); + } else if (folio_test_pmd_mappable(folio)) { + /* That test is redundant: it's for safety or to optimize out */ + + first = atomic_inc_and_test(&folio->_entire_mapcount); + if (first) { + nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped); + if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) { + *nr_pmdmapped = folio_nr_pages(folio); + nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED); + /* Raced ahead of a remove and another add? */ + if (unlikely(nr < 0)) + nr = 0; + } else { + /* Raced ahead of a remove of COMPOUND_MAPPED */ + nr = 0; + } + } + } else { + VM_WARN_ON_ONCE_FOLIO(true, folio); + } + return nr; +} + /** * folio_move_anon_rmap - move a folio to our anon_vma * @folio: The folio to move to our anon_vma @@ -1227,38 +1275,10 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address, rmap_t flags) { struct folio *folio = page_folio(page); - atomic_t *mapped = &folio->_nr_pages_mapped; - int nr = 0, nr_pmdmapped = 0; + unsigned int nr, nr_pmdmapped = 0; bool compound = flags & RMAP_COMPOUND; - bool first; - - /* Is page being mapped by PTE? Is this its first map to be added? */ - if (likely(!compound)) { - first = atomic_inc_and_test(&page->_mapcount); - nr = first; - if (first && folio_test_large(folio)) { - nr = atomic_inc_return_relaxed(mapped); - nr = (nr < COMPOUND_MAPPED); - } - } else if (folio_test_pmd_mappable(folio)) { - /* That test is redundant: it's for safety or to optimize out */ - - first = atomic_inc_and_test(&folio->_entire_mapcount); - if (first) { - nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped); - if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) { - nr_pmdmapped = folio_nr_pages(folio); - nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED); - /* Raced ahead of a remove and another add? */ - if (unlikely(nr < 0)) - nr = 0; - } else { - /* Raced ahead of a remove of COMPOUND_MAPPED */ - nr = 0; - } - } - } + nr = __folio_add_rmap_range(folio, page, 1, compound, &nr_pmdmapped); if (nr_pmdmapped) __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped); if (nr) @@ -1349,43 +1369,10 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, unsigned int nr_pages, struct vm_area_struct *vma, bool compound) { - atomic_t *mapped = &folio->_nr_pages_mapped; - unsigned int nr_pmdmapped = 0, first; - int nr = 0; - - VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio); - - /* Is page being mapped by PTE? Is this its first map to be added? */ - if (likely(!compound)) { - do { - first = atomic_inc_and_test(&page->_mapcount); - if (first && folio_test_large(folio)) { - first = atomic_inc_return_relaxed(mapped); - first = (first < COMPOUND_MAPPED); - } - - if (first) - nr++; - } while (page++, --nr_pages > 0); - } else if (folio_test_pmd_mappable(folio)) { - /* That test is redundant: it's for safety or to optimize out */ - - first = atomic_inc_and_test(&folio->_entire_mapcount); - if (first) { - nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped); - if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) { - nr_pmdmapped = folio_nr_pages(folio); - nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED); - /* Raced ahead of a remove and another add? */ - if (unlikely(nr < 0)) - nr = 0; - } else { - /* Raced ahead of a remove of COMPOUND_MAPPED */ - nr = 0; - } - } - } + unsigned int nr, nr_pmdmapped = 0; + nr = __folio_add_rmap_range(folio, page, nr_pages, compound, + &nr_pmdmapped); if (nr_pmdmapped) __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ? NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped); From patchwork Fri Nov 24 13:26:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169412 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1190090vqx; Fri, 24 Nov 2023 05:27:18 -0800 (PST) X-Google-Smtp-Source: AGHT+IHTHae9JJ3zlYygxAgGAezxJmkYeqYlUd/5WR/lhtCrpQlhsb5vwxiCC/x4WShucJu26wAJ X-Received: by 2002:a05:6e02:1aab:b0:359:d2ed:15f4 with SMTP id l11-20020a056e021aab00b00359d2ed15f4mr3833882ilv.8.1700832438601; Fri, 24 Nov 2023 05:27:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832438; cv=none; d=google.com; s=arc-20160816; b=Fgjx8d1i5YixiogX0GIqfHlhd6jJbmX37ZLeem9Y+5ZgqrPfvOkuOzaVFZj8ES4BSu aczV9xIksSKn1fuJc0uhODzjUQJc4BqWEBZaImD4UBF8pvY36Rs/74uC1VXOh2ssjlFG 55/Wq6pDmLZtm3G77/tyFfDz3AO/8YlciLgehiRjlFVX9LgtaQ2A9xGOsugrEKQYp8kK CkNlvirnfUtzNGl30ibGFryO3G3CGFUg4B39AC36TydOXmWROPbcUDTSNrFr+DmvTPxn oC0ajtTs5vL40vhwSvCrXLGR4cBFBqCwFnELbyPp9J62aO146r2z4oIRV7YZDCCGywbF BBJw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=/66tdUTtIlpBX26NzgkLUQsEQZwgNiRXmWKFnjYfBOk=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=EmDqXQyLJPGSCrN+P86A8kfXWgc1nADpkrQnsAhPILQ6kFKogW/cUn8aqy+/9WPlO1 MENSc6lFKHyRFguOgTrAftooXlSUfC+gvlmXw5l68haO4aU1duPbmDpmU2uoQwVdzy/D VhKRG7CLRNY4rv5AlrLXjetIwtOBRiNJej2CkfyUMQaOljnBw4LR1iOr4AaNsOWa+tta 0s0mTvOnmyllvwk37FYwX/gqZX2T670gM5Bp0spcmxoD4QanNisEEGa3R1ODQ0JSDSqS 6ZsS/iMLYCKwHU8BC4CV4iBSGU3qVGHBhhLC4TFDI1dwAOgU0RhEBUvfCGMG0qr1r02j 3joQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="CrFVJ/yg"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id s18-20020a635252000000b005c2201d6a55si3443451pgl.39.2023.11.24.05.27.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:27:18 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="CrFVJ/yg"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 30D1180AE573; Fri, 24 Nov 2023 05:27:10 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345280AbjKXN0m (ORCPT + 99 others); Fri, 24 Nov 2023 08:26:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46694 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229833AbjKXN0i (ORCPT ); Fri, 24 Nov 2023 08:26:38 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DA6141BE for ; Fri, 24 Nov 2023 05:26:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832403; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/66tdUTtIlpBX26NzgkLUQsEQZwgNiRXmWKFnjYfBOk=; b=CrFVJ/ygWBgIybi1/fSnA/wAKPEiWLgD0ZYic3dKzEo7XN3acPK5j4U63r7qSRCpCLuDDJ nkO4oYWIVrbYFGdhp025iQF6a2Eay6m+71puab/MJgGnD6Y63sJ1ovmczu/IfFKQu/oi39 MsNZ0NV1Pjophm8DPa4gTe83gBX68CA= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-74-ojdg3s0mO9myWpWvuZxs4w-1; Fri, 24 Nov 2023 08:26:39 -0500 X-MC-Unique: ojdg3s0mO9myWpWvuZxs4w-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 01090185A781; Fri, 24 Nov 2023 13:26:39 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7B9A22166B2A; Fri, 24 Nov 2023 13:26:35 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 02/20] mm: add a total mapcount for large folios Date: Fri, 24 Nov 2023 14:26:07 +0100 Message-ID: <20231124132626.235350-3-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:27:10 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452074892216481 X-GMAIL-MSGID: 1783452074892216481 Let's track the total mapcount for all large folios in the first subpage. The total mapcount is what we actually want to know in folio_mapcount() and it is also sufficient for implementing folio_mapped(). With PTE-mapped THP becoming more important and soon more widely used, we want to avoid looping over all pages of a folio just to calculate the total mapcount. Further, we might soon want to use the total mapcount in other context more frequently, so prepare for reading it efficiently and atomically. Maintain the total mapcount also for hugetlb pages. Use the total mapcount to implement folio_mapcount(). Make folio_mapped() simply call folio_mapped(). We can now get rid of folio_large_is_mapped() and move folio_large_total_mapcount() to mm.h. Similarly, get rid of folio_nr_pages_mapped() and stop dumping that value in __dump_page(). While at it, simplify total_mapcount() by calling folio_mapcount() and page_mapped() by calling folio_mapped(): it seems to add only one more MOV instruction on x86-64 to the compiled code, which we shouldn't have to worry about. _nr_pages_mapped is now only used in rmap code, so not accidentally externally where it might be used on arbitrary order-1 pages. The remaining usage is: (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED -> If we would account the total folio as mapped when mapping a page (based on the total mapcount), we could remove that usage. We'll have to be careful about memory-sensitive applications that also adjust /sys/kernel/debug/fault_around_bytes to not get a large folio completely mapped on page fault. (2) Detect when to add an anon folio to the deferred split queue -> If we would apply a different heuristic, or scan using the rmap on the memory reclaim path for partially mapped anon folios to split them, we could remove that usage as well. For now, these things remain as they are, they need more thought. Hugh really did a fantastic job implementing that tracking after all. Note that before the total mapcount would overflow, already our refcount would overflow: each distinct mapping requires a distinct reference. Probably, in the future, we want 64bit refcount+mapcount for larger folios. Reviewed-by: Zi Yan Reviewed-by: Ryan Roberts Reviewed-by: Yin Fengwei Signed-off-by: David Hildenbrand --- Documentation/mm/transhuge.rst | 12 +++++------ include/linux/mm.h | 37 +++++++++----------------------- include/linux/mm_types.h | 5 +++-- include/linux/rmap.h | 15 ++++++++----- mm/debug.c | 4 ++-- mm/hugetlb.c | 4 ++-- mm/internal.h | 10 +-------- mm/page_alloc.c | 4 ++++ mm/rmap.c | 39 ++++++++++++---------------------- 9 files changed, 52 insertions(+), 78 deletions(-) diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst index 9a607059ea11..b0d3b1d3e8ea 100644 --- a/Documentation/mm/transhuge.rst +++ b/Documentation/mm/transhuge.rst @@ -116,14 +116,14 @@ pages: succeeds on tail pages. - map/unmap of a PMD entry for the whole THP increment/decrement - folio->_entire_mapcount and also increment/decrement - folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount - goes from -1 to 0 or 0 to -1. + folio->_entire_mapcount, increment/decrement folio->_total_mapcount + and also increment/decrement folio->_nr_pages_mapped by COMPOUND_MAPPED + when _entire_mapcount goes from -1 to 0 or 0 to -1. - map/unmap of individual pages with PTE entry increment/decrement - page->_mapcount and also increment/decrement folio->_nr_pages_mapped - when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts - the number of pages mapped by PTE. + page->_mapcount, increment/decrement folio->_total_mapcount and also + increment/decrement folio->_nr_pages_mapped when page->_mapcount goes + from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE. split_huge_page internally has to distribute the refcounts in the head page to the tail pages before clearing all PG_head/tail bits from the page diff --git a/include/linux/mm.h b/include/linux/mm.h index 418d26608ece..fe91aaefa3db 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1207,17 +1207,16 @@ static inline int page_mapcount(struct page *page) return mapcount; } -int folio_total_mapcount(struct folio *folio); +static inline int folio_total_mapcount(struct folio *folio) +{ + VM_WARN_ON_FOLIO(!folio_test_large(folio), folio); + return atomic_read(&folio->_total_mapcount) + 1; +} /** - * folio_mapcount() - Calculate the number of mappings of this folio. + * folio_mapcount() - Number of mappings of this folio. * @folio: The folio. * - * A large folio tracks both how many times the entire folio is mapped, - * and how many times each individual page in the folio is mapped. - * This function calculates the total number of times the folio is - * mapped. - * * Return: The number of times this folio is mapped. */ static inline int folio_mapcount(struct folio *folio) @@ -1229,19 +1228,7 @@ static inline int folio_mapcount(struct folio *folio) static inline int total_mapcount(struct page *page) { - if (likely(!PageCompound(page))) - return atomic_read(&page->_mapcount) + 1; - return folio_total_mapcount(page_folio(page)); -} - -static inline bool folio_large_is_mapped(struct folio *folio) -{ - /* - * Reading _entire_mapcount below could be omitted if hugetlb - * participated in incrementing nr_pages_mapped when compound mapped. - */ - return atomic_read(&folio->_nr_pages_mapped) > 0 || - atomic_read(&folio->_entire_mapcount) >= 0; + return folio_mapcount(page_folio(page)); } /** @@ -1252,9 +1239,7 @@ static inline bool folio_large_is_mapped(struct folio *folio) */ static inline bool folio_mapped(struct folio *folio) { - if (likely(!folio_test_large(folio))) - return atomic_read(&folio->_mapcount) >= 0; - return folio_large_is_mapped(folio); + return folio_mapcount(folio) > 0; } /* @@ -1264,9 +1249,7 @@ static inline bool folio_mapped(struct folio *folio) */ static inline bool page_mapped(struct page *page) { - if (likely(!PageCompound(page))) - return atomic_read(&page->_mapcount) >= 0; - return folio_large_is_mapped(page_folio(page)); + return folio_mapped(page_folio(page)); } static inline struct page *virt_to_head_page(const void *x) @@ -2139,7 +2122,7 @@ static inline size_t folio_size(struct folio *folio) * looking at the precise mapcount of the first subpage in the folio, and * assuming the other subpages are the same. This may not be true for large * folios. If you want exact mapcounts for exact calculations, look at - * page_mapcount() or folio_total_mapcount(). + * page_mapcount() or folio_mapcount(). * * Return: The estimated number of processes sharing a folio. */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 957ce38768b2..99b84b4797b9 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -264,7 +264,8 @@ typedef struct { * @virtual: Virtual address in the kernel direct map. * @_last_cpupid: IDs of last CPU and last process that accessed the folio. * @_entire_mapcount: Do not use directly, call folio_entire_mapcount(). - * @_nr_pages_mapped: Do not use directly, call folio_mapcount(). + * @_total_mapcount: Do not use directly, call folio_mapcount(). + * @_nr_pages_mapped: Do not use outside of rmap code. * @_pincount: Do not use directly, call folio_maybe_dma_pinned(). * @_folio_nr_pages: Do not use directly, call folio_nr_pages(). * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h. @@ -323,8 +324,8 @@ struct folio { struct { unsigned long _flags_1; unsigned long _head_1; - unsigned long _folio_avail; /* public: */ + atomic_t _total_mapcount; atomic_t _entire_mapcount; atomic_t _nr_pages_mapped; atomic_t _pincount; diff --git a/include/linux/rmap.h b/include/linux/rmap.h index b26fe858fd44..42e2c74d4d6e 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -210,14 +210,19 @@ void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *, static inline void __page_dup_rmap(struct page *page, bool compound) { - if (compound) { - struct folio *folio = (struct folio *)page; + struct folio *folio = page_folio(page); - VM_BUG_ON_PAGE(compound && !PageHead(page), page); - atomic_inc(&folio->_entire_mapcount); - } else { + VM_BUG_ON_PAGE(compound && !PageHead(page), page); + if (likely(!folio_test_large(folio))) { atomic_inc(&page->_mapcount); + return; } + + if (compound) + atomic_inc(&folio->_entire_mapcount); + else + atomic_inc(&page->_mapcount); + atomic_inc(&folio->_total_mapcount); } static inline void page_dup_file_rmap(struct page *page, bool compound) diff --git a/mm/debug.c b/mm/debug.c index ee533a5ceb79..97f6f6b32ae7 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -99,10 +99,10 @@ static void __dump_page(struct page *page) page, page_ref_count(head), mapcount, mapping, page_to_pgoff(page), page_to_pfn(page)); if (compound) { - pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n", + pr_warn("head:%p order:%u entire_mapcount:%d total_mapcount:%d pincount:%d\n", head, compound_order(head), folio_entire_mapcount(folio), - folio_nr_pages_mapped(folio), + folio_mapcount(folio), atomic_read(&folio->_pincount)); } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1169ef2f2176..cf84784064c7 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1509,7 +1509,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, struct page *p; atomic_set(&folio->_entire_mapcount, 0); - atomic_set(&folio->_nr_pages_mapped, 0); + atomic_set(&folio->_total_mapcount, 0); atomic_set(&folio->_pincount, 0); for (i = 1; i < nr_pages; i++) { @@ -2119,7 +2119,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, /* we rely on prep_new_hugetlb_folio to set the destructor */ folio_set_order(folio, order); atomic_set(&folio->_entire_mapcount, -1); - atomic_set(&folio->_nr_pages_mapped, 0); + atomic_set(&folio->_total_mapcount, -1); atomic_set(&folio->_pincount, 0); return true; diff --git a/mm/internal.h b/mm/internal.h index b61034bd50f5..bb2e55c402e7 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -67,15 +67,6 @@ void page_writeback_init(void); */ #define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */ -/* - * How many individual pages have an elevated _mapcount. Excludes - * the folio's entire_mapcount. - */ -static inline int folio_nr_pages_mapped(struct folio *folio) -{ - return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED; -} - static inline void *folio_raw_mapping(struct folio *folio) { unsigned long mapping = (unsigned long)folio->mapping; @@ -429,6 +420,7 @@ static inline void prep_compound_head(struct page *page, unsigned int order) struct folio *folio = (struct folio *)page; folio_set_order(folio, order); + atomic_set(&folio->_total_mapcount, -1); atomic_set(&folio->_entire_mapcount, -1); atomic_set(&folio->_nr_pages_mapped, 0); atomic_set(&folio->_pincount, 0); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 733732e7e0ba..aad45758c0c7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -988,6 +988,10 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) bad_page(page, "nonzero entire_mapcount"); goto out; } + if (unlikely(atomic_read(&folio->_total_mapcount) + 1)) { + bad_page(page, "nonzero total_mapcount"); + goto out; + } if (unlikely(atomic_read(&folio->_nr_pages_mapped))) { bad_page(page, "nonzero nr_pages_mapped"); goto out; diff --git a/mm/rmap.c b/mm/rmap.c index afddf3d82a8f..38765796dca8 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1104,35 +1104,12 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, return page_vma_mkclean_one(&pvmw); } -int folio_total_mapcount(struct folio *folio) -{ - int mapcount = folio_entire_mapcount(folio); - int nr_pages; - int i; - - /* In the common case, avoid the loop when no pages mapped by PTE */ - if (folio_nr_pages_mapped(folio) == 0) - return mapcount; - /* - * Add all the PTE mappings of those pages mapped by PTE. - * Limit the loop to folio_nr_pages_mapped()? - * Perhaps: given all the raciness, that may be a good or a bad idea. - */ - nr_pages = folio_nr_pages(folio); - for (i = 0; i < nr_pages; i++) - mapcount += atomic_read(&folio_page(folio, i)->_mapcount); - - /* But each of those _mapcounts was based on -1 */ - mapcount += nr_pages; - return mapcount; -} - static unsigned int __folio_add_rmap_range(struct folio *folio, struct page *page, unsigned int nr_pages, bool compound, int *nr_pmdmapped) { atomic_t *mapped = &folio->_nr_pages_mapped; - int first, nr = 0; + int first, count, nr = 0; VM_WARN_ON_FOLIO(compound && page != &folio->page, folio); VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio); @@ -1144,6 +1121,7 @@ static unsigned int __folio_add_rmap_range(struct folio *folio, /* Is page being mapped by PTE? Is this its first map to be added? */ if (!compound) { + count = nr_pages; do { first = atomic_inc_and_test(&page->_mapcount); if (first) { @@ -1151,7 +1129,8 @@ static unsigned int __folio_add_rmap_range(struct folio *folio, if (first < COMPOUND_MAPPED) nr++; } - } while (page++, --nr_pages > 0); + } while (page++, --count > 0); + atomic_add(nr_pages, &folio->_total_mapcount); } else if (folio_test_pmd_mappable(folio)) { /* That test is redundant: it's for safety or to optimize out */ @@ -1169,6 +1148,7 @@ static unsigned int __folio_add_rmap_range(struct folio *folio, nr = 0; } } + atomic_inc(&folio->_total_mapcount); } else { VM_WARN_ON_ONCE_FOLIO(true, folio); } @@ -1348,6 +1328,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); } + if (folio_test_large(folio)) + /* increment count (starts at -1) */ + atomic_set(&folio->_total_mapcount, 0); + __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); __folio_set_anon(folio, vma, address, true); SetPageAnonExclusive(&folio->page); @@ -1427,6 +1411,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, VM_BUG_ON_PAGE(compound && !PageHead(page), page); + if (folio_test_large(folio)) + atomic_dec(&folio->_total_mapcount); + /* Hugetlb pages are not counted in NR_*MAPPED */ if (unlikely(folio_test_hugetlb(folio))) { /* hugetlb pages are always mapped with pmds */ @@ -2576,6 +2563,7 @@ void hugepage_add_anon_rmap(struct folio *folio, struct vm_area_struct *vma, VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); atomic_inc(&folio->_entire_mapcount); + atomic_inc(&folio->_total_mapcount); if (flags & RMAP_EXCLUSIVE) SetPageAnonExclusive(&folio->page); VM_WARN_ON_FOLIO(folio_entire_mapcount(folio) > 1 && @@ -2588,6 +2576,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio, BUG_ON(address < vma->vm_start || address >= vma->vm_end); /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); + atomic_set(&folio->_total_mapcount, 0); folio_clear_hugetlb_restore_reserve(folio); __folio_set_anon(folio, vma, address, true); SetPageAnonExclusive(&folio->page); From patchwork Fri Nov 24 13:26:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169413 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1190376vqx; Fri, 24 Nov 2023 05:27:43 -0800 (PST) X-Google-Smtp-Source: AGHT+IF5Nr8nn8XcGZPCJ1pBCtXwJNXwtQ+gO51EwFlRyhWU154Ak/YleLKhqwj5ud3Huoq8nxrD X-Received: by 2002:a05:6a20:e68b:b0:18b:d09d:6910 with SMTP id mz11-20020a056a20e68b00b0018bd09d6910mr4312817pzb.29.1700832463431; Fri, 24 Nov 2023 05:27:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832463; cv=none; d=google.com; s=arc-20160816; b=k3KvETnZi39aJ8j9+fhl2yp/5hpOY4/9gr4mq9Utm+WzGlPG3T8vEGjM30TXm2F5lr RX74imC8ElNoOMyxBNhRMuJISJijX+ZQQW/8O1nI+o7CBzqPInWXFyHlljhfQYcFtJ4i hwzRyaIRf+gNnQXKqPoLmWkdzPLIO7hBBvJV+1MHN4b3NrEsLjzJOBFy6El6Y3UiEQto 19GgicWw/xO/dNUjxBwClamC6giVxOZmSXhmMQGOud0lxSQORTKn3W0VapqKoe7rK86O +2BGdiZ41BH+9BXUugv0KY70j8s+4Q+RO8YHgbRcyeO1Kvy4iCaYIwQT0i5yZxecdiBL XGxQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=NG++BiHeM5ZPoN7pJxd4ztX7J63H+nPZ6vnmgIm1VTs=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=k1M6Is1qJDjuMvt+MxqiwiWRgJTTtdaGCqEbTwt4d6wKhqA8sk+zNZhFD0wQw9EMLv YFK1slh2H/7Tim/uPwBvu4B2onXV6J//FMNgWTKoT7a0xLu3WifuaamFhrKhymJZ0Ye4 Gr3dTbJNxquU9XRxsjH2DFgjhjrdDiu008H1S1dxEyj62NbRVkdGBzt1HcHFcRx1Sa6R hVFfpWV+bS3nVO5JKqHqSBFMgoTsCU5EkXSld5O+jD9tK/LMwoRBt+blgc/djf1+Upku tx5KNXCsaNSQpG4+wea+ji07lQy41h8gkz//pLm53tTkxAgXu0Lvds3lJlf+Onr4+fOY KHMw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=OylzUnkJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id u9-20020a631409000000b005859b2d8d7asi3448415pgl.4.2023.11.24.05.27.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:27:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=OylzUnkJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id A2C0A8047649; Fri, 24 Nov 2023 05:27:24 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231233AbjKXN1B (ORCPT + 99 others); Fri, 24 Nov 2023 08:27:01 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46782 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345265AbjKXN0l (ORCPT ); Fri, 24 Nov 2023 08:26:41 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 79F7810C6 for ; Fri, 24 Nov 2023 05:26:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832406; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NG++BiHeM5ZPoN7pJxd4ztX7J63H+nPZ6vnmgIm1VTs=; b=OylzUnkJL6FYu3Btmdy0zUw1R1cv0f54gxDEhntt2B3N3j/XH+YD8SaeA0dZZFEPNFLgCa qpwsi4c9Q8S7x8buln12N0uGRmR6PINAIqN2v4UOn31dPGjm8e6tYlq7pIGCH99V7nGdTq 70/DFohTmolUdY7h+KBf5SSvi2IVkOw= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-628-ZJGKa9ZtNFmIPA6BX8h3SQ-1; Fri, 24 Nov 2023 08:26:43 -0500 X-MC-Unique: ZJGKa9ZtNFmIPA6BX8h3SQ-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 0023E2806052; Fri, 24 Nov 2023 13:26:43 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 476F32166B2A; Fri, 24 Nov 2023 13:26:39 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 03/20] mm: convert folio_estimated_sharers() to folio_mapped_shared() and improve it Date: Fri, 24 Nov 2023 14:26:08 +0100 Message-ID: <20231124132626.235350-4-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:27:24 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452101322559914 X-GMAIL-MSGID: 1783452101322559914 Callers of folio_estimated_sharers() only care about "mapped shared vs. mapped exclusively". Let's rename the function and improve our detection for partially-mappable folios (i.e., PTE-mapped THPs). For now we can only implement, based on our guess, "certainly mapped shared vs. maybe mapped exclusively". Ideally, we'd have something like "maybe mapped shared vs. certainly mapped exclusive" -- or even better "certainly mapped shared vs. certainly mapped exclusively" instead. But these semantics are currently impossible using our guess-based heuristic we apply for partially-mappable folios. Naming the function "folio_certainly_mapped_shared" could be possible, but let's just keep it simple an call it "folio_mapped_shared" and document the fuzziness that applies for now. As we can now read the total mapcount of large folios very efficiently, use that to improve our implementation, falling back to making a guess only in case the folio is not "obviously mapped shared". Signed-off-by: David Hildenbrand --- include/linux/mm.h | 68 +++++++++++++++++++++++++++++++++++++++------- mm/huge_memory.c | 2 +- mm/madvise.c | 6 ++-- mm/memory.c | 2 +- mm/mempolicy.c | 14 ++++------ mm/migrate.c | 2 +- 6 files changed, 70 insertions(+), 24 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index fe91aaefa3db..17dac913f367 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2114,21 +2114,69 @@ static inline size_t folio_size(struct folio *folio) } /** - * folio_estimated_sharers - Estimate the number of sharers of a folio. + * folio_mapped_shared - Report if a folio is certainly mapped by + * multiple entities in their page tables * @folio: The folio. * - * folio_estimated_sharers() aims to serve as a function to efficiently - * estimate the number of processes sharing a folio. This is done by - * looking at the precise mapcount of the first subpage in the folio, and - * assuming the other subpages are the same. This may not be true for large - * folios. If you want exact mapcounts for exact calculations, look at - * page_mapcount() or folio_mapcount(). + * This function checks if a folio is certainly *currently* mapped by + * multiple entities in their page table ("mapped shared") or if the folio + * may be mapped exclusively by a single entity ("mapped exclusively"). * - * Return: The estimated number of processes sharing a folio. + * Usually, we consider a single entity to be a single MM. However, some + * folios (KSM, pagecache) can be mapped multiple times into the same MM. + * + * For KSM folios, each individual page table mapping is considered a + * separate entity. So if a KSM folio is mapped multiple times into the + * same process, it is considered "mapped shared". + * + * For pagecache folios that are entirely mapped multiple times into the + * same MM (i.e., multiple VMAs in the same MM cover the same + * file range), we traditionally (and for simplicity) consider them, + * "mapped shared". For partially-mapped folios (e..g, PTE-mapped THP), we + * might detect them either as "mapped shared" or "mapped exclusively" -- + * whatever is simpler. + * + * For small folios and entirely mapped large folios (e.g., hugetlb, + * PMD-mapped PMD-sized THP), the result will be exactly correct. + * + * For all other (partially-mappable) folios, such as PTE-mapped THP, the + * return value is partially fuzzy: true is not fuzzy, because it means + * "certainly mapped shared", but false means "maybe mapped exclusively". + * + * Note that this function only considers *current* page table mappings + * tracked via rmap -- that properly adjusts the folio mapcount(s) -- and + * does not consider: + * (1) any way the folio might get mapped in the (near) future (e.g., + * swapcache, pagecache, temporary unmapping for migration). + * (2) any way a folio might be mapped besides using the rmap (PFN mappings). + * (3) any form of page table sharing. + * + * Return: Whether the folio is certainly mapped by multiple entities. */ -static inline int folio_estimated_sharers(struct folio *folio) +static inline bool folio_mapped_shared(struct folio *folio) { - return page_mapcount(folio_page(folio, 0)); + unsigned int total_mapcount; + + if (likely(!folio_test_large(folio))) + return atomic_read(&folio->page._mapcount) != 0; + total_mapcount = folio_total_mapcount(folio); + + /* A single mapping implies "mapped exclusively". */ + if (total_mapcount == 1) + return false; + + /* If there is an entire mapping, it must be the only mapping. */ + if (folio_entire_mapcount(folio) || unlikely(folio_test_hugetlb(folio))) + return total_mapcount != 1; + /* + * Partially-mappable folios are tricky ... but some are "obviously + * mapped shared": if we have more (PTE) mappings than we have pages + * in the folio, some other entity is certainly involved. + */ + if (total_mapcount > folio_nr_pages(folio)) + return true; + /* ... guess based on the mapcount of the first page of the folio. */ + return atomic_read(&folio->page._mapcount) > 0; } #ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f31f02472396..874eeeb90e0b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1638,7 +1638,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, * If other processes are mapping this folio, we couldn't discard * the folio unless they all do MADV_FREE so let's skip the folio. */ - if (folio_estimated_sharers(folio) != 1) + if (folio_mapped_shared(folio)) goto out; if (!folio_trylock(folio)) diff --git a/mm/madvise.c b/mm/madvise.c index cf4d694280e9..1a82867c8c2e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -365,7 +365,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, folio = pfn_folio(pmd_pfn(orig_pmd)); /* Do not interfere with other mappings of this folio */ - if (folio_estimated_sharers(folio) != 1) + if (folio_mapped_shared(folio)) goto huge_unlock; if (pageout_anon_only_filter && !folio_test_anon(folio)) @@ -441,7 +441,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (folio_test_large(folio)) { int err; - if (folio_estimated_sharers(folio) != 1) + if (folio_mapped_shared(folio)) break; if (pageout_anon_only_filter && !folio_test_anon(folio)) break; @@ -665,7 +665,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (folio_test_large(folio)) { int err; - if (folio_estimated_sharers(folio) != 1) + if (folio_mapped_shared(folio)) break; if (!folio_trylock(folio)) break; diff --git a/mm/memory.c b/mm/memory.c index 1f18ed4a5497..6bcfa763a146 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4848,7 +4848,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) * Flag if the folio is shared between multiple address spaces. This * is later used when determining whether to group tasks together */ - if (folio_estimated_sharers(folio) > 1 && (vma->vm_flags & VM_SHARED)) + if (folio_mapped_shared(folio) && (vma->vm_flags & VM_SHARED)) flags |= TNF_SHARED; nid = folio_nid(folio); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 10a590ee1c89..0492113497cc 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -605,12 +605,11 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask, * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio. * Choosing not to migrate a shared folio is not counted as a failure. * - * To check if the folio is shared, ideally we want to make sure - * every page is mapped to the same process. Doing that is very - * expensive, so check the estimated sharers of the folio instead. + * See folio_mapped_shared() on possible imprecision when we cannot + * easily detect if a folio is shared. */ if ((flags & MPOL_MF_MOVE_ALL) || - (folio_estimated_sharers(folio) == 1 && !hugetlb_pmd_shared(pte))) + (!folio_mapped_shared(folio) && !hugetlb_pmd_shared(pte))) if (!isolate_hugetlb(folio, qp->pagelist)) qp->nr_failed++; unlock: @@ -988,11 +987,10 @@ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio. * Choosing not to migrate a shared folio is not counted as a failure. * - * To check if the folio is shared, ideally we want to make sure - * every page is mapped to the same process. Doing that is very - * expensive, so check the estimated sharers of the folio instead. + * See folio_mapped_shared() on possible imprecision when we cannot + * easily detect if a folio is shared. */ - if ((flags & MPOL_MF_MOVE_ALL) || folio_estimated_sharers(folio) == 1) { + if ((flags & MPOL_MF_MOVE_ALL) || !folio_mapped_shared(folio)) { if (folio_isolate_lru(folio)) { list_add_tail(&folio->lru, foliolist); node_stat_mod_folio(folio, diff --git a/mm/migrate.c b/mm/migrate.c index 35a88334bb3c..fda41bc09903 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2559,7 +2559,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, * every page is mapped to the same process. Doing that is very * expensive, so check the estimated mapcount of the folio instead. */ - if (folio_estimated_sharers(folio) != 1 && folio_is_file_lru(folio) && + if (folio_mapped_shared(folio) && folio_is_file_lru(folio) && (vma->vm_flags & VM_EXEC)) goto out; From patchwork Fri Nov 24 13:26:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169420 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191325vqx; Fri, 24 Nov 2023 05:29:02 -0800 (PST) X-Google-Smtp-Source: AGHT+IEex0gHnWG/1cHHdYQ55b8l8za5zY3z3lu+x8Heq3T0xkqZiXegzCUb1cexGkt39iotJdoL X-Received: by 2002:a17:902:bb8f:b0:1cc:bfb4:2dca with SMTP id m15-20020a170902bb8f00b001ccbfb42dcamr2673250pls.6.1700832542179; Fri, 24 Nov 2023 05:29:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832542; cv=none; d=google.com; s=arc-20160816; b=MGTGS3XiR6HL6s8+bmnmptnpyOJ/MXs9Q5S5E8vTU3OJ7G1Yxkzos3nIvqjJJoaKcw bgVPKPrwQJIQpzMlc/KGotYgaYeC+PiYoANAWYaqmFCHOGnO4K6KpeKTtwjFIDrhpleC fokE6tbPaIOnVnuUGejFeocOBx1fU/3zTlAzYmto7Buzi9KAhH6lLI7hA92RWBnr3XCu uYyc7YOqTWDc35WK1EhS01vKn9KeQHWFTbIoex/WOCfxQrghaIkO0mQa/oSBW88ialX7 HLVLivTjoIA5cEhr3QyDL4hqsTSpWOsL2YlEtdZdN3TIcOIvcSKKhWjWptRkavBpuT4M T0/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=lJ96IMZVe5oCg8NRltA+Bp5PZmcoSK+MjUKn4pYnke0=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=pq2wXSVCMvBigkyycGc0GmnYKPJGnCcMQRs6gQfOT5PRB15MQI6s5MkoR/M/KuKGo/ u5Zsyv2odKV/rvhq/HLik3/ra8FrZ3QnoMdbQDc/YF8soSKf7aMjCnG5IfKhvw+z0UtG i2sPP12MrTvrKN6I3/H0BNlBzSqo5WTm7B1A/+ohpiwizdK5l840VUfZJbK4dA8zLBWQ DEwfUdQqI0DeROxT+8zT6hVLuCs1DEvmGgQIOo/6jX0vAAf9QpzAbeH3RZqlaj1Ktais s3c4WoXRB2Uc5T/Gek6niPlgVTRygXdGyoYNZJd5uHuCbEC0g/uZtLaSs6WwADaf035p PC6w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=FM0OJ75b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id y6-20020a170902ed4600b001cfa0b7c6a0si1895678plb.432.2023.11.24.05.29.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:29:02 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=FM0OJ75b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 5B6B9818ABD7; Fri, 24 Nov 2023 05:27:33 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345242AbjKXN1K (ORCPT + 99 others); Fri, 24 Nov 2023 08:27:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46782 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231494AbjKXN07 (ORCPT ); Fri, 24 Nov 2023 08:26:59 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0A438170C for ; Fri, 24 Nov 2023 05:26:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832409; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lJ96IMZVe5oCg8NRltA+Bp5PZmcoSK+MjUKn4pYnke0=; b=FM0OJ75bu4CiGpQIwlsLGqL+b36KzF5OtZ08MDNU3rz6kSWBiKvSVdCBLhiCSAcjcfzJ4x A5xYJeU7dBXk4M4oajgGIYdH7mVYaCmnthhpAIZjcIndrMg4ja61pqYJ3w2y1Ub1qYrZBb qZL72wZIhGyFcmcOP7X1Na1xsRMMu9g= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-507-j-wvy56dM2qBkwjy1JR5fQ-1; Fri, 24 Nov 2023 08:26:47 -0500 X-MC-Unique: j-wvy56dM2qBkwjy1JR5fQ-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5E5A4811E7B; Fri, 24 Nov 2023 13:26:46 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 46A822166B2A; Fri, 24 Nov 2023 13:26:43 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 04/20] mm/rmap: pass dst_vma to page_try_dup_anon_rmap() and page_dup_file_rmap() Date: Fri, 24 Nov 2023 14:26:09 +0100 Message-ID: <20231124132626.235350-5-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:27:33 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452183445671441 X-GMAIL-MSGID: 1783452183445671441 We'll need access to the destination MM when modifying the total mapcount of a partially-mappable folio next. So pass in the destination VMA for consistency. While at it, change the parameter order for page_try_dup_anon_rmap() such that the "bool compound" parameter is last, to match the other rmap functions. Signed-off-by: David Hildenbrand --- include/linux/rmap.h | 21 +++++++++++++-------- mm/huge_memory.c | 2 +- mm/hugetlb.c | 9 +++++---- mm/memory.c | 6 +++--- mm/migrate.c | 2 +- 5 files changed, 23 insertions(+), 17 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 42e2c74d4d6e..6cb497f6feab 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -208,7 +208,8 @@ void hugepage_add_anon_rmap(struct folio *, struct vm_area_struct *, void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *, unsigned long address); -static inline void __page_dup_rmap(struct page *page, bool compound) +static inline void __page_dup_rmap(struct page *page, + struct vm_area_struct *dst_vma, bool compound) { struct folio *folio = page_folio(page); @@ -225,17 +226,19 @@ static inline void __page_dup_rmap(struct page *page, bool compound) atomic_inc(&folio->_total_mapcount); } -static inline void page_dup_file_rmap(struct page *page, bool compound) +static inline void page_dup_file_rmap(struct page *page, + struct vm_area_struct *dst_vma, bool compound) { - __page_dup_rmap(page, compound); + __page_dup_rmap(page, dst_vma, compound); } /** * page_try_dup_anon_rmap - try duplicating a mapping of an already mapped * anonymous page * @page: the page to duplicate the mapping for + * @dst_vma: the destination vma + * @src_vma: the source vma * @compound: the page is mapped as compound or as a small page - * @vma: the source vma * * The caller needs to hold the PT lock and the vma->vma_mm->write_protect_seq. * @@ -247,8 +250,10 @@ static inline void page_dup_file_rmap(struct page *page, bool compound) * * Returns 0 if duplicating the mapping succeeded. Returns -EBUSY otherwise. */ -static inline int page_try_dup_anon_rmap(struct page *page, bool compound, - struct vm_area_struct *vma) +static inline int page_try_dup_anon_rmap(struct page *page, + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, + bool compound) { VM_BUG_ON_PAGE(!PageAnon(page), page); @@ -267,7 +272,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound, * future on write faults. */ if (likely(!is_device_private_page(page) && - unlikely(page_needs_cow_for_dma(vma, page)))) + unlikely(page_needs_cow_for_dma(src_vma, page)))) return -EBUSY; ClearPageAnonExclusive(page); @@ -276,7 +281,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound, * the page R/O into both processes. */ dup: - __page_dup_rmap(page, compound); + __page_dup_rmap(page, dst_vma, compound); return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 874eeeb90e0b..51a878efca0e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1166,7 +1166,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, VM_BUG_ON_PAGE(!PageHead(src_page), src_page); get_page(src_page); - if (unlikely(page_try_dup_anon_rmap(src_page, true, src_vma))) { + if (unlikely(page_try_dup_anon_rmap(src_page, dst_vma, src_vma, true))) { /* Page maybe pinned: split and retry the fault on PTEs. */ put_page(src_page); pte_free(dst_mm, pgtable); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index cf84784064c7..1ddef4082cad 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5401,9 +5401,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, * sleep during the process. */ if (!folio_test_anon(pte_folio)) { - page_dup_file_rmap(&pte_folio->page, true); + page_dup_file_rmap(&pte_folio->page, dst_vma, + true); } else if (page_try_dup_anon_rmap(&pte_folio->page, - true, src_vma)) { + dst_vma, src_vma, true)) { pte_t src_pte_old = entry; struct folio *new_folio; @@ -6272,7 +6273,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, if (anon_rmap) hugepage_add_new_anon_rmap(folio, vma, haddr); else - page_dup_file_rmap(&folio->page, true); + page_dup_file_rmap(&folio->page, vma, true); new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE) && (vma->vm_flags & VM_SHARED))); /* @@ -6723,7 +6724,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, goto out_release_unlock; if (folio_in_pagecache) - page_dup_file_rmap(&folio->page, true); + page_dup_file_rmap(&folio->page, dst_vma, true); else hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr); diff --git a/mm/memory.c b/mm/memory.c index 6bcfa763a146..14416d05e1b6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -836,7 +836,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, get_page(page); rss[mm_counter(page)]++; /* Cannot fail as these pages cannot get pinned. */ - BUG_ON(page_try_dup_anon_rmap(page, false, src_vma)); + BUG_ON(page_try_dup_anon_rmap(page, dst_vma, src_vma, false)); /* * We do not preserve soft-dirty information, because so @@ -950,7 +950,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, * future. */ folio_get(folio); - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) { + if (unlikely(page_try_dup_anon_rmap(page, dst_vma, src_vma, false))) { /* Page may be pinned, we have to copy. */ folio_put(folio); return copy_present_page(dst_vma, src_vma, dst_pte, src_pte, @@ -959,7 +959,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, rss[MM_ANONPAGES]++; } else if (page) { folio_get(folio); - page_dup_file_rmap(page, false); + page_dup_file_rmap(page, dst_vma, false); rss[mm_counter_file(page)]++; } diff --git a/mm/migrate.c b/mm/migrate.c index fda41bc09903..341a84c3e8e4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -252,7 +252,7 @@ static bool remove_migration_pte(struct folio *folio, hugepage_add_anon_rmap(folio, vma, pvmw.address, rmap_flags); else - page_dup_file_rmap(new, true); + page_dup_file_rmap(new, vma, true); set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte, psize); } else From patchwork Fri Nov 24 13:26:10 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169414 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1190555vqx; Fri, 24 Nov 2023 05:27:58 -0800 (PST) X-Google-Smtp-Source: AGHT+IHYKTVWgEPzTX+C3xcJxQFwiE8UHfJ+rpmWoZgEH4BpbV4cXKxbuQJhrMscAjiECgnphP1m X-Received: by 2002:a17:902:ab06:b0:1cc:3bfc:69b1 with SMTP id ik6-20020a170902ab0600b001cc3bfc69b1mr2518110plb.24.1700832477734; Fri, 24 Nov 2023 05:27:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832477; cv=none; d=google.com; s=arc-20160816; b=EzZfHwRRDveUH1+59ztZNsx32+tBLeyPhDiv5CfaR7Rb1tKqpwsXEytfgPhkg2dRzT PPwdQHsCjU1efBM9rr+X3RpWNjrTBEeuKuqLaM5EmzXK3zO6VxiQE2eumCCw7Vq8Whgv azbX2HUOcjI2pO61aaE+TGwjzA5rhNuXxLt1Dd5BW9FoPyrqBRt8COdP0A0luJlUF9gT G6Mw5UtG3QcT/FYFyfNNCTEltsUBerFuJ8IOTnv4BDijZY4vLe09EcnrLuf7GM0aqJg8 me92vajSWjZ5Sn4x8iqbW7sqFwa/1Y3G6VaEcogxNmSMpiqZ/mMU4nRxfd5DEcu+zBcU 0rzA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=/ZXmCy/cptnrW109242Bp79Yc/hIzrll+sgQ1b8Vj4U=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=RXupt6jICpZ7gcvKbrz7ax/5bY1Uor3TSbLWBdMYBVrdEO9jLi+i0tyTHkCyB9G5ML O+wFKoYX3Q5UYehChOddFgvVLKlB/HlBVq/+qLE81xsNZi8xg3bG7H4AxWMdqfCVAG/P YfQBREaiPRuphyVeEZjPyctiL07go67iOmvo6iucghnjGmuVqE6RD4APhihkbJYfe6YR pfk1Dt4Hgd96uAvSNgFL6FwDovzBriYqLPzwAQlA7y8JaR+OQp5NB56VWaznSvLY9jwa KBNwVEteXXL81IPDOc7C8qOrhpFhn21A8z/6Q07hhV1qorUBEwM4niXoA3OO0l8Uq5yv PpIQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=G82MZBUh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id ik24-20020a170902ab1800b001cf54c3b19csi3334732plb.123.2023.11.24.05.27.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:27:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=G82MZBUh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 4EC0181825D7; Fri, 24 Nov 2023 05:27:45 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345297AbjKXN1M (ORCPT + 99 others); Fri, 24 Nov 2023 08:27:12 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37602 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232837AbjKXN1B (ORCPT ); Fri, 24 Nov 2023 08:27:01 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C1FA31730 for ; Fri, 24 Nov 2023 05:26:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832412; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/ZXmCy/cptnrW109242Bp79Yc/hIzrll+sgQ1b8Vj4U=; b=G82MZBUh3iwaP5zwmz+tCYWCC+6wc2m3xOxqRSVZgmLMFmkLIb2n6qnTOCu27cu4Q/jI6e 9SWtNS6ByBuCPor/cFfZNaNoduD9ebWm29y7raoRBlJshg1jzPgFzjh3nFiSr0a0mHXBTR 7Neq1ivoNSIwK480TAFQJLrkITfTBd4= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-459-PPvWI2VbOfaXtOuRopE19A-1; Fri, 24 Nov 2023 08:26:50 -0500 X-MC-Unique: PPvWI2VbOfaXtOuRopE19A-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id D82C7811E7B; Fri, 24 Nov 2023 13:26:49 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id A59242166B2A; Fri, 24 Nov 2023 13:26:46 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 05/20] mm/rmap: abstract total mapcount operations for partially-mappable folios Date: Fri, 24 Nov 2023 14:26:10 +0100 Message-ID: <20231124132626.235350-6-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:27:46 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452116657587957 X-GMAIL-MSGID: 1783452116657587957 Let's prepare for doing additional accounting whenever modifying the total mapcount of partially-mappable (!hugetlb) folios. Pass the VMA as well. Signed-off-by: David Hildenbrand --- include/linux/rmap.h | 41 ++++++++++++++++++++++++++++++++++++++++- mm/rmap.c | 23 ++++++++++++----------- 2 files changed, 52 insertions(+), 12 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 6cb497f6feab..9d5c2ed6ced5 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -168,6 +168,39 @@ static inline void anon_vma_merge(struct vm_area_struct *vma, struct anon_vma *folio_get_anon_vma(struct folio *folio); +static inline void folio_set_large_mapcount(struct folio *folio, + int count, struct vm_area_struct *vma) +{ + VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); + VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); + /* increment count (starts at -1) */ + atomic_set(&folio->_total_mapcount, count - 1); +} + +static inline void folio_inc_large_mapcount(struct folio *folio, + struct vm_area_struct *vma) +{ + VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); + VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); + atomic_inc(&folio->_total_mapcount); +} + +static inline void folio_add_large_mapcount(struct folio *folio, + int count, struct vm_area_struct *vma) +{ + VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); + VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); + atomic_add(count, &folio->_total_mapcount); +} + +static inline void folio_dec_large_mapcount(struct folio *folio, + struct vm_area_struct *vma) +{ + VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); + VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); + atomic_dec(&folio->_total_mapcount); +} + /* RMAP flags, currently only relevant for some anon rmap operations. */ typedef int __bitwise rmap_t; @@ -219,11 +252,17 @@ static inline void __page_dup_rmap(struct page *page, return; } + if (unlikely(folio_test_hugetlb(folio))) { + atomic_inc(&folio->_entire_mapcount); + atomic_inc(&folio->_total_mapcount); + return; + } + if (compound) atomic_inc(&folio->_entire_mapcount); else atomic_inc(&page->_mapcount); - atomic_inc(&folio->_total_mapcount); + folio_inc_large_mapcount(folio, dst_vma); } static inline void page_dup_file_rmap(struct page *page, diff --git a/mm/rmap.c b/mm/rmap.c index 38765796dca8..689ad85cf87e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1105,8 +1105,8 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, } static unsigned int __folio_add_rmap_range(struct folio *folio, - struct page *page, unsigned int nr_pages, bool compound, - int *nr_pmdmapped) + struct page *page, unsigned int nr_pages, + struct vm_area_struct *vma, bool compound, int *nr_pmdmapped) { atomic_t *mapped = &folio->_nr_pages_mapped; int first, count, nr = 0; @@ -1130,7 +1130,7 @@ static unsigned int __folio_add_rmap_range(struct folio *folio, nr++; } } while (page++, --count > 0); - atomic_add(nr_pages, &folio->_total_mapcount); + folio_add_large_mapcount(folio, nr_pages, vma); } else if (folio_test_pmd_mappable(folio)) { /* That test is redundant: it's for safety or to optimize out */ @@ -1148,7 +1148,7 @@ static unsigned int __folio_add_rmap_range(struct folio *folio, nr = 0; } } - atomic_inc(&folio->_total_mapcount); + folio_inc_large_mapcount(folio, vma); } else { VM_WARN_ON_ONCE_FOLIO(true, folio); } @@ -1258,7 +1258,8 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned int nr, nr_pmdmapped = 0; bool compound = flags & RMAP_COMPOUND; - nr = __folio_add_rmap_range(folio, page, 1, compound, &nr_pmdmapped); + nr = __folio_add_rmap_range(folio, page, 1, vma, compound, + &nr_pmdmapped); if (nr_pmdmapped) __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped); if (nr) @@ -1329,8 +1330,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, } if (folio_test_large(folio)) - /* increment count (starts at -1) */ - atomic_set(&folio->_total_mapcount, 0); + folio_set_large_mapcount(folio, 1, vma); __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); __folio_set_anon(folio, vma, address, true); @@ -1355,7 +1355,7 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, { unsigned int nr, nr_pmdmapped = 0; - nr = __folio_add_rmap_range(folio, page, nr_pages, compound, + nr = __folio_add_rmap_range(folio, page, nr_pages, vma, compound, &nr_pmdmapped); if (nr_pmdmapped) __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ? @@ -1411,16 +1411,17 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, VM_BUG_ON_PAGE(compound && !PageHead(page), page); - if (folio_test_large(folio)) - atomic_dec(&folio->_total_mapcount); - /* Hugetlb pages are not counted in NR_*MAPPED */ if (unlikely(folio_test_hugetlb(folio))) { /* hugetlb pages are always mapped with pmds */ atomic_dec(&folio->_entire_mapcount); + atomic_dec(&folio->_total_mapcount); return; } + if (folio_test_large(folio)) + folio_dec_large_mapcount(folio, vma); + /* Is page being unmapped by PTE? Is this its last map to be removed? */ if (likely(!compound)) { last = atomic_add_negative(-1, &page->_mapcount); From patchwork Fri Nov 24 13:26:11 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169421 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191341vqx; Fri, 24 Nov 2023 05:29:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IGprwUGWuR2whFKLKmqxKYLi5ESmm2ljdY+0sYmvSQFtIzfZmgY2ixptK9Hlw87/AIt6RE/ X-Received: by 2002:a05:6a00:44c8:b0:6cb:901a:9303 with SMTP id cv8-20020a056a0044c800b006cb901a9303mr2981119pfb.13.1700832544605; Fri, 24 Nov 2023 05:29:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832544; cv=none; d=google.com; s=arc-20160816; b=v7aBf63CncuXnuVYriIKtq6Oem7mX/w8dg6754A78ew/DV+GBcjLZkBL7yyM5mQ8kh rBtLFusw2rz6jFHKnkNsCMqZ2+mghQ30Rba4hpAb9jIOsklEJHeKT59a+NTdcp7+hks1 9E+/7ro0NLex+Oh+i8A0XcQiiexS02c7eoZdu3GfSK+JeErxkO/Q7kqeuA6cd2Gaitph 1tdZS8EYmRmv66+bBCQT2n7mB13eypJ6AoIvNEXF/YrSAdNydpF2l2gQ0Fm9izwMecFX E1kK0T7VAKckvVf7ny/lUS7dQ5PI/PbsXXQ+47gX3Yd8y4zbg02emmPASLjd1AhCq3rA 0c7Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=ujZJ3g6hK6oBZOKXOf3RJw4ZrpKoYgRWOTMhG7TrjzA=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=gNh2x6nBbJSrah7HII2nY/6tiYazNwX2s5gchLNnJVkTh73u8q/oGiiVIoJg31aZdA WW3hQ/GM/gBJRvL65W+W/fdi2dkw8iTWqzK2Vl4C2In5KVIfn37k49sNq++mnmcGlUzA CDAxzZGoXmMEgFNptVEP/hnTgjA3l+K/5cVQnSnvw4y7BV6wAWvMAOpbCP4EFalaJy+c 9ZQQQAqCJAv1373ZDcyitENSGnWzqR5XAJs0WZ3jxd5ylNylvtmhqyJS/fDQYHobR+Ci UNyrlflEte26rHG6Yo/IbLQ3GtYAveCCMfQjf22PEgy90zU2gc4VJbhM7kgysgWhi7DJ wHbw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=SeTVvvm8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id d11-20020a63ed0b000000b00578a2da998asi3592316pgi.304.2023.11.24.05.29.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:29:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=SeTVvvm8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id C8A618191468; Fri, 24 Nov 2023 05:27:43 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345606AbjKXN1a (ORCPT + 99 others); Fri, 24 Nov 2023 08:27:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37040 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235270AbjKXN1E (ORCPT ); Fri, 24 Nov 2023 08:27:04 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A39E019B1 for ; Fri, 24 Nov 2023 05:27:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832420; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ujZJ3g6hK6oBZOKXOf3RJw4ZrpKoYgRWOTMhG7TrjzA=; b=SeTVvvm8Xc2oDJVbQuCbZFFk5U1dRzA3boOM/6Af8TardoT+WGyfOelh/gXMBEFo38N5d4 GBafbBUjG7moggr6Jqp4J5VRey9az6Mucx/P9NsXuikXQsz2isTSj5pU7k+JOrYqOElRmc RlAtB9tnF/ws1RxrqzSo7UWmn3aArzs= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-446-U_verE8yOyS0gXTE1oEkOw-1; Fri, 24 Nov 2023 08:26:54 -0500 X-MC-Unique: U_verE8yOyS0gXTE1oEkOw-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A1C4538116E0; Fri, 24 Nov 2023 13:26:53 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 232362166B2A; Fri, 24 Nov 2023 13:26:50 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 06/20] atomic_seqcount: new (raw) seqcount variant to support concurrent writers Date: Fri, 24 Nov 2023 14:26:11 +0100 Message-ID: <20231124132626.235350-7-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:27:44 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452186335281729 X-GMAIL-MSGID: 1783452186335281729 Assume we have a writer side that is fairly simple and only updates some counters by adding some values: folio->counter_a += diff_a; folio->counter_b += diff_b; folio->counter_c += diff_c; ... Further, assume that our readers want to always read consistent set of counters. That is, they not only want to read each counter atomically, but also get a consistent/atomic view across *all* counters, detecting the case where there are concurrent modifications of the counters. Traditionally, we'd use a seqcount protected by some locking on the writer side. The readers can run lockless, detect when there were concurrent updates, to simply retry again to re-read all values. However, a seqcount requires to serialize all writers to only allow for a single writer at a time. Alternatives might include per-cpu counters / local atomics, but for the target use cases, both primitives are not applicable: We want to store counters (2 to 7 for now, depending on the folio size) in the "struct folio" of some larger folios (order >=2 ) whereby the counters get adjusted whenever we (un)map part of a folio. (a) The reader side must be able to get a consistent view of the counters and be able to detect concurrent changes (i.e., concurrent (un)mapping), as described above. In some cases we can simply stop immediately if we detect any concurrent writer -- any concurrent (un)map activity. (b) The writer side updates the counters as described above and should ideally run completely lockless. In many cases, we always have a single write at a time. But in some scenarios, we can trigger a lot of concurrent writers. We want the writer side to be able to make progress instead of repeadetly spinning, waiting for possibly many other writers. (c) Space in the "struct folio" especially for smallish folios is very limited, and the "struct page" layout imposes various restrictions on where we can even put new data; growing the size of the "struct page" is not desired because it can result in serious metadata overhead and easily has performance implications (cache-line). So we cannot place ordinary spinlocks in there (especially also because they change their size based on lockdep and actual implementation), and the only real alternative is a bit spinlock, which is really undesired. If we want to allow concurrent writers, we can use atomic RMW operations when updating the counters: atomic_add(diff_a, &folio->counter_a); atomic_add(diff_b, &folio->counter_b); atomic_add(diff_c, &folio->counter_c); ... But the existing seqcount to make the reader size detect concurrent updates is not capable of handling concurrent writers. So let's add a new atomic seqcount for exactly that purpose. Instead of using a single LSB in the seqcount to detect a single concurrent writer, it uses multiple LSBs to detect multiple concurrent writers. As the seqcount can be modified concurrently, it ends up being an atomic type. In theory, each CPU can participate, so we have to steal quite some LSBs on 64bit. As that reduces the bits available for the actual sequence quite drastically especially on 64bit, and there is the concern that 16bit for the sequence might not be sufficient, just use an atomic_long_t for now. For the use case discussed, we will place the new atomic seqcount into the "struct folio"/"struct page", where the limitations as described above apply. For that use case, the "raw" variant -- raw_atomic_seqcount_t -- is required, so we only add that. For the normal seqcount on the writer side, we have the following memory ordering: s->sequence++ smp_wmb(); [critical section] smp_wmb(); s->sequence++ It's important that other CPUs don't observe stores to the sequence to be reordered with stores in the critical section. For the atomic seqcount, we could have similarly used: atomic_long_add(SHARED, &s->sequence); smp_wmb(); [critical section] smp_wmb(); atomic_long_add(STEP - SHARED, &s->sequence); But especially on x86_64, the atomic_long_add() already implies a full memory barrier. So instead, we can do: atomic_long_add(SHARED, &s->sequence); __smp_mb__after_atomic(); [critical section] __smp_mb__before_atomic(); atomic_long_add(STEP - SHARED, &s->sequence); Or alternatively: atomic_long_add_return(SHARED, &s->sequence); [critical section] atomic_long_add_return(STEP - SHARED, &s->sequence); Could we use acquire-release semantics? Like the following: atomic_long_add_return_acquire(SHARED, &s->sequence) [critical section] atomic_long_add_return_release(STEP - SHARED, &s->sequence) Maybe, but (a) it would make it different to normal seqcounts, because stores before/after the atomic_long_add_*() could now be reordered and; (b) memory-barriers.txt might indicate that the sequence counter store might be reordered: "For compound atomics performing both a load and a store, ACQUIRE semantics apply only to the load and RELEASE semantics apply only to the store portion of the operation.". So let's keep it simple for now. Effectively, with the atomic seqcount We end up with more atomic RMW operations in the critical section but get no writer starvation / lock contention in return. We'll limit the implementation to !PREEMPT_RT and disallowing readers/writers from interrupt context. Signed-off-by: David Hildenbrand --- include/linux/atomic_seqcount.h | 170 ++++++++++++++++++++++++++++++++ lib/Kconfig.debug | 11 +++ 2 files changed, 181 insertions(+) create mode 100644 include/linux/atomic_seqcount.h diff --git a/include/linux/atomic_seqcount.h b/include/linux/atomic_seqcount.h new file mode 100644 index 000000000000..109447b663a1 --- /dev/null +++ b/include/linux/atomic_seqcount.h @@ -0,0 +1,170 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef __LINUX_ATOMIC_SEQLOCK_H +#define __LINUX_ATOMIC_SEQLOCK_H + +#include +#include +#include + +/* + * raw_atomic_seqcount_t -- a reader-writer consistency mechanism with + * lockless readers (read-only retry loops), and lockless writers. + * The writers must use atomic RMW operations in the critical section. + * + * This locking mechanism is applicable when all individual operations + * performed by writers can be expressed using atomic RMW operations + * (so they can run lockless) and readers only need a way to get an atomic + * view over all individual atomic values: like writers atomically updating + * multiple counters, and readers wanting to observe a consistent state + * across all these counters. + * + * For now, only the raw variant is implemented, that doesn't perform any + * lockdep checks. + * + * Copyright Red Hat, Inc. 2023 + * + * Author(s): David Hildenbrand + */ + +typedef struct raw_atomic_seqcount { + atomic_long_t sequence; +} raw_atomic_seqcount_t; + +#define raw_seqcount_init(s) atomic_long_set(&((s)->sequence), 0) + +#ifdef CONFIG_64BIT + +#define ATOMIC_SEQCOUNT_SHARED_WRITER 0x0000000000000001ul +/* 65536 CPUs */ +#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x0000000000008000ul +#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x000000000000fffful +#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000000000fffful +/* We have 48bit for the actual sequence. */ +#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x0000000000010000ul + +#else /* CONFIG_64BIT */ + +#define ATOMIC_SEQCOUNT_SHARED_WRITER 0x00000001ul +/* 64 CPUs */ +#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x00000040ul +#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x0000007ful +#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x0000007ful +/* We have 25bit for the actual sequence. */ +#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x00000080ul + +#endif /* CONFIG_64BIT */ + +#if CONFIG_NR_CPUS > ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX +#error "raw_atomic_seqcount_t does not support such large CONFIG_NR_CPUS" +#endif + +/** + * raw_read_atomic_seqcount() - read the raw_atomic_seqcount_t counter value + * @s: Pointer to the raw_atomic_seqcount_t + * + * raw_read_atomic_seqcount() opens a read critical section of the given + * raw_atomic_seqcount_t, and without checking or masking the sequence counter + * LSBs (using ATOMIC_SEQCOUNT_WRITERS_MASK). Calling code is responsible for + * handling that. + * + * Return: count to be passed to raw_read_atomic_seqcount_retry() + */ +static inline unsigned long raw_read_atomic_seqcount(raw_atomic_seqcount_t *s) +{ + unsigned long seq = atomic_long_read(&s->sequence); + + /* Read the sequence before anything in the critical section */ + smp_rmb(); + return seq; +} + +/** + * raw_read_atomic_seqcount_begin() - begin a raw_seqcount_t read section + * @s: Pointer to the raw_atomic_seqcount_t + * + * raw_read_atomic_seqcount_begin() opens a read critical section of the + * given raw_seqcount_t. This function must not be used in interrupt context. + * + * Return: count to be passed to raw_read_atomic_seqcount_retry() + */ +static inline unsigned long raw_read_atomic_seqcount_begin(raw_atomic_seqcount_t *s) +{ + unsigned long seq; + + BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT)); +#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT + DEBUG_LOCKS_WARN_ON(in_interrupt()); +#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */ + while ((seq = atomic_long_read(&s->sequence)) & + ATOMIC_SEQCOUNT_WRITERS_MASK) + cpu_relax(); + + /* Load the sequence before any load in the critical section. */ + smp_rmb(); + return seq; +} + +/** + * raw_read_atomic_seqcount_retry() - end a raw_seqcount_t read critical section + * @s: Pointer to the raw_atomic_seqcount_t + * @start: count, for example from raw_read_atomic_seqcount_begin() + * + * raw_read_atomic_seqcount_retry() closes the read critical section of the + * given raw_seqcount_t. If the critical section was invalid, it must be ignored + * (and typically retried). + * + * Return: true if a read section retry is required, else false + */ +static inline bool raw_read_atomic_seqcount_retry(raw_atomic_seqcount_t *s, + unsigned long start) +{ + /* Load the sequence after any load in the critical section. */ + smp_rmb(); + return unlikely(atomic_long_read(&s->sequence) != start); +} + +/** + * raw_write_seqcount_begin() - start a raw_seqcount_t write critical section + * @s: Pointer to the raw_atomic_seqcount_t + * + * raw_write_seqcount_begin() opens the write critical section of the + * given raw_seqcount_t. This function must not be used in interrupt context. + */ +static inline void raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s) +{ + BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT)); +#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT + DEBUG_LOCKS_WARN_ON(in_interrupt()); +#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */ + preempt_disable(); + atomic_long_add(ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence); + /* Store the sequence before any store in the critical section. */ + smp_mb__after_atomic(); +#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT + DEBUG_LOCKS_WARN_ON((atomic_long_read(&s->sequence) & + ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) > + ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX); +#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */ +} + +/** + * raw_write_seqcount_end() - end a raw_seqcount_t write critical section + * @s: Pointer to the raw_atomic_seqcount_t + * + * raw_write_seqcount_end() closes the write critical section of the + * given raw_seqcount_t. + */ +static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s) +{ +#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT + DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) & + ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK)); +#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */ + /* Store the sequence after any store in the critical section. */ + smp_mb__before_atomic(); + atomic_long_add(ATOMIC_SEQCOUNT_SEQUENCE_STEP - + ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence); + preempt_enable(); +} + +#endif /* __LINUX_ATOMIC_SEQLOCK_H */ diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index cc7d53d9dc01..569c2c6ed47f 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1298,6 +1298,7 @@ config PROVE_LOCKING select DEBUG_MUTEXES if !PREEMPT_RT select DEBUG_RT_MUTEXES if RT_MUTEXES select DEBUG_RWSEMS + select DEBUG_ATOMIC_SEQCOUNT if !PREEMPT_RT select DEBUG_WW_MUTEX_SLOWPATH select DEBUG_LOCK_ALLOC select PREEMPT_COUNT if !ARCH_NO_PREEMPT @@ -1425,6 +1426,16 @@ config DEBUG_RWSEMS This debugging feature allows mismatched rw semaphore locks and unlocks to be detected and reported. +config DEBUG_ATOMIC_SEQCOUNT + bool "Atomic seqcount debugging: basic checks" + depends on DEBUG_KERNEL && !PREEMPT_RT + help + This feature allows some atomic seqcount semantics violations to be + detected and reported. + + The debug checks are only performed when running code that actively + uses atomic seqcounts; there are no dedicated test cases yet. + config DEBUG_LOCK_ALLOC bool "Lock debugging: detect incorrect freeing of live locks" depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT From patchwork Fri Nov 24 13:26:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169423 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191397vqx; Fri, 24 Nov 2023 05:29:09 -0800 (PST) X-Google-Smtp-Source: AGHT+IFKMURYCnnzcMlYrhtFHxhM/ZYgC+Wtxr3W7HdzVzh6Uq4tUn5TvyzvI45j08Ym3SiRWHbL X-Received: by 2002:a17:902:e84f:b0:1cc:4a47:1fe5 with SMTP id t15-20020a170902e84f00b001cc4a471fe5mr3272056plg.59.1700832549355; Fri, 24 Nov 2023 05:29:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832549; cv=none; d=google.com; s=arc-20160816; b=FnunE8bpYBcsXJdW9KFwl3t0BFjBIq/aKQWEtBcd90Mp8LSeE8pTOjISMrcEgwozpG dDAxzGTl3x7FcC5FTvByuTE10wBos4meDmVRVQ3jwy3oJl5MhhajtgP5rpJtqbVvJPft NgmS2h5Ms0VsYxQA/9kX1JPL5j1HdytB0oUjYe+PvcT8glDN7dXtTQ4rhGrzdWPtcsW2 yWeJ3mngX+Rjz/ljOPjlhk8IRLI01lTbpL5F76TbkOAMdpAwo+aHBk7D3vPiWLfBTG1J PEZ6HxEtyu4VEgBfLVwwE+ngPLeqcF7273zzwFpoeV2LXLOpDiJz+ybjfIiNIqHoquno Krbw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=a5yfSJmvw+walwOxM5XR2U+SrOM6iOWFmuLu46mrnEU=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=Orz5qWBGbY8+0Plntx/Ru7gvV1sFAOUtF0HMPTELwJOzFMoLzjBKQD97UWa7kwypJY il7a2ew6cPHZ97UD/7K0Pzm/eQmEswoGO9+k9OpGGUTfPM65pgXXRGt+NQmLTw3OTOGs NJ9E/uZl+lWyaqQ674phsMIpnxzx+tPijqm2gxY7DmFjzzx9/yczNINLruAE01VeGcQX 24P4c8ankDRvuR/sK21D5ctICiDRvXbgUOYB3Ans4xNni86glnA/A+eqsZ3u9jASiDlJ P1wc/1u9h5gIfH+Q7myv13YsuJFedrU4F0/RfzKQn/QuePLX0WuvFAWN+a1stE5JL+2A 31KA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=blneaO6V; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id y2-20020a170902700200b001c7264c458dsi3318846plk.181.2023.11.24.05.29.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:29:09 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=blneaO6V; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 1B0C58191477; Fri, 24 Nov 2023 05:28:02 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345657AbjKXN1d (ORCPT + 99 others); Fri, 24 Nov 2023 08:27:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37512 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235293AbjKXN1F (ORCPT ); Fri, 24 Nov 2023 08:27:05 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 55A5019AE for ; Fri, 24 Nov 2023 05:27:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832420; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=a5yfSJmvw+walwOxM5XR2U+SrOM6iOWFmuLu46mrnEU=; b=blneaO6Von/7If1kkxAo+LYa5rv//QxgVYhvipio1uf59ixRdPSzdR41gUbs9hq+vqvk5M 2ZAC6xSD6EvmYcR0DOSBpfzjKvkeJ7koYcTBinUVIQCNVd6HWHqwLzo4XGT6PvnNzVdQlA 45zF0teKtWOwMSDeBPtLcGhN7cMj5U8= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-689-ORZ29QTLNFSiOT-xdaI3QA-1; Fri, 24 Nov 2023 08:26:57 -0500 X-MC-Unique: ORZ29QTLNFSiOT-xdaI3QA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BD5003C108C7; Fri, 24 Nov 2023 13:26:56 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 09AB92166B2A; Fri, 24 Nov 2023 13:26:53 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 07/20] mm/rmap_id: track if one ore multiple MMs map a partially-mappable folio Date: Fri, 24 Nov 2023 14:26:12 +0100 Message-ID: <20231124132626.235350-8-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:28:03 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452190938775369 X-GMAIL-MSGID: 1783452190938775369 In contrast to small folios and hugetlb folios, for a partially-mappable folio (i.e., THP), the total mapcount is often not expressive to identify whether such a folio is "mapped shared" or "mapped exclusively". For small folios and hugetlb folios that are always entirely mapped, the single mapcount is traditionally used for that purpose: is it 1? Then the folio is currently mapped exclusively; is it bigger than 1? Then it's mapped at least twice, and, therefore, considered "mapped shared". For a partially-mappable folio, each individual PTE/PMD/... mapping requires exactly one folio reference and one folio mapcount; folio_mapcount() > 1 does not imply that the folio is "mapped shared". While there are some obvious cases when we can conclude that partially-mappable folios are "mapped shared" -- see folio_mapped_shared() -- but it is currently not always possible to precisely tell whether a folio is "mapped exclusively". For implementing a precise variant of folio_mapped_shared() and for COW-reuse support of PTE-mapped anon THP, we need an efficient and precise way to identify "mapped shared" vs. "mapped exclusively". So how could we track if more than one MM is currently mapping a folio in its page tables? Having a list of MMs per folio, or even a counter for each MM for each folio is clearly not feasible. ... but what if we could play some fun math games to perform this tracking while requiring a handful of counters per folio, the exact number of counters depending on the size of the folio? 1. !!! Experimental Feature !!! =============================== We'll only support CONFIG_64BIT and !CONFIG_PREEMPT_RT (implied by THP support) for now. As we currently never get partially-mappable folios without CONFIG_TRANSPARENT_HUGEPAGE, let's limit to that to avoid unnecessary rmap ID allocations for setups without THP. 32bit support might be possible if there is demand, limiting it to 64k rmap IDs and reasonably sized folio sizes (e.g., <= order-15). Similarly, RT might be possible if there is ever real demand for it. The feature will be experimental initially, and, therefore, disabled as default. Once the involved math is considered solid, the implementation saw extended testing, and the performance implications are clear and have either been optimized (e.g., rmap batching) or mitigated (e.g., do we really have to perform this tracking for folios that are always assumed shared, like folios mapping executables or shared libraries? Is some hardware problematic?), we can consider always enabling it as default. 2. Per-mm rmap IDs ================== We'll have to assign each MM an rmap ID that is smaller than 16*1024*1024 on 64bit. Note that these are significantly more than the maximum number of processes we can possibly have in the system. There isn't really a difference between supporting 16M IDs and 2M/4M IDs. Due to the ID size limitation, we cannot use the MM pointer value and need a separate ID allocator. Maybe, we want to cache some rmap IDs per CPU? Maybe we want to improve the allocation path? We can add such improvements when deemed necessary. In the distant future, we might want to allocate rmap IDs for selected VMAs: for example, imagine a systemcall that does something like fork (COW-sharing of pages) within a process for a range of anonymous memory, ending up with a new VMA that wants a separate rmap ID. For now, per-MM is simple and sufficient. 3. Tracking Overview ==================== We derive a sequence of special sub-IDs from our MM rmap ID. Any time we map/unmap a part (e.g., PTE, PMD) of a partially-mappable folio to/from a MM, we: (1) Adjust (increment/decrement) the mapcount of the folio (2) Adjust (add/remove) the folio rmap values using the MM sub-IDs So the rmap values are always linked to the folio mapcount. Consequently, we know that a single rmap value in the folio is the sum of exactly #folio_mapcount() rmap sub-IDs. To identify whether a single MM is responsible for all folio_mapcount() mappings of a folio ("mapped exclusively") or whether other MMs are involved ("mapped shared"), we perform the following checks: (1) Do we have more mappings than the folio has pages? Then the folio is certainly shared. That is, when "folio_mapcount() > folio_nr_pages()" (2) For each rmap value X, does that rmap value folio->_rmap_valX correspond to "folio_mapcount() * sub-ID[X]" of the MM? Then the folio is certainly exclusive. Note that we only check that when "folio_mapcount() <= folio_nr_pages()". 4. Synchronization ================== We're using an atomic seqcount, stored in the folio, to allow for readers to detect concurrent (un)mapping, whereby they could obtain a wrong snapshot of the mapcount+rmap values and make a wrong decision. Further, the mapcount and all rmap values are updated using RMW atomics, to allow for concurrent updates. 5. sub-IDs ========== To achieve (2), we generate sub-IDs that have the following property, assuming that our folio has P=folio_nr_pages() pages. "2 * sub-ID" cannot be represented by the sum of any other *2* sub-IDs "3 * sub-ID" cannot be represented by the sum of any other *3* sub-IDs "4 * sub-ID" cannot be represented by the sum of any other *4* sub-IDs ... "P * sub-ID" cannot be represented by the sum of any other *P* sub-IDs The sub-IDs are generated in generations, whereby (1) Generation #0 is the number 0 (2) Generation #N takes all numbers from generations #0..#N-1 and adds (P + 1)^(N - 1), effectively doubling the number of sub-IDs Consequently, the smallest number S in gen #N is: S[#N] = (P + 1)^(N - 1) The largest number L in gen #N is: L[#N] = (P + 1)^(N - 1) + (P + 1)^(N - 2) + ... (P + 1)^0 + 0. -> [geometric sum with "P + 1 != 1"] = (1 - (P + 1)^N) / (1 - (P + 1)) = (1 - (P + 1)^N) / (-P) = ((P + 1)^N - 1) / P Example with P=4 (order-2 folio): Generation #0: 0 ------------------------ + (4 + 1)^0 = 1 Generation #1: 1 ------------------------ + (4 + 1)^1 = 5 Generation #2: 5 6 ------------------------ + (4 + 1)^2 = 25 Generation #3: 25 26 30 31 ------------------------ + (4 + 1)^3 = 125 [...] Intuitively, we are working with sub-counters that cannot overflow as long as we have <= P components. Let's consider the simple case of P=3, whereby our sub-counters are exactly 2-bit wide. Subid | Bits | Sub-counters -------------------------------- 0 | 0000 0000 | 0,0,0,0 1 | 0000 0001 | 0,0,0,1 4 | 0000 0100 | 0,0,1,0 5 | 0000 0101 | 0,0,1,1 16 | 0001 0000 | 0,1,0,0 17 | 0001 0001 | 0,1,0,1 20 | 0001 0100 | 0,1,1,0 21 | 0001 0101 | 0,1,1,1 64 | 0100 0000 | 1,0,0,0 65 | 0100 0001 | 1,0,0,1 68 | 0100 0100 | 1,0,1,0 69 | 0100 0101 | 1,0,1,1 80 | 0101 0100 | 1,1,0,0 81 | 0101 0001 | 1,1,0,1 84 | 0101 0100 | 1,1,1,0 85 | 0101 0101 | 1,1,1,1 So if we, say, have: 3 * 17 = 0,3,0,3 how could we possible get to that number by using 3 other subids? It's impossible, because the sub-counters won't overflow as long as we stay <= 3. Interesting side note that might come in handy at some point: we also cannot get to 0,3,0,3 by using 1 or 2 other subids. But, we could get to 1 * 17 = 0,1,0,1 by using 2 subids (16 and 1) or similarly to 2 * 17 = 0,2,0,2 by using 4 subids (2x16 and 2x1). Looks like we cannot get to X * subid using any 1..X other subids. Note 1: we'll add the actual detection logic used to be used by folio_mapped_shared() and wp_can_reuse_anon_folio() separately. Note 2: we might want to use that infrastructure for hugetlb as well in the future: there is nothing THP-specific about rmap ID handling. Signed-off-by: David Hildenbrand --- include/linux/mm_types.h | 58 +++++++ include/linux/rmap.h | 126 +++++++++++++- kernel/fork.c | 26 +++ mm/Kconfig | 21 +++ mm/Makefile | 1 + mm/huge_memory.c | 16 +- mm/init-mm.c | 4 + mm/page_alloc.c | 9 + mm/rmap_id.c | 351 +++++++++++++++++++++++++++++++++++++++ 9 files changed, 604 insertions(+), 8 deletions(-) create mode 100644 mm/rmap_id.c diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 99b84b4797b9..75305c57ef64 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -273,6 +274,14 @@ typedef struct { * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h. * @_hugetlb_hwpoison: Do not use directly, call raw_hwp_list_head(). * @_deferred_list: Folios to be split under memory pressure. + * @_rmap_atomic_seqcount: Seqcount protecting _total_mapcount and _rmapX. + * Does not apply to hugetlb. + * @_rmap_val0 Do not use outside of rmap code. Does not apply to hugetlb. + * @_rmap_val1 Do not use outside of rmap code. Does not apply to hugetlb. + * @_rmap_val2 Do not use outside of rmap code. Does not apply to hugetlb. + * @_rmap_val3 Do not use outside of rmap code. Does not apply to hugetlb. + * @_rmap_val4 Do not use outside of rmap code. Does not apply to hugetlb. + * @_rmap_val5 Do not use outside of rmap code. Does not apply to hugetlb. * * A folio is a physically, virtually and logically contiguous set * of bytes. It is a power-of-two in size, and it is aligned to that @@ -331,6 +340,9 @@ struct folio { atomic_t _pincount; #ifdef CONFIG_64BIT unsigned int _folio_nr_pages; +#ifdef CONFIG_RMAP_ID + raw_atomic_seqcount_t _rmap_atomic_seqcount; +#endif /* CONFIG_RMAP_ID */ #endif /* private: the union with struct page is transitional */ }; @@ -356,6 +368,34 @@ struct folio { }; struct page __page_2; }; + union { + struct { + unsigned long _flags_3; + unsigned long _head_3; + /* public: */ +#ifdef CONFIG_RMAP_ID + atomic_long_t _rmap_val0; + atomic_long_t _rmap_val1; + atomic_long_t _rmap_val2; + atomic_long_t _rmap_val3; +#endif /* CONFIG_RMAP_ID */ + /* private: the union with struct page is transitional */ + }; + struct page __page_3; + }; + union { + struct { + unsigned long _flags_4; + unsigned long _head_4; + /* public: */ +#ifdef CONFIG_RMAP_ID + atomic_long_t _rmap_val4; + atomic_long_t _rmap_val5; +#endif /* CONFIG_RMAP_ID */ + /* private: the union with struct page is transitional */ + }; + struct page __page_4; + }; }; #define FOLIO_MATCH(pg, fl) \ @@ -392,6 +432,20 @@ FOLIO_MATCH(compound_head, _head_2); FOLIO_MATCH(flags, _flags_2a); FOLIO_MATCH(compound_head, _head_2a); #undef FOLIO_MATCH +#define FOLIO_MATCH(pg, fl) \ + static_assert(offsetof(struct folio, fl) == \ + offsetof(struct page, pg) + 3 * sizeof(struct page)) +FOLIO_MATCH(flags, _flags_3); +FOLIO_MATCH(compound_head, _head_3); +#undef FOLIO_MATCH +#undef FOLIO_MATCH +#define FOLIO_MATCH(pg, fl) \ + static_assert(offsetof(struct folio, fl) == \ + offsetof(struct page, pg) + 4 * sizeof(struct page)) +FOLIO_MATCH(flags, _flags_4); +FOLIO_MATCH(compound_head, _head_4); +#undef FOLIO_MATCH + /** * struct ptdesc - Memory descriptor for page tables. @@ -975,6 +1029,10 @@ struct mm_struct { #endif } lru_gen; #endif /* CONFIG_LRU_GEN */ + +#ifdef CONFIG_RMAP_ID + int mm_rmap_id; +#endif /* CONFIG_RMAP_ID */ } __randomize_layout; /* diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 9d5c2ed6ced5..19c9dc3216df 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -168,6 +168,116 @@ static inline void anon_vma_merge(struct vm_area_struct *vma, struct anon_vma *folio_get_anon_vma(struct folio *folio); +#ifdef CONFIG_RMAP_ID +/* + * For init_mm and friends, we don't actually expect to ever rmap pages. So + * we use a reserved dummy ID that we'll never hand out the normal way. + */ +#define RMAP_ID_DUMMY 0 +#define RMAP_ID_MIN (RMAP_ID_DUMMY + 1) +#define RMAP_ID_MAX (16 * 1024 * 1024u - 1) + +void free_rmap_id(int id); +int alloc_rmap_id(void); + +#define RMAP_SUBID_4_MAX_ORDER 10 +#define RMAP_SUBID_5_MIN_ORDER 11 +#define RMAP_SUBID_5_MAX_ORDER 12 +#define RMAP_SUBID_6_MIN_ORDER 13 +#define RMAP_SUBID_6_MAX_ORDER 15 + +static inline void __folio_prep_large_rmap(struct folio *folio) +{ + const unsigned int order = folio_order(folio); + + raw_seqcount_init(&folio->_rmap_atomic_seqcount); + switch (order) { +#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER + case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER: + atomic_long_set(&folio->_rmap_val5, 0); + fallthrough; +#endif +#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER + case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER: + atomic_long_set(&folio->_rmap_val4, 0); + fallthrough; +#endif + default: + atomic_long_set(&folio->_rmap_val3, 0); + atomic_long_set(&folio->_rmap_val2, 0); + atomic_long_set(&folio->_rmap_val1, 0); + atomic_long_set(&folio->_rmap_val0, 0); + break; + } +} + +static inline void __folio_undo_large_rmap(struct folio *folio) +{ +#ifdef CONFIG_DEBUG_VM + const unsigned int order = folio_order(folio); + + switch (order) { +#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER + case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER: + VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val5)); + fallthrough; +#endif +#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER + case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER: + VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val4)); + fallthrough; +#endif + default: + VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val3)); + VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val2)); + VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val1)); + VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val0)); + break; + } +#endif +} + +static inline void __folio_write_large_rmap_begin(struct folio *folio) +{ + VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); + VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); + raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount); +} + +static inline void __folio_write_large_rmap_end(struct folio *folio) +{ + raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount); +} + +void __folio_set_large_rmap_val(struct folio *folio, int count, + struct mm_struct *mm); +void __folio_add_large_rmap_val(struct folio *folio, int count, + struct mm_struct *mm); +#else +static inline void __folio_prep_large_rmap(struct folio *folio) +{ +} +static inline void __folio_undo_large_rmap(struct folio *folio) +{ +} +static inline void __folio_write_large_rmap_begin(struct folio *folio) +{ + VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); + VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); +} +static inline void __folio_write_large_rmap_end(struct folio *folio) +{ +} +static inline void __folio_set_large_rmap_val(struct folio *folio, int count, + struct mm_struct *mm) +{ +} +static inline void __folio_add_large_rmap_val(struct folio *folio, int count, + struct mm_struct *mm) +{ +} +#endif /* CONFIG_RMAP_ID */ + static inline void folio_set_large_mapcount(struct folio *folio, int count, struct vm_area_struct *vma) { @@ -175,30 +285,34 @@ static inline void folio_set_large_mapcount(struct folio *folio, VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); /* increment count (starts at -1) */ atomic_set(&folio->_total_mapcount, count - 1); + __folio_set_large_rmap_val(folio, count, vma->vm_mm); } static inline void folio_inc_large_mapcount(struct folio *folio, struct vm_area_struct *vma) { - VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); - VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); + __folio_write_large_rmap_begin(folio); atomic_inc(&folio->_total_mapcount); + __folio_add_large_rmap_val(folio, 1, vma->vm_mm); + __folio_write_large_rmap_end(folio); } static inline void folio_add_large_mapcount(struct folio *folio, int count, struct vm_area_struct *vma) { - VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); - VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); + __folio_write_large_rmap_begin(folio); atomic_add(count, &folio->_total_mapcount); + __folio_add_large_rmap_val(folio, count, vma->vm_mm); + __folio_write_large_rmap_end(folio); } static inline void folio_dec_large_mapcount(struct folio *folio, struct vm_area_struct *vma) { - VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); - VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); + __folio_write_large_rmap_begin(folio); atomic_dec(&folio->_total_mapcount); + __folio_add_large_rmap_val(folio, -1, vma->vm_mm); + __folio_write_large_rmap_end(folio); } /* RMAP flags, currently only relevant for some anon rmap operations. */ diff --git a/kernel/fork.c b/kernel/fork.c index 10917c3e1f03..773c93613ca2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -814,6 +814,26 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) #define mm_free_pgd(mm) #endif /* CONFIG_MMU */ +#ifdef CONFIG_RMAP_ID +static inline int mm_alloc_rmap_id(struct mm_struct *mm) +{ + int id = alloc_rmap_id(); + + if (id < 0) + return id; + mm->mm_rmap_id = id; + return 0; +} + +static inline void mm_free_rmap_id(struct mm_struct *mm) +{ + free_rmap_id(mm->mm_rmap_id); +} +#else +#define mm_alloc_rmap_id(mm) (0) +#define mm_free_rmap_id(mm) +#endif /* CONFIG_RMAP_ID */ + static void check_mm(struct mm_struct *mm) { int i; @@ -917,6 +937,7 @@ void __mmdrop(struct mm_struct *mm) WARN_ON_ONCE(mm == current->active_mm); mm_free_pgd(mm); + mm_free_rmap_id(mm); destroy_context(mm); mmu_notifier_subscriptions_destroy(mm); check_mm(mm); @@ -1298,6 +1319,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (mm_alloc_pgd(mm)) goto fail_nopgd; + if (mm_alloc_rmap_id(mm)) + goto fail_normapid; + if (init_new_context(p, mm)) goto fail_nocontext; @@ -1317,6 +1341,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, fail_cid: destroy_context(mm); fail_nocontext: + mm_free_rmap_id(mm); +fail_normapid: mm_free_pgd(mm); fail_nopgd: free_mm(mm); diff --git a/mm/Kconfig b/mm/Kconfig index 89971a894b60..bb0b7b885ada 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -861,6 +861,27 @@ choice benefit. endchoice +menuconfig RMAP_ID + bool "Rmap ID tracking (EXPERIMENTAL)" + depends on TRANSPARENT_HUGEPAGE && 64BIT + help + Use per-MM rmap IDs and the unleashed power of math to track + whether partially-mappable hugepages (i.e., THPs for now) are + "mapped shared" or "mapped exclusively". + + This tracking allow for efficiently and precisely detecting + whether a PTE-mapped THP is mapped by a single process + ("mapped exclusively") or mapped by multiple ones ("mapped + shared"), with the cost of additional tracking when (un)mapping + (parts of) such a THP. + + If this configuration is not enabled, an heuristic is used + instead that might result in false "mapped exclusively" + detection; some features relying on this information might + operate slightly imprecise (e.g., MADV_PAGEOUT succeeds although + it should fail) or might not be available at all (e.g., + Copy-on-Write reuse support). + config THP_SWAP def_bool y depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP && 64BIT diff --git a/mm/Makefile b/mm/Makefile index 33873c8aedb3..b0cf2563f33a 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o +obj-$(CONFIG_RMAP_ID) += rmap_id.o diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 51a878efca0e..0228b04c4053 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -614,6 +614,7 @@ void folio_prep_large_rmappable(struct folio *folio) { VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); INIT_LIST_HEAD(&folio->_deferred_list); + __folio_prep_large_rmap(folio); folio_set_large_rmappable(folio); } @@ -2478,8 +2479,8 @@ static void __split_huge_page_tail(struct folio *folio, int tail, (1L << PG_dirty) | LRU_GEN_MASK | LRU_REFS_MASK)); - /* ->mapping in first and second tail page is replaced by other uses */ - VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, + /* ->mapping in some tail page is replaced by other uses */ + VM_BUG_ON_PAGE(tail > 4 && page_tail->mapping != TAIL_MAPPING, page_tail); page_tail->mapping = head->mapping; page_tail->index = head->index + tail; @@ -2550,6 +2551,16 @@ static void __split_huge_page(struct page *page, struct list_head *list, ClearPageHasHWPoisoned(head); +#ifdef CONFIG_RMAP_ID + /* + * Make sure folio->_rmap_atomic_seqcount, which overlays + * tail->private, is 0. All other folio->_rmap_valX should be 0 + * after unmapping the folio. + */ + if (likely(nr >= 4)) + raw_seqcount_init(&folio->_rmap_atomic_seqcount); +#endif /* CONFIG_RMAP_ID */ + for (i = nr - 1; i >= 1; i--) { __split_huge_page_tail(folio, i, lruvec, list); /* Some pages can be beyond EOF: drop them from page cache */ @@ -2809,6 +2820,7 @@ void folio_undo_large_rmappable(struct folio *folio) struct deferred_split *ds_queue; unsigned long flags; + __folio_undo_large_rmap(folio); /* * At this point, there is no one trying to add the folio to * deferred_list. If folio is not in deferred_list, it's safe diff --git a/mm/init-mm.c b/mm/init-mm.c index cfd367822cdd..8890271b50c6 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include @@ -46,6 +47,9 @@ struct mm_struct init_mm = { .cpu_bitmap = CPU_BITS_NONE, #ifdef CONFIG_IOMMU_SVA .pasid = IOMMU_PASID_INVALID, +#endif +#ifdef CONFIG_RMAP_ID + .mm_rmap_id = RMAP_ID_DUMMY, #endif INIT_MM_CONTEXT(init_mm) }; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index aad45758c0c7..c1dd039801e7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1007,6 +1007,15 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) * deferred_list.next -- ignore value. */ break; +#ifdef CONFIG_RMAP_ID + case 3: + case 4: + /* + * the third and fourth tail page: ->mapping may be + * used to store RMAP values for RMAP ID tracking. + */ + break; +#endif /* CONFIG_RMAP_ID */ default: if (page->mapping != TAIL_MAPPING) { bad_page(page, "corrupted mapping in tail page"); diff --git a/mm/rmap_id.c b/mm/rmap_id.c new file mode 100644 index 000000000000..e66b0f5aea2d --- /dev/null +++ b/mm/rmap_id.c @@ -0,0 +1,351 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * rmap ID tracking for precise "mapped shared" vs. "mapped exclusively" + * detection of partially-mappable folios (e.g., PTE-mapped THP). + * + * Copyright Red Hat, Inc. 2023 + * + * Author(s): David Hildenbrand + */ + +#include +#include +#include + +#include "internal.h" + +static DEFINE_SPINLOCK(rmap_id_lock); +static DEFINE_IDA(rmap_ida); + +/* For now we only expect folios from the buddy, not hugetlb folios. */ +#if MAX_ORDER > RMAP_SUBID_6_MAX_ORDER +#error "rmap ID tracking does not support such large MAX_ORDER" +#endif + +/* + * We assign each MM a unique rmap ID and derive from it a sequence of + * special sub-IDs. We add/remove these sub-IDs to/from the corresponding + * folio rmap values (folio->rmap_valX) whenever (un)mapping (parts of) a + * partially mappable folio. + * + * With 24bit rmap IDs, and a folio size that is compatible with 4 + * rmap values (more below), we calculate the sub-ID sequence like this: + * + * rmap ID : | 3 3 3 3 3 3 | 2 2 2 2 2 2 | 1 1 1 1 1 1 | 0 0 0 0 0 0 | + * sub-ID IDX : | IDX #3 | IDX #2 | IDX #1 | IDX #0 | + * + * sub-IDs : [ subid_4(#3), subid_4(#2), subid_4(#1), subid_4(#0) ] + * rmap value : [ _rmap_val3, _rmap_val2, _rmap_val1, _rmap_val0 ] + * + * Any time we map/unmap a part (e.g., PTE, PMD) of a partially-mappable + * folio to/from a MM, we: + * (1) Adjust (increment/decrement) the mapcount of the folio + * (2) Adjust (add/remove) the folio rmap values using the MM sub-IDs + * + * So the rmap values are always linked to the folio mapcount. + * Consequently, we know that a single rmap value in the folio is the sum + * of exactly #folio_mapcount() rmap sub-IDs. As one example, if the folio + * is completely unmapped, the rmap values must be 0. As another example, + * if the folio is mapped exactly once, the rmap values correspond to the + * MM sub-IDs. + * + * To identify whether a given MM is responsible for all #folio_mapcount() + * mappings of a folio ("mapped exclusively") or whether other MMs are + * involved ("mapped shared"), we perform the following checks: + * (1) Do we have more mappings than the folio has pages? Then the folio + * is mapped shared. So when "folio_mapcount() > folio_nr_pages()". + * (2) Do the rmap values corresond to "#folio_mapcount() * sub-IDs" of + * the MM? Then the folio is mapped exclusive. + * + * To achieve (2), we generate sub-IDs that have the following property, + * assuming that our folio has P=folio_nr_pages() pages. + * "2 * sub-ID" cannot be represented by the sum of any other *2* sub-IDs + * "3 * sub-ID" cannot be represented by the sum of any other *3* sub-IDs + * "4 * sub-ID" cannot be represented by the sum of any other *4* sub-IDs + * ... + * "P * sub-ID" cannot be represented by the sum of any other *P* sub-IDs + * + * Further, we want "P * sub-ID" (the maximum number we will ever look at) + * to not overflow. If we overflow with " > P" mappings, we don't care as + * we won't be looking at the numbers until theya re fully expressive + * again. + * + * Consequently, to not overflow 64bit values with "P * sub-ID", folios + * with large P require more rmap values (we cannot generate that many sub + * IDs), whereby folios with smaller P can get away with less rmap values + * (we can generate more sub-IDs). + * + * The sub-IDs are generated in generations, whereby + * (1) Generation #0 is the number 0 + * (2) Generation #N takes all numbers from generations #0..#N-1 and adds + * (P + 1)^(N - 1), effectively doubling the number of sub-IDs + * + * Note: a PMD-sized THP can, for a short time while PTE-mapping it, be + * mapped using PTEs and a single PMD, resulting in "P + 1" mappings. + * For now, we don't consider this case, as we are ususally not + * looking at such folios while they being remapped, because the + * involved page tables are locked and stop any page table walkers. + */ + +/* + * With 1024 (order-10) possible exclusive mappings per folio, we can have 64 + * sub-IDs per 64bit value. + * + * With 4 such 64bit values, we can support 64^4 == 16M IDs. + */ +static const unsigned long rmap_subids_4[64] = { + 0ul, + 1ul, + 1025ul, + 1026ul, + 1050625ul, + 1050626ul, + 1051650ul, + 1051651ul, + 1076890625ul, + 1076890626ul, + 1076891650ul, + 1076891651ul, + 1077941250ul, + 1077941251ul, + 1077942275ul, + 1077942276ul, + 1103812890625ul, + 1103812890626ul, + 1103812891650ul, + 1103812891651ul, + 1103813941250ul, + 1103813941251ul, + 1103813942275ul, + 1103813942276ul, + 1104889781250ul, + 1104889781251ul, + 1104889782275ul, + 1104889782276ul, + 1104890831875ul, + 1104890831876ul, + 1104890832900ul, + 1104890832901ul, + 1131408212890625ul, + 1131408212890626ul, + 1131408212891650ul, + 1131408212891651ul, + 1131408213941250ul, + 1131408213941251ul, + 1131408213942275ul, + 1131408213942276ul, + 1131409289781250ul, + 1131409289781251ul, + 1131409289782275ul, + 1131409289782276ul, + 1131409290831875ul, + 1131409290831876ul, + 1131409290832900ul, + 1131409290832901ul, + 1132512025781250ul, + 1132512025781251ul, + 1132512025782275ul, + 1132512025782276ul, + 1132512026831875ul, + 1132512026831876ul, + 1132512026832900ul, + 1132512026832901ul, + 1132513102671875ul, + 1132513102671876ul, + 1132513102672900ul, + 1132513102672901ul, + 1132513103722500ul, + 1132513103722501ul, + 1132513103723525ul, + 1132513103723526ul, +}; + +static unsigned long get_rmap_subid_4(struct mm_struct *mm, int nr) +{ + const unsigned int rmap_id = mm->mm_rmap_id; + + VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 3); + return rmap_subids_4[(rmap_id >> (nr * 6)) & 0x3f]; +} + +#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER +/* + * With 4096 (order-12) possible exclusive mappings per folio, we can have + * 32 sub-IDs per 64bit value. + * + * With 5 such 64bit values, we can support 32^5 > 16M IDs. + */ +static const unsigned long rmap_subids_5[32] = { + 0ul, + 1ul, + 4097ul, + 4098ul, + 16785409ul, + 16785410ul, + 16789506ul, + 16789507ul, + 68769820673ul, + 68769820674ul, + 68769824770ul, + 68769824771ul, + 68786606082ul, + 68786606083ul, + 68786610179ul, + 68786610180ul, + 281749955297281ul, + 281749955297282ul, + 281749955301378ul, + 281749955301379ul, + 281749972082690ul, + 281749972082691ul, + 281749972086787ul, + 281749972086788ul, + 281818725117954ul, + 281818725117955ul, + 281818725122051ul, + 281818725122052ul, + 281818741903363ul, + 281818741903364ul, + 281818741907460ul, + 281818741907461ul, +}; + +static unsigned long get_rmap_subid_5(struct mm_struct *mm, int nr) +{ + const unsigned int rmap_id = mm->mm_rmap_id; + + VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 4); + return rmap_subids_5[(rmap_id >> (nr * 5)) & 0x1f]; +} +#endif + +#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER +/* + * With 32768 (order-15) possible exclusive mappings per folio, we can have + * 16 sub-IDs per 64bit value. + * + * With 6 such 64bit values, we can support 8^6 == 16M IDs. + */ +static const unsigned long rmap_subids_6[16] = { + 0ul, + 1ul, + 32769ul, + 32770ul, + 1073807361ul, + 1073807362ul, + 1073840130ul, + 1073840131ul, + 35187593412609ul, + 35187593412610ul, + 35187593445378ul, + 35187593445379ul, + 35188667219970ul, + 35188667219971ul, + 35188667252739ul, + 35188667252740ul, +}; + +static unsigned long get_rmap_subid_6(struct mm_struct *mm, int nr) +{ + const unsigned int rmap_id = mm->mm_rmap_id; + + VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 15); + return rmap_subids_6[(rmap_id >> (nr * 4)) & 0xf]; +} +#endif + +void __folio_set_large_rmap_val(struct folio *folio, int count, + struct mm_struct *mm) +{ + const unsigned int order = folio_order(folio); + + switch (order) { +#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER + case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER: + atomic_long_set(&folio->_rmap_val0, get_rmap_subid_6(mm, 0) * count); + atomic_long_set(&folio->_rmap_val1, get_rmap_subid_6(mm, 1) * count); + atomic_long_set(&folio->_rmap_val2, get_rmap_subid_6(mm, 2) * count); + atomic_long_set(&folio->_rmap_val3, get_rmap_subid_6(mm, 3) * count); + atomic_long_set(&folio->_rmap_val4, get_rmap_subid_6(mm, 4) * count); + atomic_long_set(&folio->_rmap_val5, get_rmap_subid_6(mm, 5) * count); + break; +#endif +#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER + case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER: + atomic_long_set(&folio->_rmap_val0, get_rmap_subid_5(mm, 0) * count); + atomic_long_set(&folio->_rmap_val1, get_rmap_subid_5(mm, 1) * count); + atomic_long_set(&folio->_rmap_val2, get_rmap_subid_5(mm, 2) * count); + atomic_long_set(&folio->_rmap_val3, get_rmap_subid_5(mm, 3) * count); + atomic_long_set(&folio->_rmap_val4, get_rmap_subid_5(mm, 4) * count); + break; +#endif + default: + atomic_long_set(&folio->_rmap_val0, get_rmap_subid_4(mm, 0) * count); + atomic_long_set(&folio->_rmap_val1, get_rmap_subid_4(mm, 1) * count); + atomic_long_set(&folio->_rmap_val2, get_rmap_subid_4(mm, 2) * count); + atomic_long_set(&folio->_rmap_val3, get_rmap_subid_4(mm, 3) * count); + break; + } +} + +void __folio_add_large_rmap_val(struct folio *folio, int count, + struct mm_struct *mm) +{ + const unsigned int order = folio_order(folio); + + switch (order) { +#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER + case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER: + atomic_long_add(get_rmap_subid_6(mm, 0) * count, &folio->_rmap_val0); + atomic_long_add(get_rmap_subid_6(mm, 1) * count, &folio->_rmap_val1); + atomic_long_add(get_rmap_subid_6(mm, 2) * count, &folio->_rmap_val2); + atomic_long_add(get_rmap_subid_6(mm, 3) * count, &folio->_rmap_val3); + atomic_long_add(get_rmap_subid_6(mm, 4) * count, &folio->_rmap_val4); + atomic_long_add(get_rmap_subid_6(mm, 5) * count, &folio->_rmap_val5); + break; +#endif +#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER + case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER: + atomic_long_add(get_rmap_subid_5(mm, 0) * count, &folio->_rmap_val0); + atomic_long_add(get_rmap_subid_5(mm, 1) * count, &folio->_rmap_val1); + atomic_long_add(get_rmap_subid_5(mm, 2) * count, &folio->_rmap_val2); + atomic_long_add(get_rmap_subid_5(mm, 3) * count, &folio->_rmap_val3); + atomic_long_add(get_rmap_subid_5(mm, 4) * count, &folio->_rmap_val4); + break; +#endif + default: + atomic_long_add(get_rmap_subid_4(mm, 0) * count, &folio->_rmap_val0); + atomic_long_add(get_rmap_subid_4(mm, 1) * count, &folio->_rmap_val1); + atomic_long_add(get_rmap_subid_4(mm, 2) * count, &folio->_rmap_val2); + atomic_long_add(get_rmap_subid_4(mm, 3) * count, &folio->_rmap_val3); + break; + } +} + +int alloc_rmap_id(void) +{ + int id; + + /* + * We cannot use a mutex, because free_rmap_id() might get called + * when we are not allowed to sleep. + * + * TODO: do we need something like idr_preload()? + */ + spin_lock(&rmap_id_lock); + id = ida_alloc_range(&rmap_ida, RMAP_ID_MIN, RMAP_ID_MAX, GFP_ATOMIC); + spin_unlock(&rmap_id_lock); + + return id; +} + +void free_rmap_id(int id) +{ + if (id == RMAP_ID_DUMMY) + return; + if (WARN_ON_ONCE(id < RMAP_ID_MIN || id > RMAP_ID_MAX)) + return; + spin_lock(&rmap_id_lock); + ida_free(&rmap_ida, id); + spin_unlock(&rmap_id_lock); +} From patchwork Fri Nov 24 13:26:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169415 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1190662vqx; Fri, 24 Nov 2023 05:28:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IEsZpStPd0Hc9TAZuM7ayoQZVdH0AnamqZGNPqC40aX6W+ZbTF9hWPiyA4Gq1WCeJE07kyk X-Received: by 2002:a05:6a21:7886:b0:17f:d42e:202c with SMTP id bf6-20020a056a21788600b0017fd42e202cmr3383386pzc.49.1700832484504; Fri, 24 Nov 2023 05:28:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832484; cv=none; d=google.com; s=arc-20160816; b=c9fLWaOLCrtNiqZpoOaNqdNODQWLSGMal0Dh8wQi6MX8UXr+IemFgNJ6RLOo94jb5P CYlhzWdXP/xaqjfI3XFpBZMt7B1U0+NVidZH73FiXzZXSoJ8GKovSadR1ywr1WVioMqr VXNSKCskMHgq1qPEtBeCsAj9y9PPAwFb5QJTg1EL+IsJaTX77Mz5KGLvo2aYtlbqIFrt 0CyxVtqDQ21LyGlwmg8xE89invoo1FQR++6r+EqeIhx8czVCgyukN0rwVpjSvgC6WBOu 3ywkI+tx9bW86WG4uxEW6i5dPSLxmFWoXM8ahKZpfEZv2eVyVegiEE6xMOXV2f4S2IZR jOSA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=hWu/0kIJvur5UANFtettn6VuzgBBSt2XXjypdyFb6aY=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=FshwaGwDlVrtBPl/5PGbhUOhHKGE/UfYgnugMo5i+qx2kecRE/o25MId/oWihMLLLI yu7HUbb2dzuSGJxBQdTQX1lVbWpFlLiQiouKqEpQHoK4buy10+HieMpf4/O0XoUM6smh AUSlMa2VRGFHIuzK3lrNhQ04T+d7DCguowDq2jipn451xpIrsHSKygNjH++RzIbWkC3Y 6xqkblsP+H69zKR3anWbl72uBFjBl8/THqtV+wSURqK8T9Eu/DwbIUfSheiNIEoGq1SS DXFp0pQl3Fnx4ssOATjloBWr+PYftAOQ7xF/2uoEx6e1WNq8Zn8LlYk9xYF847g7KE4U od1w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=E1drljMP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id ck21-20020a056a02091500b005b11e5a69fdsi3716615pgb.508.2023.11.24.05.28.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:28:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=E1drljMP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id B4C6081825D6; Fri, 24 Nov 2023 05:27:55 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345351AbjKXN1q (ORCPT + 99 others); Fri, 24 Nov 2023 08:27:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37536 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345366AbjKXN1V (ORCPT ); Fri, 24 Nov 2023 08:27:21 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2F57E1BE for ; Fri, 24 Nov 2023 05:27:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832424; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hWu/0kIJvur5UANFtettn6VuzgBBSt2XXjypdyFb6aY=; b=E1drljMPaYYgsDkAyRrgoAZQkECwwlQYnT824exfvTCspKDlxc3StsNHa8adqhYAlYpmiU eid3XrLsNm5/42r01WqszDEYY/Es3nDR9VaKOGIVMaNHlEq7HXv6ToEtkjO+kJtwYMu7Ea 9kAw1Maz+Oyx9y25DIErl4EqdiRozMI= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-351-57hRR764P9m-_xbjczRUpw-1; Fri, 24 Nov 2023 08:27:01 -0500 X-MC-Unique: 57hRR764P9m-_xbjczRUpw-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id DCBB085A5BD; Fri, 24 Nov 2023 13:26:59 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 003162166B2A; Fri, 24 Nov 2023 13:26:56 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 08/20] mm: pass MM to folio_mapped_shared() Date: Fri, 24 Nov 2023 14:26:13 +0100 Message-ID: <20231124132626.235350-9-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:27:55 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452123358300830 X-GMAIL-MSGID: 1783452123358300830 We'll need the MM next to make a better decision regarding partially-mappable folios (e.g., PTE-mapped THP) using per-MM rmap IDs. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 4 +++- mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- mm/memory.c | 2 +- mm/mempolicy.c | 14 +++++++------- mm/migrate.c | 2 +- 6 files changed, 16 insertions(+), 14 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 17dac913f367..765e688690f1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2117,6 +2117,7 @@ static inline size_t folio_size(struct folio *folio) * folio_mapped_shared - Report if a folio is certainly mapped by * multiple entities in their page tables * @folio: The folio. + * @mm: The mm the folio is mapped into. * * This function checks if a folio is certainly *currently* mapped by * multiple entities in their page table ("mapped shared") or if the folio @@ -2153,7 +2154,8 @@ static inline size_t folio_size(struct folio *folio) * * Return: Whether the folio is certainly mapped by multiple entities. */ -static inline bool folio_mapped_shared(struct folio *folio) +static inline bool folio_mapped_shared(struct folio *folio, + struct mm_struct *mm) { unsigned int total_mapcount; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0228b04c4053..fd7251923557 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1639,7 +1639,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, * If other processes are mapping this folio, we couldn't discard * the folio unless they all do MADV_FREE so let's skip the folio. */ - if (folio_mapped_shared(folio)) + if (folio_mapped_shared(folio, mm)) goto out; if (!folio_trylock(folio)) diff --git a/mm/madvise.c b/mm/madvise.c index 1a82867c8c2e..e3e4f3ea5f6d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -365,7 +365,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, folio = pfn_folio(pmd_pfn(orig_pmd)); /* Do not interfere with other mappings of this folio */ - if (folio_mapped_shared(folio)) + if (folio_mapped_shared(folio, mm)) goto huge_unlock; if (pageout_anon_only_filter && !folio_test_anon(folio)) @@ -441,7 +441,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (folio_test_large(folio)) { int err; - if (folio_mapped_shared(folio)) + if (folio_mapped_shared(folio, mm)) break; if (pageout_anon_only_filter && !folio_test_anon(folio)) break; @@ -665,7 +665,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (folio_test_large(folio)) { int err; - if (folio_mapped_shared(folio)) + if (folio_mapped_shared(folio, mm)) break; if (!folio_trylock(folio)) break; diff --git a/mm/memory.c b/mm/memory.c index 14416d05e1b6..5048d58d6174 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4848,7 +4848,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) * Flag if the folio is shared between multiple address spaces. This * is later used when determining whether to group tasks together */ - if (folio_mapped_shared(folio) && (vma->vm_flags & VM_SHARED)) + if (folio_mapped_shared(folio, vma->vm_mm) && (vma->vm_flags & VM_SHARED)) flags |= TNF_SHARED; nid = folio_nid(folio); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0492113497cc..bd0243da26bf 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -418,7 +418,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { }; static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, - unsigned long flags); + struct mm_struct *mm, unsigned long flags); static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, pgoff_t ilx, int *nid); @@ -481,7 +481,7 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk) return; if (!(qp->flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) || !vma_migratable(walk->vma) || - !migrate_folio_add(folio, qp->pagelist, qp->flags)) + !migrate_folio_add(folio, qp->pagelist, walk->mm, qp->flags)) qp->nr_failed++; } @@ -561,7 +561,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, } if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) || !vma_migratable(vma) || - !migrate_folio_add(folio, qp->pagelist, flags)) { + !migrate_folio_add(folio, qp->pagelist, walk->mm, flags)) { qp->nr_failed++; if (strictly_unmovable(flags)) break; @@ -609,7 +609,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask, * easily detect if a folio is shared. */ if ((flags & MPOL_MF_MOVE_ALL) || - (!folio_mapped_shared(folio) && !hugetlb_pmd_shared(pte))) + (!folio_mapped_shared(folio, walk->mm) && !hugetlb_pmd_shared(pte))) if (!isolate_hugetlb(folio, qp->pagelist)) qp->nr_failed++; unlock: @@ -981,7 +981,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, #ifdef CONFIG_MIGRATION static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, - unsigned long flags) + struct mm_struct *mm, unsigned long flags) { /* * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio. @@ -990,7 +990,7 @@ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, * See folio_mapped_shared() on possible imprecision when we cannot * easily detect if a folio is shared. */ - if ((flags & MPOL_MF_MOVE_ALL) || !folio_mapped_shared(folio)) { + if ((flags & MPOL_MF_MOVE_ALL) || !folio_mapped_shared(folio, mm)) { if (folio_isolate_lru(folio)) { list_add_tail(&folio->lru, foliolist); node_stat_mod_folio(folio, @@ -1195,7 +1195,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src, #else static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, - unsigned long flags) + struct mm_struct *mm, unsigned long flags) { return false; } diff --git a/mm/migrate.c b/mm/migrate.c index 341a84c3e8e4..8a1d75ff2dc6 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2559,7 +2559,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, * every page is mapped to the same process. Doing that is very * expensive, so check the estimated mapcount of the folio instead. */ - if (folio_mapped_shared(folio) && folio_is_file_lru(folio) && + if (folio_mapped_shared(folio, vma->vm_mm) && folio_is_file_lru(folio) && (vma->vm_flags & VM_EXEC)) goto out; From patchwork Fri Nov 24 13:26:14 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169424 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191423vqx; Fri, 24 Nov 2023 05:29:12 -0800 (PST) X-Google-Smtp-Source: AGHT+IEZHiiXFcmrwseVNAYeh/HJAWQZnDbec6lYy4I5cB8HuzRKlQrXPedeilZOCNiWbQRwZ6U0 X-Received: by 2002:a17:902:d2cf:b0:1cc:13d0:d515 with SMTP id n15-20020a170902d2cf00b001cc13d0d515mr3430779plc.20.1700832552020; Fri, 24 Nov 2023 05:29:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832552; cv=none; d=google.com; s=arc-20160816; b=qfTdsdGCwlstSPDBCNlUWIpuTQbbTTgTpL9XwbDeEBXutMAlpn5yqOrLN+rtd0lOGO jDlQvxLYAuuCiLnIigVpyR+V6nVW7ROso1kQSfniYSzOqHbRMPF3WXkzeeUozHvxkFvP KnbHTcQBUCHkE/zWsp44LfWXyy+MIi5k7KLqSRspkSXHNirnfLQh38dTwCFLtkhlJDWB flfUqpibgclCkFuyAMoUGO3yXdpVd+PO564fFXv3l4R2XJNrwAx1IVFYlCzQ93lhyunL zIz1L3TaFYzqJIcHiXtSEC4y3XS0VEcXMqPPjSa/ffdUxLij77kMIe5iSWmBNHYYr9u8 60hg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=qOqXc/Lc47PjYVBDjn5tNISm7M7uwGrIMk+t0cHF8N0=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=Tq9+5KVo4FfbP6OeybDDCxL9eNHesBEF9h0VKRbB5wFZYaKia1ZkkSGNOKSvdE1Jpt OXHpZnVO1ZdqE/fLj9W1ZyPeTB1e3epDDc/qd+tmAv129OUrPsD0YzRoEWFz0vI5pW4z DwzLA0Z7Pb2wr0PGKPP8IDsofDr+FgLakkqk+r+ehhgr/vYfGcQbNEgJpyP1L+aozX5v N6bZn2PGUrg/zbRE4fh+0k19eVHY24HB/zKROR1MLYy5jzRWJWOxi9y42GI2//o9lkvz qIN+mhIyYEVP29ZlLPF2xBC49+mGkqXlDU0iRQsu5ww488R+3qA94vTWVIu6x6cO/rQF kd1A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Ci6nzm9h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id f14-20020a170902ce8e00b001cfaec29ec5si142781plg.28.2023.11.24.05.29.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:29:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Ci6nzm9h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id A72C38197EB3; Fri, 24 Nov 2023 05:28:11 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345377AbjKXN1t (ORCPT + 99 others); Fri, 24 Nov 2023 08:27:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231423AbjKXN1W (ORCPT ); Fri, 24 Nov 2023 08:27:22 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7D82C10F9 for ; Fri, 24 Nov 2023 05:27:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832429; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qOqXc/Lc47PjYVBDjn5tNISm7M7uwGrIMk+t0cHF8N0=; b=Ci6nzm9hcHMWZuOapqbpBSXFSXWl0TnzZyMigA7IRrFl1gTanwBv+Yxr8wUCSiBJO071Bt MEK/fQHM2ZYEnjZIVNPCV7ntYVs3F6f0Gge60gEQr9Gbs9WlQj3RSxMpCtrZZEzpBXX0pM er3r07Kwxy5b54n2RQrliMdEl3Nzk/E= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-85-HfF3hCRgOCiCl6qXEnqmsA-1; Fri, 24 Nov 2023 08:27:04 -0500 X-MC-Unique: HfF3hCRgOCiCl6qXEnqmsA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id AD25B811E86; Fri, 24 Nov 2023 13:27:03 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 477E52166B2A; Fri, 24 Nov 2023 13:27:00 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 09/20] mm: improve folio_mapped_shared() for partially-mappable folios using rmap IDs Date: Fri, 24 Nov 2023 14:26:14 +0100 Message-ID: <20231124132626.235350-10-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:28:11 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452194099868828 X-GMAIL-MSGID: 1783452194099868828 Let's make folio_mapped_shared() precise by using or rmap ID magic to identify if a single MM is responsible for all mappings. If there is a lot of concurrent (un)map activity, we could theoretically spin for quite a while. But we're only looking at the rmap values in case we didn't already identify the folio as "obviously shared". In most cases, there should only be one or a handful of page tables involved. For current THPs with ~512 .. 2048 subpages, we really shouldn't see a lot of concurrent updates that keep us spinning for a long time. Anyhow, if ever a problem this can be optimized later if there is real demand. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 21 ++++++++++++--- include/linux/rmap.h | 2 ++ mm/rmap_id.c | 63 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 82 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 765e688690f1..1081a8faa1a3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2113,6 +2113,17 @@ static inline size_t folio_size(struct folio *folio) return PAGE_SIZE << folio_order(folio); } +#ifdef CONFIG_RMAP_ID +bool __folio_large_mapped_shared(struct folio *folio, struct mm_struct *mm); +#else +static inline bool __folio_large_mapped_shared(struct folio *folio, + struct mm_struct *mm) +{ + /* ... guess based on the mapcount of the first page of the folio. */ + return atomic_read(&folio->page._mapcount) > 0; +} +#endif + /** * folio_mapped_shared - Report if a folio is certainly mapped by * multiple entities in their page tables @@ -2141,8 +2152,11 @@ static inline size_t folio_size(struct folio *folio) * PMD-mapped PMD-sized THP), the result will be exactly correct. * * For all other (partially-mappable) folios, such as PTE-mapped THP, the - * return value is partially fuzzy: true is not fuzzy, because it means - * "certainly mapped shared", but false means "maybe mapped exclusively". + * return value is partially fuzzy without CONFIG_RMAP_ID: true is not fuzzy, + * because it means "certainly mapped shared", but false means + * "maybe mapped exclusively". + * + * With CONFIG_RMAP_ID, the result will be exactly correct. * * Note that this function only considers *current* page table mappings * tracked via rmap -- that properly adjusts the folio mapcount(s) -- and @@ -2177,8 +2191,7 @@ static inline bool folio_mapped_shared(struct folio *folio, */ if (total_mapcount > folio_nr_pages(folio)) return true; - /* ... guess based on the mapcount of the first page of the folio. */ - return atomic_read(&folio->page._mapcount) > 0; + return __folio_large_mapped_shared(folio, mm); } #ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 19c9dc3216df..a73e146d82d1 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -253,6 +253,8 @@ void __folio_set_large_rmap_val(struct folio *folio, int count, struct mm_struct *mm); void __folio_add_large_rmap_val(struct folio *folio, int count, struct mm_struct *mm); +bool __folio_has_large_matching_rmap_val(struct folio *folio, int count, + struct mm_struct *mm); #else static inline void __folio_prep_large_rmap(struct folio *folio) { diff --git a/mm/rmap_id.c b/mm/rmap_id.c index e66b0f5aea2d..85a61c830f19 100644 --- a/mm/rmap_id.c +++ b/mm/rmap_id.c @@ -322,6 +322,69 @@ void __folio_add_large_rmap_val(struct folio *folio, int count, } } +bool __folio_has_large_matching_rmap_val(struct folio *folio, int count, + struct mm_struct *mm) +{ + const unsigned int order = folio_order(folio); + unsigned long diff = 0; + + switch (order) { +#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER + case RMAP_SUBID_6_MIN_ORDER .. RMAP_SUBID_6_MAX_ORDER: + diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_6(mm, 0) * count); + diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_6(mm, 1) * count); + diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_6(mm, 2) * count); + diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_6(mm, 3) * count); + diff |= atomic_long_read(&folio->_rmap_val4) ^ (get_rmap_subid_6(mm, 4) * count); + diff |= atomic_long_read(&folio->_rmap_val5) ^ (get_rmap_subid_6(mm, 5) * count); + break; +#endif +#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER + case RMAP_SUBID_5_MIN_ORDER .. RMAP_SUBID_5_MAX_ORDER: + diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_5(mm, 0) * count); + diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_5(mm, 1) * count); + diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_5(mm, 2) * count); + diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_5(mm, 3) * count); + diff |= atomic_long_read(&folio->_rmap_val4) ^ (get_rmap_subid_5(mm, 4) * count); + break; +#endif + default: + diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_4(mm, 0) * count); + diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_4(mm, 1) * count); + diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_4(mm, 2) * count); + diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_4(mm, 3) * count); + break; + } + return !diff; +} + +bool __folio_large_mapped_shared(struct folio *folio, struct mm_struct *mm) +{ + unsigned long start; + bool exclusive; + int mapcount; + + VM_WARN_ON_ONCE(!folio_test_large_rmappable(folio)); + VM_WARN_ON_ONCE(folio_test_hugetlb(folio)); + + /* + * Livelocking here is unlikely, as the caller already handles the + * "obviously shared" cases. If ever an issue and there is too much + * concurrent (un)mapping happening (using different page tables), we + * could stop earlier and just return "shared". + */ + do { + start = raw_read_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount); + mapcount = folio_mapcount(folio); + if (unlikely(mapcount > folio_nr_pages(folio))) + return true; + exclusive = __folio_has_large_matching_rmap_val(folio, mapcount, mm); + } while (raw_read_atomic_seqcount_retry(&folio->_rmap_atomic_seqcount, + start)); + + return !exclusive; +} + int alloc_rmap_id(void) { int id; From patchwork Fri Nov 24 13:26:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169416 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191067vqx; Fri, 24 Nov 2023 05:28:39 -0800 (PST) X-Google-Smtp-Source: AGHT+IGcBRZCOtbQ+gYRc22yH3JbFxGA/9ibPMTw51MqHV3l4c7eoCrr35qfbH7zRSu7cuwx3wGP X-Received: by 2002:a05:6a20:2584:b0:189:c852:561e with SMTP id k4-20020a056a20258400b00189c852561emr3195513pzd.1.1700832518929; Fri, 24 Nov 2023 05:28:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832518; cv=none; d=google.com; s=arc-20160816; b=TSbqa5OICEkcsAPMyQiTG+EzBH/O5+xVszg5iYNZN9XHsZh/cGTNPknhUVRDWiD5fY Prn+Tk+H45Kpy7GaEAUgQNYUh5qOOQdPz2tMdxaVwQSkuFw9u+/s3HDM/oBo9bOiSoJD NobW7sdREWmKc22jETj4e01XiyzgMMMAhrV9qcpv0/n09Mtn9lR7ql24BppRWhDW+Eyr Ihi1ojG6TM+Ohl5A+Xpj855oaYzyjqU19kIXCNXK9+42JF4heqvxnPV/KgZ8QbWBlWae IRBm7+3LdPdgrUhE64ya2ZnzHEy1FFnrxlNpq4WJ8VEG4UmDc/IfiagxFahXu1J19w/S O+cQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=QUP8GR2DnkC5694xV/qGgIseF+xugXnf+IpocEcF75Q=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=w6GQ2Eh3gz9aMLKrm8UcNiivzG/zzLffNc93vsWTuhJ50p5S2b0c/eAax2oTwhEWWy L1eXrg1G3IOQYzmJLfyqIVaftQGfnAFEI66FKXDTjva6FHsdWPc1xlqtx0kNkLiluhPW axPM6bI4yNDA7zdSgo8v3taBA9tV0C7U4Eschb7ITviGmmskHg3LoDvPVDC/3lXrwTo+ VtEAulFPrF3Zrq6aGSPV1uAfofTqSEu3CQpFGClmCY5pKLBp4dBMgVWNXQ4f/aorTdMf svi5N4hJ+uXjxPhihO6GDKi00eZADNE9FzWpKGIN6VQjegm+UdQIvTgi2FU/YH3ATRgD 8wyg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=f2v4Nd9M; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id o66-20020a634145000000b005be03f0da7esi3869009pga.174.2023.11.24.05.28.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:28:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=f2v4Nd9M; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id CCF5C80AE56F; Fri, 24 Nov 2023 05:28:31 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345763AbjKXN1x (ORCPT + 99 others); Fri, 24 Nov 2023 08:27:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48872 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345407AbjKXN1X (ORCPT ); Fri, 24 Nov 2023 08:27:23 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5C26C170A for ; Fri, 24 Nov 2023 05:27:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832431; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QUP8GR2DnkC5694xV/qGgIseF+xugXnf+IpocEcF75Q=; b=f2v4Nd9MWqwmzWHvH54xxheGVyGAyt5tAxJIMVpfBd0QdgkTuJFTFXpNhTI0bmsb+XtBuw kKskpr0b9PT+eBiQ0HxvdQMgwW0ZzTWmPnqV9y9qoUWFdhaGf7jJhOjxbtF0fnoulpXgZl +uGVUW0gJA7SwIMywIZvNl0TsromnDE= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-508-EVwlOeefM6O30sNZp-JA9g-1; Fri, 24 Nov 2023 08:27:08 -0500 X-MC-Unique: EVwlOeefM6O30sNZp-JA9g-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BC99C811E93; Fri, 24 Nov 2023 13:27:07 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id ECEBC2166B2B; Fri, 24 Nov 2023 13:27:03 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 10/20] mm/memory: COW reuse support for PTE-mapped THP with rmap IDs Date: Fri, 24 Nov 2023 14:26:15 +0100 Message-ID: <20231124132626.235350-11-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:28:32 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452159296161880 X-GMAIL-MSGID: 1783452159296161880 For now, we only end up reusing small folios and PMD-mapped large folios (i.e., THP) after fork(); PTE-mapped THPs are never reused, except when only a single page of the folio remains mapped. Instead, we end up copying each subpage even though the THP might be exclusive to the MM. The logic we're using for small folios and PMD-mapped THPs is the following: Is the only reference to the folio from a single page table mapping? Then: (a) There are no other references to the folio from other MMs (e.g., page table mapping, GUP) (b) There are no other references to the folio from page migration/ swapout/swapcache that might temporarily unmap the folio. Consequently, the folio is exclusive to that process and can be reused. In that case, we end up with folio_refcount(folio) == 1 and an implied folio_mapcount(folio) == 1, while holding the page table lock and the page lock to protect against possible races. For PTE-mapped THP, however, we have not one, but multiple references from page tables, whereby such THPs can be mapped into multiple page tables in the MM. Reusing the logic that we use for small folios and PMD-mapped THPs means, that when reusing a PTE-mapped THP, we want to make sure that: (1) All folio references are from page table mappings. (2) All page table mappings belong to the same MM. (3) We didn't race with (un)mapping of the page related to other page tables, such that the mapcount and refcount are stable. For (1), we can check folio_refcount(folio) == folio_mapcount(folio) For (2) and (3), we can use our new rmap ID infrastructure. We won't bother with the swapcache and LRU cache for now. Add some sanity checks under CONFIG_DEBUG_VM, to identify any obvious problems early. Signed-off-by: David Hildenbrand --- mm/memory.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 89 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 5048d58d6174..fb533995ff68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3360,6 +3360,95 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio) static bool wp_can_reuse_anon_folio(struct folio *folio, struct vm_area_struct *vma) { +#ifdef CONFIG_RMAP_ID + if (folio_test_large(folio)) { + bool retried = false; + unsigned long start; + int mapcount, i; + + /* + * The assumption for anonymous folios is that each page can + * only get mapped once into a MM. This also holds for + * small folios -- except when KSM is involved. KSM does + * currently not apply to large folios. + * + * Further, each taken mapcount must be paired with exactly one + * taken reference, whereby references must be incremented + * before the mapcount when mapping a page, and references must + * be decremented after the mapcount when unmapping a page. + * + * So if all references to a folio are from mappings, and all + * mappings are due to our (MM) page tables, and there was no + * concurrent (un)mapping, this folio is certainly exclusive. + * + * We currently don't optimize for: + * (a) folio is mapped into multiple page tables in this + * MM (e.g., mremap) and other page tables are + * concurrently (un)mapping the folio. + * (b) the folio is in the swapcache. Likely the other PTEs + * are still swap entries and folio_free_swap() would fail. + * (c) the folio is in the LRU cache. + */ +retry: + start = raw_read_atomic_seqcount(&folio->_rmap_atomic_seqcount); + if (start & ATOMIC_SEQCOUNT_WRITERS_MASK) + return false; + mapcount = folio_mapcount(folio); + + /* Is this folio possibly exclusive ... */ + if (mapcount > folio_nr_pages(folio) || folio_entire_mapcount(folio)) + return false; + + /* ... and are all references from mappings ... */ + if (folio_ref_count(folio) != mapcount) + return false; + + /* ... and do all mappings belong to us ... */ + if (!__folio_has_large_matching_rmap_val(folio, mapcount, vma->vm_mm)) + return false; + + /* ... and was there no concurrent (un)mapping ? */ + if (raw_read_atomic_seqcount_retry(&folio->_rmap_atomic_seqcount, + start)) + return false; + + /* Safety checks we might want to drop in the future. */ + if (IS_ENABLED(CONFIG_DEBUG_VM)) { + unsigned int mapcount; + + if (WARN_ON_ONCE(folio_test_ksm(folio))) + return false; + /* + * We might have raced against swapout code adding + * the folio to the swapcache (which, by itself, is not + * problematic). Let's simply check again if we would + * properly detect the additional reference now and + * properly fail. + */ + if (unlikely(folio_test_swapcache(folio))) { + if (WARN_ON_ONCE(retried)) + return false; + retried = true; + goto retry; + } + for (i = 0; i < folio_nr_pages(folio); i++) { + mapcount = page_mapcount(folio_page(folio, i)); + if (WARN_ON_ONCE(mapcount > 1)) + return false; + } + } + + /* + * This folio is exclusive to us. Do we need the page lock? + * Likely not, and a trylock would be unfortunate if this + * folio is mapped into multiple page tables and we get + * concurrent page faults. If there would be references from + * page migration/swapout/swapcache, we would have detected + * an additional reference and never ended up here. + */ + return true; + } +#endif /* CONFIG_RMAP_ID */ /* * We have to verify under folio lock: these early checks are * just an optimization to avoid locking the folio and freeing From patchwork Fri Nov 24 13:26:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169426 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191502vqx; Fri, 24 Nov 2023 05:29:18 -0800 (PST) X-Google-Smtp-Source: AGHT+IGHj5flMJieAGbzpe/kx8WglkdOs3MpKJUFeGJf/KO+U9lpV2xm7vgx/yt7qLgZduEcS5aG X-Received: by 2002:a05:6a00:2d90:b0:6c9:9e11:859d with SMTP id fb16-20020a056a002d9000b006c99e11859dmr3096201pfb.1.1700832558007; Fri, 24 Nov 2023 05:29:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832557; cv=none; d=google.com; s=arc-20160816; b=y2Nzui7xOhegoR9AAgHeuFb68F6AtvsYlSmsA1Ri9j4ghc3PpcrsOOhxB3in5Oz3Nw g54iZcJU3wc5PjBazXx9YY9L7aQQOe6hPnrjfc61D9SaqkijDVHy2s1z1vM5Wdot4xtI fm2aipbe4KiFBwKhJF38X2ktWt73T3CoBFG5AcxHvpC5mqiPWJ1XqxCzetftYEZy1kiO c3KHJxa8Z+Tb0zAxridZzrSI8c7+dJBh2aem5kl/Ru5wWkBfzbC9WxmyMPdylHlcBxnJ JHEsqa8/8bNixZQuMv7sQX4R4CjNeYmatGp1srNibqc5HDRrixAN8fTqrJm5SU7rT6Oa JoOA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=tnmd+tFa3nPIVK41hUx6D+f5FKMgV+UNk16uwK5iEbI=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=jk5lXowPvifxYAIF8Ku+fHrmA0ng9dTwFYsRJuxCqq0uPRteVx6ZZRsDrlzTBviSh2 DHLLOTtjgdNaDKsXGTYet0iqsdbsJOJj+icAB638Q04N9J7Q/vgiJosoVVqdxAgwaeg7 kCgtPTT+cE585rYaXfRn09k7FhiH7C6UMkUdVTYj4kypGpAh+/Ta9hCUMhyP83BxZxCB wqnuaZ3WKC1PVA+7ZRfFmV2I1JYbDTL3xgrY1PpAdL/DiXXxTmBlOyWn7gatVBxibVDv myZE3UY1WrR2gF3ON2UNxWrHkkCDMq41foksPnnMv8D+/KRcygdKmQTZtVcV95Emoaec S/TQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=W8rcl+FE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id c9-20020a654209000000b005b57aa8517bsi3476088pgq.91.2023.11.24.05.29.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:29:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=W8rcl+FE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id C307781CC84C; Fri, 24 Nov 2023 05:28:29 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233005AbjKXN2E (ORCPT + 99 others); Fri, 24 Nov 2023 08:28:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58184 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345445AbjKXN1X (ORCPT ); Fri, 24 Nov 2023 08:27:23 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EDCD01717 for ; Fri, 24 Nov 2023 05:27:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832432; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tnmd+tFa3nPIVK41hUx6D+f5FKMgV+UNk16uwK5iEbI=; b=W8rcl+FEAblkz91paxNS587GZI3bA4FlrwNi5tf3HGti1990LJVaXQml2IsAVRnpPStdlU JUdXdiW4G76G/hc0kMxACrlKbdzh705lBQG2NyajZ3MXnIqthc2QD+Wy5JYScAgby4mwsK FrgDozJzPtD/GsmFLjltgBXylILvgGY= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-554-n03FLBEbPOa7jDqGNMEHgQ-1; Fri, 24 Nov 2023 08:27:11 -0500 X-MC-Unique: n03FLBEbPOa7jDqGNMEHgQ-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id EACB085A58C; Fri, 24 Nov 2023 13:27:10 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 29FE22166B2A; Fri, 24 Nov 2023 13:27:07 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 11/20] mm/rmap_id: support for 1, 2 and 3 values by manual calculation Date: Fri, 24 Nov 2023 14:26:16 +0100 Message-ID: <20231124132626.235350-12-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:28:30 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452200332065897 X-GMAIL-MSGID: 1783452200332065897 For smaller folios, we can use less rmap values: * <= order-2: 1x 64bit value * <= order-5: 2x 64bit values * <= order-9: 3x 64bit values We end up with a lot of subids, so we cannot really use lookup tables. Pre-calculate the subids per MM. For order-9 we could think about having a lookup table with 128bit entries. Further, we could calcualte them only when really required. With 2 MiB THP this now implies only 3 instead of 4 values. Signed-off-by: David Hildenbrand --- include/linux/mm_types.h | 3 ++ include/linux/rmap.h | 58 ++++++++++++++++++++++++++++- kernel/fork.c | 6 +++ mm/rmap_id.c | 79 +++++++++++++++++++++++++++++++++++++--- 4 files changed, 139 insertions(+), 7 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 75305c57ef64..0ca5004e8f4a 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1032,6 +1032,9 @@ struct mm_struct { #ifdef CONFIG_RMAP_ID int mm_rmap_id; + unsigned long mm_rmap_subid_1; + unsigned long mm_rmap_subid_2[2]; + unsigned long mm_rmap_subid_3[3]; #endif /* CONFIG_RMAP_ID */ } __randomize_layout; diff --git a/include/linux/rmap.h b/include/linux/rmap.h index a73e146d82d1..39aeab457f4a 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -180,12 +180,54 @@ struct anon_vma *folio_get_anon_vma(struct folio *folio); void free_rmap_id(int id); int alloc_rmap_id(void); +#define RMAP_SUBID_1_MAX_ORDER 2 +#define RMAP_SUBID_2_MIN_ORDER 3 +#define RMAP_SUBID_2_MAX_ORDER 5 +#define RMAP_SUBID_3_MIN_ORDER 6 +#define RMAP_SUBID_3_MAX_ORDER 9 +#define RMAP_SUBID_4_MIN_ORDER 10 #define RMAP_SUBID_4_MAX_ORDER 10 #define RMAP_SUBID_5_MIN_ORDER 11 #define RMAP_SUBID_5_MAX_ORDER 12 #define RMAP_SUBID_6_MIN_ORDER 13 #define RMAP_SUBID_6_MAX_ORDER 15 +static inline unsigned long calc_rmap_subid(unsigned int n, unsigned int i) +{ + unsigned long nr = 0, mult = 1; + + while (i) { + if (i & 1) + nr += mult; + mult *= (n + 1); + i >>= 1; + } + return nr; +} + +static inline unsigned long calc_rmap_subid_1(int rmap_id) +{ + VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX); + + return calc_rmap_subid(1u << RMAP_SUBID_1_MAX_ORDER, rmap_id); +} + +static inline unsigned long calc_rmap_subid_2(int rmap_id, int nr) +{ + VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 1); + + return calc_rmap_subid(1u << RMAP_SUBID_2_MAX_ORDER, + (rmap_id >> (nr * 12)) & 0xfff); +} + +static inline unsigned long calc_rmap_subid_3(int rmap_id, int nr) +{ + VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 2); + + return calc_rmap_subid(1u << RMAP_SUBID_3_MAX_ORDER, + (rmap_id >> (nr * 8)) & 0xff); +} + static inline void __folio_prep_large_rmap(struct folio *folio) { const unsigned int order = folio_order(folio); @@ -202,10 +244,16 @@ static inline void __folio_prep_large_rmap(struct folio *folio) atomic_long_set(&folio->_rmap_val4, 0); fallthrough; #endif - default: + case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER: atomic_long_set(&folio->_rmap_val3, 0); + fallthrough; + case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER: atomic_long_set(&folio->_rmap_val2, 0); + fallthrough; + case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER: atomic_long_set(&folio->_rmap_val1, 0); + fallthrough; + default: atomic_long_set(&folio->_rmap_val0, 0); break; } @@ -227,10 +275,16 @@ static inline void __folio_undo_large_rmap(struct folio *folio) VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val4)); fallthrough; #endif - default: + case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER: VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val3)); + fallthrough; + case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER: VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val2)); + fallthrough; + case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER: VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val1)); + fallthrough; + default: VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val0)); break; } diff --git a/kernel/fork.c b/kernel/fork.c index 773c93613ca2..1d2f6248c83e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -822,6 +822,12 @@ static inline int mm_alloc_rmap_id(struct mm_struct *mm) if (id < 0) return id; mm->mm_rmap_id = id; + mm->mm_rmap_subid_1 = calc_rmap_subid_1(id); + mm->mm_rmap_subid_2[0] = calc_rmap_subid_2(id, 0); + mm->mm_rmap_subid_2[1] = calc_rmap_subid_2(id, 1); + mm->mm_rmap_subid_3[0] = calc_rmap_subid_3(id, 0); + mm->mm_rmap_subid_3[1] = calc_rmap_subid_3(id, 1); + mm->mm_rmap_subid_3[2] = calc_rmap_subid_3(id, 2); return 0; } diff --git a/mm/rmap_id.c b/mm/rmap_id.c index 85a61c830f19..6c3187547741 100644 --- a/mm/rmap_id.c +++ b/mm/rmap_id.c @@ -87,6 +87,39 @@ static DEFINE_IDA(rmap_ida); * involved page tables are locked and stop any page table walkers. */ +/* + * With 4 (order-2) possible exclusive mappings per folio, we can have + * 16777216 = 16M sub-IDs per 64bit value. + */ +static unsigned long get_rmap_subid_1(struct mm_struct *mm) +{ + return mm->mm_rmap_subid_1; +} + +/* + * With 32 (order-5) possible exclusive mappings per folio, we can have + * 4096 sub-IDs per 64bit value. + * + * With 2 such 64bit values, we can support 4096^2 == 16M IDs. + */ +static unsigned long get_rmap_subid_2(struct mm_struct *mm, int nr) +{ + VM_WARN_ON_ONCE(nr > 1); + return mm->mm_rmap_subid_2[nr]; +} + +/* + * With 512 (order-9) possible exclusive mappings per folio, we can have + * 128 sub-IDs per 64bit value. + * + * With 3 such 64bit values, we can support 128^3 == 16M IDs. + */ +static unsigned long get_rmap_subid_3(struct mm_struct *mm, int nr) +{ + VM_WARN_ON_ONCE(nr > 2); + return mm->mm_rmap_subid_3[nr]; +} + /* * With 1024 (order-10) possible exclusive mappings per folio, we can have 64 * sub-IDs per 64bit value. @@ -279,12 +312,24 @@ void __folio_set_large_rmap_val(struct folio *folio, int count, atomic_long_set(&folio->_rmap_val4, get_rmap_subid_5(mm, 4) * count); break; #endif - default: + case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER: atomic_long_set(&folio->_rmap_val0, get_rmap_subid_4(mm, 0) * count); atomic_long_set(&folio->_rmap_val1, get_rmap_subid_4(mm, 1) * count); atomic_long_set(&folio->_rmap_val2, get_rmap_subid_4(mm, 2) * count); atomic_long_set(&folio->_rmap_val3, get_rmap_subid_4(mm, 3) * count); break; + case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER: + atomic_long_set(&folio->_rmap_val0, get_rmap_subid_3(mm, 0) * count); + atomic_long_set(&folio->_rmap_val1, get_rmap_subid_3(mm, 1) * count); + atomic_long_set(&folio->_rmap_val2, get_rmap_subid_3(mm, 2) * count); + break; + case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER: + atomic_long_set(&folio->_rmap_val0, get_rmap_subid_2(mm, 0) * count); + atomic_long_set(&folio->_rmap_val1, get_rmap_subid_2(mm, 1) * count); + break; + default: + atomic_long_set(&folio->_rmap_val0, get_rmap_subid_1(mm) * count); + break; } } @@ -313,12 +358,24 @@ void __folio_add_large_rmap_val(struct folio *folio, int count, atomic_long_add(get_rmap_subid_5(mm, 4) * count, &folio->_rmap_val4); break; #endif - default: + case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER: atomic_long_add(get_rmap_subid_4(mm, 0) * count, &folio->_rmap_val0); atomic_long_add(get_rmap_subid_4(mm, 1) * count, &folio->_rmap_val1); atomic_long_add(get_rmap_subid_4(mm, 2) * count, &folio->_rmap_val2); atomic_long_add(get_rmap_subid_4(mm, 3) * count, &folio->_rmap_val3); break; + case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER: + atomic_long_add(get_rmap_subid_3(mm, 0) * count, &folio->_rmap_val0); + atomic_long_add(get_rmap_subid_3(mm, 1) * count, &folio->_rmap_val1); + atomic_long_add(get_rmap_subid_3(mm, 2) * count, &folio->_rmap_val2); + break; + case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER: + atomic_long_add(get_rmap_subid_2(mm, 0) * count, &folio->_rmap_val0); + atomic_long_add(get_rmap_subid_2(mm, 1) * count, &folio->_rmap_val1); + break; + default: + atomic_long_add(get_rmap_subid_1(mm) * count, &folio->_rmap_val0); + break; } } @@ -330,7 +387,7 @@ bool __folio_has_large_matching_rmap_val(struct folio *folio, int count, switch (order) { #if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER - case RMAP_SUBID_6_MIN_ORDER .. RMAP_SUBID_6_MAX_ORDER: + case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER: diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_6(mm, 0) * count); diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_6(mm, 1) * count); diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_6(mm, 2) * count); @@ -340,7 +397,7 @@ bool __folio_has_large_matching_rmap_val(struct folio *folio, int count, break; #endif #if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER - case RMAP_SUBID_5_MIN_ORDER .. RMAP_SUBID_5_MAX_ORDER: + case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER: diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_5(mm, 0) * count); diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_5(mm, 1) * count); diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_5(mm, 2) * count); @@ -348,12 +405,24 @@ bool __folio_has_large_matching_rmap_val(struct folio *folio, int count, diff |= atomic_long_read(&folio->_rmap_val4) ^ (get_rmap_subid_5(mm, 4) * count); break; #endif - default: + case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER: diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_4(mm, 0) * count); diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_4(mm, 1) * count); diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_4(mm, 2) * count); diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_4(mm, 3) * count); break; + case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER: + diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_3(mm, 0) * count); + diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_3(mm, 1) * count); + diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_3(mm, 2) * count); + break; + case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER: + diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_2(mm, 0) * count); + diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_2(mm, 1) * count); + break; + default: + diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_1(mm) * count); + break; } return !diff; } From patchwork Fri Nov 24 13:26:17 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169417 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191155vqx; Fri, 24 Nov 2023 05:28:48 -0800 (PST) X-Google-Smtp-Source: AGHT+IFhPzFW3SP+JwPvAmm/3aa8lHRynDIcgou45bAKLC64fipODMQGG+Bu/HCK/jnr8zKey3V9 X-Received: by 2002:a17:90b:4d0d:b0:280:4ec6:97e9 with SMTP id mw13-20020a17090b4d0d00b002804ec697e9mr3191988pjb.30.1700832528048; Fri, 24 Nov 2023 05:28:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832528; cv=none; d=google.com; s=arc-20160816; b=h76trDTXCA011FRN2nCePP3hWA5+IiI1kcGXMrn0p+UM9W5Fc6ts405FgviGViBu1M eKNmuJknOAsqPFPLdnhq7x3wK/u9ruN6FcXmDKij4gKUzzdU1nmVx9TYhChh+5C6/3mA EbhWn3ovma5pWFWl7S8INBPCmnRtuGRO4L8f+k+7bizje/Y9SJ+h/2uzS++LrToQYt3R 9dVF121LoJXmFR3GLUrdFVI2FUurJkNSqlRewa2DYwLvQBeqLouViNVTgYpCssEv4Iz5 Oss4WbH/hbHEARHaxAooGGuBsn+5OWrXZcVgMn5pqDW2c/xXDZdai666dAiXbGrZtlUi Arsg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=XtSW4MwxrPZp0IyU2r08B+zJ2gAs9ZHqKJrFZADgzdE=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=vnbva+/iJS8ruCE0dzrGn80kHW654jfZnm9XJfUq4dAocDvLRy3K/G+igy5xBiMXOn VV6gSgoXrVZwQUBcMGfPj93uaevunj3+j4cSgjR2OMivVUwElifCu1FnkZVwjGdLFurW ImlxzzLjrKcpAEE0coWQTvHXfcticFflUnM+iDKtoFejvk/MdjpszbHCZ0jrW/eDR9Fg sgclF4NOiYxobMy3nlsUFn/STfm5x+N0Gh7X8cf4PWXLt99UPk71nYjpL015EF5IEqUI 6sYm1usUU7PqqAb1AN37GP9eySPJ0D8xvQlnp/UH0j3wBQiMA5OZgfkfUpnRdYnPW9ny DOVg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WeCLO36I; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id x4-20020a17090a6c0400b002803ec7393csi4079945pjj.27.2023.11.24.05.28.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:28:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WeCLO36I; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 205B48030A57; Fri, 24 Nov 2023 05:28:45 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235206AbjKXN2V (ORCPT + 99 others); Fri, 24 Nov 2023 08:28:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58382 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345684AbjKXN1n (ORCPT ); Fri, 24 Nov 2023 08:27:43 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4746C1BF5 for ; Fri, 24 Nov 2023 05:27:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832440; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XtSW4MwxrPZp0IyU2r08B+zJ2gAs9ZHqKJrFZADgzdE=; b=WeCLO36IZ9SNhaMFgnZu8voNGRCkCCJPoksdZVs0/7brTpb5DEAxPDojyg+BjTAiO+0/nA SayEjB50qiNcOo6MATyLZ4UAULHm3GhKd6PNukhIhK5FVxVNwjrV/rOt1Ko6lLUsPGVCuG MuF0ZlgAJrryMx5SaInDNhLl6h8JuXs= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-624-S2Vde6e2MxqrSRyf_jNx7w-1; Fri, 24 Nov 2023 08:27:15 -0500 X-MC-Unique: S2Vde6e2MxqrSRyf_jNx7w-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B241E185A784; Fri, 24 Nov 2023 13:27:14 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3636A2166B2A; Fri, 24 Nov 2023 13:27:11 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 12/20] mm/rmap: introduce folio_add_anon_rmap_range() Date: Fri, 24 Nov 2023 14:26:17 +0100 Message-ID: <20231124132626.235350-13-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:28:45 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452168986567177 X-GMAIL-MSGID: 1783452168986567177 There are probably ways to have an even cleaner interface (e.g., pass the mapping granularity instead of "compound"). For now, let's handle it like folio_add_file_rmap_range(). Use separate loops for handling the "SetPageAnonExclusive()" case and performing debug checks. The latter should get optimized out automatically without CONFIG_DEBUG_VM. We'll use this function to batch rmap operations when PTE-remapping a PMD-mapped THP next. Signed-off-by: David Hildenbrand --- include/linux/rmap.h | 3 ++ mm/rmap.c | 69 +++++++++++++++++++++++++++++++++----------- 2 files changed, 55 insertions(+), 17 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 39aeab457f4a..76e6fb1dad5c 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -393,6 +393,9 @@ typedef int __bitwise rmap_t; * rmap interfaces called when adding or removing pte of page */ void folio_move_anon_rmap(struct folio *, struct vm_area_struct *); +void folio_add_anon_rmap_range(struct folio *, struct page *, + unsigned int nr_pages, struct vm_area_struct *, + unsigned long address, rmap_t flags); void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long address, rmap_t flags); void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, diff --git a/mm/rmap.c b/mm/rmap.c index 689ad85cf87e..da7fa46a18fc 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1240,25 +1240,29 @@ static void __page_check_anon_rmap(struct folio *folio, struct page *page, } /** - * page_add_anon_rmap - add pte mapping to an anonymous page - * @page: the page to add the mapping to - * @vma: the vm area in which the mapping is added - * @address: the user virtual address mapped - * @flags: the rmap flags + * folio_add_anon_rmap_range - add mappings to a page range of an anon folio + * @folio: The folio to add the mapping to + * @page: The first page to add + * @nr_pages: The number of pages which will be mapped + * @vma: The vm area in which the mapping is added + * @address: The user virtual address of the first page to map + * @flags: The rmap flags + * + * The page range of folio is defined by [first_page, first_page + nr_pages) * * The caller needs to hold the pte lock, and the page must be locked in * the anon_vma case: to serialize mapping,index checking after setting, - * and to ensure that PageAnon is not being upgraded racily to PageKsm - * (but PageKsm is never downgraded to PageAnon). + * and to ensure that an anon folio is not being upgraded racily to a KSM folio + * (but KSM folios are never downgraded). */ -void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, +void folio_add_anon_rmap_range(struct folio *folio, struct page *page, + unsigned int nr_pages, struct vm_area_struct *vma, unsigned long address, rmap_t flags) { - struct folio *folio = page_folio(page); - unsigned int nr, nr_pmdmapped = 0; + unsigned int i, nr, nr_pmdmapped = 0; bool compound = flags & RMAP_COMPOUND; - nr = __folio_add_rmap_range(folio, page, 1, vma, compound, + nr = __folio_add_rmap_range(folio, page, nr_pages, vma, compound, &nr_pmdmapped); if (nr_pmdmapped) __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped); @@ -1279,12 +1283,20 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, } else if (likely(!folio_test_ksm(folio))) { __page_check_anon_rmap(folio, page, vma, address); } - if (flags & RMAP_EXCLUSIVE) - SetPageAnonExclusive(page); - /* While PTE-mapping a THP we have a PMD and a PTE mapping. */ - VM_WARN_ON_FOLIO((atomic_read(&page->_mapcount) > 0 || - (folio_test_large(folio) && folio_entire_mapcount(folio) > 1)) && - PageAnonExclusive(page), folio); + + if (flags & RMAP_EXCLUSIVE) { + for (i = 0; i < nr_pages; i++) + SetPageAnonExclusive(page + i); + } + for (i = 0; i < nr_pages; i++) { + struct page *cur_page = page + i; + + /* While PTE-mapping a THP we have a PMD and a PTE mapping. */ + VM_WARN_ON_FOLIO((atomic_read(&cur_page->_mapcount) > 0 || + (folio_test_large(folio) && + folio_entire_mapcount(folio) > 1)) && + PageAnonExclusive(cur_page), folio); + } /* * For large folio, only mlock it if it's fully mapped to VMA. It's @@ -1296,6 +1308,29 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, mlock_vma_folio(folio, vma); } +/** + * page_add_anon_rmap - add mappings to an anonymous page + * @page: The page to add the mapping to + * @vma: The vm area in which the mapping is added + * @address: The user virtual address of the page to map + * @flags: The rmap flags + * + * See folio_add_anon_rmap_range(). + */ +void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, + unsigned long address, rmap_t flags) +{ + struct folio *folio = page_folio(page); + unsigned int nr_pages; + + if (likely(!(flags & RMAP_COMPOUND))) + nr_pages = 1; + else + nr_pages = folio_nr_pages(folio); + + folio_add_anon_rmap_range(folio, page, nr_pages, vma, address, flags); +} + /** * folio_add_new_anon_rmap - Add mapping to a new anonymous folio. * @folio: The folio to add the mapping to. From patchwork Fri Nov 24 13:26:18 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169419 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191208vqx; Fri, 24 Nov 2023 05:28:52 -0800 (PST) X-Google-Smtp-Source: AGHT+IHV7lwS1kCbql6avzeY9WARUGI18hjlsZdWOhiN0wQWSaGFYYycxRbp/opWcHNvvt7BKJ8R X-Received: by 2002:a05:6a00:7cf:b0:6c4:d6fa:ee9d with SMTP id n15-20020a056a0007cf00b006c4d6faee9dmr6488454pfu.1.1700832532433; Fri, 24 Nov 2023 05:28:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832532; cv=none; d=google.com; s=arc-20160816; b=BRbJyo0PqmJRjr2cT3teXRLacUOLeUqMlD6FbMopSPFQk3kiRfY2O8v2VnLlgkT4rD oVla/TW9y06fTku3782QWALB2lX8n2vpvSK4LObnmAy7+wwicK3gmyrXy+88bows/Ua0 rN/aMo3iwbQihs/ZBRL/kwrsyDxe7forO64Xt6EA4q4ppgISebXm5aN3DaQmT3BRPLDi lwbX5iOrSBh8N7Bu5ihuWdKFPD4RdYG9AiDsQpwhOw/gfMYmBD33cjWdtkqWNG8yy7AN wcT8hs9ZngBeDvBaiozZ8UtFbl1Rk2+0JU0jJTRlzWxn2by+SyPFi9euU4xjC24K6rMa H0JQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=vCiq1mHEXhReKPGNK7ma5q/JWzLE5up94nYDqOWMaw0=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=fs5nUznSp5YtUUlEUcJCJY3y/aWk0KVgeqUAcIflg0wzXVk5Q8tP3D/z9VG+cOP80r SdjvmW2gi6C1wpooym+sB4af1jkYTl08TuY1+MGR0WHvBF+QyRm7gKQN6rFZ9NTw4oKA KRacAtl3Vzz464SA6nXabDlnoaO7fFpcwgKZmum7VWnEfgvQfK2vaS1ReerztLNGKMHW /aPRWhDfPgq86qsm3pHstC0spwoB2i4pWp/30NjSoI8UQuga9tZgi0bNNWRLoGqkStl0 nhWmSzx9rOCD2RPA0ziTa/ST+Y2RRk938Vce6z2FnWyf6h1tHGlhnaqylQAG5R0MtQdJ NIDw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=KMtMf+7d; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id z1-20020aa78881000000b006cbb7fdff10si3526805pfe.194.2023.11.24.05.28.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:28:52 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=KMtMf+7d; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id C957B81825CD; Fri, 24 Nov 2023 05:28:49 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345366AbjKXN2Y (ORCPT + 99 others); Fri, 24 Nov 2023 08:28:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48830 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231172AbjKXN1r (ORCPT ); Fri, 24 Nov 2023 08:27:47 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0CB1E1FD5 for ; Fri, 24 Nov 2023 05:27:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832443; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vCiq1mHEXhReKPGNK7ma5q/JWzLE5up94nYDqOWMaw0=; b=KMtMf+7dmBRtjbwxbYQXOrYTQcwZiZJDTFkfqfBZDVirfeYkN8AH/PvV7p0IXhMRq4/keh bYPgAC+g3Q580Km9kmZK6U8f8eP8GSYKjt1GZ0kjLjD9w//Bl75M7BQoShXl0h6+1VUUjC Q20c2Ym3edm/eG74EUt849f10Tw5pJc= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-654-EfryieQzPKuzYd766I-Umw-1; Fri, 24 Nov 2023 08:27:19 -0500 X-MC-Unique: EfryieQzPKuzYd766I-Umw-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 805CE1C05142; Fri, 24 Nov 2023 13:27:18 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 0717E2166B2A; Fri, 24 Nov 2023 13:27:14 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 13/20] mm/huge_memory: batch rmap operations in __split_huge_pmd_locked() Date: Fri, 24 Nov 2023 14:26:18 +0100 Message-ID: <20231124132626.235350-14-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:28:50 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452173437447670 X-GMAIL-MSGID: 1783452173437447670 Let's batch the rmap operations, as a preparation to making individual page_add_anon_rmap() calls more expensive. While at it, use more folio operations (but only in the code branch we're touching), use VM_WARN_ON_FOLIO(), and pass RMAP_COMPOUND instead of manually setting PageAnonExclusive. We should never see non-anon pages on that branch: otherwise, the existing page_add_anon_rmap() call would have been flawed already. Signed-off-by: David Hildenbrand --- mm/huge_memory.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fd7251923557..f47971d1afbf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2100,6 +2100,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, unsigned long haddr, bool freeze) { struct mm_struct *mm = vma->vm_mm; + struct folio *folio; struct page *page; pgtable_t pgtable; pmd_t old_pmd, _pmd; @@ -2195,16 +2196,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, uffd_wp = pmd_swp_uffd_wp(old_pmd); } else { page = pmd_page(old_pmd); + folio = page_folio(page); if (pmd_dirty(old_pmd)) { dirty = true; - SetPageDirty(page); + folio_set_dirty(folio); } write = pmd_write(old_pmd); young = pmd_young(old_pmd); soft_dirty = pmd_soft_dirty(old_pmd); uffd_wp = pmd_uffd_wp(old_pmd); - VM_BUG_ON_PAGE(!page_count(page), page); + VM_WARN_ON_FOLIO(!folio_ref_count(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); /* * Without "freeze", we'll simply split the PMD, propagating the @@ -2221,11 +2224,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, * * See page_try_share_anon_rmap(): invalidate PMD first. */ - anon_exclusive = PageAnon(page) && PageAnonExclusive(page); + anon_exclusive = PageAnonExclusive(page); if (freeze && anon_exclusive && page_try_share_anon_rmap(page)) freeze = false; - if (!freeze) - page_ref_add(page, HPAGE_PMD_NR - 1); + if (!freeze) { + rmap_t rmap_flags = RMAP_NONE; + + folio_ref_add(folio, HPAGE_PMD_NR - 1); + if (anon_exclusive) + rmap_flags = RMAP_EXCLUSIVE; + folio_add_anon_rmap_range(folio, page, HPAGE_PMD_NR, + vma, haddr, rmap_flags); + } } /* @@ -2268,8 +2278,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot)); if (write) entry = pte_mkwrite(entry, vma); - if (anon_exclusive) - SetPageAnonExclusive(page + i); if (!young) entry = pte_mkold(entry); /* NOTE: this may set soft-dirty too on some archs */ @@ -2279,7 +2287,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, entry = pte_mksoft_dirty(entry); if (uffd_wp) entry = pte_mkuffd_wp(entry); - page_add_anon_rmap(page + i, vma, addr, RMAP_NONE); } VM_BUG_ON(!pte_none(ptep_get(pte))); set_pte_at(mm, addr, pte, entry); From patchwork Fri Nov 24 13:26:19 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169428 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191728vqx; Fri, 24 Nov 2023 05:29:41 -0800 (PST) X-Google-Smtp-Source: AGHT+IHgYE+zg/SzrX4uBWWWclrIzHkXkhvhD2AdPxOq3kkrTopSK4r0L6beJ93WO9SvokO9VemC X-Received: by 2002:a05:6a21:6da3:b0:188:f3d:ea35 with SMTP id wl35-20020a056a216da300b001880f3dea35mr4202670pzb.50.1700832581479; Fri, 24 Nov 2023 05:29:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832581; cv=none; d=google.com; s=arc-20160816; b=P1KBK6+u/H3ePAr0ADCID2C8+tRz7rFxxlVZeuFSrVTZunCaaUD048wuUXApUxLEzF +1YnIHMWy3FobK+C2xRcx9uNbbdLWltmGJ4Zr23mVOM4X6ItQBDdrAOzmah02EKIuXcI HuIu8Gl2ZDm+0V7gnLKMXV+ZPhd+YsXkRQuAgMkHDk3BpPgK2grcWQzopX+0J7W+il90 kCoplsf2Gu8Xv6ew1YM4bsNMTVy/QsfdbY1iLsKwPqzIleHCiKU8v6Mxlm8h9/ZI3zWo cxtyNjmKuRYnIUT5poJGA66kHkqXaZHss0LvgdXmTDaj2KjXTD+hV0U0+U6Mz8nUxAHL 6bwA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=8wkdTzr4ovr6JqvWUuBFMPrmTwPV2fAdlmREPfnLqjk=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=0j17+20a6N89IgfVXu9QnG9A5t37ulMUPSevUmleXgCfV5Lr3yR0xYKeJbMOAkU7hf MLf/ncO0NFHwGr5HvinP7+r2IFpPlh6Wp+lnLv0l6FG73SU4y4ZxcNNpq24e7+gbnb0h v5g+vMATeaTeKWV+YphnynV+jYLdqtEHniQiswvMtWkqiB4tf9GuTIwxP8ASeRkzcdCQ 0azMFPoCRh5QUgimLKQRilptcf71UjQNP/hKM2s1Okd6EphpKm2iB/Q1ExObuI5seNPo D39pQUvko/2lvxiHZX6GZ0u4hQVCZWXqOp95mH4MihD18RgyuuKeIGR7vTW1Gvc2U2YM z6dw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=P3OryhCw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id k32-20020a634b60000000b005bd335981e2si3463709pgl.678.2023.11.24.05.29.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:29:41 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=P3OryhCw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 303308047649; Fri, 24 Nov 2023 05:29:26 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345427AbjKXN2m (ORCPT + 99 others); Fri, 24 Nov 2023 08:28:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37522 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230104AbjKXN2S (ORCPT ); Fri, 24 Nov 2023 08:28:18 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6CA4F2105 for ; Fri, 24 Nov 2023 05:27:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832451; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8wkdTzr4ovr6JqvWUuBFMPrmTwPV2fAdlmREPfnLqjk=; b=P3OryhCwrrW3/y4KdyQKeVXl4ZsGXGCqICZDVyfWOr3qPxyr9OAPSTyhSzJPwo15ndIdhf M9IMRawBLym9dumKCK8uEAQfKEZt9+p2UK+VeGksUrqDmGNIoljzhaRSXa/a2ERoC8pkii QVZwgR8zn6qn35NAhZmGrVTX0kFoxO4= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-631-Mzyg-U9HMR2q8VnSaYE9mw-1; Fri, 24 Nov 2023 08:27:23 -0500 X-MC-Unique: Mzyg-U9HMR2q8VnSaYE9mw-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 7F79C85A58C; Fri, 24 Nov 2023 13:27:22 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id E251F2166B2B; Fri, 24 Nov 2023 13:27:18 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 14/20] mm/huge_memory: avoid folio_refcount() < folio_mapcount() in __split_huge_pmd_locked() Date: Fri, 24 Nov 2023 14:26:19 +0100 Message-ID: <20231124132626.235350-15-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:29:26 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452224988353516 X-GMAIL-MSGID: 1783452224988353516 Currently, there is a short period in time where the refcount is smaller than the mapcount. Let's just make sure we obey the rules of refcount vs. mapcount: increment the refcount before incrementing the mapcount and decrement the refcount after decrementing the mapcount. While this could make code like can_split_folio() fail to detect other folio references, such code is (currently) racy already and this change shouldn't actually be considered a real fix but rather an improvement/ cleanup. The refcount vs. mapcount changes are now well balanced in the code, with the cost of one additional refcount change, which really shouldn't matter here that much -- we're usually touching >= 512 subpage mapcounts and much more after all. Found while playing with some sanity checks to detect such cases, which we might add at some later point. Signed-off-by: David Hildenbrand --- mm/huge_memory.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f47971d1afbf..9639b4edc8a5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2230,7 +2230,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, if (!freeze) { rmap_t rmap_flags = RMAP_NONE; - folio_ref_add(folio, HPAGE_PMD_NR - 1); + folio_ref_add(folio, HPAGE_PMD_NR); if (anon_exclusive) rmap_flags = RMAP_EXCLUSIVE; folio_add_anon_rmap_range(folio, page, HPAGE_PMD_NR, @@ -2294,10 +2294,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, } pte_unmap(pte - 1); - if (!pmd_migration) + if (!pmd_migration) { page_remove_rmap(page, vma, true); - if (freeze) put_page(page); + } smp_wmb(); /* make pte visible before pmd */ pmd_populate(mm, pmd, pgtable); From patchwork Fri Nov 24 13:26:20 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169427 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1191635vqx; Fri, 24 Nov 2023 05:29:32 -0800 (PST) X-Google-Smtp-Source: AGHT+IGvEuN2azV0Bqb3M23sEN3xY1ZqWuELkmR9Ulsse2wPtZ8t48Vz6/1VYj4BIJiSAn33hcuR X-Received: by 2002:a17:903:18f:b0:1cf:591c:a8b1 with SMTP id z15-20020a170903018f00b001cf591ca8b1mr3063643plg.15.1700832571746; Fri, 24 Nov 2023 05:29:31 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832571; cv=none; d=google.com; s=arc-20160816; b=SlSzbAozrR34Nx1qrQwKxrELTUF5R1N2CrvCoU/ozb8mMK84DXH5hbhBux5VdxFcnJ 7KRwqmqxlGMStNsIhZKMxhYOMvrVlGKhOZnUeHNzz0ToeLBDmw4MPFSi5sAuBk8A3LU/ 7HIJTJDS/H1MRQkPxj+gTkQ1A19TXuQyE0kMk36vX4/wOxlxCqMxR0iUW8K131DQFmIq 5bAEFbte4JUCLc5muYpea/YppH8Emhrj3Vb91THKsD/ViFtGFfcprj3GbYgiEU7ciJRb zAEd+QlZ40QlTS+4QIK5BctVKN3Vn80/0bM1Olp3BEKReekk8ufxhf2SK2htCGPcLAqW 9igA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=qX2dA3aqMotkmkgBnT9s/2Q/bW+jQxPjHk8bIRl2TXE=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=F4pkQ0+r0qBKKK0sJhGzwiIIFE/2lMyQftrMw/U7GArPOQV0I4wD7EG3r8iwy/cbyv BY4gT4Z0Y/qmDd+omABHcPDiL+2fmArxO2xp+U+CuGxwatw+3XEZZ5z8A3PwQmBwiuLy Xbw5FTpoBTQ1WTtqUDJ8pGedulC8+oOlJbVKhTTeoGfPYT3IOFZYLTWRK3Xhi7LDlPPV eCwdwG2n9MIGWe1VVYqSHuQ0w9W6k9jkRahxFM7J0hJse9Ik+h0URbuyy1YKnTvo94T8 YrDHN6/9t2/6+4rZBezfPtkm2ZNpaG6/q6+KscAhlLWjrXto6CwdNazi1l7hnw02QRGQ xYWA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=dQWeGBfK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id jg17-20020a17090326d100b001cf68d3e90csi3313912plb.98.2023.11.24.05.29.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:29:31 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=dQWeGBfK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id CFA0E80816A1; Fri, 24 Nov 2023 05:29:24 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231494AbjKXN3H (ORCPT + 99 others); Fri, 24 Nov 2023 08:29:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48808 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235220AbjKXN2V (ORCPT ); Fri, 24 Nov 2023 08:28:21 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 252A619BB for ; Fri, 24 Nov 2023 05:27:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832456; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qX2dA3aqMotkmkgBnT9s/2Q/bW+jQxPjHk8bIRl2TXE=; b=dQWeGBfKXMPea9Kj/9qHdaRFhs6cbWq9G+QHWZkqxUminddsyiLqg5ILBCnQ+Wo/rkcETm zXzv++mcKEnuIMWkojOJbWVHLodNAEnCEXh1dke6pZ4PN0lLN3o9uK0a0GvPYKHe2/G6SM 4Lpjtx7snDBxf0B4qa2on9d+IaitCvM= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-198-yyv_70-8PeWkL3Q6JW939Q-1; Fri, 24 Nov 2023 08:27:26 -0500 X-MC-Unique: yyv_70-8PeWkL3Q6JW939Q-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id CD8A52806053; Fri, 24 Nov 2023 13:27:25 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id DE44C2166B2B; Fri, 24 Nov 2023 13:27:22 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 15/20] mm/rmap_id: verify precalculated subids with CONFIG_DEBUG_VM Date: Fri, 24 Nov 2023 14:26:20 +0100 Message-ID: <20231124132626.235350-16-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:29:24 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452214628749976 X-GMAIL-MSGID: 1783452214628749976 Let's verify the precalculated subids for 4/5/6 values. Signed-off-by: David Hildenbrand --- mm/rmap_id.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/mm/rmap_id.c b/mm/rmap_id.c index 6c3187547741..421d8d2b646c 100644 --- a/mm/rmap_id.c +++ b/mm/rmap_id.c @@ -481,3 +481,29 @@ void free_rmap_id(int id) ida_free(&rmap_ida, id); spin_unlock(&rmap_id_lock); } + +#ifdef CONFIG_DEBUG_VM +static int __init rmap_id_init(void) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(rmap_subids_4); i++) + WARN_ON_ONCE(calc_rmap_subid(1u << RMAP_SUBID_4_MAX_ORDER, i) != + rmap_subids_4[i]); + +#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER + for (i = 0; i < ARRAY_SIZE(rmap_subids_5); i++) + WARN_ON_ONCE(calc_rmap_subid(1u << RMAP_SUBID_5_MAX_ORDER, i) != + rmap_subids_5[i]); +#endif + +#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER + for (i = 0; i < ARRAY_SIZE(rmap_subids_6); i++) + WARN_ON_ONCE(calc_rmap_subid(1u << RMAP_SUBID_6_MAX_ORDER, i) != + rmap_subids_6[i]); +#endif + + return 0; +} +module_init(rmap_id_init) +#endif /* CONFIG_DEBUG_VM */ From patchwork Fri Nov 24 13:26:21 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169433 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1192576vqx; Fri, 24 Nov 2023 05:30:38 -0800 (PST) X-Google-Smtp-Source: AGHT+IF9L0ad+gMup0/KzCJNTr9aiiCU9MNYG3xJGnmpt/E9sP+h0TMdHzz0XlOL1j5KRVnOrLGe X-Received: by 2002:a05:6a20:8f1a:b0:18b:962c:1ddf with SMTP id b26-20020a056a208f1a00b0018b962c1ddfmr3060156pzk.56.1700832638133; Fri, 24 Nov 2023 05:30:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832638; cv=none; d=google.com; s=arc-20160816; b=rNCOoC7fotfipjmHMYqjGMVOhW5onFbkcYXlUrCItHhDtAVBpwLYK49neHlL2RmFJy vafEFsb7XLAZUDDRIrZZ6Ty4amysJ3VQcURt5nsq3cmNYrCdo83ZE9RGvt6Rb80xCVQh eSGlKXtY02eBNTJGm6lYhG4WLjyxVfRWRwaJNwcsiOB0cGwefbAfRb8Xf9jimglzC1Qa /uvGvngix5WZFKSClfpftDMMc6QBqHtZIyGrcXCGGKXNJ3u2KY0a5Do7yNTC1IGx64Cl NDXAgpOMw7QC2w5ai3qGGB3QzfVbm39PN5dWpwzpslujYRadlcw7pBuOUn3d5v2RUHPP 2UIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=IWxh+srF4eII3yzN2Ch68mHrKm0yO3+Xlp6snuZfYi4=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=gHDhYFHinZ1ENbVEPJEsvDwHxhWU/b9ROpzF/GECKT/AF9I2JzhEkATtCejzpBwdX8 UHWDnm2JAhOB9IbLNtWFU7frG5louUdT82mMOAeVl9wFBaUotN5smAxu4V17PGd1G/HY TX6sKonWnJK3FPC5nOYdXVYFklSWoB7wmNJK4/sD+5JSwl0jREJRA4SoKYxvyPjaewA/ XAAQF0p6wR0NYLa2CQsV2SFHpIIMquxWw3hvWRm31cJvkmf85R0CJAwuquUZD9zbSyIe phivvVeLAF3tfDIz+nWsJvpHzl+FTOdIsj2iE9MY9dwJDM2RI4pWxNymKI9Qax3J4vec MNhg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="R/Vj3wZQ"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id by37-20020a056a0205a500b005b942de1e92si3835842pgb.443.2023.11.24.05.30.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:30:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="R/Vj3wZQ"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 23880807F497; Fri, 24 Nov 2023 05:30:29 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232949AbjKXN27 (ORCPT + 99 others); Fri, 24 Nov 2023 08:28:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37654 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232787AbjKXN2U (ORCPT ); Fri, 24 Nov 2023 08:28:20 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6D2662108 for ; Fri, 24 Nov 2023 05:27:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832453; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IWxh+srF4eII3yzN2Ch68mHrKm0yO3+Xlp6snuZfYi4=; b=R/Vj3wZQr5otgXJxdaACSiB4pyJvJfVdaSa+/u5vg6ZjxlQfHjqKf1vDePuZ1Ux96U7ypJ LQ/g0nPGw0usvd7PfsyDn/PwqVMuhyIOJzePDIR/r7Usi3JnRd+Ioz0c0i7+5oAQKMuk8e QFvLt9A+cZI52SKnua9hOTuQ9Cf6lWY= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-344-yC66DjjfM-2TnstsUc80dA-1; Fri, 24 Nov 2023 08:27:30 -0500 X-MC-Unique: yC66DjjfM-2TnstsUc80dA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5C903185A780; Fri, 24 Nov 2023 13:27:29 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 150282166B2A; Fri, 24 Nov 2023 13:27:25 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 16/20] atomic_seqcount: support a single exclusive writer in the absence of other writers Date: Fri, 24 Nov 2023 14:26:21 +0100 Message-ID: <20231124132626.235350-17-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:30:29 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452284420170537 X-GMAIL-MSGID: 1783452284420170537 The current atomic seqcount requires that all writers must use atomic RMW operations in the critical section, which can result in quite some overhead on some platforms. In the common case, there is only a single writer, and ideally we'd be able to not use atomic RMW operations in that case, to reduce the overall number of atomic RMW operations on the fast path. So let's add support for a single exclusive writer. If there are no other writers, a writer can become the single exclusive writer by using an atomic cmpxchg on the atomic seqcount. However, if there is any concurrent writer (shared or exclusive), the writers become shared and only have to wait for a single exclusive writer to finish. So shared writers might be delayed a bit by the single exclusive writer, but they don't starve as they are guaranteed to make progress after the exclusive writer finished (that ideally runs faster than any shared writer due to no atomic RMW operations in the critical section). The exclusive path now effectively acts as a lock: if the trylock fails, we fallback to the shared path. We need acquire-release semantics that are implied by the full memory barriers that we are enforcing. Instead of the atomic_long_add_return(), we could keep using an atomic_long_add() + atomic_long_read(). But I suspect that doesn't really matter. If it ever matters, if will be easy to optimize. Signed-off-by: David Hildenbrand --- include/linux/atomic_seqcount.h | 101 ++++++++++++++++++++++++++------ include/linux/rmap.h | 5 +- 2 files changed, 85 insertions(+), 21 deletions(-) diff --git a/include/linux/atomic_seqcount.h b/include/linux/atomic_seqcount.h index 109447b663a1..00286a9da221 100644 --- a/include/linux/atomic_seqcount.h +++ b/include/linux/atomic_seqcount.h @@ -8,8 +8,11 @@ /* * raw_atomic_seqcount_t -- a reader-writer consistency mechanism with - * lockless readers (read-only retry loops), and lockless writers. - * The writers must use atomic RMW operations in the critical section. + * lockless readers (read-only retry loops), and (almost) lockless writers. + * Shared writers must use atomic RMW operations in the critical section, + * a single exclusive writer can avoid atomic RMW operations in the critical + * section. Shared writers will always have to wait for at most one exclusive + * writer to finish in order to make progress. * * This locking mechanism is applicable when all individual operations * performed by writers can be expressed using atomic RMW operations @@ -38,9 +41,10 @@ typedef struct raw_atomic_seqcount { /* 65536 CPUs */ #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x0000000000008000ul #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x000000000000fffful -#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000000000fffful +#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER 0x0000000000010000ul +#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000000001fffful /* We have 48bit for the actual sequence. */ -#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x0000000000010000ul +#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x0000000000020000ul #else /* CONFIG_64BIT */ @@ -48,9 +52,10 @@ typedef struct raw_atomic_seqcount { /* 64 CPUs */ #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x00000040ul #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x0000007ful -#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x0000007ful -/* We have 25bit for the actual sequence. */ -#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x00000080ul +#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER 0x00000080ul +#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000fful +/* We have 24bit for the actual sequence. */ +#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x00000100ul #endif /* CONFIG_64BIT */ @@ -126,44 +131,102 @@ static inline bool raw_read_atomic_seqcount_retry(raw_atomic_seqcount_t *s, /** * raw_write_seqcount_begin() - start a raw_seqcount_t write critical section * @s: Pointer to the raw_atomic_seqcount_t + * @try_exclusive: Whether to try becoming the exclusive writer. * * raw_write_seqcount_begin() opens the write critical section of the * given raw_seqcount_t. This function must not be used in interrupt context. + * + * Return: "true" when we are the exclusive writer and can avoid atomic RMW + * operations in the critical section. Otherwise, we are a shared + * writer and have to use atomic RMW operations in the critical + * section. Will always return "false" if @try_exclusive is not "true". */ -static inline void raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s) +static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s, + bool try_exclusive) { + unsigned long seqcount, seqcount_new; + BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT)); #ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT DEBUG_LOCKS_WARN_ON(in_interrupt()); #endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */ preempt_disable(); - atomic_long_add(ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence); - /* Store the sequence before any store in the critical section. */ - smp_mb__after_atomic(); + + /* If requested, can we just become the exclusive writer? */ + if (!try_exclusive) + goto shared; + + seqcount = atomic_long_read(&s->sequence); + if (unlikely(seqcount & ATOMIC_SEQCOUNT_WRITERS_MASK)) + goto shared; + + seqcount_new = seqcount | ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER; + /* + * Store the sequence before any store in the critical section. Further, + * this implies an acquire so loads within the critical section are + * not reordered to be outside the critical section. + */ + if (atomic_long_try_cmpxchg(&s->sequence, &seqcount, seqcount_new)) + return true; +shared: + /* + * Indicate that there is a shared writer, and spin until the exclusive + * writer is done. This avoids writer starvation, because we'll always + * have to wait for at most one writer. + * + * We spin with preemption disabled to not reschedule to a reader that + * cannot make any progress either way. + * + * Store the sequence before any store in the critical section. + */ + seqcount = atomic_long_add_return(ATOMIC_SEQCOUNT_SHARED_WRITER, + &s->sequence); #ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT - DEBUG_LOCKS_WARN_ON((atomic_long_read(&s->sequence) & - ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) > + DEBUG_LOCKS_WARN_ON((seqcount & ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) > ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX); #endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */ + if (likely(!(seqcount & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER))) + return false; + + while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER) + cpu_relax(); + return false; } /** * raw_write_seqcount_end() - end a raw_seqcount_t write critical section * @s: Pointer to the raw_atomic_seqcount_t + * @exclusive: Return value of raw_write_atomic_seqcount_begin(). * * raw_write_seqcount_end() closes the write critical section of the * given raw_seqcount_t. */ -static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s) +static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s, + bool exclusive) { + unsigned long val = ATOMIC_SEQCOUNT_SEQUENCE_STEP; + + if (likely(exclusive)) { +#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT + DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) & + ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER)); +#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */ + val -= ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER; + } else { #ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT - DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) & - ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK)); + DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) & + ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK)); #endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */ - /* Store the sequence after any store in the critical section. */ + val -= ATOMIC_SEQCOUNT_SHARED_WRITER; + } + /* + * Store the sequence after any store in the critical section. For + * the exclusive path, this further implies a release, so loads + * within the critical section are not reordered to be outside the + * cricial section. + */ smp_mb__before_atomic(); - atomic_long_add(ATOMIC_SEQCOUNT_SEQUENCE_STEP - - ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence); + atomic_long_add(val, &s->sequence); preempt_enable(); } diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 76e6fb1dad5c..0758dddc5528 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -295,12 +295,13 @@ static inline void __folio_write_large_rmap_begin(struct folio *folio) { VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); - raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount); + raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount, + false); } static inline void __folio_write_large_rmap_end(struct folio *folio) { - raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount); + raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount, false); } void __folio_set_large_rmap_val(struct folio *folio, int count, From patchwork Fri Nov 24 13:26:22 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169430 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1192154vqx; Fri, 24 Nov 2023 05:30:12 -0800 (PST) X-Google-Smtp-Source: AGHT+IHK2I5S0fw8XsKyKLtD5lQEutx5kkH5WjxXm4+jIx7ZX4L4FV8tvvv/ajF7tGQ4B9fhmuNw X-Received: by 2002:a17:90a:d494:b0:285:8939:c4b3 with SMTP id s20-20020a17090ad49400b002858939c4b3mr2156489pju.13.1700832612318; Fri, 24 Nov 2023 05:30:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832612; cv=none; d=google.com; s=arc-20160816; b=B9M3rZEkO2A25l1lrmPZID8KbtICDdkw0O3UnW5ZMUYdzHi8SEw5zn/97Q+eZGGp3V FE9nhxAV1ZdFQ2Zg0CdiqPrTrMYqnCaSbjtCToFWWETkibfmL3rjhaT5aJP5WaXDWVHX l/QDfI3iC/W+KsoIcbQS4BbofPBMuedLjYUi9++rc9CNAFPOH/h/FjlD9Jz7nTz3Tql7 25V5BC/G4VITucuM6MJ1/K+Imy9SyrmC+PHbLUsUtiWahXQaTR+I2aHNedAvoZQr5bt7 nI0bnJinUVAvmwJQmZI7mtGhE3fQCMiD2+79i4h2hmTG+3W9bMi0anoFP4wn5Cynns3D 08Vw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=Cfr/GJlqpI78eOuvYmduIMFro8OwYW0A0XSs9rIUuNI=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=EgFaEfn4IhN5w/fjB2qAsAnAs5smb70VDpb/OzZTHbscH00NmiActd5iM+uUyKe0BZ kpGEGM9nfBZV6XJFF3fKWS9lnNlgqzUU3+J4kNmzwE/KrzWu+RTD+PSUKNGifbHy8OW6 Pjri3KFDk2oXH1kaTFY7oLyboIu4Nmcyv64F0+u1AoOq0+a35sNhM3HgdCGLhR6a9+C0 1jLJfNskzpPIYdqIK6uyXZokMy9DZ7Hx5lucGfpQSwbonUpOAdzsFAxY4hQexrOiiWGF O4BEd3NiCvfyQ5Grjv0j09GT3d2JAjllvxcBYIsW2e8rzpthKZm+Q85XZQ964XyKEi2p TIsw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=FyleEjPm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id p6-20020a17090ab90600b0028598045121si521064pjr.9.2023.11.24.05.29.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:30:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=FyleEjPm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 6C0C5808168A; Fri, 24 Nov 2023 05:29:42 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345478AbjKXN3V (ORCPT + 99 others); Fri, 24 Nov 2023 08:29:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37030 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233150AbjKXN2g (ORCPT ); Fri, 24 Nov 2023 08:28:36 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E05012127 for ; Fri, 24 Nov 2023 05:27:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832459; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Cfr/GJlqpI78eOuvYmduIMFro8OwYW0A0XSs9rIUuNI=; b=FyleEjPmcwZ5n7ImKP66cqRkRswiA9IpSxNvsIieRWS/LEflfM1PWjI2g+dFYK5KMRl49O olhGwz5OI+B8DDvezMOJuBmkjIHYo7jUI/faxGNty1H6/rCwxBjMRq8vOeZPQeoqpoC8p6 XOwh8XYVOs5lfH4YjqeeLspq7YPlY9c= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-509-8WRNY9xWPlSFJQ6I882kqQ-1; Fri, 24 Nov 2023 08:27:33 -0500 X-MC-Unique: 8WRNY9xWPlSFJQ6I882kqQ-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 28DBB1C04357; Fri, 24 Nov 2023 13:27:33 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 953662166B2A; Fri, 24 Nov 2023 13:27:29 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 17/20] mm/rmap_id: reduce atomic RMW operations when we are the exclusive writer Date: Fri, 24 Nov 2023 14:26:22 +0100 Message-ID: <20231124132626.235350-18-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:29:42 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452256988591114 X-GMAIL-MSGID: 1783452256988591114 We can reduce the number of atomic RMW operations when we are the single exclusive writer -- the common case. So instead of always requiring (1) 2 atomic RMW operations for adjusting the atomic seqcount (2) 1 atomic RMW operation for adjusting the total mapcount (3) 1 to 6 atomic RMW operation for adjusting the rmap values We can avoid (2) and (3) if we are the exclusive writer and limit it to the 2 atomic RMW operations from (1). Signed-off-by: David Hildenbrand --- include/linux/rmap.h | 81 +++++++++++++++++++++++++++++++++----------- mm/rmap_id.c | 52 ++++++++++++++++++++++++++++ 2 files changed, 114 insertions(+), 19 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 0758dddc5528..538c23d3c0c9 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -291,23 +291,36 @@ static inline void __folio_undo_large_rmap(struct folio *folio) #endif } -static inline void __folio_write_large_rmap_begin(struct folio *folio) +static inline bool __folio_write_large_rmap_begin(struct folio *folio) { + bool exclusive; + VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); - raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount, - false); + + exclusive = raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount, + true); + if (likely(exclusive)) { + prefetchw(&folio->_rmap_val0); + if (unlikely(folio_order(folio) > RMAP_SUBID_4_MAX_ORDER)) + prefetchw(&folio->_rmap_val4); + } + return exclusive; } -static inline void __folio_write_large_rmap_end(struct folio *folio) +static inline void __folio_write_large_rmap_end(struct folio *folio, + bool exclusive) { - raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount, false); + raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount, + exclusive); } void __folio_set_large_rmap_val(struct folio *folio, int count, struct mm_struct *mm); void __folio_add_large_rmap_val(struct folio *folio, int count, struct mm_struct *mm); +void __folio_add_large_rmap_val_exclusive(struct folio *folio, int count, + struct mm_struct *mm); bool __folio_has_large_matching_rmap_val(struct folio *folio, int count, struct mm_struct *mm); #else @@ -317,12 +330,14 @@ static inline void __folio_prep_large_rmap(struct folio *folio) static inline void __folio_undo_large_rmap(struct folio *folio) { } -static inline void __folio_write_large_rmap_begin(struct folio *folio) +static inline bool __folio_write_large_rmap_begin(struct folio *folio) { VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio); VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); + return false; } -static inline void __folio_write_large_rmap_end(struct folio *folio) +static inline void __folio_write_large_rmap_end(struct folio *folio, + bool exclusive) { } static inline void __folio_set_large_rmap_val(struct folio *folio, int count, @@ -333,6 +348,10 @@ static inline void __folio_add_large_rmap_val(struct folio *folio, int count, struct mm_struct *mm) { } +static inline void __folio_add_large_rmap_val_exclusive(struct folio *folio, + int count, struct mm_struct *mm) +{ +} #endif /* CONFIG_RMAP_ID */ static inline void folio_set_large_mapcount(struct folio *folio, @@ -348,28 +367,52 @@ static inline void folio_set_large_mapcount(struct folio *folio, static inline void folio_inc_large_mapcount(struct folio *folio, struct vm_area_struct *vma) { - __folio_write_large_rmap_begin(folio); - atomic_inc(&folio->_total_mapcount); - __folio_add_large_rmap_val(folio, 1, vma->vm_mm); - __folio_write_large_rmap_end(folio); + bool exclusive; + + exclusive = __folio_write_large_rmap_begin(folio); + if (likely(exclusive)) { + atomic_set(&folio->_total_mapcount, + atomic_read(&folio->_total_mapcount) + 1); + __folio_add_large_rmap_val_exclusive(folio, 1, vma->vm_mm); + } else { + atomic_inc(&folio->_total_mapcount); + __folio_add_large_rmap_val(folio, 1, vma->vm_mm); + } + __folio_write_large_rmap_end(folio, exclusive); } static inline void folio_add_large_mapcount(struct folio *folio, int count, struct vm_area_struct *vma) { - __folio_write_large_rmap_begin(folio); - atomic_add(count, &folio->_total_mapcount); - __folio_add_large_rmap_val(folio, count, vma->vm_mm); - __folio_write_large_rmap_end(folio); + bool exclusive; + + exclusive = __folio_write_large_rmap_begin(folio); + if (likely(exclusive)) { + atomic_set(&folio->_total_mapcount, + atomic_read(&folio->_total_mapcount) + count); + __folio_add_large_rmap_val_exclusive(folio, count, vma->vm_mm); + } else { + atomic_add(count, &folio->_total_mapcount); + __folio_add_large_rmap_val(folio, count, vma->vm_mm); + } + __folio_write_large_rmap_end(folio, exclusive); } static inline void folio_dec_large_mapcount(struct folio *folio, struct vm_area_struct *vma) { - __folio_write_large_rmap_begin(folio); - atomic_dec(&folio->_total_mapcount); - __folio_add_large_rmap_val(folio, -1, vma->vm_mm); - __folio_write_large_rmap_end(folio); + bool exclusive; + + exclusive = __folio_write_large_rmap_begin(folio); + if (likely(exclusive)) { + atomic_set(&folio->_total_mapcount, + atomic_read(&folio->_total_mapcount) - 1); + __folio_add_large_rmap_val_exclusive(folio, -1, vma->vm_mm); + } else { + atomic_dec(&folio->_total_mapcount); + __folio_add_large_rmap_val(folio, -1, vma->vm_mm); + } + __folio_write_large_rmap_end(folio, exclusive); } /* RMAP flags, currently only relevant for some anon rmap operations. */ diff --git a/mm/rmap_id.c b/mm/rmap_id.c index 421d8d2b646c..5009c6e43965 100644 --- a/mm/rmap_id.c +++ b/mm/rmap_id.c @@ -379,6 +379,58 @@ void __folio_add_large_rmap_val(struct folio *folio, int count, } } +void __folio_add_large_rmap_val_exclusive(struct folio *folio, int count, + struct mm_struct *mm) +{ + const unsigned int order = folio_order(folio); + + /* + * Concurrent rmap value modifications are impossible. We don't care + * about store tearing because readers will realize the concurrent + * updates using the seqcount and simply retry. So adjust the bare + * atomic counter instead. + */ + switch (order) { +#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER + case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER: + folio->_rmap_val0.counter += get_rmap_subid_6(mm, 0) * count; + folio->_rmap_val1.counter += get_rmap_subid_6(mm, 1) * count; + folio->_rmap_val2.counter += get_rmap_subid_6(mm, 2) * count; + folio->_rmap_val3.counter += get_rmap_subid_6(mm, 3) * count; + folio->_rmap_val4.counter += get_rmap_subid_6(mm, 4) * count; + folio->_rmap_val5.counter += get_rmap_subid_6(mm, 5) * count; + break; +#endif +#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER + case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER: + folio->_rmap_val0.counter += get_rmap_subid_5(mm, 0) * count; + folio->_rmap_val1.counter += get_rmap_subid_5(mm, 1) * count; + folio->_rmap_val2.counter += get_rmap_subid_5(mm, 2) * count; + folio->_rmap_val3.counter += get_rmap_subid_5(mm, 3) * count; + folio->_rmap_val4.counter += get_rmap_subid_5(mm, 4) * count; + break; +#endif + case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER: + folio->_rmap_val0.counter += get_rmap_subid_4(mm, 0) * count; + folio->_rmap_val1.counter += get_rmap_subid_4(mm, 1) * count; + folio->_rmap_val2.counter += get_rmap_subid_4(mm, 2) * count; + folio->_rmap_val3.counter += get_rmap_subid_4(mm, 3) * count; + break; + case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER: + folio->_rmap_val0.counter += get_rmap_subid_3(mm, 0) * count; + folio->_rmap_val1.counter += get_rmap_subid_3(mm, 1) * count; + folio->_rmap_val2.counter += get_rmap_subid_3(mm, 2) * count; + break; + case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER: + folio->_rmap_val0.counter += get_rmap_subid_2(mm, 0) * count; + folio->_rmap_val1.counter += get_rmap_subid_2(mm, 1) * count; + break; + default: + folio->_rmap_val0.counter += get_rmap_subid_1(mm); + break; + } +} + bool __folio_has_large_matching_rmap_val(struct folio *folio, int count, struct mm_struct *mm) { From patchwork Fri Nov 24 13:26:23 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169429 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1192045vqx; Fri, 24 Nov 2023 05:30:05 -0800 (PST) X-Google-Smtp-Source: AGHT+IGnI4RpVhLVr/xsJMQE9xNQqYVbd/QQbJYBgtNj2wTefIGOkSmXRuNBXhgFqdnKRvTtQWoZ X-Received: by 2002:a17:90b:38ca:b0:27f:fe16:247a with SMTP id nn10-20020a17090b38ca00b0027ffe16247amr3300014pjb.17.1700832605148; Fri, 24 Nov 2023 05:30:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832605; cv=none; d=google.com; s=arc-20160816; b=LGI0hgsZDvd4e7stwmLP7ctVbDtU5Pu94T6h7ZE1o11yh7xblLNTSzBFDyyvAs60rn v0WEW8YNXZ0iRZVg8GOKLHoanKI5NkPdN9n656fsFRNuNWCIoNHIbUkF0nx4MuGfx5Vv AQNSwZ+UKQDKP9i/ZjcnGtTOK3KdXTbQaAQxlAd3Lm8FQ9/0AJWZYcMm/GXswRFSd2SJ Ijhq3sqWMkMCUW6xXoyqZTKkY0w3vD2iT//WI/MlQBsod5zScdtJ2JhfIZB9WhyuDAnr hksNWEFhdTIx1u5Srp0qANV/tU3lD5nBrT1Ct/8Oi5QWV8ll1HAI08YOHquLcaMLBef0 XgIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=uVx5jgkX7VDXu2BOJ112MWVTaXFaq3iLz1p8EjK2pug=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=lXINxmrnczI/d5/Y838bc1Rf811XbK5FIIEzLg5+PG2ypCLPKVzGImoJqRJIVQzSrQ JtkALw6r+lenCaYeFjgjgezIOwwKYOi+uhD6TrO1gQDP5DNmiUzUP2EaDSQFlAcKAlt/ QJjgAeCSXFm2Nd/4flgozYInMPgdsGMBSsyqfaQpYdfSRTndR/pq02Pqc+J40gWicG80 RixftY3ajeBgY/4Kp11TufpSCR1XqXMp2jRabi5qKGUlNUU1I15cQWG3jTay9Nff1IvR vK7PCrDkXc7Pj+clZsInMHOyX0X8qD82GHxSZeso89Wrx/c//JoTseRR+Qr2CzvcfaTW dqtQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=TiePEiML; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id x17-20020a17090a789100b0028014aca793si4068597pjk.2.2023.11.24.05.29.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:30:05 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=TiePEiML; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 5DB07808169A; Fri, 24 Nov 2023 05:29:47 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345295AbjKXN33 (ORCPT + 99 others); Fri, 24 Nov 2023 08:29:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58270 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235292AbjKXN2h (ORCPT ); Fri, 24 Nov 2023 08:28:37 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D26EB213A for ; Fri, 24 Nov 2023 05:27:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832461; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uVx5jgkX7VDXu2BOJ112MWVTaXFaq3iLz1p8EjK2pug=; b=TiePEiMLX8kRem+4vM/XteE8B3g580adOI01NCZjErMBifJmMhxl5rTjFWFDkK4dEdcB+I 9QujlixgjEb+wcWOoOXeBOH789xzWDV/gbB8RXHc88HLiK819wT5QlzV2ZOOam5HGanfX1 Aohc8fxxDfB9PiKyUkL6eCoatGEWqtM= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-77-x3WGX4woMCePcALEhUGVJg-1; Fri, 24 Nov 2023 08:27:37 -0500 X-MC-Unique: x3WGX4woMCePcALEhUGVJg-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E919E185A781; Fri, 24 Nov 2023 13:27:36 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 887C32166B2A; Fri, 24 Nov 2023 13:27:33 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 18/20] atomic_seqcount: use atomic add-return instead of atomic cmpxchg on 64bit Date: Fri, 24 Nov 2023 14:26:23 +0100 Message-ID: <20231124132626.235350-19-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:29:47 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452249502509398 X-GMAIL-MSGID: 1783452249502509398 Turns out that it can be beneficial on some HW to use an add-return instead of and atomic cmpxchg. However, we have to deal with more possible races now: in the worst case, each and every CPU might try becoming the exclusive writer at the same time, so we need the same number of bits as for the shared writer case. In case we detect that we didn't end up being the exclusive writer, simply back off and convert to a shared writer. Only implement this optimization on 64bit, where we can steal more bits from the actual sequence without sorrow. Signed-off-by: David Hildenbrand --- include/linux/atomic_seqcount.h | 43 +++++++++++++++++++++++++++------ 1 file changed, 36 insertions(+), 7 deletions(-) diff --git a/include/linux/atomic_seqcount.h b/include/linux/atomic_seqcount.h index 00286a9da221..9cd40903863d 100644 --- a/include/linux/atomic_seqcount.h +++ b/include/linux/atomic_seqcount.h @@ -42,9 +42,10 @@ typedef struct raw_atomic_seqcount { #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x0000000000008000ul #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x000000000000fffful #define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER 0x0000000000010000ul -#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000000001fffful -/* We have 48bit for the actual sequence. */ -#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x0000000000020000ul +#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK 0x00000000ffff0000ul +#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x00000000fffffffful +/* We have 32bit for the actual sequence. */ +#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x0000000100000000ul #else /* CONFIG_64BIT */ @@ -53,6 +54,7 @@ typedef struct raw_atomic_seqcount { #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x00000040ul #define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x0000007ful #define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER 0x00000080ul +#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK 0x00000080ul #define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000fful /* We have 24bit for the actual sequence. */ #define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x00000100ul @@ -144,7 +146,7 @@ static inline bool raw_read_atomic_seqcount_retry(raw_atomic_seqcount_t *s, static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s, bool try_exclusive) { - unsigned long seqcount, seqcount_new; + unsigned long __maybe_unused seqcount, seqcount_new; BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT)); #ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT @@ -160,6 +162,32 @@ static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s, if (unlikely(seqcount & ATOMIC_SEQCOUNT_WRITERS_MASK)) goto shared; +#ifdef CONFIG_64BIT + BUILD_BUG_ON(__builtin_popcount(ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK) != + __builtin_popcount(ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK)); + + /* See comment for atomic_long_try_cmpxchg() below. */ + seqcount = atomic_long_add_return(ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER, + &s->sequence); + if (likely((seqcount & ATOMIC_SEQCOUNT_WRITERS_MASK) == + ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER)) + return true; + + /* + * Whoops, we raced with another writer. Back off, converting ourselves + * to a shared writer and wait for any exclusive writers. + */ + atomic_long_add(ATOMIC_SEQCOUNT_SHARED_WRITER - ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER, + &s->sequence); + /* + * No need for __smp_mb__after_atomic(): the reader side already + * realizes that it has to retry and the memory barrier from + * atomic_long_add_return() is sufficient for that. + */ + while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK) + cpu_relax(); + return false; +#else seqcount_new = seqcount | ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER; /* * Store the sequence before any store in the critical section. Further, @@ -168,6 +196,7 @@ static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s, */ if (atomic_long_try_cmpxchg(&s->sequence, &seqcount, seqcount_new)) return true; +#endif shared: /* * Indicate that there is a shared writer, and spin until the exclusive @@ -185,10 +214,10 @@ static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s, DEBUG_LOCKS_WARN_ON((seqcount & ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) > ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX); #endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */ - if (likely(!(seqcount & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER))) + if (likely(!(seqcount & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK))) return false; - while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER) + while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK) cpu_relax(); return false; } @@ -209,7 +238,7 @@ static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s, if (likely(exclusive)) { #ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) & - ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER)); + ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK)); #endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */ val -= ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER; } else { From patchwork Fri Nov 24 13:26:24 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169432 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1192249vqx; Fri, 24 Nov 2023 05:30:17 -0800 (PST) X-Google-Smtp-Source: AGHT+IHcXGIt61iYTLSC9b/htttosJjKYn2zHfJMxZwouJQgvvIdSnbZl8B/3kZh3RTcieLL155g X-Received: by 2002:a05:6a20:8e10:b0:18b:a1a2:854f with SMTP id y16-20020a056a208e1000b0018ba1a2854fmr2680711pzj.49.1700832616778; Fri, 24 Nov 2023 05:30:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832616; cv=none; d=google.com; s=arc-20160816; b=g74S7nL/6hOp875Cf8KCcc34RIJ7qGCwBaKPJ2WV/cqh7mM+bVGeyllc2sC2xtTdRm Pb0MFe0D00177EeVbtjFRuNR+vjG1y9X4vsCVKZ95kYAveWhFlljmCazB6KgARE64MBq 4Ywakfq0z4JsaKe2v9LwuePlsO5QEA6TpQT92Hn1Y2k3wehWHCAmRG3IEXu6yvPtTgI0 niBdjDPfOioEDJdLBj/tXlu2ScnbIhTM9n53k8V0ZC9OFCt+bJLNZ8E5++BVJz/de4tJ /9ZmhYaPKdcB0pGjeGIwcbVkosdQNzGR/BNnWtgIKtDbvdPNOHr67vYXh/v18NG9oBhK fJpg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=BnEHwgivpdbaZgh54Ih7L1AobW4tDWwZEN96Ch5d2sI=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=W1Dm5pMTjUcRQBkSlsm2zBfw7FQS/6+75oY3zqs/2D2l7YzORKxIFcXajedBd3QC+E 7Ww8mjA4diyMW7tU77g0LmZ5wTnOJ0CGmowv2+/RPrwzQVOIxpaFP8r2xId7zqHb+S90 GZZ6veZDJFehz32hklk500SNS97zvtPZbgIA7/lYh2ziVp5ZyHb9ZL07iX3u2X1n1oaA A9PagN+cxJFv4yevhJGPsMlowyhbu6sdUqMDfxEI0+v/VhduENf+elsyUiFEGghckGpm BvvBFUQKFnMTDN0beEFyOn8r/yS+6jVM96gJV093g1te8PfF3M/3kmb815nYNZrmKSfH DNTA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Hizn47st; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id r14-20020a63514e000000b005b909e93e2dsi3647300pgl.522.2023.11.24.05.29.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:30:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Hizn47st; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id A1D6480AF24F; Fri, 24 Nov 2023 05:29:48 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229833AbjKXN3Y (ORCPT + 99 others); Fri, 24 Nov 2023 08:29:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37588 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235293AbjKXN2h (ORCPT ); Fri, 24 Nov 2023 08:28:37 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 81CF72686 for ; Fri, 24 Nov 2023 05:27:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832462; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BnEHwgivpdbaZgh54Ih7L1AobW4tDWwZEN96Ch5d2sI=; b=Hizn47stOY0bjcCAapanjWMo1DctKX6iH+q4hCiCWA0XhmT2hRBOUWfHfCXBdsEZQQsJiU PwgKXtiLA4TJss8Dutnh2w5T8tDYEclkU4NKPAOWGGLBOp0UWz2wZUCxxhYje3Xa4IQL8/ TGJ7vu7hxWsuWfYCnpXM3IRr5scvSJ8= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-542-EU4DmTN2O1ar0sjJ8gqpbQ-1; Fri, 24 Nov 2023 08:27:41 -0500 X-MC-Unique: EU4DmTN2O1ar0sjJ8gqpbQ-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 3AD05185A783; Fri, 24 Nov 2023 13:27:40 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 550AE2166B2A; Fri, 24 Nov 2023 13:27:37 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 19/20] mm/rmap: factor out removing folio range into __folio_remove_rmap_range() Date: Fri, 24 Nov 2023 14:26:24 +0100 Message-ID: <20231124132626.235350-20-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:29:48 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452262126825149 X-GMAIL-MSGID: 1783452262126825149 Let's factor it out, optimize for small folios, and compact it a bit. Well, we're adding the range part, but that will surely come in handy soon -- and it's now wasier to compare it with __folio_add_rmap_range(). Signed-off-by: David Hildenbrand --- mm/rmap.c | 90 +++++++++++++++++++++++++++++++++---------------------- 1 file changed, 55 insertions(+), 35 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index da7fa46a18fc..80ac53633332 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1155,6 +1155,57 @@ static unsigned int __folio_add_rmap_range(struct folio *folio, return nr; } +static unsigned int __folio_remove_rmap_range(struct folio *folio, + struct page *page, unsigned int nr_pages, + struct vm_area_struct *vma, bool compound, int *nr_pmdmapped) +{ + atomic_t *mapped = &folio->_nr_pages_mapped; + int last, count, nr = 0; + + VM_WARN_ON_FOLIO(compound && page != &folio->page, folio); + VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio); + VM_WARN_ON_FOLIO(compound && nr_pages != folio_nr_pages(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_large(folio) && nr_pages != 1, folio); + + if (likely(!folio_test_large(folio))) + return atomic_add_negative(-1, &page->_mapcount); + + /* Is page being unmapped by PTE? Is this its last map to be removed? */ + if (!compound) { + folio_add_large_mapcount(folio, -nr_pages, vma); + count = nr_pages; + do { + last = atomic_add_negative(-1, &page->_mapcount); + if (last) { + last = atomic_dec_return_relaxed(mapped); + if (last < COMPOUND_MAPPED) + nr++; + } + } while (page++, --count > 0); + } else if (folio_test_pmd_mappable(folio)) { + /* That test is redundant: it's for safety or to optimize out */ + + folio_dec_large_mapcount(folio, vma); + last = atomic_add_negative(-1, &folio->_entire_mapcount); + if (last) { + nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped); + if (likely(nr < COMPOUND_MAPPED)) { + *nr_pmdmapped = folio_nr_pages(folio); + nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED); + /* Raced ahead of another remove and an add? */ + if (unlikely(nr < 0)) + nr = 0; + } else { + /* An add of COMPOUND_MAPPED raced ahead */ + nr = 0; + } + } + } else { + VM_WARN_ON_ONCE_FOLIO(true, folio); + } + return nr; +} + /** * folio_move_anon_rmap - move a folio to our anon_vma * @folio: The folio to move to our anon_vma @@ -1439,13 +1490,10 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, bool compound) { struct folio *folio = page_folio(page); - atomic_t *mapped = &folio->_nr_pages_mapped; - int nr = 0, nr_pmdmapped = 0; - bool last; + unsigned long nr_pages = compound ? folio_nr_pages(folio) : 1; + unsigned int nr, nr_pmdmapped = 0; enum node_stat_item idx; - VM_BUG_ON_PAGE(compound && !PageHead(page), page); - /* Hugetlb pages are not counted in NR_*MAPPED */ if (unlikely(folio_test_hugetlb(folio))) { /* hugetlb pages are always mapped with pmds */ @@ -1454,36 +1502,8 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, return; } - if (folio_test_large(folio)) - folio_dec_large_mapcount(folio, vma); - - /* Is page being unmapped by PTE? Is this its last map to be removed? */ - if (likely(!compound)) { - last = atomic_add_negative(-1, &page->_mapcount); - nr = last; - if (last && folio_test_large(folio)) { - nr = atomic_dec_return_relaxed(mapped); - nr = (nr < COMPOUND_MAPPED); - } - } else if (folio_test_pmd_mappable(folio)) { - /* That test is redundant: it's for safety or to optimize out */ - - last = atomic_add_negative(-1, &folio->_entire_mapcount); - if (last) { - nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped); - if (likely(nr < COMPOUND_MAPPED)) { - nr_pmdmapped = folio_nr_pages(folio); - nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED); - /* Raced ahead of another remove and an add? */ - if (unlikely(nr < 0)) - nr = 0; - } else { - /* An add of COMPOUND_MAPPED raced ahead */ - nr = 0; - } - } - } - + nr = __folio_remove_rmap_range(folio, page, nr_pages, vma, compound, + &nr_pmdmapped); if (nr_pmdmapped) { if (folio_test_anon(folio)) idx = NR_ANON_THPS; From patchwork Fri Nov 24 13:26:25 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 169434 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1193849vqx; Fri, 24 Nov 2023 05:32:00 -0800 (PST) X-Google-Smtp-Source: AGHT+IGr6j/DF9i521vuKH0sCSK3bc7XN56TlGSoKiFLpR0Js0SZpwl0f/86JTIaQoRVMghmV4aO X-Received: by 2002:a17:902:f68e:b0:1cf:8364:ec24 with SMTP id l14-20020a170902f68e00b001cf8364ec24mr3399915plg.4.1700832720543; Fri, 24 Nov 2023 05:32:00 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700832720; cv=none; d=google.com; s=arc-20160816; b=eroBkJB4rYNsgwGRcdUqWLnJXLGx3I1+NYWY4DwCm6XWdiXH7EduGQsA3T4RTRXZ5W WzPsRmVhaQ7aCg2EcOYs9xG+ifa7DaHCVStMmVW9B9fCz35uAr9e6mvfo8i6IeG1OS4+ 8H+RF8REck1K173Yccojph7d8vJji8uKxBh6glBTKfkIWSHciyCKn0VLU9l9BrS9k2kC BA3gnVRGFvgT4ihv2+zQRLbPaujf+5dziujeiro6m0L5j4jqyH9EMmaAFbr5ARHkwZfa rXSUmr0ily4Wq8MRzGShbOfli9jfHf1pP8rsQf5DqqLlZHa9ZMj8Jb/zmQHI0JEwAUMJ viVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=3nZeFmARPNGvb6d3sXI2ciVXy/9aT4qOMIkiPA0/sq4=; fh=UVV9UP3jB+jR+DJP/Pn6IwXb1P9vDh5E3FRL6G+mlkU=; b=BbWaX28Bm6949Owd1HK1CJAMxKCoCqxyUywS3FkaLYMfE8FgqRs8czSHqjvY/nIwI1 YmjxfWhmZfGzFIm7+wXunmAJM6vWqD5fGg61OHoaijy74F+WmPUo8Pc58xA7xGnVFZub itd6eGta3uG3yyc65fAOYI4fOTx7qiwDZWJZgdSFhwE8MXt/z1CLihGuHdnCutRLw4/r lUldIaUFickGYgPZNW474bXEffds5km+H8gNlZdEVV83IzyqPfCLCapWyUTW3oosFBzR KJnBIvVsfnorQx4yJj6J6EwK5zfwNivOXG7u4Cky2YkFkSQQiUj/A1WB7is7SItRrb+d RQMQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=gVv6tq92; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id y2-20020a170902864200b001cf85966f78si3290407plt.117.2023.11.24.05.31.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 05:32:00 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=gVv6tq92; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id D4E21807F4EE; Fri, 24 Nov 2023 05:30:06 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345470AbjKXN35 (ORCPT + 99 others); Fri, 24 Nov 2023 08:29:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37618 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345501AbjKXN3V (ORCPT ); Fri, 24 Nov 2023 08:29:21 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E80710FA for ; Fri, 24 Nov 2023 05:27:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832474; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3nZeFmARPNGvb6d3sXI2ciVXy/9aT4qOMIkiPA0/sq4=; b=gVv6tq921bqmqMH1s1nn3UT0EyVUwrXyzHVr+xrw/Caz8F8mTec9df74iUGQI+Zxqpa20F 6LFymO1lJgW+ksjgF/4Ids1iwVR9beM7GxxRrqY2lI69JJOhUxgJC9m1y4S9qg34zeH8jT D3F6vd4riKjF8EkHFDwFOoWvap+3m/0= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-617-qEni6tCTMAipvOzVZRL0Sg-1; Fri, 24 Nov 2023 08:27:44 -0500 X-MC-Unique: qEni6tCTMAipvOzVZRL0Sg-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id AF09F2806052; Fri, 24 Nov 2023 13:27:43 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 870CD2166B2A; Fri, 24 Nov 2023 13:27:40 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Linus Torvalds , Ryan Roberts , Matthew Wilcox , Hugh Dickins , Yin Fengwei , Yang Shi , Ying Huang , Zi Yan , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , "Paul E. McKenney" Subject: [PATCH WIP v1 20/20] mm/rmap: perform all mapcount operations of large folios under the rmap seqcount Date: Fri, 24 Nov 2023 14:26:25 +0100 Message-ID: <20231124132626.235350-21-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 24 Nov 2023 05:30:07 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783452370716979334 X-GMAIL-MSGID: 1783452370716979334 Let's extend the atomic seqcount to also protect modifications of: * The subpage mapcounts * The entire mapcount * folio->_nr_pages_mapped This way, we can avoid another 1/2 atomic RMW operations on the fast path (and significantly more when patching): When we are the exclusive writer, we only need two atomic RMW operations to manage the atomic seqcount. Let's document how the existing atomic seqcount memory barriers keep the old behavior unmodified: especially, how it makes sure that folio refcount updates cannot be reordered with folio mapcount updates. Signed-off-by: David Hildenbrand --- include/linux/rmap.h | 95 ++++++++++++++++++++++++++------------------ mm/rmap.c | 84 +++++++++++++++++++++++++++++++++++++-- 2 files changed, 137 insertions(+), 42 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 538c23d3c0c9..3cff4aa71393 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -301,6 +301,12 @@ static inline bool __folio_write_large_rmap_begin(struct folio *folio) exclusive = raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount, true); if (likely(exclusive)) { + /* + * Note: raw_write_atomic_seqcount_begin() implies a full + * memory barrier like non-exclusive mapcount operations + * will. Any refcount updates that happened before this call + * are visible before any mapcount updates on other CPUs. + */ prefetchw(&folio->_rmap_val0); if (unlikely(folio_order(folio) > RMAP_SUBID_4_MAX_ORDER)) prefetchw(&folio->_rmap_val4); @@ -311,6 +317,12 @@ static inline bool __folio_write_large_rmap_begin(struct folio *folio) static inline void __folio_write_large_rmap_end(struct folio *folio, bool exclusive) { + /* + * Note: raw_write_atomic_seqcount_end() implies a full memory + * barrier like non-exclusive mapcount operations will. Any + * refcount updates happening after this call are visible after any + * mapcount updates on other CPUs. + */ raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount, exclusive); } @@ -367,52 +379,46 @@ static inline void folio_set_large_mapcount(struct folio *folio, static inline void folio_inc_large_mapcount(struct folio *folio, struct vm_area_struct *vma) { - bool exclusive; + atomic_inc(&folio->_total_mapcount); + __folio_add_large_rmap_val(folio, 1, vma->vm_mm); +} - exclusive = __folio_write_large_rmap_begin(folio); - if (likely(exclusive)) { - atomic_set(&folio->_total_mapcount, - atomic_read(&folio->_total_mapcount) + 1); - __folio_add_large_rmap_val_exclusive(folio, 1, vma->vm_mm); - } else { - atomic_inc(&folio->_total_mapcount); - __folio_add_large_rmap_val(folio, 1, vma->vm_mm); - } - __folio_write_large_rmap_end(folio, exclusive); +static inline void folio_inc_large_mapcount_exclusive(struct folio *folio, + struct vm_area_struct *vma) +{ + atomic_set(&folio->_total_mapcount, + atomic_read(&folio->_total_mapcount) + 1); + __folio_add_large_rmap_val_exclusive(folio, 1, vma->vm_mm); } static inline void folio_add_large_mapcount(struct folio *folio, int count, struct vm_area_struct *vma) { - bool exclusive; + atomic_add(count, &folio->_total_mapcount); + __folio_add_large_rmap_val(folio, count, vma->vm_mm); +} - exclusive = __folio_write_large_rmap_begin(folio); - if (likely(exclusive)) { - atomic_set(&folio->_total_mapcount, - atomic_read(&folio->_total_mapcount) + count); - __folio_add_large_rmap_val_exclusive(folio, count, vma->vm_mm); - } else { - atomic_add(count, &folio->_total_mapcount); - __folio_add_large_rmap_val(folio, count, vma->vm_mm); - } - __folio_write_large_rmap_end(folio, exclusive); +static inline void folio_add_large_mapcount_exclusive(struct folio *folio, + int count, struct vm_area_struct *vma) +{ + atomic_set(&folio->_total_mapcount, + atomic_read(&folio->_total_mapcount) + count); + __folio_add_large_rmap_val_exclusive(folio, count, vma->vm_mm); } static inline void folio_dec_large_mapcount(struct folio *folio, struct vm_area_struct *vma) { - bool exclusive; + atomic_dec(&folio->_total_mapcount); + __folio_add_large_rmap_val(folio, -1, vma->vm_mm); +} - exclusive = __folio_write_large_rmap_begin(folio); - if (likely(exclusive)) { - atomic_set(&folio->_total_mapcount, - atomic_read(&folio->_total_mapcount) - 1); - __folio_add_large_rmap_val_exclusive(folio, -1, vma->vm_mm); - } else { - atomic_dec(&folio->_total_mapcount); - __folio_add_large_rmap_val(folio, -1, vma->vm_mm); - } - __folio_write_large_rmap_end(folio, exclusive); +static inline void folio_dec_large_mapcount_exclusive(struct folio *folio, + struct vm_area_struct *vma) +{ + atomic_set(&folio->_total_mapcount, + atomic_read(&folio->_total_mapcount) - 1); + __folio_add_large_rmap_val_exclusive(folio, -1, vma->vm_mm); } /* RMAP flags, currently only relevant for some anon rmap operations. */ @@ -462,6 +468,7 @@ static inline void __page_dup_rmap(struct page *page, struct vm_area_struct *dst_vma, bool compound) { struct folio *folio = page_folio(page); + bool exclusive; VM_BUG_ON_PAGE(compound && !PageHead(page), page); if (likely(!folio_test_large(folio))) { @@ -475,11 +482,23 @@ static inline void __page_dup_rmap(struct page *page, return; } - if (compound) - atomic_inc(&folio->_entire_mapcount); - else - atomic_inc(&page->_mapcount); - folio_inc_large_mapcount(folio, dst_vma); + exclusive = __folio_write_large_rmap_begin(folio); + if (likely(exclusive)) { + if (compound) + atomic_set(&folio->_entire_mapcount, + atomic_read(&folio->_entire_mapcount) + 1); + else + atomic_set(&page->_mapcount, + atomic_read(&page->_mapcount) + 1); + folio_inc_large_mapcount_exclusive(folio, dst_vma); + } else { + if (compound) + atomic_inc(&folio->_entire_mapcount); + else + atomic_inc(&page->_mapcount); + folio_inc_large_mapcount(folio, dst_vma); + } + __folio_write_large_rmap_end(folio, exclusive); } static inline void page_dup_file_rmap(struct page *page, diff --git a/mm/rmap.c b/mm/rmap.c index 80ac53633332..755a62b046e2 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1109,7 +1109,8 @@ static unsigned int __folio_add_rmap_range(struct folio *folio, struct vm_area_struct *vma, bool compound, int *nr_pmdmapped) { atomic_t *mapped = &folio->_nr_pages_mapped; - int first, count, nr = 0; + int first, val, count, nr = 0; + bool exclusive; VM_WARN_ON_FOLIO(compound && page != &folio->page, folio); VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio); @@ -1119,8 +1120,23 @@ static unsigned int __folio_add_rmap_range(struct folio *folio, if (likely(!folio_test_large(folio))) return atomic_inc_and_test(&page->_mapcount); + exclusive = __folio_write_large_rmap_begin(folio); + /* Is page being mapped by PTE? Is this its first map to be added? */ - if (!compound) { + if (likely(exclusive) && !compound) { + count = nr_pages; + do { + val = atomic_read(&page->_mapcount) + 1; + atomic_set(&page->_mapcount, val); + if (!val) { + val = atomic_read(mapped) + 1; + atomic_set(mapped, val); + if (val < COMPOUND_MAPPED) + nr++; + } + } while (page++, --count > 0); + folio_add_large_mapcount_exclusive(folio, nr_pages, vma); + } else if (!compound) { count = nr_pages; do { first = atomic_inc_and_test(&page->_mapcount); @@ -1131,6 +1147,26 @@ static unsigned int __folio_add_rmap_range(struct folio *folio, } } while (page++, --count > 0); folio_add_large_mapcount(folio, nr_pages, vma); + } else if (likely(exclusive) && folio_test_pmd_mappable(folio)) { + /* That test is redundant: it's for safety or to optimize out */ + + val = atomic_read(&folio->_entire_mapcount) + 1; + atomic_set(&folio->_entire_mapcount, val); + if (!val) { + nr = atomic_read(mapped) + COMPOUND_MAPPED; + atomic_set(mapped, nr); + if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) { + *nr_pmdmapped = folio_nr_pages(folio); + nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED); + /* Raced ahead of a remove and another add? */ + if (unlikely(nr < 0)) + nr = 0; + } else { + /* Raced ahead of a remove of COMPOUND_MAPPED */ + nr = 0; + } + } + folio_inc_large_mapcount_exclusive(folio, vma); } else if (folio_test_pmd_mappable(folio)) { /* That test is redundant: it's for safety or to optimize out */ @@ -1152,6 +1188,8 @@ static unsigned int __folio_add_rmap_range(struct folio *folio, } else { VM_WARN_ON_ONCE_FOLIO(true, folio); } + + __folio_write_large_rmap_end(folio, exclusive); return nr; } @@ -1160,7 +1198,8 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio, struct vm_area_struct *vma, bool compound, int *nr_pmdmapped) { atomic_t *mapped = &folio->_nr_pages_mapped; - int last, count, nr = 0; + int last, val, count, nr = 0; + bool exclusive; VM_WARN_ON_FOLIO(compound && page != &folio->page, folio); VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio); @@ -1170,8 +1209,23 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio, if (likely(!folio_test_large(folio))) return atomic_add_negative(-1, &page->_mapcount); + exclusive = __folio_write_large_rmap_begin(folio); + /* Is page being unmapped by PTE? Is this its last map to be removed? */ - if (!compound) { + if (likely(exclusive) && !compound) { + folio_add_large_mapcount_exclusive(folio, -nr_pages, vma); + count = nr_pages; + do { + val = atomic_read(&page->_mapcount) - 1; + atomic_set(&page->_mapcount, val); + if (val < 0) { + val = atomic_read(mapped) - 1; + atomic_set(mapped, val); + if (val < COMPOUND_MAPPED) + nr++; + } + } while (page++, --count > 0); + } else if (!compound) { folio_add_large_mapcount(folio, -nr_pages, vma); count = nr_pages; do { @@ -1182,6 +1236,26 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio, nr++; } } while (page++, --count > 0); + } else if (likely(exclusive) && folio_test_pmd_mappable(folio)) { + /* That test is redundant: it's for safety or to optimize out */ + + folio_dec_large_mapcount_exclusive(folio, vma); + val = atomic_read(&folio->_entire_mapcount) - 1; + atomic_set(&folio->_entire_mapcount, val); + if (val < 0) { + nr = atomic_read(mapped) - COMPOUND_MAPPED; + atomic_set(mapped, nr); + if (likely(nr < COMPOUND_MAPPED)) { + *nr_pmdmapped = folio_nr_pages(folio); + nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED); + /* Raced ahead of another remove and an add? */ + if (unlikely(nr < 0)) + nr = 0; + } else { + /* An add of COMPOUND_MAPPED raced ahead */ + nr = 0; + } + } } else if (folio_test_pmd_mappable(folio)) { /* That test is redundant: it's for safety or to optimize out */ @@ -1203,6 +1277,8 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio, } else { VM_WARN_ON_ONCE_FOLIO(true, folio); } + + __folio_write_large_rmap_end(folio, exclusive); return nr; }