From patchwork Wed Feb 14 20:44:25 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 20394 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:bc8a:b0:106:860b:bbdd with SMTP id dn10csp1501934dyb; Wed, 14 Feb 2024 13:11:12 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCXcpCooV7VQAthccdJXRSGtjaDMD9PC9nMDVeJICrmSRe0d+dD+orvHZRDJjbS7XPdWh6dkO5NwwHpzPRkkpDMZ8kciTw== X-Google-Smtp-Source: AGHT+IHZQQNoFqrBBj8TtTA0Mtuj6+eKpFmZoRqF23IZdFZoYyikm7FDOeW+XiPS552MiD6etBdh X-Received: by 2002:a92:db4f:0:b0:364:2328:22c0 with SMTP id w15-20020a92db4f000000b00364232822c0mr4135682ilq.4.1707945072010; Wed, 14 Feb 2024 13:11:12 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707945071; cv=pass; d=google.com; s=arc-20160816; b=SJMMxCPer8BTp8hq9FfeW8og8yvyPG1uq/vbO7L2EIUtNBdNvLl4zJU2tsbvLhHBcE TalTK4RfMsnpFcKQl7fmidiaHF0V9kH78FRAN8T81xMhfg3sVJ/LsogvNQUSQYLijsbU Ga96Ymqiztiewhg8ftsbzEhv5iUrXC5RoRq/lJRhUnz91Jq1l13zKhSUPBHfMSrlWTSS oZHqDZ8R1PRFk+i0OliKrNT4Nps8xHk5vwKKFq3/MDATTrov85VSUZg2XvL1K2yYJ1Iq K38JXgyHDkgG6IB4lqs90WD9U5SMhoAWMOAe6Jnqxl0TV9tol3iqhD8Vn89LvVtP9qEb AkYw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=xFaV7z72bOpzL/S5oYBrXwzyy6tNcmGC5k13H1LQe0I=; fh=pjWV/M6KdFf2LlQN88g/4JvjHTkqZxQqa7s1x9YjvT0=; b=r28xdO0Ili2QSTKXuZ0Bj4TsVZevdnkrL5CYpjV1o75qDQH/HYzUjUaRDgfI4kgcbL ZplO1bG99czxEfznoloGJTzvrJ4I5ncWGZmbVNii5a6G8gHoE/d1bH/iLY2ws6BAd5d8 lYamAycjIDMaiRAb7vrJObK+7d+ygRe8Y7cXxdaJXhTOZFoJQ0FbVdwGvf75twiks9Jj 9N9PjX+xY2aQdaHdMupz+wFcLrONzfpZW348iDTXmhQODrH4e3ubRL3t95VDROGpu/Ng R0kb1eRevlmAnARGjZMZEzfxS8cpEhyQuEuA5kNxjb8JQN8vr7VXjshvQ6cR6R3CZKSI +9IA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=h6zvG+gG; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-65946-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-65946-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Forwarded-Encrypted: i=2; AJvYcCWNYPCiouNY07zXAoiN1ya4xc4Wu/IzPcpaoSVidmL+8M/S3rL75D/dW2J+z5I6GeodBMpLsUiI5zecy9nc3M4HOZcHRA== Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id bw40-20020a056a0204a800b005dc8b9a9fccsi2430085pgb.280.2024.02.14.13.11.11 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 14 Feb 2024 13:11:11 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-65946-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=h6zvG+gG; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-65946-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-65946-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id EC734B2509D for ; Wed, 14 Feb 2024 20:45:03 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id E333D55E45; Wed, 14 Feb 2024 20:44:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="h6zvG+gG" Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A45412836C for ; Wed, 14 Feb 2024 20:44:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707943488; cv=none; b=D7NfoBNnHsqBsDYBg0hs9S6MrE7mbgA72yuj9e/M/9OQPsA+XmuR8sRYhrpm062CWHvnbMTmOEY7o9mComFMLYrIjtOgi8xuA1FBj5almsj76JQCjOZyXB6DmKLv0xL8hrZdewkNttg/NXj5Xxpglr/qlEurHP/ISNlxsebLK74= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707943488; c=relaxed/simple; bh=BWiDBgwZgy2HPBX+L0L3QMYgtqOYwURfI2b0XfuCLno=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=j2+1f3rqm3Of4/gxvTv85NvgxSxQ36iD65S044j0c2gNeCQR4FZdjIe4jkzuqJY1icue9qRY/E7bYhLjJNMOMbi/1vNeQKYtOkVwwZt29GsbeMO2XL1P0TEls1JW8gLSGy/cECTzhMjH1RbK1dYXUF3FwN4fXuSO7BrBdgcOGqc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=h6zvG+gG; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1707943485; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=xFaV7z72bOpzL/S5oYBrXwzyy6tNcmGC5k13H1LQe0I=; b=h6zvG+gGbm9cuEyaVyBEgJntrKOF91fUOh9yZOKn6xriE0tTeGuEQakhDtLto/R5Fw3whT 48thBOHbcw17fM1s66S8yanjGH5tbUgxZWqOGIMrT3A2Ep3mE6I9fkQPeY4kjANCFymsLB fgfFz60+ssKm/tUPUbu5abkj6YuWnQI= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-300-jV1r6PLAOiSgtGDxuUZb9A-1; Wed, 14 Feb 2024 15:44:42 -0500 X-MC-Unique: jV1r6PLAOiSgtGDxuUZb9A-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 211B085A589; Wed, 14 Feb 2024 20:44:41 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.174]) by smtp.corp.redhat.com (Postfix) with ESMTP id 94C251C060B1; Wed, 14 Feb 2024 20:44:36 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Matthew Wilcox , Ryan Roberts , Catalin Marinas , Yin Fengwei , Michal Hocko , Will Deacon , "Aneesh Kumar K.V" , Nick Piggin , Peter Zijlstra , Michael Ellerman , Christophe Leroy , "Naveen N. Rao" , Heiko Carstens , Vasily Gorbik , Alexander Gordeev , Christian Borntraeger , Sven Schnelle , Arnd Bergmann , linux-arch@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org Subject: [PATCH v3 00/10] mm/memory: optimize unmap/zap with PTE-mapped THP Date: Wed, 14 Feb 2024 21:44:25 +0100 Message-ID: <20240214204435.167852-1-david@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.7 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790908756419501421 X-GMAIL-MSGID: 1790910211931651641 This series is based on [1]. Similar to what we did with fork(), let's implement PTE batching during unmap/zap when processing PTE-mapped THPs. We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch, (c) perform batch PTE setting/updates and (d) perform TLB entry removal once per batch. Ryan was previously working on this in the context of cont-pte for arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. This series implements the optimization for all architectures, independent of such PTE bits, teaches MMU gather/TLB code to be fully aware of such large-folio-pages batches as well, and amkes use of our new rmap batching function when removing the rmap. To achieve that, we have to enlighten MMU gather / page freeing code (i.e., everything that consumes encoded_page) to process unmapping of consecutive pages that all belong to the same large folio. I'm being very careful to not degrade order-0 performance, and it looks like I managed to achieve that. While this series should -- similar to [1] -- be beneficial for adding cont-pte support on arm64[2], it's one of the requirements for maintaining a total mapcount[3] for large folios with minimal added overhead and further changes[4] that build up on top of the total mapcount. Independent of all that, this series results in a speedup during munmap() and similar unmapping (process teardown, MADV_DONTNEED on larger ranges) with PTE-mapped THP, which is the default with THPs that are smaller than a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]). On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by PTE-mapped folios of the same size (stddev < 1%) results in the following runtimes for munmap() in seconds (shorter is better): Folio Size | mm-unstable | New | Change --------------------------------------------- 4KiB | 0.058110 | 0.057715 | - 1% 16KiB | 0.044198 | 0.035469 | -20% 32KiB | 0.034216 | 0.023522 | -31% 64KiB | 0.029207 | 0.018434 | -37% 128KiB | 0.026579 | 0.014026 | -47% 256KiB | 0.025130 | 0.011756 | -53% 512KiB | 0.024292 | 0.010703 | -56% 1024KiB | 0.023812 | 0.010294 | -57% 2048KiB | 0.023785 | 0.009910 | -58% CCing especially s390x folks, because they have a tlb freeing hooks that needs adjustment. Only tested on x86-64 for now, will have to do some more stress testing. Compile-tested on most other architectures. The PPC change is negleglible and makes my cross-compiler happy. [1] https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com [3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com [4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com --- Sending this out earlier than I ususally would, so we can get this into mm-unstable for Ryan to base his cont-pte work on this ASAP. The performance numbers are from v1. I did a quick benchmark run of v3 and nothing significantly changed, relevant code paths remained unchanged. v2 -> v3: * "mm/mmu_gather: add __tlb_remove_folio_pages()" -> Slightly adjusted patch description * "mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing" -> Use new macro for magic value and avoid code duplication -> Extend patch description * Pick up RB's v1 -> v2: * "mm/memory: factor out zapping of present pte into zap_present_pte()" -> Initialize "struct folio *folio" to NULL * "mm/memory: handle !page case in zap_present_pte() separately" -> Extend description regarding arch_check_zapped_pte() * "mm/mmu_gather: add __tlb_remove_folio_pages()" -> ENCODED_PAGE_BIT_NR_PAGES_NEXT -> Extend patch description regarding "batching more" * "mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing" -> Handle the (so far) theoretical case of possible soft lockups when we zero/poison memory when freeing pages. Try to keep old behavior in that corner case to be safe. * "mm/memory: optimize unmap/zap with PTE-mapped THP" -> Clarify description of new ptep clearing functions regarding "present PTEs" -> Extend patch description regarding relaxed mapcount sanity checks -> Improve zap_present_ptes() description * Pick up RB's Cc: Andrew Morton Cc: Matthew Wilcox (Oracle) Cc: Ryan Roberts Cc: Catalin Marinas Cc: Yin Fengwei Cc: Michal Hocko Cc: Will Deacon Cc: "Aneesh Kumar K.V" Cc: Nick Piggin Cc: Peter Zijlstra Cc: Michael Ellerman Cc: Christophe Leroy Cc: "Naveen N. Rao" Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Sven Schnelle Cc: Arnd Bergmann Cc: linux-arch@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org David Hildenbrand (10): mm/memory: factor out zapping of present pte into zap_present_pte() mm/memory: handle !page case in zap_present_pte() separately mm/memory: further separate anon and pagecache folio handling in zap_present_pte() mm/memory: factor out zapping folio pte into zap_present_folio_pte() mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size() mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP mm/mmu_gather: add tlb_remove_tlb_entries() mm/mmu_gather: add __tlb_remove_folio_pages() mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing mm/memory: optimize unmap/zap with PTE-mapped THP arch/powerpc/include/asm/tlb.h | 2 + arch/s390/include/asm/tlb.h | 30 ++++-- include/asm-generic/tlb.h | 40 ++++++-- include/linux/mm_types.h | 37 ++++++-- include/linux/pgtable.h | 70 ++++++++++++++ mm/memory.c | 169 +++++++++++++++++++++++---------- mm/mmu_gather.c | 111 ++++++++++++++++++---- mm/swap.c | 12 ++- mm/swap_state.c | 15 ++- 9 files changed, 393 insertions(+), 93 deletions(-) base-commit: 7e56cf9a7f108e8129d75cea0dabc9488fb4defa