[v1,1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for attempted synchronous hugepage collapse
Message ID | 20240117050217.43610-1-ioworker0@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-28523-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:42cf:b0:101:a8e8:374 with SMTP id q15csp699093dye; Tue, 16 Jan 2024 21:02:56 -0800 (PST) X-Google-Smtp-Source: AGHT+IHIYGEtXCP7Ed0FnEjZmrGO716tS2+as1Z2Cvj2wjvxWuDDuMn4bqS619P5iP4LYxw1JPYS X-Received: by 2002:a17:906:2351:b0:a2e:d304:96d2 with SMTP id m17-20020a170906235100b00a2ed30496d2mr145518eja.80.1705467776570; Tue, 16 Jan 2024 21:02:56 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1705467776; cv=pass; d=google.com; s=arc-20160816; b=rtBKK/GJQuu15RvPZH47IATpgsItTdHpddA3FHVScNKfx8W7DsOudRxVGuUFWGDrwL gD1UcpIVRtCICqvaiM0cAimCu/YvqNqIYUNm9gnBIBa4saTrB2aPTlRaTpHc1t6PZZV0 5Ttj31qOK+mAJGgn/XHwTSdLWiRuuQrHwBpNCru5YqqNBv823QJyi/qciZygFS+IfSBQ TJtvPS+x5W5VF0anyhb0tvD4y13azdPeSGgMwxXSg6P86+RA4PEaK1LVjQdjRdm2IrYL xwaAGrjNlhV/fEFyZWLV/mRBy0txYAN3t3mPBlk/VN1eE8nKjBPVN+sekTuBe0Ad3fFm cL8w== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=h2SD/ZB9hyvxmELXiJlpaJswjBRbcf9qVmQmE+9c0FM=; fh=/hi3mUE6twmoO4gZ64lbFGoeQT5d4XX75mMlTQaCEYk=; b=vThE3Pb2nKME46zxlkjuiYOGpcfTkxsgYGBnb5I6tJZiHOUKJFJtFp4l25F1/eI9I3 g1vZzOEXE7oyCgwylVatjJdoOXuzK3PfZB2HZFXUdtSlOfL37cjr7pQn3UGgCxVp0Bf6 TgWGFqR50IGR60YGhfbK31RYpFnkjvRpWJCtGdTTD1XTbeeEIs4mza8ZFGg/UX0drGqd p7EdgdL3DLJxeaL3P53y/nB0tNEyUPOcS1FkX2dFEEFAtTDq08QECKBX//RVoNwqizqC VTa4vSlgxUhvslOwPmDpMj6hScJ/IemNoBSZH4rY54Kiq6hNed+PzSBr/mlV9Ayt2sWr WN7Q== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=dTX7gLJU; arc=pass (i=1 spf=pass spfdomain=gmail.com dkim=pass dkdomain=gmail.com dmarc=pass fromdomain=gmail.com); spf=pass (google.com: domain of linux-kernel+bounces-28523-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-28523-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id a4-20020a170906684400b00a29063e794csi5223569ejs.297.2024.01.16.21.02.56 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 16 Jan 2024 21:02:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-28523-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=dTX7gLJU; arc=pass (i=1 spf=pass spfdomain=gmail.com dkim=pass dkdomain=gmail.com dmarc=pass fromdomain=gmail.com); spf=pass (google.com: domain of linux-kernel+bounces-28523-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-28523-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 0452F1F251FB for <ouuuleilei@gmail.com>; Wed, 17 Jan 2024 05:02:56 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 547D98BED; Wed, 17 Jan 2024 05:02:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="dTX7gLJU" Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 132C76119 for <linux-kernel@vger.kernel.org>; Wed, 17 Jan 2024 05:02:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705467752; cv=none; b=TUK/uKEO1+oXIqi0UagUHbhba8DxT+HD1AJ0y0gf4xE4Ws7QuWfWBkdFhzZ40PGMbqaykvaKw7R07n7ctdNuCtRr4D1PCdcqUFzA7iJxU56DOl80sol4vPEB8wMfdMgaqct3KRQuGbCKOvE+bP6F4sxzx7FqnJuRU4JDBrzJh9w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705467752; c=relaxed/simple; bh=IgApnOrZlunFElQRj5rnb7BNeFgvxtu7LY//05yPmO8=; h=Received:DKIM-Signature:X-Google-DKIM-Signature: X-Gm-Message-State:X-Google-Smtp-Source:X-Received:Received:From: To:Cc:Subject:Date:Message-Id:X-Mailer:MIME-Version: Content-Transfer-Encoding; b=Cx4DrIUqqWWtzZgqtOsMfHwnHUXgMTdjN/lxXHlAKNDgjpHhIcEg8nvfNjLnbHVDGZic1/4H0EWO1i51enVfEWDU1UBy7Uq/zrnTVtM0BY3noUSxxChTfo+8e4jBWNlee7Uv9Dy+t+Y9iQB+GZxEYsU8tFUxsJ8jbCor5LP9aHQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=dTX7gLJU; arc=none smtp.client-ip=209.85.210.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-6d9344f30caso7148196b3a.1 for <linux-kernel@vger.kernel.org>; Tue, 16 Jan 2024 21:02:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1705467750; x=1706072550; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=h2SD/ZB9hyvxmELXiJlpaJswjBRbcf9qVmQmE+9c0FM=; b=dTX7gLJUkoHj7sz0ISI0k7QnO+OA+X1EQMpupAnLACU0V0WEFXevMEdhAQBFmPlMMX HGuUwU0JYzg2eWqLSkqgQBLQeryKVbd2PwJiDulJeTLypstvMx2jY4WbnRZEH/pHZ4+1 xkCH7GX5lYiFyQmM1vCriKxwoMe4qWW7xeUUdt52MR1+58n41cau5TBxaD0UaZ4go33S ZK1IphAVb5/8+9dJrzghYScU7+COAmocyspFOWVY28q2OhFHv5ovbWrC3l5zij2hE0O3 hO9lRDxm9mKzYeVmgDXqhTnfKqGMcouR0M5Dpre7b8gxsuCIrggviK9sMOMlw5Z5I/f3 GVoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705467750; x=1706072550; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=h2SD/ZB9hyvxmELXiJlpaJswjBRbcf9qVmQmE+9c0FM=; b=XoOPDwUOa4Z3S+g0bJ0iUyanrtlTkmmBFvIR3UXkitOmh19GCP0PpAIH9Bxh35tdq0 kYfi/anVxXfFbLTJOqvUuuum1trI3PlA0UWwpJLE+uRq5E70yYuZ7Jw2Qad/8BYKcE/b qfjwlwR20qgoTFWnfJJQOXwEdittYvCxNQbeKSUOdhuBUsM1ZXC8UZhcs4YnTQsUFuPI mmzuQBmIBPmgShVf+VsSnNdCqmiowuFLqNtuuMKUu4MEhmBXBi0mvIhj0DG0UQq+XzaS rqAqR2s5UMrGbwNrTnghQ+NWHAdV/5VvrBjymH0OT6c1NwHwSLyPF4BvDtsfQlG1K7KH kEAQ== X-Gm-Message-State: AOJu0YwQ5h+Eiqeqx07qvZ3rIaaH3KX7U2FTh9elIG2sXmfppZc7lqey cfwgieOs8BPtS+7NUldti+Fn7YQc9TDlO81113WUXLk34ZSeyH7S X-Received: by 2002:a05:6a20:6f02:b0:19b:68d6:47eb with SMTP id gt2-20020a056a206f0200b0019b68d647ebmr452478pzb.12.1705467750190; Tue, 16 Jan 2024 21:02:30 -0800 (PST) Received: from LancedeMBP.lan ([112.10.240.24]) by smtp.gmail.com with ESMTPSA id i63-20020a638742000000b005b458aa0541sm10751495pge.15.2024.01.16.21.02.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 16 Jan 2024 21:02:29 -0800 (PST) From: Lance Yang <ioworker0@gmail.com> To: akpm@linux-foundation.org Cc: zokeefe@google.com, songmuchun@bytedance.com, linux-kernel@vger.kernel.org, Lance Yang <ioworker0@gmail.com> Subject: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for attempted synchronous hugepage collapse Date: Wed, 17 Jan 2024 13:02:16 +0800 Message-Id: <20240117050217.43610-1-ioworker0@gmail.com> X-Mailer: git-send-email 2.33.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1788312579312078640 X-GMAIL-MSGID: 1788312579312078640 |
Series |
[v1,1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for attempted synchronous hugepage collapse
|
|
Commit Message
Lance Yang
Jan. 17, 2024, 5:02 a.m. UTC
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
make a least-effort attempt at a synchronous collapse of memory at
their own expense.
The only difference from MADV_COLLAPSE is that the new hugepage allocation
avoids direct reclaim and/or compaction, quickly failing on allocation errors.
The benefits of this approach are:
* CPU is charged to the process that wants to spend the cycles for the THP
* Avoid unpredictable timing of khugepaged collapse
* Prevent unpredictable stalls caused by direct reclaim and/or compaction
Semantics
This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE. If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others. This implies a hugepage cannot cross a VMA boundary. If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.
The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned. If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range. The memory ranges must span at least one
hugepage-sized region.
All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage. Unmapped pages will have their data directly
initialized to 0 in the new hugepage. However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).
Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.
Return Value
If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful. On success, madvise(2) returns 0.
Else, -1 is returned and errno is set to indicate the error for the
most-recently attempted hugepage collapse. Note that many failures might
have occurred, since the operation may continue to collapse in the event a
single hugepage-sized/aligned region fails.
ENOMEM Memory allocation failed or VMA not found
EBUSY Memcg charging failed
EAGAIN Required resource temporarily unavailable. Try again
might succeed.
EINVAL Other error: No PMD found, subpage doesn't have Present
bit set, "Special" page no backed by struct page, VMA
incorrectly sized, address not page-aligned, ...
Use Cases
An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.
[1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
[2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
[3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
[4] https://github.com/golang/go/issues/63334
Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
include/linux/huge_mm.h | 5 +++--
include/uapi/asm-generic/mman-common.h | 1 +
mm/khugepaged.c | 19 ++++++++++++++++---
mm/madvise.c | 8 +++++++-
tools/include/uapi/asm-generic/mman-common.h | 1 +
9 files changed, 32 insertions(+), 6 deletions(-)
Comments
[+linux-mm & others] On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@gmail.com> wrote: > > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > > Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to > make a least-effort attempt at a synchronous collapse of memory at > their own expense. > > The only difference from MADV_COLLAPSE is that the new hugepage allocation > avoids direct reclaim and/or compaction, quickly failing on allocation errors. > > The benefits of this approach are: > > * CPU is charged to the process that wants to spend the cycles for the THP > * Avoid unpredictable timing of khugepaged collapse > * Prevent unpredictable stalls caused by direct reclaim and/or compaction > > Semantics > > This call is independent of the system-wide THP sysfs settings, but will > fail for memory marked VM_NOHUGEPAGE. If the ranges provided span > multiple VMAs, the semantics of the collapse over each VMA is independent > from the others. This implies a hugepage cannot cross a VMA boundary. If > collapse of a given hugepage-aligned/sized region fails, the operation may > continue to attempt collapsing the remainder of memory specified. > > The memory ranges provided must be page-aligned, but are not required to > be hugepage-aligned. If the memory ranges are not hugepage-aligned, the > start/end of the range will be clamped to the first/last hugepage-aligned > address covered by said range. The memory ranges must span at least one > hugepage-sized region. > > All non-resident pages covered by the range will first be > swapped/faulted-in, before being internally copied onto a freshly > allocated hugepage. Unmapped pages will have their data directly > initialized to 0 in the new hugepage. However, for every eligible > hugepage aligned/sized region to-be collapsed, at least one page must > currently be backed by memory (a PMD covering the address range must > already exist). > > Allocation for the new hugepage will not enter direct reclaim and/or > compaction, quickly failing if allocation fails. When the system has > multiple NUMA nodes, the hugepage will be allocated from the node providing > the most native pages. This operation operates on the current state of the > specified process and makes no persistent changes or guarantees on how pages > will be mapped, constructed, or faulted in the future. > > Return Value > > If all hugepage-sized/aligned regions covered by the provided range were > either successfully collapsed, or were already PMD-mapped THPs, this > operation will be deemed successful. On success, madvise(2) returns 0. > Else, -1 is returned and errno is set to indicate the error for the > most-recently attempted hugepage collapse. Note that many failures might > have occurred, since the operation may continue to collapse in the event a > single hugepage-sized/aligned region fails. > > ENOMEM Memory allocation failed or VMA not found > EBUSY Memcg charging failed > EAGAIN Required resource temporarily unavailable. Try again > might succeed. > EINVAL Other error: No PMD found, subpage doesn't have Present > bit set, "Special" page no backed by struct page, VMA > incorrectly sized, address not page-aligned, ... > > Use Cases > > An immediate user of this new functionality is the Go runtime heap allocator > that manages memory in hugepage-sized chunks. In the past, whether it was a > newly allocated chunk through mmap() or a reused chunk released by > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > respectively. However, both approaches resulted in performance issues; for > both scenarios, there could be entries into direct reclaim and/or compaction, > leading to unpredictable stalls[4]. Now, the allocator can confidently use > madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages. > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > [4] https://github.com/golang/go/issues/63334 Thanks for the patch, Lance, and thanks for providing the links above, referring to issues Go has seen. I've reached out to the Go team to try and understand their use case, and how we could help. It's not immediately clear whether a lighter-weight MADV_COLLAPSE is the answer, but it could turn out to be. That said, with respect to the implementation, should a need for a lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see process_madvise(2) be the "v2" of madvise(2), where we can start leveraging the forward-facing flags argument for these different advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa ("mm/madvise: remove racy mm ownership check") so that process_madvise(2) can always operate on self. IIRC, this was ~ the plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a sane default, and implement options in flags down the line). That flag could be a MADV_F_COLLAPSE_LIGHT, where we use a lighter allocation context, as well as, for example, only do a local lru_add_drain() vs lru_add_drain_all(). But I'll refrain from thinking too hard about it just yet. Best, Zach > Signed-off-by: Lance Yang <ioworker0@gmail.com> > --- > arch/alpha/include/uapi/asm/mman.h | 1 + > arch/mips/include/uapi/asm/mman.h | 1 + > arch/parisc/include/uapi/asm/mman.h | 1 + > arch/xtensa/include/uapi/asm/mman.h | 1 + > include/linux/huge_mm.h | 5 +++-- > include/uapi/asm-generic/mman-common.h | 1 + > mm/khugepaged.c | 19 ++++++++++++++++--- > mm/madvise.c | 8 +++++++- > tools/include/uapi/asm-generic/mman-common.h | 1 + > 9 files changed, 32 insertions(+), 6 deletions(-) > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > index 763929e814e9..44aa1f57a982 100644 > --- a/arch/alpha/include/uapi/asm/mman.h > +++ b/arch/alpha/include/uapi/asm/mman.h > @@ -77,6 +77,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > index c6e1fc77c996..1ae16e5d7dfc 100644 > --- a/arch/mips/include/uapi/asm/mman.h > +++ b/arch/mips/include/uapi/asm/mman.h > @@ -104,6 +104,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > index 68c44f99bc93..f8d016ee1f98 100644 > --- a/arch/parisc/include/uapi/asm/mman.h > +++ b/arch/parisc/include/uapi/asm/mman.h > @@ -71,6 +71,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > #define MADV_HWPOISON 100 /* poison a page for testing */ > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > index 1ff0c858544f..c495d1b39c83 100644 > --- a/arch/xtensa/include/uapi/asm/mman.h > +++ b/arch/xtensa/include/uapi/asm/mman.h > @@ -112,6 +112,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 5adb86af35fc..e1af75aa18fb 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, > int advice); > int madvise_collapse(struct vm_area_struct *vma, > struct vm_area_struct **prev, > - unsigned long start, unsigned long end); > + unsigned long start, unsigned long end, bool is_try); > void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, > unsigned long end, long adjust_next); > spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, > > static inline int madvise_collapse(struct vm_area_struct *vma, > struct vm_area_struct **prev, > - unsigned long start, unsigned long end) > + unsigned long start, unsigned long end, > + bool is_try) > { > return -EINVAL; > } > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index 6ce1f1ceb432..a9e5273db5f6 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -78,6 +78,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2b219acb528e..c22703155b6e 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -96,6 +96,7 @@ static struct kmem_cache *mm_slot_cache __ro_after_init; > > struct collapse_control { > bool is_khugepaged; > + bool is_try; > > /* Num pages scanned per node */ > u32 node_load[MAX_NUMNODES]; > @@ -1058,10 +1059,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, > static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, > struct collapse_control *cc) > { > - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : > - GFP_TRANSHUGE); > int node = hpage_collapse_find_target_node(cc); > struct folio *folio; > + gfp_t gfp; > + > + if (cc->is_khugepaged) > + gfp = alloc_hugepage_khugepaged_gfpmask(); > + else > + gfp = cc->is_try ? GFP_TRANSHUGE_LIGHT : GFP_TRANSHUGE; > > if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) { > *hpage = NULL; > @@ -2697,7 +2702,7 @@ static int madvise_collapse_errno(enum scan_result r) > } > > int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > - unsigned long start, unsigned long end) > + unsigned long start, unsigned long end, bool is_try) > { > struct collapse_control *cc; > struct mm_struct *mm = vma->vm_mm; > @@ -2718,6 +2723,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > if (!cc) > return -ENOMEM; > cc->is_khugepaged = false; > + cc->is_try = is_try; > > mmgrab(mm); > lru_add_drain_all(); > @@ -2773,6 +2779,13 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > result = collapse_pte_mapped_thp(mm, addr, true); > mmap_read_unlock(mm); > goto handle_result; > + /* MADV_TRY_COLLAPSE: fail quickly */ > + case SCAN_ALLOC_HUGE_PAGE_FAIL: > + case SCAN_CGROUP_CHARGE_FAIL: > + if (cc->is_try) { > + last_fail = result; > + goto out_maybelock; > + } > /* Whitelisted set of results where continuing OK */ > case SCAN_PMD_NULL: > case SCAN_PTE_NON_PRESENT: > diff --git a/mm/madvise.c b/mm/madvise.c > index 912155a94ed5..5a359bcd286c 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > case MADV_POPULATE_READ: > case MADV_POPULATE_WRITE: > case MADV_COLLAPSE: > + case MADV_TRY_COLLAPSE: > return 0; > default: > /* be safe, default to 1. list exceptions explicitly */ > @@ -1082,8 +1083,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > if (error) > goto out; > break; > + case MADV_TRY_COLLAPSE: > + return madvise_collapse(vma, prev, start, end, true); > case MADV_COLLAPSE: > - return madvise_collapse(vma, prev, start, end); > + return madvise_collapse(vma, prev, start, end, false); > } > > anon_name = anon_vma_name(vma); > @@ -1178,6 +1181,7 @@ madvise_behavior_valid(int behavior) > case MADV_HUGEPAGE: > case MADV_NOHUGEPAGE: > case MADV_COLLAPSE: > + case MADV_TRY_COLLAPSE: > #endif > case MADV_DONTDUMP: > case MADV_DODUMP: > @@ -1368,6 +1372,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > * transparent huge pages so the existing pages will not be > * coalesced into THP and new pages will not be allocated as THP. > * MADV_COLLAPSE - synchronously coalesce pages into new THP. > + * MADV_TRY_COLLAPSE - similar to COLLAPSE, but avoids direct reclaim > + * and/or compaction. > * MADV_DONTDUMP - the application wants to prevent pages in the given range > * from being included in its core dump. > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h > index 6ce1f1ceb432..a9e5273db5f6 100644 > --- a/tools/include/uapi/asm-generic/mman-common.h > +++ b/tools/include/uapi/asm-generic/mman-common.h > @@ -78,6 +78,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > -- > 2.33.1 >
On 17.01.24 18:10, Zach O'Keefe wrote: > [+linux-mm & others] > > On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@gmail.com> wrote: >> >> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. >> >> Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to >> make a least-effort attempt at a synchronous collapse of memory at >> their own expense. >> >> The only difference from MADV_COLLAPSE is that the new hugepage allocation >> avoids direct reclaim and/or compaction, quickly failing on allocation errors. >> >> The benefits of this approach are: >> >> * CPU is charged to the process that wants to spend the cycles for the THP >> * Avoid unpredictable timing of khugepaged collapse >> * Prevent unpredictable stalls caused by direct reclaim and/or compaction >> >> Semantics >> >> This call is independent of the system-wide THP sysfs settings, but will >> fail for memory marked VM_NOHUGEPAGE. If the ranges provided span >> multiple VMAs, the semantics of the collapse over each VMA is independent >> from the others. This implies a hugepage cannot cross a VMA boundary. If >> collapse of a given hugepage-aligned/sized region fails, the operation may >> continue to attempt collapsing the remainder of memory specified. >> >> The memory ranges provided must be page-aligned, but are not required to >> be hugepage-aligned. If the memory ranges are not hugepage-aligned, the >> start/end of the range will be clamped to the first/last hugepage-aligned >> address covered by said range. The memory ranges must span at least one >> hugepage-sized region. >> >> All non-resident pages covered by the range will first be >> swapped/faulted-in, before being internally copied onto a freshly >> allocated hugepage. Unmapped pages will have their data directly >> initialized to 0 in the new hugepage. However, for every eligible >> hugepage aligned/sized region to-be collapsed, at least one page must >> currently be backed by memory (a PMD covering the address range must >> already exist). >> >> Allocation for the new hugepage will not enter direct reclaim and/or >> compaction, quickly failing if allocation fails. When the system has >> multiple NUMA nodes, the hugepage will be allocated from the node providing >> the most native pages. This operation operates on the current state of the >> specified process and makes no persistent changes or guarantees on how pages >> will be mapped, constructed, or faulted in the future. >> >> Return Value >> >> If all hugepage-sized/aligned regions covered by the provided range were >> either successfully collapsed, or were already PMD-mapped THPs, this >> operation will be deemed successful. On success, madvise(2) returns 0. >> Else, -1 is returned and errno is set to indicate the error for the >> most-recently attempted hugepage collapse. Note that many failures might >> have occurred, since the operation may continue to collapse in the event a >> single hugepage-sized/aligned region fails. >> >> ENOMEM Memory allocation failed or VMA not found >> EBUSY Memcg charging failed >> EAGAIN Required resource temporarily unavailable. Try again >> might succeed. >> EINVAL Other error: No PMD found, subpage doesn't have Present >> bit set, "Special" page no backed by struct page, VMA >> incorrectly sized, address not page-aligned, ... >> >> Use Cases >> >> An immediate user of this new functionality is the Go runtime heap allocator >> that manages memory in hugepage-sized chunks. In the past, whether it was a >> newly allocated chunk through mmap() or a reused chunk released by >> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with >> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] >> respectively. However, both approaches resulted in performance issues; for >> both scenarios, there could be entries into direct reclaim and/or compaction, >> leading to unpredictable stalls[4]. Now, the allocator can confidently use >> madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages. >> >> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 >> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a >> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af >> [4] https://github.com/golang/go/issues/63334 > > Thanks for the patch, Lance, and thanks for providing the links above, > referring to issues Go has seen. > > I've reached out to the Go team to try and understand their use case, > and how we could help. It's not immediately clear whether a > lighter-weight MADV_COLLAPSE is the answer, but it could turn out to > be. > > That said, with respect to the implementation, should a need for a > lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see > process_madvise(2) be the "v2" of madvise(2), where we can start > leveraging the forward-facing flags argument for these different > advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa > ("mm/madvise: remove racy mm ownership check") so that > process_madvise(2) can always operate on self. IIRC, this was ~ the > plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a > sane default, and implement options in flags down the line). +1, using process_madvise() would likely be the right approach.
Hi Lance, kernel test robot noticed the following build warnings: [auto build test WARNING on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-madvise-add-MADV_TRY_COLLAPSE-to-process_madvise/20240117-130450 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20240117050217.43610-1-ioworker0%40gmail.com patch subject: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for attempted synchronous hugepage collapse config: x86_64-kexec (https://download.01.org/0day-ci/archive/20240118/202401180500.SKo0zynj-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240118/202401180500.SKo0zynj-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202401180500.SKo0zynj-lkp@intel.com/ All warnings (new ones prefixed by >>): mm/khugepaged.c: In function 'madvise_collapse': >> mm/khugepaged.c:2784:28: warning: this statement may fall through [-Wimplicit-fallthrough=] 2784 | if (cc->is_try) { | ^ mm/khugepaged.c:2789:17: note: here 2789 | case SCAN_PMD_NULL: | ^~~~ vim +2784 mm/khugepaged.c 2702 2703 int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, 2704 unsigned long start, unsigned long end, bool is_try) 2705 { 2706 struct collapse_control *cc; 2707 struct mm_struct *mm = vma->vm_mm; 2708 unsigned long hstart, hend, addr; 2709 int thps = 0, last_fail = SCAN_FAIL; 2710 bool mmap_locked = true; 2711 2712 BUG_ON(vma->vm_start > start); 2713 BUG_ON(vma->vm_end < end); 2714 2715 *prev = vma; 2716 2717 if (!thp_vma_allowable_order(vma, vma->vm_flags, false, false, false, 2718 PMD_ORDER)) 2719 return -EINVAL; 2720 2721 cc = kmalloc(sizeof(*cc), GFP_KERNEL); 2722 if (!cc) 2723 return -ENOMEM; 2724 cc->is_khugepaged = false; 2725 cc->is_try = is_try; 2726 2727 mmgrab(mm); 2728 lru_add_drain_all(); 2729 2730 hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; 2731 hend = end & HPAGE_PMD_MASK; 2732 2733 for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) { 2734 int result = SCAN_FAIL; 2735 2736 if (!mmap_locked) { 2737 cond_resched(); 2738 mmap_read_lock(mm); 2739 mmap_locked = true; 2740 result = hugepage_vma_revalidate(mm, addr, false, &vma, 2741 cc); 2742 if (result != SCAN_SUCCEED) { 2743 last_fail = result; 2744 goto out_nolock; 2745 } 2746 2747 hend = min(hend, vma->vm_end & HPAGE_PMD_MASK); 2748 } 2749 mmap_assert_locked(mm); 2750 memset(cc->node_load, 0, sizeof(cc->node_load)); 2751 nodes_clear(cc->alloc_nmask); 2752 if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) { 2753 struct file *file = get_file(vma->vm_file); 2754 pgoff_t pgoff = linear_page_index(vma, addr); 2755 2756 mmap_read_unlock(mm); 2757 mmap_locked = false; 2758 result = hpage_collapse_scan_file(mm, addr, file, pgoff, 2759 cc); 2760 fput(file); 2761 } else { 2762 result = hpage_collapse_scan_pmd(mm, vma, addr, 2763 &mmap_locked, cc); 2764 } 2765 if (!mmap_locked) 2766 *prev = NULL; /* Tell caller we dropped mmap_lock */ 2767 2768 handle_result: 2769 switch (result) { 2770 case SCAN_SUCCEED: 2771 case SCAN_PMD_MAPPED: 2772 ++thps; 2773 break; 2774 case SCAN_PTE_MAPPED_HUGEPAGE: 2775 BUG_ON(mmap_locked); 2776 BUG_ON(*prev); 2777 mmap_read_lock(mm); 2778 result = collapse_pte_mapped_thp(mm, addr, true); 2779 mmap_read_unlock(mm); 2780 goto handle_result; 2781 /* MADV_TRY_COLLAPSE: fail quickly */ 2782 case SCAN_ALLOC_HUGE_PAGE_FAIL: 2783 case SCAN_CGROUP_CHARGE_FAIL: > 2784 if (cc->is_try) {
Hi Lance,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-madvise-add-MADV_TRY_COLLAPSE-to-process_madvise/20240117-130450
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240117050217.43610-1-ioworker0%40gmail.com
patch subject: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for attempted synchronous hugepage collapse
config: x86_64-rhel-8.3-bpf (https://download.01.org/0day-ci/archive/20240118/202401180810.sR4s25PR-lkp@intel.com/config)
compiler: ClangBuiltLinux clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240118/202401180810.sR4s25PR-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401180810.sR4s25PR-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> mm/khugepaged.c:2789:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
2789 | case SCAN_PMD_NULL:
| ^
mm/khugepaged.c:2789:3: note: insert '__attribute__((fallthrough));' to silence this warning
2789 | case SCAN_PMD_NULL:
| ^
| __attribute__((fallthrough));
mm/khugepaged.c:2789:3: note: insert 'break;' to avoid fall-through
2789 | case SCAN_PMD_NULL:
| ^
| break;
1 warning generated.
vim +2789 mm/khugepaged.c
7d8faaf155454f Zach O'Keefe 2022-07-06 2702
7d8faaf155454f Zach O'Keefe 2022-07-06 2703 int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
a37dacf95f1857 Lance Yang 2024-01-17 2704 unsigned long start, unsigned long end, bool is_try)
7d8faaf155454f Zach O'Keefe 2022-07-06 2705 {
7d8faaf155454f Zach O'Keefe 2022-07-06 2706 struct collapse_control *cc;
7d8faaf155454f Zach O'Keefe 2022-07-06 2707 struct mm_struct *mm = vma->vm_mm;
7d8faaf155454f Zach O'Keefe 2022-07-06 2708 unsigned long hstart, hend, addr;
7d8faaf155454f Zach O'Keefe 2022-07-06 2709 int thps = 0, last_fail = SCAN_FAIL;
7d8faaf155454f Zach O'Keefe 2022-07-06 2710 bool mmap_locked = true;
7d8faaf155454f Zach O'Keefe 2022-07-06 2711
7d8faaf155454f Zach O'Keefe 2022-07-06 2712 BUG_ON(vma->vm_start > start);
7d8faaf155454f Zach O'Keefe 2022-07-06 2713 BUG_ON(vma->vm_end < end);
7d8faaf155454f Zach O'Keefe 2022-07-06 2714
7d8faaf155454f Zach O'Keefe 2022-07-06 2715 *prev = vma;
7d8faaf155454f Zach O'Keefe 2022-07-06 2716
3485b88390b0af Ryan Roberts 2023-12-07 2717 if (!thp_vma_allowable_order(vma, vma->vm_flags, false, false, false,
3485b88390b0af Ryan Roberts 2023-12-07 2718 PMD_ORDER))
7d8faaf155454f Zach O'Keefe 2022-07-06 2719 return -EINVAL;
7d8faaf155454f Zach O'Keefe 2022-07-06 2720
7d8faaf155454f Zach O'Keefe 2022-07-06 2721 cc = kmalloc(sizeof(*cc), GFP_KERNEL);
7d8faaf155454f Zach O'Keefe 2022-07-06 2722 if (!cc)
7d8faaf155454f Zach O'Keefe 2022-07-06 2723 return -ENOMEM;
7d8faaf155454f Zach O'Keefe 2022-07-06 2724 cc->is_khugepaged = false;
a37dacf95f1857 Lance Yang 2024-01-17 2725 cc->is_try = is_try;
7d8faaf155454f Zach O'Keefe 2022-07-06 2726
7d8faaf155454f Zach O'Keefe 2022-07-06 2727 mmgrab(mm);
7d8faaf155454f Zach O'Keefe 2022-07-06 2728 lru_add_drain_all();
7d8faaf155454f Zach O'Keefe 2022-07-06 2729
7d8faaf155454f Zach O'Keefe 2022-07-06 2730 hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
7d8faaf155454f Zach O'Keefe 2022-07-06 2731 hend = end & HPAGE_PMD_MASK;
7d8faaf155454f Zach O'Keefe 2022-07-06 2732
7d8faaf155454f Zach O'Keefe 2022-07-06 2733 for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
7d8faaf155454f Zach O'Keefe 2022-07-06 2734 int result = SCAN_FAIL;
7d8faaf155454f Zach O'Keefe 2022-07-06 2735
7d8faaf155454f Zach O'Keefe 2022-07-06 2736 if (!mmap_locked) {
7d8faaf155454f Zach O'Keefe 2022-07-06 2737 cond_resched();
7d8faaf155454f Zach O'Keefe 2022-07-06 2738 mmap_read_lock(mm);
7d8faaf155454f Zach O'Keefe 2022-07-06 2739 mmap_locked = true;
34488399fa08fa Zach O'Keefe 2022-09-22 2740 result = hugepage_vma_revalidate(mm, addr, false, &vma,
34488399fa08fa Zach O'Keefe 2022-09-22 2741 cc);
7d8faaf155454f Zach O'Keefe 2022-07-06 2742 if (result != SCAN_SUCCEED) {
7d8faaf155454f Zach O'Keefe 2022-07-06 2743 last_fail = result;
7d8faaf155454f Zach O'Keefe 2022-07-06 2744 goto out_nolock;
7d8faaf155454f Zach O'Keefe 2022-07-06 2745 }
4d24de9425f75f Yang Shi 2022-09-14 2746
52dc031088f00e Zach O'Keefe 2022-12-24 2747 hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
7d8faaf155454f Zach O'Keefe 2022-07-06 2748 }
7d8faaf155454f Zach O'Keefe 2022-07-06 2749 mmap_assert_locked(mm);
7d8faaf155454f Zach O'Keefe 2022-07-06 2750 memset(cc->node_load, 0, sizeof(cc->node_load));
e031ff96b334a0 Yang Shi 2022-11-08 2751 nodes_clear(cc->alloc_nmask);
34488399fa08fa Zach O'Keefe 2022-09-22 2752 if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
34488399fa08fa Zach O'Keefe 2022-09-22 2753 struct file *file = get_file(vma->vm_file);
34488399fa08fa Zach O'Keefe 2022-09-22 2754 pgoff_t pgoff = linear_page_index(vma, addr);
34488399fa08fa Zach O'Keefe 2022-09-22 2755
34488399fa08fa Zach O'Keefe 2022-09-22 2756 mmap_read_unlock(mm);
34488399fa08fa Zach O'Keefe 2022-09-22 2757 mmap_locked = false;
34488399fa08fa Zach O'Keefe 2022-09-22 2758 result = hpage_collapse_scan_file(mm, addr, file, pgoff,
7d2c4385c3417c Zach O'Keefe 2022-07-06 2759 cc);
34488399fa08fa Zach O'Keefe 2022-09-22 2760 fput(file);
34488399fa08fa Zach O'Keefe 2022-09-22 2761 } else {
34488399fa08fa Zach O'Keefe 2022-09-22 2762 result = hpage_collapse_scan_pmd(mm, vma, addr,
34488399fa08fa Zach O'Keefe 2022-09-22 2763 &mmap_locked, cc);
34488399fa08fa Zach O'Keefe 2022-09-22 2764 }
7d8faaf155454f Zach O'Keefe 2022-07-06 2765 if (!mmap_locked)
7d8faaf155454f Zach O'Keefe 2022-07-06 2766 *prev = NULL; /* Tell caller we dropped mmap_lock */
7d8faaf155454f Zach O'Keefe 2022-07-06 2767
34488399fa08fa Zach O'Keefe 2022-09-22 2768 handle_result:
7d8faaf155454f Zach O'Keefe 2022-07-06 2769 switch (result) {
7d8faaf155454f Zach O'Keefe 2022-07-06 2770 case SCAN_SUCCEED:
7d8faaf155454f Zach O'Keefe 2022-07-06 2771 case SCAN_PMD_MAPPED:
7d8faaf155454f Zach O'Keefe 2022-07-06 2772 ++thps;
7d8faaf155454f Zach O'Keefe 2022-07-06 2773 break;
34488399fa08fa Zach O'Keefe 2022-09-22 2774 case SCAN_PTE_MAPPED_HUGEPAGE:
34488399fa08fa Zach O'Keefe 2022-09-22 2775 BUG_ON(mmap_locked);
34488399fa08fa Zach O'Keefe 2022-09-22 2776 BUG_ON(*prev);
1043173eb5eb35 Hugh Dickins 2023-07-11 2777 mmap_read_lock(mm);
34488399fa08fa Zach O'Keefe 2022-09-22 2778 result = collapse_pte_mapped_thp(mm, addr, true);
1043173eb5eb35 Hugh Dickins 2023-07-11 2779 mmap_read_unlock(mm);
34488399fa08fa Zach O'Keefe 2022-09-22 2780 goto handle_result;
a37dacf95f1857 Lance Yang 2024-01-17 2781 /* MADV_TRY_COLLAPSE: fail quickly */
a37dacf95f1857 Lance Yang 2024-01-17 2782 case SCAN_ALLOC_HUGE_PAGE_FAIL:
a37dacf95f1857 Lance Yang 2024-01-17 2783 case SCAN_CGROUP_CHARGE_FAIL:
a37dacf95f1857 Lance Yang 2024-01-17 2784 if (cc->is_try) {
a37dacf95f1857 Lance Yang 2024-01-17 2785 last_fail = result;
a37dacf95f1857 Lance Yang 2024-01-17 2786 goto out_maybelock;
a37dacf95f1857 Lance Yang 2024-01-17 2787 }
7d8faaf155454f Zach O'Keefe 2022-07-06 2788 /* Whitelisted set of results where continuing OK */
7d8faaf155454f Zach O'Keefe 2022-07-06 @2789 case SCAN_PMD_NULL:
Hey Zach, Thanks for taking the time to review! Zach O'Keefe <zokeefe@google.com> 于2024年1月18日周四 01:11写道: > > [+linux-mm & others] > > On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@gmail.com> wrote: > > > > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > > > > Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to > > make a least-effort attempt at a synchronous collapse of memory at > > their own expense. > > > > The only difference from MADV_COLLAPSE is that the new hugepage allocation > > avoids direct reclaim and/or compaction, quickly failing on allocation errors. > > > > The benefits of this approach are: > > > > * CPU is charged to the process that wants to spend the cycles for the THP > > * Avoid unpredictable timing of khugepaged collapse > > * Prevent unpredictable stalls caused by direct reclaim and/or compaction > > > > Semantics > > > > This call is independent of the system-wide THP sysfs settings, but will > > fail for memory marked VM_NOHUGEPAGE. If the ranges provided span > > multiple VMAs, the semantics of the collapse over each VMA is independent > > from the others. This implies a hugepage cannot cross a VMA boundary. If > > collapse of a given hugepage-aligned/sized region fails, the operation may > > continue to attempt collapsing the remainder of memory specified. > > > > The memory ranges provided must be page-aligned, but are not required to > > be hugepage-aligned. If the memory ranges are not hugepage-aligned, the > > start/end of the range will be clamped to the first/last hugepage-aligned > > address covered by said range. The memory ranges must span at least one > > hugepage-sized region. > > > > All non-resident pages covered by the range will first be > > swapped/faulted-in, before being internally copied onto a freshly > > allocated hugepage. Unmapped pages will have their data directly > > initialized to 0 in the new hugepage. However, for every eligible > > hugepage aligned/sized region to-be collapsed, at least one page must > > currently be backed by memory (a PMD covering the address range must > > already exist). > > > > Allocation for the new hugepage will not enter direct reclaim and/or > > compaction, quickly failing if allocation fails. When the system has > > multiple NUMA nodes, the hugepage will be allocated from the node providing > > the most native pages. This operation operates on the current state of the > > specified process and makes no persistent changes or guarantees on how pages > > will be mapped, constructed, or faulted in the future. > > > > Return Value > > > > If all hugepage-sized/aligned regions covered by the provided range were > > either successfully collapsed, or were already PMD-mapped THPs, this > > operation will be deemed successful. On success, madvise(2) returns 0. > > Else, -1 is returned and errno is set to indicate the error for the > > most-recently attempted hugepage collapse. Note that many failures might > > have occurred, since the operation may continue to collapse in the event a > > single hugepage-sized/aligned region fails. > > > > ENOMEM Memory allocation failed or VMA not found > > EBUSY Memcg charging failed > > EAGAIN Required resource temporarily unavailable. Try again > > might succeed. > > EINVAL Other error: No PMD found, subpage doesn't have Present > > bit set, "Special" page no backed by struct page, VMA > > incorrectly sized, address not page-aligned, ... > > > > Use Cases > > > > An immediate user of this new functionality is the Go runtime heap allocator > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > newly allocated chunk through mmap() or a reused chunk released by > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > respectively. However, both approaches resulted in performance issues; for > > both scenarios, there could be entries into direct reclaim and/or compaction, > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages. > > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > > [4] https://github.com/golang/go/issues/63334 > > Thanks for the patch, Lance, and thanks for providing the links above, > referring to issues Go has seen. > > I've reached out to the Go team to try and understand their use case, > and how we could help. It's not immediately clear whether a > lighter-weight MADV_COLLAPSE is the answer, but it could turn out to > be. > > That said, with respect to the implementation, should a need for a > lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see > process_madvise(2) be the "v2" of madvise(2), where we can start I agree with you; it's a good idea! > leveraging the forward-facing flags argument for these different > advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa > ("mm/madvise: remove racy mm ownership check") so that > process_madvise(2) can always operate on self. IIRC, this was ~ the > plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a > sane default, and implement options in flags down the line). > > That flag could be a MADV_F_COLLAPSE_LIGHT, where we use a lighter The name MADV_F_COLLAPSE_LIGHT sounds great for the flag, and its semantics are very clear. Thanks again for your review and your suggestion! Lance > allocation context, as well as, for example, only do a local > lru_add_drain() vs lru_add_drain_all(). But I'll refrain from thinking > too hard about it just yet. > > Best, > Zach > > > > > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > > --- > > arch/alpha/include/uapi/asm/mman.h | 1 + > > arch/mips/include/uapi/asm/mman.h | 1 + > > arch/parisc/include/uapi/asm/mman.h | 1 + > > arch/xtensa/include/uapi/asm/mman.h | 1 + > > include/linux/huge_mm.h | 5 +++-- > > include/uapi/asm-generic/mman-common.h | 1 + > > mm/khugepaged.c | 19 ++++++++++++++++--- > > mm/madvise.c | 8 +++++++- > > tools/include/uapi/asm-generic/mman-common.h | 1 + > > 9 files changed, 32 insertions(+), 6 deletions(-) > > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > > index 763929e814e9..44aa1f57a982 100644 > > --- a/arch/alpha/include/uapi/asm/mman.h > > +++ b/arch/alpha/include/uapi/asm/mman.h > > @@ -77,6 +77,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > > index c6e1fc77c996..1ae16e5d7dfc 100644 > > --- a/arch/mips/include/uapi/asm/mman.h > > +++ b/arch/mips/include/uapi/asm/mman.h > > @@ -104,6 +104,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > > index 68c44f99bc93..f8d016ee1f98 100644 > > --- a/arch/parisc/include/uapi/asm/mman.h > > +++ b/arch/parisc/include/uapi/asm/mman.h > > @@ -71,6 +71,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > #define MADV_HWPOISON 100 /* poison a page for testing */ > > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > > index 1ff0c858544f..c495d1b39c83 100644 > > --- a/arch/xtensa/include/uapi/asm/mman.h > > +++ b/arch/xtensa/include/uapi/asm/mman.h > > @@ -112,6 +112,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index 5adb86af35fc..e1af75aa18fb 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, > > int advice); > > int madvise_collapse(struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > - unsigned long start, unsigned long end); > > + unsigned long start, unsigned long end, bool is_try); > > void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, > > unsigned long end, long adjust_next); > > spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); > > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, > > > > static inline int madvise_collapse(struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > - unsigned long start, unsigned long end) > > + unsigned long start, unsigned long end, > > + bool is_try) > > { > > return -EINVAL; > > } > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > > index 6ce1f1ceb432..a9e5273db5f6 100644 > > --- a/include/uapi/asm-generic/mman-common.h > > +++ b/include/uapi/asm-generic/mman-common.h > > @@ -78,6 +78,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 2b219acb528e..c22703155b6e 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -96,6 +96,7 @@ static struct kmem_cache *mm_slot_cache __ro_after_init; > > > > struct collapse_control { > > bool is_khugepaged; > > + bool is_try; > > > > /* Num pages scanned per node */ > > u32 node_load[MAX_NUMNODES]; > > @@ -1058,10 +1059,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, > > static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, > > struct collapse_control *cc) > > { > > - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : > > - GFP_TRANSHUGE); > > int node = hpage_collapse_find_target_node(cc); > > struct folio *folio; > > + gfp_t gfp; > > + > > + if (cc->is_khugepaged) > > + gfp = alloc_hugepage_khugepaged_gfpmask(); > > + else > > + gfp = cc->is_try ? GFP_TRANSHUGE_LIGHT : GFP_TRANSHUGE; > > > > if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) { > > *hpage = NULL; > > @@ -2697,7 +2702,7 @@ static int madvise_collapse_errno(enum scan_result r) > > } > > > > int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > - unsigned long start, unsigned long end) > > + unsigned long start, unsigned long end, bool is_try) > > { > > struct collapse_control *cc; > > struct mm_struct *mm = vma->vm_mm; > > @@ -2718,6 +2723,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > if (!cc) > > return -ENOMEM; > > cc->is_khugepaged = false; > > + cc->is_try = is_try; > > > > mmgrab(mm); > > lru_add_drain_all(); > > @@ -2773,6 +2779,13 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > result = collapse_pte_mapped_thp(mm, addr, true); > > mmap_read_unlock(mm); > > goto handle_result; > > + /* MADV_TRY_COLLAPSE: fail quickly */ > > + case SCAN_ALLOC_HUGE_PAGE_FAIL: > > + case SCAN_CGROUP_CHARGE_FAIL: > > + if (cc->is_try) { > > + last_fail = result; > > + goto out_maybelock; > > + } > > /* Whitelisted set of results where continuing OK */ > > case SCAN_PMD_NULL: > > case SCAN_PTE_NON_PRESENT: > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 912155a94ed5..5a359bcd286c 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > > case MADV_POPULATE_READ: > > case MADV_POPULATE_WRITE: > > case MADV_COLLAPSE: > > + case MADV_TRY_COLLAPSE: > > return 0; > > default: > > /* be safe, default to 1. list exceptions explicitly */ > > @@ -1082,8 +1083,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > > if (error) > > goto out; > > break; > > + case MADV_TRY_COLLAPSE: > > + return madvise_collapse(vma, prev, start, end, true); > > case MADV_COLLAPSE: > > - return madvise_collapse(vma, prev, start, end); > > + return madvise_collapse(vma, prev, start, end, false); > > } > > > > anon_name = anon_vma_name(vma); > > @@ -1178,6 +1181,7 @@ madvise_behavior_valid(int behavior) > > case MADV_HUGEPAGE: > > case MADV_NOHUGEPAGE: > > case MADV_COLLAPSE: > > + case MADV_TRY_COLLAPSE: > > #endif > > case MADV_DONTDUMP: > > case MADV_DODUMP: > > @@ -1368,6 +1372,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > > * transparent huge pages so the existing pages will not be > > * coalesced into THP and new pages will not be allocated as THP. > > * MADV_COLLAPSE - synchronously coalesce pages into new THP. > > + * MADV_TRY_COLLAPSE - similar to COLLAPSE, but avoids direct reclaim > > + * and/or compaction. > > * MADV_DONTDUMP - the application wants to prevent pages in the given range > > * from being included in its core dump. > > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h > > index 6ce1f1ceb432..a9e5273db5f6 100644 > > --- a/tools/include/uapi/asm-generic/mman-common.h > > +++ b/tools/include/uapi/asm-generic/mman-common.h > > @@ -78,6 +78,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > -- > > 2.33.1 > >
Hey David, Thanks for taking the time to review! David Hildenbrand <david@redhat.com> 于2024年1月18日周四 02:41写道: > > On 17.01.24 18:10, Zach O'Keefe wrote: > > [+linux-mm & others] > > > > On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@gmail.com> wrote: > >> > >> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > >> > >> Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to > >> make a least-effort attempt at a synchronous collapse of memory at > >> their own expense. > >> > >> The only difference from MADV_COLLAPSE is that the new hugepage allocation > >> avoids direct reclaim and/or compaction, quickly failing on allocation errors. > >> > >> The benefits of this approach are: > >> > >> * CPU is charged to the process that wants to spend the cycles for the THP > >> * Avoid unpredictable timing of khugepaged collapse > >> * Prevent unpredictable stalls caused by direct reclaim and/or compaction > >> > >> Semantics > >> > >> This call is independent of the system-wide THP sysfs settings, but will > >> fail for memory marked VM_NOHUGEPAGE. If the ranges provided span > >> multiple VMAs, the semantics of the collapse over each VMA is independent > >> from the others. This implies a hugepage cannot cross a VMA boundary. If > >> collapse of a given hugepage-aligned/sized region fails, the operation may > >> continue to attempt collapsing the remainder of memory specified. > >> > >> The memory ranges provided must be page-aligned, but are not required to > >> be hugepage-aligned. If the memory ranges are not hugepage-aligned, the > >> start/end of the range will be clamped to the first/last hugepage-aligned > >> address covered by said range. The memory ranges must span at least one > >> hugepage-sized region. > >> > >> All non-resident pages covered by the range will first be > >> swapped/faulted-in, before being internally copied onto a freshly > >> allocated hugepage. Unmapped pages will have their data directly > >> initialized to 0 in the new hugepage. However, for every eligible > >> hugepage aligned/sized region to-be collapsed, at least one page must > >> currently be backed by memory (a PMD covering the address range must > >> already exist). > >> > >> Allocation for the new hugepage will not enter direct reclaim and/or > >> compaction, quickly failing if allocation fails. When the system has > >> multiple NUMA nodes, the hugepage will be allocated from the node providing > >> the most native pages. This operation operates on the current state of the > >> specified process and makes no persistent changes or guarantees on how pages > >> will be mapped, constructed, or faulted in the future. > >> > >> Return Value > >> > >> If all hugepage-sized/aligned regions covered by the provided range were > >> either successfully collapsed, or were already PMD-mapped THPs, this > >> operation will be deemed successful. On success, madvise(2) returns 0. > >> Else, -1 is returned and errno is set to indicate the error for the > >> most-recently attempted hugepage collapse. Note that many failures might > >> have occurred, since the operation may continue to collapse in the event a > >> single hugepage-sized/aligned region fails. > >> > >> ENOMEM Memory allocation failed or VMA not found > >> EBUSY Memcg charging failed > >> EAGAIN Required resource temporarily unavailable. Try again > >> might succeed. > >> EINVAL Other error: No PMD found, subpage doesn't have Present > >> bit set, "Special" page no backed by struct page, VMA > >> incorrectly sized, address not page-aligned, ... > >> > >> Use Cases > >> > >> An immediate user of this new functionality is the Go runtime heap allocator > >> that manages memory in hugepage-sized chunks. In the past, whether it was a > >> newly allocated chunk through mmap() or a reused chunk released by > >> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > >> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > >> respectively. However, both approaches resulted in performance issues; for > >> both scenarios, there could be entries into direct reclaim and/or compaction, > >> leading to unpredictable stalls[4]. Now, the allocator can confidently use > >> madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages. > >> > >> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > >> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > >> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > >> [4] https://github.com/golang/go/issues/63334 > > > > Thanks for the patch, Lance, and thanks for providing the links above, > > referring to issues Go has seen. > > > > I've reached out to the Go team to try and understand their use case, > > and how we could help. It's not immediately clear whether a > > lighter-weight MADV_COLLAPSE is the answer, but it could turn out to > > be. > > > > That said, with respect to the implementation, should a need for a > > lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see > > process_madvise(2) be the "v2" of madvise(2), where we can start > > leveraging the forward-facing flags argument for these different > > advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa > > ("mm/madvise: remove racy mm ownership check") so that > > process_madvise(2) can always operate on self. IIRC, this was ~ the > > plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a > > sane default, and implement options in flags down the line). > > +1, using process_madvise() would likely be the right approach. Thanks for your suggestion! I completely agree :) Lance > > -- > Cheers, > > David / dhildenb >
Lance Yang <ioworker0@gmail.com> writes: > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > > Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to > make a least-effort attempt at a synchronous collapse of memory at > their own expense. > > The only difference from MADV_COLLAPSE is that the new hugepage allocation > avoids direct reclaim and/or compaction, quickly failing on allocation errors. > > The benefits of this approach are: > > * CPU is charged to the process that wants to spend the cycles for the THP > * Avoid unpredictable timing of khugepaged collapse > * Prevent unpredictable stalls caused by direct reclaim and/or > compaction I haven't completely followed the discussion, but it seem your second and third point could be addressed by a asynchronous THP fault without any new APIs: allocate 2MB while failing quickly, then on failure get a 4K page and provide it to the process, while asking khugepaged to convert the page ASAP in the background, but only after it managed to allocate a fresh 2MB page to minimize the process visible down time. I suppose that would be much more predictable, although there would be a slightly risk of overwhelming khugepaged. The later could be addressed by using a scalable workqueue that allocates more threads when needed. -Andi
Hey Andi, Thanks for taking the time to review! We are currently discussing this at Link: https://lore.kernel.org/all/20240118120347.61817-1-ioworker0@gmail.com/ On Fri, Jan 19, 2024 at 9:41 PM Andi Kleen <ak@linux.intel.com> wrote: > > Lance Yang <ioworker0@gmail.com> writes: > > > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > > > > Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to > > make a least-effort attempt at a synchronous collapse of memory at > > their own expense. > > > > The only difference from MADV_COLLAPSE is that the new hugepage allocation > > avoids direct reclaim and/or compaction, quickly failing on allocation errors. > > > > The benefits of this approach are: > > > > * CPU is charged to the process that wants to spend the cycles for the THP > > * Avoid unpredictable timing of khugepaged collapse > > * Prevent unpredictable stalls caused by direct reclaim and/or > > compaction > > I haven't completely followed the discussion, but it seem your second > and third point could be addressed by a asynchronous THP fault without > any new APIs: allocate 2MB while failing quickly, then on failure get > a 4K page and provide it to the process, while asking khugepaged to > convert the page ASAP in the background, but only after > it managed to allocate a fresh 2MB page to minimize the process visible > down time. > > I suppose that would be much more predictable, although there would be a > slightly risk of overwhelming khugepaged. The later could be > addressed by using a scalable workqueue that allocates more threads > when needed. Thank you for your suggestion! Unfortunately, AFAIK, the default THP behavior on most Linux distros is that MADV_HUGEPAGE blocks while the kernel eagerly reclaims and compacts memory to allocate a hugepage. In the era of cloud-native computing, it's challenging for users to be aware of the THP configurations on all nodes in a cluster, let alone have fine-grained control over them. Simply disabling the use of huge pages due to concerns about potential direct reclaim and/or compaction may be regrettable, as users are deprived of the opportunity to experiment with huge page allocations. However, relying solely on MADV_HUGEPAGE introduces the risk of unpredictable stalls, making it a trade-off that users must carefully consider. With the introduction of MADV_COLLAPSE into the kernel, it is not governed by the defrag mode. MADV_COLLAPSE offers the potential for more fine-grained control over the hugepage allocation strategy. BR, Lance > > -Andi >
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..44aa1f57a982 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -77,6 +77,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index c6e1fc77c996..1ae16e5d7dfc 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -104,6 +104,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 68c44f99bc93..f8d016ee1f98 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -71,6 +71,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ #define MADV_HWPOISON 100 /* poison a page for testing */ #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 1ff0c858544f..c495d1b39c83 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -112,6 +112,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 5adb86af35fc..e1af75aa18fb 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice); int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end); + unsigned long start, unsigned long end, bool is_try); void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, long adjust_next); spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, static inline int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + bool is_try) { return -EINVAL; } diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..a9e5273db5f6 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -78,6 +78,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2b219acb528e..c22703155b6e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -96,6 +96,7 @@ static struct kmem_cache *mm_slot_cache __ro_after_init; struct collapse_control { bool is_khugepaged; + bool is_try; /* Num pages scanned per node */ u32 node_load[MAX_NUMNODES]; @@ -1058,10 +1059,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, struct collapse_control *cc) { - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : - GFP_TRANSHUGE); int node = hpage_collapse_find_target_node(cc); struct folio *folio; + gfp_t gfp; + + if (cc->is_khugepaged) + gfp = alloc_hugepage_khugepaged_gfpmask(); + else + gfp = cc->is_try ? GFP_TRANSHUGE_LIGHT : GFP_TRANSHUGE; if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) { *hpage = NULL; @@ -2697,7 +2702,7 @@ static int madvise_collapse_errno(enum scan_result r) } int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, bool is_try) { struct collapse_control *cc; struct mm_struct *mm = vma->vm_mm; @@ -2718,6 +2723,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, if (!cc) return -ENOMEM; cc->is_khugepaged = false; + cc->is_try = is_try; mmgrab(mm); lru_add_drain_all(); @@ -2773,6 +2779,13 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, result = collapse_pte_mapped_thp(mm, addr, true); mmap_read_unlock(mm); goto handle_result; + /* MADV_TRY_COLLAPSE: fail quickly */ + case SCAN_ALLOC_HUGE_PAGE_FAIL: + case SCAN_CGROUP_CHARGE_FAIL: + if (cc->is_try) { + last_fail = result; + goto out_maybelock; + } /* Whitelisted set of results where continuing OK */ case SCAN_PMD_NULL: case SCAN_PTE_NON_PRESENT: diff --git a/mm/madvise.c b/mm/madvise.c index 912155a94ed5..5a359bcd286c 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE: + case MADV_TRY_COLLAPSE: return 0; default: /* be safe, default to 1. list exceptions explicitly */ @@ -1082,8 +1083,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, if (error) goto out; break; + case MADV_TRY_COLLAPSE: + return madvise_collapse(vma, prev, start, end, true); case MADV_COLLAPSE: - return madvise_collapse(vma, prev, start, end); + return madvise_collapse(vma, prev, start, end, false); } anon_name = anon_vma_name(vma); @@ -1178,6 +1181,7 @@ madvise_behavior_valid(int behavior) case MADV_HUGEPAGE: case MADV_NOHUGEPAGE: case MADV_COLLAPSE: + case MADV_TRY_COLLAPSE: #endif case MADV_DONTDUMP: case MADV_DODUMP: @@ -1368,6 +1372,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * transparent huge pages so the existing pages will not be * coalesced into THP and new pages will not be allocated as THP. * MADV_COLLAPSE - synchronously coalesce pages into new THP. + * MADV_TRY_COLLAPSE - similar to COLLAPSE, but avoids direct reclaim + * and/or compaction. * MADV_DONTDUMP - the application wants to prevent pages in the given range * from being included in its core dump. * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..a9e5273db5f6 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -78,6 +78,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0