Message ID | 20240201125226.28372-1-ioworker0@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-48172-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:2719:b0:106:209c:c626 with SMTP id hl25csp141177dyb; Thu, 1 Feb 2024 05:16:13 -0800 (PST) X-Google-Smtp-Source: AGHT+IHZNaKvRMZ2+gSdl9zN54SuVDBA8TMKfzz0nNn2i/SavBnQcPpHP0QrakpkUukgj7Cs8CMB X-Received: by 2002:a17:90a:a617:b0:296:c0c:98f5 with SMTP id c23-20020a17090aa61700b002960c0c98f5mr1984015pjq.40.1706793373264; Thu, 01 Feb 2024 05:16:13 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706793373; cv=pass; d=google.com; s=arc-20160816; b=obfkkKxAuZUsYoLXn/ZOPcZFhOPyNaVKkFjSKItGHkEqTdLpBC0DrYNN15sSdqaoXs Wr0xbRLgMJpJVYaZ48Iz7DvBJSU3pyu7uiHQuMNpR5DpVTqgWSp83zgbv1VqxMZCEzgg PToH+lNQXP4bHEHwj4PKnmJ8PEN1a1H0h/wQDv9IJGF75G9i+rHEVJCUyQNrxCO6Tc/9 jzV8oY4FiwfiXLt/d/cHN894/8UCvAHwqBiBqFcEL6sH9APA8UezewC2YZXhcEChsVYr 4cgmi1vK0BjhT6/KllqCSuTchpIzOIBCf7oxF0l3v/jV4THKgNizGszuIpxHom2oBGLL TKxQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=+dIpEuHcsVA+p6HpZJFPK5u2p7oWDuiKjAseXCAiZTo=; fh=eoUmRwZ9h5tudlhV/8iIU08S/vn0mY03AxJwOK3C7dg=; b=IzKiMSGfZkCAHuLaRYpdvKXm8w51dwiNsgS4Eb82d1kooJcywmzbYiEKaUaMkFMW1F 9XcrsEefD9I7d1J8WVRPpwhF90Z7c9qlWv7mb5HkfL1K8RIDG94rHy1xFEgphvvX/pdO CMKlF5NTCsJwc+EvxS8K14Pqd/IJLykPxBY76tdAJx+fNM5TOJIN+iOfGWWkn6ZnlHi8 nWHNLvQF+7h+e+sIYaE5scX0WiWQe1FDy3fYoXEUsEzvvaCtbLeEiJe7578GYKGrmm17 SJfHMolpLnzpB5OBxvA7Pv2NR3+HpvJSJ/r6cBa8pU3ly3O4bIunLELA4vuUT+YqCY4k kFRw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=SgsernW5; arc=pass (i=1 spf=pass spfdomain=gmail.com dkim=pass dkdomain=gmail.com dmarc=pass fromdomain=gmail.com); spf=pass (google.com: domain of linux-kernel+bounces-48172-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-48172-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com X-Forwarded-Encrypted: i=1; AJvYcCVHRWVuwpMgyoTt3OxEe6YME/P4QQEevY6aQXcQMuGm0aurvaFQHPBmOh2UIytPI3YcuP8qoTGWHK9O/ROQDurm9afr9A== Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id l14-20020a170903120e00b001d768fd5927si12228325plh.314.2024.02.01.05.16.12 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Feb 2024 05:16:13 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-48172-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=SgsernW5; arc=pass (i=1 spf=pass spfdomain=gmail.com dkim=pass dkdomain=gmail.com dmarc=pass fromdomain=gmail.com); spf=pass (google.com: domain of linux-kernel+bounces-48172-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-48172-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id CB40AB2772F for <ouuuleilei@gmail.com>; Thu, 1 Feb 2024 12:54:51 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 6F3D25D480; Thu, 1 Feb 2024 12:53:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="SgsernW5" Received: from mail-il1-f175.google.com (mail-il1-f175.google.com [209.85.166.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CC91B5336A for <linux-kernel@vger.kernel.org>; Thu, 1 Feb 2024 12:53:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.166.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706791998; cv=none; b=a6T4exDnm7HwbocBUE/UKsaSfWABVobZQI52/ipzsVrFIEmJV+vOIBuUpRH2J/jikvojCH96thJ2lEoOjDYifJjbFbQFjyi+ZyrGZMjIymGsRrcD/wIcX9sin5Sknu6v0cA1AyxJAb+s5+epKZGde+BAFUTNc+MBx+ggAIHgM1c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706791998; c=relaxed/simple; bh=7PXAUnHUmoEF3UTxLOPfT0TJCC9/HwX2n79M8mfQ9Hc=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=RAhfwFdS1cSOsSWBWOHa9hH9MvfGeEI9wQBjWh9s7pQDscq5rDS4t1sg+WuEYN8ux+8pSg1gPgF1eOjphJZj4BvZYLw7pzY8bBIzcyKfX6/4G4XMU7FXivMsoHkqx38PlqEn32YPBfZSFOAH5jgKDiItlapKWcwwsTOfhR0Z9GU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=SgsernW5; arc=none smtp.client-ip=209.85.166.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-il1-f175.google.com with SMTP id e9e14a558f8ab-363844ae819so3334235ab.1 for <linux-kernel@vger.kernel.org>; Thu, 01 Feb 2024 04:53:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706791996; x=1707396796; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=+dIpEuHcsVA+p6HpZJFPK5u2p7oWDuiKjAseXCAiZTo=; b=SgsernW59t1ppqSoFFq1hQRjxtFsBWBplyJ0cd/gcqiqpn4VDYAV/iEAoZxJsldLuX YQGkYCFDU6FNnTM6jsSaKxTpjyAcF4J5z7opw/Sk+reTQy982TalAe7c2NCJc7LeFoz1 skdZoAxT6xcxaRkFEAYbGmJh85JQ2O14kWXkcjRToGKuaaFavuJe3ZdYQlUYPs9bhV/T AWrvzApU5dKHK1bnPV8lNWpDpdS3Pdjeie4MHG2De7r6+bdB+A5MAsSowEkQM+opSUOD qWaMHA+P4cU3tY1SEpT89GsMQmp3gycQF487BvkOEZXpW79VQ4Tv3yG3hMMKV9GT+hZF Juug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706791996; x=1707396796; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=+dIpEuHcsVA+p6HpZJFPK5u2p7oWDuiKjAseXCAiZTo=; b=QEHgQGZx1vYFFNsiv1bjN07lPsp0fx/sErNdtSilJrSdL+WyrnueTq1zXFTPHf3xDy 9AZMUSALaAYFLSUJcotgmPSKssIkT49/zngbPMeMcsCP+n/DbZuXiA6kUOffFQZCLko7 bt/e9Y9cRBOi6gwYtZWMP7bWBBJgooM5R610czp0z0ASJWvKy9iiIlB98kLeRslJn1Gh O2BMT2M1miMUqJjZYphyQ/oWqhYWEZJiCc/HN1Nk8FPQYSg2N0fj4SidGqYItjmhKe8i Z9cfeb5k4TnxdGZ8+qEQ8SPmmzgZhjGJlVVU2QErqnTGlu3FWRxtf2QDKAMJSBP+VyZf e3ow== X-Gm-Message-State: AOJu0YystGCgjw3gVh51TroqlvU5c09O/LzwS4qQZsavnVL+tPih8z4z 6U8aO4cQJrnlG6KqWg0zZmPmGA80fJ5ydMByzcXLxxsJPf2LJiq/ X-Received: by 2002:a92:d44c:0:b0:363:7a04:548e with SMTP id r12-20020a92d44c000000b003637a04548emr4612091ilm.9.1706791995971; Thu, 01 Feb 2024 04:53:15 -0800 (PST) X-Forwarded-Encrypted: i=0; AJvYcCUQmbzvwjjuar5/jjMYvrUfyQqXab4b3OcMUL5T6Ss1INNhxyM/EtSEtESt8NBceTbbqBw+exoQkoGSxIXlWYk9A6fm4vSBbVM4TGBGBYv1jEUsE2LIYM4raJEJQbIg91R+vXrspkuys7H+WdjiG5pCXXdycCqMtjyUGqg9OBJoQ2ZLPTAaXVlkVf60qmWPCqeN/0Z47xRoIJyNXO7MdPN5l35ojzS3wrt339yLwDcWBflMmKnZ5bWc/Y+lOPfbZs7chGjFduDN6/XTx01BxIg6o+bpGfBlXkYU9gNxb+k3a649thcqK5Vmvdw+BNkFqm8n Received: from LancedeMBP.lan ([112.10.240.49]) by smtp.gmail.com with ESMTPSA id k30-20020a63ba1e000000b005c661a432d7sm10863637pgf.75.2024.02.01.04.53.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Feb 2024 04:53:15 -0800 (PST) From: Lance Yang <ioworker0@gmail.com> To: akpm@linux-foundation.org Cc: mhocko@suse.com, zokeefe@google.com, david@redhat.com, songmuchun@bytedance.com, shy828301@gmail.com, peterx@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Lance Yang <ioworker0@gmail.com> Subject: [PATCH 1/1] mm/khugepaged: skip copying lazyfree pages on collapse Date: Thu, 1 Feb 2024 20:52:26 +0800 Message-Id: <20240201125226.28372-1-ioworker0@gmail.com> X-Mailer: git-send-email 2.33.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1789702568584065590 X-GMAIL-MSGID: 1789702568584065590 |
Series |
[1/1] mm/khugepaged: skip copying lazyfree pages on collapse
|
|
Commit Message
Lance Yang
Feb. 1, 2024, 12:52 p.m. UTC
The collapsing behavior of khugepaged with pages
marked using MADV_FREE might cause confusion
among users.
For instance, allocate a 2MB chunk using mmap and
later release it by MADV_FREE. Khugepaged will not
collapse this chunk. From the user's perspective,
it treats lazyfree pages as pte_none. However,
for some pages marked as lazyfree with MADV_FREE,
khugepaged might collapse this chunk and copy
these pages to a new huge page. This inconsistency
in behavior could be confusing for users.
After a successful MADV_FREE operation, if there is
no subsequent write, the kernel can free the pages
at any time. Therefore, in my opinion, counting
lazyfree pages in max_pte_none seems reasonable.
Perhaps treating MADV_FREE like MADV_DONTNEED, not
copying lazyfree pages when khugepaged collapses
huge pages in the background better aligns with
user expectations.
Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
mm/khugepaged.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
Comments
THP: enable=madivse defrag=defer max_ptes_none=511 scan_sleep_millisecs=1000 alloc_sleep_millisecs=1000 Test code: // allocate a 2MB chunk using mmap and later // release it by MADV_FRE Link: https://github.com/ioworker0/mmapvsmprotect/blob/main/test5.c root@x:/tmp# ./a.out | grep -B 23 hg 7f762a200000-7f762a400000 rw-p 00000000 00:00 0 Size: 2048 kB Anonymous: 2048 kB LazyFree: 2048 kB AnonHugePages: 0 kB THPeligible: 1 VmFlags: rd wr mr mw me ac sd hg // allocate a 2MB chunk using mmap and later // some pages marked as lazyfree with MADV_FREE Link: https://github.com/ioworker0/mmapvsmprotect/blob/main/test4.c root@x:/tmp# ./a.out | grep -B 23 hg 7f762a200000-7f762a400000 rw-p 00000000 00:00 0 Size: 2048 kB Anonymous: 2048 kB LazyFree: 0 kB AnonHugePages: 2048 kB THPeligible: 1 VmFlags: rd wr mr mw me ac sd hg root@x:/tmp# ./a.out [...] root@x:/tmp# echo $? 2 On Thu, Feb 1, 2024 at 8:53 PM Lance Yang <ioworker0@gmail.com> wrote: > > The collapsing behavior of khugepaged with pages > marked using MADV_FREE might cause confusion > among users. > > For instance, allocate a 2MB chunk using mmap and > later release it by MADV_FREE. Khugepaged will not > collapse this chunk. From the user's perspective, > it treats lazyfree pages as pte_none. However, > for some pages marked as lazyfree with MADV_FREE, > khugepaged might collapse this chunk and copy > these pages to a new huge page. This inconsistency > in behavior could be confusing for users. > > After a successful MADV_FREE operation, if there is > no subsequent write, the kernel can free the pages > at any time. Therefore, in my opinion, counting > lazyfree pages in max_pte_none seems reasonable. > > Perhaps treating MADV_FREE like MADV_DONTNEED, not > copying lazyfree pages when khugepaged collapses > huge pages in the background better aligns with > user expectations. > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > --- > mm/khugepaged.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2b219acb528e..6cbf46d42c6a 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -777,6 +777,7 @@ static int __collapse_huge_page_copy(pte_t *pte, > pmd_t orig_pmd, > struct vm_area_struct *vma, > unsigned long address, > + struct collapse_control *cc, > spinlock_t *ptl, > struct list_head *compound_pagelist) > { > @@ -797,6 +798,13 @@ static int __collapse_huge_page_copy(pte_t *pte, > continue; > } > src_page = pte_page(pteval); > + > + if (cc->is_khugepaged > + && !folio_test_swapbacked(page_folio(src_page))) { > + clear_user_highpage(page, _address); > + continue; > + } > + > if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) { > result = SCAN_COPY_MC; > break; > @@ -1205,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > anon_vma_unlock_write(vma->anon_vma); > > result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd, > - vma, address, pte_ptl, > + vma, address, cc, pte_ptl, > &compound_pagelist); > pte_unmap(pte); > if (unlikely(result != SCAN_SUCCEED)) > -- > 2.33.1 >
On Thu, Feb 1, 2024 at 4:53 AM Lance Yang <ioworker0@gmail.com> wrote: > > The collapsing behavior of khugepaged with pages > marked using MADV_FREE might cause confusion > among users. > > For instance, allocate a 2MB chunk using mmap and > later release it by MADV_FREE. Khugepaged will not > collapse this chunk. From the user's perspective, > it treats lazyfree pages as pte_none. However, > for some pages marked as lazyfree with MADV_FREE, > khugepaged might collapse this chunk and copy > these pages to a new huge page. This inconsistency > in behavior could be confusing for users. > > After a successful MADV_FREE operation, if there is > no subsequent write, the kernel can free the pages > at any time. Therefore, in my opinion, counting > lazyfree pages in max_pte_none seems reasonable. > > Perhaps treating MADV_FREE like MADV_DONTNEED, not > copying lazyfree pages when khugepaged collapses > huge pages in the background better aligns with > user expectations. > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > --- > mm/khugepaged.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2b219acb528e..6cbf46d42c6a 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -777,6 +777,7 @@ static int __collapse_huge_page_copy(pte_t *pte, > pmd_t orig_pmd, > struct vm_area_struct *vma, > unsigned long address, > + struct collapse_control *cc, > spinlock_t *ptl, > struct list_head *compound_pagelist) > { > @@ -797,6 +798,13 @@ static int __collapse_huge_page_copy(pte_t *pte, > continue; > } > src_page = pte_page(pteval); > + > + if (cc->is_khugepaged > + && !folio_test_swapbacked(page_folio(src_page))) { > + clear_user_highpage(page, _address); > + continue; If the page was written before khugepaged collapsed it, and khugepaged collapsed the page before memory reclaim kicked in, didn't this somehow cause data corruption? > + } > + > if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) { > result = SCAN_COPY_MC; > break; > @@ -1205,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > anon_vma_unlock_write(vma->anon_vma); > > result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd, > - vma, address, pte_ptl, > + vma, address, cc, pte_ptl, > &compound_pagelist); > pte_unmap(pte); > if (unlikely(result != SCAN_SUCCEED)) > -- > 2.33.1 >
On Thu 01-02-24 20:52:26, Lance Yang wrote: > The collapsing behavior of khugepaged with pages > marked using MADV_FREE might cause confusion > among users. > > For instance, allocate a 2MB chunk using mmap and > later release it by MADV_FREE. Khugepaged will not > collapse this chunk. From the user's perspective, > it treats lazyfree pages as pte_none. However, > for some pages marked as lazyfree with MADV_FREE, > khugepaged might collapse this chunk and copy > these pages to a new huge page. This inconsistency > in behavior could be confusing for users. Is that any more confusing than collapsing pte_none pages? TBH I do not really see why this is a problem. MADV_FREE are correctly recognized same as pte_none so the user defined trashold applies.
On Fri, Feb 2, 2024 at 6:06 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 01-02-24 20:52:26, Lance Yang wrote: > > The collapsing behavior of khugepaged with pages > > marked using MADV_FREE might cause confusion > > among users. > > > > For instance, allocate a 2MB chunk using mmap and > > later release it by MADV_FREE. Khugepaged will not > > collapse this chunk. From the user's perspective, > > it treats lazyfree pages as pte_none. However, > > for some pages marked as lazyfree with MADV_FREE, > > khugepaged might collapse this chunk and copy > > these pages to a new huge page. This inconsistency > > in behavior could be confusing for users. > > Is that any more confusing than collapsing pte_none > pages? > > TBH I do not really see why this is a problem. MADV_FREE > are correctly recognized same as pte_none so the user > defined trashold applies. Sorry, I might not have expressed it clearly. MADV_FREE is correctly recognized, just like pte_none. This is also reasonable and not an issue. IMO, since it's treated the same as pte_none, perhaps lazyfree pages shouldn't be copied to the new huge page. Thanks, Lance > -- > Michal Hocko > SUSE Labs
On Fri, Feb 2, 2024 at 4:37 AM Yang Shi <shy828301@gmail.com> wrote: > > On Thu, Feb 1, 2024 at 4:53 AM Lance Yang <ioworker0@gmail.com> wrote: > > > > The collapsing behavior of khugepaged with pages > > marked using MADV_FREE might cause confusion > > among users. > > > > For instance, allocate a 2MB chunk using mmap and > > later release it by MADV_FREE. Khugepaged will not > > collapse this chunk. From the user's perspective, > > it treats lazyfree pages as pte_none. However, > > for some pages marked as lazyfree with MADV_FREE, > > khugepaged might collapse this chunk and copy > > these pages to a new huge page. This inconsistency > > in behavior could be confusing for users. > > > > After a successful MADV_FREE operation, if there is > > no subsequent write, the kernel can free the pages > > at any time. Therefore, in my opinion, counting > > lazyfree pages in max_pte_none seems reasonable. > > > > Perhaps treating MADV_FREE like MADV_DONTNEED, not > > copying lazyfree pages when khugepaged collapses > > huge pages in the background better aligns with > > user expectations. > > > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > > --- > > mm/khugepaged.c | 10 +++++++++- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 2b219acb528e..6cbf46d42c6a 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -777,6 +777,7 @@ static int __collapse_huge_page_copy(pte_t *pte, > > pmd_t orig_pmd, > > struct vm_area_struct *vma, > > unsigned long address, > > + struct collapse_control *cc, > > spinlock_t *ptl, > > struct list_head *compound_pagelist) > > { > > @@ -797,6 +798,13 @@ static int __collapse_huge_page_copy(pte_t *pte, > > continue; > > } > > src_page = pte_page(pteval); > > + > > + if (cc->is_khugepaged > > + && !folio_test_swapbacked(page_folio(src_page))) { > > + clear_user_highpage(page, _address); > > + continue; > > If the page was written before khugepaged collapsed it, and khugepaged > collapsed the page before memory reclaim kicked in, didn't this > somehow cause data corruption? > Thanks a lot! Yang, you're correct; indeed, there is a potential issue with data corruption. I took a look at the check for lazyfree pages in smaps_pte_entry. Here's the modification: if (cc->is_khugepaged && !PageSwapBacked(src_page) && !pte_dirty(pteval) && !PageDirty(src_page)) { clear_user_highpage(page, _address); continue; } Could you please take a look? Thanks, Lance > > + } > > + > > if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) { > > result = SCAN_COPY_MC; > > break; > > @@ -1205,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > anon_vma_unlock_write(vma->anon_vma); > > > > result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd, > > - vma, address, pte_ptl, > > + vma, address, cc, pte_ptl, > > &compound_pagelist); > > pte_unmap(pte); > > if (unlikely(result != SCAN_SUCCEED)) > > -- > > 2.33.1 > >
On Fri 02-02-24 19:18:31, Lance Yang wrote: > IMO, since it's treated the same as pte_none, > perhaps lazyfree pages shouldn't be copied to > the new huge page. Why? The content of MADV_FREE page is valid until it is reclaimed.
On Fri, Feb 2, 2024 at 8:27 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 02-02-24 19:18:31, Lance Yang wrote: > > IMO, since it's treated the same as pte_none, > > perhaps lazyfree pages shouldn't be copied to > > the new huge page. > > Why? The content of MADV_FREE page is valid until it is reclaimed. IMO, if MADV_FREE pages are considered valid until reclaimed, treating them the same as pte_none might pose a conflict. Thanks, Lance > -- > Michal Hocko > SUSE Labs
On Fri 02-02-24 20:52:48, Lance Yang wrote: > On Fri, Feb 2, 2024 at 8:27 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 02-02-24 19:18:31, Lance Yang wrote: > > > IMO, since it's treated the same as pte_none, > > > perhaps lazyfree pages shouldn't be copied to > > > the new huge page. > > > > Why? The content of MADV_FREE page is valid until it is reclaimed. > > IMO, if MADV_FREE pages are considered valid until > reclaimed, treating them the same as pte_none might > pose a conflict. What kind of conflict?
Here is a part from the man page explaining the MADV_FREE semantics: The kernel can thus free thesepages, but the freeing could be delayed until memory pressure occurs. For each of the pages that has been marked to be freed but has not yet been freed, the free operation will be canceled if the caller writes into the page. If there is no subsequent write, the kernel can free the pages at any time. IIUC, if there is no subsequent write, lazyfree pages will eventually be reclaimed. khugepaged treats lazyfree pages the same as pte_none, avoiding copying them to the new huge page during collapse. It seems that lazyfree pages are reclaimed before khugepaged collapses them. This aligns with user expectations. However, IMO, if the content of MADV_FREE pages remains valid during collapse, then khugepaged treating lazyfree pages the same as pte_none might not be suitable. Thanks, Lance On Fri, Feb 2, 2024 at 8:57 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 02-02-24 20:52:48, Lance Yang wrote: > > On Fri, Feb 2, 2024 at 8:27 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 02-02-24 19:18:31, Lance Yang wrote: > > > > IMO, since it's treated the same as pte_none, > > > > perhaps lazyfree pages shouldn't be copied to > > > > the new huge page. > > > > > > Why? The content of MADV_FREE page is valid until it is reclaimed. > > > > IMO, if MADV_FREE pages are considered valid until > > reclaimed, treating them the same as pte_none might > > pose a conflict. > > What kind of conflict? > -- > Michal Hocko > SUSE Labs
I'd like to add one more point. Users mark pages as lazyfree with MADV_FREE, expecting these pages to be reclaimed until memory pressure occurs. Currently, khugepaged treats lazyfree pages the same as pte_none, which seems reasonable. IMO, avoiding the copying of these pages to the new huge page during khugepaged collapse can better keep the semantics of MADV_FREE. It appears that lazyfree pages are reclaimed before khugepaged collapses them. Thanks, Lance On Fri, Feb 2, 2024 at 9:46 PM Lance Yang <ioworker0@gmail.com> wrote: > > Here is a part from the man page explaining > the MADV_FREE semantics: > > The kernel can thus free thesepages, but the > freeing could be delayed until memory pressure > occurs. For each of the pages that has been > marked to be freed but has not yet been freed, > the free operation will be canceled if the caller > writes into the page. If there is no subsequent > write, the kernel can free the pages at any time. > > IIUC, if there is no subsequent write, lazyfree > pages will eventually be reclaimed. khugepaged > treats lazyfree pages the same as pte_none, > avoiding copying them to the new huge page > during collapse. It seems that lazyfree pages > are reclaimed before khugepaged collapses them. > This aligns with user expectations. > > However, IMO, if the content of MADV_FREE pages > remains valid during collapse, then khugepaged > treating lazyfree pages the same as pte_none > might not be suitable. > > Thanks, > Lance > > On Fri, Feb 2, 2024 at 8:57 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 02-02-24 20:52:48, Lance Yang wrote: > > > On Fri, Feb 2, 2024 at 8:27 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Fri 02-02-24 19:18:31, Lance Yang wrote: > > > > > IMO, since it's treated the same as pte_none, > > > > > perhaps lazyfree pages shouldn't be copied to > > > > > the new huge page. > > > > > > > > Why? The content of MADV_FREE page is valid until it is reclaimed. > > > > > > IMO, if MADV_FREE pages are considered valid until > > > reclaimed, treating them the same as pte_none might > > > pose a conflict. > > > > What kind of conflict? > > -- > > Michal Hocko > > SUSE Labs
On Fri 02-02-24 21:46:45, Lance Yang wrote: > Here is a part from the man page explaining > the MADV_FREE semantics: > > The kernel can thus free thesepages, but the > freeing could be delayed until memory pressure > occurs. For each of the pages that has been > marked to be freed but has not yet been freed, > the free operation will be canceled if the caller > writes into the page. If there is no subsequent > write, the kernel can free the pages at any time. > > IIUC, if there is no subsequent write, lazyfree > pages will eventually be reclaimed. If there is no memory pressure then this might not ever happen. User cannot make any assumption about their content once madvise call has been done. The content has to be considered lost. Sure the userspace might have means to tell those pages from zero pages and recheck after the write but that is about it. > khugepaged > treats lazyfree pages the same as pte_none, > avoiding copying them to the new huge page > during collapse. It seems that lazyfree pages > are reclaimed before khugepaged collapses them. > This aligns with user expectations. > > However, IMO, if the content of MADV_FREE pages > remains valid during collapse, then khugepaged > treating lazyfree pages the same as pte_none > might not be suitable. Why? Unless I am missing something (which is possible of course) I do not really see why dropping the content of those pages and replacing them with a THP is any difference from reclaiming those pages and then faulting in a non-THP zero page. Now, if khugepaged reused the original content of MADV_FREE pages that would be a slightly different story. I can see why users would expect zero pages to back madvised area.
How about blocking khugepaged from collapsing lazyfree pages? This way, is it not better to keep the semantics of MADV_FREE? What do you think? Thanks, Lance On Fri, Feb 2, 2024 at 10:42 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 02-02-24 21:46:45, Lance Yang wrote: > > Here is a part from the man page explaining > > the MADV_FREE semantics: > > > > The kernel can thus free thesepages, but the > > freeing could be delayed until memory pressure > > occurs. For each of the pages that has been > > marked to be freed but has not yet been freed, > > the free operation will be canceled if the caller > > writes into the page. If there is no subsequent > > write, the kernel can free the pages at any time. > > > > IIUC, if there is no subsequent write, lazyfree > > pages will eventually be reclaimed. > > If there is no memory pressure then this might not > ever happen. User cannot make any assumption about > their content once madvise call has been done. The > content has to be considered lost. Sure the userspace > might have means to tell those pages from zero pages > and recheck after the write but that is about it. > > > khugepaged > > treats lazyfree pages the same as pte_none, > > avoiding copying them to the new huge page > > during collapse. It seems that lazyfree pages > > are reclaimed before khugepaged collapses them. > > This aligns with user expectations. > > > > However, IMO, if the content of MADV_FREE pages > > remains valid during collapse, then khugepaged > > treating lazyfree pages the same as pte_none > > might not be suitable. > > Why? > > Unless I am missing something (which is possible of > course) I do not really see why dropping the content > of those pages and replacing them with a THP is any > difference from reclaiming those pages and then faulting > in a non-THP zero page. > > Now, if khugepaged reused the original content of MADV_FREE > pages that would be a slightly different story. I can > see why users would expect zero pages to back madvised > area. > -- > Michal Hocko > SUSE Labs
On 02.02.24 15:52, Lance Yang wrote: > How about blocking khugepaged from > collapsing lazyfree pages? This way, > is it not better to keep the semantics > of MADV_FREE? > > What do you think? Why do we even want to change any behavior here? A lot of things "might cause confusion among users". Even worse, a lot of things do cause confusion ;)
On Fri 02-02-24 22:52:49, Lance Yang wrote: > How about blocking khugepaged from > collapsing lazyfree pages? This way, > is it not better to keep the semantics > of MADV_FREE? I do not see any reason why. And you are not providing any actual reasoning either. Unless you have a very specific example in mind I do not think we want to change the current behavior.
On Fri, Feb 2, 2024 at 6:53 AM Lance Yang <ioworker0@gmail.com> wrote: > > How about blocking khugepaged from > collapsing lazyfree pages? This way, > is it not better to keep the semantics > of MADV_FREE? > > What do you think? First of all, khugepaged doesn't treat MADV_FREE pages as pte_none IIUC. The khugepaged does skip the 2M block if all the pages are old and unreferenced pages in the range in hpage_collapse_scan_pmd(), then repeat the check in collapse_huge_page() again. And MADV_FREE pages are just old and unreferenced. This is actually what your first test case does. The whole 2M range is MADV_FREE range, so they are skipped by khugepaged. But if the partial range is MADV_FREE, khugepaged won't skip them. This is what your second test case does. Secondly, I think it depends on the semantics of MADV_FREE, particularly how to treat the redirtied pages. TBH I'm always confused by the semantics. For example, the page contained "abcd", then it was MADV_FREE'ed, then it was written again with "1234" after "abcd". So the user should expect to see "abcd1234" or "00001234". I'm supposed it should be "abcd1234" since MADV_FREE pages are still valid and available, if I'm wrong please feel free to correct me. If so we should always copy MADV_FREE pages in khugepaged regardless of whether it is redirtied or not otherwise it may incur data corruption. If we don't copy, then the follow up redirty after collapse to the hugepage may return "00001234", right? The current behavior is copying the page. > > Thanks, > Lance > > On Fri, Feb 2, 2024 at 10:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 02-02-24 21:46:45, Lance Yang wrote: > > > Here is a part from the man page explaining > > > the MADV_FREE semantics: > > > > > > The kernel can thus free thesepages, but the > > > freeing could be delayed until memory pressure > > > occurs. For each of the pages that has been > > > marked to be freed but has not yet been freed, > > > the free operation will be canceled if the caller > > > writes into the page. If there is no subsequent > > > write, the kernel can free the pages at any time. > > > > > > IIUC, if there is no subsequent write, lazyfree > > > pages will eventually be reclaimed. > > > > If there is no memory pressure then this might not > > ever happen. User cannot make any assumption about > > their content once madvise call has been done. The > > content has to be considered lost. Sure the userspace > > might have means to tell those pages from zero pages > > and recheck after the write but that is about it. > > > > > khugepaged > > > treats lazyfree pages the same as pte_none, > > > avoiding copying them to the new huge page > > > during collapse. It seems that lazyfree pages > > > are reclaimed before khugepaged collapses them. > > > This aligns with user expectations. > > > > > > However, IMO, if the content of MADV_FREE pages > > > remains valid during collapse, then khugepaged > > > treating lazyfree pages the same as pte_none > > > might not be suitable. > > > > Why? > > > > Unless I am missing something (which is possible of > > course) I do not really see why dropping the content > > of those pages and replacing them with a THP is any > > difference from reclaiming those pages and then faulting > > in a non-THP zero page. > > > > Now, if khugepaged reused the original content of MADV_FREE > > pages that would be a slightly different story. I can > > see why users would expect zero pages to back madvised > > area. > > -- > > Michal Hocko > > SUSE Labs
On Fri, Feb 2, 2024 at 3:23 AM Lance Yang <ioworker0@gmail.com> wrote: > > On Fri, Feb 2, 2024 at 4:37 AM Yang Shi <shy828301@gmail.com> wrote: > > > > On Thu, Feb 1, 2024 at 4:53 AM Lance Yang <ioworker0@gmail.com> wrote: > > > > > > The collapsing behavior of khugepaged with pages > > > marked using MADV_FREE might cause confusion > > > among users. > > > > > > For instance, allocate a 2MB chunk using mmap and > > > later release it by MADV_FREE. Khugepaged will not > > > collapse this chunk. From the user's perspective, > > > it treats lazyfree pages as pte_none. However, > > > for some pages marked as lazyfree with MADV_FREE, > > > khugepaged might collapse this chunk and copy > > > these pages to a new huge page. This inconsistency > > > in behavior could be confusing for users. > > > > > > After a successful MADV_FREE operation, if there is > > > no subsequent write, the kernel can free the pages > > > at any time. Therefore, in my opinion, counting > > > lazyfree pages in max_pte_none seems reasonable. > > > > > > Perhaps treating MADV_FREE like MADV_DONTNEED, not > > > copying lazyfree pages when khugepaged collapses > > > huge pages in the background better aligns with > > > user expectations. > > > > > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > > > --- > > > mm/khugepaged.c | 10 +++++++++- > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > index 2b219acb528e..6cbf46d42c6a 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -777,6 +777,7 @@ static int __collapse_huge_page_copy(pte_t *pte, > > > pmd_t orig_pmd, > > > struct vm_area_struct *vma, > > > unsigned long address, > > > + struct collapse_control *cc, > > > spinlock_t *ptl, > > > struct list_head *compound_pagelist) > > > { > > > @@ -797,6 +798,13 @@ static int __collapse_huge_page_copy(pte_t *pte, > > > continue; > > > } > > > src_page = pte_page(pteval); > > > + > > > + if (cc->is_khugepaged > > > + && !folio_test_swapbacked(page_folio(src_page))) { > > > + clear_user_highpage(page, _address); > > > + continue; > > > > If the page was written before khugepaged collapsed it, and khugepaged > > collapsed the page before memory reclaim kicked in, didn't this > > somehow cause data corruption? > > > > Thanks a lot! Yang, you're correct; indeed, there is > a potential issue with data corruption. > > I took a look at the check for lazyfree pages in > smaps_pte_entry. > > Here's the modification: > if (cc->is_khugepaged && !PageSwapBacked(src_page) > && !pte_dirty(pteval) && !PageDirty(src_page)) { > clear_user_highpage(page, _address); > continue; > } This may be ok. But as I said in another reply, this may still incur data corruption. > > Could you please take a look? > > Thanks, > Lance > > > > + } > > > + > > > if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) { > > > result = SCAN_COPY_MC; > > > break; > > > @@ -1205,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > > anon_vma_unlock_write(vma->anon_vma); > > > > > > result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd, > > > - vma, address, pte_ptl, > > > + vma, address, cc, pte_ptl, > > > &compound_pagelist); > > > pte_unmap(pte); > > > if (unlikely(result != SCAN_SUCCEED)) > > > -- > > > 2.33.1 > > >
Hey Michal, David, Yang, I sincerely appreciate your time! I still have two questions that are perplexing me. First question: Given that khugepaged doesn't treat MADV_FREE pages as pte_none, why skip the 2M block when all the pages within the range are old and unreferenced, but won't skip if the partial range is MADV_FREE, even if it's not redirtied? Why make this distinction? Would it not be more straightforward to maintain if either all were skipped or not? Second question: Does copying lazyfree pages (not redirtied) to the new huge page during khugepaged collapse undermine the semantics of MADV_FREE? Users mark pages as lazyfree with MADV_FREE, expecting these pages to be eventually reclaimed. Even without subsequent writes, these pages will no longer be reclaimed, even if memory pressure occurs. BR, Lance On Sat, Feb 3, 2024 at 1:42 AM Yang Shi <shy828301@gmail.com> wrote: > > On Fri, Feb 2, 2024 at 6:53 AM Lance Yang <ioworker0@gmail.com> wrote: > > > > How about blocking khugepaged from > > collapsing lazyfree pages? This way, > > is it not better to keep the semantics > > of MADV_FREE? > > > > What do you think? > > First of all, khugepaged doesn't treat MADV_FREE pages as pte_none > IIUC. The khugepaged does skip the 2M block if all the pages are old > and unreferenced pages in the range in hpage_collapse_scan_pmd(), then > repeat the check in collapse_huge_page() again. > > And MADV_FREE pages are just old and unreferenced. This is actually > what your first test case does. The whole 2M range is MADV_FREE range, > so they are skipped by khugepaged. > > But if the partial range is MADV_FREE, khugepaged won't skip them. > This is what your second test case does. > > Secondly, I think it depends on the semantics of MADV_FREE, > particularly how to treat the redirtied pages. TBH I'm always confused > by the semantics. For example, the page contained "abcd", then it was > MADV_FREE'ed, then it was written again with "1234" after "abcd". So > the user should expect to see "abcd1234" or "00001234". > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still > valid and available, if I'm wrong please feel free to correct me. If > so we should always copy MADV_FREE pages in khugepaged regardless of > whether it is redirtied or not otherwise it may incur data corruption. > If we don't copy, then the follow up redirty after collapse to the > hugepage may return "00001234", right? > > The current behavior is copying the page. > > > > > Thanks, > > Lance > > > > On Fri, Feb 2, 2024 at 10:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 02-02-24 21:46:45, Lance Yang wrote: > > > > Here is a part from the man page explaining > > > > the MADV_FREE semantics: > > > > > > > > The kernel can thus free thesepages, but the > > > > freeing could be delayed until memory pressure > > > > occurs. For each of the pages that has been > > > > marked to be freed but has not yet been freed, > > > > the free operation will be canceled if the caller > > > > writes into the page. If there is no subsequent > > > > write, the kernel can free the pages at any time. > > > > > > > > IIUC, if there is no subsequent write, lazyfree > > > > pages will eventually be reclaimed. > > > > > > If there is no memory pressure then this might not > > > ever happen. User cannot make any assumption about > > > their content once madvise call has been done. The > > > content has to be considered lost. Sure the userspace > > > might have means to tell those pages from zero pages > > > and recheck after the write but that is about it. > > > > > > > khugepaged > > > > treats lazyfree pages the same as pte_none, > > > > avoiding copying them to the new huge page > > > > during collapse. It seems that lazyfree pages > > > > are reclaimed before khugepaged collapses them. > > > > This aligns with user expectations. > > > > > > > > However, IMO, if the content of MADV_FREE pages > > > > remains valid during collapse, then khugepaged > > > > treating lazyfree pages the same as pte_none > > > > might not be suitable. > > > > > > Why? > > > > > > Unless I am missing something (which is possible of > > > course) I do not really see why dropping the content > > > of those pages and replacing them with a THP is any > > > difference from reclaiming those pages and then faulting > > > in a non-THP zero page. > > > > > > Now, if khugepaged reused the original content of MADV_FREE > > > pages that would be a slightly different story. I can > > > see why users would expect zero pages to back madvised > > > area. > > > -- > > > Michal Hocko > > > SUSE Labs
On Fri 02-02-24 09:42:27, Yang Shi wrote: > But if the partial range is MADV_FREE, khugepaged won't skip them. > This is what your second test case does. > > Secondly, I think it depends on the semantics of MADV_FREE, > particularly how to treat the redirtied pages. TBH I'm always confused > by the semantics. For example, the page contained "abcd", then it was > MADV_FREE'ed, then it was written again with "1234" after "abcd". So > the user should expect to see "abcd1234" or "00001234". Correct. You cannot assume the content of the first page as it could have been reclaimed at any time. > I'm supposed it should be "abcd1234" since MADV_FREE pages are still > valid and available, if I'm wrong please feel free to correct me. If > so we should always copy MADV_FREE pages in khugepaged regardless of > whether it is redirtied or not otherwise it may incur data corruption. > If we don't copy, then the follow up redirty after collapse to the > hugepage may return "00001234", right? Right. As pointed above this is a valid outcome if the page has been dropped. User has means to tell that from /proc/vmstat though. Not in a great precision but I think it would be really surprising to not see any pglazyfreed yet the content is gone. I think it would be legit to call it a bug. One could argue the bug would be in the accounting rather than the khugepaged implementation because madvised pages could be dropped at any time. But I think it makes more sense to copy the existing content.
On Fri, Feb 2, 2024 at 8:17 PM Lance Yang <ioworker0@gmail.com> wrote: > > Hey Michal, David, Yang, > > I sincerely appreciate your time! > > I still have two questions that are perplexing me. > > First question: > Given that khugepaged doesn't treat MADV_FREE > pages as pte_none, why skip the 2M block when all > the pages within the range are old and unreferenced, > but won't skip if the partial range is MADV_FREE, > even if it's not redirtied? Why make this distinction? > Would it not be more straightforward to maintain > if either all were skipped or not? It is just some heuristic in the code and may be some arbitrary choice. It could controlled in a more fine-grained way if we really see some workloads get benefit. > > Second question: > Does copying lazyfree pages (not redirtied) to the > new huge page during khugepaged collapse > undermine the semantics of MADV_FREE? > Users mark pages as lazyfree with MADV_FREE, > expecting these pages to be eventually reclaimed. > Even without subsequent writes, these pages will > no longer be reclaimed, even if memory pressure > occurs. Yeah, it just means khugepaged wins the race against page reclaim. I'm supposed the delayed free is one of the design goals of MADV_FREE, and the risk is the pages may not be freed eventually. If you want immediate free or more deterministic behavior, you should use MADV_DONTNEED or munmap IIUC. > > BR, > Lance > > On Sat, Feb 3, 2024 at 1:42 AM Yang Shi <shy828301@gmail.com> wrote: > > > > On Fri, Feb 2, 2024 at 6:53 AM Lance Yang <ioworker0@gmail.com> wrote: > > > > > > How about blocking khugepaged from > > > collapsing lazyfree pages? This way, > > > is it not better to keep the semantics > > > of MADV_FREE? > > > > > > What do you think? > > > > First of all, khugepaged doesn't treat MADV_FREE pages as pte_none > > IIUC. The khugepaged does skip the 2M block if all the pages are old > > and unreferenced pages in the range in hpage_collapse_scan_pmd(), then > > repeat the check in collapse_huge_page() again. > > > > And MADV_FREE pages are just old and unreferenced. This is actually > > what your first test case does. The whole 2M range is MADV_FREE range, > > so they are skipped by khugepaged. > > > > But if the partial range is MADV_FREE, khugepaged won't skip them. > > This is what your second test case does. > > > > Secondly, I think it depends on the semantics of MADV_FREE, > > particularly how to treat the redirtied pages. TBH I'm always confused > > by the semantics. For example, the page contained "abcd", then it was > > MADV_FREE'ed, then it was written again with "1234" after "abcd". So > > the user should expect to see "abcd1234" or "00001234". > > > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still > > valid and available, if I'm wrong please feel free to correct me. If > > so we should always copy MADV_FREE pages in khugepaged regardless of > > whether it is redirtied or not otherwise it may incur data corruption. > > If we don't copy, then the follow up redirty after collapse to the > > hugepage may return "00001234", right? > > > > The current behavior is copying the page. > > > > > > > > Thanks, > > > Lance > > > > > > On Fri, Feb 2, 2024 at 10:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Fri 02-02-24 21:46:45, Lance Yang wrote: > > > > > Here is a part from the man page explaining > > > > > the MADV_FREE semantics: > > > > > > > > > > The kernel can thus free thesepages, but the > > > > > freeing could be delayed until memory pressure > > > > > occurs. For each of the pages that has been > > > > > marked to be freed but has not yet been freed, > > > > > the free operation will be canceled if the caller > > > > > writes into the page. If there is no subsequent > > > > > write, the kernel can free the pages at any time. > > > > > > > > > > IIUC, if there is no subsequent write, lazyfree > > > > > pages will eventually be reclaimed. > > > > > > > > If there is no memory pressure then this might not > > > > ever happen. User cannot make any assumption about > > > > their content once madvise call has been done. The > > > > content has to be considered lost. Sure the userspace > > > > might have means to tell those pages from zero pages > > > > and recheck after the write but that is about it. > > > > > > > > > khugepaged > > > > > treats lazyfree pages the same as pte_none, > > > > > avoiding copying them to the new huge page > > > > > during collapse. It seems that lazyfree pages > > > > > are reclaimed before khugepaged collapses them. > > > > > This aligns with user expectations. > > > > > > > > > > However, IMO, if the content of MADV_FREE pages > > > > > remains valid during collapse, then khugepaged > > > > > treating lazyfree pages the same as pte_none > > > > > might not be suitable. > > > > > > > > Why? > > > > > > > > Unless I am missing something (which is possible of > > > > course) I do not really see why dropping the content > > > > of those pages and replacing them with a THP is any > > > > difference from reclaiming those pages and then faulting > > > > in a non-THP zero page. > > > > > > > > Now, if khugepaged reused the original content of MADV_FREE > > > > pages that would be a slightly different story. I can > > > > see why users would expect zero pages to back madvised > > > > area. > > > > -- > > > > Michal Hocko > > > > SUSE Labs
On Mon, Feb 5, 2024 at 1:45 AM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 02-02-24 09:42:27, Yang Shi wrote: > > But if the partial range is MADV_FREE, khugepaged won't skip them. > > This is what your second test case does. > > > > Secondly, I think it depends on the semantics of MADV_FREE, > > particularly how to treat the redirtied pages. TBH I'm always confused > > by the semantics. For example, the page contained "abcd", then it was > > MADV_FREE'ed, then it was written again with "1234" after "abcd". So > > the user should expect to see "abcd1234" or "00001234". > > Correct. You cannot assume the content of the first page as it could > have been reclaimed at any time. > > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still > > valid and available, if I'm wrong please feel free to correct me. If > > so we should always copy MADV_FREE pages in khugepaged regardless of > > whether it is redirtied or not otherwise it may incur data corruption. > > If we don't copy, then the follow up redirty after collapse to the > > hugepage may return "00001234", right? > > Right. As pointed above this is a valid outcome if the page has been > dropped. User has means to tell that from /proc/vmstat though. Not in a > great precision but I think it would be really surprising to not see any > pglazyfreed yet the content is gone. I think it would be legit to call > it a bug. One could argue the bug would be in the accounting rather than > the khugepaged implementation because madvised pages could be dropped at > any time. But I think it makes more sense to copy the existing content. Yeah, as long as khugepaged sees the MADV_FREE pages, it means they have "NOT" been dropped yet. It may be dropped later if memory pressure occurs, but anyway khugepaged wins the race and khugepaged can't assume the pages will be dropped before they get redirtied. So copying the content does make sense. > -- > Michal Hocko > SUSE Labs
On Mon, Feb 5, 2024 at 11:43 AM Yang Shi <shy828301@gmail.com> wrote: > > On Mon, Feb 5, 2024 at 1:45 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 02-02-24 09:42:27, Yang Shi wrote: > > > But if the partial range is MADV_FREE, khugepaged won't skip them. > > > This is what your second test case does. > > > > > > Secondly, I think it depends on the semantics of MADV_FREE, > > > particularly how to treat the redirtied pages. TBH I'm always confused > > > by the semantics. For example, the page contained "abcd", then it was > > > MADV_FREE'ed, then it was written again with "1234" after "abcd". So > > > the user should expect to see "abcd1234" or "00001234". > > > > Correct. You cannot assume the content of the first page as it could > > have been reclaimed at any time. > > > > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still > > > valid and available, if I'm wrong please feel free to correct me. If > > > so we should always copy MADV_FREE pages in khugepaged regardless of > > > whether it is redirtied or not otherwise it may incur data corruption. > > > If we don't copy, then the follow up redirty after collapse to the > > > hugepage may return "00001234", right? > > > > Right. As pointed above this is a valid outcome if the page has been > > dropped. User has means to tell that from /proc/vmstat though. Not in a > > great precision but I think it would be really surprising to not see any > > pglazyfreed yet the content is gone. I think it would be legit to call > > it a bug. One could argue the bug would be in the accounting rather than > > the khugepaged implementation because madvised pages could be dropped at > > any time. But I think it makes more sense to copy the existing content. +1. I agree that the content should be dropped iff pglazyfreed is incremented. Of course, we could do that here, but I don't think there is a good reason to, and we should just copy the contents. > Yeah, as long as khugepaged sees the MADV_FREE pages, it means they > have "NOT" been dropped yet. It may be dropped later if memory > pressure occurs, but anyway khugepaged wins the race and khugepaged > can't assume the pages will be dropped before they get redirtied. So > copying the content does make sense. Per Lance, I kinda get that this "undermines" MADV_FREE, insofar that, from the user's perspective, that memory which was intended as a buffer against OOM kill scenarios, is no longer there to reclaim trivially. I don't have a real world example where this is an issue, however. Also, not copying the contents doesn't change that fact. The proper alternative, if you want to make the "undermining" argument, is for khugepaged to stay away from hugepage regions with any MADV_FREE pages. I think it's fair to assume MADV_FREE'd memory is likely cold memory, and therefore not a good hugepage target anyways. However, it'd be unfortunate if there were a couple MADV_FREE pages in the middle of an otherwise hot / highly-utilized hugepage region that would prevent it from being pmd-mapped via khugepaged. But.. this is exactly-ish what you get when hugepage-ware system/runtime allocators split THPs to free up internal caches. Best, Zach > > -- > > Michal Hocko > > SUSE Labs
Hey Zach, Yang, Michal, and David, Please accept my sincerest apologies for the delayed response. Thanks for the replies; it‘s been very helpful to me! I also appreciate the valuable information you’ve shared! I agree that it’s not a good idea to let khugepaged avoid any pages marked with MADV_FREE. Thanks again for your time! Best, Lance On Tue, Feb 6, 2024 at 4:27 AM Zach O'Keefe <zokeefe@google.com> wrote: > > On Mon, Feb 5, 2024 at 11:43 AM Yang Shi <shy828301@gmail.com> wrote: > > > > On Mon, Feb 5, 2024 at 1:45 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 02-02-24 09:42:27, Yang Shi wrote: > > > > But if the partial range is MADV_FREE, khugepaged won't skip them. > > > > This is what your second test case does. > > > > > > > > Secondly, I think it depends on the semantics of MADV_FREE, > > > > particularly how to treat the redirtied pages. TBH I'm always confused > > > > by the semantics. For example, the page contained "abcd", then it was > > > > MADV_FREE'ed, then it was written again with "1234" after "abcd". So > > > > the user should expect to see "abcd1234" or "00001234". > > > > > > Correct. You cannot assume the content of the first page as it could > > > have been reclaimed at any time. > > > > > > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still > > > > valid and available, if I'm wrong please feel free to correct me. If > > > > so we should always copy MADV_FREE pages in khugepaged regardless of > > > > whether it is redirtied or not otherwise it may incur data corruption. > > > > If we don't copy, then the follow up redirty after collapse to the > > > > hugepage may return "00001234", right? > > > > > > Right. As pointed above this is a valid outcome if the page has been > > > dropped. User has means to tell that from /proc/vmstat though. Not in a > > > great precision but I think it would be really surprising to not see any > > > pglazyfreed yet the content is gone. I think it would be legit to call > > > it a bug. One could argue the bug would be in the accounting rather than > > > the khugepaged implementation because madvised pages could be dropped at > > > any time. But I think it makes more sense to copy the existing content. > > +1. I agree that the content should be dropped iff pglazyfreed is > incremented. Of course, we could do that here, but I don't think there > is a good reason to, and we should just copy the contents. > > > Yeah, as long as khugepaged sees the MADV_FREE pages, it means they > > have "NOT" been dropped yet. It may be dropped later if memory > > pressure occurs, but anyway khugepaged wins the race and khugepaged > > can't assume the pages will be dropped before they get redirtied. So > > copying the content does make sense. > > Per Lance, I kinda get that this "undermines" MADV_FREE, insofar that, > from the user's perspective, that memory which was intended as a > buffer against OOM kill scenarios, is no longer there to reclaim trivially. I > don't have a real world example where this is an issue, however. Also, > not copying the contents doesn't change that fact. > > The proper alternative, if you want to make the "undermining" > argument, is for khugepaged to stay away from hugepage regions with > any MADV_FREE pages. I think it's fair to assume MADV_FREE'd memory is > likely cold memory, and therefore not a good hugepage target anyways. > However, it'd be unfortunate if there were a couple MADV_FREE pages in > the middle of an otherwise hot / highly-utilized hugepage region that > would prevent it from being pmd-mapped via khugepaged. But.. this is > exactly-ish what you get when hugepage-ware system/runtime allocators > split THPs to free up internal caches. > > Best, > Zach > > > > > -- > > > Michal Hocko > > > SUSE Labs
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2b219acb528e..6cbf46d42c6a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -777,6 +777,7 @@ static int __collapse_huge_page_copy(pte_t *pte, pmd_t orig_pmd, struct vm_area_struct *vma, unsigned long address, + struct collapse_control *cc, spinlock_t *ptl, struct list_head *compound_pagelist) { @@ -797,6 +798,13 @@ static int __collapse_huge_page_copy(pte_t *pte, continue; } src_page = pte_page(pteval); + + if (cc->is_khugepaged + && !folio_test_swapbacked(page_folio(src_page))) { + clear_user_highpage(page, _address); + continue; + } + if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) { result = SCAN_COPY_MC; break; @@ -1205,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, anon_vma_unlock_write(vma->anon_vma); result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd, - vma, address, pte_ptl, + vma, address, cc, pte_ptl, &compound_pagelist); pte_unmap(pte); if (unlikely(result != SCAN_SUCCEED))