[3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
Message ID | 8d7737208bd24e754dc7a538a3f7f02de84f1f72.1708097962.git.donettom@linux.ibm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-69719-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:693c:2685:b0:108:e6aa:91d0 with SMTP id mn5csp195577dyc; Fri, 16 Feb 2024 23:35:44 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCUe4u3rI4RzJ9MFiOLtxL2GGz/gwmr1Qe4abiRxTuTlmi4PfMwdCTxeTLO8hz9dZu8eQixdZXZmCXIq4g6uqI+heZ7rkA== X-Google-Smtp-Source: AGHT+IEHPg0Ed4ENJDRF3w591lT4eaM/mD/xFHmLpEjWioynKIu+uKyjCv2hm+BFp8D/KO8DOnnX X-Received: by 2002:a05:622a:190c:b0:42b:e666:5bb1 with SMTP id w12-20020a05622a190c00b0042be6665bb1mr8053281qtc.28.1708155344732; Fri, 16 Feb 2024 23:35:44 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708155344; cv=pass; d=google.com; s=arc-20160816; b=1FdTcqxd3JGq922eLLMrXuxiNb5H87SLGl0tLEObnL9yYi7dNgxFtaJilBpgspU1ED z/pmJOsHnUIyR9DyTZ4/mGb7tIN2QlRb+DqOmJvQrNOY0Z9MwQoTVJ82fYNb3VpdVRFx e4n1sKNKfeEWEj833MatqB/JjlXyvJabCiMQroJCseJj6M+dmoUwZK3tK3ywzXPhKt+4 HfK7rtvRXK2QEiMhJ80PVb029MV8h69R03LCd2Gd5KWKd/OUDUCRLFzqtqUNqeutv0qX pK1/avPFdO9PjPcUGzYXcdiWBlX/QsqS2/3x0cN2dOtPROzRURxPu4S0a/QW2izI2SSX qsWA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=UEk1JLmSe642sbxXn82ubc2FynSt0QvkkNE3uRZNLGc=; fh=FFcNRKX/cmmdEO/4OCJX4rYOGTYmFqpwDil6Wds3ioM=; b=rkKnATnAtsSqmlLSM0YR1+4S4a43wbQpBDAkUcoawcK82hjXO/wRg/bOu0LJv71RtN 0WdkQOOAkKDUfa8yAxKtbviFgQpPr9thMFN84Z8AHH3YdUhNJJm6x9NexXLweDiXiXrk C6GhqmLzWAYTgpOEIhqzL0DhIdsjbe7LUJGQ2ZUB54cbmjeO5LNkRpsJApiYGdprClXu BQiS6Ji0c4yVrMb7/8JR/abQzi8lCx6BTsM+aOCr36q0szic8WncFrAH9MCnCd7IXYvD SScFFihYjfIqPyYRS69bmifRkDK6qXAe4pUBxgOAnizBPcRssAq8+L5/wmPl8NAfJVz1 s26w==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=Un47yEWj; arc=pass (i=1 spf=pass spfdomain=linux.ibm.com dkim=pass dkdomain=ibm.com dmarc=pass fromdomain=linux.ibm.com); spf=pass (google.com: domain of linux-kernel+bounces-69719-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-69719-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id bb26-20020a05622a1b1a00b0042dc8b61500si1758348qtb.349.2024.02.16.23.35.44 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 16 Feb 2024 23:35:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-69719-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=Un47yEWj; arc=pass (i=1 spf=pass spfdomain=linux.ibm.com dkim=pass dkdomain=ibm.com dmarc=pass fromdomain=linux.ibm.com); spf=pass (google.com: domain of linux-kernel+bounces-69719-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-69719-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 756D31C21033 for <ouuuleilei@gmail.com>; Sat, 17 Feb 2024 07:35:44 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 36F871CD26; Sat, 17 Feb 2024 07:35:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="Un47yEWj" Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A79DA33CA for <linux-kernel@vger.kernel.org>; Sat, 17 Feb 2024 07:35:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708155329; cv=none; b=tMEwR4VUYbKk6lT4otddDAl1jAfunA+XorcilIVThUNEcoZ6yLym9h/abDY405SzXNjubzRdFUjVssWenV/Lb8+f1pDRqGxtG+V0hlCjUm5iKMP8KQUHXR5A0IZnB+0T8+N7byDiwIqjU9S23NsqtU/rL3J0jSiyk/YktL8s7dE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708155329; c=relaxed/simple; bh=BdwHYDKb048UZTl6QADSK7S6tKXeJACzVDM50YdF/Rs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=ReXdpFG6LxjfGmP0Eh3Hw6OYtfZGhIMrUQs9p5D7FnMT4rgJkHnlE75upPS79Zzxsx9ZqOW/PJ0Xhu/bBU/8ODAKoPN6D9b6cPUMm/g736H+M9U8sPJ+qL253p42pnZ8V36N0UHitjGDyv7HOJ2KdmRgpi10pHaN1JCgI/eQVsg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=Un47yEWj; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Received: from pps.filterd (m0360083.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 41H7RRfW027691; Sat, 17 Feb 2024 07:35:03 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=UEk1JLmSe642sbxXn82ubc2FynSt0QvkkNE3uRZNLGc=; b=Un47yEWjBJrXA7SRBOmHqumVs/CmT+bDxvgxxOB6VSGHCYsXQwVjcHhkMgpiqs0VOPVg ablU7Tn7MaO5zzux203BbR0YUwQ9bSGeQ90DoV34e106psNEmiZNhi1yHB9o2swwuK+t sRqH14kohmKY1/8WxnHxb4/TnnmJk4tdOAAbN3jZRu398pygs6xRKejGMAwRv/1WvzkM SFMZyAahBPvO9o1SpOODWF+WcztaTs0+9lmz9NbF0Rzlx3VWDxk40ayscvUsCVoIX723 by13UFvrr5eZ+zsbbzzBHgnARjNxlalWZGzPHxs0Jys/Lrz2+h6wc+g/+W1NZaMpleCP wQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3wardu821e-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sat, 17 Feb 2024 07:35:02 +0000 Received: from m0360083.ppops.net (m0360083.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 41H7Z2XP012248; Sat, 17 Feb 2024 07:35:02 GMT Received: from ppma23.wdc07v.mail.ibm.com (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3wardu81ty-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sat, 17 Feb 2024 07:35:02 +0000 Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1]) by ppma23.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 41H3lSrS016307; Sat, 17 Feb 2024 07:31:50 GMT Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229]) by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3w6myn8j72-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sat, 17 Feb 2024 07:31:50 +0000 Received: from smtpav05.fra02v.mail.ibm.com (smtpav05.fra02v.mail.ibm.com [10.20.54.104]) by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 41H7VkN957475338 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Sat, 17 Feb 2024 07:31:48 GMT Received: from smtpav05.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7CEE420063; Sat, 17 Feb 2024 07:31:46 +0000 (GMT) Received: from smtpav05.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6D9EC20040; Sat, 17 Feb 2024 07:31:43 +0000 (GMT) Received: from ltczz402-lp1.aus.stglabs.ibm.com (unknown [9.53.171.174]) by smtpav05.fra02v.mail.ibm.com (Postfix) with ESMTP; Sat, 17 Feb 2024 07:31:43 +0000 (GMT) From: Donet Tom <donettom@linux.ibm.com> To: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Aneesh Kumar <aneesh.kumar@kernel.org>, Huang Ying <ying.huang@intel.com>, Dave Hansen <dave.hansen@linux.intel.com>, Mel Gorman <mgorman@suse.de>, Ben Widawsky <ben.widawsky@intel.com>, Feng Tang <feng.tang@intel.com>, Michal Hocko <mhocko@kernel.org>, Andrea Arcangeli <aarcange@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Rik van Riel <riel@surriel.com>, Johannes Weiner <hannes@cmpxchg.org>, Matthew Wilcox <willy@infradead.org>, Mike Kravetz <mike.kravetz@oracle.com>, Vlastimil Babka <vbabka@suse.cz>, Dan Williams <dan.j.williams@intel.com>, Hugh Dickins <hughd@google.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, Suren Baghdasaryan <surenb@google.com> Subject: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy Date: Sat, 17 Feb 2024 01:31:35 -0600 Message-Id: <8d7737208bd24e754dc7a538a3f7f02de84f1f72.1708097962.git.donettom@linux.ibm.com> X-Mailer: git-send-email 2.39.3 In-Reply-To: <9c3f7b743477560d1c5b12b8c111a584a2cc92ee.1708097962.git.donettom@linux.ibm.com> References: <9c3f7b743477560d1c5b12b8c111a584a2cc92ee.1708097962.git.donettom@linux.ibm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-GUID: cIL8LtNuI-z2kC5XRafsucUMhsTTru8H X-Proofpoint-ORIG-GUID: NZM7zSCbHzkgOUkllOLT3YhqdAxSxFIB X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.1011,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2024-02-17_04,2024-02-16_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 malwarescore=0 phishscore=0 mlxlogscore=999 clxscore=1015 lowpriorityscore=0 mlxscore=0 bulkscore=0 suspectscore=0 impostorscore=0 priorityscore=1501 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311290000 definitions=main-2402170058 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1791130698865185619 X-GMAIL-MSGID: 1791130698865185619 |
Series |
[1/3] mm/mempolicy: Use the already fetched local variable
|
|
Commit Message
Donet Tom
Feb. 17, 2024, 7:31 a.m. UTC
commit bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") added support for migrate on protnone reference with MPOL_BIND memory policy. This allowed numa fault migration when the executing node is part of the policy mask for MPOL_BIND. This patch extends migration support to MPOL_PREFERRED_MANY policy. Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag MPOL_F_NUMA_BALANCING. This causes issues when we want to use NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, the kernel should not allocate pages from the slower memory tier via allocation control zonelist fallback. Instead, we should move cold pages from the faster memory node via memory demotion. For a page allocation, kswapd is only woken up after we try to allocate pages from all nodes in the allocation zone list. This implies that, without using memory policies, we will end up allocating hot pages in the slower memory tier. MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better allocation control when we have memory tiers in the system. With MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only of faster memory nodes. When we fail to allocate pages from the faster memory node, kswapd would be woken up, allowing demotion of cold pages to slower memory nodes. With the current kernel, such usage of memory policies implies we can't do page promotion from a slower memory tier to a faster memory tier using numa fault. This patch fixes this issue. For MPOL_PREFERRED_MANY, if the executing node is in the policy node mask, we allow numa migration to the executing nodes. If the executing node is not in the policy node mask but the folio is already allocated based on policy preference (the folio node is in the policy node mask), we don't allow numa migration. If both the executing node and folio node are outside the policy node mask, we allow numa migration to the executing nodes. Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Signed-off-by: Donet Tom <donettom@linux.ibm.com> --- mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-)
Comments
On Sat 17-02-24 01:31:35, Donet Tom wrote: > commit bda420b98505 ("numa balancing: migrate on fault among multiple bound > nodes") added support for migrate on protnone reference with MPOL_BIND > memory policy. This allowed numa fault migration when the executing node > is part of the policy mask for MPOL_BIND. This patch extends migration > support to MPOL_PREFERRED_MANY policy. > > Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag > MPOL_F_NUMA_BALANCING. This causes issues when we want to use > NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, > the kernel should not allocate pages from the slower memory tier via > allocation control zonelist fallback. Instead, we should move cold pages > from the faster memory node via memory demotion. For a page allocation, > kswapd is only woken up after we try to allocate pages from all nodes in > the allocation zone list. This implies that, without using memory > policies, we will end up allocating hot pages in the slower memory tier. > > MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add > MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better > allocation control when we have memory tiers in the system. With > MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only > of faster memory nodes. When we fail to allocate pages from the faster > memory node, kswapd would be woken up, allowing demotion of cold pages > to slower memory nodes. > > With the current kernel, such usage of memory policies implies we can't > do page promotion from a slower memory tier to a faster memory tier > using numa fault. This patch fixes this issue. > > For MPOL_PREFERRED_MANY, if the executing node is in the policy node > mask, we allow numa migration to the executing nodes. If the executing > node is not in the policy node mask but the folio is already allocated > based on policy preference (the folio node is in the policy node mask), > we don't allow numa migration. If both the executing node and folio node > are outside the policy node mask, we allow numa migration to the > executing nodes. The feature makes sense to me. How has this been tested? Do you have any numbers to present? > Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> > Signed-off-by: Donet Tom <donettom@linux.ibm.com> > --- > mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- > 1 file changed, 26 insertions(+), 2 deletions(-) I haven't spotted anything obviously wrong in the patch itself but I admit this is not an area I am actively familiar with so I might be missing something.
On 2/19/24 17:37, Michal Hocko wrote: > On Sat 17-02-24 01:31:35, Donet Tom wrote: >> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >> nodes") added support for migrate on protnone reference with MPOL_BIND >> memory policy. This allowed numa fault migration when the executing node >> is part of the policy mask for MPOL_BIND. This patch extends migration >> support to MPOL_PREFERRED_MANY policy. >> >> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >> the kernel should not allocate pages from the slower memory tier via >> allocation control zonelist fallback. Instead, we should move cold pages >> from the faster memory node via memory demotion. For a page allocation, >> kswapd is only woken up after we try to allocate pages from all nodes in >> the allocation zone list. This implies that, without using memory >> policies, we will end up allocating hot pages in the slower memory tier. >> >> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >> allocation control when we have memory tiers in the system. With >> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >> of faster memory nodes. When we fail to allocate pages from the faster >> memory node, kswapd would be woken up, allowing demotion of cold pages >> to slower memory nodes. >> >> With the current kernel, such usage of memory policies implies we can't >> do page promotion from a slower memory tier to a faster memory tier >> using numa fault. This patch fixes this issue. >> >> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >> mask, we allow numa migration to the executing nodes. If the executing >> node is not in the policy node mask but the folio is already allocated >> based on policy preference (the folio node is in the policy node mask), >> we don't allow numa migration. If both the executing node and folio node >> are outside the policy node mask, we allow numa migration to the >> executing nodes. > The feature makes sense to me. How has this been tested? Do you have any > numbers to present? Hi Michal I have a test program which allocate memory on a specified node and trigger the promotion or migration (Keep accessing the pages). Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening with this patch I could see pages are getting migrated or promoted. My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below are my test results. In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. Exec_Node is the execution node, Policy is the nodes in nodemask and "Curr Location Pages" is the node where pages present before migration or promotion start. Tests Results ------------------ Scenario 1: if the executing node is in the policy node mask ================================================================================ Exec_Node Policy Curr Location Pages Observations ================================================================================ N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 N0 N0 N1 N1 Pages Migrated from N1 to N0 N0 N0 N1 N6 Pages Promoted from N6 to N0 Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask ================================================================================ Exec_Node Policy Curr Location Pages Observations ================================================================================ N0 N1 N6 N1 Pages are not Migrating to N0 N0 N1 N6 N6 Pages are not migration to N0 N0 N1 N1 Pages are not Migrating to N0 Scenario 3: both the folio node and executing node are outside the policy nodemask ============================================================================== Exec_Node Policy Curr Location Pages Observations ============================================================================== N0 N1 N6 Pages Promoted from N6 to N0 N0 N6 N1 Pages Migrated from N1 to N0 Thanks Donet Tom > >> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >> --- >> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >> 1 file changed, 26 insertions(+), 2 deletions(-) > I haven't spotted anything obviously wrong in the patch itself but I > admit this is not an area I am actively familiar with so I might be > missing something.
On Sat 17-02-24 01:31:35, Donet Tom wrote: [...] > +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, > + struct mempolicy *pol) > +{ > + /* if the executing node is in the policy node mask, migrate */ > + if (node_isset(exec_node, pol->nodes)) > + return true; > + > + /* If the folio node is in policy node mask, don't migrate */ > + if (node_isset(folio_node, pol->nodes)) > + return false; > + /* > + * both the folio node and executing node are outside the policy nodemask, > + * migrate as normal numa fault migration. > + */ > + return true; > +} I have looked at this again and only now noticed that this doesn't really work as one would expected. case MPOL_PREFERRED_MANY: /* * use current page if in policy nodemask, * else select nearest allowed node, if any. * If no allowed nodes, use current [!misplaced]. */ if (node_isset(curnid, pol->nodes)) goto out; z = first_zones_zonelist( node_zonelist(numa_node_id(), GFP_HIGHUSER), gfp_zone(GFP_HIGHUSER), &pol->nodes); polnid = zone_to_nid(z->zone); break; Will collapse the whole MPOL_PREFERRED_MANY nodemask into the first notde into that mask. Is that really what we want here? Shouldn't we use the full nodemask as the migration target?
On 2/19/24 19:50, Michal Hocko wrote: > On Sat 17-02-24 01:31:35, Donet Tom wrote: > [...] >> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >> + struct mempolicy *pol) >> +{ >> + /* if the executing node is in the policy node mask, migrate */ >> + if (node_isset(exec_node, pol->nodes)) >> + return true; >> + >> + /* If the folio node is in policy node mask, don't migrate */ >> + if (node_isset(folio_node, pol->nodes)) >> + return false; >> + /* >> + * both the folio node and executing node are outside the policy nodemask, >> + * migrate as normal numa fault migration. >> + */ >> + return true; >> +} > I have looked at this again and only now noticed that this doesn't > really work as one would expected. > > case MPOL_PREFERRED_MANY: > /* > * use current page if in policy nodemask, > * else select nearest allowed node, if any. > * If no allowed nodes, use current [!misplaced]. > */ > if (node_isset(curnid, pol->nodes)) > goto out; > z = first_zones_zonelist( > node_zonelist(numa_node_id(), GFP_HIGHUSER), > gfp_zone(GFP_HIGHUSER), > &pol->nodes); > polnid = zone_to_nid(z->zone); > break; > > Will collapse the whole MPOL_PREFERRED_MANY nodemask into the first > notde into that mask. Is that really what we want here? Shouldn't we use > the full nodemask as the migration target? With this patch it will take full nodemask and find out the correct migration target. It will not collapse into first node. For example if we have 5 NUMA nodes in our system N1 to N5, all five are in nodemask and the execution node is N3. with this fix mpol_preferred_should_numa_migrate() will return true because the execution node is there in the nodemask. So mpol_misplaced() will select N3 as the migration target since MPOL_F_MORON is set and migrate the pages to N3. /* Migrate the folio towards the node whose CPU is referencing it */ if (pol->flags & MPOL_F_MORON) { polnid = thisnid; So with this patch pages will get migrated to the correct migration target. >
On Mon 19-02-24 20:37:17, Donet Tom wrote: > > On 2/19/24 19:50, Michal Hocko wrote: > > On Sat 17-02-24 01:31:35, Donet Tom wrote: > > [...] > > > +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, > > > + struct mempolicy *pol) > > > +{ > > > + /* if the executing node is in the policy node mask, migrate */ > > > + if (node_isset(exec_node, pol->nodes)) > > > + return true; > > > + > > > + /* If the folio node is in policy node mask, don't migrate */ > > > + if (node_isset(folio_node, pol->nodes)) > > > + return false; > > > + /* > > > + * both the folio node and executing node are outside the policy nodemask, > > > + * migrate as normal numa fault migration. > > > + */ > > > + return true; > > > +} > > I have looked at this again and only now noticed that this doesn't > > really work as one would expected. > > > > case MPOL_PREFERRED_MANY: > > /* > > * use current page if in policy nodemask, > > * else select nearest allowed node, if any. > > * If no allowed nodes, use current [!misplaced]. > > */ > > if (node_isset(curnid, pol->nodes)) > > goto out; > > z = first_zones_zonelist( > > node_zonelist(numa_node_id(), GFP_HIGHUSER), > > gfp_zone(GFP_HIGHUSER), > > &pol->nodes); > > polnid = zone_to_nid(z->zone); > > break; > > > > Will collapse the whole MPOL_PREFERRED_MANY nodemask into the first > > notde into that mask. Is that really what we want here? Shouldn't we use > > the full nodemask as the migration target? > > With this patch it will take full nodemask and find out the correct migration target. It will not collapse into first node. Correct me if I am wrong, but mpol_misplaced will return the first node of the preffered node mask and then migrate_misplaced_folio would use it as a target node for alloc_misplaced_dst_folio which performs __GFP_THISNODE allocation so it won't fall back to a different node.
On 2/20/24 12:42 AM, Michal Hocko wrote: > On Mon 19-02-24 20:37:17, Donet Tom wrote: >> >> On 2/19/24 19:50, Michal Hocko wrote: >>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>> [...] >>>> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >>>> + struct mempolicy *pol) >>>> +{ >>>> + /* if the executing node is in the policy node mask, migrate */ >>>> + if (node_isset(exec_node, pol->nodes)) >>>> + return true; >>>> + >>>> + /* If the folio node is in policy node mask, don't migrate */ >>>> + if (node_isset(folio_node, pol->nodes)) >>>> + return false; >>>> + /* >>>> + * both the folio node and executing node are outside the policy nodemask, >>>> + * migrate as normal numa fault migration. >>>> + */ >>>> + return true; >>>> +} >>> I have looked at this again and only now noticed that this doesn't >>> really work as one would expected. >>> >>> case MPOL_PREFERRED_MANY: >>> /* >>> * use current page if in policy nodemask, >>> * else select nearest allowed node, if any. >>> * If no allowed nodes, use current [!misplaced]. >>> */ >>> if (node_isset(curnid, pol->nodes)) >>> goto out; >>> z = first_zones_zonelist( >>> node_zonelist(numa_node_id(), GFP_HIGHUSER), >>> gfp_zone(GFP_HIGHUSER), >>> &pol->nodes); >>> polnid = zone_to_nid(z->zone); >>> break; >>> >>> Will collapse the whole MPOL_PREFERRED_MANY nodemask into the first >>> notde into that mask. Is that really what we want here? Shouldn't we use >>> the full nodemask as the migration target? >> >> With this patch it will take full nodemask and find out the correct migration target. It will not collapse into first node. > > Correct me if I am wrong, but mpol_misplaced will return the first node > of the preffered node mask and then migrate_misplaced_folio would use > it as a target node for alloc_misplaced_dst_folio which performs > __GFP_THISNODE allocation so it won't fall back to a different node. I think the confusion is between MPOL_F_MOF (migrate on fault) vs MPOL_F_MORON( protnone fault/numa fault). With MPOL_F_MOF alone what we wanted to achieve was to have have mbind() lazy migrate the pages based on policy node mask. The change was introduced in commit commit b24f53a0bea3 ("mm: mempolicy: Add MPOL_MF_LAZY") and later dropped by commit 2cafb582173f ("mempolicy: remove confusing MPOL_MF_LAZY dead code"). We still have mpol_misplaced changes to handle the node selection for MPOL_F_MOF flag (this is dead code IIUC). MPOL_F_MORON was added in commit 5606e3877ad8 ("mm: numa: Migrate on reference policy") and with currently upstream only MPOL_BIND support that flag. With that flag specified and with the changes in the patch mpol_misplaced becomes case MPOL_PREFERRED_MANY: if (pol->flags & MPOL_F_MORON) { if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) goto out; break; } /* * use current page if in policy nodemask, * else select nearest allowed node, if any. * If no allowed nodes, use current [!misplaced]. */ if (node_isset(curnid, pol->nodes)) goto out; z = first_zones_zonelist( node_zonelist(thisnid, GFP_HIGHUSER), gfp_zone(GFP_HIGHUSER), &pol->nodes); polnid = zone_to_nid(z->zone); break; .... .. } /* Migrate the folio towards the node whose CPU is referencing it */ if (pol->flags & MPOL_F_MORON) { polnid = thisnid; if (!should_numa_migrate_memory(current, folio, curnid, thiscpu)) goto out; } if (curnid != polnid) ret = polnid; out: mpol_cond_put(pol); return ret; } ie, if we can do numa migration, we select the currently executing node as the target node otherwise we end up returning from the function with ret = NUMA_NO_NODE. -aneesh
Donet Tom <donettom@linux.ibm.com> writes: > On 2/19/24 17:37, Michal Hocko wrote: >> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>> nodes") added support for migrate on protnone reference with MPOL_BIND >>> memory policy. This allowed numa fault migration when the executing node >>> is part of the policy mask for MPOL_BIND. This patch extends migration >>> support to MPOL_PREFERRED_MANY policy. >>> >>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>> the kernel should not allocate pages from the slower memory tier via >>> allocation control zonelist fallback. Instead, we should move cold pages >>> from the faster memory node via memory demotion. For a page allocation, >>> kswapd is only woken up after we try to allocate pages from all nodes in >>> the allocation zone list. This implies that, without using memory >>> policies, we will end up allocating hot pages in the slower memory tier. >>> >>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>> allocation control when we have memory tiers in the system. With >>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>> of faster memory nodes. When we fail to allocate pages from the faster >>> memory node, kswapd would be woken up, allowing demotion of cold pages >>> to slower memory nodes. >>> >>> With the current kernel, such usage of memory policies implies we can't >>> do page promotion from a slower memory tier to a faster memory tier >>> using numa fault. This patch fixes this issue. >>> >>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>> mask, we allow numa migration to the executing nodes. If the executing >>> node is not in the policy node mask but the folio is already allocated >>> based on policy preference (the folio node is in the policy node mask), >>> we don't allow numa migration. If both the executing node and folio node >>> are outside the policy node mask, we allow numa migration to the >>> executing nodes. >> The feature makes sense to me. How has this been tested? Do you have any >> numbers to present? > > Hi Michal > > I have a test program which allocate memory on a specified node and > trigger the promotion or migration (Keep accessing the pages). > > Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening > with this patch I could see pages are getting migrated or promoted. > > My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below > are my test results. > > In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. > Exec_Node is the execution node, Policy is the nodes in nodemask and > "Curr Location Pages" is the node where pages present before migration > or promotion start. > > Tests Results > ------------------ > Scenario 1: if the executing node is in the policy node mask > ================================================================================ > Exec_Node Policy Curr Location Pages Observations > ================================================================================ > N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 > N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 > N0 N0 N1 N1 Pages Migrated from N1 to N0 > N0 N0 N1 N6 Pages Promoted from N6 to N0 > > Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask > ================================================================================ > Exec_Node Policy Curr Location Pages Observations > ================================================================================ > N0 N1 N6 N1 Pages are not Migrating to N0 > N0 N1 N6 N6 Pages are not migration to N0 > N0 N1 N1 Pages are not Migrating to N0 > > Scenario 3: both the folio node and executing node are outside the policy nodemask > ============================================================================== > Exec_Node Policy Curr Location Pages Observations > ============================================================================== > N0 N1 N6 Pages Promoted from N6 to N0 > N0 N6 N1 Pages Migrated from N1 to N0 > Please use some benchmarks (e.g., redis + memtier) and show the proc-vmstat stats and benchamrk score. Not part of the kernel series, but don't forget to submit patches to the man pages project and numactl tool to let users use it. -- Best Regards, Huang, Ying > Thanks > Donet Tom > >> >>> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>> --- >>> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >>> 1 file changed, 26 insertions(+), 2 deletions(-) >> I haven't spotted anything obviously wrong in the patch itself but I >> admit this is not an area I am actively familiar with so I might be >> missing something.
On 2/20/24 12:06 PM, Huang, Ying wrote: > Donet Tom <donettom@linux.ibm.com> writes: > >> On 2/19/24 17:37, Michal Hocko wrote: >>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>> memory policy. This allowed numa fault migration when the executing node >>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>> support to MPOL_PREFERRED_MANY policy. >>>> >>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>> the kernel should not allocate pages from the slower memory tier via >>>> allocation control zonelist fallback. Instead, we should move cold pages >>>> from the faster memory node via memory demotion. For a page allocation, >>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>> the allocation zone list. This implies that, without using memory >>>> policies, we will end up allocating hot pages in the slower memory tier. >>>> >>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>> allocation control when we have memory tiers in the system. With >>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>> of faster memory nodes. When we fail to allocate pages from the faster >>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>> to slower memory nodes. >>>> >>>> With the current kernel, such usage of memory policies implies we can't >>>> do page promotion from a slower memory tier to a faster memory tier >>>> using numa fault. This patch fixes this issue. >>>> >>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>> mask, we allow numa migration to the executing nodes. If the executing >>>> node is not in the policy node mask but the folio is already allocated >>>> based on policy preference (the folio node is in the policy node mask), >>>> we don't allow numa migration. If both the executing node and folio node >>>> are outside the policy node mask, we allow numa migration to the >>>> executing nodes. >>> The feature makes sense to me. How has this been tested? Do you have any >>> numbers to present? >> >> Hi Michal >> >> I have a test program which allocate memory on a specified node and >> trigger the promotion or migration (Keep accessing the pages). >> >> Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening >> with this patch I could see pages are getting migrated or promoted. >> >> My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below >> are my test results. >> >> In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. >> Exec_Node is the execution node, Policy is the nodes in nodemask and >> "Curr Location Pages" is the node where pages present before migration >> or promotion start. >> >> Tests Results >> ------------------ >> Scenario 1: if the executing node is in the policy node mask >> ================================================================================ >> Exec_Node Policy Curr Location Pages Observations >> ================================================================================ >> N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 >> N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 >> N0 N0 N1 N1 Pages Migrated from N1 to N0 >> N0 N0 N1 N6 Pages Promoted from N6 to N0 >> >> Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask >> ================================================================================ >> Exec_Node Policy Curr Location Pages Observations >> ================================================================================ >> N0 N1 N6 N1 Pages are not Migrating to N0 >> N0 N1 N6 N6 Pages are not migration to N0 >> N0 N1 N1 Pages are not Migrating to N0 >> >> Scenario 3: both the folio node and executing node are outside the policy nodemask >> ============================================================================== >> Exec_Node Policy Curr Location Pages Observations >> ============================================================================== >> N0 N1 N6 Pages Promoted from N6 to N0 >> N0 N6 N1 Pages Migrated from N1 to N0 >> > > Please use some benchmarks (e.g., redis + memtier) and show the > proc-vmstat stats and benchamrk score. Without this change numa fault migration is not supported with MPOL_PREFERRED_MANY policy. So there is no performance comparison with and without patch. W.r.t effectiveness of numa fault migration, that is a different topic from this patch -aneesh
Donet Tom <donettom@linux.ibm.com> writes: > commit bda420b98505 ("numa balancing: migrate on fault among multiple bound > nodes") added support for migrate on protnone reference with MPOL_BIND > memory policy. This allowed numa fault migration when the executing node > is part of the policy mask for MPOL_BIND. This patch extends migration > support to MPOL_PREFERRED_MANY policy. > > Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag > MPOL_F_NUMA_BALANCING. This causes issues when we want to use > NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, > the kernel should not allocate pages from the slower memory tier via > allocation control zonelist fallback. Instead, we should move cold pages > from the faster memory node via memory demotion. For a page allocation, > kswapd is only woken up after we try to allocate pages from all nodes in > the allocation zone list. This implies that, without using memory > policies, we will end up allocating hot pages in the slower memory tier. > > MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add > MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better > allocation control when we have memory tiers in the system. With > MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only > of faster memory nodes. When we fail to allocate pages from the faster > memory node, kswapd would be woken up, allowing demotion of cold pages > to slower memory nodes. > > With the current kernel, such usage of memory policies implies we can't > do page promotion from a slower memory tier to a faster memory tier > using numa fault. This patch fixes this issue. > > For MPOL_PREFERRED_MANY, if the executing node is in the policy node > mask, we allow numa migration to the executing nodes. If the executing > node is not in the policy node mask but the folio is already allocated > based on policy preference (the folio node is in the policy node mask), > we don't allow numa migration. If both the executing node and folio node > are outside the policy node mask, we allow numa migration to the > executing nodes. > > Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> > Signed-off-by: Donet Tom <donettom@linux.ibm.com> > --- > mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- > 1 file changed, 26 insertions(+), 2 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 73d698e21dae..8c4c92b10371 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) > if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) > return -EINVAL; > if (*flags & MPOL_F_NUMA_BALANCING) { > - if (*mode != MPOL_BIND) > + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) > + *flags |= (MPOL_F_MOF | MPOL_F_MORON); > + else > return -EINVAL; > - *flags |= (MPOL_F_MOF | MPOL_F_MORON); > } > return 0; > } > @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) > kmem_cache_free(sn_cache, n); > } > > +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, > + struct mempolicy *pol) > +{ > + /* if the executing node is in the policy node mask, migrate */ > + if (node_isset(exec_node, pol->nodes)) > + return true; > + > + /* If the folio node is in policy node mask, don't migrate */ > + if (node_isset(folio_node, pol->nodes)) > + return false; > + /* > + * both the folio node and executing node are outside the policy nodemask, > + * migrate as normal numa fault migration. > + */ > + return true; Why? This may cause some unexpected result. For example, pages may be distributed among multiple sockets unexpectedly. So, I prefer the more conservative policy, that is, only migrate if this node is in pol->nodes. -- Best Regards, Huang, Ying > +} > + > /** > * mpol_misplaced - check whether current folio node is valid in policy > * > @@ -2526,6 +2544,12 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, > break; > > case MPOL_PREFERRED_MANY: > + if (pol->flags & MPOL_F_MORON) { > + if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) > + goto out; > + break; > + } > + > /* > * use current page if in policy nodemask, > * else select nearest allowed node, if any.
"Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes: > On 2/20/24 12:06 PM, Huang, Ying wrote: >> Donet Tom <donettom@linux.ibm.com> writes: >> >>> On 2/19/24 17:37, Michal Hocko wrote: >>>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>>> memory policy. This allowed numa fault migration when the executing node >>>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>>> support to MPOL_PREFERRED_MANY policy. >>>>> >>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>>> the kernel should not allocate pages from the slower memory tier via >>>>> allocation control zonelist fallback. Instead, we should move cold pages >>>>> from the faster memory node via memory demotion. For a page allocation, >>>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>>> the allocation zone list. This implies that, without using memory >>>>> policies, we will end up allocating hot pages in the slower memory tier. >>>>> >>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>>> allocation control when we have memory tiers in the system. With >>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>>> of faster memory nodes. When we fail to allocate pages from the faster >>>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>>> to slower memory nodes. >>>>> >>>>> With the current kernel, such usage of memory policies implies we can't >>>>> do page promotion from a slower memory tier to a faster memory tier >>>>> using numa fault. This patch fixes this issue. >>>>> >>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>>> mask, we allow numa migration to the executing nodes. If the executing >>>>> node is not in the policy node mask but the folio is already allocated >>>>> based on policy preference (the folio node is in the policy node mask), >>>>> we don't allow numa migration. If both the executing node and folio node >>>>> are outside the policy node mask, we allow numa migration to the >>>>> executing nodes. >>>> The feature makes sense to me. How has this been tested? Do you have any >>>> numbers to present? >>> >>> Hi Michal >>> >>> I have a test program which allocate memory on a specified node and >>> trigger the promotion or migration (Keep accessing the pages). >>> >>> Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening >>> with this patch I could see pages are getting migrated or promoted. >>> >>> My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below >>> are my test results. >>> >>> In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. >>> Exec_Node is the execution node, Policy is the nodes in nodemask and >>> "Curr Location Pages" is the node where pages present before migration >>> or promotion start. >>> >>> Tests Results >>> ------------------ >>> Scenario 1: if the executing node is in the policy node mask >>> ================================================================================ >>> Exec_Node Policy Curr Location Pages Observations >>> ================================================================================ >>> N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 >>> N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 >>> N0 N0 N1 N1 Pages Migrated from N1 to N0 >>> N0 N0 N1 N6 Pages Promoted from N6 to N0 >>> >>> Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask >>> ================================================================================ >>> Exec_Node Policy Curr Location Pages Observations >>> ================================================================================ >>> N0 N1 N6 N1 Pages are not Migrating to N0 >>> N0 N1 N6 N6 Pages are not migration to N0 >>> N0 N1 N1 Pages are not Migrating to N0 >>> >>> Scenario 3: both the folio node and executing node are outside the policy nodemask >>> ============================================================================== >>> Exec_Node Policy Curr Location Pages Observations >>> ============================================================================== >>> N0 N1 N6 Pages Promoted from N6 to N0 >>> N0 N6 N1 Pages Migrated from N1 to N0 >>> >> >> Please use some benchmarks (e.g., redis + memtier) and show the >> proc-vmstat stats and benchamrk score. > > > Without this change numa fault migration is not supported with MPOL_PREFERRED_MANY > policy. So there is no performance comparison with and without patch. W.rt effectiveness of numa > fault migration, that is a different topic from this patch IIUC, the goal of the patch is to optimize performance, right? If so, the benchmark score will help justify the change. -- Best Regards, Huang, Ying
"Huang, Ying" <ying.huang@intel.com> writes: > "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes: > >> On 2/20/24 12:06 PM, Huang, Ying wrote: >>> Donet Tom <donettom@linux.ibm.com> writes: >>> >>>> On 2/19/24 17:37, Michal Hocko wrote: >>>>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>>>> memory policy. This allowed numa fault migration when the executing node >>>>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>>>> support to MPOL_PREFERRED_MANY policy. >>>>>> >>>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>>>> the kernel should not allocate pages from the slower memory tier via >>>>>> allocation control zonelist fallback. Instead, we should move cold pages >>>>>> from the faster memory node via memory demotion. For a page allocation, >>>>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>>>> the allocation zone list. This implies that, without using memory >>>>>> policies, we will end up allocating hot pages in the slower memory tier. >>>>>> >>>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>>>> allocation control when we have memory tiers in the system. With >>>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>>>> of faster memory nodes. When we fail to allocate pages from the faster >>>>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>>>> to slower memory nodes. >>>>>> >>>>>> With the current kernel, such usage of memory policies implies we can't >>>>>> do page promotion from a slower memory tier to a faster memory tier >>>>>> using numa fault. This patch fixes this issue. >>>>>> >>>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>>>> mask, we allow numa migration to the executing nodes. If the executing >>>>>> node is not in the policy node mask but the folio is already allocated >>>>>> based on policy preference (the folio node is in the policy node mask), >>>>>> we don't allow numa migration. If both the executing node and folio node >>>>>> are outside the policy node mask, we allow numa migration to the >>>>>> executing nodes. >>>>> The feature makes sense to me. How has this been tested? Do you have any >>>>> numbers to present? >>>> >>>> Hi Michal >>>> >>>> I have a test program which allocate memory on a specified node and >>>> trigger the promotion or migration (Keep accessing the pages). >>>> >>>> Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening >>>> with this patch I could see pages are getting migrated or promoted. >>>> >>>> My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below >>>> are my test results. >>>> >>>> In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. >>>> Exec_Node is the execution node, Policy is the nodes in nodemask and >>>> "Curr Location Pages" is the node where pages present before migration >>>> or promotion start. >>>> >>>> Tests Results >>>> ------------------ >>>> Scenario 1: if the executing node is in the policy node mask >>>> ================================================================================ >>>> Exec_Node Policy Curr Location Pages Observations >>>> ================================================================================ >>>> N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 >>>> N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 >>>> N0 N0 N1 N1 Pages Migrated from N1 to N0 >>>> N0 N0 N1 N6 Pages Promoted from N6 to N0 >>>> >>>> Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask >>>> ================================================================================ >>>> Exec_Node Policy Curr Location Pages Observations >>>> ================================================================================ >>>> N0 N1 N6 N1 Pages are not Migrating to N0 >>>> N0 N1 N6 N6 Pages are not migration to N0 >>>> N0 N1 N1 Pages are not Migrating to N0 >>>> >>>> Scenario 3: both the folio node and executing node are outside the policy nodemask >>>> ============================================================================== >>>> Exec_Node Policy Curr Location Pages Observations >>>> ============================================================================== >>>> N0 N1 N6 Pages Promoted from N6 to N0 >>>> N0 N6 N1 Pages Migrated from N1 to N0 >>>> >>> >>> Please use some benchmarks (e.g., redis + memtier) and show the >>> proc-vmstat stats and benchamrk score. >> >> >> Without this change numa fault migration is not supported with MPOL_PREFERRED_MANY >> policy. So there is no performance comparison with and without patch. W.r.t effectiveness of numa >> fault migration, that is a different topic from this patch > > IIUC, the goal of the patch is to optimize performance, right? If so, > the benchmark score will help justify the change. > The objective is to enable the use of the MPOL_PREFERRED_MANY policy, which is essential for the correct functioning of memory demotion in conjunction with memory promotion. Once we can use memory promotion, we should be able to observe the same benefits as those provided by numa fault memory promotion. The actual benefit of numa fault migration is dependent on various factors such as the speed of the slower memory device, the access pattern of the application, etc. We are discussing its effectiveness and how to improve numa fault overhead in other forums. However, we believe that this discussion should not hinder the merging of this patch. This change is similar to commit bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") -aneesh
"Huang, Ying" <ying.huang@intel.com> writes: > Donet Tom <donettom@linux.ibm.com> writes: > >> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >> nodes") added support for migrate on protnone reference with MPOL_BIND >> memory policy. This allowed numa fault migration when the executing node >> is part of the policy mask for MPOL_BIND. This patch extends migration >> support to MPOL_PREFERRED_MANY policy. >> >> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >> the kernel should not allocate pages from the slower memory tier via >> allocation control zonelist fallback. Instead, we should move cold pages >> from the faster memory node via memory demotion. For a page allocation, >> kswapd is only woken up after we try to allocate pages from all nodes in >> the allocation zone list. This implies that, without using memory >> policies, we will end up allocating hot pages in the slower memory tier. >> >> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >> allocation control when we have memory tiers in the system. With >> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >> of faster memory nodes. When we fail to allocate pages from the faster >> memory node, kswapd would be woken up, allowing demotion of cold pages >> to slower memory nodes. >> >> With the current kernel, such usage of memory policies implies we can't >> do page promotion from a slower memory tier to a faster memory tier >> using numa fault. This patch fixes this issue. >> >> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >> mask, we allow numa migration to the executing nodes. If the executing >> node is not in the policy node mask but the folio is already allocated >> based on policy preference (the folio node is in the policy node mask), >> we don't allow numa migration. If both the executing node and folio node >> are outside the policy node mask, we allow numa migration to the >> executing nodes. >> >> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >> --- >> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >> 1 file changed, 26 insertions(+), 2 deletions(-) >> >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 73d698e21dae..8c4c92b10371 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) >> if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) >> return -EINVAL; >> if (*flags & MPOL_F_NUMA_BALANCING) { >> - if (*mode != MPOL_BIND) >> + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) >> + *flags |= (MPOL_F_MOF | MPOL_F_MORON); >> + else >> return -EINVAL; >> - *flags |= (MPOL_F_MOF | MPOL_F_MORON); >> } >> return 0; >> } >> @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) >> kmem_cache_free(sn_cache, n); >> } >> >> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >> + struct mempolicy *pol) >> +{ >> + /* if the executing node is in the policy node mask, migrate */ >> + if (node_isset(exec_node, pol->nodes)) >> + return true; >> + >> + /* If the folio node is in policy node mask, don't migrate */ >> + if (node_isset(folio_node, pol->nodes)) >> + return false; >> + /* >> + * both the folio node and executing node are outside the policy nodemask, >> + * migrate as normal numa fault migration. >> + */ >> + return true; > > Why? This may cause some unexpected result. For example, pages may be > distributed among multiple sockets unexpectedly. So, I prefer the more > conservative policy, that is, only migrate if this node is in > pol->nodes. > This will only have an impact if the user specifies MPOL_F_NUMA_BALANCING. This means that the user is explicitly requesting for frequently accessed memory pages to be migrated. Memory policy MPOL_PREFERRED_MANY is able to allocate pages from nodes outside of policy->nodes. For the specific use case that I am interested in, it should be okay to restrict it to policy->nodes. However, I am wondering if this is too restrictive given the definition of MPOL_PREFERRED_MANY. -aneesh
Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> Donet Tom <donettom@linux.ibm.com> writes: >> >>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>> nodes") added support for migrate on protnone reference with MPOL_BIND >>> memory policy. This allowed numa fault migration when the executing node >>> is part of the policy mask for MPOL_BIND. This patch extends migration >>> support to MPOL_PREFERRED_MANY policy. >>> >>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>> the kernel should not allocate pages from the slower memory tier via >>> allocation control zonelist fallback. Instead, we should move cold pages >>> from the faster memory node via memory demotion. For a page allocation, >>> kswapd is only woken up after we try to allocate pages from all nodes in >>> the allocation zone list. This implies that, without using memory >>> policies, we will end up allocating hot pages in the slower memory tier. >>> >>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>> allocation control when we have memory tiers in the system. With >>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>> of faster memory nodes. When we fail to allocate pages from the faster >>> memory node, kswapd would be woken up, allowing demotion of cold pages >>> to slower memory nodes. >>> >>> With the current kernel, such usage of memory policies implies we can't >>> do page promotion from a slower memory tier to a faster memory tier >>> using numa fault. This patch fixes this issue. >>> >>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>> mask, we allow numa migration to the executing nodes. If the executing >>> node is not in the policy node mask but the folio is already allocated >>> based on policy preference (the folio node is in the policy node mask), >>> we don't allow numa migration. If both the executing node and folio node >>> are outside the policy node mask, we allow numa migration to the >>> executing nodes. >>> >>> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>> --- >>> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >>> 1 file changed, 26 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>> index 73d698e21dae..8c4c92b10371 100644 >>> --- a/mm/mempolicy.c >>> +++ b/mm/mempolicy.c >>> @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) >>> if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) >>> return -EINVAL; >>> if (*flags & MPOL_F_NUMA_BALANCING) { >>> - if (*mode != MPOL_BIND) >>> + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) >>> + *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>> + else >>> return -EINVAL; >>> - *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>> } >>> return 0; >>> } >>> @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) >>> kmem_cache_free(sn_cache, n); >>> } >>> >>> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >>> + struct mempolicy *pol) >>> +{ >>> + /* if the executing node is in the policy node mask, migrate */ >>> + if (node_isset(exec_node, pol->nodes)) >>> + return true; >>> + >>> + /* If the folio node is in policy node mask, don't migrate */ >>> + if (node_isset(folio_node, pol->nodes)) >>> + return false; >>> + /* >>> + * both the folio node and executing node are outside the policy nodemask, >>> + * migrate as normal numa fault migration. >>> + */ >>> + return true; >> >> Why? This may cause some unexpected result. For example, pages may be >> distributed among multiple sockets unexpectedly. So, I prefer the more >> conservative policy, that is, only migrate if this node is in >> pol->nodes. >> > > This will only have an impact if the user specifies > MPOL_F_NUMA_BALANCING. This means that the user is explicitly requesting > for frequently accessed memory pages to be migrated. Memory policy > MPOL_PREFERRED_MANY is able to allocate pages from nodes outside of > policy->nodes. For the specific use case that I am interested in, it > should be okay to restrict it to policy->nodes. However, I am wondering > if this is too restrictive given the definition of MPOL_PREFERRED_MANY. IMHO, we can start with some consecutive way and expand it if it's proved necessary. -- Best Regards, Huang, Ying
Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> writes: >> >>> On 2/20/24 12:06 PM, Huang, Ying wrote: >>>> Donet Tom <donettom@linux.ibm.com> writes: >>>> >>>>> On 2/19/24 17:37, Michal Hocko wrote: >>>>>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>>>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>>>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>>>>> memory policy. This allowed numa fault migration when the executing node >>>>>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>>>>> support to MPOL_PREFERRED_MANY policy. >>>>>>> >>>>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>>>>> the kernel should not allocate pages from the slower memory tier via >>>>>>> allocation control zonelist fallback. Instead, we should move cold pages >>>>>>> from the faster memory node via memory demotion. For a page allocation, >>>>>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>>>>> the allocation zone list. This implies that, without using memory >>>>>>> policies, we will end up allocating hot pages in the slower memory tier. >>>>>>> >>>>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>>>>> allocation control when we have memory tiers in the system. With >>>>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>>>>> of faster memory nodes. When we fail to allocate pages from the faster >>>>>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>>>>> to slower memory nodes. >>>>>>> >>>>>>> With the current kernel, such usage of memory policies implies we can't >>>>>>> do page promotion from a slower memory tier to a faster memory tier >>>>>>> using numa fault. This patch fixes this issue. >>>>>>> >>>>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>>>>> mask, we allow numa migration to the executing nodes. If the executing >>>>>>> node is not in the policy node mask but the folio is already allocated >>>>>>> based on policy preference (the folio node is in the policy node mask), >>>>>>> we don't allow numa migration. If both the executing node and folio node >>>>>>> are outside the policy node mask, we allow numa migration to the >>>>>>> executing nodes. >>>>>> The feature makes sense to me. How has this been tested? Do you have any >>>>>> numbers to present? >>>>> >>>>> Hi Michal >>>>> >>>>> I have a test program which allocate memory on a specified node and >>>>> trigger the promotion or migration (Keep accessing the pages). >>>>> >>>>> Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening >>>>> with this patch I could see pages are getting migrated or promoted. >>>>> >>>>> My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below >>>>> are my test results. >>>>> >>>>> In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. >>>>> Exec_Node is the execution node, Policy is the nodes in nodemask and >>>>> "Curr Location Pages" is the node where pages present before migration >>>>> or promotion start. >>>>> >>>>> Tests Results >>>>> ------------------ >>>>> Scenario 1: if the executing node is in the policy node mask >>>>> ================================================================================ >>>>> Exec_Node Policy Curr Location Pages Observations >>>>> ================================================================================ >>>>> N0 N0 N1 N6 N1 Pages Migrated from N1 to N0 >>>>> N0 N0 N1 N6 N6 Pages Promoted from N6 to N0 >>>>> N0 N0 N1 N1 Pages Migrated from N1 to N0 >>>>> N0 N0 N1 N6 Pages Promoted from N6 to N0 >>>>> >>>>> Scenario 2: If the folio node is in policy node mask and Exec node not in policy node mask >>>>> ================================================================================ >>>>> Exec_Node Policy Curr Location Pages Observations >>>>> ================================================================================ >>>>> N0 N1 N6 N1 Pages are not Migrating to N0 >>>>> N0 N1 N6 N6 Pages are not migration to N0 >>>>> N0 N1 N1 Pages are not Migrating to N0 >>>>> >>>>> Scenario 3: both the folio node and executing node are outside the policy nodemask >>>>> ============================================================================== >>>>> Exec_Node Policy Curr Location Pages Observations >>>>> ============================================================================== >>>>> N0 N1 N6 Pages Promoted from N6 to N0 >>>>> N0 N6 N1 Pages Migrated from N1 to N0 >>>>> >>>> >>>> Please use some benchmarks (e.g., redis + memtier) and show the >>>> proc-vmstat stats and benchamrk score. >>> >>> >>> Without this change numa fault migration is not supported with MPOL_PREFERRED_MANY >>> policy. So there is no performance comparison with and without patch. Wr.t effectiveness of numa >>> fault migration, that is a different topic from this patch >> >> IIUC, the goal of the patch is to optimize performance, right? If so, >> the benchmark score will help justify the change. >> > > The objective is to enable the use of the MPOL_PREFERRED_MANY policy, > which is essential for the correct functioning of memory demotion in > conjunction with memory promotion. Once we can use memory promotion, we > should be able to observe the same benefits as those provided by numa > fault memory promotion. The actual benefit of numa fault migration is > dependent on various factors such as the speed of the slower memory > device, the access pattern of the application, etc. We are discussing > its effectiveness and how to improve numa fault overhead in other > forums. However, we believe that this discussion should not hinder the > merging of this patch. > > This change is similar to commit bda420b98505 ("numa balancing: migrate > on fault among multiple bound nodes") We provide the performance data in the description of that commit :-) -- Best Regards, Huang, Ying
On Tue 20-02-24 09:27:25, Aneesh Kumar K.V wrote: [...] > case MPOL_PREFERRED_MANY: > if (pol->flags & MPOL_F_MORON) { > if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) > goto out; > break; > } > > /* > * use current page if in policy nodemask, > * else select nearest allowed node, if any. > * If no allowed nodes, use current [!misplaced]. > */ > if (node_isset(curnid, pol->nodes)) > goto out; > z = first_zones_zonelist( > node_zonelist(thisnid, GFP_HIGHUSER), > gfp_zone(GFP_HIGHUSER), > &pol->nodes); > polnid = zone_to_nid(z->zone); > break; > .... > .. > } > > /* Migrate the folio towards the node whose CPU is referencing it */ > if (pol->flags & MPOL_F_MORON) { > polnid = thisnid; > > if (!should_numa_migrate_memory(current, folio, curnid, > thiscpu)) > goto out; > } > > if (curnid != polnid) > ret = polnid; > out: > mpol_cond_put(pol); > > return ret; > } Ohh, right this code is confusing as hell. Thanks for the clarification. With this in mind. There should be a comment warning about MPOL_F_MOF always being unset as the userspace cannot really set it up. Thanks!
On 2/20/24 14:18, Michal Hocko wrote: > On Tue 20-02-24 09:27:25, Aneesh Kumar K.V wrote: > [...] >> case MPOL_PREFERRED_MANY: >> if (pol->flags & MPOL_F_MORON) { >> if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) >> goto out; >> break; >> } >> >> /* >> * use current page if in policy nodemask, >> * else select nearest allowed node, if any. >> * If no allowed nodes, use current [!misplaced]. >> */ >> if (node_isset(curnid, pol->nodes)) >> goto out; >> z = first_zones_zonelist( >> node_zonelist(thisnid, GFP_HIGHUSER), >> gfp_zone(GFP_HIGHUSER), >> &pol->nodes); >> polnid = zone_to_nid(z->zone); >> break; >> .... >> .. >> } >> >> /* Migrate the folio towards the node whose CPU is referencing it */ >> if (pol->flags & MPOL_F_MORON) { >> polnid = thisnid; >> >> if (!should_numa_migrate_memory(current, folio, curnid, >> thiscpu)) >> goto out; >> } >> >> if (curnid != polnid) >> ret = polnid; >> out: >> mpol_cond_put(pol); >> >> return ret; >> } > Ohh, right this code is confusing as hell. Thanks for the clarification. > With this in mind. There should be a comment warning about MPOL_F_MOF > always being unset as the userspace cannot really set it up. > > Thanks! > Hi Michal Sorry For the late reply. If we set MPOL_F_NUMA_BALANCING from userspace then MPOL_F_MOF and MPOL_F_MORON flags will get set in kernel. /* Basic parameter sanity check used by both mbind() and set_mempolicy() */ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) { *flags = *mode & MPOL_MODE_FLAGS; *mode &= ~MPOL_MODE_FLAGS; if ((unsigned int)(*mode) >= MPOL_MAX) return -EINVAL; if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) return -EINVAL; if (*flags & MPOL_F_NUMA_BALANCING) { if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) *flags |= (MPOL_F_MOF | MPOL_F_MORON); else return -EINVAL; } In current kernel it is supported only for MPOL_BIND and we added suppor for MPOL_PREFERRED_MANY also. Why MPOL_F_MOF flag is required? --------------------------------- For NUMA migration the process memory is unmapped by "task_numa_work" periodically, if unmapped memory got accessed again then NUMA hinting page fault will occur and in page fault handler the pages get migrated. If MPOL_F_MOF is not set then "task_numa_work" will not unmap the process pages and NUMA hinting page fault and migration will not occur. This change has been introduced by commit fc3147245d193b (mm: numa: Limit NUMA scanning to migrate-on-fault VMAs). How new implementation works ---------------------------- MPOL_PREFERRED_MANY is able to set MPOL_F_MOF and MPOL_F_MORON through MPOL_F_NUMA_BALANCING. So NUMA hinting page faults will occur. In mpol_misplaced if we can do numa migration, we select the currently executing node as the target node otherwise we end up returning from the function with ret = NUMA_NO_NODE. So since we are able to set MPOL_F_MOF from userspace through MPOL_F_NUMA_BALANCING, no need to add this comment right? Thanks Donet Tom
"Huang, Ying" <ying.huang@intel.com> writes: > Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes: > >> "Huang, Ying" <ying.huang@intel.com> writes: >> >>> Donet Tom <donettom@linux.ibm.com> writes: >>> >>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>> memory policy. This allowed numa fault migration when the executing node >>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>> support to MPOL_PREFERRED_MANY policy. >>>> >>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>> the kernel should not allocate pages from the slower memory tier via >>>> allocation control zonelist fallback. Instead, we should move cold pages >>>> from the faster memory node via memory demotion. For a page allocation, >>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>> the allocation zone list. This implies that, without using memory >>>> policies, we will end up allocating hot pages in the slower memory tier. >>>> >>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>> allocation control when we have memory tiers in the system. With >>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>> of faster memory nodes. When we fail to allocate pages from the faster >>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>> to slower memory nodes. >>>> >>>> With the current kernel, such usage of memory policies implies we can't >>>> do page promotion from a slower memory tier to a faster memory tier >>>> using numa fault. This patch fixes this issue. >>>> >>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>> mask, we allow numa migration to the executing nodes. If the executing >>>> node is not in the policy node mask but the folio is already allocated >>>> based on policy preference (the folio node is in the policy node mask), >>>> we don't allow numa migration. If both the executing node and folio node >>>> are outside the policy node mask, we allow numa migration to the >>>> executing nodes. >>>> >>>> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>>> --- >>>> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >>>> 1 file changed, 26 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>>> index 73d698e21dae..8c4c92b10371 100644 >>>> --- a/mm/mempolicy.c >>>> +++ b/mm/mempolicy.c >>>> @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) >>>> if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) >>>> return -EINVAL; >>>> if (*flags & MPOL_F_NUMA_BALANCING) { >>>> - if (*mode != MPOL_BIND) >>>> + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) >>>> + *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>>> + else >>>> return -EINVAL; >>>> - *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>>> } >>>> return 0; >>>> } >>>> @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) >>>> kmem_cache_free(sn_cache, n); >>>> } >>>> >>>> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >>>> + struct mempolicy *pol) >>>> +{ >>>> + /* if the executing node is in the policy node mask, migrate */ >>>> + if (node_isset(exec_node, pol->nodes)) >>>> + return true; >>>> + >>>> + /* If the folio node is in policy node mask, don't migrate */ >>>> + if (node_isset(folio_node, pol->nodes)) >>>> + return false; >>>> + /* >>>> + * both the folio node and executing node are outside the policy nodemask, >>>> + * migrate as normal numa fault migration. >>>> + */ >>>> + return true; >>> >>> Why? This may cause some unexpected result. For example, pages may be >>> distributed among multiple sockets unexpectedly. So, I prefer the more >>> conservative policy, that is, only migrate if this node is in >>> pol->nodes. >>> >> >> This will only have an impact if the user specifies >> MPOL_F_NUMA_BALANCING. This means that the user is explicitly requesting >> for frequently accessed memory pages to be migrated. Memory policy >> MPOL_PREFERRED_MANY is able to allocate pages from nodes outside of >> policy->nodes. For the specific use case that I am interested in, it >> should be okay to restrict it to policy->nodes. However, I am wondering >> if this is too restrictive given the definition of MPOL_PREFERRED_MANY. > > IMHO, we can start with some consecutive way and expand it if it's > proved necessary. > Is this good? 1 file changed, 14 insertions(+), 34 deletions(-) mm/mempolicy.c | 48 ++++++++++++++---------------------------------- modified mm/mempolicy.c @@ -2464,23 +2464,6 @@ static void sp_free(struct sp_node *n) kmem_cache_free(sn_cache, n); } -static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, - struct mempolicy *pol) -{ - /* if the executing node is in the policy node mask, migrate */ - if (node_isset(exec_node, pol->nodes)) - return true; - - /* If the folio node is in policy node mask, don't migrate */ - if (node_isset(folio_node, pol->nodes)) - return false; - /* - * both the folio node and executing node are outside the policy nodemask, - * migrate as normal numa fault migration. - */ - return true; -} - /** * mpol_misplaced - check whether current folio node is valid in policy * @@ -2533,29 +2516,26 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, break; case MPOL_BIND: - /* Optimize placement among multiple nodes via NUMA balancing */ + case MPOL_PREFERRED_MANY: + /* + * Even though MPOL_PREFERRED_MANY can allocate pages outside + * policy nodemask we don't allow numa migration to nodes + * outside policy nodemask for now. This is done so that if we + * want demotion to slow memory to happen, before allocating + * from some DRAM node say 'x', we will end up using a + * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario + * we should not promote to node 'x' from slow memory node. + */ if (pol->flags & MPOL_F_MORON) { + /* + * Optimize placement among multiple nodes + * via NUMA balancing + */ if (node_isset(thisnid, pol->nodes)) break; goto out; } - if (node_isset(curnid, pol->nodes)) - goto out; - z = first_zones_zonelist( - node_zonelist(thisnid, GFP_HIGHUSER), - gfp_zone(GFP_HIGHUSER), - &pol->nodes); - polnid = zone_to_nid(z->zone); - break; - - case MPOL_PREFERRED_MANY: - if (pol->flags & MPOL_F_MORON) { - if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) - goto out; - break; - } - /* * use current page if in policy nodemask, * else select nearest allowed node, if any. [back] .
Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes: >> >>> "Huang, Ying" <ying.huang@intel.com> writes: >>> >>>> Donet Tom <donettom@linux.ibm.com> writes: >>>> >>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>>> memory policy. This allowed numa fault migration when the executing node >>>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>>> support to MPOL_PREFERRED_MANY policy. >>>>> >>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>>> the kernel should not allocate pages from the slower memory tier via >>>>> allocation control zonelist fallback. Instead, we should move cold pages >>>>> from the faster memory node via memory demotion. For a page allocation, >>>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>>> the allocation zone list. This implies that, without using memory >>>>> policies, we will end up allocating hot pages in the slower memory tier. >>>>> >>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>>> allocation control when we have memory tiers in the system. With >>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>>> of faster memory nodes. When we fail to allocate pages from the faster >>>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>>> to slower memory nodes. >>>>> >>>>> With the current kernel, such usage of memory policies implies we can't >>>>> do page promotion from a slower memory tier to a faster memory tier >>>>> using numa fault. This patch fixes this issue. >>>>> >>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>>> mask, we allow numa migration to the executing nodes. If the executing >>>>> node is not in the policy node mask but the folio is already allocated >>>>> based on policy preference (the folio node is in the policy node mask), >>>>> we don't allow numa migration. If both the executing node and folio node >>>>> are outside the policy node mask, we allow numa migration to the >>>>> executing nodes. >>>>> >>>>> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> >>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>>>> --- >>>>> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >>>>> 1 file changed, 26 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>>>> index 73d698e21dae..8c4c92b10371 100644 >>>>> --- a/mm/mempolicy.c >>>>> +++ b/mm/mempolicy.c >>>>> @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) >>>>> if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) >>>>> return -EINVAL; >>>>> if (*flags & MPOL_F_NUMA_BALANCING) { >>>>> - if (*mode != MPOL_BIND) >>>>> + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) >>>>> + *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>>>> + else >>>>> return -EINVAL; >>>>> - *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>>>> } >>>>> return 0; >>>>> } >>>>> @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) >>>>> kmem_cache_free(sn_cache, n); >>>>> } >>>>> >>>>> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >>>>> + struct mempolicy *pol) >>>>> +{ >>>>> + /* if the executing node is in the policy node mask, migrate */ >>>>> + if (node_isset(exec_node, pol->nodes)) >>>>> + return true; >>>>> + >>>>> + /* If the folio node is in policy node mask, don't migrate */ >>>>> + if (node_isset(folio_node, pol->nodes)) >>>>> + return false; >>>>> + /* >>>>> + * both the folio node and executing node are outside the policy nodemask, >>>>> + * migrate as normal numa fault migration. >>>>> + */ >>>>> + return true; >>>> >>>> Why? This may cause some unexpected result. For example, pages may be >>>> distributed among multiple sockets unexpectedly. So, I prefer the more >>>> conservative policy, that is, only migrate if this node is in >>>> pol->nodes. >>>> >>> >>> This will only have an impact if the user specifies >>> MPOL_F_NUMA_BALANCING. This means that the user is explicitly requesting >>> for frequently accessed memory pages to be migrated. Memory policy >>> MPOL_PREFERRED_MANY is able to allocate pages from nodes outside of >>> policy->nodes. For the specific use case that I am interested in, it >>> should be okay to restrict it to policy->nodes. However, I am wondering >>> if this is too restrictive given the definition of MPOL_PREFERRED_MANY. >> >> IMHO, we can start with some consecutive way and expand it if it's >> proved necessary. >> > > Is this good? > > 1 file changed, 14 insertions(+), 34 deletions(-) > mm/mempolicy.c | 48 ++++++++++++++---------------------------------- > > modified mm/mempolicy.c > @@ -2464,23 +2464,6 @@ static void sp_free(struct sp_node *n) > kmem_cache_free(sn_cache, n); > } > > -static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, > - struct mempolicy *pol) > -{ > - /* if the executing node is in the policy node mask, migrate */ > - if (node_isset(exec_node, pol->nodes)) > - return true; > - > - /* If the folio node is in policy node mask, don't migrate */ > - if (node_isset(folio_node, pol->nodes)) > - return false; > - /* > - * both the folio node and executing node are outside the policy nodemask, > - * migrate as normal numa fault migration. > - */ > - return true; > -} > - > /** > * mpol_misplaced - check whether current folio node is valid in policy > * > @@ -2533,29 +2516,26 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, > break; > > case MPOL_BIND: > - /* Optimize placement among multiple nodes via NUMA balancing */ > + case MPOL_PREFERRED_MANY: > + /* > + * Even though MPOL_PREFERRED_MANY can allocate pages outside > + * policy nodemask we don't allow numa migration to nodes > + * outside policy nodemask for now. This is done so that if we > + * want demotion to slow memory to happen, before allocating > + * from some DRAM node say 'x', we will end up using a > + * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario > + * we should not promote to node 'x' from slow memory node. > + */ > if (pol->flags & MPOL_F_MORON) { > + /* > + * Optimize placement among multiple nodes > + * via NUMA balancing > + */ > if (node_isset(thisnid, pol->nodes)) > break; > goto out; > } > > - if (node_isset(curnid, pol->nodes)) > - goto out; > - z = first_zones_zonelist( > - node_zonelist(thisnid, GFP_HIGHUSER), > - gfp_zone(GFP_HIGHUSER), > - &pol->nodes); > - polnid = zone_to_nid(z->zone); > - break; IMO, the above deletion should be put in another patch? -- Best Regards, Huang, Ying > - > - case MPOL_PREFERRED_MANY: > - if (pol->flags & MPOL_F_MORON) { > - if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) > - goto out; > - break; > - } > - > /* > * use current page if in policy nodemask, > * else select nearest allowed node, if any. > > [back] > .
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 73d698e21dae..8c4c92b10371 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) return -EINVAL; if (*flags & MPOL_F_NUMA_BALANCING) { - if (*mode != MPOL_BIND) + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) + *flags |= (MPOL_F_MOF | MPOL_F_MORON); + else return -EINVAL; - *flags |= (MPOL_F_MOF | MPOL_F_MORON); } return 0; } @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) kmem_cache_free(sn_cache, n); } +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, + struct mempolicy *pol) +{ + /* if the executing node is in the policy node mask, migrate */ + if (node_isset(exec_node, pol->nodes)) + return true; + + /* If the folio node is in policy node mask, don't migrate */ + if (node_isset(folio_node, pol->nodes)) + return false; + /* + * both the folio node and executing node are outside the policy nodemask, + * migrate as normal numa fault migration. + */ + return true; +} + /** * mpol_misplaced - check whether current folio node is valid in policy * @@ -2526,6 +2544,12 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, break; case MPOL_PREFERRED_MANY: + if (pol->flags & MPOL_F_MORON) { + if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) + goto out; + break; + } + /* * use current page if in policy nodemask, * else select nearest allowed node, if any.