Message ID | a71a478ce404e93683023dbb7248dd95f11554f4.1699872019.git.baolin.wang@linux.alibaba.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b909:0:b0:403:3b70:6f57 with SMTP id t9csp1118265vqg; Mon, 13 Nov 2023 02:47:19 -0800 (PST) X-Google-Smtp-Source: AGHT+IHaVUJuOcArbpY9OBeUxYd8vZ6oZONgAVhuvMMGfP94b4e/qkou/jAiINuNstyFvU9Zl6Ws X-Received: by 2002:a05:6808:211b:b0:3a7:b011:8960 with SMTP id r27-20020a056808211b00b003a7b0118960mr9106301oiw.40.1699872439397; Mon, 13 Nov 2023 02:47:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699872439; cv=none; d=google.com; s=arc-20160816; b=vNBdbQWPayvgh/h3Xeszb9aTNMnWVICDzPsc4bG6iYS9nknbDIHfuPtwGU4f+p/MY0 RuTfrrCQtG97fg/91R6PV2q39R5hBAHe19ljlVUOY9cMLgRvCBw8X5A+BjJ6+K7y8Ich fqU3a0VeVG7kh2PhNVLL5vc+0cQRHHPSTWvQOCVEvXcDvwxZ3/983UzBJOiZvYHF0FmJ oPBCFU9vDRHKs0Ibiv2Ga093uamAzu2YI3GqDrHc/Z04H5pxH/qk46FHWGRukyyDi3l/ BwXDS42nEPsfRbyo5iCl6W6OS+Cw3TZIvv5l1bKSrg94iqlwF8EeaZw8EnhWsDwMBGOM XC/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=EJuCvPTt7gsmjc4lzpZPE+QeHWhhLBf2a1LKQN43V5E=; fh=4tqgNFNfoZov83pY26LoG87h5J0r0CHhWc2zkUlKGnQ=; b=yB89egYc179cxuHd10dgPVCiSlHofnyVpqjwjeaY6qSVykq0lz+xo64CzBLDBouW6+ s4OCQlzL7VacdW8BWNZ9u6LbmnOvS8bUDB0ojqIELKQplb935evXzwef3WxnnWerrjkM Op4tO/28c2WK2V+8ZpiTgBrh+q2xa1Q2aX+eO/2gjPf8WZEegzACkYmYqTUDwgk8c2t1 bdX8m0qy8oRF3fpGOcEH/156V80sWmxvwQG5mftd9bqJGzd/Syk+rnlKFzUKM9AaHlFE avE0iBiqtb9WePbPjeohoSOvrVUs0HRiN3jsyP4LaaY6HN0RBD3snMmo+52kIPH+enRX A5Ug== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id b26-20020a63715a000000b005c1ae100435si26802pgn.34.2023.11.13.02.47.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Nov 2023 02:47:19 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id E58A38068965; Mon, 13 Nov 2023 02:46:09 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229470AbjKMKpx (ORCPT <rfc822;heyuhang3455@gmail.com> + 29 others); Mon, 13 Nov 2023 05:45:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47634 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229462AbjKMKpu (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 13 Nov 2023 05:45:50 -0500 Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BF1E210CE for <linux-kernel@vger.kernel.org>; Mon, 13 Nov 2023 02:45:46 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046050;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0VwJPEby_1699872343; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0VwJPEby_1699872343) by smtp.aliyun-inc.com; Mon, 13 Nov 2023 18:45:44 +0800 From: Baolin Wang <baolin.wang@linux.alibaba.com> To: akpm@linux-foundation.org Cc: david@redhat.com, ying.huang@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH] mm: support large folio numa balancing Date: Mon, 13 Nov 2023 18:45:31 +0800 Message-Id: <a71a478ce404e93683023dbb7248dd95f11554f4.1699872019.git.baolin.wang@linux.alibaba.com> X-Mailer: git-send-email 2.39.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-9.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Mon, 13 Nov 2023 02:46:09 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1782445443322140554 X-GMAIL-MSGID: 1782445443322140554 |
Series |
[RFC] mm: support large folio numa balancing
|
|
Commit Message
Baolin Wang
Nov. 13, 2023, 10:45 a.m. UTC
Currently, the file pages already support large folio, and supporting for
anonymous pages is also under discussion[1]. Moreover, the numa balancing
code are converted to use a folio by previous thread[2], and the migrate_pages
function also already supports the large folio migration.
So now I did not see any reason to continue restricting NUMA balancing for
large folio.
[1] https://lkml.org/lkml/2023/9/29/342
[2] https://lore.kernel.org/all/20230921074417.24004-4-wangkefeng.wang@huawei.com/T/#md9d10fe34587229a72801f0d731f7457ab3f4a6e
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/memory.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
Comments
On 13.11.23 11:45, Baolin Wang wrote: > Currently, the file pages already support large folio, and supporting for > anonymous pages is also under discussion[1]. Moreover, the numa balancing > code are converted to use a folio by previous thread[2], and the migrate_pages > function also already supports the large folio migration. > > So now I did not see any reason to continue restricting NUMA balancing for > large folio. I recall John wanted to look into that. CCing him. I'll note that the "head page mapcount" heuristic to detect sharers will now strike on the PTE path and make us believe that a large folios is exclusive, although it isn't. As spelled out in the commit you are referencing: commit 6695cf68b15c215d33b8add64c33e01e3cbe236c Author: Kefeng Wang <wangkefeng.wang@huawei.com> Date: Thu Sep 21 15:44:14 2023 +0800 mm: memory: use a folio in do_numa_page() Numa balancing only try to migrate non-compound page in do_numa_page(), use a folio in it to save several compound_head calls, note we use folio_estimated_sharers(), it is enough to check the folio sharers since only normal page is handled, if large folio numa balancing is supported, a precise folio sharers check would be used, no functional change intended. I'll send WIP patches for one approach that can improve the situation soonish.
On 2023/11/13 18:53, David Hildenbrand wrote: > On 13.11.23 11:45, Baolin Wang wrote: >> Currently, the file pages already support large folio, and supporting for >> anonymous pages is also under discussion[1]. Moreover, the numa balancing >> code are converted to use a folio by previous thread[2], and the >> migrate_pages >> function also already supports the large folio migration. >> >> So now I did not see any reason to continue restricting NUMA balancing >> for >> large folio. > > I recall John wanted to look into that. CCing him. > > I'll note that the "head page mapcount" heuristic to detect sharers will > now strike on the PTE path and make us believe that a large folios is > exclusive, although it isn't. > > As spelled out in the commit you are referencing: > > commit 6695cf68b15c215d33b8add64c33e01e3cbe236c > Author: Kefeng Wang <wangkefeng.wang@huawei.com> > Date: Thu Sep 21 15:44:14 2023 +0800 > > mm: memory: use a folio in do_numa_page() > Numa balancing only try to migrate non-compound page in > do_numa_page(), > use a folio in it to save several compound_head calls, note we use > folio_estimated_sharers(), it is enough to check the folio sharers > since > only normal page is handled, if large folio numa balancing is > supported, a > precise folio sharers check would be used, no functional change > intended. > > > I'll send WIP patches for one approach that can improve the situation > soonish. When convert numa balance to use folio, I make similar change, it works with large anon folio(test with v5), but David's precise folio sharers should be merged firstly, also if a large folio shared by many process, we maybe split it, don't sure about it, this need some evaluation.
On 11/13/2023 6:53 PM, David Hildenbrand wrote: > On 13.11.23 11:45, Baolin Wang wrote: >> Currently, the file pages already support large folio, and supporting for >> anonymous pages is also under discussion[1]. Moreover, the numa balancing >> code are converted to use a folio by previous thread[2], and the >> migrate_pages >> function also already supports the large folio migration. >> >> So now I did not see any reason to continue restricting NUMA balancing >> for >> large folio. > > I recall John wanted to look into that. CCing him. > > I'll note that the "head page mapcount" heuristic to detect sharers will > now strike on the PTE path and make us believe that a large folios is > exclusive, although it isn't. > > As spelled out in the commit you are referencing: > > commit 6695cf68b15c215d33b8add64c33e01e3cbe236c > Author: Kefeng Wang <wangkefeng.wang@huawei.com> > Date: Thu Sep 21 15:44:14 2023 +0800 > > mm: memory: use a folio in do_numa_page() > Numa balancing only try to migrate non-compound page in > do_numa_page(), > use a folio in it to save several compound_head calls, note we use > folio_estimated_sharers(), it is enough to check the folio sharers > since > only normal page is handled, if large folio numa balancing is > supported, a > precise folio sharers check would be used, no functional change > intended. Thanks for pointing out the part I missed. I saw the migrate_pages() syscall is also using folio_estimated_sharers() to check if the folio is shared, and I wonder it will bring about any significant issues? > I'll send WIP patches for one approach that can improve the situation > soonish. Great. Look forward to seeing this:)
On 11/13/2023 8:10 PM, Kefeng Wang wrote: > > > On 2023/11/13 18:53, David Hildenbrand wrote: >> On 13.11.23 11:45, Baolin Wang wrote: >>> Currently, the file pages already support large folio, and supporting >>> for >>> anonymous pages is also under discussion[1]. Moreover, the numa >>> balancing >>> code are converted to use a folio by previous thread[2], and the >>> migrate_pages >>> function also already supports the large folio migration. >>> >>> So now I did not see any reason to continue restricting NUMA >>> balancing for >>> large folio. >> >> I recall John wanted to look into that. CCing him. >> >> I'll note that the "head page mapcount" heuristic to detect sharers will >> now strike on the PTE path and make us believe that a large folios is >> exclusive, although it isn't. >> >> As spelled out in the commit you are referencing: >> >> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c >> Author: Kefeng Wang <wangkefeng.wang@huawei.com> >> Date: Thu Sep 21 15:44:14 2023 +0800 >> >> mm: memory: use a folio in do_numa_page() >> Numa balancing only try to migrate non-compound page in >> do_numa_page(), >> use a folio in it to save several compound_head calls, note we use >> folio_estimated_sharers(), it is enough to check the folio >> sharers since >> only normal page is handled, if large folio numa balancing is >> supported, a >> precise folio sharers check would be used, no functional change >> intended. >> >> >> I'll send WIP patches for one approach that can improve the situation >> soonish. > > When convert numa balance to use folio, I make similar change, it works > with large anon folio(test with v5), but David's precise folio sharers > should be merged firstly, also if a large folio shared by many process, > we maybe split it, don't sure about it, this need some evaluation. IIUC, numa balancing will not split the large folio.
On 13.11.23 13:59, Baolin Wang wrote: > > > On 11/13/2023 6:53 PM, David Hildenbrand wrote: >> On 13.11.23 11:45, Baolin Wang wrote: >>> Currently, the file pages already support large folio, and supporting for >>> anonymous pages is also under discussion[1]. Moreover, the numa balancing >>> code are converted to use a folio by previous thread[2], and the >>> migrate_pages >>> function also already supports the large folio migration. >>> >>> So now I did not see any reason to continue restricting NUMA balancing >>> for >>> large folio. >> >> I recall John wanted to look into that. CCing him. >> >> I'll note that the "head page mapcount" heuristic to detect sharers will >> now strike on the PTE path and make us believe that a large folios is >> exclusive, although it isn't. >> >> As spelled out in the commit you are referencing: >> >> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c >> Author: Kefeng Wang <wangkefeng.wang@huawei.com> >> Date: Thu Sep 21 15:44:14 2023 +0800 >> >> mm: memory: use a folio in do_numa_page() >> Numa balancing only try to migrate non-compound page in >> do_numa_page(), >> use a folio in it to save several compound_head calls, note we use >> folio_estimated_sharers(), it is enough to check the folio sharers >> since >> only normal page is handled, if large folio numa balancing is >> supported, a >> precise folio sharers check would be used, no functional change >> intended. > > Thanks for pointing out the part I missed. > > I saw the migrate_pages() syscall is also using > folio_estimated_sharers() to check if the folio is shared, and I wonder > it will bring about any significant issues? It's now used all over the place, in some places for making manual decisions (e.g., MADV_PAGEOUT works although it shouldn't) and more and more automatic places (e.g., the system ends up migrating a folio although it shouldn't). The nasty thing about it is that it doesn't give you "certainly exclusive" vs. "maybe shared" but "maybe exclusive" vs. "certainly shared". IIUC, the side effect could be that we migrate folios because we assume they are exclusive even though they are actually shared. Right now, it's sufficient to not have the first page of the folio mapped anymore for that to happen. Anyhow, it's worth mentioning that in the commit message as long as we have no better solution for that. For many cases it might be just tolerable. > >> I'll send WIP patches for one approach that can improve the situation >> soonish. > > Great. Look forward to seeing this:) I'm still trying to evaluate the performance hit of the additional tracking ... turns out there is no such thing as free food ;)
On 11/13/23 5:01 AM, Baolin Wang wrote: > > > On 11/13/2023 8:10 PM, Kefeng Wang wrote: >> >> >> On 2023/11/13 18:53, David Hildenbrand wrote: >>> On 13.11.23 11:45, Baolin Wang wrote: >>>> Currently, the file pages already support large folio, and >>>> supporting for >>>> anonymous pages is also under discussion[1]. Moreover, the numa >>>> balancing >>>> code are converted to use a folio by previous thread[2], and the >>>> migrate_pages >>>> function also already supports the large folio migration. >>>> >>>> So now I did not see any reason to continue restricting NUMA >>>> balancing for >>>> large folio. >>> >>> I recall John wanted to look into that. CCing him. >>> >>> I'll note that the "head page mapcount" heuristic to detect sharers will >>> now strike on the PTE path and make us believe that a large folios is >>> exclusive, although it isn't. >>> >>> As spelled out in the commit you are referencing: >>> >>> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c >>> Author: Kefeng Wang <wangkefeng.wang@huawei.com> >>> Date: Thu Sep 21 15:44:14 2023 +0800 >>> >>> mm: memory: use a folio in do_numa_page() >>> Numa balancing only try to migrate non-compound page in >>> do_numa_page(), >>> use a folio in it to save several compound_head calls, note we use >>> folio_estimated_sharers(), it is enough to check the folio >>> sharers since >>> only normal page is handled, if large folio numa balancing is >>> supported, a >>> precise folio sharers check would be used, no functional change >>> intended. >>> >>> >>> I'll send WIP patches for one approach that can improve the situation >>> soonish. To be honest, I'm still catching up on the approximate vs. exact sharers case. It wasn't clear to me why a precise sharers count is needed in order to do this. Perhaps the cost of making a wrong decision is considered just too high? >> >> When convert numa balance to use folio, I make similar change, it works >> with large anon folio(test with v5), but David's precise folio sharers >> should be merged firstly, also if a large folio shared by many process, >> we maybe split it, don't sure about it, this need some evaluation. > > IIUC, numa balancing will not split the large folio. That matches my reading of the code: normal (PMD-based) THPs are split, but the code does not yet split PTE-THPs (also known as small-size THPs). thanks,
David Hildenbrand <david@redhat.com> writes: > On 13.11.23 11:45, Baolin Wang wrote: >> Currently, the file pages already support large folio, and supporting for >> anonymous pages is also under discussion[1]. Moreover, the numa balancing >> code are converted to use a folio by previous thread[2], and the migrate_pages >> function also already supports the large folio migration. >> So now I did not see any reason to continue restricting NUMA >> balancing for >> large folio. > > I recall John wanted to look into that. CCing him. > > I'll note that the "head page mapcount" heuristic to detect sharers will > now strike on the PTE path and make us believe that a large folios is > exclusive, although it isn't. Even 4k folio may be shared by multiple processes/threads. So, numa balancing uses a multi-stage node selection algorithm (mostly implemented in should_numa_migrate_memory()) to identify shared folios. I think that the algorithm needs to be adjusted for PTE mapped large folio for shared folios. And, as a performance improvement patch, some performance data needs to be provided. And, the effect of shared folio detection needs to be tested too. -- Best Regards, Huang, Ying
On 11/13/2023 10:49 PM, David Hildenbrand wrote: > On 13.11.23 13:59, Baolin Wang wrote: >> >> >> On 11/13/2023 6:53 PM, David Hildenbrand wrote: >>> On 13.11.23 11:45, Baolin Wang wrote: >>>> Currently, the file pages already support large folio, and >>>> supporting for >>>> anonymous pages is also under discussion[1]. Moreover, the numa >>>> balancing >>>> code are converted to use a folio by previous thread[2], and the >>>> migrate_pages >>>> function also already supports the large folio migration. >>>> >>>> So now I did not see any reason to continue restricting NUMA balancing >>>> for >>>> large folio. >>> >>> I recall John wanted to look into that. CCing him. >>> >>> I'll note that the "head page mapcount" heuristic to detect sharers will >>> now strike on the PTE path and make us believe that a large folios is >>> exclusive, although it isn't. >>> >>> As spelled out in the commit you are referencing: >>> >>> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c >>> Author: Kefeng Wang <wangkefeng.wang@huawei.com> >>> Date: Thu Sep 21 15:44:14 2023 +0800 >>> >>> mm: memory: use a folio in do_numa_page() >>> Numa balancing only try to migrate non-compound page in >>> do_numa_page(), >>> use a folio in it to save several compound_head calls, note we use >>> folio_estimated_sharers(), it is enough to check the folio sharers >>> since >>> only normal page is handled, if large folio numa balancing is >>> supported, a >>> precise folio sharers check would be used, no functional change >>> intended. >> >> Thanks for pointing out the part I missed. >> >> I saw the migrate_pages() syscall is also using >> folio_estimated_sharers() to check if the folio is shared, and I wonder >> it will bring about any significant issues? > > It's now used all over the place, in some places for making manual > decisions (e.g., MADV_PAGEOUT works although it shouldn't) and more and > more automatic places (e.g., the system ends up migrating a folio > although it shouldn't). The nasty thing about it is that it doesn't give > you "certainly exclusive" vs. "maybe shared" but "maybe exclusive" vs. > "certainly shared". > > IIUC, the side effect could be that we migrate folios because we assume > they are exclusive even though they are actually shared. Right now, it's > sufficient to not have the first page of the folio mapped anymore for > that to happen. Yes. > Anyhow, it's worth mentioning that in the commit message as long as we > have no better solution for that. For many cases it might be just > tolerable. Agree. The 'maybe shared' folio may affect the numa group statistics, which is used to accumulate the numa faults in one group to choose a prefered node for the tasks. For this case, it may be tolerable too, but I have no performance numbers now. Let me think about it. >>> I'll send WIP patches for one approach that can improve the situation >>> soonish. >> >> Great. Look forward to seeing this:) > > I'm still trying to evaluate the performance hit of the additional > tracking ... turns out there is no such thing as free food ;) Make sense.
On 11/14/2023 9:12 AM, Huang, Ying wrote: > David Hildenbrand <david@redhat.com> writes: > >> On 13.11.23 11:45, Baolin Wang wrote: >>> Currently, the file pages already support large folio, and supporting for >>> anonymous pages is also under discussion[1]. Moreover, the numa balancing >>> code are converted to use a folio by previous thread[2], and the migrate_pages >>> function also already supports the large folio migration. >>> So now I did not see any reason to continue restricting NUMA >>> balancing for >>> large folio. >> >> I recall John wanted to look into that. CCing him. >> >> I'll note that the "head page mapcount" heuristic to detect sharers will >> now strike on the PTE path and make us believe that a large folios is >> exclusive, although it isn't. > > Even 4k folio may be shared by multiple processes/threads. So, numa > balancing uses a multi-stage node selection algorithm (mostly > implemented in should_numa_migrate_memory()) to identify shared folios. > I think that the algorithm needs to be adjusted for PTE mapped large > folio for shared folios. Not sure I get you here. In should_numa_migrate_memory(), it will use last CPU id, last PID and group numa faults to determine if this page can be migrated to the target node. So for large folio, a precise folio sharers check can make the numa faults of a group more accurate, which is enough for should_numa_migrate_memory() to make a decision? Could you provide a more detailed description of the algorithm you would like to change for large folio? Thanks. > And, as a performance improvement patch, some performance data needs to Do you have some benchmark recommendation? I know the the autonuma can not support large folio now. > be provided. And, the effect of shared folio detection needs to be > tested too
On 13.11.23 23:15, John Hubbard wrote: > On 11/13/23 5:01 AM, Baolin Wang wrote: >> >> >> On 11/13/2023 8:10 PM, Kefeng Wang wrote: >>> >>> >>> On 2023/11/13 18:53, David Hildenbrand wrote: >>>> On 13.11.23 11:45, Baolin Wang wrote: >>>>> Currently, the file pages already support large folio, and >>>>> supporting for >>>>> anonymous pages is also under discussion[1]. Moreover, the numa >>>>> balancing >>>>> code are converted to use a folio by previous thread[2], and the >>>>> migrate_pages >>>>> function also already supports the large folio migration. >>>>> >>>>> So now I did not see any reason to continue restricting NUMA >>>>> balancing for >>>>> large folio. >>>> >>>> I recall John wanted to look into that. CCing him. >>>> >>>> I'll note that the "head page mapcount" heuristic to detect sharers will >>>> now strike on the PTE path and make us believe that a large folios is >>>> exclusive, although it isn't. >>>> >>>> As spelled out in the commit you are referencing: >>>> >>>> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c >>>> Author: Kefeng Wang <wangkefeng.wang@huawei.com> >>>> Date: Thu Sep 21 15:44:14 2023 +0800 >>>> >>>> mm: memory: use a folio in do_numa_page() >>>> Numa balancing only try to migrate non-compound page in >>>> do_numa_page(), >>>> use a folio in it to save several compound_head calls, note we use >>>> folio_estimated_sharers(), it is enough to check the folio >>>> sharers since >>>> only normal page is handled, if large folio numa balancing is >>>> supported, a >>>> precise folio sharers check would be used, no functional change >>>> intended. >>>> >>>> >>>> I'll send WIP patches for one approach that can improve the situation >>>> soonish. > > To be honest, I'm still catching up on the approximate vs. exact > sharers case. It wasn't clear to me why a precise sharers count > is needed in order to do this. Perhaps the cost of making a wrong > decision is considered just too high? Good question, I didn't really look into the impact for the NUMA hinting case where we might end up not setting TNF_SHARED although it is shared. For other folio_estimate_sharers() users it's more obvious. As a side note, it could have happened already in corner cases (e.g., concurrent page migration of a small folio). If precision as documented in that commit is really required remains to be seen -- just wanted to spell it out.
On 2023/11/14 19:35, David Hildenbrand wrote: > On 13.11.23 23:15, John Hubbard wrote: >> On 11/13/23 5:01 AM, Baolin Wang wrote: >>> >>> >>> On 11/13/2023 8:10 PM, Kefeng Wang wrote: >>>> >>>> >>>> On 2023/11/13 18:53, David Hildenbrand wrote: >>>>> On 13.11.23 11:45, Baolin Wang wrote: >>>>>> Currently, the file pages already support large folio, and >>>>>> supporting for >>>>>> anonymous pages is also under discussion[1]. Moreover, the numa >>>>>> balancing >>>>>> code are converted to use a folio by previous thread[2], and the >>>>>> migrate_pages >>>>>> function also already supports the large folio migration. >>>>>> >>>>>> So now I did not see any reason to continue restricting NUMA >>>>>> balancing for >>>>>> large folio. >>>>> >>>>> I recall John wanted to look into that. CCing him. >>>>> >>>>> I'll note that the "head page mapcount" heuristic to detect sharers >>>>> will >>>>> now strike on the PTE path and make us believe that a large folios is >>>>> exclusive, although it isn't. >>>>> >>>>> As spelled out in the commit you are referencing: >>>>> >>>>> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c >>>>> Author: Kefeng Wang <wangkefeng.wang@huawei.com> >>>>> Date: Thu Sep 21 15:44:14 2023 +0800 >>>>> >>>>> mm: memory: use a folio in do_numa_page() >>>>> Numa balancing only try to migrate non-compound page in >>>>> do_numa_page(), >>>>> use a folio in it to save several compound_head calls, note >>>>> we use >>>>> folio_estimated_sharers(), it is enough to check the folio >>>>> sharers since >>>>> only normal page is handled, if large folio numa balancing is >>>>> supported, a >>>>> precise folio sharers check would be used, no functional change >>>>> intended. >>>>> >>>>> >>>>> I'll send WIP patches for one approach that can improve the situation >>>>> soonish. >> >> To be honest, I'm still catching up on the approximate vs. exact >> sharers case. It wasn't clear to me why a precise sharers count >> is needed in order to do this. Perhaps the cost of making a wrong >> decision is considered just too high? > > Good question, I didn't really look into the impact for the NUMA hinting > case where we might end up not setting TNF_SHARED although it is shared. > For other folio_estimate_sharers() users it's more obvious. The task_numa_group() will check the TNF_SHARED, if processes share same page/folio, they will be packed into a single numa group, and the numa group fault statistic will be used in should_numa_migrate_memory() to decide whether to migrate or not, if not setting TNF_SHARED, maybe be lead to more page/folio migration. > > As a side note, it could have happened already in corner cases (e.g., > concurrent page migration of a small folio). > > If precision as documented in that commit is really required remains to > be seen -- just wanted to spell it out. >
Baolin Wang <baolin.wang@linux.alibaba.com> writes: > On 11/14/2023 9:12 AM, Huang, Ying wrote: >> David Hildenbrand <david@redhat.com> writes: >> >>> On 13.11.23 11:45, Baolin Wang wrote: >>>> Currently, the file pages already support large folio, and supporting for >>>> anonymous pages is also under discussion[1]. Moreover, the numa balancing >>>> code are converted to use a folio by previous thread[2], and the migrate_pages >>>> function also already supports the large folio migration. >>>> So now I did not see any reason to continue restricting NUMA >>>> balancing for >>>> large folio. >>> >>> I recall John wanted to look into that. CCing him. >>> >>> I'll note that the "head page mapcount" heuristic to detect sharers will >>> now strike on the PTE path and make us believe that a large folios is >>> exclusive, although it isn't. >> Even 4k folio may be shared by multiple processes/threads. So, numa >> balancing uses a multi-stage node selection algorithm (mostly >> implemented in should_numa_migrate_memory()) to identify shared folios. >> I think that the algorithm needs to be adjusted for PTE mapped large >> folio for shared folios. > > Not sure I get you here. In should_numa_migrate_memory(), it will use > last CPU id, last PID and group numa faults to determine if this page > can be migrated to the target node. So for large folio, a precise > folio sharers check can make the numa faults of a group more accurate, > which is enough for should_numa_migrate_memory() to make a decision? A large folio that is mapped by multiple process may be accessed by one remote NUMA node, so we still want to migrate it. A large folio that is mapped by one process but accessed by multiple threads on multiple NUMA node may be not migrated. > Could you provide a more detailed description of the algorithm you > would like to change for large folio? Thanks. I haven't thought about that thoroughly. So, please evaluate the algorithm by yourself. For example, the 2 sub-pages of a shared PTE-mapped large folio may be accessed together by a task. This may cause the folio be migrated wrongly. One possible solution is to restore all other PTE mappings of the large folio in do_numa_page() as the first step. This resembles the PMD-mapped THP behavior. >> And, as a performance improvement patch, some performance data needs to > > Do you have some benchmark recommendation? I know the the autonuma can > not support large folio now. There are autonuma-benchmark, and specjbb is used by someone before. >> be provided. And, the effect of shared folio detection needs to be >> tested too -- Best Regards, Huang, Ying
On 13.11.23 11:45, Baolin Wang wrote: > Currently, the file pages already support large folio, and supporting for > anonymous pages is also under discussion[1]. Moreover, the numa balancing > code are converted to use a folio by previous thread[2], and the migrate_pages > function also already supports the large folio migration. > > So now I did not see any reason to continue restricting NUMA balancing for > large folio. > > [1] https://lkml.org/lkml/2023/9/29/342 > [2] https://lore.kernel.org/all/20230921074417.24004-4-wangkefeng.wang@huawei.com/T/#md9d10fe34587229a72801f0d731f7457ab3f4a6e > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > --- I'll note that another piece is missing, and I'd be curious how you tested your patch set or what I am missing. (no anonymous pages?) change_pte_range() contains: if (prot_numa) { ... /* Also skip shared copy-on-write pages */ if (is_cow_mapping(vma->vm_flags) && folio_ref_count(folio) != 1) continue; So we'll never end up mapping an anon PTE-mapped THP prot-none (well, unless a single PTE remains) and consequently never trigger NUMA hinting faults. Now, that change has some history [1], but the original problem has been sorted out in the meantime. But we should consider Linus' original feedback. For pte-mapped THP, we might want to do something like the following (completely untested): diff --git a/mm/mprotect.c b/mm/mprotect.c index 81991102f785..c4e6b9032e40 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb, /* Also skip shared copy-on-write pages */ if (is_cow_mapping(vma->vm_flags) && - folio_ref_count(folio) != 1) + (folio_maybe_dma_pinned(folio) || + folio_estimated_sharers(folio) != 1)) continue; Another note about the possible imprecision that might or might not be tolerable ;) [1] https://bugzilla.kernel.org/show_bug.cgi?id=215616
On 15.11.23 11:46, David Hildenbrand wrote: > On 13.11.23 11:45, Baolin Wang wrote: >> Currently, the file pages already support large folio, and supporting for >> anonymous pages is also under discussion[1]. Moreover, the numa balancing >> code are converted to use a folio by previous thread[2], and the migrate_pages >> function also already supports the large folio migration. >> >> So now I did not see any reason to continue restricting NUMA balancing for >> large folio. >> >> [1] https://lkml.org/lkml/2023/9/29/342 >> [2] https://lore.kernel.org/all/20230921074417.24004-4-wangkefeng.wang@huawei.com/T/#md9d10fe34587229a72801f0d731f7457ab3f4a6e >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >> --- > > I'll note that another piece is missing, and I'd be curious how you > tested your patch set or what I am missing. (no anonymous pages?) > > change_pte_range() contains: > > if (prot_numa) { > ... > /* Also skip shared copy-on-write pages */ > if (is_cow_mapping(vma->vm_flags) && > folio_ref_count(folio) != 1) > continue; > > So we'll never end up mapping an anon PTE-mapped THP prot-none (well, unless a > single PTE remains) and consequently never trigger NUMA hinting faults. > > Now, that change has some history [1], but the original problem has been > sorted out in the meantime. But we should consider Linus' original feedback. > > For pte-mapped THP, we might want to do something like the following > (completely untested): > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 81991102f785..c4e6b9032e40 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb, > > /* Also skip shared copy-on-write pages */ > if (is_cow_mapping(vma->vm_flags) && > - folio_ref_count(folio) != 1) > + (folio_maybe_dma_pinned(folio) || > + folio_estimated_sharers(folio) != 1)) Actually, > 1 might be better if the first subpage is not mapped; it's a mess.
On Wed, Nov 15, 2023 at 10:58:32AM +0800, Huang, Ying wrote: > Baolin Wang <baolin.wang@linux.alibaba.com> writes: > > > On 11/14/2023 9:12 AM, Huang, Ying wrote: > >> David Hildenbrand <david@redhat.com> writes: > >> > >>> On 13.11.23 11:45, Baolin Wang wrote: > >>>> Currently, the file pages already support large folio, and supporting for > >>>> anonymous pages is also under discussion[1]. Moreover, the numa balancing > >>>> code are converted to use a folio by previous thread[2], and the migrate_pages > >>>> function also already supports the large folio migration. > >>>> So now I did not see any reason to continue restricting NUMA > >>>> balancing for > >>>> large folio. > >>> > >>> I recall John wanted to look into that. CCing him. > >>> > >>> I'll note that the "head page mapcount" heuristic to detect sharers will > >>> now strike on the PTE path and make us believe that a large folios is > >>> exclusive, although it isn't. > >> Even 4k folio may be shared by multiple processes/threads. So, numa > >> balancing uses a multi-stage node selection algorithm (mostly > >> implemented in should_numa_migrate_memory()) to identify shared folios. > >> I think that the algorithm needs to be adjusted for PTE mapped large > >> folio for shared folios. > > > > Not sure I get you here. In should_numa_migrate_memory(), it will use > > last CPU id, last PID and group numa faults to determine if this page > > can be migrated to the target node. So for large folio, a precise > > folio sharers check can make the numa faults of a group more accurate, > > which is enough for should_numa_migrate_memory() to make a decision? > > A large folio that is mapped by multiple process may be accessed by one > remote NUMA node, so we still want to migrate it. A large folio that is > mapped by one process but accessed by multiple threads on multiple NUMA > node may be not migrated. > This leads into a generic problem with large anything with NUMA balancing -- false sharing. As it stands, THP can be false shared by threads if thread-local data is split within a THP range. In this case, the ideal would be the THP is migrated to the hottest node but such support doesn't exist. The same applies for folios. If not handled properly, a large folio of any type can ping-pong between nodes so just migrating because we can is not necessarily a good idea. The patch should cover a realistic case why this matters, why splitting the folio is not better and supporting data.
On Fri, Nov 17, 2023 at 10:07:45AM +0000, Mel Gorman wrote: > This leads into a generic problem with large anything with NUMA > balancing -- false sharing. As it stands, THP can be false shared by > threads if thread-local data is split within a THP range. In this case, > the ideal would be the THP is migrated to the hottest node but such > support doesn't exist. The same applies for folios. If not handled > properly, a large folio of any type can ping-pong between nodes so just > migrating because we can is not necessarily a good idea. The patch > should cover a realistic case why this matters, why splitting the folio > is not better and supporting data. Would it make sense to have THP merging conditional on all (most?) pages having the same node?
On Fri, Nov 17, 2023 at 11:13:43AM +0100, Peter Zijlstra wrote: > On Fri, Nov 17, 2023 at 10:07:45AM +0000, Mel Gorman wrote: > > > This leads into a generic problem with large anything with NUMA > > balancing -- false sharing. As it stands, THP can be false shared by > > threads if thread-local data is split within a THP range. In this case, > > the ideal would be the THP is migrated to the hottest node but such > > support doesn't exist. The same applies for folios. If not handled > > properly, a large folio of any type can ping-pong between nodes so just > > migrating because we can is not necessarily a good idea. The patch > > should cover a realistic case why this matters, why splitting the folio > > is not better and supporting data. > > Would it make sense to have THP merging conditional on all (most?) pages > having the same node? Potentially yes, maybe with something similar to max_ptes_none, but it has corner cases of it's own. THP can be allocated up-front so we don't get the per-base-page hints unless the page is first split. I experimented with this once upon a time but cost post-splitting was not offset by the smarter NUMA placement. While we could always allocate small pages and promote later (originally known as the promotion threshold), that was known to have significant penalties of it's own so we still eagerly allocate THP. Part of that is that KVM was the main load to benefit from THP and always preferred eager promotion. Even if we always started with base pages, sparse addressing within the THP range may mean the threshold for collapsing can never be reached. Both THP and folios have the same false sharing problem but at least we knew about the false sharing problem for THP and NUMA balancing. It was found initially that THP false sharing is mostly an edge-case issue mitigated by the fact that large anonymous buffers tended to be either 2M aligned or only affected the boundaries. Later glibc and ABI changes made it even more likely that private buffers were THP-aligned. The same is not true of folios and it is a new problem so I'm uncomfortable with a patch that essentially says "migrate folios because it's possible" without considering any of the corner cases or measuring them.
On 11/15/2023 6:47 PM, David Hildenbrand wrote: > On 15.11.23 11:46, David Hildenbrand wrote: >> On 13.11.23 11:45, Baolin Wang wrote: >>> Currently, the file pages already support large folio, and supporting >>> for >>> anonymous pages is also under discussion[1]. Moreover, the numa >>> balancing >>> code are converted to use a folio by previous thread[2], and the >>> migrate_pages >>> function also already supports the large folio migration. >>> >>> So now I did not see any reason to continue restricting NUMA >>> balancing for >>> large folio. >>> >>> [1] https://lkml.org/lkml/2023/9/29/342 >>> [2] >>> https://lore.kernel.org/all/20230921074417.24004-4-wangkefeng.wang@huawei.com/T/#md9d10fe34587229a72801f0d731f7457ab3f4a6e >>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>> --- >> >> I'll note that another piece is missing, and I'd be curious how you >> tested your patch set or what I am missing. (no anonymous pages?) I tested it with file large folio (order = 4) created by XFS filesystem. >> change_pte_range() contains: >> >> if (prot_numa) { >> ... >> /* Also skip shared copy-on-write pages */ >> if (is_cow_mapping(vma->vm_flags) && >> folio_ref_count(folio) != 1) >> continue; >> >> So we'll never end up mapping an anon PTE-mapped THP prot-none (well, >> unless a >> single PTE remains) and consequently never trigger NUMA hinting faults. >> >> Now, that change has some history [1], but the original problem has been >> sorted out in the meantime. But we should consider Linus' original >> feedback. >> >> For pte-mapped THP, we might want to do something like the following >> (completely untested): Thanks for pointing out. I have not tried pte-mapped THP yet, and will look at it in detail. >> diff --git a/mm/mprotect.c b/mm/mprotect.c >> index 81991102f785..c4e6b9032e40 100644 >> --- a/mm/mprotect.c >> +++ b/mm/mprotect.c >> @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb, >> /* Also skip shared copy-on-write >> pages */ >> if (is_cow_mapping(vma->vm_flags) && >> - folio_ref_count(folio) != 1) >> + (folio_maybe_dma_pinned(folio) || >> + folio_estimated_sharers(folio) != >> 1)) > > Actually, > 1 might be better if the first subpage is not mapped; it's a > mess. >
On 11/17/2023 6:07 PM, Mel Gorman wrote: > On Wed, Nov 15, 2023 at 10:58:32AM +0800, Huang, Ying wrote: >> Baolin Wang <baolin.wang@linux.alibaba.com> writes: >> >>> On 11/14/2023 9:12 AM, Huang, Ying wrote: >>>> David Hildenbrand <david@redhat.com> writes: >>>> >>>>> On 13.11.23 11:45, Baolin Wang wrote: >>>>>> Currently, the file pages already support large folio, and supporting for >>>>>> anonymous pages is also under discussion[1]. Moreover, the numa balancing >>>>>> code are converted to use a folio by previous thread[2], and the migrate_pages >>>>>> function also already supports the large folio migration. >>>>>> So now I did not see any reason to continue restricting NUMA >>>>>> balancing for >>>>>> large folio. >>>>> >>>>> I recall John wanted to look into that. CCing him. >>>>> >>>>> I'll note that the "head page mapcount" heuristic to detect sharers will >>>>> now strike on the PTE path and make us believe that a large folios is >>>>> exclusive, although it isn't. >>>> Even 4k folio may be shared by multiple processes/threads. So, numa >>>> balancing uses a multi-stage node selection algorithm (mostly >>>> implemented in should_numa_migrate_memory()) to identify shared folios. >>>> I think that the algorithm needs to be adjusted for PTE mapped large >>>> folio for shared folios. >>> >>> Not sure I get you here. In should_numa_migrate_memory(), it will use >>> last CPU id, last PID and group numa faults to determine if this page >>> can be migrated to the target node. So for large folio, a precise >>> folio sharers check can make the numa faults of a group more accurate, >>> which is enough for should_numa_migrate_memory() to make a decision? >> >> A large folio that is mapped by multiple process may be accessed by one >> remote NUMA node, so we still want to migrate it. A large folio that is >> mapped by one process but accessed by multiple threads on multiple NUMA >> node may be not migrated. >> > > This leads into a generic problem with large anything with NUMA > balancing -- false sharing. As it stands, THP can be false shared by > threads if thread-local data is split within a THP range. In this case, > the ideal would be the THP is migrated to the hottest node but such > support doesn't exist. The same applies for folios. If not handled So below check in should_numa_migrate_memory() can not avoid the false sharing of large folio you mentioned? Please correct me if I missed anything. /* * Destination node is much more heavily used than the source * node? Allow migration. */ if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) * ACTIVE_NODE_FRACTION) return true; /* * Distribute memory according to CPU & memory use on each node, * with 3/4 hysteresis to avoid unnecessary memory migrations: * * faults_cpu(dst) 3 faults_cpu(src) * --------------- * - > --------------- * faults_mem(dst) 4 faults_mem(src) */ return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 > group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4; > properly, a large folio of any type can ping-pong between nodes so just > migrating because we can is not necessarily a good idea. The patch > should cover a realistic case why this matters, why splitting the folio > is not better and supporting data. Sure. For a private mapping, we should always migrate the large folio. The tricky part is the shared mapping as you and Ying said, which can have different scenarios, and I'm thinking about how to validate it. Do you have any suggestion? Thanks.
diff --git a/mm/memory.c b/mm/memory.c index c32954e16b28..8ca21eff294c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4804,7 +4804,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) int last_cpupid; int target_nid; pte_t pte, old_pte; - int flags = 0; + int flags = 0, nr_pages = 0; /* * The "pte" at this point cannot be used safely without @@ -4834,10 +4834,6 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) if (!folio || folio_is_zone_device(folio)) goto out_map; - /* TODO: handle PTE-mapped THP */ - if (folio_test_large(folio)) - goto out_map; - /* * Avoid grouping on RO pages in general. RO pages shouldn't hurt as * much anyway since they can be in shared cache state. This misses @@ -4857,6 +4853,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) flags |= TNF_SHARED; nid = folio_nid(folio); + nr_pages = folio_nr_pages(folio); /* * For memory tiering mode, cpupid of slow memory page is used * to record page access time. So use default value. @@ -4893,7 +4890,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) out: if (nid != NUMA_NO_NODE) - task_numa_fault(last_cpupid, nid, 1, flags); + task_numa_fault(last_cpupid, nid, nr_pages, flags); return 0; out_map: /*