Message ID | 20230105101844.1893104-22-jthoughton@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp227772wrt; Thu, 5 Jan 2023 02:22:46 -0800 (PST) X-Google-Smtp-Source: AMrXdXuFXPqkG6v+rA0/zZMk/L4bmDDG5VdKfuJ1GNFBJ6/bbQuBZUHjVXgGgNMGWcNRAOXM9YC8 X-Received: by 2002:a17:902:f70c:b0:188:6b9c:d17d with SMTP id h12-20020a170902f70c00b001886b9cd17dmr71530739plo.16.1672914166533; Thu, 05 Jan 2023 02:22:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1672914166; cv=none; d=google.com; s=arc-20160816; b=TQOt0bqOY3mrY5tS+r6CvNBqPktpnpYMGZ2ynYJWJiGRaQmEd/4i3BebGSAQZvY6O9 RPUDX0Xuhyj4aVXHH41yWRqcWGd7UESPZTlfNXgO5IL6244bHCn/SomCXjYsyILB4xU1 UuyTqUVTbNZ7o2yG0+pZoog3wgTV6oim6CdRQ71x6POQKganIdEXtJJF38nZ8lH97dxe eGLbzz2rplemrlZrvmXMnu8D/hFMkobuiV1EwuORNFdT6rUp4Ima9pAC2DBVq5dWSMoA 2eRUppnTf9Jv1mXHQodlNWMTCloMUgr+QD/NV1OSGThmhDtfxYqyqal1cs8ogra6Z/zz jIjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=dzlak4wlRH3uJBMFvzVwTlLZdbkzJfi0hjjO9/IWItc=; b=t0epQhLZ7eBFO+0UuGh11VYrqdsGWVdHlGEXBrn2DKrOBO6VO8IUIvruUAbQG45PW6 orxTpOa8IuEanzoDRLGyE+KFdxF3rQmomu/OiMu7Hx8T662QjG7uZGwtcjOMn6yy13TL lBnb22s7a6P2Mf4gObAm345ha5f78Vv3LTbIKRfPGIVXOsDiuTHI3sln6O4vD9oaP2+S s61iOOqsh0ujkdw6P5I2bGxZpUdmFTtmmGKrdw0G3OYZ7pibCWUNsFubQm1tCAx1AD2N GWF2VelGQrpnK1cOAJvqeMl0mES+lfERQqpU5EQUUWvhCoDkl24gFh/81LCg1Hi75rD/ 6Baw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=awstklLG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t21-20020a170902b21500b00188ef2314b2si609902plr.79.2023.01.05.02.22.34; Thu, 05 Jan 2023 02:22:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=awstklLG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232779AbjAEKV3 (ORCPT <rfc822;tmhikaru@gmail.com> + 99 others); Thu, 5 Jan 2023 05:21:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41298 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231702AbjAEKT6 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 5 Jan 2023 05:19:58 -0500 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E15ED559C3 for <linux-kernel@vger.kernel.org>; Thu, 5 Jan 2023 02:19:26 -0800 (PST) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-460ab8a327eso378497417b3.23 for <linux-kernel@vger.kernel.org>; Thu, 05 Jan 2023 02:19:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=dzlak4wlRH3uJBMFvzVwTlLZdbkzJfi0hjjO9/IWItc=; b=awstklLGib5NLIDq4Z9cXiLXngXaTfD1T2MU8qCmmFIg7CftwwRRmxHAX6D7bWf5Gk eP1XAJ27BxVhrMSTPS4OjW8E1bBJRXlppA2RWBZmOpuN5hpHa1nhxolbVg5N3XX7fR/i uxhvgdxFosl+f9OSWK0aUkrSsfbYoSIksFEwk1rsCnqG3hCxFs6mirU8CHZeMTGjuX2e Q0rjB5hCGfPsTJ4KAHBF7XDazDkHbd8tLKzKTBPDcEeSaSx12m8GRvkvEisrZ27K202+ A0uBeYtWLg95Whts+TekNEMT7WDGdYBbfWNEqStD8GqMAKLEzVcmfqPLMlDyrJLjx5tS jt3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=dzlak4wlRH3uJBMFvzVwTlLZdbkzJfi0hjjO9/IWItc=; b=h8oFCQNbDDlPkokrnp1cqFe86eBkUl84pLrs+wrV46emGUc9ntfyMhQEpMIDJOcRdI ax+lg/vpLV+9aOl9eEERIcfvl77d4M34HOC06G9xzqNzWJsIjUncANQG2jVz9coW4t5t KsZ3Aq9wvkkbBYgvPvAbWeCO9N2S0jPnOwmo1blREQdK1uYOpbE/6iuWb1hjWfR7Gjvm 1DRWeDmJIyKLBSNFs4QaNyBu1GztrcBGIxfGiuinL7/Fktia5h4M3E2VekrgZn4PFK/f RoS0jbmjQhFV9uOHWq+BBXaIycKK9GxA6a1uWNUX835ZlahHf0khNOjR3SNP/f8RVzv5 afCg== X-Gm-Message-State: AFqh2kqk9Nx+qeL+UKmo+jqOm3mZD7jpMiNFAa3gDYhj1vigCts7kV0b IwxIMl7uO7CJr2luvVvDsYG8x2gr0Y4X+A6R X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:13cc:b0:75f:65da:a046 with SMTP id y12-20020a05690213cc00b0075f65daa046mr5651223ybu.357.1672913965974; Thu, 05 Jan 2023 02:19:25 -0800 (PST) Date: Thu, 5 Jan 2023 10:18:19 +0000 In-Reply-To: <20230105101844.1893104-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230105101844.1893104-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20230105101844.1893104-22-jthoughton@google.com> Subject: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range From: James Houghton <jthoughton@google.com> To: Mike Kravetz <mike.kravetz@oracle.com>, Muchun Song <songmuchun@bytedance.com>, Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com>, David Rientjes <rientjes@google.com>, Axel Rasmussen <axelrasmussen@google.com>, Mina Almasry <almasrymina@google.com>, "Zach O'Keefe" <zokeefe@google.com>, Manish Mishra <manish.mishra@nutanix.com>, Naoya Horiguchi <naoya.horiguchi@nec.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, Baolin Wang <baolin.wang@linux.alibaba.com>, Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton <jthoughton@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1754177645228085650?= X-GMAIL-MSGID: =?utf-8?q?1754177645228085650?= |
Series |
Based on latest mm-unstable (85b44c25cd1e).
|
|
Commit Message
James Houghton
Jan. 5, 2023, 10:18 a.m. UTC
The main change in this commit is to walk_hugetlb_range to support
walking HGM mappings, but all walk_hugetlb_range callers must be updated
to use the new API and take the correct action.
Listing all the changes to the callers:
For s390 changes, we simply ignore HGM PTEs (we don't support s390 yet).
For smaps, shared_hugetlb (and private_hugetlb, although private
mappings don't support HGM) may now not be divisible by the hugepage
size. The appropriate changes have been made to support analyzing HGM
PTEs.
For pagemap, we ignore non-leaf PTEs by treating that as if they were
none PTEs. We can only end up with non-leaf PTEs if they had just been
updated from a none PTE.
For show_numa_map, the challenge is that, if any of a hugepage is
mapped, we have to count that entire page exactly once, as the results
are given in units of hugepages. To support HGM mappings, we keep track
of the last page that we looked it. If the hugepage we are currently
looking at is the same as the last one, then we must be looking at an
HGM-mapped page that has been mapped at high-granularity, and we've
already accounted for it.
For DAMON, we treat non-leaf PTEs as if they were blank, for the same
reason as pagemap.
For hwpoison, we proactively update the logic to support the case when
hpte is pointing to a subpage within the poisoned hugepage.
For queue_pages_hugetlb/migration, we ignore all HGM-enabled VMAs for
now.
For mincore, we ignore non-leaf PTEs for the same reason as pagemap.
For mprotect/prot_none_hugetlb_entry, we retry the walk when we get a
non-leaf PTE.
Signed-off-by: James Houghton <jthoughton@google.com>
---
arch/s390/mm/gmap.c | 20 ++++++++--
fs/proc/task_mmu.c | 83 +++++++++++++++++++++++++++++-----------
include/linux/pagewalk.h | 10 +++--
mm/damon/vaddr.c | 42 +++++++++++++-------
mm/hmm.c | 20 +++++++---
mm/memory-failure.c | 17 ++++----
mm/mempolicy.c | 12 ++++--
mm/mincore.c | 17 ++++++--
mm/mprotect.c | 18 ++++++---
mm/pagewalk.c | 20 +++++-----
10 files changed, 180 insertions(+), 79 deletions(-)
Comments
On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote: > -static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm, > +static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, pte_t entry, > + struct mm_struct *mm, > struct vm_area_struct *vma, unsigned long addr) > { > bool referenced = false; > - pte_t entry = huge_ptep_get(pte); > + pte_t entry = huge_ptep_get(hpte->ptep); My compiler throws me: mm/damon/vaddr.c: In function ‘damon_hugetlb_mkold’: mm/damon/vaddr.c:338:15: error: ‘entry’ redeclared as different kind of symbol 338 | pte_t entry = huge_ptep_get(hpte->ptep); | ^~~~~ I guess this line can just be dropped. > struct folio *folio = pfn_folio(pte_pfn(entry)); > > folio_get(folio);
James, On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote: > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, > int mapcount = page_mapcount(page); > > if (mapcount >= 2) > - mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); > + mss->shared_hugetlb += hugetlb_pte_size(hpte); > else > - mss->private_hugetlb += huge_page_size(hstate_vma(vma)); > + mss->private_hugetlb += hugetlb_pte_size(hpte); > } > return 0; One thing interesting I found with hgm right now is mostly everything will be counted as "shared" here, I think it's because mapcount is accounted always to the huge page even if mapped in smaller sizes, so page_mapcount() to a small page should be huge too because the head page mapcount should be huge. I'm curious the reasons behind the mapcount decision. For example, would that risk overflow with head_compound_mapcount? One 1G page mapping all 4K takes 0.25M counts, while the limit should be 2G for atomic_t. Looks like it's possible. Btw, are the small page* pointers still needed in the latest HGM design? Is there code taking care of disabling of hugetlb vmemmap optimization for HGM? Or maybe it's not needed anymore for the current design? Thanks,
On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote: > > James, > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote: > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, > > int mapcount = page_mapcount(page); > > > > if (mapcount >= 2) > > - mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); > > + mss->shared_hugetlb += hugetlb_pte_size(hpte); > > else > > - mss->private_hugetlb += huge_page_size(hstate_vma(vma)); > > + mss->private_hugetlb += hugetlb_pte_size(hpte); > > } > > return 0; > > One thing interesting I found with hgm right now is mostly everything will > be counted as "shared" here, I think it's because mapcount is accounted > always to the huge page even if mapped in smaller sizes, so page_mapcount() > to a small page should be huge too because the head page mapcount should be > huge. I'm curious the reasons behind the mapcount decision. > > For example, would that risk overflow with head_compound_mapcount? One 1G > page mapping all 4K takes 0.25M counts, while the limit should be 2G for > atomic_t. Looks like it's possible. The original mapcount approach was "if the hstate-level PTE is present, increment the compound_mapcount by 1" (basically "if any of the hugepage is mapped, increment the compound_mapcount by 1"), but this was painful to implement, so I changed it to what it is now (each new PT mapping increments the compound_mapcount by 1). I think you're right, there is absolutely an overflow risk. :( I'm not sure what the best solution is. I could just go back to the old approach. Regarding when things are accounted in private_hugetlb vs. shared_hugetlb, HGM shouldn't change that, because it only applies to shared mappings (right now anyway). It seems like "private_hugetlb" can include cases where the page is shared but has only one mapping, in which case HGM will change it from "private" to "shared". > > Btw, are the small page* pointers still needed in the latest HGM design? > Is there code taking care of disabling of hugetlb vmemmap optimization for > HGM? Or maybe it's not needed anymore for the current design? The hugetlb vmemmap optimization can still be used with HGM, so there is no code to disable it. We don't need small page* pointers either, except for when we're dealing with mapping subpages, like in hugetlb_no_page. Everything else can handle the hugetlb page as a folio. I hope we can keep compatibility with the vmemmap optimization while solving the mapcount overflow risk. Thanks Peter. - James
On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote: > On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote: > > > > James, > > > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote: > > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, > > > int mapcount = page_mapcount(page); > > > > > > if (mapcount >= 2) > > > - mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); > > > + mss->shared_hugetlb += hugetlb_pte_size(hpte); > > > else > > > - mss->private_hugetlb += huge_page_size(hstate_vma(vma)); > > > + mss->private_hugetlb += hugetlb_pte_size(hpte); > > > } > > > return 0; > > > > One thing interesting I found with hgm right now is mostly everything will > > be counted as "shared" here, I think it's because mapcount is accounted > > always to the huge page even if mapped in smaller sizes, so page_mapcount() > > to a small page should be huge too because the head page mapcount should be > > huge. I'm curious the reasons behind the mapcount decision. > > > > For example, would that risk overflow with head_compound_mapcount? One 1G > > page mapping all 4K takes 0.25M counts, while the limit should be 2G for > > atomic_t. Looks like it's possible. > > The original mapcount approach was "if the hstate-level PTE is > present, increment the compound_mapcount by 1" (basically "if any of > the hugepage is mapped, increment the compound_mapcount by 1"), but > this was painful to implement, Any more info here on why it was painful? What is the major blocker? > so I changed it to what it is now (each new PT mapping increments the > compound_mapcount by 1). I think you're right, there is absolutely an > overflow risk. :( I'm not sure what the best solution is. I could just go > back to the old approach. No rush on that; let's discuss it thoroughly before doing anything. We have more context than when it was discussed initially in the calls, so maybe a good time to revisit. > > Regarding when things are accounted in private_hugetlb vs. > shared_hugetlb, HGM shouldn't change that, because it only applies to > shared mappings (right now anyway). It seems like "private_hugetlb" > can include cases where the page is shared but has only one mapping, > in which case HGM will change it from "private" to "shared". The two fields are not defined against VM_SHARED, it seems. At least not with current code base. Let me quote the code again just to be clear: int mapcount = page_mapcount(page); <------------- [1] if (mapcount >= 2) mss->shared_hugetlb += hugetlb_pte_size(hpte); else mss->private_hugetlb += hugetlb_pte_size(hpte); smaps_hugetlb_hgm_account(mss, hpte); So that information (for some reason) is only relevant to how many mapcount is there. If we have one 1G page and only two 4K mapped, with the existing logic we should see 8K private_hugetlb while in fact I think it should be 8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings (as they all goes back to the head). I have no idea whether violating that will be a problem or not, I suppose at least it needs justification if it will be violated, or hopefully it can be kept as-is. > > > > > Btw, are the small page* pointers still needed in the latest HGM design? > > Is there code taking care of disabling of hugetlb vmemmap optimization for > > HGM? Or maybe it's not needed anymore for the current design? > > The hugetlb vmemmap optimization can still be used with HGM, so there > is no code to disable it. We don't need small page* pointers either, > except for when we're dealing with mapping subpages, like in > hugetlb_no_page. Everything else can handle the hugetlb page as a > folio. > > I hope we can keep compatibility with the vmemmap optimization while > solving the mapcount overflow risk. Yeh that'll be perfect if it works. But afaiu even with your current design (ignoring all the issues on either smaps accounting or overflow risks), we already referenced the small pages, aren't we? See: static inline int page_mapcount(struct page *page) { int mapcount = atomic_read(&page->_mapcount) + 1; <-------- here if (likely(!PageCompound(page))) return mapcount; page = compound_head(page); return head_compound_mapcount(page) + mapcount; } Even if we assume small page->_mapcount should always be zero in this case, we may need to take special care of hugetlb pages in page_mapcount() to not reference it at all. Or I think it's reading random values and some days it can be non-zero. The other approach is probably using the thp approach. After Hugh's rework on the thp accounting I assumed it would be even cleaner (at least no DoubleMap complexity anymore.. even though I can't say I fully digested the whole history of that). It's all about what's the major challenges of using the same approach there with thp. You may have more knowledge on that aspect so I'd be willing to know. Thanks,
On Thu, Jan 12, 2023 at 10:29 AM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote: > > On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > James, > > > > > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote: > > > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, > > > > int mapcount = page_mapcount(page); > > > > > > > > if (mapcount >= 2) > > > > - mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); > > > > + mss->shared_hugetlb += hugetlb_pte_size(hpte); > > > > else > > > > - mss->private_hugetlb += huge_page_size(hstate_vma(vma)); > > > > + mss->private_hugetlb += hugetlb_pte_size(hpte); > > > > } > > > > return 0; > > > > > > One thing interesting I found with hgm right now is mostly everything will > > > be counted as "shared" here, I think it's because mapcount is accounted > > > always to the huge page even if mapped in smaller sizes, so page_mapcount() > > > to a small page should be huge too because the head page mapcount should be > > > huge. I'm curious the reasons behind the mapcount decision. > > > > > > For example, would that risk overflow with head_compound_mapcount? One 1G > > > page mapping all 4K takes 0.25M counts, while the limit should be 2G for > > > atomic_t. Looks like it's possible. > > > > The original mapcount approach was "if the hstate-level PTE is > > present, increment the compound_mapcount by 1" (basically "if any of > > the hugepage is mapped, increment the compound_mapcount by 1"), but > > this was painful to implement, > > Any more info here on why it was painful? What is the major blocker? The original approach was implemented in RFC v1, but the implementation was broken: the way refcount was handled was wrong; it was incremented once for each new page table mapping. (How? find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE would increment refcount and we wouldn't drop it, and in __unmap_hugepage_range(), the mmu_gather bits would decrement the refcount once per mapping.) At the time, I figured the complexity of handling mapcount AND refcount correctly in the original approach would be quite complex, so I switched to the new one. 1. In places that already change the mapcount, check that we're installing the hstate-level PTE, not a high-granularity PTE. Adjust mapcount AND refcount appropriately. 2. In the HGM walking bits, to the caller if we made the hstate-level PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to keep track of this until we figure out which page we're allocating PTEs for, then change mapcount/refcount appropriately. 3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only once per hugepage. (This is probably the hardest of these three things to get right.) > > > so I changed it to what it is now (each new PT mapping increments the > > compound_mapcount by 1). I think you're right, there is absolutely an > > overflow risk. :( I'm not sure what the best solution is. I could just go > > back to the old approach. > > No rush on that; let's discuss it thoroughly before doing anything. We > have more context than when it was discussed initially in the calls, so > maybe a good time to revisit. > > > > > Regarding when things are accounted in private_hugetlb vs. > > shared_hugetlb, HGM shouldn't change that, because it only applies to > > shared mappings (right now anyway). It seems like "private_hugetlb" > > can include cases where the page is shared but has only one mapping, > > in which case HGM will change it from "private" to "shared". > > The two fields are not defined against VM_SHARED, it seems. At least not > with current code base. > > Let me quote the code again just to be clear: > > int mapcount = page_mapcount(page); <------------- [1] > > if (mapcount >= 2) > mss->shared_hugetlb += hugetlb_pte_size(hpte); > else > mss->private_hugetlb += hugetlb_pte_size(hpte); > > smaps_hugetlb_hgm_account(mss, hpte); > > So that information (for some reason) is only relevant to how many mapcount > is there. If we have one 1G page and only two 4K mapped, with the existing > logic we should see 8K private_hugetlb while in fact I think it should be > 8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings > (as they all goes back to the head). > > I have no idea whether violating that will be a problem or not, I suppose > at least it needs justification if it will be violated, or hopefully it can > be kept as-is. Agreed that this is a problem. I'm not sure what should be done here. It seems like the current upstream implementation is incorrect (surely MAP_SHARED with only one mapping should still be shared_hugetlb not private_hugetlb); the check should really be `if (vma->vm_flags & VM_MAYSHARE)` instead of `mapcount >= 2`. If that change can be taken, we don't have a problem here. > > > > > > > > > Btw, are the small page* pointers still needed in the latest HGM design? > > > Is there code taking care of disabling of hugetlb vmemmap optimization for > > > HGM? Or maybe it's not needed anymore for the current design? > > > > The hugetlb vmemmap optimization can still be used with HGM, so there > > is no code to disable it. We don't need small page* pointers either, > > except for when we're dealing with mapping subpages, like in > > hugetlb_no_page. Everything else can handle the hugetlb page as a > > folio. > > > > I hope we can keep compatibility with the vmemmap optimization while > > solving the mapcount overflow risk. > > Yeh that'll be perfect if it works. But afaiu even with your current > design (ignoring all the issues on either smaps accounting or overflow > risks), we already referenced the small pages, aren't we? See: > > static inline int page_mapcount(struct page *page) > { > int mapcount = atomic_read(&page->_mapcount) + 1; <-------- here > > if (likely(!PageCompound(page))) > return mapcount; > page = compound_head(page); > return head_compound_mapcount(page) + mapcount; > } > > Even if we assume small page->_mapcount should always be zero in this case, > we may need to take special care of hugetlb pages in page_mapcount() to not > reference it at all. Or I think it's reading random values and some days > it can be non-zero. IIUC, it's ok to read from all the hugetlb subpage structs, you just can't *write* to them (except the first few). The first page of page structs is mapped RW; all the others are mapped RO to a single physical page. > > The other approach is probably using the thp approach. After Hugh's rework > on the thp accounting I assumed it would be even cleaner (at least no > DoubleMap complexity anymore.. even though I can't say I fully digested the > whole history of that). It's all about what's the major challenges of > using the same approach there with thp. You may have more knowledge on > that aspect so I'd be willing to know. I need to take a closer look at Hugh's approach to see if we can do it the same way. (I wonder if the 1G THP series has some ideas too.) A really simple solution could be just to prevent userspace from doing MADV_SPLIT (or, if we are enabling HGM due to hwpoison, ignore the poison) if it could result in a mapcount overflow. For 1G pages, userspace would need 8192 mappings to overflow mapcount/refcount.
> The original approach was implemented in RFC v1, but the > implementation was broken: the way refcount was handled was wrong; it > was incremented once for each new page table mapping. (How? > find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE > would increment refcount and we wouldn't drop it, and in > __unmap_hugepage_range(), the mmu_gather bits would decrement the > refcount once per mapping.) > > At the time, I figured the complexity of handling mapcount AND > refcount correctly in the original approach would be quite complex, so > I switched to the new one. Sorry I didn't make this clear... the following steps are how we could correctly implement the original approach. > 1. In places that already change the mapcount, check that we're > installing the hstate-level PTE, not a high-granularity PTE. Adjust > mapcount AND refcount appropriately. > 2. In the HGM walking bits, to the caller if we made the hstate-level > PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to > keep track of this until we figure out which page we're allocating > PTEs for, then change mapcount/refcount appropriately. > 3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only > once per hugepage. (This is probably the hardest of these three things > to get right.)
On Thu, Jan 12, 2023 at 11:45:40AM -0500, James Houghton wrote: > On Thu, Jan 12, 2023 at 10:29 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote: > > > On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > James, > > > > > > > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote: > > > > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, > > > > > int mapcount = page_mapcount(page); > > > > > > > > > > if (mapcount >= 2) > > > > > - mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); > > > > > + mss->shared_hugetlb += hugetlb_pte_size(hpte); > > > > > else > > > > > - mss->private_hugetlb += huge_page_size(hstate_vma(vma)); > > > > > + mss->private_hugetlb += hugetlb_pte_size(hpte); > > > > > } > > > > > return 0; > > > > > > > > One thing interesting I found with hgm right now is mostly everything will > > > > be counted as "shared" here, I think it's because mapcount is accounted > > > > always to the huge page even if mapped in smaller sizes, so page_mapcount() > > > > to a small page should be huge too because the head page mapcount should be > > > > huge. I'm curious the reasons behind the mapcount decision. > > > > > > > > For example, would that risk overflow with head_compound_mapcount? One 1G > > > > page mapping all 4K takes 0.25M counts, while the limit should be 2G for > > > > atomic_t. Looks like it's possible. > > > > > > The original mapcount approach was "if the hstate-level PTE is > > > present, increment the compound_mapcount by 1" (basically "if any of > > > the hugepage is mapped, increment the compound_mapcount by 1"), but > > > this was painful to implement, > > > > Any more info here on why it was painful? What is the major blocker? > > The original approach was implemented in RFC v1, but the > implementation was broken: the way refcount was handled was wrong; it > was incremented once for each new page table mapping. (How? > find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE > would increment refcount and we wouldn't drop it, and in > __unmap_hugepage_range(), the mmu_gather bits would decrement the > refcount once per mapping.) I'm not sure I fully get the point here, but perhaps it's mostly about how hugetlb page cache is managed (in hstate size not PAGE_SIZE)? static inline struct page *folio_file_page(struct folio *folio, pgoff_t index) { /* HugeTLBfs indexes the page cache in units of hpage_size */ if (folio_test_hugetlb(folio)) return &folio->page; return folio_page(folio, index & (folio_nr_pages(folio) - 1)); } I haven't thought through on that either. Is it possible that we switche the pgcache layout to be in PAGE_SIZE granule too when HGM enabled (e.g. a simple scheme is we can fail MADV_SPLIT if hugetlb pgcache contains any page already). If keep using the same pgcache scheme (hpage_size stepped indexes), find_lock_page() will also easily content on the head page lock, so we may not be able to handle concurrent page faults on small mappings on the same page as efficient as thp. > > At the time, I figured the complexity of handling mapcount AND > refcount correctly in the original approach would be quite complex, so > I switched to the new one. > > 1. In places that already change the mapcount, check that we're > installing the hstate-level PTE, not a high-granularity PTE. Adjust > mapcount AND refcount appropriately. > 2. In the HGM walking bits, to the caller if we made the hstate-level > PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to > keep track of this until we figure out which page we're allocating > PTEs for, then change mapcount/refcount appropriately. > 3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only > once per hugepage. (This is probably the hardest of these three things > to get right.) > > > > > > so I changed it to what it is now (each new PT mapping increments the > > > compound_mapcount by 1). I think you're right, there is absolutely an > > > overflow risk. :( I'm not sure what the best solution is. I could just go > > > back to the old approach. > > > > No rush on that; let's discuss it thoroughly before doing anything. We > > have more context than when it was discussed initially in the calls, so > > maybe a good time to revisit. > > > > > > > > Regarding when things are accounted in private_hugetlb vs. > > > shared_hugetlb, HGM shouldn't change that, because it only applies to > > > shared mappings (right now anyway). It seems like "private_hugetlb" > > > can include cases where the page is shared but has only one mapping, > > > in which case HGM will change it from "private" to "shared". > > > > The two fields are not defined against VM_SHARED, it seems. At least not > > with current code base. > > > > Let me quote the code again just to be clear: > > > > int mapcount = page_mapcount(page); <------------- [1] > > > > if (mapcount >= 2) > > mss->shared_hugetlb += hugetlb_pte_size(hpte); > > else > > mss->private_hugetlb += hugetlb_pte_size(hpte); > > > > smaps_hugetlb_hgm_account(mss, hpte); > > > > So that information (for some reason) is only relevant to how many mapcount > > is there. If we have one 1G page and only two 4K mapped, with the existing > > logic we should see 8K private_hugetlb while in fact I think it should be > > 8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings > > (as they all goes back to the head). > > > > I have no idea whether violating that will be a problem or not, I suppose > > at least it needs justification if it will be violated, or hopefully it can > > be kept as-is. > > Agreed that this is a problem. I'm not sure what should be done here. > It seems like the current upstream implementation is incorrect (surely > MAP_SHARED with only one mapping should still be shared_hugetlb not > private_hugetlb); the check should really be `if (vma->vm_flags & > VM_MAYSHARE)` instead of `mapcount >= 2`. If that change can be taken, > we don't have a problem here. Naoya added relevant code. Not sure whether he'll chim in. commit 25ee01a2fca02dfb5a3ce316e77910c468108199 Author: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Date: Thu Nov 5 18:47:11 2015 -0800 mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps In all cases, it'll still be a slight ABI change, so worth considering the effects to existing users. > > > > > > > > > > > > > > Btw, are the small page* pointers still needed in the latest HGM design? > > > > Is there code taking care of disabling of hugetlb vmemmap optimization for > > > > HGM? Or maybe it's not needed anymore for the current design? > > > > > > The hugetlb vmemmap optimization can still be used with HGM, so there > > > is no code to disable it. We don't need small page* pointers either, > > > except for when we're dealing with mapping subpages, like in > > > hugetlb_no_page. Everything else can handle the hugetlb page as a > > > folio. > > > > > > I hope we can keep compatibility with the vmemmap optimization while > > > solving the mapcount overflow risk. > > > > Yeh that'll be perfect if it works. But afaiu even with your current > > design (ignoring all the issues on either smaps accounting or overflow > > risks), we already referenced the small pages, aren't we? See: > > > > static inline int page_mapcount(struct page *page) > > { > > int mapcount = atomic_read(&page->_mapcount) + 1; <-------- here > > > > if (likely(!PageCompound(page))) > > return mapcount; > > page = compound_head(page); > > return head_compound_mapcount(page) + mapcount; > > } > > > > Even if we assume small page->_mapcount should always be zero in this case, > > we may need to take special care of hugetlb pages in page_mapcount() to not > > reference it at all. Or I think it's reading random values and some days > > it can be non-zero. > > IIUC, it's ok to read from all the hugetlb subpage structs, you just > can't *write* to them (except the first few). The first page of page > structs is mapped RW; all the others are mapped RO to a single > physical page. I'm not familiar with vmemmap work, but I saw this: /* * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end) * to the page which @vmemmap_reuse is mapped to, then free the pages * which the range [@vmemmap_start, @vmemmap_end] is mapped to. */ if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse)) It seems it'll reuse the 1st page of the huge page* rather than backing the rest vmemmap with zero pages. Would that be a problem (as I think some small page* can actually points to the e.g. head page* if referenced)? > > > > > The other approach is probably using the thp approach. After Hugh's rework > > on the thp accounting I assumed it would be even cleaner (at least no > > DoubleMap complexity anymore.. even though I can't say I fully digested the > > whole history of that). It's all about what's the major challenges of > > using the same approach there with thp. You may have more knowledge on > > that aspect so I'd be willing to know. > > I need to take a closer look at Hugh's approach to see if we can do it > the same way. (I wonder if the 1G THP series has some ideas too.) https://lore.kernel.org/all/47ad693-717-79c8-e1ba-46c3a6602e48@google.com/ It's already in the mainline. I think it's mostly internally implemented under the rmap code APIs. For the HGM effort, I'd start with simply passing around compound=false when using the rmap APIs, and see what'll start to fail. > > A really simple solution could be just to prevent userspace from doing > MADV_SPLIT (or, if we are enabling HGM due to hwpoison, ignore the > poison) if it could result in a mapcount overflow. For 1G pages, > userspace would need 8192 mappings to overflow mapcount/refcount. I'm not sure you can calculate it; consider fork()s along the way when pmd sharing disabled, or whatever reason when the 1G pages mapped at multiple places with more than one mmap()s. Thanks,
On Thu, Jan 12, 2023 at 3:27 PM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Jan 12, 2023 at 11:45:40AM -0500, James Houghton wrote: > > On Thu, Jan 12, 2023 at 10:29 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote: > > > > On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > James, > > > > > > > > > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote: > > > > > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, > > > > > > int mapcount = page_mapcount(page); > > > > > > > > > > > > if (mapcount >= 2) > > > > > > - mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); > > > > > > + mss->shared_hugetlb += hugetlb_pte_size(hpte); > > > > > > else > > > > > > - mss->private_hugetlb += huge_page_size(hstate_vma(vma)); > > > > > > + mss->private_hugetlb += hugetlb_pte_size(hpte); > > > > > > } > > > > > > return 0; > > > > > > > > > > One thing interesting I found with hgm right now is mostly everything will > > > > > be counted as "shared" here, I think it's because mapcount is accounted > > > > > always to the huge page even if mapped in smaller sizes, so page_mapcount() > > > > > to a small page should be huge too because the head page mapcount should be > > > > > huge. I'm curious the reasons behind the mapcount decision. > > > > > > > > > > For example, would that risk overflow with head_compound_mapcount? One 1G > > > > > page mapping all 4K takes 0.25M counts, while the limit should be 2G for > > > > > atomic_t. Looks like it's possible. > > > > > > > > The original mapcount approach was "if the hstate-level PTE is > > > > present, increment the compound_mapcount by 1" (basically "if any of > > > > the hugepage is mapped, increment the compound_mapcount by 1"), but > > > > this was painful to implement, > > > > > > Any more info here on why it was painful? What is the major blocker? > > > > The original approach was implemented in RFC v1, but the > > implementation was broken: the way refcount was handled was wrong; it > > was incremented once for each new page table mapping. (How? > > find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE > > would increment refcount and we wouldn't drop it, and in > > __unmap_hugepage_range(), the mmu_gather bits would decrement the > > refcount once per mapping.) > > I'm not sure I fully get the point here, but perhaps it's mostly about how > hugetlb page cache is managed (in hstate size not PAGE_SIZE)? > > static inline struct page *folio_file_page(struct folio *folio, pgoff_t index) > { > /* HugeTLBfs indexes the page cache in units of hpage_size */ > if (folio_test_hugetlb(folio)) > return &folio->page; > return folio_page(folio, index & (folio_nr_pages(folio) - 1)); > } > > I haven't thought through on that either. Is it possible that we switche > the pgcache layout to be in PAGE_SIZE granule too when HGM enabled (e.g. a > simple scheme is we can fail MADV_SPLIT if hugetlb pgcache contains any > page already). > > If keep using the same pgcache scheme (hpage_size stepped indexes), > find_lock_page() will also easily content on the head page lock, so we may > not be able to handle concurrent page faults on small mappings on the same > page as efficient as thp. We keep the pgcache scheme the same: hpage_size stepped indexes. The refcount and mapcount are both stored in the head page. The problems with the implementation in RFC v1 were (1) refcount became much much larger than mapcount, and (2) refcount would have the same overflow problem we're discussing. You're right -- find_lock_page() in hugetlb_no_page() and hugetlb_mcopy_atomic_pte() will contend on the head page lock, but I think it's possible to improve things here (i.e., replace them with find_get_page() somehow). > > > > > At the time, I figured the complexity of handling mapcount AND > > refcount correctly in the original approach would be quite complex, so > > I switched to the new one. > > > > 1. In places that already change the mapcount, check that we're > > installing the hstate-level PTE, not a high-granularity PTE. Adjust > > mapcount AND refcount appropriately. > > 2. In the HGM walking bits, to the caller if we made the hstate-level > > PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to > > keep track of this until we figure out which page we're allocating > > PTEs for, then change mapcount/refcount appropriately. > > 3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only > > once per hugepage. (This is probably the hardest of these three things > > to get right.) > > > > > > > > > so I changed it to what it is now (each new PT mapping increments the > > > > compound_mapcount by 1). I think you're right, there is absolutely an > > > > overflow risk. :( I'm not sure what the best solution is. I could just go > > > > back to the old approach. > > > > > > No rush on that; let's discuss it thoroughly before doing anything. We > > > have more context than when it was discussed initially in the calls, so > > > maybe a good time to revisit. > > > > > > > > > > > Regarding when things are accounted in private_hugetlb vs. > > > > shared_hugetlb, HGM shouldn't change that, because it only applies to > > > > shared mappings (right now anyway). It seems like "private_hugetlb" > > > > can include cases where the page is shared but has only one mapping, > > > > in which case HGM will change it from "private" to "shared". > > > > > > The two fields are not defined against VM_SHARED, it seems. At least not > > > with current code base. > > > > > > Let me quote the code again just to be clear: > > > > > > int mapcount = page_mapcount(page); <------------- [1] > > > > > > if (mapcount >= 2) > > > mss->shared_hugetlb += hugetlb_pte_size(hpte); > > > else > > > mss->private_hugetlb += hugetlb_pte_size(hpte); > > > > > > smaps_hugetlb_hgm_account(mss, hpte); > > > > > > So that information (for some reason) is only relevant to how many mapcount > > > is there. If we have one 1G page and only two 4K mapped, with the existing > > > logic we should see 8K private_hugetlb while in fact I think it should be > > > 8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings > > > (as they all goes back to the head). > > > > > > I have no idea whether violating that will be a problem or not, I suppose > > > at least it needs justification if it will be violated, or hopefully it can > > > be kept as-is. > > > > Agreed that this is a problem. I'm not sure what should be done here. > > It seems like the current upstream implementation is incorrect (surely > > MAP_SHARED with only one mapping should still be shared_hugetlb not > > private_hugetlb); the check should really be `if (vma->vm_flags & > > VM_MAYSHARE)` instead of `mapcount >= 2`. If that change can be taken, > > we don't have a problem here. > > Naoya added relevant code. Not sure whether he'll chim in. > > commit 25ee01a2fca02dfb5a3ce316e77910c468108199 > Author: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Date: Thu Nov 5 18:47:11 2015 -0800 > > mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps > > In all cases, it'll still be a slight ABI change, so worth considering the > effects to existing users. > > > > > > > > > > > > > > > > > > > > Btw, are the small page* pointers still needed in the latest HGM design? > > > > > Is there code taking care of disabling of hugetlb vmemmap optimization for > > > > > HGM? Or maybe it's not needed anymore for the current design? > > > > > > > > The hugetlb vmemmap optimization can still be used with HGM, so there > > > > is no code to disable it. We don't need small page* pointers either, > > > > except for when we're dealing with mapping subpages, like in > > > > hugetlb_no_page. Everything else can handle the hugetlb page as a > > > > folio. > > > > > > > > I hope we can keep compatibility with the vmemmap optimization while > > > > solving the mapcount overflow risk. > > > > > > Yeh that'll be perfect if it works. But afaiu even with your current > > > design (ignoring all the issues on either smaps accounting or overflow > > > risks), we already referenced the small pages, aren't we? See: > > > > > > static inline int page_mapcount(struct page *page) > > > { > > > int mapcount = atomic_read(&page->_mapcount) + 1; <-------- here > > > > > > if (likely(!PageCompound(page))) > > > return mapcount; > > > page = compound_head(page); > > > return head_compound_mapcount(page) + mapcount; > > > } > > > > > > Even if we assume small page->_mapcount should always be zero in this case, > > > we may need to take special care of hugetlb pages in page_mapcount() to not > > > reference it at all. Or I think it's reading random values and some days > > > it can be non-zero. > > > > IIUC, it's ok to read from all the hugetlb subpage structs, you just > > can't *write* to them (except the first few). The first page of page > > structs is mapped RW; all the others are mapped RO to a single > > physical page. > > I'm not familiar with vmemmap work, but I saw this: > > /* > * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end) > * to the page which @vmemmap_reuse is mapped to, then free the pages > * which the range [@vmemmap_start, @vmemmap_end] is mapped to. > */ > if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse)) > > It seems it'll reuse the 1st page of the huge page* rather than backing the > rest vmemmap with zero pages. Would that be a problem (as I think some > small page* can actually points to the e.g. head page* if referenced)? It shouldn't be a problem if we don't use _mapcount and always use compound_mapcount, which is how mapcount is handled in this version of this series. > > > > > > > > > The other approach is probably using the thp approach. After Hugh's rework > > > on the thp accounting I assumed it would be even cleaner (at least no > > > DoubleMap complexity anymore.. even though I can't say I fully digested the > > > whole history of that). It's all about what's the major challenges of > > > using the same approach there with thp. You may have more knowledge on > > > that aspect so I'd be willing to know. > > > > I need to take a closer look at Hugh's approach to see if we can do it > > the same way. (I wonder if the 1G THP series has some ideas too.) > > https://lore.kernel.org/all/47ad693-717-79c8-e1ba-46c3a6602e48@google.com/ > > It's already in the mainline. I think it's mostly internally implemented > under the rmap code APIs. For the HGM effort, I'd start with simply > passing around compound=false when using the rmap APIs, and see what'll > start to fail. I'll look into it, but doing it this way will use _mapcount, so we won't be able to use the vmemmap optimization. I think even if we do use Hugh's approach, refcount is still being kept on the head page, so there's still an overflow risk there (but maybe I am misunderstanding). > > > > > A really simple solution could be just to prevent userspace from doing > > MADV_SPLIT (or, if we are enabling HGM due to hwpoison, ignore the > > poison) if it could result in a mapcount overflow. For 1G pages, > > userspace would need 8192 mappings to overflow mapcount/refcount. > > I'm not sure you can calculate it; consider fork()s along the way when pmd > sharing disabled, or whatever reason when the 1G pages mapped at multiple > places with more than one mmap()s. Yeah you're right. :(
On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote: > I'll look into it, but doing it this way will use _mapcount, so we > won't be able to use the vmemmap optimization. I think even if we do > use Hugh's approach, refcount is still being kept on the head page, so > there's still an overflow risk there (but maybe I am > misunderstanding). Could you remind me what's the issue if using refcount on the small pages rather than the head (assuming vmemmap still can be disabled)? Thanks,
On 12.01.23 22:33, Peter Xu wrote: > On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote: >> I'll look into it, but doing it this way will use _mapcount, so we >> won't be able to use the vmemmap optimization. I think even if we do >> use Hugh's approach, refcount is still being kept on the head page, so >> there's still an overflow risk there (but maybe I am >> misunderstanding). > > Could you remind me what's the issue if using refcount on the small pages > rather than the head (assuming vmemmap still can be disabled)? The THP-way of doing things is refcounting on the head page. All folios use a single refcount on the head. There has to be a pretty good reason to do it differently.
On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <david@redhat.com> wrote: > > On 12.01.23 22:33, Peter Xu wrote: > > On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote: > >> I'll look into it, but doing it this way will use _mapcount, so we > >> won't be able to use the vmemmap optimization. I think even if we do > >> use Hugh's approach, refcount is still being kept on the head page, so > >> there's still an overflow risk there (but maybe I am > >> misunderstanding). > > > > Could you remind me what's the issue if using refcount on the small pages > > rather than the head (assuming vmemmap still can be disabled)? > > The THP-way of doing things is refcounting on the head page. All folios > use a single refcount on the head. > > There has to be a pretty good reason to do it differently. Peter and I have discussed this a lot offline. There are two main problems here: 1. Refcount overflow Refcount is always kept on the head page (before and after this series). IIUC, this means that if THPs could be 1G in size, they too would be susceptible to the same potential overflow. How easy is the overflow? [1] To deal with this, the best solution we've been able to come up with is to check if refcount is > INT_MAX/2 (similar to try_get_page()), and if it is, stop the operation (UFFDIO_CONTINUE or a page fault) from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't want to kill a random process). (So David, I think this answers your question. Refcount should be handled just like THPs.) 2. page_mapcount() API differences In this series, page_mapcount() returns the total number of page table references for the compound page. For example, if you have a PTE-mapped 2M page (with no other mappings), page_mapcount() for each 4K page will be 512. This is not the same as a THP: page_mapcount() would return 1 for each page. Because of the difference in page_mapcount(), we have 4 problems: i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is "private_hugetlb" or "shared_hugetlb". ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if the hugepage is shared or not. Pages that would otherwise be migrated now require MPOL_MF_MOVE_ALL to be migrated. [Really both of the above are checking how many VMAs are mapping our hugepage.] iii. CoW. This isn't a problem right now because CoW is only possible with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs. iv. The hwpoison handling code will check if it successfully unmapped the poisoned page. This isn't a problem right now, as hwpoison will unmap all the mappings for the hugepage, not just the 4K where the poison was found. Doing it this way allows HGM to remain compatible with the hugetlb vmemmap optimization. None of the above problems strike me as particularly major, but it's unclear to me how important it is to have page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb. The other way page_mapcount() (let's say the "THP-like way") could be done is like this: increment compound mapcount if we're mapping a hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at high-granularity, increment the mapcount for each 4K page that is getting mapped (e.g., PMD within a 1G page: increment the mapcount for the 512 pages that are now mapped). This yields the same page_mapcount() API we had before, but we lose the hugetlb vmemmap optimization. We could introduce an API like hugetlb_vma_mapcount() that would, for hugetlb, give us the number of VMAs that map a hugepage, but I don't think people would like this. I'm curious what others think (Mike, Matthew?). I'm guessing the THP-like way is probably what most people would want, though it would be a real shame to lose the vmemmap optimization. - James [1] INT_MAX is 2^31. We increment the refcount once for each 4K mapping in 1G: that's 512 * 512 (2^18). That means we only need 8192 (2^13) VMAs for the same 1G page to overflow refcount. I think it's safe to say that if userspace is doing this, they are attempting to overflow refcount.
On 18.01.23 00:11, James Houghton wrote: > On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 12.01.23 22:33, Peter Xu wrote: >>> On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote: >>>> I'll look into it, but doing it this way will use _mapcount, so we >>>> won't be able to use the vmemmap optimization. I think even if we do >>>> use Hugh's approach, refcount is still being kept on the head page, so >>>> there's still an overflow risk there (but maybe I am >>>> misunderstanding). >>> >>> Could you remind me what's the issue if using refcount on the small pages >>> rather than the head (assuming vmemmap still can be disabled)? >> >> The THP-way of doing things is refcounting on the head page. All folios >> use a single refcount on the head. >> >> There has to be a pretty good reason to do it differently. > > Peter and I have discussed this a lot offline. There are two main problems here: > > 1. Refcount overflow > > Refcount is always kept on the head page (before and after this > series). IIUC, this means that if THPs could be 1G in size, they too > would be susceptible to the same potential overflow. How easy is the > overflow? [1] Right. You'd need 8k VMAs. With 2 MiB THP you'd need 4096k VMAs. So ~64 processes with 64k VMAs. Not impossible to achieve if one really wants to break the system ... Side note: a long long time ago, we used to have sub-page refcounts for THP. IIRC, that was even before we had sub-page mapcounts and was used to make COW decisions. > > To deal with this, the best solution we've been able to come up with > is to check if refcount is > INT_MAX/2 (similar to try_get_page()), > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault) > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't > want to kill a random process). You'd have to also make sure that fork() won't do the same. At least with uffd-wp, Peter also added page table copying during fork() for MAP_SHARED mappings, which would have to be handled. Of course, one can just disallow fork() with any HGM right from the start and keep it all simpler to not open up a can of worms there. Is it reasonable, to have more than one (or a handful) of VMAs mapping a huge page via a HGM? Restricting it to a single one, would make handling much easier. If there is ever demand for more HGM mappings, that whole problem (and complexity) could be dealt with later. ... but I assume it will already be a requirement for VMs (e.g., under QEMU) that share memory with other processes (virtiofsd and friends?) > > (So David, I think this answers your question. Refcount should be > handled just like THPs.) > > 2. page_mapcount() API differences > > In this series, page_mapcount() returns the total number of page table > references for the compound page. For example, if you have a > PTE-mapped 2M page (with no other mappings), page_mapcount() for each > 4K page will be 512. This is not the same as a THP: page_mapcount() > would return 1 for each page. Because of the difference in > page_mapcount(), we have 4 problems: IMHO, it would actually be great to just be able to remove the sub-page mapcounts for THP and make it all simpler. Right now, the sub-page mapcount is mostly required for making COW decisions, but only for accounting purposes IIRC (NR_ANON_THPS, NR_SHMEM_PMDMAPPED, NR_FILE_PMDMAPPED) and mlock handling IIRC. See page_remove_rmap(). If we can avoid that complexity right from the start for hugetlb, great, .. > > i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is > "private_hugetlb" or "shared_hugetlb". > ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if > the hugepage is shared or not. Pages that would otherwise be migrated > now require MPOL_MF_MOVE_ALL to be migrated. > [Really both of the above are checking how many VMAs are mapping our hugepage.] > iii. CoW. This isn't a problem right now because CoW is only possible > with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs. > iv. The hwpoison handling code will check if it successfully unmapped > the poisoned page. This isn't a problem right now, as hwpoison will > unmap all the mappings for the hugepage, not just the 4K where the > poison was found. > > Doing it this way allows HGM to remain compatible with the hugetlb > vmemmap optimization. None of the above problems strike me as > particularly major, but it's unclear to me how important it is to have > page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb. See below, maybe we should tackle HGM from a different direction. > > The other way page_mapcount() (let's say the "THP-like way") could be > done is like this: increment compound mapcount if we're mapping a > hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at > high-granularity, increment the mapcount for each 4K page that is > getting mapped (e.g., PMD within a 1G page: increment the mapcount for > the 512 pages that are now mapped). This yields the same > page_mapcount() API we had before, but we lose the hugetlb vmemmap > optimization. > > We could introduce an API like hugetlb_vma_mapcount() that would, for > hugetlb, give us the number of VMAs that map a hugepage, but I don't > think people would like this. > > I'm curious what others think (Mike, Matthew?). I'm guessing the > THP-like way is probably what most people would want, though it would > be a real shame to lose the vmemmap optimization. Heh, not me :) Having a single mapcount is certainly much cleaner. ... and if we're dealing with refcount overflows already, mapcount overflows are not an issue. I wonder if the following crazy idea has already been discussed: treat the whole mapping as a single large logical mapping. One reference and one mapping, no matter how the individual parts are mapped into the assigned page table sub-tree. Because for hugetlb with MAP_SHARED, we know that the complete assigned sub-tree of page tables can only map the given hugetlb page, no fragments of something else. That's very different to THP in private mappings ... So as soon as the first piece gets mapped, we increment refcount+mapcount. Other pieces in the same subtree don't do that. Once the last piece is unmapped (or simpler: once the complete subtree of page tables is gone), we decrement refcount+mapcount. Might require some brain power to do this tracking, but I wouldn't call it impossible right from the start. Would such a design violate other design aspects that are important?
On Wed, Jan 18, 2023 at 10:43:47AM +0100, David Hildenbrand wrote: > On 18.01.23 00:11, James Houghton wrote: > > On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <david@redhat.com> wrote: > > > > > > On 12.01.23 22:33, Peter Xu wrote: > > > > On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote: > > > > > I'll look into it, but doing it this way will use _mapcount, so we > > > > > won't be able to use the vmemmap optimization. I think even if we do > > > > > use Hugh's approach, refcount is still being kept on the head page, so > > > > > there's still an overflow risk there (but maybe I am > > > > > misunderstanding). > > > > > > > > Could you remind me what's the issue if using refcount on the small pages > > > > rather than the head (assuming vmemmap still can be disabled)? > > > > > > The THP-way of doing things is refcounting on the head page. All folios > > > use a single refcount on the head. > > > > > > There has to be a pretty good reason to do it differently. > > > > Peter and I have discussed this a lot offline. There are two main problems here: > > > > 1. Refcount overflow > > > > Refcount is always kept on the head page (before and after this > > series). IIUC, this means that if THPs could be 1G in size, they too > > would be susceptible to the same potential overflow. How easy is the > > overflow? [1] > > Right. You'd need 8k VMAs. With 2 MiB THP you'd need 4096k VMAs. So ~64 > processes with 64k VMAs. Not impossible to achieve if one really wants to > break the system ... > > Side note: a long long time ago, we used to have sub-page refcounts for THP. > IIRC, that was even before we had sub-page mapcounts and was used to make > COW decisions. > > > > > To deal with this, the best solution we've been able to come up with > > is to check if refcount is > INT_MAX/2 (similar to try_get_page()), > > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault) > > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the > > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't > > want to kill a random process). > > You'd have to also make sure that fork() won't do the same. At least with > uffd-wp, Peter also added page table copying during fork() for MAP_SHARED > mappings, which would have to be handled. If we want such a check to make a real difference, IIUC we may want to consider having similar check in: page_ref_add page_ref_inc page_ref_inc_return page_ref_add_unless But it's unfortunate that mostly all the callers to these functions (especially the first two) do not have a retval yet at all. Considering the low possibility so far on having it overflow, maybe it can also be done for later (and I think checking negative as try_get_page would suffice too). > > Of course, one can just disallow fork() with any HGM right from the start > and keep it all simpler to not open up a can of worms there. > > Is it reasonable, to have more than one (or a handful) of VMAs mapping a > huge page via a HGM? Restricting it to a single one, would make handling > much easier. > > If there is ever demand for more HGM mappings, that whole problem (and > complexity) could be dealt with later. ... but I assume it will already be a > requirement for VMs (e.g., under QEMU) that share memory with other > processes (virtiofsd and friends?) Yes, any form of multi-proc QEMU will need that for supporting HGM postcopy. > > > > > > (So David, I think this answers your question. Refcount should be > > handled just like THPs.) > > > > 2. page_mapcount() API differences > > > > In this series, page_mapcount() returns the total number of page table > > references for the compound page. For example, if you have a > > PTE-mapped 2M page (with no other mappings), page_mapcount() for each > > 4K page will be 512. This is not the same as a THP: page_mapcount() > > would return 1 for each page. Because of the difference in > > page_mapcount(), we have 4 problems: > > IMHO, it would actually be great to just be able to remove the sub-page > mapcounts for THP and make it all simpler. > > Right now, the sub-page mapcount is mostly required for making COW > decisions, but only for accounting purposes IIRC (NR_ANON_THPS, > NR_SHMEM_PMDMAPPED, NR_FILE_PMDMAPPED) and mlock handling IIRC. See > page_remove_rmap(). > > If we can avoid that complexity right from the start for hugetlb, great, .. > > > > > i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is > > "private_hugetlb" or "shared_hugetlb". > > ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if > > the hugepage is shared or not. Pages that would otherwise be migrated > > now require MPOL_MF_MOVE_ALL to be migrated. > > [Really both of the above are checking how many VMAs are mapping our hugepage.] > > iii. CoW. This isn't a problem right now because CoW is only possible > > with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs. > > iv. The hwpoison handling code will check if it successfully unmapped > > the poisoned page. This isn't a problem right now, as hwpoison will > > unmap all the mappings for the hugepage, not just the 4K where the > > poison was found. > > > > Doing it this way allows HGM to remain compatible with the hugetlb > > vmemmap optimization. None of the above problems strike me as > > particularly major, but it's unclear to me how important it is to have > > page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb. > > See below, maybe we should tackle HGM from a different direction. > > > > > The other way page_mapcount() (let's say the "THP-like way") could be > > done is like this: increment compound mapcount if we're mapping a > > hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at > > high-granularity, increment the mapcount for each 4K page that is > > getting mapped (e.g., PMD within a 1G page: increment the mapcount for > > the 512 pages that are now mapped). This yields the same > > page_mapcount() API we had before, but we lose the hugetlb vmemmap > > optimization. > > > > We could introduce an API like hugetlb_vma_mapcount() that would, for > > hugetlb, give us the number of VMAs that map a hugepage, but I don't > > think people would like this. > > > > I'm curious what others think (Mike, Matthew?). I'm guessing the > > THP-like way is probably what most people would want, though it would > > be a real shame to lose the vmemmap optimization. > > Heh, not me :) Having a single mapcount is certainly much cleaner. ... and > if we're dealing with refcount overflows already, mapcount overflows are not > an issue. > > > I wonder if the following crazy idea has already been discussed: treat the > whole mapping as a single large logical mapping. One reference and one > mapping, no matter how the individual parts are mapped into the assigned > page table sub-tree. > > Because for hugetlb with MAP_SHARED, we know that the complete assigned > sub-tree of page tables can only map the given hugetlb page, no fragments of > something else. That's very different to THP in private mappings ... > > So as soon as the first piece gets mapped, we increment refcount+mapcount. > Other pieces in the same subtree don't do that. > > Once the last piece is unmapped (or simpler: once the complete subtree of > page tables is gone), we decrement refcount+mapcount. Might require some > brain power to do this tracking, but I wouldn't call it impossible right > from the start. > > Would such a design violate other design aspects that are important? The question is how to maintaining above information. It needs to be per-map (so one page mapped multiple times can be accounted differently), and per-page (so one mapping/vma can contain multiple pages). So far I think that's exactly the pgtable. If we can squeeze information into the pgtable it'll work out, but definitely not trivial. Or we can maintain seperate allocates for such information, but that can be extra overheads too. So far I'd still consider going with reusing thp mapcounts, which will mostly be what James mentioned above. The only difference is I'm not sure whether we should allow mapping e.g. 2M ranges for 1G pages. THP mapcount doesn't have intermediate layer to maintain mapcount information like 2M, so to me it's easier we start with only mapping either the hpage size or PAGE_SIZE, not any intermediate size allowed. Having intermediate size mapping allowed can at least be error prone to me. One example is if some pgtable walker found a 2M page, it may easily fetch the PFN out of it, assuming it's a compound page and it should satisfy PageHead(pfn)==true but it'll start to break here, because the 2M PFN will only be a small page pfn for the 1G huge page in this case. To me, intermediate sized mappings are good to have but not required to resolve HGM problems, at least so far. Said that, I'm fine with looking at what it'll look like if James would like to keep persuing that direction. Thanks,
> > > To deal with this, the best solution we've been able to come up with > > > is to check if refcount is > INT_MAX/2 (similar to try_get_page()), > > > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault) > > > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the > > > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't > > > want to kill a random process). > > > > You'd have to also make sure that fork() won't do the same. At least with > > uffd-wp, Peter also added page table copying during fork() for MAP_SHARED > > mappings, which would have to be handled. Indeed, thank you! copy_hugetlb_page_range() (and therefore fork() in some cases) would need this check too. > > If we want such a check to make a real difference, IIUC we may want to > consider having similar check in: > > page_ref_add > page_ref_inc > page_ref_inc_return > page_ref_add_unless > > But it's unfortunate that mostly all the callers to these functions > (especially the first two) do not have a retval yet at all. Considering > the low possibility so far on having it overflow, maybe it can also be done > for later (and I think checking negative as try_get_page would suffice too). I think as long as we annotate the few cases that userspace can exploit to overflow refcount, we should be ok. I think this was the same idea with try_get_page(): it is supposed to be used in places that userspace could reasonably exploit to overflow refcount. > > > > > Of course, one can just disallow fork() with any HGM right from the start > > and keep it all simpler to not open up a can of worms there. > > > > Is it reasonable, to have more than one (or a handful) of VMAs mapping a > > huge page via a HGM? Restricting it to a single one, would make handling > > much easier. > > > > If there is ever demand for more HGM mappings, that whole problem (and > > complexity) could be dealt with later. ... but I assume it will already be a > > requirement for VMs (e.g., under QEMU) that share memory with other > > processes (virtiofsd and friends?) > > Yes, any form of multi-proc QEMU will need that for supporting HGM > postcopy. +1. Disallowing fork() entirely is quite a restrictive limitation. [snip] > > > I'm curious what others think (Mike, Matthew?). I'm guessing the > > > THP-like way is probably what most people would want, though it would > > > be a real shame to lose the vmemmap optimization. > > > > Heh, not me :) Having a single mapcount is certainly much cleaner. ... and > > if we're dealing with refcount overflows already, mapcount overflows are not > > an issue. > > > > > > I wonder if the following crazy idea has already been discussed: treat the > > whole mapping as a single large logical mapping. One reference and one > > mapping, no matter how the individual parts are mapped into the assigned > > page table sub-tree. > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned > > sub-tree of page tables can only map the given hugetlb page, no fragments of > > something else. That's very different to THP in private mappings ... > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount. > > Other pieces in the same subtree don't do that. > > > > Once the last piece is unmapped (or simpler: once the complete subtree of > > page tables is gone), we decrement refcount+mapcount. Might require some > > brain power to do this tracking, but I wouldn't call it impossible right > > from the start. > > > > Would such a design violate other design aspects that are important? This is actually how mapcount was treated in HGM RFC v1 (though not refcount); it is doable for both [2]. One caveat here: if a page is unmapped in small pieces, it is difficult to know if the page is legitimately completely unmapped (we would have to check all the PTEs in the page table). In RFC v1, I sidestepped this caveat by saying that "page_mapcount() is incremented if the hstate-level PTE is present". A single unmap on the whole hugepage will clear the hstate-level PTE, thus decrementing the mapcount. On a related note, there still exists an (albeit minor) API difference vs. THPs: a piece of a page that is legitimately unmapped can still have a positive page_mapcount(). Given that this approach allows us to retain the hugetlb vmemmap optimization (and it wouldn't require a horrible amount of complexity), I prefer this approach over the THP-like approach. > > The question is how to maintaining above information. > > It needs to be per-map (so one page mapped multiple times can be accounted > differently), and per-page (so one mapping/vma can contain multiple pages). > So far I think that's exactly the pgtable. If we can squeeze information > into the pgtable it'll work out, but definitely not trivial. Or we can > maintain seperate allocates for such information, but that can be extra > overheads too. I don't think we necessarily need to check the page table if we allow for the limitations stated above. > > So far I'd still consider going with reusing thp mapcounts, which will > mostly be what James mentioned above. The only difference is I'm not sure > whether we should allow mapping e.g. 2M ranges for 1G pages. THP mapcount > doesn't have intermediate layer to maintain mapcount information like 2M, > so to me it's easier we start with only mapping either the hpage size or > PAGE_SIZE, not any intermediate size allowed. > > Having intermediate size mapping allowed can at least be error prone to > me. One example is if some pgtable walker found a 2M page, it may easily > fetch the PFN out of it, assuming it's a compound page and it should > satisfy PageHead(pfn)==true but it'll start to break here, because the 2M > PFN will only be a small page pfn for the 1G huge page in this case. I assume you mean PageHuge(page). There are cases where assumptions are made about hugetlb pages and VMAs; they need to be corrected. It sounds like you're really talking about errors wrt missing a compound_head()/page_folio(), but it can be even more general than that. For example, the arm64 KVM MMU assumes that hugetlb HGM doesn't exist, and so it needs to be fixed before HGM can be enabled for arm64. If a page is HGM-mapped, I think it's more error-prone to somehow make PageHuge() come back with false. So the only solution I see here is careful auditing and testing. I don't really see how allow/disallowing intermediate sizes changes this problem either. > > To me, intermediate sized mappings are good to have but not required to > resolve HGM problems, at least so far. Said that, I'm fine with looking at > what it'll look like if James would like to keep persuing that direction. I've mentioned this to Peter already, but I don't think discarding intermediate mapping levels really reduces complexity all that much. [2]: Using the names of functions as of v1 (Peter has since suggested a name change): hugetlb_alloc_{pte,pmd} will tell us if we really allocated the PTE. That info can be passed up through hugetlb_walk_step()=>hugetlb_hgm_walk() (where we only care if the *hstate-level* PTE got allocated) and hugetlb_full_walk*() for the caller to determine if refcount/mapcount should be incremented. Thank you both for your thoughts so far. :) - James
On 18.01.23 16:35, Peter Xu wrote: > On Wed, Jan 18, 2023 at 10:43:47AM +0100, David Hildenbrand wrote: >> On 18.01.23 00:11, James Houghton wrote: >>> On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 12.01.23 22:33, Peter Xu wrote: >>>>> On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote: >>>>>> I'll look into it, but doing it this way will use _mapcount, so we >>>>>> won't be able to use the vmemmap optimization. I think even if we do >>>>>> use Hugh's approach, refcount is still being kept on the head page, so >>>>>> there's still an overflow risk there (but maybe I am >>>>>> misunderstanding). >>>>> >>>>> Could you remind me what's the issue if using refcount on the small pages >>>>> rather than the head (assuming vmemmap still can be disabled)? >>>> >>>> The THP-way of doing things is refcounting on the head page. All folios >>>> use a single refcount on the head. >>>> >>>> There has to be a pretty good reason to do it differently. >>> >>> Peter and I have discussed this a lot offline. There are two main problems here: >>> >>> 1. Refcount overflow >>> >>> Refcount is always kept on the head page (before and after this >>> series). IIUC, this means that if THPs could be 1G in size, they too >>> would be susceptible to the same potential overflow. How easy is the >>> overflow? [1] >> >> Right. You'd need 8k VMAs. With 2 MiB THP you'd need 4096k VMAs. So ~64 >> processes with 64k VMAs. Not impossible to achieve if one really wants to >> break the system ... >> >> Side note: a long long time ago, we used to have sub-page refcounts for THP. >> IIRC, that was even before we had sub-page mapcounts and was used to make >> COW decisions. >> >>> >>> To deal with this, the best solution we've been able to come up with >>> is to check if refcount is > INT_MAX/2 (similar to try_get_page()), >>> and if it is, stop the operation (UFFDIO_CONTINUE or a page fault) >>> from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the >>> page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't >>> want to kill a random process). >> >> You'd have to also make sure that fork() won't do the same. At least with >> uffd-wp, Peter also added page table copying during fork() for MAP_SHARED >> mappings, which would have to be handled. > > If we want such a check to make a real difference, IIUC we may want to > consider having similar check in: > > page_ref_add > page_ref_inc > page_ref_inc_return > page_ref_add_unless > > But it's unfortunate that mostly all the callers to these functions > (especially the first two) do not have a retval yet at all. Considering > the low possibility so far on having it overflow, maybe it can also be done > for later (and I think checking negative as try_get_page would suffice too). > >> >> Of course, one can just disallow fork() with any HGM right from the start >> and keep it all simpler to not open up a can of worms there. >> >> Is it reasonable, to have more than one (or a handful) of VMAs mapping a >> huge page via a HGM? Restricting it to a single one, would make handling >> much easier. >> >> If there is ever demand for more HGM mappings, that whole problem (and >> complexity) could be dealt with later. ... but I assume it will already be a >> requirement for VMs (e.g., under QEMU) that share memory with other >> processes (virtiofsd and friends?) > > Yes, any form of multi-proc QEMU will need that for supporting HGM > postcopy. > >> >> >>> >>> (So David, I think this answers your question. Refcount should be >>> handled just like THPs.) >>> >>> 2. page_mapcount() API differences >>> >>> In this series, page_mapcount() returns the total number of page table >>> references for the compound page. For example, if you have a >>> PTE-mapped 2M page (with no other mappings), page_mapcount() for each >>> 4K page will be 512. This is not the same as a THP: page_mapcount() >>> would return 1 for each page. Because of the difference in >>> page_mapcount(), we have 4 problems: >> >> IMHO, it would actually be great to just be able to remove the sub-page >> mapcounts for THP and make it all simpler. >> >> Right now, the sub-page mapcount is mostly required for making COW >> decisions, but only for accounting purposes IIRC (NR_ANON_THPS, >> NR_SHMEM_PMDMAPPED, NR_FILE_PMDMAPPED) and mlock handling IIRC. See >> page_remove_rmap(). >> >> If we can avoid that complexity right from the start for hugetlb, great, .. >> >>> >>> i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is >>> "private_hugetlb" or "shared_hugetlb". >>> ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if >>> the hugepage is shared or not. Pages that would otherwise be migrated >>> now require MPOL_MF_MOVE_ALL to be migrated. >>> [Really both of the above are checking how many VMAs are mapping our hugepage.] >>> iii. CoW. This isn't a problem right now because CoW is only possible >>> with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs. >>> iv. The hwpoison handling code will check if it successfully unmapped >>> the poisoned page. This isn't a problem right now, as hwpoison will >>> unmap all the mappings for the hugepage, not just the 4K where the >>> poison was found. >>> >>> Doing it this way allows HGM to remain compatible with the hugetlb >>> vmemmap optimization. None of the above problems strike me as >>> particularly major, but it's unclear to me how important it is to have >>> page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb. >> >> See below, maybe we should tackle HGM from a different direction. >> >>> >>> The other way page_mapcount() (let's say the "THP-like way") could be >>> done is like this: increment compound mapcount if we're mapping a >>> hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at >>> high-granularity, increment the mapcount for each 4K page that is >>> getting mapped (e.g., PMD within a 1G page: increment the mapcount for >>> the 512 pages that are now mapped). This yields the same >>> page_mapcount() API we had before, but we lose the hugetlb vmemmap >>> optimization. >>> >>> We could introduce an API like hugetlb_vma_mapcount() that would, for >>> hugetlb, give us the number of VMAs that map a hugepage, but I don't >>> think people would like this. >>> >>> I'm curious what others think (Mike, Matthew?). I'm guessing the >>> THP-like way is probably what most people would want, though it would >>> be a real shame to lose the vmemmap optimization. >> >> Heh, not me :) Having a single mapcount is certainly much cleaner. ... and >> if we're dealing with refcount overflows already, mapcount overflows are not >> an issue. >> >> >> I wonder if the following crazy idea has already been discussed: treat the >> whole mapping as a single large logical mapping. One reference and one >> mapping, no matter how the individual parts are mapped into the assigned >> page table sub-tree. >> >> Because for hugetlb with MAP_SHARED, we know that the complete assigned >> sub-tree of page tables can only map the given hugetlb page, no fragments of >> something else. That's very different to THP in private mappings ... >> >> So as soon as the first piece gets mapped, we increment refcount+mapcount. >> Other pieces in the same subtree don't do that. >> >> Once the last piece is unmapped (or simpler: once the complete subtree of >> page tables is gone), we decrement refcount+mapcount. Might require some >> brain power to do this tracking, but I wouldn't call it impossible right >> from the start. >> >> Would such a design violate other design aspects that are important? > > The question is how to maintaining above information. Right. > > It needs to be per-map (so one page mapped multiple times can be accounted > differently), and per-page (so one mapping/vma can contain multiple pages). > So far I think that's exactly the pgtable. If we can squeeze information > into the pgtable it'll work out, but definitely not trivial. Or we can > maintain seperate allocates for such information, but that can be extra > overheads too. If there is no sub-pgtable level, there is certainly no HGM. If there is a sub-pgtable, we can store that information in that pgtable memmap most probably. Maybe simply a pointer to the hugetlb page. As long the pointer is there, we increment the mapcount/refcount. Either directly, or via some additional metadata. Metadata should be small and most probably "noting relevant in size" compared to the actual 1 GiB page or the 2 MiB+ of page tables to cover 1 GiB. We could even teach most pgtable walkers to just assume that "logically" there is simply a hugtlb page mapped, without traversing the actual sub-pgtables. IIUC, only pgtable walkers that actually want to access page content (page faults, pinning) or change PTEs (mprotect, uffd) would really care. Maybe stuff like smaps could just say "well, there is a hugetlb page mapped" and continue. Just a thought. > > So far I'd still consider going with reusing thp mapcounts, which will > mostly be what James mentioned above. The only difference is I'm not sure > whether we should allow mapping e.g. 2M ranges for 1G pages. THP mapcount > doesn't have intermediate layer to maintain mapcount information like 2M, > so to me it's easier we start with only mapping either the hpage size or > PAGE_SIZE, not any intermediate size allowed. > > Having intermediate size mapping allowed can at least be error prone to > me. One example is if some pgtable walker found a 2M page, it may easily > fetch the PFN out of it, assuming it's a compound page and it should > satisfy PageHead(pfn)==true but it'll start to break here, because the 2M > PFN will only be a small page pfn for the 1G huge page in this case. > > To me, intermediate sized mappings are good to have but not required to > resolve HGM problems, at least so far. Said that, I'm fine with looking at > what it'll look like if James would like to keep persuing that direction. Yeah, was just an idea from my side to avoid most of the refcount and mapcount issues -- in theory :) Let me think about that all a bit more ...
>>> Once the last piece is unmapped (or simpler: once the complete subtree of >>> page tables is gone), we decrement refcount+mapcount. Might require some >>> brain power to do this tracking, but I wouldn't call it impossible right >>> from the start. >>> >>> Would such a design violate other design aspects that are important? > > This is actually how mapcount was treated in HGM RFC v1 (though not > refcount); it is doable for both [2]. > > One caveat here: if a page is unmapped in small pieces, it is > difficult to know if the page is legitimately completely unmapped (we > would have to check all the PTEs in the page table). In RFC v1, I > sidestepped this caveat by saying that "page_mapcount() is incremented > if the hstate-level PTE is present". A single unmap on the whole > hugepage will clear the hstate-level PTE, thus decrementing the > mapcount. > > On a related note, there still exists an (albeit minor) API difference > vs. THPs: a piece of a page that is legitimately unmapped can still > have a positive page_mapcount(). > > Given that this approach allows us to retain the hugetlb vmemmap > optimization (and it wouldn't require a horrible amount of > complexity), I prefer this approach over the THP-like approach. If we can store (directly/indirectly) metadata in the highest pgtable that HGM-maps a hugetlb page, I guess what would be reasonable: * hugetlb page pointer * mapped size Whenever mapping/unmapping sub-parts, we'd have to update that information. Once "mapped size" dropped to 0, we know that the hugetlb page was completely unmapped and we can drop the refcount+mapcount, clear metadata (including hugetlb page pointer) [+ remove the page tables?]. Similarly, once "mapped size" corresponds to the hugetlb size, we can immediately spot that everything is mapped. Again, just a high-level idea.
On 01/18/23 08:39, James Houghton wrote: > > > > To deal with this, the best solution we've been able to come up with > > > > is to check if refcount is > INT_MAX/2 (similar to try_get_page()), > > > > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault) > > > > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the > > > > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't > > > > want to kill a random process). > > > > > > You'd have to also make sure that fork() won't do the same. At least with > > > uffd-wp, Peter also added page table copying during fork() for MAP_SHARED > > > mappings, which would have to be handled. > > Indeed, thank you! copy_hugetlb_page_range() (and therefore fork() in > some cases) would need this check too. > > > > > If we want such a check to make a real difference, IIUC we may want to > > consider having similar check in: > > > > page_ref_add > > page_ref_inc > > page_ref_inc_return > > page_ref_add_unless > > > > But it's unfortunate that mostly all the callers to these functions > > (especially the first two) do not have a retval yet at all. Considering > > the low possibility so far on having it overflow, maybe it can also be done > > for later (and I think checking negative as try_get_page would suffice too). > > I think as long as we annotate the few cases that userspace can > exploit to overflow refcount, we should be ok. I think this was the > same idea with try_get_page(): it is supposed to be used in places > that userspace could reasonably exploit to overflow refcount. > > > > > > > > > Of course, one can just disallow fork() with any HGM right from the start > > > and keep it all simpler to not open up a can of worms there. > > > > > > Is it reasonable, to have more than one (or a handful) of VMAs mapping a > > > huge page via a HGM? Restricting it to a single one, would make handling > > > much easier. > > > > > > If there is ever demand for more HGM mappings, that whole problem (and > > > complexity) could be dealt with later. ... but I assume it will already be a > > > requirement for VMs (e.g., under QEMU) that share memory with other > > > processes (virtiofsd and friends?) > > > > Yes, any form of multi-proc QEMU will need that for supporting HGM > > postcopy. > > +1. Disallowing fork() entirely is quite a restrictive limitation. > > [snip] > > > > I'm curious what others think (Mike, Matthew?). I'm guessing the > > > > THP-like way is probably what most people would want, though it would > > > > be a real shame to lose the vmemmap optimization. > > > > > > Heh, not me :) Having a single mapcount is certainly much cleaner. ... and > > > if we're dealing with refcount overflows already, mapcount overflows are not > > > an issue. > > > > > > > > > I wonder if the following crazy idea has already been discussed: treat the > > > whole mapping as a single large logical mapping. One reference and one > > > mapping, no matter how the individual parts are mapped into the assigned > > > page table sub-tree. > > > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned > > > sub-tree of page tables can only map the given hugetlb page, no fragments of > > > something else. That's very different to THP in private mappings ... > > > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount. > > > Other pieces in the same subtree don't do that. > > > > > > Once the last piece is unmapped (or simpler: once the complete subtree of > > > page tables is gone), we decrement refcount+mapcount. Might require some > > > brain power to do this tracking, but I wouldn't call it impossible right > > > from the start. > > > > > > Would such a design violate other design aspects that are important? > > This is actually how mapcount was treated in HGM RFC v1 (though not > refcount); it is doable for both [2]. My apologies for being late to the party :) When Peter first brought up the issue with ref/map_count overflows I was thinking that we should use a scheme like David describes above. As James points out, this was the approach taken in the first RFC. > One caveat here: if a page is unmapped in small pieces, it is > difficult to know if the page is legitimately completely unmapped (we > would have to check all the PTEs in the page table). Are we allowing unmapping of small (non-huge page sized) areas with HGM? We must be if you are concerned with it. What API would cause this? I just do not remember this discussion. > would have to check all the PTEs in the page table). In RFC v1, I > sidestepped this caveat by saying that "page_mapcount() is incremented > if the hstate-level PTE is present". A single unmap on the whole > hugepage will clear the hstate-level PTE, thus decrementing the > mapcount. > > On a related note, there still exists an (albeit minor) API difference > vs. THPs: a piece of a page that is legitimately unmapped can still > have a positive page_mapcount(). > > Given that this approach allows us to retain the hugetlb vmemmap > optimization (and it wouldn't require a horrible amount of > complexity), I prefer this approach over the THP-like approach. Me too. > > > > The question is how to maintaining above information. > > > > It needs to be per-map (so one page mapped multiple times can be accounted > > differently), and per-page (so one mapping/vma can contain multiple pages). > > So far I think that's exactly the pgtable. If we can squeeze information > > into the pgtable it'll work out, but definitely not trivial. Or we can > > maintain seperate allocates for such information, but that can be extra > > overheads too. > > I don't think we necessarily need to check the page table if we allow > for the limitations stated above. > When I was thinking about this I was a bit concerned about having enough information to know exactly when to inc or dec counts. I was actually worried about knowing to do the increment. I don't recall how it was done in the first RFC, but from a high level it would need to be done when the first hstate level PTE is allocated/added to the page table. Right? My concern was with all the places where we could 'error out' after allocating the PTE, but before initializing it. I was just thinking that we might need to scan the page table or keep metadata for better or easier accounting. I think Peter mentioned it elsewhere, we should come up with a workable scheme for HGM ref/map counting. This can be done somewhat independently.
> > > > I wonder if the following crazy idea has already been discussed: treat the > > > > whole mapping as a single large logical mapping. One reference and one > > > > mapping, no matter how the individual parts are mapped into the assigned > > > > page table sub-tree. > > > > > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned > > > > sub-tree of page tables can only map the given hugetlb page, no fragments of > > > > something else. That's very different to THP in private mappings ... > > > > > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount. > > > > Other pieces in the same subtree don't do that. > > > > > > > > Once the last piece is unmapped (or simpler: once the complete subtree of > > > > page tables is gone), we decrement refcount+mapcount. Might require some > > > > brain power to do this tracking, but I wouldn't call it impossible right > > > > from the start. > > > > > > > > Would such a design violate other design aspects that are important? > > > > This is actually how mapcount was treated in HGM RFC v1 (though not > > refcount); it is doable for both [2]. > > My apologies for being late to the party :) > > When Peter first brought up the issue with ref/map_count overflows I was > thinking that we should use a scheme like David describes above. As > James points out, this was the approach taken in the first RFC. > > > One caveat here: if a page is unmapped in small pieces, it is > > difficult to know if the page is legitimately completely unmapped (we > > would have to check all the PTEs in the page table). > > Are we allowing unmapping of small (non-huge page sized) areas with HGM? > We must be if you are concerned with it. What API would cause this? > I just do not remember this discussion. There was some discussion about allowing MADV_DONTNEED on less-than-hugepage pieces [3] (it actually motivated the switch from UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented in this series, but it could be implemented in the future. In the more immediate future, we want hwpoison to unmap at 4K, so MADV_HWPOISON would be a mechanism that userspace is granted to do this. > > > would have to check all the PTEs in the page table). In RFC v1, I > > sidestepped this caveat by saying that "page_mapcount() is incremented > > if the hstate-level PTE is present". A single unmap on the whole > > hugepage will clear the hstate-level PTE, thus decrementing the > > mapcount. > > > > On a related note, there still exists an (albeit minor) API difference > > vs. THPs: a piece of a page that is legitimately unmapped can still > > have a positive page_mapcount(). > > > > Given that this approach allows us to retain the hugetlb vmemmap > > optimization (and it wouldn't require a horrible amount of > > complexity), I prefer this approach over the THP-like approach. > > Me too. > > > > > > > The question is how to maintaining above information. > > > > > > It needs to be per-map (so one page mapped multiple times can be accounted > > > differently), and per-page (so one mapping/vma can contain multiple pages). > > > So far I think that's exactly the pgtable. If we can squeeze information > > > into the pgtable it'll work out, but definitely not trivial. Or we can > > > maintain seperate allocates for such information, but that can be extra > > > overheads too. > > > > I don't think we necessarily need to check the page table if we allow > > for the limitations stated above. > > > > When I was thinking about this I was a bit concerned about having enough > information to know exactly when to inc or dec counts. I was actually > worried about knowing to do the increment. I don't recall how it was > done in the first RFC, but from a high level it would need to be done > when the first hstate level PTE is allocated/added to the page table. > Right? My concern was with all the places where we could 'error out' > after allocating the PTE, but before initializing it. I was just thinking > that we might need to scan the page table or keep metadata for better > or easier accounting. The only two places where we can *create* a high-granularity page table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and copy_hugetlb_page_range. RFC v1 did not properly deal with the cases where we error out. To correctly handle these cases, we basically have to do the pagecache lookup before touching the page table. 1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do so immediately. We can easily pass the page into hugetlb_mcopy_atomic_pte() (via 'pagep') . 2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the lookup before we do the page table walk. I'm not sure how to support non-shared HGM mappings with this scheme (in this series, we also don't support non-shared; we return -EINVAL). NB: The only case where high-granularity mappings for !VM_MAYSHARE VMAs would come up is as a result of hwpoison. So we can avoid keeping additional metadata for what this series is trying to accomplish, but if the above isn't acceptable, then I/we can try to come up with a scheme that would be acceptable. There is also the possibility that the scheme implemented in this version of the series is acceptable (i.e., the page_mapcount() API difference, which results in slightly modified page migration behavior and smaps output, is ok... assuming we have the refcount overflow check). > > I think Peter mentioned it elsewhere, we should come up with a workable > scheme for HGM ref/map counting. This can be done somewhat independently. FWIW, what makes the most sense to me right now is to implement the THP-like scheme and mark HGM as mutually exclusive with the vmemmap optimization. We can later come up with a scheme that lets us retain compatibility. (Is that what you mean by "this can be done somewhat independently", Mike?) [3]: https://lore.kernel.org/linux-mm/20221021163703.3218176-1-jthoughton@google.com/T/#m9a1090108b61d32c04b68a1f3f2577644823a999 - James
On 01/19/23 08:57, James Houghton wrote: > > > > > I wonder if the following crazy idea has already been discussed: treat the > > > > > whole mapping as a single large logical mapping. One reference and one > > > > > mapping, no matter how the individual parts are mapped into the assigned > > > > > page table sub-tree. > > > > > > > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned > > > > > sub-tree of page tables can only map the given hugetlb page, no fragments of > > > > > something else. That's very different to THP in private mappings ... > > > > > > > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount. > > > > > Other pieces in the same subtree don't do that. > > > > > > > > > > Once the last piece is unmapped (or simpler: once the complete subtree of > > > > > page tables is gone), we decrement refcount+mapcount. Might require some > > > > > brain power to do this tracking, but I wouldn't call it impossible right > > > > > from the start. > > > > > > > > > > Would such a design violate other design aspects that are important? > > > > > > This is actually how mapcount was treated in HGM RFC v1 (though not > > > refcount); it is doable for both [2]. > > > > My apologies for being late to the party :) > > > > When Peter first brought up the issue with ref/map_count overflows I was > > thinking that we should use a scheme like David describes above. As > > James points out, this was the approach taken in the first RFC. > > > > > One caveat here: if a page is unmapped in small pieces, it is > > > difficult to know if the page is legitimately completely unmapped (we > > > would have to check all the PTEs in the page table). > > > > Are we allowing unmapping of small (non-huge page sized) areas with HGM? > > We must be if you are concerned with it. What API would cause this? > > I just do not remember this discussion. > > There was some discussion about allowing MADV_DONTNEED on > less-than-hugepage pieces [3] (it actually motivated the switch from > UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented > in this series, but it could be implemented in the future. OK, so we do not actually create HGM mappings until a uffd operation is done at a less than huge page size granularity. MADV_SPLIT just says that HGM mappings are 'possible' for this vma. Hopefully, my understanding is correct. I was concerned about things like the page fault path, but in that case we have already 'entered HGM mode' via a uffd operation. Both David and Peter have asked whether eliminating intermediate mapping levels would be a simplification. I trust your response that it would not help much in the current design/implementation. But, it did get me thinking about something else. Perhaps we have discussed this before, and perhaps it does not meet all user needs, but one way possibly simplify this is: - 'Enable HGM' via MADV_SPLIT. Must be done at huge page (hstate) granularity. - MADV_SPLIT implicitly unmaps everything with in the range. - MADV_SPLIT says all mappings for this vma will now be done at a base (4K) page size granularity. vma would be marked some way. - I think this eliminates the need for hugetlb_pte's as we KNOW the mapping size. - We still use huge pages to back 4K mappings, and we still have to deal with the ref/map_count issues. - Code touching hugetlb page tables would KNOW the mapping size up front. Again, apologies if we talked about and previously dismissed this type of approach. > > When I was thinking about this I was a bit concerned about having enough > > information to know exactly when to inc or dec counts. I was actually > > worried about knowing to do the increment. I don't recall how it was > > done in the first RFC, but from a high level it would need to be done > > when the first hstate level PTE is allocated/added to the page table. > > Right? My concern was with all the places where we could 'error out' > > after allocating the PTE, but before initializing it. I was just thinking > > that we might need to scan the page table or keep metadata for better > > or easier accounting. > > The only two places where we can *create* a high-granularity page > table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and > copy_hugetlb_page_range. RFC v1 did not properly deal with the cases > where we error out. To correctly handle these cases, we basically have > to do the pagecache lookup before touching the page table. > > 1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the > PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do > so immediately. We can easily pass the page into > hugetlb_mcopy_atomic_pte() (via 'pagep') . > > 2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the > lookup before we do the page table walk. I'm not sure how to support > non-shared HGM mappings with this scheme (in this series, we also > don't support non-shared; we return -EINVAL). > NB: The only case where high-granularity mappings for !VM_MAYSHARE > VMAs would come up is as a result of hwpoison. > > So we can avoid keeping additional metadata for what this series is > trying to accomplish, but if the above isn't acceptable, then I/we can > try to come up with a scheme that would be acceptable. Ok, I was thinking we had to deal with other code paths such as page fault. But, now I understand that is not the case with this design. > There is also the possibility that the scheme implemented in this > version of the series is acceptable (i.e., the page_mapcount() API > difference, which results in slightly modified page migration behavior > and smaps output, is ok... assuming we have the refcount overflow > check). > > > > > I think Peter mentioned it elsewhere, we should come up with a workable > > scheme for HGM ref/map counting. This can be done somewhat independently. > > FWIW, what makes the most sense to me right now is to implement the > THP-like scheme and mark HGM as mutually exclusive with the vmemmap > optimization. We can later come up with a scheme that lets us retain > compatibility. (Is that what you mean by "this can be done somewhat > independently", Mike?) Sort of, I was only saying that getting the ref/map counting right seems like a task than can be independently worked. Using the THP-like scheme is good.
On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 01/19/23 08:57, James Houghton wrote: > > > > > > I wonder if the following crazy idea has already been discussed: treat the > > > > > > whole mapping as a single large logical mapping. One reference and one > > > > > > mapping, no matter how the individual parts are mapped into the assigned > > > > > > page table sub-tree. > > > > > > > > > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned > > > > > > sub-tree of page tables can only map the given hugetlb page, no fragments of > > > > > > something else. That's very different to THP in private mappings ... > > > > > > > > > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount. > > > > > > Other pieces in the same subtree don't do that. > > > > > > > > > > > > Once the last piece is unmapped (or simpler: once the complete subtree of > > > > > > page tables is gone), we decrement refcount+mapcount. Might require some > > > > > > brain power to do this tracking, but I wouldn't call it impossible right > > > > > > from the start. > > > > > > > > > > > > Would such a design violate other design aspects that are important? > > > > > > > > This is actually how mapcount was treated in HGM RFC v1 (though not > > > > refcount); it is doable for both [2]. > > > > > > My apologies for being late to the party :) > > > > > > When Peter first brought up the issue with ref/map_count overflows I was > > > thinking that we should use a scheme like David describes above. As > > > James points out, this was the approach taken in the first RFC. > > > > > > > One caveat here: if a page is unmapped in small pieces, it is > > > > difficult to know if the page is legitimately completely unmapped (we > > > > would have to check all the PTEs in the page table). > > > > > > Are we allowing unmapping of small (non-huge page sized) areas with HGM? > > > We must be if you are concerned with it. What API would cause this? > > > I just do not remember this discussion. > > > > There was some discussion about allowing MADV_DONTNEED on > > less-than-hugepage pieces [3] (it actually motivated the switch from > > UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented > > in this series, but it could be implemented in the future. > > OK, so we do not actually create HGM mappings until a uffd operation is > done at a less than huge page size granularity. MADV_SPLIT just says > that HGM mappings are 'possible' for this vma. Hopefully, my understanding > is correct. Right, that's the current meaning of MADV_SPLIT for hugetlb. > I was concerned about things like the page fault path, but in that case > we have already 'entered HGM mode' via a uffd operation. > > Both David and Peter have asked whether eliminating intermediate mapping > levels would be a simplification. I trust your response that it would > not help much in the current design/implementation. But, it did get me > thinking about something else. > > Perhaps we have discussed this before, and perhaps it does not meet all > user needs, but one way possibly simplify this is: > > - 'Enable HGM' via MADV_SPLIT. Must be done at huge page (hstate) > granularity. > - MADV_SPLIT implicitly unmaps everything with in the range. > - MADV_SPLIT says all mappings for this vma will now be done at a base > (4K) page size granularity. vma would be marked some way. > - I think this eliminates the need for hugetlb_pte's as we KNOW the > mapping size. > - We still use huge pages to back 4K mappings, and we still have to deal > with the ref/map_count issues. > - Code touching hugetlb page tables would KNOW the mapping size up front. > > Again, apologies if we talked about and previously dismissed this type > of approach. I think Peter was the one who originally suggested an approach like this, and it meets my needs. However, I still think the way that things are currently implemented is the right way to go. Assuming we want decent performance, we can't get away with the same strategy of just passing pte_t*s everywhere. The PTL for a 4K PTE should be based on the PMD above the PTE, so we need to either pass around the PMD above our PTE, or we need to pass around the PTL. This is something that hugetlb_pte does for us, so, in some sense, even going with this simpler approach, we still need a hugetlb_pte-like construct. Although most of the other complexity that HGM introduces would have to be introduced either way (like having to deal with putting compound_head()/page_folio() in more places and doing some per-architecture updates), there are some complexities that the simpler approach avoids: - We avoid problems related to compound PTEs (the problem being: two threads racing to populate a contiguous and non-contiguous PTE that take up the same space could lead to user-detectable incorrect behavior. This isn't hard to fix; it will be when I send the arm64 patches up.) - We don't need to check if PTEs get split from under us in PT walks. (In a lot of cases, the appropriate action is just to treat the PTE as if it were pte_none().) In the same vein, we don't need hugetlb_pte_present_leaf() at all, because PTEs we find will always be leaves. - We don't have to deal with sorting hstates or implementing for_each_hgm_shift()/hugetlb_alloc_largest_pte(). None of these complexities are particularly major in my opinion. This might seem kind of contrived, but let's say you have a VM with 1T of memory, and you find 100 memory errors all in different 1G pages over the life of this VM (years, potentially). Having 10% of your memory be 4K-mapped is definitely worse than having 10% be 2M-mapped (lost performance and increased memory overhead). There might be other cases in the future where being able to have intermediate mapping sizes could be helpful. > > > When I was thinking about this I was a bit concerned about having enough > > > information to know exactly when to inc or dec counts. I was actually > > > worried about knowing to do the increment. I don't recall how it was > > > done in the first RFC, but from a high level it would need to be done > > > when the first hstate level PTE is allocated/added to the page table. > > > Right? My concern was with all the places where we could 'error out' > > > after allocating the PTE, but before initializing it. I was just thinking > > > that we might need to scan the page table or keep metadata for better > > > or easier accounting. > > > > The only two places where we can *create* a high-granularity page > > table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and > > copy_hugetlb_page_range. RFC v1 did not properly deal with the cases > > where we error out. To correctly handle these cases, we basically have > > to do the pagecache lookup before touching the page table. > > > > 1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the > > PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do > > so immediately. We can easily pass the page into > > hugetlb_mcopy_atomic_pte() (via 'pagep') . > > > > 2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the > > lookup before we do the page table walk. I'm not sure how to support > > non-shared HGM mappings with this scheme (in this series, we also > > don't support non-shared; we return -EINVAL). > > NB: The only case where high-granularity mappings for !VM_MAYSHARE > > VMAs would come up is as a result of hwpoison. > > > > So we can avoid keeping additional metadata for what this series is > > trying to accomplish, but if the above isn't acceptable, then I/we can > > try to come up with a scheme that would be acceptable. > > Ok, I was thinking we had to deal with other code paths such as page > fault. But, now I understand that is not the case with this design. > > > There is also the possibility that the scheme implemented in this > > version of the series is acceptable (i.e., the page_mapcount() API > > difference, which results in slightly modified page migration behavior > > and smaps output, is ok... assuming we have the refcount overflow > > check). > > > > > > > > I think Peter mentioned it elsewhere, we should come up with a workable > > > scheme for HGM ref/map counting. This can be done somewhat independently. > > > > FWIW, what makes the most sense to me right now is to implement the > > THP-like scheme and mark HGM as mutually exclusive with the vmemmap > > optimization. We can later come up with a scheme that lets us retain > > compatibility. (Is that what you mean by "this can be done somewhat > > independently", Mike?) > > Sort of, I was only saying that getting the ref/map counting right seems > like a task than can be independently worked. Using the THP-like scheme > is good. Ok! So if you're ok with the intermediate mapping sizes, it sounds like I should go ahead and implement the THP-like scheme. - James
On Thu, Jan 19, 2023 at 11:42:26AM -0800, James Houghton wrote: > - We avoid problems related to compound PTEs (the problem being: two > threads racing to populate a contiguous and non-contiguous PTE that > take up the same space could lead to user-detectable incorrect > behavior. This isn't hard to fix; it will be when I send the arm64 > patches up.) Could you elaborate this one a bit more? > This might seem kind of contrived, but let's say you have a VM with 1T > of memory, and you find 100 memory errors all in different 1G pages > over the life of this VM (years, potentially). Having 10% of your > memory be 4K-mapped is definitely worse than having 10% be 2M-mapped > (lost performance and increased memory overhead). There might be other > cases in the future where being able to have intermediate mapping > sizes could be helpful. This is not the norm, or is it? How the possibility of bad pages can distribute over hosts over years? This can definitely affect how we should target the intermediate level mappings. Thanks,
On 01/19/23 11:42, James Houghton wrote: > On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 01/19/23 08:57, James Houghton wrote: > > > > OK, so we do not actually create HGM mappings until a uffd operation is > > done at a less than huge page size granularity. MADV_SPLIT just says > > that HGM mappings are 'possible' for this vma. Hopefully, my understanding > > is correct. > > Right, that's the current meaning of MADV_SPLIT for hugetlb. > > > I was concerned about things like the page fault path, but in that case > > we have already 'entered HGM mode' via a uffd operation. > > > > Both David and Peter have asked whether eliminating intermediate mapping > > levels would be a simplification. I trust your response that it would > > not help much in the current design/implementation. But, it did get me > > thinking about something else. > > > > Perhaps we have discussed this before, and perhaps it does not meet all > > user needs, but one way possibly simplify this is: > > > > - 'Enable HGM' via MADV_SPLIT. Must be done at huge page (hstate) > > granularity. > > - MADV_SPLIT implicitly unmaps everything with in the range. > > - MADV_SPLIT says all mappings for this vma will now be done at a base > > (4K) page size granularity. vma would be marked some way. > > - I think this eliminates the need for hugetlb_pte's as we KNOW the > > mapping size. > > - We still use huge pages to back 4K mappings, and we still have to deal > > with the ref/map_count issues. > > - Code touching hugetlb page tables would KNOW the mapping size up front. > > > > Again, apologies if we talked about and previously dismissed this type > > of approach. > > I think Peter was the one who originally suggested an approach like > this, and it meets my needs. However, I still think the way that > things are currently implemented is the right way to go. > > Assuming we want decent performance, we can't get away with the same > strategy of just passing pte_t*s everywhere. The PTL for a 4K PTE > should be based on the PMD above the PTE, so we need to either pass > around the PMD above our PTE, or we need to pass around the PTL. This > is something that hugetlb_pte does for us, so, in some sense, even > going with this simpler approach, we still need a hugetlb_pte-like > construct. Agree there is this performance hit. However, the 'simplest' approach would be to just use the page table lock as is done by default for 4K PTEs. I do not know much about the (primary) live migration use case. My guess is that page table lock contention may be an issue? In this use case, HGM is only enabled for the duration the live migration operation, then a MADV_COLLAPSE is performed. If contention is likely to be an issue during this time, then yes we would need to pass around with something like hugetlb_pte. > Although most of the other complexity that HGM introduces would have > to be introduced either way (like having to deal with putting > compound_head()/page_folio() in more places and doing some > per-architecture updates), there are some complexities that the > simpler approach avoids: > > - We avoid problems related to compound PTEs (the problem being: two > threads racing to populate a contiguous and non-contiguous PTE that > take up the same space could lead to user-detectable incorrect > behavior. This isn't hard to fix; it will be when I send the arm64 > patches up.) > > - We don't need to check if PTEs get split from under us in PT walks. > (In a lot of cases, the appropriate action is just to treat the PTE as > if it were pte_none().) In the same vein, we don't need > hugetlb_pte_present_leaf() at all, because PTEs we find will always be > leaves. > > - We don't have to deal with sorting hstates or implementing > for_each_hgm_shift()/hugetlb_alloc_largest_pte(). > > None of these complexities are particularly major in my opinion. Perhaps not. I was just thinking about the overall complexity of the hugetlb code after HGM. Currently, it is 'relatively simple' with fixed huge page sizes. IMO, much simpler than THP with two possible mapping sizes. With HGM and intermediate mapping sizes, it seems things could get more complicated than THP. Perhaps it is just me. I am just too familiar with the current code and a bit anxious about added complexity. But, I felt the same about vmemmap optimizations. :) > This might seem kind of contrived, but let's say you have a VM with 1T > of memory, and you find 100 memory errors all in different 1G pages > over the life of this VM (years, potentially). Having 10% of your > memory be 4K-mapped is definitely worse than having 10% be 2M-mapped > (lost performance and increased memory overhead). There might be other > cases in the future where being able to have intermediate mapping > sizes could be helpful. That may be a bit contrived. We know memory error handling is a future use case, but I believe there is work outside of HGM than needs to be done to handle such situations. For example, HGM will allow the 1G mapping to isolate the 4K page with error. This prevents errors if you fault almost anywhere within the 1G page. But, there still remains the possibility of accessing that 4K page page with error. IMO, it will require user space/application intervention to address this as only the application knows about the potentially lost data. This is still something that needs to be designed. It would then makes sense for the application to also determine how it wants to proceed WRT mapping the 1G area. Perhaps they will want (and there will exist a mechanism) to migrate the data to a new 1G page without error. > > > > I think Peter mentioned it elsewhere, we should come up with a workable > > > > scheme for HGM ref/map counting. This can be done somewhat independently. > > > > > > FWIW, what makes the most sense to me right now is to implement the > > > THP-like scheme and mark HGM as mutually exclusive with the vmemmap > > > optimization. We can later come up with a scheme that lets us retain > > > compatibility. (Is that what you mean by "this can be done somewhat > > > independently", Mike?) > > > > Sort of, I was only saying that getting the ref/map counting right seems > > like a task than can be independently worked. Using the THP-like scheme > > is good. > > Ok! So if you're ok with the intermediate mapping sizes, it sounds > like I should go ahead and implement the THP-like scheme. Yes, I am OK with it. Just expressed a bit of concern above.
On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote: > I do not know much about the (primary) live migration use case. My > guess is that page table lock contention may be an issue? In this use > case, HGM is only enabled for the duration the live migration operation, > then a MADV_COLLAPSE is performed. If contention is likely to be an > issue during this time, then yes we would need to pass around with > something like hugetlb_pte. I'm not aware of any such contention issue. IMHO the migration problem is majorly about being too slow transferring a page being so large. Shrinking the page size should resolve the major problem already here IIUC. AFAIU 4K-only solution should only reduce any lock contention because locks will always be pte-level if VM_HUGETLB_HGM set. When walking and creating the intermediate pgtable entries we can use atomic ops just like generic mm, so no lock needed at all. With uncertainty on the size of mappings, we'll need to take any of the multiple layers of locks. [...] > > None of these complexities are particularly major in my opinion. > > Perhaps not. I was just thinking about the overall complexity of the > hugetlb code after HGM. Currently, it is 'relatively simple' with > fixed huge page sizes. IMO, much simpler than THP with two possible > mapping sizes. With HGM and intermediate mapping sizes, it seems > things could get more complicated than THP. Perhaps it is just me. Please count me in. :) I'm just still happy to see what it'll look like if James think having that complexity doesn't greatly affect the whole design. Thanks,
On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote: > > I do not know much about the (primary) live migration use case. My > > guess is that page table lock contention may be an issue? In this use > > case, HGM is only enabled for the duration the live migration operation, > > then a MADV_COLLAPSE is performed. If contention is likely to be an > > issue during this time, then yes we would need to pass around with > > something like hugetlb_pte. > > I'm not aware of any such contention issue. IMHO the migration problem is > majorly about being too slow transferring a page being so large. Shrinking > the page size should resolve the major problem already here IIUC. This will be problematic if you scale up VMs to be quite large. Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take the MMU lock for writing in the EPT violation path. We found that this change is required for VMs >200 or so vCPUs to consistently avoid CPU soft lockups in the guest. Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on the same PTL would be problematic in the same way. > > AFAIU 4K-only solution should only reduce any lock contention because locks > will always be pte-level if VM_HUGETLB_HGM set. When walking and creating > the intermediate pgtable entries we can use atomic ops just like generic > mm, so no lock needed at all. With uncertainty on the size of mappings, > we'll need to take any of the multiple layers of locks. > Other than taking the HugeTLB VMA lock for reading, walking/allocating page tables won't need any additional locking. We take the PTL to allocate the next level down, but so does generic mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am misunderstanding. - James
On Thu, Jan 19, 2023 at 12:53 PM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Jan 19, 2023 at 11:42:26AM -0800, James Houghton wrote: > > - We avoid problems related to compound PTEs (the problem being: two > > threads racing to populate a contiguous and non-contiguous PTE that > > take up the same space could lead to user-detectable incorrect > > behavior. This isn't hard to fix; it will be when I send the arm64 > > patches up.) > > Could you elaborate this one a bit more? In hugetlb_mcopy_atomic_pte(), we check that the PTE we're about to overwrite is pte_none() before overwriting it. For contiguous PTEs, this only checks the first PTE in the bunch. If someone came around and populated one of the PTEs that lied in the middle of a potentially contiguous group of PTEs, we could end up overwriting that PTE if we later UFFDIO_CONTINUEd in such a way to create a contiguous PTE. We would expect to get EEXIST here, but in this case the operation would succeed. To fix this, we can just check that ALL the PTEs in the contiguous bunch have the value that we're expecting, not just the first one. hugetlb_no_page() has the same problem, but it's not immediately clear to me how it would result in incorrect behavior. > > > This might seem kind of contrived, but let's say you have a VM with 1T > > of memory, and you find 100 memory errors all in different 1G pages > > over the life of this VM (years, potentially). Having 10% of your > > memory be 4K-mapped is definitely worse than having 10% be 2M-mapped > > (lost performance and increased memory overhead). There might be other > > cases in the future where being able to have intermediate mapping > > sizes could be helpful. > > This is not the norm, or is it? How the possibility of bad pages can > distribute over hosts over years? This can definitely affect how we should > target the intermediate level mappings. I can't really speak for norms generally, but I can try to speak for Google Cloud. Google Cloud hasn't had memory error virtualization for very long (only about a year), but we've seen cases where VMs can pick up several memory errors over a few days/weeks. IMO, 100 errors in separate 1G pages over a few years isn't completely nonsensical, especially if the memory that you're using isn't so reliable or was damaged in shipping (like if it was flown over the poles or something!). Now there is the concern about how an application would handle it. In a VMM's case, we can virtualize the error for the guest. In the guest, it's possible that a good chunk of the errors lie in unused pages and so can be easily marked as poisoned. It's possible that recovery is much more difficult. It's not unreasonable for an application to recover from a lot of memory errors. - James
On Thu, Jan 19, 2023 at 02:35:12PM -0800, James Houghton wrote: > On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote: > > > I do not know much about the (primary) live migration use case. My > > > guess is that page table lock contention may be an issue? In this use > > > case, HGM is only enabled for the duration the live migration operation, > > > then a MADV_COLLAPSE is performed. If contention is likely to be an > > > issue during this time, then yes we would need to pass around with > > > something like hugetlb_pte. > > > > I'm not aware of any such contention issue. IMHO the migration problem is > > majorly about being too slow transferring a page being so large. Shrinking > > the page size should resolve the major problem already here IIUC. > > This will be problematic if you scale up VMs to be quite large. Do you mean that for the postcopy use case one can leverage e.g. 2M mappings (over 1G) to avoid lock contentions when VM is large? I agree it should be more efficient than having 512 4K page installed, but I think it'll make the page fault resolution slower too if some thead is only looking for a 4k portion of it. > Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take > the MMU lock for writing in the EPT violation path. We found that this > change is required for VMs >200 or so vCPUs to consistently avoid CPU > soft lockups in the guest. After the kvm mmu rwlock convertion, it'll allow concurrent page faults even if only 4K pages are used, so it seems not directly relevant to what we're discussing here, no? > > Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on > the same PTL would be problematic in the same way. Pte-level pgtable lock only covers 2M range, so I think it depends on which is the address that the vcpu is faulted on? IIUC the major case should be that the faulted threads are not falling upon the same 2M range. > > > > > AFAIU 4K-only solution should only reduce any lock contention because locks > > will always be pte-level if VM_HUGETLB_HGM set. When walking and creating > > the intermediate pgtable entries we can use atomic ops just like generic > > mm, so no lock needed at all. With uncertainty on the size of mappings, > > we'll need to take any of the multiple layers of locks. > > > > Other than taking the HugeTLB VMA lock for reading, walking/allocating > page tables won't need any additional locking. Actually when revisiting the locks I'm getting a bit confused on whether the vma lock is needed if pmd sharing is anyway forbidden for HGM. I raised a question in the other patch of MADV_COLLAPSE, maybe they're related questions so we can keep it there. > > We take the PTL to allocate the next level down, but so does generic > mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am > misunderstanding. Sorry you're right, please ignore that. I don't know why I had that impression that spinlocks are not needed in that process. Actually I am also curious why atomics won't work (by holding mmap read lock, then do cmpxchg(old_entry=0, new_entry) upon the pgtable entries). I think it's possible I just missed something else.
On Thu, Jan 19, 2023 at 3:07 PM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Jan 19, 2023 at 02:35:12PM -0800, James Houghton wrote: > > On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote: > > > > I do not know much about the (primary) live migration use case. My > > > > guess is that page table lock contention may be an issue? In this use > > > > case, HGM is only enabled for the duration the live migration operation, > > > > then a MADV_COLLAPSE is performed. If contention is likely to be an > > > > issue during this time, then yes we would need to pass around with > > > > something like hugetlb_pte. > > > > > > I'm not aware of any such contention issue. IMHO the migration problem is > > > majorly about being too slow transferring a page being so large. Shrinking > > > the page size should resolve the major problem already here IIUC. > > > > This will be problematic if you scale up VMs to be quite large. > > Do you mean that for the postcopy use case one can leverage e.g. 2M > mappings (over 1G) to avoid lock contentions when VM is large I agree it > should be more efficient than having 512 4K page installed, but I think > it'll make the page fault resolution slower too if some thead is only > looking for a 4k portion of it. No, that's not what I meant. Sorry. If you can use the PTL that is normally used for 4K PTEs, then you're right, there is no contention problem. However, this PTL is determined by the value of the PMD, so you need a pointer to the PMD to determine what the PTL should be (or a pointer to the PTL itself). In hugetlb, we only ever pass around the PTE pointer, and we rely on huge_pte_lockptr() to find the PTL for us (and it does so appropriately for everything except 4K PTEs). We would need to add the complexity of passing around a PMD or PTL everywhere, and that's what hugetlb_pte does for us. So that complexity is basically unavoidable, unless you're ok with 4K PTEs with taking mm->page_table_lock (I'm not). > > > Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take > > the MMU lock for writing in the EPT violation path. We found that this > > change is required for VMs >200 or so vCPUs to consistently avoid CPU > > soft lockups in the guest. > > After the kvm mmu rwlock convertion, it'll allow concurrent page faults > even if only 4K pages are used, so it seems not directly relevant to what > we're discussing here, no? Right. I was just bringing it up to say that if 4K PTLs were mm->page_table_lock, we would have a problem. > > > > > Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on > > the same PTL would be problematic in the same way. > > Pte-level pgtable lock only covers 2M range, so I think it depends on which > is the address that the vcpu is faulted on? IIUC the major case should be > that the faulted threads are not falling upon the same 2M range. Right. I think my comment should make more sense with the above clarification. > > > > > > > > > AFAIU 4K-only solution should only reduce any lock contention because locks > > > will always be pte-level if VM_HUGETLB_HGM set. When walking and creating > > > the intermediate pgtable entries we can use atomic ops just like generic > > > mm, so no lock needed at all. With uncertainty on the size of mappings, > > > we'll need to take any of the multiple layers of locks. > > > > > > > Other than taking the HugeTLB VMA lock for reading, walking/allocating > > page tables won't need any additional locking. > > Actually when revisiting the locks I'm getting a bit confused on whether > the vma lock is needed if pmd sharing is anyway forbidden for HGM. I > raised a question in the other patch of MADV_COLLAPSE, maybe they're > related questions so we can keep it there. We can discuss there. :) I take both the VMA lock and mapping lock so that it can stay in sync with huge_pmd_unshare(), and so HGM walks have the same synchronization as regular hugetlb PT walks. > > > > > We take the PTL to allocate the next level down, but so does generic > > mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am > > misunderstanding. > > Sorry you're right, please ignore that. I don't know why I had that > impression that spinlocks are not needed in that process. > > Actually I am also curious why atomics won't work (by holding mmap read > lock, then do cmpxchg(old_entry=0, new_entry) upon the pgtable entries). I > think it's possible I just missed something else. I think there are cases where we need to make sure the value of a PTE isn't going to change from under us while we're doing some kind of other operation, and so a compare-and-swap won't really be what we need.
On 01/19/23 18:07, Peter Xu wrote: > > Actually when revisiting the locks I'm getting a bit confused on whether > the vma lock is needed if pmd sharing is anyway forbidden for HGM. I > raised a question in the other patch of MADV_COLLAPSE, maybe they're > related questions so we can keep it there. I can quickly answer that. Yes. The vma lock is also being used for fault/truncation synchronization. Commit e700898fa075 make sure it is even used on architectures that do not support PMD sharing. I had come up with a rather ugly method of using the fault mutex for fault/truncation synchronization, but using the vma lock was more elegant.
On Thu, Jan 19, 2023 at 03:26:14PM -0800, James Houghton wrote: > On Thu, Jan 19, 2023 at 3:07 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Jan 19, 2023 at 02:35:12PM -0800, James Houghton wrote: > > > On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote: > > > > > I do not know much about the (primary) live migration use case. My > > > > > guess is that page table lock contention may be an issue? In this use > > > > > case, HGM is only enabled for the duration the live migration operation, > > > > > then a MADV_COLLAPSE is performed. If contention is likely to be an > > > > > issue during this time, then yes we would need to pass around with > > > > > something like hugetlb_pte. > > > > > > > > I'm not aware of any such contention issue. IMHO the migration problem is > > > > majorly about being too slow transferring a page being so large. Shrinking > > > > the page size should resolve the major problem already here IIUC. > > > > > > This will be problematic if you scale up VMs to be quite large. > > > > Do you mean that for the postcopy use case one can leverage e.g. 2M > > mappings (over 1G) to avoid lock contentions when VM is large I agree it > > should be more efficient than having 512 4K page installed, but I think > > it'll make the page fault resolution slower too if some thead is only > > looking for a 4k portion of it. > > No, that's not what I meant. Sorry. If you can use the PTL that is > normally used for 4K PTEs, then you're right, there is no contention > problem. However, this PTL is determined by the value of the PMD, so > you need a pointer to the PMD to determine what the PTL should be (or > a pointer to the PTL itself). > > In hugetlb, we only ever pass around the PTE pointer, and we rely on > huge_pte_lockptr() to find the PTL for us (and it does so > appropriately for everything except 4K PTEs). We would need to add the > complexity of passing around a PMD or PTL everywhere, and that's what > hugetlb_pte does for us. So that complexity is basically unavoidable, > unless you're ok with 4K PTEs with taking mm->page_table_lock (I'm > not). > > > > > > Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take > > > the MMU lock for writing in the EPT violation path. We found that this > > > change is required for VMs >200 or so vCPUs to consistently avoid CPU > > > soft lockups in the guest. > > > > After the kvm mmu rwlock convertion, it'll allow concurrent page faults > > even if only 4K pages are used, so it seems not directly relevant to what > > we're discussing here, no? > > Right. I was just bringing it up to say that if 4K PTLs were > mm->page_table_lock, we would have a problem. Ah I see what you meant. We definitely don't want to use the page_table_lock for sure. So if it's about keeping hugetlb_pte I'm fine with it, no matter what the final version will look like. > > > > > > > > > Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on > > > the same PTL would be problematic in the same way. > > > > Pte-level pgtable lock only covers 2M range, so I think it depends on which > > is the address that the vcpu is faulted on? IIUC the major case should be > > that the faulted threads are not falling upon the same 2M range. > > Right. I think my comment should make more sense with the above clarification. > > > > > > > > > > > > > > AFAIU 4K-only solution should only reduce any lock contention because locks > > > > will always be pte-level if VM_HUGETLB_HGM set. When walking and creating > > > > the intermediate pgtable entries we can use atomic ops just like generic > > > > mm, so no lock needed at all. With uncertainty on the size of mappings, > > > > we'll need to take any of the multiple layers of locks. > > > > > > > > > > Other than taking the HugeTLB VMA lock for reading, walking/allocating > > > page tables won't need any additional locking. > > > > Actually when revisiting the locks I'm getting a bit confused on whether > > the vma lock is needed if pmd sharing is anyway forbidden for HGM. I > > raised a question in the other patch of MADV_COLLAPSE, maybe they're > > related questions so we can keep it there. > > We can discuss there. :) I take both the VMA lock and mapping lock so > that it can stay in sync with huge_pmd_unshare(), and so HGM walks > have the same synchronization as regular hugetlb PT walks. Sure. :) Now after a 2nd thought I don't think it's unsafe to take the vma write lock here, especially for VM_SHARED. I can't think of anything that will go wrong. It's because we need the vma lock anywhere we'll be walking the pgtables when having mmap_sem read I think, being afraid of having pmd sharing being possible. But I'm not sure whether this is the cleanest way to do it. IMHO the major special part of hugetlb comparing to generic mm on pgtable thread safety. I worry that complicating this lock can potentially make the hugetlb code even more specific, which is not good for the long term if we still have a hope of merging more hugetlb codes with the generic paths. Here since pmd sharing is impossible for HGM, the original vma lock is not needed here. Meanwhile, what we want to guard is the pgtable walkers. They're logically being protected by either mmap lock or the mapping lock (for rmap walkers). Fast-gup is another thing but so far I think it's all safe when you're following the mmu gather facilities. Somehow I had a feeling that the hugetlb vma lock (along with the pgtable sharing explorations in the hugetlb world keeping going..) can keep evolving in the future, and it should be helpful to keep its semantics simple too. So to summarize: I wonder whether we can use mmap write lock and i_mmap_rwsem write lock to protect collapsing for hugetlb, just like what we do with THP collapsing (after Jann's fix). madvise_need_mmap_write() is not easily feasible because it's before the vma scanning so we can't take conditional write lock only for hugetlb, but that's the next question to ask only if we can reach a consensus on the lock scheme first for HGM in general. > > > > > > > > > We take the PTL to allocate the next level down, but so does generic > > > mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am > > > misunderstanding. > > > > Sorry you're right, please ignore that. I don't know why I had that > > impression that spinlocks are not needed in that process. > > > > Actually I am also curious why atomics won't work (by holding mmap read > > lock, then do cmpxchg(old_entry=0, new_entry) upon the pgtable entries). I > > think it's possible I just missed something else. > > I think there are cases where we need to make sure the value of a PTE > isn't going to change from under us while we're doing some kind of > other operation, and so a compare-and-swap won't really be what we > need. Currently the pgtable spinlock is only taken during populating the pgtables. If it can happen, then it can happen too right after we release the spinlock in e.g. __pmd_alloc(). One thing I can think of is we need more things done rather than the pgtable entry installations so atomics will stop working if so. E.g. on x86 we have paravirt_alloc_pmd(). But I'm not sure whether that's the only reason.
Hi, Mike, On Thu, Jan 19, 2023 at 03:44:25PM -0800, Mike Kravetz wrote: > On 01/19/23 18:07, Peter Xu wrote: > > > > Actually when revisiting the locks I'm getting a bit confused on whether > > the vma lock is needed if pmd sharing is anyway forbidden for HGM. I > > raised a question in the other patch of MADV_COLLAPSE, maybe they're > > related questions so we can keep it there. > > I can quickly answer that. Yes. The vma lock is also being used for > fault/truncation synchronization. Commit e700898fa075 make sure it is > even used on architectures that do not support PMD sharing. > > I had come up with a rather ugly method of using the fault mutex for > fault/truncation synchronization, but using the vma lock was more > elegant. Thanks for answering, I'll need to read some more on truncation later. Before that, since COLLAPSE will already require the i_mmap_rwsem write lock already, does it mean it is naturally race-free against truncation even without vma lock?
On 01/23/23 10:19, Peter Xu wrote: > Hi, Mike, > > On Thu, Jan 19, 2023 at 03:44:25PM -0800, Mike Kravetz wrote: > > On 01/19/23 18:07, Peter Xu wrote: > > > > > > Actually when revisiting the locks I'm getting a bit confused on whether > > > the vma lock is needed if pmd sharing is anyway forbidden for HGM. I > > > raised a question in the other patch of MADV_COLLAPSE, maybe they're > > > related questions so we can keep it there. > > > > I can quickly answer that. Yes. The vma lock is also being used for > > fault/truncation synchronization. Commit e700898fa075 make sure it is > > even used on architectures that do not support PMD sharing. > > > > I had come up with a rather ugly method of using the fault mutex for > > fault/truncation synchronization, but using the vma lock was more > > elegant. > > Thanks for answering, I'll need to read some more on truncation later. > Before that, since COLLAPSE will already require the i_mmap_rwsem write > lock already, does it mean it is naturally race-free against truncation > even without vma lock? Yes, and thanks for making me take a closer look at COLLAPSE. :)
On Thu, Jan 19, 2023 at 11:42 AM James Houghton <jthoughton@google.com> wrote: > > On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 01/19/23 08:57, James Houghton wrote: > > > FWIW, what makes the most sense to me right now is to implement the > > > THP-like scheme and mark HGM as mutually exclusive with the vmemmap > > > optimization. We can later come up with a scheme that lets us retain > > > compatibility. (Is that what you mean by "this can be done somewhat > > > independently", Mike?) > > > > Sort of, I was only saying that getting the ref/map counting right seems > > like a task than can be independently worked. Using the THP-like scheme > > is good. > > Ok! So if you're ok with the intermediate mapping sizes, it sounds > like I should go ahead and implement the THP-like scheme. It turns out that the THP-like scheme significantly slows down MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes the vast majority of the time spent in MADV_COLLAPSE when collapsing 1G mappings. It is doing 262k atomic decrements, so this makes sense. This is only really a problem because this is done between mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to access any of the 1G page while we're doing this (and it can take like ~1 second for each 1G, at least on the x86 server I was testing on). - James
James, On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote: > It turns out that the THP-like scheme significantly slows down > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes > the vast majority of the time spent in MADV_COLLAPSE when collapsing > 1G mappings. It is doing 262k atomic decrements, so this makes sense. > > This is only really a problem because this is done between > mmu_notifier_invalidate_range_start() and > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to > access any of the 1G page while we're doing this (and it can take like > ~1 second for each 1G, at least on the x86 server I was testing on). Did you try to measure the time, or it's a quick observation from perf? IIRC I used to measure some atomic ops, it is not as drastic as I thought. But maybe it depends on many things. I'm curious how the 1sec is provisioned between the procedures. E.g., I would expect mmu_notifier_invalidate_range_start() to also take some time too as it should walk the smally mapped EPT pgtables. Since we'll still keep the intermediate levels around - from application POV, one other thing to remedy this is further shrink the size of COLLAPSE so potentially for a very large page we can start with building 2M layers. But then collapse will need to be run at least two rounds.
On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote: > > James, > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote: > > It turns out that the THP-like scheme significantly slows down > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes > > the vast majority of the time spent in MADV_COLLAPSE when collapsing > > 1G mappings. It is doing 262k atomic decrements, so this makes sense. > > > > This is only really a problem because this is done between > > mmu_notifier_invalidate_range_start() and > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to > > access any of the 1G page while we're doing this (and it can take like > > ~1 second for each 1G, at least on the x86 server I was testing on). > > Did you try to measure the time, or it's a quick observation from perf? I put some ktime_get()s in. > > IIRC I used to measure some atomic ops, it is not as drastic as I thought. > But maybe it depends on many things. > > I'm curious how the 1sec is provisioned between the procedures. E.g., I > would expect mmu_notifier_invalidate_range_start() to also take some time > too as it should walk the smally mapped EPT pgtables. Somehow this doesn't take all that long (only like 10-30ms when collapsing from 4K -> 1G) compared to hugetlb_collapse(). > > Since we'll still keep the intermediate levels around - from application > POV, one other thing to remedy this is further shrink the size of COLLAPSE > so potentially for a very large page we can start with building 2M layers. > But then collapse will need to be run at least two rounds. That's exactly what I thought to do. :) I realized, too, that this is actually how userspace *should* collapse things to avoid holding up vCPUs too long. I think this is a good reason to keep intermediate page sizes. When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a huge difference: the THP-like scheme is about 30% slower overall. When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE difference. For the THP-like scheme, collapsing 4K -> 2M requires decrementing and then re-incrementing subpage->_mapcount, and then from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For the head-only scheme, for each 2M in the 4K -> 2M collapse, we decrement the compound_mapcount 512 times (once per PTE), then increment it once. And then for 2M -> 1G, for each 1G, we decrement mapcount again by 512 (once per PMD), incrementing it once. The mapcount decrements are about on par with how long it takes to do other things, like updating page tables. The main problem is, with the THP-like scheme (implemented like this [1]), there isn't a way to avoid the 262k decrements when collapsing 1G. So if we want MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API, then I think something more clever needs to be implemented. [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178 - James
On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote: > > > > James, > > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote: > > > It turns out that the THP-like scheme significantly slows down > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense. > > > > > > This is only really a problem because this is done between > > > mmu_notifier_invalidate_range_start() and > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to > > > access any of the 1G page while we're doing this (and it can take like > > > ~1 second for each 1G, at least on the x86 server I was testing on). > > > > Did you try to measure the time, or it's a quick observation from perf? > > I put some ktime_get()s in. > > > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought. > > But maybe it depends on many things. > > > > I'm curious how the 1sec is provisioned between the procedures. E.g., I > > would expect mmu_notifier_invalidate_range_start() to also take some time > > too as it should walk the smally mapped EPT pgtables. > > Somehow this doesn't take all that long (only like 10-30ms when > collapsing from 4K -> 1G) compared to hugetlb_collapse(). Did you populate as much the EPT pgtable when measuring this? IIUC this number should be pretty much relevant to how many pages are shadowed to the kvm pgtables. If the EPT table is mostly empty it should be super fast, but OTOH it can be much slower if when it's populated, because tdp mmu should need to handle the pgtable leaves one by one. E.g. it should be fully populated if you have a program busy dirtying most of the guest pages during test migration. Write op should be the worst here case since it'll require the atomic op being applied; see kvm_tdp_mmu_write_spte(). > > > > > Since we'll still keep the intermediate levels around - from application > > POV, one other thing to remedy this is further shrink the size of COLLAPSE > > so potentially for a very large page we can start with building 2M layers. > > But then collapse will need to be run at least two rounds. > > That's exactly what I thought to do. :) I realized, too, that this is > actually how userspace *should* collapse things to avoid holding up > vCPUs too long. I think this is a good reason to keep intermediate > page sizes. > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a > huge difference: the THP-like scheme is about 30% slower overall. > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE > difference. For the THP-like scheme, collapsing 4K -> 2M requires > decrementing and then re-incrementing subpage->_mapcount, and then > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For > the head-only scheme, for each 2M in the 4K -> 2M collapse, we > decrement the compound_mapcount 512 times (once per PTE), then > increment it once. And then for 2M -> 1G, for each 1G, we decrement > mapcount again by 512 (once per PMD), incrementing it once. Did you have quantified numbers (with your ktime treak) to compare these? If we want to go the other route, I think these will be materials to justify any other approach on mapcount handling. > > The mapcount decrements are about on par with how long it takes to do > other things, like updating page tables. The main problem is, with the > THP-like scheme (implemented like this [1]), there isn't a way to > avoid the 262k decrements when collapsing 1G. So if we want > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API, > then I think something more clever needs to be implemented. > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178 I believe the whole goal of HGM is trying to face the same challenge if we'll allow 1G THP exist and being able to split for anon. I don't remember whether we discussed below, maybe we did? Anyway... Another way to not use thp mapcount, nor break smaps and similar calls to page_mapcount() on small page, is to only increase the hpage mapcount only when hstate pXd (in case of 1G it's PUD) entry being populated (no matter as leaf or a non-leaf), and the mapcount can be decreased when the pXd entry is removed (for leaf, it's the same as for now; for HGM, it's when freeing pgtable of the PUD entry). Again, in all cases I think some solid measurements would definitely be helpful (as commented above) to see how much overhead will there be and whether that'll start to become a problem at least for the current motivations of the whole HGM idea. Thanks,
On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > James, > > > > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote: > > > > It turns out that the THP-like scheme significantly slows down > > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes > > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing > > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense. > > > > > > > > This is only really a problem because this is done between > > > > mmu_notifier_invalidate_range_start() and > > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to > > > > access any of the 1G page while we're doing this (and it can take like > > > > ~1 second for each 1G, at least on the x86 server I was testing on). > > > > > > Did you try to measure the time, or it's a quick observation from perf? > > > > I put some ktime_get()s in. > > > > > > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought. > > > But maybe it depends on many things. > > > > > > I'm curious how the 1sec is provisioned between the procedures. E.g., I > > > would expect mmu_notifier_invalidate_range_start() to also take some time > > > too as it should walk the smally mapped EPT pgtables. > > > > Somehow this doesn't take all that long (only like 10-30ms when > > collapsing from 4K -> 1G) compared to hugetlb_collapse(). > > Did you populate as much the EPT pgtable when measuring this? > > IIUC this number should be pretty much relevant to how many pages are > shadowed to the kvm pgtables. If the EPT table is mostly empty it should > be super fast, but OTOH it can be much slower if when it's populated, > because tdp mmu should need to handle the pgtable leaves one by one. > > E.g. it should be fully populated if you have a program busy dirtying most > of the guest pages during test migration. That's what I was doing. I was running a workload in the guest that just writes 8 bytes to a page and jumps ahead a few pages on all vCPUs, touching most of its memory. But there is more to understand; I'll collect more results. I'm not sure why the EPT can be unmapped/collapsed so quickly. > > Write op should be the worst here case since it'll require the atomic op > being applied; see kvm_tdp_mmu_write_spte(). > > > > > > > > > Since we'll still keep the intermediate levels around - from application > > > POV, one other thing to remedy this is further shrink the size of COLLAPSE > > > so potentially for a very large page we can start with building 2M layers. > > > But then collapse will need to be run at least two rounds. > > > > That's exactly what I thought to do. :) I realized, too, that this is > > actually how userspace *should* collapse things to avoid holding up > > vCPUs too long. I think this is a good reason to keep intermediate > > page sizes. > > > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a > > huge difference: the THP-like scheme is about 30% slower overall. > > > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE > > difference. For the THP-like scheme, collapsing 4K -> 2M requires > > decrementing and then re-incrementing subpage->_mapcount, and then > > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For > > the head-only scheme, for each 2M in the 4K -> 2M collapse, we > > decrement the compound_mapcount 512 times (once per PTE), then > > increment it once. And then for 2M -> 1G, for each 1G, we decrement > > mapcount again by 512 (once per PMD), incrementing it once. > > Did you have quantified numbers (with your ktime treak) to compare these? > If we want to go the other route, I think these will be materials to > justify any other approach on mapcount handling. Ok, I can do that. GIve me a couple days to collect more results and organize them in a helpful way. (If it's helpful at all, here are some results I collected last week: [2]. Please ignore it if it's not helpful.) > > > > > The mapcount decrements are about on par with how long it takes to do > > other things, like updating page tables. The main problem is, with the > > THP-like scheme (implemented like this [1]), there isn't a way to > > avoid the 262k decrements when collapsing 1G. So if we want > > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API, > > then I think something more clever needs to be implemented. > > > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178 > > I believe the whole goal of HGM is trying to face the same challenge if > we'll allow 1G THP exist and being able to split for anon. > > I don't remember whether we discussed below, maybe we did? Anyway... > > Another way to not use thp mapcount, nor break smaps and similar calls to > page_mapcount() on small page, is to only increase the hpage mapcount only > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > as leaf or a non-leaf), and the mapcount can be decreased when the pXd > entry is removed (for leaf, it's the same as for now; for HGM, it's when > freeing pgtable of the PUD entry). Right, and this is doable. Also it seems like this is pretty close to the direction Matthew Wilcox wants to go with THPs. Something I noticed though, from the implementation of folio_referenced()/folio_referenced_one(), is that folio_mapcount() ought to report the total number of PTEs that are pointing on the page (or the number of times page_vma_mapped_walk returns true). FWIW, folio_referenced() is never called for hugetlb folios. > > Again, in all cases I think some solid measurements would definitely be > helpful (as commented above) to see how much overhead will there be and > whether that'll start to become a problem at least for the current > motivations of the whole HGM idea. > > Thanks, > > -- > Peter Xu > Thanks, Peter! [2]: https://pastebin.com/raw/DVfNFi2m - James
On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > > > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > James, > > > > > > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote: > > > > > It turns out that the THP-like scheme significantly slows down > > > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes > > > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing > > > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense. > > > > > > > > > > This is only really a problem because this is done between > > > > > mmu_notifier_invalidate_range_start() and > > > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to > > > > > access any of the 1G page while we're doing this (and it can take like > > > > > ~1 second for each 1G, at least on the x86 server I was testing on). > > > > > > > > Did you try to measure the time, or it's a quick observation from perf? > > > > > > I put some ktime_get()s in. > > > > > > > > > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought. > > > > But maybe it depends on many things. > > > > > > > > I'm curious how the 1sec is provisioned between the procedures. E.g., I > > > > would expect mmu_notifier_invalidate_range_start() to also take some time > > > > too as it should walk the smally mapped EPT pgtables. > > > > > > Somehow this doesn't take all that long (only like 10-30ms when > > > collapsing from 4K -> 1G) compared to hugetlb_collapse(). > > > > Did you populate as much the EPT pgtable when measuring this? > > > > IIUC this number should be pretty much relevant to how many pages are > > shadowed to the kvm pgtables. If the EPT table is mostly empty it should > > be super fast, but OTOH it can be much slower if when it's populated, > > because tdp mmu should need to handle the pgtable leaves one by one. > > > > E.g. it should be fully populated if you have a program busy dirtying most > > of the guest pages during test migration. > > That's what I was doing. I was running a workload in the guest that > just writes 8 bytes to a page and jumps ahead a few pages on all > vCPUs, touching most of its memory. > > But there is more to understand; I'll collect more results. I'm not > sure why the EPT can be unmapped/collapsed so quickly. Maybe something smart done by the hypervisor? > > > > > Write op should be the worst here case since it'll require the atomic op > > being applied; see kvm_tdp_mmu_write_spte(). > > > > > > > > > > > > > Since we'll still keep the intermediate levels around - from application > > > > POV, one other thing to remedy this is further shrink the size of COLLAPSE > > > > so potentially for a very large page we can start with building 2M layers. > > > > But then collapse will need to be run at least two rounds. > > > > > > That's exactly what I thought to do. :) I realized, too, that this is > > > actually how userspace *should* collapse things to avoid holding up > > > vCPUs too long. I think this is a good reason to keep intermediate > > > page sizes. > > > > > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a > > > huge difference: the THP-like scheme is about 30% slower overall. > > > > > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE > > > difference. For the THP-like scheme, collapsing 4K -> 2M requires > > > decrementing and then re-incrementing subpage->_mapcount, and then > > > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For > > > the head-only scheme, for each 2M in the 4K -> 2M collapse, we > > > decrement the compound_mapcount 512 times (once per PTE), then > > > increment it once. And then for 2M -> 1G, for each 1G, we decrement > > > mapcount again by 512 (once per PMD), incrementing it once. > > > > Did you have quantified numbers (with your ktime treak) to compare these? > > If we want to go the other route, I think these will be materials to > > justify any other approach on mapcount handling. > > Ok, I can do that. GIve me a couple days to collect more results and > organize them in a helpful way. > > (If it's helpful at all, here are some results I collected last week: > [2]. Please ignore it if it's not helpful.) It's helpful already at least to me, thanks. Yes the change is drastic. > > > > > > > > > The mapcount decrements are about on par with how long it takes to do > > > other things, like updating page tables. The main problem is, with the > > > THP-like scheme (implemented like this [1]), there isn't a way to > > > avoid the 262k decrements when collapsing 1G. So if we want > > > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API, > > > then I think something more clever needs to be implemented. > > > > > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178 > > > > I believe the whole goal of HGM is trying to face the same challenge if > > we'll allow 1G THP exist and being able to split for anon. > > > > I don't remember whether we discussed below, maybe we did? Anyway... > > > > Another way to not use thp mapcount, nor break smaps and similar calls to > > page_mapcount() on small page, is to only increase the hpage mapcount only > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd > > entry is removed (for leaf, it's the same as for now; for HGM, it's when > > freeing pgtable of the PUD entry). > > Right, and this is doable. Also it seems like this is pretty close to > the direction Matthew Wilcox wants to go with THPs. I may not be familiar with it, do you mean this one? https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ For hugetlb I think it should be easier to maintain rather than any-sized folios, because there's the pgtable non-leaf entry to track rmap information and the folio size being static to hpage size. It'll be different to folios where it can be random sized pages chunk, so it needs to be managed by batching the ptes when install/zap. > > Something I noticed though, from the implementation of > folio_referenced()/folio_referenced_one(), is that folio_mapcount() > ought to report the total number of PTEs that are pointing on the page > (or the number of times page_vma_mapped_walk returns true). FWIW, > folio_referenced() is never called for hugetlb folios. FWIU folio_mapcount is the thing it needs for now to do the rmap walks - it'll walk every leaf page being mapped, big or small, so IIUC that number should match with what it expects to see later, more or less. But I agree the mapcount/referenced value itself is debatable to me, just like what you raised in the other thread on page migration. Meanwhile, I am not certain whether the mapcount is accurate either because AFAICT the mapcount can be modified if e.g. new page mapping established as long as before taking the page lock later in folio_referenced(). It's just that I don't see any severe issue either due to any of above, as long as that information is only used as a hint for next steps, e.g., to swap which page out.
On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote: > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > > > > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > James, > > > > > > > > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote: > > > > > > It turns out that the THP-like scheme significantly slows down > > > > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes > > > > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing > > > > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense. > > > > > > > > > > > > This is only really a problem because this is done between > > > > > > mmu_notifier_invalidate_range_start() and > > > > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to > > > > > > access any of the 1G page while we're doing this (and it can take like > > > > > > ~1 second for each 1G, at least on the x86 server I was testing on). > > > > > > > > > > Did you try to measure the time, or it's a quick observation from perf? > > > > > > > > I put some ktime_get()s in. > > > > > > > > > > > > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought. > > > > > But maybe it depends on many things. > > > > > > > > > > I'm curious how the 1sec is provisioned between the procedures. E.g., I > > > > > would expect mmu_notifier_invalidate_range_start() to also take some time > > > > > too as it should walk the smally mapped EPT pgtables. > > > > > > > > Somehow this doesn't take all that long (only like 10-30ms when > > > > collapsing from 4K -> 1G) compared to hugetlb_collapse(). > > > > > > Did you populate as much the EPT pgtable when measuring this? > > > > > > IIUC this number should be pretty much relevant to how many pages are > > > shadowed to the kvm pgtables. If the EPT table is mostly empty it should > > > be super fast, but OTOH it can be much slower if when it's populated, > > > because tdp mmu should need to handle the pgtable leaves one by one. > > > > > > E.g. it should be fully populated if you have a program busy dirtying most > > > of the guest pages during test migration. > > > > That's what I was doing. I was running a workload in the guest that > > just writes 8 bytes to a page and jumps ahead a few pages on all > > vCPUs, touching most of its memory. > > > > But there is more to understand; I'll collect more results. I'm not > > sure why the EPT can be unmapped/collapsed so quickly. > > Maybe something smart done by the hypervisor? Doing a little bit more digging, it looks like the invalidate_range_start notifier clears the sptes, and then later on (on the next EPT violation), the page tables are freed. I still need to look at how they end up being so much faster still, but I thought that was interesting. > > > > > > > > > Write op should be the worst here case since it'll require the atomic op > > > being applied; see kvm_tdp_mmu_write_spte(). > > > > > > > > > > > > > > > > > Since we'll still keep the intermediate levels around - from application > > > > > POV, one other thing to remedy this is further shrink the size of COLLAPSE > > > > > so potentially for a very large page we can start with building 2M layers. > > > > > But then collapse will need to be run at least two rounds. > > > > > > > > That's exactly what I thought to do. :) I realized, too, that this is > > > > actually how userspace *should* collapse things to avoid holding up > > > > vCPUs too long. I think this is a good reason to keep intermediate > > > > page sizes. > > > > > > > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a > > > > huge difference: the THP-like scheme is about 30% slower overall. > > > > > > > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE > > > > difference. For the THP-like scheme, collapsing 4K -> 2M requires > > > > decrementing and then re-incrementing subpage->_mapcount, and then > > > > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For > > > > the head-only scheme, for each 2M in the 4K -> 2M collapse, we > > > > decrement the compound_mapcount 512 times (once per PTE), then > > > > increment it once. And then for 2M -> 1G, for each 1G, we decrement > > > > mapcount again by 512 (once per PMD), incrementing it once. > > > > > > Did you have quantified numbers (with your ktime treak) to compare these? > > > If we want to go the other route, I think these will be materials to > > > justify any other approach on mapcount handling. > > > > Ok, I can do that. GIve me a couple days to collect more results and > > organize them in a helpful way. > > > > (If it's helpful at all, here are some results I collected last week: > > [2]. Please ignore it if it's not helpful.) > > It's helpful already at least to me, thanks. Yes the change is drastic. That data only contains THP-like mapcount performance, no performance for the head-only way. But the head-only scheme makes the 2M -> 1G very good ("56" comes down to about the same everything else, instead of being ~100-500x bigger). > > > > > > > > > > > > > > The mapcount decrements are about on par with how long it takes to do > > > > other things, like updating page tables. The main problem is, with the > > > > THP-like scheme (implemented like this [1]), there isn't a way to > > > > avoid the 262k decrements when collapsing 1G. So if we want > > > > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API, > > > > then I think something more clever needs to be implemented. > > > > > > > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178 > > > > > > I believe the whole goal of HGM is trying to face the same challenge if > > > we'll allow 1G THP exist and being able to split for anon. > > > > > > I don't remember whether we discussed below, maybe we did? Anyway... > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to > > > page_mapcount() on small page, is to only increase the hpage mapcount only > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when > > > freeing pgtable of the PUD entry). > > > > Right, and this is doable. Also it seems like this is pretty close to > > the direction Matthew Wilcox wants to go with THPs. > > I may not be familiar with it, do you mean this one? > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ Yep that's it. > > For hugetlb I think it should be easier to maintain rather than any-sized > folios, because there's the pgtable non-leaf entry to track rmap > information and the folio size being static to hpage size. > > It'll be different to folios where it can be random sized pages chunk, so > it needs to be managed by batching the ptes when install/zap. Agreed. It's probably easier for HugeTLB because they're always "naturally aligned" and yeah they can't change sizes. > > > > > Something I noticed though, from the implementation of > > folio_referenced()/folio_referenced_one(), is that folio_mapcount() > > ought to report the total number of PTEs that are pointing on the page > > (or the number of times page_vma_mapped_walk returns true). FWIW, > > folio_referenced() is never called for hugetlb folios. > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks - > it'll walk every leaf page being mapped, big or small, so IIUC that number > should match with what it expects to see later, more or less. I don't fully understand what you mean here. > > But I agree the mapcount/referenced value itself is debatable to me, just > like what you raised in the other thread on page migration. Meanwhile, I > am not certain whether the mapcount is accurate either because AFAICT the > mapcount can be modified if e.g. new page mapping established as long as > before taking the page lock later in folio_referenced(). > > It's just that I don't see any severe issue either due to any of above, as > long as that information is only used as a hint for next steps, e.g., to > swap which page out. I also don't see a big problem with folio_referenced() (and you're right that folio_mapcount() can be stale by the time it takes the folio lock). It still seems like folio_mapcount() should return the total number of PTEs that map the page though. Are you saying that breaking this would be ok?
On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote: > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > > > > > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > > > James, > > > > > > > > > > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote: > > > > > > > It turns out that the THP-like scheme significantly slows down > > > > > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes > > > > > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing > > > > > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense. > > > > > > > > > > > > > > This is only really a problem because this is done between > > > > > > > mmu_notifier_invalidate_range_start() and > > > > > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to > > > > > > > access any of the 1G page while we're doing this (and it can take like > > > > > > > ~1 second for each 1G, at least on the x86 server I was testing on). > > > > > > > > > > > > Did you try to measure the time, or it's a quick observation from perf? > > > > > > > > > > I put some ktime_get()s in. > > > > > > > > > > > > > > > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought. > > > > > > But maybe it depends on many things. > > > > > > > > > > > > I'm curious how the 1sec is provisioned between the procedures. E.g., I > > > > > > would expect mmu_notifier_invalidate_range_start() to also take some time > > > > > > too as it should walk the smally mapped EPT pgtables. > > > > > > > > > > Somehow this doesn't take all that long (only like 10-30ms when > > > > > collapsing from 4K -> 1G) compared to hugetlb_collapse(). > > > > > > > > Did you populate as much the EPT pgtable when measuring this? > > > > > > > > IIUC this number should be pretty much relevant to how many pages are > > > > shadowed to the kvm pgtables. If the EPT table is mostly empty it should > > > > be super fast, but OTOH it can be much slower if when it's populated, > > > > because tdp mmu should need to handle the pgtable leaves one by one. > > > > > > > > E.g. it should be fully populated if you have a program busy dirtying most > > > > of the guest pages during test migration. > > > > > > That's what I was doing. I was running a workload in the guest that > > > just writes 8 bytes to a page and jumps ahead a few pages on all > > > vCPUs, touching most of its memory. > > > > > > But there is more to understand; I'll collect more results. I'm not > > > sure why the EPT can be unmapped/collapsed so quickly. > > > > Maybe something smart done by the hypervisor? > > Doing a little bit more digging, it looks like the > invalidate_range_start notifier clears the sptes, and then later on > (on the next EPT violation), the page tables are freed. I still need > to look at how they end up being so much faster still, but I thought > that was interesting. > > > > > > > > > > > > > > Write op should be the worst here case since it'll require the atomic op > > > > being applied; see kvm_tdp_mmu_write_spte(). > > > > > > > > > > > > > > > > > > > > > Since we'll still keep the intermediate levels around - from application > > > > > > POV, one other thing to remedy this is further shrink the size of COLLAPSE > > > > > > so potentially for a very large page we can start with building 2M layers. > > > > > > But then collapse will need to be run at least two rounds. > > > > > > > > > > That's exactly what I thought to do. :) I realized, too, that this is > > > > > actually how userspace *should* collapse things to avoid holding up > > > > > vCPUs too long. I think this is a good reason to keep intermediate > > > > > page sizes. > > > > > > > > > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a > > > > > huge difference: the THP-like scheme is about 30% slower overall. > > > > > > > > > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE > > > > > difference. For the THP-like scheme, collapsing 4K -> 2M requires > > > > > decrementing and then re-incrementing subpage->_mapcount, and then > > > > > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For > > > > > the head-only scheme, for each 2M in the 4K -> 2M collapse, we > > > > > decrement the compound_mapcount 512 times (once per PTE), then > > > > > increment it once. And then for 2M -> 1G, for each 1G, we decrement > > > > > mapcount again by 512 (once per PMD), incrementing it once. > > > > > > > > Did you have quantified numbers (with your ktime treak) to compare these? > > > > If we want to go the other route, I think these will be materials to > > > > justify any other approach on mapcount handling. > > > > > > Ok, I can do that. GIve me a couple days to collect more results and > > > organize them in a helpful way. > > > > > > (If it's helpful at all, here are some results I collected last week: > > > [2]. Please ignore it if it's not helpful.) > > > > It's helpful already at least to me, thanks. Yes the change is drastic. > > That data only contains THP-like mapcount performance, no performance > for the head-only way. But the head-only scheme makes the 2M -> 1G > very good ("56" comes down to about the same everything else, instead > of being ~100-500x bigger). Oops, I think I misread those. Yeah please keep sharing information if you come up with any. > > > > > > > > > > > > > > > > > > > > The mapcount decrements are about on par with how long it takes to do > > > > > other things, like updating page tables. The main problem is, with the > > > > > THP-like scheme (implemented like this [1]), there isn't a way to > > > > > avoid the 262k decrements when collapsing 1G. So if we want > > > > > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API, > > > > > then I think something more clever needs to be implemented. > > > > > > > > > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178 > > > > > > > > I believe the whole goal of HGM is trying to face the same challenge if > > > > we'll allow 1G THP exist and being able to split for anon. > > > > > > > > I don't remember whether we discussed below, maybe we did? Anyway... > > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to > > > > page_mapcount() on small page, is to only increase the hpage mapcount only > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when > > > > freeing pgtable of the PUD entry). > > > > > > Right, and this is doable. Also it seems like this is pretty close to > > > the direction Matthew Wilcox wants to go with THPs. > > > > I may not be familiar with it, do you mean this one? > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ > > Yep that's it. > > > > > For hugetlb I think it should be easier to maintain rather than any-sized > > folios, because there's the pgtable non-leaf entry to track rmap > > information and the folio size being static to hpage size. > > > > It'll be different to folios where it can be random sized pages chunk, so > > it needs to be managed by batching the ptes when install/zap. > > Agreed. It's probably easier for HugeTLB because they're always > "naturally aligned" and yeah they can't change sizes. > > > > > > > > > Something I noticed though, from the implementation of > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount() > > > ought to report the total number of PTEs that are pointing on the page > > > (or the number of times page_vma_mapped_walk returns true). FWIW, > > > folio_referenced() is never called for hugetlb folios. > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks - > > it'll walk every leaf page being mapped, big or small, so IIUC that number > > should match with what it expects to see later, more or less. > > I don't fully understand what you mean here. I meant the rmap_walk pairing with folio_referenced_one() will walk all the leaves for the folio, big or small. I think that will match the number with what got returned from folio_mapcount(). > > > > > But I agree the mapcount/referenced value itself is debatable to me, just > > like what you raised in the other thread on page migration. Meanwhile, I > > am not certain whether the mapcount is accurate either because AFAICT the > > mapcount can be modified if e.g. new page mapping established as long as > > before taking the page lock later in folio_referenced(). > > > > It's just that I don't see any severe issue either due to any of above, as > > long as that information is only used as a hint for next steps, e.g., to > > swap which page out. > > I also don't see a big problem with folio_referenced() (and you're > right that folio_mapcount() can be stale by the time it takes the > folio lock). It still seems like folio_mapcount() should return the > total number of PTEs that map the page though. Are you saying that > breaking this would be ok? I didn't quite follow - isn't that already doing so? folio_mapcount() is total_compound_mapcount() here, IIUC it is an accumulated value of all possible PTEs or PMDs being mapped as long as it's all or part of the folio being mapped.
On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote: > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote: > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: [snip] > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when > > > > > freeing pgtable of the PUD entry). > > > > > > > > Right, and this is doable. Also it seems like this is pretty close to > > > > the direction Matthew Wilcox wants to go with THPs. > > > > > > I may not be familiar with it, do you mean this one? > > > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ > > > > Yep that's it. > > > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized > > > folios, because there's the pgtable non-leaf entry to track rmap > > > information and the folio size being static to hpage size. > > > > > > It'll be different to folios where it can be random sized pages chunk, so > > > it needs to be managed by batching the ptes when install/zap. > > > > Agreed. It's probably easier for HugeTLB because they're always > > "naturally aligned" and yeah they can't change sizes. > > > > > > > > > > > > > Something I noticed though, from the implementation of > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount() > > > > ought to report the total number of PTEs that are pointing on the page > > > > (or the number of times page_vma_mapped_walk returns true). FWIW, > > > > folio_referenced() is never called for hugetlb folios. > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks - > > > it'll walk every leaf page being mapped, big or small, so IIUC that number > > > should match with what it expects to see later, more or less. > > > > I don't fully understand what you mean here. > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the > leaves for the folio, big or small. I think that will match the number > with what got returned from folio_mapcount(). See below. > > > > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just > > > like what you raised in the other thread on page migration. Meanwhile, I > > > am not certain whether the mapcount is accurate either because AFAICT the > > > mapcount can be modified if e.g. new page mapping established as long as > > > before taking the page lock later in folio_referenced(). > > > > > > It's just that I don't see any severe issue either due to any of above, as > > > long as that information is only used as a hint for next steps, e.g., to > > > swap which page out. > > > > I also don't see a big problem with folio_referenced() (and you're > > right that folio_mapcount() can be stale by the time it takes the > > folio lock). It still seems like folio_mapcount() should return the > > total number of PTEs that map the page though. Are you saying that > > breaking this would be ok? > > I didn't quite follow - isn't that already doing so? > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an > accumulated value of all possible PTEs or PMDs being mapped as long as it's > all or part of the folio being mapped. We've talked about 3 ways of handling mapcount: 1. The RFC v2 way, which is head-only, and we increment the compound mapcount for each PT mapping we have. So a PTE-mapped 2M page, compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias). 2. The THP-like way. If we are fully mapping the hugetlb page with the hstate-level PTE, we increment the compound mapcount, otherwise we increment subpage->_mapcount. 3. The RFC v1 way (the way you have suggested above), which is head-only, and we increment the compound mapcount if the hstate-level PTE is made present. With #1 and #2, there is no concern with folio_mapcount(). But with #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA would yield 1 instead of 512 (right?). That's what I mean. #1 has problems wrt smaps and migration (though there were other problems with those anyway that Mike has fixed), and #2 makes MADV_COLLAPSE slow to the point of being unusable for some applications. It seems like the least bad option is #1, but maybe we should have a face-to-face discussion about it? I'm still collecting some more performance numbers. - James
On 01.02.23 16:45, James Houghton wrote: > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote: >> >> On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote: >>> On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote: >>>> >>>> On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: >>>>> On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: >>>>>> >>>>>> On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > [snip] >>>>>> Another way to not use thp mapcount, nor break smaps and similar calls to >>>>>> page_mapcount() on small page, is to only increase the hpage mapcount only >>>>>> when hstate pXd (in case of 1G it's PUD) entry being populated (no matter >>>>>> as leaf or a non-leaf), and the mapcount can be decreased when the pXd >>>>>> entry is removed (for leaf, it's the same as for now; for HGM, it's when >>>>>> freeing pgtable of the PUD entry). >>>>> >>>>> Right, and this is doable. Also it seems like this is pretty close to >>>>> the direction Matthew Wilcox wants to go with THPs. >>>> >>>> I may not be familiar with it, do you mean this one? >>>> >>>> https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ >>> >>> Yep that's it. >>> >>>> >>>> For hugetlb I think it should be easier to maintain rather than any-sized >>>> folios, because there's the pgtable non-leaf entry to track rmap >>>> information and the folio size being static to hpage size. >>>> >>>> It'll be different to folios where it can be random sized pages chunk, so >>>> it needs to be managed by batching the ptes when install/zap. >>> >>> Agreed. It's probably easier for HugeTLB because they're always >>> "naturally aligned" and yeah they can't change sizes. >>> >>>> >>>>> >>>>> Something I noticed though, from the implementation of >>>>> folio_referenced()/folio_referenced_one(), is that folio_mapcount() >>>>> ought to report the total number of PTEs that are pointing on the page >>>>> (or the number of times page_vma_mapped_walk returns true). FWIW, >>>>> folio_referenced() is never called for hugetlb folios. >>>> >>>> FWIU folio_mapcount is the thing it needs for now to do the rmap walks - >>>> it'll walk every leaf page being mapped, big or small, so IIUC that number >>>> should match with what it expects to see later, more or less. >>> >>> I don't fully understand what you mean here. >> >> I meant the rmap_walk pairing with folio_referenced_one() will walk all the >> leaves for the folio, big or small. I think that will match the number >> with what got returned from folio_mapcount(). > > See below. > >> >>> >>>> >>>> But I agree the mapcount/referenced value itself is debatable to me, just >>>> like what you raised in the other thread on page migration. Meanwhile, I >>>> am not certain whether the mapcount is accurate either because AFAICT the >>>> mapcount can be modified if e.g. new page mapping established as long as >>>> before taking the page lock later in folio_referenced(). >>>> >>>> It's just that I don't see any severe issue either due to any of above, as >>>> long as that information is only used as a hint for next steps, e.g., to >>>> swap which page out. >>> >>> I also don't see a big problem with folio_referenced() (and you're >>> right that folio_mapcount() can be stale by the time it takes the >>> folio lock). It still seems like folio_mapcount() should return the >>> total number of PTEs that map the page though. Are you saying that >>> breaking this would be ok? >> >> I didn't quite follow - isn't that already doing so? >> >> folio_mapcount() is total_compound_mapcount() here, IIUC it is an >> accumulated value of all possible PTEs or PMDs being mapped as long as it's >> all or part of the folio being mapped. > > We've talked about 3 ways of handling mapcount: > > 1. The RFC v2 way, which is head-only, and we increment the compound > mapcount for each PT mapping we have. So a PTE-mapped 2M page, > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias). > 2. The THP-like way. If we are fully mapping the hugetlb page with the > hstate-level PTE, we increment the compound mapcount, otherwise we > increment subpage->_mapcount. > 3. The RFC v1 way (the way you have suggested above), which is > head-only, and we increment the compound mapcount if the hstate-level > PTE is made present. > > With #1 and #2, there is no concern with folio_mapcount(). But with > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA > would yield 1 instead of 512 (right?). That's what I mean. My 2 cents: The mapcount is primarily used (in hugetlb context) to (a) Detect if a page might be shared. mapcount > 1 implies that two independent page table hierarchies are mapping the page. We care about mapcount == 1 vs. mapcount != 1. (b) Detect if unmapping was sucessfull. We care about mapcount == 0 vs. mapcount != 0. For hugetlb, I don't see why we should care about the subpage mapcount at all. For (a) it's even good to count "somehow mapped into a single page table structure" as "mapcount == 1" For (b), we don't care as long as "still mapped" implies "mapcount != 0".
On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote: > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote: > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > [snip] > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when > > > > > > freeing pgtable of the PUD entry). > > > > > > > > > > Right, and this is doable. Also it seems like this is pretty close to > > > > > the direction Matthew Wilcox wants to go with THPs. > > > > > > > > I may not be familiar with it, do you mean this one? > > > > > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ > > > > > > Yep that's it. > > > > > > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized > > > > folios, because there's the pgtable non-leaf entry to track rmap > > > > information and the folio size being static to hpage size. > > > > > > > > It'll be different to folios where it can be random sized pages chunk, so > > > > it needs to be managed by batching the ptes when install/zap. > > > > > > Agreed. It's probably easier for HugeTLB because they're always > > > "naturally aligned" and yeah they can't change sizes. > > > > > > > > > > > > > > > > > Something I noticed though, from the implementation of > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount() > > > > > ought to report the total number of PTEs that are pointing on the page > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW, > > > > > folio_referenced() is never called for hugetlb folios. > > > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks - > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number > > > > should match with what it expects to see later, more or less. > > > > > > I don't fully understand what you mean here. > > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the > > leaves for the folio, big or small. I think that will match the number > > with what got returned from folio_mapcount(). > > See below. > > > > > > > > > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just > > > > like what you raised in the other thread on page migration. Meanwhile, I > > > > am not certain whether the mapcount is accurate either because AFAICT the > > > > mapcount can be modified if e.g. new page mapping established as long as > > > > before taking the page lock later in folio_referenced(). > > > > > > > > It's just that I don't see any severe issue either due to any of above, as > > > > long as that information is only used as a hint for next steps, e.g., to > > > > swap which page out. > > > > > > I also don't see a big problem with folio_referenced() (and you're > > > right that folio_mapcount() can be stale by the time it takes the > > > folio lock). It still seems like folio_mapcount() should return the > > > total number of PTEs that map the page though. Are you saying that > > > breaking this would be ok? > > > > I didn't quite follow - isn't that already doing so? > > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an > > accumulated value of all possible PTEs or PMDs being mapped as long as it's > > all or part of the folio being mapped. > > We've talked about 3 ways of handling mapcount: > > 1. The RFC v2 way, which is head-only, and we increment the compound > mapcount for each PT mapping we have. So a PTE-mapped 2M page, > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias). > 2. The THP-like way. If we are fully mapping the hugetlb page with the > hstate-level PTE, we increment the compound mapcount, otherwise we > increment subpage->_mapcount. > 3. The RFC v1 way (the way you have suggested above), which is > head-only, and we increment the compound mapcount if the hstate-level > PTE is made present. Oh that's where it come from! It took quite some months going through all these, I can hardly remember the details. > > With #1 and #2, there is no concern with folio_mapcount(). But with > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA > would yield 1 instead of 512 (right?). That's what I mean. > > #1 has problems wrt smaps and migration (though there were other > problems with those anyway that Mike has fixed), and #2 makes > MADV_COLLAPSE slow to the point of being unusable for some > applications. Ah so you're talking about after HGM being applied.. while I was only talking about THPs. If to apply the logic here with idea 3), the worst case is we'll need to have special care of HGM hugetlb in folio_referenced_one(), so the default page_vma_mapped_walk() may not apply anymore - the resource is always in hstate sized, so counting small ptes do not help too - we can just walk until the hstate entry and do referenced++ if it's not none, at the entrance of folio_referenced_one(). But I'm not sure whether that'll be necessary at all, as I'm not sure whether that path can be triggered at all in any form (where from the top it should always be shrink_page_list()). In that sense maybe we can also consider adding a WARN_ON_ONCE() in folio_referenced() where it is a hugetlb page that got passed in? Meanwhile, adding a TODO comment explaining that current walk won't work easily for HGM only, so when it will be applicable to hugetlb we need to rework? I confess that's not pretty, though. But that'll make 3) with no major defect from function-wise. Side note: did we finish folio conversion on hugetlb at all? I think at least we need some helper like folio_test_huge(). It seems still missing. Maybe it's another clue that hugetlb is not important to folio_referenced() because it's already fully converted? > > It seems like the least bad option is #1, but maybe we should have a > face-to-face discussion about it? I'm still collecting some more > performance numbers. Let's see how it goes.. Thanks,
On Wed, Feb 1, 2023 at 7:56 AM David Hildenbrand <david@redhat.com> wrote: > > On 01.02.23 16:45, James Houghton wrote: > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote: > >> > >> On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote: > >>> On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote: > >>>> > >>>> On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: > >>>>> On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: > >>>>>> > >>>>>> On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > > [snip] > >>>>>> Another way to not use thp mapcount, nor break smaps and similar calls to > >>>>>> page_mapcount() on small page, is to only increase the hpage mapcount only > >>>>>> when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > >>>>>> as leaf or a non-leaf), and the mapcount can be decreased when the pXd > >>>>>> entry is removed (for leaf, it's the same as for now; for HGM, it's when > >>>>>> freeing pgtable of the PUD entry). > >>>>> > >>>>> Right, and this is doable. Also it seems like this is pretty close to > >>>>> the direction Matthew Wilcox wants to go with THPs. > >>>> > >>>> I may not be familiar with it, do you mean this one? > >>>> > >>>> https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ > >>> > >>> Yep that's it. > >>> > >>>> > >>>> For hugetlb I think it should be easier to maintain rather than any-sized > >>>> folios, because there's the pgtable non-leaf entry to track rmap > >>>> information and the folio size being static to hpage size. > >>>> > >>>> It'll be different to folios where it can be random sized pages chunk, so > >>>> it needs to be managed by batching the ptes when install/zap. > >>> > >>> Agreed. It's probably easier for HugeTLB because they're always > >>> "naturally aligned" and yeah they can't change sizes. > >>> > >>>> > >>>>> > >>>>> Something I noticed though, from the implementation of > >>>>> folio_referenced()/folio_referenced_one(), is that folio_mapcount() > >>>>> ought to report the total number of PTEs that are pointing on the page > >>>>> (or the number of times page_vma_mapped_walk returns true). FWIW, > >>>>> folio_referenced() is never called for hugetlb folios. > >>>> > >>>> FWIU folio_mapcount is the thing it needs for now to do the rmap walks - > >>>> it'll walk every leaf page being mapped, big or small, so IIUC that number > >>>> should match with what it expects to see later, more or less. > >>> > >>> I don't fully understand what you mean here. > >> > >> I meant the rmap_walk pairing with folio_referenced_one() will walk all the > >> leaves for the folio, big or small. I think that will match the number > >> with what got returned from folio_mapcount(). > > > > See below. > > > >> > >>> > >>>> > >>>> But I agree the mapcount/referenced value itself is debatable to me, just > >>>> like what you raised in the other thread on page migration. Meanwhile, I > >>>> am not certain whether the mapcount is accurate either because AFAICT the > >>>> mapcount can be modified if e.g. new page mapping established as long as > >>>> before taking the page lock later in folio_referenced(). > >>>> > >>>> It's just that I don't see any severe issue either due to any of above, as > >>>> long as that information is only used as a hint for next steps, e.g., to > >>>> swap which page out. > >>> > >>> I also don't see a big problem with folio_referenced() (and you're > >>> right that folio_mapcount() can be stale by the time it takes the > >>> folio lock). It still seems like folio_mapcount() should return the > >>> total number of PTEs that map the page though. Are you saying that > >>> breaking this would be ok? > >> > >> I didn't quite follow - isn't that already doing so? > >> > >> folio_mapcount() is total_compound_mapcount() here, IIUC it is an > >> accumulated value of all possible PTEs or PMDs being mapped as long as it's > >> all or part of the folio being mapped. > > > > We've talked about 3 ways of handling mapcount: > > > > 1. The RFC v2 way, which is head-only, and we increment the compound > > mapcount for each PT mapping we have. So a PTE-mapped 2M page, > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias). > > 2. The THP-like way. If we are fully mapping the hugetlb page with the > > hstate-level PTE, we increment the compound mapcount, otherwise we > > increment subpage->_mapcount. > > 3. The RFC v1 way (the way you have suggested above), which is > > head-only, and we increment the compound mapcount if the hstate-level > > PTE is made present. > > > > With #1 and #2, there is no concern with folio_mapcount(). But with > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA > > would yield 1 instead of 512 (right?). That's what I mean. > > My 2 cents: > > The mapcount is primarily used (in hugetlb context) to > > (a) Detect if a page might be shared. mapcount > 1 implies that two > independent page table hierarchies are mapping the page. We care about > mapcount == 1 vs. mapcount != 1. > > (b) Detect if unmapping was sucessfull. We care about mapcount == 0 vs. > mapcount != 0. > > For hugetlb, I don't see why we should care about the subpage mapcount > at all. Agreed -- it shouldn't really matter all that much. > > For (a) it's even good to count "somehow mapped into a single page table > structure" as "mapcount == 1" For (b), we don't care as long as "still > mapped" implies "mapcount != 0". Thanks for your thoughts, David. So it sounds like you're still squarely in the #3 camp. :)
>>> 1. The RFC v2 way, which is head-only, and we increment the compound >>> mapcount for each PT mapping we have. So a PTE-mapped 2M page, >>> compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias). >>> 2. The THP-like way. If we are fully mapping the hugetlb page with the >>> hstate-level PTE, we increment the compound mapcount, otherwise we >>> increment subpage->_mapcount. >>> 3. The RFC v1 way (the way you have suggested above), which is >>> head-only, and we increment the compound mapcount if the hstate-level >>> PTE is made present. >>> >>> With #1 and #2, there is no concern with folio_mapcount(). But with >>> #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA >>> would yield 1 instead of 512 (right?). That's what I mean. >> >> My 2 cents: >> >> The mapcount is primarily used (in hugetlb context) to >> >> (a) Detect if a page might be shared. mapcount > 1 implies that two >> independent page table hierarchies are mapping the page. We care about >> mapcount == 1 vs. mapcount != 1. >> >> (b) Detect if unmapping was sucessfull. We care about mapcount == 0 vs. >> mapcount != 0. >> >> For hugetlb, I don't see why we should care about the subpage mapcount >> at all. > > Agreed -- it shouldn't really matter all that much. > >> >> For (a) it's even good to count "somehow mapped into a single page table >> structure" as "mapcount == 1" For (b), we don't care as long as "still >> mapped" implies "mapcount != 0". > > Thanks for your thoughts, David. So it sounds like you're still > squarely in the #3 camp. :) Well, yes. As long as we can get it implemented in a clean way ... :)
On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <peterx@redhat.com> wrote: > > On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote: > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote: > > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: > > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > > [snip] > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to > > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only > > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd > > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when > > > > > > > freeing pgtable of the PUD entry). > > > > > > > > > > > > Right, and this is doable. Also it seems like this is pretty close to > > > > > > the direction Matthew Wilcox wants to go with THPs. > > > > > > > > > > I may not be familiar with it, do you mean this one? > > > > > > > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ > > > > > > > > Yep that's it. > > > > > > > > > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized > > > > > folios, because there's the pgtable non-leaf entry to track rmap > > > > > information and the folio size being static to hpage size. > > > > > > > > > > It'll be different to folios where it can be random sized pages chunk, so > > > > > it needs to be managed by batching the ptes when install/zap. > > > > > > > > Agreed. It's probably easier for HugeTLB because they're always > > > > "naturally aligned" and yeah they can't change sizes. > > > > > > > > > > > > > > > > > > > > > Something I noticed though, from the implementation of > > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount() > > > > > > ought to report the total number of PTEs that are pointing on the page > > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW, > > > > > > folio_referenced() is never called for hugetlb folios. > > > > > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks - > > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number > > > > > should match with what it expects to see later, more or less. > > > > > > > > I don't fully understand what you mean here. > > > > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the > > > leaves for the folio, big or small. I think that will match the number > > > with what got returned from folio_mapcount(). > > > > See below. > > > > > > > > > > > > > > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just > > > > > like what you raised in the other thread on page migration. Meanwhile, I > > > > > am not certain whether the mapcount is accurate either because AFAICT the > > > > > mapcount can be modified if e.g. new page mapping established as long as > > > > > before taking the page lock later in folio_referenced(). > > > > > > > > > > It's just that I don't see any severe issue either due to any of above, as > > > > > long as that information is only used as a hint for next steps, e.g., to > > > > > swap which page out. > > > > > > > > I also don't see a big problem with folio_referenced() (and you're > > > > right that folio_mapcount() can be stale by the time it takes the > > > > folio lock). It still seems like folio_mapcount() should return the > > > > total number of PTEs that map the page though. Are you saying that > > > > breaking this would be ok? > > > > > > I didn't quite follow - isn't that already doing so? > > > > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an > > > accumulated value of all possible PTEs or PMDs being mapped as long as it's > > > all or part of the folio being mapped. > > > > We've talked about 3 ways of handling mapcount: > > > > 1. The RFC v2 way, which is head-only, and we increment the compound > > mapcount for each PT mapping we have. So a PTE-mapped 2M page, > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias). > > 2. The THP-like way. If we are fully mapping the hugetlb page with the > > hstate-level PTE, we increment the compound mapcount, otherwise we > > increment subpage->_mapcount. > > 3. The RFC v1 way (the way you have suggested above), which is > > head-only, and we increment the compound mapcount if the hstate-level > > PTE is made present. > > Oh that's where it come from! It took quite some months going through all > these, I can hardly remember the details. > > > > > With #1 and #2, there is no concern with folio_mapcount(). But with > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA > > would yield 1 instead of 512 (right?). That's what I mean. > > > > #1 has problems wrt smaps and migration (though there were other > > problems with those anyway that Mike has fixed), and #2 makes > > MADV_COLLAPSE slow to the point of being unusable for some > > applications. > > Ah so you're talking about after HGM being applied.. while I was only > talking about THPs. > > If to apply the logic here with idea 3), the worst case is we'll need to > have special care of HGM hugetlb in folio_referenced_one(), so the default > page_vma_mapped_walk() may not apply anymore - the resource is always in > hstate sized, so counting small ptes do not help too - we can just walk > until the hstate entry and do referenced++ if it's not none, at the > entrance of folio_referenced_one(). > > But I'm not sure whether that'll be necessary at all, as I'm not sure > whether that path can be triggered at all in any form (where from the top > it should always be shrink_page_list()). In that sense maybe we can also > consider adding a WARN_ON_ONCE() in folio_referenced() where it is a > hugetlb page that got passed in? Meanwhile, adding a TODO comment > explaining that current walk won't work easily for HGM only, so when it > will be applicable to hugetlb we need to rework? > > I confess that's not pretty, though. But that'll make 3) with no major > defect from function-wise. Another potential idea would be to add something like page_vmacount(). For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for HugeTLB pages, we could keep a separate count (in one of the tail pages, I guess). And then in the places that matter (so smaps, migration, and maybe CoW and hwpoison), potentially change their calls to page_vmacount() instead of page_mapcount(). Then to implement page_vmacount(), we do the RFC v1 mapcount approach (but like.... correctly this time). And then for page_mapcount(), we do the RFC v2 mapcount approach (head-only, once per PTE). Then we fix folio_referenced() without needing to special-case it for HugeTLB. :) Or we could just special-case it. *shrug* Does that sound reasonable? We still have the problem where a series of partially unmaps could leave page_vmacount() incremented, but I don't think that's a big problem. > > Side note: did we finish folio conversion on hugetlb at all? I think at > least we need some helper like folio_test_huge(). It seems still missing. > Maybe it's another clue that hugetlb is not important to folio_referenced() > because it's already fully converted? I'm not sure. A lot of work was done very pretty recently, so I bet there's probably some work left to do.
On Wed, Feb 01, 2023 at 01:32:21PM -0800, James Houghton wrote: > On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote: > > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote: > > > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: > > > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > > > [snip] > > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to > > > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only > > > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > > > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd > > > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when > > > > > > > > freeing pgtable of the PUD entry). > > > > > > > > > > > > > > Right, and this is doable. Also it seems like this is pretty close to > > > > > > > the direction Matthew Wilcox wants to go with THPs. > > > > > > > > > > > > I may not be familiar with it, do you mean this one? > > > > > > > > > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ > > > > > > > > > > Yep that's it. > > > > > > > > > > > > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized > > > > > > folios, because there's the pgtable non-leaf entry to track rmap > > > > > > information and the folio size being static to hpage size. > > > > > > > > > > > > It'll be different to folios where it can be random sized pages chunk, so > > > > > > it needs to be managed by batching the ptes when install/zap. > > > > > > > > > > Agreed. It's probably easier for HugeTLB because they're always > > > > > "naturally aligned" and yeah they can't change sizes. > > > > > > > > > > > > > > > > > > > > > > > > > Something I noticed though, from the implementation of > > > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount() > > > > > > > ought to report the total number of PTEs that are pointing on the page > > > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW, > > > > > > > folio_referenced() is never called for hugetlb folios. > > > > > > > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks - > > > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number > > > > > > should match with what it expects to see later, more or less. > > > > > > > > > > I don't fully understand what you mean here. > > > > > > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the > > > > leaves for the folio, big or small. I think that will match the number > > > > with what got returned from folio_mapcount(). > > > > > > See below. > > > > > > > > > > > > > > > > > > > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just > > > > > > like what you raised in the other thread on page migration. Meanwhile, I > > > > > > am not certain whether the mapcount is accurate either because AFAICT the > > > > > > mapcount can be modified if e.g. new page mapping established as long as > > > > > > before taking the page lock later in folio_referenced(). > > > > > > > > > > > > It's just that I don't see any severe issue either due to any of above, as > > > > > > long as that information is only used as a hint for next steps, e.g., to > > > > > > swap which page out. > > > > > > > > > > I also don't see a big problem with folio_referenced() (and you're > > > > > right that folio_mapcount() can be stale by the time it takes the > > > > > folio lock). It still seems like folio_mapcount() should return the > > > > > total number of PTEs that map the page though. Are you saying that > > > > > breaking this would be ok? > > > > > > > > I didn't quite follow - isn't that already doing so? > > > > > > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an > > > > accumulated value of all possible PTEs or PMDs being mapped as long as it's > > > > all or part of the folio being mapped. > > > > > > We've talked about 3 ways of handling mapcount: > > > > > > 1. The RFC v2 way, which is head-only, and we increment the compound > > > mapcount for each PT mapping we have. So a PTE-mapped 2M page, > > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias). > > > 2. The THP-like way. If we are fully mapping the hugetlb page with the > > > hstate-level PTE, we increment the compound mapcount, otherwise we > > > increment subpage->_mapcount. > > > 3. The RFC v1 way (the way you have suggested above), which is > > > head-only, and we increment the compound mapcount if the hstate-level > > > PTE is made present. > > > > Oh that's where it come from! It took quite some months going through all > > these, I can hardly remember the details. > > > > > > > > With #1 and #2, there is no concern with folio_mapcount(). But with > > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA > > > would yield 1 instead of 512 (right?). That's what I mean. > > > > > > #1 has problems wrt smaps and migration (though there were other > > > problems with those anyway that Mike has fixed), and #2 makes > > > MADV_COLLAPSE slow to the point of being unusable for some > > > applications. > > > > Ah so you're talking about after HGM being applied.. while I was only > > talking about THPs. > > > > If to apply the logic here with idea 3), the worst case is we'll need to > > have special care of HGM hugetlb in folio_referenced_one(), so the default > > page_vma_mapped_walk() may not apply anymore - the resource is always in > > hstate sized, so counting small ptes do not help too - we can just walk > > until the hstate entry and do referenced++ if it's not none, at the > > entrance of folio_referenced_one(). > > > > But I'm not sure whether that'll be necessary at all, as I'm not sure > > whether that path can be triggered at all in any form (where from the top > > it should always be shrink_page_list()). In that sense maybe we can also > > consider adding a WARN_ON_ONCE() in folio_referenced() where it is a > > hugetlb page that got passed in? Meanwhile, adding a TODO comment > > explaining that current walk won't work easily for HGM only, so when it > > will be applicable to hugetlb we need to rework? > > > > I confess that's not pretty, though. But that'll make 3) with no major > > defect from function-wise. > > Another potential idea would be to add something like page_vmacount(). > For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for > HugeTLB pages, we could keep a separate count (in one of the tail > pages, I guess). And then in the places that matter (so smaps, > migration, and maybe CoW and hwpoison), potentially change their calls > to page_vmacount() instead of page_mapcount(). > > Then to implement page_vmacount(), we do the RFC v1 mapcount approach > (but like.... correctly this time). And then for page_mapcount(), we > do the RFC v2 mapcount approach (head-only, once per PTE). > > Then we fix folio_referenced() without needing to special-case it for > HugeTLB. :) Or we could just special-case it. *shrug* > > Does that sound reasonable? We still have the problem where a series > of partially unmaps could leave page_vmacount() incremented, but I > don't think that's a big problem. I'm afraid someone will stop you from introducing yet another definition of mapcount, where others are trying to remove it. :) Or, can we just drop folio_referenced_arg.mapcount? We need to keep: if (!pra.mapcount) return 0; By replacing it with folio_mapcount which is definitely something worthwhile, but what about the rest? If it can be dropped, afaict it'll naturally work with HGM again. IIUC that's an optimization where we want to stop the rmap walk as long as we found all the pages, however (1) IIUC it's not required to function, and (2) it's not guaranteed to work as solid anyway.. As we've discussed before: right after it reads mapcount (but before taking the page lock), the mapcount can get decreased by 1, then it'll still need to loop over all the vmas just to find that there's one "misterious" mapcount lost. Personally I really have no idea on how much that optimization can help.
On Wed, Feb 1, 2023 at 1:51 PM Peter Xu <peterx@redhat.com> wrote: > > On Wed, Feb 01, 2023 at 01:32:21PM -0800, James Houghton wrote: > > On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote: > > > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote: > > > > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: > > > > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > > > > [snip] > > > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to > > > > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only > > > > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > > > > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd > > > > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when > > > > > > > > > freeing pgtable of the PUD entry). > > > > > > > > > > > > > > > > Right, and this is doable. Also it seems like this is pretty close to > > > > > > > > the direction Matthew Wilcox wants to go with THPs. > > > > > > > > > > > > > > I may not be familiar with it, do you mean this one? > > > > > > > > > > > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ > > > > > > > > > > > > Yep that's it. > > > > > > > > > > > > > > > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized > > > > > > > folios, because there's the pgtable non-leaf entry to track rmap > > > > > > > information and the folio size being static to hpage size. > > > > > > > > > > > > > > It'll be different to folios where it can be random sized pages chunk, so > > > > > > > it needs to be managed by batching the ptes when install/zap. > > > > > > > > > > > > Agreed. It's probably easier for HugeTLB because they're always > > > > > > "naturally aligned" and yeah they can't change sizes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Something I noticed though, from the implementation of > > > > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount() > > > > > > > > ought to report the total number of PTEs that are pointing on the page > > > > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW, > > > > > > > > folio_referenced() is never called for hugetlb folios. > > > > > > > > > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks - > > > > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number > > > > > > > should match with what it expects to see later, more or less. > > > > > > > > > > > > I don't fully understand what you mean here. > > > > > > > > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the > > > > > leaves for the folio, big or small. I think that will match the number > > > > > with what got returned from folio_mapcount(). > > > > > > > > See below. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just > > > > > > > like what you raised in the other thread on page migration. Meanwhile, I > > > > > > > am not certain whether the mapcount is accurate either because AFAICT the > > > > > > > mapcount can be modified if e.g. new page mapping established as long as > > > > > > > before taking the page lock later in folio_referenced(). > > > > > > > > > > > > > > It's just that I don't see any severe issue either due to any of above, as > > > > > > > long as that information is only used as a hint for next steps, e.g., to > > > > > > > swap which page out. > > > > > > > > > > > > I also don't see a big problem with folio_referenced() (and you're > > > > > > right that folio_mapcount() can be stale by the time it takes the > > > > > > folio lock). It still seems like folio_mapcount() should return the > > > > > > total number of PTEs that map the page though. Are you saying that > > > > > > breaking this would be ok? > > > > > > > > > > I didn't quite follow - isn't that already doing so? > > > > > > > > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an > > > > > accumulated value of all possible PTEs or PMDs being mapped as long as it's > > > > > all or part of the folio being mapped. > > > > > > > > We've talked about 3 ways of handling mapcount: > > > > > > > > 1. The RFC v2 way, which is head-only, and we increment the compound > > > > mapcount for each PT mapping we have. So a PTE-mapped 2M page, > > > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias). > > > > 2. The THP-like way. If we are fully mapping the hugetlb page with the > > > > hstate-level PTE, we increment the compound mapcount, otherwise we > > > > increment subpage->_mapcount. > > > > 3. The RFC v1 way (the way you have suggested above), which is > > > > head-only, and we increment the compound mapcount if the hstate-level > > > > PTE is made present. > > > > > > Oh that's where it come from! It took quite some months going through all > > > these, I can hardly remember the details. > > > > > > > > > > > With #1 and #2, there is no concern with folio_mapcount(). But with > > > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA > > > > would yield 1 instead of 512 (right?). That's what I mean. > > > > > > > > #1 has problems wrt smaps and migration (though there were other > > > > problems with those anyway that Mike has fixed), and #2 makes > > > > MADV_COLLAPSE slow to the point of being unusable for some > > > > applications. > > > > > > Ah so you're talking about after HGM being applied.. while I was only > > > talking about THPs. > > > > > > If to apply the logic here with idea 3), the worst case is we'll need to > > > have special care of HGM hugetlb in folio_referenced_one(), so the default > > > page_vma_mapped_walk() may not apply anymore - the resource is always in > > > hstate sized, so counting small ptes do not help too - we can just walk > > > until the hstate entry and do referenced++ if it's not none, at the > > > entrance of folio_referenced_one(). > > > > > > But I'm not sure whether that'll be necessary at all, as I'm not sure > > > whether that path can be triggered at all in any form (where from the top > > > it should always be shrink_page_list()). In that sense maybe we can also > > > consider adding a WARN_ON_ONCE() in folio_referenced() where it is a > > > hugetlb page that got passed in? Meanwhile, adding a TODO comment > > > explaining that current walk won't work easily for HGM only, so when it > > > will be applicable to hugetlb we need to rework? > > > > > > I confess that's not pretty, though. But that'll make 3) with no major > > > defect from function-wise. > > > > Another potential idea would be to add something like page_vmacount(). > > For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for > > HugeTLB pages, we could keep a separate count (in one of the tail > > pages, I guess). And then in the places that matter (so smaps, > > migration, and maybe CoW and hwpoison), potentially change their calls > > to page_vmacount() instead of page_mapcount(). > > > > Then to implement page_vmacount(), we do the RFC v1 mapcount approach > > (but like.... correctly this time). And then for page_mapcount(), we > > do the RFC v2 mapcount approach (head-only, once per PTE). > > > > Then we fix folio_referenced() without needing to special-case it for > > HugeTLB. :) Or we could just special-case it. *shrug* > > > > Does that sound reasonable? We still have the problem where a series > > of partially unmaps could leave page_vmacount() incremented, but I > > don't think that's a big problem. > > I'm afraid someone will stop you from introducing yet another definition of > mapcount, where others are trying to remove it. :) > > Or, can we just drop folio_referenced_arg.mapcount? We need to keep: > > if (!pra.mapcount) > return 0; > > By replacing it with folio_mapcount which is definitely something > worthwhile, but what about the rest? > > If it can be dropped, afaict it'll naturally work with HGM again. > > IIUC that's an optimization where we want to stop the rmap walk as long as > we found all the pages, however (1) IIUC it's not required to function, and > (2) it's not guaranteed to work as solid anyway.. As we've discussed > before: right after it reads mapcount (but before taking the page lock), > the mapcount can get decreased by 1, then it'll still need to loop over all > the vmas just to find that there's one "misterious" mapcount lost. > > Personally I really have no idea on how much that optimization can help. Ok, yeah, I think pra.mapcount can be removed too. (And we replace !pra.mapcount with !folio_mapcount().) I don't see any other existing users of folio_mapcount() and total_mapcount() that are problematic. We do need to make sure to keep refcount and mapcount in sync though; it can be done. So I'll compare this "RFC v1" way with the THP-like way and get you a performance comparison. - James
On Wed, Feb 1, 2023 at 4:24 PM James Houghton <jthoughton@google.com> wrote: > > On Wed, Feb 1, 2023 at 1:51 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Wed, Feb 01, 2023 at 01:32:21PM -0800, James Houghton wrote: > > > On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote: > > > > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote: > > > > > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: > > > > > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > > > > > [snip] > > > > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to > > > > > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only > > > > > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > > > > > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd > > > > > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when > > > > > > > > > > freeing pgtable of the PUD entry). > > > > > > > > > > > > > > > > > > Right, and this is doable. Also it seems like this is pretty close to > > > > > > > > > the direction Matthew Wilcox wants to go with THPs. > > > > > > > > > > > > > > > > I may not be familiar with it, do you mean this one? > > > > > > > > > > > > > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ > > > > > > > > > > > > > > Yep that's it. > > > > > > > > > > > > > > > > > > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized > > > > > > > > folios, because there's the pgtable non-leaf entry to track rmap > > > > > > > > information and the folio size being static to hpage size. > > > > > > > > > > > > > > > > It'll be different to folios where it can be random sized pages chunk, so > > > > > > > > it needs to be managed by batching the ptes when install/zap. > > > > > > > > > > > > > > Agreed. It's probably easier for HugeTLB because they're always > > > > > > > "naturally aligned" and yeah they can't change sizes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Something I noticed though, from the implementation of > > > > > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount() > > > > > > > > > ought to report the total number of PTEs that are pointing on the page > > > > > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW, > > > > > > > > > folio_referenced() is never called for hugetlb folios. > > > > > > > > > > > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks - > > > > > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number > > > > > > > > should match with what it expects to see later, more or less. > > > > > > > > > > > > > > I don't fully understand what you mean here. > > > > > > > > > > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the > > > > > > leaves for the folio, big or small. I think that will match the number > > > > > > with what got returned from folio_mapcount(). > > > > > > > > > > See below. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just > > > > > > > > like what you raised in the other thread on page migration. Meanwhile, I > > > > > > > > am not certain whether the mapcount is accurate either because AFAICT the > > > > > > > > mapcount can be modified if e.g. new page mapping established as long as > > > > > > > > before taking the page lock later in folio_referenced(). > > > > > > > > > > > > > > > > It's just that I don't see any severe issue either due to any of above, as > > > > > > > > long as that information is only used as a hint for next steps, e.g., to > > > > > > > > swap which page out. > > > > > > > > > > > > > > I also don't see a big problem with folio_referenced() (and you're > > > > > > > right that folio_mapcount() can be stale by the time it takes the > > > > > > > folio lock). It still seems like folio_mapcount() should return the > > > > > > > total number of PTEs that map the page though. Are you saying that > > > > > > > breaking this would be ok? > > > > > > > > > > > > I didn't quite follow - isn't that already doing so? > > > > > > > > > > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an > > > > > > accumulated value of all possible PTEs or PMDs being mapped as long as it's > > > > > > all or part of the folio being mapped. > > > > > > > > > > We've talked about 3 ways of handling mapcount: > > > > > > > > > > 1. The RFC v2 way, which is head-only, and we increment the compound > > > > > mapcount for each PT mapping we have. So a PTE-mapped 2M page, > > > > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias). > > > > > 2. The THP-like way. If we are fully mapping the hugetlb page with the > > > > > hstate-level PTE, we increment the compound mapcount, otherwise we > > > > > increment subpage->_mapcount. > > > > > 3. The RFC v1 way (the way you have suggested above), which is > > > > > head-only, and we increment the compound mapcount if the hstate-level > > > > > PTE is made present. > > > > > > > > Oh that's where it come from! It took quite some months going through all > > > > these, I can hardly remember the details. > > > > > > > > > > > > > > With #1 and #2, there is no concern with folio_mapcount(). But with > > > > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA > > > > > would yield 1 instead of 512 (right?). That's what I mean. > > > > > > > > > > #1 has problems wrt smaps and migration (though there were other > > > > > problems with those anyway that Mike has fixed), and #2 makes > > > > > MADV_COLLAPSE slow to the point of being unusable for some > > > > > applications. > > > > > > > > Ah so you're talking about after HGM being applied.. while I was only > > > > talking about THPs. > > > > > > > > If to apply the logic here with idea 3), the worst case is we'll need to > > > > have special care of HGM hugetlb in folio_referenced_one(), so the default > > > > page_vma_mapped_walk() may not apply anymore - the resource is always in > > > > hstate sized, so counting small ptes do not help too - we can just walk > > > > until the hstate entry and do referenced++ if it's not none, at the > > > > entrance of folio_referenced_one(). > > > > > > > > But I'm not sure whether that'll be necessary at all, as I'm not sure > > > > whether that path can be triggered at all in any form (where from the top > > > > it should always be shrink_page_list()). In that sense maybe we can also > > > > consider adding a WARN_ON_ONCE() in folio_referenced() where it is a > > > > hugetlb page that got passed in? Meanwhile, adding a TODO comment > > > > explaining that current walk won't work easily for HGM only, so when it > > > > will be applicable to hugetlb we need to rework? > > > > > > > > I confess that's not pretty, though. But that'll make 3) with no major > > > > defect from function-wise. > > > > > > Another potential idea would be to add something like page_vmacount(). > > > For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for > > > HugeTLB pages, we could keep a separate count (in one of the tail > > > pages, I guess). And then in the places that matter (so smaps, > > > migration, and maybe CoW and hwpoison), potentially change their calls > > > to page_vmacount() instead of page_mapcount(). > > > > > > Then to implement page_vmacount(), we do the RFC v1 mapcount approach > > > (but like.... correctly this time). And then for page_mapcount(), we > > > do the RFC v2 mapcount approach (head-only, once per PTE). > > > > > > Then we fix folio_referenced() without needing to special-case it for > > > HugeTLB. :) Or we could just special-case it. *shrug* > > > > > > Does that sound reasonable? We still have the problem where a series > > > of partially unmaps could leave page_vmacount() incremented, but I > > > don't think that's a big problem. > > > > I'm afraid someone will stop you from introducing yet another definition of > > mapcount, where others are trying to remove it. :) > > > > Or, can we just drop folio_referenced_arg.mapcount? We need to keep: > > > > if (!pra.mapcount) > > return 0; > > > > By replacing it with folio_mapcount which is definitely something > > worthwhile, but what about the rest? > > > > If it can be dropped, afaict it'll naturally work with HGM again. > > > > IIUC that's an optimization where we want to stop the rmap walk as long as > > we found all the pages, however (1) IIUC it's not required to function, and > > (2) it's not guaranteed to work as solid anyway.. As we've discussed > > before: right after it reads mapcount (but before taking the page lock), > > the mapcount can get decreased by 1, then it'll still need to loop over all > > the vmas just to find that there's one "misterious" mapcount lost. > > > > Personally I really have no idea on how much that optimization can help. > > Ok, yeah, I think pra.mapcount can be removed too. (And we replace > !pra.mapcount with !folio_mapcount().) > > I don't see any other existing users of folio_mapcount() and > total_mapcount() that are problematic. We do need to make sure to keep > refcount and mapcount in sync though; it can be done. > > So I'll compare this "RFC v1" way with the THP-like way and get you a > performance comparison. Here is the result: [1] (sorry it took a little while heh). The implementation of the "RFC v1" way is pretty horrible[2] (and this implementation probably has bugs anyway; it doesn't account for the folio_referenced() problem). Matthew is trying to solve the same problem with THPs right now: [3]. I haven't figured out how we can apply Matthews's approach to HGM right now, but there probably is a way. (If we left the mapcount increment bits in the same place, we couldn't just check the hstate-level PTE; it would have already been made present.) We could: - use the THP-like way and tolerate ~1 second collapses - use the (non-RFC) v1 way and tolerate the migration/smaps differences - use the RFC v1 way and tolerate the complicated mapcount accounting - flesh out [3] and see if it can be applied to HGM nicely I'm happy to go with any of these approaches. [1]: https://pastebin.com/raw/hJzFJHiD [2]: https://github.com/48ca/linux/commit/4495f16a09b660aff44b3edcc125aa3a3df85976 [3]: https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/ - James
> Here is the result: [1] (sorry it took a little while heh). The > implementation of the "RFC v1" way is pretty horrible[2] (and this > implementation probably has bugs anyway; it doesn't account for the > folio_referenced() problem). > > Matthew is trying to solve the same problem with THPs right now: [3]. > I haven't figured out how we can apply Matthews's approach to HGM > right now, but there probably is a way. (If we left the mapcount > increment bits in the same place, we couldn't just check the > hstate-level PTE; it would have already been made present.) > > We could: > - use the THP-like way and tolerate ~1 second collapses Another thought here. We don't necessarily *need* to collapse the page table mappings in between mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end(), as the pfns aren't changing, we aren't punching any holes, and we aren't changing permission bits. If we had an MMU notifier that simply informed KVM that we collapsed the page tables *after* we finished collapsing, then it would be ok for hugetlb_collapse() to be slow. If this MMU notifier is something that makes sense, it probably applies to MADV_COLLAPSE for THPs as well. > - use the (non-RFC) v1 way and tolerate the migration/smaps differences > - use the RFC v1 way and tolerate the complicated mapcount accounting > - flesh out [3] and see if it can be applied to HGM nicely > > I'm happy to go with any of these approaches. > > [1]: https://pastebin.com/raw/hJzFJHiD > [2]: https://github.com/48ca/linux/commit/4495f16a09b660aff44b3edcc125aa3a3df85976 > [3]: https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/ - James
James, On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote: > > Here is the result: [1] (sorry it took a little while heh). The Thanks. From what I can tell, that number shows that it'll be great we start with your rfcv1 mapcount approach, which mimics what's proposed by Matthew for generic folio. > > implementation of the "RFC v1" way is pretty horrible[2] (and this Any more information on why it's horrible? :) A quick comment is I'm wondering whether that "whether we should boost the mapcount" value can be hidden in hugetlb_pte* so you don't need to pass over a lot of bool* deep into the hgm walk routines. > > implementation probably has bugs anyway; it doesn't account for the > > folio_referenced() problem). I thought we reached a consensus on the resolution, by a proposal to remove folio_referenced_arg.mapcount. Is it not working for some reason? > > > > Matthew is trying to solve the same problem with THPs right now: [3]. > > I haven't figured out how we can apply Matthews's approach to HGM > > right now, but there probably is a way. (If we left the mapcount > > increment bits in the same place, we couldn't just check the > > hstate-level PTE; it would have already been made present.) I'm just worried that (1) this may add yet another dependency to your work which is still during discussion phase, and (2) whether the folio approach is easily applicable here, e.g., we may not want to populate all the ptes for hugetlb HGMs by default. > > > > We could: > > - use the THP-like way and tolerate ~1 second collapses > > Another thought here. We don't necessarily *need* to collapse the page > table mappings in between mmu_notifier_invalidate_range_start() and > mmu_notifier_invalidate_range_end(), as the pfns aren't changing, > we aren't punching any holes, and we aren't changing permission bits. > If we had an MMU notifier that simply informed KVM that we collapsed > the page tables *after* we finished collapsing, then it would be ok > for hugetlb_collapse() to be slow. That's a great point! It'll definitely apply to either approach. > > If this MMU notifier is something that makes sense, it probably > applies to MADV_COLLAPSE for THPs as well. THPs are definitely different, mmu notifiers should be required there, afaict. Isn't that what the current code does? See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon. > > > > - use the (non-RFC) v1 way and tolerate the migration/smaps differences > > - use the RFC v1 way and tolerate the complicated mapcount accounting > > - flesh out [3] and see if it can be applied to HGM nicely > > > > I'm happy to go with any of these approaches. > > > > [1]: https://pastebin.com/raw/hJzFJHiD > > [2]: https://github.com/48ca/linux/commit/4495f16a09b660aff44b3edcc125aa3a3df85976 > > [3]: https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/ > > - James >
On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote: > > James, > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote: > > > Here is the result: [1] (sorry it took a little while heh). The > > Thanks. From what I can tell, that number shows that it'll be great we > start with your rfcv1 mapcount approach, which mimics what's proposed by > Matthew for generic folio. Do you think the RFC v1 way is better than doing the THP-like way *with the additional MMU notifier*? > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this > > Any more information on why it's horrible? :) I figured the code would speak for itself, heh. It's quite complicated. I really didn't like: 1. The 'inc' business in copy_hugetlb_page_range. 2. How/where I call put_page()/folio_put() to keep the refcount and mapcount synced up. 3. Having to check the page cache in UFFDIO_CONTINUE. > > A quick comment is I'm wondering whether that "whether we should boost the > mapcount" value can be hidden in hugetlb_pte* so you don't need to pass > over a lot of bool* deep into the hgm walk routines. Oh yeah, that's a great idea. > > > > implementation probably has bugs anyway; it doesn't account for the > > > folio_referenced() problem). > > I thought we reached a consensus on the resolution, by a proposal to remove > folio_referenced_arg.mapcount. Is it not working for some reason? I think that works, I just didn't bother here. I just wanted to show you approximately what it would look like to implement the RFC v1 approach. > > > > > > > Matthew is trying to solve the same problem with THPs right now: [3]. > > > I haven't figured out how we can apply Matthews's approach to HGM > > > right now, but there probably is a way. (If we left the mapcount > > > increment bits in the same place, we couldn't just check the > > > hstate-level PTE; it would have already been made present.) > > I'm just worried that (1) this may add yet another dependency to your work > which is still during discussion phase, and (2) whether the folio approach > is easily applicable here, e.g., we may not want to populate all the ptes > for hugetlb HGMs by default. That's true. I definitely don't want to wait for this either. It seems like Matthew's approach won't work very well for us -- when doing a lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all the PTEs to see if any of them are mapped would get really slow. > > > > > > > We could: > > > - use the THP-like way and tolerate ~1 second collapses > > > > Another thought here. We don't necessarily *need* to collapse the page > > table mappings in between mmu_notifier_invalidate_range_start() and > > mmu_notifier_invalidate_range_end(), as the pfns aren't changing, > > we aren't punching any holes, and we aren't changing permission bits. > > If we had an MMU notifier that simply informed KVM that we collapsed > > the page tables *after* we finished collapsing, then it would be ok > > for hugetlb_collapse() to be slow. > > That's a great point! It'll definitely apply to either approach. > > > > > If this MMU notifier is something that makes sense, it probably > > applies to MADV_COLLAPSE for THPs as well. > > THPs are definitely different, mmu notifiers should be required there, > afaict. Isn't that what the current code does? > > See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon. Oh, yes, of course, MADV_COLLAPSE can actually move things around and properly make THPs. Thanks. But it would apply if we were only collapsing PTE-mapped THPs, I think? - James
On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote: > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote: > > > > James, > > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote: > > > > Here is the result: [1] (sorry it took a little while heh). The > > > > Thanks. From what I can tell, that number shows that it'll be great we > > start with your rfcv1 mapcount approach, which mimics what's proposed by > > Matthew for generic folio. > > Do you think the RFC v1 way is better than doing the THP-like way > *with the additional MMU notifier*? What's the additional MMU notifier you're referring? > > > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this > > > > Any more information on why it's horrible? :) > > I figured the code would speak for itself, heh. It's quite complicated. > > I really didn't like: > 1. The 'inc' business in copy_hugetlb_page_range. > 2. How/where I call put_page()/folio_put() to keep the refcount and > mapcount synced up. > 3. Having to check the page cache in UFFDIO_CONTINUE. I think the complexity is one thing which I'm fine with so far. However when I think again about the things behind that complexity, I noticed there may be at least one flaw that may not be trivial to work around. It's about truncation. The problem is now we use the pgtable entry to represent the mapcount, but the pgtable entry cannot be zapped easily, unless vma unmapped or collapsed. It means e.g. truncate_inode_folio() may stop working for hugetlb (of course, with page lock held). The mappings will be removed for real, but not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps the pgtable leaves, not the ones that we used to account for mapcounts. So the kernel may see weird things, like mapcount>0 after truncate_inode_folio() being finished completely. For HGM to do the right thing, we may want to also remove the non-leaf entries when truncating or doing similar things like a rmap walk to drop any mappings for a page/folio. Though that's not doable for now because the locks that truncate_inode_folio() is weaker than what we need to free the pgtable non-leaf entries - we'll need mmap write lock for that, the same as when we unmap or collapse. Matthew's design doesn't have such issue if the ptes need to be populated, because mapcount is still with the leaves; not the case for us here. If that's the case, _maybe_ we still need to start with the stupid but working approach of subpage mapcounts. [...] > > > > Matthew is trying to solve the same problem with THPs right now: [3]. > > > > I haven't figured out how we can apply Matthews's approach to HGM > > > > right now, but there probably is a way. (If we left the mapcount > > > > increment bits in the same place, we couldn't just check the > > > > hstate-level PTE; it would have already been made present.) > > > > I'm just worried that (1) this may add yet another dependency to your work > > which is still during discussion phase, and (2) whether the folio approach > > is easily applicable here, e.g., we may not want to populate all the ptes > > for hugetlb HGMs by default. > > That's true. I definitely don't want to wait for this either. It seems > like Matthew's approach won't work very well for us -- when doing a > lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all > the PTEs to see if any of them are mapped would get really slow. I think it'll be a common problem to userfaultfd when it comes, e.g., userfaultfd by design is PAGE_SIZE based so far. It needs page size granule on pgtable manipulations, unless we extend the userfaultfd protocol to support folios, iiuc. > > > > > > > > > > > We could: > > > > - use the THP-like way and tolerate ~1 second collapses > > > > > > Another thought here. We don't necessarily *need* to collapse the page > > > table mappings in between mmu_notifier_invalidate_range_start() and > > > mmu_notifier_invalidate_range_end(), as the pfns aren't changing, > > > we aren't punching any holes, and we aren't changing permission bits. > > > If we had an MMU notifier that simply informed KVM that we collapsed > > > the page tables *after* we finished collapsing, then it would be ok > > > for hugetlb_collapse() to be slow. > > > > That's a great point! It'll definitely apply to either approach. > > > > > > > > If this MMU notifier is something that makes sense, it probably > > > applies to MADV_COLLAPSE for THPs as well. > > > > THPs are definitely different, mmu notifiers should be required there, > > afaict. Isn't that what the current code does? > > > > See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon. > > Oh, yes, of course, MADV_COLLAPSE can actually move things around and > properly make THPs. Thanks. But it would apply if we were only > collapsing PTE-mapped THPs, I think? Yes it applies I think. And if I'm not wrong it's also doing so. :) See collapse_pte_mapped_thp(). While for anon we always allocate a new page, hence not applicable.
On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <peterx@redhat.com> wrote: > > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote: > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > James, > > > > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote: > > > > > Here is the result: [1] (sorry it took a little while heh). The > > > > > > Thanks. From what I can tell, that number shows that it'll be great we > > > start with your rfcv1 mapcount approach, which mimics what's proposed by > > > Matthew for generic folio. > > > > Do you think the RFC v1 way is better than doing the THP-like way > > *with the additional MMU notifier*? > > What's the additional MMU notifier you're referring? An MMU notifier that informs KVM that a collapse has happened without having to invalidate_range_start() and invalidate_range_end(), the one you're replying to lower down in the email. :) [ see below... ] > > > > > > > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this > > > > > > Any more information on why it's horrible? :) > > > > I figured the code would speak for itself, heh. It's quite complicated. > > > > I really didn't like: > > 1. The 'inc' business in copy_hugetlb_page_range. > > 2. How/where I call put_page()/folio_put() to keep the refcount and > > mapcount synced up. > > 3. Having to check the page cache in UFFDIO_CONTINUE. > > I think the complexity is one thing which I'm fine with so far. However > when I think again about the things behind that complexity, I noticed there > may be at least one flaw that may not be trivial to work around. > > It's about truncation. The problem is now we use the pgtable entry to > represent the mapcount, but the pgtable entry cannot be zapped easily, > unless vma unmapped or collapsed. > > It means e.g. truncate_inode_folio() may stop working for hugetlb (of > course, with page lock held). The mappings will be removed for real, but > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps > the pgtable leaves, not the ones that we used to account for mapcounts. > > So the kernel may see weird things, like mapcount>0 after > truncate_inode_folio() being finished completely. > > For HGM to do the right thing, we may want to also remove the non-leaf > entries when truncating or doing similar things like a rmap walk to drop > any mappings for a page/folio. Though that's not doable for now because > the locks that truncate_inode_folio() is weaker than what we need to free > the pgtable non-leaf entries - we'll need mmap write lock for that, the > same as when we unmap or collapse. > > Matthew's design doesn't have such issue if the ptes need to be populated, > because mapcount is still with the leaves; not the case for us here. > > If that's the case, _maybe_ we still need to start with the stupid but > working approach of subpage mapcounts. Good point. I can't immediately think of a solution. I would prefer to go with the subpage mapcount approach to simplify HGM for now; optimizing mapcount for HugeTLB can then be handled separately. If you're ok with this, I'll go ahead and send v2. One way that might be possible: using the PAGE_SPECIAL bit on the hstate-level PTE to indicate if mapcount has been incremented or not (if the PTE is pointing to page tables). As far as I can tell, PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would need to be careful with existing PTE examination code as to not misinterpret these PTEs. > > [...] > > > > > > Matthew is trying to solve the same problem with THPs right now: [3]. > > > > > I haven't figured out how we can apply Matthews's approach to HGM > > > > > right now, but there probably is a way. (If we left the mapcount > > > > > increment bits in the same place, we couldn't just check the > > > > > hstate-level PTE; it would have already been made present.) > > > > > > I'm just worried that (1) this may add yet another dependency to your work > > > which is still during discussion phase, and (2) whether the folio approach > > > is easily applicable here, e.g., we may not want to populate all the ptes > > > for hugetlb HGMs by default. > > > > That's true. I definitely don't want to wait for this either. It seems > > like Matthew's approach won't work very well for us -- when doing a > > lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all > > the PTEs to see if any of them are mapped would get really slow. > > I think it'll be a common problem to userfaultfd when it comes, e.g., > userfaultfd by design is PAGE_SIZE based so far. It needs page size > granule on pgtable manipulations, unless we extend the userfaultfd protocol > to support folios, iiuc. > > > > > > > > > > > > > > > > We could: > > > > > - use the THP-like way and tolerate ~1 second collapses > > > > > > > > Another thought here. We don't necessarily *need* to collapse the page > > > > table mappings in between mmu_notifier_invalidate_range_start() and > > > > mmu_notifier_invalidate_range_end(), as the pfns aren't changing, > > > > we aren't punching any holes, and we aren't changing permission bits. > > > > If we had an MMU notifier that simply informed KVM that we collapsed > > > > the page tables *after* we finished collapsing, then it would be ok > > > > for hugetlb_collapse() to be slow. [ from above... ] This MMU notifier. :) > > > > > > That's a great point! It'll definitely apply to either approach. > > > > > > > > > > > If this MMU notifier is something that makes sense, it probably > > > > applies to MADV_COLLAPSE for THPs as well. > > > > > > THPs are definitely different, mmu notifiers should be required there, > > > afaict. Isn't that what the current code does? > > > > > > See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon. > > > > Oh, yes, of course, MADV_COLLAPSE can actually move things around and > > properly make THPs. Thanks. But it would apply if we were only > > collapsing PTE-mapped THPs, I think? > > Yes it applies I think. And if I'm not wrong it's also doing so. :) See > collapse_pte_mapped_thp(). > > While for anon we always allocate a new page, hence not applicable. > > -- > Peter Xu Thanks Peter! - James
On Thu, Feb 09, 2023 at 08:43:45AM -0800, James Houghton wrote: > On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote: > > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > James, > > > > > > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote: > > > > > > Here is the result: [1] (sorry it took a little while heh). The > > > > > > > > Thanks. From what I can tell, that number shows that it'll be great we > > > > start with your rfcv1 mapcount approach, which mimics what's proposed by > > > > Matthew for generic folio. > > > > > > Do you think the RFC v1 way is better than doing the THP-like way > > > *with the additional MMU notifier*? > > > > What's the additional MMU notifier you're referring? > > An MMU notifier that informs KVM that a collapse has happened without > having to invalidate_range_start() and invalidate_range_end(), the one > you're replying to lower down in the email. :) [ see below... ] Isn't that something that is needed no matter what mapcount approach we'll go for? Did I miss something? > > > > > > > > > > > > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this > > > > > > > > Any more information on why it's horrible? :) > > > > > > I figured the code would speak for itself, heh. It's quite complicated. > > > > > > I really didn't like: > > > 1. The 'inc' business in copy_hugetlb_page_range. > > > 2. How/where I call put_page()/folio_put() to keep the refcount and > > > mapcount synced up. > > > 3. Having to check the page cache in UFFDIO_CONTINUE. > > > > I think the complexity is one thing which I'm fine with so far. However > > when I think again about the things behind that complexity, I noticed there > > may be at least one flaw that may not be trivial to work around. > > > > It's about truncation. The problem is now we use the pgtable entry to > > represent the mapcount, but the pgtable entry cannot be zapped easily, > > unless vma unmapped or collapsed. > > > > It means e.g. truncate_inode_folio() may stop working for hugetlb (of > > course, with page lock held). The mappings will be removed for real, but > > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps > > the pgtable leaves, not the ones that we used to account for mapcounts. > > > > So the kernel may see weird things, like mapcount>0 after > > truncate_inode_folio() being finished completely. > > > > For HGM to do the right thing, we may want to also remove the non-leaf > > entries when truncating or doing similar things like a rmap walk to drop > > any mappings for a page/folio. Though that's not doable for now because > > the locks that truncate_inode_folio() is weaker than what we need to free > > the pgtable non-leaf entries - we'll need mmap write lock for that, the > > same as when we unmap or collapse. > > > > Matthew's design doesn't have such issue if the ptes need to be populated, > > because mapcount is still with the leaves; not the case for us here. > > > > If that's the case, _maybe_ we still need to start with the stupid but > > working approach of subpage mapcounts. > > Good point. I can't immediately think of a solution. I would prefer to > go with the subpage mapcount approach to simplify HGM for now; > optimizing mapcount for HugeTLB can then be handled separately. If > you're ok with this, I'll go ahead and send v2. I'm okay with it, but I suggest wait for at least another one day or two to see whether Mike or others have any comments. > > One way that might be possible: using the PAGE_SPECIAL bit on the > hstate-level PTE to indicate if mapcount has been incremented or not > (if the PTE is pointing to page tables). As far as I can tell, > PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would > need to be careful with existing PTE examination code as to not > misinterpret these PTEs. This is an interesting idea. :) Yes I don't see it being used at all in any pgtable non-leaves. Then it's about how to let the zap code know when to remove the special bit, hence the mapcount, because not all of them should. Maybe it can be passed over as a new zap_flags_t bit? Thanks,
On Thu, Feb 9, 2023 at 11:11 AM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Feb 09, 2023 at 08:43:45AM -0800, James Houghton wrote: > > On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote: > > > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > James, > > > > > > > > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote: > > > > > > > Here is the result: [1] (sorry it took a little while heh). The > > > > > > > > > > Thanks. From what I can tell, that number shows that it'll be great we > > > > > start with your rfcv1 mapcount approach, which mimics what's proposed by > > > > > Matthew for generic folio. > > > > > > > > Do you think the RFC v1 way is better than doing the THP-like way > > > > *with the additional MMU notifier*? > > > > > > What's the additional MMU notifier you're referring? > > > > An MMU notifier that informs KVM that a collapse has happened without > > having to invalidate_range_start() and invalidate_range_end(), the one > > you're replying to lower down in the email. :) [ see below... ] > > Isn't that something that is needed no matter what mapcount approach we'll > go for? Did I miss something? It's not really needed for anything, but it could be an optimization for both approaches. However, for the subpage-mapcount approach, it would have a *huge* impact. That's what I mean. > > > > > > > > > > > > > > > > > > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this > > > > > > > > > > Any more information on why it's horrible? :) > > > > > > > > I figured the code would speak for itself, heh. It's quite complicated. > > > > > > > > I really didn't like: > > > > 1. The 'inc' business in copy_hugetlb_page_range. > > > > 2. How/where I call put_page()/folio_put() to keep the refcount and > > > > mapcount synced up. > > > > 3. Having to check the page cache in UFFDIO_CONTINUE. > > > > > > I think the complexity is one thing which I'm fine with so far. However > > > when I think again about the things behind that complexity, I noticed there > > > may be at least one flaw that may not be trivial to work around. > > > > > > It's about truncation. The problem is now we use the pgtable entry to > > > represent the mapcount, but the pgtable entry cannot be zapped easily, > > > unless vma unmapped or collapsed. > > > > > > It means e.g. truncate_inode_folio() may stop working for hugetlb (of > > > course, with page lock held). The mappings will be removed for real, but > > > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps > > > the pgtable leaves, not the ones that we used to account for mapcounts. > > > > > > So the kernel may see weird things, like mapcount>0 after > > > truncate_inode_folio() being finished completely. > > > > > > For HGM to do the right thing, we may want to also remove the non-leaf > > > entries when truncating or doing similar things like a rmap walk to drop > > > any mappings for a page/folio. Though that's not doable for now because > > > the locks that truncate_inode_folio() is weaker than what we need to free > > > the pgtable non-leaf entries - we'll need mmap write lock for that, the > > > same as when we unmap or collapse. > > > > > > Matthew's design doesn't have such issue if the ptes need to be populated, > > > because mapcount is still with the leaves; not the case for us here. > > > > > > If that's the case, _maybe_ we still need to start with the stupid but > > > working approach of subpage mapcounts. > > > > Good point. I can't immediately think of a solution. I would prefer to > > go with the subpage mapcount approach to simplify HGM for now; > > optimizing mapcount for HugeTLB can then be handled separately. If > > you're ok with this, I'll go ahead and send v2. > > I'm okay with it, but I suggest wait for at least another one day or two to > see whether Mike or others have any comments. Ok. :) > > > > > One way that might be possible: using the PAGE_SPECIAL bit on the > > hstate-level PTE to indicate if mapcount has been incremented or not > > (if the PTE is pointing to page tables). As far as I can tell, > > PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would > > need to be careful with existing PTE examination code as to not > > misinterpret these PTEs. > > This is an interesting idea. :) Yes I don't see it being used at all in any > pgtable non-leaves. > > Then it's about how to let the zap code know when to remove the special > bit, hence the mapcount, because not all of them should. > > Maybe it can be passed over as a new zap_flags_t bit? Here[1] is one way it could be done (it doesn't work 100% correctly, it's just approximately what we could do). Basically we pass in the entire range that we are unmapping ("floor" and "ceil"), and if hugetlb_remove_rmap finds that we're doing the final removal of a page that we are entirely unmapping (i.e., floor <= addr & huge_page_mask(h)). Having a zap flag would probably work too. I think something like [1] ought to go in its own series. :) [1]: https://github.com/48ca/linux/commit/de884eaaadf61b8dcfb1defd99bbf487667e46f4 - James
On Thu, Feb 09, 2023 at 11:49:25AM -0800, James Houghton wrote: > On Thu, Feb 9, 2023 at 11:11 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Thu, Feb 09, 2023 at 08:43:45AM -0800, James Houghton wrote: > > > On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote: > > > > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > > > James, > > > > > > > > > > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote: > > > > > > > > Here is the result: [1] (sorry it took a little while heh). The > > > > > > > > > > > > Thanks. From what I can tell, that number shows that it'll be great we > > > > > > start with your rfcv1 mapcount approach, which mimics what's proposed by > > > > > > Matthew for generic folio. > > > > > > > > > > Do you think the RFC v1 way is better than doing the THP-like way > > > > > *with the additional MMU notifier*? > > > > > > > > What's the additional MMU notifier you're referring? > > > > > > An MMU notifier that informs KVM that a collapse has happened without > > > having to invalidate_range_start() and invalidate_range_end(), the one > > > you're replying to lower down in the email. :) [ see below... ] > > > > Isn't that something that is needed no matter what mapcount approach we'll > > go for? Did I miss something? > > It's not really needed for anything, but it could be an optimization > for both approaches. However, for the subpage-mapcount approach, it > would have a *huge* impact. That's what I mean. Ah, okay. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this > > > > > > > > > > > > Any more information on why it's horrible? :) > > > > > > > > > > I figured the code would speak for itself, heh. It's quite complicated. > > > > > > > > > > I really didn't like: > > > > > 1. The 'inc' business in copy_hugetlb_page_range. > > > > > 2. How/where I call put_page()/folio_put() to keep the refcount and > > > > > mapcount synced up. > > > > > 3. Having to check the page cache in UFFDIO_CONTINUE. > > > > > > > > I think the complexity is one thing which I'm fine with so far. However > > > > when I think again about the things behind that complexity, I noticed there > > > > may be at least one flaw that may not be trivial to work around. > > > > > > > > It's about truncation. The problem is now we use the pgtable entry to > > > > represent the mapcount, but the pgtable entry cannot be zapped easily, > > > > unless vma unmapped or collapsed. > > > > > > > > It means e.g. truncate_inode_folio() may stop working for hugetlb (of > > > > course, with page lock held). The mappings will be removed for real, but > > > > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps > > > > the pgtable leaves, not the ones that we used to account for mapcounts. > > > > > > > > So the kernel may see weird things, like mapcount>0 after > > > > truncate_inode_folio() being finished completely. > > > > > > > > For HGM to do the right thing, we may want to also remove the non-leaf > > > > entries when truncating or doing similar things like a rmap walk to drop > > > > any mappings for a page/folio. Though that's not doable for now because > > > > the locks that truncate_inode_folio() is weaker than what we need to free > > > > the pgtable non-leaf entries - we'll need mmap write lock for that, the > > > > same as when we unmap or collapse. > > > > > > > > Matthew's design doesn't have such issue if the ptes need to be populated, > > > > because mapcount is still with the leaves; not the case for us here. > > > > > > > > If that's the case, _maybe_ we still need to start with the stupid but > > > > working approach of subpage mapcounts. > > > > > > Good point. I can't immediately think of a solution. I would prefer to > > > go with the subpage mapcount approach to simplify HGM for now; > > > optimizing mapcount for HugeTLB can then be handled separately. If > > > you're ok with this, I'll go ahead and send v2. > > > > I'm okay with it, but I suggest wait for at least another one day or two to > > see whether Mike or others have any comments. > > Ok. :) > > > > > > > > > One way that might be possible: using the PAGE_SPECIAL bit on the > > > hstate-level PTE to indicate if mapcount has been incremented or not > > > (if the PTE is pointing to page tables). As far as I can tell, > > > PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would > > > need to be careful with existing PTE examination code as to not > > > misinterpret these PTEs. > > > > This is an interesting idea. :) Yes I don't see it being used at all in any > > pgtable non-leaves. > > > > Then it's about how to let the zap code know when to remove the special > > bit, hence the mapcount, because not all of them should. > > > > Maybe it can be passed over as a new zap_flags_t bit? > > Here[1] is one way it could be done (it doesn't work 100% correctly, > it's just approximately what we could do). Basically we pass in the > entire range that we are unmapping ("floor" and "ceil"), and if > hugetlb_remove_rmap finds that we're doing the final removal of a page > that we are entirely unmapping (i.e., floor <= addr & > huge_page_mask(h)). Having a zap flag would probably work too. Yeah maybe flags are not needed at all. I had a quick glance, looks good in general. I think the trick is when it's not unmapped in a single shot. Consider someone zaps the first half of HGM-mapped hpage then the other half. The range may not always tell the whole story so rmap might be left over in some cases. But maybe it is not a big deal. The only thing I think of so far is the partial DONTNEED. but I think maybe it's fine to leave it there until another more serious request to either truncate or unmap it. At least all rmap walks should work as expected. > > I think something like [1] ought to go in its own series. :) > > [1]: https://github.com/48ca/linux/commit/de884eaaadf61b8dcfb1defd99bbf487667e46f4 Yes I agree it can be worked on top.
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c index 74e1d873dce0..284466bf4f25 100644 --- a/arch/s390/mm/gmap.c +++ b/arch/s390/mm/gmap.c @@ -2626,13 +2626,25 @@ static int __s390_enable_skey_pmd(pmd_t *pmd, unsigned long addr, return 0; } -static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr, - unsigned long hmask, unsigned long next, +static int __s390_enable_skey_hugetlb(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { - pmd_t *pmd = (pmd_t *)pte; + struct hstate *h = hstate_vma(walk->vma); + pmd_t *pmd; unsigned long start, end; - struct page *page = pmd_page(*pmd); + struct page *page; + + if (huge_page_size(h) != hugetlb_pte_size(hpte)) + /* Ignore high-granularity PTEs. */ + return 0; + + if (!pte_present(huge_ptep_get(hpte->ptep))) + /* Ignore non-present PTEs. */ + return 0; + + pmd = (pmd_t *)hpte->ptep; + page = pmd_page(*pmd); /* * The write check makes sure we do not set a key on shared diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 41b5509bde0e..c353cab11eee 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -731,18 +731,28 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) } #ifdef CONFIG_HUGETLB_PAGE -static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, - struct mm_walk *walk) +static int smaps_hugetlb_range(struct hugetlb_pte *hpte, + unsigned long addr, + struct mm_walk *walk) { struct mem_size_stats *mss = walk->private; struct vm_area_struct *vma = walk->vma; struct page *page = NULL; + pte_t pte = huge_ptep_get(hpte->ptep); - if (pte_present(*pte)) { - page = vm_normal_page(vma, addr, *pte); - } else if (is_swap_pte(*pte)) { - swp_entry_t swpent = pte_to_swp_entry(*pte); + if (pte_present(pte)) { + /* We only care about leaf-level PTEs. */ + if (!hugetlb_pte_present_leaf(hpte, pte)) + /* + * The only case where hpte is not a leaf is that + * it was originally none, but it was split from + * under us. It was originally none, so exclude it. + */ + return 0; + + page = vm_normal_page(vma, addr, pte); + } else if (is_swap_pte(pte)) { + swp_entry_t swpent = pte_to_swp_entry(pte); if (is_pfn_swap_entry(swpent)) page = pfn_swap_entry_to_page(swpent); @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, int mapcount = page_mapcount(page); if (mapcount >= 2) - mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); + mss->shared_hugetlb += hugetlb_pte_size(hpte); else - mss->private_hugetlb += huge_page_size(hstate_vma(vma)); + mss->private_hugetlb += hugetlb_pte_size(hpte); } return 0; } @@ -1572,22 +1582,31 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, #ifdef CONFIG_HUGETLB_PAGE /* This function walks within one hugetlb entry in the single call */ -static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, - unsigned long addr, unsigned long end, +static int pagemap_hugetlb_range(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { struct pagemapread *pm = walk->private; struct vm_area_struct *vma = walk->vma; u64 flags = 0, frame = 0; int err = 0; - pte_t pte; + unsigned long hmask = hugetlb_pte_mask(hpte); + unsigned long end = addr + hugetlb_pte_size(hpte); + pte_t pte = huge_ptep_get(hpte->ptep); + struct page *page; if (vma->vm_flags & VM_SOFTDIRTY) flags |= PM_SOFT_DIRTY; - pte = huge_ptep_get(ptep); if (pte_present(pte)) { - struct page *page = pte_page(pte); + /* + * We raced with this PTE being split, which can only happen if + * it was blank before. Treat it is as if it were blank. + */ + if (!hugetlb_pte_present_leaf(hpte, pte)) + return 0; + + page = pte_page(pte); if (!PageAnon(page)) flags |= PM_FILE; @@ -1868,10 +1887,16 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd, } #endif +struct show_numa_map_private { + struct numa_maps *md; + struct page *last_page; +}; + static int gather_pte_stats(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - struct numa_maps *md = walk->private; + struct show_numa_map_private *priv = walk->private; + struct numa_maps *md = priv->md; struct vm_area_struct *vma = walk->vma; spinlock_t *ptl; pte_t *orig_pte; @@ -1883,6 +1908,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, struct page *page; page = can_gather_numa_stats_pmd(*pmd, vma, addr); + priv->last_page = page; if (page) gather_stats(page, md, pmd_dirty(*pmd), HPAGE_PMD_SIZE/PAGE_SIZE); @@ -1896,6 +1922,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { struct page *page = can_gather_numa_stats(*pte, vma, addr); + priv->last_page = page; if (!page) continue; gather_stats(page, md, pte_dirty(*pte), 1); @@ -1906,19 +1933,25 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, return 0; } #ifdef CONFIG_HUGETLB_PAGE -static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, struct mm_walk *walk) +static int gather_hugetlb_stats(struct hugetlb_pte *hpte, unsigned long addr, + struct mm_walk *walk) { - pte_t huge_pte = huge_ptep_get(pte); + struct show_numa_map_private *priv = walk->private; + pte_t huge_pte = huge_ptep_get(hpte->ptep); struct numa_maps *md; struct page *page; - if (!pte_present(huge_pte)) + if (!hugetlb_pte_present_leaf(hpte, huge_pte)) + return 0; + + page = compound_head(pte_page(huge_pte)); + if (priv->last_page == page) + /* we've already accounted for this page */ return 0; - page = pte_page(huge_pte); + priv->last_page = page; - md = walk->private; + md = priv->md; gather_stats(page, md, pte_dirty(huge_pte), 1); return 0; } @@ -1948,9 +1981,15 @@ static int show_numa_map(struct seq_file *m, void *v) struct file *file = vma->vm_file; struct mm_struct *mm = vma->vm_mm; struct mempolicy *pol; + char buffer[64]; int nid; + struct show_numa_map_private numa_map_private; + + numa_map_private.md = md; + numa_map_private.last_page = NULL; + if (!mm) return 0; @@ -1980,7 +2019,7 @@ static int show_numa_map(struct seq_file *m, void *v) seq_puts(m, " huge"); /* mmap_lock is held by m_start */ - walk_page_vma(vma, &show_numa_ops, md); + walk_page_vma(vma, &show_numa_ops, &numa_map_private); if (!md->pages) goto out; diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index 27a6df448ee5..f4bddad615c2 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -3,6 +3,7 @@ #define _LINUX_PAGEWALK_H #include <linux/mm.h> +#include <linux/hugetlb.h> struct mm_walk; @@ -31,6 +32,10 @@ struct mm_walk; * ptl after dropping the vma lock, or else revalidate * those items after re-acquiring the vma lock and before * accessing them. + * In the presence of high-granularity hugetlb entries, + * @hugetlb_entry is called only for leaf-level entries + * (hstate-level entries are ignored if they are not + * leaves). * @test_walk: caller specific callback function to determine whether * we walk over the current vma or not. Returning 0 means * "do page table walk over the current vma", returning @@ -58,9 +63,8 @@ struct mm_walk_ops { unsigned long next, struct mm_walk *walk); int (*pte_hole)(unsigned long addr, unsigned long next, int depth, struct mm_walk *walk); - int (*hugetlb_entry)(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long next, - struct mm_walk *walk); + int (*hugetlb_entry)(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk); int (*test_walk)(unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pre_vma)(unsigned long start, unsigned long end, diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 9d92c5eb3a1f..2383f647f202 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -330,11 +330,12 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr, } #ifdef CONFIG_HUGETLB_PAGE -static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm, +static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, pte_t entry, + struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr) { bool referenced = false; - pte_t entry = huge_ptep_get(pte); + pte_t entry = huge_ptep_get(hpte->ptep); struct folio *folio = pfn_folio(pte_pfn(entry)); folio_get(folio); @@ -342,12 +343,12 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm, if (pte_young(entry)) { referenced = true; entry = pte_mkold(entry); - set_huge_pte_at(mm, addr, pte, entry); + set_huge_pte_at(mm, addr, hpte->ptep, entry); } #ifdef CONFIG_MMU_NOTIFIER if (mmu_notifier_clear_young(mm, addr, - addr + huge_page_size(hstate_vma(vma)))) + addr + hugetlb_pte_size(hpte))) referenced = true; #endif /* CONFIG_MMU_NOTIFIER */ @@ -358,20 +359,26 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm, folio_put(folio); } -static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, +static int damon_mkold_hugetlb_entry(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { - struct hstate *h = hstate_vma(walk->vma); spinlock_t *ptl; pte_t entry; - ptl = huge_pte_lock(h, walk->mm, pte); - entry = huge_ptep_get(pte); + ptl = hugetlb_pte_lock(hpte); + entry = huge_ptep_get(hpte->ptep); if (!pte_present(entry)) goto out; - damon_hugetlb_mkold(pte, walk->mm, walk->vma, addr); + if (!hugetlb_pte_present_leaf(hpte, entry)) + /* + * We raced with someone splitting a blank PTE. Treat this PTE + * as if it were blank. + */ + goto out; + + damon_hugetlb_mkold(hpte, entry, walk->mm, walk->vma, addr); out: spin_unlock(ptl); @@ -484,8 +491,8 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr, } #ifdef CONFIG_HUGETLB_PAGE -static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, +static int damon_young_hugetlb_entry(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { struct damon_young_walk_private *priv = walk->private; @@ -494,11 +501,18 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask, spinlock_t *ptl; pte_t entry; - ptl = huge_pte_lock(h, walk->mm, pte); - entry = huge_ptep_get(pte); + ptl = hugetlb_pte_lock(hpte); + entry = huge_ptep_get(hpte->ptep); if (!pte_present(entry)) goto out; + if (!hugetlb_pte_present_leaf(hpte, entry)) + /* + * We raced with someone splitting a blank PTE. Treat this PTE + * as if it were blank. + */ + goto out; + folio = pfn_folio(pte_pfn(entry)); folio_get(folio); diff --git a/mm/hmm.c b/mm/hmm.c index 6a151c09de5e..d3e40cfdd4cb 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -468,8 +468,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, #endif #ifdef CONFIG_HUGETLB_PAGE -static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask, - unsigned long start, unsigned long end, +static int hmm_vma_walk_hugetlb_entry(struct hugetlb_pte *hpte, + unsigned long start, struct mm_walk *walk) { unsigned long addr = start, i, pfn; @@ -479,16 +479,24 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask, unsigned int required_fault; unsigned long pfn_req_flags; unsigned long cpu_flags; + unsigned long hmask = hugetlb_pte_mask(hpte); + unsigned int order = hpte->shift - PAGE_SHIFT; + unsigned long end = start + hugetlb_pte_size(hpte); spinlock_t *ptl; pte_t entry; - ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte); - entry = huge_ptep_get(pte); + ptl = hugetlb_pte_lock(hpte); + entry = huge_ptep_get(hpte->ptep); + + if (!hugetlb_pte_present_leaf(hpte, entry)) { + spin_unlock(ptl); + return -EAGAIN; + } i = (start - range->start) >> PAGE_SHIFT; pfn_req_flags = range->hmm_pfns[i]; cpu_flags = pte_to_hmm_pfn_flags(range, entry) | - hmm_pfn_flags_order(huge_page_order(hstate_vma(vma))); + hmm_pfn_flags_order(order); required_fault = hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags); if (required_fault) { @@ -605,7 +613,7 @@ int hmm_range_fault(struct hmm_range *range) * in pfns. All entries < last in the pfn array are set to their * output, and all >= are still at their input values. */ - } while (ret == -EBUSY); + } while (ret == -EBUSY || ret == -EAGAIN); return ret; } EXPORT_SYMBOL(hmm_range_fault); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index c77a9e37e27e..e7e56298d305 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -641,6 +641,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift, unsigned long poisoned_pfn, struct to_kill *tk) { unsigned long pfn = 0; + unsigned long base_pages_poisoned = (1UL << shift) / PAGE_SIZE; if (pte_present(pte)) { pfn = pte_pfn(pte); @@ -651,7 +652,8 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift, pfn = swp_offset_pfn(swp); } - if (!pfn || pfn != poisoned_pfn) + if (!pfn || pfn < poisoned_pfn || + pfn >= poisoned_pfn + base_pages_poisoned) return 0; set_to_kill(tk, addr, shift); @@ -717,16 +719,15 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr, } #ifdef CONFIG_HUGETLB_PAGE -static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask, - unsigned long addr, unsigned long end, - struct mm_walk *walk) +static int hwpoison_hugetlb_range(struct hugetlb_pte *hpte, + unsigned long addr, + struct mm_walk *walk) { struct hwp_walk *hwp = walk->private; - pte_t pte = huge_ptep_get(ptep); - struct hstate *h = hstate_vma(walk->vma); + pte_t pte = huge_ptep_get(hpte->ptep); - return check_hwpoisoned_entry(pte, addr, huge_page_shift(h), - hwp->pfn, &hwp->tk); + return check_hwpoisoned_entry(pte, addr & hugetlb_pte_mask(hpte), + hpte->shift, hwp->pfn, &hwp->tk); } #else #define hwpoison_hugetlb_range NULL diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d3558248a0f0..e5859ed34e90 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -558,8 +558,8 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, return addr != end ? -EIO : 0; } -static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long end, +static int queue_pages_hugetlb(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { int ret = 0; @@ -570,8 +570,12 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask, spinlock_t *ptl; pte_t entry; - ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte); - entry = huge_ptep_get(pte); + /* We don't migrate high-granularity HugeTLB mappings for now. */ + if (hugetlb_hgm_enabled(walk->vma)) + return -EINVAL; + + ptl = hugetlb_pte_lock(hpte); + entry = huge_ptep_get(hpte->ptep); if (!pte_present(entry)) goto unlock; page = pte_page(entry); diff --git a/mm/mincore.c b/mm/mincore.c index a085a2aeabd8..0894965b3944 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -22,18 +22,29 @@ #include <linux/uaccess.h> #include "swap.h" -static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, - unsigned long end, struct mm_walk *walk) +static int mincore_hugetlb(struct hugetlb_pte *hpte, unsigned long addr, + struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE unsigned char present; + unsigned long end = addr + hugetlb_pte_size(hpte); unsigned char *vec = walk->private; + pte_t pte = huge_ptep_get(hpte->ptep); /* * Hugepages under user process are always in RAM and never * swapped out, but theoretically it needs to be checked. */ - present = pte && !huge_pte_none(huge_ptep_get(pte)); + present = !huge_pte_none(pte); + + /* + * If the pte is present but not a leaf, we raced with someone + * splitting it. For someone to have split it, it must have been + * huge_pte_none before, so treat it as such. + */ + if (pte_present(pte) && !hugetlb_pte_present_leaf(hpte, pte)) + present = false; + for (; addr != end; vec++, addr += PAGE_SIZE) *vec = present; walk->private = vec; diff --git a/mm/mprotect.c b/mm/mprotect.c index 71358e45a742..62d8c5f7bc92 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -543,12 +543,16 @@ static int prot_none_pte_entry(pte_t *pte, unsigned long addr, 0 : -EACCES; } -static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask, - unsigned long addr, unsigned long next, +static int prot_none_hugetlb_entry(struct hugetlb_pte *hpte, + unsigned long addr, struct mm_walk *walk) { - return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ? - 0 : -EACCES; + pte_t pte = huge_ptep_get(hpte->ptep); + + if (!hugetlb_pte_present_leaf(hpte, pte)) + return -EAGAIN; + return pfn_modify_allowed(pte_pfn(pte), + *(pgprot_t *)(walk->private)) ? 0 : -EACCES; } static int prot_none_test(unsigned long addr, unsigned long next, @@ -591,8 +595,10 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma, (newflags & VM_ACCESS_FLAGS) == 0) { pgprot_t new_pgprot = vm_get_page_prot(newflags); - error = walk_page_range(current->mm, start, end, - &prot_none_walk_ops, &new_pgprot); + do { + error = walk_page_range(current->mm, start, end, + &prot_none_walk_ops, &new_pgprot); + } while (error == -EAGAIN); if (error) return error; } diff --git a/mm/pagewalk.c b/mm/pagewalk.c index cb23f8a15c13..05ce242f8b7e 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -3,6 +3,7 @@ #include <linux/highmem.h> #include <linux/sched.h> #include <linux/hugetlb.h> +#include <linux/minmax.h> /* * We want to know the real level where a entry is located ignoring any @@ -296,20 +297,21 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, struct vm_area_struct *vma = walk->vma; struct hstate *h = hstate_vma(vma); unsigned long next; - unsigned long hmask = huge_page_mask(h); - unsigned long sz = huge_page_size(h); - pte_t *pte; const struct mm_walk_ops *ops = walk->ops; int err = 0; + struct hugetlb_pte hpte; hugetlb_vma_lock_read(vma); do { - next = hugetlb_entry_end(h, addr, end); - pte = hugetlb_walk(vma, addr & hmask, sz); - if (pte) - err = ops->hugetlb_entry(pte, hmask, addr, next, walk); - else if (ops->pte_hole) - err = ops->pte_hole(addr, next, -1, walk); + if (hugetlb_full_walk(&hpte, vma, addr)) { + next = hugetlb_entry_end(h, addr, end); + if (ops->pte_hole) + err = ops->pte_hole(addr, next, -1, walk); + } else { + err = ops->hugetlb_entry( + &hpte, addr, walk); + next = min(addr + hugetlb_pte_size(&hpte), end); + } if (err) break; } while (addr = next, addr != end);