Message ID | 20221021163703.3218176-11-jthoughton@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4242:0:0:0:0:0 with SMTP id s2csp795473wrr; Fri, 21 Oct 2022 09:38:55 -0700 (PDT) X-Google-Smtp-Source: AMsMyM58a3CetAXuhzEWhsdfTh6G9PikcmGEfu7RPiysRfIvlyddxvOnqpOhktXLfpZg1XMm82QH X-Received: by 2002:a17:902:e849:b0:17f:f3c4:a2c4 with SMTP id t9-20020a170902e84900b0017ff3c4a2c4mr19921189plg.125.1666370324765; Fri, 21 Oct 2022 09:38:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666370324; cv=none; d=google.com; s=arc-20160816; b=F9sxInQ3aZ6by4nRYJmowAtlECtDE5YADyOp+9OsDepE9f2dxymkCirlR2mTUa/1uI QNhcYEtMXFlFCoArxpQtZ1dX2yIR911agaOTihcr3FK2MmNCeK4i0gfy7Snqy8BO6lbd +8qsKr+FGGbiU5KdcIOXRoldkt6t41fIB4YG/cKKpVmlLvPIsPLnR8/kyE5AXGWxIBCk 1dNwEq18aet6Oay43xPi2IojoLp9IqU+PnWjB+yHsI0N8Bf1S+/fMRRvthMqrvoMHZoq reyK9bHOI04FPklExGR5l1fdGT7M0yfNXQA/8W+TZ8DMwLjJ5lc05K98KsY1QzPk0rqo 8kSQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=fki+um4ICIptosTnJrWPKrPDlP+wU0UwwTN9RPzv9+o=; b=qGfSfzz0CNZz8hY/8HHYz33HKFOSmTAjHQScBS09LUhO6Jf2lEHiBQCWDDGZwjW6qV Z+g6cvhWZ+67miBAC+Ko48E4XWwCcaOT/0c8D6YAlIG+TgJ7tiE/DzP1KtfYCc8TFUdR 0SMyLF6IVASJmijX0Rd55OhpDvzWkRIDA0+K/cSWQp8Djiw0TWtBUW0kd10CPP/CUIqJ T+WTWmQPfc0KbbFGiOOaX+rKIWnhvCt39TKl4diRSrWbp3LiNuScD9IkSu3cVFT065Bq U43XpK3bZFwcDm+gb+1EGwE4FbrPhm2TAQ5PjUqC79m1bAh8QKchZevKOdQ/My5yqPtm fnDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=ND0Wnc7o; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id oc1-20020a17090b1c0100b0020ad6d218cdsi176379pjb.3.2022.10.21.09.38.31; Fri, 21 Oct 2022 09:38:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=ND0Wnc7o; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230342AbiJUQiL (ORCPT <rfc822;mntrajkot1@gmail.com> + 99 others); Fri, 21 Oct 2022 12:38:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52606 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230439AbiJUQha (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 21 Oct 2022 12:37:30 -0400 Received: from mail-ua1-x94a.google.com (mail-ua1-x94a.google.com [IPv6:2607:f8b0:4864:20::94a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 26C5A27B56F for <linux-kernel@vger.kernel.org>; Fri, 21 Oct 2022 09:37:26 -0700 (PDT) Received: by mail-ua1-x94a.google.com with SMTP id h11-20020ab0470b000000b003bf1da44886so2348922uac.17 for <linux-kernel@vger.kernel.org>; Fri, 21 Oct 2022 09:37:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=fki+um4ICIptosTnJrWPKrPDlP+wU0UwwTN9RPzv9+o=; b=ND0Wnc7owKsedJs2eeuHEmNDxzWNQ8LJ9L6l2O850QzH0C/poOjJndJ7AVL7Po1zjY RTH3QeNCDw8Ggm9zXm0JwsqpfhqKci87zPwqjusR19zNbpEUtAwPucBKje4NDjKnw/yN JTUX7XoJ4tpG9TXphw9fp2FSGuVr5cTSuyjJYTpAt1r53VPzmdcNWylmEckjVlQivsUE doMwgo1hmTPDhQyun6/Q6ld67hLjwxbbwc8Uuj99/wST20NLCwvhvHvgbjaU/OG+tOnz +6TkRssWN2qGSoBubKtF0OdCdQxRP8LDpfebPGhVmTNXhq3XL/JjAXpxQmoXgb5b4WfY YwQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fki+um4ICIptosTnJrWPKrPDlP+wU0UwwTN9RPzv9+o=; b=TlL4s2l6x5Kum61eScvmz35zXKbBMWdiUsXcigpIKhz7d1O/mgFIreWOf421RqJ/xl E5UJD5XQzfVs/S7ndfSqnO/75jvHq5eJwRAJqPPEfBVoMwwRQHp8GAEXhcMgP3x3Y9GR Qk7dR4Qb2N9dnvpJd3lAjF9q3S96nBFkIfUdCU0zeKjtLJ746o6+kTs4wpQTDm7WIPn2 weOGulcdbzb/SyaPMthNI0cTgTRvQftW8lT33tPGsdZcGyaoF6lFbTnX86zrSdhvE/9E P7zgbyzBq04xdahXNuKB1YqcPd1lZjNDJqivFa41fUyLeKh+WfZLO0dsPrm51ih+wwVC 2GdQ== X-Gm-Message-State: ACrzQf0MrHUR3dsJJf81gxyV+il7ztqV4GexocjlrF+sRm9buKDgeUv2 b3ub21ja88kvhiDvwukZuNlKUGflndZ8wIis X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6102:3e1a:b0:3a9:7543:204a with SMTP id j26-20020a0561023e1a00b003a97543204amr13044013vsv.53.1666370245924; Fri, 21 Oct 2022 09:37:25 -0700 (PDT) Date: Fri, 21 Oct 2022 16:36:26 +0000 In-Reply-To: <20221021163703.3218176-1-jthoughton@google.com> Mime-Version: 1.0 References: <20221021163703.3218176-1-jthoughton@google.com> X-Mailer: git-send-email 2.38.0.135.g90850a2211-goog Message-ID: <20221021163703.3218176-11-jthoughton@google.com> Subject: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries From: James Houghton <jthoughton@google.com> To: Mike Kravetz <mike.kravetz@oracle.com>, Muchun Song <songmuchun@bytedance.com>, Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com>, David Rientjes <rientjes@google.com>, Axel Rasmussen <axelrasmussen@google.com>, Mina Almasry <almasrymina@google.com>, "Zach O'Keefe" <zokeefe@google.com>, Manish Mishra <manish.mishra@nutanix.com>, Naoya Horiguchi <naoya.horiguchi@nec.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, Baolin Wang <baolin.wang@linux.alibaba.com>, Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton <jthoughton@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747315929689596371?= X-GMAIL-MSGID: =?utf-8?q?1747315929689596371?= |
Series |
hugetlb: introduce HugeTLB high-granularity mapping
|
|
Commit Message
James Houghton
Oct. 21, 2022, 4:36 p.m. UTC
After high-granularity mapping, page table entries for HugeTLB pages can
be of any size/type. (For example, we can have a 1G page mapped with a
mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
PTE after we have done a page table walk.
Without this, we'd have to pass around the "size" of the PTE everywhere.
We effectively did this before; it could be fetched from the hstate,
which we pass around pretty much everywhere.
hugetlb_pte_present_leaf is included here as a helper function that will
be used frequently later on.
Signed-off-by: James Houghton <jthoughton@google.com>
---
include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++
mm/hugetlb.c | 29 ++++++++++++++
2 files changed, 117 insertions(+)
Comments
On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote: > +struct hugetlb_pte { > + pte_t *ptep; > + unsigned int shift; > + enum hugetlb_level level; > + spinlock_t *ptl; > +}; Do we need both shift + level? Maybe it's only meaningful for ARM where the shift may not be directly calculcated from level? I'm wondering whether we can just maintain "shift" then we calculate "level" realtime. It just reads a bit weird to have these two fields, also a burden to most of the call sites where shift and level exactly match.. > + > +static inline > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep, > + unsigned int shift, enum hugetlb_level level) I'd think it's nicer to replace "populate" with something else, as populate is definitely a meaningful word in vm world for "making something appear if it wasn't". Maybe hugetlb_pte_setup()? Even one step back, on the naming of hugetlb_pte.. Sorry to comment on namings especially on this one, I really don't like to do that normally.. but here hugetlb_pte only walks the sub-page level of pgtables, meanwhile it's not really a pte but an iterator. How about hugetlb_hgm_iter? "hgm" tells that it only walks sub-level, and "iter" tells that it is an iterator, being updated for each stepping downwards. Then hugetlb_pte_populate() can be hugetlb_hgm_iter_init(). Take these comments with a grain of salt, and it never hurts to wait for a 2nd opinion before anything. > +{ > + WARN_ON_ONCE(!ptep); > + hpte->ptep = ptep; > + hpte->shift = shift; > + hpte->level = level; > + hpte->ptl = NULL; > +} > + > +static inline > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte) > +{ > + WARN_ON_ONCE(!hpte->ptep); > + return 1UL << hpte->shift; > +} > + > +static inline > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte) > +{ > + WARN_ON_ONCE(!hpte->ptep); > + return ~(hugetlb_pte_size(hpte) - 1); > +} > + > +static inline > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte) > +{ > + WARN_ON_ONCE(!hpte->ptep); > + return hpte->shift; > +} > + > +static inline > +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte) > +{ > + WARN_ON_ONCE(!hpte->ptep); There're definitely a bunch of hpte->ptep WARN_ON_ONCE()s.. AFAIK the hugetlb_pte* will be setup once with valid ptep and then it should always be. I rem someone commented on these helpers look not useful, which I must confess I had the same feeling. But besides that, I'd rather drop all these WARN_ON_ONCE()s but only check it when init() the iterator/pte. > + return hpte->level; > +} > + > +static inline > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src) > +{ > + dest->ptep = src->ptep; > + dest->shift = src->shift; > + dest->level = src->level; > + dest->ptl = src->ptl; > +} > + > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte); > + > struct hugepage_subpool { > spinlock_t lock; > long count; > @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h, > return ptl; > } > > +static inline > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte) > +{ > + > + BUG_ON(!hpte->ptep); Another BUG_ON(); better be dropped too. > + if (hpte->ptl) > + return hpte->ptl; > + return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep); I'm curious whether we can always have hpte->ptl set for a valid hugetlb_pte. I think that means we'll need to also init the ptl in the init() fn of the iterator. Then it'll be clear on which lock to take for each valid hugetlb_pte. > +}
On Wed, Nov 16, 2022 at 2:18 PM Peter Xu <peterx@redhat.com> wrote: > > On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote: > > +struct hugetlb_pte { > > + pte_t *ptep; > > + unsigned int shift; > > + enum hugetlb_level level; > > + spinlock_t *ptl; > > +}; > > Do we need both shift + level? Maybe it's only meaningful for ARM where > the shift may not be directly calculcated from level? > > I'm wondering whether we can just maintain "shift" then we calculate > "level" realtime. It just reads a bit weird to have these two fields, also > a burden to most of the call sites where shift and level exactly match.. My main concern is interaction with folded levels. For example, if PUD_SIZE and PMD_SIZE are the same, we want to do something like this: pud = pud_offset(p4d, addr) pmd = pmd_offset(pud, addr) /* this is just pmd = (pmd_t *) pud */ pte = pte_offset(pmd, addr) and I think we should avoid quietly skipping the folded level, which could happen: pud = pud_offset(p4d, addr) /* Each time, we go back to pte_t *, so if we stored PUD_SHIFT here, it is impossible to know that `pud` came from `pud_offset` and not `pmd_offset`. We must assume the deeper level so that we don't get stuck in a loop. */ pte = pte_offset(pud, addr) /* pud is cast from (pud_t * -> pte_t * -> pmd_t *) */ Quietly dropping p*d_offset for folded levels is safe; it's just a cast that we're doing anyway. If you think this is fine, then I can remove `level`. It might also be that this is a non-issue and that there will never be a folded level underneath a hugepage level. We could also change `ptep` to a union eventually (to clean up "hugetlb casts everything to pte_t *" messiness), and having an explicit `level` as a tag for the union would be nice help. In the same way: I like having `level` explicitly so that we know for sure where `ptep` came from. I can try to reduce the burden at the callsite while keeping `level`: hpage_size_to_level() is really annoying to have everywhere. > > > + > > +static inline > > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep, > > + unsigned int shift, enum hugetlb_level level) > > I'd think it's nicer to replace "populate" with something else, as populate > is definitely a meaningful word in vm world for "making something appear if > it wasn't". Maybe hugetlb_pte_setup()? > > Even one step back, on the naming of hugetlb_pte.. Sorry to comment on > namings especially on this one, I really don't like to do that normally.. > but here hugetlb_pte only walks the sub-page level of pgtables, meanwhile > it's not really a pte but an iterator. How about hugetlb_hgm_iter? "hgm" > tells that it only walks sub-level, and "iter" tells that it is an > iterator, being updated for each stepping downwards. > > Then hugetlb_pte_populate() can be hugetlb_hgm_iter_init(). > > Take these comments with a grain of salt, and it never hurts to wait for a > 2nd opinion before anything. I think this is a great idea. :) Thank you! I'll make this change for v1 unless someone has a better suggestion. > > > +{ > > + WARN_ON_ONCE(!ptep); > > + hpte->ptep = ptep; > > + hpte->shift = shift; > > + hpte->level = level; > > + hpte->ptl = NULL; > > +} > > + > > +static inline > > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte) > > +{ > > + WARN_ON_ONCE(!hpte->ptep); > > + return 1UL << hpte->shift; > > +} > > + > > +static inline > > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte) > > +{ > > + WARN_ON_ONCE(!hpte->ptep); > > + return ~(hugetlb_pte_size(hpte) - 1); > > +} > > + > > +static inline > > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte) > > +{ > > + WARN_ON_ONCE(!hpte->ptep); > > + return hpte->shift; > > +} > > + > > +static inline > > +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte) > > +{ > > + WARN_ON_ONCE(!hpte->ptep); > > There're definitely a bunch of hpte->ptep WARN_ON_ONCE()s.. AFAIK the > hugetlb_pte* will be setup once with valid ptep and then it should always > be. I rem someone commented on these helpers look not useful, which I must > confess I had the same feeling. But besides that, I'd rather drop all > these WARN_ON_ONCE()s but only check it when init() the iterator/pte. The idea with these WARN_ON_ONCE()s is that it WARNs for the case that `hpte` was never populated/initialized, but I realize that we can't even rely on hpte->ptep == NULL. I'll remove the WARN_ON_ONCE()s, and I'll drop hugetlb_pte_shift and hugetlb_pte_level entirely. I'll keep the hugetlb_pte_{size,mask,copy,present_leaf} helpers as they are legitimately helpful. > > > + return hpte->level; > > +} > > + > > +static inline > > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src) > > +{ > > + dest->ptep = src->ptep; > > + dest->shift = src->shift; > > + dest->level = src->level; > > + dest->ptl = src->ptl; > > +} > > + > > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte); > > + > > struct hugepage_subpool { > > spinlock_t lock; > > long count; > > @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h, > > return ptl; > > } > > > > +static inline > > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte) > > +{ > > + > > + BUG_ON(!hpte->ptep); > > Another BUG_ON(); better be dropped too. Can do. > > > + if (hpte->ptl) > > + return hpte->ptl; > > + return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep); > > I'm curious whether we can always have hpte->ptl set for a valid > hugetlb_pte. I think that means we'll need to also init the ptl in the > init() fn of the iterator. Then it'll be clear on which lock to take for > each valid hugetlb_pte. I can work on this for v1. Right now it's not very good: for 4K PTEs, we manually set ->ptl while walking. I'll make it so that ->ptl is always populated so the code is easier to read. - James > > > +} > > -- > Peter Xu >
On Wed, Nov 16, 2022 at 05:00:08PM -0800, James Houghton wrote: > On Wed, Nov 16, 2022 at 2:18 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote: > > > +struct hugetlb_pte { > > > + pte_t *ptep; > > > + unsigned int shift; > > > + enum hugetlb_level level; > > > + spinlock_t *ptl; > > > +}; > > > > Do we need both shift + level? Maybe it's only meaningful for ARM where > > the shift may not be directly calculcated from level? > > > > I'm wondering whether we can just maintain "shift" then we calculate > > "level" realtime. It just reads a bit weird to have these two fields, also > > a burden to most of the call sites where shift and level exactly match.. > > My main concern is interaction with folded levels. For example, if > PUD_SIZE and PMD_SIZE are the same, we want to do something like this: > > pud = pud_offset(p4d, addr) > pmd = pmd_offset(pud, addr) /* this is just pmd = (pmd_t *) pud */ > pte = pte_offset(pmd, addr) > > and I think we should avoid quietly skipping the folded level, which > could happen: > > pud = pud_offset(p4d, addr) > /* Each time, we go back to pte_t *, so if we stored PUD_SHIFT here, > it is impossible to know that `pud` came from `pud_offset` and not > `pmd_offset`. We must assume the deeper level so that we don't get > stuck in a loop. */ > pte = pte_offset(pud, addr) /* pud is cast from (pud_t * -> pte_t * -> > pmd_t *) */ > > Quietly dropping p*d_offset for folded levels is safe; it's just a > cast that we're doing anyway. If you think this is fine, then I can > remove `level`. It might also be that this is a non-issue and that > there will never be a folded level underneath a hugepage level. > > We could also change `ptep` to a union eventually (to clean up > "hugetlb casts everything to pte_t *" messiness), and having an > explicit `level` as a tag for the union would be nice help. In the > same way: I like having `level` explicitly so that we know for sure > where `ptep` came from. Makes sense. > > I can try to reduce the burden at the callsite while keeping `level`: > hpage_size_to_level() is really annoying to have everywhere. Yeah this would be nice. Per what I see most callers are having paired level/shift, so maybe we can make hugetlb_hgm_iter_init() to only take one of them and calculate the other. Then there can also be the __hugetlb_hgm_iter_init() which takes both, only used in the few places where necessary to have explicit level/shift. hugetlb_hgm_iter_init() calls __hugetlb_hgm_iter_init(). Thanks,
On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote: > > After high-granularity mapping, page table entries for HugeTLB pages can > be of any size/type. (For example, we can have a 1G page mapped with a > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB > PTE after we have done a page table walk. > > Without this, we'd have to pass around the "size" of the PTE everywhere. > We effectively did this before; it could be fetched from the hstate, > which we pass around pretty much everywhere. > > hugetlb_pte_present_leaf is included here as a helper function that will > be used frequently later on. > > Signed-off-by: James Houghton <jthoughton@google.com> > --- > include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++ > mm/hugetlb.c | 29 ++++++++++++++ > 2 files changed, 117 insertions(+) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index db3ed6095b1c..d30322108b34 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -50,6 +50,75 @@ enum { > __NR_USED_SUBPAGE, > }; > > +enum hugetlb_level { > + HUGETLB_LEVEL_PTE = 1, > + /* > + * We always include PMD, PUD, and P4D in this enum definition so that, > + * when logged as an integer, we can easily tell which level it is. > + */ > + HUGETLB_LEVEL_PMD, > + HUGETLB_LEVEL_PUD, > + HUGETLB_LEVEL_P4D, > + HUGETLB_LEVEL_PGD, > +}; > + Don't we need to support CONTIG_PTE/PMD levels here for ARM64? > +struct hugetlb_pte { > + pte_t *ptep; > + unsigned int shift; > + enum hugetlb_level level; Is shift + level redundant? When would those diverge? > + spinlock_t *ptl; > +}; > + > +static inline > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep, > + unsigned int shift, enum hugetlb_level level) > +{ > + WARN_ON_ONCE(!ptep); > + hpte->ptep = ptep; > + hpte->shift = shift; > + hpte->level = level; > + hpte->ptl = NULL; > +} > + > +static inline > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte) > +{ > + WARN_ON_ONCE(!hpte->ptep); > + return 1UL << hpte->shift; > +} > + > +static inline > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte) > +{ > + WARN_ON_ONCE(!hpte->ptep); > + return ~(hugetlb_pte_size(hpte) - 1); > +} > + > +static inline > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte) > +{ > + WARN_ON_ONCE(!hpte->ptep); > + return hpte->shift; > +} > + > +static inline > +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte) > +{ > + WARN_ON_ONCE(!hpte->ptep); > + return hpte->level; > +} > + > +static inline > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src) > +{ > + dest->ptep = src->ptep; > + dest->shift = src->shift; > + dest->level = src->level; > + dest->ptl = src->ptl; > +} > + > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte); > + > struct hugepage_subpool { > spinlock_t lock; > long count; > @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h, > return ptl; > } > > +static inline > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte) > +{ > + > + BUG_ON(!hpte->ptep); I think BUG_ON()s will be frowned upon. This function also doesn't really need ptep. Maybe let hugetlb_pte_shift() decide to BUG_ON() if necessary. > + if (hpte->ptl) > + return hpte->ptl; > + return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep); I don't know if this fallback to huge_pte_lockptr() should be obivous to the reader. If not, a comment would help. > +} > + > +static inline > +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte) > +{ > + spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte); > + > + spin_lock(ptl); > + return ptl; > +} > + > #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA) > extern void __init hugetlb_cma_reserve(int order); > #else > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index ef7662bd0068..a0e46d35dabc 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1127,6 +1127,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg) > return false; > } > > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte) I also don't know if this is obvious to other readers, but I'm quite confused that we pass both hugetlb_pte and pte_t here, especially when hpte has a pte_t inside of it. Maybe a comment would help. > +{ > + pgd_t pgd; > + p4d_t p4d; > + pud_t pud; > + pmd_t pmd; > + > + WARN_ON_ONCE(!hpte->ptep); > + switch (hugetlb_pte_level(hpte)) { > + case HUGETLB_LEVEL_PGD: > + pgd = __pgd(pte_val(pte)); > + return pgd_present(pgd) && pgd_leaf(pgd); > + case HUGETLB_LEVEL_P4D: > + p4d = __p4d(pte_val(pte)); > + return p4d_present(p4d) && p4d_leaf(p4d); > + case HUGETLB_LEVEL_PUD: > + pud = __pud(pte_val(pte)); > + return pud_present(pud) && pud_leaf(pud); > + case HUGETLB_LEVEL_PMD: > + pmd = __pmd(pte_val(pte)); > + return pmd_present(pmd) && pmd_leaf(pmd); > + case HUGETLB_LEVEL_PTE: > + return pte_present(pte); > + default: > + WARN_ON_ONCE(1); > + return false; > + } > +} > + > static void enqueue_huge_page(struct hstate *h, struct page *page) > { > int nid = page_to_nid(page); > -- > 2.38.0.135.g90850a2211-goog >
On Wed, Dec 7, 2022 at 7:46 PM Mina Almasry <almasrymina@google.com> wrote: > > On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote: > > > > After high-granularity mapping, page table entries for HugeTLB pages can > > be of any size/type. (For example, we can have a 1G page mapped with a > > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB > > PTE after we have done a page table walk. > > > > Without this, we'd have to pass around the "size" of the PTE everywhere. > > We effectively did this before; it could be fetched from the hstate, > > which we pass around pretty much everywhere. > > > > hugetlb_pte_present_leaf is included here as a helper function that will > > be used frequently later on. > > > > Signed-off-by: James Houghton <jthoughton@google.com> > > --- > > include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++ > > mm/hugetlb.c | 29 ++++++++++++++ > > 2 files changed, 117 insertions(+) > > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index db3ed6095b1c..d30322108b34 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -50,6 +50,75 @@ enum { > > __NR_USED_SUBPAGE, > > }; > > > > +enum hugetlb_level { > > + HUGETLB_LEVEL_PTE = 1, > > + /* > > + * We always include PMD, PUD, and P4D in this enum definition so that, > > + * when logged as an integer, we can easily tell which level it is. > > + */ > > + HUGETLB_LEVEL_PMD, > > + HUGETLB_LEVEL_PUD, > > + HUGETLB_LEVEL_P4D, > > + HUGETLB_LEVEL_PGD, > > +}; > > + > > Don't we need to support CONTIG_PTE/PMD levels here for ARM64? Yeah, which is why shift and level aren't quite the same thing. Contiguous PMDs would be HUGETLB_LEVEL_PMD but have shift = CONT_PMD_SHIFT, whereas regular PMDs would have shift = PMD_SHIFT. > > > +struct hugetlb_pte { > > + pte_t *ptep; > > + unsigned int shift; > > + enum hugetlb_level level; > > Is shift + level redundant? When would those diverge? Peter asked a very similar question. `shift` can be used to determine `level` if no levels are being folded. In the case of folded levels, you might have a single shift that corresponds to multiple "levels". That isn't necessarily a problem, as folding a level just means casting your p?d_t* differently, but I think it's good to be able to *know* if the hugetlb_pte was populated with a pud_t* that we treat it like a pud_t* always. If `ptep` was instead a union, then `level` would be the tag. Perhaps it should be written that way. > > > + spinlock_t *ptl; > > +}; > > + > > +static inline > > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep, > > + unsigned int shift, enum hugetlb_level level) > > +{ > > + WARN_ON_ONCE(!ptep); > > + hpte->ptep = ptep; > > + hpte->shift = shift; > > + hpte->level = level; > > + hpte->ptl = NULL; > > +} > > + > > +static inline > > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte) > > +{ > > + WARN_ON_ONCE(!hpte->ptep); > > + return 1UL << hpte->shift; > > +} > > + > > +static inline > > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte) > > +{ > > + WARN_ON_ONCE(!hpte->ptep); > > + return ~(hugetlb_pte_size(hpte) - 1); > > +} > > + > > +static inline > > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte) > > +{ > > + WARN_ON_ONCE(!hpte->ptep); > > + return hpte->shift; > > +} > > + > > +static inline > > +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte) > > +{ > > + WARN_ON_ONCE(!hpte->ptep); > > + return hpte->level; > > +} > > + > > +static inline > > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src) > > +{ > > + dest->ptep = src->ptep; > > + dest->shift = src->shift; > > + dest->level = src->level; > > + dest->ptl = src->ptl; > > +} > > + > > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte); > > + > > struct hugepage_subpool { > > spinlock_t lock; > > long count; > > @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h, > > return ptl; > > } > > > > +static inline > > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte) > > +{ > > + > > + BUG_ON(!hpte->ptep); > > I think BUG_ON()s will be frowned upon. This function also doesn't > really need ptep. Maybe let hugetlb_pte_shift() decide to BUG_ON() if > necessary. Right. I'll remove this (and others that aren't really necessary). Peter's suggestion to just let the kernel take a #pf and crash (thereby logging more info) SGTM. > > > > + if (hpte->ptl) > > + return hpte->ptl; > > + return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep); > > I don't know if this fallback to huge_pte_lockptr() should be obivous > to the reader. If not, a comment would help. I'll clean this up a little for the next version. If something like this branch stays, I'll add a comment. > > > +} > > + > > +static inline > > +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte) > > +{ > > + spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte); > > + > > + spin_lock(ptl); > > + return ptl; > > +} > > + > > #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA) > > extern void __init hugetlb_cma_reserve(int order); > > #else > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index ef7662bd0068..a0e46d35dabc 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1127,6 +1127,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg) > > return false; > > } > > > > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte) > > I also don't know if this is obvious to other readers, but I'm quite > confused that we pass both hugetlb_pte and pte_t here, especially when > hpte has a pte_t inside of it. Maybe a comment would help. It's possible for the value of the pte to change if we haven't locked the PTL; we only store a pte_t* in hugetlb_pte, not the value itself. Thinking about this... we *do* store `shift` which technically depends on the value of the PTE. If the PTE is pte_none, the true `shift` of the PTE is ambiguous, and so we just provide what the user asked for. That could lead to a scenario where UFFDIO_CONTINUE(some 4K page) then UFFDIO_CONTINUE(CONT_PTE_SIZE range around that page) can both succeed because we merely check if the *first* PTE in the contiguous bunch is none/has changed. So, in the case of a contiguous PTE where we *think* we're overwriting a bunch of none PTEs, we need to check that each PTE we're overwriting is still none while holding the PTL. That means that the PTL we use for cont PTEs and non-cont PTEs of the same level must be the same. So for the next version, I'll: - add some requirement that contiguous and non-contiguous PTEs on the same level must use the same PTL - think up some kind of API like all_contig_ptes_none(), but it only really applies for arm64, so I think actually putting it in can wait. I'll at least put a comment in hugetlb_mcopy_atomic_pte and hugetlb_no_page (near the final huge_pte_none() and pte_same() checks). > > > +{ > > + pgd_t pgd; > > + p4d_t p4d; > > + pud_t pud; > > + pmd_t pmd; > > + > > + WARN_ON_ONCE(!hpte->ptep); > > + switch (hugetlb_pte_level(hpte)) { > > + case HUGETLB_LEVEL_PGD: > > + pgd = __pgd(pte_val(pte)); > > + return pgd_present(pgd) && pgd_leaf(pgd); > > + case HUGETLB_LEVEL_P4D: > > + p4d = __p4d(pte_val(pte)); > > + return p4d_present(p4d) && p4d_leaf(p4d); > > + case HUGETLB_LEVEL_PUD: > > + pud = __pud(pte_val(pte)); > > + return pud_present(pud) && pud_leaf(pud); > > + case HUGETLB_LEVEL_PMD: > > + pmd = __pmd(pte_val(pte)); > > + return pmd_present(pmd) && pmd_leaf(pmd); > > + case HUGETLB_LEVEL_PTE: > > + return pte_present(pte); > > + default: > > + WARN_ON_ONCE(1); > > + return false; > > + } > > +} > > + > > static void enqueue_huge_page(struct hstate *h, struct page *page) > > { > > int nid = page_to_nid(page); > > -- > > 2.38.0.135.g90850a2211-goog > >
On 12/09/22 11:02, James Houghton wrote: > On Wed, Dec 7, 2022 at 7:46 PM Mina Almasry <almasrymina@google.com> wrote: > > On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote: > > > > > > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte) > > > > I also don't know if this is obvious to other readers, but I'm quite > > confused that we pass both hugetlb_pte and pte_t here, especially when > > hpte has a pte_t inside of it. Maybe a comment would help. > > It's possible for the value of the pte to change if we haven't locked > the PTL; we only store a pte_t* in hugetlb_pte, not the value itself. I had comments similar to Mina and Peter on other parts of this patch. Calling this without some type of locking is 'interesting'. I have not yet looked at callers (without locking), but I assume such callers can handle stale results. > Thinking about this... we *do* store `shift` which technically depends > on the value of the PTE. If the PTE is pte_none, the true `shift` of > the PTE is ambiguous, and so we just provide what the user asked for. > That could lead to a scenario where UFFDIO_CONTINUE(some 4K page) then > UFFDIO_CONTINUE(CONT_PTE_SIZE range around that page) can both succeed > because we merely check if the *first* PTE in the contiguous bunch is > none/has changed. Right, Yuck! > > So, in the case of a contiguous PTE where we *think* we're overwriting > a bunch of none PTEs, we need to check that each PTE we're overwriting > is still none while holding the PTL. That means that the PTL we use > for cont PTEs and non-cont PTEs of the same level must be the same. > > So for the next version, I'll: > - add some requirement that contiguous and non-contiguous PTEs on the > same level must use the same PTL > - think up some kind of API like all_contig_ptes_none(), but it only > really applies for arm64, so I think actually putting it in can wait. > I'll at least put a comment in hugetlb_mcopy_atomic_pte and > hugetlb_no_page (near the final huge_pte_none() and pte_same() > checks). >
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index db3ed6095b1c..d30322108b34 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -50,6 +50,75 @@ enum { __NR_USED_SUBPAGE, }; +enum hugetlb_level { + HUGETLB_LEVEL_PTE = 1, + /* + * We always include PMD, PUD, and P4D in this enum definition so that, + * when logged as an integer, we can easily tell which level it is. + */ + HUGETLB_LEVEL_PMD, + HUGETLB_LEVEL_PUD, + HUGETLB_LEVEL_P4D, + HUGETLB_LEVEL_PGD, +}; + +struct hugetlb_pte { + pte_t *ptep; + unsigned int shift; + enum hugetlb_level level; + spinlock_t *ptl; +}; + +static inline +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep, + unsigned int shift, enum hugetlb_level level) +{ + WARN_ON_ONCE(!ptep); + hpte->ptep = ptep; + hpte->shift = shift; + hpte->level = level; + hpte->ptl = NULL; +} + +static inline +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte) +{ + WARN_ON_ONCE(!hpte->ptep); + return 1UL << hpte->shift; +} + +static inline +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte) +{ + WARN_ON_ONCE(!hpte->ptep); + return ~(hugetlb_pte_size(hpte) - 1); +} + +static inline +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte) +{ + WARN_ON_ONCE(!hpte->ptep); + return hpte->shift; +} + +static inline +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte) +{ + WARN_ON_ONCE(!hpte->ptep); + return hpte->level; +} + +static inline +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src) +{ + dest->ptep = src->ptep; + dest->shift = src->shift; + dest->level = src->level; + dest->ptl = src->ptl; +} + +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte); + struct hugepage_subpool { spinlock_t lock; long count; @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h, return ptl; } +static inline +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte) +{ + + BUG_ON(!hpte->ptep); + if (hpte->ptl) + return hpte->ptl; + return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep); +} + +static inline +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte) +{ + spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte); + + spin_lock(ptl); + return ptl; +} + #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA) extern void __init hugetlb_cma_reserve(int order); #else diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ef7662bd0068..a0e46d35dabc 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1127,6 +1127,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg) return false; } +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte) +{ + pgd_t pgd; + p4d_t p4d; + pud_t pud; + pmd_t pmd; + + WARN_ON_ONCE(!hpte->ptep); + switch (hugetlb_pte_level(hpte)) { + case HUGETLB_LEVEL_PGD: + pgd = __pgd(pte_val(pte)); + return pgd_present(pgd) && pgd_leaf(pgd); + case HUGETLB_LEVEL_P4D: + p4d = __p4d(pte_val(pte)); + return p4d_present(p4d) && p4d_leaf(p4d); + case HUGETLB_LEVEL_PUD: + pud = __pud(pte_val(pte)); + return pud_present(pud) && pud_leaf(pud); + case HUGETLB_LEVEL_PMD: + pmd = __pmd(pte_val(pte)); + return pmd_present(pmd) && pmd_leaf(pmd); + case HUGETLB_LEVEL_PTE: + return pte_present(pte); + default: + WARN_ON_ONCE(1); + return false; + } +} + static void enqueue_huge_page(struct hstate *h, struct page *page) { int nid = page_to_nid(page);