Message ID | 20230125015738.912924-2-zokeefe@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp43274wrn; Tue, 24 Jan 2023 17:58:54 -0800 (PST) X-Google-Smtp-Source: AMrXdXuewevSWV22j9TKnjX1ZFrF4KlLzOCdNkVrpQdzHnvdJjjGEcdYCVlQ5f9ulkT5TA5HOK6k X-Received: by 2002:a62:3385:0:b0:589:d850:7ea5 with SMTP id z127-20020a623385000000b00589d8507ea5mr27801055pfz.6.1674611934390; Tue, 24 Jan 2023 17:58:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674611934; cv=none; d=google.com; s=arc-20160816; b=UUB4wLynKZZh2GAZ9dGB/H1/NcN9LKDLNavFgliHpNl0EpSLJhgS0JIoNVtU+7j71a O1ONeKMq7wuR9fhBjtHtDM1R9cvSM5yOnWe814C4GKn+YxXJ+ZnYHUFyr0yGU00y4DRz JSxRWl9GlKF5ODi5dEBooYT1PEDJaF2TibfJIqjTdYPcSM3D+rh94IS9P35HrUYWqgIC mfA7RZ57cdeCCSynTwMJhw+2gvf7z5dP91X9KgmCVlXwMNuFNBaTNaiBKZAFXhG0+jLg lcm6OPll/ntkFG7wd/PPV+gLEABb2bYz3FjAKRaS/gfa0qxfliYyHnADngjYGfdH3z4P gPsQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=CLeT8Nf/jj+XeEyGWZaHUysjcWhO2WN556Qt8YbZFTA=; b=Z5J5dDBs3065aGZYXN8w5tgsEmvf9UKuizG6dZxjqt8Kl8GE7HRLSZkGTIMmBhnJ12 qXiNvXeOYA64UelJzNMEzShKhJyGWxAZNmKP4IpMsLUnwEpRF3S2eGgJ9Q4RkNRT0omE VGdyB4Q+mlrYbSUh2tNaREUoKtHwyBhv1CSB4gHSG2g2lKWZYJ+waifq0puZ2rzCTJA7 BsvIDinebKgCqcIYBOaMJuUlH0uiCa0TZLpFsAJawKxByisAIfMSInn9OFO/+4NLN1sj 5Pz/AscYH2H5DNRjVfSYFysyj9eTSS/td9Y/rlYw9Cp7FHIoHaDL76kPkDHNGGqcafkP PIuA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=bXlVQRum; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q6-20020aa79606000000b0058e2403f011si4334631pfg.56.2023.01.24.17.58.42; Tue, 24 Jan 2023 17:58:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=bXlVQRum; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232084AbjAYB6S (ORCPT <rfc822;rust.linux@gmail.com> + 99 others); Tue, 24 Jan 2023 20:58:18 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45208 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232306AbjAYB6Q (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 24 Jan 2023 20:58:16 -0500 Received: from mail-pg1-x54a.google.com (mail-pg1-x54a.google.com [IPv6:2607:f8b0:4864:20::54a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7051538023 for <linux-kernel@vger.kernel.org>; Tue, 24 Jan 2023 17:58:13 -0800 (PST) Received: by mail-pg1-x54a.google.com with SMTP id 193-20020a6305ca000000b004cece0d0d64so7697156pgf.13 for <linux-kernel@vger.kernel.org>; Tue, 24 Jan 2023 17:58:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=CLeT8Nf/jj+XeEyGWZaHUysjcWhO2WN556Qt8YbZFTA=; b=bXlVQRumzm40VAjDslHGOuknPuAdE9sb0L4akt62AYYiNfF1nl/PE9a3K4+mhxfxbU qlOjlx9HYpNZ7nlKZHVgHB0Q1n9WvvwyURl+Khn4oVXSDFDI5stAZNLD/GV7mYuhnAmX 4fP31lq2dBuIKqP7/dpZqBG1jQWdaE4UbMixSINcYMwJN6K9z5usFxW1G1RYeQzjzLmE IuTvdJPPP3texNImaofSkc7EevfsFsTYsDDWMHluYgcK4K8lzmtm33VRfk8XHHm/C1AO JzwN7PIDjhXbMVtdks4HwD6HiFKY6CPRKEfjrxQ0G2uWsS+SXuQvOj6jyB/oyIkb1ojY XkMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CLeT8Nf/jj+XeEyGWZaHUysjcWhO2WN556Qt8YbZFTA=; b=EkiXB9Kvw16qJ5dui8fH29xlhoPTxrB3LMOGIazRwOG5i4OR/+hjIRleQGzF6eEZWe 6+bgFEHvKVviMMOJN5mjo13qqkrXgJ0Xxw9gU5qNj50ZRa9bvP3hxBUDOfOxIefZwVs3 SUfG6Twi0FR7CA0TLoex3LA6EQhJ1VX1dFhQHyZaOTwig7C/whEefgu7iv54RKGCdKee a9B3RDSmNI19uc8WR8eKtYRWOU/mT4pP33XKly/63dqH279Zvj/LyL+Cmt3yubIHweOY YPzosY0EGygzN8kIVD4erYixXyQVaw3Di+io2UIgwg7lkpyJ8KkODvfDmF1ozIR+x/yK rFtw== X-Gm-Message-State: AFqh2kojUvQOGBS+aZCzyKSJ648U5UpVqD1VnWPGwEAfnKZKrAmaaxG5 QngvbRFhHybHlv03chy8KNvLVMYstkOM X-Received: from zokeefe3.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1b6]) (user=zokeefe job=sendgmr) by 2002:a62:6d04:0:b0:578:9709:615f with SMTP id i4-20020a626d04000000b005789709615fmr3520547pfc.45.1674611892894; Tue, 24 Jan 2023 17:58:12 -0800 (PST) Date: Tue, 24 Jan 2023 17:57:38 -0800 In-Reply-To: <20230125015738.912924-1-zokeefe@google.com> Mime-Version: 1.0 References: <20230125015738.912924-1-zokeefe@google.com> X-Mailer: git-send-email 2.39.1.405.gd4c25cc71f-goog Message-ID: <20230125015738.912924-2-zokeefe@google.com> Subject: [PATCH 2/2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups From: "Zach O'Keefe" <zokeefe@google.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Hugh Dickins <hughd@google.com>, Yang Shi <shy828301@gmail.com>, "Zach O'Keefe" <zokeefe@google.com>, stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755957883578026063?= X-GMAIL-MSGID: =?utf-8?q?1755957883578026063?= |
Series |
[1/2] mm/MADV_COLLAPSE: set EAGAIN on unexpected page refcount
|
|
Commit Message
Zach O'Keefe
Jan. 25, 2023, 1:57 a.m. UTC
In commit 34488399fa08 ("mm/madvise: add file and shmem support to
MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():
- if (!pmd_present(pmde))
- return SCAN_PMD_NULL;
+ if (pmd_none(pmde))
+ return SCAN_PMD_NONE;
This was for-use by MADV_COLLAPSE file/shmem codepaths, where MADV_COLLAPSE
might identify a pte-mapped hugepage, only to have khugepaged race-in, free
the pte table, and clear the pmd. Such codepaths include:
A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
already in the pagecache.
B) In retract_page_tables(), if we fail to grab mmap_lock for the target
mm/address.
In these cases, collapse_pte_mapped_thp() really does expect a none (not
just !present) pmd, and we want to suitably identify that case separate
from the case where no pmd is found, or it's a bad-pmd (of course, many
things could happen once we drop mmap_lock, and the pmd could plausibly
undergo multiple transitions due to intervening fault, split, etc).
Regardless, the code is prepared install a huge-pmd only when the existing
pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.
However, the commit introduces a logical hole; namely, that we've allowed
!none- && !huge- && !bad-pmds to be classified as genuine
pte-table-mapping-pmds. One such example that could leak through are swap
entries. The pmd values aren't checked again before use in
pte_offset_map_lock(), which is expecting nothing less than a genuine
pte-table-mapping-pmd.
We want to put back the !pmd_present() check (below the pmd_none() check),
but need to be careful to deal with subtleties in pmd transitions and
treatments by various arch.
The issue is that __split_huge_pmd_locked() temporarily clears the present
bit (or otherwise marks the entry as invalid), but pmd_present()
and pmd_trans_huge() still need to return true while the pmd is in this
transitory state. For example, x86's pmd_present() also checks the
_PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
checks a PMD_PRESENT_INVALID bit.
Covering all 4 cases for x86 (all checks done on the same pmd value):
1) pmd_present() && pmd_trans_huge()
All we actually know here is that the PSE bit is set. Either:
a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
is set.
=> huge-pmd
b) We are currently racing with __split_huge_page(). The danger here
is that we proceed as-if we have a huge-pmd, but really we are
looking at a pte-mapping-pmd. So, what is the risk of this
danger?
The only relevant path is:
madvise_collapse() -> collapse_pte_mapped_thp()
Where we might just incorrectly report back "success", when really
the memory isn't pmd-backed. This is fine, since split could
happen immediately after (actually) successful madvise_collapse().
So, it should be safe to just assume huge-pmd here.
2) pmd_present() && !pmd_trans_huge()
Either:
a) PSE not set and either PRESENT or PROTNONE is.
=> pte-table-mapping pmd (or PROT_NONE)
b) devmap. This routine can be called immediately after
unlocking/locking mmap_lock -- or called with no locks held (see
khugepaged_scan_mm_slot()), so previous VMA checks have since been
invalidated.
3) !pmd_present() && pmd_trans_huge()
Not possible.
4) !pmd_present() && !pmd_trans_huge()
Neither PRESENT nor PROTNONE set
=> not present
I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
powerpc, longarch, x86, mips, s390) and this logic roughly translates
(though devmap treatment is unique to x86 and powerpc, and (3) doesn't
necessarily hold in general -- but that doesn't matter since !pmd_present()
always takes failure path).
Also, add a comment above find_pmd_or_thp_or_none() to help future
travelers reason about the validity of the code; namely, the possible
mutations that might happen out from under us, depending on how
mmap_lock is held (if at all).
Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Cc: stable@vger.kernel.org
---
Request that this be pulled into stable since it's theoretically
possible (though I have no reproducer) that while mmap_lock is dropped,
racing thp migration installs a pmd migration entry which then has a path to
be consumed, unchecked, by pte_offset_map().
---
mm/khugepaged.c | 8 ++++++++
1 file changed, 8 insertions(+)
Comments
Hi Zach, Thank you for the patch! Yet something to improve: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-MADV_COLLAPSE-catch-none-huge-bad-pmd-lookups/20230125-095954 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20230125015738.912924-2-zokeefe%40google.com patch subject: [PATCH 2/2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups config: x86_64-randconfig-r025-20230123 (https://download.01.org/0day-ci/archive/20230125/202301252033.HoFIRXm4-lkp@intel.com/config) compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/6001eb9a8f1687a1d0b72831d991886106cac37b git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Zach-O-Keefe/mm-MADV_COLLAPSE-catch-none-huge-bad-pmd-lookups/20230125-095954 git checkout 6001eb9a8f1687a1d0b72831d991886106cac37b # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> mm/khugepaged.c:972:17: error: passing 'pmd_t **' to parameter of incompatible type 'pmd_t' if (pmd_devmap(pmd)) ^~~ arch/x86/include/asm/pgtable.h:254:36: note: passing argument to parameter 'pmd' here static inline int pmd_devmap(pmd_t pmd) ^ 1 error generated. vim +972 mm/khugepaged.c 945 946 /* 947 * See pmd_trans_unstable() for how the result may change out from 948 * underneath us, even if we hold mmap_lock in read. 949 */ 950 static int find_pmd_or_thp_or_none(struct mm_struct *mm, 951 unsigned long address, 952 pmd_t **pmd) 953 { 954 pmd_t pmde; 955 956 *pmd = mm_find_pmd(mm, address); 957 if (!*pmd) 958 return SCAN_PMD_NULL; 959 960 pmde = pmdp_get_lockless(*pmd); 961 962 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 963 /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ 964 barrier(); 965 #endif 966 if (pmd_none(pmde)) 967 return SCAN_PMD_NONE; 968 if (!pmd_present(pmde)) 969 return SCAN_PMD_NULL; 970 if (pmd_trans_huge(pmde)) 971 return SCAN_PMD_MAPPED; > 972 if (pmd_devmap(pmd)) 973 return SCAN_PMD_NULL; 974 if (pmd_bad(pmde)) 975 return SCAN_PMD_NULL; 976 return SCAN_SUCCEED; 977 } 978
Hi Zach, Thank you for the patch! Yet something to improve: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-MADV_COLLAPSE-catch-none-huge-bad-pmd-lookups/20230125-095954 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20230125015738.912924-2-zokeefe%40google.com patch subject: [PATCH 2/2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups config: x86_64-rhel-8.3-kselftests (https://download.01.org/0day-ci/archive/20230125/202301252110.hFYRsbrm-lkp@intel.com/config) compiler: gcc-11 (Debian 11.3.0-8) 11.3.0 reproduce (this is a W=1 build): # https://github.com/intel-lab-lkp/linux/commit/6001eb9a8f1687a1d0b72831d991886106cac37b git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Zach-O-Keefe/mm-MADV_COLLAPSE-catch-none-huge-bad-pmd-lookups/20230125-095954 git checkout 6001eb9a8f1687a1d0b72831d991886106cac37b # save the config file mkdir build_dir && cp config build_dir/.config make W=1 O=build_dir ARCH=x86_64 olddefconfig make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): mm/khugepaged.c: In function 'find_pmd_or_thp_or_none': >> mm/khugepaged.c:972:24: error: incompatible type for argument 1 of 'pmd_devmap' 972 | if (pmd_devmap(pmd)) | ^~~ | | | pmd_t ** In file included from include/linux/pgtable.h:6, from include/linux/mm.h:29, from mm/khugepaged.c:4: arch/x86/include/asm/pgtable.h:254:36: note: expected 'pmd_t' but argument is of type 'pmd_t **' 254 | static inline int pmd_devmap(pmd_t pmd) | ~~~~~~^~~ vim +/pmd_devmap +972 mm/khugepaged.c 945 946 /* 947 * See pmd_trans_unstable() for how the result may change out from 948 * underneath us, even if we hold mmap_lock in read. 949 */ 950 static int find_pmd_or_thp_or_none(struct mm_struct *mm, 951 unsigned long address, 952 pmd_t **pmd) 953 { 954 pmd_t pmde; 955 956 *pmd = mm_find_pmd(mm, address); 957 if (!*pmd) 958 return SCAN_PMD_NULL; 959 960 pmde = pmdp_get_lockless(*pmd); 961 962 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 963 /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ 964 barrier(); 965 #endif 966 if (pmd_none(pmde)) 967 return SCAN_PMD_NONE; 968 if (!pmd_present(pmde)) 969 return SCAN_PMD_NULL; 970 if (pmd_trans_huge(pmde)) 971 return SCAN_PMD_MAPPED; > 972 if (pmd_devmap(pmd)) 973 return SCAN_PMD_NULL; 974 if (pmd_bad(pmde)) 975 return SCAN_PMD_NULL; 976 return SCAN_SUCCEED; 977 } 978
Apologies here; shouldn't have overlooked the 4 line change. Will follow-up with a v2 here in a second.
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index fa38cae240b9..7ea668bbea70 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -941,6 +941,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, return SCAN_SUCCEED; } +/* + * See pmd_trans_unstable() for how the result may change out from + * underneath us, even if we hold mmap_lock in read. + */ static int find_pmd_or_thp_or_none(struct mm_struct *mm, unsigned long address, pmd_t **pmd) @@ -959,8 +963,12 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm, #endif if (pmd_none(pmde)) return SCAN_PMD_NONE; + if (!pmd_present(pmde)) + return SCAN_PMD_NULL; if (pmd_trans_huge(pmde)) return SCAN_PMD_MAPPED; + if (pmd_devmap(pmd)) + return SCAN_PMD_NULL; if (pmd_bad(pmde)) return SCAN_PMD_NULL; return SCAN_SUCCEED;