Message ID | 20230602230552.350731-5-peterx@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp1359276vqr; Fri, 2 Jun 2023 16:19:43 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6E+P8ImnoX8HQxyBqmt3uaqjd2Eorwn0xxXJUc5pOzCO8X/+Oifi5UrTKbf3Bc4Up0cfII X-Received: by 2002:a37:a813:0:b0:75c:b919:b4e5 with SMTP id r19-20020a37a813000000b0075cb919b4e5mr13849788qke.32.1685747982955; Fri, 02 Jun 2023 16:19:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685747982; cv=none; d=google.com; s=arc-20160816; b=kHvDsrv08/eD+IahS4jUHKX/KFFstX6TkeRPc4ZwNugZBxYcZl9bWjNgCWhetYAxcK qEmiaxoSW61MErhdhzY34u6JaqVNWw5tByf11e1gSqpeawdvWOlztMFwW1BTwUhnx63B lS4O105DZfPgmOeGlFIJDWQkiWEfx4grqzMlyVa4CRRmisQ+5/8FPXFXInWT+smHsn2I uulBlCIdv1ehanKeQOX+8yAk85c2FRgmOk4bOjb+zpjY0rzzr7ygOejY4k/7WQJiLnv0 UyFSjqgb1RYLdsdvxLXl9zni2VkEgz4QQ3+QHKJk7kJT2BypyiHSEvgNk6aBJUIpTXfP bN9w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=g6wkyDZjE3AkPJxDBxHTBEFEzJVoOQ4nWaImAmEi0sc=; b=f4MnGD27jIrH8rxdEsMWH6Ndit7V9mLKL7hwsXGVq35QjHq2Ysmpa/9PoSwJrnDU5w 6QomOdP73oiTB0P2o8NK9xs4ahBpW00G7Yo5LPXTLWxthJi/tiTomo1AoHZZSABI7ij9 mE1st6mv8HyX5tx263bdB5dkUuBKHH5jgYT552hJw/cWFa/rS0Nj8QYqeQC1J4Zl2rjd OGLDwgDU7NsL+8Oo+yccPqCrWUsrg/6TQEDz9HlQbubrC4gHDXXwWtkF8Wwr9tfPBhvj vxLfwl4pEgSlTbiqV+EPYiOuf/M/CeJzqTHNNmwqq2dpAS9E+Fyjv4hF127aBNbX0zXv etCw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=aqePeWpr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t2-20020aa79462000000b0063b7b7712a5si1403494pfq.304.2023.06.02.16.19.30; Fri, 02 Jun 2023 16:19:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=aqePeWpr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236846AbjFBXHE (ORCPT <rfc822;limurcpp@gmail.com> + 99 others); Fri, 2 Jun 2023 19:07:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44168 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236833AbjFBXGy (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 2 Jun 2023 19:06:54 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7EF67E43 for <linux-kernel@vger.kernel.org>; Fri, 2 Jun 2023 16:06:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1685747168; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=g6wkyDZjE3AkPJxDBxHTBEFEzJVoOQ4nWaImAmEi0sc=; b=aqePeWpru3LXVdZKoBm87b0ERXzgXoP2A5VtWC0+DDPI3ApDEKF8XXU5BKQQhsmhF0l1oJ JxJwAHlejvvK8zdVSJf697LOdmzoUx/NbPX7YLdpq0Ae3JFIDuCeChxAm5MXv36xoeXB2S KY9Mg4bhfO/Hcrq0ZeZans2DCd0rh5c= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-434-HxqvD66iMJSIa2WdzzSNMQ-1; Fri, 02 Jun 2023 19:06:07 -0400 X-MC-Unique: HxqvD66iMJSIa2WdzzSNMQ-1 Received: by mail-qv1-f72.google.com with SMTP id 6a1803df08f44-626204b0663so5584996d6.1 for <linux-kernel@vger.kernel.org>; Fri, 02 Jun 2023 16:06:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685747166; x=1688339166; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=g6wkyDZjE3AkPJxDBxHTBEFEzJVoOQ4nWaImAmEi0sc=; b=WgOEWhzRJErbvF+MbNB2WT10QGVg9s6JbN/0y8Lq0Mzl6MoBqdHmlwdjloNRAai/GT LvAZKPt//RcpG2KtvhmDeKynbedzxpZQLnxfxNCGkezCnRI4N8i7uZ056eTDH2HM4v2k y++Jl4qfGzmBRUkzTAiaibdJ5AnnSVBOJf8CjKyS3qZqpGeWCMs8rgbesftsqJTks516 mMrg/0WaI3/HDK9amjJz9D7ZeWtx1iTkKh/uNe4GsZhU0hADOKlILRtTkvoFH9UPNCaa pkOfNasj+OJcVn9AxGDRy3F/GhrQPtYI5SmSZ6oUh3WJXg7cFEHADLAN0YQJh6XFPL/L FtKQ== X-Gm-Message-State: AC+VfDzNoJ1m0dSkCscJEfAsTuuUGiTDSL6vmoA6nfXON3Y2Q+gtD3Id ktg1XaAc9AhKNVWb6P7NASen6NJtsqfn5cK/CWQK3gXUWRpwgiuJ7nRD5qMAr3qOhLhRW3Lxito vLpBdgyKgptwYj9CpSfrmRcdryXJTiWmGwHXqdf4RWh6Zzg5J1DapsQB+vpeXeBn78R1yEdUH0v 8GHnLOVA== X-Received: by 2002:a05:6214:5182:b0:625:aa49:c182 with SMTP id kl2-20020a056214518200b00625aa49c182mr11538756qvb.6.1685747166301; Fri, 02 Jun 2023 16:06:06 -0700 (PDT) X-Received: by 2002:a05:6214:5182:b0:625:aa49:c182 with SMTP id kl2-20020a056214518200b00625aa49c182mr11538709qvb.6.1685747165861; Fri, 02 Jun 2023 16:06:05 -0700 (PDT) Received: from x1n.. (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39]) by smtp.gmail.com with ESMTPSA id px13-20020a056214050d00b0062607ea6d01sm1400792qvb.50.2023.06.02.16.06.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Jun 2023 16:06:04 -0700 (PDT) From: Peter Xu <peterx@redhat.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: David Hildenbrand <david@redhat.com>, Alistair Popple <apopple@nvidia.com>, Andrew Morton <akpm@linux-foundation.org>, Andrea Arcangeli <aarcange@redhat.com>, "Kirill A . Shutemov" <kirill@shutemov.name>, Johannes Weiner <hannes@cmpxchg.org>, John Hubbard <jhubbard@nvidia.com>, Naoya Horiguchi <naoya.horiguchi@nec.com>, peterx@redhat.com, Muhammad Usama Anjum <usama.anjum@collabora.com>, Hugh Dickins <hughd@google.com>, Mike Rapoport <rppt@kernel.org> Subject: [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry Date: Fri, 2 Jun 2023 19:05:52 -0400 Message-Id: <20230602230552.350731-5-peterx@redhat.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230602230552.350731-1-peterx@redhat.com> References: <20230602230552.350731-1-peterx@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767634876955435111?= X-GMAIL-MSGID: =?utf-8?q?1767634876955435111?= |
Series |
mm: Fix pmd_trans_unstable() call sites on retry
|
|
Commit Message
Peter Xu
June 2, 2023, 11:05 p.m. UTC
For most of the page walk paths, logically it'll always be good to have the
pmd retries if hit pmd_trans_unstable() race. We can treat it as none
pmd (per comment above pmd_trans_unstable()), but in most cases we're not
even treating that as a none pmd. If to fix it anyway, a retry will be the
most accurate.
I've went over all the pmd_trans_unstable() special cases and this patch
should cover all the rest places where we should retry properly with
unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we can
easily achieve that.
These are the call sites that I think should be fixed with it:
*** fs/proc/task_mmu.c:
smaps_pte_range[634] if (pmd_trans_unstable(pmd))
clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd))
pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp))
gather_pte_stats[1891] if (pmd_trans_unstable(pmd))
*** mm/memcontrol.c:
mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
*** mm/memory-failure.c:
hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp))
*** mm/mempolicy.c:
queue_folios_pte_range[517] if (pmd_trans_unstable(pmd))
*** mm/madvise.c:
madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
madvise_free_pte_range[625] if (pmd_trans_unstable(pmd))
IIUC most of them may or may not be a big issue even without a retry,
either because they're already not strict (smaps, pte_stats, MADV_COLD,
.. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
cold worst case), but some of them could have functional error without the
retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
the unstable pmd range.. so IIUC the pagemap result can be wrong).
While these call sites all look fine, and don't need any change:
*** include/linux/pgtable.h:
pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
*** mm/gup.c:
follow_pmd_mask[695] if (pmd_trans_unstable(pmd))
*** mm/mapping_dirty_helpers.c:
wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval))
*** mm/memory.c:
do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->pmd)))
*** mm/migrate_device.c:
migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp)))
*** mm/mincore.c:
mincore_pte_range[116] if (pmd_trans_unstable(pmd)) {
Signed-off-by: Peter Xu <peterx@redhat.com>
---
fs/proc/task_mmu.c | 17 +++++++++++++----
mm/madvise.c | 8 ++++++--
mm/memcontrol.c | 8 ++++++--
mm/memory-failure.c | 4 +++-
mm/mempolicy.c | 4 +++-
5 files changed, 31 insertions(+), 10 deletions(-)
Comments
On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@redhat.com> wrote: > > For most of the page walk paths, logically it'll always be good to have the > pmd retries if hit pmd_trans_unstable() race. We can treat it as none > pmd (per comment above pmd_trans_unstable()), but in most cases we're not > even treating that as a none pmd. If to fix it anyway, a retry will be the > most accurate. > > I've went over all the pmd_trans_unstable() special cases and this patch > should cover all the rest places where we should retry properly with > unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we can > easily achieve that. > > These are the call sites that I think should be fixed with it: > > *** fs/proc/task_mmu.c: > smaps_pte_range[634] if (pmd_trans_unstable(pmd)) > clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd)) > pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp)) > gather_pte_stats[1891] if (pmd_trans_unstable(pmd)) > *** mm/memcontrol.c: > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd)) > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd)) > *** mm/memory-failure.c: > hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp)) > *** mm/mempolicy.c: > queue_folios_pte_range[517] if (pmd_trans_unstable(pmd)) > *** mm/madvise.c: > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd)) > madvise_free_pte_range[625] if (pmd_trans_unstable(pmd)) > > IIUC most of them may or may not be a big issue even without a retry, > either because they're already not strict (smaps, pte_stats, MADV_COLD, > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to > cold worst case), but some of them could have functional error without the > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over > the unstable pmd range.. so IIUC the pagemap result can be wrong). > > While these call sites all look fine, and don't need any change: > > *** include/linux/pgtable.h: > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); > *** mm/gup.c: > follow_pmd_mask[695] if (pmd_trans_unstable(pmd)) > *** mm/mapping_dirty_helpers.c: > wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval)) > *** mm/memory.c: > do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->pmd))) > *** mm/migrate_device.c: > migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp))) > *** mm/mincore.c: > mincore_pte_range[116] if (pmd_trans_unstable(pmd)) { > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > fs/proc/task_mmu.c | 17 +++++++++++++---- > mm/madvise.c | 8 ++++++-- > mm/memcontrol.c | 8 ++++++-- > mm/memory-failure.c | 4 +++- > mm/mempolicy.c | 4 +++- > 5 files changed, 31 insertions(+), 10 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 6259dd432eeb..823eaba5c6bf 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > goto out; > } > > - if (pmd_trans_unstable(pmd)) > + if (pmd_trans_unstable(pmd)) { > + walk->action = ACTION_AGAIN; > goto out; > + } > + > /* > * The mmap_lock held all the way back in m_start() is what > * keeps khugepaged out of here and from collapsing things > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, > return 0; > } > > - if (pmd_trans_unstable(pmd)) > + if (pmd_trans_unstable(pmd)) { > + walk->action = ACTION_AGAIN; > return 0; > + } > > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > for (; addr != end; pte++, addr += PAGE_SIZE) { > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, > return err; > } > > - if (pmd_trans_unstable(pmdp)) > + if (pmd_trans_unstable(pmdp)) { > + walk->action = ACTION_AGAIN; > return 0; Had a quick look at the pagemap code, I agree with your analysis, "returning 0" may mess up pagemap, retry should be fine. But I'm wondering whether we should just fill in empty entries. Anyway I don't have a strong opinion on this, just a little bit concerned by potential indefinite retry. > + } > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > /* > @@ -1888,8 +1895,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, > return 0; > } > > - if (pmd_trans_unstable(pmd)) > + if (pmd_trans_unstable(pmd)) { > + walk->action = ACTION_AGAIN; > return 0; > + } > #endif > orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); > do { > diff --git a/mm/madvise.c b/mm/madvise.c > index 78cd12581628..0fd81712022c 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -424,8 +424,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > } > > regular_folio: > - if (pmd_trans_unstable(pmd)) > + if (pmd_trans_unstable(pmd)) { > + walk->action = ACTION_AGAIN; > return 0; > + } > #endif > tlb_change_page_size(tlb, PAGE_SIZE); > orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > @@ -626,8 +628,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next)) > goto next; > > - if (pmd_trans_unstable(pmd)) > + if (pmd_trans_unstable(pmd)) { > + walk->action = ACTION_AGAIN; > return 0; > + } > > tlb_change_page_size(tlb, PAGE_SIZE); > orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 6ee433be4c3b..15e50f033e41 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6021,8 +6021,10 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, > return 0; > } > > - if (pmd_trans_unstable(pmd)) > + if (pmd_trans_unstable(pmd)) { > + walk->action = ACTION_AGAIN; > return 0; Either retry or keep as is is fine to me. I'm not aware of anyone complaining about noticeable inaccurate charges due to this. But we may have potential indefinite retry anyway, so if you don't want to risk this, it may be better just keep it as is. > + } > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > for (; addr != end; pte++, addr += PAGE_SIZE) > if (get_mctgt_type(vma, addr, *pte, NULL)) > @@ -6241,8 +6243,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, > return 0; > } > > - if (pmd_trans_unstable(pmd)) > + if (pmd_trans_unstable(pmd)) { > + walk->action = ACTION_AGAIN; > return 0; > + } > retry: > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > for (; addr != end; addr += PAGE_SIZE) { > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 004a02f44271..c97fb2b7ab4a 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -791,8 +791,10 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr, > goto out; > } > > - if (pmd_trans_unstable(pmdp)) > + if (pmd_trans_unstable(pmdp)) { > + walk->action = ACTION_AGAIN; > goto out; > + } This may be worth retrying IMHO. > > mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp, > addr, &ptl); > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index f06ca8c18e62..af8907b4aad1 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -514,8 +514,10 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, > if (ptl) > return queue_folios_pmd(pmd, ptl, addr, end, walk); > > - if (pmd_trans_unstable(pmd)) > + if (pmd_trans_unstable(pmd)) { > + walk->action = ACTION_AGAIN; > return 0; > + } Either retry or keep it as is is fine. > > mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); > for (; addr != end; pte++, addr += PAGE_SIZE) { > -- > 2.40.1 > >
On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote: > On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@redhat.com> wrote: > > > > For most of the page walk paths, logically it'll always be good to have the > > pmd retries if hit pmd_trans_unstable() race. We can treat it as none > > pmd (per comment above pmd_trans_unstable()), but in most cases we're not > > even treating that as a none pmd. If to fix it anyway, a retry will be the > > most accurate. > > > > I've went over all the pmd_trans_unstable() special cases and this patch > > should cover all the rest places where we should retry properly with > > unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we can > > easily achieve that. > > > > These are the call sites that I think should be fixed with it: > > > > *** fs/proc/task_mmu.c: > > smaps_pte_range[634] if (pmd_trans_unstable(pmd)) > > clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd)) > > pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp)) > > gather_pte_stats[1891] if (pmd_trans_unstable(pmd)) > > *** mm/memcontrol.c: > > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd)) > > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd)) > > *** mm/memory-failure.c: > > hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp)) > > *** mm/mempolicy.c: > > queue_folios_pte_range[517] if (pmd_trans_unstable(pmd)) > > *** mm/madvise.c: > > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd)) > > madvise_free_pte_range[625] if (pmd_trans_unstable(pmd)) > > > > IIUC most of them may or may not be a big issue even without a retry, > > either because they're already not strict (smaps, pte_stats, MADV_COLD, > > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to > > cold worst case), but some of them could have functional error without the > > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over > > the unstable pmd range.. so IIUC the pagemap result can be wrong). > > > > While these call sites all look fine, and don't need any change: > > > > *** include/linux/pgtable.h: > > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); > > *** mm/gup.c: > > follow_pmd_mask[695] if (pmd_trans_unstable(pmd)) > > *** mm/mapping_dirty_helpers.c: > > wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval)) > > *** mm/memory.c: > > do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->pmd))) > > *** mm/migrate_device.c: > > migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp))) > > *** mm/mincore.c: > > mincore_pte_range[116] if (pmd_trans_unstable(pmd)) { > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > fs/proc/task_mmu.c | 17 +++++++++++++---- > > mm/madvise.c | 8 ++++++-- > > mm/memcontrol.c | 8 ++++++-- > > mm/memory-failure.c | 4 +++- > > mm/mempolicy.c | 4 +++- > > 5 files changed, 31 insertions(+), 10 deletions(-) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index 6259dd432eeb..823eaba5c6bf 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > > goto out; > > } > > > > - if (pmd_trans_unstable(pmd)) > > + if (pmd_trans_unstable(pmd)) { > > + walk->action = ACTION_AGAIN; > > goto out; > > + } > > + > > /* > > * The mmap_lock held all the way back in m_start() is what > > * keeps khugepaged out of here and from collapsing things > > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, > > return 0; > > } > > > > - if (pmd_trans_unstable(pmd)) > > + if (pmd_trans_unstable(pmd)) { > > + walk->action = ACTION_AGAIN; > > return 0; > > + } > > > > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > for (; addr != end; pte++, addr += PAGE_SIZE) { > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, > > return err; > > } > > > > - if (pmd_trans_unstable(pmdp)) > > + if (pmd_trans_unstable(pmdp)) { > > + walk->action = ACTION_AGAIN; > > return 0; > > Had a quick look at the pagemap code, I agree with your analysis, > "returning 0" may mess up pagemap, retry should be fine. But I'm > wondering whether we should just fill in empty entries. Anyway I don't > have a strong opinion on this, just a little bit concerned by > potential indefinite retry. Yes, none pte is still an option. But if we're going to fix this anyway, it seems better to fix it with the accurate new thing that poped up, and it's even less change (just apply walk->action rather than doing random stuff in different call sites). I see that you have worry on deadloop over this, so I hope to discuss altogether here. Unlike normal checks, pmd_trans_unstable() check means something must have changed in the past very short period or it should just never if nothing changed concurrently from under us, so it's not a "if (flag==true)" check which is even more likely to loop. If we see the places that I didn't touch, most of them suggested a retry in one form or another. So if there's a worry this will also not the first time to do a retry (and for such a "unstable" API, that's really the most natural thing to do which is to retry until it's stable). So in general, it seems to me if we deadloop over pmd_trans_unstable() for whatever reason then something more wrong could have happened.. Thanks,
On Mon, Jun 5, 2023 at 12:20 PM Peter Xu <peterx@redhat.com> wrote: > > On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote: > > On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > For most of the page walk paths, logically it'll always be good to have the > > > pmd retries if hit pmd_trans_unstable() race. We can treat it as none > > > pmd (per comment above pmd_trans_unstable()), but in most cases we're not > > > even treating that as a none pmd. If to fix it anyway, a retry will be the > > > most accurate. > > > > > > I've went over all the pmd_trans_unstable() special cases and this patch > > > should cover all the rest places where we should retry properly with > > > unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we can > > > easily achieve that. > > > > > > These are the call sites that I think should be fixed with it: > > > > > > *** fs/proc/task_mmu.c: > > > smaps_pte_range[634] if (pmd_trans_unstable(pmd)) > > > clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd)) > > > pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp)) > > > gather_pte_stats[1891] if (pmd_trans_unstable(pmd)) > > > *** mm/memcontrol.c: > > > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd)) > > > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd)) > > > *** mm/memory-failure.c: > > > hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp)) > > > *** mm/mempolicy.c: > > > queue_folios_pte_range[517] if (pmd_trans_unstable(pmd)) > > > *** mm/madvise.c: > > > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd)) > > > madvise_free_pte_range[625] if (pmd_trans_unstable(pmd)) > > > > > > IIUC most of them may or may not be a big issue even without a retry, > > > either because they're already not strict (smaps, pte_stats, MADV_COLD, > > > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to > > > cold worst case), but some of them could have functional error without the > > > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over > > > the unstable pmd range.. so IIUC the pagemap result can be wrong). > > > > > > While these call sites all look fine, and don't need any change: > > > > > > *** include/linux/pgtable.h: > > > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); > > > *** mm/gup.c: > > > follow_pmd_mask[695] if (pmd_trans_unstable(pmd)) > > > *** mm/mapping_dirty_helpers.c: > > > wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval)) > > > *** mm/memory.c: > > > do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->pmd))) > > > *** mm/migrate_device.c: > > > migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp))) > > > *** mm/mincore.c: > > > mincore_pte_range[116] if (pmd_trans_unstable(pmd)) { > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > > --- > > > fs/proc/task_mmu.c | 17 +++++++++++++---- > > > mm/madvise.c | 8 ++++++-- > > > mm/memcontrol.c | 8 ++++++-- > > > mm/memory-failure.c | 4 +++- > > > mm/mempolicy.c | 4 +++- > > > 5 files changed, 31 insertions(+), 10 deletions(-) > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index 6259dd432eeb..823eaba5c6bf 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > > > goto out; > > > } > > > > > > - if (pmd_trans_unstable(pmd)) > > > + if (pmd_trans_unstable(pmd)) { > > > + walk->action = ACTION_AGAIN; > > > goto out; > > > + } > > > + > > > /* > > > * The mmap_lock held all the way back in m_start() is what > > > * keeps khugepaged out of here and from collapsing things > > > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, > > > return 0; > > > } > > > > > > - if (pmd_trans_unstable(pmd)) > > > + if (pmd_trans_unstable(pmd)) { > > > + walk->action = ACTION_AGAIN; > > > return 0; > > > + } > > > > > > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > > for (; addr != end; pte++, addr += PAGE_SIZE) { > > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, > > > return err; > > > } > > > > > > - if (pmd_trans_unstable(pmdp)) > > > + if (pmd_trans_unstable(pmdp)) { > > > + walk->action = ACTION_AGAIN; > > > return 0; > > > > Had a quick look at the pagemap code, I agree with your analysis, > > "returning 0" may mess up pagemap, retry should be fine. But I'm > > wondering whether we should just fill in empty entries. Anyway I don't > > have a strong opinion on this, just a little bit concerned by > > potential indefinite retry. > > Yes, none pte is still an option. But if we're going to fix this anyway, > it seems better to fix it with the accurate new thing that poped up, and > it's even less change (just apply walk->action rather than doing random > stuff in different call sites). > > I see that you have worry on deadloop over this, so I hope to discuss > altogether here. > > Unlike normal checks, pmd_trans_unstable() check means something must have > changed in the past very short period or it should just never if nothing > changed concurrently from under us, so it's not a "if (flag==true)" check > which is even more likely to loop. > > If we see the places that I didn't touch, most of them suggested a retry in > one form or another. So if there's a worry this will also not the first > time to do a retry (and for such a "unstable" API, that's really the most > natural thing to do which is to retry until it's stable). IIUC other than do_anonymous_page() suggests retry (retry page fault), others may not, for example: - follow_pmd_mask: return -EBUSY - wp_clean_pmd_entry: actually just retry for pmd_none case, but the pagewalk code does handle pmd_none by skipping it, so it basically just retry once - min_core_pte_range: treated as unmapped range by calling __mincore_unmapped_range Anyway I really don't have a strong opinion on this. I may be just over-concerned. I just thought if nobody cares whether the result is accurate or not, why do we bother fixing those cases? > > So in general, it seems to me if we deadloop over pmd_trans_unstable() for > whatever reason then something more wrong could have happened.. > > Thanks, > > -- > Peter Xu >
On Tue, Jun 06, 2023 at 12:12:03PM -0700, Yang Shi wrote: > > > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, > > > > return err; > > > > } > > > > > > > > - if (pmd_trans_unstable(pmdp)) > > > > + if (pmd_trans_unstable(pmdp)) { > > > > + walk->action = ACTION_AGAIN; > > > > return 0; > > > > > > Had a quick look at the pagemap code, I agree with your analysis, > > > "returning 0" may mess up pagemap, retry should be fine. But I'm > > > wondering whether we should just fill in empty entries. Anyway I don't > > > have a strong opinion on this, just a little bit concerned by > > > potential indefinite retry. > > > > Yes, none pte is still an option. But if we're going to fix this anyway, > > it seems better to fix it with the accurate new thing that poped up, and > > it's even less change (just apply walk->action rather than doing random > > stuff in different call sites). > > > > I see that you have worry on deadloop over this, so I hope to discuss > > altogether here. > > > > Unlike normal checks, pmd_trans_unstable() check means something must have > > changed in the past very short period or it should just never if nothing > > changed concurrently from under us, so it's not a "if (flag==true)" check > > which is even more likely to loop. > > > > If we see the places that I didn't touch, most of them suggested a retry in > > one form or another. So if there's a worry this will also not the first > > time to do a retry (and for such a "unstable" API, that's really the most > > natural thing to do which is to retry until it's stable). > > IIUC other than do_anonymous_page() suggests retry (retry page fault), > others may not, for example: > - follow_pmd_mask: return -EBUSY I assumed a -EBUSY would mean if the caller still needs the page it'll just need to retry. It's actually a very rare errno to return in follow page... e.g. some gup callers may not even be able to handle -EBUSY afaiu, neither does the gup core (__get_user_pages), afaiu it'll just forwarded upwards. My bet is it's just so rare and only used with FOLL_SPLIT_PMD for now. I had a quick look, at least kvm handles -EBUSY as a real fault (hva_to_pfn, where it should translate that -EBUSY into a KVM_PFN_ERR_FAULT), I think it'll crash the hypervisor directly if returned from gup one day (not for now if never with !FOLL_SPLIT_PMD).. > - wp_clean_pmd_entry: actually just retry for pmd_none case, but the > pagewalk code does handle pmd_none by skipping it, so it basically > just retry once This one is very special IMHO, pmd_trans_unstable() should in most cases be used after someone checked pmd value before walking ptes. I had a feeling it's kind of abused, to check whether it's a huge pmd in whatever format.. IMHO it should just use the other pmd_*() helpers but I won't touch it in this series. > - min_core_pte_range: treated as unmapped range by calling > __mincore_unmapped_range Correct. Also pmd_devmap_trans_unstable(), called in 3 call sites: pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); filemap_map_pmd[3423] if (pmd_devmap_trans_unstable(vmf->pmd)) { finish_fault[4390] if (pmd_devmap_trans_unstable(vmf->pmd)) handle_pte_fault[4922] if (pmd_devmap_trans_unstable(vmf->pmd)) They all suggest a retry on the page faults, afaiu. So indeed not all of them retries, but I doubt those ones that are not and whether that's the best we should have. Again, I'll leave that out of this series. > > Anyway I really don't have a strong opinion on this. I may be just > over-concerned. I just thought if nobody cares whether the result is > accurate or not, why do we bother fixing those cases? Because anyway we're already at it and it's easier and cleaner to do so? :) I would say if to fix this I think the most important ones are pagemap and memcg paths so far, but since I'm at it and anyway I checked all of them, I figured maybe I should just make everywhere do right and in the same pattern when handling unstable pmd. Especially, what this patch touched are all using walk_page*() routines (I left special ones in first patches), so it's quite straightforward IMHO to switch altogether using ACTION_AGAIN. Thanks,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 6259dd432eeb..823eaba5c6bf 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + walk->action = ACTION_AGAIN; goto out; + } + /* * The mmap_lock held all the way back in m_start() is what * keeps khugepaged out of here and from collapsing things @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, return 0; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + walk->action = ACTION_AGAIN; return 0; + } pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) { @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, return err; } - if (pmd_trans_unstable(pmdp)) + if (pmd_trans_unstable(pmdp)) { + walk->action = ACTION_AGAIN; return 0; + } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ /* @@ -1888,8 +1895,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, return 0; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + walk->action = ACTION_AGAIN; return 0; + } #endif orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { diff --git a/mm/madvise.c b/mm/madvise.c index 78cd12581628..0fd81712022c 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -424,8 +424,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, } regular_folio: - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + walk->action = ACTION_AGAIN; return 0; + } #endif tlb_change_page_size(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); @@ -626,8 +628,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next)) goto next; - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + walk->action = ACTION_AGAIN; return 0; + } tlb_change_page_size(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6ee433be4c3b..15e50f033e41 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6021,8 +6021,10 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, return 0; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + walk->action = ACTION_AGAIN; return 0; + } pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) if (get_mctgt_type(vma, addr, *pte, NULL)) @@ -6241,8 +6243,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, return 0; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + walk->action = ACTION_AGAIN; return 0; + } retry: pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; addr += PAGE_SIZE) { diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 004a02f44271..c97fb2b7ab4a 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -791,8 +791,10 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr, goto out; } - if (pmd_trans_unstable(pmdp)) + if (pmd_trans_unstable(pmdp)) { + walk->action = ACTION_AGAIN; goto out; + } mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp, addr, &ptl); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index f06ca8c18e62..af8907b4aad1 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -514,8 +514,10 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, if (ptl) return queue_folios_pmd(pmd, ptl, addr, end, walk); - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + walk->action = ACTION_AGAIN; return 0; + } mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) {