Message ID | 20230517160948.811355-2-jiaqiyan@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp1255460vqo; Wed, 17 May 2023 09:14:51 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ43MpoCHiSa23lYAi1KiwLuOo2a6PzfwNhrbqzz41+gk2an1xWfdVUz4hCKAqM7DakLooU3 X-Received: by 2002:a05:6a20:4411:b0:101:2ad0:134c with SMTP id ce17-20020a056a20441100b001012ad0134cmr40691956pzb.45.1684340091556; Wed, 17 May 2023 09:14:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684340091; cv=none; d=google.com; s=arc-20160816; b=eCRHvmFuBoYjgMVfVkYGTTdin6bE+dksLEQPFuuqiOvqV4ylnvmFmzZOP//4nhFZLy u4DbAsN9RgwDu7ej10eF0/swXbJmxBWuR9/ddhS8aYqrwtQmsGNclO/O9PcXULWx2WbB roiRbl3FZOYfJ7BEor/GGRRikkG8YP8rrbOhJKqCez/+E36VDQVGrLuVH8i7+D2Ps8dE WTPPTJuMPMTRHnvLbShH+3iUCCzPwy7EhLCY134qb/2lvlmLGKnqX2KxzSHUehy4xtUg F2yHR6r68EnjI2mSiaG/ZVAUG2ZI9+AnpsbIk3qrWHwPhofBu/uTbuSbkqUaAANXAu1H IX6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=woRAdaWYV8FD8vab19hJA4MPyBeuK73rImQoSJfr/I0=; b=HS6WYS77LvXVbJJssDPEaA01160yBD4Q+9X/Ckl0gLYhtpIdVwO99TRk8MafPhZq/8 eWc9NYdUeiJccb1K54oI+gA7Gooosgq//cXWVlwyAl7T8Yu15OM4z2P9k1G3uQDXC8p4 gs71yVD/Ch9FJjTHHW3Y6HWfKh4NNpDnlh4PYvPwi+hQybeNqG1/5P3VRtKHPAxp5Grg G5h0jhjzQvzZuyqlQ9tc+L1A5rcBTw3cQXLob5qxWQK6MGnOZ1yYu0H7jiQH4RkJ0zxL zSGeLBf3PuvNa1/bkf/oqo+6ZDr0whS3p3V1LLTKmT5mqlzFJSNz2Fp3+rOnRtPsnt99 UBzw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b="YE1b/zym"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f26-20020a63755a000000b0051b48085aaasi21779356pgn.858.2023.05.17.09.14.39; Wed, 17 May 2023 09:14:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b="YE1b/zym"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231891AbjEQQKH (ORCPT <rfc822;abdi.embedded@gmail.com> + 99 others); Wed, 17 May 2023 12:10:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37876 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231331AbjEQQKC (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 17 May 2023 12:10:02 -0400 Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F1D24268C for <linux-kernel@vger.kernel.org>; Wed, 17 May 2023 09:10:01 -0700 (PDT) Received: by mail-pl1-x64a.google.com with SMTP id d9443c01a7336-1ae515ff6a9so976625ad.0 for <linux-kernel@vger.kernel.org>; Wed, 17 May 2023 09:10:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1684339801; x=1686931801; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=woRAdaWYV8FD8vab19hJA4MPyBeuK73rImQoSJfr/I0=; b=YE1b/zymZ9vc5qJrU2yZ6Y6Sk4upJCBRZPNMxGSUkSJ4NmR9re5sc7FtxyFCA6op5Z NYFItdPqItmJDGSzZeWAH3tLZk1Wm8Afz5r3LdakqUg7OhhCWIIdGn7UMn0FZjneF7a+ bisIrHgG+Aut0UuT0TMLzqNQqD7aRu51rTrV0uDZzw1l+caeFdr0GilMHPWzBABPv15M X4eJsAFnhsiuhrdvQNPjyZTVUft46X+7oX5ruBdVoJ1jwWqEudjOGdVLWk0A5g6E9K6h emxl51IemdopSKTESt1Ud6k+QmEevHfSkIuOBN+uVG3cweipz2hw8QLMUNEJwvHVbkK9 kVdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684339801; x=1686931801; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=woRAdaWYV8FD8vab19hJA4MPyBeuK73rImQoSJfr/I0=; b=V7I39NxV/JDfIKRP7L5B8O6aKctKzU03tqiUZpReGw4Ttpwm2YMXKTDPvwReIitHAn bnvsvjKZITUUyDRtw4fJ1lZeD5KyzqWpN1Y2IQMU6zEY3XeLDmswGwfuzj5dGWdnWU/8 FKyolGnQB9LprcjQifwYdFwr3/xBYGt2/G24K76ysGTYw2LCyqEr5g3ZQlu2NnqwooMY C0lC7E4VUL5fHmXrvgbVo2LVdE4lscbihOgN0IgTkPe/tQTYEYWS+VyFVPQAIploebub ZzKMX1egvfOoxNc6+hdkkmQGVmRFUpMUmkO5+fsJPsVnskziBaY9bENrWvIyqyhFVez+ iTbw== X-Gm-Message-State: AC+VfDw8L3KXjrqtayPVq7bob6eEa65tn6rXr+QCVPFtMIuj+xFvPuas 7wDz9d3DP+kfa5f7jOkFi/92digLXIcCvA== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a17:903:334f:b0:1ae:531f:366a with SMTP id ka15-20020a170903334f00b001ae531f366amr637380plb.5.1684339801269; Wed, 17 May 2023 09:10:01 -0700 (PDT) Date: Wed, 17 May 2023 16:09:46 +0000 In-Reply-To: <20230517160948.811355-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20230517160948.811355-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.40.1.606.ga4b1b128d6-goog Message-ID: <20230517160948.811355-2-jiaqiyan@google.com> Subject: [PATCH v1 1/3] mm/hwpoison: find subpage in hugetlb HWPOISON list From: Jiaqi Yan <jiaqiyan@google.com> To: mike.kravetz@oracle.com, songmuchun@bytedance.com, naoya.horiguchi@nec.com, shy828301@gmail.com, linmiaohe@huawei.com Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, duenwen@google.com, axelrasmussen@google.com, jthoughton@google.com, Jiaqi Yan <jiaqiyan@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766158595691867385?= X-GMAIL-MSGID: =?utf-8?q?1766158595691867385?= |
Series |
Improve hugetlbfs read on HWPOISON hugepages
|
|
Commit Message
Jiaqi Yan
May 17, 2023, 4:09 p.m. UTC
Adds the functionality to search a subpage's corresponding raw_hwp_page
in hugetlb page's HWPOISON list. This functionality can also tell if a
subpage is a raw HWPOISON page.
Exports this functionality to be immediately used in the read operation
for hugetlbfs.
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
include/linux/mm.h | 23 +++++++++++++++++++++++
mm/memory-failure.c | 26 ++++++++++++++++----------
2 files changed, 39 insertions(+), 10 deletions(-)
Comments
On 05/17/23 16:09, Jiaqi Yan wrote: > Adds the functionality to search a subpage's corresponding raw_hwp_page > in hugetlb page's HWPOISON list. This functionality can also tell if a > subpage is a raw HWPOISON page. > > Exports this functionality to be immediately used in the read operation > for hugetlbfs. > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > --- > include/linux/mm.h | 23 +++++++++++++++++++++++ > mm/memory-failure.c | 26 ++++++++++++++++---------- > 2 files changed, 39 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 27ce77080c79..f191a4119719 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h Any reason why you decided to add the following to linux/mm.h instead of linux/hugetlb.h? Since it is hugetlb specific I would have thought hugetlb.h was more appropriate. > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > */ > extern const struct attribute_group memory_failure_attr_group; > > +#ifdef CONFIG_HUGETLB_PAGE > +/* > + * Struct raw_hwp_page represents information about "raw error page", > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > + */ > +struct raw_hwp_page { > + struct llist_node node; > + struct page *page; > +}; > + > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > +{ > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > +} > + > +/* > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > + */ > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > + struct page *subpage); > +#endif > + > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > extern void clear_huge_page(struct page *page, > unsigned long addr_hint, > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 5b663eca1f29..c49e6c2d1f07 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > #endif /* CONFIG_FS_DAX */ > > #ifdef CONFIG_HUGETLB_PAGE > -/* > - * Struct raw_hwp_page represents information about "raw error page", > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > - */ > -struct raw_hwp_page { > - struct llist_node node; > - struct page *page; > -}; > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > + struct page *subpage) > { > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > + struct llist_node *t, *tnode; > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > + struct raw_hwp_page *hwp_page = NULL; > + struct raw_hwp_page *p; > + > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a raw_hwp_list. This is indicated by the hugetlb page specific flag RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). Looks like this routine does not consider that case. Seems like it should always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() is true?
On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 05/17/23 16:09, Jiaqi Yan wrote: > > Adds the functionality to search a subpage's corresponding raw_hwp_page > > in hugetlb page's HWPOISON list. This functionality can also tell if a > > subpage is a raw HWPOISON page. > > > > Exports this functionality to be immediately used in the read operation > > for hugetlbfs. > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > --- > > include/linux/mm.h | 23 +++++++++++++++++++++++ > > mm/memory-failure.c | 26 ++++++++++++++++---------- > > 2 files changed, 39 insertions(+), 10 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 27ce77080c79..f191a4119719 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > Any reason why you decided to add the following to linux/mm.h instead of > linux/hugetlb.h? Since it is hugetlb specific I would have thought > hugetlb.h was more appropriate. > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > > */ > > extern const struct attribute_group memory_failure_attr_group; > > > > +#ifdef CONFIG_HUGETLB_PAGE > > +/* > > + * Struct raw_hwp_page represents information about "raw error page", > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > + */ > > +struct raw_hwp_page { > > + struct llist_node node; > > + struct page *page; > > +}; > > + > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > +{ > > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > > +} > > + > > +/* > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > > + */ > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > + struct page *subpage); > > +#endif > > + > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > > extern void clear_huge_page(struct page *page, > > unsigned long addr_hint, > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > index 5b663eca1f29..c49e6c2d1f07 100644 > > --- a/mm/memory-failure.c > > +++ b/mm/memory-failure.c > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > #endif /* CONFIG_FS_DAX */ > > > > #ifdef CONFIG_HUGETLB_PAGE > > -/* > > - * Struct raw_hwp_page represents information about "raw error page", > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > - */ > > -struct raw_hwp_page { > > - struct llist_node node; > > - struct page *page; > > -}; > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > + struct page *subpage) > > { > > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > > + struct llist_node *t, *tnode; > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > > + struct raw_hwp_page *hwp_page = NULL; > > + struct raw_hwp_page *p; > > + > > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a > raw_hwp_list. This is indicated by the hugetlb page specific flag > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). > > Looks like this routine does not consider that case. Seems like it should > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() > is true? Thanks for catching this. I wonder should this routine consider RawHwpUnreliable or should the caller do. find_raw_hwp_page now returns raw_hwp_page* in the llist entry to caller (valid one at the moment), but once RawHwpUnreliable is set, all the raw_hwp_page in the llist will be kfree(), and the returned value becomes dangling pointer to caller (if the caller holds that caller long enough). Maybe returning a bool would be safer to the caller? If the routine returns bool, then checking RawHwpUnreliable can definitely be within the routine. Another option is, this routine simply doesn one thing: find a raw_hwp_page in raw_hwp_list for a subpage. But the caller needs to 1) test RawHwpUnreliable before calls into the routine, and 2) test RawHwpUnreliable before access returned raw_hwp_page*. I think 2nd option will be error-prone and the 1st option is a better one. Maybe I am over-thinking. What do you think? > -- > Mike Kravetz > > > + p = container_of(tnode, struct raw_hwp_page, node); > > + if (subpage == p->page) { > > + hwp_page = p; > > + break; > > + } > > + } > > + > > + return hwp_page; > > } > > > > static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag) > > -- > > 2.40.1.606.ga4b1b128d6-goog > >
On 05/19/23 13:54, Jiaqi Yan wrote: > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 05/17/23 16:09, Jiaqi Yan wrote: > > > Adds the functionality to search a subpage's corresponding raw_hwp_page > > > in hugetlb page's HWPOISON list. This functionality can also tell if a > > > subpage is a raw HWPOISON page. > > > > > > Exports this functionality to be immediately used in the read operation > > > for hugetlbfs. > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > --- > > > include/linux/mm.h | 23 +++++++++++++++++++++++ > > > mm/memory-failure.c | 26 ++++++++++++++++---------- > > > 2 files changed, 39 insertions(+), 10 deletions(-) > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > index 27ce77080c79..f191a4119719 100644 > > > --- a/include/linux/mm.h > > > +++ b/include/linux/mm.h > > > > Any reason why you decided to add the following to linux/mm.h instead of > > linux/hugetlb.h? Since it is hugetlb specific I would have thought > > hugetlb.h was more appropriate. > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > > > */ > > > extern const struct attribute_group memory_failure_attr_group; > > > > > > +#ifdef CONFIG_HUGETLB_PAGE > > > +/* > > > + * Struct raw_hwp_page represents information about "raw error page", > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > + */ > > > +struct raw_hwp_page { > > > + struct llist_node node; > > > + struct page *page; > > > +}; > > > + > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > +{ > > > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > +} > > > + > > > +/* > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > > > + */ > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > + struct page *subpage); > > > +#endif > > > + > > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > > > extern void clear_huge_page(struct page *page, > > > unsigned long addr_hint, > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > index 5b663eca1f29..c49e6c2d1f07 100644 > > > --- a/mm/memory-failure.c > > > +++ b/mm/memory-failure.c > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > > #endif /* CONFIG_FS_DAX */ > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > -/* > > > - * Struct raw_hwp_page represents information about "raw error page", > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > - */ > > > -struct raw_hwp_page { > > > - struct llist_node node; > > > - struct page *page; > > > -}; > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > + struct page *subpage) > > > { > > > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > + struct llist_node *t, *tnode; > > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > > > + struct raw_hwp_page *hwp_page = NULL; > > > + struct raw_hwp_page *p; > > > + > > > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a > > raw_hwp_list. This is indicated by the hugetlb page specific flag > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). > > > > Looks like this routine does not consider that case. Seems like it should > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() > > is true? > > Thanks for catching this. I wonder should this routine consider > RawHwpUnreliable or should the caller do. > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to > caller (valid one at the moment), but once RawHwpUnreliable is set, > all the raw_hwp_page in the llist will be kfree(), and the returned > value becomes dangling pointer to caller (if the caller holds that > caller long enough). Maybe returning a bool would be safer to the > caller? If the routine returns bool, then checking RawHwpUnreliable > can definitely be within the routine. I think the check for RawHwpUnreliable should be within this routine. Looking closer at the code, I do not see any way to synchronize this. It looks like manipulation in the memory-failure code would be synchronized via the mf_mutex. However, I do not see how traversal and freeing of the raw_hwp_list called from __update_and_free_hugetlb_folio is synchronized against memory-failure code modifying the list. Naoya, can you provide some thoughts? > > Another option is, this routine simply doesn one thing: find a > raw_hwp_page in raw_hwp_list for a subpage. But the caller needs to 1) > test RawHwpUnreliable before calls into the routine, and 2) test > RawHwpUnreliable before access returned raw_hwp_page*. I think 2nd > option will be error-prone and the 1st option is a better one. > > Maybe I am over-thinking. What do you think? I think racing code accessing the raw_hwp_list is very unlikely. However, it is possible and should be considered.
On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote: > On 05/19/23 13:54, Jiaqi Yan wrote: > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote: > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a > > > > subpage is a raw HWPOISON page. > > > > > > > > Exports this functionality to be immediately used in the read operation > > > > for hugetlbfs. > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > > --- > > > > include/linux/mm.h | 23 +++++++++++++++++++++++ > > > > mm/memory-failure.c | 26 ++++++++++++++++---------- > > > > 2 files changed, 39 insertions(+), 10 deletions(-) > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > index 27ce77080c79..f191a4119719 100644 > > > > --- a/include/linux/mm.h > > > > +++ b/include/linux/mm.h > > > > > > Any reason why you decided to add the following to linux/mm.h instead of > > > linux/hugetlb.h? Since it is hugetlb specific I would have thought > > > hugetlb.h was more appropriate. > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > > > > */ > > > > extern const struct attribute_group memory_failure_attr_group; > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE > > > > +/* > > > > + * Struct raw_hwp_page represents information about "raw error page", > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > + */ > > > > +struct raw_hwp_page { > > > > + struct llist_node node; > > > > + struct page *page; > > > > +}; > > > > + > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > +{ > > > > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > +} > > > > + > > > > +/* > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > > > > + */ > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > + struct page *subpage); > > > > +#endif > > > > + > > > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > > > > extern void clear_huge_page(struct page *page, > > > > unsigned long addr_hint, > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > > index 5b663eca1f29..c49e6c2d1f07 100644 > > > > --- a/mm/memory-failure.c > > > > +++ b/mm/memory-failure.c > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > > > #endif /* CONFIG_FS_DAX */ > > > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > -/* > > > > - * Struct raw_hwp_page represents information about "raw error page", > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > - */ > > > > -struct raw_hwp_page { > > > > - struct llist_node node; > > > > - struct page *page; > > > > -}; > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > + struct page *subpage) > > > > { > > > > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > + struct llist_node *t, *tnode; > > > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > > > > + struct raw_hwp_page *hwp_page = NULL; > > > > + struct raw_hwp_page *p; > > > > + > > > > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a > > > raw_hwp_list. This is indicated by the hugetlb page specific flag > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). > > > > > > Looks like this routine does not consider that case. Seems like it should > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() > > > is true? > > > > Thanks for catching this. I wonder should this routine consider > > RawHwpUnreliable or should the caller do. > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to > > caller (valid one at the moment), but once RawHwpUnreliable is set, > > all the raw_hwp_page in the llist will be kfree(), and the returned > > value becomes dangling pointer to caller (if the caller holds that > > caller long enough). Maybe returning a bool would be safer to the > > caller? If the routine returns bool, then checking RawHwpUnreliable > > can definitely be within the routine. > > I think the check for RawHwpUnreliable should be within this routine. > Looking closer at the code, I do not see any way to synchronize this. > It looks like manipulation in the memory-failure code would be > synchronized via the mf_mutex. However, I do not see how traversal and > freeing of the raw_hwp_list called from __update_and_free_hugetlb_folio > is synchronized against memory-failure code modifying the list. > > Naoya, can you provide some thoughts? Thanks for elaborating the issue. I think that making find_raw_hwp_page() and folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution. try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(), already calls it within mf_mutex, so some wrapper might be needed to implement calling path from __update_and_free_hugetlb_folio() to take mf_mutex. It might be a concern that mf_mutex is a big lock to cover overall hwpoison subsystem, but I think that the impact is not so big if the changed code paths take mf_mutex only after checking hwpoisoned hugepage. Maybe using folio_lock to synchronize accesses to the raw_hwp_list could be possible, but currently __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without folio_lock, so this approach needs update on locking rule and it sounds more error-prone to me. Thanks, Naoya Horiguchi > > > > > Another option is, this routine simply doesn one thing: find a > > raw_hwp_page in raw_hwp_list for a subpage. But the caller needs to 1) > > test RawHwpUnreliable before calls into the routine, and 2) test > > RawHwpUnreliable before access returned raw_hwp_page*. I think 2nd > > option will be error-prone and the 1st option is a better one. > > > > Maybe I am over-thinking. What do you think? > > I think racing code accessing the raw_hwp_list is very unlikely. > However, it is possible and should be considered. > -- > Mike Kravetz > > > > > > -- > > > Mike Kravetz > > > > > > > + p = container_of(tnode, struct raw_hwp_page, node); > > > > + if (subpage == p->page) { > > > > + hwp_page = p; > > > > + break; > > > > + } > > > > + } > > > > + > > > > + return hwp_page; > > > > } > > > > > > > > static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag) > > > > -- > > > > 2.40.1.606.ga4b1b128d6-goog > > > >
On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote: > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote: > > On 05/19/23 13:54, Jiaqi Yan wrote: > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote: > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a > > > > > subpage is a raw HWPOISON page. > > > > > > > > > > Exports this functionality to be immediately used in the read operation > > > > > for hugetlbfs. > > > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > > > --- > > > > > include/linux/mm.h | 23 +++++++++++++++++++++++ > > > > > mm/memory-failure.c | 26 ++++++++++++++++---------- > > > > > 2 files changed, 39 insertions(+), 10 deletions(-) > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > index 27ce77080c79..f191a4119719 100644 > > > > > --- a/include/linux/mm.h > > > > > +++ b/include/linux/mm.h > > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of > > > > linux/hugetlb.h? Since it is hugetlb specific I would have thought > > > > hugetlb.h was more appropriate. > > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > > > > > */ > > > > > extern const struct attribute_group memory_failure_attr_group; > > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE > > > > > +/* > > > > > + * Struct raw_hwp_page represents information about "raw error page", > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > + */ > > > > > +struct raw_hwp_page { > > > > > + struct llist_node node; > > > > > + struct page *page; > > > > > +}; > > > > > + > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > +{ > > > > > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > +} > > > > > + > > > > > +/* > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > > > > > + */ > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > + struct page *subpage); > > > > > +#endif > > > > > + > > > > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > > > > > extern void clear_huge_page(struct page *page, > > > > > unsigned long addr_hint, > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > > > index 5b663eca1f29..c49e6c2d1f07 100644 > > > > > --- a/mm/memory-failure.c > > > > > +++ b/mm/memory-failure.c > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > > > > #endif /* CONFIG_FS_DAX */ > > > > > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > > -/* > > > > > - * Struct raw_hwp_page represents information about "raw error page", > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > - */ > > > > > -struct raw_hwp_page { > > > > > - struct llist_node node; > > > > > - struct page *page; > > > > > -}; > > > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > + struct page *subpage) > > > > > { > > > > > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > + struct llist_node *t, *tnode; > > > > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > > > > > + struct raw_hwp_page *hwp_page = NULL; > > > > > + struct raw_hwp_page *p; > > > > > + > > > > > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { > > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a > > > > raw_hwp_list. This is indicated by the hugetlb page specific flag > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). > > > > > > > > Looks like this routine does not consider that case. Seems like it should > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() > > > > is true? > > > > > > Thanks for catching this. I wonder should this routine consider > > > RawHwpUnreliable or should the caller do. > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to > > > caller (valid one at the moment), but once RawHwpUnreliable is set, > > > all the raw_hwp_page in the llist will be kfree(), and the returned > > > value becomes dangling pointer to caller (if the caller holds that > > > caller long enough). Maybe returning a bool would be safer to the > > > caller? If the routine returns bool, then checking RawHwpUnreliable > > > can definitely be within the routine. > > > > I think the check for RawHwpUnreliable should be within this routine. > > Looking closer at the code, I do not see any way to synchronize this. > > It looks like manipulation in the memory-failure code would be > > synchronized via the mf_mutex. However, I do not see how traversal and > > freeing of the raw_hwp_list called from __update_and_free_hugetlb_folio > > is synchronized against memory-failure code modifying the list. > > > > Naoya, can you provide some thoughts? > > Thanks for elaborating the issue. I think that making find_raw_hwp_page() and > folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution. > try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(), > already calls it within mf_mutex, so some wrapper might be needed to implement > calling path from __update_and_free_hugetlb_folio() to take mf_mutex. > > It might be a concern that mf_mutex is a big lock to cover overall hwpoison > subsystem, but I think that the impact is not so big if the changed code paths > take mf_mutex only after checking hwpoisoned hugepage. Maybe using folio_lock > to synchronize accesses to the raw_hwp_list could be possible, but currently > __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without > folio_lock, so this approach needs update on locking rule and it sounds more > error-prone to me. Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I agree mf_mutex could help to sync its two del_all operations (one from try_memory_failure_hugetlb and one from __update_and_free_hugetlb_folio). I still want to ask a perhaps stupid question, somewhat related to how to implement find_raw_hwp_page() correctly. It seems llist_for_each_safe should only be used to traverse after list entries already *deleted* via llist_del_all. But the llist_for_each_safe calls in memory_failure today is used *directly* on the raw_hwp_list. This is quite different from other users of llist_for_each_safe (for example, kernel/bpf/memalloc.c). Why is it correct? I guess mostly because they are sync-ed under mf_mutex (except the missing coverage on __update_and_free_hugetlb_folio)? > > Thanks, > Naoya Horiguchi > > > > > > > > > Another option is, this routine simply doesn one thing: find a > > > raw_hwp_page in raw_hwp_list for a subpage. But the caller needs to 1) > > > test RawHwpUnreliable before calls into the routine, and 2) test > > > RawHwpUnreliable before access returned raw_hwp_page*. I think 2nd > > > option will be error-prone and the 1st option is a better one. > > > > > > Maybe I am over-thinking. What do you think? > > > > I think racing code accessing the raw_hwp_list is very unlikely. > > However, it is possible and should be considered. > > -- > > Mike Kravetz > > > > > > > > > -- > > > > Mike Kravetz > > > > > > > > > + p = container_of(tnode, struct raw_hwp_page, node); > > > > > + if (subpage == p->page) { > > > > > + hwp_page = p; > > > > > + break; > > > > > + } > > > > > + } > > > > > + > > > > > + return hwp_page; > > > > > } > > > > > > > > > > static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag) > > > > > -- > > > > > 2.40.1.606.ga4b1b128d6-goog > > > > >
On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote: > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口 直也) > <naoya.horiguchi@nec.com> wrote: > > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote: > > > On 05/19/23 13:54, Jiaqi Yan wrote: > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote: > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a > > > > > > subpage is a raw HWPOISON page. > > > > > > > > > > > > Exports this functionality to be immediately used in the read operation > > > > > > for hugetlbfs. > > > > > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > > > > --- > > > > > > include/linux/mm.h | 23 +++++++++++++++++++++++ > > > > > > mm/memory-failure.c | 26 ++++++++++++++++---------- > > > > > > 2 files changed, 39 insertions(+), 10 deletions(-) > > > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > index 27ce77080c79..f191a4119719 100644 > > > > > > --- a/include/linux/mm.h > > > > > > +++ b/include/linux/mm.h > > > > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of > > > > > linux/hugetlb.h? Since it is hugetlb specific I would have thought > > > > > hugetlb.h was more appropriate. > > > > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > > > > > > */ > > > > > > extern const struct attribute_group memory_failure_attr_group; > > > > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE > > > > > > +/* > > > > > > + * Struct raw_hwp_page represents information about "raw error page", > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > + */ > > > > > > +struct raw_hwp_page { > > > > > > + struct llist_node node; > > > > > > + struct page *page; > > > > > > +}; > > > > > > + > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > +{ > > > > > > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > +} > > > > > > + > > > > > > +/* > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > > > > > > + */ > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > + struct page *subpage); > > > > > > +#endif > > > > > > + > > > > > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > > > > > > extern void clear_huge_page(struct page *page, > > > > > > unsigned long addr_hint, > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644 > > > > > > --- a/mm/memory-failure.c > > > > > > +++ b/mm/memory-failure.c > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > > > > > #endif /* CONFIG_FS_DAX */ > > > > > > > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > > > -/* > > > > > > - * Struct raw_hwp_page represents information about "raw error page", > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > - */ > > > > > > -struct raw_hwp_page { > > > > > > - struct llist_node node; > > > > > > - struct page *page; > > > > > > -}; > > > > > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > + struct page *subpage) > > > > > > { > > > > > > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > + struct llist_node *t, *tnode; > > > > > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > > > > > > + struct raw_hwp_page *hwp_page = NULL; > > > > > > + struct raw_hwp_page *p; > > > > > > + > > > > > > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { > > > > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a > > > > > raw_hwp_list. This is indicated by the hugetlb page specific flag > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). > > > > > > > > > > Looks like this routine does not consider that case. Seems like it should > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() > > > > > is true? > > > > > > > > Thanks for catching this. I wonder should this routine consider > > > > RawHwpUnreliable or should the caller do. > > > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to > > > > caller (valid one at the moment), but once RawHwpUnreliable is set, > > > > all the raw_hwp_page in the llist will be kfree(), and the returned > > > > value becomes dangling pointer to caller (if the caller holds that > > > > caller long enough). Maybe returning a bool would be safer to the > > > > caller? If the routine returns bool, then checking RawHwpUnreliable > > > > can definitely be within the routine. > > > > > > I think the check for RawHwpUnreliable should be within this routine. > > > Looking closer at the code, I do not see any way to synchronize this. > > > It looks like manipulation in the memory-failure code would be > > > synchronized via the mf_mutex. However, I do not see how traversal and > > > freeing of the raw_hwp_list called from __update_and_free_hugetlb_folio > > > is synchronized against memory-failure code modifying the list. > > > > > > Naoya, can you provide some thoughts? > > > > Thanks for elaborating the issue. I think that making find_raw_hwp_page() and > > folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution. > > try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(), > > already calls it within mf_mutex, so some wrapper might be needed to implement > > calling path from __update_and_free_hugetlb_folio() to take mf_mutex. > > > > It might be a concern that mf_mutex is a big lock to cover overall hwpoison > > subsystem, but I think that the impact is not so big if the changed code paths > > take mf_mutex only after checking hwpoisoned hugepage. Maybe using folio_lock > > to synchronize accesses to the raw_hwp_list could be possible, but currently > > __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without > > folio_lock, so this approach needs update on locking rule and it sounds more > > error-prone to me. > > Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I > agree mf_mutex could help to sync its two del_all operations (one from > try_memory_failure_hugetlb and one from > __update_and_free_hugetlb_folio). > > I still want to ask a perhaps stupid question, somewhat related to how > to implement find_raw_hwp_page() correctly. It seems > llist_for_each_safe should only be used to traverse after list entries > already *deleted* via llist_del_all. But the llist_for_each_safe calls > in memory_failure today is used *directly* on the raw_hwp_list. This > is quite different from other users of llist_for_each_safe (for > example, kernel/bpf/memalloc.c). Oh, I don't noticed that when writing the original code. (I just chose llist_for_each_safe because I just wanted struct llist_node as a singly linked list.) > Why is it correct? I guess mostly > because they are sync-ed under mf_mutex (except the missing coverage > on __update_and_free_hugetlb_folio)? Yes, and there seems no good reason to use the macro llist_for_each_safe here. I think it's OK to switch to simpler one like list_for_each which should be supposed to be called directly. To do this, struct raw_hwp_page need to have @node (typed struct list_head instead of struct llist_node). Thanks, Naoya Horiguchi
On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote: > > On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote: > > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口 直也) > > <naoya.horiguchi@nec.com> wrote: > > > > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote: > > > > On 05/19/23 13:54, Jiaqi Yan wrote: > > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote: > > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page > > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a > > > > > > > subpage is a raw HWPOISON page. > > > > > > > > > > > > > > Exports this functionality to be immediately used in the read operation > > > > > > > for hugetlbfs. > > > > > > > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > > > > > --- > > > > > > > include/linux/mm.h | 23 +++++++++++++++++++++++ > > > > > > > mm/memory-failure.c | 26 ++++++++++++++++---------- > > > > > > > 2 files changed, 39 insertions(+), 10 deletions(-) > > > > > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > > index 27ce77080c79..f191a4119719 100644 > > > > > > > --- a/include/linux/mm.h > > > > > > > +++ b/include/linux/mm.h > > > > > > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of > > > > > > linux/hugetlb.h? Since it is hugetlb specific I would have thought > > > > > > hugetlb.h was more appropriate. > > > > > > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > > > > > > > */ > > > > > > > extern const struct attribute_group memory_failure_attr_group; > > > > > > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE > > > > > > > +/* > > > > > > > + * Struct raw_hwp_page represents information about "raw error page", > > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > + */ > > > > > > > +struct raw_hwp_page { > > > > > > > + struct llist_node node; > > > > > > > + struct page *page; > > > > > > > +}; > > > > > > > + > > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > +{ > > > > > > > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > +} > > > > > > > + > > > > > > > +/* > > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > > > > > > > + */ > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > + struct page *subpage); > > > > > > > +#endif > > > > > > > + > > > > > > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > > > > > > > extern void clear_huge_page(struct page *page, > > > > > > > unsigned long addr_hint, > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644 > > > > > > > --- a/mm/memory-failure.c > > > > > > > +++ b/mm/memory-failure.c > > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > > > > > > #endif /* CONFIG_FS_DAX */ > > > > > > > > > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > > > > -/* > > > > > > > - * Struct raw_hwp_page represents information about "raw error page", > > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > - */ > > > > > > > -struct raw_hwp_page { > > > > > > > - struct llist_node node; > > > > > > > - struct page *page; > > > > > > > -}; > > > > > > > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > + struct page *subpage) > > > > > > > { > > > > > > > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > + struct llist_node *t, *tnode; > > > > > > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > > > > > > > + struct raw_hwp_page *hwp_page = NULL; > > > > > > > + struct raw_hwp_page *p; > > > > > > > + > > > > > > > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { > > > > > > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a > > > > > > raw_hwp_list. This is indicated by the hugetlb page specific flag > > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). > > > > > > > > > > > > Looks like this routine does not consider that case. Seems like it should > > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() > > > > > > is true? > > > > > > > > > > Thanks for catching this. I wonder should this routine consider > > > > > RawHwpUnreliable or should the caller do. > > > > > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to > > > > > caller (valid one at the moment), but once RawHwpUnreliable is set, > > > > > all the raw_hwp_page in the llist will be kfree(), and the returned > > > > > value becomes dangling pointer to caller (if the caller holds that > > > > > caller long enough). Maybe returning a bool would be safer to the > > > > > caller? If the routine returns bool, then checking RawHwpUnreliable > > > > > can definitely be within the routine. > > > > > > > > I think the check for RawHwpUnreliable should be within this routine. > > > > Looking closer at the code, I do not see any way to synchronize this. > > > > It looks like manipulation in the memory-failure code would be > > > > synchronized via the mf_mutex. However, I do not see how traversal and > > > > freeing of the raw_hwp_list called from __update_and_free_hugetlb_folio > > > > is synchronized against memory-failure code modifying the list. > > > > > > > > Naoya, can you provide some thoughts? > > > > > > Thanks for elaborating the issue. I think that making find_raw_hwp_page() and > > > folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution. > > > try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(), > > > already calls it within mf_mutex, so some wrapper might be needed to implement > > > calling path from __update_and_free_hugetlb_folio() to take mf_mutex. > > > > > > It might be a concern that mf_mutex is a big lock to cover overall hwpoison > > > subsystem, but I think that the impact is not so big if the changed code paths > > > take mf_mutex only after checking hwpoisoned hugepage. Maybe using folio_lock > > > to synchronize accesses to the raw_hwp_list could be possible, but currently > > > __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without > > > folio_lock, so this approach needs update on locking rule and it sounds more > > > error-prone to me. > > > > Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I > > agree mf_mutex could help to sync its two del_all operations (one from > > try_memory_failure_hugetlb and one from > > __update_and_free_hugetlb_folio). > > > > I still want to ask a perhaps stupid question, somewhat related to how > > to implement find_raw_hwp_page() correctly. It seems > > llist_for_each_safe should only be used to traverse after list entries > > already *deleted* via llist_del_all. But the llist_for_each_safe calls > > in memory_failure today is used *directly* on the raw_hwp_list. This > > is quite different from other users of llist_for_each_safe (for > > example, kernel/bpf/memalloc.c). > > Oh, I don't noticed that when writing the original code. (I just chose > llist_for_each_safe because I just wanted struct llist_node as a singly > linked list.) And maybe because you can avoid doing INIT_LIST_HEAD (which seems doable in folio_set_hugetlb_hwpoison if hugepage is hwpoison-ed for the first time)? > > > Why is it correct? I guess mostly > > because they are sync-ed under mf_mutex (except the missing coverage > > on __update_and_free_hugetlb_folio)? > > Yes, and there seems no good reason to use the macro llist_for_each_safe > here. I think it's OK to switch to simpler one like list_for_each which > should be supposed to be called directly. To do this, struct raw_hwp_page > need to have @node (typed struct list_head instead of struct llist_node). I will start to work on a separate patch to switch to list_head, and make sure operations from __update_and_free_hugetlb_folio and memory_failure are serialized (hopefully without intro new locks and just use mf_mutex). > > Thanks, > Naoya Horiguchi
On Thu, May 25, 2023 at 5:28 PM Jiaqi Yan <jiaqiyan@google.com> wrote: > > On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口 直也) > <naoya.horiguchi@nec.com> wrote: > > > > On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote: > > > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口 直也) > > > <naoya.horiguchi@nec.com> wrote: > > > > > > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote: > > > > > On 05/19/23 13:54, Jiaqi Yan wrote: > > > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > > > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote: > > > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page > > > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a > > > > > > > > subpage is a raw HWPOISON page. > > > > > > > > > > > > > > > > Exports this functionality to be immediately used in the read operation > > > > > > > > for hugetlbfs. > > > > > > > > > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > > > > > > --- > > > > > > > > include/linux/mm.h | 23 +++++++++++++++++++++++ > > > > > > > > mm/memory-failure.c | 26 ++++++++++++++++---------- > > > > > > > > 2 files changed, 39 insertions(+), 10 deletions(-) > > > > > > > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > > > index 27ce77080c79..f191a4119719 100644 > > > > > > > > --- a/include/linux/mm.h > > > > > > > > +++ b/include/linux/mm.h > > > > > > > > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of > > > > > > > linux/hugetlb.h? Since it is hugetlb specific I would have thought > > > > > > > hugetlb.h was more appropriate. > > > > > > > > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > > > > > > > > */ > > > > > > > > extern const struct attribute_group memory_failure_attr_group; > > > > > > > > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE > > > > > > > > +/* > > > > > > > > + * Struct raw_hwp_page represents information about "raw error page", > > > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > > + */ > > > > > > > > +struct raw_hwp_page { > > > > > > > > + struct llist_node node; > > > > > > > > + struct page *page; > > > > > > > > +}; > > > > > > > > + > > > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > > +{ > > > > > > > > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > > +} > > > > > > > > + > > > > > > > > +/* > > > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > > > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > > > > > > > > + */ > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > > + struct page *subpage); > > > > > > > > +#endif > > > > > > > > + > > > > > > > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > > > > > > > > extern void clear_huge_page(struct page *page, > > > > > > > > unsigned long addr_hint, > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644 > > > > > > > > --- a/mm/memory-failure.c > > > > > > > > +++ b/mm/memory-failure.c > > > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > > > > > > > #endif /* CONFIG_FS_DAX */ > > > > > > > > > > > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > > > > > -/* > > > > > > > > - * Struct raw_hwp_page represents information about "raw error page", > > > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > > - */ > > > > > > > > -struct raw_hwp_page { > > > > > > > > - struct llist_node node; > > > > > > > > - struct page *page; > > > > > > > > -}; > > > > > > > > > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > > + struct page *subpage) > > > > > > > > { > > > > > > > > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > > + struct llist_node *t, *tnode; > > > > > > > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > > > > > > > > + struct raw_hwp_page *hwp_page = NULL; > > > > > > > > + struct raw_hwp_page *p; > > > > > > > > + > > > > > > > > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { > > > > > > > > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a > > > > > > > raw_hwp_list. This is indicated by the hugetlb page specific flag > > > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). > > > > > > > > > > > > > > Looks like this routine does not consider that case. Seems like it should > > > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() > > > > > > > is true? > > > > > > > > > > > > Thanks for catching this. I wonder should this routine consider > > > > > > RawHwpUnreliable or should the caller do. > > > > > > > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to > > > > > > caller (valid one at the moment), but once RawHwpUnreliable is set, > > > > > > all the raw_hwp_page in the llist will be kfree(), and the returned > > > > > > value becomes dangling pointer to caller (if the caller holds that > > > > > > caller long enough). Maybe returning a bool would be safer to the > > > > > > caller? If the routine returns bool, then checking RawHwpUnreliable > > > > > > can definitely be within the routine. > > > > > > > > > > I think the check for RawHwpUnreliable should be within this routine. > > > > > Looking closer at the code, I do not see any way to synchronize this. > > > > > It looks like manipulation in the memory-failure code would be > > > > > synchronized via the mf_mutex. However, I do not see how traversal and > > > > > freeing of the raw_hwp_list called from __update_and_free_hugetlb_folio > > > > > is synchronized against memory-failure code modifying the list. > > > > > > > > > > Naoya, can you provide some thoughts? > > > > > > > > Thanks for elaborating the issue. I think that making find_raw_hwp_page() and > > > > folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution. > > > > try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(), > > > > already calls it within mf_mutex, so some wrapper might be needed to implement > > > > calling path from __update_and_free_hugetlb_folio() to take mf_mutex. > > > > > > > > It might be a concern that mf_mutex is a big lock to cover overall hwpoison > > > > subsystem, but I think that the impact is not so big if the changed code paths > > > > take mf_mutex only after checking hwpoisoned hugepage. Maybe using folio_lock > > > > to synchronize accesses to the raw_hwp_list could be possible, but currently > > > > __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without > > > > folio_lock, so this approach needs update on locking rule and it sounds more > > > > error-prone to me. > > > > > > Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I > > > agree mf_mutex could help to sync its two del_all operations (one from > > > try_memory_failure_hugetlb and one from > > > __update_and_free_hugetlb_folio). > > > > > > I still want to ask a perhaps stupid question, somewhat related to how > > > to implement find_raw_hwp_page() correctly. It seems > > > llist_for_each_safe should only be used to traverse after list entries > > > already *deleted* via llist_del_all. But the llist_for_each_safe calls > > > in memory_failure today is used *directly* on the raw_hwp_list. This > > > is quite different from other users of llist_for_each_safe (for > > > example, kernel/bpf/memalloc.c). > > > > Oh, I don't noticed that when writing the original code. (I just chose > > llist_for_each_safe because I just wanted struct llist_node as a singly > > linked list.) > > And maybe because you can avoid doing INIT_LIST_HEAD (which seems > doable in folio_set_hugetlb_hwpoison if hugepage is hwpoison-ed for > the first time)? > > > > > > Why is it correct? I guess mostly > > > because they are sync-ed under mf_mutex (except the missing coverage > > > on __update_and_free_hugetlb_folio)? > > > > Yes, and there seems no good reason to use the macro llist_for_each_safe > > here. I think it's OK to switch to simpler one like list_for_each which > > should be supposed to be called directly. To do this, struct raw_hwp_page > > need to have @node (typed struct list_head instead of struct llist_node). Hi Naoya, a maybe-stupid question on list vs llist: _hugetlb_hwpoison in folio is a void*. struct list_head is composed of two pointers to list_node (prev and next) so folio just can't hold a list_head in the _hugetlb_hwpoison field, right? llist_head on the other hand only contains one pointer to llist_node first. I wonder if this is one of the reason you picked llist instead of list in the first place. The reason I ask is while I was testing my refactor draft, I constantly see the refcount of the 3rd subpage in the folio got corrupted. I am not sure about the exact reason but it feels to me related to the above reason. > > I will start to work on a separate patch to switch to list_head, and > make sure operations from __update_and_free_hugetlb_folio and > memory_failure are serialized (hopefully without intro new locks and > just use mf_mutex). > > > > > Thanks, > > Naoya Horiguchi
On Fri, Jun 09, 2023 at 10:48:47PM -0700, Jiaqi Yan wrote: > On Thu, May 25, 2023 at 5:28 PM Jiaqi Yan <jiaqiyan@google.com> wrote: > > > > On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口 直也) > > <naoya.horiguchi@nec.com> wrote: > > > > > > On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote: > > > > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口 直也) > > > > <naoya.horiguchi@nec.com> wrote: > > > > > > > > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote: > > > > > > On 05/19/23 13:54, Jiaqi Yan wrote: > > > > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > > > > > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote: > > > > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page > > > > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a > > > > > > > > > subpage is a raw HWPOISON page. > > > > > > > > > > > > > > > > > > Exports this functionality to be immediately used in the read operation > > > > > > > > > for hugetlbfs. > > > > > > > > > > > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > > > > > > > --- > > > > > > > > > include/linux/mm.h | 23 +++++++++++++++++++++++ > > > > > > > > > mm/memory-failure.c | 26 ++++++++++++++++---------- > > > > > > > > > 2 files changed, 39 insertions(+), 10 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > > > > index 27ce77080c79..f191a4119719 100644 > > > > > > > > > --- a/include/linux/mm.h > > > > > > > > > +++ b/include/linux/mm.h > > > > > > > > > > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of > > > > > > > > linux/hugetlb.h? Since it is hugetlb specific I would have thought > > > > > > > > hugetlb.h was more appropriate. > > > > > > > > > > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > > > > > > > > > */ > > > > > > > > > extern const struct attribute_group memory_failure_attr_group; > > > > > > > > > > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE > > > > > > > > > +/* > > > > > > > > > + * Struct raw_hwp_page represents information about "raw error page", > > > > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > > > + */ > > > > > > > > > +struct raw_hwp_page { > > > > > > > > > + struct llist_node node; > > > > > > > > > + struct page *page; > > > > > > > > > +}; > > > > > > > > > + > > > > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > > > +{ > > > > > > > > > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > > > +} > > > > > > > > > + > > > > > > > > > +/* > > > > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > > > > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > > > > > > > > > + */ > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > > > + struct page *subpage); > > > > > > > > > +#endif > > > > > > > > > + > > > > > > > > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > > > > > > > > > extern void clear_huge_page(struct page *page, > > > > > > > > > unsigned long addr_hint, > > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644 > > > > > > > > > --- a/mm/memory-failure.c > > > > > > > > > +++ b/mm/memory-failure.c > > > > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > > > > > > > > #endif /* CONFIG_FS_DAX */ > > > > > > > > > > > > > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > > > > > > -/* > > > > > > > > > - * Struct raw_hwp_page represents information about "raw error page", > > > > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > > > - */ > > > > > > > > > -struct raw_hwp_page { > > > > > > > > > - struct llist_node node; > > > > > > > > > - struct page *page; > > > > > > > > > -}; > > > > > > > > > > > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > > > + struct page *subpage) > > > > > > > > > { > > > > > > > > > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > > > + struct llist_node *t, *tnode; > > > > > > > > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > > > > > > > > > + struct raw_hwp_page *hwp_page = NULL; > > > > > > > > > + struct raw_hwp_page *p; > > > > > > > > > + > > > > > > > > > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { > > > > > > > > > > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a > > > > > > > > raw_hwp_list. This is indicated by the hugetlb page specific flag > > > > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). > > > > > > > > > > > > > > > > Looks like this routine does not consider that case. Seems like it should > > > > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() > > > > > > > > is true? > > > > > > > > > > > > > > Thanks for catching this. I wonder should this routine consider > > > > > > > RawHwpUnreliable or should the caller do. > > > > > > > > > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to > > > > > > > caller (valid one at the moment), but once RawHwpUnreliable is set, > > > > > > > all the raw_hwp_page in the llist will be kfree(), and the returned > > > > > > > value becomes dangling pointer to caller (if the caller holds that > > > > > > > caller long enough). Maybe returning a bool would be safer to the > > > > > > > caller? If the routine returns bool, then checking RawHwpUnreliable > > > > > > > can definitely be within the routine. > > > > > > > > > > > > I think the check for RawHwpUnreliable should be within this routine. > > > > > > Looking closer at the code, I do not see any way to synchronize this. > > > > > > It looks like manipulation in the memory-failure code would be > > > > > > synchronized via the mf_mutex. However, I do not see how traversal and > > > > > > freeing of the raw_hwp_list called from __update_and_free_hugetlb_folio > > > > > > is synchronized against memory-failure code modifying the list. > > > > > > > > > > > > Naoya, can you provide some thoughts? > > > > > > > > > > Thanks for elaborating the issue. I think that making find_raw_hwp_page() and > > > > > folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution. > > > > > try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(), > > > > > already calls it within mf_mutex, so some wrapper might be needed to implement > > > > > calling path from __update_and_free_hugetlb_folio() to take mf_mutex. > > > > > > > > > > It might be a concern that mf_mutex is a big lock to cover overall hwpoison > > > > > subsystem, but I think that the impact is not so big if the changed code paths > > > > > take mf_mutex only after checking hwpoisoned hugepage. Maybe using folio_lock > > > > > to synchronize accesses to the raw_hwp_list could be possible, but currently > > > > > __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without > > > > > folio_lock, so this approach needs update on locking rule and it sounds more > > > > > error-prone to me. > > > > > > > > Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I > > > > agree mf_mutex could help to sync its two del_all operations (one from > > > > try_memory_failure_hugetlb and one from > > > > __update_and_free_hugetlb_folio). > > > > > > > > I still want to ask a perhaps stupid question, somewhat related to how > > > > to implement find_raw_hwp_page() correctly. It seems > > > > llist_for_each_safe should only be used to traverse after list entries > > > > already *deleted* via llist_del_all. But the llist_for_each_safe calls > > > > in memory_failure today is used *directly* on the raw_hwp_list. This > > > > is quite different from other users of llist_for_each_safe (for > > > > example, kernel/bpf/memalloc.c). > > > > > > Oh, I don't noticed that when writing the original code. (I just chose > > > llist_for_each_safe because I just wanted struct llist_node as a singly > > > linked list.) > > > > And maybe because you can avoid doing INIT_LIST_HEAD (which seems > > doable in folio_set_hugetlb_hwpoison if hugepage is hwpoison-ed for > > the first time)? > > > > > > > > > Why is it correct? I guess mostly > > > > because they are sync-ed under mf_mutex (except the missing coverage > > > > on __update_and_free_hugetlb_folio)? > > > > > > Yes, and there seems no good reason to use the macro llist_for_each_safe > > > here. I think it's OK to switch to simpler one like list_for_each which > > > should be supposed to be called directly. To do this, struct raw_hwp_page > > > need to have @node (typed struct list_head instead of struct llist_node). > > Hi Naoya, a maybe-stupid question on list vs llist: _hugetlb_hwpoison > in folio is a void*. struct list_head is composed of two pointers to > list_node (prev and next) so folio just can't hold a list_head in the > _hugetlb_hwpoison field, right? llist_head on the other hand only > contains one pointer to llist_node first. I wonder if this is one of > the reason you picked llist instead of list in the first place. Yes, that's one reason to use llist_head, and another (minor) reason is that we don't need doubly-linked list here. Thanks, Naoya Horiguchi > > The reason I ask is while I was testing my refactor draft, I > constantly see the refcount of the 3rd subpage in the folio got > corrupted. I am not sure about the exact reason but it feels to me > related to the above reason. > > > > > I will start to work on a separate patch to switch to list_head, and > > make sure operations from __update_and_free_hugetlb_folio and > > memory_failure are serialized (hopefully without intro new locks and > > just use mf_mutex). > > > > > > > > Thanks, > > > Naoya Horiguchi > >
On Sun, Jun 11, 2023 at 9:19 PM Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote: > > On Fri, Jun 09, 2023 at 10:48:47PM -0700, Jiaqi Yan wrote: > > On Thu, May 25, 2023 at 5:28 PM Jiaqi Yan <jiaqiyan@google.com> wrote: > > > > > > On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口 直也) > > > <naoya.horiguchi@nec.com> wrote: > > > > > > > > On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote: > > > > > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口 直也) > > > > > <naoya.horiguchi@nec.com> wrote: > > > > > > > > > > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote: > > > > > > > On 05/19/23 13:54, Jiaqi Yan wrote: > > > > > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > > > > > > > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote: > > > > > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page > > > > > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a > > > > > > > > > > subpage is a raw HWPOISON page. > > > > > > > > > > > > > > > > > > > > Exports this functionality to be immediately used in the read operation > > > > > > > > > > for hugetlbfs. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > > > > > > > > --- > > > > > > > > > > include/linux/mm.h | 23 +++++++++++++++++++++++ > > > > > > > > > > mm/memory-failure.c | 26 ++++++++++++++++---------- > > > > > > > > > > 2 files changed, 39 insertions(+), 10 deletions(-) > > > > > > > > > > > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > > > > > index 27ce77080c79..f191a4119719 100644 > > > > > > > > > > --- a/include/linux/mm.h > > > > > > > > > > +++ b/include/linux/mm.h > > > > > > > > > > > > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of > > > > > > > > > linux/hugetlb.h? Since it is hugetlb specific I would have thought > > > > > > > > > hugetlb.h was more appropriate. > > > > > > > > > > > > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > > > > > > > > > > */ > > > > > > > > > > extern const struct attribute_group memory_failure_attr_group; > > > > > > > > > > > > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE > > > > > > > > > > +/* > > > > > > > > > > + * Struct raw_hwp_page represents information about "raw error page", > > > > > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > > > > + */ > > > > > > > > > > +struct raw_hwp_page { > > > > > > > > > > + struct llist_node node; > > > > > > > > > > + struct page *page; > > > > > > > > > > +}; > > > > > > > > > > + > > > > > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > > > > +{ > > > > > > > > > > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > > > > +} > > > > > > > > > > + > > > > > > > > > > +/* > > > > > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > > > > > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > > > > > > > > > > + */ > > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > > > > + struct page *subpage); > > > > > > > > > > +#endif > > > > > > > > > > + > > > > > > > > > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > > > > > > > > > > extern void clear_huge_page(struct page *page, > > > > > > > > > > unsigned long addr_hint, > > > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > > > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644 > > > > > > > > > > --- a/mm/memory-failure.c > > > > > > > > > > +++ b/mm/memory-failure.c > > > > > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > > > > > > > > > #endif /* CONFIG_FS_DAX */ > > > > > > > > > > > > > > > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > > > > > > > -/* > > > > > > > > > > - * Struct raw_hwp_page represents information about "raw error page", > > > > > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > > > > - */ > > > > > > > > > > -struct raw_hwp_page { > > > > > > > > > > - struct llist_node node; > > > > > > > > > > - struct page *page; > > > > > > > > > > -}; > > > > > > > > > > > > > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > > > > + struct page *subpage) > > > > > > > > > > { > > > > > > > > > > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > > > > + struct llist_node *t, *tnode; > > > > > > > > > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > > > > > > > > > > + struct raw_hwp_page *hwp_page = NULL; > > > > > > > > > > + struct raw_hwp_page *p; > > > > > > > > > > + > > > > > > > > > > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { > > > > > > > > > > > > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a > > > > > > > > > raw_hwp_list. This is indicated by the hugetlb page specific flag > > > > > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). > > > > > > > > > > > > > > > > > > Looks like this routine does not consider that case. Seems like it should > > > > > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() > > > > > > > > > is true? > > > > > > > > > > > > > > > > Thanks for catching this. I wonder should this routine consider > > > > > > > > RawHwpUnreliable or should the caller do. > > > > > > > > > > > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to > > > > > > > > caller (valid one at the moment), but once RawHwpUnreliable is set, > > > > > > > > all the raw_hwp_page in the llist will be kfree(), and the returned > > > > > > > > value becomes dangling pointer to caller (if the caller holds that > > > > > > > > caller long enough). Maybe returning a bool would be safer to the > > > > > > > > caller? If the routine returns bool, then checking RawHwpUnreliable > > > > > > > > can definitely be within the routine. > > > > > > > > > > > > > > I think the check for RawHwpUnreliable should be within this routine. > > > > > > > Looking closer at the code, I do not see any way to synchronize this. > > > > > > > It looks like manipulation in the memory-failure code would be > > > > > > > synchronized via the mf_mutex. However, I do not see how traversal and > > > > > > > freeing of the raw_hwp_list called from __update_and_free_hugetlb_folio > > > > > > > is synchronized against memory-failure code modifying the list. > > > > > > > > > > > > > > Naoya, can you provide some thoughts? Hi Mike, Now looking again this, I think concurrent adding and deleting are fine with each other and with themselves, because raw_hwp_list is lock-less llist. As for synchronizing traversal with adding and deleting, I wonder is it a good idea to make __update_and_free_hugetlb_folio hold hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse + delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is missing the lock. > > > > > > > > > > > > Thanks for elaborating the issue. I think that making find_raw_hwp_page() and > > > > > > folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution. > > > > > > try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(), > > > > > > already calls it within mf_mutex, so some wrapper might be needed to implement > > > > > > calling path from __update_and_free_hugetlb_folio() to take mf_mutex. > > > > > > > > > > > > It might be a concern that mf_mutex is a big lock to cover overall hwpoison > > > > > > subsystem, but I think that the impact is not so big if the changed code paths > > > > > > take mf_mutex only after checking hwpoisoned hugepage. Maybe using folio_lock > > > > > > to synchronize accesses to the raw_hwp_list could be possible, but currently > > > > > > __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without > > > > > > folio_lock, so this approach needs update on locking rule and it sounds more > > > > > > error-prone to me. > > > > > > > > > > Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I > > > > > agree mf_mutex could help to sync its two del_all operations (one from > > > > > try_memory_failure_hugetlb and one from > > > > > __update_and_free_hugetlb_folio). > > > > > > > > > > I still want to ask a perhaps stupid question, somewhat related to how > > > > > to implement find_raw_hwp_page() correctly. It seems > > > > > llist_for_each_safe should only be used to traverse after list entries > > > > > already *deleted* via llist_del_all. But the llist_for_each_safe calls > > > > > in memory_failure today is used *directly* on the raw_hwp_list. This > > > > > is quite different from other users of llist_for_each_safe (for > > > > > example, kernel/bpf/memalloc.c). > > > > > > > > Oh, I don't noticed that when writing the original code. (I just chose > > > > llist_for_each_safe because I just wanted struct llist_node as a singly > > > > linked list.) > > > > > > And maybe because you can avoid doing INIT_LIST_HEAD (which seems > > > doable in folio_set_hugetlb_hwpoison if hugepage is hwpoison-ed for > > > the first time)? > > > > > > > > > > > > Why is it correct? I guess mostly > > > > > because they are sync-ed under mf_mutex (except the missing coverage > > > > > on __update_and_free_hugetlb_folio)? > > > > > > > > Yes, and there seems no good reason to use the macro llist_for_each_safe > > > > here. I think it's OK to switch to simpler one like list_for_each which > > > > should be supposed to be called directly. To do this, struct raw_hwp_page > > > > need to have @node (typed struct list_head instead of struct llist_node). > > > > Hi Naoya, a maybe-stupid question on list vs llist: _hugetlb_hwpoison > > in folio is a void*. struct list_head is composed of two pointers to > > list_node (prev and next) so folio just can't hold a list_head in the > > _hugetlb_hwpoison field, right? llist_head on the other hand only > > contains one pointer to llist_node first. I wonder if this is one of > > the reason you picked llist instead of list in the first place. > > Yes, that's one reason to use llist_head, and another (minor) reason is that > we don't need doubly-linked list here. Hi Naoya, Even with hugetlb_lock, I think we should still fix __folio_free_raw_hwp: call llist_del_all first, then traverse the list and free raw_hwp_page entries. Then folio_clear_hugetlb_hwpoison from both memory_failure and hugetlb will be safe given llist_del_all on llist is safe with itself. In my v2, I tried both (1.taking hugetlb in __update_and_free_hugetlb_folio, 2. call llist_del_all first in __folio_free_raw_hwp). I also changed find_raw_hwp_page to is_raw_hwp_subpage (only returns bool, and takes hugetlb_lock for traversing raw_hwp_list). So far I didn't run into any problems with my selftest. > > Thanks, > Naoya Horiguchi > > > > > The reason I ask is while I was testing my refactor draft, I > > constantly see the refcount of the 3rd subpage in the folio got > > corrupted. I am not sure about the exact reason but it feels to me > > related to the above reason. > > > > > > > > I will start to work on a separate patch to switch to list_head, and > > > make sure operations from __update_and_free_hugetlb_folio and > > > memory_failure are serialized (hopefully without intro new locks and > > > just use mf_mutex). > > > > > > > > > > > Thanks, > > > > Naoya Horiguchi > > > >
On 06/16/23 14:19, Jiaqi Yan wrote: > On Sun, Jun 11, 2023 at 9:19 PM Naoya Horiguchi > <naoya.horiguchi@linux.dev> wrote: > > > > On Fri, Jun 09, 2023 at 10:48:47PM -0700, Jiaqi Yan wrote: > > > On Thu, May 25, 2023 at 5:28 PM Jiaqi Yan <jiaqiyan@google.com> wrote: > > > > > > > > On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口 直也) > > > > <naoya.horiguchi@nec.com> wrote: > > > > > > > > > > On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote: > > > > > > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口 直也) > > > > > > <naoya.horiguchi@nec.com> wrote: > > > > > > > > > > > > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote: > > > > > > > > On 05/19/23 13:54, Jiaqi Yan wrote: > > > > > > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote: > > > > > > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page > > > > > > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a > > > > > > > > > > > subpage is a raw HWPOISON page. > > > > > > > > > > > > > > > > > > > > > > Exports this functionality to be immediately used in the read operation > > > > > > > > > > > for hugetlbfs. > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > > > > > > > > > --- > > > > > > > > > > > include/linux/mm.h | 23 +++++++++++++++++++++++ > > > > > > > > > > > mm/memory-failure.c | 26 ++++++++++++++++---------- > > > > > > > > > > > 2 files changed, 39 insertions(+), 10 deletions(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > > > > > > index 27ce77080c79..f191a4119719 100644 > > > > > > > > > > > --- a/include/linux/mm.h > > > > > > > > > > > +++ b/include/linux/mm.h > > > > > > > > > > > > > > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of > > > > > > > > > > linux/hugetlb.h? Since it is hugetlb specific I would have thought > > > > > > > > > > hugetlb.h was more appropriate. > > > > > > > > > > > > > > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > > > > > > > > > > > */ > > > > > > > > > > > extern const struct attribute_group memory_failure_attr_group; > > > > > > > > > > > > > > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE > > > > > > > > > > > +/* > > > > > > > > > > > + * Struct raw_hwp_page represents information about "raw error page", > > > > > > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > > > > > + */ > > > > > > > > > > > +struct raw_hwp_page { > > > > > > > > > > > + struct llist_node node; > > > > > > > > > > > + struct page *page; > > > > > > > > > > > +}; > > > > > > > > > > > + > > > > > > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > > > > > +{ > > > > > > > > > > > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > > > > > +} > > > > > > > > > > > + > > > > > > > > > > > +/* > > > > > > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > > > > > > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > > > > > > > > > > > + */ > > > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > > > > > + struct page *subpage); > > > > > > > > > > > +#endif > > > > > > > > > > > + > > > > > > > > > > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > > > > > > > > > > > extern void clear_huge_page(struct page *page, > > > > > > > > > > > unsigned long addr_hint, > > > > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > > > > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644 > > > > > > > > > > > --- a/mm/memory-failure.c > > > > > > > > > > > +++ b/mm/memory-failure.c > > > > > > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > > > > > > > > > > #endif /* CONFIG_FS_DAX */ > > > > > > > > > > > > > > > > > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > > > > > > > > -/* > > > > > > > > > > > - * Struct raw_hwp_page represents information about "raw error page", > > > > > > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > > > > > - */ > > > > > > > > > > > -struct raw_hwp_page { > > > > > > > > > > > - struct llist_node node; > > > > > > > > > > > - struct page *page; > > > > > > > > > > > -}; > > > > > > > > > > > > > > > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > > > > > + struct page *subpage) > > > > > > > > > > > { > > > > > > > > > > > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > > > > > + struct llist_node *t, *tnode; > > > > > > > > > > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > > > > > > > > > > > + struct raw_hwp_page *hwp_page = NULL; > > > > > > > > > > > + struct raw_hwp_page *p; > > > > > > > > > > > + > > > > > > > > > > > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { > > > > > > > > > > > > > > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a > > > > > > > > > > raw_hwp_list. This is indicated by the hugetlb page specific flag > > > > > > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). > > > > > > > > > > > > > > > > > > > > Looks like this routine does not consider that case. Seems like it should > > > > > > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() > > > > > > > > > > is true? > > > > > > > > > > > > > > > > > > Thanks for catching this. I wonder should this routine consider > > > > > > > > > RawHwpUnreliable or should the caller do. > > > > > > > > > > > > > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to > > > > > > > > > caller (valid one at the moment), but once RawHwpUnreliable is set, > > > > > > > > > all the raw_hwp_page in the llist will be kfree(), and the returned > > > > > > > > > value becomes dangling pointer to caller (if the caller holds that > > > > > > > > > caller long enough). Maybe returning a bool would be safer to the > > > > > > > > > caller? If the routine returns bool, then checking RawHwpUnreliable > > > > > > > > > can definitely be within the routine. > > > > > > > > > > > > > > > > I think the check for RawHwpUnreliable should be within this routine. > > > > > > > > Looking closer at the code, I do not see any way to synchronize this. > > > > > > > > It looks like manipulation in the memory-failure code would be > > > > > > > > synchronized via the mf_mutex. However, I do not see how traversal and > > > > > > > > freeing of the raw_hwp_list called from __update_and_free_hugetlb_folio > > > > > > > > is synchronized against memory-failure code modifying the list. > > > > > > > > > > > > > > > > Naoya, can you provide some thoughts? > > Hi Mike, > > Now looking again this, I think concurrent adding and deleting are > fine with each other and with themselves, because raw_hwp_list is > lock-less llist. Correct. > As for synchronizing traversal with adding and deleting, I wonder is > it a good idea to make __update_and_free_hugetlb_folio hold > hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse + > delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already > takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is > missing the lock. I do not think the lock is needed. However, while looking more closely at this I think I discovered another issue. This is VERY subtle. Perhaps Naoya can help verify if my reasoning below is correct. In __update_and_free_hugetlb_folio we are not operating on a hugetlb page. Why is this? Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio. The purpose of remove_hugetlb_folio is to remove the huge page from the list AND compound page destructor indicating this is a hugetlb page is changed. This is all done while holding the hugetlb lock. So, the test for folio_test_hugetlb(folio) is false. We have technically a compound non-hugetlb page with a non-null raw_hwp_list. Important note: at this time we have not reallocated vmemmap pages if hugetlb page was vmemmap optimized. That is done later in __update_and_free_hugetlb_folio. The 'good news' is that after this point get_huge_page_for_hwpoison will not recognize this as a hugetlb page, so nothing will be added to the list. There is no need to worry about entries being added to the list during traversal. The 'bad news' is that if we get a memory error at this time we will treat it as a memory error on a regular compound page. So, TestSetPageHWPoison(p) in memory_failure() may try to write a read only struct page. :(
On Fri, Jun 16, 2023 at 4:35 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 06/16/23 14:19, Jiaqi Yan wrote: > > On Sun, Jun 11, 2023 at 9:19 PM Naoya Horiguchi > > <naoya.horiguchi@linux.dev> wrote: > > > > > > On Fri, Jun 09, 2023 at 10:48:47PM -0700, Jiaqi Yan wrote: > > > > On Thu, May 25, 2023 at 5:28 PM Jiaqi Yan <jiaqiyan@google.com> wrote: > > > > > > > > > > On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口 直也) > > > > > <naoya.horiguchi@nec.com> wrote: > > > > > > > > > > > > On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote: > > > > > > > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口 直也) > > > > > > > <naoya.horiguchi@nec.com> wrote: > > > > > > > > > > > > > > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote: > > > > > > > > > On 05/19/23 13:54, Jiaqi Yan wrote: > > > > > > > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote: > > > > > > > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page > > > > > > > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a > > > > > > > > > > > > subpage is a raw HWPOISON page. > > > > > > > > > > > > > > > > > > > > > > > > Exports this functionality to be immediately used in the read operation > > > > > > > > > > > > for hugetlbfs. > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > > > > > > > > > > --- > > > > > > > > > > > > include/linux/mm.h | 23 +++++++++++++++++++++++ > > > > > > > > > > > > mm/memory-failure.c | 26 ++++++++++++++++---------- > > > > > > > > > > > > 2 files changed, 39 insertions(+), 10 deletions(-) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > > > > > > > index 27ce77080c79..f191a4119719 100644 > > > > > > > > > > > > --- a/include/linux/mm.h > > > > > > > > > > > > +++ b/include/linux/mm.h > > > > > > > > > > > > > > > > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of > > > > > > > > > > > linux/hugetlb.h? Since it is hugetlb specific I would have thought > > > > > > > > > > > hugetlb.h was more appropriate. > > > > > > > > > > > > > > > > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type { > > > > > > > > > > > > */ > > > > > > > > > > > > extern const struct attribute_group memory_failure_attr_group; > > > > > > > > > > > > > > > > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE > > > > > > > > > > > > +/* > > > > > > > > > > > > + * Struct raw_hwp_page represents information about "raw error page", > > > > > > > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > > > > > > + */ > > > > > > > > > > > > +struct raw_hwp_page { > > > > > > > > > > > > + struct llist_node node; > > > > > > > > > > > > + struct page *page; > > > > > > > > > > > > +}; > > > > > > > > > > > > + > > > > > > > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > > > > > > +{ > > > > > > > > > > > > + return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > > > > > > +} > > > > > > > > > > > > + > > > > > > > > > > > > +/* > > > > > > > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's > > > > > > > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. > > > > > > > > > > > > + */ > > > > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > > > > > > + struct page *subpage); > > > > > > > > > > > > +#endif > > > > > > > > > > > > + > > > > > > > > > > > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > > > > > > > > > > > > extern void clear_huge_page(struct page *page, > > > > > > > > > > > > unsigned long addr_hint, > > > > > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > > > > > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644 > > > > > > > > > > > > --- a/mm/memory-failure.c > > > > > > > > > > > > +++ b/mm/memory-failure.c > > > > > > > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > > > > > > > > > > > #endif /* CONFIG_FS_DAX */ > > > > > > > > > > > > > > > > > > > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > > > > > > > > > -/* > > > > > > > > > > > > - * Struct raw_hwp_page represents information about "raw error page", > > > > > > > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. > > > > > > > > > > > > - */ > > > > > > > > > > > > -struct raw_hwp_page { > > > > > > > > > > > > - struct llist_node node; > > > > > > > > > > > > - struct page *page; > > > > > > > > > > > > -}; > > > > > > > > > > > > > > > > > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) > > > > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, > > > > > > > > > > > > + struct page *subpage) > > > > > > > > > > > > { > > > > > > > > > > > > - return (struct llist_head *)&folio->_hugetlb_hwpoison; > > > > > > > > > > > > + struct llist_node *t, *tnode; > > > > > > > > > > > > + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); > > > > > > > > > > > > + struct raw_hwp_page *hwp_page = NULL; > > > > > > > > > > > > + struct raw_hwp_page *p; > > > > > > > > > > > > + > > > > > > > > > > > > + llist_for_each_safe(tnode, t, raw_hwp_head->first) { > > > > > > > > > > > > > > > > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a > > > > > > > > > > > raw_hwp_list. This is indicated by the hugetlb page specific flag > > > > > > > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable(). > > > > > > > > > > > > > > > > > > > > > > Looks like this routine does not consider that case. Seems like it should > > > > > > > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable() > > > > > > > > > > > is true? > > > > > > > > > > > > > > > > > > > > Thanks for catching this. I wonder should this routine consider > > > > > > > > > > RawHwpUnreliable or should the caller do. > > > > > > > > > > > > > > > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to > > > > > > > > > > caller (valid one at the moment), but once RawHwpUnreliable is set, > > > > > > > > > > all the raw_hwp_page in the llist will be kfree(), and the returned > > > > > > > > > > value becomes dangling pointer to caller (if the caller holds that > > > > > > > > > > caller long enough). Maybe returning a bool would be safer to the > > > > > > > > > > caller? If the routine returns bool, then checking RawHwpUnreliable > > > > > > > > > > can definitely be within the routine. > > > > > > > > > > > > > > > > > > I think the check for RawHwpUnreliable should be within this routine. > > > > > > > > > Looking closer at the code, I do not see any way to synchronize this. > > > > > > > > > It looks like manipulation in the memory-failure code would be > > > > > > > > > synchronized via the mf_mutex. However, I do not see how traversal and > > > > > > > > > freeing of the raw_hwp_list called from __update_and_free_hugetlb_folio > > > > > > > > > is synchronized against memory-failure code modifying the list. > > > > > > > > > > > > > > > > > > Naoya, can you provide some thoughts? > > > > Hi Mike, > > > > Now looking again this, I think concurrent adding and deleting are > > fine with each other and with themselves, because raw_hwp_list is > > lock-less llist. > > Correct. > > > As for synchronizing traversal with adding and deleting, I wonder is > > it a good idea to make __update_and_free_hugetlb_folio hold > > hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse + > > delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already > > takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is > > missing the lock. > > I do not think the lock is needed. However, while looking more closely > at this I think I discovered another issue. > This is VERY subtle. > Perhaps Naoya can help verify if my reasoning below is correct. > > In __update_and_free_hugetlb_folio we are not operating on a hugetlb page. > Why is this? > Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio. > The purpose of remove_hugetlb_folio is to remove the huge page from the > list AND compound page destructor indicating this is a hugetlb page is changed. > This is all done while holding the hugetlb lock. So, the test for > folio_test_hugetlb(folio) is false. > > We have technically a compound non-hugetlb page with a non-null raw_hwp_list. > > Important note: at this time we have not reallocated vmemmap pages if > hugetlb page was vmemmap optimized. That is done later in > __update_and_free_hugetlb_folio. > > The 'good news' is that after this point get_huge_page_for_hwpoison will > not recognize this as a hugetlb page, so nothing will be added to the > list. There is no need to worry about entries being added to the list > during traversal. > > The 'bad news' is that if we get a memory error at this time we will > treat it as a memory error on a regular compound page. So, > TestSetPageHWPoison(p) in memory_failure() may try to write a read only > struct page. :( At least I think this is an issue. Would it help if dissolve_free_huge_page doesn't unlock hugetlb_lock until update_and_free_hugetlb_folio is done, or basically until dissolve_free_huge_page is done? TestSetPageHWPoison in memory_failure is called after try_memory_failure_hugetlb, and folio_test_hugetlb is tested within __get_huge_page_for_hwpoison, which is wrapped by the hugetlb_lock. So by the time dissolve_free_huge_page returns, subpages already go through hugetlb_vmemmap_restore and __destroy_compound_gigantic_folio and become non-compound raw pages (folios). Now folio_test_hugetlb(p)=false will be correct for memory_failure, and it can recover p as a dissolved non-hugetlb page. > -- > Mike Kravetz
On 06/16/23 19:18, Jiaqi Yan wrote: > On Fri, Jun 16, 2023 at 4:35 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 06/16/23 14:19, Jiaqi Yan wrote: > > > > > > Now looking again this, I think concurrent adding and deleting are > > > fine with each other and with themselves, because raw_hwp_list is > > > lock-less llist. > > > > Correct. > > > > > As for synchronizing traversal with adding and deleting, I wonder is > > > it a good idea to make __update_and_free_hugetlb_folio hold > > > hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse + > > > delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already > > > takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is > > > missing the lock. > > > > I do not think the lock is needed. However, while looking more closely > > at this I think I discovered another issue. > > This is VERY subtle. > > Perhaps Naoya can help verify if my reasoning below is correct. > > > > In __update_and_free_hugetlb_folio we are not operating on a hugetlb page. > > Why is this? > > Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio. > > The purpose of remove_hugetlb_folio is to remove the huge page from the > > list AND compound page destructor indicating this is a hugetlb page is changed. > > This is all done while holding the hugetlb lock. So, the test for > > folio_test_hugetlb(folio) is false. > > > > We have technically a compound non-hugetlb page with a non-null raw_hwp_list. > > > > Important note: at this time we have not reallocated vmemmap pages if > > hugetlb page was vmemmap optimized. That is done later in > > __update_and_free_hugetlb_folio. > > > > > > The 'good news' is that after this point get_huge_page_for_hwpoison will > > not recognize this as a hugetlb page, so nothing will be added to the > > list. There is no need to worry about entries being added to the list > > during traversal. > > > > The 'bad news' is that if we get a memory error at this time we will > > treat it as a memory error on a regular compound page. So, > > TestSetPageHWPoison(p) in memory_failure() may try to write a read only > > struct page. :( > > At least I think this is an issue. > > Would it help if dissolve_free_huge_page doesn't unlock hugetlb_lock > until update_and_free_hugetlb_folio is done, or basically until > dissolve_free_huge_page is done? Unfortunately, update_and_free_hugetlb_folio is designed to be called without locks held. This is because we can not hold any locks while allocating vmemmap pages. I'll try to think of some way to restructure the code. IIUC, this is a potential general issue, not just isolated to memory error handling.
On Sat, Jun 17, 2023 at 03:59:27PM -0700, Mike Kravetz wrote: > On 06/16/23 19:18, Jiaqi Yan wrote: > > On Fri, Jun 16, 2023 at 4:35 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > On 06/16/23 14:19, Jiaqi Yan wrote: > > > > > > > > Now looking again this, I think concurrent adding and deleting are > > > > fine with each other and with themselves, because raw_hwp_list is > > > > lock-less llist. > > > > > > Correct. > > > > > > > As for synchronizing traversal with adding and deleting, I wonder is > > > > it a good idea to make __update_and_free_hugetlb_folio hold > > > > hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse + > > > > delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already > > > > takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is > > > > missing the lock. > > > > > > I do not think the lock is needed. However, while looking more closely > > > at this I think I discovered another issue. > > > This is VERY subtle. > > > Perhaps Naoya can help verify if my reasoning below is correct. > > > > > > In __update_and_free_hugetlb_folio we are not operating on a hugetlb page. > > > Why is this? > > > Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio. > > > The purpose of remove_hugetlb_folio is to remove the huge page from the > > > list AND compound page destructor indicating this is a hugetlb page is changed. > > > This is all done while holding the hugetlb lock. So, the test for > > > folio_test_hugetlb(folio) is false. > > > > > > We have technically a compound non-hugetlb page with a non-null raw_hwp_list. > > > > > > Important note: at this time we have not reallocated vmemmap pages if > > > hugetlb page was vmemmap optimized. That is done later in > > > __update_and_free_hugetlb_folio. > > > > > > > > > > The 'good news' is that after this point get_huge_page_for_hwpoison will > > > not recognize this as a hugetlb page, so nothing will be added to the > > > list. There is no need to worry about entries being added to the list > > > during traversal. > > > > > > The 'bad news' is that if we get a memory error at this time we will > > > treat it as a memory error on a regular compound page. So, > > > TestSetPageHWPoison(p) in memory_failure() may try to write a read only > > > struct page. :( > > > > At least I think this is an issue. > > > > Would it help if dissolve_free_huge_page doesn't unlock hugetlb_lock > > until update_and_free_hugetlb_folio is done, or basically until > > dissolve_free_huge_page is done? > > Unfortunately, update_and_free_hugetlb_folio is designed to be called > without locks held. This is because we can not hold any locks while > allocating vmemmap pages. > > I'll try to think of some way to restructure the code. IIUC, this is a > potential general issue, not just isolated to memory error handling. Considering this issue as one specific to memory error handling, checking HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to detect the race. Then, an idea like the below diff (not tested) can make try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait for complete the allocation of vmemmap pages. @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, int ret = 2; /* fallback to normal page handling */ bool count_increased = false; - if (!folio_test_hugetlb(folio)) + if (!folio_test_hugetlb(folio)) { + if (folio_test_hugetlb_vmemmap_optimized(folio)) + ret = -EBUSY; goto out; + } if (flags & MF_COUNT_INCREASED) { ret = 1; Thanks, Naoya Horiguchi > -- > Mike Kravetz > > > > > TestSetPageHWPoison in memory_failure is called after > > try_memory_failure_hugetlb, and folio_test_hugetlb is tested within > > __get_huge_page_for_hwpoison, which is wrapped by the hugetlb_lock. So > > by the time dissolve_free_huge_page returns, subpages already go > > through hugetlb_vmemmap_restore and __destroy_compound_gigantic_folio > > and become non-compound raw pages (folios). Now > > folio_test_hugetlb(p)=false will be correct for memory_failure, and it > > can recover p as a dissolved non-hugetlb page.
On 06/19/23 17:23, Naoya Horiguchi wrote: > On Sat, Jun 17, 2023 at 03:59:27PM -0700, Mike Kravetz wrote: > > On 06/16/23 19:18, Jiaqi Yan wrote: > > > On Fri, Jun 16, 2023 at 4:35 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 06/16/23 14:19, Jiaqi Yan wrote: > > > > > > > > > > Now looking again this, I think concurrent adding and deleting are > > > > > fine with each other and with themselves, because raw_hwp_list is > > > > > lock-less llist. > > > > > > > > Correct. > > > > > > > > > As for synchronizing traversal with adding and deleting, I wonder is > > > > > it a good idea to make __update_and_free_hugetlb_folio hold > > > > > hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse + > > > > > delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already > > > > > takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is > > > > > missing the lock. > > > > > > > > I do not think the lock is needed. However, while looking more closely > > > > at this I think I discovered another issue. > > > > This is VERY subtle. > > > > Perhaps Naoya can help verify if my reasoning below is correct. > > > > > > > > In __update_and_free_hugetlb_folio we are not operating on a hugetlb page. > > > > Why is this? > > > > Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio. > > > > The purpose of remove_hugetlb_folio is to remove the huge page from the > > > > list AND compound page destructor indicating this is a hugetlb page is changed. > > > > This is all done while holding the hugetlb lock. So, the test for > > > > folio_test_hugetlb(folio) is false. > > > > > > > > We have technically a compound non-hugetlb page with a non-null raw_hwp_list. > > > > > > > > Important note: at this time we have not reallocated vmemmap pages if > > > > hugetlb page was vmemmap optimized. That is done later in > > > > __update_and_free_hugetlb_folio. > > > > > > > > > > > > > > The 'good news' is that after this point get_huge_page_for_hwpoison will > > > > not recognize this as a hugetlb page, so nothing will be added to the > > > > list. There is no need to worry about entries being added to the list > > > > during traversal. > > > > > > > > The 'bad news' is that if we get a memory error at this time we will > > > > treat it as a memory error on a regular compound page. So, > > > > TestSetPageHWPoison(p) in memory_failure() may try to write a read only > > > > struct page. :( > > > > > > At least I think this is an issue. > > > > > > Would it help if dissolve_free_huge_page doesn't unlock hugetlb_lock > > > until update_and_free_hugetlb_folio is done, or basically until > > > dissolve_free_huge_page is done? > > > > Unfortunately, update_and_free_hugetlb_folio is designed to be called > > without locks held. This is because we can not hold any locks while > > allocating vmemmap pages. > > > > I'll try to think of some way to restructure the code. IIUC, this is a > > potential general issue, not just isolated to memory error handling. > > Considering this issue as one specific to memory error handling, checking > HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to > detect the race. Then, an idea like the below diff (not tested) can make > try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait > for complete the allocation of vmemmap pages. > > @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, > int ret = 2; /* fallback to normal page handling */ > bool count_increased = false; > > - if (!folio_test_hugetlb(folio)) > + if (!folio_test_hugetlb(folio)) { > + if (folio_test_hugetlb_vmemmap_optimized(folio)) > + ret = -EBUSY; The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in the folio->private field. In the case where the folio is a non-hugetlb folio, the folio->private field could be any arbitrary value. As such, the test for vmemmap_optimized may return a false positive. We could end up retrying for an arbitrarily long time. I am looking at how to restructure the code which removes and frees hugetlb pages so that folio_test_hugetlb() would remain true until vmemmap pages are allocated. The easiest way to do this would introduce another hugetlb lock/unlock cycle in the page freeing path. This would undo some of the speedups in the series: https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab
On 06/20/23 11:05, Mike Kravetz wrote: > On 06/19/23 17:23, Naoya Horiguchi wrote: > > > > Considering this issue as one specific to memory error handling, checking > > HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to > > detect the race. Then, an idea like the below diff (not tested) can make > > try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait > > for complete the allocation of vmemmap pages. > > > > @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, > > int ret = 2; /* fallback to normal page handling */ > > bool count_increased = false; > > > > - if (!folio_test_hugetlb(folio)) > > + if (!folio_test_hugetlb(folio)) { > > + if (folio_test_hugetlb_vmemmap_optimized(folio)) > > + ret = -EBUSY; > > The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in > the folio->private field. > > In the case where the folio is a non-hugetlb folio, the folio->private field > could be any arbitrary value. As such, the test for vmemmap_optimized may > return a false positive. We could end up retrying for an arbitrarily > long time. > > I am looking at how to restructure the code which removes and frees > hugetlb pages so that folio_test_hugetlb() would remain true until > vmemmap pages are allocated. The easiest way to do this would introduce > another hugetlb lock/unlock cycle in the page freeing path. This would > undo some of the speedups in the series: > https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab > Perhaps something like this? Minimal testing. From e709fb4da0b6249973f9bf0540c9da0e4c585fe2 Mon Sep 17 00:00:00 2001 From: Mike Kravetz <mike.kravetz@oracle.com> Date: Tue, 20 Jun 2023 14:48:39 -0700 Subject: [PATCH] hugetlb: Do not clear hugetlb dtor until allocating vmemmap Freeing a hugetlb page and releasing base pages back to the underlying allocator such as buddy or cma is performed in two steps: - remove_hugetlb_folio() is called to remove the folio from hugetlb lists, get a ref on the page and remove hugetlb destructor. This all must be done under the hugetlb lock. After this call, the page can be treated as a normal compound page or a collection of base size pages. - update_and_free_hugetlb_folio() is called to allocate vmemmap if needed and the free routine of the underlying allocator is called on the resulting page. We can not hold the hugetlb lock here. One issue with this scheme is that a memory error could occur between these two steps. In this case, the memory error handling code treats the old hugetlb page as a normal compound page or collection of base pages. It will then try to SetPageHWPoison(page) on the page with an error. If the page with error is a tail page without vmemmap, a write error will occur when trying to set the flag. Address this issue by modifying remove_hugetlb_folio() and update_and_free_hugetlb_folio() such that the hugetlb destructor is not cleared until after allocating vmemmap. Since clearing the destructor required holding the hugetlb lock, the clearing is done in remove_hugetlb_folio() if the vmemmap is present. This saves a lock/unlock cycle. Otherwise, destructor is cleared in update_and_free_hugetlb_folio() after allocating vmemmap. Note that this will leave hugetlb pages in a state where they are marked free (by hugetlb specific page flag) and have a ref count. This is not a normal state. The only code that would notice is the memory error code, and it is set up to retry in such a case. A subsequent patch will create a routine to do bulk processing of vmemmap allocation. This will eliminate a lock/unlock cycle for each hugetlb page in the case where we are freeing a bunch of pages. Fixes: ??? Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++++----------------- 1 file changed, 51 insertions(+), 24 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d76574425da3..f7f64470aee0 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio, unsigned int order) { } #endif +static inline void __clear_hugetlb_destructor(struct hstate *h, + struct folio *folio) +{ + lockdep_assert_held(&hugetlb_lock); + + /* + * Very subtle + * + * For non-gigantic pages set the destructor to the normal compound + * page dtor. This is needed in case someone takes an additional + * temporary ref to the page, and freeing is delayed until they drop + * their reference. + * + * For gigantic pages set the destructor to the null dtor. This + * destructor will never be called. Before freeing the gigantic + * page destroy_compound_gigantic_folio will turn the folio into a + * simple group of pages. After this the destructor does not + * apply. + * + */ + if (hstate_is_gigantic(h)) + folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR); + else + folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); +} + /* - * Remove hugetlb folio from lists, and update dtor so that the folio appears - * as just a compound page. + * Remove hugetlb folio from lists. + * If vmemmap exists for the folio, update dtor so that the folio appears + * as just a compound page. Otherwise, wait until after allocating vmemmap + * to update dtor. * * A reference is held on the folio, except in the case of demote. * @@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, } /* - * Very subtle - * - * For non-gigantic pages set the destructor to the normal compound - * page dtor. This is needed in case someone takes an additional - * temporary ref to the page, and freeing is delayed until they drop - * their reference. - * - * For gigantic pages set the destructor to the null dtor. This - * destructor will never be called. Before freeing the gigantic - * page destroy_compound_gigantic_folio will turn the folio into a - * simple group of pages. After this the destructor does not - * apply. - * - * This handles the case where more than one ref is held when and - * after update_and_free_hugetlb_folio is called. - * - * In the case of demote we do not ref count the page as it will soon - * be turned into a page of smaller size. + * We can only clear the hugetlb destructor after allocating vmemmap + * pages. Otherwise, someone (memory error handling) may try to write + * to tail struct pages. + */ + if (!folio_test_hugetlb_vmemmap_optimized(folio)) + __clear_hugetlb_destructor(h, folio); + + /* + * In the case of demote we do not ref count the page as it will soon + * be turned into a page of smaller size. */ if (!demote) folio_ref_unfreeze(folio, 1); - if (hstate_is_gigantic(h)) - folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR); - else - folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); h->nr_huge_pages--; h->nr_huge_pages_node[nid]--; @@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, { int i; struct page *subpage; + bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio); if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) return; @@ -1735,6 +1752,16 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, if (unlikely(folio_test_hwpoison(folio))) folio_clear_hugetlb_hwpoison(folio); + /* + * If vmemmap pages were allocated above, then we need to clear the + * hugetlb destructor under the hugetlb lock. + */ + if (clear_dtor) { + spin_lock_irq(&hugetlb_lock); + __clear_hugetlb_destructor(h, folio); + spin_unlock_irq(&hugetlb_lock); + } + for (i = 0; i < pages_per_huge_page(h); i++) { subpage = folio_page(folio, i); subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
On Tue, Jun 20, 2023 at 3:39 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 06/20/23 11:05, Mike Kravetz wrote: > > On 06/19/23 17:23, Naoya Horiguchi wrote: > > > > > > Considering this issue as one specific to memory error handling, checking > > > HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to > > > detect the race. Then, an idea like the below diff (not tested) can make > > > try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait > > > for complete the allocation of vmemmap pages. > > > > > > @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, > > > int ret = 2; /* fallback to normal page handling */ > > > bool count_increased = false; > > > > > > - if (!folio_test_hugetlb(folio)) > > > + if (!folio_test_hugetlb(folio)) { > > > + if (folio_test_hugetlb_vmemmap_optimized(folio)) > > > + ret = -EBUSY; > > > > The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in > > the folio->private field. > > > > In the case where the folio is a non-hugetlb folio, the folio->private field > > could be any arbitrary value. As such, the test for vmemmap_optimized may > > return a false positive. We could end up retrying for an arbitrarily > > long time. > > > > I am looking at how to restructure the code which removes and frees > > hugetlb pages so that folio_test_hugetlb() would remain true until > > vmemmap pages are allocated. The easiest way to do this would introduce > > another hugetlb lock/unlock cycle in the page freeing path. This would > > undo some of the speedups in the series: > > https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab > > > > Perhaps something like this? Minimal testing. Thanks for putting up a fix, Mike! > > From e709fb4da0b6249973f9bf0540c9da0e4c585fe2 Mon Sep 17 00:00:00 2001 > From: Mike Kravetz <mike.kravetz@oracle.com> > Date: Tue, 20 Jun 2023 14:48:39 -0700 > Subject: [PATCH] hugetlb: Do not clear hugetlb dtor until allocating vmemmap > > Freeing a hugetlb page and releasing base pages back to the underlying > allocator such as buddy or cma is performed in two steps: > - remove_hugetlb_folio() is called to remove the folio from hugetlb > lists, get a ref on the page and remove hugetlb destructor. This > all must be done under the hugetlb lock. After this call, the page > can be treated as a normal compound page or a collection of base > size pages. > - update_and_free_hugetlb_folio() is called to allocate vmemmap if > needed and the free routine of the underlying allocator is called > on the resulting page. We can not hold the hugetlb lock here. > > One issue with this scheme is that a memory error could occur between > these two steps. In this case, the memory error handling code treats > the old hugetlb page as a normal compound page or collection of base > pages. It will then try to SetPageHWPoison(page) on the page with an > error. If the page with error is a tail page without vmemmap, a write > error will occur when trying to set the flag. > > Address this issue by modifying remove_hugetlb_folio() and > update_and_free_hugetlb_folio() such that the hugetlb destructor is not > cleared until after allocating vmemmap. Since clearing the destructor > required holding the hugetlb lock, the clearing is done in > remove_hugetlb_folio() if the vmemmap is present. This saves a > lock/unlock cycle. Otherwise, destructor is cleared in > update_and_free_hugetlb_folio() after allocating vmemmap. > > Note that this will leave hugetlb pages in a state where they are marked > free (by hugetlb specific page flag) and have a ref count. This is not > a normal state. The only code that would notice is the memory error > code, and it is set up to retry in such a case. > > A subsequent patch will create a routine to do bulk processing of > vmemmap allocation. This will eliminate a lock/unlock cycle for each > hugetlb page in the case where we are freeing a bunch of pages. > > Fixes: ??? > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > --- > mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 51 insertions(+), 24 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index d76574425da3..f7f64470aee0 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio, > unsigned int order) { } > #endif > > +static inline void __clear_hugetlb_destructor(struct hstate *h, > + struct folio *folio) > +{ > + lockdep_assert_held(&hugetlb_lock); > + > + /* > + * Very subtle > + * > + * For non-gigantic pages set the destructor to the normal compound > + * page dtor. This is needed in case someone takes an additional > + * temporary ref to the page, and freeing is delayed until they drop > + * their reference. > + * > + * For gigantic pages set the destructor to the null dtor. This > + * destructor will never be called. Before freeing the gigantic > + * page destroy_compound_gigantic_folio will turn the folio into a > + * simple group of pages. After this the destructor does not > + * apply. > + * > + */ > + if (hstate_is_gigantic(h)) > + folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR); > + else > + folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); > +} > + > /* > - * Remove hugetlb folio from lists, and update dtor so that the folio appears > - * as just a compound page. > + * Remove hugetlb folio from lists. > + * If vmemmap exists for the folio, update dtor so that the folio appears > + * as just a compound page. Otherwise, wait until after allocating vmemmap > + * to update dtor. > * > * A reference is held on the folio, except in the case of demote. > * > @@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, > } > > /* > - * Very subtle > - * > - * For non-gigantic pages set the destructor to the normal compound > - * page dtor. This is needed in case someone takes an additional > - * temporary ref to the page, and freeing is delayed until they drop > - * their reference. > - * > - * For gigantic pages set the destructor to the null dtor. This > - * destructor will never be called. Before freeing the gigantic > - * page destroy_compound_gigantic_folio will turn the folio into a > - * simple group of pages. After this the destructor does not > - * apply. > - * > - * This handles the case where more than one ref is held when and > - * after update_and_free_hugetlb_folio is called. > - * > - * In the case of demote we do not ref count the page as it will soon > - * be turned into a page of smaller size. > + * We can only clear the hugetlb destructor after allocating vmemmap > + * pages. Otherwise, someone (memory error handling) may try to write > + * to tail struct pages. > + */ > + if (!folio_test_hugetlb_vmemmap_optimized(folio)) > + __clear_hugetlb_destructor(h, folio); > + > + /* > + * In the case of demote we do not ref count the page as it will soon > + * be turned into a page of smaller size. > */ > if (!demote) > folio_ref_unfreeze(folio, 1); > - if (hstate_is_gigantic(h)) > - folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR); > - else > - folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); > > h->nr_huge_pages--; > h->nr_huge_pages_node[nid]--; > @@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, > { > int i; > struct page *subpage; > + bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio); Can this test on vmemmap_optimized still tell us if we should __clear_hugetlb_destructor? From my reading: 1. If a hugetlb folio is still vmemmap optimized in __remove_hugetlb_folio, __remove_hugetlb_folio won't __clear_hugetlb_destructor. 2. Then hugetlb_vmemmap_restore in dissolve_free_huge_page will clear HPG_vmemmap_optimized if it succeeds. 3. Now when dissolve_free_huge_page gets into __update_and_free_hugetlb_folio, we will see clear_dtor to be false and __clear_hugetlb_destructor won't be called. Or maybe I misunderstood, and what you really want to do is never __clear_hugetlb_destructor so that folio_test_hugetlb is always true? > > if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) > return; > @@ -1735,6 +1752,16 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, > if (unlikely(folio_test_hwpoison(folio))) > folio_clear_hugetlb_hwpoison(folio); > > + /* > + * If vmemmap pages were allocated above, then we need to clear the > + * hugetlb destructor under the hugetlb lock. > + */ > + if (clear_dtor) { > + spin_lock_irq(&hugetlb_lock); > + __clear_hugetlb_destructor(h, folio); > + spin_unlock_irq(&hugetlb_lock); > + } > + > for (i = 0; i < pages_per_huge_page(h); i++) { > subpage = folio_page(folio, i); > subpage->flags &= ~(1 << PG_locked | 1 << PG_error | > -- > 2.41.0 >
On 06/22/23 17:45, Jiaqi Yan wrote: > On Tue, Jun 20, 2023 at 3:39 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 06/20/23 11:05, Mike Kravetz wrote: > > > On 06/19/23 17:23, Naoya Horiguchi wrote: > > > > > > > > Considering this issue as one specific to memory error handling, checking > > > > HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to > > > > detect the race. Then, an idea like the below diff (not tested) can make > > > > try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait > > > > for complete the allocation of vmemmap pages. > > > > > > > > @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, > > > > int ret = 2; /* fallback to normal page handling */ > > > > bool count_increased = false; > > > > > > > > - if (!folio_test_hugetlb(folio)) > > > > + if (!folio_test_hugetlb(folio)) { > > > > + if (folio_test_hugetlb_vmemmap_optimized(folio)) > > > > + ret = -EBUSY; > > > > > > The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in > > > the folio->private field. > > > > > > In the case where the folio is a non-hugetlb folio, the folio->private field > > > could be any arbitrary value. As such, the test for vmemmap_optimized may > > > return a false positive. We could end up retrying for an arbitrarily > > > long time. > > > > > > I am looking at how to restructure the code which removes and frees > > > hugetlb pages so that folio_test_hugetlb() would remain true until > > > vmemmap pages are allocated. The easiest way to do this would introduce > > > another hugetlb lock/unlock cycle in the page freeing path. This would > > > undo some of the speedups in the series: > > > https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab > > > > > > > Perhaps something like this? Minimal testing. > > Thanks for putting up a fix, Mike! > > > > > From e709fb4da0b6249973f9bf0540c9da0e4c585fe2 Mon Sep 17 00:00:00 2001 > > From: Mike Kravetz <mike.kravetz@oracle.com> > > Date: Tue, 20 Jun 2023 14:48:39 -0700 > > Subject: [PATCH] hugetlb: Do not clear hugetlb dtor until allocating vmemmap > > > > Freeing a hugetlb page and releasing base pages back to the underlying > > allocator such as buddy or cma is performed in two steps: > > - remove_hugetlb_folio() is called to remove the folio from hugetlb > > lists, get a ref on the page and remove hugetlb destructor. This > > all must be done under the hugetlb lock. After this call, the page > > can be treated as a normal compound page or a collection of base > > size pages. > > - update_and_free_hugetlb_folio() is called to allocate vmemmap if > > needed and the free routine of the underlying allocator is called > > on the resulting page. We can not hold the hugetlb lock here. > > > > One issue with this scheme is that a memory error could occur between > > these two steps. In this case, the memory error handling code treats > > the old hugetlb page as a normal compound page or collection of base > > pages. It will then try to SetPageHWPoison(page) on the page with an > > error. If the page with error is a tail page without vmemmap, a write > > error will occur when trying to set the flag. > > > > Address this issue by modifying remove_hugetlb_folio() and > > update_and_free_hugetlb_folio() such that the hugetlb destructor is not > > cleared until after allocating vmemmap. Since clearing the destructor > > required holding the hugetlb lock, the clearing is done in > > remove_hugetlb_folio() if the vmemmap is present. This saves a > > lock/unlock cycle. Otherwise, destructor is cleared in > > update_and_free_hugetlb_folio() after allocating vmemmap. > > > > Note that this will leave hugetlb pages in a state where they are marked > > free (by hugetlb specific page flag) and have a ref count. This is not > > a normal state. The only code that would notice is the memory error > > code, and it is set up to retry in such a case. > > > > A subsequent patch will create a routine to do bulk processing of > > vmemmap allocation. This will eliminate a lock/unlock cycle for each > > hugetlb page in the case where we are freeing a bunch of pages. > > > > Fixes: ??? > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > > --- > > mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++++----------------- > > 1 file changed, 51 insertions(+), 24 deletions(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index d76574425da3..f7f64470aee0 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio, > > unsigned int order) { } > > #endif > > > > +static inline void __clear_hugetlb_destructor(struct hstate *h, > > + struct folio *folio) > > +{ > > + lockdep_assert_held(&hugetlb_lock); > > + > > + /* > > + * Very subtle > > + * > > + * For non-gigantic pages set the destructor to the normal compound > > + * page dtor. This is needed in case someone takes an additional > > + * temporary ref to the page, and freeing is delayed until they drop > > + * their reference. > > + * > > + * For gigantic pages set the destructor to the null dtor. This > > + * destructor will never be called. Before freeing the gigantic > > + * page destroy_compound_gigantic_folio will turn the folio into a > > + * simple group of pages. After this the destructor does not > > + * apply. > > + * > > + */ > > + if (hstate_is_gigantic(h)) > > + folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR); > > + else > > + folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); > > +} > > + > > /* > > - * Remove hugetlb folio from lists, and update dtor so that the folio appears > > - * as just a compound page. > > + * Remove hugetlb folio from lists. > > + * If vmemmap exists for the folio, update dtor so that the folio appears > > + * as just a compound page. Otherwise, wait until after allocating vmemmap > > + * to update dtor. > > * > > * A reference is held on the folio, except in the case of demote. > > * > > @@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, > > } > > > > /* > > - * Very subtle > > - * > > - * For non-gigantic pages set the destructor to the normal compound > > - * page dtor. This is needed in case someone takes an additional > > - * temporary ref to the page, and freeing is delayed until they drop > > - * their reference. > > - * > > - * For gigantic pages set the destructor to the null dtor. This > > - * destructor will never be called. Before freeing the gigantic > > - * page destroy_compound_gigantic_folio will turn the folio into a > > - * simple group of pages. After this the destructor does not > > - * apply. > > - * > > - * This handles the case where more than one ref is held when and > > - * after update_and_free_hugetlb_folio is called. > > - * > > - * In the case of demote we do not ref count the page as it will soon > > - * be turned into a page of smaller size. > > + * We can only clear the hugetlb destructor after allocating vmemmap > > + * pages. Otherwise, someone (memory error handling) may try to write > > + * to tail struct pages. > > + */ > > + if (!folio_test_hugetlb_vmemmap_optimized(folio)) > > + __clear_hugetlb_destructor(h, folio); > > + > > + /* > > + * In the case of demote we do not ref count the page as it will soon > > + * be turned into a page of smaller size. > > */ > > if (!demote) > > folio_ref_unfreeze(folio, 1); > > - if (hstate_is_gigantic(h)) > > - folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR); > > - else > > - folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); > > > > h->nr_huge_pages--; > > h->nr_huge_pages_node[nid]--; > > @@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, > > { > > int i; > > struct page *subpage; > > + bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio); > > Can this test on vmemmap_optimized still tell us if we should > __clear_hugetlb_destructor? From my reading: > 1. If a hugetlb folio is still vmemmap optimized in > __remove_hugetlb_folio, __remove_hugetlb_folio won't > __clear_hugetlb_destructor. > 2. Then hugetlb_vmemmap_restore in dissolve_free_huge_page will clear > HPG_vmemmap_optimized if it succeeds. > 3. Now when dissolve_free_huge_page gets into > __update_and_free_hugetlb_folio, we will see clear_dtor to be false > and __clear_hugetlb_destructor won't be called. Good catch! That is indeed a problem with this patch. > > Or maybe I misunderstood, and what you really want to do is never > __clear_hugetlb_destructor so that folio_test_hugetlb is always true? No, that was a bug with this patch. We could ALWAYS wait until __update_and_free_hugetlb_folio to clear the hugetlb destructor. However, we have to take hugetlb lock to clear it. If the page does not have vmemmap optimized, the we can clear the destructor earlier in __remove_hugetlb_folio and avoid the lock/unlock cycle. In the past, we have had complaints about the time required to allocate and free a large quantity of hugetlb pages. Most of that time is spent in the low level allocators. However, I do not want to add something like an extra lock/unlock cycle unless absolutely necessary. I'll try to think of a cleaner and more fool proof way to address this. IIUC, this is an existing issue. Your patch series does not depend this being fixed.
On Thu, Jun 22, 2023 at 9:19 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 06/22/23 17:45, Jiaqi Yan wrote: > > On Tue, Jun 20, 2023 at 3:39 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > On 06/20/23 11:05, Mike Kravetz wrote: > > > > On 06/19/23 17:23, Naoya Horiguchi wrote: > > > > > > > > > > Considering this issue as one specific to memory error handling, checking > > > > > HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to > > > > > detect the race. Then, an idea like the below diff (not tested) can make > > > > > try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait > > > > > for complete the allocation of vmemmap pages. > > > > > > > > > > @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, > > > > > int ret = 2; /* fallback to normal page handling */ > > > > > bool count_increased = false; > > > > > > > > > > - if (!folio_test_hugetlb(folio)) > > > > > + if (!folio_test_hugetlb(folio)) { > > > > > + if (folio_test_hugetlb_vmemmap_optimized(folio)) > > > > > + ret = -EBUSY; > > > > > > > > The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in > > > > the folio->private field. > > > > > > > > In the case where the folio is a non-hugetlb folio, the folio->private field > > > > could be any arbitrary value. As such, the test for vmemmap_optimized may > > > > return a false positive. We could end up retrying for an arbitrarily > > > > long time. > > > > > > > > I am looking at how to restructure the code which removes and frees > > > > hugetlb pages so that folio_test_hugetlb() would remain true until > > > > vmemmap pages are allocated. The easiest way to do this would introduce > > > > another hugetlb lock/unlock cycle in the page freeing path. This would > > > > undo some of the speedups in the series: > > > > https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab > > > > > > > > > > Perhaps something like this? Minimal testing. > > > > Thanks for putting up a fix, Mike! > > > > > > > > From e709fb4da0b6249973f9bf0540c9da0e4c585fe2 Mon Sep 17 00:00:00 2001 > > > From: Mike Kravetz <mike.kravetz@oracle.com> > > > Date: Tue, 20 Jun 2023 14:48:39 -0700 > > > Subject: [PATCH] hugetlb: Do not clear hugetlb dtor until allocating vmemmap > > > > > > Freeing a hugetlb page and releasing base pages back to the underlying > > > allocator such as buddy or cma is performed in two steps: > > > - remove_hugetlb_folio() is called to remove the folio from hugetlb > > > lists, get a ref on the page and remove hugetlb destructor. This > > > all must be done under the hugetlb lock. After this call, the page > > > can be treated as a normal compound page or a collection of base > > > size pages. > > > - update_and_free_hugetlb_folio() is called to allocate vmemmap if > > > needed and the free routine of the underlying allocator is called > > > on the resulting page. We can not hold the hugetlb lock here. > > > > > > One issue with this scheme is that a memory error could occur between > > > these two steps. In this case, the memory error handling code treats > > > the old hugetlb page as a normal compound page or collection of base > > > pages. It will then try to SetPageHWPoison(page) on the page with an > > > error. If the page with error is a tail page without vmemmap, a write > > > error will occur when trying to set the flag. > > > > > > Address this issue by modifying remove_hugetlb_folio() and > > > update_and_free_hugetlb_folio() such that the hugetlb destructor is not > > > cleared until after allocating vmemmap. Since clearing the destructor > > > required holding the hugetlb lock, the clearing is done in > > > remove_hugetlb_folio() if the vmemmap is present. This saves a > > > lock/unlock cycle. Otherwise, destructor is cleared in > > > update_and_free_hugetlb_folio() after allocating vmemmap. > > > > > > Note that this will leave hugetlb pages in a state where they are marked > > > free (by hugetlb specific page flag) and have a ref count. This is not > > > a normal state. The only code that would notice is the memory error > > > code, and it is set up to retry in such a case. > > > > > > A subsequent patch will create a routine to do bulk processing of > > > vmemmap allocation. This will eliminate a lock/unlock cycle for each > > > hugetlb page in the case where we are freeing a bunch of pages. > > > > > > Fixes: ??? > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > > > --- > > > mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++++----------------- > > > 1 file changed, 51 insertions(+), 24 deletions(-) > > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index d76574425da3..f7f64470aee0 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio, > > > unsigned int order) { } > > > #endif > > > > > > +static inline void __clear_hugetlb_destructor(struct hstate *h, > > > + struct folio *folio) > > > +{ > > > + lockdep_assert_held(&hugetlb_lock); > > > + > > > + /* > > > + * Very subtle > > > + * > > > + * For non-gigantic pages set the destructor to the normal compound > > > + * page dtor. This is needed in case someone takes an additional > > > + * temporary ref to the page, and freeing is delayed until they drop > > > + * their reference. > > > + * > > > + * For gigantic pages set the destructor to the null dtor. This > > > + * destructor will never be called. Before freeing the gigantic > > > + * page destroy_compound_gigantic_folio will turn the folio into a > > > + * simple group of pages. After this the destructor does not > > > + * apply. > > > + * > > > + */ > > > + if (hstate_is_gigantic(h)) > > > + folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR); > > > + else > > > + folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); > > > +} > > > + > > > /* > > > - * Remove hugetlb folio from lists, and update dtor so that the folio appears > > > - * as just a compound page. > > > + * Remove hugetlb folio from lists. > > > + * If vmemmap exists for the folio, update dtor so that the folio appears > > > + * as just a compound page. Otherwise, wait until after allocating vmemmap > > > + * to update dtor. > > > * > > > * A reference is held on the folio, except in the case of demote. > > > * > > > @@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, > > > } > > > > > > /* > > > - * Very subtle > > > - * > > > - * For non-gigantic pages set the destructor to the normal compound > > > - * page dtor. This is needed in case someone takes an additional > > > - * temporary ref to the page, and freeing is delayed until they drop > > > - * their reference. > > > - * > > > - * For gigantic pages set the destructor to the null dtor. This > > > - * destructor will never be called. Before freeing the gigantic > > > - * page destroy_compound_gigantic_folio will turn the folio into a > > > - * simple group of pages. After this the destructor does not > > > - * apply. > > > - * > > > - * This handles the case where more than one ref is held when and > > > - * after update_and_free_hugetlb_folio is called. > > > - * > > > - * In the case of demote we do not ref count the page as it will soon > > > - * be turned into a page of smaller size. > > > + * We can only clear the hugetlb destructor after allocating vmemmap > > > + * pages. Otherwise, someone (memory error handling) may try to write > > > + * to tail struct pages. > > > + */ > > > + if (!folio_test_hugetlb_vmemmap_optimized(folio)) > > > + __clear_hugetlb_destructor(h, folio); > > > + > > > + /* > > > + * In the case of demote we do not ref count the page as it will soon > > > + * be turned into a page of smaller size. > > > */ > > > if (!demote) > > > folio_ref_unfreeze(folio, 1); > > > - if (hstate_is_gigantic(h)) > > > - folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR); > > > - else > > > - folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); > > > > > > h->nr_huge_pages--; > > > h->nr_huge_pages_node[nid]--; > > > @@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, > > > { > > > int i; > > > struct page *subpage; > > > + bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio); > > > > Can this test on vmemmap_optimized still tell us if we should > > __clear_hugetlb_destructor? From my reading: > > 1. If a hugetlb folio is still vmemmap optimized in > > __remove_hugetlb_folio, __remove_hugetlb_folio won't > > __clear_hugetlb_destructor. > > 2. Then hugetlb_vmemmap_restore in dissolve_free_huge_page will clear > > HPG_vmemmap_optimized if it succeeds. > > 3. Now when dissolve_free_huge_page gets into > > __update_and_free_hugetlb_folio, we will see clear_dtor to be false > > and __clear_hugetlb_destructor won't be called. > > Good catch! That is indeed a problem with this patch. Glad that I could help. > > > > > Or maybe I misunderstood, and what you really want to do is never > > __clear_hugetlb_destructor so that folio_test_hugetlb is always true? > > No, that was a bug with this patch. > > We could ALWAYS wait until __update_and_free_hugetlb_folio to clear the > hugetlb destructor. However, we have to take hugetlb lock to clear it. > If the page does not have vmemmap optimized, the we can clear the > destructor earlier in __remove_hugetlb_folio and avoid the lock/unlock > cycle. In the past, we have had complaints about the time required to > allocate and free a large quantity of hugetlb pages. Most of that time > is spent in the low level allocators. However, I do not want to add > something like an extra lock/unlock cycle unless absolutely necessary. > > I'll try to think of a cleaner and more fool proof way to address this. > > IIUC, this is an existing issue. Your patch series does not depend > this being fixed. Thanks Mike, I was about to send out V2 today. > -- > Mike Kravetz > > > > > > > > > if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) > > > return; > > > @@ -1735,6 +1752,16 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, > > > if (unlikely(folio_test_hwpoison(folio))) > > > folio_clear_hugetlb_hwpoison(folio); > > > > > > + /* > > > + * If vmemmap pages were allocated above, then we need to clear the > > > + * hugetlb destructor under the hugetlb lock. > > > + */ > > > + if (clear_dtor) { > > > + spin_lock_irq(&hugetlb_lock); > > > + __clear_hugetlb_destructor(h, folio); > > > + spin_unlock_irq(&hugetlb_lock); > > > + } > > > + > > > for (i = 0; i < pages_per_huge_page(h); i++) { > > > subpage = folio_page(folio, i); > > > subpage->flags &= ~(1 << PG_locked | 1 << PG_error | > > > -- > > > 2.41.0 > > >
diff --git a/include/linux/mm.h b/include/linux/mm.h index 27ce77080c79..f191a4119719 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3683,6 +3683,29 @@ enum mf_action_page_type { */ extern const struct attribute_group memory_failure_attr_group; +#ifdef CONFIG_HUGETLB_PAGE +/* + * Struct raw_hwp_page represents information about "raw error page", + * constructing singly linked list from ->_hugetlb_hwpoison field of folio. + */ +struct raw_hwp_page { + struct llist_node node; + struct page *page; +}; + +static inline struct llist_head *raw_hwp_list_head(struct folio *folio) +{ + return (struct llist_head *)&folio->_hugetlb_hwpoison; +} + +/* + * Given @subpage, a raw page in a hugepage, find its location in @folio's + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list. + */ +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, + struct page *subpage); +#endif + #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) extern void clear_huge_page(struct page *page, unsigned long addr_hint, diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 5b663eca1f29..c49e6c2d1f07 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs); #endif /* CONFIG_FS_DAX */ #ifdef CONFIG_HUGETLB_PAGE -/* - * Struct raw_hwp_page represents information about "raw error page", - * constructing singly linked list from ->_hugetlb_hwpoison field of folio. - */ -struct raw_hwp_page { - struct llist_node node; - struct page *page; -}; -static inline struct llist_head *raw_hwp_list_head(struct folio *folio) +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio, + struct page *subpage) { - return (struct llist_head *)&folio->_hugetlb_hwpoison; + struct llist_node *t, *tnode; + struct llist_head *raw_hwp_head = raw_hwp_list_head(folio); + struct raw_hwp_page *hwp_page = NULL; + struct raw_hwp_page *p; + + llist_for_each_safe(tnode, t, raw_hwp_head->first) { + p = container_of(tnode, struct raw_hwp_page, node); + if (subpage == p->page) { + hwp_page = p; + break; + } + } + + return hwp_page; } static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)