Message ID | f15f57def8e69cf288f0646f819b784fe15fabe2.1683069252.git.ackerleytng@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp978259vqo; Tue, 2 May 2023 17:13:31 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ73K1Y1pvLO26hpgr8La38YapQb/P5Pnsv5V5QrXNi0K4xZDfS4xF2cjmM8EJhtaEpzlcrF X-Received: by 2002:a17:902:ba94:b0:1a5:1109:f58e with SMTP id k20-20020a170902ba9400b001a51109f58emr63863pls.29.1683072811207; Tue, 02 May 2023 17:13:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683072811; cv=none; d=google.com; s=arc-20160816; b=s3IUBhObD0x2wjC275e37TEuI/E4C2PbruKZc0dPsLWIIT92/a2jFYmkDVw9t2bAzu YanjtNvKiB/ZV9DUnGqaX+bhXAUzsPjRvauJl5yInX6pIu6FsKPDaE9SXH3TqHEBRR7g Jue89eVd+aIdslpwVcl6QIlRJ74nO0aVydH+Mro7H/pxCsis9DW5pf/R5pLZN+WOXsR+ kxm5lgMzWFhDN5Txh3JIz6SD0mE5u7IYZ6xALt5g31Eroah2PybxlIoyb6DdVWRxxiwQ BzND72v0ynFfMfsZbM69jZi7RsgpdBqN5iiI3+443JjyJUP8n5mEwQyrkdp/oIH2k+0Q i5Bw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=JdKcEQmaaWLxe2HiTbkU80Vf3btOicSKEWhjrrEdCro=; b=qtfC1gZhLc6PjwDXyMYQ9luuUEu29FIEUMuDOaVe5F7KM8rJnrRgb6aL0AtiQx1zEQ dm6aV+Ac66oNwp4HWeBdfZRp/ppKDc3tDxBEWxYWeyFte0YOjYRue8TjGIW8VEBam/2C utnUskWWfAxnBwEnOCOKeli+NPWTgXmAKRr6oK3tRgaAM1fu9NNk0oCzWYo2B7SaYU0E m4fxZbbf9glLoFTx0wddNlaMr+Rehe6bpXgiU9nZmYD5VOFHmKl3LyBhtDQP/PJXAiDm /ORv2mIcIcMsAm70hmoUTyOX7oKSejbty4kRE7sdrbSlIix41reE30QYdmXIo9787kWa +/ZQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=SyzGjzNi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j7-20020a170902da8700b001a686b1dc64si34333363plx.412.2023.05.02.17.13.09; Tue, 02 May 2023 17:13:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=SyzGjzNi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230072AbjEBX4R (ORCPT <rfc822;heyuhang3455@gmail.com> + 99 others); Tue, 2 May 2023 19:56:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41236 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229793AbjEBX4K (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 2 May 2023 19:56:10 -0400 Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com [IPv6:2607:f8b0:4864:20::449]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A0D4A2680 for <linux-kernel@vger.kernel.org>; Tue, 2 May 2023 16:56:09 -0700 (PDT) Received: by mail-pf1-x449.google.com with SMTP id d2e1a72fcca58-63b60018e71so2629559b3a.0 for <linux-kernel@vger.kernel.org>; Tue, 02 May 2023 16:56:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1683071769; x=1685663769; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=JdKcEQmaaWLxe2HiTbkU80Vf3btOicSKEWhjrrEdCro=; b=SyzGjzNi7EHG5k8wDd+xH8eat2wyOdXHzClWS8H2God9fXzYy212Ey1RaqjNMwN4Oc vVHPKzhXrxRkWS4Kujnd8ypMX5ysqjrCg5HHCG11NHP4r/2QAaTYWXTbzHJ2xUwpwIsK /+AJr9RGPEuBMB1QRpTr7nsgF2phaCsxeZl1+5ocMSHMtT/T6eIsVMZPqOjAR8Gtkvdg UAntmTtInu0WrqCro+B2cLss2JHJMmSoG8EMdw2CvO7O4yKKzmM0qjZM1Hhn5m55oI/W ugGV+tk1LHPfdCp5eaUjuXQivIGZXMdJzipjaOXDytwjViO2PMbdYNti4ea5xxYc/rsL U3Iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683071769; x=1685663769; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=JdKcEQmaaWLxe2HiTbkU80Vf3btOicSKEWhjrrEdCro=; b=fA4/8EgpOUqIhBK95j+gwpKQqv0wS9jIY9Zmkp3N51W2qnfpaQbJE0jQRE/dq4Znpo 30E53mjwrPceu4nObrNMgAA1XbfNAWJhGpK78pRFVk5RnX3s5OJCRNnYFOIRDeXobdMS ND0FGOq1PsA7J/7fxXY+RQxLNKf7eulD/vo1mm6A3ONe3s5bfOp52dCFk/rGuDfHbaq5 TpqyqG+DOAvYbhU9+VXY37VeOcJQS1/5NNnFz+pI2MnK1oRDcYAuAiYVQV/nmA83prLh vIiiChDS3xW1NaTzpSB/mqvFuEzesEIwFvAaFmXIKddml5X82HV68LbUj/0OoDjqPcof d6Sg== X-Gm-Message-State: AC+VfDw2S3fzoYwC9vH7hs03+GF/AQMadWz5By1xNMYNSIBXwvgDPey6 gFwNwT8nSrwyerC7MdQbbiiOZkIaNJL8AGAzTg== X-Received: from ackerleytng-ctop.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:13f8]) (user=ackerleytng job=sendgmr) by 2002:a05:6a00:d52:b0:63d:3f94:8c3a with SMTP id n18-20020a056a000d5200b0063d3f948c3amr4655714pfv.6.1683071769151; Tue, 02 May 2023 16:56:09 -0700 (PDT) Date: Tue, 2 May 2023 23:56:03 +0000 In-Reply-To: <cover.1683069252.git.ackerleytng@google.com> Mime-Version: 1.0 References: <cover.1683069252.git.ackerleytng@google.com> X-Mailer: git-send-email 2.40.1.495.gc816e09b53d-goog Message-ID: <f15f57def8e69cf288f0646f819b784fe15fabe2.1683069252.git.ackerleytng@google.com> Subject: [PATCH 2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache From: Ackerley Tng <ackerleytng@google.com> To: akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mike.kravetz@oracle.com, muchun.song@linux.dev, willy@infradead.org, sidhartha.kumar@oracle.com, jhubbard@nvidia.com Cc: vannapurve@google.com, erdemaktas@google.com, Ackerley Tng <ackerleytng@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764829756269238972?= X-GMAIL-MSGID: =?utf-8?q?1764829756269238972?= |
Series |
Fix fallocate error in hugetlbfs when fallocating again
|
|
Commit Message
Ackerley Tng
May 2, 2023, 11:56 p.m. UTC
When fallocate() is called twice on the same offset in the file, the
second fallocate() should succeed.
page_cache_next_miss() always advances index before returning, so even
on a page cache hit, the check would set present to false.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
fs/hugetlbfs/inode.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
Comments
On 05/02/23 23:56, Ackerley Tng wrote: > When fallocate() is called twice on the same offset in the file, the > second fallocate() should succeed. > > page_cache_next_miss() always advances index before returning, so even > on a page cache hit, the check would set present to false. Thank you Ackerley for finding this! When I read the description of page_cache_next_miss(), I assumed present = page_cache_next_miss(mapping, index, 1) != index; would tell us if there was a page at index in the cache. However, when looking closer at the code it does not check for a page at index, but rather starts looking at index+1. Perhaps that is why it is named next? Matthew, I think the use of the above statement was your suggestion. And you know the xarray code better than anyone. I just want to make sure page_cache_next_miss is operating as designed/expected. If so, then the changes suggested here make sense. In addition, the same code is in hugetlbfs_pagecache_present and will have this same issue.
On 05/02/23 20:05, Mike Kravetz wrote: > On 05/02/23 23:56, Ackerley Tng wrote: > > When fallocate() is called twice on the same offset in the file, the > > second fallocate() should succeed. > > > > page_cache_next_miss() always advances index before returning, so even > > on a page cache hit, the check would set present to false. > > Thank you Ackerley for finding this! > > When I read the description of page_cache_next_miss(), I assumed > > present = page_cache_next_miss(mapping, index, 1) != index; > > would tell us if there was a page at index in the cache. > > However, when looking closer at the code it does not check for a page > at index, but rather starts looking at index+1. Perhaps that is why > it is named next? > > Matthew, I think the use of the above statement was your suggestion. > And you know the xarray code better than anyone. I just want to make > sure page_cache_next_miss is operating as designed/expected. If so, > then the changes suggested here make sense. I took a closer look at the code today. page_cache_next_miss has a 'special case' for index 0. The function description says: * Return: The index of the gap if found, otherwise an index outside the * range specified (in which case 'return - index >= max_scan' will be true). * In the rare case of index wrap-around, 0 will be returned. And, the loop in the routine does: while (max_scan--) { void *entry = xas_next(&xas); if (!entry || xa_is_value(entry)) break; if (xas.xa_index == 0) break; } At first glance, I thought xas_next always went to the next entry but now see that is not the case here because this is a new state with xa_node = XAS_RESTART. So, xas_next is effectively a xas_load. This means in the case were index == 0, page_cache_next_miss(mapping, index, 1) will ALWAYS return zero even if a page is present. I need to look at the xarray code and this rare index wrap-around case to see if we can somehow modify that check for xas.xa_index == 0 in page_cache_next_miss.
I wrote a selftest to verify the behavior of page_cache_next_miss() and indeed you are right about the special case. I'll continue to look into this to figure out why it isn't hitting in the page cache. Thanks Mike!
I figured it out, the bug is still in the use of page_cache_next_miss(), but my earlier suggestion that xas_next always advances the pointer is wrong. Mike is right in that xas_next() is effectively xas_load() for the first call after an XA_STATE() initialiation. However, when max_scan is set to 1, xas.xa_index has no chance to be advanced since the loop condition while(max_scan--) terminates the loop after 1 iteration. Hence, after loading happens with xas_next() regardless of the checks within the loop (!entry, or xa_is_value(entry), etc), xa.xas_index is not advanced, and the index returned always == the index passed in to page_cache_next_miss(). Hence, in hugetlb_fallocate(), it always appears to be a page cache miss. Here's code from a selftest that can be added to lib/test_xarray.c: /* Modified from page_cache_next_miss() */ static unsigned long xa_next_empty(struct xarray *xa, unsigned long start, unsigned long max_scan) { XA_STATE(xas, xa, start); while (max_scan--) { void *entry = xas_next(&xas); if (!entry) { printk("entry not present"); break; } if (xa_is_value(entry)) { printk("xa_is_value instead of pointer"); } if (xas.xa_index == 0) { printk("wraparound"); break; } } if (max_scan == -1) printk("exceeded max_scan"); return xas.xa_index; } /* Replace this function in lib/test_xarray.c to run */ static noinline void check_find(struct xarray *xa) { unsigned long max_scan; xa_init(&xa); xa_store_range(&xa, 3, 5, malloc(10), GFP_KERNEL); max_scan = 1; for (int i = 1; i < 8; i++) printk(" => xa_next_empty(xa, %d, %ld): %ld\n", i, max_scan, xa_next_empty(&xa, i, max_scan)); printk("\n"); max_scan = 2; for (int i = 1; i < 8; i++) printk(" => xa_next_empty(xa, %d, %ld): %ld\n", i, max_scan, xa_next_empty(&xa, i, max_scan)); } Result: entry not present => xa_next_empty(xa, 1, 1): 1 entry not present => xa_next_empty(xa, 2, 1): 2 exceeded max_scan => xa_next_empty(xa, 3, 1): 3 exceeded max_scan => xa_next_empty(xa, 4, 1): 4 exceeded max_scan => xa_next_empty(xa, 5, 1): 5 entry not present => xa_next_empty(xa, 6, 1): 6 entry not present => xa_next_empty(xa, 7, 1): 7 entry not present => xa_next_empty(xa, 1, 2): 1 entry not present => xa_next_empty(xa, 2, 2): 2 exceeded max_scan => xa_next_empty(xa, 3, 2): 4 exceeded max_scan => xa_next_empty(xa, 4, 2): 5 exceeded max_scan => xa_next_empty(xa, 5, 2): 6 entry not present => xa_next_empty(xa, 6, 2): 6 entry not present => xa_next_empty(xa, 7, 2): 7 Since the xarray was set up with pointers in indices 3, 4 and 5, we expect xa_next_empty() or page_cache_next_miss() to return the next index (4, 5 and 6 respectively), but when used with a max_scan of 1, we just get the index passed in. While max_scan could be increased to fix this bug, I feel that having a separate function like filemap_has_folio() makes the intent more explicit and is less reliant on internal values of struct xa_state. xas.xa_index could take other values to indicate wraparound or sibling nodes and I think it is better to use a higher level abstraction like xa_load() (used in filemap_has_folio()). In addition xa_load() also takes the locks it needs, which helps :). I could refactor the other usage of page_cache_next_miss() in hugetlbfs_pagecache_present() if you'd like! On this note, the docstring of page_cache_next_miss() is also inaccurate. The return value is not always outside the range specified if max_scan is too small.
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ecfdfb2529a3..f640cff1bbce 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -821,7 +821,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, */ struct folio *folio; unsigned long addr; - bool present; cond_resched(); @@ -845,10 +844,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, mutex_lock(&hugetlb_fault_mutex_table[hash]); /* See if already present in mapping to avoid alloc/free */ - rcu_read_lock(); - present = page_cache_next_miss(mapping, index, 1) != index; - rcu_read_unlock(); - if (present) { + if (filemap_has_folio(mapping, index)) { mutex_unlock(&hugetlb_fault_mutex_table[hash]); hugetlb_drop_vma_policy(&pseudo_vma); continue;