[2/2] fs: hugetlbfs: Fix logic to skip allocation on hit in page cache

Message ID f15f57def8e69cf288f0646f819b784fe15fabe2.1683069252.git.ackerleytng@google.com
State New
Headers
Series Fix fallocate error in hugetlbfs when fallocating again |

Commit Message

Ackerley Tng May 2, 2023, 11:56 p.m. UTC
  When fallocate() is called twice on the same offset in the file, the
second fallocate() should succeed.

page_cache_next_miss() always advances index before returning, so even
on a page cache hit, the check would set present to false.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 fs/hugetlbfs/inode.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)
  

Comments

Mike Kravetz May 3, 2023, 3:05 a.m. UTC | #1
On 05/02/23 23:56, Ackerley Tng wrote:
> When fallocate() is called twice on the same offset in the file, the
> second fallocate() should succeed.
> 
> page_cache_next_miss() always advances index before returning, so even
> on a page cache hit, the check would set present to false.

Thank you Ackerley for finding this!

When I read the description of page_cache_next_miss(), I assumed

	present = page_cache_next_miss(mapping, index, 1) != index;

would tell us if there was a page at index in the cache.

However, when looking closer at the code it does not check for a page
at index, but rather starts looking at index+1.  Perhaps that is why
it is named next?

Matthew, I think the use of the above statement was your suggestion.
And you know the xarray code better than anyone.  I just want to make
sure page_cache_next_miss is operating as designed/expected.  If so,
then the changes suggested here make sense.

In addition, the same code is in hugetlbfs_pagecache_present and will
have this same issue.
  
Mike Kravetz May 4, 2023, 12:14 a.m. UTC | #2
On 05/02/23 20:05, Mike Kravetz wrote:
> On 05/02/23 23:56, Ackerley Tng wrote:
> > When fallocate() is called twice on the same offset in the file, the
> > second fallocate() should succeed.
> > 
> > page_cache_next_miss() always advances index before returning, so even
> > on a page cache hit, the check would set present to false.
> 
> Thank you Ackerley for finding this!
> 
> When I read the description of page_cache_next_miss(), I assumed
> 
> 	present = page_cache_next_miss(mapping, index, 1) != index;
> 
> would tell us if there was a page at index in the cache.
> 
> However, when looking closer at the code it does not check for a page
> at index, but rather starts looking at index+1.  Perhaps that is why
> it is named next?
> 
> Matthew, I think the use of the above statement was your suggestion.
> And you know the xarray code better than anyone.  I just want to make
> sure page_cache_next_miss is operating as designed/expected.  If so,
> then the changes suggested here make sense.

I took a closer look at the code today.

page_cache_next_miss has a 'special case' for index 0.  The function
description says:

 * Return: The index of the gap if found, otherwise an index outside the
 * range specified (in which case 'return - index >= max_scan' will be true).
 * In the rare case of index wrap-around, 0 will be returned.

And, the loop in the routine does:

	while (max_scan--) {
		void *entry = xas_next(&xas);
		if (!entry || xa_is_value(entry))
			break;
		if (xas.xa_index == 0)
			break;
	}

At first glance, I thought xas_next always went to the next entry but
now see that is not the case here because this is a new state with
xa_node = XAS_RESTART.  So, xas_next is effectively a xas_load.

This means in the case were index == 0,

	page_cache_next_miss(mapping, index, 1)

will ALWAYS return zero even if a page is present.

I need to look at the xarray code and this rare index wrap-around case
to see if we can somehow modify that check for xas.xa_index == 0 in
page_cache_next_miss.
  
Ackerley Tng May 8, 2023, 4:29 p.m. UTC | #3
I wrote a selftest to verify the behavior of page_cache_next_miss() and
indeed you are right about the special case.

I'll continue to look into this to figure out why it isn't hitting in
the page cache.

Thanks Mike!
  
Ackerley Tng May 8, 2023, 8:40 p.m. UTC | #4
I figured it out, the bug is still in the use of page_cache_next_miss(),
but my earlier suggestion that xas_next always advances the pointer is
wrong.

Mike is right in that xas_next() is effectively xas_load() for the first
call after an XA_STATE() initialiation.

However, when max_scan is set to 1, xas.xa_index has no chance to be
advanced since the loop condition while(max_scan--) terminates the loop
after 1 iteration.

Hence, after loading happens with xas_next() regardless of the checks
within the loop (!entry, or xa_is_value(entry), etc), xa.xas_index is
not advanced, and the index returned always == the index passed in to
page_cache_next_miss().

Hence, in hugetlb_fallocate(), it always appears to be a page cache
miss.

Here's code from a selftest that can be added to lib/test_xarray.c:

/* Modified from page_cache_next_miss() */
static unsigned long xa_next_empty(struct xarray *xa, unsigned long start,  
unsigned long max_scan)
{
	XA_STATE(xas, xa, start);

	while (max_scan--) {
		void *entry = xas_next(&xas);
		if (!entry) {
			printk("entry not present");
			break;
		}
		if (xa_is_value(entry)) {
			printk("xa_is_value instead of pointer");
		}
		if (xas.xa_index == 0) {
			printk("wraparound");
			break;
		}
	}

	if (max_scan == -1)
		printk("exceeded max_scan");

	return xas.xa_index;
}

/* Replace this function in lib/test_xarray.c to run */
static noinline void check_find(struct xarray *xa)
{
	unsigned long max_scan;

	xa_init(&xa);
	xa_store_range(&xa, 3, 5, malloc(10), GFP_KERNEL);

	max_scan = 1;
	for (int i = 1; i < 8; i++)
		printk(" => xa_next_empty(xa, %d, %ld): %ld\n", i, max_scan,  
xa_next_empty(&xa, i, max_scan));

	printk("\n");

	max_scan = 2;
	for (int i = 1; i < 8; i++)
		printk(" => xa_next_empty(xa, %d, %ld): %ld\n", i, max_scan,  
xa_next_empty(&xa, i, max_scan));
}

Result:

entry not present => xa_next_empty(xa, 1, 1): 1
entry not present => xa_next_empty(xa, 2, 1): 2
exceeded max_scan => xa_next_empty(xa, 3, 1): 3
exceeded max_scan => xa_next_empty(xa, 4, 1): 4
exceeded max_scan => xa_next_empty(xa, 5, 1): 5
entry not present => xa_next_empty(xa, 6, 1): 6
entry not present => xa_next_empty(xa, 7, 1): 7

entry not present => xa_next_empty(xa, 1, 2): 1
entry not present => xa_next_empty(xa, 2, 2): 2
exceeded max_scan => xa_next_empty(xa, 3, 2): 4
exceeded max_scan => xa_next_empty(xa, 4, 2): 5
exceeded max_scan => xa_next_empty(xa, 5, 2): 6
entry not present => xa_next_empty(xa, 6, 2): 6
entry not present => xa_next_empty(xa, 7, 2): 7

Since the xarray was set up with pointers in indices 3, 4 and 5, we
expect xa_next_empty() or page_cache_next_miss() to return the next
index (4, 5 and 6 respectively), but when used with a max_scan of 1, we
just get the index passed in.

While max_scan could be increased to fix this bug, I feel that having a
separate function like filemap_has_folio() makes the intent more
explicit and is less reliant on internal values of struct xa_state.

xas.xa_index could take other values to indicate wraparound or sibling
nodes and I think it is better to use a higher level abstraction like
xa_load() (used in filemap_has_folio()). In addition xa_load() also
takes the locks it needs, which helps :).

I could refactor the other usage of page_cache_next_miss() in
hugetlbfs_pagecache_present() if you'd like!

On this note, the docstring of page_cache_next_miss() is also
inaccurate. The return value is not always outside the range specified
if max_scan is too small.
  

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ecfdfb2529a3..f640cff1bbce 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -821,7 +821,6 @@  static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		 */
 		struct folio *folio;
 		unsigned long addr;
-		bool present;
 
 		cond_resched();
 
@@ -845,10 +844,7 @@  static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 		/* See if already present in mapping to avoid alloc/free */
-		rcu_read_lock();
-		present = page_cache_next_miss(mapping, index, 1) != index;
-		rcu_read_unlock();
-		if (present) {
+		if (filemap_has_folio(mapping, index)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			hugetlb_drop_vma_policy(&pseudo_vma);
 			continue;