[v3] mm, netfs, fscache: Stop read optimisation when folio removed from pagecache
Commit Message
Hi Linus,
Is the attached patch too heavy to be applied this late in the merge cycle?
Or would you prefer it to wait for the merge window?
Thanks,
David
---
Fscache has an optimisation by which reads from the cache are skipped until
we know that (a) there's data there to be read and (b) that data isn't
entirely covered by pages resident in the netfs pagecache. This is done
with two flags manipulated by fscache_note_page_release():
if (...
test_bit(FSCACHE_COOKIE_HAVE_DATA, &cookie->flags) &&
test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags))
clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags);
where the NO_DATA_TO_READ flag causes cachefiles_prepare_read() to indicate
that netfslib should download from the server or clear the page instead.
The fscache_note_page_release() function is intended to be called from
->releasepage() - but that only gets called if PG_private or PG_private_2
is set - and currently the former is at the discretion of the network
filesystem and the latter is only set whilst a page is being written to the
cache, so sometimes we miss clearing the optimisation.
Fix this by following Willy's suggestion[1] and adding an address_space
flag, AS_RELEASE_ALWAYS, that causes filemap_release_folio() to always call
->release_folio() if it's set, even if PG_private or PG_private_2 aren't
set.
Note that this would require folio_test_private() and page_has_private() to
become more complicated. To avoid that, in the places[*] where these are
used to conditionalise calls to filemap_release_folio() and
try_to_release_page(), the tests are removed the those functions just
jumped to unconditionally and the test is performed there.
[*] There are some exceptions in vmscan.c where the check guards more than
just a call to the releaser. I've added a function, folio_needs_release()
to wrap all the checks for that.
AS_RELEASE_ALWAYS should be set if a non-NULL cookie is obtained from
fscache and cleared in ->evict_inode() before truncate_inode_pages_final()
is called.
Additionally, the FSCACHE_COOKIE_NO_DATA_TO_READ flag needs to be cleared
and the optimisation cancelled if a cachefiles object already contains data
when we open it.
Changes:
========
ver #3)
- Fixed mapping_clear_release_always() to use clear_bit() not set_bit().
- Moved a '&&' to the correct line.
ver #2)
- Rewrote entirely according to Willy's suggestion[1].
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Dave Wysochanski <dwysocha@redhat.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-afs@lists.infradead.org
cc: v9fs-developer@lists.sourceforge.net
cc: ceph-devel@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/Yk9V/03wgdYi65Lb@casper.infradead.org/ [1]
Link: https://lore.kernel.org/r/164928630577.457102.8519251179327601178.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166844174069.1124521.10890506360974169994.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166869495238.3720468.4878151409085146764.stgit@warthog.procyon.org.uk/ # v3
---
fs/9p/cache.c | 2 ++
fs/9p/vfs_inode.c | 1 +
fs/afs/inode.c | 1 +
fs/afs/internal.h | 2 ++
fs/cachefiles/namei.c | 2 ++
fs/ceph/cache.c | 2 ++
fs/ceph/inode.c | 1 +
fs/cifs/cifsfs.c | 1 +
fs/cifs/fscache.c | 2 ++
fs/splice.c | 3 +--
include/linux/pagemap.h | 16 ++++++++++++++++
mm/filemap.c | 4 ++++
mm/huge_memory.c | 3 +--
mm/khugepaged.c | 3 +--
mm/memory-failure.c | 3 +--
mm/migrate.c | 3 +--
mm/truncate.c | 6 ++----
mm/vmscan.c | 15 +++++++++++----
18 files changed, 52 insertions(+), 18 deletions(-)
Comments
On Wed, Nov 23, 2022 at 5:02 AM David Howells <dhowells@redhat.com> wrote:
>
> Is the attached patch too heavy to be applied this late in the merge cycle?
> Or would you prefer it to wait for the merge window?
This patch is much too much for this point in the release.
But I also think it's strange in another way, with that odd placement of
mapping_clear_release_always(inode->i_mapping);
at inode eviction time. That just feels very random.
Similarly, that change to shrink_folio_list() looks strange, with the
nasty folio_needs_release() helper. It seems entirely pointless, with
the use then being
if (folio_needs_release(folio)) {
if (!filemap_release_folio(folio, sc->gfp_mask))
goto activate_locked;
when everybody else is just using filemap_release_folio() and checking
its return value. I like how you changed other cases of
if (folio_has_private(folio) && !filemap_release_folio(folio, 0))
return 0;
to just use "!filemap_release_folio()" directly, and that felt like a
cleanup, but the shrink_folio_list() changes look like a step
backwards.
And the change to mm/filemap.c is completely unacceptable in all
forms, and this added test
+ if ((!mapping || !mapping_release_always(mapping)) &&
+ !folio_test_private(folio) &&
+ !folio_test_private_2(folio))
+ return true;
will not be accepted even during the merge window. That code makes no
sense what-so-ever, and is in no way acceptable.
That code makes no sense what-so-ever. Why isn't it using
"folio_has_private()"? Why is it using it's own illegible version of
that that doesn't match any other case? Why is this done as an
open-coded - and *badly* so - version of !folio_needs_release() that
you for some reason made private to mm/vmscan.c?
So no, this patch is too ugly to apply as-is *ever*, much less during
the late rc series.
Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> But I also think it's strange in another way, with that odd placement of
>
> mapping_clear_release_always(inode->i_mapping);
>
> at inode eviction time. That just feels very random.
I was under the impression that a warning got splashed if unexpected
address_space flags were set when ->evict_inode() returned. I may be thinking
of page flags. If it doesn't, fine, this isn't required.
> Similarly, that change to shrink_folio_list() looks strange, with the
> nasty folio_needs_release() helper. It seems entirely pointless, with
> the use then being
>
> if (folio_needs_release(folio)) {
> if (!filemap_release_folio(folio, sc->gfp_mask))
> goto activate_locked;
Unfortunately, that can't be simply folded down. It actually does something
extra if folio_has_private() was set, filemap_release_folio() succeeds but
there was no mapping:
* Rarely, folios can have buffers and no ->mapping.
* These are the folios which were not successfully
* invalidated in truncate_cleanup_folio(). We try to
* drop those buffers here and if that worked, and the
* folio is no longer mapped into process address space
* (refcount == 1) it can be freed. Otherwise, leave
* the folio on the LRU so it is swappable.
Possibly I could split the if-statement and make it two separate cases:
/*
* If the folio has buffers, try to free the buffer
* mappings associated with this folio. If we succeed
* we try to free the folio as well.
*
* We do this even if the folio is dirty.
* filemap_release_folio() does not perform I/O, but it
* is possible for a folio to have the dirty flag set,
* but it is actually clean (all its buffers are clean).
* This happens if the buffers were written out directly,
* with submit_bh(). ext3 will do this, as well as
* the blockdev mapping. filemap_release_folio() will
* discover that cleanness and will drop the buffers
* and mark the folio clean - it can be freed.
*/
if (!filemap_release_folio(folio, sc->gfp_mask))
goto activate_locked;
filemap_release_folio() will return true if folio_has_private() is false,
which would allow us to reach the next part, which we would then skip.
/*
* Rarely, folios can have buffers and no ->mapping.
* These are the folios which were not successfully
* invalidated in truncate_cleanup_folio(). We try to
* drop those buffers here and if that worked, and the
* folio is no longer mapped into process address space
* (refcount == 1) it can be freed. Otherwise, leave
* the folio on the LRU so it is swappable.
*/
if (!mapping && folio_has_private(folio) &&
folio_ref_count(folio) == 1) {
folio_unlock(folio);
if (folio_put_testzero(folio))
goto free_it;
/*
* rare race with speculative reference.
* the speculative reference will free
* this folio shortly, so we may
* increment nr_reclaimed here (and
* leave it off the LRU).
*/
nr_reclaimed += nr_pages;
continue;
}
But that will malfunction if try_to_free_buffers(), as called from
folio_has_private(), manages to clear the private bits. I wonder if it might
be possible to fold this bit into filemap_release_folio() somehow.
I really need a three-state return from filemap_release_folio() - maybe:
0 couldn't release
1 released
2 there was no private
The ordinary "if (filemap_release_folio()) { ... }" would work as expected.
shrink_folio_list() could do something different between case 1 and case 2.
> And the change to mm/filemap.c is completely unacceptable in all
> forms, and this added test
>
> + if ((!mapping || !mapping_release_always(mapping)) &&
> + !folio_test_private(folio) &&
> + !folio_test_private_2(folio))
> + return true;
>
> will not be accepted even during the merge window. That code makes no
> sense what-so-ever, and is in no way acceptable.
>
> That code makes no sense what-so-ever. Why isn't it using
> "folio_has_private()"?
It should be, yes.
> Why is this done as an open-coded - and *badly* so - version of
> !folio_needs_release() that you for some reason made private to mm/vmscan.c?
Yeah, in retrospect, I should have put that in mm/internal.h.
David
On Wed, Nov 23, 2022 at 12:03 PM David Howells <dhowells@redhat.com> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > But I also think it's strange in another way, with that odd placement of
> >
> > mapping_clear_release_always(inode->i_mapping);
> >
> > at inode eviction time. That just feels very random.
>
> I was under the impression that a warning got splashed if unexpected
> address_space flags were set when ->evict_inode() returned. I may be thinking
> of page flags. If it doesn't, fine, this isn't required.
I don't know if the warning happens or not, but the thing I reacted to
was just how *random* this was. There was no logic to it, nor any
explanation.
I *suspect* that if we add this kind of new generic address space
flag, then that flag should just be cleared by generic code when the
address space is released.
But I'm not saying it has to be done that way - I'm just saying that
however it is done, please don't make it this random mess with no
explanation.
The *setting* of the flag was at least fairly obvious. I didn't find
code like this odd:
+ if (v9inode->netfs.cache)
+ mapping_set_release_always(inode->i_mapping);
and it makes all kinds of sense (ie I can read it as a "if I use netfs
caching for this inode, then I want to be informed when a folio is
released from this mapping").
It's just the clearing that looked very random to me.
Maybe just a comment would have helped, but I get the feeling that it
migth as well just be cleared in "clear_inode()" or something like
that.
> > That code makes no sense what-so-ever. Why isn't it using
> > "folio_has_private()"?
>
> It should be, yes.
>
> > Why is this done as an open-coded - and *badly* so - version of
> > !folio_needs_release() that you for some reason made private to mm/vmscan.c?
>
> Yeah, in retrospect, I should have put that in mm/internal.h.
So if folio_needs_release() is in mm/internal.h, and then mm/filemap.c
uses it in filemap_release_folio() instead of the odd open-coding, I
think that would clear up my worries about both mm/filemap.c and
mm/vmscan.c.
Linus
@@ -68,6 +68,8 @@ void v9fs_cache_inode_get_cookie(struct inode *inode)
&path, sizeof(path),
&version, sizeof(version),
i_size_read(&v9inode->netfs.inode));
+ if (v9inode->netfs.cache)
+ mapping_set_release_always(inode->i_mapping);
p9_debug(P9_DEBUG_FSC, "inode %p get cookie %p\n",
inode, v9fs_inode_cookie(v9inode));
@@ -394,6 +394,7 @@ void v9fs_evict_inode(struct inode *inode)
version = cpu_to_le32(v9inode->qid.version);
fscache_clear_inode_writeback(v9fs_inode_cookie(v9inode), inode,
&version);
+ mapping_clear_release_always(inode->i_mapping);
clear_inode(inode);
filemap_fdatawrite(&inode->i_data);
@@ -805,6 +805,7 @@ void afs_evict_inode(struct inode *inode)
afs_set_cache_aux(vnode, &aux);
fscache_clear_inode_writeback(afs_vnode_cache(vnode), inode, &aux);
+ mapping_clear_release_always(inode->i_mapping);
clear_inode(inode);
while (!list_empty(&vnode->wb_keys)) {
@@ -680,6 +680,8 @@ static inline void afs_vnode_set_cache(struct afs_vnode *vnode,
{
#ifdef CONFIG_AFS_FSCACHE
vnode->netfs.cache = cookie;
+ if (cookie)
+ mapping_set_release_always(vnode->netfs.inode.i_mapping);
#endif
}
@@ -584,6 +584,8 @@ static bool cachefiles_open_file(struct cachefiles_object *object,
if (ret < 0)
goto check_failed;
+ clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &object->cookie->flags);
+
object->file = file;
/* Always update the atime on an object we've just looked up (this is
@@ -36,6 +36,8 @@ void ceph_fscache_register_inode_cookie(struct inode *inode)
&ci->i_vino, sizeof(ci->i_vino),
&ci->i_version, sizeof(ci->i_version),
i_size_read(inode));
+ if (ci->netfs.cache)
+ mapping_set_release_always(inode->i_mapping);
}
void ceph_fscache_unregister_inode_cookie(struct ceph_inode_info *ci)
@@ -572,6 +572,7 @@ void ceph_evict_inode(struct inode *inode)
truncate_inode_pages_final(&inode->i_data);
if (inode->i_state & I_PINNING_FSCACHE_WB)
ceph_fscache_unuse_cookie(inode, true);
+ mapping_clear_release_always(inode->i_mapping);
clear_inode(inode);
ceph_fscache_unregister_inode_cookie(ci);
@@ -423,6 +423,7 @@ cifs_free_inode(struct inode *inode)
static void
cifs_evict_inode(struct inode *inode)
{
+ mapping_clear_release_always(inode->i_mapping);
truncate_inode_pages_final(&inode->i_data);
if (inode->i_state & I_PINNING_FSCACHE_WB)
cifs_fscache_unuse_inode_cookie(inode, true);
@@ -108,6 +108,8 @@ void cifs_fscache_get_inode_cookie(struct inode *inode)
&cifsi->uniqueid, sizeof(cifsi->uniqueid),
&cd, sizeof(cd),
i_size_read(&cifsi->netfs.inode));
+ if (cifsi->netfs.cache)
+ mapping_set_release_always(inode->i_mapping);
}
void cifs_fscache_unuse_inode_cookie(struct inode *inode, bool update)
@@ -65,8 +65,7 @@ static bool page_cache_pipe_buf_try_steal(struct pipe_inode_info *pipe,
*/
folio_wait_writeback(folio);
- if (folio_has_private(folio) &&
- !filemap_release_folio(folio, GFP_KERNEL))
+ if (!filemap_release_folio(folio, GFP_KERNEL))
goto out_unlock;
/*
@@ -199,6 +199,7 @@ enum mapping_flags {
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
AS_LARGE_FOLIO_SUPPORT = 6,
+ AS_RELEASE_ALWAYS, /* Call ->release_folio(), even if no private data */
};
/**
@@ -269,6 +270,21 @@ static inline int mapping_use_writeback_tags(struct address_space *mapping)
return !test_bit(AS_NO_WRITEBACK_TAGS, &mapping->flags);
}
+static inline bool mapping_release_always(const struct address_space *mapping)
+{
+ return test_bit(AS_RELEASE_ALWAYS, &mapping->flags);
+}
+
+static inline void mapping_set_release_always(struct address_space *mapping)
+{
+ set_bit(AS_RELEASE_ALWAYS, &mapping->flags);
+}
+
+static inline void mapping_clear_release_always(struct address_space *mapping)
+{
+ clear_bit(AS_RELEASE_ALWAYS, &mapping->flags);
+}
+
static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
{
return mapping->gfp_mask;
@@ -3941,6 +3941,10 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp)
struct address_space * const mapping = folio->mapping;
BUG_ON(!folio_test_locked(folio));
+ if ((!mapping || !mapping_release_always(mapping)) &&
+ !folio_test_private(folio) &&
+ !folio_test_private_2(folio))
+ return true;
if (folio_test_writeback(folio))
return false;
@@ -2683,8 +2683,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
gfp = current_gfp_context(mapping_gfp_mask(mapping) &
GFP_RECLAIM_MASK);
- if (folio_test_private(folio) &&
- !filemap_release_folio(folio, gfp)) {
+ if (!filemap_release_folio(folio, gfp)) {
ret = -EBUSY;
goto out;
}
@@ -1883,8 +1883,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
goto out_unlock;
}
- if (page_has_private(page) &&
- !try_to_release_page(page, GFP_KERNEL)) {
+ if (!try_to_release_page(page, GFP_KERNEL)) {
result = SCAN_PAGE_HAS_PRIVATE;
putback_lru_page(page);
goto out_unlock;
@@ -831,8 +831,7 @@ static int truncate_error_page(struct page *p, unsigned long pfn,
if (err != 0) {
pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
- } else if (page_has_private(p) &&
- !try_to_release_page(p, GFP_NOIO)) {
+ } else if (!try_to_release_page(p, GFP_NOIO)) {
pr_info("%#lx: failed to release buffers\n", pfn);
} else {
ret = MF_RECOVERED;
@@ -905,8 +905,7 @@ static int fallback_migrate_folio(struct address_space *mapping,
* Buffers may be managed in a filesystem specific way.
* We must have no buffers or drop them.
*/
- if (folio_test_private(src) &&
- !filemap_release_folio(src, GFP_KERNEL))
+ if (!filemap_release_folio(src, GFP_KERNEL))
return mode == MIGRATE_SYNC ? -EAGAIN : -EBUSY;
return migrate_folio(mapping, dst, src, mode);
@@ -19,7 +19,6 @@
#include <linux/highmem.h>
#include <linux/pagevec.h>
#include <linux/task_io_accounting_ops.h>
-#include <linux/buffer_head.h> /* grr. try_to_release_page */
#include <linux/shmem_fs.h>
#include <linux/rmap.h>
#include "internal.h"
@@ -276,7 +275,7 @@ static long mapping_evict_folio(struct address_space *mapping,
if (folio_ref_count(folio) >
folio_nr_pages(folio) + folio_has_private(folio) + 1)
return 0;
- if (folio_has_private(folio) && !filemap_release_folio(folio, 0))
+ if (!filemap_release_folio(folio, 0))
return 0;
return remove_mapping(mapping, folio);
@@ -581,8 +580,7 @@ static int invalidate_complete_folio2(struct address_space *mapping,
if (folio->mapping != mapping)
return 0;
- if (folio_has_private(folio) &&
- !filemap_release_folio(folio, GFP_KERNEL))
+ if (!filemap_release_folio(folio, GFP_KERNEL))
return 0;
spin_lock(&mapping->host->i_lock);
@@ -186,6 +186,14 @@ struct scan_control {
#define prefetchw_prev_lru_folio(_folio, _base, _field) do { } while (0)
#endif
+static bool folio_needs_release(struct folio *folio)
+{
+ struct address_space *mapping = folio->mapping;
+
+ return folio_has_private(folio) ||
+ (mapping && mapping_release_always(mapping));
+}
+
/*
* From 0 .. 200. Higher means more swappy.
*/
@@ -1978,7 +1986,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
* (refcount == 1) it can be freed. Otherwise, leave
* the folio on the LRU so it is swappable.
*/
- if (folio_has_private(folio)) {
+ if (folio_needs_release(folio)) {
if (!filemap_release_folio(folio, sc->gfp_mask))
goto activate_locked;
if (!mapping && folio_ref_count(folio) == 1) {
@@ -2592,9 +2600,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
}
if (unlikely(buffer_heads_over_limit)) {
- if (folio_test_private(folio) && folio_trylock(folio)) {
- if (folio_test_private(folio))
- filemap_release_folio(folio, 0);
+ if (folio_needs_release(folio) && folio_trylock(folio)) {
+ filemap_release_folio(folio, 0);
folio_unlock(folio);
}
}