[v2,6/6] shmem: add large folios support to the write path
Commit Message
Add large folio support for shmem write path matching the same high
order preference mechanism used for iomap buffered IO path as used in
__filemap_get_folio() with a difference on the max order permitted
(being PMD_ORDER-1) to respect the huge mount option when large folio
is supported.
Use the __folio_get_max_order to get a hint for the order of the folio
based on file size which takes care of the mapping requirements.
Swap does not support high order folios for now, so make it order 0 in
case swap is enabled.
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
---
mm/shmem.c | 66 ++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 52 insertions(+), 14 deletions(-)
Comments
On Tue, Sep 19, 2023 at 01:55:54PM +0000, Daniel Gomez wrote:
> Add large folio support for shmem write path matching the same high
> order preference mechanism used for iomap buffered IO path as used in
> __filemap_get_folio() with a difference on the max order permitted
> (being PMD_ORDER-1) to respect the huge mount option when large folio
> is supported.
I'm strongly opposed to "respecting the huge mount option". We're
determining the best order to use for the folios. Artificially limiting
the size because the sysadmin read an article from 2005 that said to
use this option is STUPID.
> else
> - folio = shmem_alloc_folio(gfp, info, index, *order);
> + folio = shmem_alloc_folio(gfp, info, index, order);
Why did you introduce it as *order, only to change it back to order
in this patch? It feels like you just fixed up patch 6 rather than
percolating the changes all the way back to where they should have
been done. This makes the reviewer's life hard.
On Tue, Sep 19, 2023 at 04:01:19PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 19, 2023 at 01:55:54PM +0000, Daniel Gomez wrote:
> > Add large folio support for shmem write path matching the same high
> > order preference mechanism used for iomap buffered IO path as used in
> > __filemap_get_folio() with a difference on the max order permitted
> > (being PMD_ORDER-1) to respect the huge mount option when large folio
> > is supported.
>
> I'm strongly opposed to "respecting the huge mount option". We're
> determining the best order to use for the folios. Artificially limiting
> the size because the sysadmin read an article from 2005 that said to
> use this option is STUPID.
Then, I would still have the conflict on what to do when the order is
same as huge. I guess huge does not make sense in this new scenario?
unless we add large folios controls as proposal in linux-MM meeting
notes [1]. But I'm missing a bit of context so it's not clear to me
what to do next.
[1] https://lore.kernel.org/all/4966f496-9f71-460c-b2ab-8661384ce626@arm.com/T/#u
In that sense, I wanted to have a big picture of what was this new
strategy implying in terms of folio order when adding to page cache,
so I added tracing for it (same as in readahead). With bpftrace I
can see the following (notes added to explain each field) after running
fsx up to 119M:
@c: 363049108 /* total folio order being traced */
@order[8]: 2 /* order 8 being used 2 times (add_to_page_cache) */
@order[5]: 3249587 */ order 5 being used 3249587 times
(add_to_page_cache) */
@order[4]: 5972205
@order[3]: 8890418
@order[2]: 10380055
@order[0]: 334556841
@order_2: /* linear histogram of folio order */
[0, 1) 334556841 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2) 0 | |
[2, 3) 10380055 |@ |
[3, 4) 8890418 |@ |
[4, 5) 5972205 | |
[5, 6) 3249587 | |
[6, 7) 0 | |
[7, 8) 0 | |
[8, 9) 2 | |
I guess that's not te best workload to see this but would tracing be also
interesting to add to the series?
>
> > else
> > - folio = shmem_alloc_folio(gfp, info, index, *order);
> > + folio = shmem_alloc_folio(gfp, info, index, order);
>
> Why did you introduce it as *order, only to change it back to order
> in this patch? It feels like you just fixed up patch 6 rather than
> percolating the changes all the way back to where they should have
> been done. This makes the reviewer's life hard.
>
Sorry about that. I missed it in my changes.
@@ -95,6 +95,9 @@ static struct vfsmount *shm_mnt;
/* Symlink up to this size is kmalloc'ed instead of using a swappable page */
#define SHORT_SYMLINK_LEN 128
+/* Like MAX_PAGECACHE_ORDER but respecting huge option */
+#define MAX_SHMEM_ORDER HPAGE_PMD_ORDER - 1
+
/*
* shmem_fallocate communicates with shmem_fault or shmem_writepage via
* inode->i_private (with i_rwsem making sure that it has only one user at
@@ -1680,26 +1683,58 @@ static struct folio *shmem_alloc_folio(gfp_t gfp,
return folio;
}
+/**
+ * shmem_mapping_size_order - Get maximum folio order for the given file size.
+ * @mapping: Target address_space.
+ * @index: The page index.
+ * @size: The suggested size of the folio to create.
+ *
+ * This returns a high order for folios (when supported) based on the file size
+ * which the mapping currently allows at the given index. The index is relevant
+ * due to alignment considerations the mapping might have. The returned order
+ * may be less than the size passed.
+ *
+ * Like __filemap_get_folio order calculation.
+ *
+ * Return: The order.
+ */
+static inline unsigned int
+shmem_mapping_size_order(struct address_space *mapping, pgoff_t index,
+ size_t size, struct shmem_sb_info *sbinfo)
+{
+ unsigned int order = ilog2(size);
+
+ if ((order <= PAGE_SHIFT) ||
+ (!mapping_large_folio_support(mapping) || !sbinfo->noswap))
+ return 0;
+ else
+ order = order - PAGE_SHIFT;
+
+ /* If we're not aligned, allocate a smaller folio */
+ if (index & ((1UL << order) - 1))
+ order = __ffs(index);
+
+ order = min_t(size_t, order, MAX_SHMEM_ORDER);
+
+ /* Order-1 not supported due to THP dependency */
+ return (order == 1) ? 0 : order;
+}
+
static struct folio *shmem_alloc_and_acct_folio(gfp_t gfp, struct inode *inode,
- pgoff_t index, bool huge, unsigned int *order)
+ pgoff_t index, unsigned int order)
{
struct shmem_inode_info *info = SHMEM_I(inode);
struct folio *folio;
- int nr;
- int err;
-
- if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
- huge = false;
- nr = huge ? HPAGE_PMD_NR : 1U << *order;
+ int nr = 1U << order;
+ int err = shmem_inode_acct_block(inode, nr);
- err = shmem_inode_acct_block(inode, nr);
if (err)
goto failed;
- if (huge)
+ if (order == HPAGE_PMD_ORDER)
folio = shmem_alloc_hugefolio(gfp, info, index);
else
- folio = shmem_alloc_folio(gfp, info, index, *order);
+ folio = shmem_alloc_folio(gfp, info, index, order);
if (folio) {
__folio_set_locked(folio);
__folio_set_swapbacked(folio);
@@ -2030,18 +2065,19 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
return 0;
}
+ order = shmem_mapping_size_order(inode->i_mapping, index, len, sbinfo);
+
if (!shmem_is_huge(inode, index, false,
vma ? vma->vm_mm : NULL, vma ? vma->vm_flags : 0))
goto alloc_nohuge;
huge_gfp = vma_thp_gfp_mask(vma);
huge_gfp = limit_gfp_mask(huge_gfp, gfp);
- folio = shmem_alloc_and_acct_folio(huge_gfp, inode, index, true,
- &order);
+ folio = shmem_alloc_and_acct_folio(huge_gfp, inode, index,
+ HPAGE_PMD_ORDER);
if (IS_ERR(folio)) {
alloc_nohuge:
- folio = shmem_alloc_and_acct_folio(gfp, inode, index, false,
- &order);
+ folio = shmem_alloc_and_acct_folio(gfp, inode, index, order);
}
if (IS_ERR(folio)) {
int retry = 5;
@@ -2145,6 +2181,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
if (folio_test_large(folio)) {
folio_unlock(folio);
folio_put(folio);
+ if (--order == 1)
+ order = 0;
goto alloc_nohuge;
}
unlock: