[v4,2/3] mm: Defer TLB flush by keeping both src and dst folios at migration
Commit Message
Implementation of MIGRC mechanism that stands for 'Migration Read Copy'.
We always face the migration overhead at either promotion or demotion,
while working with tiered memory e.g. CXL memory and found out TLB
shootdown is a quite big one that is needed to get rid of if possible.
Fortunately, TLB flush can be defered if both source and destination of
folios during migration are kept until all TLB flushes required will
have been done, of course, only if the target PTE entries have read-only
permission, more precisely speaking, don't have write permission.
Otherwise, no doubt the folio might get messed up.
To achieve that:
1. For the folios that map only to non-writable TLB entries, prevent
TLB flush at migration by keeping both source and destination
folios, which will be handled later at a better time.
2. When any non-writable TLB entry changes to writable e.g. through
fault handler, give up migrc mechanism so as to perform TLB flush
required right away.
The measurement result:
Architecture - x86_64
QEMU - kvm enabled, host cpu
Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled
Benchmark - XSBench -p 50000000 (-p option makes the runtime longer)
run 'perf stat' using events:
1) itlb.itlb_flush
2) tlb_flush.dtlb_thread
3) tlb_flush.stlb_any
4) dTLB-load-misses
5) dTLB-store-misses
6) iTLB-load-misses
run 'cat /proc/vmstat' and pick:
1) numa_pages_migrated
2) pgmigrate_success
3) nr_tlb_remote_flush
4) nr_tlb_remote_flush_received
5) nr_tlb_local_flush_all
6) nr_tlb_local_flush_one
BEFORE - mainline v6.6-rc5
------------------------------------------
$ perf stat -a \
-e itlb.itlb_flush \
-e tlb_flush.dtlb_thread \
-e tlb_flush.stlb_any \
-e dTLB-load-misses \
-e dTLB-store-misses \
-e iTLB-load-misses \
./XSBench -p 50000000
Performance counter stats for 'system wide':
20953405 itlb.itlb_flush
114886593 tlb_flush.dtlb_thread
88267015 tlb_flush.stlb_any
115304095543 dTLB-load-misses
163904743 dTLB-store-misses
608486259 iTLB-load-misses
556.787113849 seconds time elapsed
$ cat /proc/vmstat
...
numa_pages_migrated 3378748
pgmigrate_success 7720310
nr_tlb_remote_flush 751464
nr_tlb_remote_flush_received 10742115
nr_tlb_local_flush_all 21899
nr_tlb_local_flush_one 740157
...
AFTER - mainline v6.6-rc5 + migrc
------------------------------------------
$ perf stat -a \
-e itlb.itlb_flush \
-e tlb_flush.dtlb_thread \
-e tlb_flush.stlb_any \
-e dTLB-load-misses \
-e dTLB-store-misses \
-e iTLB-load-misses \
./XSBench -p 50000000
Performance counter stats for 'system wide':
4353555 itlb.itlb_flush
72482780 tlb_flush.dtlb_thread
68226458 tlb_flush.stlb_any
114331610808 dTLB-load-misses
116084771 dTLB-store-misses
377180518 iTLB-load-misses
552.667718220 seconds time elapsed
$ cat /proc/vmstat
...
numa_pages_migrated 3339325
pgmigrate_success 7642363
nr_tlb_remote_flush 192913
nr_tlb_remote_flush_received 2327426
nr_tlb_local_flush_all 25759
nr_tlb_local_flush_one 740454
...
Signed-off-by: Byungchul Park <byungchul@sk.com>
---
include/linux/mm_types.h | 21 ++++
include/linux/mmzone.h | 9 ++
include/linux/page-flags.h | 4 +
include/linux/sched.h | 6 +
include/trace/events/mmflags.h | 3 +-
mm/internal.h | 57 +++++++++
mm/memory.c | 11 ++
mm/migrate.c | 215 +++++++++++++++++++++++++++++++++
mm/page_alloc.c | 17 ++-
mm/rmap.c | 11 +-
10 files changed, 349 insertions(+), 5 deletions(-)
Comments
On Thu, Nov 09, 2023 at 01:59:07PM +0900, Byungchul Park wrote:
> +++ b/include/linux/page-flags.h
> @@ -136,6 +136,7 @@ enum pageflags {
> PG_arch_2,
> PG_arch_3,
> #endif
> + PG_migrc, /* Page is under migrc's control */
> __NR_PAGEFLAGS,
Yeah; no. We're out of page flags. And CXL is insufficiently
compelling to add more. If you use CXL, you don't care about
performance, by definition.
> @@ -589,6 +590,9 @@ TESTCLEARFLAG(Young, young, PF_ANY)
> PAGEFLAG(Idle, idle, PF_ANY)
> #endif
>
> +TESTCLEARFLAG(Migrc, migrc, PF_ANY)
> +__PAGEFLAG(Migrc, migrc, PF_ANY)
Why PF_ANY?
> +/*
> + * Initialize the page when allocated from buddy allocator.
> + */
> +static inline void migrc_init_page(struct page *p)
> +{
> + __ClearPageMigrc(p);
> +}
This flag should already be clear ... ?
> +/*
> + * Check if the folio is pending for TLB flush and then clear the flag.
> + */
> +static inline bool migrc_unpend_if_pending(struct folio *f)
> +{
> + return folio_test_clear_migrc(f);
> +}
If you named the flag better, you wouldn't need this wrapper.
> +static void migrc_mark_pending(struct folio *fsrc, struct folio *fdst)
> +{
> + folio_get(fsrc);
> + __folio_set_migrc(fsrc);
> + __folio_set_migrc(fdst);
> +}
This is almost certainly unsafe. By using the non-atomic bit ops, you
stand the risk of losing a simultaneous update to any other bit in this
word. Like, say, someone trying to lock the folio?
> +++ b/mm/page_alloc.c
> @@ -1535,6 +1535,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>
> set_page_owner(page, order, gfp_flags);
> page_table_check_alloc(page, order);
> +
> + for (i = 0; i != 1 << order; ++i)
> + migrc_init_page(page + i);
No.
Hi Byungchul,
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/x86/core tip/x86/mm v6.6]
[cannot apply to akpm-mm/mm-everything linus/master next-20231109]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Byungchul-Park/mm-rmap-Recognize-read-only-TLB-entries-during-batched-TLB-flush/20231109-163706
base: tip/sched/core
patch link: https://lore.kernel.org/r/20231109045908.54996-3-byungchul%40sk.com
patch subject: [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20231109/202311092356.XzY1aBHX-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231109/202311092356.XzY1aBHX-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311092356.XzY1aBHX-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from include/linux/mmzone.h:22,
from include/linux/topology.h:33,
from include/linux/irq.h:19,
from include/asm-generic/hardirq.h:17,
from ./arch/m68k/include/generated/asm/hardirq.h:1,
from include/linux/hardirq.h:11,
from include/linux/interrupt.h:11,
from include/linux/kernel_stat.h:9,
from arch/m68k/kernel/asm-offsets.c:16:
>> include/linux/mm_types.h:1416:42: error: field 'arch' has incomplete type
1416 | struct arch_tlbflush_unmap_batch arch;
| ^~~~
make[3]: *** [scripts/Makefile.build:116: arch/m68k/kernel/asm-offsets.s] Error 1
make[3]: Target 'prepare' not remade because of errors.
make[2]: *** [Makefile:1202: prepare0] Error 2
make[2]: Target 'prepare' not remade because of errors.
make[1]: *** [Makefile:234: __sub-make] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:234: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.
vim +/arch +1416 include/linux/mm_types.h
1401
1402 struct migrc_req {
1403 /*
1404 * folios pending for TLB flush
1405 */
1406 struct list_head folios;
1407
1408 /*
1409 * for hanging to the associated numa node
1410 */
1411 struct llist_node llnode;
1412
1413 /*
1414 * architecture specific data for batched TLB flush
1415 */
> 1416 struct arch_tlbflush_unmap_batch arch;
Hi Byungchul,
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/x86/core tip/x86/mm v6.6]
[cannot apply to akpm-mm/mm-everything linus/master next-20231109]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Byungchul-Park/mm-rmap-Recognize-read-only-TLB-entries-during-batched-TLB-flush/20231109-163706
base: tip/sched/core
patch link: https://lore.kernel.org/r/20231109045908.54996-3-byungchul%40sk.com
patch subject: [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20231110/202311100211.UAqu6dj7-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231110/202311100211.UAqu6dj7-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311100211.UAqu6dj7-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from arch/um/kernel/asm-offsets.c:1:
In file included from arch/x86/um/shared/sysdep/kernel-offsets.h:5:
In file included from include/linux/crypto.h:17:
In file included from include/linux/slab.h:16:
In file included from include/linux/gfp.h:7:
In file included from include/linux/mmzone.h:22:
>> include/linux/mm_types.h:1416:35: error: field has incomplete type 'struct arch_tlbflush_unmap_batch'
1416 | struct arch_tlbflush_unmap_batch arch;
| ^
include/linux/mm_types.h:1416:9: note: forward declaration of 'struct arch_tlbflush_unmap_batch'
1416 | struct arch_tlbflush_unmap_batch arch;
| ^
In file included from arch/um/kernel/asm-offsets.c:1:
arch/x86/um/shared/sysdep/kernel-offsets.h:9:6: warning: no previous prototype for function 'foo' [-Wmissing-prototypes]
9 | void foo(void)
| ^
arch/x86/um/shared/sysdep/kernel-offsets.h:9:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
9 | void foo(void)
| ^
| static
1 warning and 1 error generated.
make[3]: *** [scripts/Makefile.build:116: arch/um/kernel/asm-offsets.s] Error 1
make[3]: Target 'prepare' not remade because of errors.
make[2]: *** [Makefile:1202: prepare0] Error 2
make[2]: Target 'prepare' not remade because of errors.
make[1]: *** [Makefile:234: __sub-make] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:234: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.
vim +1416 include/linux/mm_types.h
1401
1402 struct migrc_req {
1403 /*
1404 * folios pending for TLB flush
1405 */
1406 struct list_head folios;
1407
1408 /*
1409 * for hanging to the associated numa node
1410 */
1411 struct llist_node llnode;
1412
1413 /*
1414 * architecture specific data for batched TLB flush
1415 */
> 1416 struct arch_tlbflush_unmap_batch arch;
On Thu, Nov 09, 2023 at 02:36:01PM +0000, Matthew Wilcox wrote:
> On Thu, Nov 09, 2023 at 01:59:07PM +0900, Byungchul Park wrote:
> > +++ b/include/linux/page-flags.h
> > @@ -136,6 +136,7 @@ enum pageflags {
> > PG_arch_2,
> > PG_arch_3,
> > #endif
> > + PG_migrc, /* Page is under migrc's control */
> > __NR_PAGEFLAGS,
>
> Yeah; no. We're out of page flags. And CXL is insufficiently
I should've forced migrc to work only for 64bit arch. I missed it while
I removed kconifg for it. However, lemme try to avoid the additonal page
flag anyway if possible.
> compelling to add more. If you use CXL, you don't care about
> performance, by definition.
>
> > @@ -589,6 +590,9 @@ TESTCLEARFLAG(Young, young, PF_ANY)
> > PAGEFLAG(Idle, idle, PF_ANY)
> > #endif
> >
> > +TESTCLEARFLAG(Migrc, migrc, PF_ANY)
> > +__PAGEFLAG(Migrc, migrc, PF_ANY)
>
> Why PF_ANY?
PF_HEAD looks more fit on the purpose. I will change it to PF_HEAD.
> > +/*
> > + * Initialize the page when allocated from buddy allocator.
> > + */
> > +static inline void migrc_init_page(struct page *p)
> > +{
> > + __ClearPageMigrc(p);
> > +}
>
> This flag should already be clear ... ?
That should be initialized either on allocation or on free.
> > +/*
> > + * Check if the folio is pending for TLB flush and then clear the flag.
> > + */
> > +static inline bool migrc_unpend_if_pending(struct folio *f)
> > +{
> > + return folio_test_clear_migrc(f);
> > +}
>
> If you named the flag better, you wouldn't need this wrapper.
I will.
> > +static void migrc_mark_pending(struct folio *fsrc, struct folio *fdst)
> > +{
> > + folio_get(fsrc);
> > + __folio_set_migrc(fsrc);
> > + __folio_set_migrc(fdst);
> > +}
>
> This is almost certainly unsafe. By using the non-atomic bit ops, you
> stand the risk of losing a simultaneous update to any other bit in this
> word. Like, say, someone trying to lock the folio?
fdst is not exposed yet so safe to use non-atomic in here IMHO. While..
fsrc's PG_locked is owned by the migration context and the folio has
been successfully unmapped, so I thought it'd be safe but yeah I'm not
convinced if fsrc is safe here for real. I will change it to atomic.
> > +++ b/mm/page_alloc.c
> > @@ -1535,6 +1535,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >
> > set_page_owner(page, order, gfp_flags);
> > page_table_check_alloc(page, order);
> > +
> > + for (i = 0; i != 1 << order; ++i)
> > + migrc_init_page(page + i);
>
> No.
I will change it.
Appreciate your feedback.
Byungchul
On Thu, Nov 09, 2023 at 02:36:01PM +0000, Matthew Wilcox wrote:
> On Thu, Nov 09, 2023 at 01:59:07PM +0900, Byungchul Park wrote:
> > +++ b/include/linux/page-flags.h
> > @@ -136,6 +136,7 @@ enum pageflags {
> > PG_arch_2,
> > PG_arch_3,
> > #endif
> > + PG_migrc, /* Page is under migrc's control */
> > __NR_PAGEFLAGS,
>
> Yeah; no. We're out of page flags. And CXL is insufficiently
I won't use an additional page flag any more.
Thanks.
Byungchul
@@ -1372,4 +1372,25 @@ enum {
/* See also internal only FOLL flags in mm/internal.h */
};
+struct migrc_req {
+ /*
+ * folios pending for TLB flush
+ */
+ struct list_head folios;
+
+ /*
+ * for hanging to the associated numa node
+ */
+ struct llist_node llnode;
+
+ /*
+ * architecture specific data for batched TLB flush
+ */
+ struct arch_tlbflush_unmap_batch arch;
+
+ /*
+ * associated numa node
+ */
+ int nid;
+};
#endif /* _LINUX_MM_TYPES_H */
@@ -980,6 +980,11 @@ struct zone {
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
+
+ /*
+ * the number of folios pending for TLB flush in the zone
+ */
+ atomic_t migrc_pending_nr;
} ____cacheline_internodealigned_in_smp;
enum pgdat_flags {
@@ -1398,6 +1403,10 @@ typedef struct pglist_data {
#ifdef CONFIG_MEMORY_FAILURE
struct memory_failure_stats mf_stats;
#endif
+ /*
+ * migrc requests including folios pending for TLB flush
+ */
+ struct llist_head migrc_reqs;
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
@@ -136,6 +136,7 @@ enum pageflags {
PG_arch_2,
PG_arch_3,
#endif
+ PG_migrc, /* Page is under migrc's control */
__NR_PAGEFLAGS,
PG_readahead = PG_reclaim,
@@ -589,6 +590,9 @@ TESTCLEARFLAG(Young, young, PF_ANY)
PAGEFLAG(Idle, idle, PF_ANY)
#endif
+TESTCLEARFLAG(Migrc, migrc, PF_ANY)
+__PAGEFLAG(Migrc, migrc, PF_ANY)
+
/*
* PageReported() is used to track reported free pages within the Buddy
* allocator. We can use the non-atomic version of the test and set
@@ -1326,6 +1326,12 @@ struct task_struct {
struct tlbflush_unmap_batch tlb_ubc;
struct tlbflush_unmap_batch tlb_ubc_ro;
+ /*
+ * if all the mappings of a folio during unmap are RO so that
+ * migrc can work on it
+ */
+ bool can_migrc;
+
/* Cache last used pipe for splice(): */
struct pipe_inode_info *splice_pipe;
@@ -118,7 +118,8 @@
DEF_PAGEFLAG_NAME(mappedtodisk), \
DEF_PAGEFLAG_NAME(reclaim), \
DEF_PAGEFLAG_NAME(swapbacked), \
- DEF_PAGEFLAG_NAME(unevictable) \
+ DEF_PAGEFLAG_NAME(unevictable), \
+ DEF_PAGEFLAG_NAME(migrc) \
IF_HAVE_PG_MLOCK(mlocked) \
IF_HAVE_PG_UNCACHED(uncached) \
IF_HAVE_PG_HWPOISON(hwpoison) \
@@ -1158,4 +1158,61 @@ struct vma_prepare {
struct vm_area_struct *remove;
struct vm_area_struct *remove2;
};
+
+/*
+ * Initialize the page when allocated from buddy allocator.
+ */
+static inline void migrc_init_page(struct page *p)
+{
+ __ClearPageMigrc(p);
+}
+
+/*
+ * Check if the folio is pending for TLB flush and then clear the flag.
+ */
+static inline bool migrc_unpend_if_pending(struct folio *f)
+{
+ return folio_test_clear_migrc(f);
+}
+
+/*
+ * Reset the indicator indicating there are no writable mappings at the
+ * beginning of every rmap traverse for unmap. Migrc can work only when
+ * all the mappings are RO.
+ */
+static inline void can_migrc_init(void)
+{
+ current->can_migrc = true;
+}
+
+/*
+ * Mark the folio is not applicable to migrc once it found a writble or
+ * dirty pte during rmap traverse for unmap.
+ */
+static inline void can_migrc_fail(void)
+{
+ current->can_migrc = false;
+}
+
+/*
+ * Check if all the mappings are RO and RO mappings even exist.
+ */
+static inline bool can_migrc_test(void)
+{
+ return current->can_migrc && current->tlb_ubc_ro.flush_required;
+}
+
+/*
+ * Return the number of folios pending TLB flush that have yet to get
+ * freed in the zone.
+ */
+static inline int migrc_pending_nr_in_zone(struct zone *z)
+{
+ return atomic_read(&z->migrc_pending_nr);
+}
+
+/*
+ * Perform TLB flush needed and free the folios in the node.
+ */
+bool migrc_flush_free_folios(nodemask_t *nodes);
#endif /* __MM_INTERNAL_H */
@@ -3359,6 +3359,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
if (vmf->page)
folio = page_folio(vmf->page);
+ /*
+ * This folio has its read copy to prevent inconsistency while
+ * deferring TLB flushes. However, the problem might arise if
+ * it's going to become writable.
+ *
+ * To prevent it, give up the deferring TLB flushes and perform
+ * TLB flush right away.
+ */
+ if (folio && migrc_unpend_if_pending(folio))
+ migrc_flush_free_folios(NULL);
+
/*
* Shared mapping: we are guaranteed to have VM_WRITE and
* FAULT_FLAG_WRITE set at this point.
@@ -57,6 +57,162 @@
#include "internal.h"
+/*
+ * Marks the folios as pending for TLB flush.
+ */
+static void migrc_mark_pending(struct folio *fsrc, struct folio *fdst)
+{
+ folio_get(fsrc);
+ __folio_set_migrc(fsrc);
+ __folio_set_migrc(fdst);
+}
+
+static bool migrc_under_processing(struct folio *fsrc, struct folio *fdst)
+{
+ /*
+ * case1. folio_test_migrc(fsrc) && !folio_test_migrc(fdst):
+ *
+ * fsrc was already under migrc's control even before the
+ * current migration. Migrc doesn't work with it this time.
+ *
+ * case2. !folio_test_migrc(fsrc) && !folio_test_migrc(fdst):
+ *
+ * This is the normal case that is not migrc's interest.
+ *
+ * case3. folio_test_migrc(fsrc) && folio_test_migrc(fdst):
+ *
+ * Only the case that migrc works on.
+ */
+ return folio_test_migrc(fsrc) && folio_test_migrc(fdst);
+}
+
+static void migrc_undo_folios(struct folio *fsrc, struct folio *fdst)
+{
+ /*
+ * TLB flushes needed are already done at this moment so the
+ * flag doesn't have to be kept.
+ */
+ __folio_clear_migrc(fsrc);
+ __folio_clear_migrc(fdst);
+ folio_put(fsrc);
+}
+
+static void migrc_expand_req(struct folio *fsrc, struct folio *fdst,
+ struct migrc_req *req)
+{
+ if (req->nid == -1)
+ req->nid = folio_nid(fsrc);
+
+ /*
+ * All the nids in a req should be the same.
+ */
+ WARN_ON(req->nid != folio_nid(fsrc));
+
+ list_add(&fsrc->lru, &req->folios);
+ atomic_inc(&folio_zone(fsrc)->migrc_pending_nr);
+}
+
+/*
+ * Prepares for gathering folios pending for TLB flushes, try to
+ * allocate objects needed, initialize them and make them ready.
+ */
+static struct migrc_req *migrc_req_start(void)
+{
+ struct migrc_req *req;
+
+ req = kmalloc(sizeof(struct migrc_req), GFP_KERNEL);
+ if (!req)
+ return NULL;
+
+ arch_tlbbatch_clear(&req->arch);
+ INIT_LIST_HEAD(&req->folios);
+ req->nid = -1;
+
+ return req;
+}
+
+/*
+ * Hang the request with the collected folios to the corresponding node.
+ */
+static void migrc_req_end(struct migrc_req *req)
+{
+ if (!req)
+ return;
+
+ if (list_empty(&req->folios)) {
+ kfree(req);
+ return;
+ }
+
+ llist_add(&req->llnode, &NODE_DATA(req->nid)->migrc_reqs);
+}
+
+/*
+ * Gather folios and architecture specific data to handle.
+ */
+static void migrc_gather(struct list_head *folios,
+ struct arch_tlbflush_unmap_batch *arch,
+ struct llist_head *reqs)
+{
+ struct llist_node *nodes;
+ struct migrc_req *req;
+ struct migrc_req *req2;
+
+ nodes = llist_del_all(reqs);
+ if (!nodes)
+ return;
+
+ llist_for_each_entry_safe(req, req2, nodes, llnode) {
+ arch_tlbbatch_fold(arch, &req->arch);
+ list_splice(&req->folios, folios);
+ kfree(req);
+ }
+}
+
+bool migrc_flush_free_folios(nodemask_t *nodes)
+{
+ struct folio *f, *f2;
+ int nid;
+ struct arch_tlbflush_unmap_batch arch;
+ LIST_HEAD(folios);
+
+ if (!nodes)
+ nodes = &node_possible_map;
+ arch_tlbbatch_clear(&arch);
+
+ for_each_node_mask(nid, *nodes)
+ migrc_gather(&folios, &arch, &NODE_DATA(nid)->migrc_reqs);
+
+ if (list_empty(&folios))
+ return false;
+
+ arch_tlbbatch_flush(&arch);
+ list_for_each_entry_safe(f, f2, &folios, lru) {
+ atomic_dec(&folio_zone(f)->migrc_pending_nr);
+ folio_put(f);
+ }
+ return true;
+}
+
+static void fold_ubc_ro_to_migrc(struct migrc_req *req)
+{
+ struct tlbflush_unmap_batch *tlb_ubc_ro = ¤t->tlb_ubc_ro;
+
+ if (!tlb_ubc_ro->flush_required)
+ return;
+
+ /*
+ * Fold tlb_ubc_ro's data to the request.
+ */
+ arch_tlbbatch_fold(&req->arch, &tlb_ubc_ro->arch);
+
+ /*
+ * Reset tlb_ubc_ro's data.
+ */
+ arch_tlbbatch_clear(&tlb_ubc_ro->arch);
+ tlb_ubc_ro->flush_required = false;
+}
+
bool isolate_movable_page(struct page *page, isolate_mode_t mode)
{
struct folio *folio = folio_get_nontail_page(page);
@@ -379,6 +535,7 @@ static int folio_expected_refs(struct address_space *mapping,
struct folio *folio)
{
int refs = 1;
+
if (!mapping)
return refs;
@@ -406,6 +563,12 @@ int folio_migrate_mapping(struct address_space *mapping,
int expected_count = folio_expected_refs(mapping, folio) + extra_count;
long nr = folio_nr_pages(folio);
+ /*
+ * Migrc mechanism increased the reference count.
+ */
+ if (migrc_under_processing(folio, newfolio))
+ expected_count++;
+
if (!mapping) {
/* Anonymous page without mapping */
if (folio_ref_count(folio) != expected_count)
@@ -1620,16 +1783,25 @@ static int migrate_pages_batch(struct list_head *from,
LIST_HEAD(unmap_folios);
LIST_HEAD(dst_folios);
bool nosplit = (reason == MR_NUMA_MISPLACED);
+ struct migrc_req *mreq = NULL;
VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
!list_empty(from) && !list_is_singular(from));
+ /*
+ * Apply migrc only to numa migration for now.
+ */
+ if (reason == MR_DEMOTION || reason == MR_NUMA_MISPLACED)
+ mreq = migrc_req_start();
+
for (pass = 0; pass < nr_pass && retry; pass++) {
retry = 0;
thp_retry = 0;
nr_retry_pages = 0;
list_for_each_entry_safe(folio, folio2, from, lru) {
+ bool can_migrc;
+
is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
nr_pages = folio_nr_pages(folio);
@@ -1657,9 +1829,21 @@ static int migrate_pages_batch(struct list_head *from,
continue;
}
+ can_migrc_init();
rc = migrate_folio_unmap(get_new_folio, put_new_folio,
private, folio, &dst, mode, reason,
ret_folios);
+ /*
+ * can_migrc is true only if:
+ *
+ * 1. struct migrc_req has been allocated &&
+ * 2. There's no writable mapping at all &&
+ * 3. There's read-only mapping found &&
+ * 4. Not under migrc's control already
+ */
+ can_migrc = mreq && can_migrc_test() &&
+ !folio_test_migrc(folio);
+
/*
* The rules are:
* Success: folio will be freed
@@ -1720,6 +1904,19 @@ static int migrate_pages_batch(struct list_head *from,
case MIGRATEPAGE_UNMAP:
list_move_tail(&folio->lru, &unmap_folios);
list_add_tail(&dst->lru, &dst_folios);
+
+ if (can_migrc) {
+ /*
+ * To use ->lru exclusively, just
+ * mark the page flag for now.
+ *
+ * The folio will be queued to
+ * the current migrc request on
+ * move success below.
+ */
+ migrc_mark_pending(folio, dst);
+ fold_ubc_ro_to_migrc(mreq);
+ }
break;
default:
/*
@@ -1733,6 +1930,11 @@ static int migrate_pages_batch(struct list_head *from,
stats->nr_failed_pages += nr_pages;
break;
}
+ /*
+ * Done with the current folio. Fold the ro
+ * batch data gathered, to the normal batch.
+ */
+ fold_ubc_ro();
}
}
nr_failed += retry;
@@ -1774,6 +1976,14 @@ static int migrate_pages_batch(struct list_head *from,
case MIGRATEPAGE_SUCCESS:
stats->nr_succeeded += nr_pages;
stats->nr_thp_succeeded += is_thp;
+
+ /*
+ * Now that it's safe to use ->lru,
+ * queue the folio to the current migrc
+ * request.
+ */
+ if (migrc_under_processing(folio, dst))
+ migrc_expand_req(folio, dst, mreq);
break;
default:
nr_failed++;
@@ -1791,6 +2001,8 @@ static int migrate_pages_batch(struct list_head *from,
rc = rc_saved ? : nr_failed;
out:
+ migrc_req_end(mreq);
+
/* Cleanup remaining folios */
dst = list_first_entry(&dst_folios, struct folio, lru);
dst2 = list_next_entry(dst, lru);
@@ -1798,6 +2010,9 @@ static int migrate_pages_batch(struct list_head *from,
int page_was_mapped = 0;
struct anon_vma *anon_vma = NULL;
+ if (migrc_under_processing(folio, dst))
+ migrc_undo_folios(folio, dst);
+
__migrate_folio_extract(dst, &page_was_mapped, &anon_vma);
migrate_folio_undo_src(folio, page_was_mapped, anon_vma,
true, ret_folios);
@@ -1535,6 +1535,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
set_page_owner(page, order, gfp_flags);
page_table_check_alloc(page, order);
+
+ for (i = 0; i != 1 << order; ++i)
+ migrc_init_page(page + i);
}
static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
@@ -2839,6 +2842,8 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
long min = mark;
int o;
+ free_pages += migrc_pending_nr_in_zone(z);
+
/* free_pages may go negative - that's OK */
free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
@@ -2933,7 +2938,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
long usable_free;
long reserved;
- usable_free = free_pages;
+ usable_free = free_pages + migrc_pending_nr_in_zone(z);
reserved = __zone_watermark_unusable_free(z, 0, alloc_flags);
/* reserved may over estimate high-atomic reserves. */
@@ -3121,6 +3126,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
gfp_mask)) {
int ret;
+ /*
+ * Free the pending folios so that the remaining
+ * code can use the updated vmstats and check
+ * zone_watermark_fast() again.
+ */
+ migrc_flush_free_folios(ac->nodemask);
+ if (zone_watermark_fast(zone, order, mark,
+ ac->highest_zoneidx, alloc_flags, gfp_mask))
+ goto try_this_zone;
+
if (has_unaccepted_memory()) {
if (try_to_accept_memory(zone, order))
goto try_this_zone;
@@ -605,7 +605,6 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
}
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-
void fold_ubc_ro(void)
{
struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc;
@@ -675,9 +674,15 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
if (!pte_accessible(mm, pteval))
return;
- if (pte_write(pteval) || writable)
+ if (pte_write(pteval) || writable) {
tlb_ubc = ¤t->tlb_ubc;
- else
+
+ /*
+ * migrc cannot work with the folio once it found a
+ * writable or dirty mapping of it.
+ */
+ can_migrc_fail();
+ } else
tlb_ubc = ¤t->tlb_ubc_ro;
arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);