[v3] fs/buffer.c: update per-CPU bh_lru cache via RCU
Commit Message
umount calls invalidate_bh_lrus which IPIs each
CPU that has non empty per-CPU buffer_head cache:
on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
This interrupts CPUs which might be executing code sensitive
to interferences.
To avoid the IPI, free the per-CPU caches remotely via RCU.
Two bh_lrus structures for each CPU are allocated: one is being
used (assigned to per-CPU bh_lru pointer), and the other is
being freed (or idle).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
v3: fix CPU hotplug
v2: fix sparse warnings (kernel test robot)
Comments
On Wed, Feb 01, 2023 at 01:01:47PM -0300, Marcelo Tosatti wrote:
>
> umount calls invalidate_bh_lrus which IPIs each
via invalidate_bdev(). So this is only triggered on unmount of
filesystems that use the block device mapping directly, right?
Or is the problem that userspace is polling the block device (e.g.
udisks, blkid, etc) whilst the filesystem is mounted and populating
the block device mapping with cached pages so invalidate_bdev()
always does work even when the filesystem doesn't actually use the
bdev mapping?
> CPU that has non empty per-CPU buffer_head cache:
>
> on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
>
> This interrupts CPUs which might be executing code sensitive
> to interferences.
>
> To avoid the IPI, free the per-CPU caches remotely via RCU.
> Two bh_lrus structures for each CPU are allocated: one is being
> used (assigned to per-CPU bh_lru pointer), and the other is
> being freed (or idle).
Rather than adding more complexity to the legacy bufferhead code,
wouldn't it be better to switch the block device mapping to use
iomap+folios and get rid of the use of bufferheads altogether?
Cheers,
Dave.
On Fri, Feb 03, 2023 at 09:36:53AM +1100, Dave Chinner wrote:
> On Wed, Feb 01, 2023 at 01:01:47PM -0300, Marcelo Tosatti wrote:
> >
> > umount calls invalidate_bh_lrus which IPIs each
>
> via invalidate_bdev(). So this is only triggered on unmount of
> filesystems that use the block device mapping directly, right?
>
> Or is the problem that userspace is polling the block device (e.g.
> udisks, blkid, etc) whilst the filesystem is mounted and populating
> the block device mapping with cached pages so invalidate_bdev()
> always does work even when the filesystem doesn't actually use the
> bdev mapping?
>
> > CPU that has non empty per-CPU buffer_head cache:
> >
> > on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
> >
> > This interrupts CPUs which might be executing code sensitive
> > to interferences.
> >
> > To avoid the IPI, free the per-CPU caches remotely via RCU.
> > Two bh_lrus structures for each CPU are allocated: one is being
> > used (assigned to per-CPU bh_lru pointer), and the other is
> > being freed (or idle).
>
> Rather than adding more complexity to the legacy bufferhead code,
> wouldn't it be better to switch the block device mapping to use
> iomap+folios and get rid of the use of bufferheads altogether?
Pretty sure ext4's journalling relies on the blockdev using
buffer_heads. At least, I did a conversion of blockdev to use
mpage_readahead() and ext4 stopped working.
On Thu, Feb 02, 2023 at 10:51:27PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 03, 2023 at 09:36:53AM +1100, Dave Chinner wrote:
> > On Wed, Feb 01, 2023 at 01:01:47PM -0300, Marcelo Tosatti wrote:
> > >
> > > umount calls invalidate_bh_lrus which IPIs each
> >
> > via invalidate_bdev(). So this is only triggered on unmount of
> > filesystems that use the block device mapping directly, right?
While executing:
mount -o loop alpine-standard-3.17.1-x86_64.iso /mnt/loop/
mount-170027 [004] ...1 53852.213367: invalidate_bdev <-__invalidate_device
mount-170027 [004] ...1 53852.213468: invalidate_bdev <-bdev_disk_changed.part.0
mount-170027 [000] ...1 53852.222326: invalidate_bh_lrus <-set_blocksize
mount-170027 [000] ...1 53852.222398: invalidate_bh_lrus <-set_blocksize
systemd-udevd-170031 [011] ...1 53852.239794: invalidate_bh_lrus <-blkdev_flush_mapping
systemd-udevd-170029 [004] ...1 53852.240947: invalidate_bh_lrus <-blkdev_flush_mapping
> >
> > Or is the problem that userspace is polling the block device (e.g.
> > udisks, blkid, etc) whilst the filesystem is mounted and populating
> > the block device mapping with cached pages so invalidate_bdev()
> > always does work even when the filesystem doesn't actually use the
> > bdev mapping?
> >
> > > CPU that has non empty per-CPU buffer_head cache:
> > >
> > > on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
> > >
> > > This interrupts CPUs which might be executing code sensitive
> > > to interferences.
> > >
> > > To avoid the IPI, free the per-CPU caches remotely via RCU.
> > > Two bh_lrus structures for each CPU are allocated: one is being
> > > used (assigned to per-CPU bh_lru pointer), and the other is
> > > being freed (or idle).
> >
> > Rather than adding more complexity to the legacy bufferhead code,
> > wouldn't it be better to switch the block device mapping to use
> > iomap+folios and get rid of the use of bufferheads altogether?
>
> Pretty sure ext4's journalling relies on the blockdev using
> buffer_heads. At least, I did a conversion of blockdev to use
> mpage_readahead() and ext4 stopped working.
And its actually pretty simple: the new invalidate_bh_lrus should be
straightforward use of RCU:
1. for_each_online(cpu)
cpu->bh_lrup = bh_lrus[1] (or 0)
2. synchronize_rcu_expedited() (wait for all previous users of
bh_lrup pointer to stop
referencing it).
3. free bh's in bh_lrus[0] (or 1)
On Fri, Feb 03, 2023 at 09:36:53AM +1100, Dave Chinner wrote:
> Rather than adding more complexity to the legacy bufferhead code,
> wouldn't it be better to switch the block device mapping to use
> iomap+folios and get rid of the use of bufferheads altogether?
It's not that simple unfortunately :( All file systems that use
buffer heads also use them on the block device for metadata. I have
a WIP block device iomap conversion, but it still has to offer
buffer_heads similar to the legacy path in iomap to make all that
work.
@@ -1203,7 +1203,21 @@ struct bh_lru {
struct buffer_head *bhs[BH_LRU_SIZE];
};
-static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }};
+
+/*
+ * Allocate two bh_lrus structures for each CPU. bh_lru points to the
+ * one that is currently in use, and the update path does
+ * (consider cpu->bh_lru = bh_lrus[0]).
+ *
+ * cpu->bh_lrup = bh_lrus[1]
+ * synchronize_rcu()
+ * free bh's in bh_lrus[0]
+ */
+static unsigned int bh_lru_idx;
+static DEFINE_PER_CPU(struct bh_lru, bh_lrus[2]) = {{{ NULL }}, {{NULL}}};
+static DEFINE_PER_CPU(struct bh_lru __rcu *, bh_lrup);
+
+static DEFINE_MUTEX(bh_lru_invalidate_mutex);
#ifdef CONFIG_SMP
#define bh_lru_lock() local_irq_disable()
@@ -1245,16 +1259,19 @@ static void bh_lru_install(struct buffer_head *bh)
return;
}
- b = this_cpu_ptr(&bh_lrus);
+ rcu_read_lock();
+ b = rcu_dereference(per_cpu(bh_lrup, smp_processor_id()));
for (i = 0; i < BH_LRU_SIZE; i++) {
swap(evictee, b->bhs[i]);
if (evictee == bh) {
+ rcu_read_unlock();
bh_lru_unlock();
return;
}
}
get_bh(bh);
+ rcu_read_unlock();
bh_lru_unlock();
brelse(evictee);
}
@@ -1266,28 +1283,32 @@ static struct buffer_head *
lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
{
struct buffer_head *ret = NULL;
+ struct bh_lru *lru;
unsigned int i;
check_irqs_on();
bh_lru_lock();
+ rcu_read_lock();
+
+ lru = rcu_dereference(per_cpu(bh_lrup, smp_processor_id()));
for (i = 0; i < BH_LRU_SIZE; i++) {
- struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
+ struct buffer_head *bh = lru->bhs[i];
if (bh && bh->b_blocknr == block && bh->b_bdev == bdev &&
bh->b_size == size) {
if (i) {
while (i) {
- __this_cpu_write(bh_lrus.bhs[i],
- __this_cpu_read(bh_lrus.bhs[i - 1]));
+ lru->bhs[i] = lru->bhs[i - 1];
i--;
}
- __this_cpu_write(bh_lrus.bhs[0], bh);
+ lru->bhs[0] = bh;
}
get_bh(bh);
ret = bh;
break;
}
}
+ rcu_read_unlock();
bh_lru_unlock();
return ret;
}
@@ -1381,35 +1402,54 @@ static void __invalidate_bh_lrus(struct bh_lru *b)
b->bhs[i] = NULL;
}
}
-/*
- * invalidate_bh_lrus() is called rarely - but not only at unmount.
- * This doesn't race because it runs in each cpu either in irq
- * or with preempt disabled.
- */
-static void invalidate_bh_lru(void *arg)
-{
- struct bh_lru *b = &get_cpu_var(bh_lrus);
-
- __invalidate_bh_lrus(b);
- put_cpu_var(bh_lrus);
-}
bool has_bh_in_lru(int cpu, void *dummy)
{
- struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
+ struct bh_lru *b;
int i;
-
+
+ rcu_read_lock();
+ b = rcu_dereference(per_cpu(bh_lrup, cpu));
for (i = 0; i < BH_LRU_SIZE; i++) {
- if (b->bhs[i])
+ if (b->bhs[i]) {
+ rcu_read_unlock();
return true;
+ }
}
+ rcu_read_unlock();
return false;
}
+/*
+ * invalidate_bh_lrus() is called rarely - but not only at unmount.
+ */
void invalidate_bh_lrus(void)
{
- on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
+ int cpu, oidx;
+
+ mutex_lock(&bh_lru_invalidate_mutex);
+ cpus_read_lock();
+ oidx = bh_lru_idx;
+ bh_lru_idx++;
+ if (bh_lru_idx >= 2)
+ bh_lru_idx = 0;
+
+ /* Assign the per-CPU bh_lru pointer */
+ for_each_online_cpu(cpu)
+ rcu_assign_pointer(per_cpu(bh_lrup, cpu),
+ per_cpu_ptr(&bh_lrus[bh_lru_idx], cpu));
+ synchronize_rcu_expedited();
+
+ for_each_online_cpu(cpu) {
+ struct bh_lru *b = per_cpu_ptr(&bh_lrus[oidx], cpu);
+
+ bh_lru_lock();
+ __invalidate_bh_lrus(b);
+ bh_lru_unlock();
+ }
+ cpus_read_unlock();
+ mutex_unlock(&bh_lru_invalidate_mutex);
}
EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
@@ -1422,8 +1462,10 @@ void invalidate_bh_lrus_cpu(void)
struct bh_lru *b;
bh_lru_lock();
- b = this_cpu_ptr(&bh_lrus);
+ rcu_read_lock();
+ b = rcu_dereference(per_cpu(bh_lrup, smp_processor_id()));
__invalidate_bh_lrus(b);
+ rcu_read_unlock();
bh_lru_unlock();
}
@@ -2920,15 +2962,25 @@ void free_buffer_head(struct buffer_head *bh)
}
EXPORT_SYMBOL(free_buffer_head);
+static int buffer_cpu_online(unsigned int cpu)
+{
+ rcu_assign_pointer(per_cpu(bh_lrup, cpu),
+ per_cpu_ptr(&bh_lrus[bh_lru_idx], cpu));
+ return 0;
+}
+
static int buffer_exit_cpu_dead(unsigned int cpu)
{
int i;
- struct bh_lru *b = &per_cpu(bh_lrus, cpu);
+ struct bh_lru *b;
+ rcu_read_lock();
+ b = rcu_dereference(per_cpu(bh_lrup, cpu));
for (i = 0; i < BH_LRU_SIZE; i++) {
brelse(b->bhs[i]);
b->bhs[i] = NULL;
}
+ rcu_read_unlock();
this_cpu_add(bh_accounting.nr, per_cpu(bh_accounting, cpu).nr);
per_cpu(bh_accounting, cpu).nr = 0;
return 0;
@@ -3021,7 +3073,7 @@ EXPORT_SYMBOL(__bh_read_batch);
void __init buffer_init(void)
{
unsigned long nrpages;
- int ret;
+ int ret, cpu;
bh_cachep = kmem_cache_create("buffer_head",
sizeof(struct buffer_head), 0,
@@ -3029,6 +3081,11 @@ void __init buffer_init(void)
SLAB_MEM_SPREAD),
NULL);
+ cpus_read_lock();
+ for_each_online_cpu(cpu)
+ rcu_assign_pointer(per_cpu(bh_lrup, cpu), per_cpu_ptr(&bh_lrus[0], cpu));
+ cpus_read_unlock();
+
/*
* Limit the bh occupancy to 10% of ZONE_NORMAL
*/
@@ -3037,4 +3094,7 @@ void __init buffer_init(void)
ret = cpuhp_setup_state_nocalls(CPUHP_FS_BUFF_DEAD, "fs/buffer:dead",
NULL, buffer_exit_cpu_dead);
WARN_ON(ret < 0);
+ ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "fs/buffer:online",
+ NULL, buffer_cpu_online);
+ WARN_ON(ret < 0);
}