[2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh

Message ID 20240123183332.876854-3-shikemeng@huaweicloud.com
State New
Headers
Series Fix and cleanups to page-writeback |

Commit Message

Kemeng Shi Jan. 23, 2024, 6:33 p.m. UTC
  The wb_calc_thresh will calculate wb's share in global wb domain. We need
to wb's share in mem_cgroup_wb_domain for mdtc. Call __wb_calc_thresh
instead of wb_calc_thresh to fix this.

Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/page-writeback.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Tejun Heo Jan. 23, 2024, 8:43 p.m. UTC | #1
On Wed, Jan 24, 2024 at 02:33:29AM +0800, Kemeng Shi wrote:
> The wb_calc_thresh will calculate wb's share in global wb domain. We need
> to wb's share in mem_cgroup_wb_domain for mdtc. Call __wb_calc_thresh
> instead of wb_calc_thresh to fix this.

That function is calculating the wb's portion of wb portion in the whole
system so that threshold can be distributed accordingly. So, it has to be
compared in the global domain. If you look at the comment on top of struct
wb_domain, it says:

/*
 * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
 * and are measured against each other in.  There always is one global
 * domain, global_wb_domain, that every wb in the system is a member of.
 * This allows measuring the relative bandwidth of each wb to distribute
 * dirtyable memory accordingly.
 */

Also, how is this tested? Was there a case where the existing code
misbehaved that's improved by this patch? Or is this just from reading code?

Thanks.
  
Kemeng Shi Jan. 24, 2024, 2:01 a.m. UTC | #2
on 1/24/2024 4:43 AM, Tejun Heo wrote:
> On Wed, Jan 24, 2024 at 02:33:29AM +0800, Kemeng Shi wrote:
>> The wb_calc_thresh will calculate wb's share in global wb domain. We need
>> to wb's share in mem_cgroup_wb_domain for mdtc. Call __wb_calc_thresh
>> instead of wb_calc_thresh to fix this.
> 
> That function is calculating the wb's portion of wb portion in the whole
> system so that threshold can be distributed accordingly. So, it has to be
> compared in the global domain. If you look at the comment on top of struct
> wb_domain, it says:
> 
> /*
>  * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
>  * and are measured against each other in.  There always is one global
>  * domain, global_wb_domain, that every wb in the system is a member of.
>  * This allows measuring the relative bandwidth of each wb to distribute
>  * dirtyable memory accordingly.
>  */
> 
Hi Tejun, thanks for reply. For cgroup wb, it will belongs to a global wb
domain and a cgroup domain. I agree the way how we calculate wb's threshold
in global domain as you described above. This patch tries to fix calculation
of wb's threshold in cgroup domain which now is wb_calc_thresh(mdtc->wb,
mdtc->bg_thresh)), means:
(wb bandwidth) / (system bandwidth) * (*cgroup domain threshold*)
The cgroup domain threshold is
(memory of cgroup domain) / (memory of system) * (system threshold).
Then the wb's threshold in cgroup will be smaller than expected.

Consider following domain hierarchy:
                global domain (100G)
                /                 \
        cgroup domain1(50G)     cgroup domain2(50G)
                |                 |
bdi            wb1               wb2
Assume wb1 and wb2 has the same bandwidth.
We have global domain bg_thresh 10G, cgroup domain bg_thresh 5G.
Then we have:
wb's thresh in global domain = 10G * (wb bandwidth) / (system bandwidth)
= 10G * 1/2 = 5G
wb's thresh in cgroup domain = 5G * (wb bandwidth) / (system bandwidth)
= 5G * 1/2 = 2.5G
At last, wb1 and wb2 will be limited at 2.5G, the system will be limited
at 5G which is less than global domain bg_thresh 10G.

After the fix, threshold in cgroup domain will be:
(wb bandwidth) / (cgroup bandwidth) * (cgroup domain threshold)
The wb1 and wb2 will be limited at 5G, the system will be limited at
10G which equals to global domain bg_thresh 10G.

As I didn't take a deep look into memory cgroup, please correct me if
anything is wrong. Thanks!
> Also, how is this tested? Was there a case where the existing code
> misbehaved that's improved by this patch? Or is this just from reading code?
This is jut from reading code. Would the case showed above convince you
a bit. Look forward to your reply, thanks!.
> 
> Thanks.
>
  
Tejun Heo Jan. 29, 2024, 9 p.m. UTC | #3
Hello,

On Wed, Jan 24, 2024 at 10:01:47AM +0800, Kemeng Shi wrote:
> Hi Tejun, thanks for reply. For cgroup wb, it will belongs to a global wb
> domain and a cgroup domain. I agree the way how we calculate wb's threshold
> in global domain as you described above. This patch tries to fix calculation
> of wb's threshold in cgroup domain which now is wb_calc_thresh(mdtc->wb,
> mdtc->bg_thresh)), means:
> (wb bandwidth) / (system bandwidth) * (*cgroup domain threshold*)
> The cgroup domain threshold is
> (memory of cgroup domain) / (memory of system) * (system threshold).
> Then the wb's threshold in cgroup will be smaller than expected.
> 
> Consider following domain hierarchy:
>                 global domain (100G)
>                 /                 \
>         cgroup domain1(50G)     cgroup domain2(50G)
>                 |                 |
> bdi            wb1               wb2
> Assume wb1 and wb2 has the same bandwidth.
> We have global domain bg_thresh 10G, cgroup domain bg_thresh 5G.
> Then we have:
> wb's thresh in global domain = 10G * (wb bandwidth) / (system bandwidth)
> = 10G * 1/2 = 5G
> wb's thresh in cgroup domain = 5G * (wb bandwidth) / (system bandwidth)
> = 5G * 1/2 = 2.5G
> At last, wb1 and wb2 will be limited at 2.5G, the system will be limited
> at 5G which is less than global domain bg_thresh 10G.
> 
> After the fix, threshold in cgroup domain will be:
> (wb bandwidth) / (cgroup bandwidth) * (cgroup domain threshold)
> The wb1 and wb2 will be limited at 5G, the system will be limited at
> 10G which equals to global domain bg_thresh 10G.
> 
> As I didn't take a deep look into memory cgroup, please correct me if
> anything is wrong. Thanks!
> > Also, how is this tested? Was there a case where the existing code
> > misbehaved that's improved by this patch? Or is this just from reading code?
>
> This is jut from reading code. Would the case showed above convince you
> a bit. Look forward to your reply, thanks!.

So, the explanation makes some sense to me but can you please construct a
case to actually demonstrate the problem and fix? I don't think it'd be wise
to apply the change without actually observing the code change does what it
says it does.

Thanks.
  
Kemeng Shi Feb. 8, 2024, 9:26 a.m. UTC | #4
on 1/30/2024 5:00 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jan 24, 2024 at 10:01:47AM +0800, Kemeng Shi wrote:
>> Hi Tejun, thanks for reply. For cgroup wb, it will belongs to a global wb
>> domain and a cgroup domain. I agree the way how we calculate wb's threshold
>> in global domain as you described above. This patch tries to fix calculation
>> of wb's threshold in cgroup domain which now is wb_calc_thresh(mdtc->wb,
>> mdtc->bg_thresh)), means:
>> (wb bandwidth) / (system bandwidth) * (*cgroup domain threshold*)
>> The cgroup domain threshold is
>> (memory of cgroup domain) / (memory of system) * (system threshold).
>> Then the wb's threshold in cgroup will be smaller than expected.
>>
>> Consider following domain hierarchy:
>>                 global domain (100G)
>>                 /                 \
>>         cgroup domain1(50G)     cgroup domain2(50G)
>>                 |                 |
>> bdi            wb1               wb2
>> Assume wb1 and wb2 has the same bandwidth.
>> We have global domain bg_thresh 10G, cgroup domain bg_thresh 5G.
>> Then we have:
>> wb's thresh in global domain = 10G * (wb bandwidth) / (system bandwidth)
>> = 10G * 1/2 = 5G
>> wb's thresh in cgroup domain = 5G * (wb bandwidth) / (system bandwidth)
>> = 5G * 1/2 = 2.5G
>> At last, wb1 and wb2 will be limited at 2.5G, the system will be limited
>> at 5G which is less than global domain bg_thresh 10G.
>>
>> After the fix, threshold in cgroup domain will be:
>> (wb bandwidth) / (cgroup bandwidth) * (cgroup domain threshold)
>> The wb1 and wb2 will be limited at 5G, the system will be limited at
>> 10G which equals to global domain bg_thresh 10G.
>>
>> As I didn't take a deep look into memory cgroup, please correct me if
>> anything is wrong. Thanks!
>>> Also, how is this tested? Was there a case where the existing code
>>> misbehaved that's improved by this patch? Or is this just from reading code?
>>
>> This is jut from reading code. Would the case showed above convince you
>> a bit. Look forward to your reply, thanks!.
> 
> So, the explanation makes some sense to me but can you please construct a
> case to actually demonstrate the problem and fix? I don't think it'd be wise
> to apply the change without actually observing the code change does what it
> says it does.
Hi Tejun, sorry for the delay as I found there is a issue that keep triggering
writeback even the dirty page is under dirty background threshold. The issue
make it difficult to observe the expected improvment from this patch. I try to
fix it in [1] and test this patch based on the fix patches.
Run test as following:

/* make background writeback easier to observe */
echo 300000 > /proc/sys/vm/dirty_expire_centisecs
echo 100 > /proc/sys/vm/dirty_writeback_centisecs
/* enable memory and io cgroup */
echo "+memory +io" > /sys/fs/cgroup/cgroup.subtree_control

/* run fio in group1 with shell */
cd /sys/fs/cgroup
mkdir group1
cd group1
echo 10G > memory.high
echo 10G > memory.max
echo $$ > cgroup.procs
mkfs.ext4 -F /dev/vdb
mount /dev/vdb /bdi1/
fio -name test -filename=/bdi1/file -size=800M -ioengine=libaio -bs=4K -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0

/* run another fio in group2 with another shell */
cd /sys/fs/cgroup
mkdir group2
cd group2
echo 10G > memory.high
echo 10G > memory.max
echo $$ > cgroup.procs
mkfs.ext4 -F /dev/vdc
mount /dev/vdc /bdi2/
fio -name test -filename=/bdi2/file -size=800M -ioengine=libaio -bs=4K -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0

Before the fix we got (result of three tests):
fio1
  WRITE: bw=1304MiB/s (1367MB/s), 1304MiB/s-1304MiB/s (1367MB/s-1367MB/s), io=76.4GiB (82.0GB), run=60001-60001msec
  WRITE: bw=1351MiB/s (1417MB/s), 1351MiB/s-1351MiB/s (1417MB/s-1417MB/s), io=79.2GiB (85.0GB), run=60001-60001msec
  WRITE: bw=1373MiB/s (1440MB/s), 1373MiB/s-1373MiB/s (1440MB/s-1440MB/s), io=80.5GiB (86.4GB), run=60001-60001msec
fio2
  WRITE: bw=1134MiB/s (1190MB/s), 1134MiB/s-1134MiB/s (1190MB/s-1190MB/s), io=66.5GiB (71.4GB), run=60001-60001msec
  WRITE: bw=1414MiB/s (1483MB/s), 1414MiB/s-1414MiB/s (1483MB/s-1483MB/s), io=82.8GiB (88.0GB), run=60001-60001msec
  WRITE: bw=1469MiB/s (1540MB/s), 1469MiB/s-1469MiB/s (1540MB/s-1540MB/s), io=86.0GiB (92.4GB), run=60001-60001msec

After the fix we got (result of three tests):
fio1
  WRITE: bw=1719MiB/s (1802MB/s), 1719MiB/s-1719MiB/s (1802MB/s-1802MB/s), io=101GiB (108GB), run=60001-60001msec
  WRITE: bw=1723MiB/s (1806MB/s), 1723MiB/s-1723MiB/s (1806MB/s-1806MB/s), io=101GiB (108GB), run=60001-60001msec
  WRITE: bw=1691MiB/s (1774MB/s), 1691MiB/s-1691MiB/s (1774MB/s-1774MB/s), io=99.2GiB (106GB), run=60036-60036msec
fio2
  WRITE: bw=1692MiB/s (1774MB/s), 1692MiB/s-1692MiB/s (1774MB/s-1774MB/s), io=99.1GiB (106GB), run=60001-60001msec
  WRITE: bw=1681MiB/s (1763MB/s), 1681MiB/s-1681MiB/s (1763MB/s-1763MB/s), io=98.5GiB (106GB), run=60001-60001msec
  WRITE: bw=1671MiB/s (1752MB/s), 1671MiB/s-1671MiB/s (1752MB/s-1752MB/s), io=97.9GiB (105GB), run=60001-60001msec
I also add code to print the pages written in writeback and pages written in
writeback reduce a lot and are rare after this fix.

[1] https://lore.kernel.org/linux-fsdevel/20240208172024.23625-2-shikemeng@huaweicloud.com/T/#u
> 
> Thanks.
>
  
Tejun Heo Feb. 8, 2024, 7:32 p.m. UTC | #5
Hello, Kemeng.

On Thu, Feb 08, 2024 at 05:26:10PM +0800, Kemeng Shi wrote:
> Hi Tejun, sorry for the delay as I found there is a issue that keep triggering
> writeback even the dirty page is under dirty background threshold. The issue
> make it difficult to observe the expected improvment from this patch. I try to
> fix it in [1] and test this patch based on the fix patches.
> Run test as following:

Ah, that looks promising and thanks a lot for looking into this. It's great
to have someone actually poring over the code and behavior. Understanding
the wb and cgroup wb behaviors have always been challenging because the only
thing we have is the tracepoints and it's really tedious and difficult to
build an overall understanding from the trace outputs. Can I persuade you
into writing a drgn monitoring script similar to e.g.
tools/workqueues/wq_monitor.py? I think there's a pretty good chance the
visibility can be improved substantially.

Thanks.
  
Kemeng Shi Feb. 18, 2024, 2:35 a.m. UTC | #6
on 2/9/2024 3:32 AM, Tejun Heo wrote:
> Hello, Kemeng.
> 
> On Thu, Feb 08, 2024 at 05:26:10PM +0800, Kemeng Shi wrote:
>> Hi Tejun, sorry for the delay as I found there is a issue that keep triggering
>> writeback even the dirty page is under dirty background threshold. The issue
>> make it difficult to observe the expected improvment from this patch. I try to
>> fix it in [1] and test this patch based on the fix patches.
>> Run test as following:
> 
> Ah, that looks promising and thanks a lot for looking into this. It's great
> to have someone actually poring over the code and behavior. Understanding
> the wb and cgroup wb behaviors have always been challenging because the only
> thing we have is the tracepoints and it's really tedious and difficult to
> build an overall understanding from the trace outputs. Can I persuade you
> into writing a drgn monitoring script similar to e.g.
> tools/workqueues/wq_monitor.py? I think there's a pretty good chance the
> visibility can be improved substantially.
Hi Tejun, sorry for the late reply as I was on vacation these days.
I agree that visibility is poor so I have to add some printks to debug.
Actually, I have added per wb stats to improve the visibility as we
only have per bdi stats (/sys/kernel/debug/bdi/xxx:x/stats) now. And I
plan to submit it in a new series.
I'd like to add a script to improve visibility more but I can't ensure
the time to do it. I would submit the monitoring script with the per wb
stats if the time problem does not bother you.
Thanks.
> 
> Thanks.
>
  
Tejun Heo Feb. 20, 2024, 5:34 p.m. UTC | #7
Hello,

On Sun, Feb 18, 2024 at 10:35:41AM +0800, Kemeng Shi wrote:
> Hi Tejun, sorry for the late reply as I was on vacation these days.
> I agree that visibility is poor so I have to add some printks to debug.
> Actually, I have added per wb stats to improve the visibility as we
> only have per bdi stats (/sys/kernel/debug/bdi/xxx:x/stats) now. And I
> plan to submit it in a new series.
> I'd like to add a script to improve visibility more but I can't ensure
> the time to do it. I would submit the monitoring script with the per wb
> stats if the time problem does not bother you.

It has had poor visiblity for many many years, I don't think we're in any
hurry.

Thanks.
  

Patch

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9268859722c4..f6c7f3b0f495 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2118,7 +2118,7 @@  bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		if (mdtc->dirty > mdtc->bg_thresh)
 			return true;
 
-		thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh);
+		thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh);
 		if (thresh < 2 * wb_stat_error())
 			reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
 		else