Message ID | 20230413182313.RFC.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp56810vqo; Thu, 13 Apr 2023 18:25:35 -0700 (PDT) X-Google-Smtp-Source: AKy350aFKVP0yLRYxRDzU/qRcMScxxW9g7BtlHEsaQQhfYxo1YVgY4DNUL8Z63sjS/bBUvC3uypx X-Received: by 2002:a17:90a:1688:b0:23f:58a2:7d86 with SMTP id o8-20020a17090a168800b0023f58a27d86mr3609455pja.10.1681435535445; Thu, 13 Apr 2023 18:25:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681435535; cv=none; d=google.com; s=arc-20160816; b=evth0A5ABf9cpxFE0wE0Oe4FIRn7IcvLbqdgCnSklrY5GQI7gQSCdhQOMkh2Ocf4Na hnQFoHpwp/8V1ssfuRzEM01PTapA2pgSmpItlpiyH6npQrB+Hje+5BbxEeQly0pvrJx2 3gwebX1oiVWk1HjKahlOI0F4+8uIIpGnexHSzoQ6IVcD643WJZBi4AT4JsM/PApPdTBU WlBeUNJ43pvhSIz6HRL8wKC1jNo1GjJO/YFRt1OPOXYHs0cf6C6L+w3Tky5u2881JhZ2 GccVCqCW9/R4aKVoFz9fOjMse/cwu8lEo9HbECLoLFyOWO1SZt/Lf9ZINCYsbz7yQtF2 Lj3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=ySd2AFOBGRafkZnqoRaWZ+CuGXqcsO86v2c2mbNKCt0=; b=b6SI/mPfQmxMmpsWKszexYyWsAPEqNTFX+w9tiIBUi1ME0BevNHA8FBtLMv8ffyQfx cyDwIe1hE90M4oa091n9MVqKc5Vn0O7uE6Dp8ytiZvw9Ndq+sS4HPqxas3WdHOLu5Skg kAF7lqUBYYUImYlz5nNnr/fvN/8eV9WyE5JB5dZLX4tYFRb5im+/BQ+qQWrae4qQ7EIJ +Q2hiBkza/MjmsnbqlGMqghFyZkHttnGDSskKvcJgXVHbduxP36n7M3mlbQhSBlK8/Cj oVNghi46T0U2araCqYmhfYVFstQLvJQxzDGyx0wAFGbRxn+3hsH8L1dXjB0C7FavfIpu FONw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=ih5QBKpf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m2-20020a17090ab78200b002471057e709si3733923pjr.75.2023.04.13.18.25.23; Thu, 13 Apr 2023 18:25:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=ih5QBKpf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229724AbjDNBX5 (ORCPT <rfc822;peter110.wang@gmail.com> + 99 others); Thu, 13 Apr 2023 21:23:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50364 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229878AbjDNBXx (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 13 Apr 2023 21:23:53 -0400 Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6EDDD30FA for <linux-kernel@vger.kernel.org>; Thu, 13 Apr 2023 18:23:46 -0700 (PDT) Received: by mail-pj1-x102c.google.com with SMTP id l9-20020a17090a3f0900b0023d32684e7fso7455940pjc.1 for <linux-kernel@vger.kernel.org>; Thu, 13 Apr 2023 18:23:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1681435426; x=1684027426; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=ySd2AFOBGRafkZnqoRaWZ+CuGXqcsO86v2c2mbNKCt0=; b=ih5QBKpfh12zgAWpWZBSYoZ5iClOskNaxjN6FS5ohodztZdxKJmIOjzVQN4ewzvb00 WqfzUppOQ+yz46+81FK2uJRQKTUVrTx4O0h3u8sb2EnDwb7538qU6Dt7G0PENasO5cD/ J0zRcM1HQVTloF2a4UTsA7ZP8KMzUvwIM9H9U= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681435426; x=1684027426; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ySd2AFOBGRafkZnqoRaWZ+CuGXqcsO86v2c2mbNKCt0=; b=PZpGyM+ZR2iw2i08J7iL9a3VTgp6RCobZzlFCf5oviIWgRHIiFTnYmdFSoXjKi7dJo BQm+5tD0qh28WqZiCe4LT0nJZmpY6MIqM8+yv7cG0Mw1fIm8HTBpVYde3/5I2C1xUIXR phAUJnePFRvAv55vDIZ8po+KA83k/jGn1CHgEfBVkChcE0+eR0KSLUJlqsyx7CexEF/W v9BUMsFvPAsyymuxgCN8GCrVq4cBZLg3FRSiO/DK6xGOLk22aU1v5Pr+thrKcH9RAEt1 4RQUaFniRaIDdpM5RWQRf0qtGelpZ3Sl+M1eeMhwq982dIFic3mKsNHndUjkmTIBtnLw rA4w== X-Gm-Message-State: AAQBX9fPdGRc/3vgDihpdlQQ13dcyTXpfDchRDe1Jnac+iDa23XvawDC C/LMsqHjxIbncc0+kSBHH8SYQw== X-Received: by 2002:a17:902:d902:b0:1a6:47aa:dbd7 with SMTP id c2-20020a170902d90200b001a647aadbd7mr934051plz.53.1681435425851; Thu, 13 Apr 2023 18:23:45 -0700 (PDT) Received: from tictac2.mtv.corp.google.com ([2620:15c:9d:2:d555:e8ef:b29c:bd37]) by smtp.gmail.com with ESMTPSA id t20-20020a170902d21400b001a05122b562sm2022886ply.286.2023.04.13.18.23.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Apr 2023 18:23:45 -0700 (PDT) From: Douglas Anderson <dianders@chromium.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Yu Zhao <yuzhao@google.com>, Douglas Anderson <dianders@chromium.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH] migrate_pages: Never block waiting for the page lock Date: Thu, 13 Apr 2023 18:23:15 -0700 Message-ID: <20230413182313.RFC.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid> X-Mailer: git-send-email 2.40.0.634.g4ca3ef3211-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763112948445522495?= X-GMAIL-MSGID: =?utf-8?q?1763112948445522495?= |
Series |
[RFC] migrate_pages: Never block waiting for the page lock
|
|
Commit Message
Doug Anderson
April 14, 2023, 1:23 a.m. UTC
Currently when we try to do page migration and we're in "synchronous"
mode (and not doing direct compaction) then we'll wait an infinite
amount of time for a page lock. This does not appear to be a great
idea.
One issue can be seen when I put a device under extreme memory
pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram
swap). I ran the browser along with Android (which runs from a
loopback mounted 128K block-size squashfs "disk"). I then manually ran
the mmm_donut memory pressure tool [1]. The system is completely
unusable both with and without this patch since there are 8 processes
completely thrashing memory, but it was still interesting to look at
how migration was behaving. I put some timing code in and I could see
that we sometimes waited over 25 seconds (in the context of
kcompactd0) for a page lock to become available. Although the 25
seconds was the high mark, it was easy to see tens, hundreds, or
thousands of milliseconds spent waiting on the lock.
Instead of waiting, if I bailed out right away (as this patch does), I
could see kcompactd0 move forward to successfully to migrate other
pages instead. This seems like a better use of kcompactd's time.
Thus, even though this didn't make the system any more usable in my
absurd test case, it still seemed to make migration behave better and
that feels like a win. It also makes the code simpler since we have
one fewer special case.
The second issue that this patch attempts to fix is one that I haven't
managed to reproduce yet. We have crash reports from the field that
report that kcompactd0 was blocked for more than ~120 seconds on this
exact lock. These crash reports are on devices running older kernels
(5.10 mostly). In most of these crash reports the device is under
memory pressure and many tasks were blocked in squashfs code, ext4
code, or memory allocation code. While I don't know if unblocking
kcompactd would actually have helped relieve the memory pressure, at
least there was a chance that it could have helped a little bit.
[1] https://chromium.googlesource.com/chromiumos/platform/microbenchmarks/+/refs/heads/main/mmm_donut.py
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
mm/migrate.c | 25 +++++++------------------
1 file changed, 7 insertions(+), 18 deletions(-)
Comments
Douglas Anderson <dianders@chromium.org> writes: > Currently when we try to do page migration and we're in "synchronous" > mode (and not doing direct compaction) then we'll wait an infinite > amount of time for a page lock. This does not appear to be a great > idea. > > One issue can be seen when I put a device under extreme memory > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram > swap). I ran the browser along with Android (which runs from a > loopback mounted 128K block-size squashfs "disk"). I then manually ran > the mmm_donut memory pressure tool [1]. The system is completely > unusable both with and without this patch since there are 8 processes > completely thrashing memory, but it was still interesting to look at > how migration was behaving. I put some timing code in and I could see > that we sometimes waited over 25 seconds (in the context of > kcompactd0) for a page lock to become available. Although the 25 > seconds was the high mark, it was easy to see tens, hundreds, or > thousands of milliseconds spent waiting on the lock. > > Instead of waiting, if I bailed out right away (as this patch does), I > could see kcompactd0 move forward to successfully to migrate other > pages instead. This seems like a better use of kcompactd's time. > > Thus, even though this didn't make the system any more usable in my > absurd test case, it still seemed to make migration behave better and > that feels like a win. It also makes the code simpler since we have > one fewer special case. TBH, the test case is too extreme for me. And, we have multiple "sync" mode to deal with latency requirement, for example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long latency. If you have latency requirement for some users, you may consider to add new "sync" mode. Best Regards, Huang, Ying > The second issue that this patch attempts to fix is one that I haven't > managed to reproduce yet. We have crash reports from the field that > report that kcompactd0 was blocked for more than ~120 seconds on this > exact lock. These crash reports are on devices running older kernels > (5.10 mostly). In most of these crash reports the device is under > memory pressure and many tasks were blocked in squashfs code, ext4 > code, or memory allocation code. While I don't know if unblocking > kcompactd would actually have helped relieve the memory pressure, at > least there was a chance that it could have helped a little bit. > > [1] https://chromium.googlesource.com/chromiumos/platform/microbenchmarks/+/refs/heads/main/mmm_donut.py > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > > mm/migrate.c | 25 +++++++------------------ > 1 file changed, 7 insertions(+), 18 deletions(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index db3f154446af..dfb0a6944181 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1143,26 +1143,15 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page > dst->private = NULL; > > if (!folio_trylock(src)) { > - if (mode == MIGRATE_ASYNC) > - goto out; > - > /* > - * It's not safe for direct compaction to call lock_page. > - * For example, during page readahead pages are added locked > - * to the LRU. Later, when the IO completes the pages are > - * marked uptodate and unlocked. However, the queueing > - * could be merging multiple pages for one bio (e.g. > - * mpage_readahead). If an allocation happens for the > - * second or third page, the process can end up locking > - * the same page twice and deadlocking. Rather than > - * trying to be clever about what pages can be locked, > - * avoid the use of lock_page for direct compaction > - * altogether. > + * While there are some modes we could be running in where we > + * could block here waiting for the lock (specifically > + * modes other than MIGRATE_ASYNC and when we're not in > + * direct compaction), it's not worth the wait. Instead of > + * waiting, we'll bail. This will let the caller try to > + * migrate some other pages that aren't contended. > */ > - if (current->flags & PF_MEMALLOC) > - goto out; > - > - folio_lock(src); > + goto out; > } > locked = true;
Hi, On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote: > > Douglas Anderson <dianders@chromium.org> writes: > > > Currently when we try to do page migration and we're in "synchronous" > > mode (and not doing direct compaction) then we'll wait an infinite > > amount of time for a page lock. This does not appear to be a great > > idea. > > > > One issue can be seen when I put a device under extreme memory > > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram > > swap). I ran the browser along with Android (which runs from a > > loopback mounted 128K block-size squashfs "disk"). I then manually ran > > the mmm_donut memory pressure tool [1]. The system is completely > > unusable both with and without this patch since there are 8 processes > > completely thrashing memory, but it was still interesting to look at > > how migration was behaving. I put some timing code in and I could see > > that we sometimes waited over 25 seconds (in the context of > > kcompactd0) for a page lock to become available. Although the 25 > > seconds was the high mark, it was easy to see tens, hundreds, or > > thousands of milliseconds spent waiting on the lock. > > > > Instead of waiting, if I bailed out right away (as this patch does), I > > could see kcompactd0 move forward to successfully to migrate other > > pages instead. This seems like a better use of kcompactd's time. > > > > Thus, even though this didn't make the system any more usable in my > > absurd test case, it still seemed to make migration behave better and > > that feels like a win. It also makes the code simpler since we have > > one fewer special case. > > TBH, the test case is too extreme for me. That's fair. That being said, I guess the point I was trying to make is that waiting for this lock could take an unbounded amount of time. Other parts of the system sometimes hold a page lock and then do a blocking operation. At least in the case of kcompactd there are better uses of its time than waiting for any given page. > And, we have multiple "sync" mode to deal with latency requirement, for > example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long > latency. If you have latency requirement for some users, you may > consider to add new "sync" mode. Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I guess my first thought would be to avoid adding a new mode and make MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to wait for all the pages to be migrated can use the heavier sync modes. It seems to me like the current users of MIGRATE_SYNC_LIGHT would not want to block for an unbounded amount of time here. What do you think? -Doug
Doug Anderson <dianders@chromium.org> writes: > Hi, > > On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Douglas Anderson <dianders@chromium.org> writes: >> >> > Currently when we try to do page migration and we're in "synchronous" >> > mode (and not doing direct compaction) then we'll wait an infinite >> > amount of time for a page lock. This does not appear to be a great >> > idea. >> > >> > One issue can be seen when I put a device under extreme memory >> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram >> > swap). I ran the browser along with Android (which runs from a >> > loopback mounted 128K block-size squashfs "disk"). I then manually ran >> > the mmm_donut memory pressure tool [1]. The system is completely >> > unusable both with and without this patch since there are 8 processes >> > completely thrashing memory, but it was still interesting to look at >> > how migration was behaving. I put some timing code in and I could see >> > that we sometimes waited over 25 seconds (in the context of >> > kcompactd0) for a page lock to become available. Although the 25 >> > seconds was the high mark, it was easy to see tens, hundreds, or >> > thousands of milliseconds spent waiting on the lock. >> > >> > Instead of waiting, if I bailed out right away (as this patch does), I >> > could see kcompactd0 move forward to successfully to migrate other >> > pages instead. This seems like a better use of kcompactd's time. >> > >> > Thus, even though this didn't make the system any more usable in my >> > absurd test case, it still seemed to make migration behave better and >> > that feels like a win. It also makes the code simpler since we have >> > one fewer special case. >> >> TBH, the test case is too extreme for me. > > That's fair. That being said, I guess the point I was trying to make > is that waiting for this lock could take an unbounded amount of time. > Other parts of the system sometimes hold a page lock and then do a > blocking operation. At least in the case of kcompactd there are better > uses of its time than waiting for any given page. > >> And, we have multiple "sync" mode to deal with latency requirement, for >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long >> latency. If you have latency requirement for some users, you may >> consider to add new "sync" mode. > > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I > guess my first thought would be to avoid adding a new mode and make > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to > wait for all the pages to be migrated can use the heavier sync modes. > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not > want to block for an unbounded amount of time here. What do you think? It appears that you can just use MIGRATE_ASYNC if you think the correct behavior is "NOT block at all". I found that there are more fine-grained controls on this in compaction code, please take a look at "enum compact_priority" and its comments. Best Regards, Huang, Ying
Hi, On Sun, Apr 16, 2023 at 6:15 PM Huang, Ying <ying.huang@intel.com> wrote: > > Doug Anderson <dianders@chromium.org> writes: > > > Hi, > > > > On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Douglas Anderson <dianders@chromium.org> writes: > >> > >> > Currently when we try to do page migration and we're in "synchronous" > >> > mode (and not doing direct compaction) then we'll wait an infinite > >> > amount of time for a page lock. This does not appear to be a great > >> > idea. > >> > > >> > One issue can be seen when I put a device under extreme memory > >> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram > >> > swap). I ran the browser along with Android (which runs from a > >> > loopback mounted 128K block-size squashfs "disk"). I then manually ran > >> > the mmm_donut memory pressure tool [1]. The system is completely > >> > unusable both with and without this patch since there are 8 processes > >> > completely thrashing memory, but it was still interesting to look at > >> > how migration was behaving. I put some timing code in and I could see > >> > that we sometimes waited over 25 seconds (in the context of > >> > kcompactd0) for a page lock to become available. Although the 25 > >> > seconds was the high mark, it was easy to see tens, hundreds, or > >> > thousands of milliseconds spent waiting on the lock. > >> > > >> > Instead of waiting, if I bailed out right away (as this patch does), I > >> > could see kcompactd0 move forward to successfully to migrate other > >> > pages instead. This seems like a better use of kcompactd's time. > >> > > >> > Thus, even though this didn't make the system any more usable in my > >> > absurd test case, it still seemed to make migration behave better and > >> > that feels like a win. It also makes the code simpler since we have > >> > one fewer special case. > >> > >> TBH, the test case is too extreme for me. > > > > That's fair. That being said, I guess the point I was trying to make > > is that waiting for this lock could take an unbounded amount of time. > > Other parts of the system sometimes hold a page lock and then do a > > blocking operation. At least in the case of kcompactd there are better > > uses of its time than waiting for any given page. > > > >> And, we have multiple "sync" mode to deal with latency requirement, for > >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long > >> latency. If you have latency requirement for some users, you may > >> consider to add new "sync" mode. > > > > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I > > guess my first thought would be to avoid adding a new mode and make > > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to > > wait for all the pages to be migrated can use the heavier sync modes. > > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not > > want to block for an unbounded amount of time here. What do you think? > > It appears that you can just use MIGRATE_ASYNC if you think the correct > behavior is "NOT block at all". I found that there are more > fine-grained controls on this in compaction code, please take a look at > "enum compact_priority" and its comments. Actually, the more I think about it the more I think the right answer is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make MIGRATE_SYNC_LIGHT not block on the folio lock. kcompactd can accept some blocking but we don't want long / unbounded blocking. Reading the comments for MIGRATE_SYNC_LIGHT, this also seems like it fits pretty well. MIGRATE_SYNC_LIGHT says that the stall time of writepage() is too much. It's entirely plausible that someone else holding the lock is doing something as slow as writepage() and thus waiting on the lock can be just as bad for latency. I'll try to send out a v2 with this approach today and we can see what people think. -Doug
Doug Anderson <dianders@chromium.org> writes: > Hi, > > On Sun, Apr 16, 2023 at 6:15 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Doug Anderson <dianders@chromium.org> writes: >> >> > Hi, >> > >> > On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Douglas Anderson <dianders@chromium.org> writes: >> >> >> >> > Currently when we try to do page migration and we're in "synchronous" >> >> > mode (and not doing direct compaction) then we'll wait an infinite >> >> > amount of time for a page lock. This does not appear to be a great >> >> > idea. >> >> > >> >> > One issue can be seen when I put a device under extreme memory >> >> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram >> >> > swap). I ran the browser along with Android (which runs from a >> >> > loopback mounted 128K block-size squashfs "disk"). I then manually ran >> >> > the mmm_donut memory pressure tool [1]. The system is completely >> >> > unusable both with and without this patch since there are 8 processes >> >> > completely thrashing memory, but it was still interesting to look at >> >> > how migration was behaving. I put some timing code in and I could see >> >> > that we sometimes waited over 25 seconds (in the context of >> >> > kcompactd0) for a page lock to become available. Although the 25 >> >> > seconds was the high mark, it was easy to see tens, hundreds, or >> >> > thousands of milliseconds spent waiting on the lock. >> >> > >> >> > Instead of waiting, if I bailed out right away (as this patch does), I >> >> > could see kcompactd0 move forward to successfully to migrate other >> >> > pages instead. This seems like a better use of kcompactd's time. >> >> > >> >> > Thus, even though this didn't make the system any more usable in my >> >> > absurd test case, it still seemed to make migration behave better and >> >> > that feels like a win. It also makes the code simpler since we have >> >> > one fewer special case. >> >> >> >> TBH, the test case is too extreme for me. >> > >> > That's fair. That being said, I guess the point I was trying to make >> > is that waiting for this lock could take an unbounded amount of time. >> > Other parts of the system sometimes hold a page lock and then do a >> > blocking operation. At least in the case of kcompactd there are better >> > uses of its time than waiting for any given page. >> > >> >> And, we have multiple "sync" mode to deal with latency requirement, for >> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long >> >> latency. If you have latency requirement for some users, you may >> >> consider to add new "sync" mode. >> > >> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I >> > guess my first thought would be to avoid adding a new mode and make >> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to >> > wait for all the pages to be migrated can use the heavier sync modes. >> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not >> > want to block for an unbounded amount of time here. What do you think? >> >> It appears that you can just use MIGRATE_ASYNC if you think the correct >> behavior is "NOT block at all". I found that there are more >> fine-grained controls on this in compaction code, please take a look at >> "enum compact_priority" and its comments. > > Actually, the more I think about it the more I think the right answer > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make > MIGRATE_SYNC_LIGHT not block on the folio lock. Then, what is the difference between MIGRATE_SYNC_LIGHT and MIGRATE_ASYNC? > kcompactd can accept some blocking but we don't want long / unbounded > blocking. Reading the comments for MIGRATE_SYNC_LIGHT, this also seems > like it fits pretty well. MIGRATE_SYNC_LIGHT says that the stall time > of writepage() is too much. It's entirely plausible that someone else > holding the lock is doing something as slow as writepage() and thus > waiting on the lock can be just as bad for latency. IIUC, during writepage(), the page/folio will be unlocked. But, during page reading, the page/folio will be locked. I don't really understand why we can wait for page reading but cannot wait for page writeback. Best Regards, Huang, Ying > I'll try to send out a v2 with this approach today and we can see what > people think. > > -Doug
On 4/17/23 16:28, Doug Anderson wrote: > Hi, > > On Sun, Apr 16, 2023 at 6:15 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Doug Anderson <dianders@chromium.org> writes: >> >> > Hi, >> > >> > On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Douglas Anderson <dianders@chromium.org> writes: >> >> >> >> > Currently when we try to do page migration and we're in "synchronous" >> >> > mode (and not doing direct compaction) then we'll wait an infinite >> >> > amount of time for a page lock. This does not appear to be a great >> >> > idea. >> >> > >> >> > One issue can be seen when I put a device under extreme memory >> >> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram >> >> > swap). I ran the browser along with Android (which runs from a >> >> > loopback mounted 128K block-size squashfs "disk"). I then manually ran >> >> > the mmm_donut memory pressure tool [1]. The system is completely >> >> > unusable both with and without this patch since there are 8 processes >> >> > completely thrashing memory, but it was still interesting to look at >> >> > how migration was behaving. I put some timing code in and I could see >> >> > that we sometimes waited over 25 seconds (in the context of >> >> > kcompactd0) for a page lock to become available. Although the 25 >> >> > seconds was the high mark, it was easy to see tens, hundreds, or >> >> > thousands of milliseconds spent waiting on the lock. >> >> > >> >> > Instead of waiting, if I bailed out right away (as this patch does), I >> >> > could see kcompactd0 move forward to successfully to migrate other >> >> > pages instead. This seems like a better use of kcompactd's time. >> >> > >> >> > Thus, even though this didn't make the system any more usable in my >> >> > absurd test case, it still seemed to make migration behave better and >> >> > that feels like a win. It also makes the code simpler since we have >> >> > one fewer special case. >> >> >> >> TBH, the test case is too extreme for me. >> > >> > That's fair. That being said, I guess the point I was trying to make >> > is that waiting for this lock could take an unbounded amount of time. >> > Other parts of the system sometimes hold a page lock and then do a >> > blocking operation. At least in the case of kcompactd there are better >> > uses of its time than waiting for any given page. >> > >> >> And, we have multiple "sync" mode to deal with latency requirement, for >> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long >> >> latency. If you have latency requirement for some users, you may >> >> consider to add new "sync" mode. >> > >> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I >> > guess my first thought would be to avoid adding a new mode and make >> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to >> > wait for all the pages to be migrated can use the heavier sync modes. >> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not >> > want to block for an unbounded amount of time here. What do you think? >> >> It appears that you can just use MIGRATE_ASYNC if you think the correct >> behavior is "NOT block at all". I found that there are more >> fine-grained controls on this in compaction code, please take a look at >> "enum compact_priority" and its comments. > > Actually, the more I think about it the more I think the right answer > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make > MIGRATE_SYNC_LIGHT not block on the folio lock. kcompactd can accept > some blocking but we don't want long / unbounded blocking. Reading the > comments for MIGRATE_SYNC_LIGHT, this also seems like it fits pretty > well. MIGRATE_SYNC_LIGHT says that the stall time of writepage() is > too much. It's entirely plausible that someone else holding the lock > is doing something as slow as writepage() and thus waiting on the lock > can be just as bad for latency. +Cc Mel for potential insights. Sounds like a good compromise at first glance, but it's a tricky area. Also there are other callers of migration than compaction, and we should make sure we are not breaking them unexpectedly. > I'll try to send out a v2 with this approach today and we can see what > people think. Please Cc Mel and myself for further versions. > -Doug >
Hi, On Mon, Apr 17, 2023 at 8:18 PM Huang, Ying <ying.huang@intel.com> wrote: > > Doug Anderson <dianders@chromium.org> writes: > > > Hi, > > > > On Sun, Apr 16, 2023 at 6:15 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Doug Anderson <dianders@chromium.org> writes: > >> > >> > Hi, > >> > > >> > On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Douglas Anderson <dianders@chromium.org> writes: > >> >> > >> >> > Currently when we try to do page migration and we're in "synchronous" > >> >> > mode (and not doing direct compaction) then we'll wait an infinite > >> >> > amount of time for a page lock. This does not appear to be a great > >> >> > idea. > >> >> > > >> >> > One issue can be seen when I put a device under extreme memory > >> >> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram > >> >> > swap). I ran the browser along with Android (which runs from a > >> >> > loopback mounted 128K block-size squashfs "disk"). I then manually ran > >> >> > the mmm_donut memory pressure tool [1]. The system is completely > >> >> > unusable both with and without this patch since there are 8 processes > >> >> > completely thrashing memory, but it was still interesting to look at > >> >> > how migration was behaving. I put some timing code in and I could see > >> >> > that we sometimes waited over 25 seconds (in the context of > >> >> > kcompactd0) for a page lock to become available. Although the 25 > >> >> > seconds was the high mark, it was easy to see tens, hundreds, or > >> >> > thousands of milliseconds spent waiting on the lock. > >> >> > > >> >> > Instead of waiting, if I bailed out right away (as this patch does), I > >> >> > could see kcompactd0 move forward to successfully to migrate other > >> >> > pages instead. This seems like a better use of kcompactd's time. > >> >> > > >> >> > Thus, even though this didn't make the system any more usable in my > >> >> > absurd test case, it still seemed to make migration behave better and > >> >> > that feels like a win. It also makes the code simpler since we have > >> >> > one fewer special case. > >> >> > >> >> TBH, the test case is too extreme for me. > >> > > >> > That's fair. That being said, I guess the point I was trying to make > >> > is that waiting for this lock could take an unbounded amount of time. > >> > Other parts of the system sometimes hold a page lock and then do a > >> > blocking operation. At least in the case of kcompactd there are better > >> > uses of its time than waiting for any given page. > >> > > >> >> And, we have multiple "sync" mode to deal with latency requirement, for > >> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long > >> >> latency. If you have latency requirement for some users, you may > >> >> consider to add new "sync" mode. > >> > > >> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I > >> > guess my first thought would be to avoid adding a new mode and make > >> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to > >> > wait for all the pages to be migrated can use the heavier sync modes. > >> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not > >> > want to block for an unbounded amount of time here. What do you think? > >> > >> It appears that you can just use MIGRATE_ASYNC if you think the correct > >> behavior is "NOT block at all". I found that there are more > >> fine-grained controls on this in compaction code, please take a look at > >> "enum compact_priority" and its comments. > > > > Actually, the more I think about it the more I think the right answer > > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make > > MIGRATE_SYNC_LIGHT not block on the folio lock. > > Then, what is the difference between MIGRATE_SYNC_LIGHT and > MIGRATE_ASYNC? Aren't there still some differences even if we remove blocking this one lock? ...or maybe your point is that maybe the other differences have similar properties? OK, so let's think about just using MIGRATE_ASYNC and either leaving MIGRATE_SYNC_LIGHT alone or deleting it (if there are no users left). The nice thing is that the only users of MIGRATE_SYNC_LIGHT are in "compaction.c" and there are only 3 places where it's specified. 1. kcompactd_do_work() - This is what I was analyzing and where I argued that indefinite blocking is less useful than simply trying to compact a different page. So sure, moving this to MIGRATE_ASYNC seems like it would be OK? 2. proactive_compact_node() - Just like kcompactd_do_work(), this is called from kcompactd and thus probably should have the same mode. 3. compact_zone_order() - This explicitly chooses between MIGRATE_SYNC_LIGHT and MIGRATE_ASYNC, so I guess I'd keep MIGRATE_SYNC_LIGHT just for this use case. It looks as if compact_zone_order() is called for direct compaction and thus making it synchronous can make sense, especially since it seems to go to the synchronous case after it failed with the async case. Ironically, though, the exact lock I was proposing to not wait on _isn't_ ever waited on in direct reclaim (see the comment in migrate_folio_unmap() about deadlock), but the other differences between SYNC_LIGHT and ASYNC come into play. If the above sounds correct then I'm OK w/ moving #1 and #2 to MIGRATE_ASYNC and leaving #3 as the sole user or MIGRATE_SYNC_LIGHT. > > kcompactd can accept some blocking but we don't want long / unbounded > > blocking. Reading the comments for MIGRATE_SYNC_LIGHT, this also seems > > like it fits pretty well. MIGRATE_SYNC_LIGHT says that the stall time > > of writepage() is too much. It's entirely plausible that someone else > > holding the lock is doing something as slow as writepage() and thus > > waiting on the lock can be just as bad for latency. > > IIUC, during writepage(), the page/folio will be unlocked. > > But, during page reading, the page/folio will be locked. I don't really > understand why we can wait for page reading but cannot wait for page > writeback. I'm not sure I totally got your point here. It sorta sounds as if you're making the same point that I was? IIUC by waiting on the lock we may be implicitly waiting for someone to finish reading which seems as bad as waiting for writing. That was why I was arguing that with MIGRATE_SYNC_LIGHT (which says that waiting for the write was too slow) that we shouldn't wait for the lock (which may be blocking on a read). -Doug
Doug Anderson <dianders@chromium.org> writes: > Hi, > > On Mon, Apr 17, 2023 at 8:18 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Doug Anderson <dianders@chromium.org> writes: >> >> > Hi, >> > >> > On Sun, Apr 16, 2023 at 6:15 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Doug Anderson <dianders@chromium.org> writes: >> >> >> >> > Hi, >> >> > >> >> > On Thu, Apr 13, 2023 at 8:10 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Douglas Anderson <dianders@chromium.org> writes: >> >> >> >> >> >> > Currently when we try to do page migration and we're in "synchronous" >> >> >> > mode (and not doing direct compaction) then we'll wait an infinite >> >> >> > amount of time for a page lock. This does not appear to be a great >> >> >> > idea. >> >> >> > >> >> >> > One issue can be seen when I put a device under extreme memory >> >> >> > pressure. I took a sc7180-trogdor Chromebook (4GB RAM, 8GB zram >> >> >> > swap). I ran the browser along with Android (which runs from a >> >> >> > loopback mounted 128K block-size squashfs "disk"). I then manually ran >> >> >> > the mmm_donut memory pressure tool [1]. The system is completely >> >> >> > unusable both with and without this patch since there are 8 processes >> >> >> > completely thrashing memory, but it was still interesting to look at >> >> >> > how migration was behaving. I put some timing code in and I could see >> >> >> > that we sometimes waited over 25 seconds (in the context of >> >> >> > kcompactd0) for a page lock to become available. Although the 25 >> >> >> > seconds was the high mark, it was easy to see tens, hundreds, or >> >> >> > thousands of milliseconds spent waiting on the lock. >> >> >> > >> >> >> > Instead of waiting, if I bailed out right away (as this patch does), I >> >> >> > could see kcompactd0 move forward to successfully to migrate other >> >> >> > pages instead. This seems like a better use of kcompactd's time. >> >> >> > >> >> >> > Thus, even though this didn't make the system any more usable in my >> >> >> > absurd test case, it still seemed to make migration behave better and >> >> >> > that feels like a win. It also makes the code simpler since we have >> >> >> > one fewer special case. >> >> >> >> >> >> TBH, the test case is too extreme for me. >> >> > >> >> > That's fair. That being said, I guess the point I was trying to make >> >> > is that waiting for this lock could take an unbounded amount of time. >> >> > Other parts of the system sometimes hold a page lock and then do a >> >> > blocking operation. At least in the case of kcompactd there are better >> >> > uses of its time than waiting for any given page. >> >> > >> >> >> And, we have multiple "sync" mode to deal with latency requirement, for >> >> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long >> >> >> latency. If you have latency requirement for some users, you may >> >> >> consider to add new "sync" mode. >> >> > >> >> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I >> >> > guess my first thought would be to avoid adding a new mode and make >> >> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to >> >> > wait for all the pages to be migrated can use the heavier sync modes. >> >> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not >> >> > want to block for an unbounded amount of time here. What do you think? >> >> >> >> It appears that you can just use MIGRATE_ASYNC if you think the correct >> >> behavior is "NOT block at all". I found that there are more >> >> fine-grained controls on this in compaction code, please take a look at >> >> "enum compact_priority" and its comments. >> > >> > Actually, the more I think about it the more I think the right answer >> > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make >> > MIGRATE_SYNC_LIGHT not block on the folio lock. >> >> Then, what is the difference between MIGRATE_SYNC_LIGHT and >> MIGRATE_ASYNC? > > Aren't there still some differences even if we remove blocking this > one lock? ...or maybe your point is that maybe the other differences > have similar properties? Sorry for confusing words. Here, I asked you to list the implementation difference between MIGRATE_ASYNC and MIGRATE_SYNC_LIGHT after your proposed changes. Which are waited in MIGRATE_SYNC_LIGHT but not in MIGRATE_ASYNC? > OK, so let's think about just using MIGRATE_ASYNC and either leaving > MIGRATE_SYNC_LIGHT alone or deleting it (if there are no users left). > The nice thing is that the only users of MIGRATE_SYNC_LIGHT are in > "compaction.c" and there are only 3 places where it's specified. > > 1. kcompactd_do_work() - This is what I was analyzing and where I > argued that indefinite blocking is less useful than simply trying to > compact a different page. So sure, moving this to MIGRATE_ASYNC seems > like it would be OK? > > 2. proactive_compact_node() - Just like kcompactd_do_work(), this is > called from kcompactd and thus probably should have the same mode. > > 3. compact_zone_order() - This explicitly chooses between > MIGRATE_SYNC_LIGHT and MIGRATE_ASYNC, so I guess I'd keep > MIGRATE_SYNC_LIGHT just for this use case. It looks as if > compact_zone_order() is called for direct compaction and thus making > it synchronous can make sense, especially since it seems to go to the > synchronous case after it failed with the async case. Ironically, > though, the exact lock I was proposing to not wait on _isn't_ ever > waited on in direct reclaim (see the comment in migrate_folio_unmap() > about deadlock), but the other differences between SYNC_LIGHT and > ASYNC come into play. > > If the above sounds correct then I'm OK w/ moving #1 and #2 to > MIGRATE_ASYNC and leaving #3 as the sole user or MIGRATE_SYNC_LIGHT. > >> > kcompactd can accept some blocking but we don't want long / unbounded >> > blocking. Reading the comments for MIGRATE_SYNC_LIGHT, this also seems >> > like it fits pretty well. MIGRATE_SYNC_LIGHT says that the stall time >> > of writepage() is too much. It's entirely plausible that someone else >> > holding the lock is doing something as slow as writepage() and thus >> > waiting on the lock can be just as bad for latency. >> >> IIUC, during writepage(), the page/folio will be unlocked. >> >> But, during page reading, the page/folio will be locked. I don't really >> understand why we can wait for page reading but cannot wait for page >> writeback. > > I'm not sure I totally got your point here. It sorta sounds as if > you're making the same point that I was? Yes, kind of. It is a question, not conclusion. > IIUC by waiting on the lock > we may be implicitly waiting for someone to finish reading which seems > as bad as waiting for writing. That was why I was arguing that with > MIGRATE_SYNC_LIGHT (which says that waiting for the write was too > slow) that we shouldn't wait for the lock (which may be blocking on a > read). Best Regards, Huang, Ying
Hi, On Tue, Apr 18, 2023 at 5:34 PM Huang, Ying <ying.huang@intel.com> wrote: > > >> >> >> TBH, the test case is too extreme for me. > >> >> > > >> >> > That's fair. That being said, I guess the point I was trying to make > >> >> > is that waiting for this lock could take an unbounded amount of time. > >> >> > Other parts of the system sometimes hold a page lock and then do a > >> >> > blocking operation. At least in the case of kcompactd there are better > >> >> > uses of its time than waiting for any given page. > >> >> > > >> >> >> And, we have multiple "sync" mode to deal with latency requirement, for > >> >> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long > >> >> >> latency. If you have latency requirement for some users, you may > >> >> >> consider to add new "sync" mode. > >> >> > > >> >> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I > >> >> > guess my first thought would be to avoid adding a new mode and make > >> >> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to > >> >> > wait for all the pages to be migrated can use the heavier sync modes. > >> >> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not > >> >> > want to block for an unbounded amount of time here. What do you think? > >> >> > >> >> It appears that you can just use MIGRATE_ASYNC if you think the correct > >> >> behavior is "NOT block at all". I found that there are more > >> >> fine-grained controls on this in compaction code, please take a look at > >> >> "enum compact_priority" and its comments. > >> > > >> > Actually, the more I think about it the more I think the right answer > >> > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make > >> > MIGRATE_SYNC_LIGHT not block on the folio lock. > >> > >> Then, what is the difference between MIGRATE_SYNC_LIGHT and > >> MIGRATE_ASYNC? > > > > Aren't there still some differences even if we remove blocking this > > one lock? ...or maybe your point is that maybe the other differences > > have similar properties? > > Sorry for confusing words. Here, I asked you to list the implementation > difference between MIGRATE_ASYNC and MIGRATE_SYNC_LIGHT after your > proposed changes. Which are waited in MIGRATE_SYNC_LIGHT but not in > MIGRATE_ASYNC? Ah, got it! It's not always the easiest to follow all the code paths, but let's see what I can find. I guess to start with, though, I will assert that someone seems to have believed that there was an important difference between MIGRATE_ASYNC and MIGRATE_SYNC_LIGHT besides waiting on the lock in migrate_folio_unmapt() since (as I found in my previous digging) the "direct reclaim" path never grabs this lock but explicitly sometimes chooses MIGRATE_ASYNC some times and MIGRATE_SYNC_LIGHT other times. OK, so looking at mainline Linux and comparing differences in behavior between SYNC_LIGHT and ASYNC and thoughts about which one should be used for kcompactd. Note that I won't go _too_ deep into all the differences... -- In nfs.c: 1. We will wait for the fscache if SYNC_LIGHT but not ASYNC. No idea what would be the most ideal for calls from kcompactd. In compaction.c: 2. We will update the non-async "compact_cached_migrate_pfn" for SYNC_LIGHT but not ASYNC since we keep track of sync and async progress separately. 3. compact_lock_irqsave() note contentions for ASYNC but not SYNC_LIGHT and cause an earlier abort. Seems like kcompactd would want the SYNC_LIGHT behavior since this isn't about things indefinitely blocking. 4. isolate_migratepages_block() will bail if too_many_isolated() for ASYNC but not SYNC_LIGHT. My hunch is that kcompactd wants the SYNC_LIGHT behavior for kcompact. 5. If in direct compaction, isolate_migratepages_block() sets "skip_on_failure" for ASYNC but not SYNC_LIGHT. My hunch is that kcompactd wants the SYNC_LIGHT behavior for kcompact. 6. suitable_migration_source() considers more things suitable migration sources when SYNC_LIGHT but not (ASYNC+direct_compaction). Doesn't matter since kcompactd isn't direct compaction and non-direct compaction is the same. 7. fast_isolate_around() does less scanning when SYNC_LIGHT but not (ASYNC+direct_compaction). Again, it doesn't matter for kcompactd. 8. isolate_freepages() uses a different stride with SYNC_LIGHT vs. ASYNC. My hunch is that kcompactd wants the SYNC_LIGHT behavior for kcompact. In migrate.c: 9. buffer_migrate_lock_buffers() will block waiting to lock buffers with SYNC_LIGHT but not ASYNC. I don't know for sure, but this feels like something we _wouldn't_ want to block on in kcompactd and instead should look for easier pickings. 10. migrate_folio_unmap() has the case we've already talked about 11. migrate_folio_unmap() does batch flushes for async because it doesn't need to worry about a class of deadlock. Somewhat recent code actually ends up running this code first even for sync modes to get the batch. 12. We'll retry a few more times for SYNC_LIGHT than ASYNC. Seems like the more retries won't really hurt for kcompactd. -- So from looking at all the above, I'll say that kcompactd should stick with SYNC_LIGHT and we should fix #10. In other words, like my original patch except that we keep blocking on the lock in the full SYNC modes. It's possible that we should also change case #9 I listed above. Do you know if locking buffers is likely to block on something as slow as page reading/writing? -Doug
Doug Anderson <dianders@chromium.org> writes: > Hi, > > On Tue, Apr 18, 2023 at 5:34 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> TBH, the test case is too extreme for me. >> >> >> > >> >> >> > That's fair. That being said, I guess the point I was trying to make >> >> >> > is that waiting for this lock could take an unbounded amount of time. >> >> >> > Other parts of the system sometimes hold a page lock and then do a >> >> >> > blocking operation. At least in the case of kcompactd there are better >> >> >> > uses of its time than waiting for any given page. >> >> >> > >> >> >> >> And, we have multiple "sync" mode to deal with latency requirement, for >> >> >> >> example, we use MIGRATE_SYNC_LIGHT for compaction to avoid too long >> >> >> >> latency. If you have latency requirement for some users, you may >> >> >> >> consider to add new "sync" mode. >> >> >> > >> >> >> > Sure. kcompactd_do_work() is currently using MIGRATE_SYNC_LIGHT. I >> >> >> > guess my first thought would be to avoid adding a new mode and make >> >> >> > MIGRATE_SYNC_LIGHT not block here. Then anyone that truly needs to >> >> >> > wait for all the pages to be migrated can use the heavier sync modes. >> >> >> > It seems to me like the current users of MIGRATE_SYNC_LIGHT would not >> >> >> > want to block for an unbounded amount of time here. What do you think? >> >> >> >> >> >> It appears that you can just use MIGRATE_ASYNC if you think the correct >> >> >> behavior is "NOT block at all". I found that there are more >> >> >> fine-grained controls on this in compaction code, please take a look at >> >> >> "enum compact_priority" and its comments. >> >> > >> >> > Actually, the more I think about it the more I think the right answer >> >> > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make >> >> > MIGRATE_SYNC_LIGHT not block on the folio lock. >> >> >> >> Then, what is the difference between MIGRATE_SYNC_LIGHT and >> >> MIGRATE_ASYNC? >> > >> > Aren't there still some differences even if we remove blocking this >> > one lock? ...or maybe your point is that maybe the other differences >> > have similar properties? >> >> Sorry for confusing words. Here, I asked you to list the implementation >> difference between MIGRATE_ASYNC and MIGRATE_SYNC_LIGHT after your >> proposed changes. Which are waited in MIGRATE_SYNC_LIGHT but not in >> MIGRATE_ASYNC? > > Ah, got it! It's not always the easiest to follow all the code paths, > but let's see what I can find. > > I guess to start with, though, I will assert that someone seems to > have believed that there was an important difference between > MIGRATE_ASYNC and MIGRATE_SYNC_LIGHT besides waiting on the lock in > migrate_folio_unmapt() since (as I found in my previous digging) the > "direct reclaim" path never grabs this lock but explicitly sometimes > chooses MIGRATE_ASYNC some times and MIGRATE_SYNC_LIGHT other times. > > OK, so looking at mainline Linux and comparing differences in behavior > between SYNC_LIGHT and ASYNC and thoughts about which one should be > used for kcompactd. Note that I won't go _too_ deep into all the > differences... > > -- > > In nfs.c: > > 1. We will wait for the fscache if SYNC_LIGHT but not ASYNC. No idea > what would be the most ideal for calls from kcompactd. This appears like something like disk writing. > In compaction.c: > > 2. We will update the non-async "compact_cached_migrate_pfn" for > SYNC_LIGHT but not ASYNC since we keep track of sync and async > progress separately. > > 3. compact_lock_irqsave() note contentions for ASYNC but not > SYNC_LIGHT and cause an earlier abort. Seems like kcompactd would want > the SYNC_LIGHT behavior since this isn't about things indefinitely > blocking. > > 4. isolate_migratepages_block() will bail if too_many_isolated() for > ASYNC but not SYNC_LIGHT. My hunch is that kcompactd wants the > SYNC_LIGHT behavior for kcompact. > > 5. If in direct compaction, isolate_migratepages_block() sets > "skip_on_failure" for ASYNC but not SYNC_LIGHT. My hunch is that > kcompactd wants the SYNC_LIGHT behavior for kcompact. > > 6. suitable_migration_source() considers more things suitable > migration sources when SYNC_LIGHT but not (ASYNC+direct_compaction). > Doesn't matter since kcompactd isn't direct compaction and non-direct > compaction is the same. > > 7. fast_isolate_around() does less scanning when SYNC_LIGHT but not > (ASYNC+direct_compaction). Again, it doesn't matter for kcompactd. > > 8. isolate_freepages() uses a different stride with SYNC_LIGHT vs. > ASYNC. My hunch is that kcompactd wants the SYNC_LIGHT behavior for > kcompact. > > In migrate.c: > > 9. buffer_migrate_lock_buffers() will block waiting to lock buffers > with SYNC_LIGHT but not ASYNC. I don't know for sure, but this feels > like something we _wouldn't_ want to block on in kcompactd and instead > should look for easier pickings. IIUC, this is similar as page lock. > 10. migrate_folio_unmap() has the case we've already talked about > > 11. migrate_folio_unmap() does batch flushes for async because it > doesn't need to worry about a class of deadlock. Somewhat recent code > actually ends up running this code first even for sync modes to get > the batch. > > 12. We'll retry a few more times for SYNC_LIGHT than ASYNC. Seems like > the more retries won't really hurt for kcompactd. > > -- > > So from looking at all the above, I'll say that kcompactd should stick > with SYNC_LIGHT and we should fix #10. In other words, like my > original patch except that we keep blocking on the lock in the full > SYNC modes. > > It's possible that we should also change case #9 I listed above. Do > you know if locking buffers is likely to block on something as slow as > page reading/writing? IIUC, this is related to page reading/writing. Buffer head is used by ext2/4 to read/write. Thank you very much for your research. It looks like ASYNC isn't appropriate for kcompactd. From the comments of SYNC_LIGHT, * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking * on most operations but not ->writepage as the potential stall time * is too significant To make SYNC_LIGHT block on less operations than before, I guess that you need to prove the stall time can be long with the operation with not-so-extreme test cases. Best Regards, Huang, Ying
On Tue, Apr 18, 2023 at 11:17:23AM +0200, Vlastimil Babka wrote: > > Actually, the more I think about it the more I think the right answer > > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make > > MIGRATE_SYNC_LIGHT not block on the folio lock. kcompactd can accept > > some blocking but we don't want long / unbounded blocking. Reading the > > comments for MIGRATE_SYNC_LIGHT, this also seems like it fits pretty > > well. MIGRATE_SYNC_LIGHT says that the stall time of writepage() is > > too much. It's entirely plausible that someone else holding the lock > > is doing something as slow as writepage() and thus waiting on the lock > > can be just as bad for latency. > > +Cc Mel for potential insights. Sounds like a good compromise at first > glance, but it's a tricky area. > Also there are other callers of migration than compaction, and we should > make sure we are not breaking them unexpectedly. > It's tricky because part of the point of SYNC_LIGHT was to block on some operations but not for too long. writepage was considered to be an exception because it can be very slow for a variety of reasons. I think At the time that writeback from reclaim context was possible and it was very inefficient, more more inefficient than reads. Storage can also be very slow (e.g. USB stick plugged in) and there can be large differences between read/write performance (SMR, some ssd etc). Pages read were generally clean and could be migrated, pages for write may be redirtied etc. It was assumed that while both read/write could lock the page for a long time, write had worse hold times and most other users of lock page were transient. A compromise for SYNC_LIGHT or even SYNC on lock page would be to try locking with a timeout. I don't think there is a specific helper but it should be possible to trylock, wait on the folio_waitqueue and attempt once to get the lock again. I didn't look very closely but it would doing something similar to folio_wait_bit_common() with io_schedule_timeout instead of io_schedule. This will have false positives because the folio waitqueue may be woken for unrelated pages and obviously it can race with other wait queues. kcompactd is an out-of-line wait and can afford to wait for a long time without user-visible impact but 120 seconds or any potentially unbounded length of time is ridiculous and unhelpful. I would still be wary about adding new sync modes or making major modifications to kcompactd because detecting application stalls due to a kcompactd modification is difficult. There is another approach -- kcompactd or proactive under heavy memory pressure is probably a waste of CPU time and resources and should avoid or minimise effort when under pressure. While direct compaction can capture a page for immediate use, kcompactd and proactive reclaim are just shuffling memory around for *potential* users and may be making the overall memory pressure even worse. If memory pressure detection was better and proactive/kcompactd reclaim bailed then the unbounded time to lock a page is mitigated or completely avoided. The two options are ortogonal. trylock_page_timeout() would mitigate worst case behaviour while memory pressure detection and bailing on SYNC* compaction side-steps the issue in extreme cases. Both have value as pressure detection would be best-effort and trylock_page_timeout() would limit worst-case behaviour. > > I'll try to send out a v2 with this approach today and we can see what > > people think. > > Please Cc Mel and myself for further versions. > Yes please.
Hi, On Thu, Apr 20, 2023 at 3:23 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Apr 18, 2023 at 11:17:23AM +0200, Vlastimil Babka wrote: > > > Actually, the more I think about it the more I think the right answer > > > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make > > > MIGRATE_SYNC_LIGHT not block on the folio lock. kcompactd can accept > > > some blocking but we don't want long / unbounded blocking. Reading the > > > comments for MIGRATE_SYNC_LIGHT, this also seems like it fits pretty > > > well. MIGRATE_SYNC_LIGHT says that the stall time of writepage() is > > > too much. It's entirely plausible that someone else holding the lock > > > is doing something as slow as writepage() and thus waiting on the lock > > > can be just as bad for latency. > > > > +Cc Mel for potential insights. Sounds like a good compromise at first > > glance, but it's a tricky area. > > Also there are other callers of migration than compaction, and we should > > make sure we are not breaking them unexpectedly. > > > > It's tricky because part of the point of SYNC_LIGHT was to block on > some operations but not for too long. writepage was considered to be an > exception because it can be very slow for a variety of reasons. I think > At the time that writeback from reclaim context was possible and it was > very inefficient, more more inefficient than reads. Storage can also be > very slow (e.g. USB stick plugged in) and there can be large differences > between read/write performance (SMR, some ssd etc). Pages read were generally > clean and could be migrated, pages for write may be redirtied etc. It was > assumed that while both read/write could lock the page for a long time, > write had worse hold times and most other users of lock page were transient. I think some of the slowness gets into the complex ways that systems like ChromeOS are currently working. As mentioned in the commit message of my RFC patch, ChromeOS currently runs Android programs out of a 128K-block, zlib-compressed squashfs disk. That squashfs disk is actually a loopback mounted file on the main ext2 filesystem which is stored on something like eMMC. If I understand the everything correctly, if we get a page fault on memory backed by this squashfs filesystem, then we can end up holding a page/folio lock and then trying to read a pile of pages (enough to decompress the whole 128K block). ...but we don't read them directly, we instead block on ext4 which might need to allocate memory and then finally blocks on the block driver completing the task. This whole sequence of things is not necessarily fast. I think this is responsible for some of the very large numbers that were part of my original patch description. Without the above squashfs setup, we can still run into slow times but not quite as bad. I tried running a ChromeOS "memory pressure" test against a mainline kernel plus _just_ the browser (Android disabled). The test eventually opened over 90 tabs on my 4GB system and the system got pretty janky, but still barely usable. While running the test, I saw dozens of cases of folio_lock() taking over 10 ms and quite a few (~10?) of it taking over 100 ms. The peak I saw was ~380ms. I also happened to time buffer locking. That was much less bad with the worst case topping out at ~70ms. I'm not sure what timeout you'd consider to be bad. 10 ms? 20 ms? Also as a side note: I ran the same memory pressure but _with_ Android running (though it doesn't actually stress Android, it's just running in the background). To do this I had to run a downstream kernel. Here it was easy to see a ~1.7 ms wait on the page lock without any ridiculous amount of stressing. ...and a ~1.5 second wait for the buffer lock, too. > A compromise for SYNC_LIGHT or even SYNC on lock page would be to try > locking with a timeout. I don't think there is a specific helper but it > should be possible to trylock, wait on the folio_waitqueue and attempt > once to get the lock again. I didn't look very closely but it would > doing something similar to folio_wait_bit_common() with > io_schedule_timeout instead of io_schedule. This will have false > positives because the folio waitqueue may be woken for unrelated pages > and obviously it can race with other wait queues. > > kcompactd is an out-of-line wait and can afford to wait for a long time > without user-visible impact but 120 seconds or any potentially unbounded > length of time is ridiculous and unhelpful. I would still be wary about > adding new sync modes or making major modifications to kcompactd because > detecting application stalls due to a kcompactd modification is difficult. OK, I'll give this a shot. It doesn't look too hard, but we'll see. > There is another approach -- kcompactd or proactive under heavy memory > pressure is probably a waste of CPU time and resources and should > avoid or minimise effort when under pressure. While direct compaction > can capture a page for immediate use, kcompactd and proactive reclaim > are just shuffling memory around for *potential* users and may be making > the overall memory pressure even worse. If memory pressure detection was > better and proactive/kcompactd reclaim bailed then the unbounded time to > lock a page is mitigated or completely avoided. I probably won't try to take this on, though it does sound like a good idea for future research.
diff --git a/mm/migrate.c b/mm/migrate.c index db3f154446af..dfb0a6944181 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1143,26 +1143,15 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page dst->private = NULL; if (!folio_trylock(src)) { - if (mode == MIGRATE_ASYNC) - goto out; - /* - * It's not safe for direct compaction to call lock_page. - * For example, during page readahead pages are added locked - * to the LRU. Later, when the IO completes the pages are - * marked uptodate and unlocked. However, the queueing - * could be merging multiple pages for one bio (e.g. - * mpage_readahead). If an allocation happens for the - * second or third page, the process can end up locking - * the same page twice and deadlocking. Rather than - * trying to be clever about what pages can be locked, - * avoid the use of lock_page for direct compaction - * altogether. + * While there are some modes we could be running in where we + * could block here waiting for the lock (specifically + * modes other than MIGRATE_ASYNC and when we're not in + * direct compaction), it's not worth the wait. Instead of + * waiting, we'll bail. This will let the caller try to + * migrate some other pages that aren't contended. */ - if (current->flags & PF_MEMALLOC) - goto out; - - folio_lock(src); + goto out; } locked = true;