Message ID | 20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp1376074vqo; Fri, 21 Apr 2023 15:30:42 -0700 (PDT) X-Google-Smtp-Source: AKy350bqzq/dNTEbmWG5pEqNSKE3VKQhQkf/+7bMLX0bHPmU5W6S60UA+C8CqiNd8K7k5SfVM1Of X-Received: by 2002:a17:90a:6b43:b0:246:9517:30b6 with SMTP id x3-20020a17090a6b4300b00246951730b6mr12477190pjl.4.1682116241814; Fri, 21 Apr 2023 15:30:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682116241; cv=none; d=google.com; s=arc-20160816; b=Ug6fEvT0K65fBEKaF3ZetsK90ANdppGgBsLAFdWRKNL7ikbnGMUUAj50qe7Y7ZT6Oz uBhIlzesMYUyRq6u/mIPlXKEfeuKl71DcmNa10klvTnV4wd0itm/CBKXc0/Zc+eEVyyD /+Mr48CZFsLFKI9ArnpsYe8hot1wZI6g4aExbLXQfs7Tshr+ovRW0eiVP/9mK6E1gt2F 5zxNwEc6T48qdgi9KglbJcBh0Ay50JpxXziK0NuSbR4B3O5mv4OxV/HzmmQ/IJ5Gg8cq EbGHnkfkt7fg2R4ikZth1Bth3pw6eVHB4Z1fPgs4cSc3svD09bSSOob7iIpyvl9TRRmI dfrA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=u/AkHMm2p+EP+a6U0DbEFYOuvSo1i83VDzuubdv1yU4=; b=0LhmNzN1P6hxMwVg3VY1d3eLnKQimnQzGaJWsK8mgl5RiVY2VY3nOiC0GrpevuZF8h JjKMjbyR0ooJu2VZpPmTsOin9SYfBO1M1ZHvB4aDd4R74y7IyFnlvvhfr8RXAFZ2pMWh RoBHYBp9S/+W+twhI5nCeYt4Baf7y/IRsqb92qHNMtQxoBVfNzEwB/B4rxaMYePOrl7K bQa+7k0/RRN+C1CJsHf6y/hOedyHGqNR/KEEzO4HUDcD4oNM5dB75WzFFsOhQO2FIVu2 5kEaZsgVmlGA43kfNP9duGqoqzpUC7iT6aH+3rBmJ52iOb1ZZZIcrTQw6Xskz9jcc6Zo 6CNg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b="gWPYUh/q"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id mh14-20020a17090b4ace00b002476173727fsi8686629pjb.26.2023.04.21.15.30.28; Fri, 21 Apr 2023 15:30:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b="gWPYUh/q"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233765AbjDUWN2 (ORCPT <rfc822;cjcooper78@gmail.com> + 99 others); Fri, 21 Apr 2023 18:13:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51736 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233661AbjDUWNY (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 21 Apr 2023 18:13:24 -0400 Received: from mail-pf1-x42d.google.com (mail-pf1-x42d.google.com [IPv6:2607:f8b0:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D4A47198B for <linux-kernel@vger.kernel.org>; Fri, 21 Apr 2023 15:13:22 -0700 (PDT) Received: by mail-pf1-x42d.google.com with SMTP id d2e1a72fcca58-63d4595d60fso17238593b3a.0 for <linux-kernel@vger.kernel.org>; Fri, 21 Apr 2023 15:13:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1682115202; x=1684707202; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=u/AkHMm2p+EP+a6U0DbEFYOuvSo1i83VDzuubdv1yU4=; b=gWPYUh/qJS3hKUZOK+9h4MWoL8QjpqBENZF1/R77mtm+LeEDasMRdSowMwEpPxxS78 EP2q9aDYBJHna1dXO/vRjzhgMH3xXjajRzj+Gem3tqcctYYdBwGqaQ29Kog5RyRfy0yp UjVmuXQ923zKnlvwr5/ZYrwLOML+t858ymiAc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682115202; x=1684707202; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=u/AkHMm2p+EP+a6U0DbEFYOuvSo1i83VDzuubdv1yU4=; b=lWmzPLLu/PxABxqwhwajJkCOEjaCcViOUC0CRsffN2GAvsPZQLCCrPFAod7PvrsZWt pzormJHHoIgObT0c8m4KYzAr5uMt21RgsLQ3k47dJji4bZ1m1NdGVCYRJch4T8h9ps6q eR6erWPTUQuiwjxJVntxkaghKnTl0LnE7otybmZlR4SMsUTjYgFYkKNR5ZAm/6ijeoOE G4fIeW+939P/l6ESDgOpJ01talhySElp5yCa4Lz8PZcQB9/TduWOqU5YvF/lFCZCJSW9 II8NgMwsCJ9LE777DgKyq+ym0sA31RUl4QW/v/qtaLrzja/fmcnMp3Op+RReP/D6rC0F UfRw== X-Gm-Message-State: AAQBX9dmQatpdlUcfpn3jSjrt23QxhkBx3meXmX4+Xe2EJ0DByjuRDlS dS/231yHQZK45eTCkDkhdy7OAQ== X-Received: by 2002:a05:6a20:8426:b0:f0:8708:2341 with SMTP id c38-20020a056a20842600b000f087082341mr7837599pzd.26.1682115202268; Fri, 21 Apr 2023 15:13:22 -0700 (PDT) Received: from tictac2.mtv.corp.google.com ([2620:15c:9d:2:87cc:9018:e569:4a27]) by smtp.gmail.com with ESMTPSA id y72-20020a62644b000000b006372791d708sm3424715pfb.104.2023.04.21.15.13.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 21 Apr 2023 15:13:21 -0700 (PDT) From: Douglas Anderson <dianders@chromium.org> To: Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mgorman@techsingularity.net>, Vlastimil Babka <vbabka@suse.cz>, Ying <ying.huang@intel.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Yu Zhao <yuzhao@google.com>, linux-fsdevel@vger.kernel.org, Matthew Wilcox <willy@infradead.org>, Douglas Anderson <dianders@chromium.org> Subject: [PATCH v2 1/4] mm/filemap: Add folio_lock_timeout() Date: Fri, 21 Apr 2023 15:12:45 -0700 Message-ID: <20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid> X-Mailer: git-send-email 2.40.0.634.g4ca3ef3211-goog In-Reply-To: <20230421221249.1616168-1-dianders@chromium.org> References: <20230421221249.1616168-1-dianders@chromium.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763826720438041269?= X-GMAIL-MSGID: =?utf-8?q?1763826720438041269?= |
Series |
migrate: Avoid unbounded blocks in MIGRATE_SYNC_LIGHT
|
|
Commit Message
Doug Anderson
April 21, 2023, 10:12 p.m. UTC
Add a variant of folio_lock() that can timeout. This is useful to
avoid unbounded waits for the page lock in kcompactd.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
Changes in v2:
- "Add folio_lock_timeout()" new for v2.
include/linux/pagemap.h | 16 ++++++++++++++
mm/filemap.c | 47 +++++++++++++++++++++++++++++------------
2 files changed, 50 insertions(+), 13 deletions(-)
Comments
Douglas Anderson <dianders@chromium.org> writes: > Add a variant of folio_lock() that can timeout. This is useful to > avoid unbounded waits for the page lock in kcompactd. > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > > Changes in v2: > - "Add folio_lock_timeout()" new for v2. > > include/linux/pagemap.h | 16 ++++++++++++++ > mm/filemap.c | 47 +++++++++++++++++++++++++++++------------ > 2 files changed, 50 insertions(+), 13 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 0acb8e1fb7af..0f3ef9f79300 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -892,6 +892,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page, > } > > void __folio_lock(struct folio *folio); > +int __folio_lock_timeout(struct folio *folio, long timeout); > int __folio_lock_killable(struct folio *folio); > bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, > unsigned int flags); > @@ -952,6 +953,21 @@ static inline void folio_lock(struct folio *folio) > __folio_lock(folio); > } > > +/** > + * folio_lock_timeout() - Lock this folio, with a timeout. > + * @folio: The folio to lock. > + * @timeout: The timeout in jiffies; %MAX_SCHEDULE_TIMEOUT means wait forever. > + * > + * Return: 0 upon success; -ETIMEDOUT upon failure. IIUC, the funtion may return -EINTR too. Otherwise looks good to me. Thanks! Best Regards, Huang, Ying > + */ > +static inline int folio_lock_timeout(struct folio *folio, long timeout) > +{ > + might_sleep(); > + if (!folio_trylock(folio)) > + return __folio_lock_timeout(folio, timeout); > + return 0; > +} > + > /** > * lock_page() - Lock the folio containing this page. > * @page: The page to lock. > diff --git a/mm/filemap.c b/mm/filemap.c > index 2723104cc06a..c6056ec41284 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1220,7 +1220,7 @@ static inline bool folio_trylock_flag(struct folio *folio, int bit_nr, > int sysctl_page_lock_unfairness = 5; > > static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > - int state, enum behavior behavior) > + int state, enum behavior behavior, long timeout) > { > wait_queue_head_t *q = folio_waitqueue(folio); > int unfairness = sysctl_page_lock_unfairness; > @@ -1229,6 +1229,7 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > bool thrashing = false; > unsigned long pflags; > bool in_thrashing; > + int err; > > if (bit_nr == PG_locked && > !folio_test_uptodate(folio) && folio_test_workingset(folio)) { > @@ -1295,10 +1296,13 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > /* Loop until we've been woken or interrupted */ > flags = smp_load_acquire(&wait->flags); > if (!(flags & WQ_FLAG_WOKEN)) { > + if (!timeout) > + break; > + > if (signal_pending_state(state, current)) > break; > > - io_schedule(); > + timeout = io_schedule_timeout(timeout); > continue; > } > > @@ -1324,10 +1328,10 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > } > > /* > - * If a signal happened, this 'finish_wait()' may remove the last > - * waiter from the wait-queues, but the folio waiters bit will remain > - * set. That's ok. The next wakeup will take care of it, and trying > - * to do it here would be difficult and prone to races. > + * If a signal/timeout happened, this 'finish_wait()' may remove the > + * last waiter from the wait-queues, but the folio waiters bit will > + * remain set. That's ok. The next wakeup will take care of it, and > + * trying to do it here would be difficult and prone to races. > */ > finish_wait(q, wait); > > @@ -1336,6 +1340,13 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > psi_memstall_leave(&pflags); > } > > + /* > + * If we don't meet the success criteria below then we've got an error > + * of some sort. Differentiate between the two error cases. If there's > + * no time left it must have been a timeout. > + */ > + err = !timeout ? -ETIMEDOUT : -EINTR; > + > /* > * NOTE! The wait->flags weren't stable until we've done the > * 'finish_wait()', and we could have exited the loop above due > @@ -1350,9 +1361,9 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > * waiter, but an exclusive one requires WQ_FLAG_DONE. > */ > if (behavior == EXCLUSIVE) > - return wait->flags & WQ_FLAG_DONE ? 0 : -EINTR; > + return wait->flags & WQ_FLAG_DONE ? 0 : err; > > - return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR; > + return wait->flags & WQ_FLAG_WOKEN ? 0 : err; > } > > #ifdef CONFIG_MIGRATION > @@ -1442,13 +1453,15 @@ void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep, > > void folio_wait_bit(struct folio *folio, int bit_nr) > { > - folio_wait_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED); > + folio_wait_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED, > + MAX_SCHEDULE_TIMEOUT); > } > EXPORT_SYMBOL(folio_wait_bit); > > int folio_wait_bit_killable(struct folio *folio, int bit_nr) > { > - return folio_wait_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED); > + return folio_wait_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED, > + MAX_SCHEDULE_TIMEOUT); > } > EXPORT_SYMBOL(folio_wait_bit_killable); > > @@ -1467,7 +1480,8 @@ EXPORT_SYMBOL(folio_wait_bit_killable); > */ > static int folio_put_wait_locked(struct folio *folio, int state) > { > - return folio_wait_bit_common(folio, PG_locked, state, DROP); > + return folio_wait_bit_common(folio, PG_locked, state, DROP, > + MAX_SCHEDULE_TIMEOUT); > } > > /** > @@ -1662,17 +1676,24 @@ EXPORT_SYMBOL_GPL(page_endio); > void __folio_lock(struct folio *folio) > { > folio_wait_bit_common(folio, PG_locked, TASK_UNINTERRUPTIBLE, > - EXCLUSIVE); > + EXCLUSIVE, MAX_SCHEDULE_TIMEOUT); > } > EXPORT_SYMBOL(__folio_lock); > > int __folio_lock_killable(struct folio *folio) > { > return folio_wait_bit_common(folio, PG_locked, TASK_KILLABLE, > - EXCLUSIVE); > + EXCLUSIVE, MAX_SCHEDULE_TIMEOUT); > } > EXPORT_SYMBOL_GPL(__folio_lock_killable); > > +int __folio_lock_timeout(struct folio *folio, long timeout) > +{ > + return folio_wait_bit_common(folio, PG_locked, TASK_KILLABLE, > + EXCLUSIVE, timeout); > +} > +EXPORT_SYMBOL_GPL(__folio_lock_timeout); > + > static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) > { > struct wait_queue_head *q = folio_waitqueue(folio);
On Fri, Apr 21, 2023 at 03:12:45PM -0700, Douglas Anderson wrote: > Add a variant of folio_lock() that can timeout. This is useful to > avoid unbounded waits for the page lock in kcompactd. > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > > Changes in v2: > - "Add folio_lock_timeout()" new for v2. > > include/linux/pagemap.h | 16 ++++++++++++++ > mm/filemap.c | 47 +++++++++++++++++++++++++++++------------ > 2 files changed, 50 insertions(+), 13 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 0acb8e1fb7af..0f3ef9f79300 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -892,6 +892,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page, > } > > void __folio_lock(struct folio *folio); > +int __folio_lock_timeout(struct folio *folio, long timeout); > int __folio_lock_killable(struct folio *folio); > bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, > unsigned int flags); > @@ -952,6 +953,21 @@ static inline void folio_lock(struct folio *folio) > __folio_lock(folio); > } > > +/** > + * folio_lock_timeout() - Lock this folio, with a timeout. > + * @folio: The folio to lock. > + * @timeout: The timeout in jiffies; %MAX_SCHEDULE_TIMEOUT means wait forever. > + * > + * Return: 0 upon success; -ETIMEDOUT upon failure. > + */ May return -EINTR? > +static inline int folio_lock_timeout(struct folio *folio, long timeout) > +{ > + might_sleep(); > + if (!folio_trylock(folio)) > + return __folio_lock_timeout(folio, timeout); > + return 0; > +} > + > /** > * lock_page() - Lock the folio containing this page. > * @page: The page to lock. > diff --git a/mm/filemap.c b/mm/filemap.c > index 2723104cc06a..c6056ec41284 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1220,7 +1220,7 @@ static inline bool folio_trylock_flag(struct folio *folio, int bit_nr, > int sysctl_page_lock_unfairness = 5; > > static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > - int state, enum behavior behavior) > + int state, enum behavior behavior, long timeout) > { > wait_queue_head_t *q = folio_waitqueue(folio); > int unfairness = sysctl_page_lock_unfairness; > @@ -1229,6 +1229,7 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > bool thrashing = false; > unsigned long pflags; > bool in_thrashing; > + int err; > > if (bit_nr == PG_locked && > !folio_test_uptodate(folio) && folio_test_workingset(folio)) { > @@ -1295,10 +1296,13 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > /* Loop until we've been woken or interrupted */ > flags = smp_load_acquire(&wait->flags); > if (!(flags & WQ_FLAG_WOKEN)) { > + if (!timeout) > + break; > + An io_schedule_timeout of 0 is valid so why the special handling? It's negative timeouts that cause schedule_timeout() to complain. > if (signal_pending_state(state, current)) > break; > > - io_schedule(); > + timeout = io_schedule_timeout(timeout); > continue; > } > > @@ -1324,10 +1328,10 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > } > > /* > - * If a signal happened, this 'finish_wait()' may remove the last > - * waiter from the wait-queues, but the folio waiters bit will remain > - * set. That's ok. The next wakeup will take care of it, and trying > - * to do it here would be difficult and prone to races. > + * If a signal/timeout happened, this 'finish_wait()' may remove the > + * last waiter from the wait-queues, but the folio waiters bit will > + * remain set. That's ok. The next wakeup will take care of it, and > + * trying to do it here would be difficult and prone to races. > */ > finish_wait(q, wait); > > @@ -1336,6 +1340,13 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > psi_memstall_leave(&pflags); > } > > + /* > + * If we don't meet the success criteria below then we've got an error > + * of some sort. Differentiate between the two error cases. If there's > + * no time left it must have been a timeout. > + */ > + err = !timeout ? -ETIMEDOUT : -EINTR; > + > /* > * NOTE! The wait->flags weren't stable until we've done the > * 'finish_wait()', and we could have exited the loop above due > @@ -1350,9 +1361,9 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > * waiter, but an exclusive one requires WQ_FLAG_DONE. > */ > if (behavior == EXCLUSIVE) > - return wait->flags & WQ_FLAG_DONE ? 0 : -EINTR; > + return wait->flags & WQ_FLAG_DONE ? 0 : err; > > - return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR; > + return wait->flags & WQ_FLAG_WOKEN ? 0 : err; > } > > #ifdef CONFIG_MIGRATION > @@ -1442,13 +1453,15 @@ void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep, > > void folio_wait_bit(struct folio *folio, int bit_nr) > { > - folio_wait_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED); > + folio_wait_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED, > + MAX_SCHEDULE_TIMEOUT); > } > EXPORT_SYMBOL(folio_wait_bit); > > int folio_wait_bit_killable(struct folio *folio, int bit_nr) > { > - return folio_wait_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED); > + return folio_wait_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED, > + MAX_SCHEDULE_TIMEOUT); > } > EXPORT_SYMBOL(folio_wait_bit_killable); > > @@ -1467,7 +1480,8 @@ EXPORT_SYMBOL(folio_wait_bit_killable); > */ > static int folio_put_wait_locked(struct folio *folio, int state) > { > - return folio_wait_bit_common(folio, PG_locked, state, DROP); > + return folio_wait_bit_common(folio, PG_locked, state, DROP, > + MAX_SCHEDULE_TIMEOUT); > } > > /** > @@ -1662,17 +1676,24 @@ EXPORT_SYMBOL_GPL(page_endio); > void __folio_lock(struct folio *folio) > { > folio_wait_bit_common(folio, PG_locked, TASK_UNINTERRUPTIBLE, > - EXCLUSIVE); > + EXCLUSIVE, MAX_SCHEDULE_TIMEOUT); > } > EXPORT_SYMBOL(__folio_lock); > > int __folio_lock_killable(struct folio *folio) > { > return folio_wait_bit_common(folio, PG_locked, TASK_KILLABLE, > - EXCLUSIVE); > + EXCLUSIVE, MAX_SCHEDULE_TIMEOUT); > } > EXPORT_SYMBOL_GPL(__folio_lock_killable); > > +int __folio_lock_timeout(struct folio *folio, long timeout) > +{ > + return folio_wait_bit_common(folio, PG_locked, TASK_KILLABLE, > + EXCLUSIVE, timeout); > +} > +EXPORT_SYMBOL_GPL(__folio_lock_timeout); > + > static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) > { > struct wait_queue_head *q = folio_waitqueue(folio); > -- > 2.40.0.634.g4ca3ef3211-goog >
Hi, On Mon, Apr 24, 2023 at 1:22 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > @@ -1295,10 +1296,13 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > > /* Loop until we've been woken or interrupted */ > > flags = smp_load_acquire(&wait->flags); > > if (!(flags & WQ_FLAG_WOKEN)) { > > + if (!timeout) > > + break; > > + > > An io_schedule_timeout of 0 is valid so why the special handling? It's > negative timeouts that cause schedule_timeout() to complain. It's not expected that the caller passes in a timeout of 0 here. The test here actually handles the case that the previous call to io_schedule_timeout() returned 0. In my patch, after the call to io_schedule_timeout() we unconditionally "continue" and end up back at the top of the loop. The next time through the loop if we don't see the WOKEN flag then we'll check for the two "error" conditions (timeout or signal pending) and break for either of them. To make it clearer, I'll add this comment for the next version: /* Break if the last io_schedule_timeout() said no time left */
Hi, On Fri, Apr 21, 2023 at 10:19 PM Hillf Danton <hdanton@sina.com> wrote: > > On 21 Apr 2023 15:12:45 -0700 Douglas Anderson <dianders@chromium.org> > > Add a variant of folio_lock() that can timeout. This is useful to > > avoid unbounded waits for the page lock in kcompactd. > > Given no mutex_lock_timeout() (perhaps because timeout makes no sense for > spinlock), I suspect your fix lies in the right layer. I'm not 100% sure I understood the above comment, but I think you're saying that the approach my patch takes seems OK. > If waiting for > page under IO causes trouble for you, another simpler option is make > IO faster (perhaps all you can do) for instance. Yeah, this gets into the discussion about whether our current squashfs settings actually make sense. I suspect that they don't and that we should look into EROFS like Gao suggested, or at least choose different squashfs settings (smaller block sizes, ZSTD instead of zlib). Unfortunately I believe that the current squashfs settings were chosen because disk space is a concern. > If kcompactd is waken > up by kswapd, waiting for slow IO is the right thing to do. I don't have enough intuition here, so I'm happy to take others' advice here. I guess my thought was that kcompactd is explicitly not using the full "sync" and instead choosing the "sync light". To me that means we shouldn't block for _too_ long. -Doug
On Mon, Apr 24, 2023 at 09:22:55AM -0700, Doug Anderson wrote: > Hi, > > On Mon, Apr 24, 2023 at 1:22???AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > @@ -1295,10 +1296,13 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, > > > /* Loop until we've been woken or interrupted */ > > > flags = smp_load_acquire(&wait->flags); > > > if (!(flags & WQ_FLAG_WOKEN)) { > > > + if (!timeout) > > > + break; > > > + > > > > An io_schedule_timeout of 0 is valid so why the special handling? It's > > negative timeouts that cause schedule_timeout() to complain. > > It's not expected that the caller passes in a timeout of 0 here. The > test here actually handles the case that the previous call to > io_schedule_timeout() returned 0. In my patch, after the call to > io_schedule_timeout() we unconditionally "continue" and end up back at > the top of the loop. The next time through the loop if we don't see > the WOKEN flag then we'll check for the two "error" conditions > (timeout or signal pending) and break for either of them. Ah, I see! > > To make it clearer, I'll add this comment for the next version: > > /* Break if the last io_schedule_timeout() said no time left */ Yes please.
Hi, On Mon, Apr 24, 2023 at 6:09 PM Hillf Danton <hdanton@sina.com> wrote: > > On 24 Apr 2023 09:56:58 -0700 Douglas Anderson <dianders@chromium.org> > > On Fri, Apr 21, 2023 at 10:19=E2=80=AFPM Hillf Danton <hdanton@sina.com> wrote: > > > If kcompactd is waken > > > up by kswapd, waiting for slow IO is the right thing to do. > > > > I don't have enough intuition here, so I'm happy to take others' > > advice here. I guess my thought was that kcompactd is explicitly not > > using the full "sync" and instead choosing the "sync light". To me > > that means we shouldn't block for _too_ long. > > Take a look at another case of lock wait [1]. > > [1] https://lore.kernel.org/lkml/CAHk-=wgyL9OujQ72er7oXt_VsMeno4bMKCTydBT1WSaagZ_5CA@mail.gmail.com/ So is this an explicit NAK on this approach, then? It still feels worthwhile to me given the current kcompactd design where there is a single thread that's in charge of going through and cleaning up all of memory. Any single pags isn't _that_ important for kcompactd to deal with and it's nice not to block the whole task's ability to make progress. kcompactd is already very much designed in this model (which is why SYNC_LIGHT exists in the first place) and that's why my patch series was relatively simple/short. That being said, if people really don't think I should pursue this then I won't send another version and we can drop it. -Doug
Hi, On Tue, Apr 25, 2023 at 9:42 PM Hillf Danton <hdanton@sina.com> wrote: > > On 25 Apr 2023 07:19:48 -0700 Douglas Anderson <dianders@chromium.org> > > > > So is this an explicit NAK on this approach, then? > > Ah I see your point. You misunderstood because I dont think NAk is needed > in 99.999% cases, given the fact that 1) your patch will never be able to > escape from standing ovation 2) every mutex_trylock() hints the straws in mind. I'm afraid I'm still super confused about what you're saying. You think I should abandon this patch series, or that it might be OK to continue with it? -Doug
On Tue, Apr 25, 2023 at 07:19:48AM -0700, Doug Anderson wrote: > Hi, > > On Mon, Apr 24, 2023 at 6:09???PM Hillf Danton <hdanton@sina.com> wrote: > > > > On 24 Apr 2023 09:56:58 -0700 Douglas Anderson <dianders@chromium.org> > > > On Fri, Apr 21, 2023 at 10:19=E2=80=AFPM Hillf Danton <hdanton@sina.com> wrote: > > > > If kcompactd is waken > > > > up by kswapd, waiting for slow IO is the right thing to do. > > > > > > I don't have enough intuition here, so I'm happy to take others' > > > advice here. I guess my thought was that kcompactd is explicitly not > > > using the full "sync" and instead choosing the "sync light". To me > > > that means we shouldn't block for _too_ long. > > > > Take a look at another case of lock wait [1]. > > > > [1] https://lore.kernel.org/lkml/CAHk-=wgyL9OujQ72er7oXt_VsMeno4bMKCTydBT1WSaagZ_5CA@mail.gmail.com/ > > So is this an explicit NAK on this approach, then? It still feels > worthwhile to me given the current kcompactd design where there is a > single thread that's in charge of going through and cleaning up all of > memory. Any single pags isn't _that_ important for kcompactd to deal > with and it's nice not to block the whole task's ability to make > progress. kcompactd is already very much designed in this model (which > is why SYNC_LIGHT exists in the first place) and that's why my patch > series was relatively simple/short. That being said, if people really > don't think I should pursue this then I won't send another version and > we can drop it. I don't consider it to be an explicit NAK but lets cc Linus because it's a valid question. Linus, the patch is https://lore.kernel.org/lkml/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid/ asnd it's adding folio_lock_timeout which in older terms is a lock_page_timout. The intended use is kcompactd doing out-of-line compaction (like kswapd does out-of-line reclaim) to try lock a page in MIGRATE_SYNC_LIGHT mode but if it cannot be locked quickly then give up and move on to another migration candidate. The MIGRATE_SYNC_LIGHT is expected to incur some delays while trying to make forward progress and the overall problem is that kcompactd can sometimes stall for many seconds and sometimes minutes on one page. The reason I don't consider this patch a NAK candidate is that this is not conditional locking as such because no special action is taken if the lock cannot be acquired. In the referenced mail, I think the context for the IO NOWAIT stuff is "try lock and if that fails, delegate the work to an async context". That is not necessarily a universal win and it's potentially complex. It's not a universal win because it's unknown how long it would take to acquire the lock and it may be a short enough period to be cheaper than the setup_for_async+context_switch+completion handler. If that happens often enough in a short window then delegation may be slower overall than doing the work synchronously. It's potentially complex because the setup for async handling and completion needs code that must be maintained. The kcompactd case using folio_lock_timeout is different. If the lock fails, it's not being explicitly delegated to another context, the page is simply ignored and kcompactd moves on. Fair enough, another context may end up migrating the same page in direct compaction or kcompactd at a later time but there is no complex setup for that and it's not explicit delegation. It's vaguely similar to how shrink_folio_list() calls folio_trylock and if that fails, keep the page on the LRU for a future attempt with the main difference being that some time is spent on trylock. This is *also* not necessarily a universal win because kcompactd could find a suitable migration candidate quicker by a plain trylock but that's what MIGRATE_ASYNC is for, MIGRATE_SYNC_LIGHT is expected to delay for short periods of time when MIGRATE_ASYNC fails and the problem being solved is the folio lock taking minutes to acquire.
On Wed, Apr 26, 2023 at 11:09:18AM +0100, Mel Gorman wrote: > On Tue, Apr 25, 2023 at 07:19:48AM -0700, Doug Anderson wrote: > > On Mon, Apr 24, 2023 at 6:09???PM Hillf Danton <hdanton@sina.com> wrote: > > > Take a look at another case of lock wait [1]. > > > > > > [1] https://lore.kernel.org/lkml/CAHk-=wgyL9OujQ72er7oXt_VsMeno4bMKCTydBT1WSaagZ_5CA@mail.gmail.com/ > > > > So is this an explicit NAK on this approach, then? It still feels > > worthwhile to me given the current kcompactd design where there is a > > single thread that's in charge of going through and cleaning up all of > > memory. Any single pags isn't _that_ important for kcompactd to deal > > with and it's nice not to block the whole task's ability to make > > progress. kcompactd is already very much designed in this model (which > > is why SYNC_LIGHT exists in the first place) and that's why my patch > > series was relatively simple/short. That being said, if people really > > don't think I should pursue this then I won't send another version and > > we can drop it. > > I don't consider it to be an explicit NAK but lets > cc Linus because it's a valid question. Linus, the patch is > https://lore.kernel.org/lkml/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid/ > asnd it's adding folio_lock_timeout which in older terms is a > lock_page_timout. The intended use is kcompactd doing out-of-line > compaction (like kswapd does out-of-line reclaim) to try lock a page in > MIGRATE_SYNC_LIGHT mode but if it cannot be locked quickly then give up > and move on to another migration candidate. The MIGRATE_SYNC_LIGHT is > expected to incur some delays while trying to make forward progress and > the overall problem is that kcompactd can sometimes stall for many seconds > and sometimes minutes on one page. > > The reason I don't consider this patch a NAK candidate is that this is not > conditional locking as such because no special action is taken if the lock > cannot be acquired. In the referenced mail, I think the context for the IO > NOWAIT stuff is "try lock and if that fails, delegate the work to an async > context". That is not necessarily a universal win and it's potentially > complex. It's not a universal win because it's unknown how long it would > take to acquire the lock and it may be a short enough period to be cheaper > than the setup_for_async+context_switch+completion handler. If that happens > often enough in a short window then delegation may be slower overall than > doing the work synchronously. It's potentially complex because the setup > for async handling and completion needs code that must be maintained. > > The kcompactd case using folio_lock_timeout is different. If the lock > fails, it's not being explicitly delegated to another context, the page > is simply ignored and kcompactd moves on. Fair enough, another context > may end up migrating the same page in direct compaction or kcompactd > at a later time but there is no complex setup for that and it's not > explicit delegation. It's vaguely similar to how shrink_folio_list() > calls folio_trylock and if that fails, keep the page on the LRU for a > future attempt with the main difference being that some time is spent on > trylock. This is *also* not necessarily a universal win because kcompactd > could find a suitable migration candidate quicker by a plain trylock but > that's what MIGRATE_ASYNC is for, MIGRATE_SYNC_LIGHT is expected to delay > for short periods of time when MIGRATE_ASYNC fails and the problem being > solved is the folio lock taking minutes to acquire. I'm not generally a fan of lock-with-timeout approaches. I think the rationale for this one makes sense, but we're going to see some people try to use this for situations where it doesn't make sense. I almost wonder if we shouldn't spin rather than sleep on this lock, since the window of time we're willing to wait is so short. I'm certainly not willing to NAK this patch since it's clearly fixing a real problem. Hm. If the problem is that we want to wait for the lock unless the lock is being held for I/O, we can actually tell that in the caller. if (folio_test_uptodate(folio)) folio_lock(folio); else folio_trylock(folio); (the folio lock isn't held for writeback, just taken and released; if the folio is uptodate, the folio lock should only be taken for a short time; if it's !uptodate then it's probably being read)
On Wed, Apr 26, 2023 at 3:09 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > The reason I don't consider this patch a NAK candidate is that this is not > conditional locking as such because no special action is taken if the lock > cannot be acquired. If this comes from my rant against not having conditional locking for the O_NDELAY (well, the io_uring equivalent) IO path, then no, I don't think something like lock_page_timout() is "conditional locking". Some caller wanting to get the lock, but being willing to just go on and do something else after a timeout is fine. Within the context of something like memory compaction internally for the kernel that is fundamentally opportunistic anyway, that sounds fine to me. In fact, in contexts like that, even trylock is fine. We obviously have trylock in lots of places of the kernel. My "no conditional locking" is really that I do not think it's sane to have user IO fail with EAGAIN just because some data structure happened to be busy. It's a debugging nightmare with unlikely things that happen only in special conditions. Doing IO is not some "opportunistic" thing. We've actually had things like that before where people tried to make O_NDELAY mean "no locking" (I think that was driven at least partly by the old in-kernel web server patches), and it also causes actual problems with user space then busy-looping in a select() loop, because the select doesn't consider some low-level lock to be a waiting event. (The io_uring case is _slightly_ different from our historical issues in this area, in that the kernel can fall back to the user worker thread case, but it's all mixed up in that same IO path and that's why I absolutely hated that "if (X) trylock else proper_lock" approach). So a unconditional "lock with timeout" in the context of "we can just skip this if it times out" is perfectly fine by me. That said - the kcompactd code is not code I know, so maybe there are *other* issues with that patch, so this is also not an ACK from me. So please consider this just a "that is a very different case from the one I complained about". Linus
Hi, On Wed, Apr 26, 2023 at 8:14 AM Matthew Wilcox <willy@infradead.org> wrote: > > I'm not generally a fan of lock-with-timeout approaches. I think the > rationale for this one makes sense, but we're going to see some people > try to use this for situations where it doesn't make sense. Although it won't completely prevent the issue, I could add a comment to the function (and the similar lock_buffer_timeout() added in patch #2 [1] at least warning people that it's discouraged to use the function without very careful consideration. [1] https://lore.kernel.org/r/20230421151135.v2.2.Ie146eec4d41480ebeb15f0cfdfb3bc9095e4ebd9@changeid/ > I almost > wonder if we shouldn't spin rather than sleep on this lock, since the > window of time we're willing to wait is so short. It doesn't feel like spinning is the right move in this particular case. While we want to enable kcompactd to move forward, it's not so urgent that it needs to take up lots of CPU cycles doing so. ...and, in fact, the cases I've seen us be blocked is when we're under memory pressure and spinning would be counterproductive to getting out of that pressure. > I'm certainly not > willing to NAK this patch since it's clearly fixing a real problem. > > Hm. If the problem is that we want to wait for the lock unless the > lock is being held for I/O, we can actually tell that in the caller. > > if (folio_test_uptodate(folio)) > folio_lock(folio); > else > folio_trylock(folio); > > (the folio lock isn't held for writeback, just taken and released; > if the folio is uptodate, the folio lock should only be taken for a > short time; if it's !uptodate then it's probably being read) The current place in patch #3 where I'm using folio_lock_timeout() only calls it if a folio_trylock() already failed [2]. So I guess the idea would be that if the trylock failed and folio_test_uptodate() returns 0 then we immediately fail, otherwise we call the unbounded folio_trylock()? I put some traces in and ran my test and it turns out that in every case (except one) where the tre initial folio_trylock() failed I saw folio_test_uptodate() return 0. Assuming my test case is typical, I think that means that coding it with folio_test_uptodate() is roughly the same as just never waiting at all for the folio lock in the SYNC_LIGHT case. In the original discussion of my v1 patch people didn't like that idea. ...so I think that for now I'm going to keep it with the timeout flow. -- After all of the discussion and continued digging into the code that I've done, I'm still of the opinion that this patch series is worthwhile and in the spirit of the SYNC_LIGHT mode of migration, but I also don't believe it's going to be a panacea for any particular case. Presumably even if kcompactd gets blocked for a second or two under a lot of memory pressure it won't be the absolute end of the world. In that spirit, I'll plan to post v3 in a day or two, but I'll also continue to say that if someone tells me that they really hate it that I can put it on the back burner. [2] https://lore.kernel.org/r/20230421151135.v2.3.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid/
On Wed, Apr 26, 2023 at 01:46:58PM -0700, Doug Anderson wrote: > On Wed, Apr 26, 2023 at 8:14 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > I'm not generally a fan of lock-with-timeout approaches. I think the > > rationale for this one makes sense, but we're going to see some people > > try to use this for situations where it doesn't make sense. > > Although it won't completely prevent the issue, I could add a comment People don't read comments. > > Hm. If the problem is that we want to wait for the lock unless the > > lock is being held for I/O, we can actually tell that in the caller. > > > > if (folio_test_uptodate(folio)) > > folio_lock(folio); > > else > > folio_trylock(folio); > > > > (the folio lock isn't held for writeback, just taken and released; > > if the folio is uptodate, the folio lock should only be taken for a > > short time; if it's !uptodate then it's probably being read) > > The current place in patch #3 where I'm using folio_lock_timeout() > only calls it if a folio_trylock() already failed [2]. So I guess the > idea would be that if the trylock failed and folio_test_uptodate() > returns 0 then we immediately fail, otherwise we call the unbounded > folio_trylock()? Looking at the actual code, here's what I'd do: +++ b/mm/migrate.c @@ -1156,6 +1156,14 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page if (current->flags & PF_MEMALLOC) goto out; + /* + * In "light" mode, we can wait for transient locks (eg + * inserting a page into the page table), but it's not + * worth waiting for I/O. + */ + if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(folio)) + goto out; + folio_lock(src); } locked = true; > I put some traces in and ran my test and it turns out that in every > case (except one) where the tre initial folio_trylock() failed I saw > folio_test_uptodate() return 0. Assuming my test case is typical, I > think that means that coding it with folio_test_uptodate() is roughly > the same as just never waiting at all for the folio lock in the > SYNC_LIGHT case. In the original discussion of my v1 patch people > didn't like that idea. ...so I think that for now I'm going to keep it > with the timeout flow. I think that means that your specific test is generally going to exercise the case where the lock is held because we're waiting for I/O. That's exactly what you set it up to produce, after all! But it won't affect the cases where the folio lock is being held for other reasons, which your testcase is incredibly unlikely to produce.
Hi, On Wed, Apr 26, 2023 at 2:27 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Apr 26, 2023 at 01:46:58PM -0700, Doug Anderson wrote: > > On Wed, Apr 26, 2023 at 8:14 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > I'm not generally a fan of lock-with-timeout approaches. I think the > > > rationale for this one makes sense, but we're going to see some people > > > try to use this for situations where it doesn't make sense. > > > > Although it won't completely prevent the issue, I could add a comment > > People don't read comments. Agreed, it's just better than nothing... > > > Hm. If the problem is that we want to wait for the lock unless the > > > lock is being held for I/O, we can actually tell that in the caller. > > > > > > if (folio_test_uptodate(folio)) > > > folio_lock(folio); > > > else > > > folio_trylock(folio); > > > > > > (the folio lock isn't held for writeback, just taken and released; > > > if the folio is uptodate, the folio lock should only be taken for a > > > short time; if it's !uptodate then it's probably being read) > > > > The current place in patch #3 where I'm using folio_lock_timeout() > > only calls it if a folio_trylock() already failed [2]. So I guess the > > idea would be that if the trylock failed and folio_test_uptodate() > > returns 0 then we immediately fail, otherwise we call the unbounded > > folio_trylock()? > > Looking at the actual code, here's what I'd do: > > +++ b/mm/migrate.c > @@ -1156,6 +1156,14 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page > if (current->flags & PF_MEMALLOC) > goto out; > > + /* > + * In "light" mode, we can wait for transient locks (eg > + * inserting a page into the page table), but it's not > + * worth waiting for I/O. > + */ > + if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(folio)) > + goto out; > + > folio_lock(src); > } > locked = true; > > > I put some traces in and ran my test and it turns out that in every > > case (except one) where the tre initial folio_trylock() failed I saw > > folio_test_uptodate() return 0. Assuming my test case is typical, I > > think that means that coding it with folio_test_uptodate() is roughly > > the same as just never waiting at all for the folio lock in the > > SYNC_LIGHT case. In the original discussion of my v1 patch people > > didn't like that idea. ...so I think that for now I'm going to keep it > > with the timeout flow. > > I think that means that your specific test is generally going to > exercise the case where the lock is held because we're waiting for I/O. > That's exactly what you set it up to produce, after all! But it won't > affect the cases where the folio lock is being held for other reasons, > which your testcase is incredibly unlikely to produce. Sure, I'm happy to do it like you say. Do you have any suggestions for the similar lock_buffer() case, or are you OK w/ the timeout there? Mel: do you have any comments? In your previous response [1] you seemed to indicate that you thought that short waits for read were a good idea. [1] https://lore.kernel.org/r/20230420102304.7wdquge2b7r3xerj@techsingularity.net -Doug
On Wed, Apr 26, 2023 at 02:39:56PM -0700, Doug Anderson wrote: > Sure, I'm happy to do it like you say. Do you have any suggestions for > the similar lock_buffer() case, or are you OK w/ the timeout there? I'd do that similarly, ie: +++ b/mm/migrate.c @@ -691,38 +691,30 @@ EXPORT_SYMBOL(migrate_folio); static bool buffer_migrate_lock_buffers(struct buffer_head *head, enum migrate_mode mode) { - struct buffer_head *bh = head; + struct buffer_head *failed_bh, *bh = head; - /* Simple case, sync compaction */ - if (mode != MIGRATE_ASYNC) { - do { - lock_buffer(bh); - bh = bh->b_this_page; - - } while (bh != head); - - return true; - } - - /* async case, we cannot block on lock_buffer so use trylock_buffer */ do { if (!trylock_buffer(bh)) { - /* - * We failed to lock the buffer and cannot stall in - * async migration. Release the taken locks - */ - struct buffer_head *failed_bh = bh; - bh = head; - while (bh != failed_bh) { - unlock_buffer(bh); - bh = bh->b_this_page; - } - return false; + if (mode == MIGRATE_ASYNC) + goto unlock; + if (mode == MIGRATE_SYNC_LIGHT && !buffer_uptodate(bh)) + goto unlock; + lock_buffer(bh); } bh = bh->b_this_page; } while (bh != head); return true; + +unlock: + /* Failed to lock the buffer and cannot stall */ + failed_bh = bh; + bh = head; + while (bh != failed_bh) { + unlock_buffer(bh); + bh = bh->b_this_page; + } + return false; } static int __buffer_migrate_folio(struct address_space *mapping,
On Wed, Apr 26, 2023 at 02:39:56PM -0700, Doug Anderson wrote: > Hi, > > On Wed, Apr 26, 2023 at 2:27???PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, Apr 26, 2023 at 01:46:58PM -0700, Doug Anderson wrote: > > > On Wed, Apr 26, 2023 at 8:14???AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > I'm not generally a fan of lock-with-timeout approaches. I think the > > > > rationale for this one makes sense, but we're going to see some people > > > > try to use this for situations where it doesn't make sense. > > > > > > Although it won't completely prevent the issue, I could add a comment > > > > People don't read comments. > > Agreed, it's just better than nothing... > It also acts as something to point to if an ill-advised use of trylock_timeout was used. > > > > > Hm. If the problem is that we want to wait for the lock unless the > > > > lock is being held for I/O, we can actually tell that in the caller. > > > > > > > > if (folio_test_uptodate(folio)) > > > > folio_lock(folio); > > > > else > > > > folio_trylock(folio); > > > > > > > > (the folio lock isn't held for writeback, just taken and released; > > > > if the folio is uptodate, the folio lock should only be taken for a > > > > short time; if it's !uptodate then it's probably being read) > > > > > > The current place in patch #3 where I'm using folio_lock_timeout() > > > only calls it if a folio_trylock() already failed [2]. So I guess the > > > idea would be that if the trylock failed and folio_test_uptodate() > > > returns 0 then we immediately fail, otherwise we call the unbounded > > > folio_trylock()? > > > > Looking at the actual code, here's what I'd do: > > > > +++ b/mm/migrate.c > > @@ -1156,6 +1156,14 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page > > if (current->flags & PF_MEMALLOC) > > goto out; > > > > + /* > > + * In "light" mode, we can wait for transient locks (eg > > + * inserting a page into the page table), but it's not > > + * worth waiting for I/O. > > + */ > > + if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(folio)) > > + goto out; > > + > > folio_lock(src); > > } > > locked = true; > > > > > I put some traces in and ran my test and it turns out that in every > > > case (except one) where the tre initial folio_trylock() failed I saw > > > folio_test_uptodate() return 0. Assuming my test case is typical, I > > > think that means that coding it with folio_test_uptodate() is roughly > > > the same as just never waiting at all for the folio lock in the > > > SYNC_LIGHT case. In the original discussion of my v1 patch people > > > didn't like that idea. ...so I think that for now I'm going to keep it > > > with the timeout flow. > > > > I think that means that your specific test is generally going to > > exercise the case where the lock is held because we're waiting for I/O. > > That's exactly what you set it up to produce, after all! But it won't > > affect the cases where the folio lock is being held for other reasons, > > which your testcase is incredibly unlikely to produce. > > Sure, I'm happy to do it like you say. Do you have any suggestions for > the similar lock_buffer() case, or are you OK w/ the timeout there? > > Mel: do you have any comments? In your previous response [1] you > seemed to indicate that you thought that short waits for read were a > good idea. > I have no objection and it's worth trying the patch against your test case although note that short-lived locks may still be skipped (but that's ok). Either kcompactd still waits for minutes on folio lock or it does not. The potential downside is that kcompactd skips candidates that it shouldn't have which will manifest as a drop in compaction efficiency that will be difficult to detect. With Johannes' compaction series in the mix though, it's more likely we'll notice a drop in efficiency in the short term. I prefer the uptodate check a lot more than "spin rather than sleep on this lock". Spinning is surprisingly difficult to get right (see all the changes and subsequent fixes that went into mutexes or the changes in "polling on the way to idle" logic in cpuidle), it also occupies CPU time when a system is already under stress and it's not a great fit for this specific case. The timeout is currently specified in ticks and ticking for 1-4ms on common kernel configs is a *long* time from a scheduler perspective even if it's a non-issue for MM which sometimes deals with time frames suitable for slow storage. It's even worse if there is no preemption point and even if there was, it hits potential hazards with CFS dealing with the fairness of a useless spinning task vs tasks trying to do real work. Of course the spin time could be based on cond_resched or a timeout in nanoseconds but cond_resched is a random timeout and picking a specific timeout is difficult and will attract negative attention eventually. My MM hat says "ah, it'll be fine" and my scheduler hat says "uhhh, don't do that". Matthew's suggestion on checking uptodate is a good one. If he's right, the problem goes away. If the problem still persists, I am willing to bet that it's mitigated a lot *and* it becomes interesting to find out why a !uptodate page could take a very long time to lock because it's possibly revealing another real bug. If the idea does not work at all, the timeout patch still exists, add to the changelog why checking uptodate doesn't work and Linus spelled out clearly why it should be ok in this specific case even if we have to watch out for bad users of the new interface.
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 0acb8e1fb7af..0f3ef9f79300 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -892,6 +892,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page, } void __folio_lock(struct folio *folio); +int __folio_lock_timeout(struct folio *folio, long timeout); int __folio_lock_killable(struct folio *folio); bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, unsigned int flags); @@ -952,6 +953,21 @@ static inline void folio_lock(struct folio *folio) __folio_lock(folio); } +/** + * folio_lock_timeout() - Lock this folio, with a timeout. + * @folio: The folio to lock. + * @timeout: The timeout in jiffies; %MAX_SCHEDULE_TIMEOUT means wait forever. + * + * Return: 0 upon success; -ETIMEDOUT upon failure. + */ +static inline int folio_lock_timeout(struct folio *folio, long timeout) +{ + might_sleep(); + if (!folio_trylock(folio)) + return __folio_lock_timeout(folio, timeout); + return 0; +} + /** * lock_page() - Lock the folio containing this page. * @page: The page to lock. diff --git a/mm/filemap.c b/mm/filemap.c index 2723104cc06a..c6056ec41284 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1220,7 +1220,7 @@ static inline bool folio_trylock_flag(struct folio *folio, int bit_nr, int sysctl_page_lock_unfairness = 5; static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, - int state, enum behavior behavior) + int state, enum behavior behavior, long timeout) { wait_queue_head_t *q = folio_waitqueue(folio); int unfairness = sysctl_page_lock_unfairness; @@ -1229,6 +1229,7 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, bool thrashing = false; unsigned long pflags; bool in_thrashing; + int err; if (bit_nr == PG_locked && !folio_test_uptodate(folio) && folio_test_workingset(folio)) { @@ -1295,10 +1296,13 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, /* Loop until we've been woken or interrupted */ flags = smp_load_acquire(&wait->flags); if (!(flags & WQ_FLAG_WOKEN)) { + if (!timeout) + break; + if (signal_pending_state(state, current)) break; - io_schedule(); + timeout = io_schedule_timeout(timeout); continue; } @@ -1324,10 +1328,10 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, } /* - * If a signal happened, this 'finish_wait()' may remove the last - * waiter from the wait-queues, but the folio waiters bit will remain - * set. That's ok. The next wakeup will take care of it, and trying - * to do it here would be difficult and prone to races. + * If a signal/timeout happened, this 'finish_wait()' may remove the + * last waiter from the wait-queues, but the folio waiters bit will + * remain set. That's ok. The next wakeup will take care of it, and + * trying to do it here would be difficult and prone to races. */ finish_wait(q, wait); @@ -1336,6 +1340,13 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, psi_memstall_leave(&pflags); } + /* + * If we don't meet the success criteria below then we've got an error + * of some sort. Differentiate between the two error cases. If there's + * no time left it must have been a timeout. + */ + err = !timeout ? -ETIMEDOUT : -EINTR; + /* * NOTE! The wait->flags weren't stable until we've done the * 'finish_wait()', and we could have exited the loop above due @@ -1350,9 +1361,9 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, * waiter, but an exclusive one requires WQ_FLAG_DONE. */ if (behavior == EXCLUSIVE) - return wait->flags & WQ_FLAG_DONE ? 0 : -EINTR; + return wait->flags & WQ_FLAG_DONE ? 0 : err; - return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR; + return wait->flags & WQ_FLAG_WOKEN ? 0 : err; } #ifdef CONFIG_MIGRATION @@ -1442,13 +1453,15 @@ void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep, void folio_wait_bit(struct folio *folio, int bit_nr) { - folio_wait_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED); + folio_wait_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED, + MAX_SCHEDULE_TIMEOUT); } EXPORT_SYMBOL(folio_wait_bit); int folio_wait_bit_killable(struct folio *folio, int bit_nr) { - return folio_wait_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED); + return folio_wait_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED, + MAX_SCHEDULE_TIMEOUT); } EXPORT_SYMBOL(folio_wait_bit_killable); @@ -1467,7 +1480,8 @@ EXPORT_SYMBOL(folio_wait_bit_killable); */ static int folio_put_wait_locked(struct folio *folio, int state) { - return folio_wait_bit_common(folio, PG_locked, state, DROP); + return folio_wait_bit_common(folio, PG_locked, state, DROP, + MAX_SCHEDULE_TIMEOUT); } /** @@ -1662,17 +1676,24 @@ EXPORT_SYMBOL_GPL(page_endio); void __folio_lock(struct folio *folio) { folio_wait_bit_common(folio, PG_locked, TASK_UNINTERRUPTIBLE, - EXCLUSIVE); + EXCLUSIVE, MAX_SCHEDULE_TIMEOUT); } EXPORT_SYMBOL(__folio_lock); int __folio_lock_killable(struct folio *folio) { return folio_wait_bit_common(folio, PG_locked, TASK_KILLABLE, - EXCLUSIVE); + EXCLUSIVE, MAX_SCHEDULE_TIMEOUT); } EXPORT_SYMBOL_GPL(__folio_lock_killable); +int __folio_lock_timeout(struct folio *folio, long timeout) +{ + return folio_wait_bit_common(folio, PG_locked, TASK_KILLABLE, + EXCLUSIVE, timeout); +} +EXPORT_SYMBOL_GPL(__folio_lock_timeout); + static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) { struct wait_queue_head *q = folio_waitqueue(folio);