Message ID | 20230501175025.36233-1-surenb@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp90159vqo; Mon, 1 May 2023 10:55:42 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5fj0H2I4MZXDs+ry0HrOhvy2Z8OWOMRzMzxLUw96joR7x/M00Qdu9OL0NgbdHpMPCfV3IN X-Received: by 2002:a05:6a00:a13:b0:63b:646d:9165 with SMTP id p19-20020a056a000a1300b0063b646d9165mr21595050pfh.26.1682963742335; Mon, 01 May 2023 10:55:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682963742; cv=none; d=google.com; s=arc-20160816; b=h166ybFmHqJUFn9kIK483wRivMxz+SOfcNxfMaKv5a1Mnu3nZkBdYyvy/JXbNyoT32 d2O4IrJ9gM5Yo2VEJeFVZC8he/3k2HwnmSe/VYpKPHXnJcClYtNms0ugCuoA9TOXpI6V SbvhQd6ZGwSGGLYsqpa2h+fA7SBn0tD8fhPI/29Z+fUUEGWI4MiCoLIZSqn8c8ttN1nr uYoZPyj0T+6CET3lWrI+Do+BBrdVcmGhux6K7tFOvKK9mUzls7J6QX8godmEruHBKAAe XGO4Dsn3leR+49da2osNpDLhhr+44F0DVlvqkc8+uT2IdCD/WYIEu89VTu5tfctkAogU e26w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=Vjmy9npoFS1ltTIQSU6SYKXLbdap9UrmwfWM+Y2ixPk=; b=Vy7foVnHyRCw+g9Mv5AUG6T2vqhTr/fbwEuhegihaMzCV+wfmOtnazPAXmXvp5L6tf pW522ij9SRqCrNnaQrWuo7/teW15w4kHAK+VxPgFI/XDNh7zPC9x6qALr7zEZ9XUYLdM NUaN0dtc0KIX0SlAMbg/i3L+8UUZgh7/SSgUsaTTMnS0+T4Z3ZCeftn1KxmgOBPq49uz MpIyK8yxWOD5MlQtvlTHxuLs4guk4WMO0O7h4XT0DFwyrevxYxd1Y3sfHag6JL3NoGTy twYMVEMpYX+RvQFKLjV6hcRWuzijuPwIah4a9FF38Yn/FvBvmpT/ja1tmA3C/fXvB0wM dIsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=dZCcwJnc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a127-20020a621a85000000b0063b78126397si28108054pfa.230.2023.05.01.10.55.29; Mon, 01 May 2023 10:55:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=dZCcwJnc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229871AbjEARue (ORCPT <rfc822;rbbytesnap@gmail.com> + 99 others); Mon, 1 May 2023 13:50:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40372 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230429AbjEARuc (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 1 May 2023 13:50:32 -0400 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 04ED910C6 for <linux-kernel@vger.kernel.org>; Mon, 1 May 2023 10:50:31 -0700 (PDT) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-559d35837bbso54144037b3.3 for <linux-kernel@vger.kernel.org>; Mon, 01 May 2023 10:50:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682963430; x=1685555430; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=Vjmy9npoFS1ltTIQSU6SYKXLbdap9UrmwfWM+Y2ixPk=; b=dZCcwJncaHug3B8HkBJA8I4rogyge1EVTo4b3P8k7VcTSJVpEkICoBGsnhmO7W+EaQ BIxtz97RUXIfYWYlU1CJsKlwqGVGZ7x+vS+3P/rfb8ANuZUjR7SbifV/F6aSBN5z7ZRt /AMQ4KNfWlhXtFlPUDhe7iDHC11AWgJzvh0HYI/QSVDj28sAvGrMcNoB1jrqM2PiCRxV MKWY+ho9Y/9MhG5dJ5YP+l84eWwLxXehtDV3C+sB45yhtiKFovPsBc0ojtZnWpwnmUO1 OgbB8QClj+k71UJXHVn/U9WcLkhMORsNKUzA9GKVhmcWDmrnB5+OYkqWUJRZdkLT90UK DpmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682963430; x=1685555430; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Vjmy9npoFS1ltTIQSU6SYKXLbdap9UrmwfWM+Y2ixPk=; b=BsilZOOoLlkgaf5L8TRwYqLod7zmE88nRCfmPpYLrDiEB+xbxtKui2Rhvn84eE052T pOe6yBSzTCfniX9Gns3SQgvqqou0xpD8rjxFiBCmG2CMEr6twTt3Lgt0TJPQL+cNe7Xm imTpxXSg1PSGoLZ+djL/MfIMjMQOrlp5lxQxIoSFXKGTw25o9+xWOGaSdpOzxIImMvBZ qsPp8xz93J+798z2ruU7+roaDzl2mwl+fx782e0+4eZKTA44FBI7Og2hESkzv9kJ/lBV Q0jFN3rdZ9HLNQqLhfcVEOxi6Nudz2JVjMRjAkFWJEqmte1OOvT1WvRBh7OWBfmYCGAx 8vkQ== X-Gm-Message-State: AC+VfDy69IhJ9RyvZmrqrmWx6lT4npRUZeUxIWvKNHS4mS3nd+54/ce0 55Gt05d3yaSr3L7D6OdzqsDPvKWzYh0= X-Received: from surenb-desktop.mtv.corp.google.com ([2620:15c:211:201:6d24:3efd:facc:7ac4]) (user=surenb job=sendgmr) by 2002:a81:9f09:0:b0:559:e830:60f1 with SMTP id s9-20020a819f09000000b00559e83060f1mr4687408ywn.8.1682963430249; Mon, 01 May 2023 10:50:30 -0700 (PDT) Date: Mon, 1 May 2023 10:50:23 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.40.1.495.gc816e09b53d-goog Message-ID: <20230501175025.36233-1-surenb@google.com> Subject: [PATCH 1/3] mm: handle swap page faults under VMA lock if page is uncontended From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: willy@infradead.org, hannes@cmpxchg.org, mhocko@suse.com, josef@toxicpanda.com, jack@suse.cz, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, michel@lespinasse.org, liam.howlett@oracle.com, jglisse@google.com, vbabka@suse.cz, minchan@google.com, dave@stgolabs.net, punit.agrawal@bytedance.com, lstoakes@gmail.com, hdanton@sina.com, apopple@nvidia.com, surenb@google.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764715389330672741?= X-GMAIL-MSGID: =?utf-8?q?1764715389330672741?= |
Series |
[1/3] mm: handle swap page faults under VMA lock if page is uncontended
|
|
Commit Message
Suren Baghdasaryan
May 1, 2023, 5:50 p.m. UTC
When page fault is handled under VMA lock protection, all swap page
faults are retried with mmap_lock because folio_lock_or_retry
implementation has to drop and reacquire mmap_lock if folio could
not be immediately locked.
Instead of retrying all swapped page faults, retry only when folio
locking fails.
Drivers implementing ops->migrate_to_ram might still rely on mmap_lock,
therefore fall back to mmap_lock in this case.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
mm/filemap.c | 6 ++++++
mm/memory.c | 14 +++++++++-----
2 files changed, 15 insertions(+), 5 deletions(-)
Comments
On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > +++ b/mm/memory.c > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (!pte_unmap_same(vmf)) > goto out; > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > - ret = VM_FAULT_RETRY; > - goto out; > - } > - > entry = pte_to_swp_entry(vmf->orig_pte); > if (unlikely(non_swap_entry(entry))) { > if (is_migration_entry(entry)) { You're missing the necessary fallback in the (!folio) case. swap_readpage() is synchronous and will sleep.
On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > +++ b/mm/memory.c > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > if (!pte_unmap_same(vmf)) > > goto out; > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > - ret = VM_FAULT_RETRY; > > - goto out; > > - } > > - > > entry = pte_to_swp_entry(vmf->orig_pte); > > if (unlikely(non_swap_entry(entry))) { > > if (is_migration_entry(entry)) { > > You're missing the necessary fallback in the (!folio) case. > swap_readpage() is synchronous and will sleep. True, but is it unsafe to do that under VMA lock and has to be done under mmap_lock?
On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > +++ b/mm/memory.c > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > if (!pte_unmap_same(vmf)) > > > goto out; > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > - ret = VM_FAULT_RETRY; > > > - goto out; > > > - } > > > - > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > if (unlikely(non_swap_entry(entry))) { > > > if (is_migration_entry(entry)) { > > > > You're missing the necessary fallback in the (!folio) case. > > swap_readpage() is synchronous and will sleep. > > True, but is it unsafe to do that under VMA lock and has to be done > under mmap_lock? ... you were the one arguing that we didn't want to wait for I/O with the VMA lock held?
On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > +++ b/mm/memory.c > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > if (!pte_unmap_same(vmf)) > > > > goto out; > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > - ret = VM_FAULT_RETRY; > > > > - goto out; > > > > - } > > > > - > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > if (unlikely(non_swap_entry(entry))) { > > > > if (is_migration_entry(entry)) { > > > > > > You're missing the necessary fallback in the (!folio) case. > > > swap_readpage() is synchronous and will sleep. > > > > True, but is it unsafe to do that under VMA lock and has to be done > > under mmap_lock? > > ... you were the one arguing that we didn't want to wait for I/O with > the VMA lock held? Well, that discussion was about waiting in folio_lock_or_retry() with the lock being held. I argued against it because currently we drop mmap_lock lock before waiting, so if we don't drop VMA lock we would be changing the current behavior which might introduce new regressions. In the case of swap_readpage and swapin_readahead we already wait with mmap_lock held, so waiting with VMA lock held does not introduce new problems (unless there is a need to hold mmap_lock). That said, you are absolutely correct that this situation can be improved by dropping the lock in these cases too. I just didn't want to attack everything at once. I believe after we agree on the approach implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com for dropping the VMA lock before waiting, these cases can be added easier. Does that make sense?
On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > +++ b/mm/memory.c > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > if (!pte_unmap_same(vmf)) > > > > > goto out; > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > - ret = VM_FAULT_RETRY; > > > > > - goto out; > > > > > - } > > > > > - > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > if (is_migration_entry(entry)) { > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > swap_readpage() is synchronous and will sleep. > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > under mmap_lock? > > > > ... you were the one arguing that we didn't want to wait for I/O with > > the VMA lock held? > > Well, that discussion was about waiting in folio_lock_or_retry() with > the lock being held. I argued against it because currently we drop > mmap_lock lock before waiting, so if we don't drop VMA lock we would > be changing the current behavior which might introduce new > regressions. In the case of swap_readpage and swapin_readahead we > already wait with mmap_lock held, so waiting with VMA lock held does > not introduce new problems (unless there is a need to hold mmap_lock). > > That said, you are absolutely correct that this situation can be > improved by dropping the lock in these cases too. I just didn't want > to attack everything at once. I believe after we agree on the approach > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > for dropping the VMA lock before waiting, these cases can be added > easier. Does that make sense? OK, I looked at this path some more, and I think we're fine. This patch is only called for SWP_SYNCHRONOUS_IO which is only set for QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms (both btt and pmem). So the answer is that we don't sleep in this path, and there's no need to drop the lock.
On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > +++ b/mm/memory.c > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > if (!pte_unmap_same(vmf)) > > > > > > goto out; > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > - ret = VM_FAULT_RETRY; > > > > > > - goto out; > > > > > > - } > > > > > > - > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > under mmap_lock? > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > the VMA lock held? > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > the lock being held. I argued against it because currently we drop > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > be changing the current behavior which might introduce new > > regressions. In the case of swap_readpage and swapin_readahead we > > already wait with mmap_lock held, so waiting with VMA lock held does > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > That said, you are absolutely correct that this situation can be > > improved by dropping the lock in these cases too. I just didn't want > > to attack everything at once. I believe after we agree on the approach > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > for dropping the VMA lock before waiting, these cases can be added > > easier. Does that make sense? > > OK, I looked at this path some more, and I think we're fine. This > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > (both btt and pmem). So the answer is that we don't sleep in this > path, and there's no need to drop the lock. Yes but swapin_readahead does sleep, so I'll have to handle that case too after this.
On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > +++ b/mm/memory.c > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > goto out; > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > - goto out; > > > > > > > - } > > > > > > > - > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > under mmap_lock? > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > the VMA lock held? > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > the lock being held. I argued against it because currently we drop > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > be changing the current behavior which might introduce new > > > regressions. In the case of swap_readpage and swapin_readahead we > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > That said, you are absolutely correct that this situation can be > > > improved by dropping the lock in these cases too. I just didn't want > > > to attack everything at once. I believe after we agree on the approach > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > for dropping the VMA lock before waiting, these cases can be added > > > easier. Does that make sense? > > > > OK, I looked at this path some more, and I think we're fine. This > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > (both btt and pmem). So the answer is that we don't sleep in this > > path, and there's no need to drop the lock. > > Yes but swapin_readahead does sleep, so I'll have to handle that case > too after this. Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere in swapin_readahead()? It all looks like async I/O to me.
On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > +++ b/mm/memory.c > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > goto out; > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > - goto out; > > > > > > > > - } > > > > > > > > - > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > under mmap_lock? > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > the VMA lock held? > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > the lock being held. I argued against it because currently we drop > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > be changing the current behavior which might introduce new > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > That said, you are absolutely correct that this situation can be > > > > improved by dropping the lock in these cases too. I just didn't want > > > > to attack everything at once. I believe after we agree on the approach > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > for dropping the VMA lock before waiting, these cases can be added > > > > easier. Does that make sense? > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > (both btt and pmem). So the answer is that we don't sleep in this > > > path, and there's no need to drop the lock. > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > too after this. > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > in swapin_readahead()? It all looks like async I/O to me. Hmm. I thought that we have synchronous I/O in the following paths: swapin_readahead()->swap_cluster_readahead()->swap_readpage() swapin_readahead()->swap_vma_readahead()->swap_readpage() but just noticed that in both cases swap_readpage() is called with the synchronous parameter being false. So you are probably right here... Does that mean swapin_readahead() might return a page which does not have its content swapped-in yet?
On Tue, May 02, 2023 at 04:04:59PM -0700, Suren Baghdasaryan wrote: > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > > +++ b/mm/memory.c > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > > goto out; > > > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > > - goto out; > > > > > > > > > - } > > > > > > > > > - > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > > under mmap_lock? > > > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > > the VMA lock held? > > > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > > the lock being held. I argued against it because currently we drop > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > > be changing the current behavior which might introduce new > > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > > > That said, you are absolutely correct that this situation can be > > > > > improved by dropping the lock in these cases too. I just didn't want > > > > > to attack everything at once. I believe after we agree on the approach > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > > for dropping the VMA lock before waiting, these cases can be added > > > > > easier. Does that make sense? > > > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > > (both btt and pmem). So the answer is that we don't sleep in this > > > > path, and there's no need to drop the lock. > > > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > > too after this. > > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > > in swapin_readahead()? It all looks like async I/O to me. > > Hmm. I thought that we have synchronous I/O in the following paths: > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > swapin_readahead()->swap_vma_readahead()->swap_readpage() > but just noticed that in both cases swap_readpage() is called with the > synchronous parameter being false. So you are probably right here... > Does that mean swapin_readahead() might return a page which does not > have its content swapped-in yet? That's my understanding. In that case it's !uptodate and still locked. The folio_lock_or_retry() will wait for the read to complete unless we've told it we'd rather retry.
On Tue, May 2, 2023 at 4:40 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, May 02, 2023 at 04:04:59PM -0700, Suren Baghdasaryan wrote: > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > > > +++ b/mm/memory.c > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > > > goto out; > > > > > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > > > - goto out; > > > > > > > > > > - } > > > > > > > > > > - > > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > > > under mmap_lock? > > > > > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > > > the VMA lock held? > > > > > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > > > the lock being held. I argued against it because currently we drop > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > > > be changing the current behavior which might introduce new > > > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > > > > > That said, you are absolutely correct that this situation can be > > > > > > improved by dropping the lock in these cases too. I just didn't want > > > > > > to attack everything at once. I believe after we agree on the approach > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > > > for dropping the VMA lock before waiting, these cases can be added > > > > > > easier. Does that make sense? > > > > > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > > > (both btt and pmem). So the answer is that we don't sleep in this > > > > > path, and there's no need to drop the lock. > > > > > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > > > too after this. > > > > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > > > in swapin_readahead()? It all looks like async I/O to me. > > > > Hmm. I thought that we have synchronous I/O in the following paths: > > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > > swapin_readahead()->swap_vma_readahead()->swap_readpage() > > but just noticed that in both cases swap_readpage() is called with the > > synchronous parameter being false. So you are probably right here... > > Does that mean swapin_readahead() might return a page which does not > > have its content swapped-in yet? > > That's my understanding. In that case it's !uptodate and still locked. > The folio_lock_or_retry() will wait for the read to complete unless > we've told it we'd rather retry. Ok, and we already drop the locks in folio_lock_or_retry() when needed. Sounds like we cover this case with this patchset?
On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > > +++ b/mm/memory.c > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > > goto out; > > > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > > - goto out; > > > > > > > > > - } > > > > > > > > > - > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > > under mmap_lock? > > > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > > the VMA lock held? > > > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > > the lock being held. I argued against it because currently we drop > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > > be changing the current behavior which might introduce new > > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > > > That said, you are absolutely correct that this situation can be > > > > > improved by dropping the lock in these cases too. I just didn't want > > > > > to attack everything at once. I believe after we agree on the approach > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > > for dropping the VMA lock before waiting, these cases can be added > > > > > easier. Does that make sense? > > > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > > (both btt and pmem). So the answer is that we don't sleep in this > > > > path, and there's no need to drop the lock. > > > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > > too after this. > > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > > in swapin_readahead()? It all looks like async I/O to me. > > Hmm. I thought that we have synchronous I/O in the following paths: > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > swapin_readahead()->swap_vma_readahead()->swap_readpage() > but just noticed that in both cases swap_readpage() is called with the > synchronous parameter being false. So you are probably right here... In both swap_cluster_readahead() and swap_vma_readahead() it looks like if the readahead window is 1 (aka we are not reading ahead), then we jump to directly calling read_swap_cache_async() passing do_poll = true, which means we may end up calling swap_readpage() passing synchronous = true. I am not familiar with readahead heuristics, so I am not sure how common this is, but it's something to think about. > Does that mean swapin_readahead() might return a page which does not > have its content swapped-in yet? >
On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > > > +++ b/mm/memory.c > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > > > goto out; > > > > > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > > > - goto out; > > > > > > > > > > - } > > > > > > > > > > - > > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > > > under mmap_lock? > > > > > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > > > the VMA lock held? > > > > > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > > > the lock being held. I argued against it because currently we drop > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > > > be changing the current behavior which might introduce new > > > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > > > > > That said, you are absolutely correct that this situation can be > > > > > > improved by dropping the lock in these cases too. I just didn't want > > > > > > to attack everything at once. I believe after we agree on the approach > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > > > for dropping the VMA lock before waiting, these cases can be added > > > > > > easier. Does that make sense? > > > > > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > > > (both btt and pmem). So the answer is that we don't sleep in this > > > > > path, and there's no need to drop the lock. > > > > > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > > > too after this. > > > > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > > > in swapin_readahead()? It all looks like async I/O to me. > > > > Hmm. I thought that we have synchronous I/O in the following paths: > > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > > swapin_readahead()->swap_vma_readahead()->swap_readpage() > > but just noticed that in both cases swap_readpage() is called with the > > synchronous parameter being false. So you are probably right here... > > In both swap_cluster_readahead() and swap_vma_readahead() it looks > like if the readahead window is 1 (aka we are not reading ahead), then > we jump to directly calling read_swap_cache_async() passing do_poll = > true, which means we may end up calling swap_readpage() passing > synchronous = true. > > I am not familiar with readahead heuristics, so I am not sure how > common this is, but it's something to think about. Uh, you are correct. If this branch is common, we could use the same "drop the lock and retry" pattern inside read_swap_cache_async(). That would be quite easy to implement. Thanks for checking on it! > > > Does that mean swapin_readahead() might return a page which does not > > have its content swapped-in yet? > >
On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > > > > +++ b/mm/memory.c > > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > > > > goto out; > > > > > > > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > > > > - goto out; > > > > > > > > > > > - } > > > > > > > > > > > - > > > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > > > > under mmap_lock? > > > > > > > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > > > > the VMA lock held? > > > > > > > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > > > > the lock being held. I argued against it because currently we drop > > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > > > > be changing the current behavior which might introduce new > > > > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > > > > > > > That said, you are absolutely correct that this situation can be > > > > > > > improved by dropping the lock in these cases too. I just didn't want > > > > > > > to attack everything at once. I believe after we agree on the approach > > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > > > > for dropping the VMA lock before waiting, these cases can be added > > > > > > > easier. Does that make sense? > > > > > > > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > > > > (both btt and pmem). So the answer is that we don't sleep in this > > > > > > path, and there's no need to drop the lock. > > > > > > > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > > > > too after this. > > > > > > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > > > > in swapin_readahead()? It all looks like async I/O to me. > > > > > > Hmm. I thought that we have synchronous I/O in the following paths: > > > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > > > swapin_readahead()->swap_vma_readahead()->swap_readpage() > > > but just noticed that in both cases swap_readpage() is called with the > > > synchronous parameter being false. So you are probably right here... > > > > In both swap_cluster_readahead() and swap_vma_readahead() it looks > > like if the readahead window is 1 (aka we are not reading ahead), then > > we jump to directly calling read_swap_cache_async() passing do_poll = > > true, which means we may end up calling swap_readpage() passing > > synchronous = true. > > > > I am not familiar with readahead heuristics, so I am not sure how > > common this is, but it's something to think about. > > Uh, you are correct. If this branch is common, we could use the same > "drop the lock and retry" pattern inside read_swap_cache_async(). That > would be quite easy to implement. > Thanks for checking on it! I am honestly not sure how common this is. +Ying who might have a better idea. > > > > > > > Does that mean swapin_readahead() might return a page which does not > > > have its content swapped-in yet? > > >
Yosry Ahmed <yosryahmed@google.com> writes: > On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote: >> >> On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: >> > >> > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: >> > > >> > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: >> > > > >> > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: >> > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: >> > > > > > >> > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: >> > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: >> > > > > > > > >> > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: >> > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: >> > > > > > > > > > >> > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: >> > > > > > > > > > > +++ b/mm/memory.c >> > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> > > > > > > > > > > if (!pte_unmap_same(vmf)) >> > > > > > > > > > > goto out; >> > > > > > > > > > > >> > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { >> > > > > > > > > > > - ret = VM_FAULT_RETRY; >> > > > > > > > > > > - goto out; >> > > > > > > > > > > - } >> > > > > > > > > > > - >> > > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); >> > > > > > > > > > > if (unlikely(non_swap_entry(entry))) { >> > > > > > > > > > > if (is_migration_entry(entry)) { >> > > > > > > > > > >> > > > > > > > > > You're missing the necessary fallback in the (!folio) case. >> > > > > > > > > > swap_readpage() is synchronous and will sleep. >> > > > > > > > > >> > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done >> > > > > > > > > under mmap_lock? >> > > > > > > > >> > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with >> > > > > > > > the VMA lock held? >> > > > > > > >> > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with >> > > > > > > the lock being held. I argued against it because currently we drop >> > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would >> > > > > > > be changing the current behavior which might introduce new >> > > > > > > regressions. In the case of swap_readpage and swapin_readahead we >> > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does >> > > > > > > not introduce new problems (unless there is a need to hold mmap_lock). >> > > > > > > >> > > > > > > That said, you are absolutely correct that this situation can be >> > > > > > > improved by dropping the lock in these cases too. I just didn't want >> > > > > > > to attack everything at once. I believe after we agree on the approach >> > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com >> > > > > > > for dropping the VMA lock before waiting, these cases can be added >> > > > > > > easier. Does that make sense? >> > > > > > >> > > > > > OK, I looked at this path some more, and I think we're fine. This >> > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for >> > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms >> > > > > > (both btt and pmem). So the answer is that we don't sleep in this >> > > > > > path, and there's no need to drop the lock. >> > > > > >> > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case >> > > > > too after this. >> > > > >> > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere >> > > > in swapin_readahead()? It all looks like async I/O to me. >> > > >> > > Hmm. I thought that we have synchronous I/O in the following paths: >> > > swapin_readahead()->swap_cluster_readahead()->swap_readpage() >> > > swapin_readahead()->swap_vma_readahead()->swap_readpage() >> > > but just noticed that in both cases swap_readpage() is called with the >> > > synchronous parameter being false. So you are probably right here... >> > >> > In both swap_cluster_readahead() and swap_vma_readahead() it looks >> > like if the readahead window is 1 (aka we are not reading ahead), then >> > we jump to directly calling read_swap_cache_async() passing do_poll = >> > true, which means we may end up calling swap_readpage() passing >> > synchronous = true. >> > >> > I am not familiar with readahead heuristics, so I am not sure how >> > common this is, but it's something to think about. >> >> Uh, you are correct. If this branch is common, we could use the same >> "drop the lock and retry" pattern inside read_swap_cache_async(). That >> would be quite easy to implement. >> Thanks for checking on it! > > > I am honestly not sure how common this is. > > +Ying who might have a better idea. Checked the code and related history. It seems that we can just pass "synchronous = false" to swap_readpage() in read_swap_cache_async(). "synchronous = true" was introduced in commit 23955622ff8d ("swap: add block io poll in swapin path") to reduce swap read latency for block devices that can be polled. But in commit 9650b453a3d4 ("block: ignore RWF_HIPRI hint for sync dio"), the polling is deleted. So, we don't need to pass "synchronous = true" to swap_readpage() during swapin_readahead(), because we will wait the IO to complete in folio_lock_or_retry(). Best Regards, Huang, Ying >> >> >> > >> > > Does that mean swapin_readahead() might return a page which does not >> > > have its content swapped-in yet? >> > >
On Thu, May 4, 2023 at 10:03 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote: > >> > >> On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > >> > > >> > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: > >> > > > >> > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > >> > > > > >> > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > >> > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > >> > > > > > > >> > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > >> > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > >> > > > > > > > > >> > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > >> > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > >> > > > > > > > > > > >> > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > >> > > > > > > > > > > +++ b/mm/memory.c > >> > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >> > > > > > > > > > > if (!pte_unmap_same(vmf)) > >> > > > > > > > > > > goto out; > >> > > > > > > > > > > > >> > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > >> > > > > > > > > > > - ret = VM_FAULT_RETRY; > >> > > > > > > > > > > - goto out; > >> > > > > > > > > > > - } > >> > > > > > > > > > > - > >> > > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > >> > > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > >> > > > > > > > > > > if (is_migration_entry(entry)) { > >> > > > > > > > > > > >> > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > >> > > > > > > > > > swap_readpage() is synchronous and will sleep. > >> > > > > > > > > > >> > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > >> > > > > > > > > under mmap_lock? > >> > > > > > > > > >> > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > >> > > > > > > > the VMA lock held? > >> > > > > > > > >> > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > >> > > > > > > the lock being held. I argued against it because currently we drop > >> > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > >> > > > > > > be changing the current behavior which might introduce new > >> > > > > > > regressions. In the case of swap_readpage and swapin_readahead we > >> > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > >> > > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > >> > > > > > > > >> > > > > > > That said, you are absolutely correct that this situation can be > >> > > > > > > improved by dropping the lock in these cases too. I just didn't want > >> > > > > > > to attack everything at once. I believe after we agree on the approach > >> > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > >> > > > > > > for dropping the VMA lock before waiting, these cases can be added > >> > > > > > > easier. Does that make sense? > >> > > > > > > >> > > > > > OK, I looked at this path some more, and I think we're fine. This > >> > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > >> > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > >> > > > > > (both btt and pmem). So the answer is that we don't sleep in this > >> > > > > > path, and there's no need to drop the lock. > >> > > > > > >> > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > >> > > > > too after this. > >> > > > > >> > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > >> > > > in swapin_readahead()? It all looks like async I/O to me. > >> > > > >> > > Hmm. I thought that we have synchronous I/O in the following paths: > >> > > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > >> > > swapin_readahead()->swap_vma_readahead()->swap_readpage() > >> > > but just noticed that in both cases swap_readpage() is called with the > >> > > synchronous parameter being false. So you are probably right here... > >> > > >> > In both swap_cluster_readahead() and swap_vma_readahead() it looks > >> > like if the readahead window is 1 (aka we are not reading ahead), then > >> > we jump to directly calling read_swap_cache_async() passing do_poll = > >> > true, which means we may end up calling swap_readpage() passing > >> > synchronous = true. > >> > > >> > I am not familiar with readahead heuristics, so I am not sure how > >> > common this is, but it's something to think about. > >> > >> Uh, you are correct. If this branch is common, we could use the same > >> "drop the lock and retry" pattern inside read_swap_cache_async(). That > >> would be quite easy to implement. > >> Thanks for checking on it! > > > > > > I am honestly not sure how common this is. > > > > +Ying who might have a better idea. > > Checked the code and related history. It seems that we can just pass > "synchronous = false" to swap_readpage() in read_swap_cache_async(). > "synchronous = true" was introduced in commit 23955622ff8d ("swap: add > block io poll in swapin path") to reduce swap read latency for block > devices that can be polled. But in commit 9650b453a3d4 ("block: ignore > RWF_HIPRI hint for sync dio"), the polling is deleted. So, we don't > need to pass "synchronous = true" to swap_readpage() during > swapin_readahead(), because we will wait the IO to complete in > folio_lock_or_retry(). Thanks for investigating, Ying! It sounds like we can make some simplifications here. I'll double-check and if I don't find anything else, will change to "synchronous = false" in the next version of the patchset. > > Best Regards, > Huang, Ying > > >> > >> > >> > > >> > > Does that mean swapin_readahead() might return a page which does not > >> > > have its content swapped-in yet? > >> > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
diff --git a/mm/filemap.c b/mm/filemap.c index a34abfe8c654..84f39114d4de 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1706,6 +1706,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) * mmap_lock has been released (mmap_read_unlock(), unless flags had both * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in * which case mmap_lock is still held. + * If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed + * with VMA lock only, the VMA lock is still held. * * If neither ALLOW_RETRY nor KILLABLE are set, will always return true * with the folio locked and the mmap_lock unperturbed. @@ -1713,6 +1715,10 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, unsigned int flags) { + /* Can't do this if not holding mmap_lock */ + if (flags & FAULT_FLAG_VMA_LOCK) + return false; + if (fault_flag_allow_retry_first(flags)) { /* * CAUTION! In this case, mmap_lock is not released diff --git a/mm/memory.c b/mm/memory.c index f69fbc251198..41f45819a923 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!pte_unmap_same(vmf)) goto out; - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { - ret = VM_FAULT_RETRY; - goto out; - } - entry = pte_to_swp_entry(vmf->orig_pte); if (unlikely(non_swap_entry(entry))) { if (is_migration_entry(entry)) { @@ -3725,6 +3720,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) vmf->page = pfn_swap_entry_to_page(entry); ret = remove_device_exclusive_entry(vmf); } else if (is_device_private_entry(entry)) { + if (vmf->flags & FAULT_FLAG_VMA_LOCK) { + /* + * migrate_to_ram is not yet ready to operate + * under VMA lock. + */ + ret |= VM_FAULT_RETRY; + goto out; + } + vmf->page = pfn_swap_entry_to_page(entry); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);