Message ID | 20230621093009.637544-1-yosryahmed@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp4253087vqr; Wed, 21 Jun 2023 03:15:04 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4y9NAnXieZRwYRktJ3UKYtaOXzVUYmuLDkq98FE5oueTzcnAHSCnyytEivexjbydKsilt1 X-Received: by 2002:a05:6a00:1595:b0:666:e8e1:bc8d with SMTP id u21-20020a056a00159500b00666e8e1bc8dmr19992003pfk.11.1687342504580; Wed, 21 Jun 2023 03:15:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687342504; cv=none; d=google.com; s=arc-20160816; b=Dc3Fr6snFMdE8OmAug30l8Adk3XkmdhZ2pu4oovKvn8aZ8r65KZ/Nu8vZmg4YaEPzi rgcfgabYFW7Pt2AJQhmYqDGtVbdg1jpayqxpTSrjSOt13kPWT1VhYxAyCGx6i8Ltk3QL HaoLUj12tOGsw85bL/gBgBeagnwUtNDKXk1zEP4C0oTMEa+kzu81gEiEUIOa1cI3DKN8 quotTjSlHQil6hliibIhUv0jJx1a9IAmBtTh8mpj0kv9eZ24j4IHnL5TmJdq9l4/SD8G XV+5N+aBwfXXDbSLh8sH+by0xysSykRzvInzQtKtiioHbqPL1qWa7NuImzptNCEa3uKX zT8g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=ZkMJYW9AaEkYEFdiIfyRJ8y7x5cNBiWOkfQ1Jk9vOOI=; b=gb2sPlLnvNDAooLvwlIf3BeYUhk32ef9XaGx91zqlvDmA24L1CGlv/g5Kk9/jjyb4x LHUZdVxpRJ5duLbdUlnkUCuTgtwW3A7QGaU8g/hnYmrScmp5KPowRDd6mRWCT9RFhi5T GJa0HFr2e3s0NP+U2jobNbjbUI/Cb1L59fKLt5slk8+viokwTaEbfPMIfln4e9DaknC7 EnQFUrR4TBHdqMwoF2rxT0sM8oI8y4n67l16mVF6qeWzcnBRkc23gXW6cdswt1h73ywk XrvD5E5SDtuq1FUJUu6w2lmIRqmY75FQ3/xSqFsobbJKaNtnSQOPsdsgBbsWxTUe/ZN8 7hyA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=U6f0qHda; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r7-20020a63e507000000b0053ff2b1be42si3667138pgh.113.2023.06.21.03.14.52; Wed, 21 Jun 2023 03:15:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=U6f0qHda; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231743AbjFUJb4 (ORCPT <rfc822;maxin.john@gmail.com> + 99 others); Wed, 21 Jun 2023 05:31:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34856 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231745AbjFUJas (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 21 Jun 2023 05:30:48 -0400 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F1A3410A for <linux-kernel@vger.kernel.org>; Wed, 21 Jun 2023 02:30:12 -0700 (PDT) Received: by mail-pj1-x104a.google.com with SMTP id 98e67ed59e1d1-25b79a5cf1aso4094329a91.1 for <linux-kernel@vger.kernel.org>; Wed, 21 Jun 2023 02:30:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1687339812; x=1689931812; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=ZkMJYW9AaEkYEFdiIfyRJ8y7x5cNBiWOkfQ1Jk9vOOI=; b=U6f0qHdaGld6GTFVyXxfloqd25G2EbbnWjUmTySziT7GSoTmtZppD3J4/pbhq/vTr1 FFrgcAlhqTCcqPc2fBGDrjxKgSAihBui7o41fev/08bRLTT4nyGNLaIHeGTUSzNvnAAO DN/9CTNj+eA4WdTANg86emSsQY4nvBEGRRhB/ZAJJuI3AxPKCnmFTdhWqcoSem0mF/pq JjHJy7Ca+IFbWiqi0l8H/LHQLNn5f6UIuqc8lDTvNY+QGwaGWGGdJaRy+kKc52aSFnt5 U9RMgouucsiKJp4d+Z9Uh5kG6aeiyZGJ4RJjONYN8P1iFu+BzlK3XuBwRqVb5knD0dzE vjVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687339812; x=1689931812; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ZkMJYW9AaEkYEFdiIfyRJ8y7x5cNBiWOkfQ1Jk9vOOI=; b=If6YW+WIndw+dwQpvGXrqX8H2WpmzY0CvIX8wrKaPoEhCHfZJ/xVLb/nn9NKoXKa6w aJfcEqCV4bhH/y4f6IT3ab/h/53qCc66lRB4Mu96nYopLKNLsTRB9uIUyTvtGF7Cx816 XHTDWSXmDalkusEkIzyPyWMI9+V9njQaUpagfZZupH22/bmWkK1moS/xbpBdKOW/t/LB zkRCeOgccqUdHyr6YcrqHHAoAY3nuh2I1WOjdMFOuFDect9atxiNCBgxVXzXfFAZqK8+ hEBc7vtwsN4F3Lr4b+/cHRkF3e6D1Pzv/1oyFAnYhVrzypEsReo+tR110XoRecVh6kPM i/gg== X-Gm-Message-State: AC+VfDwkKyyTrIzxFfMVvZ2kp5jynDDukADRews7jkNU5Ycrlzt9vcnZ pYgU7v+DTMMhEJhk4vOHIJ/goov/p7eCjwnp X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a17:90a:404c:b0:25e:c3a3:ccc7 with SMTP id k12-20020a17090a404c00b0025ec3a3ccc7mr1891887pjg.3.1687339812393; Wed, 21 Jun 2023 02:30:12 -0700 (PDT) Date: Wed, 21 Jun 2023 09:30:09 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.41.0.162.gfafddb0af9-goog Message-ID: <20230621093009.637544-1-yosryahmed@google.com> Subject: [PATCH] mm: zswap: fix double invalidate with exclusive loads From: Yosry Ahmed <yosryahmed@google.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>, Hyeonggon Yoo <42.hyeyoo@gmail.com>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, Seth Jennings <sjenning@redhat.com>, Dan Streetman <ddstreet@ieee.org>, Vitaly Wool <vitaly.wool@konsulko.com>, Johannes Weiner <hannes@cmpxchg.org>, Nhat Pham <nphamcs@gmail.com>, Yu Zhao <yuzhao@google.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769306854085685082?= X-GMAIL-MSGID: =?utf-8?q?1769306854085685082?= |
Series |
mm: zswap: fix double invalidate with exclusive loads
|
|
Commit Message
Yosry Ahmed
June 21, 2023, 9:30 a.m. UTC
If exclusive loads are enabled for zswap, we invalidate the entry before
returning from zswap_frontswap_load(), after dropping the local
reference. However, the tree lock is dropped during decompression after
the local reference is acquired, so the entry could be invalidated
before we drop the local ref. If this happens, the entry is freed once
we drop the local ref, and zswap_invalidate_entry() tries to invalidate
an already freed entry.
Fix this by:
(a) Making sure zswap_invalidate_entry() is always called with a local
ref held, to avoid being called on a freed entry.
(b) Making sure zswap_invalidate_entry() only drops the ref if the entry
was actually on the rbtree. Otherwise, another invalidation could
have already happened, and the initial ref is already dropped.
With these changes, there is no need to check that there is no need to
make sure the entry still exists in the tree in zswap_reclaim_entry()
before invalidating it, as zswap_reclaim_entry() will make this check
internally.
Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
mm/zswap.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
Comments
On Wed, Jun 21, 2023 at 11:30 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > If exclusive loads are enabled for zswap, we invalidate the entry before > returning from zswap_frontswap_load(), after dropping the local > reference. However, the tree lock is dropped during decompression after > the local reference is acquired, so the entry could be invalidated > before we drop the local ref. If this happens, the entry is freed once > we drop the local ref, and zswap_invalidate_entry() tries to invalidate > an already freed entry. > > Fix this by: > (a) Making sure zswap_invalidate_entry() is always called with a local > ref held, to avoid being called on a freed entry. > (b) Making sure zswap_invalidate_entry() only drops the ref if the entry > was actually on the rbtree. Otherwise, another invalidation could > have already happened, and the initial ref is already dropped. > > With these changes, there is no need to check that there is no need to > make sure the entry still exists in the tree in zswap_reclaim_entry() > before invalidating it, as zswap_reclaim_entry() will make this check > internally. > > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") > Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > mm/zswap.c | 21 ++++++++++++--------- > 1 file changed, 12 insertions(+), 9 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 87b204233115..62195f72bf56 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -355,12 +355,14 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry, > return 0; > } > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > +static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > { > if (!RB_EMPTY_NODE(&entry->rbnode)) { > rb_erase(&entry->rbnode, root); > RB_CLEAR_NODE(&entry->rbnode); > + return true; > } > + return false; > } > > /* > @@ -599,14 +601,16 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) > return NULL; > } > > +/* > + * If the entry is still valid in the tree, drop the initial ref and remove it > + * from the tree. This function must be called with an additional ref held, > + * otherwise it may race with another invalidation freeing the entry. > + */ On re-reading this comment there's one thing I'm not sure I get, do we really need to hold an additional local ref to call this? As far as I understood, once we check that the entry was in the tree before putting the initial ref, there's no need for an additional local one. > static void zswap_invalidate_entry(struct zswap_tree *tree, > struct zswap_entry *entry) > { > - /* remove from rbtree */ > - zswap_rb_erase(&tree->rbroot, entry); > - > - /* drop the initial reference from entry creation */ > - zswap_entry_put(tree, entry); > + if (zswap_rb_erase(&tree->rbroot, entry)) > + zswap_entry_put(tree, entry); > } > > static int zswap_reclaim_entry(struct zswap_pool *pool) > @@ -659,8 +663,7 @@ static int zswap_reclaim_entry(struct zswap_pool *pool) > * swapcache. Drop the entry from zswap - unless invalidate already > * took it out while we had the tree->lock released for IO. > */ > - if (entry == zswap_rb_search(&tree->rbroot, swpoffset)) > - zswap_invalidate_entry(tree, entry); > + zswap_invalidate_entry(tree, entry); > > put_unlock: > /* Drop local reference */ > @@ -1466,7 +1469,6 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > count_objcg_event(entry->objcg, ZSWPIN); > freeentry: > spin_lock(&tree->lock); > - zswap_entry_put(tree, entry); > if (!ret && zswap_exclusive_loads_enabled) { > zswap_invalidate_entry(tree, entry); > *exclusive = true; > @@ -1475,6 +1477,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > list_move(&entry->lru, &entry->pool->lru); > spin_unlock(&entry->pool->lru_lock); > } > + zswap_entry_put(tree, entry); > spin_unlock(&tree->lock); > > return ret; > -- > 2.41.0.162.gfafddb0af9-goog >
On Wed, Jun 21, 2023 at 3:20 AM Domenico Cerasuolo <cerasuolodomenico@gmail.com> wrote: > > On Wed, Jun 21, 2023 at 11:30 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > If exclusive loads are enabled for zswap, we invalidate the entry before > > returning from zswap_frontswap_load(), after dropping the local > > reference. However, the tree lock is dropped during decompression after > > the local reference is acquired, so the entry could be invalidated > > before we drop the local ref. If this happens, the entry is freed once > > we drop the local ref, and zswap_invalidate_entry() tries to invalidate > > an already freed entry. > > > > Fix this by: > > (a) Making sure zswap_invalidate_entry() is always called with a local > > ref held, to avoid being called on a freed entry. > > (b) Making sure zswap_invalidate_entry() only drops the ref if the entry > > was actually on the rbtree. Otherwise, another invalidation could > > have already happened, and the initial ref is already dropped. > > > > With these changes, there is no need to check that there is no need to > > make sure the entry still exists in the tree in zswap_reclaim_entry() > > before invalidating it, as zswap_reclaim_entry() will make this check > > internally. > > > > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") > > Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > --- > > mm/zswap.c | 21 ++++++++++++--------- > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 87b204233115..62195f72bf56 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -355,12 +355,14 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry, > > return 0; > > } > > > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > +static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > { > > if (!RB_EMPTY_NODE(&entry->rbnode)) { > > rb_erase(&entry->rbnode, root); > > RB_CLEAR_NODE(&entry->rbnode); > > + return true; > > } > > + return false; > > } > > > > /* > > @@ -599,14 +601,16 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) > > return NULL; > > } > > > > +/* > > + * If the entry is still valid in the tree, drop the initial ref and remove it > > + * from the tree. This function must be called with an additional ref held, > > + * otherwise it may race with another invalidation freeing the entry. > > + */ > > On re-reading this comment there's one thing I'm not sure I get, do we > really need to hold an additional local ref to call this? As far as I > understood, once we check that the entry was in the tree before putting > the initial ref, there's no need for an additional local one. I believe it is, but please correct me if I am wrong. Consider the following scenario: // Initially refcount is at 1 CPU#1: CPU#2: spin_lock(tree_lock) zswap_entry_get() // 2 refs spin_unlock(tree_lock) spin_lock(tree_lock) zswap_invalidate_entry() // 1 ref spin_unlock(tree_lock) zswap_entry_put() // 0 refs zswap_invalidate_entry() // problem That last zswap_invalidate_entry() call in CPU#1 is problematic. The entry would have already been freed. If we check that the entry is on the tree by checking RB_EMPTY_NODE(&entry->rbnode), then we are reading already freed and potentially re-used memory. We would need to search the tree to make sure the same entry still exists in the tree (aka what zswap_reclaim_entry() currently does). This is not ideal in the fault path to have to do the lookups twice. Also, in zswap_reclaim_entry(), would it be possible if we call zswap_invalidate_entry() after we drop the local ref that the swap entry has been reused for a different page? I didn't look closely, but if yes, then the slab allocator may have repurposed the zswap_entry and we may find the entry in the tree for the same offset, even though it is referring to a different page now. This sounds practically unlikely but perhaps theoretically possible. I think it's more reliable to call zswap_invalidate_entry() on an entry that we know is valid before dropping the local ref. Especially that it's easy to do today by just moving a few lines around. > > > static void zswap_invalidate_entry(struct zswap_tree *tree, > > struct zswap_entry *entry) > > { > > - /* remove from rbtree */ > > - zswap_rb_erase(&tree->rbroot, entry); > > - > > - /* drop the initial reference from entry creation */ > > - zswap_entry_put(tree, entry); > > + if (zswap_rb_erase(&tree->rbroot, entry)) > > + zswap_entry_put(tree, entry); > > } > > > > static int zswap_reclaim_entry(struct zswap_pool *pool) > > @@ -659,8 +663,7 @@ static int zswap_reclaim_entry(struct zswap_pool *pool) > > * swapcache. Drop the entry from zswap - unless invalidate already > > * took it out while we had the tree->lock released for IO. > > */ > > - if (entry == zswap_rb_search(&tree->rbroot, swpoffset)) > > - zswap_invalidate_entry(tree, entry); > > + zswap_invalidate_entry(tree, entry); > > > > put_unlock: > > /* Drop local reference */ > > @@ -1466,7 +1469,6 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > count_objcg_event(entry->objcg, ZSWPIN); > > freeentry: > > spin_lock(&tree->lock); > > - zswap_entry_put(tree, entry); > > if (!ret && zswap_exclusive_loads_enabled) { > > zswap_invalidate_entry(tree, entry); > > *exclusive = true; > > @@ -1475,6 +1477,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > list_move(&entry->lru, &entry->pool->lru); > > spin_unlock(&entry->pool->lru_lock); > > } > > + zswap_entry_put(tree, entry); > > spin_unlock(&tree->lock); > > > > return ret; > > -- > > 2.41.0.162.gfafddb0af9-goog > >
On Wed, Jun 21, 2023 at 7:26 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Jun 21, 2023 at 3:20 AM Domenico Cerasuolo > <cerasuolodomenico@gmail.com> wrote: > > > > On Wed, Jun 21, 2023 at 11:30 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > If exclusive loads are enabled for zswap, we invalidate the entry before > > > returning from zswap_frontswap_load(), after dropping the local > > > reference. However, the tree lock is dropped during decompression after > > > the local reference is acquired, so the entry could be invalidated > > > before we drop the local ref. If this happens, the entry is freed once > > > we drop the local ref, and zswap_invalidate_entry() tries to invalidate > > > an already freed entry. > > > > > > Fix this by: > > > (a) Making sure zswap_invalidate_entry() is always called with a local > > > ref held, to avoid being called on a freed entry. > > > (b) Making sure zswap_invalidate_entry() only drops the ref if the entry > > > was actually on the rbtree. Otherwise, another invalidation could > > > have already happened, and the initial ref is already dropped. > > > > > > With these changes, there is no need to check that there is no need to > > > make sure the entry still exists in the tree in zswap_reclaim_entry() > > > before invalidating it, as zswap_reclaim_entry() will make this check > > > internally. > > > > > > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") > > > Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > --- > > > mm/zswap.c | 21 ++++++++++++--------- > > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 87b204233115..62195f72bf56 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -355,12 +355,14 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry, > > > return 0; > > > } > > > > > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > > +static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > > { > > > if (!RB_EMPTY_NODE(&entry->rbnode)) { > > > rb_erase(&entry->rbnode, root); > > > RB_CLEAR_NODE(&entry->rbnode); > > > + return true; > > > } > > > + return false; > > > } > > > > > > /* > > > @@ -599,14 +601,16 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) > > > return NULL; > > > } > > > > > > +/* > > > + * If the entry is still valid in the tree, drop the initial ref and remove it > > > + * from the tree. This function must be called with an additional ref held, > > > + * otherwise it may race with another invalidation freeing the entry. > > > + */ > > > > On re-reading this comment there's one thing I'm not sure I get, do we > > really need to hold an additional local ref to call this? As far as I > > understood, once we check that the entry was in the tree before putting > > the initial ref, there's no need for an additional local one. > > I believe it is, but please correct me if I am wrong. Consider the > following scenario: > > // Initially refcount is at 1 > > CPU#1: CPU#2: > spin_lock(tree_lock) > zswap_entry_get() // 2 refs > spin_unlock(tree_lock) > spin_lock(tree_lock) > zswap_invalidate_entry() // 1 ref > spin_unlock(tree_lock) > zswap_entry_put() // 0 refs > zswap_invalidate_entry() // problem > > That last zswap_invalidate_entry() call in CPU#1 is problematic. The > entry would have already been freed. If we check that the entry is on > the tree by checking RB_EMPTY_NODE(&entry->rbnode), then we are > reading already freed and potentially re-used memory. > > We would need to search the tree to make sure the same entry still > exists in the tree (aka what zswap_reclaim_entry() currently does). > This is not ideal in the fault path to have to do the lookups twice. Thanks for the clarification, it is indeed needed in that case. I was just wondering if the wording of the comment is exact, in that before calling zswap_invalidate_entry one has to ensure that the entry has not been freed, not specifically by holding an additional reference, if a lookup can serve the same purpose. > > Also, in zswap_reclaim_entry(), would it be possible if we call > zswap_invalidate_entry() after we drop the local ref that the swap > entry has been reused for a different page? I didn't look closely, but > if yes, then the slab allocator may have repurposed the zswap_entry > and we may find the entry in the tree for the same offset, even though > it is referring to a different page now. This sounds practically > unlikely but perhaps theoretically possible. I'm not sure I understood the scenario, in zswap_reclaim_entry we keep a local reference until the end in order to avoid a free. > > I think it's more reliable to call zswap_invalidate_entry() on an > entry that we know is valid before dropping the local ref. Especially > that it's easy to do today by just moving a few lines around. > > > > > > > > > static void zswap_invalidate_entry(struct zswap_tree *tree, > > > struct zswap_entry *entry) > > > { > > > - /* remove from rbtree */ > > > - zswap_rb_erase(&tree->rbroot, entry); > > > - > > > - /* drop the initial reference from entry creation */ > > > - zswap_entry_put(tree, entry); > > > + if (zswap_rb_erase(&tree->rbroot, entry)) > > > + zswap_entry_put(tree, entry); > > > } > > > > > > static int zswap_reclaim_entry(struct zswap_pool *pool) > > > @@ -659,8 +663,7 @@ static int zswap_reclaim_entry(struct zswap_pool *pool) > > > * swapcache. Drop the entry from zswap - unless invalidate already > > > * took it out while we had the tree->lock released for IO. > > > */ > > > - if (entry == zswap_rb_search(&tree->rbroot, swpoffset)) > > > - zswap_invalidate_entry(tree, entry); > > > + zswap_invalidate_entry(tree, entry); > > > > > > put_unlock: > > > /* Drop local reference */ > > > @@ -1466,7 +1469,6 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > > count_objcg_event(entry->objcg, ZSWPIN); > > > freeentry: > > > spin_lock(&tree->lock); > > > - zswap_entry_put(tree, entry); > > > if (!ret && zswap_exclusive_loads_enabled) { > > > zswap_invalidate_entry(tree, entry); > > > *exclusive = true; > > > @@ -1475,6 +1477,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > > list_move(&entry->lru, &entry->pool->lru); > > > spin_unlock(&entry->pool->lru_lock); > > > } > > > + zswap_entry_put(tree, entry); > > > spin_unlock(&tree->lock); > > > > > > return ret; > > > -- > > > 2.41.0.162.gfafddb0af9-goog > > >
On Wed, Jun 21, 2023 at 12:36 PM Domenico Cerasuolo <cerasuolodomenico@gmail.com> wrote: > > On Wed, Jun 21, 2023 at 7:26 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Wed, Jun 21, 2023 at 3:20 AM Domenico Cerasuolo > > <cerasuolodomenico@gmail.com> wrote: > > > > > > On Wed, Jun 21, 2023 at 11:30 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > If exclusive loads are enabled for zswap, we invalidate the entry before > > > > returning from zswap_frontswap_load(), after dropping the local > > > > reference. However, the tree lock is dropped during decompression after > > > > the local reference is acquired, so the entry could be invalidated > > > > before we drop the local ref. If this happens, the entry is freed once > > > > we drop the local ref, and zswap_invalidate_entry() tries to invalidate > > > > an already freed entry. > > > > > > > > Fix this by: > > > > (a) Making sure zswap_invalidate_entry() is always called with a local > > > > ref held, to avoid being called on a freed entry. > > > > (b) Making sure zswap_invalidate_entry() only drops the ref if the entry > > > > was actually on the rbtree. Otherwise, another invalidation could > > > > have already happened, and the initial ref is already dropped. > > > > > > > > With these changes, there is no need to check that there is no need to > > > > make sure the entry still exists in the tree in zswap_reclaim_entry() > > > > before invalidating it, as zswap_reclaim_entry() will make this check > > > > internally. > > > > > > > > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") > > > > Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > --- > > > > mm/zswap.c | 21 ++++++++++++--------- > > > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index 87b204233115..62195f72bf56 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -355,12 +355,14 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry, > > > > return 0; > > > > } > > > > > > > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > > > +static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > > > { > > > > if (!RB_EMPTY_NODE(&entry->rbnode)) { > > > > rb_erase(&entry->rbnode, root); > > > > RB_CLEAR_NODE(&entry->rbnode); > > > > + return true; > > > > } > > > > + return false; > > > > } > > > > > > > > /* > > > > @@ -599,14 +601,16 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) > > > > return NULL; > > > > } > > > > > > > > +/* > > > > + * If the entry is still valid in the tree, drop the initial ref and remove it > > > > + * from the tree. This function must be called with an additional ref held, > > > > + * otherwise it may race with another invalidation freeing the entry. > > > > + */ > > > > > > On re-reading this comment there's one thing I'm not sure I get, do we > > > really need to hold an additional local ref to call this? As far as I > > > understood, once we check that the entry was in the tree before putting > > > the initial ref, there's no need for an additional local one. > > > > I believe it is, but please correct me if I am wrong. Consider the > > following scenario: > > > > // Initially refcount is at 1 > > > > CPU#1: CPU#2: > > spin_lock(tree_lock) > > zswap_entry_get() // 2 refs > > spin_unlock(tree_lock) > > spin_lock(tree_lock) > > zswap_invalidate_entry() // 1 ref > > spin_unlock(tree_lock) > > zswap_entry_put() // 0 refs > > zswap_invalidate_entry() // problem > > > > That last zswap_invalidate_entry() call in CPU#1 is problematic. The > > entry would have already been freed. If we check that the entry is on > > the tree by checking RB_EMPTY_NODE(&entry->rbnode), then we are > > reading already freed and potentially re-used memory. > > > > We would need to search the tree to make sure the same entry still > > exists in the tree (aka what zswap_reclaim_entry() currently does). > > This is not ideal in the fault path to have to do the lookups twice. > > Thanks for the clarification, it is indeed needed in that case. I was just > wondering if the wording of the comment is exact, in that before calling > zswap_invalidate_entry one has to ensure that the entry has not been freed, not > specifically by holding an additional reference, if a lookup can serve the same > purpose. I am wondering if the scenario below is possible, in which case a lookup would not be enough. Let me try to clarify. Let's assume in zswap_reclaim_entry() we drop the local ref early (before we invalidate the entry), and rely on the lookup to ensure that the entry was not freed: - On CPU#1, in zswap_reclaim_entry() we release the lock during IO. Let's assume we drop the local ref here and rely on the lookup to make sure the zswap entry wasn't freed. - On CPU#2, invalidates the swap entry. The zswap entry is freed (returned to the slab allocator). - On CPU#2, we try to reclaim another page, and allocates the same swap slot (same type and offset). - On CPU#2, a zswap entry is allocated, and the slab allocator happens to hand us the same zswap_entry we just freed. - On CPU#1, after IO is done, we lookup the tree to make sure that the zswap entry was not freed. We find the same zswap entry (same address) at the same offset, so we assume it was not freed. - On CPU#1, we invalidate the zswap entry that was actually used by CPU#2. I am not entirely sure if this is possible, perhaps locking in the swap layer will prevent the swap entry reuse, but it seems like relying on the lookup can be fragile, and we should rely on the local ref instead to reliably prevent freeing/reuse of the zswap entry. Please correct me if I missed something. > > > > > > Also, in zswap_reclaim_entry(), would it be possible if we call > > zswap_invalidate_entry() after we drop the local ref that the swap > > entry has been reused for a different page? I didn't look closely, but > > if yes, then the slab allocator may have repurposed the zswap_entry > > and we may find the entry in the tree for the same offset, even though > > it is referring to a different page now. This sounds practically > > unlikely but perhaps theoretically possible. > > I'm not sure I understood the scenario, in zswap_reclaim_entry we keep a local > reference until the end in order to avoid a free. Right, I was just trying to reason about what might happen if we call zswap_invalidate_entry() after dropping the local ref, as I mentioned above. > > > > > > I think it's more reliable to call zswap_invalidate_entry() on an > > entry that we know is valid before dropping the local ref. Especially > > that it's easy to do today by just moving a few lines around. > > > > > > > > > > > > > > > static void zswap_invalidate_entry(struct zswap_tree *tree, > > > > struct zswap_entry *entry) > > > > { > > > > - /* remove from rbtree */ > > > > - zswap_rb_erase(&tree->rbroot, entry); > > > > - > > > > - /* drop the initial reference from entry creation */ > > > > - zswap_entry_put(tree, entry); > > > > + if (zswap_rb_erase(&tree->rbroot, entry)) > > > > + zswap_entry_put(tree, entry); > > > > } > > > > > > > > static int zswap_reclaim_entry(struct zswap_pool *pool) > > > > @@ -659,8 +663,7 @@ static int zswap_reclaim_entry(struct zswap_pool *pool) > > > > * swapcache. Drop the entry from zswap - unless invalidate already > > > > * took it out while we had the tree->lock released for IO. > > > > */ > > > > - if (entry == zswap_rb_search(&tree->rbroot, swpoffset)) > > > > - zswap_invalidate_entry(tree, entry); > > > > + zswap_invalidate_entry(tree, entry); > > > > > > > > put_unlock: > > > > /* Drop local reference */ > > > > @@ -1466,7 +1469,6 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > > > count_objcg_event(entry->objcg, ZSWPIN); > > > > freeentry: > > > > spin_lock(&tree->lock); > > > > - zswap_entry_put(tree, entry); > > > > if (!ret && zswap_exclusive_loads_enabled) { > > > > zswap_invalidate_entry(tree, entry); > > > > *exclusive = true; > > > > @@ -1475,6 +1477,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > > > list_move(&entry->lru, &entry->pool->lru); > > > > spin_unlock(&entry->pool->lru_lock); > > > > } > > > > + zswap_entry_put(tree, entry); > > > > spin_unlock(&tree->lock); > > > > > > > > return ret; > > > > -- > > > > 2.41.0.162.gfafddb0af9-goog > > > >
On Wed, Jun 21, 2023 at 11:23 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Jun 21, 2023 at 12:36 PM Domenico Cerasuolo > <cerasuolodomenico@gmail.com> wrote: > > > > On Wed, Jun 21, 2023 at 7:26 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Wed, Jun 21, 2023 at 3:20 AM Domenico Cerasuolo > > > <cerasuolodomenico@gmail.com> wrote: > > > > > > > > On Wed, Jun 21, 2023 at 11:30 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > If exclusive loads are enabled for zswap, we invalidate the entry before > > > > > returning from zswap_frontswap_load(), after dropping the local > > > > > reference. However, the tree lock is dropped during decompression after > > > > > the local reference is acquired, so the entry could be invalidated > > > > > before we drop the local ref. If this happens, the entry is freed once > > > > > we drop the local ref, and zswap_invalidate_entry() tries to invalidate > > > > > an already freed entry. > > > > > > > > > > Fix this by: > > > > > (a) Making sure zswap_invalidate_entry() is always called with a local > > > > > ref held, to avoid being called on a freed entry. > > > > > (b) Making sure zswap_invalidate_entry() only drops the ref if the entry > > > > > was actually on the rbtree. Otherwise, another invalidation could > > > > > have already happened, and the initial ref is already dropped. > > > > > > > > > > With these changes, there is no need to check that there is no need to > > > > > make sure the entry still exists in the tree in zswap_reclaim_entry() > > > > > before invalidating it, as zswap_reclaim_entry() will make this check > > > > > internally. > > > > > > > > > > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") > > > > > Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > --- > > > > > mm/zswap.c | 21 ++++++++++++--------- > > > > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > index 87b204233115..62195f72bf56 100644 > > > > > --- a/mm/zswap.c > > > > > +++ b/mm/zswap.c > > > > > @@ -355,12 +355,14 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry, > > > > > return 0; > > > > > } > > > > > > > > > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > > > > +static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > > > > { > > > > > if (!RB_EMPTY_NODE(&entry->rbnode)) { > > > > > rb_erase(&entry->rbnode, root); > > > > > RB_CLEAR_NODE(&entry->rbnode); > > > > > + return true; > > > > > } > > > > > + return false; > > > > > } > > > > > > > > > > /* > > > > > @@ -599,14 +601,16 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) > > > > > return NULL; > > > > > } > > > > > > > > > > +/* > > > > > + * If the entry is still valid in the tree, drop the initial ref and remove it > > > > > + * from the tree. This function must be called with an additional ref held, > > > > > + * otherwise it may race with another invalidation freeing the entry. > > > > > + */ > > > > > > > > On re-reading this comment there's one thing I'm not sure I get, do we > > > > really need to hold an additional local ref to call this? As far as I > > > > understood, once we check that the entry was in the tree before putting > > > > the initial ref, there's no need for an additional local one. > > > > > > I believe it is, but please correct me if I am wrong. Consider the > > > following scenario: > > > > > > // Initially refcount is at 1 > > > > > > CPU#1: CPU#2: > > > spin_lock(tree_lock) > > > zswap_entry_get() // 2 refs > > > spin_unlock(tree_lock) > > > spin_lock(tree_lock) > > > zswap_invalidate_entry() // 1 ref > > > spin_unlock(tree_lock) > > > zswap_entry_put() // 0 refs > > > zswap_invalidate_entry() // problem > > > > > > That last zswap_invalidate_entry() call in CPU#1 is problematic. The > > > entry would have already been freed. If we check that the entry is on > > > the tree by checking RB_EMPTY_NODE(&entry->rbnode), then we are > > > reading already freed and potentially re-used memory. > > > > > > We would need to search the tree to make sure the same entry still > > > exists in the tree (aka what zswap_reclaim_entry() currently does). > > > This is not ideal in the fault path to have to do the lookups twice. > > > > Thanks for the clarification, it is indeed needed in that case. I was just > > wondering if the wording of the comment is exact, in that before calling > > zswap_invalidate_entry one has to ensure that the entry has not been freed, not > > specifically by holding an additional reference, if a lookup can serve the same > > purpose. > > > I am wondering if the scenario below is possible, in which case a > lookup would not be enough. Let me try to clarify. Let's assume in > zswap_reclaim_entry() we drop the local ref early (before we > invalidate the entry), and rely on the lookup to ensure that the entry > was not freed: > > - On CPU#1, in zswap_reclaim_entry() we release the lock during IO. > Let's assume we drop the local ref here and rely on the lookup to make > sure the zswap entry wasn't freed. > - On CPU#2, invalidates the swap entry. The zswap entry is freed > (returned to the slab allocator). > - On CPU#2, we try to reclaim another page, and allocates the same > swap slot (same type and offset). > - On CPU#2, a zswap entry is allocated, and the slab allocator happens > to hand us the same zswap_entry we just freed. > - On CPU#1, after IO is done, we lookup the tree to make sure that the > zswap entry was not freed. We find the same zswap entry (same address) > at the same offset, so we assume it was not freed. > - On CPU#1, we invalidate the zswap entry that was actually used by CPU#2. > > I am not entirely sure if this is possible, perhaps locking in the > swap layer will prevent the swap entry reuse, but it seems like > relying on the lookup can be fragile, and we should rely on the local > ref instead to reliably prevent freeing/reuse of the zswap entry. > > Please correct me if I missed something. I think it is, we definitely need an additional reference to pin down the entry. Sorry if I was being pedantic, my original doubt was only about the wording of the comment, where it says that an additional reference must be held. I was wondering if it was strictly needed, and now I see that it is :) > > > > > > > > > > > Also, in zswap_reclaim_entry(), would it be possible if we call > > > zswap_invalidate_entry() after we drop the local ref that the swap > > > entry has been reused for a different page? I didn't look closely, but > > > if yes, then the slab allocator may have repurposed the zswap_entry > > > and we may find the entry in the tree for the same offset, even though > > > it is referring to a different page now. This sounds practically > > > unlikely but perhaps theoretically possible. > > > > I'm not sure I understood the scenario, in zswap_reclaim_entry we keep a local > > reference until the end in order to avoid a free. > > > Right, I was just trying to reason about what might happen if we call > zswap_invalidate_entry() after dropping the local ref, as I mentioned > above. > > > > > > > > > > > > I think it's more reliable to call zswap_invalidate_entry() on an > > > entry that we know is valid before dropping the local ref. Especially > > > that it's easy to do today by just moving a few lines around. > > > > > > > > > > > > > > > > > > > > > static void zswap_invalidate_entry(struct zswap_tree *tree, > > > > > struct zswap_entry *entry) > > > > > { > > > > > - /* remove from rbtree */ > > > > > - zswap_rb_erase(&tree->rbroot, entry); > > > > > - > > > > > - /* drop the initial reference from entry creation */ > > > > > - zswap_entry_put(tree, entry); > > > > > + if (zswap_rb_erase(&tree->rbroot, entry)) > > > > > + zswap_entry_put(tree, entry); > > > > > } > > > > > > > > > > static int zswap_reclaim_entry(struct zswap_pool *pool) > > > > > @@ -659,8 +663,7 @@ static int zswap_reclaim_entry(struct zswap_pool *pool) > > > > > * swapcache. Drop the entry from zswap - unless invalidate already > > > > > * took it out while we had the tree->lock released for IO. > > > > > */ > > > > > - if (entry == zswap_rb_search(&tree->rbroot, swpoffset)) > > > > > - zswap_invalidate_entry(tree, entry); > > > > > + zswap_invalidate_entry(tree, entry); > > > > > > > > > > put_unlock: > > > > > /* Drop local reference */ > > > > > @@ -1466,7 +1469,6 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > > > > count_objcg_event(entry->objcg, ZSWPIN); > > > > > freeentry: > > > > > spin_lock(&tree->lock); > > > > > - zswap_entry_put(tree, entry); > > > > > if (!ret && zswap_exclusive_loads_enabled) { > > > > > zswap_invalidate_entry(tree, entry); > > > > > *exclusive = true; > > > > > @@ -1475,6 +1477,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > > > > list_move(&entry->lru, &entry->pool->lru); > > > > > spin_unlock(&entry->pool->lru_lock); > > > > > } > > > > > + zswap_entry_put(tree, entry); > > > > > spin_unlock(&tree->lock); > > > > > > > > > > return ret; > > > > > -- > > > > > 2.41.0.162.gfafddb0af9-goog > > > > > Reviewed-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
On Wed, Jun 21, 2023 at 11:33 PM Domenico Cerasuolo <cerasuolodomenico@gmail.com> wrote: > > On Wed, Jun 21, 2023 at 11:23 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Wed, Jun 21, 2023 at 12:36 PM Domenico Cerasuolo > > <cerasuolodomenico@gmail.com> wrote: > > > > > > On Wed, Jun 21, 2023 at 7:26 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > On Wed, Jun 21, 2023 at 3:20 AM Domenico Cerasuolo > > > > <cerasuolodomenico@gmail.com> wrote: > > > > > > > > > > On Wed, Jun 21, 2023 at 11:30 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > If exclusive loads are enabled for zswap, we invalidate the entry before > > > > > > returning from zswap_frontswap_load(), after dropping the local > > > > > > reference. However, the tree lock is dropped during decompression after > > > > > > the local reference is acquired, so the entry could be invalidated > > > > > > before we drop the local ref. If this happens, the entry is freed once > > > > > > we drop the local ref, and zswap_invalidate_entry() tries to invalidate > > > > > > an already freed entry. > > > > > > > > > > > > Fix this by: > > > > > > (a) Making sure zswap_invalidate_entry() is always called with a local > > > > > > ref held, to avoid being called on a freed entry. > > > > > > (b) Making sure zswap_invalidate_entry() only drops the ref if the entry > > > > > > was actually on the rbtree. Otherwise, another invalidation could > > > > > > have already happened, and the initial ref is already dropped. > > > > > > > > > > > > With these changes, there is no need to check that there is no need to > > > > > > make sure the entry still exists in the tree in zswap_reclaim_entry() > > > > > > before invalidating it, as zswap_reclaim_entry() will make this check > > > > > > internally. > > > > > > > > > > > > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") > > > > > > Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > --- > > > > > > mm/zswap.c | 21 ++++++++++++--------- > > > > > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > > index 87b204233115..62195f72bf56 100644 > > > > > > --- a/mm/zswap.c > > > > > > +++ b/mm/zswap.c > > > > > > @@ -355,12 +355,14 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry, > > > > > > return 0; > > > > > > } > > > > > > > > > > > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > > > > > +static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > > > > > { > > > > > > if (!RB_EMPTY_NODE(&entry->rbnode)) { > > > > > > rb_erase(&entry->rbnode, root); > > > > > > RB_CLEAR_NODE(&entry->rbnode); > > > > > > + return true; > > > > > > } > > > > > > + return false; > > > > > > } > > > > > > > > > > > > /* > > > > > > @@ -599,14 +601,16 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) > > > > > > return NULL; > > > > > > } > > > > > > > > > > > > +/* > > > > > > + * If the entry is still valid in the tree, drop the initial ref and remove it > > > > > > + * from the tree. This function must be called with an additional ref held, > > > > > > + * otherwise it may race with another invalidation freeing the entry. > > > > > > + */ > > > > > > > > > > On re-reading this comment there's one thing I'm not sure I get, do we > > > > > really need to hold an additional local ref to call this? As far as I > > > > > understood, once we check that the entry was in the tree before putting > > > > > the initial ref, there's no need for an additional local one. > > > > > > > > I believe it is, but please correct me if I am wrong. Consider the > > > > following scenario: > > > > > > > > // Initially refcount is at 1 > > > > > > > > CPU#1: CPU#2: > > > > spin_lock(tree_lock) > > > > zswap_entry_get() // 2 refs > > > > spin_unlock(tree_lock) > > > > spin_lock(tree_lock) > > > > zswap_invalidate_entry() // 1 ref > > > > spin_unlock(tree_lock) > > > > zswap_entry_put() // 0 refs > > > > zswap_invalidate_entry() // problem > > > > > > > > That last zswap_invalidate_entry() call in CPU#1 is problematic. The > > > > entry would have already been freed. If we check that the entry is on > > > > the tree by checking RB_EMPTY_NODE(&entry->rbnode), then we are > > > > reading already freed and potentially re-used memory. > > > > > > > > We would need to search the tree to make sure the same entry still > > > > exists in the tree (aka what zswap_reclaim_entry() currently does). > > > > This is not ideal in the fault path to have to do the lookups twice. > > > > > > Thanks for the clarification, it is indeed needed in that case. I was just > > > wondering if the wording of the comment is exact, in that before calling > > > zswap_invalidate_entry one has to ensure that the entry has not been freed, not > > > specifically by holding an additional reference, if a lookup can serve the same > > > purpose. > > > > > > I am wondering if the scenario below is possible, in which case a > > lookup would not be enough. Let me try to clarify. Let's assume in > > zswap_reclaim_entry() we drop the local ref early (before we > > invalidate the entry), and rely on the lookup to ensure that the entry > > was not freed: > > > > - On CPU#1, in zswap_reclaim_entry() we release the lock during IO. > > Let's assume we drop the local ref here and rely on the lookup to make > > sure the zswap entry wasn't freed. > > - On CPU#2, invalidates the swap entry. The zswap entry is freed > > (returned to the slab allocator). > > - On CPU#2, we try to reclaim another page, and allocates the same > > swap slot (same type and offset). > > - On CPU#2, a zswap entry is allocated, and the slab allocator happens > > to hand us the same zswap_entry we just freed. > > - On CPU#1, after IO is done, we lookup the tree to make sure that the > > zswap entry was not freed. We find the same zswap entry (same address) > > at the same offset, so we assume it was not freed. > > - On CPU#1, we invalidate the zswap entry that was actually used by CPU#2. > > > > I am not entirely sure if this is possible, perhaps locking in the > > swap layer will prevent the swap entry reuse, but it seems like > > relying on the lookup can be fragile, and we should rely on the local > > ref instead to reliably prevent freeing/reuse of the zswap entry. > > > > Please correct me if I missed something. > > I think it is, we definitely need an additional reference to pin down the entry. > Sorry if I was being pedantic, my original doubt was only about the wording of > the comment, where it says that an additional reference must be held. I was > wondering if it was strictly needed, and now I see that it is :) Not at all! Questions and comments are always welcome (and encouraged), at least for me :) > > > > > > > > > > > > > > > > > Also, in zswap_reclaim_entry(), would it be possible if we call > > > > zswap_invalidate_entry() after we drop the local ref that the swap > > > > entry has been reused for a different page? I didn't look closely, but > > > > if yes, then the slab allocator may have repurposed the zswap_entry > > > > and we may find the entry in the tree for the same offset, even though > > > > it is referring to a different page now. This sounds practically > > > > unlikely but perhaps theoretically possible. > > > > > > I'm not sure I understood the scenario, in zswap_reclaim_entry we keep a local > > > reference until the end in order to avoid a free. > > > > > > Right, I was just trying to reason about what might happen if we call > > zswap_invalidate_entry() after dropping the local ref, as I mentioned > > above. > > > > > > > > > > > > > > > > > > I think it's more reliable to call zswap_invalidate_entry() on an > > > > entry that we know is valid before dropping the local ref. Especially > > > > that it's easy to do today by just moving a few lines around. > > > > > > > > > > > > > > > > > > > > > > > > > > > static void zswap_invalidate_entry(struct zswap_tree *tree, > > > > > > struct zswap_entry *entry) > > > > > > { > > > > > > - /* remove from rbtree */ > > > > > > - zswap_rb_erase(&tree->rbroot, entry); > > > > > > - > > > > > > - /* drop the initial reference from entry creation */ > > > > > > - zswap_entry_put(tree, entry); > > > > > > + if (zswap_rb_erase(&tree->rbroot, entry)) > > > > > > + zswap_entry_put(tree, entry); > > > > > > } > > > > > > > > > > > > static int zswap_reclaim_entry(struct zswap_pool *pool) > > > > > > @@ -659,8 +663,7 @@ static int zswap_reclaim_entry(struct zswap_pool *pool) > > > > > > * swapcache. Drop the entry from zswap - unless invalidate already > > > > > > * took it out while we had the tree->lock released for IO. > > > > > > */ > > > > > > - if (entry == zswap_rb_search(&tree->rbroot, swpoffset)) > > > > > > - zswap_invalidate_entry(tree, entry); > > > > > > + zswap_invalidate_entry(tree, entry); > > > > > > > > > > > > put_unlock: > > > > > > /* Drop local reference */ > > > > > > @@ -1466,7 +1469,6 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > > > > > count_objcg_event(entry->objcg, ZSWPIN); > > > > > > freeentry: > > > > > > spin_lock(&tree->lock); > > > > > > - zswap_entry_put(tree, entry); > > > > > > if (!ret && zswap_exclusive_loads_enabled) { > > > > > > zswap_invalidate_entry(tree, entry); > > > > > > *exclusive = true; > > > > > > @@ -1475,6 +1477,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > > > > > list_move(&entry->lru, &entry->pool->lru); > > > > > > spin_unlock(&entry->pool->lru_lock); > > > > > > } > > > > > > + zswap_entry_put(tree, entry); > > > > > > spin_unlock(&tree->lock); > > > > > > > > > > > > return ret; > > > > > > -- > > > > > > 2.41.0.162.gfafddb0af9-goog > > > > > > > > Reviewed-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Thanks!
diff --git a/mm/zswap.c b/mm/zswap.c index 87b204233115..62195f72bf56 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -355,12 +355,14 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry, return 0; } -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) +static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) { if (!RB_EMPTY_NODE(&entry->rbnode)) { rb_erase(&entry->rbnode, root); RB_CLEAR_NODE(&entry->rbnode); + return true; } + return false; } /* @@ -599,14 +601,16 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) return NULL; } +/* + * If the entry is still valid in the tree, drop the initial ref and remove it + * from the tree. This function must be called with an additional ref held, + * otherwise it may race with another invalidation freeing the entry. + */ static void zswap_invalidate_entry(struct zswap_tree *tree, struct zswap_entry *entry) { - /* remove from rbtree */ - zswap_rb_erase(&tree->rbroot, entry); - - /* drop the initial reference from entry creation */ - zswap_entry_put(tree, entry); + if (zswap_rb_erase(&tree->rbroot, entry)) + zswap_entry_put(tree, entry); } static int zswap_reclaim_entry(struct zswap_pool *pool) @@ -659,8 +663,7 @@ static int zswap_reclaim_entry(struct zswap_pool *pool) * swapcache. Drop the entry from zswap - unless invalidate already * took it out while we had the tree->lock released for IO. */ - if (entry == zswap_rb_search(&tree->rbroot, swpoffset)) - zswap_invalidate_entry(tree, entry); + zswap_invalidate_entry(tree, entry); put_unlock: /* Drop local reference */ @@ -1466,7 +1469,6 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, count_objcg_event(entry->objcg, ZSWPIN); freeentry: spin_lock(&tree->lock); - zswap_entry_put(tree, entry); if (!ret && zswap_exclusive_loads_enabled) { zswap_invalidate_entry(tree, entry); *exclusive = true; @@ -1475,6 +1477,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, list_move(&entry->lru, &entry->pool->lru); spin_unlock(&entry->pool->lru_lock); } + zswap_entry_put(tree, entry); spin_unlock(&tree->lock); return ret;