Message ID | 20240120024007.2850671-3-yosryahmed@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-31681-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:2bc4:b0:101:a8e8:374 with SMTP id hx4csp1406662dyb; Fri, 19 Jan 2024 18:40:55 -0800 (PST) X-Google-Smtp-Source: AGHT+IHI4puKw2Pf2K7dS9tYt8az7Lj3pc8wLK3e9zN0dzGgYlj966t5banErIZvRD1XNMtZkMNF X-Received: by 2002:a05:6a00:10c9:b0:6d9:af31:293b with SMTP id d9-20020a056a0010c900b006d9af31293bmr921367pfu.61.1705718455543; Fri, 19 Jan 2024 18:40:55 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1705718455; cv=pass; d=google.com; s=arc-20160816; b=Aij6KQ0OSKfk876Y6VnYeQMMhQ70cvZ1Eo9PPM3pzJNYKdCz853BRQF1lKiHS0ze36 p8z6JmLuFSIMaliND4q4vk3osRP+B8UyMn7UKlZrC2TfvcnKSNf2NHd6usVa3eJCFAsz TbDbm6QcxdjLEEBgraMSyxzY639xI39sGGIDN3vktLGSM6gtMnJjMy3i9d7M2K42WcGA AsAymvA9A+W2m3M8xVLWmn/VPWpEjitDffSRqVr+Snq4U9rhVS4gbJRj4ATdLLwb5E6w VJBlWt4JddhL5cILoGS7HW0yrdbzfXT0vhWW674qod0wqR+uqJzKirt/k27NXGYjVrB3 ho3A== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:references:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:in-reply-to:date :dkim-signature; bh=fLiPaqnX6yFRN11bVOzAvgH/n62qJ4AjDWPdtJN5Ujg=; fh=2fp47O5heT9ZSY23GxUI23pZ77ANaQPj9zY0xLvYULg=; b=q9h52h89TiTVVEflIgrcuSuvVxVI3gPFhOxWTx/HZS0ifCfNb00kD7AQuzpNxdZzLm 8SxAZZ5FacMYc9je8Z5SYCfCZIcRUdwEZfs9SggpJ0MVZ47wv4BDhyM62bxdEkquhBVd 30KcKnh9J2423Xwoj6WKrETFzLCefgkLfzHp6w9w8Dv9v1RvRwvI179t3TEsjZ0uTkPY 8iRX2VF23KDgKbgl7Sy0a+4JmgoojO33J3NP8NX/9aomi7ch7byS57JpwC5P+FdTxKTY T0NbHAQC3oi+cnTirM7Hntks/Xv360PdJeHZqHX6zTVFp7nKUf2xS+mc4JTFHpdHHK/E sobQ== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=sBgNXW7P; arc=pass (i=1 spf=pass spfdomain=flex--yosryahmed.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-31681-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-31681-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id lm7-20020a056a003c8700b006daed185d08si6716130pfb.111.2024.01.19.18.40.55 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Jan 2024 18:40:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-31681-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=sBgNXW7P; arc=pass (i=1 spf=pass spfdomain=flex--yosryahmed.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-31681-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-31681-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 94327B2262B for <ouuuleilei@gmail.com>; Sat, 20 Jan 2024 02:40:54 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 399DE3FED; Sat, 20 Jan 2024 02:40:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="sBgNXW7P" Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4BAFC111E for <linux-kernel@vger.kernel.org>; Sat, 20 Jan 2024 02:40:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705718415; cv=none; b=aFcKAcQgfggQSkMPN4clcxNuA/HRPzSMlS0qsnGZWgBChzEpnosnf80tAFLw/PdCtnoXq5Tjy4S8E4CqGKJrlbxMxuxu8DSdahwOa97dgXtALe3oGWWXI93czYZH02CPtM+1rKEFLSTZQETaFm5NRvgBuEvWL3hXPwBzm/aigN4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705718415; c=relaxed/simple; bh=LCehRxHcPURdqDd+A6LsG5DhsUqvxC6slXr4dvoC3PU=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=dEjyFEM4d0fIY6XIZzBTzevgtQkEhk5LGxapwaQjvu+05OMrQ6m0LqyoD8vnESw/JMMNUIRVfBLgeYAJ+8c/ref33qSjXri52KQecL7QPYA38ME1s4S7wn9me2Tiq9qtezDP23SdI03zmf28pzjcmeX6xYhMKfXN60pN/4YG1yg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yosryahmed.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=sBgNXW7P; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yosryahmed.bounces.google.com Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dc2358dce6bso1827186276.3 for <linux-kernel@vger.kernel.org>; Fri, 19 Jan 2024 18:40:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1705718413; x=1706323213; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=fLiPaqnX6yFRN11bVOzAvgH/n62qJ4AjDWPdtJN5Ujg=; b=sBgNXW7Pm8YSvLOc9jyTZn/t64NP3NVT1OFKZFsGbX0hASC3KQFMd4YNMvsV7qZkkB HhdkU5+qw3hvNQL1/AeuGb1QXWya5wsoPc8ZliQHGBe8htQfeqyCy54S0qQqI3xX1Rwr bHFWFzBfQv5sohTMPFAxO0iPcBwz4Khha7mDNzMzL4E2OEymE4PBVG0NEBnCFhcbZ3UL 5imQ8twmtaJuyIDktEqT7TjkvXxUjy2eVPf4cPbMrlbu+2P4UXKS1RXxY8cv4vgZlsWK 9QUz88pPewed4IQCHWmkVt+yTsFmApHsbXMHJLubIwm8uz3XJBop3jUz6YyQKgwjAMma myNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705718413; x=1706323213; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fLiPaqnX6yFRN11bVOzAvgH/n62qJ4AjDWPdtJN5Ujg=; b=CyzqUSJGznVv5wNYQ06CXluh1ksUHSsC2Ffm8OgUgRMSOQUrsjBK3mqu4N3lt8mGeR reIn2XcL4eraftTX3p1Z1cz3KsdBNtD+6JazE5OAp0kEZgIrgLen5B/yGlUXMR5fvsGm ZdmbEYrVNnuJIWiXg0vdxICK5kwVA2oaeoibcNEylDMmRAZIIgGEJFHrUt/QxhaBkMnS 6t70fU3iTzWpymtfUQdgbcak6rhlxXD1dkREpSQbvqjGRYlt0dxnd8A7ggVLCQPExGIw nSmc2XgVExS233WMSWvMUUPS6dSwY1zGk1H1LAQdiMR2eiEhxEQur1VI1iSJNS/1MX0D 9JEg== X-Gm-Message-State: AOJu0Yyzv4wTT3KsorBTH+hkuyzVqHQYFTE36E3r/Mf7e8W4DIQdCabr tXD2iWBnvTUbnIjOs3ygBh7e2icxg9fzkIH/5sIEML4JZf6EfaOZVRXhxBJ1vREHY/O67LYae3p moW/WgN7pBsOYZDRh+A== X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:29b4]) (user=yosryahmed job=sendgmr) by 2002:a05:6902:b1f:b0:dbe:abc7:ae5c with SMTP id ch31-20020a0569020b1f00b00dbeabc7ae5cmr76590ybb.3.1705718413333; Fri, 19 Jan 2024 18:40:13 -0800 (PST) Date: Sat, 20 Jan 2024 02:40:07 +0000 In-Reply-To: <20240120024007.2850671-1-yosryahmed@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> Mime-Version: 1.0 References: <20240120024007.2850671-1-yosryahmed@google.com> X-Mailer: git-send-email 2.43.0.429.g432eaa2c6b-goog Message-ID: <20240120024007.2850671-3-yosryahmed@google.com> Subject: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() From: Yosry Ahmed <yosryahmed@google.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, Nhat Pham <nphamcs@gmail.com>, Chris Li <chrisl@kernel.org>, Chengming Zhou <zhouchengming@bytedance.com>, Huang Ying <ying.huang@intel.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com> Content-Type: text/plain; charset="UTF-8" X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1788575435281321083 X-GMAIL-MSGID: 1788575435281321083 |
Series |
mm: zswap: simplify zswap_swapoff()
|
|
Commit Message
Yosry Ahmed
Jan. 20, 2024, 2:40 a.m. UTC
During swapoff, try_to_unuse() makes sure that zswap_invalidate() is
called for all swap entries before zswap_swapoff() is called. This means
that all zswap entries should already be removed from the tree. Simplify
zswap_swapoff() by removing the tree cleanup loop, and leaving an
assertion in its place.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
Chengming, Chris, I think this should make the tree split and the xarray
conversion patches simpler (especially the former). If others agree,
both changes can be rebased on top of this.
---
mm/zswap.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
Comments
On 2024/1/20 10:40, Yosry Ahmed wrote: > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > called for all swap entries before zswap_swapoff() is called. This means > that all zswap entries should already be removed from the tree. Simplify > zswap_swapoff() by removing the tree cleanup loop, and leaving an > assertion in its place. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > Chengming, Chris, I think this should make the tree split and the xarray > conversion patches simpler (especially the former). If others agree, > both changes can be rebased on top of this. Ok. Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Thanks. > --- > mm/zswap.c | 9 ++------- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f8bc9e0892687..9675c3c27f9d1 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1790,17 +1790,12 @@ void zswap_swapon(int type) > void zswap_swapoff(int type) > { > struct zswap_tree *tree = zswap_trees[type]; > - struct zswap_entry *entry, *n; > > if (!tree) > return; > > - /* walk the tree and free everything */ > - spin_lock(&tree->lock); > - rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) > - zswap_free_entry(entry); > - tree->rbroot = RB_ROOT; > - spin_unlock(&tree->lock); > + /* try_to_unuse() invalidated all entries already */ > + WARN_ON_ONCE(!RB_EMPTY_ROOT(&tree->rbroot)); > kfree(tree); > zswap_trees[type] = NULL; > }
On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > called for all swap entries before zswap_swapoff() is called. This means > that all zswap entries should already be removed from the tree. Simplify > zswap_swapoff() by removing the tree cleanup loop, and leaving an > assertion in its place. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> That's a great simplification. Removing the tree->lock made me double take, but at this point the swapfile and its cache should be fully dead and I don't see how any of the zswap operations that take tree->lock could race at this point. > --- > Chengming, Chris, I think this should make the tree split and the xarray > conversion patches simpler (especially the former). If others agree, > both changes can be rebased on top of this. The resulting code is definitely simpler, but this patch is not a completely trivial cleanup, either. If you put it before Chengming's patch and it breaks something, it would be difficult to pull out without affecting the tree split.
On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > called for all swap entries before zswap_swapoff() is called. This means > > that all zswap entries should already be removed from the tree. Simplify > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > assertion in its place. > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > That's a great simplification. > > Removing the tree->lock made me double take, but at this point the > swapfile and its cache should be fully dead and I don't see how any of > the zswap operations that take tree->lock could race at this point. It took me a while staring at the code to realize this loop is pointless. However, while I have your attention on the swapoff path, there's a slightly irrelevant problem that I think might be there, but I am not sure. It looks to me like swapoff can race with writeback, and there may be a chance of UAF for the zswap tree. For example, if zswap_swapoff() races with shrink_memcg_cb(), I feel like we may free the tree as it is being used. For example if zswap_swapoff()->kfree(tree) happen right before shrink_memcg_cb()->list_lru_isolate(l, item). Please tell me that I am being paranoid and that there is some protection against zswap writeback racing with swapoff. It feels like we are very careful with zswap entries refcounting, but not with the zswap tree itself. > > > --- > > Chengming, Chris, I think this should make the tree split and the xarray > > conversion patches simpler (especially the former). If others agree, > > both changes can be rebased on top of this. > > The resulting code is definitely simpler, but this patch is not a > completely trivial cleanup, either. If you put it before Chengming's > patch and it breaks something, it would be difficult to pull out > without affecting the tree split. Are you suggesting I rebase this on top of Chengming's patches? I can definitely do this, but the patch will be slightly less straightforward, and if the tree split patches break something it would be difficult to pull out as well. If you feel like this patch is more likely to break things, I can rebase.
On Fri, Jan 19, 2024 at 6:40 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > called for all swap entries before zswap_swapoff() is called. This means > that all zswap entries should already be removed from the tree. Simplify > zswap_swapoff() by removing the tree cleanup loop, and leaving an > assertion in its place. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > Chengming, Chris, I think this should make the tree split and the xarray > conversion patches simpler (especially the former). If others agree, > both changes can be rebased on top of this. > --- > mm/zswap.c | 9 ++------- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f8bc9e0892687..9675c3c27f9d1 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1790,17 +1790,12 @@ void zswap_swapon(int type) > void zswap_swapoff(int type) > { > struct zswap_tree *tree = zswap_trees[type]; > - struct zswap_entry *entry, *n; > > if (!tree) > return; > > - /* walk the tree and free everything */ > - spin_lock(&tree->lock); > - rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) > - zswap_free_entry(entry); > - tree->rbroot = RB_ROOT; > - spin_unlock(&tree->lock); > + /* try_to_unuse() invalidated all entries already */ > + WARN_ON_ONCE(!RB_EMPTY_ROOT(&tree->rbroot)); > kfree(tree); > zswap_trees[type] = NULL; > } > -- > 2.43.0.429.g432eaa2c6b-goog > Oh man this is sweet! FWIW: Acked-by: Nhat Pham <nphamcs@gmail.com>
On Fri, Jan 19, 2024 at 6:40 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > called for all swap entries before zswap_swapoff() is called. This means > that all zswap entries should already be removed from the tree. Simplify > zswap_swapoff() by removing the tree cleanup loop, and leaving an > assertion in its place. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > Chengming, Chris, I think this should make the tree split and the xarray > conversion patches simpler (especially the former). If others agree, > both changes can be rebased on top of this. I was wondering why those need to be there if all the zswap entries should have been swapped in already. In my testing I never see this delete of an entry so think this is kind of just in case. Nice clean up and will help simplify my zswap to xarray patch. Thanks for doing this. Acked-by: Chris Li <chrisl@kernel.org> (Google) Chris > --- > mm/zswap.c | 9 ++------- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f8bc9e0892687..9675c3c27f9d1 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1790,17 +1790,12 @@ void zswap_swapon(int type) > void zswap_swapoff(int type) > { > struct zswap_tree *tree = zswap_trees[type]; > - struct zswap_entry *entry, *n; > > if (!tree) > return; > > - /* walk the tree and free everything */ > - spin_lock(&tree->lock); > - rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) > - zswap_free_entry(entry); > - tree->rbroot = RB_ROOT; > - spin_unlock(&tree->lock); > + /* try_to_unuse() invalidated all entries already */ > + WARN_ON_ONCE(!RB_EMPTY_ROOT(&tree->rbroot)); > kfree(tree); > zswap_trees[type] = NULL; > } > -- > 2.43.0.429.g432eaa2c6b-goog >
On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > > called for all swap entries before zswap_swapoff() is called. This means > > > that all zswap entries should already be removed from the tree. Simplify > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > > assertion in its place. > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > That's a great simplification. > > > > Removing the tree->lock made me double take, but at this point the > > swapfile and its cache should be fully dead and I don't see how any of > > the zswap operations that take tree->lock could race at this point. > > It took me a while staring at the code to realize this loop is pointless. > > However, while I have your attention on the swapoff path, there's a > slightly irrelevant problem that I think might be there, but I am not > sure. > > It looks to me like swapoff can race with writeback, and there may be > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > races with shrink_memcg_cb(), I feel like we may free the tree as it > is being used. For example if zswap_swapoff()->kfree(tree) happen > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > Please tell me that I am being paranoid and that there is some > protection against zswap writeback racing with swapoff. It feels like > we are very careful with zswap entries refcounting, but not with the > zswap tree itself. Hm, I don't see how. Writeback operates on entries from the LRU. By the time zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should will have emptied out the LRU and tree. Writeback could have gotten a refcount to the entry and dropped the tree->lock. But then it does __read_swap_cache_async(), and while holding the page lock checks the tree under lock once more; if that finds the entry valid, it means try_to_unuse() hasn't started on this page yet, and would be held up by the page lock/writeback state. > > > Chengming, Chris, I think this should make the tree split and the xarray > > > conversion patches simpler (especially the former). If others agree, > > > both changes can be rebased on top of this. > > > > The resulting code is definitely simpler, but this patch is not a > > completely trivial cleanup, either. If you put it before Chengming's > > patch and it breaks something, it would be difficult to pull out > > without affecting the tree split. > > Are you suggesting I rebase this on top of Chengming's patches? I can > definitely do this, but the patch will be slightly less > straightforward, and if the tree split patches break something it > would be difficult to pull out as well. If you feel like this patch is > more likely to break things, I can rebase. Yeah I think it's more subtle. I'd only ask somebody to rebase an already tested patch on a newer one if the latter were an obvious, low-risk, prep-style patch. Your patch is good, but it doesn't quite fit into this particular category, so I'd say no jumping the queue ;)
On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > > > called for all swap entries before zswap_swapoff() is called. This means > > > > that all zswap entries should already be removed from the tree. Simplify > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > > > assertion in its place. > > > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > That's a great simplification. > > > > > > Removing the tree->lock made me double take, but at this point the > > > swapfile and its cache should be fully dead and I don't see how any of > > > the zswap operations that take tree->lock could race at this point. > > > > It took me a while staring at the code to realize this loop is pointless. > > > > However, while I have your attention on the swapoff path, there's a > > slightly irrelevant problem that I think might be there, but I am not > > sure. > > > > It looks to me like swapoff can race with writeback, and there may be > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > races with shrink_memcg_cb(), I feel like we may free the tree as it > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > Please tell me that I am being paranoid and that there is some > > protection against zswap writeback racing with swapoff. It feels like > > we are very careful with zswap entries refcounting, but not with the > > zswap tree itself. > > Hm, I don't see how. > > Writeback operates on entries from the LRU. By the time > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should > will have emptied out the LRU and tree. > > Writeback could have gotten a refcount to the entry and dropped the > tree->lock. But then it does __read_swap_cache_async(), and while > holding the page lock checks the tree under lock once more; if that > finds the entry valid, it means try_to_unuse() hasn't started on this > page yet, and would be held up by the page lock/writeback state. Consider the following race: CPU 1 CPU 2 # In shrink_memcg_cb() # In swap_off list_lru_isolate() zswap_invalidate() .. zswap_swapoff() -> kfree(tree) spin_lock(&tree->lock); Isn't this a UAF or am I missing something here? > > > > > Chengming, Chris, I think this should make the tree split and the xarray > > > > conversion patches simpler (especially the former). If others agree, > > > > both changes can be rebased on top of this. > > > > > > The resulting code is definitely simpler, but this patch is not a > > > completely trivial cleanup, either. If you put it before Chengming's > > > patch and it breaks something, it would be difficult to pull out > > > without affecting the tree split. > > > > Are you suggesting I rebase this on top of Chengming's patches? I can > > definitely do this, but the patch will be slightly less > > straightforward, and if the tree split patches break something it > > would be difficult to pull out as well. If you feel like this patch is > > more likely to break things, I can rebase. > > Yeah I think it's more subtle. I'd only ask somebody to rebase an > already tested patch on a newer one if the latter were an obvious, > low-risk, prep-style patch. Your patch is good, but it doesn't quite > fit into this particular category, so I'd say no jumping the queue ;) My intention was to reduce the diff in both this patch and the tree split patches, but I do understand this is more subtle. I can rebase on top of Chengming's patches instead.
On Tue, Jan 23, 2024 at 07:54:49AM -0800, Yosry Ahmed wrote: > On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > > > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > > > > called for all swap entries before zswap_swapoff() is called. This means > > > > > that all zswap entries should already be removed from the tree. Simplify > > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > > > > assertion in its place. > > > > > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > > > That's a great simplification. > > > > > > > > Removing the tree->lock made me double take, but at this point the > > > > swapfile and its cache should be fully dead and I don't see how any of > > > > the zswap operations that take tree->lock could race at this point. > > > > > > It took me a while staring at the code to realize this loop is pointless. > > > > > > However, while I have your attention on the swapoff path, there's a > > > slightly irrelevant problem that I think might be there, but I am not > > > sure. > > > > > > It looks to me like swapoff can race with writeback, and there may be > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > > races with shrink_memcg_cb(), I feel like we may free the tree as it > > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > > > Please tell me that I am being paranoid and that there is some > > > protection against zswap writeback racing with swapoff. It feels like > > > we are very careful with zswap entries refcounting, but not with the > > > zswap tree itself. > > > > Hm, I don't see how. > > > > Writeback operates on entries from the LRU. By the time > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should > > will have emptied out the LRU and tree. > > > > Writeback could have gotten a refcount to the entry and dropped the > > tree->lock. But then it does __read_swap_cache_async(), and while > > holding the page lock checks the tree under lock once more; if that > > finds the entry valid, it means try_to_unuse() hasn't started on this > > page yet, and would be held up by the page lock/writeback state. > > Consider the following race: > > CPU 1 CPU 2 > # In shrink_memcg_cb() # In swap_off > list_lru_isolate() > zswap_invalidate() > .. > zswap_swapoff() -> kfree(tree) > spin_lock(&tree->lock); > > Isn't this a UAF or am I missing something here? Oof. You're right, it looks like there is a bug. Digging through the history, I think this is actually quite old: the original backend shrinkers would pluck something off their LRU, drop all locks, then try to acquire tree->lock. There is no protection against swapoff. The lock that is supposed to protect this is the LRU lock. That's where reclaim and invalidation should synchronize. But it's not right: 1. We drop the LRU lock before acquiring the tree lock. We should instead trylock the tree while still holding the LRU lock to make sure the tree is safe against swapoff. 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But it must always cycle the LRU lock before freeing the tree for that synchronization to work. Once we're holding a refcount to the entry, it's safe to drop all locks for the next step because we'll then work against the swapcache and entry: __read_swap_cache_async() will try to pin and lock whatever swap entry is at that type+offset. This also pins the type's current tree. HOWEVER, if swapoff + swapon raced, this could be a different tree than what we had in @tree, so 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to look up zswap_trees[] again after __read_swap_cache_async() succeeded to validate the entry. Once it succeeded, we can validate the entry. The entry is valid due to our refcount. The zswap_trees[type] is valid due to the cache pin. However, if validation failed and we have a non-zero writeback_result, there is one last bug: 4. the original entry's tree is no longer valid for the entry put. The current scheme handles invalidation fine (which is good because that's quite common). But it's fundamentally unsynchronized against swapoff (which has probably gone undetected because that's rare). I can't think of an immediate solution to this, but I wanted to put my analysis out for comments.
On Tue, Jan 23, 2024 at 7:55 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: > > > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: > > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is > > > > > called for all swap entries before zswap_swapoff() is called. This means > > > > > that all zswap entries should already be removed from the tree. Simplify > > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an > > > > > assertion in its place. > > > > > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > > > That's a great simplification. > > > > > > > > Removing the tree->lock made me double take, but at this point the > > > > swapfile and its cache should be fully dead and I don't see how any of > > > > the zswap operations that take tree->lock could race at this point. > > > > > > It took me a while staring at the code to realize this loop is pointless. > > > > > > However, while I have your attention on the swapoff path, there's a > > > slightly irrelevant problem that I think might be there, but I am not > > > sure. > > > > > > It looks to me like swapoff can race with writeback, and there may be > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > > races with shrink_memcg_cb(), I feel like we may free the tree as it > > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > > > Please tell me that I am being paranoid and that there is some > > > protection against zswap writeback racing with swapoff. It feels like > > > we are very careful with zswap entries refcounting, but not with the > > > zswap tree itself. > > > > Hm, I don't see how. > > > > Writeback operates on entries from the LRU. By the time > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should > > will have emptied out the LRU and tree. > > > > Writeback could have gotten a refcount to the entry and dropped the > > tree->lock. But then it does __read_swap_cache_async(), and while > > holding the page lock checks the tree under lock once more; if that > > finds the entry valid, it means try_to_unuse() hasn't started on this > > page yet, and would be held up by the page lock/writeback state. > > Consider the following race: > > CPU 1 CPU 2 > # In shrink_memcg_cb() # In swap_off > list_lru_isolate() > zswap_invalidate() > .. > zswap_swapoff() -> kfree(tree) > spin_lock(&tree->lock); > > Isn't this a UAF or am I missing something here? I need to read this code closer. But this smells like a race to me as well. Long term speaking, I think decoupling swap and zswap will fix this, no? We won't need to kfree(tree) inside swapoff. IOW, if we have a single zswap tree that is not tied down to any swapfile, then we can't have this race. There might be other races introduced by the decoupling that I might have not foreseen tho :) Short term, no clue hmm. Let me think a bit more about this. > > > > > > > > Chengming, Chris, I think this should make the tree split and the xarray > > > > > conversion patches simpler (especially the former). If others agree, > > > > > both changes can be rebased on top of this. > > > > > > > > The resulting code is definitely simpler, but this patch is not a > > > > completely trivial cleanup, either. If you put it before Chengming's > > > > patch and it breaks something, it would be difficult to pull out > > > > without affecting the tree split. > > > > > > Are you suggesting I rebase this on top of Chengming's patches? I can > > > definitely do this, but the patch will be slightly less > > > straightforward, and if the tree split patches break something it > > > would be difficult to pull out as well. If you feel like this patch is > > > more likely to break things, I can rebase. > > > > Yeah I think it's more subtle. I'd only ask somebody to rebase an > > already tested patch on a newer one if the latter were an obvious, > > low-risk, prep-style patch. Your patch is good, but it doesn't quite > > fit into this particular category, so I'd say no jumping the queue ;) > > My intention was to reduce the diff in both this patch and the tree > split patches, but I do understand this is more subtle. I can rebase > on top of Chengming's patches instead.
On 2024/1/24 05:02, Yosry Ahmed wrote: > On Tue, Jan 23, 2024 at 12:12 PM Johannes Weiner <hannes@cmpxchg.org> wrote: >> >> On Tue, Jan 23, 2024 at 07:54:49AM -0800, Yosry Ahmed wrote: >>> On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: >>>> >>>> On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote: >>>>> On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: >>>>>> >>>>>> On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote: >>>>>>> During swapoff, try_to_unuse() makes sure that zswap_invalidate() is >>>>>>> called for all swap entries before zswap_swapoff() is called. This means >>>>>>> that all zswap entries should already be removed from the tree. Simplify >>>>>>> zswap_swapoff() by removing the tree cleanup loop, and leaving an >>>>>>> assertion in its place. >>>>>>> >>>>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> >>>>>> >>>>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org> >>>>>> >>>>>> That's a great simplification. >>>>>> >>>>>> Removing the tree->lock made me double take, but at this point the >>>>>> swapfile and its cache should be fully dead and I don't see how any of >>>>>> the zswap operations that take tree->lock could race at this point. >>>>> >>>>> It took me a while staring at the code to realize this loop is pointless. >>>>> >>>>> However, while I have your attention on the swapoff path, there's a >>>>> slightly irrelevant problem that I think might be there, but I am not >>>>> sure. >>>>> >>>>> It looks to me like swapoff can race with writeback, and there may be >>>>> a chance of UAF for the zswap tree. For example, if zswap_swapoff() >>>>> races with shrink_memcg_cb(), I feel like we may free the tree as it >>>>> is being used. For example if zswap_swapoff()->kfree(tree) happen >>>>> right before shrink_memcg_cb()->list_lru_isolate(l, item). >>>>> >>>>> Please tell me that I am being paranoid and that there is some >>>>> protection against zswap writeback racing with swapoff. It feels like >>>>> we are very careful with zswap entries refcounting, but not with the >>>>> zswap tree itself. >>>> >>>> Hm, I don't see how. >>>> >>>> Writeback operates on entries from the LRU. By the time >>>> zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should >>>> will have emptied out the LRU and tree. >>>> >>>> Writeback could have gotten a refcount to the entry and dropped the >>>> tree->lock. But then it does __read_swap_cache_async(), and while >>>> holding the page lock checks the tree under lock once more; if that >>>> finds the entry valid, it means try_to_unuse() hasn't started on this >>>> page yet, and would be held up by the page lock/writeback state. >>> >>> Consider the following race: >>> >>> CPU 1 CPU 2 >>> # In shrink_memcg_cb() # In swap_off >>> list_lru_isolate() >>> zswap_invalidate() >>> .. >>> zswap_swapoff() -> kfree(tree) >>> spin_lock(&tree->lock); >>> >>> Isn't this a UAF or am I missing something here? >> >> Oof. You're right, it looks like there is a bug. Digging through the >> history, I think this is actually quite old: the original backend >> shrinkers would pluck something off their LRU, drop all locks, then >> try to acquire tree->lock. There is no protection against swapoff. >> >> The lock that is supposed to protect this is the LRU lock. That's >> where reclaim and invalidation should synchronize. But it's not right: >> >> 1. We drop the LRU lock before acquiring the tree lock. We should >> instead trylock the tree while still holding the LRU lock to make >> sure the tree is safe against swapoff. >> >> 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But >> it must always cycle the LRU lock before freeing the tree for that >> synchronization to work. >> >> Once we're holding a refcount to the entry, it's safe to drop all >> locks for the next step because we'll then work against the swapcache >> and entry: __read_swap_cache_async() will try to pin and lock whatever >> swap entry is at that type+offset. This also pins the type's current >> tree. HOWEVER, if swapoff + swapon raced, this could be a different >> tree than what we had in @tree, so >> >> 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to >> look up zswap_trees[] again after __read_swap_cache_async() >> succeeded to validate the entry. >> >> Once it succeeded, we can validate the entry. The entry is valid due >> to our refcount. The zswap_trees[type] is valid due to the cache pin. >> >> However, if validation failed and we have a non-zero writeback_result, >> there is one last bug: >> >> 4. the original entry's tree is no longer valid for the entry put. >> >> The current scheme handles invalidation fine (which is good because >> that's quite common). But it's fundamentally unsynchronized against >> swapoff (which has probably gone undetected because that's rare). >> >> I can't think of an immediate solution to this, but I wanted to put my >> analysis out for comments. > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > The first solution that came to mind for me was refcounting the zswap > tree with RCU with percpu-refcount, similar to how cgroup refs are > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > the percpu-refcount may be an overkill in terms of memory usage > though. I think we can still do our own refcounting with RCU, but it > may be more complicated. Hello, I also thought about this problem for some time, maybe something like below can be changed to fix it? It's likely I missed something, just some thoughts. IMHO, the problem is caused by the different way in which we use zswap entry in the writeback, that should be much like zswap_load(). The zswap_load() comes in with the folio locked in swap cache, so it has stable zswap tree to search and lock... But in writeback case, we don't, shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, then release lru lock to get tree lock, which maybe freed already. So we should change here, we read swpentry from entry with lru list lock held, then release lru lock, to try to lock corresponding folio in swap cache, if we success, the following things is much the same like zswap_load(). We can get tree lock, to recheck the invalidate race, if no race happened, we can make sure the entry is still right and get refcount of it, then release the tree lock. The main differences between this writeback with zswap_load() is the handling of lru entry and the tree lifetime. The whole zswap_load() function has the stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half after __swap_writepage() since we unlock the folio after that. So we can't reference the tree after that. This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early in tree lock, since thereafter writeback can't fail. BTW, I think we should also zswap_invalidate_entry() early in zswap_load() and only support the zswap_exclusive_loads_enabled mode, but that's another topic. The second difference is the handling of lru entry, which is easy that we just zswap_lru_del() in tree lock. So no any path can reference the entry from tree or lru after we release the tree lock, so we can just zswap_free_entry() after writeback. Thanks! // lru list lock held shrink_memcg_cb() swpentry = entry->swpentry // Don't isolate entry from lru list here, just use list_lru_putback() spin_unlock(lru list lock) folio = __read_swap_cache_async(swpentry) if (!folio) return if (!folio_was_allocated) folio_put(folio) return // folio is locked, swapcache is secured against swapoff tree = get tree from swpentry spin_lock(&tree->lock) // check invalidate race? No if (entry == zswap_rb_search()) // so we can make sure this entry is still right // zswap_invalidate_entry() since the below writeback can't fail zswap_entry_get(entry) zswap_invalidate_entry(tree, entry) // remove from lru list zswap_lru_del() spin_unlock(&tree->lock) __zswap_load() __swap_writepage() // folio unlock folio_put(folio) // entry is safe to free, since it's removed from tree and lru above zswap_free_entry(entry)
Hi Yosry, On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > > > The first solution that came to mind for me was refcounting the zswap > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > the percpu-refcount may be an overkill in terms of memory usage > > though. I think we can still do our own refcounting with RCU, but it > > may be more complicated. > > FWIW, I was able to reproduce the problem in a vm with the following > kernel diff: Thanks for the great find. I was worry about the usage after free situation in this email: https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@mail.gmail.com/ Glad you are able to find a reproducible case. That is one of the reasons I change the free to invalidate entries in my xarray patch. I think the swap_off code should remove the entry from the tree, just wait for each zswap entry to drop to zero. Then free it. That way you shouldn't need to refcount the tree. The tree refcount is effectively the combined refcount of all the zswap entries. Having refcount on the tree would be very high contention. Chris > diff --git a/mm/zswap.c b/mm/zswap.c > index 78df16d307aa8..6580a4be52a18 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -880,6 +880,9 @@ static enum lru_status shrink_memcg_cb(struct > list_head *item, struct list_lru_o > */ > spin_unlock(lock); > > + pr_info("Sleeping in shrink_memcg_cb() before > spin_lock(&tree->lock)\n"); > + schedule_timeout_uninterruptible(msecs_to_jiffies(10 * 1000)); > + > /* Check for invalidate() race */ > spin_lock(&tree->lock); > if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) > > This basically expands the race window to 10 seconds. I have a > reproducer script attached that utilizes the zswap shrinker (which > makes this much easier to reproduce btw). The steps are as follows: > - Compile the kernel with the above diff, and both ZRAM & KASAN enabled. > - In one terminal, run zswap_wb_swapoff_race.sh. > - In a different terminal, once the "Sleeping in shrink_memcg_cb()" > message is logged, run "swapoff /dev/zram0". > - In the first terminal, once the 10 seconds elapse, I get a UAF BUG > from KASAN (log attached as well).
On Tue, Jan 23, 2024 at 11:21 PM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > > > The first solution that came to mind for me was refcounting the zswap > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > the percpu-refcount may be an overkill in terms of memory usage > > though. I think we can still do our own refcounting with RCU, but it > > may be more complicated. > Hello, > > I also thought about this problem for some time, maybe something like below > can be changed to fix it? It's likely I missed something, just some thoughts. > > IMHO, the problem is caused by the different way in which we use zswap entry > in the writeback, that should be much like zswap_load(). It is possible to keep the behavior difference and get the tree right. Follow a few rules: 1) Always bump the zswap entry refcount after finding the entry (inside the RCU read lock) 2) Always free entry after refcount drops to zero. 3) Swap off should just wait for the each entry ref count drop to zero (or 1 including the swap off code holding one) then free it. Swap off is the slow path anyway. BTW, the swap off is where swap device locks get a lot of contention. > > The zswap_load() comes in with the folio locked in swap cache, so it has > stable zswap tree to search and lock... But in writeback case, we don't, > shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, > then release lru lock to get tree lock, which maybe freed already. > > So we should change here, we read swpentry from entry with lru list lock held, > then release lru lock, to try to lock corresponding folio in swap cache, > if we success, the following things is much the same like zswap_load(). > We can get tree lock, to recheck the invalidate race, if no race happened, > we can make sure the entry is still right and get refcount of it, then > release the tree lock. > > The main differences between this writeback with zswap_load() is the handling > of lru entry and the tree lifetime. The whole zswap_load() function has the > stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half > after __swap_writepage() since we unlock the folio after that. So we can't > reference the tree after that. > > This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early > in tree lock, since thereafter writeback can't fail. BTW, I think we should > also zswap_invalidate_entry() early in zswap_load() and only support the > zswap_exclusive_loads_enabled mode, but that's another topic. > > The second difference is the handling of lru entry, which is easy that we > just zswap_lru_del() in tree lock. > > So no any path can reference the entry from tree or lru after we release > the tree lock, so we can just zswap_free_entry() after writeback. > > Thanks! > > // lru list lock held > shrink_memcg_cb() > swpentry = entry->swpentry > // Don't isolate entry from lru list here, just use list_lru_putback() > spin_unlock(lru list lock) > > folio = __read_swap_cache_async(swpentry) > if (!folio) > return > > if (!folio_was_allocated) > folio_put(folio) > return > > // folio is locked, swapcache is secured against swapoff > tree = get tree from swpentry > spin_lock(&tree->lock) That will not work well with zswap to xarray change. We want to remove the tree lock and only use the xarray lock. The lookup should just hold xarray RCU read lock and return the entry with ref count increased. Chris > > // check invalidate race? No > if (entry == zswap_rb_search()) > > // so we can make sure this entry is still right > // zswap_invalidate_entry() since the below writeback can't fail > zswap_entry_get(entry) > zswap_invalidate_entry(tree, entry) > > // remove from lru list > zswap_lru_del() > > spin_unlock(&tree->lock) > > __zswap_load() > > __swap_writepage() // folio unlock > folio_put(folio) > > // entry is safe to free, since it's removed from tree and lru above > zswap_free_entry(entry) > >
On 2024/1/25 15:53, Yosry Ahmed wrote: >> Hello, >> >> I also thought about this problem for some time, maybe something like below >> can be changed to fix it? It's likely I missed something, just some thoughts. >> >> IMHO, the problem is caused by the different way in which we use zswap entry >> in the writeback, that should be much like zswap_load(). >> >> The zswap_load() comes in with the folio locked in swap cache, so it has >> stable zswap tree to search and lock... But in writeback case, we don't, >> shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, >> then release lru lock to get tree lock, which maybe freed already. >> >> So we should change here, we read swpentry from entry with lru list lock held, >> then release lru lock, to try to lock corresponding folio in swap cache, >> if we success, the following things is much the same like zswap_load(). >> We can get tree lock, to recheck the invalidate race, if no race happened, >> we can make sure the entry is still right and get refcount of it, then >> release the tree lock. > > Hmm I think you may be onto something here. Moving the swap cache > allocation ahead before referencing the tree should give us the same > guarantees as zswap_load() indeed. We can also consolidate the > invalidate race checks (right now we have one in shrink_memcg_cb() and > another one inside zswap_writeback_entry()). Right, if we successfully lock folio in the swap cache, we can get the tree lock and check the invalidate race, only once. > > We will have to be careful about the error handling path to make sure > we delete the folio from the swap cache only after we know the tree > won't be referenced anymore. Anyway, I think this can work. Yes, we can't reference tree if we early return or after unlocking folio, since the reference of zswap entry can't protect the tree. > > On a separate note, I think there is a bug in zswap_writeback_entry() > when we delete a folio from the swap cache. I think we are missing a > folio_unlock() there. Ah, yes, and folio_put(). > >> >> The main differences between this writeback with zswap_load() is the handling >> of lru entry and the tree lifetime. The whole zswap_load() function has the >> stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half >> after __swap_writepage() since we unlock the folio after that. So we can't >> reference the tree after that. >> >> This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early >> in tree lock, since thereafter writeback can't fail. BTW, I think we should >> also zswap_invalidate_entry() early in zswap_load() and only support the >> zswap_exclusive_loads_enabled mode, but that's another topic. > > zswap_invalidate_entry() actually doesn't seem to be using the tree at all. > >> >> The second difference is the handling of lru entry, which is easy that we >> just zswap_lru_del() in tree lock. > > Why do we need zswap_lru_del() at all? We should have already isolated > the entry at that point IIUC. I was thinking how to handle the "zswap_lru_putback()" if not writeback, in which case we can't use the entry actually since we haven't got reference of it. So we can don't isolate at the entry, and only zswap_lru_del() when we are going to writeback actually. Thanks!
On 2024/1/25 16:42, Yosry Ahmed wrote: > On Thu, Jan 25, 2024 at 12:30 AM Chengming Zhou > <zhouchengming@bytedance.com> wrote: >> >> On 2024/1/25 15:53, Yosry Ahmed wrote: >>>> Hello, >>>> >>>> I also thought about this problem for some time, maybe something like below >>>> can be changed to fix it? It's likely I missed something, just some thoughts. >>>> >>>> IMHO, the problem is caused by the different way in which we use zswap entry >>>> in the writeback, that should be much like zswap_load(). >>>> >>>> The zswap_load() comes in with the folio locked in swap cache, so it has >>>> stable zswap tree to search and lock... But in writeback case, we don't, >>>> shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, >>>> then release lru lock to get tree lock, which maybe freed already. >>>> >>>> So we should change here, we read swpentry from entry with lru list lock held, >>>> then release lru lock, to try to lock corresponding folio in swap cache, >>>> if we success, the following things is much the same like zswap_load(). >>>> We can get tree lock, to recheck the invalidate race, if no race happened, >>>> we can make sure the entry is still right and get refcount of it, then >>>> release the tree lock. >>> >>> Hmm I think you may be onto something here. Moving the swap cache >>> allocation ahead before referencing the tree should give us the same >>> guarantees as zswap_load() indeed. We can also consolidate the >>> invalidate race checks (right now we have one in shrink_memcg_cb() and >>> another one inside zswap_writeback_entry()). >> >> Right, if we successfully lock folio in the swap cache, we can get the >> tree lock and check the invalidate race, only once. >> >>> >>> We will have to be careful about the error handling path to make sure >>> we delete the folio from the swap cache only after we know the tree >>> won't be referenced anymore. Anyway, I think this can work. >> >> Yes, we can't reference tree if we early return or after unlocking folio, >> since the reference of zswap entry can't protect the tree. >> >>> >>> On a separate note, I think there is a bug in zswap_writeback_entry() >>> when we delete a folio from the swap cache. I think we are missing a >>> folio_unlock() there. >> >> Ah, yes, and folio_put(). > > Yes. I am preparing a fix. > >> >>> >>>> >>>> The main differences between this writeback with zswap_load() is the handling >>>> of lru entry and the tree lifetime. The whole zswap_load() function has the >>>> stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half >>>> after __swap_writepage() since we unlock the folio after that. So we can't >>>> reference the tree after that. >>>> >>>> This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early >>>> in tree lock, since thereafter writeback can't fail. BTW, I think we should >>>> also zswap_invalidate_entry() early in zswap_load() and only support the >>>> zswap_exclusive_loads_enabled mode, but that's another topic. >>> >>> zswap_invalidate_entry() actually doesn't seem to be using the tree at all. >>> >>>> >>>> The second difference is the handling of lru entry, which is easy that we >>>> just zswap_lru_del() in tree lock. >>> >>> Why do we need zswap_lru_del() at all? We should have already isolated >>> the entry at that point IIUC. >> >> I was thinking how to handle the "zswap_lru_putback()" if not writeback, >> in which case we can't use the entry actually since we haven't got reference >> of it. So we can don't isolate at the entry, and only zswap_lru_del() when >> we are going to writeback actually. > > Why not just call zswap_lru_putback() before we unlock the folio? When early return because __read_swap_cache_async() return NULL or !folio_was_allocated, we don't have a locked folio yet. The entry maybe invalidated and freed concurrently.
On 2024/1/25 15:53, Yosry Ahmed wrote: >> Hello, >> >> I also thought about this problem for some time, maybe something like below >> can be changed to fix it? It's likely I missed something, just some thoughts. >> >> IMHO, the problem is caused by the different way in which we use zswap entry >> in the writeback, that should be much like zswap_load(). >> >> The zswap_load() comes in with the folio locked in swap cache, so it has >> stable zswap tree to search and lock... But in writeback case, we don't, >> shrink_memcg_cb() comes in with only a zswap entry with lru list lock held, >> then release lru lock to get tree lock, which maybe freed already. >> >> So we should change here, we read swpentry from entry with lru list lock held, >> then release lru lock, to try to lock corresponding folio in swap cache, >> if we success, the following things is much the same like zswap_load(). >> We can get tree lock, to recheck the invalidate race, if no race happened, >> we can make sure the entry is still right and get refcount of it, then >> release the tree lock. > > Hmm I think you may be onto something here. Moving the swap cache > allocation ahead before referencing the tree should give us the same > guarantees as zswap_load() indeed. We can also consolidate the > invalidate race checks (right now we have one in shrink_memcg_cb() and > another one inside zswap_writeback_entry()). > > We will have to be careful about the error handling path to make sure > we delete the folio from the swap cache only after we know the tree > won't be referenced anymore. Anyway, I think this can work. > > On a separate note, I think there is a bug in zswap_writeback_entry() > when we delete a folio from the swap cache. I think we are missing a > folio_unlock() there. > Hi, want to know if you are preparing the fix patch, I would just wait to review if you are. Or I can work on it if you are busy with other thing. Thanks!
diff --git a/mm/zswap.c b/mm/zswap.c index f8bc9e0892687..9675c3c27f9d1 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1790,17 +1790,12 @@ void zswap_swapon(int type) void zswap_swapoff(int type) { struct zswap_tree *tree = zswap_trees[type]; - struct zswap_entry *entry, *n; if (!tree) return; - /* walk the tree and free everything */ - spin_lock(&tree->lock); - rbtree_postorder_for_each_entry_safe(entry, n, &tree->rbroot, rbnode) - zswap_free_entry(entry); - tree->rbroot = RB_ROOT; - spin_unlock(&tree->lock); + /* try_to_unuse() invalidated all entries already */ + WARN_ON_ONCE(!RB_EMPTY_ROOT(&tree->rbroot)); kfree(tree); zswap_trees[type] = NULL; }