Message ID | 170786028847.11135.14775608389430603086.stgit@91.116.238.104.host.secureserver.net |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-64348-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:bc8a:b0:106:860b:bbdd with SMTP id dn10csp823390dyb; Tue, 13 Feb 2024 13:40:58 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCXnz1G7vm62t7d/4CkYl8SseVckY67/1I+XmaYyAAJc0rtJJj0OLAHfzPV4Iyqxy1YgXEkjOb1yKcblb6GnK50a1lo7tw== X-Google-Smtp-Source: AGHT+IHlJtG6lBPnnKBpXjVCoSNwnPGhAVegn0unloePjb9L4dZCbrzNRI99+6r/KdgQmHPPW7PB X-Received: by 2002:a0c:f594:0:b0:686:ab05:59fb with SMTP id k20-20020a0cf594000000b00686ab0559fbmr768714qvm.8.1707860458757; Tue, 13 Feb 2024 13:40:58 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707860458; cv=pass; d=google.com; s=arc-20160816; b=RUEIAExlqT//m+1Hz+PXHF+/x8E8o10I1xRzp9jWZ06+8Mh2lK88aG/2kNDxvI1+XQ JhL1epsUlhFSnNqF50a9wtJ+eI3ybwJ25N/nE+X4BUjwtyiynffku6Pl9U5BIekuqB4O UexNxFQVFBF6c5Qyhl8B1+RI8c4/ZjqIDThg3zL6Go7edHERUDDi4yWqMLwuYkWly/5E l/Lk+sHXHzgHtaPrMBWItErukYHJk/g1fHX0beF2mP/vgnBbOW6GEGkwwRhWbOsH8KoG y41Qs8dQPbl577EPnIGeRaiZXKwFjtZ9p+oLDobctBxmEwHH1+59ywlS77ccBDH/U1CB y7qw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:user-agent:references:in-reply-to :message-id:date:cc:to:from:subject:dkim-signature; bh=R/sSg6xxDbxmhGUF8KYJydF2AEnkGCxNzIuo/Lb2e1I=; fh=XC7QALz6am8mCRHWKOZO3uxZQg3wt+lVBimFUtIi/zs=; b=CjvIGu905Sx9bJplvskt/VSSBu7DvZuFTBDr/xJcJRQPiWLJFyZyxQcWTDz7Ch++uR kJ4q+NoMba/Bn/UeA+J8o08cl35KM0S6LyI2xb953hFmurCMP7BtEKNFXoMzS/Ug8/rl oYa2ZxOpsFDYmA1sgOGg/iUpY8pcs02A0MJrETiUmVmcaI7Va2bw0zi7SzV9ajOM/1oG fgovsh0NVaqYtxe3RJ4e3MGYhCdj0MhQyglubgJMvniJ27e3Y9JH852p3YiUX3VMsqTm yPpkdwFNanaiSJQMreYZMrLiWzvDzcofYw0KPF7TyUiSRxCPMj10zjGN9gluoaGrUUMa QShA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=BtQTNyV5; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-64348-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-64348-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Forwarded-Encrypted: i=2; AJvYcCWdOA9wn3MUEwclOcT4dBtgvni0A2i4a48vcOOJ0A0bQvmS6yqpw0ulRkr8kHu2TxMmaMV4+Ei7fnNVc4ssJv7CLGk2aA== Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id 5-20020a0562140dc500b0068cd85d5d9csi4036388qvt.93.2024.02.13.13.40.58 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 13 Feb 2024 13:40:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-64348-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=BtQTNyV5; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-64348-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-64348-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 3C3B01C2737A for <ouuuleilei@gmail.com>; Tue, 13 Feb 2024 21:40:14 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 716DD62A1A; Tue, 13 Feb 2024 21:38:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="BtQTNyV5" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E16E4629E7; Tue, 13 Feb 2024 21:38:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707860291; cv=none; b=AgYFfxcwSXUYGoWaQBTXtoTOvjAb6AQpLz64GCS32gcjOn+65kuJ67bgqOBFwctJ/7G9DO9JNXJEa1YMI12qDfrt2TlHexjrqTF6pUP8W1eUyQslYBhtgmtNlMgUrLt7j7y1iuA3WMvCsY/fD6auYpfgnpAqhDcdejqe7q2swfA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707860291; c=relaxed/simple; bh=SovcTJwr5T0bwcx24VxTqCxw8DF70LVQIJpozpuEJ/c=; h=Subject:From:To:Cc:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=rW/KUxZb4jC1VjwUzXhJWfah05t/rAqctgkP1j3p+A2hrGDv/jk8j2VtQnGJSXSo9TJyxFqOjmAwL9kuldIKPEA0Uqci+DOu/mMdbGmr4AxQ1SDlksbu4mDXvyMk4twEylCVGKJY3ijm2p90lr4dwKkhGSUa4KCOXMnyTCuCoE4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=BtQTNyV5; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6C3E2C433C7; Tue, 13 Feb 2024 21:38:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1707860290; bh=SovcTJwr5T0bwcx24VxTqCxw8DF70LVQIJpozpuEJ/c=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=BtQTNyV5Zn/oTgtUO3HAG41FZzx+r21bj4Czooq47TPXHx+lj/idVHknoPLK3RJVH XgyxesfzyB2xskU8InhLkEzRzd2KWcyozSsJCvHLwQa4wW/pOt4dWfHHMJDxF/hD3A LxnOHDiCc8VnukMJfNU0tmiGpoGIVz5iOCXClF4YTEHFcAVLNGtutlmBjVMtVdx6Cb YlYV2q3zpxRKJXV6PPZuUtDHrlJfnxTdXlgzmcoClySh5Gm1bEay5VT63RB9M3AflK 19AvfgMz+13A6rmaQthtztDQ268tQjbpw0iXdKSAJPuHCcfVETwWx0apSTAmxJX+fZ BdiSRx/XBm5NA== Subject: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir() From: Chuck Lever <cel@kernel.org> To: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, hughd@google.com, akpm@linux-foundation.org, Liam.Howlett@oracle.com, oliver.sang@intel.com, feng.tang@intel.com Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, maple-tree@lists.infradead.org, linux-mm@kvack.org, lkp@intel.com Date: Tue, 13 Feb 2024 16:38:08 -0500 Message-ID: <170786028847.11135.14775608389430603086.stgit@91.116.238.104.host.secureserver.net> In-Reply-To: <170785993027.11135.8830043889278631735.stgit@91.116.238.104.host.secureserver.net> References: <170785993027.11135.8830043889278631735.stgit@91.116.238.104.host.secureserver.net> User-Agent: StGit/1.5 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790821488500366785 X-GMAIL-MSGID: 1790821488500366785 |
Series |
Use Maple Trees for simple_offset utilities
|
|
Commit Message
Chuck Lever
Feb. 13, 2024, 9:38 p.m. UTC
From: Chuck Lever <chuck.lever@oracle.com> Liam says that, unlike with xarray, once the RCU read lock is released ma_state is not safe to re-use for the next mas_find() call. But the RCU read lock has to be released on each loop iteration so that dput() can be called safely. Thus we are forced to walk the offset tree with fresh state for each directory entry. mt_find() can do this for us, though it might be a little less efficient than maintaining ma_state locally. Since offset_iterate_dir() doesn't build ma_state locally any more, there's no longer a strong need for offset_find_next(). Clean up by rolling these two helpers together. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- fs/libfs.c | 39 +++++++++++++++++---------------------- 1 file changed, 17 insertions(+), 22 deletions(-)
Comments
On Tue 13-02-24 16:38:08, Chuck Lever wrote: > From: Chuck Lever <chuck.lever@oracle.com> > > Liam says that, unlike with xarray, once the RCU read lock is > released ma_state is not safe to re-use for the next mas_find() call. > But the RCU read lock has to be released on each loop iteration so > that dput() can be called safely. > > Thus we are forced to walk the offset tree with fresh state for each > directory entry. mt_find() can do this for us, though it might be a > little less efficient than maintaining ma_state locally. > > Since offset_iterate_dir() doesn't build ma_state locally any more, > there's no longer a strong need for offset_find_next(). Clean up by > rolling these two helpers together. > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Well, in general I think even xas_next_entry() is not safe to use how offset_find_next() was using it. Once you drop rcu_read_lock(), xas->xa_node could go stale. But since you're holding inode->i_rwsem when using offset_find_next() you should be protected from concurrent modifications of the mapping (whatever the underlying data structure is) - that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the maple tree? Am I missing something? Honza > --- > fs/libfs.c | 39 +++++++++++++++++---------------------- > 1 file changed, 17 insertions(+), 22 deletions(-) > > diff --git a/fs/libfs.c b/fs/libfs.c > index f073e9aeb2bf..6e01fde1cf95 100644 > --- a/fs/libfs.c > +++ b/fs/libfs.c > @@ -436,23 +436,6 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence) > return vfs_setpos(file, offset, MAX_LFS_FILESIZE); > } > > -static struct dentry *offset_find_next(struct ma_state *mas) > -{ > - struct dentry *child, *found = NULL; > - > - rcu_read_lock(); > - child = mas_find(mas, ULONG_MAX); > - if (!child) > - goto out; > - spin_lock(&child->d_lock); > - if (simple_positive(child)) > - found = dget_dlock(child); > - spin_unlock(&child->d_lock); > -out: > - rcu_read_unlock(); > - return found; > -} > - > static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry) > { > unsigned long offset = dentry2offset(dentry); > @@ -465,13 +448,22 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry) > static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx) > { > struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode); > - MA_STATE(mas, &octx->mt, ctx->pos, ctx->pos); > - struct dentry *dentry; > + struct dentry *dentry, *found; > + unsigned long offset; > > + offset = ctx->pos; > while (true) { > - dentry = offset_find_next(&mas); > + found = mt_find(&octx->mt, &offset, ULONG_MAX); > + if (!found) > + goto out_noent; > + > + dentry = NULL; > + spin_lock(&found->d_lock); > + if (simple_positive(found)) > + dentry = dget_dlock(found); > + spin_unlock(&found->d_lock); > if (!dentry) > - return ERR_PTR(-ENOENT); > + goto out_noent; > > if (!offset_dir_emit(ctx, dentry)) { > dput(dentry); > @@ -479,9 +471,12 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx) > } > > dput(dentry); > - ctx->pos = mas.index + 1; > + ctx->pos = offset; > } > return NULL; > + > +out_noent: > + return ERR_PTR(-ENOENT); > } > > /** > >
* Jan Kara <jack@suse.cz> [240215 08:16]: > On Tue 13-02-24 16:38:08, Chuck Lever wrote: > > From: Chuck Lever <chuck.lever@oracle.com> > > > > Liam says that, unlike with xarray, once the RCU read lock is > > released ma_state is not safe to re-use for the next mas_find() call. > > But the RCU read lock has to be released on each loop iteration so > > that dput() can be called safely. > > > > Thus we are forced to walk the offset tree with fresh state for each > > directory entry. mt_find() can do this for us, though it might be a > > little less efficient than maintaining ma_state locally. > > > > Since offset_iterate_dir() doesn't build ma_state locally any more, > > there's no longer a strong need for offset_find_next(). Clean up by > > rolling these two helpers together. > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > Well, in general I think even xas_next_entry() is not safe to use how > offset_find_next() was using it. Once you drop rcu_read_lock(), > xas->xa_node could go stale. But since you're holding inode->i_rwsem when > using offset_find_next() you should be protected from concurrent > modifications of the mapping (whatever the underlying data structure is) - > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the > maple tree? Am I missing something? If you are stopping, you should be pausing the iteration. Although this works today, it's not how it should be used because if we make changes (ie: compaction requires movement of data), then you may end up with a UAF issue. We'd have no way of knowing you are depending on the tree structure to remain consistent. IOW the inode->i_rwsem is protecting writes of data but not the structure holding the data. This is true for both xarray and maple tree. Thanks, Liam
On Thu 15-02-24 12:00:08, Liam R. Howlett wrote: > * Jan Kara <jack@suse.cz> [240215 08:16]: > > On Tue 13-02-24 16:38:08, Chuck Lever wrote: > > > From: Chuck Lever <chuck.lever@oracle.com> > > > > > > Liam says that, unlike with xarray, once the RCU read lock is > > > released ma_state is not safe to re-use for the next mas_find() call. > > > But the RCU read lock has to be released on each loop iteration so > > > that dput() can be called safely. > > > > > > Thus we are forced to walk the offset tree with fresh state for each > > > directory entry. mt_find() can do this for us, though it might be a > > > little less efficient than maintaining ma_state locally. > > > > > > Since offset_iterate_dir() doesn't build ma_state locally any more, > > > there's no longer a strong need for offset_find_next(). Clean up by > > > rolling these two helpers together. > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > > Well, in general I think even xas_next_entry() is not safe to use how > > offset_find_next() was using it. Once you drop rcu_read_lock(), > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when > > using offset_find_next() you should be protected from concurrent > > modifications of the mapping (whatever the underlying data structure is) - > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the > > maple tree? Am I missing something? > > If you are stopping, you should be pausing the iteration. Although this > works today, it's not how it should be used because if we make changes > (ie: compaction requires movement of data), then you may end up with a > UAF issue. We'd have no way of knowing you are depending on the tree > structure to remain consistent. I see. But we have versions of these structures that have locking external to the structure itself, don't we? Then how do you imagine serializing the background operations like compaction? As much as I agree your argument is "theoretically clean", it seems a bit like a trap and there are definitely xarray users that are going to be broken by this (e.g. tag_pages_for_writeback())... Honza
On Thu, Feb 15, 2024 at 12:00:08PM -0500, Liam R. Howlett wrote: > * Jan Kara <jack@suse.cz> [240215 08:16]: > > On Tue 13-02-24 16:38:08, Chuck Lever wrote: > > > From: Chuck Lever <chuck.lever@oracle.com> > > > > > > Liam says that, unlike with xarray, once the RCU read lock is > > > released ma_state is not safe to re-use for the next mas_find() call. > > > But the RCU read lock has to be released on each loop iteration so > > > that dput() can be called safely. > > > > > > Thus we are forced to walk the offset tree with fresh state for each > > > directory entry. mt_find() can do this for us, though it might be a > > > little less efficient than maintaining ma_state locally. > > > > > > Since offset_iterate_dir() doesn't build ma_state locally any more, > > > there's no longer a strong need for offset_find_next(). Clean up by > > > rolling these two helpers together. > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > > Well, in general I think even xas_next_entry() is not safe to use how > > offset_find_next() was using it. Once you drop rcu_read_lock(), > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when > > using offset_find_next() you should be protected from concurrent > > modifications of the mapping (whatever the underlying data structure is) - > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the > > maple tree? Am I missing something? > > If you are stopping, you should be pausing the iteration. Although this > works today, it's not how it should be used because if we make changes > (ie: compaction requires movement of data), then you may end up with a > UAF issue. We'd have no way of knowing you are depending on the tree > structure to remain consistent. > > IOW the inode->i_rwsem is protecting writes of data but not the > structure holding the data. > > This is true for both xarray and maple tree. Would it be appropriate to reorder this series so 7/7 comes before the transition to use Maple Tree?
* Jan Kara <jack@suse.cz> [240215 12:16]: > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote: > > * Jan Kara <jack@suse.cz> [240215 08:16]: > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote: > > > > From: Chuck Lever <chuck.lever@oracle.com> > > > > > > > > Liam says that, unlike with xarray, once the RCU read lock is > > > > released ma_state is not safe to re-use for the next mas_find() call. > > > > But the RCU read lock has to be released on each loop iteration so > > > > that dput() can be called safely. > > > > > > > > Thus we are forced to walk the offset tree with fresh state for each > > > > directory entry. mt_find() can do this for us, though it might be a > > > > little less efficient than maintaining ma_state locally. > > > > > > > > Since offset_iterate_dir() doesn't build ma_state locally any more, > > > > there's no longer a strong need for offset_find_next(). Clean up by > > > > rolling these two helpers together. > > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > > > > Well, in general I think even xas_next_entry() is not safe to use how > > > offset_find_next() was using it. Once you drop rcu_read_lock(), > > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when > > > using offset_find_next() you should be protected from concurrent > > > modifications of the mapping (whatever the underlying data structure is) - > > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the > > > maple tree? Am I missing something? > > > > If you are stopping, you should be pausing the iteration. Although this > > works today, it's not how it should be used because if we make changes > > (ie: compaction requires movement of data), then you may end up with a > > UAF issue. We'd have no way of knowing you are depending on the tree > > structure to remain consistent. > > I see. But we have versions of these structures that have locking external > to the structure itself, don't we? Ah, I do have them - but I don't want to propagate its use as the dream is that it can be removed. > Then how do you imagine serializing the > background operations like compaction? As much as I agree your argument is > "theoretically clean", it seems a bit like a trap and there are definitely > xarray users that are going to be broken by this (e.g. > tag_pages_for_writeback())... I'm not sure I follow the trap logic. There are locks for the data structure that need to be followed for reading (rcu) and writing (spinlock for the maple tree). If you don't correctly lock the data structure then you really are setting yourself up for potential issues in the future. The limitations are outlined in the documentation as to how and when to lock. I'm not familiar with the xarray users, but it does check for locking with lockdep, but the way this is written bypasses the lockdep checking as the locks are taken and dropped without the proper scope. If you feel like this is a trap, then maybe we need to figure out a new plan to detect incorrect use? Looking through tag_pages_for_writeback(), it does what is necessary to keep a safe state - before it unlocks it calls xas_pause(). We have the same on maple tree; mas_pause(). This will restart the next operation from the root of the tree (the root can also change), to ensure that it is safe. If you have other examples you think are unsafe then I can have a look at them as well. You can make the existing code safe by also calling xas_pause() before the rcu lock is dropped, but that is essentially what Chuck has done in the maple tree conversion by using mt_find(). Regarding compaction, I would expect the write lock to be taken to avoid any writes happening while compaction occurs. Readers would use rcu locking to ensure they return either the old or new value. During the write critical section, other writers would be in a "mas_pause()/xas_pause()" state - so once they continue, they will re-start the walk to the next element in the new tree. If external locks are used, then compaction would be sub-optimal as it may unnecessarily hold up readers, or partial write work (before the point of a store into the maple tree). Thanks, Liam
* Chuck Lever <chuck.lever@oracle.com> [240215 12:40]: > On Thu, Feb 15, 2024 at 12:00:08PM -0500, Liam R. Howlett wrote: > > * Jan Kara <jack@suse.cz> [240215 08:16]: > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote: > > > > From: Chuck Lever <chuck.lever@oracle.com> > > > > > > > > Liam says that, unlike with xarray, once the RCU read lock is > > > > released ma_state is not safe to re-use for the next mas_find() call. > > > > But the RCU read lock has to be released on each loop iteration so > > > > that dput() can be called safely. > > > > > > > > Thus we are forced to walk the offset tree with fresh state for each > > > > directory entry. mt_find() can do this for us, though it might be a > > > > little less efficient than maintaining ma_state locally. > > > > > > > > Since offset_iterate_dir() doesn't build ma_state locally any more, > > > > there's no longer a strong need for offset_find_next(). Clean up by > > > > rolling these two helpers together. > > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > > > > Well, in general I think even xas_next_entry() is not safe to use how > > > offset_find_next() was using it. Once you drop rcu_read_lock(), > > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when > > > using offset_find_next() you should be protected from concurrent > > > modifications of the mapping (whatever the underlying data structure is) - > > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the > > > maple tree? Am I missing something? > > > > If you are stopping, you should be pausing the iteration. Although this > > works today, it's not how it should be used because if we make changes > > (ie: compaction requires movement of data), then you may end up with a > > UAF issue. We'd have no way of knowing you are depending on the tree > > structure to remain consistent. > > > > IOW the inode->i_rwsem is protecting writes of data but not the > > structure holding the data. > > > > This is true for both xarray and maple tree. > > Would it be appropriate to reorder this series so 7/7 comes before > the transition to use Maple Tree? I think it would, yes. Thanks, Liam
On Thu 15-02-24 16:07:42, Liam R. Howlett wrote: > * Jan Kara <jack@suse.cz> [240215 12:16]: > > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote: > > > * Jan Kara <jack@suse.cz> [240215 08:16]: > > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote: > > > > > From: Chuck Lever <chuck.lever@oracle.com> > > > > > > > > > > Liam says that, unlike with xarray, once the RCU read lock is > > > > > released ma_state is not safe to re-use for the next mas_find() call. > > > > > But the RCU read lock has to be released on each loop iteration so > > > > > that dput() can be called safely. > > > > > > > > > > Thus we are forced to walk the offset tree with fresh state for each > > > > > directory entry. mt_find() can do this for us, though it might be a > > > > > little less efficient than maintaining ma_state locally. > > > > > > > > > > Since offset_iterate_dir() doesn't build ma_state locally any more, > > > > > there's no longer a strong need for offset_find_next(). Clean up by > > > > > rolling these two helpers together. > > > > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > > > > > > Well, in general I think even xas_next_entry() is not safe to use how > > > > offset_find_next() was using it. Once you drop rcu_read_lock(), > > > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when > > > > using offset_find_next() you should be protected from concurrent > > > > modifications of the mapping (whatever the underlying data structure is) - > > > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the > > > > maple tree? Am I missing something? > > > > > > If you are stopping, you should be pausing the iteration. Although this > > > works today, it's not how it should be used because if we make changes > > > (ie: compaction requires movement of data), then you may end up with a > > > UAF issue. We'd have no way of knowing you are depending on the tree > > > structure to remain consistent. > > > > I see. But we have versions of these structures that have locking external > > to the structure itself, don't we? > > Ah, I do have them - but I don't want to propagate its use as the dream > is that it can be removed. > > > > Then how do you imagine serializing the > > background operations like compaction? As much as I agree your argument is > > "theoretically clean", it seems a bit like a trap and there are definitely > > xarray users that are going to be broken by this (e.g. > > tag_pages_for_writeback())... > > I'm not sure I follow the trap logic. There are locks for the data > structure that need to be followed for reading (rcu) and writing > (spinlock for the maple tree). If you don't correctly lock the data > structure then you really are setting yourself up for potential issues > in the future. > > The limitations are outlined in the documentation as to how and when to > lock. I'm not familiar with the xarray users, but it does check for > locking with lockdep, but the way this is written bypasses the lockdep > checking as the locks are taken and dropped without the proper scope. > > If you feel like this is a trap, then maybe we need to figure out a new > plan to detect incorrect use? OK, I was a bit imprecise. What I wanted to say is that this is a shift in the paradigm in the sense that previously, we mostly had (and still have) data structure APIs (lists, rb-trees, radix-tree, now xarray) that were guaranteeing that unless you call into the function to mutate the data structure it stays intact. Now maple trees are shifting more in a direction of black-box API where you cannot assume what happens inside. Which is fine but then we have e.g. these iterators which do not quite follow this black-box design and you have to remember subtle details like calling "mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of the black-box API shouldn't be exposed to the details of the internal locking at all (but then the performance suffers so I understand why you do things this way). Second to this ideal variant would be if we could detect we unlocked the lock without calling xas_pause() and warn on that. Or maybe xas_unlock*() should be calling xas_pause() automagically and we'd have similar helpers for RCU to do the magic for you? > Looking through tag_pages_for_writeback(), it does what is necessary to > keep a safe state - before it unlocks it calls xas_pause(). We have the > same on maple tree; mas_pause(). This will restart the next operation > from the root of the tree (the root can also change), to ensure that it > is safe. OK, I've missed the xas_pause(). Thanks for correcting me. > If you have other examples you think are unsafe then I can have a look > at them as well. I'm currently not aware of any but I'll let you know if I find some. Missing xas/mas_pause() seems really easy. Honza
On Fri, Feb 16, 2024 at 11:15:46AM +0100, Jan Kara wrote: > On Thu 15-02-24 16:07:42, Liam R. Howlett wrote: > > The limitations are outlined in the documentation as to how and when to > > lock. I'm not familiar with the xarray users, but it does check for > > locking with lockdep, but the way this is written bypasses the lockdep > > checking as the locks are taken and dropped without the proper scope. > > > > If you feel like this is a trap, then maybe we need to figure out a new > > plan to detect incorrect use? > > OK, I was a bit imprecise. What I wanted to say is that this is a shift in > the paradigm in the sense that previously, we mostly had (and still have) > data structure APIs (lists, rb-trees, radix-tree, now xarray) that were > guaranteeing that unless you call into the function to mutate the data > structure it stays intact. hm, no. The radix tree never guaranteed that to you; I just documented that it wasn't guaranteed for the XArray. > Now maple trees are shifting more in a direction > of black-box API where you cannot assume what happens inside. Which is fine > but then we have e.g. these iterators which do not quite follow this > black-box design and you have to remember subtle details like calling > "mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of > the black-box API shouldn't be exposed to the details of the internal > locking at all (but then the performance suffers so I understand why you do > things this way). Second to this ideal variant would be if we could detect > we unlocked the lock without calling xas_pause() and warn on that. Or maybe > xas_unlock*() should be calling xas_pause() automagically and we'd have > similar helpers for RCU to do the magic for you? If you're unlocking the lock that protects a data structure while still using that data structure, you should always be aware that you're doing something very dangerous! It's no different from calling inode_unlock() inside a filesystem. Sure, you can do it, but you'd better be ready to deal with the consequences. The question is, do we want to be able to defragment slabs or not? My thinking is "yes", for objects where we can ensure there are no current users (at least after an RCU grace period), we want to be able to move them. That does impose certain costs (and subtleties), but just like fast-GUP and lockless page-cache, I think it's worth doing. Of course, we don't have slab defragmentation yet, so we're not getting any benefit from this. The most recent attempt was in 2019: https://lore.kernel.org/linux-mm/20190603042637.2018-1-tobin@kernel.org/ but there were earlier attepts in 2017: https://lore.kernel.org/linux-mm/20171227220636.361857279@linux.com/ and 2008: https://lore.kernel.org/linux-mm/20080216004526.763643520@sgi.com/ so I have to believe it's something we want, just haven't been able to push across the "merge this now" line.
* Jan Kara <jack@suse.cz> [240216 05:15]: > On Thu 15-02-24 16:07:42, Liam R. Howlett wrote: > > * Jan Kara <jack@suse.cz> [240215 12:16]: > > > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote: > > > > * Jan Kara <jack@suse.cz> [240215 08:16]: > > > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote: > > > > > > From: Chuck Lever <chuck.lever@oracle.com> > > > > > > > > > > > > Liam says that, unlike with xarray, once the RCU read lock is > > > > > > released ma_state is not safe to re-use for the next mas_find() call. > > > > > > But the RCU read lock has to be released on each loop iteration so > > > > > > that dput() can be called safely. > > > > > > .. > > > > > Then how do you imagine serializing the > > > background operations like compaction? As much as I agree your argument is > > > "theoretically clean", it seems a bit like a trap and there are definitely > > > xarray users that are going to be broken by this (e.g. > > > tag_pages_for_writeback())... > > > > I'm not sure I follow the trap logic. There are locks for the data > > structure that need to be followed for reading (rcu) and writing > > (spinlock for the maple tree). If you don't correctly lock the data > > structure then you really are setting yourself up for potential issues > > in the future. > > > > The limitations are outlined in the documentation as to how and when to > > lock. I'm not familiar with the xarray users, but it does check for > > locking with lockdep, but the way this is written bypasses the lockdep > > checking as the locks are taken and dropped without the proper scope. > > > > If you feel like this is a trap, then maybe we need to figure out a new > > plan to detect incorrect use? > > OK, I was a bit imprecise. What I wanted to say is that this is a shift in > the paradigm in the sense that previously, we mostly had (and still have) > data structure APIs (lists, rb-trees, radix-tree, now xarray) that were > guaranteeing that unless you call into the function to mutate the data > structure it stays intact. Now maple trees are shifting more in a direction > of black-box API where you cannot assume what happens inside. Which is fine > but then we have e.g. these iterators which do not quite follow this > black-box design and you have to remember subtle details like calling > "mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of > the black-box API shouldn't be exposed to the details of the internal > locking at all (but then the performance suffers so I understand why you do > things this way). Second to this ideal variant would be if we could detect > we unlocked the lock without calling xas_pause() and warn on that. Or maybe > xas_unlock*() should be calling xas_pause() automagically and we'd have > similar helpers for RCU to do the magic for you? > .. Fair enough. You can think of the radix-tree and xarray as black boxes as well; not everyone knows what's going on in there. The rbtree and list are embedded in your own data structure and you have to do a lot of work for your operations. Right now, it is the way you have described; it's a data structure API. It will work this way, but you lose the really neat and performant part of the tree. If we do this right, we can compact at the same time as reading data. We cannot do that when depending on the locks you use today. You don't have to use the mas_pause() functions, there are less error-prone methods such as mt_find() that Chuck used here. If you want to do really neat stuff though, you should be looking at the mas_* interface. And yes, we totally hand you enough rope to hang yourself. I'm not sure we can have an advanced interface without doing this. Using the correct calls will allow for us to smoothly transition to a model where you don't depend on the data remaining in the same place in the tree (or xarray). > > > If you have other examples you think are unsafe then I can have a look > > at them as well. > > I'm currently not aware of any but I'll let you know if I find some. > Missing xas/mas_pause() seems really easy. What if we convert the rcu_read_lock() to a mas_read_lock() or xas_read_lock() and we can check to ensure the state isn't being locked without being in the 'parked' state (paused or otherwise)? mas_read_lock(struct ma_state *mas) { assert(!mas_active(mas)); rcu_read_lock(); } Would that be a reasonable resolution to your concern? Unfortunately, what was done with the locking in this case would not be detected with this change unless the rcu_read_lock() was replaced. IOW, people could still use the rcu_read_lock() and skip the detection. Doing the same in the mas_unlock() doesn't make as much sense since that may be called without the intent to reuse the state at all. So we'd be doing more work than necessary at the end of each loop or use. Thanks, Liam
On Fri 16-02-24 11:33:18, Liam R. Howlett wrote: > * Jan Kara <jack@suse.cz> [240216 05:15]: > > > If you have other examples you think are unsafe then I can have a look > > > at them as well. > > > > I'm currently not aware of any but I'll let you know if I find some. > > Missing xas/mas_pause() seems really easy. > > What if we convert the rcu_read_lock() to a mas_read_lock() or > xas_read_lock() and we can check to ensure the state isn't being locked > without being in the 'parked' state (paused or otherwise)? > > mas_read_lock(struct ma_state *mas) { > assert(!mas_active(mas)); > rcu_read_lock(); > } > > Would that be a reasonable resolution to your concern? Unfortunately, > what was done with the locking in this case would not be detected with > this change unless the rcu_read_lock() was replaced. IOW, people could > still use the rcu_read_lock() and skip the detection. Yes, I guess this is still better than nothing. > Doing the same in the mas_unlock() doesn't make as much sense since that > may be called without the intent to reuse the state at all. So we'd be > doing more work than necessary at the end of each loop or use. Yes, understood. Honza
diff --git a/fs/libfs.c b/fs/libfs.c index f073e9aeb2bf..6e01fde1cf95 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -436,23 +436,6 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence) return vfs_setpos(file, offset, MAX_LFS_FILESIZE); } -static struct dentry *offset_find_next(struct ma_state *mas) -{ - struct dentry *child, *found = NULL; - - rcu_read_lock(); - child = mas_find(mas, ULONG_MAX); - if (!child) - goto out; - spin_lock(&child->d_lock); - if (simple_positive(child)) - found = dget_dlock(child); - spin_unlock(&child->d_lock); -out: - rcu_read_unlock(); - return found; -} - static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry) { unsigned long offset = dentry2offset(dentry); @@ -465,13 +448,22 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry) static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx) { struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode); - MA_STATE(mas, &octx->mt, ctx->pos, ctx->pos); - struct dentry *dentry; + struct dentry *dentry, *found; + unsigned long offset; + offset = ctx->pos; while (true) { - dentry = offset_find_next(&mas); + found = mt_find(&octx->mt, &offset, ULONG_MAX); + if (!found) + goto out_noent; + + dentry = NULL; + spin_lock(&found->d_lock); + if (simple_positive(found)) + dentry = dget_dlock(found); + spin_unlock(&found->d_lock); if (!dentry) - return ERR_PTR(-ENOENT); + goto out_noent; if (!offset_dir_emit(ctx, dentry)) { dput(dentry); @@ -479,9 +471,12 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx) } dput(dentry); - ctx->pos = mas.index + 1; + ctx->pos = offset; } return NULL; + +out_noent: + return ERR_PTR(-ENOENT); } /**