Message ID | 20240131162533.247710-3-mathieu.desnoyers@efficios.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-46785-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:2087:b0:106:209c:c626 with SMTP id gs7csp2005781dyb; Wed, 31 Jan 2024 08:28:15 -0800 (PST) X-Google-Smtp-Source: AGHT+IHD4ffzBALdOW5Aw629R04TmNiMUOLBLbaTZyBCjfAzn8mdheNp+EZKmoA4FJTC/r0po+ri X-Received: by 2002:a05:6a21:3889:b0:19c:7e6f:85f2 with SMTP id yj9-20020a056a21388900b0019c7e6f85f2mr1802285pzb.1.1706718495711; Wed, 31 Jan 2024 08:28:15 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706718495; cv=pass; d=google.com; s=arc-20160816; b=s6rhq+YWu1qydHJE1kl3L5xAYAClJwVUkXBPSiXa0bnEl8bJMVrueT6BAQT83RwKGa 5smIWY/odvdiG6EL7MHXwB3HHafK9F7OO3IMShc8S3HpssVu1xzZkGyHRBokB22cwfPG egALF2llZNrq3NHxkbUYjyBxU+LNAF5TXtJBYcXiUI22QvpkrvoTbG9Q8sagISl0JLIe ge5IgK63oahsIt5tAAGimTdH2updAX97l5g+9A60JbDVAp2mLHpVsV56n3sRCN0swctF 07a7aNymb8+oDtj7zXa6PUM6toF3G2GpqDEXQdfFZgccEx6vGVP5fMbRZ8F7RJZYFFQl 0C4A== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=6wJ6k+XVGp97QkRoYqu73O7Askxr2TglC0OY8izxSN4=; fh=vGA7OBJ6PGMu4lUSEjVl1i9xgf/WOE3Koa9lX9CH8oc=; b=nqKXWv/zl8TODyLJ/3No1nympDmlsVZkXe9X/ejs7SZa9lzDkymt0d0DiCMJmbm/F7 8srgNojxGFpEWBWXzw5qYnsA0hr8YjHOhW2a441Sm0F64DymXR6US/hPMVz2vX0afs5T ML4/ws5b9u96r0nSRenmZKvBICuFou7Q10FHcaINheFHSU+o/bMxD/KPFDTt2/ytJyeQ fN8KLzAKOcho6TmicLIuygHA5qNBgusuKzXGFAafX7hpofepiavtQKMzfLsDeFSYLiiP QIZblRiXMuvyNnR/Jhj/6ugUqSaNBa8qNa1nkBSnVxNzCRURB01qO8CJj346O4bjJDd7 jsxw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=tPFOoGUC; arc=pass (i=1 spf=pass spfdomain=efficios.com dkim=pass dkdomain=efficios.com dmarc=pass fromdomain=efficios.com); spf=pass (google.com: domain of linux-kernel+bounces-46785-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-46785-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com X-Forwarded-Encrypted: i=1; AJvYcCV6YG87PDEAmWKSCwb29G8n/2Xz+aGmmCbZbYzkS68he9/A3JGZempgEzu/rqLIYl7+mfveLZv2ktOJbKx2HngrjBdSKQ== Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id x9-20020a17090aca0900b0029014755c2dsi1589495pjt.25.2024.01.31.08.28.15 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 31 Jan 2024 08:28:15 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-46785-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=tPFOoGUC; arc=pass (i=1 spf=pass spfdomain=efficios.com dkim=pass dkdomain=efficios.com dmarc=pass fromdomain=efficios.com); spf=pass (google.com: domain of linux-kernel+bounces-46785-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-46785-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 37848287812 for <ouuuleilei@gmail.com>; Wed, 31 Jan 2024 16:27:26 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 6714F12F585; Wed, 31 Jan 2024 16:25:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="tPFOoGUC" Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 95D3012BE8F; Wed, 31 Jan 2024 16:25:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=167.114.26.122 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706718353; cv=none; b=POAj/54xoax5vn2noySHee6yTnX8jrZGXC6kE/NQNacff6SO/RfR7mwreDl+OQdmipANi7BiPj1y7P2jnYnssdj1M8zdmT15EOYMFTztUDCrosN3XRbJS/ND51r7OEH6QMuDkx5igTfgyj+cVRUhW6nYYeDHT0Kc2z0cGlMil68= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706718353; c=relaxed/simple; bh=QiUQHVE9OuD7A7XRfq4nHC+crkQnZ8+IZ0u7lnayc8c=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=QaeRJdyEg3QatJIHmAPF15Yva1YVB6zXFmYZhM1wTKn60uTJ8B39Eb05Mp17z+RWUgny298zd+nuG89EOlkdfKTAwznvtac18b3dfAAcEEVRnOw7ENFwEwdzvMwkqgP2XF0gVmZq78daz6+yGz2es30YtGgJqtVOoEZ3bKis+n0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com; spf=pass smtp.mailfrom=efficios.com; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b=tPFOoGUC; arc=none smtp.client-ip=167.114.26.122 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=efficios.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1706718348; bh=QiUQHVE9OuD7A7XRfq4nHC+crkQnZ8+IZ0u7lnayc8c=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=tPFOoGUCQpG6nbDTeV4EO+3l6CQr96GDzwrgBAkFN3BBXvvuVDa7KEgUQVhRO+9UJ OGBfhp6+IHWdIbnDgzb7rp6f5NoCAVT/MC/jdSqh/SbWXQai0GciSCZ5jGzJAZJFn9 zdbBdntL/XW+dTsVqtKTP0pFMFrIkzMRY2EjvD3nxnNlRo8+GV97CGNQTyzuR04/Bv DnVLu22IZhi/FCEikncgvc+r9uRAQtDTq6OOUCYLLFl+SPLOZdSfK/om2/oxbsUHtn heRyE9JrDmR4Al+5f3ZW4R/iVTnsP1/2fKc1z18/UCX+ap0fXxxnWGMUegUer3Cath BZBJCCPLX8AoA== Received: from thinkos.internal.efficios.com (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4TQ6ph3nMyzVny; Wed, 31 Jan 2024 11:25:48 -0500 (EST) From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: Dan Williams <dan.j.williams@intel.com>, Arnd Bergmann <arnd@arndb.de>, Dave Chinner <david@fromorbit.com> Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, linux-mm@kvack.org, linux-arch@vger.kernel.org, Vishal Verma <vishal.l.verma@intel.com>, Dave Jiang <dave.jiang@intel.com>, Matthew Wilcox <willy@infradead.org>, Russell King <linux@armlinux.org.uk>, nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, dm-devel@lists.linux.dev Subject: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime Date: Wed, 31 Jan 2024 11:25:31 -0500 Message-Id: <20240131162533.247710-3-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20240131162533.247710-1-mathieu.desnoyers@efficios.com> References: <20240131162533.247710-1-mathieu.desnoyers@efficios.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1789624053180114566 X-GMAIL-MSGID: 1789624053180114566 |
Series |
Introduce cpu_dcache_is_aliasing() to fix DAX regression
|
|
Commit Message
Mathieu Desnoyers
Jan. 31, 2024, 4:25 p.m. UTC
Replace the following fs/Kconfig:FS_DAX dependency:
depends on !(ARM || MIPS || SPARC)
By a runtime check within alloc_dax().
This is done in preparation for its use by each filesystem supporting
the "dax" mount option to validate whether DAX is indeed supported.
This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-arch@vger.kernel.org
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: nvdimm@lists.linux.dev
Cc: linux-cxl@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: dm-devel@lists.linux.dev
---
drivers/dax/super.c | 6 ++++++
fs/Kconfig | 1 -
2 files changed, 6 insertions(+), 1 deletion(-)
Comments
On 2024-01-31 16:02, Dan Williams wrote: > Mathieu Desnoyers wrote: >> Replace the following fs/Kconfig:FS_DAX dependency: >> >> depends on !(ARM || MIPS || SPARC) >> >> By a runtime check within alloc_dax(). >> >> This is done in preparation for its use by each filesystem supporting >> the "dax" mount option to validate whether DAX is indeed supported. >> >> This is done in preparation for using cpu_dcache_is_aliasing() in a >> following change which will properly support architectures which detect >> data cache aliasing at runtime. >> >> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Linus Torvalds <torvalds@linux-foundation.org> >> Cc: linux-mm@kvack.org >> Cc: linux-arch@vger.kernel.org >> Cc: Dan Williams <dan.j.williams@intel.com> >> Cc: Vishal Verma <vishal.l.verma@intel.com> >> Cc: Dave Jiang <dave.jiang@intel.com> >> Cc: Matthew Wilcox <willy@infradead.org> >> Cc: Arnd Bergmann <arnd@arndb.de> >> Cc: Russell King <linux@armlinux.org.uk> >> Cc: nvdimm@lists.linux.dev >> Cc: linux-cxl@vger.kernel.org >> Cc: linux-fsdevel@vger.kernel.org >> Cc: dm-devel@lists.linux.dev >> --- >> drivers/dax/super.c | 6 ++++++ >> fs/Kconfig | 1 - >> 2 files changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/dax/super.c b/drivers/dax/super.c >> index 0da9232ea175..e9f397b8a5a3 100644 >> --- a/drivers/dax/super.c >> +++ b/drivers/dax/super.c >> @@ -445,6 +445,12 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) >> dev_t devt; >> int minor; >> >> + /* Unavailable on architectures with virtually aliased data caches. */ >> + if (IS_ENABLED(CONFIG_ARM) || >> + IS_ENABLED(CONFIG_MIPS) || >> + IS_ENABLED(CONFIG_SPARC)) >> + return NULL; > > This function returns ERR_PTR(), not NULL on failure. Except that it returns NULL in the CONFIG_DAX=n case as you noticed below. > > ...and I notice this mistake is also made in include/linux/dax.h in the > CONFIG_DAX=n case. That function also mentions: > > static inline struct dax_device *alloc_dax(void *private, > const struct dax_operations *ops) > { > /* > * Callers should check IS_ENABLED(CONFIG_DAX) to know if this > * NULL is an error or expected. > */ > return NULL; > } > > ...and none of the callers validate the result, but now runtime > validation is necessary. I.e. it is not enough to check > IS_ENABLED(CONFIG_DAX) it also needs to check cpu_dcache_is_aliasing(). If the callers select DAX in their Kconfig, then they don't have to explicitly check for IS_ENABLED(CONFIG_DAX). Things change for the introduced runtime check though. > > With that, there are a few more fixup places needed, pmem_attach_disk(), > dcssblk_add_store(), and virtio_fs_setup_dax(). Which approach should we take then ? Should we: A) Keep returning NULL from alloc_dax() for both cpu_dcache_is_aliasing() and CONFIG_DAX=n, and use IS_ERR_OR_NULL() in the caller. If we do this, then the callers need to somehow translate this NULL into a negative error value, or B) Replace this NULL return value in both cases by a ERR_PTR() (which error value should we return ?). I would favor approach B) which appears more robust and introduces fewer changes. If we go for that approach do we still need to change the callers ? Thanks, Mathieu
Mathieu Desnoyers wrote: [..] > The change for -EOPNOTSUPP in header and implementation would look like > this (more questions below): > > diff --git a/include/linux/dax.h b/include/linux/dax.h > index b463502b16e1..df2d52b8a245 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev) > static inline struct dax_device *alloc_dax(void *private, > const struct dax_operations *ops) > { > - /* > - * Callers should check IS_ENABLED(CONFIG_DAX) to know if this > - * NULL is an error or expected. > - */ > - return NULL; > + return ERR_PTR(-EOPNOTSUPP); > } > static inline void put_dax(struct dax_device *dax_dev) > { > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index 8c3a6e8e6334..c1cf6f0bbe12 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -448,7 +448,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) > > /* Unavailable on architectures with virtually aliased data caches. */ > if (cpu_dcache_is_aliasing()) > - return NULL; > + return ERR_PTR(-EOPNOTSUPP); > > if (WARN_ON_ONCE(ops && !ops->zero_page_range)) > return ERR_PTR(-EINVAL); > > > > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > > index f4b635526345..254d3b1e420e 100644 > > --- a/drivers/dax/super.c > > +++ b/drivers/dax/super.c > > @@ -322,7 +322,7 @@ EXPORT_SYMBOL_GPL(dax_alive); > > */ > > void kill_dax(struct dax_device *dax_dev) > > { > > - if (!dax_dev) > > + if (IS_ERR_OR_NULL(dax_dev)) > > I am tempted to just leave the "if (!dax_dev)" check here, because > many other functions of this API are only protected by a NULL > pointer check. I would hate to forget to convert one check in this > change, and I don't think it simplifies anything. It's not that it simplifies anything, but mistakes fail safely as a memory leak rather than a crash. > The alternative route I intend to take is to audit all callers > of alloc_dax() and make sure they all save the alloc_dax() return > value in a struct dax_device * local variable first for the sake > of checking for IS_ERR(). This will leave the xyz->dax_dev pointer > initialized to NULL in the error case and simplify the rest of > error checking. I could maybe get on board with that, but it needs a comment somewhere about the asymmetric subtlety. > > > > return; > > > > if (dax_dev->holder_data != NULL) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > > index 4e8fdcb3f1c8..b69c9e442cf4 100644 > > --- a/drivers/nvdimm/pmem.c > > +++ b/drivers/nvdimm/pmem.c > > @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev, > > dax_dev = alloc_dax(pmem, &pmem_dax_ops); > > if (IS_ERR(dax_dev)) { > > rc = PTR_ERR(dax_dev); > > - goto out; > > + if (rc != -EOPNOTSUPP) > > + goto out; > > If I compare the before / after this change, if previously > pmem_attach_disk() was called in a configuration with FS_DAX=n, it would > result in a NULL pointer dereference. No, alloc_dax() only returns NULL CONFIG_DAX=n case, not the CONFIG_FS_DAX=n case. So that means that pmem devices on ARM have been possible without FS_DAX. So, in order for alloc_dax() returning ERR_PTR(-EOPNOTSUPP) to not regress pmem device availability this error path needs to be changed. > This would be an error handling fix all by itself. Do we really want > to return successfully if dax is unsupported, or should we return > an error here ? Per above, there is no error handling fix, and pmem block device available should not depend on alloc_dax() succeeding. The real question is what to do about device-dax. I *think* it is not affected by cpu_dcache aliasing because it never accesses user mappings through a kernel alias. I doubt device-dax is in use on these platforms, but we might need another fixup for that if someone screams about the alloc_dax() behavior change making them lose device-dax access. > > + } else { > > + set_dax_nocache(dax_dev); > > + set_dax_nomc(dax_dev); > > + if (is_nvdimm_sync(nd_region)) > > + set_dax_synchronous(dax_dev); > > + rc = dax_add_host(dax_dev, disk); > > + if (rc) > > + goto out_cleanup_dax; > > + dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); > > + pmem->dax_dev = dax_dev; > > } > > - set_dax_nocache(dax_dev); > > - set_dax_nomc(dax_dev); > > - if (is_nvdimm_sync(nd_region)) > > - set_dax_synchronous(dax_dev); > > - rc = dax_add_host(dax_dev, disk); > > - if (rc) > > - goto out_cleanup_dax; > > - dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); > > - pmem->dax_dev = dax_dev; > > > > rc = device_add_disk(dev, disk, pmem_attribute_groups); > > if (rc) > > diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c > > index 4b7ecd4fd431..f911e58a24dd 100644 > > --- a/drivers/s390/block/dcssblk.c > > +++ b/drivers/s390/block/dcssblk.c > > @@ -681,12 +681,14 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char > > if (IS_ERR(dev_info->dax_dev)) { > > rc = PTR_ERR(dev_info->dax_dev); > > dev_info->dax_dev = NULL; > > - goto put_dev; > > + if (rc != -EOPNOTSUPP) > > + goto put_dev; > > config DCSSBLK selects FS_DAX_LIMITED and DAX. > > I'm not sure what selecting DAX is trying to achieve here, because the > Kconfig option is "FS_DAX". > > So depending on the real motivation behind this select, we may want to > consider failure rather than success in the -EOPNOTSUPP case. Ah, true, s390 is a !cpu_dcache_is_aliasing() arch, so it is ok to fail driver load on alloc_dax() failure as it knows that ERR_PTR(-EOPNOTSUPP) will never be returned. > > > + } else { > > + set_dax_synchronous(dev_info->dax_dev); > > + rc = dax_add_host(dev_info->dax_dev, dev_info->gd); > > + if (rc) > > + goto out_dax; > > } > > - set_dax_synchronous(dev_info->dax_dev); > > - rc = dax_add_host(dev_info->dax_dev, dev_info->gd); > > - if (rc) > > - goto out_dax; > > > > get_device(&dev_info->dev); > > rc = device_add_disk(&dev_info->dev, dev_info->gd, NULL); > > My own changes, if we want failure on -EOPNOTSUPP: > > diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c > index 4b7ecd4fd431..f363c1d51d9a 100644 > --- a/drivers/s390/block/dcssblk.c > +++ b/drivers/s390/block/dcssblk.c > @@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char > int rc, i, j, num_of_segments; > struct dcssblk_dev_info *dev_info; > struct segment_info *seg_info, *temp; > + struct dax_device *dax_dev; > char *local_buf; > unsigned long seg_byte_size; > > @@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char > if (rc) > goto put_dev; > > - dev_info->dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops); > - if (IS_ERR(dev_info->dax_dev)) { > - rc = PTR_ERR(dev_info->dax_dev); > - dev_info->dax_dev = NULL; > + dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops); > + if (IS_ERR(dax_dev)) { > + rc = PTR_ERR(dax_dev); > goto put_dev; > } > - set_dax_synchronous(dev_info->dax_dev); > + set_dax_synchronous(dax_dev); > + dev_info->dax_dev = dax_dev; Looks good to me. > rc = dax_add_host(dev_info->dax_dev, dev_info->gd); > if (rc) > goto out_dax; > > > > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c > > index 5f1be1da92ce..11053a70f5ab 100644 > > --- a/fs/fuse/virtio_fs.c > > +++ b/fs/fuse/virtio_fs.c > > @@ -16,6 +16,7 @@ > > #include <linux/fs_context.h> > > #include <linux/fs_parser.h> > > #include <linux/highmem.h> > > +#include <linux/cleanup.h> > > #include <linux/uio.h> > > #include "fuse_i.h" > > > > @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data) > > put_dax(dax_dev); > > } > > > > +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T)) > > + > > static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) > > So either I'm completely missing how ownership works in this function, or > we should be really concerned about the fact that it does no actual > cleanup of anything on any error. > > I would be tempted to first refactor this function without using cleanup.h > so those fixes can be easily backported to older kernels (?) > > Here what I'm seeing so far: > > - devm_release_mem_region() is never called after devm_request_mem_region(). Not > on error, neither on teardown, devm_release_mem_region() is called from virtio_fs_probe() context. That means that when virtio_fs_probe() returns an error the driver core will automatically call devm_request_mem_region(). > - pgmap is never freed on error after devm_kzalloc. That is what the "devm_" in devm_kzalloc() does, free the memory on driver-probe failure, or after the driver remove callback is invoked. > > > { > > + struct dax_device *dax_dev __free(cleanup_dax) = NULL; > > struct virtio_shm_region cache_reg; > > struct dev_pagemap *pgmap; > > bool have_cache; > > @@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) > > if (!IS_ENABLED(CONFIG_FUSE_DAX)) > > return 0; > > > > + dax_dev = alloc_dax(fs, &virtio_fs_dax_ops); > > + if (IS_ERR(dax_dev)) { > > + int rc = PTR_ERR(dax_dev); > > + > > + if (rc == -EOPNOTSUPP) > > + return 0; > > + return rc; > > + } > > What is gained by moving this allocation here ? The gain is to fail early in virtio_fs_setup_dax() since the fundamental dependency of alloc_dax() success is not met. For example why let the setup progress to devm_memremap_pages() when alloc_dax() is going to return ERR_PTR(-EOPNOTSUPP). > > > + > > /* Get cache region */ > > have_cache = virtio_get_shm_region(vdev, &cache_reg, > > (u8)VIRTIO_FS_SHMCAP_ID_CACHE); > > @@ -849,10 +862,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) > > dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 0x%llx\n", > > __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len); > > > > - fs->dax_dev = alloc_dax(fs, &virtio_fs_dax_ops); > > - if (IS_ERR(fs->dax_dev)) > > - return PTR_ERR(fs->dax_dev); > > - > > + fs->dax_dev = no_free_ptr(dax_dev); > > return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax, > > fs->dax_dev); > > } > > In addition I have: > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index f90743a94da9..86de91b35f4d 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device *md) > static struct mapped_device *alloc_dev(int minor) > { > int r, numa_node_id = dm_get_numa_node(); > + struct dax_device *dax_dev; > struct mapped_device *md; > void *old_md; > > @@ -2122,16 +2123,13 @@ static struct mapped_device *alloc_dev(int minor) > md->disk->private_data = md; > sprintf(md->disk->disk_name, "dm-%d", minor); > > - if (IS_ENABLED(CONFIG_FS_DAX)) { > - md->dax_dev = alloc_dax(md, &dm_dax_ops); > - if (IS_ERR(md->dax_dev)) { > - md->dax_dev = NULL; > - } else { > - set_dax_nocache(md->dax_dev); > - set_dax_nomc(md->dax_dev); > - if (dax_add_host(md->dax_dev, md->disk)) > - goto bad; > - } > + dax_dev = alloc_dax(md, &dm_dax_ops); > + if (!IS_ERR(dax_dev)) { > + set_dax_nocache(dax_dev); > + set_dax_nomc(dax_dev); > + md->dax_dev = dax_dev; > + if (dax_add_host(dax_dev, md->disk)) > + goto bad; Looks good.
Mathieu Desnoyers wrote: > On 2024-02-02 12:37, Dan Williams wrote: > > Mathieu Desnoyers wrote: > [...] > >> > > > >> The alternative route I intend to take is to audit all callers > >> of alloc_dax() and make sure they all save the alloc_dax() return > >> value in a struct dax_device * local variable first for the sake > >> of checking for IS_ERR(). This will leave the xyz->dax_dev pointer > >> initialized to NULL in the error case and simplify the rest of > >> error checking. > > > > I could maybe get on board with that, but it needs a comment somewhere > > about the asymmetric subtlety. > > Is this "somewhere" at every alloc_dax() call site, or do you have > something else in mind ? At least kill_dax() should mention the asymmetry I think. > > > > >> > >> > >>> return; > >>> > >>> if (dax_dev->holder_data != NULL) > >>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > >>> index 4e8fdcb3f1c8..b69c9e442cf4 100644 > >>> --- a/drivers/nvdimm/pmem.c > >>> +++ b/drivers/nvdimm/pmem.c > >>> @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev, > >>> dax_dev = alloc_dax(pmem, &pmem_dax_ops); > >>> if (IS_ERR(dax_dev)) { > >>> rc = PTR_ERR(dax_dev); > >>> - goto out; > >>> + if (rc != -EOPNOTSUPP) > >>> + goto out; > >> > >> If I compare the before / after this change, if previously > >> pmem_attach_disk() was called in a configuration with FS_DAX=n, it would > >> result in a NULL pointer dereference. > > > > No, alloc_dax() only returns NULL CONFIG_DAX=n case, not the > > CONFIG_FS_DAX=n case. > > Indeed, I was wrong there. > > > So that means that pmem devices on ARM have been > > possible without FS_DAX. So, in order for alloc_dax() returning > > ERR_PTR(-EOPNOTSUPP) to not regress pmem device availability this error > > path needs to be changed. > Good point. We're moving the depends on !(ARM || MIPS |PARC) from FS_DAX > Kconfig to a runtime check in alloc_dax(), which is used whenever DAX=y, > which includes configurations that had FS_DAX=n previously. > > I'll change the error path in pmem_attack_disk to treat -EOPNOTSUPP > alloc_dax() return value as non-fatal. > > > > >> This would be an error handling fix all by itself. Do we really want > >> to return successfully if dax is unsupported, or should we return > >> an error here ? > > > > Per above, there is no error handling fix, and pmem block device > > available should not depend on alloc_dax() succeeding. > > I agree on treating alloc_dax() failure as non-fatal. There is > however one error handling fix to nvdimm/pmem which I plan to > introduce as an initial patch before this change: > > nvdimm/pmem: Fix leak on dax_add_host() failure > > Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done > before setting pmem->dax_dev, which therefore issues the two following > calls on NULL pointers: > > out_cleanup_dax: > kill_dax(pmem->dax_dev); > put_dax(pmem->dax_dev); > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 4e8fdcb3f1c8..9fe358090720 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev, > set_dax_nomc(dax_dev); > if (is_nvdimm_sync(nd_region)) > set_dax_synchronous(dax_dev); > + pmem->dax_dev = dax_dev; > rc = dax_add_host(dax_dev, disk); > if (rc) > goto out_cleanup_dax; > dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); > - pmem->dax_dev = dax_dev; > - > rc = device_add_disk(dev, disk, pmem_attribute_groups); > if (rc) > goto out_remove_host; Yup, looks good. > > The real question is what to do about device-dax. I *think* it is not > > affected by cpu_dcache aliasing because it never accesses user mappings > > through a kernel alias. I doubt device-dax is in use on these platforms, > > but we might need another fixup for that if someone screams about the > > alloc_dax() behavior change making them lose device-dax access. > > By "device-dax", I understand you mean drivers/dax/Kconfig:DEV_DAX. > > Based on your analysis, is alloc_dax() still the right spot where > to place this runtime check ? Which call sites are responsible > for invoking alloc_dax() for device-dax ? That is in devm_create_dev_dax(). > If we know which call sites do not intend to use the kernel linear > mapping, we could introduce a flag (or a new variant of the alloc_dax() > API) that would either enforce or skip the check. Hmmm, it looks like there is already a natural flag for that. If alloc_dax() is passed a NULL operations pointer it means there are no kernel usages of the aliased mapping. That actually fits rather nicely. [..] > >>> @@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) > >>> if (!IS_ENABLED(CONFIG_FUSE_DAX)) > >>> return 0; > >>> > >>> + dax_dev = alloc_dax(fs, &virtio_fs_dax_ops); > >>> + if (IS_ERR(dax_dev)) { > >>> + int rc = PTR_ERR(dax_dev); > >>> + > >>> + if (rc == -EOPNOTSUPP) > >>> + return 0; > >>> + return rc; > >>> + } > >> > >> What is gained by moving this allocation here ? > > > > The gain is to fail early in virtio_fs_setup_dax() since the fundamental > > dependency of alloc_dax() success is not met. For example why let the > > setup progress to devm_memremap_pages() when alloc_dax() is going to > > return ERR_PTR(-EOPNOTSUPP). > > What I don't know is whether there is a dependency requiring to do > devm_request_mem_region(), devm_kzalloc(), devm_memremap_pages() > before calling alloc_dax() ? > > Those 3 calls are used to populate: > > fs->window_phys_addr = (phys_addr_t) cache_reg.addr; > fs->window_len = (phys_addr_t) cache_reg.len; > > and then alloc_dax() takes "fs" as private data parameter. So it's > unclear to me whether we can swap the invocation order. I suspect > that it is not an issue because it is only used to populate > dax_dev->private, but I prefer to confirm this with you just to be > on the safe side. Thanks for that. All of those need to be done before the fs goes live later in virtio_device_ready(), but before that point nothing should be calling into virtio_fs_dax_ops, so as far as I can see it is safe to change the order.
Mathieu Desnoyers wrote: [..] > > Thanks for that. All of those need to be done before the fs goes live > > later in virtio_device_ready(), but before that point nothing should be > > calling into virtio_fs_dax_ops, so as far as I can see it is safe to > > change the order. > > Sounds good, I'll do that. > > I will soon be ready to send out a RFC v4, which is still only > compiled-tested. Do you happen to have some kind of test suite > you can use to automate some of the runtime testing ? There is a test suite for the pmem, dm, and dax changes (https://github.com/pmem/ndctl?tab=readme-ov-file#unit-tests), but not automated unfortunately. The NVDIMM maintainer team will run that before pushing patches out to the fixes branch if you just want to lean on that. For the rest I think we will need to depend on tested-by's from s390 + virtio_fs folks, and / or sufficient soak time in linux-next.
diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 0da9232ea175..e9f397b8a5a3 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -445,6 +445,12 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) dev_t devt; int minor; + /* Unavailable on architectures with virtually aliased data caches. */ + if (IS_ENABLED(CONFIG_ARM) || + IS_ENABLED(CONFIG_MIPS) || + IS_ENABLED(CONFIG_SPARC)) + return NULL; + if (WARN_ON_ONCE(ops && !ops->zero_page_range)) return ERR_PTR(-EINVAL); diff --git a/fs/Kconfig b/fs/Kconfig index 42837617a55b..e5efdb3b276b 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -56,7 +56,6 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX