diff mbox series

[RFC,v1] nvme: add cse, ds, ms, nsze and nuse to sysfs

Message ID	20231127103208.25748-1-dwagner@suse.de
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; From: Daniel Wagner <dwagner@suse.de> To: linux-nvme@lists.infradead.org Cc: linux-kernel@vger.kernel.org, Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>, Hannes Reinecke <hare@suse.de>, Daniel Wagner <dwagner@suse.de> Subject: [RFC v1] nvme: add cse, ds, ms, nsze and nuse to sysfs Date: Mon, 27 Nov 2023 11:32:08 +0100 Message-ID: <20231127103208.25748-1-dwagner@suse.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[RFC,v1] nvme: add cse, ds, ms, nsze and nuse to sysfs \| [RFC,v1] nvme: add cse, ds, ms, nsze and nuse to sysfs

Commit Message

Daniel Wagner Nov. 27, 2023, 10:32 a.m. UTC

  libnvme is using the sysfs for enumarating the nvme resources. Though
there are few missing attritbutes in the sysfs. For these libnvme issues
commands during discovering.

As the kernel already knows all these attributes and we would like to
avoid libnvme to issue commands all the time, expose these missing
attributes.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---

As discussed during ALPPS, these here are the missing attribures which libnvme
is still looking up via commands. I've tested this with a modified libnvme and
didn't observe any ioctls anymore.

I'm pretty sure the naming is a bit off for the variables. Not really sure if we
want to stick to the spec naming sceme or have our own one, e.g. 'nsze' vs
'capacity'.

Also getting a pointer to the nvme_ns data structure is a bit strange
(dev_to_nvme_ns). This stip is necessary as many of the ns attributes are in
nvme_ns. Shouldn't these per path values not all be the same and thus couldn't
these be in nvme_ns_head? Anyway, just not sure who to deal with this. So any
pointers highly welcomed!

Cheers,
Daniel

 drivers/nvme/host/core.c  |  2 ++
 drivers/nvme/host/nvme.h  |  2 ++
 drivers/nvme/host/sysfs.c | 72 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 76 insertions(+)

Comments

Keith Busch Nov. 27, 2023, 10:44 a.m. UTC | #1

On Mon, Nov 27, 2023 at 11:32:08AM +0100, Daniel Wagner wrote:
> libnvme is using the sysfs for enumarating the nvme resources. Though
> there are few missing attritbutes in the sysfs. For these libnvme issues
> commands during discovering.
> 
> As the kernel already knows all these attributes and we would like to
> avoid libnvme to issue commands all the time, expose these missing
> attributes.

The id namespace 'nuse' field can be quite volatile: it can change on
any write or discard command, so caching it may quickly get out of sync
with the actual value.

Daniel Wagner Nov. 27, 2023, 12:07 p.m. UTC | #2

On Mon, Nov 27, 2023 at 03:44:11AM -0700, Keith Busch wrote:
> On Mon, Nov 27, 2023 at 11:32:08AM +0100, Daniel Wagner wrote:
> > libnvme is using the sysfs for enumarating the nvme resources. Though
> > there are few missing attritbutes in the sysfs. For these libnvme issues
> > commands during discovering.
> > 
> > As the kernel already knows all these attributes and we would like to
> > avoid libnvme to issue commands all the time, expose these missing
> > attributes.
> 
> The id namespace 'nuse' field can be quite volatile: it can change on
> any write or discard command, so caching it may quickly get out of sync
> with the actual value.

libnvme itself is also cashing this value and exposes it via the
nvme_ns_get_lba_util() getter. I'd say libnvme shouldn't cache it
either. Instead the function should just issue the ns command to report
the current nuse value.

I'll drop the nuse sysfs entry.

Unfortunately, 'nvme list' is using the 'nuse' field for showing the
currently used space. I was hoping to get 'nvme list' working without
issuing any commands.

Christoph Hellwig Nov. 27, 2023, 1:57 p.m. UTC | #3

On Mon, Nov 27, 2023 at 01:07:32PM +0100, Daniel Wagner wrote:
> libnvme itself is also cashing this value and exposes it via the
> nvme_ns_get_lba_util() getter. I'd say libnvme shouldn't cache it
> either. Instead the function should just issue the ns command to report
> the current nuse value.
> 
> I'll drop the nuse sysfs entry.
> 
> Unfortunately, 'nvme list' is using the 'nuse' field for showing the
> currently used space. I was hoping to get 'nvme list' working without
> issuing any commands.

I'd be ok with implementing nuse in a way where we issue an identify
command to read it, but rate limit the calls to something reasonable.
I think the kernel can do that much better than userspace because it
can keep that state a lot better.

Christoph Hellwig Nov. 27, 2023, 2:18 p.m. UTC | #4

On Mon, Nov 27, 2023 at 11:32:08AM +0100, Daniel Wagner wrote:
> Also getting a pointer to the nvme_ns data structure is a bit strange
> (dev_to_nvme_ns).
> This stip is necessary as many of the ns attributes are in
> nvme_ns. Shouldn't these per path values not all be the same and thus couldn't
> these be in nvme_ns_head? Anyway, just not sure who to deal with this. So any
> pointers highly welcomed!

Yes, they probably should be in the ns_head.

> +		list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> +			down_read(&ctrl->namespaces_rwsem);
> +			list_for_each_entry(ns, &ctrl->namespaces, list) {
> +				ret = ns;
> +				break;
> +			}
> +			up_read(&ctrl->namespaces_rwsem);
> +		}
> +		return ret;
> +	}

.. I also don't think this would even be safe as-is as we dont hold
a reference to the ns after dropping the lock.

> +static ssize_t csi_show(struct device *dev, struct device_attribute *attr,
> +		char *buf)
> +{
> +	return sysfs_emit(buf, "%d\n", dev_to_ns_head(dev)->ids.csi);
> +}
> +static DEVICE_ATTR_RO(csi);
> +
> +static ssize_t lba_ds_show(struct device *dev, struct device_attribute *attr,
> +		char *buf)
> +{
> +	struct nvme_ns *ns = dev_to_nvme_ns(dev);
> +
> +	return sysfs_emit(buf, "%d\n", ns->lba_shift);

lba_ds is a bit of an odd name.  And I also don't think we even need
this, because it really is just a weird enconding for the logical block
size already exposed by the block layer.

> +static ssize_t lba_ms_show(struct device *dev, struct device_attribute *attr,
> +		char *buf)
> +{
> +	struct nvme_ns *ns = dev_to_nvme_ns(dev);
> +
> +	return sysfs_emit(buf, "%d\n", ns->ms);
> +}
> +static DEVICE_ATTR_RO(lba_ms);

I'd probably spell out metadata_size, or probably even better
metadata_bytes to match the unit postfixes elsewhere in the block code.

> +
> +static ssize_t nsze_show(struct device *dev, struct device_attribute *attr,
> +		char *buf)
> +{
> +	struct nvme_ns *ns = dev_to_nvme_ns(dev);
> +
> +	return sysfs_emit(buf, "%llu\n", ns->nsze);
> +}
> +static DEVICE_ATTR_RO(nsze);

This is just the normal size of the block device we already export.

Keith Busch Nov. 27, 2023, 3:44 p.m. UTC | #5

On Mon, Nov 27, 2023 at 03:18:57PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 27, 2023 at 11:32:08AM +0100, Daniel Wagner wrote:
> > +static ssize_t lba_ms_show(struct device *dev, struct device_attribute *attr,
> > +		char *buf)
> > +{
> > +	struct nvme_ns *ns = dev_to_nvme_ns(dev);
> > +
> > +	return sysfs_emit(buf, "%d\n", ns->ms);
> > +}
> > +static DEVICE_ATTR_RO(lba_ms);
> 
> I'd probably spell out metadata_size, or probably even better
> metadata_bytes to match the unit postfixes elsewhere in the block code.

Should this even be an nvme specific attribute? I thought we should have
blk-integrity.c report its 'tuple_size' attribute instead. That should
work as long as we're not dealing with extended metadata at least, but
that's kind of a special format that doesn't have block layer support.

Christoph Hellwig Nov. 27, 2023, 3:56 p.m. UTC | #6

On Mon, Nov 27, 2023 at 08:44:52AM -0700, Keith Busch wrote:
> > I'd probably spell out metadata_size, or probably even better
> > metadata_bytes to match the unit postfixes elsewhere in the block code.
> 
> Should this even be an nvme specific attribute? I thought we should have
> blk-integrity.c report its 'tuple_size' attribute instead. That should
> work as long as we're not dealing with extended metadata at least, but
> that's kind of a special format that doesn't have block layer support.

Reporting the tuple size is a good idea.  But is that enough for
the existing nvme-cli use case?

Keith Busch Nov. 27, 2023, 4:30 p.m. UTC | #7

On Mon, Nov 27, 2023 at 04:56:49PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 27, 2023 at 08:44:52AM -0700, Keith Busch wrote:
> > > I'd probably spell out metadata_size, or probably even better
> > > metadata_bytes to match the unit postfixes elsewhere in the block code.
> > 
> > Should this even be an nvme specific attribute? I thought we should have
> > blk-integrity.c report its 'tuple_size' attribute instead. That should
> > work as long as we're not dealing with extended metadata at least, but
> > that's kind of a special format that doesn't have block layer support.
> 
> Reporting the tuple size is a good idea.  But is that enough for
> the existing nvme-cli use case?

nvme-cli currently queries with admin passthrough identify command, so
adding a new attribute won't break that. I assume Daniel would have it
fallback to that same command for backward compatibilty if a desired
sysfs attribute doesn't exist.

Christoph Hellwig Nov. 27, 2023, 4:33 p.m. UTC | #8

On Mon, Nov 27, 2023 at 09:30:14AM -0700, Keith Busch wrote:
> > > Should this even be an nvme specific attribute? I thought we should have
> > > blk-integrity.c report its 'tuple_size' attribute instead. That should
> > > work as long as we're not dealing with extended metadata at least, but
> > > that's kind of a special format that doesn't have block layer support.
> > 
> > Reporting the tuple size is a good idea.  But is that enough for
> > the existing nvme-cli use case?
> 
> nvme-cli currently queries with admin passthrough identify command, so
> adding a new attribute won't break that. I assume Daniel would have it
> fallback to that same command for backward compatibilty if a desired
> sysfs attribute doesn't exist.

Yes.  But does it care about the tuple size, or the actual size of the
metadata field even if is bigger than the PI tuple?

Keith Busch Nov. 27, 2023, 4:46 p.m. UTC | #9

On Mon, Nov 27, 2023 at 05:33:33PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 27, 2023 at 09:30:14AM -0700, Keith Busch wrote:
> > > > Should this even be an nvme specific attribute? I thought we should have
> > > > blk-integrity.c report its 'tuple_size' attribute instead. That should
> > > > work as long as we're not dealing with extended metadata at least, but
> > > > that's kind of a special format that doesn't have block layer support.
> > > 
> > > Reporting the tuple size is a good idea.  But is that enough for
> > > the existing nvme-cli use case?
> > 
> > nvme-cli currently queries with admin passthrough identify command, so
> > adding a new attribute won't break that. I assume Daniel would have it
> > fallback to that same command for backward compatibilty if a desired
> > sysfs attribute doesn't exist.
> 
> Yes.  But does it care about the tuple size, or the actual size of the
> metadata field even if is bigger than the PI tuple?

tuple_size is the same value as metadata size regardless of PI usage.
See nvme_init_integrity() for how this driver sets it:

	integrity.tuple_size = ns->ms;

Daniel Wagner Nov. 28, 2023, 8:21 a.m. UTC | #10

On Mon, Nov 27, 2023 at 09:46:43AM -0700, Keith Busch wrote:
> On Mon, Nov 27, 2023 at 05:33:33PM +0100, Christoph Hellwig wrote:
> > On Mon, Nov 27, 2023 at 09:30:14AM -0700, Keith Busch wrote:
> > > > > Should this even be an nvme specific attribute? I thought we should have
> > > > > blk-integrity.c report its 'tuple_size' attribute instead. That should
> > > > > work as long as we're not dealing with extended metadata at least, but
> > > > > that's kind of a special format that doesn't have block layer support.
> > > > 
> > > > Reporting the tuple size is a good idea.  But is that enough for
> > > > the existing nvme-cli use case?

'nvme list' is just listening the block size and the meta size in the
'Format' field. So nothing really crazy going on:

  Usage                      Format
  -------------------------- ----------------
  343.33  GB / 512.11  GB    512   B +  0 B

nvme-cli commands like 'nmve ns-id' etc will always issue a command so
that is not a concern. It's just the libnvme nvme_scan_topology() call
which should stop issuing any commands.

I'll add the missing tuple_size to the integrity sysfs dir in this case.

> > > nvme-cli currently queries with admin passthrough identify command, so
> > > adding a new attribute won't break that. I assume Daniel would have it
> > > fallback to that same command for backward compatibilty if a desired
> > > sysfs attribute doesn't exist.

Yes, a fallback will exist. There is no need to break existing users.

In summary, the only missing entries are

 - csi
 - tuple_size
 - nuse

Sagi Grimberg Nov. 28, 2023, 10:06 a.m. UTC | #11

>>>>>> Should this even be an nvme specific attribute? I thought we should have
>>>>>> blk-integrity.c report its 'tuple_size' attribute instead. That should
>>>>>> work as long as we're not dealing with extended metadata at least, but
>>>>>> that's kind of a special format that doesn't have block layer support.
>>>>>
>>>>> Reporting the tuple size is a good idea.  But is that enough for
>>>>> the existing nvme-cli use case?
> 
> 'nvme list' is just listening the block size and the meta size in the
> 'Format' field. So nothing really crazy going on:
> 
>    Usage                      Format
>    -------------------------- ----------------
>    343.33  GB / 512.11  GB    512   B +  0 B
> 
> nvme-cli commands like 'nmve ns-id' etc will always issue a command so
> that is not a concern. It's just the libnvme nvme_scan_topology() call
> which should stop issuing any commands.
> 
> I'll add the missing tuple_size to the integrity sysfs dir in this case.
> 
>>>> nvme-cli currently queries with admin passthrough identify command, so
>>>> adding a new attribute won't break that. I assume Daniel would have it
>>>> fallback to that same command for backward compatibilty if a desired
>>>> sysfs attribute doesn't exist.
> 
> Yes, a fallback will exist. There is no need to break existing users.
> 
> In summary, the only missing entries are
> 
>   - csi
>   - tuple_size
>   - nuse

I agree with the comments made, especially the one made by Christoph
that these values should be added to the nshead.

Christoph Hellwig Nov. 28, 2023, 1:05 p.m. UTC | #12

On Mon, Nov 27, 2023 at 09:46:43AM -0700, Keith Busch wrote:
> > 
> > Yes.  But does it care about the tuple size, or the actual size of the
> > metadata field even if is bigger than the PI tuple?
> 
> tuple_size is the same value as metadata size regardless of PI usage.
> See nvme_init_integrity() for how this driver sets it:
> 
> 	integrity.tuple_size = ns->ms;

Yes, for the case where we actually support integrity in the kernel
for a given device.  But if the device has a metadata size larger than
the PI size we still support it, and just let the device strip/insert
the PI.  And if nvme-cli wants to report detailed information about
the namespace it probably needs to report the actual metadata size
as the tuple size won't be reported given that we're never initializing
the kernel PI support.

Keith Busch Nov. 28, 2023, 7:02 p.m. UTC | #13

On Tue, Nov 28, 2023 at 02:05:08PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 27, 2023 at 09:46:43AM -0700, Keith Busch wrote:
> > > 
> > > Yes.  But does it care about the tuple size, or the actual size of the
> > > metadata field even if is bigger than the PI tuple?
> > 
> > tuple_size is the same value as metadata size regardless of PI usage.
> > See nvme_init_integrity() for how this driver sets it:
> > 
> > 	integrity.tuple_size = ns->ms;
> 
> Yes, for the case where we actually support integrity in the kernel
> for a given device.  But if the device has a metadata size larger than
> the PI size we still support it, and just let the device strip/insert
> the PI.

I'm pretty sure that isn't right. We already support PI regardless of
the metadata size as long as the PI field is in the first 8 bytes.
Strip/insert doesn't even work if metadata is larger than a PI field.
For any metadata case where PI isn't used, the driver requests
allocating an empty buffer for the purpose.

> And if nvme-cli wants to report detailed information about
> the namespace it probably needs to report the actual metadata size
> as the tuple size won't be reported given that we're never initializing
> the kernel PI support.

I don't understand. For any namespace with a metadata size, even if the
namespace format doesn't have PI support, we still register an
"integrity" profile with no-ops to get that unused buffer just so the
block layer can access the format. We alyways set the tuple_size to the
namespace metadata-size so the kernel buffer is correctly sized.

This all works as long as the metadata is separate (not extended) and
kernel has CONFIG_BLK_DEV_INTEGRITY.

diff mbox series

Patch

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 80673ea63fea..f100ee241bd7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2029,6 +2029,8 @@  static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	blk_mq_freeze_queue(ns->disk->queue);
 	lbaf = nvme_lbaf_index(id->flbas);
 	ns->lba_shift = id->lbaf[lbaf].ds;
+	ns->nsze = le64_to_cpu(id->nsze);
+	ns->nuse = le64_to_cpu(id->nuse);
 	nvme_set_queue_limits(ns->ctrl, ns->queue);
 
 	nvme_configure_metadata(ns, id);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f35647c470af..97652bf2c787 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -487,6 +487,8 @@  struct nvme_ns {
 	struct nvme_ns_head *head;
 
 	int lba_shift;
+	u64 nsze;
+	u64 nuse;
 	u16 ms;
 	u16 pi_size;
 	u16 sgs;
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 212e1b05d298..b46faee50361 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -114,12 +114,84 @@  static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RO(nsid);
 
+static struct nvme_ns *dev_to_nvme_ns(struct device *dev)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+
+	if (disk->fops == &nvme_bdev_ops)
+		return nvme_get_ns_from_dev(dev);
+	else {
+		struct nvme_ns_head *head = disk->private_data;
+		struct nvme_subsystem *subsys = head->subsys;
+		struct nvme_ctrl *ctrl;
+		struct nvme_ns *ns, *ret = NULL;
+
+		list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+			down_read(&ctrl->namespaces_rwsem);
+			list_for_each_entry(ns, &ctrl->namespaces, list) {
+				ret = ns;
+				break;
+			}
+			up_read(&ctrl->namespaces_rwsem);
+		}
+		return ret;
+	}
+}
+
+static ssize_t csi_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	return sysfs_emit(buf, "%d\n", dev_to_ns_head(dev)->ids.csi);
+}
+static DEVICE_ATTR_RO(csi);
+
+static ssize_t lba_ds_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	struct nvme_ns *ns = dev_to_nvme_ns(dev);
+
+	return sysfs_emit(buf, "%d\n", ns->lba_shift);
+}
+static DEVICE_ATTR_RO(lba_ds);
+
+static ssize_t lba_ms_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	struct nvme_ns *ns = dev_to_nvme_ns(dev);
+
+	return sysfs_emit(buf, "%d\n", ns->ms);
+}
+static DEVICE_ATTR_RO(lba_ms);
+
+static ssize_t nsze_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	struct nvme_ns *ns = dev_to_nvme_ns(dev);
+
+	return sysfs_emit(buf, "%llu\n", ns->nsze);
+}
+static DEVICE_ATTR_RO(nsze);
+
+static ssize_t nuse_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	struct nvme_ns *ns = dev_to_nvme_ns(dev);
+
+	return sysfs_emit(buf, "%llu\n", ns->nuse);
+}
+static DEVICE_ATTR_RO(nuse);
+
 static struct attribute *nvme_ns_id_attrs[] = {
 	&dev_attr_wwid.attr,
 	&dev_attr_uuid.attr,
 	&dev_attr_nguid.attr,
 	&dev_attr_eui.attr,
+	&dev_attr_csi.attr,
 	&dev_attr_nsid.attr,
+	&dev_attr_lba_ds.attr,
+	&dev_attr_lba_ms.attr,
+	&dev_attr_nsze.attr,
+	&dev_attr_nuse.attr,
 #ifdef CONFIG_NVME_MULTIPATH
 	&dev_attr_ana_grpid.attr,
 	&dev_attr_ana_state.attr,