[GIT,PULL] tracing: Fixes for 6.6-rc2

Message ID 20230923191420.10c42e4f@rorschach.local.home
State New
Headers
Series [GIT,PULL] tracing: Fixes for 6.6-rc2 |

Commit Message

Steven Rostedt Sept. 23, 2023, 11:14 p.m. UTC
  Linus,

Tracing fixes for 6.6-rc2:

- Fix the "bytes" output of the per_cpu stat file
  The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the
  accounting was not accurate. It is suppose to show how many used bytes are
  still in the ring buffer, but even when the ring buffer was empty it would
  still show there were bytes used.

- Fix a bug in eventfs where reading a dynamic event directory (open) and then
  creating a dynamic event that goes into that diretory screws up the accounting.
  On close, the newly created event dentry will get a "dput" without ever having
  a "dget" done for it. The fix is to allocate an array on dir open to save what
  dentries were actually "dget" on, and what ones to "dput" on close.

[
   Note, the show_event_dentries was critical in debugging the above ;-)
]

Please pull the latest trace-v6.6-rc2 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace-v6.6-rc2

Tag SHA1: 3164684ad125d90cdb6c96cddb00676649831d52
Head SHA1: ef36b4f92868d66908e235980f74afdfb9742d12


Steven Rostedt (Google) (1):
      eventfs: Remember what dentries were created on dir open

Zheng Yejian (1):
      ring-buffer: Fix bytes info in per_cpu buffer stats

----
 fs/tracefs/event_inode.c   | 87 +++++++++++++++++++++++++++++++++++++---------
 kernel/trace/ring_buffer.c | 28 ++++++++-------
 2 files changed, 85 insertions(+), 30 deletions(-)
---------------------------
  

Comments

Linus Torvalds Sept. 24, 2023, 9:09 p.m. UTC | #1
On Sat, 23 Sept 2023 at 16:14, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> - Fix a bug in eventfs where reading a dynamic event directory (open) and then
>   creating a dynamic event that goes into that diretory screws up the accounting.

Honestly, I'm getting more and more convinced that you just need to
stop this eventfs stuff.

This is just *incredibly* ugly:

  /*
   * This just sets the file->private_data back to the cursor and back.
   */
  static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx)
  {
        struct dentry_list *dlist = file->private_data;
        int ret;

        file->private_data = dlist->cursor;
        ret = dcache_readdir(file, ctx);
        dlist->cursor = file->private_data;
        file->private_data = dlist;
        return ret;
  }

I guess it works by the f_pos locking magic, but by christ is this
ugly beyond words.

Honestly, now, are the eventfs changes *really* making up for this
kind of continual "this is crazy" garbage? We had the whole "this is
undebuggable" discussion, now there's stuff like this.

Were people even *aware* of the f_pos locking, or did this just happen to work?

And that's entirely ignoring the disgusting thing that is that
"allocate an array of every dentry we looked at" issue. Which honestly
also looks disgusting.

I beg of you: think hard about just reverting all the eventfs changes
that moved away from using dentries natively, and instead started
doing these *incredibly* hacky and ugly things.

Really.

               Linus
  
pr-tracker-bot@kernel.org Sept. 24, 2023, 9:24 p.m. UTC | #2
The pull request you sent on Sat, 23 Sep 2023 19:14:20 -0400:

> git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git trace-v6.6-rc2

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/5edc6bb321d970c77d666a6cf8eeb060f2d18116

Thank you!
  
Steven Rostedt Sept. 25, 2023, 2:53 p.m. UTC | #3
On Sun, 24 Sep 2023 14:09:04 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 23 Sept 2023 at 16:14, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > - Fix a bug in eventfs where reading a dynamic event directory (open) and then
> >   creating a dynamic event that goes into that diretory screws up the accounting.  
> 
> Honestly, I'm getting more and more convinced that you just need to
> stop this eventfs stuff.
> 
> This is just *incredibly* ugly:
> 
>   /*
>    * This just sets the file->private_data back to the cursor and back.
>    */
>   static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx)
>   {
>         struct dentry_list *dlist = file->private_data;
>         int ret;
> 
>         file->private_data = dlist->cursor;
>         ret = dcache_readdir(file, ctx);
>         dlist->cursor = file->private_data;
>         file->private_data = dlist;
>         return ret;
>   }
> 
> I guess it works by the f_pos locking magic, but by christ is this
> ugly beyond words.
> 
> Honestly, now, are the eventfs changes *really* making up for this
> kind of continual "this is crazy" garbage? We had the whole "this is
> undebuggable" discussion, now there's stuff like this.

I never said "this is undebuggable". That show_event_dentries just
showed ref counts and what was allocated. Not that it was "undebuggable".
When ls is done on eventfs, the dentries are created. When the memory
is tight, any dentry that is not open should be reclaimed. The
show_event_dentries was to see if they were reclaimed or are still
around and taking up memory. It is also showing the ref counts, where
you can see if closing the file/directory would decrement the ref count
properly. It was used here because I found a way to dec a dentry
without first upping the refcount, as I explain below.

The purpose of eventfs here is that the /sys/kernel/tracing/events
directory currently allocates dentries and inodes for all existing
events in the kernel. Since there are now over a thousand events in the
kernel, and each event has several files created for them, and these
files are seldom looked at, why should they be allocated when not used?

These dentries and inodes are allocated for the top level directory and
wasting memory for most users. When an instance is created it makes
matters even worse.

 mkdir /sys/kernel/trace/instance/foo

Which creates an entire copy of the events directory to maintain state
(you can enable events for this instance and not for other instances),
all those dentries and inodes for the events are allocated again (like
20MB worth).

And we plan on using many instances in production, we rather not waste
all that memory. That was the reason for doing this in the first place.

Now, I presented this at LSFMM where I learned about /proc doing
something similar (and you pointed that out too) but when I looked at
that code I couldn't figure out how to easily make that work with
tracefs, so this work came out instead.

I'm not an FS developer so there may be a better way to do this. I
would be happy to hear about better alternatives.

> 
> Were people even *aware* of the f_pos locking, or did this just happen to work?

I looked at the implementation of dcache_dir_open(), dcache_readdir()
and dcache_dir_close() and saw how it allocated, modified, and freed
the file->private_data / cursor. I came to the conclusion that if
there wasn't protection around them then it would not work. In fact,
it's abstracted out enough that I could have just simply copied the
code and just use my struct dlist_entry directly. But instead of
copying of that code, I did it this way to reuse the existing code.

Would you prefer that I just copy that code (there's really not much)
and implement it less "ugly" by using my own cursor? Which would
require that f_pos locking magic just like any other implementation of
.iterate_shared and friends.

> 
> And that's entirely ignoring the disgusting thing that is that
> "allocate an array of every dentry we looked at" issue. Which honestly
> also looks disgusting.

How is it disgusting? How else would you do it?

You open a directory to read the files which creates the dentries and
ups their ref counts, in the mean time, a new file is created (it's OK
not to see it, as it's similar to RCU, only new opens will see the new
file), but when you close it, you only do the dput on the files you
opened and not the new file. The current code used just the link list
of all the files which also included the new file that wasn't updated
on open.

> 
> I beg of you: think hard about just reverting all the eventfs changes
> that moved away from using dentries natively, and instead started
> doing these *incredibly* hacky and ugly things.

And bring back all that wasted memory?

If there's a better way to achieve the same thing, I'll happily do it.

-- Steve
  
Linus Torvalds Sept. 25, 2023, 3:40 p.m. UTC | #4
On Mon, 25 Sept 2023 at 07:53, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> These dentries and inodes are allocated for the top level directory and
> wasting memory for most users. When an instance is created it makes
> matters even worse.

So honestly, dentries are *tiny*. They are probably not any bigger
than the backing data structures you use to keep track of the names in
teh first place.

The memory cost is likely almost from the inodes, which are indeed
pretty big and fat.

> If there's a better way to achieve the same thing, I'll happily do it.

Do you actually *need* individual inodes for each of the entries in tracefd?

Because I think the *big* memory savings would be to use one single
inode for every file (make directories have individual inodes,
anything else will confuse user-space 'pwd' etc mightily)

Then you'd

 (a) have the actual tracefs-specific data in dentry->d_fsdata

 (b) use "-inode->i_op->getattr()" to make the stat() info look
different for different files (if you even care, but you might)

and I think that would be most of it.

You might hit some gotcha somewhere - things like "dcache_readdir()
will take the inode number just from the inode, and there's no
callback for it", so if you care about showing unique inode numbers to
user space, we might need to help you out some way.

But that "reuse a single inode" is actually a very traditional
pattern, even if I suspect that pattern doesn't remain anywhere. /proc
used to do things like that, iirc.

                        Linus
  

Patch

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 9f64e7332796..5f1714089884 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -70,6 +70,7 @@  static struct dentry *eventfs_root_lookup(struct inode *dir,
 					  struct dentry *dentry,
 					  unsigned int flags);
 static int dcache_dir_open_wrapper(struct inode *inode, struct file *file);
+static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx);
 static int eventfs_release(struct inode *inode, struct file *file);
 
 static const struct inode_operations eventfs_root_dir_inode_operations = {
@@ -79,7 +80,7 @@  static const struct inode_operations eventfs_root_dir_inode_operations = {
 static const struct file_operations eventfs_file_operations = {
 	.open		= dcache_dir_open_wrapper,
 	.read		= generic_read_dir,
-	.iterate_shared	= dcache_readdir,
+	.iterate_shared	= dcache_readdir_wrapper,
 	.llseek		= generic_file_llseek,
 	.release	= eventfs_release,
 };
@@ -396,6 +397,11 @@  static struct dentry *eventfs_root_lookup(struct inode *dir,
 	return ret;
 }
 
+struct dentry_list {
+	void			*cursor;
+	struct dentry		**dentries;
+};
+
 /**
  * eventfs_release - called to release eventfs file/dir
  * @inode: inode to be released
@@ -404,26 +410,25 @@  static struct dentry *eventfs_root_lookup(struct inode *dir,
 static int eventfs_release(struct inode *inode, struct file *file)
 {
 	struct tracefs_inode *ti;
-	struct eventfs_inode *ei;
-	struct eventfs_file *ef;
-	struct dentry *dentry;
-	int idx;
+	struct dentry_list *dlist = file->private_data;
+	void *cursor;
+	int i;
 
 	ti = get_tracefs(inode);
 	if (!(ti->flags & TRACEFS_EVENT_INODE))
 		return -EINVAL;
 
-	ei = ti->private;
-	idx = srcu_read_lock(&eventfs_srcu);
-	list_for_each_entry_srcu(ef, &ei->e_top_files, list,
-				 srcu_read_lock_held(&eventfs_srcu)) {
-		mutex_lock(&eventfs_mutex);
-		dentry = ef->dentry;
-		mutex_unlock(&eventfs_mutex);
-		if (dentry)
-			dput(dentry);
+	if (WARN_ON_ONCE(!dlist))
+		return -EINVAL;
+
+	for (i = 0; dlist->dentries[i]; i++) {
+		dput(dlist->dentries[i]);
 	}
-	srcu_read_unlock(&eventfs_srcu, idx);
+
+	cursor = dlist->cursor;
+	kfree(dlist->dentries);
+	kfree(dlist);
+	file->private_data = cursor;
 	return dcache_dir_close(inode, file);
 }
 
@@ -442,22 +447,70 @@  static int dcache_dir_open_wrapper(struct inode *inode, struct file *file)
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
 	struct eventfs_file *ef;
+	struct dentry_list *dlist;
+	struct dentry **dentries = NULL;
 	struct dentry *dentry = file_dentry(file);
+	struct dentry *d;
 	struct inode *f_inode = file_inode(file);
+	int cnt = 0;
 	int idx;
+	int ret;
 
 	ti = get_tracefs(f_inode);
 	if (!(ti->flags & TRACEFS_EVENT_INODE))
 		return -EINVAL;
 
+	if (WARN_ON_ONCE(file->private_data))
+		return -EINVAL;
+
+	dlist = kmalloc(sizeof(*dlist), GFP_KERNEL);
+	if (!dlist)
+		return -ENOMEM;
+
 	ei = ti->private;
 	idx = srcu_read_lock(&eventfs_srcu);
 	list_for_each_entry_srcu(ef, &ei->e_top_files, list,
 				 srcu_read_lock_held(&eventfs_srcu)) {
-		create_dentry(ef, dentry, false);
+		d = create_dentry(ef, dentry, false);
+		if (d) {
+			struct dentry **tmp;
+
+			tmp = krealloc(dentries, sizeof(d) * (cnt + 2), GFP_KERNEL);
+			if (!tmp)
+				break;
+			tmp[cnt] = d;
+			tmp[cnt + 1] = NULL;
+			cnt++;
+			dentries = tmp;
+		}
 	}
 	srcu_read_unlock(&eventfs_srcu, idx);
-	return dcache_dir_open(inode, file);
+	ret = dcache_dir_open(inode, file);
+
+	/*
+	 * dcache_dir_open() sets file->private_data to a dentry cursor.
+	 * Need to save that but also save all the dentries that were
+	 * opened by this function.
+	 */
+	dlist->cursor = file->private_data;
+	dlist->dentries = dentries;
+	file->private_data = dlist;
+	return ret;
+}
+
+/*
+ * This just sets the file->private_data back to the cursor and back.
+ */
+static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx)
+{
+	struct dentry_list *dlist = file->private_data;
+	int ret;
+
+	file->private_data = dlist->cursor;
+	ret = dcache_readdir(file, ctx);
+	dlist->cursor = file->private_data;
+	file->private_data = dlist;
+	return ret;
 }
 
 /**
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a1651edc48d5..28daf0ce95c5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -354,6 +354,11 @@  static void rb_init_page(struct buffer_data_page *bpage)
 	local_set(&bpage->commit, 0);
 }
 
+static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
+{
+	return local_read(&bpage->page->commit);
+}
+
 static void free_buffer_page(struct buffer_page *bpage)
 {
 	free_page((unsigned long)bpage->page);
@@ -2003,7 +2008,7 @@  rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages)
 			 * Increment overrun to account for the lost events.
 			 */
 			local_add(page_entries, &cpu_buffer->overrun);
-			local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+			local_sub(rb_page_commit(to_remove_page), &cpu_buffer->entries_bytes);
 			local_inc(&cpu_buffer->pages_lost);
 		}
 
@@ -2367,11 +2372,6 @@  rb_reader_event(struct ring_buffer_per_cpu *cpu_buffer)
 			       cpu_buffer->reader_page->read);
 }
 
-static __always_inline unsigned rb_page_commit(struct buffer_page *bpage)
-{
-	return local_read(&bpage->page->commit);
-}
-
 static struct ring_buffer_event *
 rb_iter_head_event(struct ring_buffer_iter *iter)
 {
@@ -2517,7 +2517,7 @@  rb_handle_head_page(struct ring_buffer_per_cpu *cpu_buffer,
 		 * the counters.
 		 */
 		local_add(entries, &cpu_buffer->overrun);
-		local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
+		local_sub(rb_page_commit(next_page), &cpu_buffer->entries_bytes);
 		local_inc(&cpu_buffer->pages_lost);
 
 		/*
@@ -2660,9 +2660,6 @@  rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
 
 	event = __rb_page_index(tail_page, tail);
 
-	/* account for padding bytes */
-	local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
-
 	/*
 	 * Save the original length to the meta data.
 	 * This will be used by the reader to add lost event
@@ -2676,7 +2673,8 @@  rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
 	 * write counter enough to allow another writer to slip
 	 * in on this page.
 	 * We put in a discarded commit instead, to make sure
-	 * that this space is not used again.
+	 * that this space is not used again, and this space will
+	 * not be accounted into 'entries_bytes'.
 	 *
 	 * If we are less than the minimum size, we don't need to
 	 * worry about it.
@@ -2701,6 +2699,9 @@  rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
 	/* time delta must be non zero */
 	event->time_delta = 1;
 
+	/* account for padding bytes */
+	local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
+
 	/* Make sure the padding is visible before the tail_page->write update */
 	smp_wmb();
 
@@ -4215,7 +4216,7 @@  u64 ring_buffer_oldest_event_ts(struct trace_buffer *buffer, int cpu)
 EXPORT_SYMBOL_GPL(ring_buffer_oldest_event_ts);
 
 /**
- * ring_buffer_bytes_cpu - get the number of bytes consumed in a cpu buffer
+ * ring_buffer_bytes_cpu - get the number of bytes unconsumed in a cpu buffer
  * @buffer: The ring buffer
  * @cpu: The per CPU buffer to read from.
  */
@@ -4723,6 +4724,7 @@  static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer)
 
 	length = rb_event_length(event);
 	cpu_buffer->reader_page->read += length;
+	cpu_buffer->read_bytes += length;
 }
 
 static void rb_advance_iter(struct ring_buffer_iter *iter)
@@ -5816,7 +5818,7 @@  int ring_buffer_read_page(struct trace_buffer *buffer,
 	} else {
 		/* update the entry counter */
 		cpu_buffer->read += rb_page_entries(reader);
-		cpu_buffer->read_bytes += BUF_PAGE_SIZE;
+		cpu_buffer->read_bytes += rb_page_commit(reader);
 
 		/* swap the pages */
 		rb_init_page(bpage);