[v2,3/5] tracing/user_events: Add auto cleanup and a flag to persist events
Commit Message
Currently user events need to be manually deleted via the delete IOCTL
call or via the dynamic_events file. Most operators and processes wish
to have these events auto cleanup when they are no longer used by
anything to prevent them piling without manual maintenance. However,
some operators may not want this, such as pre-registering events via the
dynamic_events tracefs file.
Add a persist flag to user facing header and honor it within the
register IOCTL call. Add max flag as well to ensure that only known
flags can be used now and in the future. Update user_event_put() to
attempt an auto delete of the event if it's the last reference. The
auto delete must run in a work queue to ensure proper behavior of
class->reg() invocations that don't expect the call to go away from
underneath them during the unregister. Add work_struct to user_event
struct to ensure we can do this reliably.
Link: https://lore.kernel.org/linux-trace-kernel/20230518093600.3f119d68@rorschach.local.home/
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
---
include/uapi/linux/user_events.h | 10 ++-
kernel/trace/trace_events_user.c | 118 +++++++++++++++++++++++++++----
2 files changed, 114 insertions(+), 14 deletions(-)
Comments
On Mon, Jun 5, 2023 at 4:39 PM Beau Belgrave <beaub@linux.microsoft.com> wrote:
> + /*
> + * When the event is not enabled for auto-delete there will always
> + * be at least 1 reference to the event. During the event creation
> + * we initially set the refcnt to 2 to achieve this. In those cases
> + * the caller must acquire event_mutex and after decrement check if
> + * the refcnt is 1, meaning this is the last reference. When auto
> + * delete is enabled, there will only be 1 ref, IE: refcnt will be
> + * only set to 1 during creation to allow the below checks to go
> + * through upon the last put. The last put must always be done with
> + * the event mutex held.
> + */
> + if (!locked) {
> + lockdep_assert_not_held(&event_mutex);
> + delete = refcount_dec_and_mutex_lock(&user->refcnt, &event_mutex);
> + } else {
> + lockdep_assert_held(&event_mutex);
> + delete = refcount_dec_and_test(&user->refcnt);
> + }
> +
> + if (!delete)
> + return;
> +
> + /* We now have the event_mutex in all cases */
> +
> + if (user->reg_flags & USER_EVENT_REG_PERSIST) {
> + /* We should not get here when persist flag is set */
> + pr_alert("BUG: Auto-delete engaged on persistent event\n");
> + goto out;
> + }
> +
> + /*
> + * Unfortunately we have to attempt the actual destroy in a work
> + * queue. This is because not all cases handle a trace_event_call
> + * being removed within the class->reg() operation for unregister.
> + */
> + INIT_WORK(&user->put_work, delayed_destroy_user_event);
> +
> + /*
> + * Since the event is still in the hashtable, we have to re-inc
> + * the ref count to 1. This count will be decremented and checked
> + * in the work queue to ensure it's still the last ref. This is
> + * needed because a user-process could register the same event in
> + * between the time of event_mutex release and the work queue
> + * running the delayed destroy. If we removed the item now from
> + * the hashtable, this would result in a timing window where a
> + * user process would fail a register because the trace_event_call
> + * register would fail in the tracing layers.
> + */
> + refcount_set(&user->refcnt, 1);
The recnt-ing scheme is quite unorthodox.
Atomically decrementing it to zero and then immediately set it back to 1?
Smells racy.
Another process can go through the same code and do another dec and set to 1
and we'll have two work queued?
Will mutex_lock help somehow? If yes, then why atomic refcnt?
On Tue, Jun 06, 2023 at 06:26:56PM -0700, Alexei Starovoitov wrote:
> On Mon, Jun 5, 2023 at 4:39 PM Beau Belgrave <beaub@linux.microsoft.com> wrote:
> > + /*
> > + * When the event is not enabled for auto-delete there will always
> > + * be at least 1 reference to the event. During the event creation
> > + * we initially set the refcnt to 2 to achieve this. In those cases
> > + * the caller must acquire event_mutex and after decrement check if
> > + * the refcnt is 1, meaning this is the last reference. When auto
> > + * delete is enabled, there will only be 1 ref, IE: refcnt will be
> > + * only set to 1 during creation to allow the below checks to go
> > + * through upon the last put. The last put must always be done with
> > + * the event mutex held.
> > + */
> > + if (!locked) {
> > + lockdep_assert_not_held(&event_mutex);
> > + delete = refcount_dec_and_mutex_lock(&user->refcnt, &event_mutex);
> > + } else {
> > + lockdep_assert_held(&event_mutex);
> > + delete = refcount_dec_and_test(&user->refcnt);
> > + }
> > +
> > + if (!delete)
> > + return;
> > +
> > + /* We now have the event_mutex in all cases */
> > +
> > + if (user->reg_flags & USER_EVENT_REG_PERSIST) {
> > + /* We should not get here when persist flag is set */
> > + pr_alert("BUG: Auto-delete engaged on persistent event\n");
> > + goto out;
> > + }
> > +
> > + /*
> > + * Unfortunately we have to attempt the actual destroy in a work
> > + * queue. This is because not all cases handle a trace_event_call
> > + * being removed within the class->reg() operation for unregister.
> > + */
> > + INIT_WORK(&user->put_work, delayed_destroy_user_event);
> > +
> > + /*
> > + * Since the event is still in the hashtable, we have to re-inc
> > + * the ref count to 1. This count will be decremented and checked
> > + * in the work queue to ensure it's still the last ref. This is
> > + * needed because a user-process could register the same event in
> > + * between the time of event_mutex release and the work queue
> > + * running the delayed destroy. If we removed the item now from
> > + * the hashtable, this would result in a timing window where a
> > + * user process would fail a register because the trace_event_call
> > + * register would fail in the tracing layers.
> > + */
> > + refcount_set(&user->refcnt, 1);
>
> The recnt-ing scheme is quite unorthodox.
Yes, it's unfortunately because we have to keep the event in the hashtable.
Typically we'd just remove the event from the hashtable, ref_dec and
then upon final ref_dec free it. The problem with that is the event in
the hashtable is an actual trace_event exposed via tracefs/perf. It
prevents us from removing it at this time as the comment tries to
explain.
> Atomically decrementing it to zero and then immediately set it back to 1?
> Smells racy.
Thanks for pointing this out, I likely need another comment in the code
explaining why this is ok.
It might smell that way :) But the only way once it's zero to get
another reference is by acquiring event_mutex and then calling
find_user_event(). This is why it's important that at this phase we hold
the event_mutex in all cases.
> Another process can go through the same code and do another dec and set to 1
> and we'll have two work queued?
Once we set it to 1, we cannot get into this code path again until the
work that was queued is finished. The scheduled work acquires
event_mutex and ensures it's still the last reference before doing any
actual deletion or a refcount_dec that would allow for another work
queue via this path.
> Will mutex_lock help somehow? If yes, then why atomic refcnt?
I hopefully fully explained the mutex_lock above, if it's confusing
still let me know. The rason for the atomic refcnt though is because you
initially get a reference to a user_event via find_user_event() under
the lock, but then each FD (and each enabler) has a reference to the
underlying user_event (and thus the trace_event). Fork() and enablement
registers are done outside of this lock for performance reasons. When
the task exits, we also close down the enablers for that mm/task, and so
the lock is not there either (and why having the less used
refcount_dec_and_mutex_lock being required on put when it's not
acquired).
Thanks,
-Beau
On Mon, 5 Jun 2023 16:38:58 -0700
Beau Belgrave <beaub@linux.microsoft.com> wrote:
> Currently user events need to be manually deleted via the delete IOCTL
> call or via the dynamic_events file. Most operators and processes wish
> to have these events auto cleanup when they are no longer used by
> anything to prevent them piling without manual maintenance. However,
> some operators may not want this, such as pre-registering events via the
> dynamic_events tracefs file.
>
> Add a persist flag to user facing header and honor it within the
> register IOCTL call. Add max flag as well to ensure that only known
> flags can be used now and in the future. Update user_event_put() to
> attempt an auto delete of the event if it's the last reference. The
> auto delete must run in a work queue to ensure proper behavior of
> class->reg() invocations that don't expect the call to go away from
> underneath them during the unregister. Add work_struct to user_event
> struct to ensure we can do this reliably.
>
> Link: https://lore.kernel.org/linux-trace-kernel/20230518093600.3f119d68@rorschach.local.home/
Since there still seems to be some controversy over the persistent events,
could we hold off on implementing them until next merge window? That is, I
would like to only have the fd owned events for this release, which would
give us time to hash out any of the issues with persistent events.
If they are not in, then they are not an API, but once they are in, then we
are stuck with them. I believe everyone is fine with the fd owned events,
right?
>
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
> ---
> include/uapi/linux/user_events.h | 10 ++-
> kernel/trace/trace_events_user.c | 118 +++++++++++++++++++++++++++----
> 2 files changed, 114 insertions(+), 14 deletions(-)
>
>
That is we can keep all the code of this patch, but:
> static __always_inline __must_check
> @@ -1609,7 +1695,8 @@ static int user_event_create(const char *raw_command)
>
> mutex_lock(&group->reg_mutex);
>
> - ret = user_event_parse_cmd(group, name, &user, 0);
> + /* Dyn events persist, otherwise they would cleanup immediately */
> + ret = user_event_parse_cmd(group, name, &user, USER_EVENT_REG_PERSIST);
>
> if (!ret)
> user_event_put(user, false);
> @@ -1843,8 +1930,13 @@ static int user_event_parse(struct user_event_group *group, char *name,
>
Add here:
if (reg_flags) {
/* Holding off implementing PERSIST events */
ret = -EINVAL;
goto put_user_lock;
}
Which also reminds me. We should return EINVAL if any flags that we do not
know about is set. Otherwise when we do implement new flags, the user will
not know if they are taking effect or not.
-- Steve
> user->reg_flags = reg_flags;
>
> - /* Ensure we track self ref and caller ref (2) */
> - refcount_set(&user->refcnt, 2);
> + if (user->reg_flags & USER_EVENT_REG_PERSIST) {
> + /* Ensure we track self ref and caller ref (2) */
> + refcount_set(&user->refcnt, 2);
> + } else {
> + /* Ensure we track only caller ref (1) */
> + refcount_set(&user->refcnt, 1);
> + }
>
> dyn_event_init(&user->devent, &user_event_dops);
> dyn_event_add(&user->devent, &user->call);
> @@ -2066,8 +2158,8 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg)
> if (ret)
> return ret;
>
> - /* Ensure no flags, since we don't support any yet */
> - if (kreg->flags != 0)
> + /* Ensure only valid flags */
> + if (kreg->flags & ~(USER_EVENT_REG_MAX-1))
> return -EINVAL;
>
> /* Ensure supported size */
On Thu, Jun 08, 2023 at 04:25:57PM -0400, Steven Rostedt wrote:
> On Mon, 5 Jun 2023 16:38:58 -0700
> Beau Belgrave <beaub@linux.microsoft.com> wrote:
>
> > Currently user events need to be manually deleted via the delete IOCTL
> > call or via the dynamic_events file. Most operators and processes wish
> > to have these events auto cleanup when they are no longer used by
> > anything to prevent them piling without manual maintenance. However,
> > some operators may not want this, such as pre-registering events via the
> > dynamic_events tracefs file.
> >
> > Add a persist flag to user facing header and honor it within the
> > register IOCTL call. Add max flag as well to ensure that only known
> > flags can be used now and in the future. Update user_event_put() to
> > attempt an auto delete of the event if it's the last reference. The
> > auto delete must run in a work queue to ensure proper behavior of
> > class->reg() invocations that don't expect the call to go away from
> > underneath them during the unregister. Add work_struct to user_event
> > struct to ensure we can do this reliably.
> >
> > Link: https://lore.kernel.org/linux-trace-kernel/20230518093600.3f119d68@rorschach.local.home/
>
> Since there still seems to be some controversy over the persistent events,
> could we hold off on implementing them until next merge window? That is, I
> would like to only have the fd owned events for this release, which would
> give us time to hash out any of the issues with persistent events.
>
> If they are not in, then they are not an API, but once they are in, then we
> are stuck with them. I believe everyone is fine with the fd owned events,
> right?
>
I am fine with this approach, however, FD owned events only means that
anyone using the /sys/kernel/tracing/dynamic_events will have the event
deleted immediately.
Should we flat out issue an error instead of having it work, but then
removed immediately?
NOTE:
I'm waiting for the other user_event patches to land in the tracing
for-next since there will be conflicts and I want to make sure I get
good coverage with catching all the put/get refs.
> >
> > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
> > ---
> > include/uapi/linux/user_events.h | 10 ++-
> > kernel/trace/trace_events_user.c | 118 +++++++++++++++++++++++++++----
> > 2 files changed, 114 insertions(+), 14 deletions(-)
> >
> >
>
> That is we can keep all the code of this patch, but:
>
> > static __always_inline __must_check
> > @@ -1609,7 +1695,8 @@ static int user_event_create(const char *raw_command)
> >
> > mutex_lock(&group->reg_mutex);
> >
> > - ret = user_event_parse_cmd(group, name, &user, 0);
> > + /* Dyn events persist, otherwise they would cleanup immediately */
> > + ret = user_event_parse_cmd(group, name, &user, USER_EVENT_REG_PERSIST);
> >
> > if (!ret)
> > user_event_put(user, false);
> > @@ -1843,8 +1930,13 @@ static int user_event_parse(struct user_event_group *group, char *name,
> >
>
> Add here:
>
> if (reg_flags) {
> /* Holding off implementing PERSIST events */
> ret = -EINVAL;
> goto put_user_lock;
> }
>
> Which also reminds me. We should return EINVAL if any flags that we do not
> know about is set. Otherwise when we do implement new flags, the user will
> not know if they are taking effect or not.
>
We do that in user_reg_get, but point taken, some flag could come
through dynamic_events interface and then we'd miss it.
Thanks,
-Beau
> -- Steve
>
>
> > user->reg_flags = reg_flags;
> >
> > - /* Ensure we track self ref and caller ref (2) */
> > - refcount_set(&user->refcnt, 2);
> > + if (user->reg_flags & USER_EVENT_REG_PERSIST) {
> > + /* Ensure we track self ref and caller ref (2) */
> > + refcount_set(&user->refcnt, 2);
> > + } else {
> > + /* Ensure we track only caller ref (1) */
> > + refcount_set(&user->refcnt, 1);
> > + }
> >
> > dyn_event_init(&user->devent, &user_event_dops);
> > dyn_event_add(&user->devent, &user->call);
> > @@ -2066,8 +2158,8 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg)
> > if (ret)
> > return ret;
> >
> > - /* Ensure no flags, since we don't support any yet */
> > - if (kreg->flags != 0)
> > + /* Ensure only valid flags */
> > + if (kreg->flags & ~(USER_EVENT_REG_MAX-1))
> > return -EINVAL;
> >
> > /* Ensure supported size */
On Mon, 5 Jun 2023 16:38:58 -0700
Beau Belgrave <beaub@linux.microsoft.com> wrote:
Continuing on what Alexei was saying ...
> + /*
> + * Unfortunately we have to attempt the actual destroy in a work
> + * queue. This is because not all cases handle a trace_event_call
> + * being removed within the class->reg() operation for unregister.
> + */
> + INIT_WORK(&user->put_work, delayed_destroy_user_event);
We initialize the work here.
> +
> + /*
> + * Since the event is still in the hashtable, we have to re-inc
> + * the ref count to 1. This count will be decremented and checked
> + * in the work queue to ensure it's still the last ref. This is
> + * needed because a user-process could register the same event in
> + * between the time of event_mutex release and the work queue
> + * running the delayed destroy. If we removed the item now from
> + * the hashtable, this would result in a timing window where a
> + * user process would fail a register because the trace_event_call
> + * register would fail in the tracing layers.
> + */
> + refcount_set(&user->refcnt, 1);
> +
> + if (!schedule_work(&user->put_work)) {
From what I understand, schedule_work() can only fail if the work is
already queued. That should never happen if we just initialized it.
That means we need a WARN_ON_ONCE() here. Because it's a major bug if that
does return false.
-- Steve
> + /*
> + * If we fail we must wait for an admin to attempt delete or
> + * another register/close of the event, whichever is first.
> + */
> + pr_warn("user_events: Unable to queue delayed destroy\n");
> + }
@@ -17,6 +17,14 @@
/* Create dynamic location entry within a 32-bit value */
#define DYN_LOC(offset, size) ((size) << 16 | (offset))
+enum user_reg_flag {
+ /* Event will not delete upon last reference closing */
+ USER_EVENT_REG_PERSIST = 1U << 0,
+
+ /* This value or above is currently non-ABI */
+ USER_EVENT_REG_MAX = 1U << 1,
+};
+
/*
* Describes an event registration and stores the results of the registration.
* This structure is passed to the DIAG_IOCSREG ioctl, callers at a minimum
@@ -33,7 +41,7 @@ struct user_reg {
/* Input: Enable size in bytes at address */
__u8 enable_size;
- /* Input: Flags for future use, set to 0 */
+ /* Input: Flags can be any of the above user_reg_flag values */
__u16 flags;
/* Input: Address to update when enabled */
@@ -85,6 +85,7 @@ struct user_event {
struct hlist_node node;
struct list_head fields;
struct list_head validators;
+ struct work_struct put_work;
refcount_t refcnt;
int min_size;
int reg_flags;
@@ -169,6 +170,7 @@ static int user_event_parse(struct user_event_group *group, char *name,
static struct user_event_mm *user_event_mm_get(struct user_event_mm *mm);
static struct user_event_mm *user_event_mm_get_all(struct user_event *user);
static void user_event_mm_put(struct user_event_mm *mm);
+static int destroy_user_event(struct user_event *user);
static u32 user_event_key(char *name)
{
@@ -182,19 +184,98 @@ static struct user_event *user_event_get(struct user_event *user)
return user;
}
+static void delayed_destroy_user_event(struct work_struct *work)
+{
+ struct user_event *user = container_of(
+ work, struct user_event, put_work);
+
+ mutex_lock(&event_mutex);
+
+ if (!refcount_dec_and_test(&user->refcnt))
+ goto out;
+
+ if (destroy_user_event(user)) {
+ /*
+ * The only reason this would fail here is if we cannot
+ * update the visibility of the event. In this case the
+ * event stays in the hashtable, waiting for someone to
+ * attempt to delete it later.
+ */
+ pr_warn("user_events: Unable to delete event\n");
+ refcount_set(&user->refcnt, 1);
+ }
+out:
+ mutex_unlock(&event_mutex);
+}
+
static void user_event_put(struct user_event *user, bool locked)
{
-#ifdef CONFIG_LOCKDEP
- if (locked)
- lockdep_assert_held(&event_mutex);
- else
- lockdep_assert_not_held(&event_mutex);
-#endif
+ bool delete;
if (unlikely(!user))
return;
- refcount_dec(&user->refcnt);
+ /*
+ * When the event is not enabled for auto-delete there will always
+ * be at least 1 reference to the event. During the event creation
+ * we initially set the refcnt to 2 to achieve this. In those cases
+ * the caller must acquire event_mutex and after decrement check if
+ * the refcnt is 1, meaning this is the last reference. When auto
+ * delete is enabled, there will only be 1 ref, IE: refcnt will be
+ * only set to 1 during creation to allow the below checks to go
+ * through upon the last put. The last put must always be done with
+ * the event mutex held.
+ */
+ if (!locked) {
+ lockdep_assert_not_held(&event_mutex);
+ delete = refcount_dec_and_mutex_lock(&user->refcnt, &event_mutex);
+ } else {
+ lockdep_assert_held(&event_mutex);
+ delete = refcount_dec_and_test(&user->refcnt);
+ }
+
+ if (!delete)
+ return;
+
+ /* We now have the event_mutex in all cases */
+
+ if (user->reg_flags & USER_EVENT_REG_PERSIST) {
+ /* We should not get here when persist flag is set */
+ pr_alert("BUG: Auto-delete engaged on persistent event\n");
+ goto out;
+ }
+
+ /*
+ * Unfortunately we have to attempt the actual destroy in a work
+ * queue. This is because not all cases handle a trace_event_call
+ * being removed within the class->reg() operation for unregister.
+ */
+ INIT_WORK(&user->put_work, delayed_destroy_user_event);
+
+ /*
+ * Since the event is still in the hashtable, we have to re-inc
+ * the ref count to 1. This count will be decremented and checked
+ * in the work queue to ensure it's still the last ref. This is
+ * needed because a user-process could register the same event in
+ * between the time of event_mutex release and the work queue
+ * running the delayed destroy. If we removed the item now from
+ * the hashtable, this would result in a timing window where a
+ * user process would fail a register because the trace_event_call
+ * register would fail in the tracing layers.
+ */
+ refcount_set(&user->refcnt, 1);
+
+ if (!schedule_work(&user->put_work)) {
+ /*
+ * If we fail we must wait for an admin to attempt delete or
+ * another register/close of the event, whichever is first.
+ */
+ pr_warn("user_events: Unable to queue delayed destroy\n");
+ }
+out:
+ /* Ensure if we didn't have event_mutex before we unlock it */
+ if (!locked)
+ mutex_unlock(&event_mutex);
}
static void user_event_group_destroy(struct user_event_group *group)
@@ -793,7 +874,12 @@ static struct user_event_enabler
static __always_inline __must_check
bool user_event_last_ref(struct user_event *user)
{
- return refcount_read(&user->refcnt) == 1;
+ int last = 0;
+
+ if (user->reg_flags & USER_EVENT_REG_PERSIST)
+ last = 1;
+
+ return refcount_read(&user->refcnt) == last;
}
static __always_inline __must_check
@@ -1609,7 +1695,8 @@ static int user_event_create(const char *raw_command)
mutex_lock(&group->reg_mutex);
- ret = user_event_parse_cmd(group, name, &user, 0);
+ /* Dyn events persist, otherwise they would cleanup immediately */
+ ret = user_event_parse_cmd(group, name, &user, USER_EVENT_REG_PERSIST);
if (!ret)
user_event_put(user, false);
@@ -1843,8 +1930,13 @@ static int user_event_parse(struct user_event_group *group, char *name,
user->reg_flags = reg_flags;
- /* Ensure we track self ref and caller ref (2) */
- refcount_set(&user->refcnt, 2);
+ if (user->reg_flags & USER_EVENT_REG_PERSIST) {
+ /* Ensure we track self ref and caller ref (2) */
+ refcount_set(&user->refcnt, 2);
+ } else {
+ /* Ensure we track only caller ref (1) */
+ refcount_set(&user->refcnt, 1);
+ }
dyn_event_init(&user->devent, &user_event_dops);
dyn_event_add(&user->devent, &user->call);
@@ -2066,8 +2158,8 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg)
if (ret)
return ret;
- /* Ensure no flags, since we don't support any yet */
- if (kreg->flags != 0)
+ /* Ensure only valid flags */
+ if (kreg->flags & ~(USER_EVENT_REG_MAX-1))
return -EINVAL;
/* Ensure supported size */