From patchwork Thu Oct 27 22:40:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Beau Belgrave X-Patchwork-Id: 11999 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp499524wru; Thu, 27 Oct 2022 16:04:08 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4p+sy6e1eGNmpOjrahP25LTtifXFwQeXtnIMbl9LWZ0kbcnTD/U2da2z3sGK/gzJbiVG1/ X-Received: by 2002:a05:6402:4006:b0:461:f4e2:82e0 with SMTP id d6-20020a056402400600b00461f4e282e0mr18469740eda.68.1666911847927; Thu, 27 Oct 2022 16:04:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666911847; cv=none; d=google.com; s=arc-20160816; b=CwkCixbfl8OrArnTqMul95UQW4aGXzzDUN/pTbDaMElZGSjbm+EGS1KNZru1tyMCo5 v3E2fCJk/ClTZFL8HYq2VbY1/5K6Jk2/aWkDm2BgSuibiscI/F93f/AAmxDD8efPVYa8 JEOkWFbuOzdeJoYHsSUgM5NdmxJsRomifR7Dy1NlmYCO2tnPUldYx5J+zrlQHKQ3oCtG jBzkF5qDYqSCfERbhw3aHHgC0t2GOM65icQWEcGXq3arenxggSO57i18N7acd47bQjdr xQjMLJ7kDXGaaBWzCZmJPm/VPVsXkjWf9sU0GXrobfJ0bUgSGH3QNEvCqs45hIQi3w+B qvLA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:dkim-filter; bh=Pvg3jfqri4oHWaX2IAEqNttrvSN2N/JuDOseA8Oi8kg=; b=P9Dt5ctzhmaGyzXS9wcJ+ijugSKYeul1lRaS+nZDQAbZVQuMjnANz9p6klzuTr3TIL XWCf8Rf/kq3t8wZ8doaqQ6bFXiUTElwiyxZYHoY5NiFubOrd3Tf2QjRJO/D53CZiu6q3 niCnQLr1HLv6KuyCqnRqWoXeG7UZs19iYK5xm+fMUkDVPJbyRLqwmBfZ4vp2sqwYbNOY MyrwZkmu8DMkhs2MvRoP0JdHzBX9IcLy7OTmZDUsf1RH+Q9pS0wtBsQ70K39RhvJtdDm r0aJDGqi6Dkt5v01PKbULmVbO5jGnRkvlWaQCuK2LW/bEpiAUbMysQtAYAXp+qzX5Ymo XyWg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=Jk4camFM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gs38-20020a1709072d2600b0073317d6b047si3013020ejc.569.2022.10.27.16.03.38; Thu, 27 Oct 2022 16:04:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=Jk4camFM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235719AbiJ0WkX (ORCPT + 99 others); Thu, 27 Oct 2022 18:40:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47308 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234719AbiJ0WkR (ORCPT ); Thu, 27 Oct 2022 18:40:17 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id DA52D5B103; Thu, 27 Oct 2022 15:40:15 -0700 (PDT) Received: from W11-BEAU-MD.localdomain (unknown [76.135.50.127]) by linux.microsoft.com (Postfix) with ESMTPSA id BE08A210DD4A; Thu, 27 Oct 2022 15:40:14 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com BE08A210DD4A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1666910415; bh=Pvg3jfqri4oHWaX2IAEqNttrvSN2N/JuDOseA8Oi8kg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Jk4camFMgLiqahlAxNuBzpllGd4CPtWN/MyjDBbih0QVj3edSDpbhAvXp8zb0neZA 5n+OlNOSEwUcGxYceu7Y6uwiCbxT8kMweSFO+/W/KhvbMe0B3hljwjgrh3wyvIjjXH 4/nCt2pvvmNEN1DzcTo88z4VW/kLl6zPd9t3DjiE= From: Beau Belgrave To: rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, dcook@linux.microsoft.com, alanau@linux.microsoft.com Cc: linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement Date: Thu, 27 Oct 2022 15:40:10 -0700 Message-Id: <20221027224011.2075-2-beaub@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20221027224011.2075-1-beaub@linux.microsoft.com> References: <20221027224011.2075-1-beaub@linux.microsoft.com> MIME-Version: 1.0 X-Spam-Status: No, score=-19.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747883757351702557?= X-GMAIL-MSGID: =?utf-8?q?1747883757351702557?= As part of the discussions for user_events aligned with user space tracers, it was determined that user programs should register a 32-bit value to set or clear a bit when an event becomes enabled. Currently a shared page is being used that requires mmap(). Remove the shared page implementation and move to a user registered address implementation. In this new model during the event registration from user programs 2 new values are specified. The first is the address to update when the event is either enabled or disabled. The second is the bit to set/clear to reflect the event being enabled. This allows for a local 32-bit value in user programs to support both kernel and user tracers. As an example, setting bit 31 for kernel tracers when the event becomes enabled allows for user tracers to use the other bits for ref counts or other flags. The kernel side updates the bit atomically, user programs need to also update these values atomically. User provided addresses must be aligned on a 32-bit boundary, this allows for single page checking and prevents odd behaviors such as a 32-bit value straddling 2 pages instead of a single page. Currently page faults are only logged, future patches will handle these. NOTE: User programs that wish to have the enable bit shared across forks either need to use a MAP_SHARED allocated address or register a new address and file descriptor. If MAP_SHARED cannot be used or new registrations cannot be done, then it's allowable to use MAP_PRIVATE as long as the forked children never update the page themselves. Once the page has been updated, the page from the parent will be copied over to the child. This new copy-on-write page will not receive updates from the kernel until another registration has been performed with this new address. Suggested-by: Mathieu Desnoyers Signed-off-by: Beau Belgrave --- include/linux/user_events.h | 10 +- kernel/trace/trace_events_user.c | 279 ++++++++++++++++--------------- 2 files changed, 153 insertions(+), 136 deletions(-) diff --git a/include/linux/user_events.h b/include/linux/user_events.h index 592a3fbed98e..4c3bd16395a9 100644 --- a/include/linux/user_events.h +++ b/include/linux/user_events.h @@ -33,12 +33,16 @@ struct user_reg { /* Input: Size of the user_reg structure being used */ __u32 size; + /* Input: Flags/common settings */ + __u32 enable_bit : 5, /* Bit in enable address to use (0-31) */ + __reserved : 27; + + /* Input: Address to update when enabled */ + __u64 enable_addr; + /* Input: Pointer to string with event name, description and flags */ __u64 name_args; - /* Output: Bitwise index of the event within the status page */ - __u32 status_bit; - /* Output: Index of the event to use when writing data */ __u32 write_index; } __attribute__((__packed__)); diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index ae78c2d53c8a..633f24c2a1ac 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -19,6 +19,9 @@ #include #include #include +#include +#include +#include /* Reminder to move to uapi when everything works */ #ifdef CONFIG_COMPILE_TEST #include @@ -34,34 +37,11 @@ #define FIELD_DEPTH_NAME 1 #define FIELD_DEPTH_SIZE 2 -/* - * Limits how many trace_event calls user processes can create: - * Must be a power of two of PAGE_SIZE. - */ -#define MAX_PAGE_ORDER 0 -#define MAX_PAGES (1 << MAX_PAGE_ORDER) -#define MAX_BYTES (MAX_PAGES * PAGE_SIZE) -#define MAX_EVENTS (MAX_BYTES * 8) - /* Limit how long of an event name plus args within the subsystem. */ #define MAX_EVENT_DESC 512 #define EVENT_NAME(user_event) ((user_event)->tracepoint.name) #define MAX_FIELD_ARRAY_SIZE 1024 -/* - * The MAP_STATUS_* macros are used for taking a index and determining the - * appropriate byte and the bit in the byte to set/reset for an event. - * - * The lower 3 bits of the index decide which bit to set. - * The remaining upper bits of the index decide which byte to use for the bit. - * - * This is used when an event has a probe attached/removed to reflect live - * status of the event wanting tracing or not to user-programs via shared - * memory maps. - */ -#define MAP_STATUS_BYTE(index) ((index) >> 3) -#define MAP_STATUS_MASK(index) BIT((index) & 7) - /* * Internal bits (kernel side only) to keep track of connected probes: * These are used when status is requested in text form about an event. These @@ -75,25 +55,37 @@ #define EVENT_STATUS_OTHER BIT(7) /* - * Stores the pages, tables, and locks for a group of events. - * Each logical grouping of events has its own group, with a - * matching page for status checks within user programs. This - * allows for isolation of events to user programs by various - * means. + * Stores the system name, tables, and locks for a group of events. This + * allows isolation for events by various means. */ struct user_event_group { - struct page *pages; - char *register_page_data; char *system_name; struct hlist_node node; struct mutex reg_mutex; DECLARE_HASHTABLE(register_table, 8); - DECLARE_BITMAP(page_bitmap, MAX_EVENTS); }; /* Group for init_user_ns mapping, top-most group */ static struct user_event_group *init_group; +/* + * Describes where to change a bit when an event becomes + * enabled/disabled. These are chained together to enable + * many processes being notified when an event changes. These + * have a lifetime tied to the data files that are used to + * register them. When these go away the ref count to the mm_struct + * is decremented to ensure mm_struct lifetime last as long as + * required for the enable bit set/clear. + */ +struct user_event_enabler { + struct list_head link; + struct mm_struct *mm; + struct file *file; + unsigned long enable_addr; + unsigned int enable_bit: 5, + __reserved: 27; +}; + /* * Stores per-event properties, as users register events * within a file a user_event might be created if it does not @@ -110,8 +102,8 @@ struct user_event { struct hlist_node node; struct list_head fields; struct list_head validators; + struct list_head enablers; refcount_t refcnt; - int index; int flags; int min_size; char status; @@ -155,28 +147,8 @@ static u32 user_event_key(char *name) return jhash(name, strlen(name), 0); } -static void set_page_reservations(char *pages, bool set) -{ - int page; - - for (page = 0; page < MAX_PAGES; ++page) { - void *addr = pages + (PAGE_SIZE * page); - - if (set) - SetPageReserved(virt_to_page(addr)); - else - ClearPageReserved(virt_to_page(addr)); - } -} - static void user_event_group_destroy(struct user_event_group *group) { - if (group->register_page_data) - set_page_reservations(group->register_page_data, false); - - if (group->pages) - __free_pages(group->pages, MAX_PAGE_ORDER); - kfree(group->system_name); kfree(group); } @@ -247,19 +219,6 @@ static struct user_event_group if (!group->system_name) goto error; - group->pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PAGE_ORDER); - - if (!group->pages) - goto error; - - group->register_page_data = page_address(group->pages); - - set_page_reservations(group->register_page_data, true); - - /* Zero all bits beside 0 (which is reserved for failures) */ - bitmap_zero(group->page_bitmap, MAX_EVENTS); - set_bit(0, group->page_bitmap); - mutex_init(&group->reg_mutex); hash_init(group->register_table); @@ -271,20 +230,107 @@ static struct user_event_group return NULL; }; -static __always_inline -void user_event_register_set(struct user_event *user) +static void user_event_enabler_destroy(struct user_event_enabler *enabler) { - int i = user->index; + mmdrop(enabler->mm); + kfree(enabler); +} + +static void user_event_enabler_remove(struct file *file, + struct user_event *user) +{ + struct user_event_enabler *enabler, *next; + struct list_head *head = &user->enablers; + + /* Prevent racing with status changes and new events */ + mutex_lock(&event_mutex); + + list_for_each_entry_safe(enabler, next, head, link) { + if (enabler->file != file) + continue; + + list_del(&enabler->link); + user_event_enabler_destroy(enabler); + } + + mutex_unlock(&event_mutex); +} + +static void user_event_enabler_write(struct user_event_enabler *enabler, + struct user_event *user) +{ + struct mm_struct *mm = enabler->mm; + unsigned long uaddr = enabler->enable_addr; + unsigned long *ptr; + struct page *page; + void *kaddr; + int ret; + + mmap_read_lock(mm); + + ret = pin_user_pages_remote(mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT, + &page, NULL, NULL); + + mmap_read_unlock(mm); + + if (ret <= 0) { + pr_warn("user_events: Enable write failed\n"); + return; + } + + kaddr = kmap_local_page(page); + ptr = kaddr + (uaddr & ~PAGE_MASK); + + if (user->status) + set_bit(enabler->enable_bit, ptr); + else + clear_bit(enabler->enable_bit, ptr); - user->group->register_page_data[MAP_STATUS_BYTE(i)] |= MAP_STATUS_MASK(i); + kunmap_local(kaddr); + unpin_user_pages_dirty_lock(&page, 1, true); } -static __always_inline -void user_event_register_clear(struct user_event *user) +static void user_event_enabler_update(struct user_event *user) { - int i = user->index; + struct list_head *head = &user->enablers; + struct user_event_enabler *enabler; - user->group->register_page_data[MAP_STATUS_BYTE(i)] &= ~MAP_STATUS_MASK(i); + list_for_each_entry(enabler, head, link) + user_event_enabler_write(enabler, user); +} + +static struct user_event_enabler +*user_event_enabler_create(struct file *file, struct user_reg *reg, + struct user_event *user) +{ + struct user_event_enabler *enabler; + + enabler = kzalloc(sizeof(*enabler), GFP_KERNEL); + + if (!enabler) + return NULL; + + /* + * This is grabbed for accounting purposes. This is to ensure if a + * process exits before the file is released a valid memory descriptor + * will exist for the enabler. + */ + mmgrab(current->mm); + + enabler->mm = current->mm; + enabler->file = file; + enabler->enable_addr = (unsigned long)reg->enable_addr; + enabler->enable_bit = reg->enable_bit; + + /* Prevents state changes from racing with new enablers */ + mutex_lock(&event_mutex); + + list_add(&enabler->link, &user->enablers); + user_event_enabler_write(enabler, user); + + mutex_unlock(&event_mutex); + + return enabler; } static __always_inline __must_check @@ -829,9 +875,6 @@ static int destroy_user_event(struct user_event *user) return ret; dyn_event_remove(&user->devent); - - user_event_register_clear(user); - clear_bit(user->index, user->group->page_bitmap); hash_del(&user->node); user_event_destroy_validators(user); @@ -977,9 +1020,9 @@ static void user_event_perf(struct user_event *user, struct iov_iter *i, #endif /* - * Update the register page that is shared between user processes. + * Update the enabled bit among all user processes. */ -static void update_reg_page_for(struct user_event *user) +static void update_enable_bit_for(struct user_event *user) { struct tracepoint *tp = &user->tracepoint; char status = 0; @@ -1010,12 +1053,9 @@ static void update_reg_page_for(struct user_event *user) rcu_read_unlock_sched(); } - if (status) - user_event_register_set(user); - else - user_event_register_clear(user); - user->status = status; + + user_event_enabler_update(user); } /* @@ -1072,10 +1112,10 @@ static int user_event_reg(struct trace_event_call *call, return ret; inc: refcount_inc(&user->refcnt); - update_reg_page_for(user); + update_enable_bit_for(user); return 0; dec: - update_reg_page_for(user); + update_enable_bit_for(user); refcount_dec(&user->refcnt); return 0; } @@ -1269,7 +1309,6 @@ static int user_event_parse(struct user_event_group *group, char *name, struct user_event **newuser) { int ret; - int index; u32 key; struct user_event *user; @@ -1288,11 +1327,6 @@ static int user_event_parse(struct user_event_group *group, char *name, return 0; } - index = find_first_zero_bit(group->page_bitmap, MAX_EVENTS); - - if (index == MAX_EVENTS) - return -EMFILE; - user = kzalloc(sizeof(*user), GFP_KERNEL); if (!user) @@ -1301,6 +1335,7 @@ static int user_event_parse(struct user_event_group *group, char *name, INIT_LIST_HEAD(&user->class.fields); INIT_LIST_HEAD(&user->fields); INIT_LIST_HEAD(&user->validators); + INIT_LIST_HEAD(&user->enablers); user->group = group; user->tracepoint.name = name; @@ -1338,14 +1373,11 @@ static int user_event_parse(struct user_event_group *group, char *name, if (ret) goto put_user_lock; - user->index = index; - /* Ensure we track self ref and caller ref (2) */ refcount_set(&user->refcnt, 2); dyn_event_init(&user->devent, &user_event_dops); dyn_event_add(&user->devent, &user->call); - set_bit(user->index, group->page_bitmap); hash_add(group->register_table, &user->node, key); mutex_unlock(&event_mutex); @@ -1561,6 +1593,14 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) if (ret) return ret; + /* Ensure natural alignment and sanity check on max bit */ + if (kreg->enable_addr % sizeof(__u32) || kreg->enable_bit > 31) + return -EINVAL; + + /* Ensure accessible */ + if (!access_ok((const void __user *)kreg->enable_addr, sizeof(__u32))) + return -EFAULT; + kreg->size = size; return 0; @@ -1570,11 +1610,12 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) * Registers a user_event on behalf of a user process. */ static long user_events_ioctl_reg(struct user_event_file_info *info, - unsigned long uarg) + struct file *file, unsigned long uarg) { struct user_reg __user *ureg = (struct user_reg __user *)uarg; struct user_reg reg; struct user_event *user; + struct user_event_enabler *enabler; char *name; long ret; @@ -1607,8 +1648,12 @@ static long user_events_ioctl_reg(struct user_event_file_info *info, if (ret < 0) return ret; + enabler = user_event_enabler_create(file, ®, user); + + if (!enabler) + return -ENOMEM; + put_user((u32)ret, &ureg->write_index); - put_user(user->index, &ureg->status_bit); return 0; } @@ -1651,7 +1696,7 @@ static long user_events_ioctl(struct file *file, unsigned int cmd, switch (cmd) { case DIAG_IOCSREG: mutex_lock(&group->reg_mutex); - ret = user_events_ioctl_reg(info, uarg); + ret = user_events_ioctl_reg(info, file, uarg); mutex_unlock(&group->reg_mutex); break; @@ -1700,8 +1745,10 @@ static int user_events_release(struct inode *node, struct file *file) for (i = 0; i < refs->count; ++i) { user = refs->events[i]; - if (user) + if (user) { + user_event_enabler_remove(file, user); refcount_dec(&user->refcnt); + } } out: file->private_data = NULL; @@ -1722,38 +1769,6 @@ static const struct file_operations user_data_fops = { .release = user_events_release, }; -static struct user_event_group *user_status_group(struct file *file) -{ - struct seq_file *m = file->private_data; - - if (!m) - return NULL; - - return m->private; -} - -/* - * Maps the shared page into the user process for checking if event is enabled. - */ -static int user_status_mmap(struct file *file, struct vm_area_struct *vma) -{ - char *pages; - struct user_event_group *group = user_status_group(file); - unsigned long size = vma->vm_end - vma->vm_start; - - if (size != MAX_BYTES) - return -EINVAL; - - if (!group) - return -EINVAL; - - pages = group->register_page_data; - - return remap_pfn_range(vma, vma->vm_start, - virt_to_phys(pages) >> PAGE_SHIFT, - size, vm_get_page_prot(VM_READ)); -} - static void *user_seq_start(struct seq_file *m, loff_t *pos) { if (*pos) @@ -1788,7 +1803,7 @@ static int user_seq_show(struct seq_file *m, void *p) status = user->status; flags = user->flags; - seq_printf(m, "%d:%s", user->index, EVENT_NAME(user)); + seq_printf(m, "%s", EVENT_NAME(user)); if (flags != 0 || status != 0) seq_puts(m, " #"); @@ -1813,7 +1828,6 @@ static int user_seq_show(struct seq_file *m, void *p) seq_puts(m, "\n"); seq_printf(m, "Active: %d\n", active); seq_printf(m, "Busy: %d\n", busy); - seq_printf(m, "Max: %ld\n", MAX_EVENTS); return 0; } @@ -1849,7 +1863,6 @@ static int user_status_open(struct inode *node, struct file *file) static const struct file_operations user_status_fops = { .open = user_status_open, - .mmap = user_status_mmap, .read = seq_read, .llseek = seq_lseek, .release = seq_release, From patchwork Thu Oct 27 22:40:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Beau Belgrave X-Patchwork-Id: 12001 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp501830wru; Thu, 27 Oct 2022 16:08:19 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4wmGsaISqn3sqV4RZ+96tyX1jtXvuolNt4o1XQGH1TVwgTb9PAXAWwh9zHUKDH6JVgUaZx X-Received: by 2002:a17:902:9308:b0:182:b2ba:755 with SMTP id bc8-20020a170902930800b00182b2ba0755mr51982055plb.107.1666912099339; Thu, 27 Oct 2022 16:08:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666912099; cv=none; d=google.com; s=arc-20160816; b=uNFIfHtiZ16RZQq9vTcdX37RCF+yn5XWdz0yP7+oVQvw0oJ5ws7edCMwB5czcFQKFU kK++4dsmYrYmgB/RXjz7DOaHpV2g9Mypp2TgC8bVpHSoJ3JunpwUwYkvzPDu33LXBVqz 7pGRKJqH36NpjEffwiagMiQ7ZAerSK72OJFa3nruaRvhgmSpDECSXsQWUxX0eTl51tZ/ Yu2ZDLzVSt++viQ5fQtpkLDozpKCmkKQq2o2RtEnJoJNQcDTqlOLow1Tqyzlg4Gq2LLV 5A4o0+JsxWB21onPKJCmEY5K5SJSMDCPhXJc7WRx3FaGrvKr31sJ1PJY7KD1GS17Otqn Rrlw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:dkim-filter; bh=f2uIrdDboNxBVlV0TX5bvVI0+LO4ZU7b2wcvL6RBD5E=; b=HVDxZFjooxZ+EPW4b67WvnDwEckO9pPj7ZCXhv+qckEKPUD3GZp9J2OIY1eIDhkAYq pNXd0Nb4sU4b7EY+ZAFMicvLQITqRdkEHblV0VyeLNmBX1s+seNSpXOxMYHD6kY+HYIK aTOyJa9bCFFZ54qja1tcMsIrZ8RDfEQXlf52uE0++9bHJVsNhRQoxw/wqz0tf2Zr3Hdb 1V5P31ZV98Y6Xa1M0QWiXPvx6CtQZ7PzHBbTjhkO5xR+HJZlhQBPKJQQ113aRFPEXc1o ig6mX7Ha37B0/hZG3yUTM6M479C73oL8PfDv+5PjLWX2cHXacjFqGTXPao2/w6R5pzYr NlrA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=G7Ew7Twv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s1-20020a63ff41000000b00412607fea43si3172760pgk.617.2022.10.27.16.08.06; Thu, 27 Oct 2022 16:08:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=G7Ew7Twv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234719AbiJ0Wk1 (ORCPT + 99 others); Thu, 27 Oct 2022 18:40:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47312 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235629AbiJ0WkR (ORCPT ); Thu, 27 Oct 2022 18:40:17 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id DB6F65B119; Thu, 27 Oct 2022 15:40:15 -0700 (PDT) Received: from W11-BEAU-MD.localdomain (unknown [76.135.50.127]) by linux.microsoft.com (Postfix) with ESMTPSA id 1CE2E210DD4C; Thu, 27 Oct 2022 15:40:15 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 1CE2E210DD4C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1666910415; bh=f2uIrdDboNxBVlV0TX5bvVI0+LO4ZU7b2wcvL6RBD5E=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=G7Ew7TwvX2FgHZ0Cr/bVEvCqhx7ci+yepweA5NuYFQ/Q2d7v4jBncbB7+DEDxeEmo 24WdcYJ3+TGe0nOW4rj5zYI1hrlzz2TcTv4E4WCEGgGL6lGmLAqUMixdTuEw7Q+jVs V2HBcWOmrn0wdO6dOMY2meYJAeCa+LIuIa9DsjEY= From: Beau Belgrave To: rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, dcook@linux.microsoft.com, alanau@linux.microsoft.com Cc: linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly Date: Thu, 27 Oct 2022 15:40:11 -0700 Message-Id: <20221027224011.2075-3-beaub@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20221027224011.2075-1-beaub@linux.microsoft.com> References: <20221027224011.2075-1-beaub@linux.microsoft.com> MIME-Version: 1.0 X-Spam-Status: No, score=-19.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747884021544064698?= X-GMAIL-MSGID: =?utf-8?q?1747884021544064698?= When events are enabled within the various tracing facilities, such as ftrace/perf, the event_mutex is held. As events are enabled pages are accessed. We do not want page faults to occur under this lock. Instead queue the fault to a workqueue to be handled in a process context safe way without the lock. The enable address is disabled while the async fault-in occurs. This ensures that we don't attempt fault-in more than is necessary. Once the page has been faulted in, the address write is attempted again. If the page couldn't fault-in, then we wait until the next time the event is enabled to prevent any potential infinite loops. Signed-off-by: Beau Belgrave --- kernel/trace/trace_events_user.c | 125 ++++++++++++++++++++++++++++++- 1 file changed, 121 insertions(+), 4 deletions(-) diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index 633f24c2a1ac..f1eb8101e053 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -81,11 +81,22 @@ struct user_event_enabler { struct list_head link; struct mm_struct *mm; struct file *file; + refcount_t refcnt; unsigned long enable_addr; unsigned int enable_bit: 5, - __reserved: 27; + __reserved: 26, + disabled: 1; }; +/* Used for asynchronous faulting in of pages */ +struct user_event_enabler_fault { + struct work_struct work; + struct user_event_enabler *enabler; + struct user_event *event; +}; + +static struct kmem_cache *fault_cache; + /* * Stores per-event properties, as users register events * within a file a user_event might be created if it does not @@ -236,6 +247,19 @@ static void user_event_enabler_destroy(struct user_event_enabler *enabler) kfree(enabler); } +static __always_inline struct user_event_enabler +*user_event_enabler_get(struct user_event_enabler *enabler) +{ + refcount_inc(&enabler->refcnt); + return enabler; +} + +static void user_event_enabler_put(struct user_event_enabler *enabler) +{ + if (refcount_dec_and_test(&enabler->refcnt)) + user_event_enabler_destroy(enabler); +} + static void user_event_enabler_remove(struct file *file, struct user_event *user) { @@ -249,13 +273,93 @@ static void user_event_enabler_remove(struct file *file, if (enabler->file != file) continue; + enabler->disabled = 0; list_del(&enabler->link); - user_event_enabler_destroy(enabler); + user_event_enabler_put(enabler); } mutex_unlock(&event_mutex); } +static void user_event_enabler_write(struct user_event_enabler *enabler, + struct user_event *user); + +static void user_event_enabler_fault_fixup(struct work_struct *work) +{ + struct user_event_enabler_fault *fault = container_of( + work, struct user_event_enabler_fault, work); + struct user_event_enabler *enabler = fault->enabler; + struct user_event *user = fault->event; + struct mm_struct *mm = enabler->mm; + unsigned long uaddr = enabler->enable_addr; + bool unlocked = false; + int ret; + + might_sleep(); + + mmap_read_lock(mm); + + ret = fixup_user_fault(mm, uaddr, FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE, + &unlocked); + + mmap_read_unlock(mm); + + if (ret) + pr_warn("user_events: Fixup fault failed with %d " + "for mm: 0x%pK offset: 0x%llx event: %s\n", ret, mm, + (unsigned long long)uaddr, EVENT_NAME(user)); + + /* Prevent state changes from racing */ + mutex_lock(&event_mutex); + + /* + * If we managed to get the page, re-issue the write. We do not + * want to get into a possible infinite loop, which is why we only + * attempt again directly if the page came in. If we couldn't get + * the page here, then we will try again the next time the event is + * enabled/disabled. + */ + enabler->disabled = 0; + + if (!ret) + user_event_enabler_write(enabler, user); + + mutex_unlock(&event_mutex); + + refcount_dec(&user->refcnt); + user_event_enabler_put(enabler); + kmem_cache_free(fault_cache, fault); +} + +static bool user_event_enabler_queue_fault(struct user_event_enabler *enabler, + struct user_event *user) +{ + struct user_event_enabler_fault *fault; + + fault = kmem_cache_zalloc(fault_cache, GFP_NOWAIT | __GFP_NOWARN); + + if (!fault) + return false; + + INIT_WORK(&fault->work, user_event_enabler_fault_fixup); + fault->enabler = user_event_enabler_get(enabler); + fault->event = user; + + refcount_inc(&user->refcnt); + enabler->disabled = 1; + + if (!schedule_work(&fault->work)) { + enabler->disabled = 0; + refcount_dec(&user->refcnt); + user_event_enabler_put(enabler); + kmem_cache_free(fault_cache, fault); + + return false; + } + + return true; +} + static void user_event_enabler_write(struct user_event_enabler *enabler, struct user_event *user) { @@ -266,6 +370,11 @@ static void user_event_enabler_write(struct user_event_enabler *enabler, void *kaddr; int ret; + lockdep_assert_held(&event_mutex); + + if (unlikely(enabler->disabled)) + return; + mmap_read_lock(mm); ret = pin_user_pages_remote(mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT, @@ -273,8 +382,10 @@ static void user_event_enabler_write(struct user_event_enabler *enabler, mmap_read_unlock(mm); - if (ret <= 0) { - pr_warn("user_events: Enable write failed\n"); + if (unlikely(ret <= 0)) { + if (!user_event_enabler_queue_fault(enabler, user)) + pr_warn("user_events: Unable to queue fault handler\n"); + return; } @@ -321,6 +432,7 @@ static struct user_event_enabler enabler->file = file; enabler->enable_addr = (unsigned long)reg->enable_addr; enabler->enable_bit = reg->enable_bit; + refcount_set(&enabler->refcnt, 1); /* Prevents state changes from racing with new enablers */ mutex_lock(&event_mutex); @@ -1902,6 +2014,11 @@ static int __init trace_events_user_init(void) { int ret; + fault_cache = KMEM_CACHE(user_event_enabler_fault, 0); + + if (!fault_cache) + return -ENOMEM; + init_group = user_event_group_create(&init_user_ns); if (!init_group)