From patchwork Thu Feb 15 19:14:02 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mathieu Desnoyers X-Patchwork-Id: 201716 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:c619:b0:108:e6aa:91d0 with SMTP id hn25csp63886dyb; Thu, 15 Feb 2024 11:14:39 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCUghHt9l+SdIgbCsdG6nDU3iFJdvCUREfLPV3/TbRri1VsydTdGKD6DgGLAE+/TaHZUTRVvgC6uCgl4AkB2z4DUwmfJRg== X-Google-Smtp-Source: AGHT+IECGPjvNza2lqGXDLenjf0GatYRV+Cjgxap6xyvw2Z/miVes6xfITe+V/2c0q2U11fd/6y4 X-Received: by 2002:ac8:7e83:0:b0:42d:b3dd:db37 with SMTP id w3-20020ac87e83000000b0042db3dddb37mr3337110qtj.23.1708024479041; Thu, 15 Feb 2024 11:14:39 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708024479; cv=pass; d=google.com; s=arc-20160816; b=qyFmnzpaBUjrDObS2JP1tKBJ9fyuF8NHTKrrgqCl++E7JwPfmozlOhBlze8MmjKRON jvF3bm0JF5BlDSrMYiHzzxyGnGa8eGngSzjvQU/qg7jcfmi3XU/8hmae1z9Xe6QYhEhF Ov0DeLwjezBZeFLa4kQBlkNGFTBJETn0fAx7SaFn7Sk2EOOlZG+bYSnFDceLofBfn2wt jr6QtYJ8P6mLT9ORhfIrbU5TFvOe2qw3cJF2dPpl8Ij15fvzXaCTN5aGZ9madODgWczI Kxb5KdClVKrhFr3wyVlWVCR68UgTLdEEbivcgootI+YNjbtqmvZxcah8by6lZvcpOh2q sFBQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=pNhhSdWmeIvw+QaJwSZiQL2ICN3mPny693X8JXYQ/DM=; fh=5sIFAaAgGdBnRP5ddY6nWby1f7xTHD6JdjZWSj6SeAI=; b=ac9C1HDQPw4ARLpIL9U0YCJ//AAi9bW0KEjf7fHvX7AN9VRDBWKn09HsfWYq03Fo0e TjPtgzl95bThrrx9KHz5LxJDX7FT1iLh+yx5DhOkyeS2JubBMhPL+NkjaAZGwApHAeaM 2bmCUcjwAfuIDhxenA/useDi6JZq8jgXZuY1Na+l2TI1NLb1cDn6Dx1nIlyltTPG3v2V qTDd79bp0+rc6QPgi7yyREzLAbZYG7IAttwtusyxE+s3YbezOByt/2WQDSKyo81D2H8Q FMckxDsBBzJ0Y1kPyxEZxppyLASAIlxTkpE1ErVMSspC2+BXUJMn80MIfqEuHS+YvLVr Ygbg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=GTz3kaGN; arc=pass (i=1 spf=pass spfdomain=efficios.com dkim=pass dkdomain=efficios.com dmarc=pass fromdomain=efficios.com); spf=pass (google.com: domain of linux-kernel+bounces-67558-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-67558-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id w9-20020ac857c9000000b0042db43affbasi2301622qta.448.2024.02.15.11.14.38 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 15 Feb 2024 11:14:39 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-67558-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=GTz3kaGN; arc=pass (i=1 spf=pass spfdomain=efficios.com dkim=pass dkdomain=efficios.com dmarc=pass fromdomain=efficios.com); spf=pass (google.com: domain of linux-kernel+bounces-67558-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-67558-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id C37581C22FFA for ; Thu, 15 Feb 2024 19:14:38 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 48A1C139594; Thu, 15 Feb 2024 19:14:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="GTz3kaGN" Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 413F1139566 for ; Thu, 15 Feb 2024 19:14:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=167.114.26.122 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708024463; cv=none; b=byFrjBRpLLaKULCgSz2y0AL+izRg1CaZQa5YdtzMu3bWOCBKTJiEYOwJKEXyAiIrQGtEqzbFcQhxa2dQYEFYJlxZoHEOOlwZIPtYNjxCrM3j92fBaC7OjnqGzWtstOTAfyIazzEQ2mXpUYn61f9ZvsX1MrKujmJ8jb+aVJOlx3w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708024463; c=relaxed/simple; bh=GzbroZFzU/UMoQI/0o02kQB2IRVJUnv3yXo9b71JnlM=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=mi/UqViOyslzGb8k8MeXFQ8RUCQpGJGYHaapyKrMK48da3eU7Y0VXsVvGM6xuS3RYxl17ublsPfuGLTfTaibLKyr7fo2Br1cFNToSFt6Bpv11FFm5uJ0/WBEZlbZZkR/OtBKlzYwYdARB1tHpld6IenaMg615bALmxIBL+RGPsQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com; spf=pass smtp.mailfrom=efficios.com; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b=GTz3kaGN; arc=none smtp.client-ip=167.114.26.122 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=efficios.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1708024457; bh=GzbroZFzU/UMoQI/0o02kQB2IRVJUnv3yXo9b71JnlM=; h=From:To:Cc:Subject:Date:From; b=GTz3kaGN+Ch5s8fKnFkczu9tvWOQyyZPmBNzg8iRu4HsXmcZI9QoMKuCohuyaBYf8 cAcNxtQlBSVfF+jpWO37RFtpVyPrnLfAPkntl5v6LbkrGUjSFL6qQlqIus2TMI4J9+ jiF0XdRWYbTxMi5BGu0csIoA3MLX+oXAGN61aRhIjYhRVatN2RCMZ+3Er7/rwrahwk gPPmI+ZbEbkRO/YK3tWTiKvkCUP7kdPZnPyn079ozSFofpYUrv5lD7YAwG+OPRtYrD MFxPCjYATzw00RfsiG+pNGOrEJAKUM5Rs/nA3sFaxIDeJIpkaKgpHgwujqiXqjSWXn SemykPm5mcyCQ== Received: from thinkos.internal.efficios.com (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4TbPr950ZnzZPw; Thu, 15 Feb 2024 14:14:17 -0500 (EST) From: Mathieu Desnoyers To: Dmitry Vyukov Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Peter Oskolkov , Peter Zijlstra , "Paul E. McKenney" , Boqun Feng , Chris Kennelly , Andrew Morton , Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , linux-mm@kvack.org Subject: [RFC PATCH 1/1] sched/rseq: Consider rseq abort in page fault handler Date: Thu, 15 Feb 2024 14:14:02 -0500 Message-Id: <20240215191402.681674-1-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790993475979836882 X-GMAIL-MSGID: 1790993475979836882 Consider rseq abort before emitting the SIGSEGV or SIGBUS signals from the page fault handler. This allows using membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ to abort rseq critical sections which include memory accesses to memory which mapping can be munmap'd or mprotect'd after the membarrier "rseq fence" without causing SIGSEGV or SIGBUS when the page fault handler triggered by a faulting memory access within a rseq critical section is preempted before handling the page fault. The problematic scenario is: CPU 0 CPU 1 ------------------------------------------------------------------ old_p = P P = NULL - rseq c.s. begins - x = P - if (x != NULL) - v = *x - page fault - preempted membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ) munmap(old_p) (or mprotect(old_p)) - handle page fault - force_sig_fault(SIGSEGV) - rseq resume notifier - move IP to abort IP -> SIGSEGV handler runs. This is solved by postponing the force_sig_fault() to return to user-space when the page fault handler detects that rseq events will cause the thread to call the rseq resume notifier before going back to user-space. This allows the rseq resume notifier to load the userspace memory pointed by rseq->rseq_cs to compare the IP with the rseq c.s. range before either moving the IP to the abort handler or calling force_sig_fault() with the parameters previously saved by the page fault handler. Add a new AT_RSEQ_FEATURE_FLAGS getauxval(3) to allow user-space to query whether the kernel implements this behavior (flag: RSEQ_FEATURE_PAGE_FAULT_ABORT). Untested implementation submitted for early feedback. Only x86 is implemented in this PoC. Link: https://lore.kernel.org/lkml/CACT4Y+bXfekygoyhO7pCctjnL15=E=Zs31BUGXU0dk8d4rc1Cw@mail.gmail.com/ Signed-off-by: Mathieu Desnoyers Cc: Dmitry Vyukov Cc: Peter Oskolkov Cc: Peter Zijlstra Cc: "Paul E. McKenney" Cc: Boqun Feng Cc: Chris Kennelly Cc: Andrew Morton Cc: Andy Lutomirski Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: linux-mm@kvack.org --- arch/x86/mm/fault.c | 4 ++-- fs/binfmt_elf.c | 1 + include/linux/sched.h | 16 ++++++++++++++++ include/linux/sched/signal.h | 24 ++++++++++++++++++++++++ include/uapi/linux/auxvec.h | 1 + include/uapi/linux/rseq.h | 7 +++++++ kernel/rseq.c | 36 +++++++++++++++++++++++++++++++----- 7 files changed, 82 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 679b09cfe241..42ac39680cb6 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -854,7 +854,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, if (si_code == SEGV_PKUERR) force_sig_pkuerr((void __user *)address, pkey); else - force_sig_fault(SIGSEGV, si_code, (void __user *)address); + rseq_lazy_force_sig_fault(SIGSEGV, si_code, (void __user *)address); local_irq_disable(); } @@ -973,7 +973,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, return; } #endif - force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address); + rseq_lazy_force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address); } static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 5397b552fbeb..8fece0911c7d 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -273,6 +273,7 @@ create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec, #ifdef CONFIG_RSEQ NEW_AUX_ENT(AT_RSEQ_FEATURE_SIZE, offsetof(struct rseq, end)); NEW_AUX_ENT(AT_RSEQ_ALIGN, __alignof__(struct rseq)); + NEW_AUX_ENT(AT_RSEQ_FEATURE_FLAGS, RSEQ_FEATURE_FLAGS); #endif #undef NEW_AUX_ENT /* AT_NULL is zero; clear the rest too */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 292c31697248..39aa585ba2a3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -743,6 +743,15 @@ struct kmap_ctrl { #endif }; +#ifdef CONFIG_RSEQ +struct rseq_lazy_sig { + bool pending; + int sig; + int code; + void __user *addr; +}; +#endif + struct task_struct { #ifdef CONFIG_THREAD_INFO_IN_TASK /* @@ -1317,6 +1326,7 @@ struct task_struct { * with respect to preemption. */ unsigned long rseq_event_mask; + struct rseq_lazy_sig rseq_lazy_sig; #endif #ifdef CONFIG_SCHED_MM_CID @@ -2330,6 +2340,8 @@ unsigned long sched_cpu_util(int cpu); #ifdef CONFIG_RSEQ +#define RSEQ_FEATURE_FLAGS RSEQ_FEATURE_PAGE_FAULT_ABORT + /* * Map the event mask on the user-space ABI enum rseq_cs_flags * for direct mask checks. @@ -2390,6 +2402,8 @@ static inline void rseq_migrate(struct task_struct *t) */ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { + WARN_ON_ONCE(current->rseq_lazy_sig.pending); + if (clone_flags & CLONE_VM) { t->rseq = NULL; t->rseq_len = 0; @@ -2405,6 +2419,8 @@ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) static inline void rseq_execve(struct task_struct *t) { + WARN_ON_ONCE(current->rseq_lazy_sig.pending); + t->rseq = NULL; t->rseq_len = 0; t->rseq_sig = 0; diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 3499c1a8b929..0d75dfde2f9b 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -781,4 +781,28 @@ static inline unsigned long rlimit_max(unsigned int limit) return task_rlimit_max(current, limit); } +#ifdef CONFIG_RSEQ + +static inline int rseq_lazy_force_sig_fault(int sig, int code, void __user *addr) +{ + struct task_struct *t = current; + + if (!t->rseq_event_mask) + return force_sig_fault(sig, code, addr); + t->rseq_lazy_sig.pending = true; + t->rseq_lazy_sig.sig = sig; + t->rseq_lazy_sig.code = code; + t->rseq_lazy_sig.addr = addr; + return 0; +} + +#else + +static inline int rseq_lazy_force_sig_fault(int sig, int code, void __user *addr) +{ + return force_sig_fault(sig, code, addr); +} + +#endif + #endif /* _LINUX_SCHED_SIGNAL_H */ diff --git a/include/uapi/linux/auxvec.h b/include/uapi/linux/auxvec.h index 6991c4b8ab18..5044f367a219 100644 --- a/include/uapi/linux/auxvec.h +++ b/include/uapi/linux/auxvec.h @@ -32,6 +32,7 @@ #define AT_HWCAP2 26 /* extension of AT_HWCAP */ #define AT_RSEQ_FEATURE_SIZE 27 /* rseq supported feature size */ #define AT_RSEQ_ALIGN 28 /* rseq allocation alignment */ +#define AT_RSEQ_FEATURE_FLAGS 29 /* rseq feature flags */ #define AT_EXECFN 31 /* filename of program */ diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index c233aae5eac9..0fdb192e3cd3 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -37,6 +37,13 @@ enum rseq_cs_flags { (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), }; +/* + * rseq feature flags. Query with getauxval(AT_RSEQ_FEATURE_FLAGS). + */ +enum rseq_feature_flags { + RSEQ_FEATURE_PAGE_FAULT_ABORT = (1U << 0), +}; + /* * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always * contained within a single cache-line. It is usually declared as diff --git a/kernel/rseq.c b/kernel/rseq.c index 9de6e35fe679..f686a97abb45 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -271,6 +271,25 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs) return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset; } +static void rseq_clear_lazy_sig_fault(struct task_struct *t) +{ + if (!t->rseq_lazy_sig.pending) + return; + t->rseq_lazy_sig.pending = false; + t->rseq_lazy_sig.sig = 0; + t->rseq_lazy_sig.code = 0; + t->rseq_lazy_sig.addr = NULL; +} + +static void rseq_force_lazy_sig_fault(struct task_struct *t) +{ + if (!t->rseq_lazy_sig.pending) + return; + force_sig_fault(t->rseq_lazy_sig.sig, t->rseq_lazy_sig.code, + t->rseq_lazy_sig.addr); + rseq_clear_lazy_sig_fault(t); +} + static int rseq_ip_fixup(struct pt_regs *regs) { unsigned long ip = instruction_pointer(regs); @@ -280,25 +299,32 @@ static int rseq_ip_fixup(struct pt_regs *regs) ret = rseq_get_rseq_cs(t, &rseq_cs); if (ret) - return ret; + goto nofixup; /* * Handle potentially not being within a critical section. * If not nested over a rseq critical section, restart is useless. * Clear the rseq_cs pointer and return. */ - if (!in_rseq_cs(ip, &rseq_cs)) - return clear_rseq_cs(t); + if (!in_rseq_cs(ip, &rseq_cs)) { + ret = clear_rseq_cs(t); + goto nofixup; + } ret = rseq_need_restart(t, rseq_cs.flags); if (ret <= 0) - return ret; + goto nofixup; ret = clear_rseq_cs(t); if (ret) - return ret; + goto nofixup; + rseq_clear_lazy_sig_fault(t); trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset, rseq_cs.abort_ip); instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip); return 0; + +nofixup: + rseq_force_lazy_sig_fault(t); + return ret; } /*