From patchwork Wed Feb 15 19:08:58 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mathieu Desnoyers X-Patchwork-Id: 57684 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp375227wrn; Wed, 15 Feb 2023 11:25:43 -0800 (PST) X-Google-Smtp-Source: AK7set8V56/rKav7Pi7OzRX2mwzdoq5LHHf1hMdisEBGDE24Hd4LHJ7c/25saJa1k17zFkfHFpLh X-Received: by 2002:a17:906:b6cc:b0:88d:3c85:4ccf with SMTP id ec12-20020a170906b6cc00b0088d3c854ccfmr3190361ejb.25.1676489143479; Wed, 15 Feb 2023 11:25:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676489143; cv=none; d=google.com; s=arc-20160816; b=TZ6KFfEgSlFA9cR52hBOSEwOAap3RWiFi42w2fFwqE0T3Pdxj+LQ19Y+d1dJfuN8/m fLW2gPBaZuK3T4w9SgdkbScqlgr3Z00TjJES5jnaLHDZh5fA/e2sL25omKbaFf5W3RR8 BcqzWMs33w29OmzHIjbO0maQ28he1Hzsy0+LIHrc41+ab1GaNQmuBWYmvU5CaDThM2kX w0H/GbONgeuWLxNFBtaU763on+MhM7eDy0zQCVLaYhf+/EmyA9KmC2Oisryjv+uEb+wc bi2qON5cA0v3mj30J9XI1A3UQF7t9+3Z82H9XLVsn7t3WsQQBb+btV/2YeCLfvbsssBu DAhQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=gvZfP4o9ZG1cmqsvav0ob8HErZVvitgsAP1HrIRLaxo=; b=aHH1zAXskureHEfaKAjqR2wqRG2rZn7kTHmF2XClhCkUAyY65h5WdmUQGfzC+p+fNO yQjCVP9hmHLMOTRKI8+LKAJqU46vmO0iseQl6BAgxVlPLtfTCw9qXv53Cf1PpCqvXPYw LJ3Jr2AzMqExkodVCd3uauKr/hWwKOmNP9DH+TZ+ifsZl0yXuRzjSSnMM2L8rblxMA/j lpK3PMB5GJUiyF6poZZd2hQV4vCh/rFBnuumYTR+qcCdhbWN9OyXBbsVJzmyao7m02iT /Ls8wrt5/sBQaJ8CI5FJeOhZvP50Z1g8B3FFaW/+roAHuhZEnJxuVTG4SAbd8tJv5IND hlFA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=PcdK0DeQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id fx38-20020a1709069ea600b0084d0b4b4fcdsi25459800ejc.194.2023.02.15.11.25.19; Wed, 15 Feb 2023 11:25:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=PcdK0DeQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229803AbjBOTJI (ORCPT + 99 others); Wed, 15 Feb 2023 14:09:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51634 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229524AbjBOTJG (ORCPT ); Wed, 15 Feb 2023 14:09:06 -0500 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8AD022A16B; Wed, 15 Feb 2023 11:09:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1676488143; bh=h8UQ2bwMUIzC0g64KAmAjiQFGjIga++qnuCue4FaNjk=; h=From:To:Cc:Subject:Date:From; b=PcdK0DeQfp8ZViQx8pU08QCb3+IcZDoAAQbvUPXAGSztrtG3O18Cuepk3IRbGzX/y a5K+UWiw+yTakI4+FMD+mBvLHtGnS71xLgHjX7N5do/AwES1KCN1FVFatkm37PtdLY jVkG+VBJBJgXGLUpTLgVdwEDtshINEFt32FQf74O77+Rrj+UAqZNiEOc+RBmXnnlds b+JYKg/wiCfb1jaTgMAiu8MAEx03CAOYkHwc+NdKmYdiEdiYjWsVFVDU8cZF+vqte/ gWub7DRphtgByx4xQQB9hBlytO+AVuu9CBGfhnJdAapxdsTm4u9KfhRXSDr5+JvVZ0 TKCS+vQKBd2RA== Received: from thinkos.internal.efficios.com (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4PH70b51dbzljk; Wed, 15 Feb 2023 14:09:03 -0500 (EST) From: Mathieu Desnoyers To: Alejandro Colomar Cc: linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, "linux-man @ vger . kernel . org" , Peter Zijlstra , "Paul E . McKenney" , Boqun Feng , Mathieu Desnoyers Subject: [PATCH v2] rseq.2: New man page for the rseq(2) API Date: Wed, 15 Feb 2023 14:08:58 -0500 Message-Id: <20230215190858.958935-1-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1757839295759845358?= X-GMAIL-MSGID: =?utf-8?q?1757926280056621789?= Signed-off-by: Mathieu Desnoyers --- man2/rseq.2 | 461 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 461 insertions(+) create mode 100644 man2/rseq.2 diff --git a/man2/rseq.2 b/man2/rseq.2 new file mode 100644 index 000000000..1a7e4a893 --- /dev/null +++ b/man2/rseq.2 @@ -0,0 +1,461 @@ +.\" Copyright 2015-2023 Mathieu Desnoyers +.\" +.\" SPDX-License-Identifier: Linux-man-pages-copyleft +.\" +.TH rseq 2 (date) "Linux man-pages (unreleased)" +.SH NAME +rseq \- restartable sequences system call +.SH LIBRARY +Standard C library +.RI ( libc ", " \-lc ) +.SH SYNOPSIS +.nf +.PP +.BR "#include " " /* Definition of " RSEQ_* " constants */" +.BR "#include " " /* Definition of " SYS_* " constants */" +.B #include +.PP +.BI "int syscall(SYS_rseq, struct rseq *" rseq ", uint32_t " rseq_len , +.BI " int " flags ", uint32_t " sig ); +.fi +.PP +.IR Note : +glibc provides no wrapper for +.BR rseq (), +necessitating the use of +.BR syscall (2). +.SH DESCRIPTION +The +.BR rseq () +ABI accelerates specific user-space operations by registering a +per-thread data structure shared between kernel and userspace. +This data structure can be read from or written to by user-space to skip +otherwise expensive system calls. +.PP +A restartable sequence is a sequence of instructions +guaranteed to be executed atomically with respect to +other threads and signal handlers on the current CPU. +If its execution does not complete atomically, +the kernel changes the execution flow by jumping to an abort handler +defined by user-space for that restartable sequence. +.PP +Using restartable sequences requires to register a +.BR rseq () +ABI per-thread data structure +.RI ( "struct rseq" ) +through the +.BR rseq () +system call. +Only one +.BR rseq () +ABI can be registered per thread, +so user-space libraries and applications must follow a user-space ABI +defining how to share this resource. +The ABI defining how to share this resource between applications and +libraries is defined by the C library. +Allocation of the per-thread +.BR rseq () +ABI and its registration to the kernel is handled by glibc since version +2.35. +.PP +The +.BR rseq () +ABI per-thread data structure contains a +.I rseq_cs +field which points to the currently executing critical section. +For each thread, a single rseq critical section can run at any given +point. +Each critical section needs to be implemented in assembly. +.PP +The +.BR rseq () +ABI accelerates user-space operations on per-cpu data by defining a +shared data structure ABI between each user-space thread and the kernel. +.PP +It allows user-space to perform update operations on per-cpu data +without requiring heavy-weight atomic operations. +.PP +The term CPU used in this documentation refers to a hardware execution +context. +For instance, each CPU number returned by +.BR sched_getcpu () +is a CPU. +The current CPU means to the CPU on which the registered thread is +running. +.PP +Restartable sequences are atomic with respect to preemption (making it +atomic with respect to other threads running on the same CPU), +as well as signal delivery (user-space execution contexts nested over +the same thread). +They either complete atomically with respect to preemption on the +current CPU and signal delivery, or they are aborted. +.PP +Restartable sequences are suited for update operations on per-cpu data. +.PP +Restartable sequences can be used on data structures shared between threads +within a process, +and on data structures shared between threads across different +processes. +.PP +Some examples of operations that can be accelerated or improved by this ABI: +.IP \(bu 3 +Memory allocator per-cpu free-lists, +.IP \(bu 3 +Querying the current CPU number, +.IP \(bu 3 +Incrementing per-CPU counters, +.IP \(bu 3 +Modifying data protected by per-CPU spinlocks, +.IP \(bu 3 +Inserting/removing elements in per-CPU linked-lists, +.IP \(bu 3 +Writing/reading per-CPU ring buffers content. +.IP \(bu 3 +Accurately reading performance monitoring unit counters with respect to +thread migration. +.PP +Restartable sequences must not perform system calls. +Doing so may result in termination of the process by a segmentation +fault. +.PP +The +.I rseq +argument is a pointer to the thread-local +.I struct rseq +to be shared between kernel and user-space. +.PP +The structure +.I struct rseq +is an extensible structure. +Additional feature fields can be added in future kernel versions. +Its layout is as follows: +.TP +.B Structure alignment +This structure is aligned on either 32-byte boundary, +or on the alignment value returned by +.IR getauxval () +invoked with +.B AT_RSEQ_ALIGN +if the structure size differs from 32 bytes. +.TP +.B Structure size +This structure size needs to be at least 32 bytes. +It can be either 32 bytes, +or it needs to be large enough to hold the result of +.IR getauxval () +invoked with +.BR AT_RSEQ_FEATURE_SIZE . +Its size is passed as parameter to the +.BR rseq () +system call. +.in +4n +.IP +.EX +#include + +struct rseq { + __u32 cpu_id_start; + __u32 cpu_id; + union { + /* ... */ + } rseq_cs; + __u32 flags; + __u32 node_id; + __u32 mm_cid; +} __attribute__((aligned(32))); +.EE +.in +.TP +.B Fields +.RS +.TP +.I cpu_id_start +Always-updated value of the CPU number on which the registered thread is +running. +Initialized by user-space to 0, +updated by the kernel for threads registered with +.BR rseq (). +Its value is 0 when +.BR rseq () +is not registered. +Its value should always be confirmed by reading the +.I cpu_id +field before user-space performs any side-effect +(e.g. storing to memory). +.IP +Because it is initialized to 0, +this field can be loaded by user-space and +used to index per-cpu data structures +without having to check whether its value is within valid bounds. +.IP +For user-space applications executed on a kernel without +.BR rseq () +support, +the cpu_id_start field stays initialized at 0. +It is therefore valid to use it as an offset in per-cpu data structures, +and only validate whether it's actually the current CPU number by +comparing it with the cpu_id field within the rseq critical section. +If the kernel does not provide +.BR rseq () +support, that cpu_id field stays initialized at -1, +so the comparison always fails, as intended. +.IP +This field should only be read by the thread which registered this data +structure. +Aligned on 32-bit. +.IP +It is up to user space to implement a fall-back mechanism for scenarios where +.BR rseq () +is not available. +.TP +.I cpu_id +Always-updated value of the CPU number on which the registered thread is +running. +Initialized by user-space to -1, +updated by the kernel for threads registered with +.BR rseq (). +.IP +This field should only be read by the thread which registered this data +structure. +Aligned on 32-bit. +.TP +.I rseq_cs +The rseq_cs field is a pointer to a +.IR "struct rseq_cs" . +Is is NULL when no rseq assembly block critical section is active for +the registered thread. +Setting it to point to a critical section descriptor +.RI ( "struct rseq_cs") +marks the beginning of the critical section. +.IP +Initialized by user-space to NULL. +.IP +Updated by user-space, which sets the address of the currently +active rseq_cs at the beginning of assembly instruction sequence +block, +and set to NULL by the kernel when it restarts an assembly instruction +sequence block, +as well as when the kernel detects that it is preempting or delivering a +signal outside of the range targeted by the rseq_cs. +Also needs to be set to NULL by user-space before reclaiming memory that +contains the targeted +.IR "struct rseq_cs" . +.IP +Read and set by the kernel. +.IP +This field should only be updated by the thread which registered this +data structure. +Aligned on 64-bit. +.TP +.I flags +Flags indicating the restart behavior for the registered thread. +This is mainly used for debugging purposes. +Can be a combination of: +.RS +.TP +.B RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT +Inhibit instruction sequence block restart on preemption for this +thread. +This flag is deprecated since Linux 6.1. +.TP +.B RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL +Inhibit instruction sequence block restart on signal delivery for this +thread. +This flag is deprecated since Linux 6.1. +.TP +.B RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE +Inhibit instruction sequence block restart on migration for this thread. +This flag is deprecated since Linux 6.1. +.RE +.IP +Initialized by user-space, used by the kernel. +.TP +.I node_id +Always-updated value of the current NUMA node ID. +.IP +Initialized by user-space to 0. +.IP +Updated by the kernel. +Read by user-space with single-copy atomicity semantics. +This field should only be read by the thread which registered +this data structure. +Aligned on 32-bit. +.TP +.I mm_cid +Contains the current thread's concurrency ID +(allocated uniquely within a memory map). +.IP +Updated by the kernel. +Read by user-space with single-copy atomicity semantics. +This field should only be read by the thread which registered this data +structure. +Aligned on 32-bit. +.IP +This concurrency ID is within the possible cpus range, +and is temporarily (and uniquely) assigned while threads are actively +running within a memory map. +If a memory map has fewer threads than cores, +or is limited to run on few cores concurrently through sched affinity or +cgroup cpusets, +the concurrency IDs will be values close to 0, +thus allowing efficient use of user-space memory for per-cpu data +structures. +.RE +.PP +The layout of +.I struct rseq_cs +version 0 is as follows: +.TP +.B Structure alignment +This structure is aligned on 32-byte boundary. +.TP +.B Structure size +This structure has a fixed size of 32 bytes. +.in +4n +.IP +.EX +#include + +struct rseq_cs { + __u32 version; + __u32 flags; + __u64 start_ip; + __u64 post_commit_offset; + __u64 abort_ip; +} __attribute__((aligned(32))); +.EE +.in +.TP +.B Fields +.RS +.TP +.I version +Version of this structure. +Should be initialized to 0. +.TP +.I flags +.RS +Flags indicating the restart behavior of this structure. +Can be a combination of: +.TP +.B RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT +Inhibit instruction sequence block restart on preemption for this +critical section. +This flag is deprecated since Linux 6.1. +.TP +.B RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL +Inhibit instruction sequence block restart on signal delivery for this +critical section. +This flag is deprecated since Linux 6.1. +.TP +.B RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE +Inhibit instruction sequence block restart on migration for this +critical section. +This flag is deprecated since Linux 6.1. +.RE +.TP +.I start_ip +Instruction pointer address of the first instruction of the sequence of +consecutive assembly instructions. +.TP +.I post_commit_offset +Offset (from start_ip address) of the address after the last instruction +of the sequence of consecutive assembly instructions. +.TP +.I abort_ip +Instruction pointer address where to move the execution flow in case of +abort of the sequence of consecutive assembly instructions. +.RE +.PP +The +.I rseq_len +argument is the size of the +.I struct rseq +to register. +.PP +The +.I flags +argument is 0 for registration, and +.B RSEQ_FLAG_UNREGISTER +for unregistration. +.PP +The +.I sig +argument is the 32-bit signature to be expected before the abort +handler code. +.PP +A single library per process should keep the +.I struct rseq +in a per-thread data structure. +The +.I cpu_id +field should be initialized to -1, and the +.I cpu_id_start +field should be initialized to a possible CPU value (typically 0). +.PP +Each thread is responsible for registering and unregistering its +.IR "struct rseq" . +No more than one +.I struct rseq +address can be registered per thread at a given time. +.PP +Reclaim of +.I struct rseq +object's memory must only be done after either an explicit rseq +unregistration is performed or after the thread exits. +.PP +In a typical usage scenario, the thread registering the +.I struct rseq +will be performing loads and stores from/to that structure. +It is however also allowed to read that structure from other threads. +The +.I struct rseq +field updates performed by the kernel provide relaxed atomicity +semantics (atomic store, without memory ordering), +which guarantee that other threads performing relaxed atomic reads +(atomic load, without memory ordering) of the cpu number fields will +always observe a consistent value. +.SH RETURN VALUE +A return value of 0 indicates success. +On error, \-1 is returned, and +.I errno +is set appropriately. +.SH ERRORS +.TP +.B EINVAL +Either +.I flags +contains an invalid value, or +.I rseq +contains an address which is not appropriately aligned, or +.I rseq_len +contains an incorrect size. +.TP +.B ENOSYS +The +.BR rseq () +system call is not implemented by this kernel. +.TP +.B EFAULT +.I rseq +is an invalid address. +.TP +.B EBUSY +Restartable sequence is already registered for this thread. +.TP +.B EPERM +The +.I sig +argument on unregistration does not match the signature received +on registration. +.SH VERSIONS +The +.BR rseq () +system call was added in Linux 4.18. +.SH STANDARDS +.BR rseq () +is Linux-specific. +.SH SEE ALSO +.BR sched_getcpu (3) , +.BR membarrier (2) , +.BR getauxval (3)