From patchwork Wed Jul 12 00:46:58 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 11896 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp839489vqm; Tue, 11 Jul 2023 18:08:41 -0700 (PDT) X-Google-Smtp-Source: APBJJlHNjj3zNmZV975hg3CuaK9JX+NSwZIAbHeoAk7+kr5tSuwSf7XBZCz8XcHjR9W8VYIHmWei X-Received: by 2002:a05:6a20:7f8c:b0:131:371e:74 with SMTP id d12-20020a056a207f8c00b00131371e0074mr9409181pzj.8.1689124120992; Tue, 11 Jul 2023 18:08:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689124120; cv=none; d=google.com; s=arc-20160816; b=EqhOhaLnkdvKtR3YzCHhUZXkjCoPEGecITxgbslzn+jhblH6hzZF1WzOnTLLCF+/QN HFgxSTimUMbFLs6Rt+/x7uq7Rf5oCjis4hJJDxfM8WNEJfFgIGNg6vECczrQmPnTeZfW CV85UykjOODy1hM5EgTyVg7PqBQjuMxQsqCCCspveW633h7GFTa+9JemtGkTu5NO8btq okDoKkrECkel9tfpO99/bp7rymxjvDg1BMPSMyg0NIoeqNy13E3nFqqA8K2aYAn5dM2H 4lM7X73qX/2uhMln4xFk7mdegUMGwymTBqL5zj7qsCBAy8edCh1OINi29oN52cC8V1Wb 0YXA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=zwf7PNYu4wI0CQUOtfCbpFsxoKlxsJ+HgWgm32SeiN8=; fh=/pgcSL9uqIrDzZpVArfdVKafZqLd+/5oIM0cnYGSLio=; b=xsYDOYVqh1kqawcw9B961vz7doit563ge1kudle96SskOivgEasCccGPOX+ZMvREKa SBBZNfIq2Iw/jauy2J39lgW+4y0Xsi+8TVhjKSdBjzy8UWDiWCSoz+HvYQnek6AEX162 WRnFd0AsgPGkpFHiQyavsV1bOg5r2CUs0dAI930MX5FWwoVEyNZDyoCt+InXzXPDJoHb 2D8sKMe4OPFw5i0xm1nhaiFCJGgtm5UxpVXP8S0m/Rw1phPP66R7hnBi+zNEYUaXbGYT 8SugaeAgvK76HCxFvHa5gGkLJ28MWtjaRQTKyNBAdmYo8wsuZkW92+9PsgjuTStovpa6 WnzQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20221208.gappssmtp.com header.s=20221208 header.b=Rw3RNDZg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q16-20020a056a00089000b0066ce96ce70csi2295124pfj.122.2023.07.11.18.08.28; Tue, 11 Jul 2023 18:08:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20221208.gappssmtp.com header.s=20221208 header.b=Rw3RNDZg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231179AbjGLArO (ORCPT + 99 others); Tue, 11 Jul 2023 20:47:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38986 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229512AbjGLArN (ORCPT ); Tue, 11 Jul 2023 20:47:13 -0400 Received: from mail-pg1-x52a.google.com (mail-pg1-x52a.google.com [IPv6:2607:f8b0:4864:20::52a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 007CD10CF for ; Tue, 11 Jul 2023 17:47:11 -0700 (PDT) Received: by mail-pg1-x52a.google.com with SMTP id 41be03b00d2f7-55b741fd0c5so1012319a12.0 for ; Tue, 11 Jul 2023 17:47:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20221208.gappssmtp.com; s=20221208; t=1689122831; x=1691714831; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=zwf7PNYu4wI0CQUOtfCbpFsxoKlxsJ+HgWgm32SeiN8=; b=Rw3RNDZg3dW69P/y+UN3VY5htsidXQx7LlComs7YTCo2SqmTrOqJytbmEZp4w+NXrB jQ+KKf9bQ/igYEFDMhLChufOR16aVz8DbZo/IwikkvwmyfRbvRaTRkuoYJTuTtvlrPaG c3wp/Up6GiLxRx4Ovry53lQK8ZXjKYrYYrW/egcO3Sln7nnMwniWOnUhaa8qYgkgIrPO lDjXjeddBM99F/kFMVojYwxNmAuPkSbuAZsZQgcDWDTzUGYL8idLFLGp/XwZtlXzaJbg WzcDbxBKy7Bc5cFtTOnUKENyjjR/g06h71wiBOdWgPbjlcdYzHrP0J5P3pd38dxYspWx CBXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689122831; x=1691714831; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=zwf7PNYu4wI0CQUOtfCbpFsxoKlxsJ+HgWgm32SeiN8=; b=EdmCsMaUN+2kjfPdKoSdGBDZzBV88lPIN/G1fNKZP+klXJXR68kORyCSBxY1hggY2D 9lii3MD9yDQph3RR+hUE6yuZ/fdox6qp8+qViKuOxu+0rdsuQBW+E9Be1PcoSjT3PMk7 cfaxNcKCm2H86jFlRUe6kjw+xswVFtQuR0SGu9mCheZlkpaZCxOQFEhj39qbY7NyuPui +bYOsnQtfNjP+MdohXP7tVK0Zk+wbWbnOxYVoHoB5x/rTs/oRKIOvFIYntkbFkXkfJLi UQ11xwW7hwCcXtjsteHHR/amAXr4b84XuzEc0TNKiVB4C350IFCdRxKs1VmOd9UEVQBo pFUw== X-Gm-Message-State: ABy/qLaqjhbyjbsgSxUxkvwEMVICOh2x9AGQpJtzZLSMcmpU9MFC/8D4 jbOdmB+v3tGIWNL5YCi5lk45b8EGIJhofJ849LQ= X-Received: by 2002:a17:902:eccc:b0:1b8:ac61:ffcd with SMTP id a12-20020a170902eccc00b001b8ac61ffcdmr21877660plh.3.1689122831335; Tue, 11 Jul 2023 17:47:11 -0700 (PDT) Received: from localhost.localdomain ([198.8.77.157]) by smtp.gmail.com with ESMTPSA id s8-20020a170902b18800b001b694140d96sm2543542plr.170.2023.07.11.17.47.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jul 2023 17:47:10 -0700 (PDT) From: Jens Axboe To: io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Cc: tglx@linutronix.de, mingo@redhat.com, peterz@infradead.org Subject: [PATCHSET 0/7] Add io_uring futex/futexv support Date: Tue, 11 Jul 2023 18:46:58 -0600 Message-Id: <20230712004705.316157-1-axboe@kernel.dk> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771175013886229701 X-GMAIL-MSGID: 1771175013886229701 Hi, This patchset adds support for first futex wake and wait, and then futexv. Patches 1..2 are just prep patches, patch 3 adds the wait and wake support for io_uring, and then patches 4..6 are again prep patches to end up with futexv support in patch 7. For both wait/wake/waitv, we support the bitset variant, as the "normal" variants can be easily implemented on top of that. PI and requeue are not supported through io_uring, just the above mentioned parts. This may change in the future, but in the spirit of keeping this small (and based on what people have been asking for), this is what we currently have. When I did these patches, I forgot that Pavel had previously posted a futex variant for io_uring. The major thing that had been holding me back from people asking about futexes and io_uring, is that I wanted to do this what I consider the right way - no usage of io-wq or thread offload, an actually async implementation that is efficient to use and don't rely on a blocking thread for futex wait/waitv. This is what this patchset attempts to do, while being minimally invasive on the futex side. I believe the diffstat reflects that. As far as I can recall, the first request for futex support with io_uring came from Andres Freund, working on postgres. His aio rework of postgres was one of the early adopters of io_uring, and futex support was a natural extension for that. This is relevant from both a usability point of view, as well as for effiency and performance. In Andres's words, for the former: "Futex wait support in io_uring makes it a lot easier to avoid deadlocks in concurrent programs that have their own buffer pool: Obviously pages in the application buffer pool have to be locked during IO. If the initiator of IO A needs to wait for a held lock B, the holder of lock B might wait for the IO A to complete. The ability to wait for a lock and IO completions at the same time provides an efficient way to avoid such deadlocks." and in terms of effiency, even without unlocking the full potential yet, Andres says: "Futex wake support in io_uring is useful because it allows for more efficient directed wakeups. For some "locks" postgres has queues implemented in userspace, with wakeup logic that cannot easily be implemented with FUTEX_WAKE_BITSET on a single "futex word" (imagine waiting for journal flushes to have completed up to a certain point). Thus a "lock release" sometimes need to wake up many processes in a row. A quick-and-dirty conversion to doing these wakeups via io_uring lead to a 3% throughput increase, with 12% fewer context switches, albeit in a fairly extreme workload." Some basic io_uring futex support and test cases are available in the liburing 'futex' branch: https://git.kernel.dk/cgit/liburing/log/?h=futex testing all of the variants. I originally wrote this code about a month ago and Andres has been using it with postgres, and I'm not aware of any bugs in it. That's not to say it's perfect, obviously, and I welcome some feedback so we can move this forward and hash out any potential issues. include/linux/io_uring_types.h | 3 + include/uapi/linux/io_uring.h | 4 + io_uring/Makefile | 4 +- io_uring/cancel.c | 5 + io_uring/cancel.h | 4 + io_uring/futex.c | 377 +++++++++++++++++++++++++++++++++ io_uring/futex.h | 36 ++++ io_uring/io_uring.c | 5 + io_uring/opdef.c | 35 ++- kernel/futex/futex.h | 30 +++ kernel/futex/requeue.c | 3 +- kernel/futex/syscalls.c | 25 ++- kernel/futex/waitwake.c | 19 +- 13 files changed, 525 insertions(+), 25 deletions(-) You can also find the code here: https://git.kernel.dk/cgit/linux/log/?h=io_uring-futex