From patchwork Thu Nov 3 08:50:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 14763 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp401943wru; Thu, 3 Nov 2022 01:58:44 -0700 (PDT) X-Google-Smtp-Source: AMsMyM63yJXLkbwVjYJQ6Sj+lmcxKUWn4FcImUOvD+rg/kogRCRVWIBLGLsv5t3vPAl+imGvEQGw X-Received: by 2002:a17:906:1282:b0:7ad:f8df:5f98 with SMTP id k2-20020a170906128200b007adf8df5f98mr9304907ejb.627.1667465924525; Thu, 03 Nov 2022 01:58:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667465924; cv=none; d=google.com; s=arc-20160816; b=Ed9emNU5rLv9fsESBu6qBvxAcuzAJRtNOIeFZ+b8K5Jp2n5cDFSVRc3F/6m/I4nNIy h6b1a/lBOCY7z9MPhn1Gaoc3LVO8WWGDr4VRVjaN3XcomIC4rWX0n13vwGd9+kHY6l4H 4261rrW+Ah8bF9OB/hRPfJBCUlxPwQC6vPXkBmPv+3N/UqRLRyC35+I64DXVKYE0SWGk f9Snj6A2eSDSFl/QwKMFtSJN1MaoxtYcrXUDZGRYtXFmtqQpNPTaK6jtndkVsgUUlNoS mpNDEW9RtU96VTeKyKITEl+oqqwAnea/Iq5bCi69H+XOorslzFZ8jU+dzrJHS4Q3rXF+ K+hg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=aDA3qRRkp7ROYhYReLuXeczugOI8k0pmRRjImuHFcO0=; b=QfeTpiEAVLFX2WYWBAwYPTPeNs6cyrJ/t/hijniiTBieiTwvp77z+4AZ8jUNf4H5nh Lwyn5TMC35SUtjSlDeJ8H5b6ZVYKHJtiH438iTk76wYQlusL6vFrNjZ0w5uzZgStYuch LTssMl9spQSdn9lxOb2I9US7IGMusbcsfSR7LY0P/TAErr2UUjkZ9TpMOx+HGS9y5yBp lzqiip1IAUW3bO4lfZ/5luwMboQi4cv4JbJSdzcl1QmZ5UeyxlDHfmad3/fIKgWUybjA VkNpCDtau5CUw4y5K32MVNFaDJHcw0bRbyMaFL8udCpzTGhuNWp6L+MU51sjQenM8k8C r//w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=IVPOsepP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hb38-20020a170907162600b007307fd1b9bbsi549316ejc.589.2022.11.03.01.58.20; Thu, 03 Nov 2022 01:58:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=IVPOsepP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231423AbiKCIvn (ORCPT + 99 others); Thu, 3 Nov 2022 04:51:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52266 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231372AbiKCIvf (ORCPT ); Thu, 3 Nov 2022 04:51:35 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 75B44271C for ; Thu, 3 Nov 2022 01:50:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667465433; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aDA3qRRkp7ROYhYReLuXeczugOI8k0pmRRjImuHFcO0=; b=IVPOsepPKfM5J5FlvC/veAOdx7mZ9/uChmtGcVBlYxLgbvAIBHfZQ5YDL1cd15S53ffaoE GLTq+p2hznakqX93CSJB2kCgPng93RaHbhjgstVxaWbBohU+kf8iRn30dUffOHR7Tw48xN WyVk/sGBQCT1GTmT7lnHYrloG4s9AYc= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-652-7Pg4MZE_O6GN3gizwamFpg-1; Thu, 03 Nov 2022 04:50:29 -0400 X-MC-Unique: 7Pg4MZE_O6GN3gizwamFpg-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id ECD5A811E75; Thu, 3 Nov 2022 08:50:28 +0000 (UTC) Received: from localhost (ovpn-8-20.pek2.redhat.com [10.72.8.20]) by smtp.corp.redhat.com (Postfix) with ESMTP id 086592084836; Thu, 3 Nov 2022 08:50:27 +0000 (UTC) From: Ming Lei To: Jens Axboe , io-uring@vger.kernel.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Miklos Szeredi , Stefan Hajnoczi , ZiyangZhang , Ming Lei Subject: [RFC PATCH 1/4] io_uring/splice: support do_splice_direct Date: Thu, 3 Nov 2022 16:50:01 +0800 Message-Id: <20221103085004.1029763-2-ming.lei@redhat.com> In-Reply-To: <20221103085004.1029763-1-ming.lei@redhat.com> References: <20221103085004.1029763-1-ming.lei@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748464748853279956?= X-GMAIL-MSGID: =?utf-8?q?1748464748853279956?= do_splice_direct() has at least two advantages: 1) the extra pipe isn't required from user viewpoint, so userspace code can be simplified, meantime easy to relax current pipe limit since curret->splice_pipe is used for direct splice 2) in some situation, it isn't good to expose file data via ->splice_read() to userspace, such as the coming ublk driver's zero copy support, request pages will be spliced to pipe for supporting zero copy, and if it is READ, userspace may read data of kernel pages, and direct splice can avoid this kind of info leaks Signed-off-by: Ming Lei --- fs/read_write.c | 5 +++-- include/linux/splice.h | 3 +++ io_uring/splice.c | 13 ++++++++++--- 3 files changed, 16 insertions(+), 5 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 328ce8cf9a85..98869d15e884 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1253,7 +1253,7 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, goto fput_out; file_start_write(out.file); retval = do_splice_direct(in.file, &pos, out.file, &out_pos, - count, fl); + count, fl | SPLICE_F_DIRECT); file_end_write(out.file); } else { if (out.file->f_flags & O_NONBLOCK) @@ -1389,7 +1389,8 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, size_t len, unsigned int flags) { return do_splice_direct(file_in, &pos_in, file_out, &pos_out, - len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0); + len > MAX_RW_COUNT ? MAX_RW_COUNT : len, + SPLICE_F_DIRECT); } EXPORT_SYMBOL(generic_copy_file_range); diff --git a/include/linux/splice.h b/include/linux/splice.h index a55179fd60fc..9121624ad198 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -23,6 +23,9 @@ #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT) +/* used for io_uring interface only */ +#define SPLICE_F_DIRECT (0x10) /* direct splice and user needn't provide pipe */ + /* * Passed to the actors */ diff --git a/io_uring/splice.c b/io_uring/splice.c index 53e4232d0866..c11ea4cd1c7e 100644 --- a/io_uring/splice.c +++ b/io_uring/splice.c @@ -27,7 +27,8 @@ static int __io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_splice *sp = io_kiocb_to_cmd(req, struct io_splice); - unsigned int valid_flags = SPLICE_F_FD_IN_FIXED | SPLICE_F_ALL; + unsigned int valid_flags = SPLICE_F_FD_IN_FIXED | SPLICE_F_ALL | + SPLICE_F_DIRECT; sp->len = READ_ONCE(sqe->len); sp->flags = READ_ONCE(sqe->splice_flags); @@ -109,8 +110,14 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags) poff_in = (sp->off_in == -1) ? NULL : &sp->off_in; poff_out = (sp->off_out == -1) ? NULL : &sp->off_out; - if (sp->len) - ret = do_splice(in, poff_in, out, poff_out, sp->len, flags); + if (sp->len) { + if (flags & SPLICE_F_DIRECT) + ret = do_splice_direct(in, poff_in, out, poff_out, + sp->len, flags); + else + ret = do_splice(in, poff_in, out, poff_out, sp->len, + flags); + } if (!(sp->flags & SPLICE_F_FD_IN_FIXED)) io_put_file(in); From patchwork Thu Nov 3 08:50:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 14761 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp401663wru; Thu, 3 Nov 2022 01:57:55 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4v1ECRfewcL2/2Lrfn7KwJFJ+WkpsDMpxt+LvAAeo6POd2wNgupjYe79M5YmZDzpmkV5sF X-Received: by 2002:a17:907:728f:b0:7ad:dcbb:3e7f with SMTP id dt15-20020a170907728f00b007addcbb3e7fmr16928727ejc.535.1667465875661; Thu, 03 Nov 2022 01:57:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667465875; cv=none; d=google.com; s=arc-20160816; b=McEpqc5RXqiPuh21RQ3JU1Fctc/kAnKLIc2tls1gQs/UtWsyZ62CeKekWYryj0xBdH AtCqVtt+FZAh4tPlUd+zluU1ui9QRbeJGXnaSVE4rqvDptMIIDBk+z6TVSEh3Rr4FbJN tUSUbAZNRBHgqvGJCA8+T435wBLVD0BeU7zsxMmSp4fRt7LLYXpv+1LMOYogVz1YYgRm /0yjh+Ki7+4EmMgeALp4ViBwZPxv3q4elOVmYPMqIE86H95bgMoCjpKTmlaedQWjYg4Q NX/RX9zy1UAgx6qUUqKcOlWA8Vh58Z0GJ3BbRAQ/bFmDVAvOBJd19iYS+aKp01vbXM+U aDxw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=bv8EKan1TkTFIJ/b08ygqoNPGp8v59Xr0rg9p15U8hc=; b=RokWAxIZjIIP2c8ppN0E0hTfw+/YuM5FP2CJa+3GCTot5a9vFbn9LIsDQ9z1+8cGMH D61VMUFkzc9vxnX1OfCW/W7IaLDPUNL+VHqzanVXjIA6FHFcwNBi9UJAwOCTahCMdMac ESrcc9kLxEf2ihKqyTLOKtOvioU9D/5581LvmJXV0f31tpjSluQA0fG3KRDEcl05e3Lo VItRMPu2iSoC1P5l/UuZAIcOPKF2znofrYts3FprUTMZY0aCrpplDS/1qwOokHtng0g4 i/dhWls9G6Jcm0N1cOt95EgPajFt7DH0T8ZTum8y7xLhQOwbvxcfUQXMkuR6DgXNK7FU wm0Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Fm7JLJ3Q; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g3-20020a1709065d0300b0078d878d8fb7si592918ejt.920.2022.11.03.01.57.32; Thu, 03 Nov 2022 01:57:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Fm7JLJ3Q; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231386AbiKCIwE (ORCPT + 99 others); Thu, 3 Nov 2022 04:52:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52402 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231396AbiKCIvs (ORCPT ); Thu, 3 Nov 2022 04:51:48 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EC491D122 for ; Thu, 3 Nov 2022 01:50:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667465444; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bv8EKan1TkTFIJ/b08ygqoNPGp8v59Xr0rg9p15U8hc=; b=Fm7JLJ3QW68VNOzYk4Y2u8tXjR8FNFVCMlBn3emnG6GxT1KkuZcxw75Uo61XttUdTulpL0 kW8bLfvWNlcGcTKe5wQWGFvAzxDX5ZvRJNNLCsYv0CR0NpduxSQQJ9mEs4HucUSvZeq6Nz gUa2gKRGOEtmK9QzEbx05pDFo+WrwgY= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-475-vvSZal0vOFOVjiSTmwbjmw-1; Thu, 03 Nov 2022 04:50:33 -0400 X-MC-Unique: vvSZal0vOFOVjiSTmwbjmw-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1527C811E84; Thu, 3 Nov 2022 08:50:33 +0000 (UTC) Received: from localhost (ovpn-8-20.pek2.redhat.com [10.72.8.20]) by smtp.corp.redhat.com (Postfix) with ESMTP id 28D892166B26; Thu, 3 Nov 2022 08:50:31 +0000 (UTC) From: Ming Lei To: Jens Axboe , io-uring@vger.kernel.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Miklos Szeredi , Stefan Hajnoczi , ZiyangZhang , Ming Lei Subject: [RFC PATCH 2/4] fs/splice: add helper of splice_dismiss_pipe() Date: Thu, 3 Nov 2022 16:50:02 +0800 Message-Id: <20221103085004.1029763-3-ming.lei@redhat.com> In-Reply-To: <20221103085004.1029763-1-ming.lei@redhat.com> References: <20221103085004.1029763-1-ming.lei@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.6 X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748464698339406300?= X-GMAIL-MSGID: =?utf-8?q?1748464698339406300?= Add helper of splice_dismiss_pipe(), so cleanup iter_file_splice_write a bit. And this helper will be reused in the following patch for supporting to consume pipe by ->splice_read(). Signed-off-by: Ming Lei --- fs/splice.c | 47 ++++++++++++++++++++++++++++++----------------- 1 file changed, 30 insertions(+), 17 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 0878b852b355..f8999ffe6215 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -282,6 +282,34 @@ void splice_shrink_spd(struct splice_pipe_desc *spd) kfree(spd->partial); } +/* return if wakeup is needed */ +static bool splice_dismiss_pipe(struct pipe_inode_info *pipe, ssize_t bytes) +{ + unsigned int mask = pipe->ring_size - 1; + unsigned int tail = pipe->tail; + bool need_wakeup = false; + + while (bytes) { + struct pipe_buffer *buf = &pipe->bufs[tail & mask]; + + if (bytes >= buf->len) { + bytes -= buf->len; + buf->len = 0; + pipe_buf_release(pipe, buf); + tail++; + pipe->tail = tail; + if (pipe->files) + need_wakeup = true; + } else { + buf->offset += bytes; + buf->len -= bytes; + bytes = 0; + } + } + + return need_wakeup; +} + /** * generic_file_splice_read - splice data from file to a pipe * @in: file to splice from @@ -692,23 +720,8 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, *ppos = sd.pos; /* dismiss the fully eaten buffers, adjust the partial one */ - tail = pipe->tail; - while (ret) { - struct pipe_buffer *buf = &pipe->bufs[tail & mask]; - if (ret >= buf->len) { - ret -= buf->len; - buf->len = 0; - pipe_buf_release(pipe, buf); - tail++; - pipe->tail = tail; - if (pipe->files) - sd.need_wakeup = true; - } else { - buf->offset += ret; - buf->len -= ret; - ret = 0; - } - } + if (splice_dismiss_pipe(pipe, ret)) + sd.need_wakeup = true; } done: kfree(array); From patchwork Thu Nov 3 08:50:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 14762 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp401794wru; Thu, 3 Nov 2022 01:58:20 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7eBkmDH6ommnuspkbzkBkZNCTA4TOOJ0YbEXZR1TpEGJuTANN9vfGRk9KB+94UMD/dcFA+ X-Received: by 2002:aa7:d650:0:b0:462:d945:3801 with SMTP id v16-20020aa7d650000000b00462d9453801mr28997062edr.117.1667465900260; Thu, 03 Nov 2022 01:58:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667465900; cv=none; d=google.com; s=arc-20160816; b=yHy4VgQUKUcx6dbmst5+tm5vUuOwjvhA0pPhCMwjLBTyD5NWnD5eT6j7JnS5DUYhZP iBjVEJ7oLaRxbllj5pOCam7oZ3ASzd4r7GnkzoqojOtruT7vdiE11g8rIYYuddtn3h78 Khy9aGXaYrXvRRLEIYqGVH30ZxaF+JHSfiE+GsdxGb6fPMewbyhBN+Fy7uy4++dH4aaP c3VcBgPn7qn5rDz1SHZFJYU1E+GGDKNfcNs5MagA4ClwDgGfV0lZJ3//hgryUSRhRXD1 A+MeKKyvrLtLPlLtFBrWDR2XzQ/GAizBLgMtLwquX05w9qMJD2Bi4MaeM5UfY2+MOmOv qgVg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=i8JvcZbpi+InOaWaSSfJsW/LYP35SvNLfn87x+CwQHU=; b=dnW70BGPx4TQ79qq+d9RPbg7KzsyFVyIif1OlIFF8x5DPE0cWF5UP5JkTpkHOSf9Sh UwhI1lEk5ksUYqUv33VInOWAntlOe2M+4fXz1zCQDcEX7A3nK3JHUL//UYipuHu6xxpD m7r0GILuQ9jvXVR78GRAqT4smD3rWrPyvOEQtYLTETrmy5LHSjGnxUUacA7a57X10fjF LkkoV1zX341ISV28YAptaebzO0drwXJKEP7cd1GDKIvR//85aWoWfLa2LkhKTaE9LIuP K2AHKFomYV73Zx74AwLHNys97tZQ/cYhkzt+MHVa3DTuCnTOfW8tvqSIuVcTbhIuquEy blFg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GanMkhED; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hv12-20020a17090760cc00b007adf558e182si482141ejc.926.2022.11.03.01.57.57; Thu, 03 Nov 2022 01:58:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GanMkhED; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229637AbiKCIw1 (ORCPT + 99 others); Thu, 3 Nov 2022 04:52:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52558 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230398AbiKCIwE (ORCPT ); Thu, 3 Nov 2022 04:52:04 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 18605D2FB for ; Thu, 3 Nov 2022 01:51:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667465463; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=i8JvcZbpi+InOaWaSSfJsW/LYP35SvNLfn87x+CwQHU=; b=GanMkhED+1MN2uK1Q33CsLilWQNbhZWGAwXTo4f6T4oyRXhsAQKnclwQj2S0alEZL5JBxZ iFkaR7H1f7MSwHbmjwOurqTgu/X2sK+dv95wtzIuSKjqQXVtfau1s7J1roUCFgUd1YkxwY oxSKM5HsVEuJCElBBfwp4T2NNev5Vro= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-522-94rabdZCPhWvVKR2uC06aw-1; Thu, 03 Nov 2022 04:50:43 -0400 X-MC-Unique: 94rabdZCPhWvVKR2uC06aw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 09448101A5AD; Thu, 3 Nov 2022 08:50:43 +0000 (UTC) Received: from localhost (ovpn-8-20.pek2.redhat.com [10.72.8.20]) by smtp.corp.redhat.com (Postfix) with ESMTP id CF4F0C15BA4; Thu, 3 Nov 2022 08:50:35 +0000 (UTC) From: Ming Lei To: Jens Axboe , io-uring@vger.kernel.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Miklos Szeredi , Stefan Hajnoczi , ZiyangZhang , Ming Lei Subject: [RFC PATCH 3/4] io_uring/splice: support splice from ->splice_read to ->splice_read Date: Thu, 3 Nov 2022 16:50:03 +0800 Message-Id: <20221103085004.1029763-4-ming.lei@redhat.com> In-Reply-To: <20221103085004.1029763-1-ming.lei@redhat.com> References: <20221103085004.1029763-1-ming.lei@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.8 X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748464724111407907?= X-GMAIL-MSGID: =?utf-8?q?1748464724111407907?= The 1st ->splice_read produces buffer to the pipe of current->splice_pipe, and the 2nd ->splice_read consumes the buffer in this pipe. This way helps to support zero copy of read request for ublk and fuse. Signed-off-by: Ming Lei --- fs/splice.c | 146 ++++++++++++++++++++++++++++++++++++-- include/linux/fs.h | 2 + include/linux/pipe_fs_i.h | 9 +++ include/linux/splice.h | 11 +++ io_uring/splice.c | 13 ++-- 5 files changed, 169 insertions(+), 12 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index f8999ffe6215..cd5255f9ff13 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -330,17 +330,70 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, struct iov_iter to; struct kiocb kiocb; int ret; + struct bio_vec *array = NULL; + bool consumer; + + /* + * So far, in->fops->splice_read() has to make sure the following + * simple page use model works. + * + * pipe->consumed_by_read is set by the in end of the pipe + */ + if ((flags & SPLICE_F_READ_TO_READ) && pipe->consumed_by_read) { + unsigned int head, tail, mask; + int nbufs = pipe->max_usage; + size_t left = len; + int n; + + if (WARN_ON_ONCE(!(flags & SPLICE_F_DIRECT))) + return -EINVAL; + + head = pipe->head; + tail = pipe->tail; + mask = pipe->ring_size - 1; + + array = kcalloc(nbufs, sizeof(struct bio_vec), GFP_KERNEL); + if (!array) + return -ENOMEM; + + for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; + tail++) { + struct pipe_buffer *buf = &pipe->bufs[tail & mask]; + size_t this_len = buf->len; + + /* zero-length bvecs are not supported, skip them */ + if (!this_len) + continue; + this_len = min(this_len, left); + + array[n].bv_page = buf->page; + array[n].bv_len = this_len; + array[n].bv_offset = buf->offset; + left -= this_len; + n++; + } + + consumer = true; + iov_iter_bvec(&to, READ, array, n, len - left); + } else { + /* !consumer means one pipe buf producer */ + consumer = false; + iov_iter_pipe(&to, READ, pipe, len); + } - iov_iter_pipe(&to, READ, pipe, len); init_sync_kiocb(&kiocb, in); kiocb.ki_pos = *ppos; ret = call_read_iter(in, &kiocb, &to); if (ret > 0) { *ppos = kiocb.ki_pos; file_accessed(in); + + if (consumer) + splice_dismiss_pipe(pipe, ret); } else if (ret < 0) { /* free what was emitted */ - pipe_discard_from(pipe, to.start_head); + if (!consumer) + pipe_discard_from(pipe, to.start_head); /* * callers of ->splice_read() expect -EAGAIN on * "can't put anything in there", rather than -EFAULT. @@ -349,6 +402,11 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, ret = -EAGAIN; } + if (consumer) { + kfree(array); + pipe->consumed_by_read = false; + } + return ret; } EXPORT_SYMBOL(generic_file_splice_read); @@ -782,7 +840,7 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out, */ static long do_splice_to(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, - unsigned int flags) + unsigned int flags, bool consumer) { unsigned int p_space; int ret; @@ -790,8 +848,12 @@ static long do_splice_to(struct file *in, loff_t *ppos, if (unlikely(!(in->f_mode & FMODE_READ))) return -EBADF; - /* Don't try to read more the pipe has space for. */ - p_space = pipe->max_usage - pipe_occupancy(pipe->head, pipe->tail); + if (consumer) /* read is consumer */ + p_space = pipe_occupancy(pipe->head, pipe->tail); + else + /* Don't try to read more the pipe has space for. */ + p_space = pipe->max_usage - pipe_occupancy(pipe->head, + pipe->tail); len = min_t(size_t, len, p_space << PAGE_SHIFT); ret = rw_verify_area(READ, in, ppos, len); @@ -875,7 +937,7 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, size_t read_len; loff_t pos = sd->pos, prev_pos = pos; - ret = do_splice_to(in, &pos, pipe, len, flags); + ret = do_splice_to(in, &pos, pipe, len, flags, false); if (unlikely(ret <= 0)) goto out_release; @@ -992,6 +1054,76 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, } EXPORT_SYMBOL(do_splice_direct); +static int direct_splice_read_consumer_actor(struct pipe_inode_info *pipe, + struct splice_desc *sd) +{ + struct file *file = sd->u.file; + + /* Pipe in side has to notify us by ->consumed_by_read */ + if (!pipe->consumed_by_read) + return -EINVAL; + + return do_splice_to(file, sd->opos, pipe, sd->total_len, + sd->flags, true); +} + +/** + * do_splice_direct_read_consumer - splices data directly with producer/ + * consumer model + * @in: file to splice from + * @ppos: input file offset + * @out: file to splice to + * @opos: output file offset + * @len: number of bytes to splice + * @flags: splice modifier flags, SPLICE_F_READ_TO_READ is required + * + * Description: + * For use by ublk or fuse to implement zero copy for READ request, and + * splice directly over internal pipe from device to file, and device's + * ->splice_read() produces pipe buffers, and file's ->splice_read() + * consumes the buffers. + * + */ +long do_splice_direct_read_consumer(struct file *in, loff_t *ppos, + struct file *out, loff_t *opos, + size_t len, unsigned int flags) +{ + struct splice_desc sd = { + .len = len, + .total_len = len, + .flags = flags, + .pos = *ppos, + .u.file = out, + .opos = opos, + }; + long ret; + + if (!(flags & (SPLICE_F_DIRECT | SPLICE_F_READ_TO_READ))) + return -EINVAL; + + if (unlikely(!(out->f_mode & FMODE_READ))) + return -EBADF; + + /* + * So far we just support F_READ_TO_READ if it is one plain + * file which ->splice_read points to generic_file_splice_read + */ + if (out->f_op->splice_read != generic_file_splice_read) + return -EINVAL; + + ret = rw_verify_area(READ, out, opos, len); + if (unlikely(ret < 0)) + return ret; + + ret = splice_direct_to_actor(in, &sd, + direct_splice_read_consumer_actor); + if (ret > 0) + *ppos = sd.pos; + + return ret; +} +EXPORT_SYMBOL(do_splice_direct_read_consumer); + static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags) { for (;;) { @@ -1023,7 +1155,7 @@ long splice_file_to_pipe(struct file *in, pipe_lock(opipe); ret = wait_for_space(opipe, flags); if (!ret) - ret = do_splice_to(in, offset, opipe, len, flags); + ret = do_splice_to(in, offset, opipe, len, flags, false); pipe_unlock(opipe); if (ret > 0) wakeup_pipe_readers(opipe); diff --git a/include/linux/fs.h b/include/linux/fs.h index e654435f1651..e5f84902f149 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3169,6 +3169,8 @@ extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out, loff_t *, size_t len, unsigned int flags); extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, loff_t *opos, size_t len, unsigned int flags); +extern long do_splice_direct_read_consumer(struct file *in, loff_t *ppos, + struct file *out, loff_t *opos, size_t len, unsigned int flags); extern void diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 6cb65df3e3ba..90c6ff8c82ef 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -72,6 +72,15 @@ struct pipe_inode_info { unsigned int r_counter; unsigned int w_counter; bool poll_usage; + + /* + * If SPLICE_F_READ_TO_READ is applied, in->fops->splice_read() + * should set this flag, so that out->fops->splice_read() can + * observe this flag, then consume buffers in the pipe. + * + * Used by do_splice_direct_read_consumer() only. + */ + bool consumed_by_read; struct page *tmp_page; struct fasync_struct *fasync_readers; struct fasync_struct *fasync_writers; diff --git a/include/linux/splice.h b/include/linux/splice.h index 9121624ad198..f48044e5e173 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -26,6 +26,17 @@ /* used for io_uring interface only */ #define SPLICE_F_DIRECT (0x10) /* direct splice and user needn't provide pipe */ +/* + * The usual splice is file-to-pipe and pipe-to-file, and this flag means the + * splice is file-to-pipe and file-to-pipe. Looks this way is stupid, but + * please understand from producer & consumer viewpoint, the 1st file-to-pipe + * is producer, and the 2nd file-to-pipe is consumer, so here the 2nd + * ->slice_read just consumes buffers stored in the pipe. + * + * And this flag is only valid in case of SPLICE_F_DIRECT. + */ +#define SPLICE_F_READ_TO_READ (0x20) + /* * Passed to the actors */ diff --git a/io_uring/splice.c b/io_uring/splice.c index c11ea4cd1c7e..df66d89f4f17 100644 --- a/io_uring/splice.c +++ b/io_uring/splice.c @@ -28,7 +28,7 @@ static int __io_splice_prep(struct io_kiocb *req, { struct io_splice *sp = io_kiocb_to_cmd(req, struct io_splice); unsigned int valid_flags = SPLICE_F_FD_IN_FIXED | SPLICE_F_ALL | - SPLICE_F_DIRECT; + SPLICE_F_DIRECT | SPLICE_F_READ_TO_READ; sp->len = READ_ONCE(sqe->len); sp->flags = READ_ONCE(sqe->splice_flags); @@ -111,12 +111,15 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags) poff_out = (sp->off_out == -1) ? NULL : &sp->off_out; if (sp->len) { - if (flags & SPLICE_F_DIRECT) - ret = do_splice_direct(in, poff_in, out, poff_out, - sp->len, flags); - else + if (!(flags & (SPLICE_F_DIRECT | SPLICE_F_READ_TO_READ))) ret = do_splice(in, poff_in, out, poff_out, sp->len, flags); + else if (flags & SPLICE_F_READ_TO_READ) + ret = do_splice_direct_read_consumer(in, poff_in, out, + poff_out, sp->len, flags); + else + ret = do_splice_direct(in, poff_in, out, poff_out, + sp->len, flags); } if (!(sp->flags & SPLICE_F_FD_IN_FIXED)) From patchwork Thu Nov 3 08:50:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 14764 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp402359wru; Thu, 3 Nov 2022 01:59:55 -0700 (PDT) X-Google-Smtp-Source: AMsMyM71L34nJE5hyb7GjHaxDp1JYOccLjKeoj7khstsvpUKQSQ5NirS30suFz6ajdlAyKECpmhp X-Received: by 2002:a50:fb95:0:b0:463:526:308b with SMTP id e21-20020a50fb95000000b004630526308bmr26639142edq.424.1667465995641; Thu, 03 Nov 2022 01:59:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667465995; cv=none; d=google.com; s=arc-20160816; b=niKdiDlJt+AgnUE99USJTp6dGAgJVa8qiZPrGSHdhio78x+2XXtV8Gpb8t9XSCc8y1 sWePJtYe5vcLyxNQTd1+JmSphQf13+ITAZG8tfXQTfYD0KzhJzMsybQ9jObfvNtfc4XF EZTZk8UJ0fGZdYbrmlZvYGKok9GKd8vGJR0Mgo8cUglRM/GRil/BQKWXzC9IHzIvzQT5 XBXLnBIe3Jd6HDITqVgvOMZMaUwf/zFk9bR6akL7HZiGINn6QHNJA45m7WLNJUEAnf/6 YvuzyXfauVyPTZa6R8H4ZThDLBGggC+feZj/ad/rB7W9i0kKoCc7gyH5EaYrbzzsogy/ iXpQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=s24KGcV2Z54EoXUCNfpVmCq/LLHmEm9f0arH5p9T5iM=; b=eAa9ffkGG3GsZlPqZZJ4NvczS+fQzKkJasSsoeo8GNxMZk/yTjdRNMbjWhdfqJTrkB EjXWPPYs3P/GBKHoyoyCjZleUrplBFW5LjXKYqSIez3XNBN388pxTwbuVCKjqo3aOuoU 4rnbygbzNHkgbe70whdBmw4Qe3kboLuHm2P5eOKF9uRyFHV+n0Kjv9rXep7CLmiSrT0V LU63LEenhwLgi/tjfy75NRqqnPj0zJsc9+T7FFoV1sWQn/k+gybd3EbbW4l1UIbxYW9r zqbZX3cepEdIJUtwoP6ZoCQZyToXorJS2PtmJGbBEtDg/UlAcvqgNmrRcNxRDUZf1efg KA9w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=LWMbnlNs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hc40-20020a17090716a800b0078d027ceb41si557031ejc.857.2022.11.03.01.59.32; Thu, 03 Nov 2022 01:59:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=LWMbnlNs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230525AbiKCIwI (ORCPT + 99 others); Thu, 3 Nov 2022 04:52:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52496 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231467AbiKCIv4 (ORCPT ); Thu, 3 Nov 2022 04:51:56 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1F53BD11E for ; Thu, 3 Nov 2022 01:50:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667465452; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=s24KGcV2Z54EoXUCNfpVmCq/LLHmEm9f0arH5p9T5iM=; b=LWMbnlNsdzDmHxxAqZ9DMp7LffyVbSONd6lTpRRxKW1HqZikIREgYs4AClUWOP/QaWmSyp OmRPcV6elhzoJYFLCu2ZQLeMD+EBXYqYmWcRakgR8GQgK+J2FkcUPZ1wM1AwejFipgGazG bAe5JRS/UV2p2jjA0a92dyB6HOFMeoM= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-623-P1tEpVr6PqigEItNTJ3Wjw-1; Thu, 03 Nov 2022 04:50:48 -0400 X-MC-Unique: P1tEpVr6PqigEItNTJ3Wjw-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id C4270858F13; Thu, 3 Nov 2022 08:50:47 +0000 (UTC) Received: from localhost (ovpn-8-20.pek2.redhat.com [10.72.8.20]) by smtp.corp.redhat.com (Postfix) with ESMTP id B901840C2140; Thu, 3 Nov 2022 08:50:46 +0000 (UTC) From: Ming Lei To: Jens Axboe , io-uring@vger.kernel.org Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Miklos Szeredi , Stefan Hajnoczi , ZiyangZhang , Ming Lei Subject: [RFC PATCH 4/4] ublk_drv: support splice based read/write zero copy Date: Thu, 3 Nov 2022 16:50:04 +0800 Message-Id: <20221103085004.1029763-5-ming.lei@redhat.com> In-Reply-To: <20221103085004.1029763-1-ming.lei@redhat.com> References: <20221103085004.1029763-1-ming.lei@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.1 X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748464823891720836?= X-GMAIL-MSGID: =?utf-8?q?1748464823891720836?= Pass ublk block IO request pages to kernel backend IO handling code via pipe, and request page copy can be avoided. So far, the existed pipe/splice mechanism works for handling write request only. The initial idea of using splice for zero copy is from Miklos and Stefan. Read request's zero copy requires pipe's change to allow one read end to produce buffers for another read end to consume. The added SPLICE_F_READ_TO_READ flag is for supporting this feature. READ is handled by sending IORING_OP_SPLICE with SPLICE_F_DIRECT | SPLICE_F_READ_TO_READ. WRITE is handled by sending IORING_OP_SPLICE with SPLICE_F_DIRECT. Kernel internal pipe is used for simplifying userspace, meantime potential info leak could be avoided. Suggested-by: Miklos Szeredi Suggested-by: Stefan Hajnoczi Signed-off-by: Ming Lei --- drivers/block/ublk_drv.c | 151 +++++++++++++++++++++++++++++++++- include/uapi/linux/ublk_cmd.h | 34 +++++++- 2 files changed, 182 insertions(+), 3 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index f96cb01e9604..c9d061547877 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -42,6 +42,8 @@ #include #include #include +#include +#include #include #define UBLK_MINORS (1U << MINORBITS) @@ -51,7 +53,8 @@ | UBLK_F_URING_CMD_COMP_IN_TASK \ | UBLK_F_NEED_GET_DATA \ | UBLK_F_USER_RECOVERY \ - | UBLK_F_USER_RECOVERY_REISSUE) + | UBLK_F_USER_RECOVERY_REISSUE \ + | UBLK_F_SPLICE_ZC) /* All UBLK_PARAM_TYPE_* should be included here */ #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD) @@ -61,6 +64,7 @@ struct ublk_rq_data { struct callback_head work; struct llist_node node; }; + atomic_t handled; }; struct ublk_uring_cmd_pdu { @@ -480,6 +484,9 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req, if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH) return rq_bytes; + if (ubq->flags & UBLK_F_SPLICE_ZC) + return rq_bytes; + if (ublk_rq_has_data(req)) { struct ublk_map_data data = { .ubq = ubq, @@ -501,6 +508,9 @@ static int ublk_unmap_io(const struct ublk_queue *ubq, { const unsigned int rq_bytes = blk_rq_bytes(req); + if (ubq->flags & UBLK_F_SPLICE_ZC) + return rq_bytes; + if (req_op(req) == REQ_OP_READ && ublk_rq_has_data(req)) { struct ublk_map_data data = { .ubq = ubq, @@ -858,6 +868,19 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, if (ublk_queue_can_use_recovery(ubq) && unlikely(ubq->force_abort)) return BLK_STS_IOERR; + if (ubq->flags & UBLK_F_SPLICE_ZC) { + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); + + atomic_set(&data->handled, 0); + + /* + * Order write ->handled and write rq->state in + * blk_mq_start_request, the pair barrier is the one + * implied in atomic_inc_return() in ublk_splice_read + */ + smp_wmb(); + } + blk_mq_start_request(bd->rq); if (unlikely(ubq_daemon_is_dying(ubq))) { @@ -1299,13 +1322,137 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) return -EIOCBQUEUED; } +static void ublk_pipe_buf_release(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ +} + +static const struct pipe_buf_operations ublk_pipe_buf_ops = { + .release = ublk_pipe_buf_release, +}; + +/* + * Pass request page reference to kernel backend IO handler via pipe + * + * ublk server has to handle backend IO via splice() + */ +static ssize_t ublk_splice_read(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, + size_t len, unsigned int flags) +{ + struct ublk_device *ub = in->private_data; + struct req_iterator rq_iter; + struct bio_vec bv; + struct request *req; + struct ublk_queue *ubq; + u16 tag, q_id; + unsigned int done; + int ret, buf_offset; + struct ublk_rq_data *data; + + if (!(flags & SPLICE_F_DIRECT)) + return -EPERM; + + /* No, we have to be the in side */ + if (pipe->consumed_by_read) + return -EINVAL; + + if (!ub) + return -EPERM; + + tag = ublk_pos_to_tag(*ppos); + q_id = ublk_pos_to_hwq(*ppos); + buf_offset = ublk_pos_to_buf_offset(*ppos); + + if (q_id >= ub->dev_info.nr_hw_queues) + return -EINVAL; + + ubq = ublk_get_queue(ub, q_id); + if (!ubq) + return -EINVAL; + + if (!(ubq->flags & UBLK_F_SPLICE_ZC)) + return -EINVAL; + + if (tag >= ubq->q_depth) + return -EINVAL; + + req = blk_mq_tag_to_rq(ub->tag_set.tags[q_id], tag); + if (!req || !blk_mq_request_started(req)) + return -EINVAL; + + data = blk_mq_rq_to_pdu(req); + if (atomic_add_return(len, &data->handled) > blk_rq_bytes(req) || !len) + return -EACCES; + + ret = -EINVAL; + if (!ublk_rq_has_data(req)) + goto exit; + + pr_devel("%s: qid %d tag %u offset %x, request bytes %u, len %llu\n", + __func__, tag, q_id, buf_offset, blk_rq_bytes(req), + (unsigned long long)len); + + if (buf_offset + len > blk_rq_bytes(req)) + goto exit; + + if ((req_op(req) == REQ_OP_READ) && + !(flags & SPLICE_F_READ_TO_READ)) + goto exit; + + if ((req_op(req) != REQ_OP_READ) && + (flags & SPLICE_F_READ_TO_READ)) + goto exit; + + done = ret = 0; + /* todo: is iov_iter ready for handling multipage bvec? */ + rq_for_each_segment(bv, req, rq_iter) { + struct pipe_buffer buf = { + .ops = &ublk_pipe_buf_ops, + .flags = 0, + .page = bv.bv_page, + .offset = bv.bv_offset, + .len = bv.bv_len, + }; + + if (buf_offset > 0) { + if (buf_offset >= bv.bv_len) { + buf_offset -= bv.bv_len; + continue; + } else { + buf.offset += buf_offset; + buf.len -= buf_offset; + buf_offset = 0; + } + } + + ret = add_to_pipe(pipe, &buf); + if (unlikely(ret < 0)) + break; + done += ret; + } + + if (flags & SPLICE_F_READ_TO_READ) + pipe->consumed_by_read = true; + + WARN_ON_ONCE(done > len); + + if (done) { + *ppos += done; + ret = done; + } +exit: + return ret; +} + static const struct file_operations ublk_ch_fops = { .owner = THIS_MODULE, .open = ublk_ch_open, .release = ublk_ch_release, - .llseek = no_llseek, + .llseek = noop_llseek, .uring_cmd = ublk_ch_uring_cmd, .mmap = ublk_ch_mmap, + .splice_read = ublk_splice_read, }; static void ublk_deinit_queue(struct ublk_device *ub, int q_id) diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h index 8f88e3a29998..93d9ca7650ce 100644 --- a/include/uapi/linux/ublk_cmd.h +++ b/include/uapi/linux/ublk_cmd.h @@ -52,7 +52,36 @@ #define UBLKSRV_IO_BUF_OFFSET 0x80000000 /* tag bit is 12bit, so at most 4096 IOs for each queue */ -#define UBLK_MAX_QUEUE_DEPTH 4096 +#define UBLK_TAG_BITS 12 +#define UBLK_MAX_QUEUE_DEPTH (1U << UBLK_TAG_BITS) + +/* used in ->splice_read for supporting zero-copy */ +#define UBLK_BUFS_SIZE_BITS 42 +#define UBLK_BUFS_SIZE_MASK ((1ULL << UBLK_BUFS_SIZE_BITS) - 1) +#define UBLK_BUF_SIZE_BITS (UBLK_BUFS_SIZE_BITS - UBLK_TAG_BITS) +#define UBLK_BUF_MAX_SIZE (1ULL << UBLK_BUF_SIZE_BITS) + +static inline __u16 ublk_pos_to_hwq(__u64 pos) +{ + return pos >> UBLK_BUFS_SIZE_BITS; +} + +static inline __u32 ublk_pos_to_buf_offset(__u64 pos) +{ + return (pos & UBLK_BUFS_SIZE_MASK) & (UBLK_BUF_MAX_SIZE - 1); +} + +static inline __u16 ublk_pos_to_tag(__u64 pos) +{ + return (pos & UBLK_BUFS_SIZE_MASK) >> UBLK_BUF_SIZE_BITS; +} + +/* offset of single buffer, which has to be < UBLK_BUX_MAX_SIZE */ +static inline __u64 ublk_pos(__u16 q_id, __u16 tag, __u32 offset) +{ + return (((__u64)q_id) << UBLK_BUFS_SIZE_BITS) | + ((((__u64)tag) << UBLK_BUF_SIZE_BITS) + offset); +} /* * zero copy requires 4k block size, and can remap ublk driver's io @@ -79,6 +108,9 @@ #define UBLK_F_USER_RECOVERY_REISSUE (1UL << 4) +/* slice based write zero copy */ +#define UBLK_F_SPLICE_ZC (1UL << 5) + /* device state */ #define UBLK_S_DEV_DEAD 0 #define UBLK_S_DEV_LIVE 1