From patchwork Fri Nov 24 16:04:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169489 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1333976vqx; Fri, 24 Nov 2023 08:12:59 -0800 (PST) X-Google-Smtp-Source: AGHT+IHygptNkQAb1NrP7BYC0Bdh/LOKK7Ks/Uvx18pyoReM+HxFKrqtLdgYGkGk2MiEyYmE0m3+ X-Received: by 2002:a17:902:b281:b0:1cf:67e0:8904 with SMTP id u1-20020a170902b28100b001cf67e08904mr2533514plr.43.1700842378648; Fri, 24 Nov 2023 08:12:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700842378; cv=none; d=google.com; s=arc-20160816; b=s4VffJ5vGq8SPF7upxAVn0kDFcIo/UD+VhmN4U/9aKaf3jb6UpHsj+D3LQ8rjhZ7va JD3+n/1NSWpAzXBI/PKraftt4vdyEO6BsdowbdoP4cTlfEekkOPsGYt8wLTuyT7PcDx4 4Z0HVX4ZPubBqM04V79xIrucUFKepX2WiwZp6m3QCnXbx1nsXGWVgsYrzGBxj8X159TM 9bNO24HG/DjZean27tokdKou0CNHyP41KmCau45UghDHBh2lYE8qHpmSj6iT0kVKJvQg FZbtK4rpAyv52dqP4GS9phk28R7s0GKmnP2v/q8CLwnkyUeXAFqR7v63cmqY/tuS3OHX xSgg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=NktJjNztL6a9VBLs/6z7OLOrm+oRflq1cbNxypOGnd4=; fh=sGVGOFmMLAnX+07bQNWtBjXylIzixACIEFOcvK9wZzw=; b=d82BuFkFP1u6wQnsKJES0l+U+FULF9Qa6bAW0KtDms8GUAUklrrBtEBkGRb8piJZ6d m7mmF/TlgodY8SlW745CvCDNhxGsQWwdNn/UJuIBpNBTj1p168298hCOQEi5bq/p8kKe meGIWwNlWL8LqX+48jVXEbI47ljkxr6CcefeU0ebS59SN6xYfnDMIg4tVAQJvk5ggO01 jMJUmGQC2dvcMIfGJyFYb75Ts6JVKPMUft9cgPk3ilYLtKWExg3aDo854IlVDrflV91W zUmoALNtxn0WnhyOECtMa4HG3qu5rNqCRAKS7Bn9SWsf+CFTKy4sYtZiKVmhQyCv2Lqt 0Isw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=i+W8szTl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id k17-20020a170902c41100b001cf6a75e986si3959776plk.456.2023.11.24.08.12.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 08:12:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=i+W8szTl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 65633804DDA9; Fri, 24 Nov 2023 08:10:45 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345427AbjKXQKc (ORCPT + 99 others); Fri, 24 Nov 2023 11:10:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41050 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230104AbjKXQK2 (ORCPT ); Fri, 24 Nov 2023 11:10:28 -0500 X-Greylist: delayed 324 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Fri, 24 Nov 2023 08:10:32 PST Received: from out-173.mta0.migadu.com (out-173.mta0.migadu.com [IPv6:2001:41d0:1004:224b::ad]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AD2DD19A6 for ; Fri, 24 Nov 2023 08:10:32 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700841906; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NktJjNztL6a9VBLs/6z7OLOrm+oRflq1cbNxypOGnd4=; b=i+W8szTlF1ZX//o0zn3QXAeuBOk3WNLnWrfX5T3uvLsnI36AB7e/JcQq1ZVfElXuFUCQXz qpCWKdhKVria4lWjdzd3wZ9V2f5GIGFYJH3nIjNlrv9obtGy0dGn2kphhkIOrUMcO7FLV5 mNDt4JzjkDsY9M8Yk0SE0GLdTzj/1WE= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, gregkh@linuxfoundation.org, arnd@arndb.de, christian.koenig@amd.com, yi.l.liu@intel.com, jirislaby@kernel.org, stfrench@microsoft.com, jpanis@baylibre.com, jgg@ziepe.ca, contact@emersion.fr, dchinner@redhat.com, jack@suse.cz, linux@weissschuh.net, min15.li@samsung.com, dlemoal@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa Subject: [PATCH v6 01/11] documentation: Block Device Filtering Mechanism Date: Fri, 24 Nov 2023 17:04:49 +0100 Message-Id: <20231124160459.26227-2-sergei.shtepa@linux.dev> In-Reply-To: <20231124160459.26227-1-sergei.shtepa@linux.dev> References: <20231124160459.26227-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 24 Nov 2023 08:10:45 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783462497627671561 X-GMAIL-MSGID: 1783462497627671561 From: Sergei Shtepa The document contains: * Describes the purpose of the mechanism * A little historical background on the capabilities of handling I/O units of the Linux kernel * Brief description of the design * Reference to interface description Signed-off-by: Sergei Shtepa --- Documentation/block/blkfilter.rst | 66 +++++++++++++++++++++++++++++++ Documentation/block/index.rst | 1 + MAINTAINERS | 6 +++ 3 files changed, 73 insertions(+) create mode 100644 Documentation/block/blkfilter.rst diff --git a/Documentation/block/blkfilter.rst b/Documentation/block/blkfilter.rst new file mode 100644 index 000000000000..4e148e78f3d4 --- /dev/null +++ b/Documentation/block/blkfilter.rst @@ -0,0 +1,66 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +Block Device Filtering Mechanism +================================ + +The block device filtering mechanism provides the ability to attach block +device filters. Block device filters allow performing additional processing +for I/O units. + +Introduction +============ + +The idea of handling I/O units on block devices is not new. Back in the +2.6 kernel, there was an undocumented possibility of handling I/O units +by substituting the make_request_fn() function, which belonged to the +request_queue structure. But none of the in-tree kernel modules used this +feature, and it was eliminated in the 5.10 kernel. + +The block device filtering mechanism returns the ability to handle I/O units. +It is possible to safely attach a filter to a block device "on the fly" without +changing the structure of the block device's stack. + +It supports attaching one filter to one block device, because there is only +one filter implementation in the kernel yet. +See Documentation/block/blksnap.rst. + +Design +====== + +The block device filtering mechanism provides registration and unregistration +for filter operations. The struct blkfilter_operations contains a pointer to +the callback functions for the filter. After registering the filter operations, +the filter can be managed using block device ioctls BLKFILTER_ATTACH, +BLKFILTER_DETACH and BLKFILTER_CTL. + +When the filter is attached, the callback function is called for each I/O unit +for a block device, providing I/O unit filtering. Depending on the result of +filtering the I/O unit, it can either be passed for subsequent processing by +the block layer, or skipped. + +The filter can be implemented as a loadable module. In this case, the filter +module cannot be unloaded while the filter is attached to at least one of the +block devices. + +Interface description +===================== + +The ioctl BLKFILTER_ATTACH allows user-space programs to attach a block device +filter to a block device. The ioctl BLKFILTER_DETACH allows user-space programs +to detach it. Both ioctls use &struct blkfilter_name. The ioctl BLKFILTER_CTL +allows user-space programs to send a filter-specific command. It use &struct +blkfilter_ctl. + +.. kernel-doc:: include/uapi/linux/blk-filter.h + +To register in the system, the filter uses the &struct blkfilter_operations, +which contains callback functions, unique filter name and module owner. When +attaching a filter to a block device, the filter creates a &struct blkfilter. +The pointer to the &struct blkfilter allows the filter to determine for which +block device the callback functions are being called. + +.. kernel-doc:: include/linux/blk-filter.h + +.. kernel-doc:: block/blk-filter.c + :export: diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst index 9fea696f9daa..e9712f72cd6d 100644 --- a/Documentation/block/index.rst +++ b/Documentation/block/index.rst @@ -10,6 +10,7 @@ Block bfq-iosched biovecs blk-mq + blkfilter cmdline-partition data-integrity deadline-iosched diff --git a/MAINTAINERS b/MAINTAINERS index 97f51d5ec1cf..c20cbec81b58 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3584,6 +3584,12 @@ M: Jan-Simon Moeller S: Maintained F: drivers/leds/leds-blinkm.c +BLOCK DEVICE FILTERING MECHANISM +M: Sergei Shtepa +L: linux-block@vger.kernel.org +S: Supported +F: Documentation/block/blkfilter.rst + BLOCK LAYER M: Jens Axboe L: linux-block@vger.kernel.org From patchwork Fri Nov 24 16:04:50 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169488 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1333030vqx; Fri, 24 Nov 2023 08:12:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IE6gcEvFGXunpz9iO0E36oMqz8iyoGhBVJ8ogCGuifpuP4AjZNZSxr+zYfuE0B6arq0GBAo X-Received: by 2002:a05:6870:2189:b0:1fa:1719:dce2 with SMTP id l9-20020a056870218900b001fa1719dce2mr326302oae.28.1700842323899; Fri, 24 Nov 2023 08:12:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700842323; cv=none; d=google.com; s=arc-20160816; b=EBOMc9QJOVV2gM11fdIdZmGqmvItGj1RPjIz7XvOGO1er2sG5W7Uy1hikMDxpCMdah PP/s3i6HSQGTINQcd+8QD6FZIUWfZR3F6Sojeo+1amckb3TbOJrFgit4M1BBo/fkA+r3 v4XRJW3G63K9nsifXUM/vW9+37i5KBtAwZwkdPn0xljB/0nfyjF2AD3299kwoCHQ8v3x Nao/9kZnhfRFjjTZViSF2Mck5sXIMVN9R4Ceh24b1p6pstXLZTArA4IQ/+aysVB32oPH I6Gcr8vCIQiQ5M/G41TXAFNKbRfxztHamnLn5ugqDOWmYjON5/TACcEapLWdRXKmlIis IjIA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=c0AFCXYWnD8+A9LPjk6gYMyLJ/8u6+TPPOfbw4VMVDQ=; fh=KJyVC+8IgygKEzctCXwVvbxN7ANN1FenxonmFdPkmHA=; b=WUDMDJgq3k1D5HD7/Mee4YMRZ+/KMZy5MlV2wzdBrqvMikbKuH0rciOZptnJIHrkPs bChYKB8E9b7vaEHzcAAqT5fFQoTjeCYfXToyeJ7NXsj+ap84jyx0pC8tpFaARcdAFHRp 2ys+FB0o1Zf9/K5mm2gBVUXVyI9N/R+SUrRZxWjw8v4ZB5ftWqLnzF5Yk0To/chzxqHh KmxU3N+vZir/AWMS6ZzMTPzcyfF0H+q5ApAUMyWl/ACaeuu5H//eJWI8WbM7BfsGqIAV qZLfRADbhSYoasZrPUs/WJV/G2uI+jTcJ+2GBajBU05uAzSeIOl7o5iH0On6SkGLAsh4 td0Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=PHd1A8+F; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id oi12-20020a0568702e0c00b001e10371b688si1602119oab.250.2023.11.24.08.12.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 08:12:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=PHd1A8+F; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 936AE80213A0; Fri, 24 Nov 2023 08:11:06 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231567AbjKXQKn (ORCPT + 99 others); Fri, 24 Nov 2023 11:10:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41052 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231180AbjKXQK3 (ORCPT ); Fri, 24 Nov 2023 11:10:29 -0500 Received: from out-189.mta0.migadu.com (out-189.mta0.migadu.com [91.218.175.189]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A17E919A5 for ; Fri, 24 Nov 2023 08:10:32 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700841907; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=c0AFCXYWnD8+A9LPjk6gYMyLJ/8u6+TPPOfbw4VMVDQ=; b=PHd1A8+F4iOVbolCBeQS2xqTgfT/NSxBRVV5LM/BufjFt9+Ihbgr7zXbv4aenKVl536xl7 Diwa2kS4gXZrAasVPtUq6VKQR+mvp4iTZ/wepY3A4WefI4VQcKOnkHDfzlmFSdiVs/l6jx FEV7O2eYJD4hnoYoV6ppc0tlaOo4JEQ= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, gregkh@linuxfoundation.org, arnd@arndb.de, christian.koenig@amd.com, yi.l.liu@intel.com, jirislaby@kernel.org, stfrench@microsoft.com, jpanis@baylibre.com, jgg@ziepe.ca, contact@emersion.fr, dchinner@redhat.com, jack@suse.cz, linux@weissschuh.net, min15.li@samsung.com, dlemoal@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa , Donald Buczek , Fabio Fantoni Subject: [PATCH v6 02/11] block: Block Device Filtering Mechanism Date: Fri, 24 Nov 2023 17:04:50 +0100 Message-Id: <20231124160459.26227-3-sergei.shtepa@linux.dev> In-Reply-To: <20231124160459.26227-1-sergei.shtepa@linux.dev> References: <20231124160459.26227-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Fri, 24 Nov 2023 08:11:06 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783462440429659377 X-GMAIL-MSGID: 1783462440429659377 From: Sergei Shtepa The block device filtering mechanism is an API that allows to attach block device filters. Block device filters allow perform additional processing for I/O units. The idea of handling I/O units on block devices is not new. Back in the 2.6 kernel, there was an undocumented possibility of handling I/O units by substituting the make_request_fn() function, which belonged to the request_queue structure. But none of the in-tree kernel modules used this feature, and it was eliminated in the 5.10 kernel. The block device filtering mechanism returns the ability to handle I/O units. It is possible to safely attach filter to a block device "on the fly" without changing the structure of block devices stack. Co-developed-by: Christoph Hellwig Signed-off-by: Christoph Hellwig Tested-by: Donald Buczek Tested-by: Fabio Fantoni Signed-off-by: Sergei Shtepa --- MAINTAINERS | 3 + block/Makefile | 3 +- block/bdev.c | 2 + block/blk-core.c | 35 ++++- block/blk-filter.c | 238 ++++++++++++++++++++++++++++++++ block/blk.h | 11 ++ block/genhd.c | 10 ++ block/ioctl.c | 7 + block/partitions/core.c | 9 ++ include/linux/blk-filter.h | 51 +++++++ include/linux/blk_types.h | 1 + include/linux/blkdev.h | 1 + include/linux/sched.h | 1 + include/uapi/linux/blk-filter.h | 35 +++++ include/uapi/linux/fs.h | 3 + 15 files changed, 408 insertions(+), 2 deletions(-) create mode 100644 block/blk-filter.c create mode 100644 include/linux/blk-filter.h create mode 100644 include/uapi/linux/blk-filter.h diff --git a/MAINTAINERS b/MAINTAINERS index c20cbec81b58..ef90cd0fec9c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3589,6 +3589,9 @@ M: Sergei Shtepa L: linux-block@vger.kernel.org S: Supported F: Documentation/block/blkfilter.rst +F: block/blk-filter.c +F: include/linux/blk-filter.h +F: include/uapi/linux/blk-filter.h BLOCK LAYER M: Jens Axboe diff --git a/block/Makefile b/block/Makefile index 46ada9dc8bbf..041c54eb0240 100644 --- a/block/Makefile +++ b/block/Makefile @@ -9,7 +9,8 @@ obj-y := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \ blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \ blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \ genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \ - disk-events.o blk-ia-ranges.o early-lookup.o + disk-events.o blk-ia-ranges.o early-lookup.o \ + blk-filter.o obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o diff --git a/block/bdev.c b/block/bdev.c index e4cfb7adb645..6039d99b3a75 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -412,6 +412,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno) return NULL; } bdev->bd_disk = disk; + bdev->bd_filter = NULL; return bdev; } @@ -1018,6 +1019,7 @@ void bdev_mark_dead(struct block_device *bdev, bool surprise) } invalidate_bdev(bdev); + blkfilter_detach(bdev); } /* * New drivers should not use this directly. There are some drivers however diff --git a/block/blk-core.c b/block/blk-core.c index fdf25b8d6e78..1de74240892a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -592,12 +593,34 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, static void __submit_bio(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + bool skip_bio = false; + + if (unlikely(bio_queue_enter(bio))) + return; + + if (bio->bi_bdev->bd_filter && + bio->bi_bdev->bd_filter != current->blk_filter) { + struct blkfilter *prev = current->blk_filter; + + current->blk_filter = bio->bi_bdev->bd_filter; + skip_bio = bio->bi_bdev->bd_filter->ops->submit_bio(bio); + current->blk_filter = prev; + } + + blk_queue_exit(q); + if (skip_bio) + return; + if (unlikely(!blk_crypto_bio_prep(&bio))) return; if (!bio->bi_bdev->bd_has_submit_bio) { blk_mq_submit_bio(bio); - } else if (likely(bio_queue_enter(bio) == 0)) { + return; + } + + if (likely(bio_queue_enter(bio) == 0)) { struct gendisk *disk = bio->bi_bdev->bd_disk; disk->fops->submit_bio(bio); @@ -681,6 +704,15 @@ static void __submit_bio_noacct_mq(struct bio *bio) current->bio_list = NULL; } +/** + * submit_bio_noacct_nocheck - re-submit a bio to the block device layer for I/O + * from block device filter. + * @bio: The bio describing the location in memory and on the device. + * + * This is a version of submit_bio() that shall only be used for I/O that is + * resubmitted to lower level by block device filters. All file systems and + * other upper level users of the block layer should use submit_bio() instead. + */ void submit_bio_noacct_nocheck(struct bio *bio) { blk_cgroup_bio_start(bio); @@ -708,6 +740,7 @@ void submit_bio_noacct_nocheck(struct bio *bio) else __submit_bio_noacct(bio); } +EXPORT_SYMBOL_GPL(submit_bio_noacct_nocheck); /** * submit_bio_noacct - re-submit a bio to the block device layer for I/O diff --git a/block/blk-filter.c b/block/blk-filter.c new file mode 100644 index 000000000000..8e2550bed0c5 --- /dev/null +++ b/block/blk-filter.c @@ -0,0 +1,238 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#include +#include +#include + +#include "blk.h" + +static LIST_HEAD(blkfilters); +static DEFINE_SPINLOCK(blkfilters_lock); + +static inline struct blkfilter_operations *__blkfilter_find(const char *name) +{ + struct blkfilter_operations *ops; + + list_for_each_entry(ops, &blkfilters, link) + if (strncmp(ops->name, name, BLKFILTER_NAME_LENGTH) == 0) + return ops; + + return NULL; +} + +static inline struct blkfilter_operations *blkfilter_find_get(const char *name) +{ + struct blkfilter_operations *ops; + + spin_lock(&blkfilters_lock); + ops = __blkfilter_find(name); + if (ops && !try_module_get(ops->owner)) + ops = NULL; + spin_unlock(&blkfilters_lock); + + return ops; +} + +static inline void blkfilter_put(const struct blkfilter_operations *ops) +{ + module_put(ops->owner); +} + +int blkfilter_ioctl_attach(struct block_device *bdev, + struct blkfilter_name __user *argp) +{ + struct blkfilter_name name; + struct blkfilter_operations *ops; + struct blkfilter *flt; + int ret; + + if (copy_from_user(&name, argp, sizeof(name))) + return -EFAULT; + + ops = blkfilter_find_get(name.name); + if (!ops) + return -ENOENT; + + mutex_lock(&bdev->bd_disk->open_mutex); + if (!disk_live(bdev->bd_disk)) { + ret = -ENODEV; + goto out_mutex_unlock; + } + ret = freeze_bdev(bdev); + if (ret) + goto out_mutex_unlock; + blk_mq_freeze_queue(bdev->bd_queue); + + if (bdev->bd_filter) { + if (bdev->bd_filter->ops == ops) + ret = -EALREADY; + else + ret = -EBUSY; + goto out_unfreeze; + } + + flt = ops->attach(bdev); + if (IS_ERR(flt)) { + ret = PTR_ERR(flt); + goto out_unfreeze; + } + + flt->ops = ops; + bdev->bd_filter = flt; + +out_unfreeze: + blk_mq_unfreeze_queue(bdev->bd_queue); + thaw_bdev(bdev); +out_mutex_unlock: + mutex_unlock(&bdev->bd_disk->open_mutex); + if (ret) + blkfilter_put(ops); + return ret; +} + +static void __blkfilter_detach(struct block_device *bdev) +{ + struct blkfilter *flt = bdev->bd_filter; + const struct blkfilter_operations *ops = flt->ops; + + bdev->bd_filter = NULL; + ops->detach(flt); + blkfilter_put(ops); +} + +void blkfilter_detach(struct block_device *bdev) +{ + if (bdev->bd_filter) { + blk_mq_freeze_queue(bdev->bd_queue); + __blkfilter_detach(bdev); + blk_mq_unfreeze_queue(bdev->bd_queue); + } +} + +int blkfilter_ioctl_detach(struct block_device *bdev, + struct blkfilter_name __user *argp) +{ + struct blkfilter_name name; + int ret = 0; + + if (copy_from_user(&name, argp, sizeof(name))) + return -EFAULT; + + mutex_lock(&bdev->bd_disk->open_mutex); + if (!disk_live(bdev->bd_disk)) { + ret = -ENODEV; + goto out_mutex_unlock; + } + blk_mq_freeze_queue(bdev->bd_queue); + if (!bdev->bd_filter) { + ret = -ENOENT; + goto out_unfreeze; + } + if (strncmp(bdev->bd_filter->ops->name, name.name, + BLKFILTER_NAME_LENGTH)) { + ret = -EINVAL; + goto out_unfreeze; + } + + __blkfilter_detach(bdev); +out_unfreeze: + blk_mq_unfreeze_queue(bdev->bd_queue); +out_mutex_unlock: + mutex_unlock(&bdev->bd_disk->open_mutex); + return ret; +} + +int blkfilter_ioctl_ctl(struct block_device *bdev, + struct blkfilter_ctl __user *argp) +{ + struct blkfilter_ctl ctl; + struct blkfilter *flt; + int ret; + + if (copy_from_user(&ctl, argp, sizeof(ctl))) + return -EFAULT; + + mutex_lock(&bdev->bd_disk->open_mutex); + if (!disk_live(bdev->bd_disk)) { + ret = -ENODEV; + goto out_mutex_unlock; + } + ret = blk_queue_enter(bdev_get_queue(bdev), 0); + if (ret) + goto out_mutex_unlock; + + flt = bdev->bd_filter; + if (!flt || strncmp(flt->ops->name, ctl.name, BLKFILTER_NAME_LENGTH)) { + ret = -ENOENT; + goto out_queue_exit; + } + + if (!flt->ops->ctl) { + ret = -ENOTTY; + goto out_queue_exit; + } + + ret = flt->ops->ctl(flt, ctl.cmd, u64_to_user_ptr(ctl.opt), + &ctl.optlen); +out_queue_exit: + blk_queue_exit(bdev_get_queue(bdev)); +out_mutex_unlock: + mutex_unlock(&bdev->bd_disk->open_mutex); + return ret; +} + +ssize_t blkfilter_show(struct block_device *bdev, char *buf) +{ + ssize_t ret = 0; + + blk_mq_freeze_queue(bdev->bd_queue); + if (bdev->bd_filter) + ret = sprintf(buf, "%s\n", bdev->bd_filter->ops->name); + else + ret = sprintf(buf, "\n"); + blk_mq_unfreeze_queue(bdev->bd_queue); + + return ret; +} + +/** + * blkfilter_register() - Register block device filter operations + * @ops: The operations to register. + * + * Return: + * 0 if succeeded, + * -EBUSY if a block device filter with the same name is already + * registered. + */ +int blkfilter_register(struct blkfilter_operations *ops) +{ + struct blkfilter_operations *found; + int ret = 0; + + spin_lock(&blkfilters_lock); + found = __blkfilter_find(ops->name); + if (found) + ret = -EBUSY; + else + list_add_tail(&ops->link, &blkfilters); + spin_unlock(&blkfilters_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(blkfilter_register); + +/** + * blkfilter_unregister() - Unregister block device filter operations + * @ops: The operations to unregister. + * + * Important: before unloading, it is necessary to detach the filter from all + * block devices. + * + */ +void blkfilter_unregister(struct blkfilter_operations *ops) +{ + spin_lock(&blkfilters_lock); + list_del(&ops->link); + spin_unlock(&blkfilters_lock); +} +EXPORT_SYMBOL_GPL(blkfilter_unregister); diff --git a/block/blk.h b/block/blk.h index 08a358bc0919..1f104f4865c3 100644 --- a/block/blk.h +++ b/block/blk.h @@ -7,6 +7,8 @@ #include #include "blk-crypto-internal.h" +struct blkfilter_ctl; +struct blkfilter_name; struct elevator_type; /* Max future timer expiry for timeouts */ @@ -474,6 +476,15 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); extern const struct address_space_operations def_blk_aops; +int blkfilter_ioctl_attach(struct block_device *bdev, + struct blkfilter_name __user *argp); +int blkfilter_ioctl_detach(struct block_device *bdev, + struct blkfilter_name __user *argp); +int blkfilter_ioctl_ctl(struct block_device *bdev, + struct blkfilter_ctl __user *argp); +void blkfilter_detach(struct block_device *bdev); +ssize_t blkfilter_show(struct block_device *bdev, char *buf); + int disk_register_independent_access_ranges(struct gendisk *disk); void disk_unregister_independent_access_ranges(struct gendisk *disk); diff --git a/block/genhd.c b/block/genhd.c index c9d06f72c587..ba744e3fd581 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -26,6 +26,7 @@ #include #include #include +#include #include "blk-throttle.h" #include "blk.h" @@ -654,6 +655,7 @@ void del_gendisk(struct gendisk *disk) mutex_lock(&disk->open_mutex); xa_for_each(&disk->part_tbl, idx, part) remove_inode_hash(part->bd_inode); + blkfilter_detach(disk->part0); mutex_unlock(&disk->open_mutex); /* @@ -1044,6 +1046,12 @@ static ssize_t diskseq_show(struct device *dev, return sprintf(buf, "%llu\n", disk->diskseq); } +static ssize_t disk_filter_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return blkfilter_show(dev_to_bdev(dev), buf); +} + static DEVICE_ATTR(range, 0444, disk_range_show, NULL); static DEVICE_ATTR(ext_range, 0444, disk_ext_range_show, NULL); static DEVICE_ATTR(removable, 0444, disk_removable_show, NULL); @@ -1057,6 +1065,7 @@ static DEVICE_ATTR(stat, 0444, part_stat_show, NULL); static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL); static DEVICE_ATTR(badblocks, 0644, disk_badblocks_show, disk_badblocks_store); static DEVICE_ATTR(diskseq, 0444, diskseq_show, NULL); +static DEVICE_ATTR(filter, 0444, disk_filter_show, NULL); #ifdef CONFIG_FAIL_MAKE_REQUEST ssize_t part_fail_show(struct device *dev, @@ -1103,6 +1112,7 @@ static struct attribute *disk_attrs[] = { &dev_attr_events_async.attr, &dev_attr_events_poll_msecs.attr, &dev_attr_diskseq.attr, + &dev_attr_filter.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, #endif diff --git a/block/ioctl.c b/block/ioctl.c index 4160f4e6bd5b..1b11303e213b 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -2,6 +2,7 @@ #include #include #include +#include #include #include #include @@ -572,6 +573,12 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode, return blkdev_pr_preempt(bdev, mode, argp, true); case IOC_PR_CLEAR: return blkdev_pr_clear(bdev, mode, argp); + case BLKFILTER_ATTACH: + return blkfilter_ioctl_attach(bdev, argp); + case BLKFILTER_DETACH: + return blkfilter_ioctl_detach(bdev, argp); + case BLKFILTER_CTL: + return blkfilter_ioctl_ctl(bdev, argp); default: return -ENOIOCTLCMD; } diff --git a/block/partitions/core.c b/block/partitions/core.c index f47ffcfdfcec..19c69dc23d2c 100644 --- a/block/partitions/core.c +++ b/block/partitions/core.c @@ -10,6 +10,7 @@ #include #include #include +#include #include "check.h" static int (*const check_part[])(struct parsed_partitions *) = { @@ -200,6 +201,12 @@ static ssize_t part_discard_alignment_show(struct device *dev, return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev))); } +static ssize_t part_filter_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return blkfilter_show(dev_to_bdev(dev), buf); +} + static DEVICE_ATTR(partition, 0444, part_partition_show, NULL); static DEVICE_ATTR(start, 0444, part_start_show, NULL); static DEVICE_ATTR(size, 0444, part_size_show, NULL); @@ -208,6 +215,7 @@ static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL); static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL); static DEVICE_ATTR(stat, 0444, part_stat_show, NULL); static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL); +static DEVICE_ATTR(filter, 0444, part_filter_show, NULL); #ifdef CONFIG_FAIL_MAKE_REQUEST static struct device_attribute dev_attr_fail = __ATTR(make-it-fail, 0644, part_fail_show, part_fail_store); @@ -222,6 +230,7 @@ static struct attribute *part_attrs[] = { &dev_attr_discard_alignment.attr, &dev_attr_stat.attr, &dev_attr_inflight.attr, + &dev_attr_filter.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, #endif diff --git a/include/linux/blk-filter.h b/include/linux/blk-filter.h new file mode 100644 index 000000000000..0afdb40f3bab --- /dev/null +++ b/include/linux/blk-filter.h @@ -0,0 +1,51 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef _LINUX_BLK_FILTER_H +#define _LINUX_BLK_FILTER_H + +#include + +struct bio; +struct block_device; +struct blkfilter_operations; + +/** + * struct blkfilter - Block device filter. + * + * @ops: Block device filter operations. + * + * For each filtered block device, the filter creates a data structure + * associated with this device. The data in this structure is specific to the + * filter, but it must contain a pointer to the block device filter account. + */ +struct blkfilter { + const struct blkfilter_operations *ops; +}; + +/** + * struct blkfilter_operations - Block device filter operations. + * + * @link: Entry in the global list of filter drivers + * (must not be accessed by the driver). + * @owner: Module implementing the filter driver. + * @name: Name of the filter driver. + * @attach: Attach the filter driver to the block device. + * @detach: Detach the filter driver from the block device. + * @ctl: Send a control command to the filter driver. + * @submit_bio: Handle bio submissions to the filter driver. + */ +struct blkfilter_operations { + struct list_head link; + struct module *owner; + const char *name; + struct blkfilter *(*attach)(struct block_device *bdev); + void (*detach)(struct blkfilter *flt); + int (*ctl)(struct blkfilter *flt, const unsigned int cmd, + __u8 __user *buf, __u32 *plen); + bool (*submit_bio)(struct bio *bio); +}; + +int blkfilter_register(struct blkfilter_operations *ops); +void blkfilter_unregister(struct blkfilter_operations *ops); + +#endif /* _UAPI_LINUX_BLK_FILTER_H */ diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index d5c5e59ddbd2..490865292fde 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -74,6 +74,7 @@ struct block_device { * path */ struct device bd_device; + struct blkfilter *bd_filter; } __randomize_layout; #define bdev_whole(_bdev) \ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 51fa7ffdee83..6a0754007d1d 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -834,6 +834,7 @@ void blk_request_module(dev_t devt); extern int blk_register_queue(struct gendisk *disk); extern void blk_unregister_queue(struct gendisk *disk); +void submit_bio_noacct_nocheck(struct bio *bio); void submit_bio_noacct(struct bio *bio); struct bio *bio_split_to_limits(struct bio *bio); diff --git a/include/linux/sched.h b/include/linux/sched.h index 292c31697248..e7c3cd490a80 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1190,6 +1190,7 @@ struct task_struct { /* Stack plugging: */ struct blk_plug *plug; + struct blkfilter *blk_filter; /* VM state: */ struct reclaim_state *reclaim_state; diff --git a/include/uapi/linux/blk-filter.h b/include/uapi/linux/blk-filter.h new file mode 100644 index 000000000000..18885dc1b717 --- /dev/null +++ b/include/uapi/linux/blk-filter.h @@ -0,0 +1,35 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef _UAPI_LINUX_BLK_FILTER_H +#define _UAPI_LINUX_BLK_FILTER_H + +#include + +#define BLKFILTER_NAME_LENGTH 32 + +/** + * struct blkfilter_name - parameter for BLKFILTER_ATTACH and BLKFILTER_DETACH + * ioctl. + * + * @name: Name of block device filter. + */ +struct blkfilter_name { + __u8 name[BLKFILTER_NAME_LENGTH]; +}; + +/** + * struct blkfilter_ctl - parameter for BLKFILTER_CTL ioctl + * + * @name: Name of block device filter. + * @cmd: The filter-specific operation code of the command. + * @optlen: Size of data at @opt. + * @opt: Userspace buffer with options. + */ +struct blkfilter_ctl { + __u8 name[BLKFILTER_NAME_LENGTH]; + __u32 cmd; + __u32 optlen; + __u64 opt; +}; + +#endif /* _UAPI_LINUX_BLK_FILTER_H */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index da43810b7485..f96809cd2f50 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -189,6 +189,9 @@ struct fsxattr { * A jump here: 130-136 are reserved for zoned block devices * (see uapi/linux/blkzoned.h) */ +#define BLKFILTER_ATTACH _IOWR(0x12, 140, struct blkfilter_name) +#define BLKFILTER_DETACH _IOWR(0x12, 141, struct blkfilter_name) +#define BLKFILTER_CTL _IOWR(0x12, 142, struct blkfilter_ctl) #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP _IO(0x00,1) /* bmap access */ From patchwork Fri Nov 24 16:04:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169487 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1332833vqx; Fri, 24 Nov 2023 08:11:52 -0800 (PST) X-Google-Smtp-Source: AGHT+IGpQTVycevnLLhjRcmKdF33Szh9p13asyS0uokBp5NiR2L7r2UQEpg61pw/AQCbXd/RPNf8 X-Received: by 2002:a05:6808:d53:b0:3b2:e60d:27f6 with SMTP id w19-20020a0568080d5300b003b2e60d27f6mr3380675oik.29.1700842312033; Fri, 24 Nov 2023 08:11:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700842312; cv=none; d=google.com; s=arc-20160816; b=LWckK88C4ifGoYmuUOG6EZ1Zd3InKKHQcDf6TZWxWOm6EIxexzJIaMpGG5Wd747D9f e7GqjM6aI2le/sa5Y/9YCnWMmSXCTOSvcCvmNvRmenkYv6xNUA+AzxL7VfLpJrr6fUVZ LSeU3cT2aGeeegQDSbHzDYnMqPoX+isSJ6rSqnTAQxqsQrOMDW0CAR2luy/n+kSdLsqn FjuibeSsVX2HdBSEdKVbFG0Z1wZQxPA+i5xrZnrboldtFKKKyxUj9nvnXZKRo4P8kLh/ X7R7vmADF8CiAf1vRZCG7j0NjdifPJ9DcfGR7lV9YBXTXmNrLwubVGqqJEQB9SJ8UVdI YQ4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=MDyOlhsQ3H7gKFA4WT2r4Df9eQ2QXv4dHEaGU3xXd2I=; fh=SY3YltGhT5cJCmMlJImcDjcyyicSmkUO5X2OfV97XjE=; b=ALmDLeG18Chb5dR9Nx9DSJnu+yj6QOy73RV8Pj9Y8wHtZrdpcrA6bnrLVr6cvYGMg+ v6tfZPBcmxEbMIzlBwSg3ki9x4NOWmzlk/iiBoupwVcLNF7/SI9crgwWJhtZOPTHEpvc xCJtJ6NRtV11zhYWolUdx/yuieMYoNQNgDTzQhbKShUVmJBYs65+HMBLNj1gyWqMZvzL wxtvcvXVDY0CmnFDOrcJyWWZNa0qso4X7d7QJrMPAc0kcIwNHM7av28c8P23rvY98en0 Rdn5ga3mKc68bExP82hy+Ne48+FLwuuh4qK3kUksVWbZ6iQ6KZxZTDYD5ErDysaP3t6N zaDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="J7Lzh9/s"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id d17-20020a05680813d100b003b842bc8c09si1553497oiw.90.2023.11.24.08.11.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 08:11:52 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="J7Lzh9/s"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 93E158063BD2; Fri, 24 Nov 2023 08:11:13 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231277AbjKXQKh (ORCPT + 99 others); Fri, 24 Nov 2023 11:10:37 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41054 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231228AbjKXQK3 (ORCPT ); Fri, 24 Nov 2023 11:10:29 -0500 Received: from out-184.mta0.migadu.com (out-184.mta0.migadu.com [IPv6:2001:41d0:1004:224b::b8]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D0D5619AF for ; Fri, 24 Nov 2023 08:10:32 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700841908; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MDyOlhsQ3H7gKFA4WT2r4Df9eQ2QXv4dHEaGU3xXd2I=; b=J7Lzh9/scvaiTfqId+dWpkR4jUqnIolW7YbrwVLXhNkBnmOrZWjZBvWw9itFxUOAcgcQsT 0nRQQ6HyJltwvPjVq60ecfPpzIOTTDROL8AsuhVyhb8WImgxb+YZvwixE4C17dW4sg8MMX RkcxedXWBcBfh/1Boj0DJqctWzlAmZE= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, gregkh@linuxfoundation.org, arnd@arndb.de, christian.koenig@amd.com, yi.l.liu@intel.com, jirislaby@kernel.org, stfrench@microsoft.com, jpanis@baylibre.com, jgg@ziepe.ca, contact@emersion.fr, dchinner@redhat.com, jack@suse.cz, linux@weissschuh.net, min15.li@samsung.com, dlemoal@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa , Bagas Sanjaya , Fabio Fantoni Subject: [PATCH v6 03/11] documentation: Block Devices Snapshots Module Date: Fri, 24 Nov 2023 17:04:51 +0100 Message-Id: <20231124160459.26227-4-sergei.shtepa@linux.dev> In-Reply-To: <20231124160459.26227-1-sergei.shtepa@linux.dev> References: <20231124160459.26227-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 24 Nov 2023 08:11:13 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783462428180927035 X-GMAIL-MSGID: 1783462428180927035 From: Sergei Shtepa The document contains: * Describes the purpose of the mechanism * Description of features * Description of algorithms * Recommendations about using the module from the user-space side * Reference to module interface description Reviewed-by: Bagas Sanjaya Reviewed-by: Fabio Fantoni Signed-off-by: Sergei Shtepa --- Documentation/block/blksnap.rst | 352 ++++++++++++++++++++++++++++++++ Documentation/block/index.rst | 1 + MAINTAINERS | 6 + 3 files changed, 359 insertions(+) create mode 100644 Documentation/block/blksnap.rst diff --git a/Documentation/block/blksnap.rst b/Documentation/block/blksnap.rst new file mode 100644 index 000000000000..ef6010e46858 --- /dev/null +++ b/Documentation/block/blksnap.rst @@ -0,0 +1,352 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================== +Block Devices Snapshots Module (blksnap) +======================================== + +Introduction +============ + +At first glance, there is no novelty in the idea of creating snapshots for +block devices. The Linux kernel already has mechanisms for creating snapshots. +Device Mapper includes dm-snap, which allows to create snapshots of block +devices. BTRFS supports snapshots at the filesystem level. However, both of +these options have flaws that do not allow to use them as a universal tool for +creating backups. + +The main properties that a backup tool should have are: + +- Simplicity and universality of use +- Reliability +- Minimal consumption of system resources during backup +- Minimal time required for recovery or replication of the entire system + +Taking above properties into account, blksnap module features: + +- Change tracker +- Snapshots at the block device level +- Dynamic allocation of space for storing differences +- Snapshot overflow resistance +- Coherent snapshot of multiple block devices + +Features +======== + +Change tracker +-------------- + +The change tracker allows to determine which blocks were changed during the +time between the last snapshot created and any of the previous snapshots. +With a map of changes, it is enough to copy only the changed blocks, and no +need to reread the entire block device completely. The change tracker allows +to implement the logic of both incremental and differential backups. +Incremental backup is critical for large file repositories whose size can be +hundreds of terabytes and whose full backup time can take more than a day. +On such servers, the use of backup tools without a change tracker becomes +practically impossible. + +Snapshot at the block device level +---------------------------------- + +A snapshot at the block device level allows to simplify the backup algorithm +and reduce consumption of system resources. It also allows to perform linear +reading of disk space directly, which allows to achieve maximum reading speed +with minimal use of processor time. At the same time, the universality of +creating snapshots for any block device is achieved, regardless of the file +system located on it. The exceptions are BTRFS, ZFS and cluster file systems. + +Dynamic allocation of storage space for differences +--------------------------------------------------- + +To store differences, the module does not require a pre-reserved space on +filesystem. The space for storing differences can be allocated in file in any +filesystem. In addition, the size of the difference storage can be increased +after the snapshot is created, but only for a filesystem that supports +fallocate. A shared difference storage for all images of snapshot block devices +allows to optimize the use of storage space. However, there is one limitation. +A snapshot cannot be taken from a block device on which the difference storage +is located. + +Snapshot overflow resistance +---------------------------- + +To create images of snapshots of block devices, the module stores blocks +of the original block device that have been changed since the snapshot +was taken. To do this, the module handles write requests and reads blocks +that need to be overwritten. This algorithm guarantees safety of the data +of the original block device in the event of an overflow of the snapshot, +and even in the case of unpredictable critical errors. If a problem occurs +during backup, the difference storage is released, the snapshot is closed, +no backup is created, but the server continues to work. + +Coherent snapshot of multiple block devices +------------------------------------------- + +A snapshot is created simultaneously for all block devices for which a backup +is being created, ensuring their coherent state. + + +Algorithms +========== + +Overview +-------- + +The blksnap module is a block-level filter. It handles all write I/O units. +The filter is attached to the block device when the snapshot is created +for the first time. The change tracker marks all overwritten blocks. +Information about the history of changes on the block device is available +while holding the snapshot. The module reads the blocks that need to be +overwritten and stores them in the difference storage. When reading from +a snapshot image, reading is performed either from the original device or +from the difference storage. + +Change tracking +--------------- + +A change tracker map is created for each block device. One byte of this map +corresponds to one block. The block size is set by the +``tracking_block_minimum_shift`` and ``tracking_block_maximum_count`` +module parameters. The ``tracking_block_minimum_shift`` parameter limits +the minimum block size for tracking, while ``tracking_block_maximum_count`` +defines the maximum allowed number of blocks. The size of the change tracker +block is determined depending on the size of the block device when adding +a tracking device, that is, when the snapshot is taken for the first time. +The block size must be a power of two. The ``tracking_block_maximum_shift`` +module parameter allows to limit the maximum block size for tracking. If the +block size reaches the allowable limit, the number of blocks will exceed the +``tracking_block_maximum_count`` parameter. + +The byte of the change map stores a number from 0 to 255. This is the +snapshot number, since the creation of which there have been changes in +the block. Each time a snapshot is created, the number of the current +snapshot is increased by one. This number is written to the cell of the +change map when writing to the block. Thus, knowing the number of one of +the previous snapshots and the number of the last snapshot, one can determine +from the change map which blocks have been changed. When the number of the +current change reaches the maximum allowed value for the map of 255, at the +time when the next snapshot is created, the map of changes is reset to zero, +and the number of the current snapshot is assigned the value 1. The change +tracker is reset, and a new UUID is generated - a unique identifier of the +snapshot generation. The snapshot generation identifier allows to identify +that a change tracking reset has been performed. + +The change map has two copies. One copy is active, it tracks the current +changes on the block device. The second copy is available for reading +while the snapshot is being held, and contains the history up to the moment +the snapshot is taken. Copies are synchronized at the moment of snapshot +creation. After the snapshot is released, a second copy of the map is not +needed, but it is not released, so as not to allocate memory for it again +the next time the snapshot is created. + +Copy on write +------------- + +Data is copied in blocks, or rather in chunks. The term "chunk" is used to +avoid confusion with change tracker blocks and I/O blocks. In addition, +the "chunk" in the blksnap module means about the same as the "chunk" in +the dm-snap module. + +The size of the chunk is determined by the ``chunk_minimum_shift`` and +``chunk_maximum_count`` module parameters. The ``chunk_minimum_shift`` +parameter limits the minimum size of the chunk, while ``chunk_maximum_count`` +defines the maximum allowed number of chunks. The size of the chunk is +determined depending on the size of the block device at the time of taking the +snapshot. The size of the chunk must be a power of two. The module parameter +``chunk_maximum_shift`` allows to limit the maximum chunk size. If the chunk +size reaches the allowable limit, the number of chunks will exceed the +``chunk_maximum_count`` parameter. + +One chunk is described by the ``struct chunk`` structure. A map of structures +is created for each block device. The structure contains all the necessary +information to copy the chunks data from the original block device to the +difference storage. This information allows to describe the snapshot image. +A semaphore is located in the structure, which allows synchronization of threads +accessing the chunk. + +The block level in Linux has a feature. If a read I/O unit was sent, and a +write I/O unit was sent after it, then a write can be performed first, and only +then a read. Therefore, the copy-on-write algorithm is executed synchronously. +If the write request is handled, the execution of this I/O unit will be delayed +until the overwritten chunks are read from the original device for later +storing to the difference store. But if, when handling a write I/O unit, it +turns out that the written range of sectors has already been prepared for +storing to the difference storage, then the I/O unit is simply passed. + +This algorithm makes it possible to efficiently perform backup even systems +with a Round-Robin databases. Such databases can be overwritten several times +during the system backup. Of course, the value of a backup of the RRD monitoring +system data can be questioned. However, it is often a task to make a backup +of the entire enterprise infrastructure in order to restore or replicate it +entirely in case of problems. + +There is also a flaw in the algorithm. When overwriting at least one sector, +an entire chunk is copied. Thus, a situation of rapid filling of the difference +storage when writing data to a block device in small portions in random order +is possible. This situation is possible in case of strong fragmentation of +data on the filesystem. But it must be borne in mind that with such data +fragmentation, performance of systems usually degrades greatly. So, this +problem does not occur on real servers, although it can easily be created +by artificial tests. + +Difference storage +------------------ + +The difference storage can be a block device or it can be a file on a +filesystem. Using a block device allows to achieve slightly higher performance, +but in this case, the block device is used by the kernel module exclusively. +Usually the disk space is marked up so that there is no available free space +for backup purposes. Using a file allows to place the difference storage on a +filesystem. + +The difference storage can be expanded already while the snapshot is being held, +but only if the filesystem supports fallocate(). If the free space in the +difference storage remains less than half of the value of the module parameter +``diff_storage_minimum``, then the kernel module can expand the difference +storage file within the specified limits. This limit is set when creating a +snapshot. + +If free space in the difference storage runs out, an event to user land is +generated about the overflow of the snapshot. Such a snapshot is considered +corrupted, and read I/O units to snapshot images will be terminated with an +error code. The difference storage stores outdated data required for snapshot +images, so when the snapshot is overflowed, the backup process is interrupted, +but the system maintains its operability without data loss. + +The difference storage has a limitation. The device cannot be added to the +snapshot where the difference storage is located. In this case, the difference +storage can be located in virtual memory, which consists of RAM and a swap +partition (or file). To do this, it is enough to use a file in /dev/shm, or a +new tmpfs filesystem can be created for this purpose. Obviously, this variant +can be useful if the system has a lot of RAM or a large swap. The good news is +that the modern Linux kernel allows to increase the size of the swap file "on +the fly" without changing the system configuration. + +A regular file or a block device file for the difference storage must be opened +with the O_EXCL flag. If an unnamed file with the O_TMPFILE flag is created, +then such a file will be automatically released when the snapshot is destroyed. +In addition, the use of an unnamed temporary file ensures that no one can open +this file and read its contents. + +Performing I/O for a snapshot image +----------------------------------- + +To read snapshot data, when taking a snapshot, block devices of snapshot images +are created. The snapshot image block devices support the write operation. +This allows to perform additional data preparation on the filesystem before +creating a backup. + +To process the I/O unit, clones of the I/O unit are created, which redirect +the I/O unit either to the original block device or to the difference storage. +When processing of cloned I/O units is completed, the original I/O unit is +marked as completed too. + +An I/O unit can be partially processed without accessing to block devices if +the I/O unit refers to a chunk that is in the queue for storing to the +difference storage. In this case, the data is read or written in a buffer in +memory. + +If, when processing the write I/O unit, it turns out that the data of the +referred chunk has not yet been stored to the difference storage or has not +even been read from the original device, then an I/O unit to read data from the +original device is initiated beforehand. After the reading from original device +is performed, their data from the I/O unit is partially overwritten directly in +the buffer of the chunk in memory, and the chunk is scheduled to be saved to the +difference storage. + +How to use +========== + +Depending on the needs and the selected license, you can choose different +options for managing the module: + +- Using ioctl directly +- Using a static C++ library +- Using the blksnap console tool + +Using a BLKFILTER_CTL for block device +-------------------------------------- + +BLKFILTER_CTL allows to send a filter-specific command to the filter on block +device and get the result of its execution. The module provides the +``include/uapi/blksnap.h`` header file with a description of the commands and +their data structures. + +1. ``blkfilter_ctl_blksnap_cbtinfo`` allows to get information from the + change tracker. +2. ``blkfilter_ctl_blksnap_cbtmap`` reads the change tracker table. If a write + operation was performed for the snapshot, then the change tracker takes this + into account. Therefore, it is necessary to receive tracker data after write + operations have been completed. +3. ``blkfilter_ctl_blksnap_cbtdirty`` mark blocks as changed in the change + tracker table. This is necessary if post-processing is performed after the + backup is created, which changes the backup blocks. +4. ``blkfilter_ctl_blksnap_snapshotadd`` adds a block device to the snapshot. +5. ``blkfilter_ctl_blksnap_snapshotinfo`` allows to get the name of the snapshot + image block device and the presence of an error. + +Using ioctl +----------- + +Using a BLKFILTER_CTL ioctl does not allow to fully implement the management of +the blksnap module. A control file ``blksnap-control`` is created to manage +snapshots. The control commands are also described in the file +``include/uapi/blksnap.h``. + +1. ``blksnap_ioctl_version`` get the version number. +2. ``blk_snap_ioctl_snapshot_create`` initiates the snapshot creation process. +3. ``blk_snap_ioctl_snapshot_append_storage`` add the range of blocks to + difference storage. +4. ``blk_snap_ioctl_snapshot_take`` creates block devices of block device + snapshot images. +5. ``blk_snap_ioctl_snapshot_collect`` collect all created snapshots. +6. ``blk_snap_ioctl_snapshot_wait_event`` allows to track the status of + snapshots and receive events about the requirement to expand the difference + storage or about snapshot overflow. +7. ``blk_snap_ioctl_snapshot_destroy`` releases the snapshot. + +Static C++ library +------------------ + +The [#userspace_libs]_ library was created primarily to simplify creation of +tests in C++, and it is also a good example of using the module interface. +When creating applications, direct use of control calls is preferable. +However, the library can be used in an application with a GPL-2+ license, +or a library with an LGPL-2+ license can be created, with which even a +proprietary application can be dynamically linked. + +blksnap console tool +-------------------- + +The blksnap [#userspace_tools]_ console tool allows to control the module from +the command line. The tool contains detailed built-in help. To get list of +commands with usage description, see ``blksnap --help`` command. The ``blksnap + --help`` command allows to get detailed information about the +parameters of each command call. This option may be convenient when creating +proprietary software, as it allows not to compile with the open source code. +At the same time, the blksnap tool can be used for creating backup scripts. +For example, rsync can be called to synchronize files on the filesystem of +the mounted snapshot image and files in the archive on a filesystem that +supports compression. + +Tests +----- + +A set of tests was created for regression testing [#userspace_tests]_. +Tests with simple algorithms that use the ``blksnap`` console tool to +control the module are written in Bash. More complex testing algorithms +are implemented in C++. + +References +========== + +.. [#userspace_libs] https://github.com/veeam/blksnap/tree/stable-v2.0/lib + +.. [#userspace_tools] https://github.com/veeam/blksnap/tree/stable-v2.0/tools + +.. [#userspace_tests] https://github.com/veeam/blksnap/tree/stable-v2.0/tests + +Module interface description +============================ + +.. kernel-doc:: include/uapi/linux/blksnap.h diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst index e9712f72cd6d..696ff150c6b7 100644 --- a/Documentation/block/index.rst +++ b/Documentation/block/index.rst @@ -11,6 +11,7 @@ Block biovecs blk-mq blkfilter + blksnap cmdline-partition data-integrity deadline-iosched diff --git a/MAINTAINERS b/MAINTAINERS index ef90cd0fec9c..9c81e4c83139 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3593,6 +3593,12 @@ F: block/blk-filter.c F: include/linux/blk-filter.h F: include/uapi/linux/blk-filter.h +BLOCK DEVICE SNAPSHOTS MODULE +M: Sergei Shtepa +L: linux-block@vger.kernel.org +S: Supported +F: Documentation/block/blksnap.rst + BLOCK LAYER M: Jens Axboe L: linux-block@vger.kernel.org From patchwork Fri Nov 24 16:04:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169490 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1341060vqx; Fri, 24 Nov 2023 08:20:42 -0800 (PST) X-Google-Smtp-Source: AGHT+IE2iKKVdfyWTC/rRZHwIDI3SPgMf6rJAztBiXJg5JP4fd8xGZ5acZvvm45+iXjCGYl/6yGi X-Received: by 2002:a05:6820:546:b0:571:aceb:26c8 with SMTP id n6-20020a056820054600b00571aceb26c8mr700773ooj.3.1700842841780; Fri, 24 Nov 2023 08:20:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700842841; cv=none; d=google.com; s=arc-20160816; b=lhdWk4M1C/AisXggLp8YeEjatwYgt4i4dE1lu1Ze642d/MmHMj9xicPKPCC2YmWHi/ uWJ43pKKNbocYIw4xTtSDNZjN48Uf4EN8DYOJPUj26XGMgBuv4rjumwf561XASPjeQ80 BaRkRGZvNHE1V0ZpaM9HL98sgTU6EcBY01vy0t6HGFz7b7anIpcXFPM4ETRKzf5oTnEs YaJU4KIW5DzKbMwTODyggLzcOEFin2iRdxlYSsnBEaxJKmVVFPhCGRhxs4azAkmOhfbg hS/aQX+PN8W7TzoSv0mnHLtAy1celGpZgzaTP0n0eBzlDwq/3qGwSj68WJral8sZgYRJ FcWA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=5yGW+7zC7kl+eEIUhoSp3TXjUTFyRBrnT16mbCZw/Uk=; fh=QaTITROV58yX3T7vzUTPzrTaIKGOlfNszXLwxhLdN5M=; b=KeYF4LXID2kezpO7DVM+Dm7AsO4aa91ep7m3c+1mUY4nwuB7IFjPPQIRT6Rmxw4z7H PlTbQwd5/n14VWUI9fgjeJeXCMk5EnMWmNru7IxQHt2XRtMrMfRfEDeC02MxG6Gyhx1v iaGJU1wTCzJtlQaCC4P1OnDFAlEpjU39FOxrolASF1O+rJaAaAs2K8tiOFdySO9mR8FE fBCK66CAfwQmYMVUMx5oAiX3jmgDPBT2DtkxKLW/BCPZujhtbT6nf7zMP7RDEId0g3Pc kZdSzZBtJsQ/aP1rLhH9vA5nAUJhaOytd2chrdEde+Oqez0CgoQYIvQQqmi3jehnM5Uw NzWQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=HQLKbHPj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id h12-20020a4ad28c000000b0058cbeb32894si1525503oos.90.2023.11.24.08.20.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 08:20:41 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=HQLKbHPj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id B8B2B832D4C2; Fri, 24 Nov 2023 08:20:36 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230367AbjKXQU1 (ORCPT + 99 others); Fri, 24 Nov 2023 11:20:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54574 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229742AbjKXQU0 (ORCPT ); Fri, 24 Nov 2023 11:20:26 -0500 Received: from out-184.mta0.migadu.com (out-184.mta0.migadu.com [IPv6:2001:41d0:1004:224b::b8]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 95EB712B for ; Fri, 24 Nov 2023 08:20:31 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700841909; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5yGW+7zC7kl+eEIUhoSp3TXjUTFyRBrnT16mbCZw/Uk=; b=HQLKbHPjG5ECRX3W2QDUsBCOVgOd4zNCuHfcIt73YpGytD2dwouao5IYeV1HGmFHYs1AgF BfFqDeoybgM1Excty5fp0fBgzf+Hikw9ySRxFeccUON0VIoABOKMK6vYDwcoolLMOYnZMJ COMNmajOOBR72yHoy7i9ELgNscr7gJg= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, gregkh@linuxfoundation.org, arnd@arndb.de, christian.koenig@amd.com, yi.l.liu@intel.com, jirislaby@kernel.org, stfrench@microsoft.com, jpanis@baylibre.com, jgg@ziepe.ca, contact@emersion.fr, dchinner@redhat.com, jack@suse.cz, linux@weissschuh.net, min15.li@samsung.com, dlemoal@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa , Donald Buczek Subject: [PATCH v6 04/11] blksnap: header file of the module interface Date: Fri, 24 Nov 2023 17:04:52 +0100 Message-Id: <20231124160459.26227-5-sergei.shtepa@linux.dev> In-Reply-To: <20231124160459.26227-1-sergei.shtepa@linux.dev> References: <20231124160459.26227-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 24 Nov 2023 08:20:36 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783462983566784648 X-GMAIL-MSGID: 1783462983566784648 From: Sergei Shtepa The header file contains a set of declarations, structures and control requests (ioctl) that allows to manage the module from the user space. Co-developed-by: Christoph Hellwig Signed-off-by: Christoph Hellwig Tested-by: Donald Buczek Signed-off-by: Sergei Shtepa --- .../userspace-api/ioctl/ioctl-number.rst | 1 + MAINTAINERS | 1 + include/uapi/linux/blksnap.h | 388 ++++++++++++++++++ 3 files changed, 390 insertions(+) create mode 100644 include/uapi/linux/blksnap.h diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 4ea5b837399a..81acae1b1859 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -203,6 +203,7 @@ Code Seq# Include File Comments 'V' C0 linux/ivtvfb.h conflict! 'V' C0 linux/ivtv.h conflict! 'V' C0 media/si4713.h conflict! +'V' 00-1F uapi/linux/blksnap.h conflict! 'W' 00-1F linux/watchdog.h conflict! 'W' 00-1F linux/wanrouter.h conflict! (pre 3.9) 'W' 00-3F sound/asound.h conflict! diff --git a/MAINTAINERS b/MAINTAINERS index 9c81e4c83139..9770c4d4b15d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3598,6 +3598,7 @@ M: Sergei Shtepa L: linux-block@vger.kernel.org S: Supported F: Documentation/block/blksnap.rst +F: include/uapi/linux/blksnap.h BLOCK LAYER M: Jens Axboe diff --git a/include/uapi/linux/blksnap.h b/include/uapi/linux/blksnap.h new file mode 100644 index 000000000000..be1474f2025c --- /dev/null +++ b/include/uapi/linux/blksnap.h @@ -0,0 +1,388 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef _UAPI_LINUX_BLKSNAP_H +#define _UAPI_LINUX_BLKSNAP_H + +#include + +#define BLKSNAP_CTL "blksnap-control" +#define BLKSNAP_IMAGE_NAME "blksnap-image" +#define BLKSNAP 'V' + +/** + * DOC: Block device filter interface. + * + * Control commands that are transmitted through the block device filter + * interface. + */ + +/** + * enum blkfilter_ctl_blksnap - List of commands for BLKFILTER_CTL ioctl + * + * @blkfilter_ctl_blksnap_cbtinfo: + * Get CBT information. + * The result of executing the command is a &struct blksnap_cbtinfo. + * Return 0 if succeeded, negative errno otherwise. + * @blkfilter_ctl_blksnap_cbtmap: + * Read the CBT map. + * The option passes the &struct blksnap_cbtmap. + * The size of the table can be quite large. Thus, the table is read in + * a loop, in each cycle of which the next offset is set to + * &blksnap_tracker_read_cbt_bitmap.offset. + * Return a count of bytes read if succeeded, negative errno otherwise. + * @blkfilter_ctl_blksnap_cbtdirty: + * Set dirty blocks in the CBT map. + * The option passes the &struct blksnap_cbtdirty. + * There are cases when some blocks need to be marked as changed. + * This ioctl allows to do this. + * Return 0 if succeeded, negative errno otherwise. + * @blkfilter_ctl_blksnap_snapshotadd: + * Add device to snapshot. + * The option passes the &struct blksnap_snapshotadd. + * Return 0 if succeeded, negative errno otherwise. + * @blkfilter_ctl_blksnap_snapshotinfo: + * Get information about snapshot. + * The result of executing the command is a &struct blksnap_snapshotinfo. + * Return 0 if succeeded, negative errno otherwise. + */ +enum blkfilter_ctl_blksnap { + blkfilter_ctl_blksnap_cbtinfo, + blkfilter_ctl_blksnap_cbtmap, + blkfilter_ctl_blksnap_cbtdirty, + blkfilter_ctl_blksnap_snapshotadd, + blkfilter_ctl_blksnap_snapshotinfo, +}; + +#ifndef UUID_SIZE +#define UUID_SIZE 16 +#endif + +/** + * struct blksnap_uuid - Unique 16-byte identifier. + * + * @b: + * An array of 16 bytes. + */ +struct blksnap_uuid { + __u8 b[UUID_SIZE]; +}; + +/** + * struct blksnap_cbtinfo - Result for the command + * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtinfo. + * + * @device_capacity: + * Device capacity in bytes. + * @block_size: + * Block size in bytes. + * @block_count: + * Number of blocks. + * @generation_id: + * Unique identifier of change tracking generation. + * @changes_number: + * Current changes number. + */ +struct blksnap_cbtinfo { + __u64 device_capacity; + __u32 block_size; + __u32 block_count; + struct blksnap_uuid generation_id; + __u8 changes_number; +}; + +/** + * struct blksnap_cbtmap - Option for the command + * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtmap. + * + * @offset: + * Offset from the beginning of the CBT bitmap in bytes. + * @length: + * Size of @buff in bytes. + * @buffer: + * Pointer to the buffer for output. + */ +struct blksnap_cbtmap { + __u32 offset; + __u32 length; + __u64 buffer; +}; + +/** + * struct blksnap_sectors - Description of the block device region. + * + * @offset: + * Offset from the beginning of the disk in sectors. + * @count: + * Count of sectors. + */ +struct blksnap_sectors { + __u64 offset; + __u64 count; +}; + +/** + * struct blksnap_cbtdirty - Option for the command + * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtdirty. + * + * @count: + * Count of elements in the @dirty_sectors. + * @dirty_sectors: + * Pointer to the array of &struct blksnap_sectors. + */ +struct blksnap_cbtdirty { + __u32 count; + __u64 dirty_sectors; +}; + +/** + * struct blksnap_snapshotadd - Option for the command + * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_snapshotadd. + * + * @id: + * ID of the snapshot to which the block device should be added. + */ +struct blksnap_snapshotadd { + struct blksnap_uuid id; +}; + +#define IMAGE_DISK_NAME_LEN 32 + +/** + * struct blksnap_snapshotinfo - Result for the command + * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_snapshotinfo. + * + * @error_code: + * Zero if there were no errors while holding the snapshot. + * The error code -ENOSPC means that while holding the snapshot, a snapshot + * overflow situation has occurred. Other error codes mean other reasons + * for failure. + * The error code is reset when the device is added to a new snapshot. + * @image: + * If the snapshot was taken, it stores the block device name of the + * image, or empty string otherwise. + */ +struct blksnap_snapshotinfo { + __s32 error_code; + __u8 image[IMAGE_DISK_NAME_LEN]; +}; + +/** + * DOC: Interface for managing snapshots + * + * Control commands that are transmitted through the blksnap module interface. + */ +enum blksnap_ioctl { + blksnap_ioctl_version, + blksnap_ioctl_snapshot_create, + blksnap_ioctl_snapshot_destroy, + blksnap_ioctl_snapshot_take, + blksnap_ioctl_snapshot_collect, + blksnap_ioctl_snapshot_wait_event, +}; + +/** + * struct blksnap_version - Module version. + * + * @major: + * Version major part. + * @minor: + * Version minor part. + * @revision: + * Revision number. + * @build: + * Build number. Should be zero. + */ +struct blksnap_version { + __u16 major; + __u16 minor; + __u16 revision; + __u16 build; +}; + +/** + * define IOCTL_BLKSNAP_VERSION - Get module version. + * + * The version may increase when the API changes. But linking the user space + * behavior to the version code does not seem to be a good idea. + * To ensure backward compatibility, API changes should be made by adding new + * ioctl without changing the behavior of existing ones. The version should be + * used for logs. + * + * Return: 0 if succeeded, negative errno otherwise. + */ +#define IOCTL_BLKSNAP_VERSION \ + _IOR(BLKSNAP, blksnap_ioctl_version, struct blksnap_version) + +/** + * struct blksnap_snapshot_create - Argument for the + * &IOCTL_BLKSNAP_SNAPSHOT_CREATE control. + * + * @diff_storage_limit_sect: + * The maximum allowed difference storage size in sectors. + * @diff_storage_fd: + * The difference storage file descriptor. + * @id: + * Generated new snapshot ID. + */ +struct blksnap_snapshot_create { + __u64 diff_storage_limit_sect; + __u32 diff_storage_fd; + struct blksnap_uuid id; +}; + +/** + * define IOCTL_BLKSNAP_SNAPSHOT_CREATE - Create snapshot. + * + * Creates a snapshot structure and initializes the difference storage. + * A snapshot is created for several block devices at once. Several snapshots + * can be created at the same time, but with the condition that one block + * device can only be included in one snapshot. + * + * The difference storage can be dynamically increase as it fills up. + * The file is increased in portions, the size of which is determined by the + * module parameter &diff_storage_minimum. Each time the amount of free space + * in the difference storage is reduced to the half of &diff_storage_minimum, + * the file is expanded by a portion, until it reaches the allowable limit + * &diff_storage_limit_sect. + * + * Return: 0 if succeeded, negative errno otherwise. + */ +#define IOCTL_BLKSNAP_SNAPSHOT_CREATE \ + _IOWR(BLKSNAP, blksnap_ioctl_snapshot_create, \ + struct blksnap_snapshot_create) + +/** + * define IOCTL_BLKSNAP_SNAPSHOT_DESTROY - Release and destroy the snapshot. + * + * Destroys snapshot with &blksnap_snapshot_destroy.id. This leads to the + * deletion of all block device images of the snapshot. The difference storage + * is being released. But the change tracker keeps tracking. + * + * Return: 0 if succeeded, negative errno otherwise. + */ +#define IOCTL_BLKSNAP_SNAPSHOT_DESTROY \ + _IOW(BLKSNAP, blksnap_ioctl_snapshot_destroy, \ + struct blksnap_uuid) + +/** + * define IOCTL_BLKSNAP_SNAPSHOT_TAKE - Take snapshot. + * + * Creates snapshot images of block devices and switches change trackers tables. + * The snapshot must be created before this call, and the areas of block + * devices should be added to the difference storage. + * + * Return: 0 if succeeded, negative errno otherwise. + */ +#define IOCTL_BLKSNAP_SNAPSHOT_TAKE \ + _IOW(BLKSNAP, blksnap_ioctl_snapshot_take, \ + struct blksnap_uuid) + +/** + * struct blksnap_snapshot_collect - Argument for the + * &IOCTL_BLKSNAP_SNAPSHOT_COLLECT control. + * + * @count: + * Size of &blksnap_snapshot_collect.ids in the number of 16-byte UUID. + * @ids: + * Pointer to the array of struct blksnap_uuid for output. + */ +struct blksnap_snapshot_collect { + __u32 count; + __u64 ids; +}; + +/** + * define IOCTL_BLKSNAP_SNAPSHOT_COLLECT - Get collection of created snapshots. + * + * Multiple snapshots can be created at the same time. This allows for one + * system to create backups for different data with a independent schedules. + * + * If in &blksnap_snapshot_collect.count is less than required to store the + * &blksnap_snapshot_collect.ids, the array is not filled, and the ioctl + * returns the required count for &blksnap_snapshot_collect.ids. + * + * So, it is recommended to call the ioctl twice. The first call with an null + * pointer &blksnap_snapshot_collect.ids and a zero value in + * &blksnap_snapshot_collect.count. It will set the required array size in + * &blksnap_snapshot_collect.count. The second call with a pointer + * &blksnap_snapshot_collect.ids to an array of the required size will allow to + * get collection of active snapshots. + * + * Return: 0 if succeeded, -ENODATA if there is not enough space in the array + * to store collection of active snapshots, or negative errno otherwise. + */ +#define IOCTL_BLKSNAP_SNAPSHOT_COLLECT \ + _IOR(BLKSNAP, blksnap_ioctl_snapshot_collect, \ + struct blksnap_snapshot_collect) + +/** + * enum blksnap_event_codes - Variants of event codes. + * + * @blksnap_event_code_corrupted: + * Snapshot image is corrupted event. + * If a chunk could not be allocated when trying to save data to the + * difference storage, this event is generated. However, this does not mean + * that the backup process was interrupted with an error. If the snapshot + * image has been read to the end by this time, the backup process is + * considered successful. + */ +enum blksnap_event_codes { + blksnap_event_code_corrupted, +}; + +/** + * struct blksnap_snapshot_event - Argument for the + * &IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT control. + * + * @id: + * Snapshot ID. + * @timeout_ms: + * Timeout for waiting in milliseconds. + * @time_label: + * Timestamp of the received event. + * @code: + * Code of the received event &enum blksnap_event_codes. + * @data: + * The received event body. + */ +struct blksnap_snapshot_event { + struct blksnap_uuid id; + __u32 timeout_ms; + __u32 code; + __s64 time_label; + __u8 data[4096 - 32]; +}; + +/** + * define IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT - Wait and get the event from the + * snapshot. + * + * While holding the snapshot, the kernel module can transmit information about + * changes in its state in the form of events to the user level. + * It is very important to receive these events as quickly as possible, so the + * user's thread is in the state of interruptible sleep. + * + * Return: 0 if succeeded, negative errno otherwise. + */ +#define IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT \ + _IOR(BLKSNAP, blksnap_ioctl_snapshot_wait_event, \ + struct blksnap_snapshot_event) + +/** + * struct blksnap_event_corrupted - Data for the + * &blksnap_event_code_corrupted event. + * + * @dev_id_mj: + * Major part of original device ID. + * @dev_id_mn: + * Minor part of original device ID. + * @err_code: + * Error code. + */ +struct blksnap_event_corrupted { + __u32 dev_id_mj; + __u32 dev_id_mn; + __s32 err_code; +}; + +#endif /* _UAPI_LINUX_BLKSNAP_H */