From patchwork Fri Nov 24 16:59:23 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169520 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1368400vqx; Fri, 24 Nov 2023 08:59:56 -0800 (PST) X-Google-Smtp-Source: AGHT+IGdTpIwx5R7X4s8RhOzk4lBlC9RZFxfVFeO9BHKsBClWsJ6Kx96r//nzg+ZJQdTPnyiJjWq X-Received: by 2002:a05:6e02:218b:b0:359:d6e4:cfb7 with SMTP id j11-20020a056e02218b00b00359d6e4cfb7mr4928478ila.11.1700845195915; Fri, 24 Nov 2023 08:59:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700845195; cv=none; d=google.com; s=arc-20160816; b=MtlpDIl+mEkok45BtD6TDaR69dwP+xSTN9bBUjtxqkR4sQvCe5CgOZErDwwcAzAy/r jOb/onJr8Ogymo1oG21gqENevMaar6uQ4A/Lb4j0ZG6rNadLMRRnI5ODMQvItYVMZ96V NGuL2PPKl9M+IQ3b+Sis3fAkiIR1Dq5NCL4dkCwlxwbiipa5/M72AmVWOAe7adbUUXM2 OBnr0kc3ZR3JvlWfWxDlxjyML7CLT53ZRcz3SM2Z7LErVaPKVciRpkXE3+G7YFcxA6f+ O0+qO7rEIEicM7oAGjFUdUwbuw74qD0S1x1KzhCG7UFRv1yS7ZbNKOacUhgsc40cDCTI qf5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=NktJjNztL6a9VBLs/6z7OLOrm+oRflq1cbNxypOGnd4=; fh=2mCdnEsEQzaNNg7WB3fw4oVqqq+eEoqCz0tNA1gXC/4=; b=Ej6sz7Ku7MAmjtFGj1DvkoNLBE5UbbyTfwT8eGJDbr1a0WGuMU4rdnCIlusCKjEbye XGceQpukC7eIdTchYSIosKGglTSrv8op2xsOJ3EwutznzuqBcjk8G7cxDlgMy1NYCLec bGUdlwpNV3qpjKifwH3ioiwVxri6TgLnEqH5NkYVr488uV1pSSJ4wo0AAZ7V4KSXbuou Gqkrt1KL9gCe52vjvnKM7DEhPnSgjVfatG/HvQVFh5piKVL72YX0HhErHDT/EVsVHwkr B3KE0KRqqRAIwnh7pkKlAH4pnBbolLXinwMi7oAznF92tXuNOscnNdGZE8KHpw5eNNn0 p4JQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=YFT93xtH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1]) by mx.google.com with ESMTPS id 23-20020a630c57000000b0059779ae58a0si3978899pgm.465.2023.11.24.08.59.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 08:59:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) client-ip=2620:137:e000::3:1; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=YFT93xtH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 9BB9580473DB; Fri, 24 Nov 2023 08:59:52 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231397AbjKXQ7m (ORCPT + 99 others); Fri, 24 Nov 2023 11:59:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37232 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345682AbjKXQ7k (ORCPT ); Fri, 24 Nov 2023 11:59:40 -0500 Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 28CA019B7; Fri, 24 Nov 2023 08:59:45 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700845183; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NktJjNztL6a9VBLs/6z7OLOrm+oRflq1cbNxypOGnd4=; b=YFT93xtHAazANu0N84o4LWWKdsAnfTk6pK4xImBlQtc9k3KFTOjQY/b/w2MT43T6dW76SG Loqz4kq2trnfHkMGdyF4Ex+M0xi96S0bCg8O8oNgwug27jMU33BtBtOK6WkQoKL5ETXQe8 d43K/RdHTOCYWhC4Cd4b9xSVy5eIuAg= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa Subject: [PATCH v6 01/11] documentation: Block Device Filtering Mechanism Date: Fri, 24 Nov 2023 17:59:23 +0100 Message-Id: <20231124165933.27580-2-sergei.shtepa@linux.dev> In-Reply-To: <20231124165933.27580-1-sergei.shtepa@linux.dev> References: <20231124165933.27580-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Fri, 24 Nov 2023 08:59:52 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783465452399799759 X-GMAIL-MSGID: 1783465452399799759 From: Sergei Shtepa The document contains: * Describes the purpose of the mechanism * A little historical background on the capabilities of handling I/O units of the Linux kernel * Brief description of the design * Reference to interface description Signed-off-by: Sergei Shtepa --- Documentation/block/blkfilter.rst | 66 +++++++++++++++++++++++++++++++ Documentation/block/index.rst | 1 + MAINTAINERS | 6 +++ 3 files changed, 73 insertions(+) create mode 100644 Documentation/block/blkfilter.rst diff --git a/Documentation/block/blkfilter.rst b/Documentation/block/blkfilter.rst new file mode 100644 index 000000000000..4e148e78f3d4 --- /dev/null +++ b/Documentation/block/blkfilter.rst @@ -0,0 +1,66 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +Block Device Filtering Mechanism +================================ + +The block device filtering mechanism provides the ability to attach block +device filters. Block device filters allow performing additional processing +for I/O units. + +Introduction +============ + +The idea of handling I/O units on block devices is not new. Back in the +2.6 kernel, there was an undocumented possibility of handling I/O units +by substituting the make_request_fn() function, which belonged to the +request_queue structure. But none of the in-tree kernel modules used this +feature, and it was eliminated in the 5.10 kernel. + +The block device filtering mechanism returns the ability to handle I/O units. +It is possible to safely attach a filter to a block device "on the fly" without +changing the structure of the block device's stack. + +It supports attaching one filter to one block device, because there is only +one filter implementation in the kernel yet. +See Documentation/block/blksnap.rst. + +Design +====== + +The block device filtering mechanism provides registration and unregistration +for filter operations. The struct blkfilter_operations contains a pointer to +the callback functions for the filter. After registering the filter operations, +the filter can be managed using block device ioctls BLKFILTER_ATTACH, +BLKFILTER_DETACH and BLKFILTER_CTL. + +When the filter is attached, the callback function is called for each I/O unit +for a block device, providing I/O unit filtering. Depending on the result of +filtering the I/O unit, it can either be passed for subsequent processing by +the block layer, or skipped. + +The filter can be implemented as a loadable module. In this case, the filter +module cannot be unloaded while the filter is attached to at least one of the +block devices. + +Interface description +===================== + +The ioctl BLKFILTER_ATTACH allows user-space programs to attach a block device +filter to a block device. The ioctl BLKFILTER_DETACH allows user-space programs +to detach it. Both ioctls use &struct blkfilter_name. The ioctl BLKFILTER_CTL +allows user-space programs to send a filter-specific command. It use &struct +blkfilter_ctl. + +.. kernel-doc:: include/uapi/linux/blk-filter.h + +To register in the system, the filter uses the &struct blkfilter_operations, +which contains callback functions, unique filter name and module owner. When +attaching a filter to a block device, the filter creates a &struct blkfilter. +The pointer to the &struct blkfilter allows the filter to determine for which +block device the callback functions are being called. + +.. kernel-doc:: include/linux/blk-filter.h + +.. kernel-doc:: block/blk-filter.c + :export: diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst index 9fea696f9daa..e9712f72cd6d 100644 --- a/Documentation/block/index.rst +++ b/Documentation/block/index.rst @@ -10,6 +10,7 @@ Block bfq-iosched biovecs blk-mq + blkfilter cmdline-partition data-integrity deadline-iosched diff --git a/MAINTAINERS b/MAINTAINERS index 97f51d5ec1cf..c20cbec81b58 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3584,6 +3584,12 @@ M: Jan-Simon Moeller S: Maintained F: drivers/leds/leds-blinkm.c +BLOCK DEVICE FILTERING MECHANISM +M: Sergei Shtepa +L: linux-block@vger.kernel.org +S: Supported +F: Documentation/block/blkfilter.rst + BLOCK LAYER M: Jens Axboe L: linux-block@vger.kernel.org From patchwork Fri Nov 24 16:59:24 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169522 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1370105vqx; Fri, 24 Nov 2023 09:01:30 -0800 (PST) X-Google-Smtp-Source: AGHT+IHVS+p46BokbIFwdN53+V/NiJNmNlAoCjQd+3UDs4dktXPWjygy2mUblXd7C0ukPl29MJwE X-Received: by 2002:a05:6808:320c:b0:3b8:400f:45e with SMTP id cb12-20020a056808320c00b003b8400f045emr4436993oib.20.1700845290038; Fri, 24 Nov 2023 09:01:30 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700845290; cv=none; d=google.com; s=arc-20160816; b=sSry/k96/Zj+C3MnvqPHmc13Nr5KRNXCVllN3XoznYF9PQ44eE3bPdB+nYZUR9fDbO FTdVxONgt3NsB8aG+SEsemBLYvP0ypwjQZjuYjL5/8g+93fVZlY3mdbJ4lJ6lVdja4KQ ZKa6AZxO6c+c5Y0hVLt4wRLBUC/L5nXdzw81dCf5XnFt70qKe0B84yeBKodazbMAYafa J+nTHlep3TynUp0FJbHrhsMSnKaSL7gCmfvCX4wwrjtXg18afF+aP9k6ugBI6wPoAXHO WVTkxf2oup9m6ZIm181Oej1/UbuAN3Q4LXlOpBczlusaSSOilYFdjW4xr4xiZXtYKoqP PXeg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=c0AFCXYWnD8+A9LPjk6gYMyLJ/8u6+TPPOfbw4VMVDQ=; fh=tkhLbxogW002dFC9tOFGyYWXsgv2AMqPsx0LFllL618=; b=DSXGNgBirNvzOcTBWt6wkJ93TaMadh+3BP6DvuAzz8fKTYwyzcIWbkRDxQWyhhhkbF jC8ovixPvKH2zeNcMfs9XmAh4z0jBhn6fhahc7wFsFmfrsEA6cdoa4L0B6+SUgp4Plhc B4pp/2oue6Ykptg9fqk6fgORrBIpOlgkHQV1h6beql4goC3EZl7DEoiy1DM6qb4Bt0B8 R83FhRQ2Jwa8P2bitFXW3iEQ3N1WE5T/+r8ZdscKfNM5J+kpE7rq5Gvsk3tp7pLIdk5G eWNuvDYHCx/0GroQoFk6+Cgh2bxsQFCwlgwXBo7cRcp76E+K/7VcLhigVQ8FCONRZ1bn JOHA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=f7Wu5Iyc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id cd17-20020a056808319100b003b851b38225si1060644oib.135.2023.11.24.09.01.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 09:01:30 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=f7Wu5Iyc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 8083D808750E; Fri, 24 Nov 2023 09:00:09 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231395AbjKXQ74 (ORCPT + 99 others); Fri, 24 Nov 2023 11:59:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59054 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231494AbjKXQ7u (ORCPT ); Fri, 24 Nov 2023 11:59:50 -0500 Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 693431BC0 for ; Fri, 24 Nov 2023 08:59:51 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700845189; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=c0AFCXYWnD8+A9LPjk6gYMyLJ/8u6+TPPOfbw4VMVDQ=; b=f7Wu5IycjO2EE+7ouCpR3Y+28+DL4N8xoXRkBcmvW6bxQIYs73tO7mwb/gQF0lnxlkoW2s Z/PzPN8XwgJA76uhohIwTu4YhzndRaPD2aLdLDBaGcDmKNmrbRFRedNKLzsjg26yw7foXQ WrWRvPpSQBWMS8p9TRVXMWPmRMWNcvs= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa , Donald Buczek , Fabio Fantoni Subject: [PATCH v6 02/11] block: Block Device Filtering Mechanism Date: Fri, 24 Nov 2023 17:59:24 +0100 Message-Id: <20231124165933.27580-3-sergei.shtepa@linux.dev> In-Reply-To: <20231124165933.27580-1-sergei.shtepa@linux.dev> References: <20231124165933.27580-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Fri, 24 Nov 2023 09:00:09 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783465550944846221 X-GMAIL-MSGID: 1783465550944846221 From: Sergei Shtepa The block device filtering mechanism is an API that allows to attach block device filters. Block device filters allow perform additional processing for I/O units. The idea of handling I/O units on block devices is not new. Back in the 2.6 kernel, there was an undocumented possibility of handling I/O units by substituting the make_request_fn() function, which belonged to the request_queue structure. But none of the in-tree kernel modules used this feature, and it was eliminated in the 5.10 kernel. The block device filtering mechanism returns the ability to handle I/O units. It is possible to safely attach filter to a block device "on the fly" without changing the structure of block devices stack. Co-developed-by: Christoph Hellwig Signed-off-by: Christoph Hellwig Tested-by: Donald Buczek Tested-by: Fabio Fantoni Signed-off-by: Sergei Shtepa --- MAINTAINERS | 3 + block/Makefile | 3 +- block/bdev.c | 2 + block/blk-core.c | 35 ++++- block/blk-filter.c | 238 ++++++++++++++++++++++++++++++++ block/blk.h | 11 ++ block/genhd.c | 10 ++ block/ioctl.c | 7 + block/partitions/core.c | 9 ++ include/linux/blk-filter.h | 51 +++++++ include/linux/blk_types.h | 1 + include/linux/blkdev.h | 1 + include/linux/sched.h | 1 + include/uapi/linux/blk-filter.h | 35 +++++ include/uapi/linux/fs.h | 3 + 15 files changed, 408 insertions(+), 2 deletions(-) create mode 100644 block/blk-filter.c create mode 100644 include/linux/blk-filter.h create mode 100644 include/uapi/linux/blk-filter.h diff --git a/MAINTAINERS b/MAINTAINERS index c20cbec81b58..ef90cd0fec9c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3589,6 +3589,9 @@ M: Sergei Shtepa L: linux-block@vger.kernel.org S: Supported F: Documentation/block/blkfilter.rst +F: block/blk-filter.c +F: include/linux/blk-filter.h +F: include/uapi/linux/blk-filter.h BLOCK LAYER M: Jens Axboe diff --git a/block/Makefile b/block/Makefile index 46ada9dc8bbf..041c54eb0240 100644 --- a/block/Makefile +++ b/block/Makefile @@ -9,7 +9,8 @@ obj-y := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \ blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \ blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \ genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \ - disk-events.o blk-ia-ranges.o early-lookup.o + disk-events.o blk-ia-ranges.o early-lookup.o \ + blk-filter.o obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o diff --git a/block/bdev.c b/block/bdev.c index e4cfb7adb645..6039d99b3a75 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -412,6 +412,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno) return NULL; } bdev->bd_disk = disk; + bdev->bd_filter = NULL; return bdev; } @@ -1018,6 +1019,7 @@ void bdev_mark_dead(struct block_device *bdev, bool surprise) } invalidate_bdev(bdev); + blkfilter_detach(bdev); } /* * New drivers should not use this directly. There are some drivers however diff --git a/block/blk-core.c b/block/blk-core.c index fdf25b8d6e78..1de74240892a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -592,12 +593,34 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, static void __submit_bio(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + bool skip_bio = false; + + if (unlikely(bio_queue_enter(bio))) + return; + + if (bio->bi_bdev->bd_filter && + bio->bi_bdev->bd_filter != current->blk_filter) { + struct blkfilter *prev = current->blk_filter; + + current->blk_filter = bio->bi_bdev->bd_filter; + skip_bio = bio->bi_bdev->bd_filter->ops->submit_bio(bio); + current->blk_filter = prev; + } + + blk_queue_exit(q); + if (skip_bio) + return; + if (unlikely(!blk_crypto_bio_prep(&bio))) return; if (!bio->bi_bdev->bd_has_submit_bio) { blk_mq_submit_bio(bio); - } else if (likely(bio_queue_enter(bio) == 0)) { + return; + } + + if (likely(bio_queue_enter(bio) == 0)) { struct gendisk *disk = bio->bi_bdev->bd_disk; disk->fops->submit_bio(bio); @@ -681,6 +704,15 @@ static void __submit_bio_noacct_mq(struct bio *bio) current->bio_list = NULL; } +/** + * submit_bio_noacct_nocheck - re-submit a bio to the block device layer for I/O + * from block device filter. + * @bio: The bio describing the location in memory and on the device. + * + * This is a version of submit_bio() that shall only be used for I/O that is + * resubmitted to lower level by block device filters. All file systems and + * other upper level users of the block layer should use submit_bio() instead. + */ void submit_bio_noacct_nocheck(struct bio *bio) { blk_cgroup_bio_start(bio); @@ -708,6 +740,7 @@ void submit_bio_noacct_nocheck(struct bio *bio) else __submit_bio_noacct(bio); } +EXPORT_SYMBOL_GPL(submit_bio_noacct_nocheck); /** * submit_bio_noacct - re-submit a bio to the block device layer for I/O diff --git a/block/blk-filter.c b/block/blk-filter.c new file mode 100644 index 000000000000..8e2550bed0c5 --- /dev/null +++ b/block/blk-filter.c @@ -0,0 +1,238 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#include +#include +#include + +#include "blk.h" + +static LIST_HEAD(blkfilters); +static DEFINE_SPINLOCK(blkfilters_lock); + +static inline struct blkfilter_operations *__blkfilter_find(const char *name) +{ + struct blkfilter_operations *ops; + + list_for_each_entry(ops, &blkfilters, link) + if (strncmp(ops->name, name, BLKFILTER_NAME_LENGTH) == 0) + return ops; + + return NULL; +} + +static inline struct blkfilter_operations *blkfilter_find_get(const char *name) +{ + struct blkfilter_operations *ops; + + spin_lock(&blkfilters_lock); + ops = __blkfilter_find(name); + if (ops && !try_module_get(ops->owner)) + ops = NULL; + spin_unlock(&blkfilters_lock); + + return ops; +} + +static inline void blkfilter_put(const struct blkfilter_operations *ops) +{ + module_put(ops->owner); +} + +int blkfilter_ioctl_attach(struct block_device *bdev, + struct blkfilter_name __user *argp) +{ + struct blkfilter_name name; + struct blkfilter_operations *ops; + struct blkfilter *flt; + int ret; + + if (copy_from_user(&name, argp, sizeof(name))) + return -EFAULT; + + ops = blkfilter_find_get(name.name); + if (!ops) + return -ENOENT; + + mutex_lock(&bdev->bd_disk->open_mutex); + if (!disk_live(bdev->bd_disk)) { + ret = -ENODEV; + goto out_mutex_unlock; + } + ret = freeze_bdev(bdev); + if (ret) + goto out_mutex_unlock; + blk_mq_freeze_queue(bdev->bd_queue); + + if (bdev->bd_filter) { + if (bdev->bd_filter->ops == ops) + ret = -EALREADY; + else + ret = -EBUSY; + goto out_unfreeze; + } + + flt = ops->attach(bdev); + if (IS_ERR(flt)) { + ret = PTR_ERR(flt); + goto out_unfreeze; + } + + flt->ops = ops; + bdev->bd_filter = flt; + +out_unfreeze: + blk_mq_unfreeze_queue(bdev->bd_queue); + thaw_bdev(bdev); +out_mutex_unlock: + mutex_unlock(&bdev->bd_disk->open_mutex); + if (ret) + blkfilter_put(ops); + return ret; +} + +static void __blkfilter_detach(struct block_device *bdev) +{ + struct blkfilter *flt = bdev->bd_filter; + const struct blkfilter_operations *ops = flt->ops; + + bdev->bd_filter = NULL; + ops->detach(flt); + blkfilter_put(ops); +} + +void blkfilter_detach(struct block_device *bdev) +{ + if (bdev->bd_filter) { + blk_mq_freeze_queue(bdev->bd_queue); + __blkfilter_detach(bdev); + blk_mq_unfreeze_queue(bdev->bd_queue); + } +} + +int blkfilter_ioctl_detach(struct block_device *bdev, + struct blkfilter_name __user *argp) +{ + struct blkfilter_name name; + int ret = 0; + + if (copy_from_user(&name, argp, sizeof(name))) + return -EFAULT; + + mutex_lock(&bdev->bd_disk->open_mutex); + if (!disk_live(bdev->bd_disk)) { + ret = -ENODEV; + goto out_mutex_unlock; + } + blk_mq_freeze_queue(bdev->bd_queue); + if (!bdev->bd_filter) { + ret = -ENOENT; + goto out_unfreeze; + } + if (strncmp(bdev->bd_filter->ops->name, name.name, + BLKFILTER_NAME_LENGTH)) { + ret = -EINVAL; + goto out_unfreeze; + } + + __blkfilter_detach(bdev); +out_unfreeze: + blk_mq_unfreeze_queue(bdev->bd_queue); +out_mutex_unlock: + mutex_unlock(&bdev->bd_disk->open_mutex); + return ret; +} + +int blkfilter_ioctl_ctl(struct block_device *bdev, + struct blkfilter_ctl __user *argp) +{ + struct blkfilter_ctl ctl; + struct blkfilter *flt; + int ret; + + if (copy_from_user(&ctl, argp, sizeof(ctl))) + return -EFAULT; + + mutex_lock(&bdev->bd_disk->open_mutex); + if (!disk_live(bdev->bd_disk)) { + ret = -ENODEV; + goto out_mutex_unlock; + } + ret = blk_queue_enter(bdev_get_queue(bdev), 0); + if (ret) + goto out_mutex_unlock; + + flt = bdev->bd_filter; + if (!flt || strncmp(flt->ops->name, ctl.name, BLKFILTER_NAME_LENGTH)) { + ret = -ENOENT; + goto out_queue_exit; + } + + if (!flt->ops->ctl) { + ret = -ENOTTY; + goto out_queue_exit; + } + + ret = flt->ops->ctl(flt, ctl.cmd, u64_to_user_ptr(ctl.opt), + &ctl.optlen); +out_queue_exit: + blk_queue_exit(bdev_get_queue(bdev)); +out_mutex_unlock: + mutex_unlock(&bdev->bd_disk->open_mutex); + return ret; +} + +ssize_t blkfilter_show(struct block_device *bdev, char *buf) +{ + ssize_t ret = 0; + + blk_mq_freeze_queue(bdev->bd_queue); + if (bdev->bd_filter) + ret = sprintf(buf, "%s\n", bdev->bd_filter->ops->name); + else + ret = sprintf(buf, "\n"); + blk_mq_unfreeze_queue(bdev->bd_queue); + + return ret; +} + +/** + * blkfilter_register() - Register block device filter operations + * @ops: The operations to register. + * + * Return: + * 0 if succeeded, + * -EBUSY if a block device filter with the same name is already + * registered. + */ +int blkfilter_register(struct blkfilter_operations *ops) +{ + struct blkfilter_operations *found; + int ret = 0; + + spin_lock(&blkfilters_lock); + found = __blkfilter_find(ops->name); + if (found) + ret = -EBUSY; + else + list_add_tail(&ops->link, &blkfilters); + spin_unlock(&blkfilters_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(blkfilter_register); + +/** + * blkfilter_unregister() - Unregister block device filter operations + * @ops: The operations to unregister. + * + * Important: before unloading, it is necessary to detach the filter from all + * block devices. + * + */ +void blkfilter_unregister(struct blkfilter_operations *ops) +{ + spin_lock(&blkfilters_lock); + list_del(&ops->link); + spin_unlock(&blkfilters_lock); +} +EXPORT_SYMBOL_GPL(blkfilter_unregister); diff --git a/block/blk.h b/block/blk.h index 08a358bc0919..1f104f4865c3 100644 --- a/block/blk.h +++ b/block/blk.h @@ -7,6 +7,8 @@ #include #include "blk-crypto-internal.h" +struct blkfilter_ctl; +struct blkfilter_name; struct elevator_type; /* Max future timer expiry for timeouts */ @@ -474,6 +476,15 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); extern const struct address_space_operations def_blk_aops; +int blkfilter_ioctl_attach(struct block_device *bdev, + struct blkfilter_name __user *argp); +int blkfilter_ioctl_detach(struct block_device *bdev, + struct blkfilter_name __user *argp); +int blkfilter_ioctl_ctl(struct block_device *bdev, + struct blkfilter_ctl __user *argp); +void blkfilter_detach(struct block_device *bdev); +ssize_t blkfilter_show(struct block_device *bdev, char *buf); + int disk_register_independent_access_ranges(struct gendisk *disk); void disk_unregister_independent_access_ranges(struct gendisk *disk); diff --git a/block/genhd.c b/block/genhd.c index c9d06f72c587..ba744e3fd581 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -26,6 +26,7 @@ #include #include #include +#include #include "blk-throttle.h" #include "blk.h" @@ -654,6 +655,7 @@ void del_gendisk(struct gendisk *disk) mutex_lock(&disk->open_mutex); xa_for_each(&disk->part_tbl, idx, part) remove_inode_hash(part->bd_inode); + blkfilter_detach(disk->part0); mutex_unlock(&disk->open_mutex); /* @@ -1044,6 +1046,12 @@ static ssize_t diskseq_show(struct device *dev, return sprintf(buf, "%llu\n", disk->diskseq); } +static ssize_t disk_filter_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return blkfilter_show(dev_to_bdev(dev), buf); +} + static DEVICE_ATTR(range, 0444, disk_range_show, NULL); static DEVICE_ATTR(ext_range, 0444, disk_ext_range_show, NULL); static DEVICE_ATTR(removable, 0444, disk_removable_show, NULL); @@ -1057,6 +1065,7 @@ static DEVICE_ATTR(stat, 0444, part_stat_show, NULL); static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL); static DEVICE_ATTR(badblocks, 0644, disk_badblocks_show, disk_badblocks_store); static DEVICE_ATTR(diskseq, 0444, diskseq_show, NULL); +static DEVICE_ATTR(filter, 0444, disk_filter_show, NULL); #ifdef CONFIG_FAIL_MAKE_REQUEST ssize_t part_fail_show(struct device *dev, @@ -1103,6 +1112,7 @@ static struct attribute *disk_attrs[] = { &dev_attr_events_async.attr, &dev_attr_events_poll_msecs.attr, &dev_attr_diskseq.attr, + &dev_attr_filter.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, #endif diff --git a/block/ioctl.c b/block/ioctl.c index 4160f4e6bd5b..1b11303e213b 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -2,6 +2,7 @@ #include #include #include +#include #include #include #include @@ -572,6 +573,12 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode, return blkdev_pr_preempt(bdev, mode, argp, true); case IOC_PR_CLEAR: return blkdev_pr_clear(bdev, mode, argp); + case BLKFILTER_ATTACH: + return blkfilter_ioctl_attach(bdev, argp); + case BLKFILTER_DETACH: + return blkfilter_ioctl_detach(bdev, argp); + case BLKFILTER_CTL: + return blkfilter_ioctl_ctl(bdev, argp); default: return -ENOIOCTLCMD; } diff --git a/block/partitions/core.c b/block/partitions/core.c index f47ffcfdfcec..19c69dc23d2c 100644 --- a/block/partitions/core.c +++ b/block/partitions/core.c @@ -10,6 +10,7 @@ #include #include #include +#include #include "check.h" static int (*const check_part[])(struct parsed_partitions *) = { @@ -200,6 +201,12 @@ static ssize_t part_discard_alignment_show(struct device *dev, return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev))); } +static ssize_t part_filter_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return blkfilter_show(dev_to_bdev(dev), buf); +} + static DEVICE_ATTR(partition, 0444, part_partition_show, NULL); static DEVICE_ATTR(start, 0444, part_start_show, NULL); static DEVICE_ATTR(size, 0444, part_size_show, NULL); @@ -208,6 +215,7 @@ static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL); static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL); static DEVICE_ATTR(stat, 0444, part_stat_show, NULL); static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL); +static DEVICE_ATTR(filter, 0444, part_filter_show, NULL); #ifdef CONFIG_FAIL_MAKE_REQUEST static struct device_attribute dev_attr_fail = __ATTR(make-it-fail, 0644, part_fail_show, part_fail_store); @@ -222,6 +230,7 @@ static struct attribute *part_attrs[] = { &dev_attr_discard_alignment.attr, &dev_attr_stat.attr, &dev_attr_inflight.attr, + &dev_attr_filter.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, #endif diff --git a/include/linux/blk-filter.h b/include/linux/blk-filter.h new file mode 100644 index 000000000000..0afdb40f3bab --- /dev/null +++ b/include/linux/blk-filter.h @@ -0,0 +1,51 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef _LINUX_BLK_FILTER_H +#define _LINUX_BLK_FILTER_H + +#include + +struct bio; +struct block_device; +struct blkfilter_operations; + +/** + * struct blkfilter - Block device filter. + * + * @ops: Block device filter operations. + * + * For each filtered block device, the filter creates a data structure + * associated with this device. The data in this structure is specific to the + * filter, but it must contain a pointer to the block device filter account. + */ +struct blkfilter { + const struct blkfilter_operations *ops; +}; + +/** + * struct blkfilter_operations - Block device filter operations. + * + * @link: Entry in the global list of filter drivers + * (must not be accessed by the driver). + * @owner: Module implementing the filter driver. + * @name: Name of the filter driver. + * @attach: Attach the filter driver to the block device. + * @detach: Detach the filter driver from the block device. + * @ctl: Send a control command to the filter driver. + * @submit_bio: Handle bio submissions to the filter driver. + */ +struct blkfilter_operations { + struct list_head link; + struct module *owner; + const char *name; + struct blkfilter *(*attach)(struct block_device *bdev); + void (*detach)(struct blkfilter *flt); + int (*ctl)(struct blkfilter *flt, const unsigned int cmd, + __u8 __user *buf, __u32 *plen); + bool (*submit_bio)(struct bio *bio); +}; + +int blkfilter_register(struct blkfilter_operations *ops); +void blkfilter_unregister(struct blkfilter_operations *ops); + +#endif /* _UAPI_LINUX_BLK_FILTER_H */ diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index d5c5e59ddbd2..490865292fde 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -74,6 +74,7 @@ struct block_device { * path */ struct device bd_device; + struct blkfilter *bd_filter; } __randomize_layout; #define bdev_whole(_bdev) \ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 51fa7ffdee83..6a0754007d1d 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -834,6 +834,7 @@ void blk_request_module(dev_t devt); extern int blk_register_queue(struct gendisk *disk); extern void blk_unregister_queue(struct gendisk *disk); +void submit_bio_noacct_nocheck(struct bio *bio); void submit_bio_noacct(struct bio *bio); struct bio *bio_split_to_limits(struct bio *bio); diff --git a/include/linux/sched.h b/include/linux/sched.h index 292c31697248..e7c3cd490a80 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1190,6 +1190,7 @@ struct task_struct { /* Stack plugging: */ struct blk_plug *plug; + struct blkfilter *blk_filter; /* VM state: */ struct reclaim_state *reclaim_state; diff --git a/include/uapi/linux/blk-filter.h b/include/uapi/linux/blk-filter.h new file mode 100644 index 000000000000..18885dc1b717 --- /dev/null +++ b/include/uapi/linux/blk-filter.h @@ -0,0 +1,35 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef _UAPI_LINUX_BLK_FILTER_H +#define _UAPI_LINUX_BLK_FILTER_H + +#include + +#define BLKFILTER_NAME_LENGTH 32 + +/** + * struct blkfilter_name - parameter for BLKFILTER_ATTACH and BLKFILTER_DETACH + * ioctl. + * + * @name: Name of block device filter. + */ +struct blkfilter_name { + __u8 name[BLKFILTER_NAME_LENGTH]; +}; + +/** + * struct blkfilter_ctl - parameter for BLKFILTER_CTL ioctl + * + * @name: Name of block device filter. + * @cmd: The filter-specific operation code of the command. + * @optlen: Size of data at @opt. + * @opt: Userspace buffer with options. + */ +struct blkfilter_ctl { + __u8 name[BLKFILTER_NAME_LENGTH]; + __u32 cmd; + __u32 optlen; + __u64 opt; +}; + +#endif /* _UAPI_LINUX_BLK_FILTER_H */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index da43810b7485..f96809cd2f50 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -189,6 +189,9 @@ struct fsxattr { * A jump here: 130-136 are reserved for zoned block devices * (see uapi/linux/blkzoned.h) */ +#define BLKFILTER_ATTACH _IOWR(0x12, 140, struct blkfilter_name) +#define BLKFILTER_DETACH _IOWR(0x12, 141, struct blkfilter_name) +#define BLKFILTER_CTL _IOWR(0x12, 142, struct blkfilter_ctl) #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP _IO(0x00,1) /* bmap access */ From patchwork Fri Nov 24 16:59:25 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169521 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1369857vqx; Fri, 24 Nov 2023 09:01:16 -0800 (PST) X-Google-Smtp-Source: AGHT+IHGbXxRVz/nDpDdV2J0He5D/rj37+SKhkKYEsF71DfLggG0zxqmF2qa/CWkxGpHLS8fLRwV X-Received: by 2002:a05:6a00:2d27:b0:6cb:d24b:8c2e with SMTP id fa39-20020a056a002d2700b006cbd24b8c2emr4584072pfb.13.1700845276107; Fri, 24 Nov 2023 09:01:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700845276; cv=none; d=google.com; s=arc-20160816; b=uqctPfpSaAvgsR7xuMbesyaA8U4lbdQu2d6+pU5hu/NxgizkJzG/Sk646nYfKGMUq7 OdnVW3hRGPd2Wkqs69XudQ5FYAdRVHxbeJl2qkqgKkI3Vi//oj9fVS4S2QEHk2sOz1hA C7ctdlkFdcCxpPYR1STUljBkYfTgT3tVyo1B7KD5+DikbbpLCqNDlwCZFZgHrTcEy4pi dT+arc+rSrxe3L1/C7PBi0I23zR4cOwn6Jj9fE/UBsKYRTkbySubcTwBtjiinGHfmnX8 ukae6PK3v5q4DU3xVJ00NEdg2TgciaYGYPCvVi/uPo5sLYvQjF9iaizbV5W/uKvf/1f+ gPyA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=MDyOlhsQ3H7gKFA4WT2r4Df9eQ2QXv4dHEaGU3xXd2I=; fh=wDxeBRLlMDlj3r9dI9YRoi1R8cfSgWg7ccXomxalgY8=; b=waraGgvDOud11FISs/+FA7k19OQuzTE3WTgBThzsPuEXcVkNTt692fydE35b8VU/+p oy5X8Y7UYs0vc3wDZKJgynGWnInOvqmEFC5aRh4nE/GA+TxkuXV/kCXmtqDLzE5EUhPL WWyTAAu1r7TnCjOm1FTnLwO7Ixw+IpQiFUpudu7hkrxEfPZLFndE7J1gE9sWpC7kQ303 WtP3qYlkY3eJdcChncr56OvClSmviu3alNYObQnDnZbFwPwgWIxCFR/uZJFQcx2wKJYy KwaynVtJm5ylP5XzrQ8g693B5UWhioEWoD5d4V0danVyrvDhQ94yxUz2ZO8vR92/TL7N 1xFQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=QQT61QoV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1]) by mx.google.com with ESMTPS id h26-20020a056a00001a00b006bf537ce976si3787380pfk.260.2023.11.24.09.01.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 09:01:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) client-ip=2620:137:e000::3:1; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=QQT61QoV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id A924380ABFDB; Fri, 24 Nov 2023 09:00:35 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345714AbjKXRAE (ORCPT + 99 others); Fri, 24 Nov 2023 12:00:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49488 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232693AbjKXQ7z (ORCPT ); Fri, 24 Nov 2023 11:59:55 -0500 Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 435F91BFD for ; Fri, 24 Nov 2023 08:59:55 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700845194; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MDyOlhsQ3H7gKFA4WT2r4Df9eQ2QXv4dHEaGU3xXd2I=; b=QQT61QoV++TngAgqjDPjGnNFs07Amxjg8KYBAhaHZ8BOvtvK7vq3CnHbXnT7CRYcKrEOpG yGtsVefge5pum0I7r8ivpidw2Tp2QfINH0os9TOAT+DPUAZlpF/sn9R7DmNQl7n6/bZzBX ru5yYOi0/9IOu2xAVKcNNcre205VBdE= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa , Bagas Sanjaya , Fabio Fantoni Subject: [PATCH v6 03/11] documentation: Block Devices Snapshots Module Date: Fri, 24 Nov 2023 17:59:25 +0100 Message-Id: <20231124165933.27580-4-sergei.shtepa@linux.dev> In-Reply-To: <20231124165933.27580-1-sergei.shtepa@linux.dev> References: <20231124165933.27580-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Fri, 24 Nov 2023 09:00:35 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783465536058618355 X-GMAIL-MSGID: 1783465536058618355 From: Sergei Shtepa The document contains: * Describes the purpose of the mechanism * Description of features * Description of algorithms * Recommendations about using the module from the user-space side * Reference to module interface description Reviewed-by: Bagas Sanjaya Reviewed-by: Fabio Fantoni Signed-off-by: Sergei Shtepa --- Documentation/block/blksnap.rst | 352 ++++++++++++++++++++++++++++++++ Documentation/block/index.rst | 1 + MAINTAINERS | 6 + 3 files changed, 359 insertions(+) create mode 100644 Documentation/block/blksnap.rst diff --git a/Documentation/block/blksnap.rst b/Documentation/block/blksnap.rst new file mode 100644 index 000000000000..ef6010e46858 --- /dev/null +++ b/Documentation/block/blksnap.rst @@ -0,0 +1,352 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================== +Block Devices Snapshots Module (blksnap) +======================================== + +Introduction +============ + +At first glance, there is no novelty in the idea of creating snapshots for +block devices. The Linux kernel already has mechanisms for creating snapshots. +Device Mapper includes dm-snap, which allows to create snapshots of block +devices. BTRFS supports snapshots at the filesystem level. However, both of +these options have flaws that do not allow to use them as a universal tool for +creating backups. + +The main properties that a backup tool should have are: + +- Simplicity and universality of use +- Reliability +- Minimal consumption of system resources during backup +- Minimal time required for recovery or replication of the entire system + +Taking above properties into account, blksnap module features: + +- Change tracker +- Snapshots at the block device level +- Dynamic allocation of space for storing differences +- Snapshot overflow resistance +- Coherent snapshot of multiple block devices + +Features +======== + +Change tracker +-------------- + +The change tracker allows to determine which blocks were changed during the +time between the last snapshot created and any of the previous snapshots. +With a map of changes, it is enough to copy only the changed blocks, and no +need to reread the entire block device completely. The change tracker allows +to implement the logic of both incremental and differential backups. +Incremental backup is critical for large file repositories whose size can be +hundreds of terabytes and whose full backup time can take more than a day. +On such servers, the use of backup tools without a change tracker becomes +practically impossible. + +Snapshot at the block device level +---------------------------------- + +A snapshot at the block device level allows to simplify the backup algorithm +and reduce consumption of system resources. It also allows to perform linear +reading of disk space directly, which allows to achieve maximum reading speed +with minimal use of processor time. At the same time, the universality of +creating snapshots for any block device is achieved, regardless of the file +system located on it. The exceptions are BTRFS, ZFS and cluster file systems. + +Dynamic allocation of storage space for differences +--------------------------------------------------- + +To store differences, the module does not require a pre-reserved space on +filesystem. The space for storing differences can be allocated in file in any +filesystem. In addition, the size of the difference storage can be increased +after the snapshot is created, but only for a filesystem that supports +fallocate. A shared difference storage for all images of snapshot block devices +allows to optimize the use of storage space. However, there is one limitation. +A snapshot cannot be taken from a block device on which the difference storage +is located. + +Snapshot overflow resistance +---------------------------- + +To create images of snapshots of block devices, the module stores blocks +of the original block device that have been changed since the snapshot +was taken. To do this, the module handles write requests and reads blocks +that need to be overwritten. This algorithm guarantees safety of the data +of the original block device in the event of an overflow of the snapshot, +and even in the case of unpredictable critical errors. If a problem occurs +during backup, the difference storage is released, the snapshot is closed, +no backup is created, but the server continues to work. + +Coherent snapshot of multiple block devices +------------------------------------------- + +A snapshot is created simultaneously for all block devices for which a backup +is being created, ensuring their coherent state. + + +Algorithms +========== + +Overview +-------- + +The blksnap module is a block-level filter. It handles all write I/O units. +The filter is attached to the block device when the snapshot is created +for the first time. The change tracker marks all overwritten blocks. +Information about the history of changes on the block device is available +while holding the snapshot. The module reads the blocks that need to be +overwritten and stores them in the difference storage. When reading from +a snapshot image, reading is performed either from the original device or +from the difference storage. + +Change tracking +--------------- + +A change tracker map is created for each block device. One byte of this map +corresponds to one block. The block size is set by the +``tracking_block_minimum_shift`` and ``tracking_block_maximum_count`` +module parameters. The ``tracking_block_minimum_shift`` parameter limits +the minimum block size for tracking, while ``tracking_block_maximum_count`` +defines the maximum allowed number of blocks. The size of the change tracker +block is determined depending on the size of the block device when adding +a tracking device, that is, when the snapshot is taken for the first time. +The block size must be a power of two. The ``tracking_block_maximum_shift`` +module parameter allows to limit the maximum block size for tracking. If the +block size reaches the allowable limit, the number of blocks will exceed the +``tracking_block_maximum_count`` parameter. + +The byte of the change map stores a number from 0 to 255. This is the +snapshot number, since the creation of which there have been changes in +the block. Each time a snapshot is created, the number of the current +snapshot is increased by one. This number is written to the cell of the +change map when writing to the block. Thus, knowing the number of one of +the previous snapshots and the number of the last snapshot, one can determine +from the change map which blocks have been changed. When the number of the +current change reaches the maximum allowed value for the map of 255, at the +time when the next snapshot is created, the map of changes is reset to zero, +and the number of the current snapshot is assigned the value 1. The change +tracker is reset, and a new UUID is generated - a unique identifier of the +snapshot generation. The snapshot generation identifier allows to identify +that a change tracking reset has been performed. + +The change map has two copies. One copy is active, it tracks the current +changes on the block device. The second copy is available for reading +while the snapshot is being held, and contains the history up to the moment +the snapshot is taken. Copies are synchronized at the moment of snapshot +creation. After the snapshot is released, a second copy of the map is not +needed, but it is not released, so as not to allocate memory for it again +the next time the snapshot is created. + +Copy on write +------------- + +Data is copied in blocks, or rather in chunks. The term "chunk" is used to +avoid confusion with change tracker blocks and I/O blocks. In addition, +the "chunk" in the blksnap module means about the same as the "chunk" in +the dm-snap module. + +The size of the chunk is determined by the ``chunk_minimum_shift`` and +``chunk_maximum_count`` module parameters. The ``chunk_minimum_shift`` +parameter limits the minimum size of the chunk, while ``chunk_maximum_count`` +defines the maximum allowed number of chunks. The size of the chunk is +determined depending on the size of the block device at the time of taking the +snapshot. The size of the chunk must be a power of two. The module parameter +``chunk_maximum_shift`` allows to limit the maximum chunk size. If the chunk +size reaches the allowable limit, the number of chunks will exceed the +``chunk_maximum_count`` parameter. + +One chunk is described by the ``struct chunk`` structure. A map of structures +is created for each block device. The structure contains all the necessary +information to copy the chunks data from the original block device to the +difference storage. This information allows to describe the snapshot image. +A semaphore is located in the structure, which allows synchronization of threads +accessing the chunk. + +The block level in Linux has a feature. If a read I/O unit was sent, and a +write I/O unit was sent after it, then a write can be performed first, and only +then a read. Therefore, the copy-on-write algorithm is executed synchronously. +If the write request is handled, the execution of this I/O unit will be delayed +until the overwritten chunks are read from the original device for later +storing to the difference store. But if, when handling a write I/O unit, it +turns out that the written range of sectors has already been prepared for +storing to the difference storage, then the I/O unit is simply passed. + +This algorithm makes it possible to efficiently perform backup even systems +with a Round-Robin databases. Such databases can be overwritten several times +during the system backup. Of course, the value of a backup of the RRD monitoring +system data can be questioned. However, it is often a task to make a backup +of the entire enterprise infrastructure in order to restore or replicate it +entirely in case of problems. + +There is also a flaw in the algorithm. When overwriting at least one sector, +an entire chunk is copied. Thus, a situation of rapid filling of the difference +storage when writing data to a block device in small portions in random order +is possible. This situation is possible in case of strong fragmentation of +data on the filesystem. But it must be borne in mind that with such data +fragmentation, performance of systems usually degrades greatly. So, this +problem does not occur on real servers, although it can easily be created +by artificial tests. + +Difference storage +------------------ + +The difference storage can be a block device or it can be a file on a +filesystem. Using a block device allows to achieve slightly higher performance, +but in this case, the block device is used by the kernel module exclusively. +Usually the disk space is marked up so that there is no available free space +for backup purposes. Using a file allows to place the difference storage on a +filesystem. + +The difference storage can be expanded already while the snapshot is being held, +but only if the filesystem supports fallocate(). If the free space in the +difference storage remains less than half of the value of the module parameter +``diff_storage_minimum``, then the kernel module can expand the difference +storage file within the specified limits. This limit is set when creating a +snapshot. + +If free space in the difference storage runs out, an event to user land is +generated about the overflow of the snapshot. Such a snapshot is considered +corrupted, and read I/O units to snapshot images will be terminated with an +error code. The difference storage stores outdated data required for snapshot +images, so when the snapshot is overflowed, the backup process is interrupted, +but the system maintains its operability without data loss. + +The difference storage has a limitation. The device cannot be added to the +snapshot where the difference storage is located. In this case, the difference +storage can be located in virtual memory, which consists of RAM and a swap +partition (or file). To do this, it is enough to use a file in /dev/shm, or a +new tmpfs filesystem can be created for this purpose. Obviously, this variant +can be useful if the system has a lot of RAM or a large swap. The good news is +that the modern Linux kernel allows to increase the size of the swap file "on +the fly" without changing the system configuration. + +A regular file or a block device file for the difference storage must be opened +with the O_EXCL flag. If an unnamed file with the O_TMPFILE flag is created, +then such a file will be automatically released when the snapshot is destroyed. +In addition, the use of an unnamed temporary file ensures that no one can open +this file and read its contents. + +Performing I/O for a snapshot image +----------------------------------- + +To read snapshot data, when taking a snapshot, block devices of snapshot images +are created. The snapshot image block devices support the write operation. +This allows to perform additional data preparation on the filesystem before +creating a backup. + +To process the I/O unit, clones of the I/O unit are created, which redirect +the I/O unit either to the original block device or to the difference storage. +When processing of cloned I/O units is completed, the original I/O unit is +marked as completed too. + +An I/O unit can be partially processed without accessing to block devices if +the I/O unit refers to a chunk that is in the queue for storing to the +difference storage. In this case, the data is read or written in a buffer in +memory. + +If, when processing the write I/O unit, it turns out that the data of the +referred chunk has not yet been stored to the difference storage or has not +even been read from the original device, then an I/O unit to read data from the +original device is initiated beforehand. After the reading from original device +is performed, their data from the I/O unit is partially overwritten directly in +the buffer of the chunk in memory, and the chunk is scheduled to be saved to the +difference storage. + +How to use +========== + +Depending on the needs and the selected license, you can choose different +options for managing the module: + +- Using ioctl directly +- Using a static C++ library +- Using the blksnap console tool + +Using a BLKFILTER_CTL for block device +-------------------------------------- + +BLKFILTER_CTL allows to send a filter-specific command to the filter on block +device and get the result of its execution. The module provides the +``include/uapi/blksnap.h`` header file with a description of the commands and +their data structures. + +1. ``blkfilter_ctl_blksnap_cbtinfo`` allows to get information from the + change tracker. +2. ``blkfilter_ctl_blksnap_cbtmap`` reads the change tracker table. If a write + operation was performed for the snapshot, then the change tracker takes this + into account. Therefore, it is necessary to receive tracker data after write + operations have been completed. +3. ``blkfilter_ctl_blksnap_cbtdirty`` mark blocks as changed in the change + tracker table. This is necessary if post-processing is performed after the + backup is created, which changes the backup blocks. +4. ``blkfilter_ctl_blksnap_snapshotadd`` adds a block device to the snapshot. +5. ``blkfilter_ctl_blksnap_snapshotinfo`` allows to get the name of the snapshot + image block device and the presence of an error. + +Using ioctl +----------- + +Using a BLKFILTER_CTL ioctl does not allow to fully implement the management of +the blksnap module. A control file ``blksnap-control`` is created to manage +snapshots. The control commands are also described in the file +``include/uapi/blksnap.h``. + +1. ``blksnap_ioctl_version`` get the version number. +2. ``blk_snap_ioctl_snapshot_create`` initiates the snapshot creation process. +3. ``blk_snap_ioctl_snapshot_append_storage`` add the range of blocks to + difference storage. +4. ``blk_snap_ioctl_snapshot_take`` creates block devices of block device + snapshot images. +5. ``blk_snap_ioctl_snapshot_collect`` collect all created snapshots. +6. ``blk_snap_ioctl_snapshot_wait_event`` allows to track the status of + snapshots and receive events about the requirement to expand the difference + storage or about snapshot overflow. +7. ``blk_snap_ioctl_snapshot_destroy`` releases the snapshot. + +Static C++ library +------------------ + +The [#userspace_libs]_ library was created primarily to simplify creation of +tests in C++, and it is also a good example of using the module interface. +When creating applications, direct use of control calls is preferable. +However, the library can be used in an application with a GPL-2+ license, +or a library with an LGPL-2+ license can be created, with which even a +proprietary application can be dynamically linked. + +blksnap console tool +-------------------- + +The blksnap [#userspace_tools]_ console tool allows to control the module from +the command line. The tool contains detailed built-in help. To get list of +commands with usage description, see ``blksnap --help`` command. The ``blksnap + --help`` command allows to get detailed information about the +parameters of each command call. This option may be convenient when creating +proprietary software, as it allows not to compile with the open source code. +At the same time, the blksnap tool can be used for creating backup scripts. +For example, rsync can be called to synchronize files on the filesystem of +the mounted snapshot image and files in the archive on a filesystem that +supports compression. + +Tests +----- + +A set of tests was created for regression testing [#userspace_tests]_. +Tests with simple algorithms that use the ``blksnap`` console tool to +control the module are written in Bash. More complex testing algorithms +are implemented in C++. + +References +========== + +.. [#userspace_libs] https://github.com/veeam/blksnap/tree/stable-v2.0/lib + +.. [#userspace_tools] https://github.com/veeam/blksnap/tree/stable-v2.0/tools + +.. [#userspace_tests] https://github.com/veeam/blksnap/tree/stable-v2.0/tests + +Module interface description +============================ + +.. kernel-doc:: include/uapi/linux/blksnap.h diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst index e9712f72cd6d..696ff150c6b7 100644 --- a/Documentation/block/index.rst +++ b/Documentation/block/index.rst @@ -11,6 +11,7 @@ Block biovecs blk-mq blkfilter + blksnap cmdline-partition data-integrity deadline-iosched diff --git a/MAINTAINERS b/MAINTAINERS index ef90cd0fec9c..9c81e4c83139 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3593,6 +3593,12 @@ F: block/blk-filter.c F: include/linux/blk-filter.h F: include/uapi/linux/blk-filter.h +BLOCK DEVICE SNAPSHOTS MODULE +M: Sergei Shtepa +L: linux-block@vger.kernel.org +S: Supported +F: Documentation/block/blksnap.rst + BLOCK LAYER M: Jens Axboe L: linux-block@vger.kernel.org From patchwork Fri Nov 24 16:59:26 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169524 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1370906vqx; Fri, 24 Nov 2023 09:02:13 -0800 (PST) X-Google-Smtp-Source: AGHT+IGYNtTNB4yblT7r5i3Fek7ja+d7/OagXJhUpOTodmQ4n0t3P4VxS45R6RHYrHhB+C5OE3Em X-Received: by 2002:a81:ae26:0:b0:5ce:16cb:b709 with SMTP id m38-20020a81ae26000000b005ce16cbb709mr1930470ywh.25.1700845333288; Fri, 24 Nov 2023 09:02:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700845333; cv=none; d=google.com; s=arc-20160816; b=hptJKS0/R6asazaEQxISlUBosyVq+ymSmLMRB60ai7JalCNoq2tYE6B3nCjIigkUPN f6tfiQoSuINoP5Morod/farpL5WXR3sSd9aOkYyuBmz5187DqI+fA1NaQeRcQug8WNqE Qu5Op1xPX6FX7C2Y4BmwXyI5XWf7wKGjwrcKTWlY8bysFLH7xO3wDO/kJaqlOHxY9wpZ p/EsUUeeCztP1dUfLv7SQ3wjbigDlTYnezhtmokuBNy8Krf9yfmc9siqEvF+SZI6remT ynNqQkB+Gg3fy/fXGTvhSMbG2KZAoCqdUq+OankHm3J6t+AEdKm2eaQF+ygzLgK2un90 dZAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=5yGW+7zC7kl+eEIUhoSp3TXjUTFyRBrnT16mbCZw/Uk=; fh=AWQXyBtFxkW1TV47LXCqxpKJ0U+8YPoSYp83AhwLjYQ=; b=nPqA+kGpai6bNkrJ2XcB5dq+Td8h+rItEOaOd9LQEGrlxuFBqAKHJQM2zq/RCl5rAN hwIfq/dJoDdpLX1he2Lj9qdAeZlCLN9TNWVFxlZ74LZXHxfWzAT0qdecowa15Jd8JyNv WxibuTsyF36W78cEYk9ZdyXbFWndjNoWNeZn7Pmmz/KeQvt3elg42OMxxJNex6dwL6oJ ztT9HNiC2aj5CJyI0ovsjyvQJ8bjfvELekTxqHnUss2ujQ49CKcksBbO9YMdO7K9zBTF 7vncfQoFVPONfDA7svJhHh+O1EPH4LwrzdVQlfBGJajPuKtLLommJAqJrY7DhXSBK05z dXuw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=g9BqFg0b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id u5-20020a81e605000000b005a50590c49esi2550704ywl.549.2023.11.24.09.02.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 09:02:13 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=g9BqFg0b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 2D84580816A3; Fri, 24 Nov 2023 09:00:47 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233039AbjKXRAW (ORCPT + 99 others); Fri, 24 Nov 2023 12:00:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49430 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231494AbjKXRAB (ORCPT ); Fri, 24 Nov 2023 12:00:01 -0500 Received: from out-185.mta0.migadu.com (out-185.mta0.migadu.com [91.218.175.185]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EBC2919B2 for ; Fri, 24 Nov 2023 08:59:59 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700845198; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5yGW+7zC7kl+eEIUhoSp3TXjUTFyRBrnT16mbCZw/Uk=; b=g9BqFg0blimFYHrDE5IgTxFvADRtiBaES+sK6SuqsLfsMWTX2BlwjSJEVXQqQc5oNeuw1e bKi9zNVVI75ROzS0ZPqaqWniNTkpT91nShCmSqdYE/dzZNyuBUi3YIaC4YBZHY94aQ7kpL KQaBRaNGbzSYKxRElVeju0+obqvIvsU= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa , Donald Buczek Subject: [PATCH v6 04/11] blksnap: header file of the module interface Date: Fri, 24 Nov 2023 17:59:26 +0100 Message-Id: <20231124165933.27580-5-sergei.shtepa@linux.dev> In-Reply-To: <20231124165933.27580-1-sergei.shtepa@linux.dev> References: <20231124165933.27580-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 24 Nov 2023 09:00:48 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783465596312525669 X-GMAIL-MSGID: 1783465596312525669 From: Sergei Shtepa The header file contains a set of declarations, structures and control requests (ioctl) that allows to manage the module from the user space. Co-developed-by: Christoph Hellwig Signed-off-by: Christoph Hellwig Tested-by: Donald Buczek Signed-off-by: Sergei Shtepa --- .../userspace-api/ioctl/ioctl-number.rst | 1 + MAINTAINERS | 1 + include/uapi/linux/blksnap.h | 388 ++++++++++++++++++ 3 files changed, 390 insertions(+) create mode 100644 include/uapi/linux/blksnap.h diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 4ea5b837399a..81acae1b1859 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -203,6 +203,7 @@ Code Seq# Include File Comments 'V' C0 linux/ivtvfb.h conflict! 'V' C0 linux/ivtv.h conflict! 'V' C0 media/si4713.h conflict! +'V' 00-1F uapi/linux/blksnap.h conflict! 'W' 00-1F linux/watchdog.h conflict! 'W' 00-1F linux/wanrouter.h conflict! (pre 3.9) 'W' 00-3F sound/asound.h conflict! diff --git a/MAINTAINERS b/MAINTAINERS index 9c81e4c83139..9770c4d4b15d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3598,6 +3598,7 @@ M: Sergei Shtepa L: linux-block@vger.kernel.org S: Supported F: Documentation/block/blksnap.rst +F: include/uapi/linux/blksnap.h BLOCK LAYER M: Jens Axboe diff --git a/include/uapi/linux/blksnap.h b/include/uapi/linux/blksnap.h new file mode 100644 index 000000000000..be1474f2025c --- /dev/null +++ b/include/uapi/linux/blksnap.h @@ -0,0 +1,388 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef _UAPI_LINUX_BLKSNAP_H +#define _UAPI_LINUX_BLKSNAP_H + +#include + +#define BLKSNAP_CTL "blksnap-control" +#define BLKSNAP_IMAGE_NAME "blksnap-image" +#define BLKSNAP 'V' + +/** + * DOC: Block device filter interface. + * + * Control commands that are transmitted through the block device filter + * interface. + */ + +/** + * enum blkfilter_ctl_blksnap - List of commands for BLKFILTER_CTL ioctl + * + * @blkfilter_ctl_blksnap_cbtinfo: + * Get CBT information. + * The result of executing the command is a &struct blksnap_cbtinfo. + * Return 0 if succeeded, negative errno otherwise. + * @blkfilter_ctl_blksnap_cbtmap: + * Read the CBT map. + * The option passes the &struct blksnap_cbtmap. + * The size of the table can be quite large. Thus, the table is read in + * a loop, in each cycle of which the next offset is set to + * &blksnap_tracker_read_cbt_bitmap.offset. + * Return a count of bytes read if succeeded, negative errno otherwise. + * @blkfilter_ctl_blksnap_cbtdirty: + * Set dirty blocks in the CBT map. + * The option passes the &struct blksnap_cbtdirty. + * There are cases when some blocks need to be marked as changed. + * This ioctl allows to do this. + * Return 0 if succeeded, negative errno otherwise. + * @blkfilter_ctl_blksnap_snapshotadd: + * Add device to snapshot. + * The option passes the &struct blksnap_snapshotadd. + * Return 0 if succeeded, negative errno otherwise. + * @blkfilter_ctl_blksnap_snapshotinfo: + * Get information about snapshot. + * The result of executing the command is a &struct blksnap_snapshotinfo. + * Return 0 if succeeded, negative errno otherwise. + */ +enum blkfilter_ctl_blksnap { + blkfilter_ctl_blksnap_cbtinfo, + blkfilter_ctl_blksnap_cbtmap, + blkfilter_ctl_blksnap_cbtdirty, + blkfilter_ctl_blksnap_snapshotadd, + blkfilter_ctl_blksnap_snapshotinfo, +}; + +#ifndef UUID_SIZE +#define UUID_SIZE 16 +#endif + +/** + * struct blksnap_uuid - Unique 16-byte identifier. + * + * @b: + * An array of 16 bytes. + */ +struct blksnap_uuid { + __u8 b[UUID_SIZE]; +}; + +/** + * struct blksnap_cbtinfo - Result for the command + * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtinfo. + * + * @device_capacity: + * Device capacity in bytes. + * @block_size: + * Block size in bytes. + * @block_count: + * Number of blocks. + * @generation_id: + * Unique identifier of change tracking generation. + * @changes_number: + * Current changes number. + */ +struct blksnap_cbtinfo { + __u64 device_capacity; + __u32 block_size; + __u32 block_count; + struct blksnap_uuid generation_id; + __u8 changes_number; +}; + +/** + * struct blksnap_cbtmap - Option for the command + * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtmap. + * + * @offset: + * Offset from the beginning of the CBT bitmap in bytes. + * @length: + * Size of @buff in bytes. + * @buffer: + * Pointer to the buffer for output. + */ +struct blksnap_cbtmap { + __u32 offset; + __u32 length; + __u64 buffer; +}; + +/** + * struct blksnap_sectors - Description of the block device region. + * + * @offset: + * Offset from the beginning of the disk in sectors. + * @count: + * Count of sectors. + */ +struct blksnap_sectors { + __u64 offset; + __u64 count; +}; + +/** + * struct blksnap_cbtdirty - Option for the command + * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtdirty. + * + * @count: + * Count of elements in the @dirty_sectors. + * @dirty_sectors: + * Pointer to the array of &struct blksnap_sectors. + */ +struct blksnap_cbtdirty { + __u32 count; + __u64 dirty_sectors; +}; + +/** + * struct blksnap_snapshotadd - Option for the command + * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_snapshotadd. + * + * @id: + * ID of the snapshot to which the block device should be added. + */ +struct blksnap_snapshotadd { + struct blksnap_uuid id; +}; + +#define IMAGE_DISK_NAME_LEN 32 + +/** + * struct blksnap_snapshotinfo - Result for the command + * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_snapshotinfo. + * + * @error_code: + * Zero if there were no errors while holding the snapshot. + * The error code -ENOSPC means that while holding the snapshot, a snapshot + * overflow situation has occurred. Other error codes mean other reasons + * for failure. + * The error code is reset when the device is added to a new snapshot. + * @image: + * If the snapshot was taken, it stores the block device name of the + * image, or empty string otherwise. + */ +struct blksnap_snapshotinfo { + __s32 error_code; + __u8 image[IMAGE_DISK_NAME_LEN]; +}; + +/** + * DOC: Interface for managing snapshots + * + * Control commands that are transmitted through the blksnap module interface. + */ +enum blksnap_ioctl { + blksnap_ioctl_version, + blksnap_ioctl_snapshot_create, + blksnap_ioctl_snapshot_destroy, + blksnap_ioctl_snapshot_take, + blksnap_ioctl_snapshot_collect, + blksnap_ioctl_snapshot_wait_event, +}; + +/** + * struct blksnap_version - Module version. + * + * @major: + * Version major part. + * @minor: + * Version minor part. + * @revision: + * Revision number. + * @build: + * Build number. Should be zero. + */ +struct blksnap_version { + __u16 major; + __u16 minor; + __u16 revision; + __u16 build; +}; + +/** + * define IOCTL_BLKSNAP_VERSION - Get module version. + * + * The version may increase when the API changes. But linking the user space + * behavior to the version code does not seem to be a good idea. + * To ensure backward compatibility, API changes should be made by adding new + * ioctl without changing the behavior of existing ones. The version should be + * used for logs. + * + * Return: 0 if succeeded, negative errno otherwise. + */ +#define IOCTL_BLKSNAP_VERSION \ + _IOR(BLKSNAP, blksnap_ioctl_version, struct blksnap_version) + +/** + * struct blksnap_snapshot_create - Argument for the + * &IOCTL_BLKSNAP_SNAPSHOT_CREATE control. + * + * @diff_storage_limit_sect: + * The maximum allowed difference storage size in sectors. + * @diff_storage_fd: + * The difference storage file descriptor. + * @id: + * Generated new snapshot ID. + */ +struct blksnap_snapshot_create { + __u64 diff_storage_limit_sect; + __u32 diff_storage_fd; + struct blksnap_uuid id; +}; + +/** + * define IOCTL_BLKSNAP_SNAPSHOT_CREATE - Create snapshot. + * + * Creates a snapshot structure and initializes the difference storage. + * A snapshot is created for several block devices at once. Several snapshots + * can be created at the same time, but with the condition that one block + * device can only be included in one snapshot. + * + * The difference storage can be dynamically increase as it fills up. + * The file is increased in portions, the size of which is determined by the + * module parameter &diff_storage_minimum. Each time the amount of free space + * in the difference storage is reduced to the half of &diff_storage_minimum, + * the file is expanded by a portion, until it reaches the allowable limit + * &diff_storage_limit_sect. + * + * Return: 0 if succeeded, negative errno otherwise. + */ +#define IOCTL_BLKSNAP_SNAPSHOT_CREATE \ + _IOWR(BLKSNAP, blksnap_ioctl_snapshot_create, \ + struct blksnap_snapshot_create) + +/** + * define IOCTL_BLKSNAP_SNAPSHOT_DESTROY - Release and destroy the snapshot. + * + * Destroys snapshot with &blksnap_snapshot_destroy.id. This leads to the + * deletion of all block device images of the snapshot. The difference storage + * is being released. But the change tracker keeps tracking. + * + * Return: 0 if succeeded, negative errno otherwise. + */ +#define IOCTL_BLKSNAP_SNAPSHOT_DESTROY \ + _IOW(BLKSNAP, blksnap_ioctl_snapshot_destroy, \ + struct blksnap_uuid) + +/** + * define IOCTL_BLKSNAP_SNAPSHOT_TAKE - Take snapshot. + * + * Creates snapshot images of block devices and switches change trackers tables. + * The snapshot must be created before this call, and the areas of block + * devices should be added to the difference storage. + * + * Return: 0 if succeeded, negative errno otherwise. + */ +#define IOCTL_BLKSNAP_SNAPSHOT_TAKE \ + _IOW(BLKSNAP, blksnap_ioctl_snapshot_take, \ + struct blksnap_uuid) + +/** + * struct blksnap_snapshot_collect - Argument for the + * &IOCTL_BLKSNAP_SNAPSHOT_COLLECT control. + * + * @count: + * Size of &blksnap_snapshot_collect.ids in the number of 16-byte UUID. + * @ids: + * Pointer to the array of struct blksnap_uuid for output. + */ +struct blksnap_snapshot_collect { + __u32 count; + __u64 ids; +}; + +/** + * define IOCTL_BLKSNAP_SNAPSHOT_COLLECT - Get collection of created snapshots. + * + * Multiple snapshots can be created at the same time. This allows for one + * system to create backups for different data with a independent schedules. + * + * If in &blksnap_snapshot_collect.count is less than required to store the + * &blksnap_snapshot_collect.ids, the array is not filled, and the ioctl + * returns the required count for &blksnap_snapshot_collect.ids. + * + * So, it is recommended to call the ioctl twice. The first call with an null + * pointer &blksnap_snapshot_collect.ids and a zero value in + * &blksnap_snapshot_collect.count. It will set the required array size in + * &blksnap_snapshot_collect.count. The second call with a pointer + * &blksnap_snapshot_collect.ids to an array of the required size will allow to + * get collection of active snapshots. + * + * Return: 0 if succeeded, -ENODATA if there is not enough space in the array + * to store collection of active snapshots, or negative errno otherwise. + */ +#define IOCTL_BLKSNAP_SNAPSHOT_COLLECT \ + _IOR(BLKSNAP, blksnap_ioctl_snapshot_collect, \ + struct blksnap_snapshot_collect) + +/** + * enum blksnap_event_codes - Variants of event codes. + * + * @blksnap_event_code_corrupted: + * Snapshot image is corrupted event. + * If a chunk could not be allocated when trying to save data to the + * difference storage, this event is generated. However, this does not mean + * that the backup process was interrupted with an error. If the snapshot + * image has been read to the end by this time, the backup process is + * considered successful. + */ +enum blksnap_event_codes { + blksnap_event_code_corrupted, +}; + +/** + * struct blksnap_snapshot_event - Argument for the + * &IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT control. + * + * @id: + * Snapshot ID. + * @timeout_ms: + * Timeout for waiting in milliseconds. + * @time_label: + * Timestamp of the received event. + * @code: + * Code of the received event &enum blksnap_event_codes. + * @data: + * The received event body. + */ +struct blksnap_snapshot_event { + struct blksnap_uuid id; + __u32 timeout_ms; + __u32 code; + __s64 time_label; + __u8 data[4096 - 32]; +}; + +/** + * define IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT - Wait and get the event from the + * snapshot. + * + * While holding the snapshot, the kernel module can transmit information about + * changes in its state in the form of events to the user level. + * It is very important to receive these events as quickly as possible, so the + * user's thread is in the state of interruptible sleep. + * + * Return: 0 if succeeded, negative errno otherwise. + */ +#define IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT \ + _IOR(BLKSNAP, blksnap_ioctl_snapshot_wait_event, \ + struct blksnap_snapshot_event) + +/** + * struct blksnap_event_corrupted - Data for the + * &blksnap_event_code_corrupted event. + * + * @dev_id_mj: + * Major part of original device ID. + * @dev_id_mn: + * Minor part of original device ID. + * @err_code: + * Error code. + */ +struct blksnap_event_corrupted { + __u32 dev_id_mj; + __u32 dev_id_mn; + __s32 err_code; +}; + +#endif /* _UAPI_LINUX_BLKSNAP_H */ From patchwork Fri Nov 24 16:59:27 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169530 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1371435vqx; Fri, 24 Nov 2023 09:02:45 -0800 (PST) X-Google-Smtp-Source: AGHT+IF1S0Cio2yKRaWv+ykkkW4FrvgMN9hh+AoRrwng3GOC8dL27Vhwd3OHPMmflNxNGTXSw8G4 X-Received: by 2002:a05:6870:200e:b0:1f9:78d9:ced2 with SMTP id o14-20020a056870200e00b001f978d9ced2mr1680541oab.5.1700845365239; Fri, 24 Nov 2023 09:02:45 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700845365; cv=none; d=google.com; s=arc-20160816; b=f2f+wl2ruEv7Z8mpa6lDuNKbmYU1efi6TlRHIcB/68dKI+bSUCqitPWKQti6FfPlvE 3MrOfZc72Ml3DWlKM3fNOFzB3Tc+gOFeBdJ0szDZnq9PA7teOLsLnljrNsTLA0d6pHXe QiZyJlx+yaOR94rJFEPqOnGNk3ttEWhvtmElahS9r7MJGEpFbGfqva8zLAIDOprLoAlP iRPXtEvcYu566omCKCEHiEq82ffZTpHyflJFf1YW4M3Hu3QCvh9HLOb/jTa/YBVd3gNF 1nq/W7V3FiwoJ5lk+Uix8j6/KFfhyO8j+gVc7m61r9dE6W8DcRFGm5EOtwvoojXakgQr mxKg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=RJneZ4OkvbVOWh1hD/tfiGQQTqVpGmWXope7DfL6eJ4=; fh=2mCdnEsEQzaNNg7WB3fw4oVqqq+eEoqCz0tNA1gXC/4=; b=apvd01afNLVdaZgbyEmuoIdxiQAh39HW2oXjh6v4xEcBgs21JSx+9QfFtarq8+YIeb yjJg8KzYwv0F20BdDvrNbbB57Ml/yuFnfTflSWwDwaSPBH1p5JwEsVsStXZHpHjwjKj5 WicTCSS/LATvhynET9Y1LNt08xc6mLtRhGe8nOkvFWCDLCL5ShsdodSCU4OIcUVL+Ehp dMyXdOe7Bied9RJnVp64cveDIp2OOKsCbSDY4WDwa24RlPnp8FvWB+40jwAmwKzjCEaI JybfSR9Fl9vaKa1LzteVEEtf6ZAJT9K8a9uRei2aM6dpJeIkaFPGdeAC0Nb1hmRoHVfe uQTA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=ueZQjeVz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id hq13-20020a0568709b0d00b001efb4aee2d0si1602149oab.84.2023.11.24.09.02.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 09:02:45 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=ueZQjeVz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id B6459803F7B7; Fri, 24 Nov 2023 09:00:59 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235298AbjKXRA0 (ORCPT + 99 others); Fri, 24 Nov 2023 12:00:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49528 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345825AbjKXRAP (ORCPT ); Fri, 24 Nov 2023 12:00:15 -0500 Received: from out-188.mta0.migadu.com (out-188.mta0.migadu.com [IPv6:2001:41d0:1004:224b::bc]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E02671BE2 for ; Fri, 24 Nov 2023 09:00:05 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700845203; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=RJneZ4OkvbVOWh1hD/tfiGQQTqVpGmWXope7DfL6eJ4=; b=ueZQjeVz9BoOubwcZw232uGaPxBOhxgMMJFVvSe07yzR9JCJ2xdjMB+GDd6SjNaH3Xgd93 mQV9zsgfbbgwElcqf6VwTlpNWVQ9wIW94DzLze/fmNaEv8ImyND6CCgRMzasIHsruI16SS j+8MKbqAathgIxvPe3yPMJ83+C2vTfU= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa Subject: [PATCH v6 05/11] blksnap: module management interface functions Date: Fri, 24 Nov 2023 17:59:27 +0100 Message-Id: <20231124165933.27580-6-sergei.shtepa@linux.dev> In-Reply-To: <20231124165933.27580-1-sergei.shtepa@linux.dev> References: <20231124165933.27580-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 24 Nov 2023 09:00:59 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783465629410556941 X-GMAIL-MSGID: 1783465629410556941 From: Sergei Shtepa Contains callback functions for loading and unloading the module and implementation of module management interface functions. The module parameters and other mandatory declarations for the kernel module are also defined. Co-developed-by: Christoph Hellwig Signed-off-by: Christoph Hellwig Signed-off-by: Sergei Shtepa --- MAINTAINERS | 1 + drivers/block/blksnap/main.c | 475 +++++++++++++++++++++++++++++++++ drivers/block/blksnap/params.h | 16 ++ 3 files changed, 492 insertions(+) create mode 100644 drivers/block/blksnap/main.c create mode 100644 drivers/block/blksnap/params.h diff --git a/MAINTAINERS b/MAINTAINERS index 9770c4d4b15d..6f666d772cf5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3598,6 +3598,7 @@ M: Sergei Shtepa L: linux-block@vger.kernel.org S: Supported F: Documentation/block/blksnap.rst +F: drivers/block/blksnap/* F: include/uapi/linux/blksnap.h BLOCK LAYER diff --git a/drivers/block/blksnap/main.c b/drivers/block/blksnap/main.c new file mode 100644 index 000000000000..a8ae824d580f --- /dev/null +++ b/drivers/block/blksnap/main.c @@ -0,0 +1,475 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include "snapimage.h" +#include "snapshot.h" +#include "tracker.h" +#include "chunk.h" +#include "params.h" + +/* + * The power of 2 for minimum tracking block size. + * + * If we make the tracking block size small, we will get detailed information + * about the changes, but the size of the change tracker table will be too + * large, which will lead to inefficient memory usage. + */ +static unsigned int tracking_block_minimum_shift = 16; + +/* + * The maximum number of tracking blocks. + * + * A table is created in RAM to store information about the status of all + * tracking blocks. So, if the size of the tracking block is small, then the + * size of the table turns out to be large and memory is consumed inefficiently. + * As the size of the block device grows, the size of the tracking block size + * should also grow. For this purpose, the limit of the maximum number of block + * size is set. + */ +static unsigned int tracking_block_maximum_count = 2097152; + +/* + * The power of 2 for maximum tracking block size. + * + * On very large capacity disks, the block size may be too large. To prevent + * this, the maximum block size is limited. If the limit on the maximum block + * size has been reached, then the number of blocks may exceed the + * &tracking_block_maximum_count. + */ +static unsigned int tracking_block_maximum_shift = 26; + +/* + * The power of 2 for minimum chunk size. + * + * The size of the chunk depends on how much data will be copied to the + * difference storage when at least one sector of the block device is changed. + * If the size is small, then small I/O units will be generated, which will + * reduce performance. Too large a chunk size will lead to inefficient use of + * the difference storage. + */ +static unsigned int chunk_minimum_shift = 18; + +/* + * The power of 2 for maximum number of chunks. + * + * A table is created in RAM to store information about the state of the chunks. + * So, if the size of the chunk is small, then the size of the table turns out + * to be large and memory is consumed inefficiently. As the size of the block + * device grows, the size of the chunk should also grow. For this purpose, the + * maximum number of chunks is set. + * + * The table expands dynamically when new chunks are allocated. Therefore, + * memory consumption also depends on the intensity of writing to the block + * device under the snapshot. + */ +static unsigned int chunk_maximum_count_shift = 40; + +/* + * The power of 2 for maximum chunk size. + * + * On very large capacity disks, the chunk size may be too large. To prevent + * this, the maximum block size is limited. If the limit on the maximum chunk + * size has been reached, then the number of chunks may exceed the + * &chunk_maximum_count. + */ +static unsigned int chunk_maximum_shift = 26; + +/* + * The maximum number of chunks in queue. + * + * The chunk is not immediately stored to the difference storage. The chunks + * are put in a store queue. The store queue allows to postpone the operation + * of storing a chunks data to the difference storage and perform it later in + * the worker thread. + */ +static unsigned int chunk_maximum_in_queue = 16; + +/* + * The size of the pool of preallocated difference buffers. + * + * A buffer can be allocated for each chunk. After use, this buffer is not + * released immediately, but is sent to the pool of free buffers. However, if + * there are too many free buffers in the pool, then these free buffers will + * be released immediately. + */ +static unsigned int free_diff_buffer_pool_size = 128; + +/* + * The minimum allowable size of the difference storage in sectors. + * + * The difference storage is a part of the disk space allocated for storing + * snapshot data. If the free space in difference storage is less than half of + * this value, then the process of increasing the size of the difference storage + * file will begin. The size of the difference storage file is increased in + * portions, the size of which is determined by this value. + */ +static unsigned int diff_storage_minimum = 2097152; + +#define VERSION_STR "2.0.0.0" +static const struct blksnap_version version = { + .major = 2, + .minor = 0, + .revision = 0, + .build = 0, +}; + +unsigned int get_tracking_block_minimum_shift(void) +{ + return tracking_block_minimum_shift; +} + +unsigned int get_tracking_block_maximum_shift(void) +{ + return tracking_block_maximum_shift; +} + +unsigned int get_tracking_block_maximum_count(void) +{ + return tracking_block_maximum_count; +} + +unsigned int get_chunk_minimum_shift(void) +{ + return chunk_minimum_shift; +} + +unsigned int get_chunk_maximum_shift(void) +{ + return chunk_maximum_shift; +} + +unsigned long get_chunk_maximum_count(void) +{ + /* + * The XArray is used to store chunks. And 'unsigned long' is used as + * chunk number parameter. So, The number of chunks cannot exceed the + * limits of ULONG_MAX. + */ + if ((chunk_maximum_count_shift >> 3) < sizeof(unsigned long)) + return (1ul << chunk_maximum_count_shift); + return ULONG_MAX; +} + +unsigned int get_chunk_maximum_in_queue(void) +{ + return chunk_maximum_in_queue; +} + +unsigned int get_free_diff_buffer_pool_size(void) +{ + return free_diff_buffer_pool_size; +} + +sector_t get_diff_storage_minimum(void) +{ + return (sector_t)diff_storage_minimum; +} + +static int ioctl_version(struct blksnap_version __user *user_version) +{ + if (copy_to_user(user_version, &version, sizeof(version))) { + pr_err("Unable to get version: invalid user buffer\n"); + return -ENODATA; + } + + return 0; +} + +static_assert(sizeof(uuid_t) == sizeof(struct blksnap_uuid), + "Invalid size of struct blksnap_uuid."); + +static int ioctl_snapshot_create(struct blksnap_snapshot_create __user *uarg) +{ + struct blksnap_snapshot_create karg; + int ret; + + if (copy_from_user(&karg, uarg, sizeof(karg))) { + pr_err("Unable to create snapshot: invalid user buffer\n"); + return -ENODATA; + } + + ret = snapshot_create(&karg); + if (ret) + return ret; + + if (copy_to_user(uarg, &karg, sizeof(karg))) { + pr_err("Unable to create snapshot: invalid user buffer\n"); + return -ENODATA; + } + + return 0; +} + +static int ioctl_snapshot_destroy(struct blksnap_uuid __user *user_id) +{ + uuid_t kernel_id; + + if (copy_from_user(kernel_id.b, user_id->b, sizeof(uuid_t))) { + pr_err("Unable to destroy snapshot: invalid user buffer\n"); + return -ENODATA; + } + + return snapshot_destroy(&kernel_id); +} + +static int ioctl_snapshot_take(struct blksnap_uuid __user *user_id) +{ + uuid_t kernel_id; + + if (copy_from_user(kernel_id.b, user_id->b, sizeof(uuid_t))) { + pr_err("Unable to take snapshot: invalid user buffer\n"); + return -ENODATA; + } + + return snapshot_take(&kernel_id); +} + +static int ioctl_snapshot_collect(struct blksnap_snapshot_collect __user *uarg) +{ + int ret; + struct blksnap_snapshot_collect karg; + + if (copy_from_user(&karg, uarg, sizeof(karg))) { + pr_err("Unable to collect available snapshots: invalid user buffer\n"); + return -ENODATA; + } + + ret = snapshot_collect(&karg.count, u64_to_user_ptr(karg.ids)); + + if (copy_to_user(uarg, &karg, sizeof(karg))) { + pr_err("Unable to collect available snapshots: invalid user buffer\n"); + return -ENODATA; + } + + return ret; +} + +static_assert(sizeof(struct blksnap_snapshot_event) == 4096, + "The size struct blksnap_snapshot_event should be equal to the size of the page."); + +static int ioctl_snapshot_wait_event(struct blksnap_snapshot_event __user *uarg) +{ + int ret = 0; + struct blksnap_snapshot_event *karg; + struct event *ev; + + karg = kzalloc(sizeof(struct blksnap_snapshot_event), GFP_KERNEL); + if (!karg) + return -ENOMEM; + + /* Copy only snapshot ID and timeout*/ + if (copy_from_user(karg, uarg, sizeof(uuid_t) + sizeof(__u32))) { + pr_err("Unable to get snapshot event. Invalid user buffer\n"); + ret = -EINVAL; + goto out; + } + + ev = snapshot_wait_event((uuid_t *)karg->id.b, karg->timeout_ms); + if (IS_ERR(ev)) { + ret = PTR_ERR(ev); + goto out; + } + + pr_debug("Received event=%lld code=%d data_size=%d\n", ev->time, + ev->code, ev->data_size); + karg->code = ev->code; + karg->time_label = ev->time; + + if (ev->data_size > sizeof(karg->data)) { + pr_err("Event size %d is too big\n", ev->data_size); + ret = -ENOSPC; + /* If we can't copy all the data, we copy only part of it. */ + } + memcpy(karg->data, ev->data, ev->data_size); + event_free(ev); + + if (copy_to_user(uarg, karg, sizeof(struct blksnap_snapshot_event))) { + pr_err("Unable to get snapshot event. Invalid user buffer\n"); + ret = -EINVAL; + } +out: + kfree(karg); + + return ret; +} + +static long blksnap_ctrl_unlocked_ioctl(struct file *filp, unsigned int cmd, + unsigned long arg) +{ + void *argp = (void __user *)arg; + + switch (cmd) { + case IOCTL_BLKSNAP_VERSION: + return ioctl_version(argp); + case IOCTL_BLKSNAP_SNAPSHOT_CREATE: + return ioctl_snapshot_create(argp); + case IOCTL_BLKSNAP_SNAPSHOT_DESTROY: + return ioctl_snapshot_destroy(argp); + case IOCTL_BLKSNAP_SNAPSHOT_TAKE: + return ioctl_snapshot_take(argp); + case IOCTL_BLKSNAP_SNAPSHOT_COLLECT: + return ioctl_snapshot_collect(argp); + case IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT: + return ioctl_snapshot_wait_event(argp); + default: + return -ENOTTY; + } + +} + +static const struct file_operations blksnap_ctrl_fops = { + .owner = THIS_MODULE, + .unlocked_ioctl = blksnap_ctrl_unlocked_ioctl, +}; + +static struct miscdevice blksnap_ctrl_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = BLKSNAP_CTL, + .fops = &blksnap_ctrl_fops, +}; + +static inline sector_t chunk_minimum_sectors(void) +{ + return (1ull << (chunk_minimum_shift - SECTOR_SHIFT)); +}; + +static int __init parameters_init(void) +{ + pr_debug("tracking_block_minimum_shift: %d\n", + tracking_block_minimum_shift); + pr_debug("tracking_block_maximum_shift: %d\n", + tracking_block_maximum_shift); + pr_debug("tracking_block_maximum_count: %d\n", + tracking_block_maximum_count); + + pr_debug("chunk_minimum_shift: %d\n", chunk_minimum_shift); + pr_debug("chunk_maximum_shift: %d\n", chunk_maximum_shift); + pr_debug("chunk_maximum_count_shift: %u\n", chunk_maximum_count_shift); + + pr_debug("chunk_maximum_in_queue: %d\n", chunk_maximum_in_queue); + pr_debug("free_diff_buffer_pool_size: %d\n", + free_diff_buffer_pool_size); + pr_debug("diff_storage_minimum: %d\n", diff_storage_minimum); + + if (tracking_block_maximum_shift < tracking_block_minimum_shift) { + tracking_block_maximum_shift = tracking_block_minimum_shift; + pr_warn("fixed tracking_block_maximum_shift: %d\n", + tracking_block_maximum_shift); + } + + if (chunk_minimum_shift < PAGE_SHIFT) { + chunk_minimum_shift = PAGE_SHIFT; + pr_warn("fixed chunk_minimum_shift: %d\n", + chunk_minimum_shift); + } + if (chunk_maximum_shift < chunk_minimum_shift) { + chunk_maximum_shift = chunk_minimum_shift; + pr_warn("fixed chunk_maximum_shift: %d\n", + chunk_maximum_shift); + } + if (diff_storage_minimum < (chunk_minimum_sectors() * 2)) { + diff_storage_minimum = chunk_minimum_sectors() * 2; + pr_warn("fixed diff_storage_minimum: %d\n", + diff_storage_minimum); + } + if (diff_storage_minimum & (chunk_minimum_sectors() - 1)) { + diff_storage_minimum &= ~(chunk_minimum_sectors() - 1); + pr_warn("fixed diff_storage_minimum: %d\n", + diff_storage_minimum); + } + + return 0; +} + +static int __init blksnap_init(void) +{ + int ret; + + pr_debug("Loading\n"); + pr_debug("Version: %s\n", VERSION_STR); + + ret = parameters_init(); + if (ret) + return ret; + + ret = chunk_init(); + if (ret) + goto fail_chunk_init; + + ret = tracker_init(); + if (ret) + goto fail_tracker_init; + + ret = misc_register(&blksnap_ctrl_misc); + if (ret) + goto fail_misc_register; + + return 0; + +fail_misc_register: + tracker_done(); +fail_tracker_init: + chunk_done(); +fail_chunk_init: + + return ret; +} + +static void __exit blksnap_exit(void) +{ + pr_debug("Unloading module\n"); + + misc_deregister(&blksnap_ctrl_misc); + + chunk_done(); + snapshot_done(); + tracker_done(); + + pr_debug("Module was unloaded\n"); +} + +module_init(blksnap_init); +module_exit(blksnap_exit); + +module_param_named(tracking_block_minimum_shift, tracking_block_minimum_shift, + uint, 0644); +MODULE_PARM_DESC(tracking_block_minimum_shift, + "The power of 2 for minimum tracking block size"); +module_param_named(tracking_block_maximum_count, tracking_block_maximum_count, + uint, 0644); +MODULE_PARM_DESC(tracking_block_maximum_count, + "The maximum number of tracking blocks"); +module_param_named(tracking_block_maximum_shift, tracking_block_maximum_shift, + uint, 0644); +MODULE_PARM_DESC(tracking_block_maximum_shift, + "The power of 2 for maximum trackings block size"); +module_param_named(chunk_minimum_shift, chunk_minimum_shift, uint, 0644); +MODULE_PARM_DESC(chunk_minimum_shift, + "The power of 2 for minimum chunk size"); +module_param_named(chunk_maximum_count_shift, chunk_maximum_count_shift, + uint, 0644); +MODULE_PARM_DESC(chunk_maximum_count_shift, + "The power of 2 for maximum number of chunks"); +module_param_named(chunk_maximum_shift, chunk_maximum_shift, uint, 0644); +MODULE_PARM_DESC(chunk_maximum_shift, + "The power of 2 for maximum snapshots chunk size"); +module_param_named(chunk_maximum_in_queue, chunk_maximum_in_queue, uint, 0644); +MODULE_PARM_DESC(chunk_maximum_in_queue, + "The maximum number of chunks in store queue"); +module_param_named(free_diff_buffer_pool_size, free_diff_buffer_pool_size, + uint, 0644); +MODULE_PARM_DESC(free_diff_buffer_pool_size, + "The size of the pool of preallocated difference buffers"); +module_param_named(diff_storage_minimum, diff_storage_minimum, uint, 0644); +MODULE_PARM_DESC(diff_storage_minimum, + "The minimum allowable size of the difference storage in sectors"); + +MODULE_DESCRIPTION("Block Device Snapshots Module"); +MODULE_VERSION(VERSION_STR); +MODULE_AUTHOR("Veeam Software Group GmbH"); +MODULE_LICENSE("GPL"); diff --git a/drivers/block/blksnap/params.h b/drivers/block/blksnap/params.h new file mode 100644 index 000000000000..3ec4cce4de39 --- /dev/null +++ b/drivers/block/blksnap/params.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef __BLKSNAP_PARAMS_H +#define __BLKSNAP_PARAMS_H + +unsigned int get_tracking_block_minimum_shift(void); +unsigned int get_tracking_block_maximum_shift(void); +unsigned int get_tracking_block_maximum_count(void); +unsigned int get_chunk_minimum_shift(void); +unsigned int get_chunk_maximum_shift(void); +unsigned long get_chunk_maximum_count(void); +unsigned int get_chunk_maximum_in_queue(void); +unsigned int get_free_diff_buffer_pool_size(void); +sector_t get_diff_storage_minimum(void); + +#endif /* __BLKSNAP_PARAMS_H */ From patchwork Fri Nov 24 16:59:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169525 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1370968vqx; Fri, 24 Nov 2023 09:02:17 -0800 (PST) X-Google-Smtp-Source: AGHT+IExGibOx2PB9C2AkzGln5w2gQftvfMpDVoXEaHBrkHoXsI9YDu37BHdWgOP3RO8MFypxIQR X-Received: by 2002:a25:6987:0:b0:db4:158:b9d9 with SMTP id e129-20020a256987000000b00db40158b9d9mr3351966ybc.1.1700845336564; Fri, 24 Nov 2023 09:02:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700845336; cv=none; d=google.com; s=arc-20160816; b=QcMJu8VZtNK+dWWKtbWF3M50YCrAnSagGCJDptUuwsoki2JmpKWbDIXCNMQSN+OJ6a 6dOBWwryaAtJcc2rE89WmJ4sP5rx0i4OUoab2/Gt/xiiepySJTd1EBt8hLGLSuAIKLbN MzGQmWD48H6Oq7M6k9RCKSPdO1Mm2KaCqYjhacwPjsDBEEpx1qc4YsCL0XpjpzXHN3Hf eaOaBjpLwC0sQhd6s+HTl1WRn3NNQCBD1mp1mmsLzKRr5B5yqRWaWeOAYDuy+i4qnoid +CqsANZ4yfB+83i/N64zJI2JwnFpZnMgzOSru5BPJKfjXEYNggAfLhXsFp0fGU2stgkP pLaA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=TmWDk8Y68Mao3PzWdDlZiK1voP/L3e+BCM9sV+PzBf0=; fh=2mCdnEsEQzaNNg7WB3fw4oVqqq+eEoqCz0tNA1gXC/4=; b=urh9INcCNSJ2l5FPFObQf5W8ACH44zPqzo+RpenjJ1tt6x+fHy1YgvsO8NdOKZV0ES TqwdMeFIXIDu8y/0JTqySIiQqV4g0LV1/QN06yKj8Y3OHQyGD6RYPH803pEBPMc2oCMo 6w5ipznybZitAx0B/MRwGjgiKQMb469lwoH8D3La2ZkpD+ifCNjZl+CT2jzlsjmUoOJr 1YvqU2gE/yM0TI07RVtVX/0Y+RtZSIncDacs7uz/a5i9Lvw0pmzf0klbb2HI1Xytxd5a Vui+ExWo+hHeOWFfIXBlT7KL8EEknH4bMKNcWaF3UlFlei9IK7Obb+bN4D8n/h8qyD3v gb/w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=jL5ZrF8L; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id g6-20020a25a486000000b00da05933c623si2382462ybi.135.2023.11.24.09.02.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 09:02:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=jL5ZrF8L; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 1A7BF80A9D4F; Fri, 24 Nov 2023 09:01:03 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345844AbjKXRAh (ORCPT + 99 others); Fri, 24 Nov 2023 12:00:37 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47374 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231480AbjKXRAQ (ORCPT ); Fri, 24 Nov 2023 12:00:16 -0500 Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DA3121BF5 for ; Fri, 24 Nov 2023 09:00:08 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700845207; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TmWDk8Y68Mao3PzWdDlZiK1voP/L3e+BCM9sV+PzBf0=; b=jL5ZrF8L9sNR/TMyHd8GQPVJVrA2s9k8ODf21idQfM+N9aCVi966USdfL5V0ZDrYpEGkj+ Z0x5NLriwSsNvHgz7lFWQ8UIH8cQZA938QlffBoRPPHxB7saqhJB0FD+SRkbhZ6yUd7EKU 2BpUcsGWnOaiRpui1GAP295bDYKcZKE= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa Subject: [PATCH v6 06/11] blksnap: handling and tracking I/O units Date: Fri, 24 Nov 2023 17:59:28 +0100 Message-Id: <20231124165933.27580-7-sergei.shtepa@linux.dev> In-Reply-To: <20231124165933.27580-1-sergei.shtepa@linux.dev> References: <20231124165933.27580-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 24 Nov 2023 09:01:03 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783465599118148351 X-GMAIL-MSGID: 1783465599118148351 From: Sergei Shtepa The struct tracker contains callback functions for handling a I/O units of a block device. When a write request is handled, the change block tracking (CBT) map functions are called and initiates the process of copying data from the original block device to the change store. Registering and unregistering the tracker is provided by the functions blkfilter_register() and blkfilter_unregister(). The struct cbt_map allows to store the history of block device changes. Co-developed-by: Christoph Hellwig Signed-off-by: Christoph Hellwig Signed-off-by: Sergei Shtepa --- drivers/block/blksnap/cbt_map.c | 228 +++++++++++++++++++++ drivers/block/blksnap/cbt_map.h | 90 +++++++++ drivers/block/blksnap/tracker.c | 344 ++++++++++++++++++++++++++++++++ drivers/block/blksnap/tracker.h | 78 ++++++++ 4 files changed, 740 insertions(+) create mode 100644 drivers/block/blksnap/cbt_map.c create mode 100644 drivers/block/blksnap/cbt_map.h create mode 100644 drivers/block/blksnap/tracker.c create mode 100644 drivers/block/blksnap/tracker.h diff --git a/drivers/block/blksnap/cbt_map.c b/drivers/block/blksnap/cbt_map.c new file mode 100644 index 000000000000..7b6ebe225c48 --- /dev/null +++ b/drivers/block/blksnap/cbt_map.c @@ -0,0 +1,228 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#define pr_fmt(fmt) KBUILD_MODNAME "-cbt_map: " fmt + +#include +#include +#include +#include "cbt_map.h" +#include "params.h" + +static inline unsigned long long count_by_shift(sector_t capacity, + unsigned long long shift) +{ + sector_t blk_size = 1ull << (shift - SECTOR_SHIFT); + + return round_up(capacity, blk_size) / blk_size; +} + +static void cbt_map_calculate_block_size(struct cbt_map *cbt_map) +{ + unsigned long long count; + unsigned long long shift = get_tracking_block_minimum_shift(); + + pr_debug("Device capacity %llu sectors\n", cbt_map->device_capacity); + /* + * The size of the tracking block is calculated based on the size of + * the disk so that the CBT table does not exceed a reasonable size. + */ + count = count_by_shift(cbt_map->device_capacity, shift); + pr_debug("Blocks count %llu\n", count); + while (count > get_tracking_block_maximum_count()) { + if (shift >= get_tracking_block_maximum_shift()) { + pr_info("The maximum allowable CBT block size has been reached.\n"); + break; + } + shift = shift + 1ull; + count = count_by_shift(cbt_map->device_capacity, shift); + pr_debug("Blocks count %llu\n", count); + } + + cbt_map->blk_size_shift = shift; + cbt_map->blk_count = count; + pr_debug("The optimal CBT block size was calculated as %llu bytes\n", + (1ull << cbt_map->blk_size_shift)); +} + +static int cbt_map_allocate(struct cbt_map *cbt_map) +{ + unsigned char *read_map = NULL; + unsigned char *write_map = NULL; + size_t size = cbt_map->blk_count; + + pr_debug("Allocate CBT map of %zu blocks\n", size); + + if (cbt_map->read_map || cbt_map->write_map) + return -EINVAL; + + read_map = __vmalloc(size, GFP_NOIO | __GFP_ZERO); + if (!read_map) + return -ENOMEM; + + write_map = __vmalloc(size, GFP_NOIO | __GFP_ZERO); + if (!write_map) { + vfree(read_map); + return -ENOMEM; + } + + cbt_map->read_map = read_map; + cbt_map->write_map = write_map; + + cbt_map->snap_number_previous = 0; + cbt_map->snap_number_active = 1; + generate_random_uuid(cbt_map->generation_id.b); + cbt_map->is_corrupted = false; + + return 0; +} + +static void cbt_map_deallocate(struct cbt_map *cbt_map) +{ + cbt_map->is_corrupted = false; + + if (cbt_map->read_map) { + vfree(cbt_map->read_map); + cbt_map->read_map = NULL; + } + + if (cbt_map->write_map) { + vfree(cbt_map->write_map); + cbt_map->write_map = NULL; + } +} + +int cbt_map_reset(struct cbt_map *cbt_map, sector_t device_capacity) +{ + cbt_map_deallocate(cbt_map); + + cbt_map->device_capacity = device_capacity; + cbt_map_calculate_block_size(cbt_map); + + return cbt_map_allocate(cbt_map); +} + +void cbt_map_destroy(struct cbt_map *cbt_map) +{ + pr_debug("CBT map destroy\n"); + + cbt_map_deallocate(cbt_map); + kfree(cbt_map); +} + +struct cbt_map *cbt_map_create(struct block_device *bdev) +{ + struct cbt_map *cbt_map = NULL; + int ret; + + pr_debug("CBT map create\n"); + + cbt_map = kzalloc(sizeof(struct cbt_map), GFP_KERNEL); + if (cbt_map == NULL) + return NULL; + + cbt_map->device_capacity = bdev_nr_sectors(bdev); + cbt_map_calculate_block_size(cbt_map); + + ret = cbt_map_allocate(cbt_map); + if (ret) { + pr_err("Failed to create tracker. errno=%d\n", abs(ret)); + cbt_map_destroy(cbt_map); + return NULL; + } + + spin_lock_init(&cbt_map->locker); + cbt_map->is_corrupted = false; + + return cbt_map; +} + +void cbt_map_switch(struct cbt_map *cbt_map) +{ + pr_debug("CBT map switch\n"); + spin_lock(&cbt_map->locker); + + cbt_map->snap_number_previous = cbt_map->snap_number_active; + ++cbt_map->snap_number_active; + if (cbt_map->snap_number_active == 256) { + cbt_map->snap_number_active = 1; + + memset(cbt_map->write_map, 0, cbt_map->blk_count); + + generate_random_uuid(cbt_map->generation_id.b); + + pr_debug("CBT reset\n"); + } else + memcpy(cbt_map->read_map, cbt_map->write_map, + cbt_map->blk_count); + spin_unlock(&cbt_map->locker); +} + +static inline int _cbt_map_set(struct cbt_map *cbt_map, sector_t sector_start, + sector_t sector_cnt, u8 snap_number, + unsigned char *map) +{ + int res = 0; + u8 num; + size_t inx; + size_t cbt_block_first = (size_t)( + sector_start >> (cbt_map->blk_size_shift - SECTOR_SHIFT)); + size_t cbt_block_last = (size_t)( + (sector_start + sector_cnt - 1) >> + (cbt_map->blk_size_shift - SECTOR_SHIFT)); + + for (inx = cbt_block_first; inx <= cbt_block_last; ++inx) { + if (unlikely(inx >= cbt_map->blk_count)) { + pr_err("Block index is too large\n"); + pr_err("Block #%zu was demanded, map size %zu blocks\n", + inx, cbt_map->blk_count); + res = -EINVAL; + break; + } + + num = map[inx]; + if (num < snap_number) + map[inx] = snap_number; + } + return res; +} + +int cbt_map_set(struct cbt_map *cbt_map, sector_t sector_start, + sector_t sector_cnt) +{ + int res; + + spin_lock(&cbt_map->locker); + if (unlikely(cbt_map->is_corrupted)) { + spin_unlock(&cbt_map->locker); + return -EINVAL; + } + res = _cbt_map_set(cbt_map, sector_start, sector_cnt, + (u8)cbt_map->snap_number_active, cbt_map->write_map); + if (unlikely(res)) + cbt_map->is_corrupted = true; + + spin_unlock(&cbt_map->locker); + + return res; +} + +int cbt_map_set_both(struct cbt_map *cbt_map, sector_t sector_start, + sector_t sector_cnt) +{ + int res; + + spin_lock(&cbt_map->locker); + if (unlikely(cbt_map->is_corrupted)) { + spin_unlock(&cbt_map->locker); + return -EINVAL; + } + res = _cbt_map_set(cbt_map, sector_start, sector_cnt, + (u8)cbt_map->snap_number_active, cbt_map->write_map); + if (!res) + res = _cbt_map_set(cbt_map, sector_start, sector_cnt, + (u8)cbt_map->snap_number_previous, + cbt_map->read_map); + spin_unlock(&cbt_map->locker); + + return res; +} diff --git a/drivers/block/blksnap/cbt_map.h b/drivers/block/blksnap/cbt_map.h new file mode 100644 index 000000000000..95dc17e6bcec --- /dev/null +++ b/drivers/block/blksnap/cbt_map.h @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef __BLKSNAP_CBT_MAP_H +#define __BLKSNAP_CBT_MAP_H + +#include +#include +#include +#include +#include + +struct blksnap_sectors; + +/** + * struct cbt_map - The table of changes for a block device. + * + * @locker: + * Locking for atomic modification of structure members. + * @blk_size_shift: + * The power of 2 used to specify the change tracking block size. + * @blk_count: + * The number of change tracking blocks. + * @device_capacity: + * The actual capacity of the device. + * @read_map: + * A table of changes available for reading. This is the table that can + * be read after taking a snapshot. + * @write_map: + * The current table for tracking changes. + * @snap_number_active: + * The current sequential number of changes. This is the number that is + * written to the current table when the block data changes. + * @snap_number_previous: + * The previous sequential number of changes. This number is used to + * identify the blocks that were changed between the penultimate snapshot + * and the last snapshot. + * @generation_id: + * UUID of the generation of changes. + * @is_corrupted: + * A flag that the change tracking data is no longer reliable. + * + * The change block tracking map is a byte table. Each byte stores the + * sequential number of changes for one block. To determine which blocks have + * changed since the previous snapshot with the change number 4, it is enough + * to find all bytes with the number more than 4. + * + * Since one byte is allocated to track changes in one block, the change table + * is created again at the 255th snapshot. At the same time, a new unique + * generation identifier is generated. Tracking changes is possible only for + * tables of the same generation. + * + * There are two tables on the change block tracking map. One is available for + * reading, and the other is available for writing. At the moment of taking + * a snapshot, the tables are synchronized. The user's process, when calling + * the corresponding ioctl, can read the readable table. At the same time, the + * change tracking mechanism continues to work with the writable table. + * + * To provide the ability to mount a snapshot image as writeable, it is + * possible to make changes to both of these tables simultaneously. + * + */ +struct cbt_map { + spinlock_t locker; + + size_t blk_size_shift; + size_t blk_count; + sector_t device_capacity; + + unsigned char *read_map; + unsigned char *write_map; + + unsigned long snap_number_active; + unsigned long snap_number_previous; + uuid_t generation_id; + + bool is_corrupted; +}; + +struct cbt_map *cbt_map_create(struct block_device *bdev); +int cbt_map_reset(struct cbt_map *cbt_map, sector_t device_capacity); + +void cbt_map_destroy(struct cbt_map *cbt_map); + +void cbt_map_switch(struct cbt_map *cbt_map); +int cbt_map_set(struct cbt_map *cbt_map, sector_t sector_start, + sector_t sector_cnt); +int cbt_map_set_both(struct cbt_map *cbt_map, sector_t sector_start, + sector_t sector_cnt); + +#endif /* __BLKSNAP_CBT_MAP_H */ diff --git a/drivers/block/blksnap/tracker.c b/drivers/block/blksnap/tracker.c new file mode 100644 index 000000000000..2b8978a2f42e --- /dev/null +++ b/drivers/block/blksnap/tracker.c @@ -0,0 +1,344 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#define pr_fmt(fmt) KBUILD_MODNAME "-tracker: " fmt + +#include +#include +#include +#include +#include +#include "tracker.h" +#include "cbt_map.h" +#include "diff_area.h" +#include "snapimage.h" +#include "snapshot.h" + +void tracker_free(struct kref *kref) +{ + struct tracker *tracker = container_of(kref, struct tracker, kref); + + might_sleep(); + + pr_debug("Free tracker for device [%u:%u]\n", MAJOR(tracker->dev_id), + MINOR(tracker->dev_id)); + + if (tracker->diff_area) + diff_area_put(tracker->diff_area); + if (tracker->cbt_map) + cbt_map_destroy(tracker->cbt_map); + + kfree(tracker); +} + +static bool tracker_submit_bio(struct bio *bio) +{ + struct blkfilter *flt = bio->bi_bdev->bd_filter; + struct tracker *tracker = container_of(flt, struct tracker, filter); + sector_t count = bio_sectors(bio); + struct bvec_iter copy_iter; + + if (!op_is_write(bio_op(bio)) || !count) + return false; + + copy_iter = bio->bi_iter; + if (bio_flagged(bio, BIO_REMAPPED)) + copy_iter.bi_sector -= bio->bi_bdev->bd_start_sect; + + if (cbt_map_set(tracker->cbt_map, copy_iter.bi_sector, count)) + return false; + + if (!atomic_read(&tracker->snapshot_is_taken)) + return false; + /* + * The diff_area is not blocked from releasing now, because + * changing the value of the snapshot_is_taken is performed when + * the block device queue is frozen in tracker_release_snapshot(). + */ + if (diff_area_is_corrupted(tracker->diff_area)) + return false; + + return diff_area_cow(bio, tracker->diff_area, ©_iter); +} + +static struct blkfilter *tracker_attach(struct block_device *bdev) +{ + struct tracker *tracker = NULL; + struct cbt_map *cbt_map; + + pr_debug("Creating tracker for device [%u:%u]\n", + MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev)); + + cbt_map = cbt_map_create(bdev); + if (!cbt_map) { + pr_err("Failed to create CBT map for device [%u:%u]\n", + MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev)); + return ERR_PTR(-ENOMEM); + } + + tracker = kzalloc(sizeof(struct tracker), GFP_KERNEL); + if (tracker == NULL) { + cbt_map_destroy(cbt_map); + return ERR_PTR(-ENOMEM); + } + + tracker->orig_bdev = bdev; + mutex_init(&tracker->ctl_lock); + INIT_LIST_HEAD(&tracker->link); + kref_init(&tracker->kref); + tracker->dev_id = bdev->bd_dev; + atomic_set(&tracker->snapshot_is_taken, false); + tracker->cbt_map = cbt_map; + tracker->diff_area = NULL; + + pr_debug("New tracker for device [%u:%u] was created\n", + MAJOR(tracker->dev_id), MINOR(tracker->dev_id)); + + return &tracker->filter; +} + +static void tracker_detach(struct blkfilter *flt) +{ + struct tracker *tracker = container_of(flt, struct tracker, filter); + + pr_debug("Detach tracker from device [%u:%u]\n", + MAJOR(tracker->dev_id), MINOR(tracker->dev_id)); + + tracker_put(tracker); +} + +static int ctl_cbtinfo(struct tracker *tracker, __u8 __user *buf, __u32 *plen) +{ + struct cbt_map *cbt_map = tracker->cbt_map; + struct blksnap_cbtinfo arg; + + if (!cbt_map) + return -ESRCH; + + if (*plen < sizeof(arg)) + return -EINVAL; + + arg.device_capacity = (__u64)(cbt_map->device_capacity << SECTOR_SHIFT); + arg.block_size = (__u32)(1 << cbt_map->blk_size_shift); + arg.block_count = (__u32)cbt_map->blk_count; + export_uuid(arg.generation_id.b, &cbt_map->generation_id); + arg.changes_number = (__u8)cbt_map->snap_number_previous; + + if (copy_to_user(buf, &arg, sizeof(arg))) + return -ENODATA; + + *plen = sizeof(arg); + return 0; +} + +static int ctl_cbtmap(struct tracker *tracker, __u8 __user *buf, __u32 *plen) +{ + struct cbt_map *cbt_map = tracker->cbt_map; + struct blksnap_cbtmap arg; + + if (!cbt_map) + return -ESRCH; + + if (unlikely(cbt_map->is_corrupted)) { + pr_err("CBT table was corrupted\n"); + return -EFAULT; + } + + if (*plen < sizeof(arg)) + return -EINVAL; + + if (copy_from_user(&arg, buf, sizeof(arg))) + return -ENODATA; + + if (arg.length > (cbt_map->blk_count - arg.offset)) + return -ENODATA; + + if (copy_to_user(u64_to_user_ptr(arg.buffer), + cbt_map->read_map + arg.offset, arg.length)) + + return -EINVAL; + + *plen = 0; + return 0; +} + +static int ctl_cbtdirty(struct tracker *tracker, __u8 __user *buf, __u32 *plen) +{ + struct cbt_map *cbt_map = tracker->cbt_map; + struct blksnap_cbtdirty arg; + unsigned int inx; + + if (!cbt_map) + return -ESRCH; + + if (*plen < sizeof(arg)) + return -EINVAL; + + if (copy_from_user(&arg, buf, sizeof(arg))) + return -ENODATA; + + for (inx = 0; inx < arg.count; inx++) { + struct blksnap_sectors range; + int ret; + + if (copy_from_user(&range, u64_to_user_ptr(arg.dirty_sectors), + sizeof(range))) + return -ENODATA; + + ret = cbt_map_set_both(cbt_map, range.offset, range.count); + if (ret) + return ret; + } + *plen = 0; + return 0; +} + +static int ctl_snapshotadd(struct tracker *tracker, + __u8 __user *buf, __u32 *plen) +{ + struct blksnap_snapshotadd arg; + + if (*plen < sizeof(arg)) + return -EINVAL; + + if (copy_from_user(&arg, buf, sizeof(arg))) + return -ENODATA; + + *plen = 0; + return snapshot_add_device((uuid_t *)&arg.id, tracker); +} +static int ctl_snapshotinfo(struct tracker *tracker, + __u8 __user *buf, __u32 *plen) +{ + struct blksnap_snapshotinfo arg = {0}; + + if (*plen < sizeof(arg)) + return -EINVAL; + + if (copy_from_user(&arg, buf, sizeof(arg))) + return -ENODATA; + + if (tracker->diff_area && diff_area_is_corrupted(tracker->diff_area)) + arg.error_code = tracker->diff_area->error_code; + else + arg.error_code = 0; + + if (tracker->snap_disk) + strscpy(arg.image, tracker->snap_disk->disk_name, + IMAGE_DISK_NAME_LEN); + + if (copy_to_user(buf, &arg, sizeof(arg))) + return -ENODATA; + + *plen = sizeof(arg); + return 0; +} + +static int (*const ctl_table[])(struct tracker *tracker, + __u8 __user *buf, __u32 *plen) = { + ctl_cbtinfo, + ctl_cbtmap, + ctl_cbtdirty, + ctl_snapshotadd, + ctl_snapshotinfo, +}; + +static int tracker_ctl(struct blkfilter *flt, const unsigned int cmd, + __u8 __user *buf, __u32 *plen) +{ + int ret = 0; + struct tracker *tracker = container_of(flt, struct tracker, filter); + + if (cmd > ARRAY_SIZE(ctl_table)) + return -ENOTTY; + + mutex_lock(&tracker->ctl_lock); + ret = ctl_table[cmd](tracker, buf, plen); + mutex_unlock(&tracker->ctl_lock); + + return ret; +} + +static struct blkfilter_operations tracker_ops = { + .owner = THIS_MODULE, + .name = "blksnap", + .attach = tracker_attach, + .detach = tracker_detach, + .ctl = tracker_ctl, + .submit_bio = tracker_submit_bio, +}; + +int tracker_take_snapshot(struct tracker *tracker) +{ + int ret = 0; + bool cbt_reset_needed = false; + struct block_device *orig_bdev = tracker->orig_bdev; + sector_t capacity; + unsigned int current_flag; + + blk_mq_freeze_queue(orig_bdev->bd_queue); + current_flag = memalloc_noio_save(); + + if (tracker->cbt_map->is_corrupted) { + cbt_reset_needed = true; + pr_warn("Corrupted CBT table detected. CBT fault\n"); + } + + capacity = bdev_nr_sectors(orig_bdev); + if (tracker->cbt_map->device_capacity != capacity) { + cbt_reset_needed = true; + pr_warn("Device resize detected. CBT fault\n"); + } + + if (cbt_reset_needed) { + ret = cbt_map_reset(tracker->cbt_map, capacity); + if (ret) { + pr_err("Failed to create tracker. errno=%d\n", + abs(ret)); + return ret; + } + } + + cbt_map_switch(tracker->cbt_map); + atomic_set(&tracker->snapshot_is_taken, true); + + memalloc_noio_restore(current_flag); + blk_mq_unfreeze_queue(orig_bdev->bd_queue); + + return 0; +} + +void tracker_release_snapshot(struct tracker *tracker) +{ + struct diff_area *diff_area = tracker->diff_area; + + if (unlikely(!diff_area)) + return; + + snapimage_free(tracker); + + blk_mq_freeze_queue(tracker->orig_bdev->bd_queue); + + pr_debug("Tracker for device [%u:%u] release snapshot\n", + MAJOR(tracker->dev_id), MINOR(tracker->dev_id)); + + atomic_set(&tracker->snapshot_is_taken, false); + tracker->diff_area = NULL; + + blk_mq_unfreeze_queue(tracker->orig_bdev->bd_queue); + + diff_area_put(diff_area); +} + +int __init tracker_init(void) +{ + pr_debug("Register filter '%s'", tracker_ops.name); + + return blkfilter_register(&tracker_ops); +} + +void tracker_done(void) +{ + pr_debug("Unregister filter '%s'", tracker_ops.name); + + blkfilter_unregister(&tracker_ops); +} diff --git a/drivers/block/blksnap/tracker.h b/drivers/block/blksnap/tracker.h new file mode 100644 index 000000000000..05ecc3c3c819 --- /dev/null +++ b/drivers/block/blksnap/tracker.h @@ -0,0 +1,78 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef __BLKSNAP_TRACKER_H +#define __BLKSNAP_TRACKER_H + +#include +#include +#include +#include +#include +#include +#include + +struct cbt_map; +struct diff_area; + +/** + * struct tracker - Tracker for a block device. + * + * @filter: + * The block device filter structure. + * @orig_bdev: + * The original block device this trackker is attached to. + * @ctl_lock: + * The mutex blocks simultaneous management of the tracker from different + * treads. + * @link: + * List header. Allows to combine trackers into a list in a snapshot. + * @kref: + * The reference counter allows to control the lifetime of the tracker. + * @dev_id: + * Original block device ID. + * @snapshot_is_taken: + * Indicates that a snapshot was taken for the device whose I/O unit are + * handled by this tracker. + * @cbt_map: + * Pointer to a change block tracker map. + * @diff_area: + * Pointer to a difference area. + * @snap_disk: + * Snapshot image disk. + * + * The goal of the tracker is to handle I/O unit. The tracker detectes the range + * of sectors that will change and transmits them to the CBT map and to the + * difference area. + */ +struct tracker { + struct blkfilter filter; + struct block_device *orig_bdev; + struct mutex ctl_lock; + struct list_head link; + struct kref kref; + dev_t dev_id; + + atomic_t snapshot_is_taken; + + struct cbt_map *cbt_map; + struct diff_area *diff_area; + struct gendisk *snap_disk; +}; + +int __init tracker_init(void); +void tracker_done(void); + +void tracker_free(struct kref *kref); +static inline void tracker_put(struct tracker *tracker) +{ + if (likely(tracker)) + kref_put(&tracker->kref, tracker_free); +}; +static inline void tracker_get(struct tracker *tracker) +{ + kref_get(&tracker->kref); +}; +int tracker_take_snapshot(struct tracker *tracker); +void tracker_release_snapshot(struct tracker *tracker); + +#endif /* __BLKSNAP_TRACKER_H */ From patchwork Fri Nov 24 16:59:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169528 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1371258vqx; Fri, 24 Nov 2023 09:02:35 -0800 (PST) X-Google-Smtp-Source: AGHT+IHzDqyLlp8lcs5fGldT/cktwIFtx3tBmj3y/5KkNCDngtrxSV/i+Ot/JFpFfdgLjq2FzWkx X-Received: by 2002:a05:6e02:b43:b0:35b:e80:57e2 with SMTP id f3-20020a056e020b4300b0035b0e8057e2mr4139343ilu.32.1700845354959; Fri, 24 Nov 2023 09:02:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700845354; cv=none; d=google.com; s=arc-20160816; b=kklUPWm6OMNcNVbn/pJhm+YFi8MMuewCrVAAa8+5tEcVxCZee2iCkDjJpfursEjoGY fVikseNM1PcejQL+qHx3uMYykXwfmBH1IOWGIAPJw4a40yQ4uztJb5veRZYhc+cc2eGM fwIxao9kL0FTXqQKDlth4SYmEN8Abbj8JL8pfCxmPeD6+Vn4XTzqw+twAtAf0cZCV0oO jOrxrOFY8uof1j34GBDLAhJ7tbt4gGPJYdQWI73RpAWW/v5j92N3yV8vczBty8z5i+ic CAApL6NsIJAcRIkrKM/1PVW77Lyblvbeb2jUPsmFoyKtFoR94lr70ITRElKMQaHwAeAo t12A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=chjpZrLLQa/OxHDhXgDf/zDRbpLMuNlM6DfyhZPjyv4=; fh=2mCdnEsEQzaNNg7WB3fw4oVqqq+eEoqCz0tNA1gXC/4=; b=lvNlgvFksSWAbVSXti1lE7Dd795X9kwJEJYqyvkoBjq93Q/h9WPqUOy7bthx++7u7s ewFfUDV0sJHq2td430Cki2Ejtu6aIjsOSvXgF54nK4qkIHoRVnQEjGg8ripk/BaKRnUQ g0uZ2gwUCaso5W3v6DjzklehXV81y2mSktvJ2Y7KjA09oOM1TAM7sCjvVgYVJ7Eok0rZ /LxbqqIKtZbs5wkd1m3V88U3ta2jx7BtICDKHnYgeHdBw4vcIBJdJW/uqy8rRjzoS/j+ M5zAEzGbu1eKf9cKy2D29PT3SnwCIEl4OKwdxY9Ebf33U8B7fP1xAmgONSN6CEgSJZrL +XXQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=Mh73GDxM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id h19-20020a056e021d9300b0035b2559d495si1878406ila.73.2023.11.24.09.02.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 09:02:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=Mh73GDxM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 6D98883D155F; Fri, 24 Nov 2023 09:01:08 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232709AbjKXRAq (ORCPT + 99 others); Fri, 24 Nov 2023 12:00:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59062 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231327AbjKXRAT (ORCPT ); Fri, 24 Nov 2023 12:00:19 -0500 Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [IPv6:2001:41d0:1004:224b::b3]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DCA561FD4 for ; Fri, 24 Nov 2023 09:00:12 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700845211; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=chjpZrLLQa/OxHDhXgDf/zDRbpLMuNlM6DfyhZPjyv4=; b=Mh73GDxMt+28XIgFE80dWsILGMsQSLQsBFwoTPqmlCQ84W8c5gXF3jsjTlrP62fMT0HRky esuiSk15p/IuhnOq6yn2va2+D3z4bn+6dwoV09Gzn32sPXCDtltZDNbyYNt4xotxmq+l+2 qZBkUOJUqGyCbzVmq83euo6o2v2lVqo= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa Subject: [PATCH v6 07/11] blksnap: difference storage and chunk Date: Fri, 24 Nov 2023 17:59:29 +0100 Message-Id: <20231124165933.27580-8-sergei.shtepa@linux.dev> In-Reply-To: <20231124165933.27580-1-sergei.shtepa@linux.dev> References: <20231124165933.27580-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Fri, 24 Nov 2023 09:01:08 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783465618523999474 X-GMAIL-MSGID: 1783465618523999474 From: Sergei Shtepa The struct diff_area provides management of difference blocks of block devices. Storing difference blocks, and reading them to get a snapshot images. The struct chunk describes the minimum data storage unit of the original block device. Functions for working with these minimal blocks implement algorithms for reading and writing blocks. Co-developed-by: Christoph Hellwig Signed-off-by: Christoph Hellwig Signed-off-by: Sergei Shtepa --- drivers/block/blksnap/chunk.c | 667 +++++++++++++++++++++++++++ drivers/block/blksnap/chunk.h | 142 ++++++ drivers/block/blksnap/diff_area.c | 601 ++++++++++++++++++++++++ drivers/block/blksnap/diff_area.h | 175 +++++++ drivers/block/blksnap/diff_buffer.c | 115 +++++ drivers/block/blksnap/diff_buffer.h | 37 ++ drivers/block/blksnap/diff_storage.c | 291 ++++++++++++ drivers/block/blksnap/diff_storage.h | 104 +++++ 8 files changed, 2132 insertions(+) create mode 100644 drivers/block/blksnap/chunk.c create mode 100644 drivers/block/blksnap/chunk.h create mode 100644 drivers/block/blksnap/diff_area.c create mode 100644 drivers/block/blksnap/diff_area.h create mode 100644 drivers/block/blksnap/diff_buffer.c create mode 100644 drivers/block/blksnap/diff_buffer.h create mode 100644 drivers/block/blksnap/diff_storage.c create mode 100644 drivers/block/blksnap/diff_storage.h diff --git a/drivers/block/blksnap/chunk.c b/drivers/block/blksnap/chunk.c new file mode 100644 index 000000000000..cd82d0d9d6fd --- /dev/null +++ b/drivers/block/blksnap/chunk.c @@ -0,0 +1,667 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#define pr_fmt(fmt) KBUILD_MODNAME "-chunk: " fmt + +#include +#include +#include "chunk.h" +#include "diff_buffer.h" +#include "diff_storage.h" +#include "params.h" + +struct chunk_bio { + struct work_struct work; + struct list_head chunks; + struct bio *orig_bio; + struct bvec_iter orig_iter; + struct bio bio; +}; + +static struct bio_set chunk_io_bioset; +static struct bio_set chunk_clone_bioset; + +static inline sector_t chunk_sector(struct chunk *chunk) +{ + return (sector_t)(chunk->number) + << (chunk->diff_area->chunk_shift - SECTOR_SHIFT); +} + +static inline sector_t chunk_sector_end(struct chunk *chunk) +{ + return chunk_sector(chunk) + chunk->sector_count; +} + +void chunk_store_failed(struct chunk *chunk, int error) +{ + struct diff_area *diff_area = diff_area_get(chunk->diff_area); + + WARN_ON_ONCE(chunk->state != CHUNK_ST_NEW && + chunk->state != CHUNK_ST_IN_MEMORY); + chunk->state = CHUNK_ST_FAILED; + + if (likely(chunk->diff_buffer)) { + diff_buffer_release(diff_area, chunk->diff_buffer); + chunk->diff_buffer = NULL; + } + + chunk_up(chunk); + if (error) + diff_area_set_corrupted(diff_area, error); + diff_area_put(diff_area); +}; + +static inline void chunk_io_failed(struct chunk *chunk) +{ + struct diff_area *diff_area = diff_area_get(chunk->diff_area); + + if (likely(chunk->diff_buffer)) { + diff_buffer_release(diff_area, chunk->diff_buffer); + chunk->diff_buffer = NULL; + } + + chunk_up(chunk); + diff_area_put(diff_area); +} + +static void chunk_schedule_storing(struct chunk *chunk) +{ + struct diff_area *diff_area = diff_area_get(chunk->diff_area); + bool need_work = false; + + WARN_ON_ONCE(chunk->state != CHUNK_ST_NEW && + chunk->state != CHUNK_ST_STORED); + chunk->state = CHUNK_ST_IN_MEMORY; + + spin_lock(&diff_area->store_queue_lock); + list_add_tail(&chunk->link, &diff_area->store_queue); + + need_work = (atomic_inc_return(&diff_area->store_queue_count) > + get_chunk_maximum_in_queue()) && + !diff_area->store_queue_processing; + if (need_work) + diff_area->store_queue_processing = true; + spin_unlock(&diff_area->store_queue_lock); + + chunk_up(chunk); + + if (need_work) { + /* Initiate the queue clearing process */ + queue_work(system_wq, &diff_area->store_queue_work); + } + diff_area_put(diff_area); +} + +void chunk_copy_bio(struct chunk *chunk, struct bio *bio, + struct bvec_iter *iter) +{ + unsigned int chunk_ofs, chunk_left; + + chunk_ofs = (iter->bi_sector - chunk_sector(chunk)) << SECTOR_SHIFT; + chunk_left = chunk->diff_buffer->size - chunk_ofs; + while (chunk_left && iter->bi_size) { + struct bio_vec bvec = bio_iter_iovec(bio, *iter); + unsigned int page_ofs = offset_in_page(chunk_ofs); + unsigned int inx = chunk_ofs >> PAGE_SHIFT; + struct page *page = chunk->diff_buffer->bvec[inx].bv_page; + unsigned int len; + + len = min3(bvec.bv_len, + chunk_left, + (unsigned int)PAGE_SIZE - page_ofs); + + if (op_is_write(bio_op(bio))) { + /* from bio to buffer */ + memcpy_page(page, page_ofs, + bvec.bv_page, bvec.bv_offset, len); + } else { + /* from buffer to bio */ + memcpy_page(bvec.bv_page, bvec.bv_offset, + page, page_ofs, len); + } + + chunk_ofs += len; + chunk_left -= len; + bio_advance_iter_single(bio, iter, len); + } +} + +static void chunk_clone_endio(struct bio *bio) +{ + struct bio *orig_bio = bio->bi_private; + + if (unlikely(bio->bi_status != BLK_STS_OK)) + bio_io_error(orig_bio); + else + bio_endio(orig_bio); +} + +static inline sector_t chunk_offset(struct chunk *chunk, struct bio *bio) +{ + return bio->bi_iter.bi_sector - chunk_sector(chunk); +} + +static inline void chunk_limit_iter(struct chunk *chunk, struct bio *bio, + sector_t sector, struct bvec_iter *iter) +{ + sector_t chunk_ofs = chunk_offset(chunk, bio); + + iter->bi_sector = sector + chunk_ofs; + iter->bi_size = min_t(unsigned int, + bio->bi_iter.bi_size, + (chunk->sector_count - chunk_ofs) << SECTOR_SHIFT); +} + +static inline unsigned int chunk_limit(struct chunk *chunk, struct bio *bio) +{ + unsigned int chunk_ofs, chunk_left; + + chunk_ofs = (unsigned int)chunk_offset(chunk, bio) << SECTOR_SHIFT; + chunk_left = chunk->diff_buffer->size - chunk_ofs; + + return min(bio->bi_iter.bi_size, chunk_left); +} + +struct bio *chunk_alloc_clone(struct block_device *bdev, struct bio *bio) +{ + return bio_alloc_clone(bdev, bio, GFP_NOIO, &chunk_clone_bioset); +} + +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) +void chunk_diff_bio_tobdev(struct chunk *chunk, struct bio *bio) +{ + struct bio *new_bio; + + new_bio = chunk_alloc_clone(chunk->diff_bdev, bio); + chunk_limit_iter(chunk, bio, chunk->diff_ofs_sect, &new_bio->bi_iter); + new_bio->bi_end_io = chunk_clone_endio; + new_bio->bi_private = bio; + + bio_advance(bio, new_bio->bi_iter.bi_size); + bio_inc_remaining(bio); + + submit_bio_noacct(new_bio); +} +#endif + +static inline void chunk_io_ctx_free(struct chunk_io_ctx *io_ctx, long ret) +{ + struct chunk *chunk = io_ctx->chunk; + struct bio *bio = io_ctx->bio; + + kfree(io_ctx); + if (ret < 0) { + bio_io_error(bio); + chunk_io_failed(chunk); + return; + } + + bio_endio(bio); + chunk_up(chunk); +} + +#ifdef CONFIG_BLKSNAP_CHUNK_DIFF_BIO_SYNC +void chunk_diff_bio_execute(struct chunk_io_ctx *io_ctx) +{ + struct file *diff_file = io_ctx->chunk->diff_file; + ssize_t len; + + if (io_ctx->iov_iter.data_source) { + file_start_write(diff_file); + len = vfs_iter_write(diff_file, &io_ctx->iov_iter, + &io_ctx->pos, 0); + file_end_write(diff_file); + } else { + len = vfs_iter_read(diff_file, &io_ctx->iov_iter, + &io_ctx->pos, 0); + } + + chunk_io_ctx_free(io_ctx, len); +} +#else +static void chunk_diff_bio_complete_read(struct kiocb *iocb, long ret) +{ + struct chunk_io_ctx *io_ctx; + + io_ctx = container_of(iocb, struct chunk_io_ctx, iocb); + chunk_io_ctx_free(io_ctx, ret); +} + +static void chunk_diff_bio_complete_write(struct kiocb *iocb, long ret) +{ + struct chunk_io_ctx *io_ctx; + + io_ctx = container_of(iocb, struct chunk_io_ctx, iocb); + file_end_write(io_ctx->iocb.ki_filp); + chunk_io_ctx_free(io_ctx, ret); +} + +static inline void chunk_diff_bio_execute_write(struct chunk_io_ctx *io_ctx) +{ + struct file *diff_file = io_ctx->chunk->diff_file; + ssize_t len; + + file_start_write(diff_file); + len = vfs_iocb_iter_write(diff_file, &io_ctx->iocb, &io_ctx->iov_iter); + + if (len != -EIOCBQUEUED) { + if (unlikely(len < 0)) + pr_err("Failed to write data to difference storage\n"); + file_end_write(diff_file); + chunk_io_ctx_free(io_ctx, len); + } +} + +static inline void chunk_diff_bio_execute_read(struct chunk_io_ctx *io_ctx) +{ + struct file *diff_file = io_ctx->chunk->diff_file; + ssize_t len; + + len = vfs_iocb_iter_read(diff_file, &io_ctx->iocb, &io_ctx->iov_iter); + if (len != -EIOCBQUEUED) { + if (unlikely(len < 0)) + pr_err("Failed to read data from difference storage\n"); + chunk_io_ctx_free(io_ctx, len); + } +} + +void chunk_diff_bio_execute(struct chunk_io_ctx *io_ctx) +{ + if (io_ctx->iov_iter.data_source) + chunk_diff_bio_execute_write(io_ctx); + else + chunk_diff_bio_execute_read(io_ctx); +} +#endif + +static inline void chunk_diff_bio_schedule(struct diff_area *diff_area, + struct chunk_io_ctx *io_ctx) +{ + spin_lock(&diff_area->image_io_queue_lock); + list_add_tail(&io_ctx->link, &diff_area->image_io_queue); + spin_unlock(&diff_area->image_io_queue_lock); + queue_work(system_wq, &diff_area->image_io_work); +} + +/* + * The data from bio is write to the diff file or read from it. + */ +int chunk_diff_bio(struct chunk *chunk, struct bio *bio) +{ + bool is_write = op_is_write(bio_op(bio)); + loff_t chunk_ofs, chunk_left; + struct bio_vec iter_bvec, *bio_bvec; + struct bvec_iter iter; + unsigned long nr_segs = 0; + size_t nbytes = 0; + struct chunk_io_ctx *io_ctx; + + io_ctx = kzalloc(sizeof(struct chunk_io_ctx), GFP_NOIO); + if (!io_ctx) + return -ENOMEM; + + chunk_ofs = (bio->bi_iter.bi_sector - chunk_sector(chunk)) + << SECTOR_SHIFT; + chunk_left = (chunk->sector_count << SECTOR_SHIFT) - chunk_ofs; + bio_for_each_segment(iter_bvec, bio, iter) { + if (chunk_left == 0) + break; + + if (chunk_left > iter_bvec.bv_len) { + chunk_left -= iter_bvec.bv_len; + nbytes += iter_bvec.bv_len; + } else { + nbytes += chunk_left; + chunk_left = 0; + } + nr_segs++; + } + bio_bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); + iov_iter_bvec(&io_ctx->iov_iter, is_write ? WRITE : READ, + bio_bvec, nr_segs, nbytes); + io_ctx->iov_iter.iov_offset = bio->bi_iter.bi_bvec_done; +#ifdef CONFIG_BLKSNAP_CHUNK_DIFF_BIO_SYNC + io_ctx->pos = (chunk->diff_ofs_sect << SECTOR_SHIFT) + chunk_ofs; +#else + io_ctx->iocb.ki_filp = chunk->diff_file; + io_ctx->iocb.ki_pos = (chunk->diff_ofs_sect << SECTOR_SHIFT) + + chunk_ofs; + io_ctx->iocb.ki_flags = IOCB_DIRECT; + if (is_write) + io_ctx->iocb.ki_flags |= IOCB_WRITE; + io_ctx->iocb.ki_ioprio = get_current_ioprio(); + io_ctx->iocb.ki_complete = is_write ? chunk_diff_bio_complete_write + : chunk_diff_bio_complete_read; +#endif + io_ctx->chunk = chunk; + io_ctx->bio = bio; + bio_inc_remaining(bio); + bio_advance(bio, nbytes); + + chunk_diff_bio_schedule(chunk->diff_area, io_ctx); + + return 0; +} + +static inline struct chunk *get_chunk_from_cbio(struct chunk_bio *cbio) +{ + struct chunk *chunk = list_first_entry_or_null(&cbio->chunks, + struct chunk, link); + + if (chunk) + list_del_init(&chunk->link); + return chunk; +} + +static void notify_load_and_schedule_io(struct work_struct *work) +{ + struct chunk_bio *cbio = container_of(work, struct chunk_bio, work); + struct chunk *chunk; + + while ((chunk = get_chunk_from_cbio(cbio))) { + if (unlikely(cbio->bio.bi_status != BLK_STS_OK)) { + chunk_store_failed(chunk, -EIO); + continue; + } + if (chunk->state == CHUNK_ST_FAILED) { + chunk_up(chunk); + continue; + } + + chunk_copy_bio(chunk, cbio->orig_bio, &cbio->orig_iter); + bio_endio(cbio->orig_bio); + + chunk_schedule_storing(chunk); + } + + bio_put(&cbio->bio); +} + +static void notify_load_and_postpone_io(struct work_struct *work) +{ + struct chunk_bio *cbio = container_of(work, struct chunk_bio, work); + struct chunk *chunk; + + while ((chunk = get_chunk_from_cbio(cbio))) { + if (unlikely(cbio->bio.bi_status != BLK_STS_OK)) { + chunk_store_failed(chunk, -EIO); + continue; + } + if (chunk->state == CHUNK_ST_FAILED) { + chunk_up(chunk); + continue; + } + + chunk_schedule_storing(chunk); + } + + /* submit the original bio fed into the tracker */ + submit_bio_noacct_nocheck(cbio->orig_bio); + bio_put(&cbio->bio); +} + +static void chunk_notify_store(struct chunk *chunk, int err) +{ + if (err) { + chunk_store_failed(chunk, err); + return; + } + + WARN_ON_ONCE(chunk->state != CHUNK_ST_IN_MEMORY); + chunk->state = CHUNK_ST_STORED; + + if (chunk->diff_buffer) { + diff_buffer_release(chunk->diff_area, + chunk->diff_buffer); + chunk->diff_buffer = NULL; + } + chunk_up(chunk); +} + +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) +static void chunk_notify_store_tobdev(struct work_struct *work) +{ + struct chunk_bio *cbio = container_of(work, struct chunk_bio, work); + struct chunk *chunk; + + while ((chunk = get_chunk_from_cbio(cbio))) { + if (unlikely(cbio->bio.bi_status != BLK_STS_OK)) { + chunk_store_failed(chunk, -EIO); + continue; + } + + WARN_ON_ONCE(chunk->state != CHUNK_ST_IN_MEMORY); + chunk->state = CHUNK_ST_STORED; + + if (chunk->diff_buffer) { + diff_buffer_release(chunk->diff_area, + chunk->diff_buffer); + chunk->diff_buffer = NULL; + } + chunk_up(chunk); + } + + bio_put(&cbio->bio); +} +#endif + +static void chunk_io_endio(struct bio *bio) +{ + struct chunk_bio *cbio = container_of(bio, struct chunk_bio, bio); + + queue_work(system_wq, &cbio->work); +} + +static void chunk_submit_bio(struct bio *bio) +{ + bio->bi_end_io = chunk_io_endio; + submit_bio_noacct(bio); +} + +static inline unsigned short calc_max_vecs(sector_t left) +{ + return bio_max_segs(round_up(left, PAGE_SECTORS) / PAGE_SECTORS); +} + +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) +void chunk_store_tobdev(struct chunk *chunk) +{ + struct block_device *bdev = chunk->diff_bdev; + sector_t sector = chunk->diff_ofs_sect; + sector_t count = chunk->sector_count; + unsigned int inx = 0; + struct bio *bio; + struct chunk_bio *cbio; + + bio = bio_alloc_bioset(bdev, calc_max_vecs(count), + REQ_OP_WRITE | REQ_SYNC | REQ_FUA, GFP_NOIO, + &chunk_io_bioset); + bio->bi_iter.bi_sector = sector; + + while (count) { + struct bio *next; + sector_t portion = min_t(sector_t, count, PAGE_SECTORS); + unsigned int bytes = portion << SECTOR_SHIFT; + + if (bio_add_page(bio, chunk->diff_buffer->bvec[inx].bv_page, + bytes, 0) == bytes) { + inx++; + count -= portion; + continue; + } + + /* Create next bio */ + next = bio_alloc_bioset(bdev, calc_max_vecs(count), + REQ_OP_WRITE | REQ_SYNC | REQ_FUA, + GFP_NOIO, &chunk_io_bioset); + next->bi_iter.bi_sector = bio_end_sector(bio); + bio_chain(bio, next); + submit_bio_noacct(bio); + bio = next; + } + + cbio = container_of(bio, struct chunk_bio, bio); + + INIT_WORK(&cbio->work, chunk_notify_store_tobdev); + INIT_LIST_HEAD(&cbio->chunks); + list_add_tail(&chunk->link, &cbio->chunks); + cbio->orig_bio = NULL; + chunk_submit_bio(bio); +} +#endif + +/* + * Synchronously store chunk to diff file. + */ +void chunk_diff_write(struct chunk *chunk) +{ + loff_t pos = chunk->diff_ofs_sect << SECTOR_SHIFT; + size_t length = chunk->sector_count << SECTOR_SHIFT; + struct iov_iter iov_iter; + ssize_t len; + int err = 0; + + iov_iter_bvec(&iov_iter, ITER_SOURCE, chunk->diff_buffer->bvec, + chunk->diff_buffer->nr_pages, length); + file_start_write(chunk->diff_file); + while (length) { + len = vfs_iter_write(chunk->diff_file, &iov_iter, &pos, 0); + if (len < 0) { + err = (int)len; + pr_debug("vfs_iter_write complete with error code %zd", + len); + break; + } + length -= len; + } + file_end_write(chunk->diff_file); + chunk_notify_store(chunk, err); +} + +static struct bio *chunk_origin_load_async(struct chunk *chunk) +{ + struct block_device *bdev; + struct bio *bio = NULL; + struct diff_buffer *diff_buffer; + unsigned int inx = 0; + sector_t sector, count = chunk->sector_count; + + diff_buffer = diff_buffer_take(chunk->diff_area); + if (IS_ERR(diff_buffer)) + return ERR_CAST(diff_buffer); + chunk->diff_buffer = diff_buffer; + + bdev = chunk->diff_area->orig_bdev; + sector = chunk_sector(chunk); + + bio = bio_alloc_bioset(bdev, calc_max_vecs(count), + REQ_OP_READ, GFP_NOIO, &chunk_io_bioset); + bio->bi_iter.bi_sector = sector; + + while (count) { + struct bio *next; + sector_t portion = min_t(sector_t, count, PAGE_SECTORS); + unsigned int bytes = portion << SECTOR_SHIFT; + struct page *pg = chunk->diff_buffer->bvec[inx].bv_page; + + if (bio_add_page(bio, pg, bytes, 0) == bytes) { + inx++; + count -= portion; + continue; + } + + /* Create next bio */ + next = bio_alloc_bioset(bdev, calc_max_vecs(count), + REQ_OP_READ, GFP_NOIO, + &chunk_io_bioset); + next->bi_iter.bi_sector = bio_end_sector(bio); + bio_chain(bio, next); + submit_bio_noacct(bio); + bio = next; + } + + return bio; +} + +/* + * Load the chunk asynchronously. + */ +int chunk_load_and_postpone_io(struct chunk *chunk, struct bio **chunk_bio) +{ + struct bio *prev = *chunk_bio, *bio; + + bio = chunk_origin_load_async(chunk); + if (IS_ERR(bio)) + return PTR_ERR(bio); + + if (prev) { + bio_chain(prev, bio); + submit_bio_noacct(prev); + } + + *chunk_bio = bio; + return 0; +} + +void chunk_load_and_postpone_io_finish(struct list_head *chunks, + struct bio *chunk_bio, struct bio *orig_bio) +{ + struct chunk_bio *cbio; + + cbio = container_of(chunk_bio, struct chunk_bio, bio); + INIT_LIST_HEAD(&cbio->chunks); + while (!list_empty(chunks)) { + struct chunk *it; + + it = list_first_entry(chunks, struct chunk, link); + list_del_init(&it->link); + + list_add_tail(&it->link, &cbio->chunks); + } + INIT_WORK(&cbio->work, notify_load_and_postpone_io); + cbio->orig_bio = orig_bio; + chunk_submit_bio(chunk_bio); +} + +bool chunk_load_and_schedule_io(struct chunk *chunk, struct bio *orig_bio) +{ + struct chunk_bio *cbio; + struct bio *bio; + + bio = chunk_origin_load_async(chunk); + if (IS_ERR(bio)) { + chunk_up(chunk); + return false; + } + + cbio = container_of(bio, struct chunk_bio, bio); + INIT_LIST_HEAD(&cbio->chunks); + list_add_tail(&chunk->link, &cbio->chunks); + INIT_WORK(&cbio->work, notify_load_and_schedule_io); + cbio->orig_bio = orig_bio; + cbio->orig_iter = orig_bio->bi_iter; + bio_advance_iter_single(orig_bio, &orig_bio->bi_iter, + chunk_limit(chunk, orig_bio)); + bio_inc_remaining(orig_bio); + + chunk_submit_bio(bio); + return true; +} + +int __init chunk_init(void) +{ + int ret; + + ret = bioset_init(&chunk_io_bioset, 64, + offsetof(struct chunk_bio, bio), + BIOSET_NEED_BVECS | BIOSET_NEED_RESCUER); + if (!ret) + ret = bioset_init(&chunk_clone_bioset, 64, 0, + BIOSET_NEED_BVECS | BIOSET_NEED_RESCUER); + return ret; +} + +void chunk_done(void) +{ + bioset_exit(&chunk_io_bioset); + bioset_exit(&chunk_clone_bioset); +} diff --git a/drivers/block/blksnap/chunk.h b/drivers/block/blksnap/chunk.h new file mode 100644 index 000000000000..148e4c07b883 --- /dev/null +++ b/drivers/block/blksnap/chunk.h @@ -0,0 +1,142 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef __BLKSNAP_CHUNK_H +#define __BLKSNAP_CHUNK_H + +#include +#include +#include +#include +#include "diff_area.h" + +struct diff_area; + +/** + * enum chunk_st - Possible states for a chunk. + * + * @CHUNK_ST_NEW: + * No data is associated with the chunk. + * @CHUNK_ST_IN_MEMORY: + * The data of the chunk is ready to be read from the RAM buffer. + * The flag is removed when a chunk is removed from the store queue + * and its buffer is released. + * @CHUNK_ST_STORED: + * The data of the chunk has been written to the difference storage. + * @CHUNK_ST_FAILED: + * An error occurred while processing the chunk data. + * + * Chunks life circle: + * CHUNK_ST_NEW -> CHUNK_ST_IN_MEMORY <-> CHUNK_ST_STORED + */ + +enum chunk_st { + CHUNK_ST_NEW, + CHUNK_ST_IN_MEMORY, + CHUNK_ST_STORED, + CHUNK_ST_FAILED, +}; + +/** + * struct chunk - Minimum data storage unit. + * + * @link: + * The list header allows to create queue of chunks. + * @number: + * Sequential number of the chunk. + * @sector_count: + * Number of sectors in the current chunk. This is especially true + * for the last chunk. + * @lock: + * Binary semaphore. Syncs access to the chunks fields: state, + * diff_buffer, diff_file and diff_ofs_sect. + * @diff_area: + * Pointer to the difference area - the difference storage area for a + * specific device. This field is only available when the chunk is locked. + * Allows to protect the difference area from early release. + * @state: + * Defines the state of a chunk. + * @diff_bdev: + * The difference storage block device. + * @diff_file: + * The difference storage file. + * @diff_ofs_sect: + * The sector offset of the region's first sector. + * @diff_buffer: + * Pointer to &struct diff_buffer. Describes a buffer in the memory + * for storing the chunk data. + * on the difference storage. + * + * This structure describes the block of data that the module operates + * with when executing the copy-on-write algorithm and when performing I/O + * to snapshot images. + * + * If the data of the chunk has been changed, then the chunk gets into store + * queue. The queue provides caching of chunks. Saving chunks to the storage is + * performed in a separate working thread. This ensures the best system + * performance. + * + * The semaphore is blocked for writing if there is no actual data in the + * buffer, since a block of data is being read from the original device or + * from a difference storage. If data is being read from or written to the + * diff_buffer, the semaphore must be locked. + */ +struct chunk { + struct list_head link; + unsigned long number; + sector_t sector_count; + + struct semaphore lock; + struct diff_area *diff_area; + + enum chunk_st state; + +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + struct block_device *diff_bdev; +#endif + struct file *diff_file; + sector_t diff_ofs_sect; + + struct diff_buffer *diff_buffer; +}; + +static inline void chunk_up(struct chunk *chunk) +{ + struct diff_area *diff_area = chunk->diff_area; + + chunk->diff_area = NULL; + up(&chunk->lock); + diff_area_put(diff_area); +}; + +struct chunk_io_ctx { + struct list_head link; +#ifdef CONFIG_BLKSNAP_CHUNK_DIFF_BIO_SYNC + loff_t pos; +#else + struct kiocb iocb; +#endif + struct iov_iter iov_iter; + struct chunk *chunk; + struct bio *bio; +}; +void chunk_diff_bio_execute(struct chunk_io_ctx *io_ctx); + +void chunk_store_failed(struct chunk *chunk, int error); +struct bio *chunk_alloc_clone(struct block_device *bdev, struct bio *bio); + +void chunk_copy_bio(struct chunk *chunk, struct bio *bio, + struct bvec_iter *iter); +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) +void chunk_diff_bio_tobdev(struct chunk *chunk, struct bio *bio); +void chunk_store_tobdev(struct chunk *chunk); +#endif +int chunk_diff_bio(struct chunk *chunk, struct bio *bio); +void chunk_diff_write(struct chunk *chunk); +bool chunk_load_and_schedule_io(struct chunk *chunk, struct bio *orig_bio); +int chunk_load_and_postpone_io(struct chunk *chunk, struct bio **chunk_bio); +void chunk_load_and_postpone_io_finish(struct list_head *chunks, + struct bio *chunk_bio, struct bio *orig_bio); + +int __init chunk_init(void); +void chunk_done(void); +#endif /* __BLKSNAP_CHUNK_H */ diff --git a/drivers/block/blksnap/diff_area.c b/drivers/block/blksnap/diff_area.c new file mode 100644 index 000000000000..965c9822ec27 --- /dev/null +++ b/drivers/block/blksnap/diff_area.c @@ -0,0 +1,601 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#define pr_fmt(fmt) KBUILD_MODNAME "-diff-area: " fmt + +#include +#include +#include +#include +#include "chunk.h" +#include "diff_buffer.h" +#include "diff_storage.h" +#include "params.h" +#include "tracker.h" + +static inline sector_t diff_area_chunk_offset(struct diff_area *diff_area, + sector_t sector) +{ + return sector & ((1ull << (diff_area->chunk_shift - SECTOR_SHIFT)) - 1); +} + +static inline unsigned long diff_area_chunk_number(struct diff_area *diff_area, + sector_t sector) +{ + return (unsigned long)(sector >> + (diff_area->chunk_shift - SECTOR_SHIFT)); +} + +static inline sector_t chunk_sector(struct chunk *chunk) +{ + return (sector_t)(chunk->number) + << (chunk->diff_area->chunk_shift - SECTOR_SHIFT); +} + +static inline sector_t last_chunk_size(sector_t sector_count, sector_t capacity) +{ + sector_t capacity_rounded = round_down(capacity, sector_count); + + if (capacity > capacity_rounded) + sector_count = capacity - capacity_rounded; + + return sector_count; +} + +static inline unsigned long long count_by_shift(sector_t capacity, + unsigned long long shift) +{ + unsigned long long shift_sector = (shift - SECTOR_SHIFT); + + return round_up(capacity, (1ull << shift_sector)) >> shift_sector; +} + +static inline struct chunk *chunk_alloc(struct diff_area *diff_area, + unsigned long number) +{ + struct chunk *chunk; + + chunk = kzalloc(sizeof(struct chunk), GFP_NOIO); + if (!chunk) + return NULL; + + INIT_LIST_HEAD(&chunk->link); + sema_init(&chunk->lock, 1); + chunk->diff_area = NULL; + chunk->number = number; + chunk->state = CHUNK_ST_NEW; + + chunk->sector_count = diff_area_chunk_sectors(diff_area); + /* + * The last chunk has a special size. + */ + if (unlikely((number + 1) == diff_area->chunk_count)) { + chunk->sector_count = bdev_nr_sectors(diff_area->orig_bdev) - + (chunk->sector_count * number); + } + + return chunk; +} + +static inline void chunk_free(struct diff_area *diff_area, struct chunk *chunk) +{ + down(&chunk->lock); + if (chunk->diff_buffer) + diff_buffer_release(diff_area, chunk->diff_buffer); + up(&chunk->lock); + kfree(chunk); +} + +static void diff_area_calculate_chunk_size(struct diff_area *diff_area) +{ + unsigned long count; + unsigned long shift = get_chunk_minimum_shift(); + sector_t capacity; + sector_t min_io_sect; + + min_io_sect = (sector_t)(bdev_io_min(diff_area->orig_bdev) >> + SECTOR_SHIFT); + capacity = bdev_nr_sectors(diff_area->orig_bdev); + pr_debug("Minimal IO block %llu sectors\n", min_io_sect); + pr_debug("Device capacity %llu sectors\n", capacity); + + count = count_by_shift(capacity, shift); + pr_debug("Chunks count %lu\n", count); + while ((count > get_chunk_maximum_count()) || + ((1ul << (shift - SECTOR_SHIFT)) < min_io_sect)) { + shift++; + count = count_by_shift(capacity, shift); + pr_debug("Chunks count %lu\n", count); + } + + diff_area->chunk_shift = shift; + diff_area->chunk_count = (unsigned long)DIV_ROUND_UP_ULL(capacity, + (1ul << (shift - SECTOR_SHIFT))); +} + +void diff_area_free(struct kref *kref) +{ + unsigned long inx = 0; + struct chunk *chunk; + struct diff_area *diff_area; + + might_sleep(); + diff_area = container_of(kref, struct diff_area, kref); + + flush_work(&diff_area->image_io_work); + flush_work(&diff_area->store_queue_work); + xa_for_each(&diff_area->chunk_map, inx, chunk) + if (chunk) + chunk_free(diff_area, chunk); + xa_destroy(&diff_area->chunk_map); + + diff_buffer_cleanup(diff_area); + tracker_put(diff_area->tracker); + kfree(diff_area); +} + +static inline bool diff_area_store_one(struct diff_area *diff_area) +{ + struct chunk *iter, *chunk = NULL; + + spin_lock(&diff_area->store_queue_lock); + list_for_each_entry(iter, &diff_area->store_queue, link) { + if (!down_trylock(&iter->lock)) { + chunk = iter; + atomic_dec(&diff_area->store_queue_count); + list_del_init(&chunk->link); + chunk->diff_area = diff_area_get(diff_area); + break; + } + /* + * If it is not possible to lock a chunk for writing, then it is + * currently in use, and we try to clean up the next chunk. + */ + } + if (!chunk) + diff_area->store_queue_processing = false; + spin_unlock(&diff_area->store_queue_lock); + if (!chunk) + return false; + + if (chunk->state != CHUNK_ST_IN_MEMORY) { + /* + * There cannot be a chunk in the store queue whose buffer has + * not been read into memory. + */ + chunk_up(chunk); + pr_warn("Cannot release empty buffer for chunk #%ld", + chunk->number); + return true; + } + + if (diff_area_is_corrupted(diff_area)) { + chunk_store_failed(chunk, 0); + return true; + } + +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + if (!chunk->diff_file && !chunk->diff_bdev) { +#else + if (!chunk->diff_file) { +#endif + int ret; + + ret = diff_storage_alloc(diff_area->diff_storage, + diff_area_chunk_sectors(diff_area), +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + &chunk->diff_bdev, +#endif + &chunk->diff_file, + &chunk->diff_ofs_sect); + if (ret) { + pr_debug("Cannot get store for chunk #%ld\n", + chunk->number); + chunk_store_failed(chunk, ret); + return true; + } + } + +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + if (chunk->diff_bdev) { + chunk_store_tobdev(chunk); + return true; + } +#endif + chunk_diff_write(chunk); + return true; +} + +static void diff_area_store_queue_work(struct work_struct *work) +{ + struct diff_area *diff_area = container_of( + work, struct diff_area, store_queue_work); + unsigned int old_nofs; + struct blkfilter *prev_filter = current->blk_filter; + + current->blk_filter = &diff_area->tracker->filter; + old_nofs = memalloc_nofs_save(); + while (diff_area_store_one(diff_area)) + ; + memalloc_nofs_restore(old_nofs); + current->blk_filter = prev_filter; +} + +static inline struct chunk_io_ctx *chunk_io_ctx_take( + struct diff_area *diff_area) +{ + struct chunk_io_ctx *io_ctx; + + spin_lock(&diff_area->image_io_queue_lock); + io_ctx = list_first_entry_or_null(&diff_area->image_io_queue, + struct chunk_io_ctx, link); + if (io_ctx) + list_del(&io_ctx->link); + spin_unlock(&diff_area->image_io_queue_lock); + + return io_ctx; +} + +static void diff_area_image_io_work(struct work_struct *work) +{ + struct diff_area *diff_area = container_of( + work, struct diff_area, image_io_work); + struct chunk_io_ctx *io_ctx; + unsigned int old_nofs; + struct blkfilter *prev_filter = current->blk_filter; + + current->blk_filter = &diff_area->tracker->filter; + old_nofs = memalloc_nofs_save(); + while ((io_ctx = chunk_io_ctx_take(diff_area))) + chunk_diff_bio_execute(io_ctx); + memalloc_nofs_restore(old_nofs); + current->blk_filter = prev_filter; +} + +struct diff_area *diff_area_new(struct tracker *tracker, + struct diff_storage *diff_storage) +{ + int ret = 0; + struct diff_area *diff_area = NULL; + struct block_device *bdev = tracker->orig_bdev; + + diff_area = kzalloc(sizeof(struct diff_area), GFP_KERNEL); + if (!diff_area) + return ERR_PTR(-ENOMEM); + + kref_init(&diff_area->kref); + diff_area->orig_bdev = bdev; + diff_area->diff_storage = diff_storage; + + diff_area_calculate_chunk_size(diff_area); + if (diff_area->chunk_shift > get_chunk_maximum_shift()) { + pr_info("The maximum allowable chunk size has been reached.\n"); + return ERR_PTR(-EFAULT); + } + pr_debug("The optimal chunk size was calculated as %llu bytes for device [%d:%d]\n", + (1ull << diff_area->chunk_shift), + MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev)); + + xa_init(&diff_area->chunk_map); + + tracker_get(tracker); + diff_area->tracker = tracker; + + spin_lock_init(&diff_area->store_queue_lock); + INIT_LIST_HEAD(&diff_area->store_queue); + atomic_set(&diff_area->store_queue_count, 0); + INIT_WORK(&diff_area->store_queue_work, diff_area_store_queue_work); + + spin_lock_init(&diff_area->free_diff_buffers_lock); + INIT_LIST_HEAD(&diff_area->free_diff_buffers); + atomic_set(&diff_area->free_diff_buffers_count, 0); + + spin_lock_init(&diff_area->image_io_queue_lock); + INIT_LIST_HEAD(&diff_area->image_io_queue); + INIT_WORK(&diff_area->image_io_work, diff_area_image_io_work); + + diff_area->physical_blksz = bdev_physical_block_size(bdev); + diff_area->logical_blksz = bdev_logical_block_size(bdev); + diff_area->corrupt_flag = 0; + diff_area->store_queue_processing = false; + + if (ret) { + diff_area_put(diff_area); + return ERR_PTR(ret); + } + + return diff_area; +} + +static inline unsigned int chunk_limit(struct chunk *chunk, + struct bvec_iter *iter) +{ + sector_t chunk_ofs = iter->bi_sector - chunk_sector(chunk); + sector_t chunk_left = chunk->sector_count - chunk_ofs; + + return min(iter->bi_size, (unsigned int)(chunk_left << SECTOR_SHIFT)); +} + +/* + * Implements the copy-on-write mechanism. + */ +bool diff_area_cow(struct bio *bio, struct diff_area *diff_area, + struct bvec_iter *iter) +{ + bool nowait = bio->bi_opf & REQ_NOWAIT; + struct bio *chunk_bio = NULL; + LIST_HEAD(chunks); + int ret = 0; + + while (iter->bi_size) { + unsigned long nr = diff_area_chunk_number(diff_area, + iter->bi_sector); + struct chunk *chunk = xa_load(&diff_area->chunk_map, nr); + unsigned int len; + + if (!chunk) { + chunk = chunk_alloc(diff_area, nr); + if (!chunk) { + diff_area_set_corrupted(diff_area, -EINVAL); + ret = -ENOMEM; + goto fail; + } + + ret = xa_insert(&diff_area->chunk_map, nr, chunk, + GFP_NOIO); + if (likely(!ret)) { + /* new chunk has been added */ + } else if (ret == -EBUSY) { + /* another chunk has just been created */ + chunk_free(diff_area, chunk); + chunk = xa_load(&diff_area->chunk_map, nr); + WARN_ON_ONCE(!chunk); + if (unlikely(!chunk)) { + ret = -EINVAL; + diff_area_set_corrupted(diff_area, ret); + goto fail; + } + } else if (ret) { + pr_err("Failed insert chunk to chunk map\n"); + chunk_free(diff_area, chunk); + diff_area_set_corrupted(diff_area, ret); + goto fail; + } + } + + if (nowait) { + if (down_trylock(&chunk->lock)) { + ret = -EAGAIN; + goto fail; + } + } else { + ret = down_killable(&chunk->lock); + if (unlikely(ret)) + goto fail; + } + chunk->diff_area = diff_area_get(diff_area); + + len = chunk_limit(chunk, iter); + bio_advance_iter_single(bio, iter, len); + + if (chunk->state == CHUNK_ST_NEW) { + if (nowait) { + /* + * If the data of this chunk has not yet been + * copied to the difference storage, then it is + * impossible to process the I/O write unit with + * the NOWAIT flag. + */ + chunk_up(chunk); + ret = -EAGAIN; + goto fail; + } + + /* + * Load the chunk asynchronously. + */ + ret = chunk_load_and_postpone_io(chunk, &chunk_bio); + if (ret) { + chunk_up(chunk); + goto fail; + } + list_add_tail(&chunk->link, &chunks); + } else { + /* + * The chunk has already been: + * - failed, when the snapshot is corrupted + * - read into the buffer + * - stored into the diff storage + * In this case, we do not change the chunk. + */ + chunk_up(chunk); + } + } + + if (chunk_bio) { + /* Postpone bio processing in a callback. */ + chunk_load_and_postpone_io_finish(&chunks, chunk_bio, bio); + return true; + } + /* Pass bio to the low level */ + return false; + +fail: + if (chunk_bio) { + chunk_bio->bi_status = errno_to_blk_status(ret); + bio_endio(chunk_bio); + } + + if (ret == -EAGAIN) { + /* + * The -EAGAIN error code means that it is not possible to + * process a I/O unit with a flag REQ_NOWAIT. + * I/O unit processing is being completed with such error. + */ + bio->bi_status = BLK_STS_AGAIN; + bio_endio(bio); + return true; + } + /* + * In any other case, the processing of the I/O unit continues. + */ + return false; +} + +static void orig_clone_endio(struct bio *bio) +{ + struct bio *orig_bio = bio->bi_private; + + if (unlikely(bio->bi_status != BLK_STS_OK)) + bio_io_error(orig_bio); + else + bio_endio(orig_bio); +} + +static void orig_clone_bio(struct diff_area *diff_area, struct bio *bio) +{ + struct bio *new_bio; + struct block_device *bdev = diff_area->orig_bdev; + sector_t chunk_limit; + + new_bio = chunk_alloc_clone(bdev, bio); + WARN_ON(!new_bio); + + chunk_limit = diff_area_chunk_sectors(diff_area) - + diff_area_chunk_offset(diff_area, bio->bi_iter.bi_sector); + + new_bio->bi_iter.bi_sector = bio->bi_iter.bi_sector; + new_bio->bi_iter.bi_size = min_t(unsigned int, + bio->bi_iter.bi_size, chunk_limit << SECTOR_SHIFT); + + new_bio->bi_end_io = orig_clone_endio; + new_bio->bi_private = bio; + + bio_advance(bio, new_bio->bi_iter.bi_size); + bio_inc_remaining(bio); + + submit_bio_noacct(new_bio); +} + +bool diff_area_submit_chunk(struct diff_area *diff_area, struct bio *bio) +{ + int ret; + struct chunk *chunk; + unsigned long nr = diff_area_chunk_number(diff_area, + bio->bi_iter.bi_sector); + + chunk = xa_load(&diff_area->chunk_map, nr); + /* + * If this chunk is not in the chunk map, then the COW algorithm did + * not access this part of the disk space, and writing to the snapshot + * in this part was also not performed. + */ + if (!chunk) { + if (!op_is_write(bio_op(bio))) { + /* + * To read, we simply redirect the bio to the original + * block device. + */ + orig_clone_bio(diff_area, bio); + return true; + } + + /* + * To process a write bio, we need to allocate a new chunk. + */ + chunk = chunk_alloc(diff_area, nr); + WARN_ON_ONCE(!chunk); + if (unlikely(!chunk)) + return false; + + ret = xa_insert(&diff_area->chunk_map, nr, chunk, + GFP_NOIO); + if (likely(!ret)) { + /* new chunk has been added */ + } else if (ret == -EBUSY) { + /* another chunk has just been created */ + chunk_free(diff_area, chunk); + chunk = xa_load(&diff_area->chunk_map, nr); + WARN_ON_ONCE(!chunk); + if (unlikely(!chunk)) + return false; + } else if (ret) { + pr_err("Failed insert chunk to chunk map\n"); + chunk_free(diff_area, chunk); + return false; + } + } + + if (down_killable(&chunk->lock)) + return false; + chunk->diff_area = diff_area_get(diff_area); + + switch (chunk->state) { + case CHUNK_ST_IN_MEMORY: + /* + * Directly copy data from the in-memory chunk or + * copy to the in-memory chunk for write operation. + */ + chunk_copy_bio(chunk, bio, &bio->bi_iter); + chunk_up(chunk); + return true; + case CHUNK_ST_STORED: + /* + * Data is read from the difference storage or written to it. + */ +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + if (chunk->diff_bdev) { + chunk_diff_bio_tobdev(chunk, bio); + chunk_up(chunk); + return true; + } +#endif + ret = chunk_diff_bio(chunk, bio); + return (ret == 0); + case CHUNK_ST_NEW: + if (!op_is_write(bio_op(bio))) { + /* + * Read from original block device + */ + orig_clone_bio(diff_area, bio); + chunk_up(chunk); + return true; + } + + /* + * Starts asynchronous loading of a chunk from the original + * block device and schedule copying data to (or from) the + * in-memory chunk. + */ + return chunk_load_and_schedule_io(chunk, bio); + default: /* CHUNK_ST_FAILED */ + pr_err("Chunk #%ld corrupted\n", chunk->number); + chunk_up(chunk); + return false; + } +} + +static inline void diff_area_event_corrupted(struct diff_area *diff_area) +{ + struct blksnap_event_corrupted data = { + .dev_id_mj = MAJOR(diff_area->orig_bdev->bd_dev), + .dev_id_mn = MINOR(diff_area->orig_bdev->bd_dev), + .err_code = abs(diff_area->error_code), + }; + + event_gen(&diff_area->diff_storage->event_queue, GFP_NOIO, + blksnap_event_code_corrupted, &data, + sizeof(struct blksnap_event_corrupted)); +} + +void diff_area_set_corrupted(struct diff_area *diff_area, int err_code) +{ + if (test_and_set_bit(0, &diff_area->corrupt_flag)) + return; + + diff_area->error_code = err_code; + diff_area_event_corrupted(diff_area); + + pr_err("Set snapshot device is corrupted for [%u:%u] with error code %d\n", + MAJOR(diff_area->orig_bdev->bd_dev), + MINOR(diff_area->orig_bdev->bd_dev), abs(err_code)); +} diff --git a/drivers/block/blksnap/diff_area.h b/drivers/block/blksnap/diff_area.h new file mode 100644 index 000000000000..3fff8138276c --- /dev/null +++ b/drivers/block/blksnap/diff_area.h @@ -0,0 +1,175 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef __BLKSNAP_DIFF_AREA_H +#define __BLKSNAP_DIFF_AREA_H + +#include +#include +#include +#include +#include +#include +#include +#include "event_queue.h" + +struct diff_storage; +struct chunk; +struct tracker; + +/** + * struct diff_area - Describes the difference area for one original device. + * + * @kref: + * The reference counter allows to manage the lifetime of an object. + * @orig_bdev: + * A pointer to the structure of an opened block device. + * @diff_storage: + * Pointer to difference storage for storing difference data. + * @tracker: + * Back pointer to the tracker for this &struct diff_area + * @chunk_shift: + * Power of 2 used to specify the chunk size. This allows to set different + * chunk sizes for huge and small block devices. + * @chunk_count: + * Count of chunks. The number of chunks into which the block device + * is divided. + * @chunk_map: + * A map of chunks. The map stores only chunks of differences. Chunks are + * added to the map if this data block was overwritten on the original + * device, or was overwritten on the snapshot. If there is no chunk in the + * map, then when accessing the snapshot, I/O units are redirected to the + * original device. + * @store_queue_lock: + * The spinlock guarantees consistency of the linked lists of chunks + * queue. + * @store_queue: + * The queue of chunks waiting to be stored to the difference storage. + * @store_queue_count: + * The number of chunks in the store queue. + * @store_queue_work: + * The workqueue work item. This worker stores chunks to the difference + * storage freeing up the cache. It's limits the number of chunks that + * store their data in RAM. + * @store_queue_processing: + * The flag is an indication that the &diff_area.store_queue_work is + * running or has been scheduled to run. + * @free_diff_buffers_lock: + * The spinlock guarantees consistency of the linked lists of free + * difference buffers. + * @free_diff_buffers: + * Linked list of free difference buffers allows to reduce the number + * of buffer allocation and release operations. + * @free_diff_buffers_count: + * The number of free difference buffers in the linked list. + * @image_io_queue_lock: + * The spinlock guarantees consistency of the linked lists of I/O + * requests to image. + * @image_io_queue: + * A linked list of I/O units for the snapshot image that need to be read + * from the difference storage to process. + * @image_io_work: + * A worker who maintains the I/O units for reading or writing data to the + * difference storage file. If the difference storage is a block device, + * then this worker is not used to process the I/O units of the snapshot + * image. + * @physical_blksz: + * The physical block size for the snapshot image is equal to the + * physical block size of the original device. + * @logical_blksz: + * The logical block size for the snapshot image is equal to the + * logical block size of the original device. + * @corrupt_flag: + * The flag is set if an error occurred in the operation of the data + * saving mechanism in the diff area. In this case, an error will be + * generated when reading from the snapshot image. + * @error_code: + * The error code that caused the snapshot to be corrupted. + * + * The &struct diff_area is created for each block device in the snapshot. It + * is used to store the differences between the original block device and the + * snapshot image. That is, when writing data to the original device, the + * differences are copied as chunks to the difference storage. Reading and + * writing from the snapshot image is also performed using &struct diff_area. + * + * The map of chunks is a xarray. It has a capacity limit. This can be + * especially noticeable on 32-bit systems. The maximum number of chunks for + * 32-bit systems cannot be equal or more than 2^32. + * + * For example, for a 256 TiB disk and a chunk size of 65536 bytes, the number + * of chunks in the chunk map will be equal to 2^32. This number already goes + * beyond the 32-bit number. Therefore, for large disks, it is required to + * increase the size of the chunk. + * + * The store queue allows to postpone the operation of storing a chunks data + * to the difference storage and perform it later in the worker thread. + * + * The linked list of difference buffers allows to have a certain number of + * "hot" buffers. This allows to reduce the number of allocations and releases + * of memory. + * + * If it is required to read or write to the difference storage file to process + * I/O unit from snapshot image, then this operation is performed in a separate + * thread. To do this, a worker &diff_area.image_io_work and a queue + * &diff_area.image_io_queue are used. An attempt to read a file from the same + * thread that initiated the block I/O can lead to a deadlock state. + */ +struct diff_area { + struct kref kref; + struct block_device *orig_bdev; + struct diff_storage *diff_storage; + struct tracker *tracker; + + unsigned long chunk_shift; + unsigned long chunk_count; + struct xarray chunk_map; + + spinlock_t store_queue_lock; + struct list_head store_queue; + atomic_t store_queue_count; + struct work_struct store_queue_work; + bool store_queue_processing; + + spinlock_t free_diff_buffers_lock; + struct list_head free_diff_buffers; + atomic_t free_diff_buffers_count; + + spinlock_t image_io_queue_lock; + struct list_head image_io_queue; + struct work_struct image_io_work; + + unsigned int physical_blksz; + unsigned int logical_blksz; + + unsigned long corrupt_flag; + int error_code; +}; + +struct diff_area *diff_area_new(struct tracker *tracker, + struct diff_storage *diff_storage); +void diff_area_free(struct kref *kref); +static inline struct diff_area *diff_area_get(struct diff_area *diff_area) +{ + kref_get(&diff_area->kref); + return diff_area; +}; +static inline void diff_area_put(struct diff_area *diff_area) +{ + kref_put(&diff_area->kref, diff_area_free); +}; + +void diff_area_set_corrupted(struct diff_area *diff_area, int err_code); +static inline bool diff_area_is_corrupted(struct diff_area *diff_area) +{ + return !!diff_area->corrupt_flag; +}; +static inline sector_t diff_area_chunk_sectors(struct diff_area *diff_area) +{ + return (sector_t)(1ull << (diff_area->chunk_shift - SECTOR_SHIFT)); +}; +bool diff_area_cow(struct bio *bio, struct diff_area *diff_area, + struct bvec_iter *iter); + +bool diff_area_submit_chunk(struct diff_area *diff_area, struct bio *bio); +void diff_area_rw_chunk(struct kref *kref); + +#endif /* __BLKSNAP_DIFF_AREA_H */ diff --git a/drivers/block/blksnap/diff_buffer.c b/drivers/block/blksnap/diff_buffer.c new file mode 100644 index 000000000000..ed0d2da94ff8 --- /dev/null +++ b/drivers/block/blksnap/diff_buffer.c @@ -0,0 +1,115 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#define pr_fmt(fmt) KBUILD_MODNAME "-diff-buffer: " fmt + +#include "diff_buffer.h" +#include "diff_area.h" +#include "params.h" + +static void diff_buffer_free(struct diff_buffer *diff_buffer) +{ + size_t inx = 0; + + if (unlikely(!diff_buffer)) + return; + + for (inx = 0; inx < diff_buffer->nr_pages; inx++) + __free_page(diff_buffer->bvec[inx].bv_page); + + kfree(diff_buffer); +} + +static struct diff_buffer *diff_buffer_new(size_t nr_pages, size_t size, + gfp_t gfp_mask) +{ + struct diff_buffer *diff_buffer; + size_t inx = 0; + + if (unlikely(nr_pages <= 0)) + return NULL; + + diff_buffer = kzalloc(sizeof(struct diff_buffer) + + nr_pages * sizeof(struct bio_vec), + gfp_mask); + if (!diff_buffer) + return NULL; + + INIT_LIST_HEAD(&diff_buffer->link); + diff_buffer->size = size; + diff_buffer->nr_pages = nr_pages; + + for (inx = 0; inx < nr_pages; inx++) { + struct page *page = alloc_page(gfp_mask); + + if (!page) + goto fail; + bvec_set_page(&diff_buffer->bvec[inx], page, PAGE_SIZE, 0); + } + return diff_buffer; +fail: + diff_buffer_free(diff_buffer); + return NULL; +} + +struct diff_buffer *diff_buffer_take(struct diff_area *diff_area) +{ + struct diff_buffer *diff_buffer = NULL; + sector_t chunk_sectors; + size_t page_count; + + spin_lock(&diff_area->free_diff_buffers_lock); + diff_buffer = list_first_entry_or_null(&diff_area->free_diff_buffers, + struct diff_buffer, link); + if (diff_buffer) { + list_del(&diff_buffer->link); + atomic_dec(&diff_area->free_diff_buffers_count); + } + spin_unlock(&diff_area->free_diff_buffers_lock); + + /* Return free buffer if it was found in a pool */ + if (diff_buffer) + return diff_buffer; + + /* Allocate new buffer */ + chunk_sectors = diff_area_chunk_sectors(diff_area); + page_count = round_up(chunk_sectors, PAGE_SECTORS) / PAGE_SECTORS; + diff_buffer = diff_buffer_new(page_count, chunk_sectors << SECTOR_SHIFT, + GFP_NOIO); + if (unlikely(!diff_buffer)) + return ERR_PTR(-ENOMEM); + return diff_buffer; +} + +void diff_buffer_release(struct diff_area *diff_area, + struct diff_buffer *diff_buffer) +{ + if (atomic_read(&diff_area->free_diff_buffers_count) > + get_free_diff_buffer_pool_size()) { + diff_buffer_free(diff_buffer); + return; + } + spin_lock(&diff_area->free_diff_buffers_lock); + list_add_tail(&diff_buffer->link, &diff_area->free_diff_buffers); + atomic_inc(&diff_area->free_diff_buffers_count); + spin_unlock(&diff_area->free_diff_buffers_lock); +} + +void diff_buffer_cleanup(struct diff_area *diff_area) +{ + struct diff_buffer *diff_buffer = NULL; + + do { + spin_lock(&diff_area->free_diff_buffers_lock); + diff_buffer = + list_first_entry_or_null(&diff_area->free_diff_buffers, + struct diff_buffer, link); + if (diff_buffer) { + list_del(&diff_buffer->link); + atomic_dec(&diff_area->free_diff_buffers_count); + } + spin_unlock(&diff_area->free_diff_buffers_lock); + + if (diff_buffer) + diff_buffer_free(diff_buffer); + } while (diff_buffer); +} diff --git a/drivers/block/blksnap/diff_buffer.h b/drivers/block/blksnap/diff_buffer.h new file mode 100644 index 000000000000..077fcf4a2292 --- /dev/null +++ b/drivers/block/blksnap/diff_buffer.h @@ -0,0 +1,37 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef __BLKSNAP_DIFF_BUFFER_H +#define __BLKSNAP_DIFF_BUFFER_H + +#include +#include +#include +#include + +struct diff_area; + +/** + * struct diff_buffer - Difference buffer. + * @link: + * The list header allows to create a pool of the diff_buffer structures. + * @size: + * Count of bytes in the buffer. + * @nr_pages: + * The number of pages reserved for the buffer. + * @bvec: + * An array of pages in bio_vec form. + * + * Describes the memory buffer for a chunk in the memory. + */ +struct diff_buffer { + struct list_head link; + size_t size; + unsigned long nr_pages; + struct bio_vec bvec[]; +}; + +struct diff_buffer *diff_buffer_take(struct diff_area *diff_area); +void diff_buffer_release(struct diff_area *diff_area, + struct diff_buffer *diff_buffer); +void diff_buffer_cleanup(struct diff_area *diff_area); +#endif /* __BLKSNAP_DIFF_BUFFER_H */ diff --git a/drivers/block/blksnap/diff_storage.c b/drivers/block/blksnap/diff_storage.c new file mode 100644 index 000000000000..e6de1d1efe89 --- /dev/null +++ b/drivers/block/blksnap/diff_storage.c @@ -0,0 +1,291 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#define pr_fmt(fmt) KBUILD_MODNAME "-diff-storage: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include "chunk.h" +#include "diff_buffer.h" +#include "diff_storage.h" +#include "params.h" + +static void diff_storage_reallocate_work(struct work_struct *work) +{ + int ret; + sector_t req_sect; + struct diff_storage *diff_storage = container_of( + work, struct diff_storage, reallocate_work); + bool complete = false; + + do { + spin_lock(&diff_storage->lock); + req_sect = diff_storage->requested; + spin_unlock(&diff_storage->lock); + + ret = vfs_fallocate(diff_storage->file, 0, 0, + (loff_t)(req_sect << SECTOR_SHIFT)); + if (ret) { + pr_err("Failed to fallocate difference storage file\n"); + break; + } + + spin_lock(&diff_storage->lock); + diff_storage->capacity = req_sect; + complete = (diff_storage->capacity >= diff_storage->requested); + if (complete) + atomic_set(&diff_storage->low_space_flag, 0); + spin_unlock(&diff_storage->lock); + + pr_debug("Diff storage reallocate. Capacity: %llu sectors\n", + req_sect); + } while (!complete); +} + +static bool diff_storage_calculate_requested(struct diff_storage *diff_storage) +{ + bool ret = false; + + spin_lock(&diff_storage->lock); + if (diff_storage->capacity < diff_storage->limit) { + diff_storage->requested += min(get_diff_storage_minimum(), + diff_storage->limit - diff_storage->capacity); + ret = true; + } + pr_debug("The size of the difference storage was %llu MiB\n", + diff_storage->capacity >> (20 - SECTOR_SHIFT)); + pr_debug("The limit is %llu MiB\n", + diff_storage->limit >> (20 - SECTOR_SHIFT)); + spin_unlock(&diff_storage->lock); + + return ret; +} + +static inline bool is_halffull(const sector_t sectors_left) +{ + return sectors_left <= (get_diff_storage_minimum() / 2); +} + +static inline void check_halffull(struct diff_storage *diff_storage, + const sector_t sectors_left) +{ + if (is_halffull(sectors_left) && + (atomic_inc_return(&diff_storage->low_space_flag) == 1)) { + +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + if (diff_storage->bdev) { + pr_warn("Reallocating is allowed only for a regular file\n"); + return; + } +#endif + if (!diff_storage_calculate_requested(diff_storage)) { + pr_info("The limit size of the difference storage has been reached\n"); + return; + } + + pr_debug("Diff storage low free space.\n"); + queue_work(system_wq, &diff_storage->reallocate_work); + } +} + +struct diff_storage *diff_storage_new(void) +{ + struct diff_storage *diff_storage; + + diff_storage = kzalloc(sizeof(struct diff_storage), GFP_KERNEL); + if (!diff_storage) + return NULL; + + kref_init(&diff_storage->kref); + spin_lock_init(&diff_storage->lock); + diff_storage->limit = 0; + + INIT_WORK(&diff_storage->reallocate_work, diff_storage_reallocate_work); + event_queue_init(&diff_storage->event_queue); + + return diff_storage; +} + +void diff_storage_free(struct kref *kref) +{ + struct diff_storage *diff_storage; + + diff_storage = container_of(kref, struct diff_storage, kref); + flush_work(&diff_storage->reallocate_work); + +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + if (diff_storage->bdev) + blkdev_put(diff_storage->bdev, NULL); +#endif + if (diff_storage->file) + fput(diff_storage->file); + event_queue_done(&diff_storage->event_queue); + kfree(diff_storage); +} + +static inline bool unsupported_mode(const umode_t m) +{ + return (S_ISCHR(m) || S_ISFIFO(m) || S_ISSOCK(m)); +} + +static inline bool unsupported_flags(const unsigned int flags) +{ + if (!(flags | O_RDWR)) { + pr_err("Read and write access is required\n"); + return true; + } + if (!(flags | O_EXCL)) { + pr_err("Exclusive access is required\n"); + return true; + } + + return false; +} + +int diff_storage_set_diff_storage(struct diff_storage *diff_storage, + unsigned int fd, sector_t limit) +{ + int ret = 0; + struct file *file; + + file = fget(fd); + if (!file) { + pr_err("Invalid file descriptor\n"); + return -EINVAL; + } + + if (unsupported_mode(file_inode(file)->i_mode)) { + pr_err("The difference storage can only be a regular file or a block device\n"); + ret = -EINVAL; + goto fail_fput; + } + + if (unsupported_flags(file->f_flags)) { + pr_err("Invalid flags 0x%x with which the file was opened\n", + file->f_flags); + ret = -EINVAL; + goto fail_fput; + } + + if (S_ISBLK(file_inode(file)->i_mode)) { + struct block_device *bdev; + dev_t dev_id = file_inode(file)->i_rdev; + + pr_debug("Open a block device %d:%d\n", + MAJOR(dev_id), MINOR(dev_id)); + /* + * The block device is opened non-exclusively. + * It should be exclusive to open the file whose descriptor is + * passed to the module. + */ + bdev = blkdev_get_by_dev(dev_id, + BLK_OPEN_READ | BLK_OPEN_WRITE, + NULL, NULL); + if (IS_ERR(bdev)) { + pr_err("Cannot open a block device %d:%d\n", + MAJOR(dev_id), MINOR(dev_id)); + ret = PTR_ERR(bdev); + bdev = NULL; + goto fail_fput; + } + + pr_debug("A block device is selected for difference storage\n"); + diff_storage->dev_id = file_inode(file)->i_rdev; + diff_storage->capacity = bdev_nr_sectors(bdev); +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + diff_storage->bdev = bdev; +#else + blkdev_put(bdev, NULL); +#endif + } else { + pr_debug("A regular file is selected for difference storage\n"); + diff_storage->dev_id = file_inode(file)->i_sb->s_dev; + diff_storage->capacity = + i_size_read(file_inode(file)) >> SECTOR_SHIFT; + } + + diff_storage->file = get_file(file); + diff_storage->requested = diff_storage->capacity; + diff_storage->limit = limit; + + if (is_halffull(diff_storage->requested)) { + sector_t req_sect; + + if (diff_storage->capacity == diff_storage->limit) { + pr_info("The limit size of the difference storage has been reached\n"); + ret = 0; + goto fail_fput; + } + if (diff_storage->capacity > diff_storage->limit) { + pr_err("The limit size of the difference storage has been exceeded\n"); + ret = -ENOSPC; + goto fail_fput; + } + + diff_storage->requested += min(get_diff_storage_minimum(), + diff_storage->limit - diff_storage->capacity); + req_sect = diff_storage->requested; + +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + if (diff_storage->bdev) { + pr_warn("Difference storage on block device is not large enough\n"); + pr_warn("Requested: %llu sectors\n", req_sect); + ret = 0; + goto fail_fput; + } +#endif + pr_debug("Difference storage is not large enough\n"); + pr_debug("Requested: %llu sectors\n", req_sect); + + ret = vfs_fallocate(diff_storage->file, 0, 0, + (loff_t)(req_sect << SECTOR_SHIFT)); + if (ret) { + pr_err("Failed to fallocate difference storage file\n"); + pr_warn("The difference storage is not large enough\n"); + goto fail_fput; + } + diff_storage->capacity = req_sect; + } +fail_fput: + fput(file); + return ret; +} + +int diff_storage_alloc(struct diff_storage *diff_storage, sector_t count, +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + struct block_device **bdev, +#endif + struct file **file, sector_t *sector) + +{ + sector_t sectors_left; + + if (atomic_read(&diff_storage->overflow_flag)) + return -ENOSPC; + + spin_lock(&diff_storage->lock); + if ((diff_storage->filled + count) > diff_storage->requested) { + atomic_inc(&diff_storage->overflow_flag); + spin_unlock(&diff_storage->lock); + return -ENOSPC; + } + +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + *bdev = diff_storage->bdev; +#endif + *file = diff_storage->file; + *sector = diff_storage->filled; + + diff_storage->filled += count; + sectors_left = diff_storage->requested - diff_storage->filled; + + spin_unlock(&diff_storage->lock); + + check_halffull(diff_storage, sectors_left); + return 0; +} diff --git a/drivers/block/blksnap/diff_storage.h b/drivers/block/blksnap/diff_storage.h new file mode 100644 index 000000000000..f186956630e5 --- /dev/null +++ b/drivers/block/blksnap/diff_storage.h @@ -0,0 +1,104 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef __BLKSNAP_DIFF_STORAGE_H +#define __BLKSNAP_DIFF_STORAGE_H + +#include "event_queue.h" + +struct blksnap_sectors; + +/** + * struct diff_storage - Difference storage. + * + * @kref: + * The reference counter. + * @lock: + * Spinlock allows to safely change structure fields in a multithreaded + * environment. + * @dev_id: + * ID of the block device on which the difference storage file is located. + * @bdev: + * A pointer to the block device that has been selected for the + * difference storage. Available only if configuration BLKSNAP_DIFF_BLKDEV + * is enabled. + * @file: + * A pointer to the file that was selected for the difference storage. + * @capacity: + * Total amount of available difference storage space. + * @limit: + * The limit to which the difference storage can be allowed to grow. + * @filled: + * The number of sectors already filled in. + * @requested: + * The number of sectors already requested from user space. + * @low_space_flag: + * The flag is set if the number of free regions available in the + * difference storage is less than the allowed minimum. + * @overflow_flag: + * The request for a free region failed due to the absence of free + * regions in the difference storage. + * @reallocate_work: + * The working thread in which the difference storage file is growing. + * @event_queue: + * A queue of events to pass events to user space. + * + * The difference storage manages the block device or file that are used + * to store the data of the original block devices in the snapshot. + * The difference storage is created one per snapshot and is used to store + * data from all block devices. + * + * The difference storage file has the ability to increase while holding the + * snapshot as needed within the specified limits. This is done using the + * function vfs_fallocate(). + * + * Changing the file size leads to a change in the file metadata in the file + * system, which leads to the generation of I/O units for the block device. + * Using a separate working thread ensures that metadata changes will be + * handled and correctly processed by the block-level filters. + * + * The event queue allows to inform the user land about changes in the state + * of the difference storage. + */ +struct diff_storage { + struct kref kref; + spinlock_t lock; + + dev_t dev_id; +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + struct block_device *bdev; +#endif + struct file *file; + sector_t capacity; + sector_t limit; + sector_t filled; + sector_t requested; + + atomic_t low_space_flag; + atomic_t overflow_flag; + + struct work_struct reallocate_work; + struct event_queue event_queue; +}; + +struct diff_storage *diff_storage_new(void); +void diff_storage_free(struct kref *kref); + +static inline void diff_storage_get(struct diff_storage *diff_storage) +{ + kref_get(&diff_storage->kref); +}; +static inline void diff_storage_put(struct diff_storage *diff_storage) +{ + if (likely(diff_storage)) + kref_put(&diff_storage->kref, diff_storage_free); +}; + +int diff_storage_set_diff_storage(struct diff_storage *diff_storage, + unsigned int fd, sector_t limit); + +int diff_storage_alloc(struct diff_storage *diff_storage, sector_t count, +#if defined(CONFIG_BLKSNAP_DIFF_BLKDEV) + struct block_device **bdev, +#endif + struct file **file, sector_t *sector); +#endif /* __BLKSNAP_DIFF_STORAGE_H */ From patchwork Fri Nov 24 16:59:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169523 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1370344vqx; Fri, 24 Nov 2023 09:01:44 -0800 (PST) X-Google-Smtp-Source: AGHT+IHAqIWQ0pQkUFEtAw0ZVtrDINlNzzrzQ1oiET0k5v1LmKDqYz7iLiLMdtLqmKX2S+zLrm14 X-Received: by 2002:a05:6870:2196:b0:1f0:1c00:d860 with SMTP id l22-20020a056870219600b001f01c00d860mr3845335oae.51.1700845303771; Fri, 24 Nov 2023 09:01:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700845303; cv=none; d=google.com; s=arc-20160816; b=sUQo3h+LJgXiw0PensIFZ0CDUPxLhtos82rFJf4q7RCl33ACAEjhigC7uMh5NHqXGT Clv22UF5TFfGXWixduDxJSkhIZ1uQ70PJcMNfTItlAUY5fAjMHbZSbicRivS6+fec1Z3 hLT2QCdP7LLRAHQCMl2QnQUOZ6DQ+NqQr/AL5152WAvdH75nhCWVxs0hPUHMfcttAws/ bUPoeU98Ju3Ml1PR2waDyqLr6varbr3kIAtc0K33Pl3kN9M4ZyKQ6uMlNKHY5wR74VwX QyXiNkyNYcW4JJFaT2zTdd53+I5C8j9RGEJDgvhptCJVoJ65HeZIIFmyJPXCUMmCUSDB qHng== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=HWENQVeTzgUlxGLBDIpOkaJ8YTyPuR19yc5enxAnFIw=; fh=2mCdnEsEQzaNNg7WB3fw4oVqqq+eEoqCz0tNA1gXC/4=; b=MgTDYBz1m4NBqfiKdywyWoic17PVs3KIHkZWqdYnQKYZSdN6zWv47+1w7izci1SaKU mnpSgN0FLTP352EZaLf0VrmcJmQVNDSvPFzmBH+Q0qpZmMWxxSRkA6jiDcCF0pO9E/bo aC1jUdZDEFpQKc+H/7amDcTSOt6Pt+gtNLJV+WZupZFrRGEHl+KluR6y+PAt5vzvkY3y j+wZHpt931J7uiJLozXLl2tMdNrsl1IkD1nFUFbb28p6cW/rNlxXXQWuolPJq0lcsN7X hgEA4eItZBqz650nh4iDJTtI2DwzvAxxqmj2tl3PQDfCL2j3blGygbX38bBYwru7J34P LmNA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=Y89b0090; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id j4-20020a056830270400b006d7e9a52180si1756005otu.132.2023.11.24.09.01.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 09:01:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=Y89b0090; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id A492080B028C; Fri, 24 Nov 2023 09:01:30 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231436AbjKXRAu (ORCPT + 99 others); Fri, 24 Nov 2023 12:00:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49444 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235154AbjKXRAU (ORCPT ); Fri, 24 Nov 2023 12:00:20 -0500 Received: from out-189.mta0.migadu.com (out-189.mta0.migadu.com [91.218.175.189]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D666B210C for ; Fri, 24 Nov 2023 09:00:17 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700845216; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HWENQVeTzgUlxGLBDIpOkaJ8YTyPuR19yc5enxAnFIw=; b=Y89b0090oahW/UxVCcF6BkmwTPuu68hFa995OMYMDYfcq8LCEx3UtwuMuNYlcfZfEcA9fW Jujin1UGL0IJ5fMsZcq+bOcwCmQU5pcmjH9rGA+pVxVJ0GsaS+VOsVqYPB7VFglGsNcCzE EoVw9De+WKKzL3juazloK4Uw9Frg2ys= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa Subject: [PATCH v6 08/11] blksnap: event queue from the difference storage Date: Fri, 24 Nov 2023 17:59:30 +0100 Message-Id: <20231124165933.27580-9-sergei.shtepa@linux.dev> In-Reply-To: <20231124165933.27580-1-sergei.shtepa@linux.dev> References: <20231124165933.27580-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Fri, 24 Nov 2023 09:01:30 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783465565223553780 X-GMAIL-MSGID: 1783465565223553780 From: Sergei Shtepa Events are used to immediately notify the user land of a change in the snapshot state. For example, if an error occurred while snapshot holding when reading data from the original block device or from the difference storage. Co-developed-by: Christoph Hellwig Signed-off-by: Christoph Hellwig Signed-off-by: Sergei Shtepa --- drivers/block/blksnap/event_queue.c | 81 +++++++++++++++++++++++++++++ drivers/block/blksnap/event_queue.h | 64 +++++++++++++++++++++++ 2 files changed, 145 insertions(+) create mode 100644 drivers/block/blksnap/event_queue.c create mode 100644 drivers/block/blksnap/event_queue.h diff --git a/drivers/block/blksnap/event_queue.c b/drivers/block/blksnap/event_queue.c new file mode 100644 index 000000000000..2256167b631b --- /dev/null +++ b/drivers/block/blksnap/event_queue.c @@ -0,0 +1,81 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#define pr_fmt(fmt) KBUILD_MODNAME "-event_queue: " fmt + +#include +#include +#include "event_queue.h" + +void event_queue_init(struct event_queue *event_queue) +{ + INIT_LIST_HEAD(&event_queue->list); + spin_lock_init(&event_queue->lock); + init_waitqueue_head(&event_queue->wq_head); +} + +void event_queue_done(struct event_queue *event_queue) +{ + struct event *event; + + spin_lock(&event_queue->lock); + while (!list_empty(&event_queue->list)) { + event = list_first_entry(&event_queue->list, struct event, + link); + list_del(&event->link); + event_free(event); + } + spin_unlock(&event_queue->lock); +} + +int event_gen(struct event_queue *event_queue, gfp_t flags, int code, + const void *data, int data_size) +{ + struct event *event; + + event = kzalloc(sizeof(struct event) + data_size + 1, flags); + if (!event) + return -ENOMEM; + + event->time = ktime_get(); + event->code = code; + event->data_size = data_size; + memcpy(event->data, data, data_size); + + pr_debug("Generate event: time=%lld code=%d data_size=%d\n", + event->time, event->code, event->data_size); + + spin_lock(&event_queue->lock); + list_add_tail(&event->link, &event_queue->list); + spin_unlock(&event_queue->lock); + + wake_up(&event_queue->wq_head); + return 0; +} + +struct event *event_wait(struct event_queue *event_queue, + unsigned long timeout_ms) +{ + int ret; + + ret = wait_event_interruptible_timeout(event_queue->wq_head, + !list_empty(&event_queue->list), timeout_ms); + if (ret >= 0) { + struct event *event = ERR_PTR(-ENOENT); + + spin_lock(&event_queue->lock); + if (!list_empty(&event_queue->list)) { + event = list_first_entry(&event_queue->list, + struct event, link); + list_del(&event->link); + } + spin_unlock(&event_queue->lock); + return event; + } + if (ret == -ERESTARTSYS) { + pr_debug("event waiting interrupted\n"); + return ERR_PTR(-EINTR); + } + + pr_err("Failed to wait event. errno=%d\n", abs(ret)); + return ERR_PTR(ret); +} diff --git a/drivers/block/blksnap/event_queue.h b/drivers/block/blksnap/event_queue.h new file mode 100644 index 000000000000..c919eee3ed96 --- /dev/null +++ b/drivers/block/blksnap/event_queue.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef __BLKSNAP_EVENT_QUEUE_H +#define __BLKSNAP_EVENT_QUEUE_H + +#include +#include +#include +#include +#include + +/** + * struct event - An event to be passed to the user space. + * @link: + * The list header allows to combine events from the queue. + * @time: + * A timestamp indicates when an event occurred. + * @code: + * Event code. + * @data_size: + * The number of bytes in the event data array. + * @data: + * An array of event data. + * + * Events can be different, so they contain different data. The size of the + * data array is not defined exactly, but it has limitations. The size of + * the event structure is limited by the PAGE_SIZE (4096 bytes). + */ +struct event { + struct list_head link; + ktime_t time; + int code; + int data_size; + char data[]; +}; + +/** + * struct event_queue - A queue of &struct event. + * @list: + * Linked list for storing events. + * @lock: + * Spinlock allows to guarantee safety of the linked list. + * @wq_head: + * A wait queue allows to put a user thread in a waiting state until + * an event appears in the linked list. + */ +struct event_queue { + struct list_head list; + spinlock_t lock; + struct wait_queue_head wq_head; +}; + +void event_queue_init(struct event_queue *event_queue); +void event_queue_done(struct event_queue *event_queue); + +int event_gen(struct event_queue *event_queue, gfp_t flags, int code, + const void *data, int data_size); +struct event *event_wait(struct event_queue *event_queue, + unsigned long timeout_ms); +static inline void event_free(struct event *event) +{ + kfree(event); +}; +#endif /* __BLKSNAP_EVENT_QUEUE_H */ From patchwork Fri Nov 24 16:59:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169526 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1371186vqx; Fri, 24 Nov 2023 09:02:31 -0800 (PST) X-Google-Smtp-Source: AGHT+IGsPhVhw10SAs6hwnU3TttYWQqYEpD8ZFOJkEh8nQ0HEAvXr7z8BM7+wC46antrVrO1/L9c X-Received: by 2002:a81:bb48:0:b0:5ae:c0e2:da1b with SMTP id a8-20020a81bb48000000b005aec0e2da1bmr3451033ywl.45.1700845351328; Fri, 24 Nov 2023 09:02:31 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700845351; cv=none; d=google.com; s=arc-20160816; b=ayNF1MQEnK/yadcxJS6/XcvZMugVbbgGw7wNkm7KDBUlB5qP6yVtU5aHbC7F3WLpnr GeQV3XcHplEYLZNGLB9aWvkhaw4D1W8rk/g/tSifQalyFwQg0zz9t+hHU7pjpoYN9nEM kMJnvWgYkj1JhSo1uL2dbEj3nCYBJ65L6NUIz0LhAvYTOTs0noBsDSj/tz/8WFvFi6Ja 33kEO0y5HsJkOAbyY7NV6kKSqiFDDavVGkLLRuCQv8tR1yWbCVVYhVkYSkz4f9b/6kDQ udsfr5CDIhAGxTTFW3lK9BpCGtEDLohQctRHLGRT1krcOOPDX4fYAAgTJGs8ktW0u3Ia y7Fw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=zT8W0511OjWhWNR4kOTUZaAUsYLmsetenUDLVK6jUuQ=; fh=2mCdnEsEQzaNNg7WB3fw4oVqqq+eEoqCz0tNA1gXC/4=; b=saxFyrRsCv2IYjfdxTNwGtJJvV6Md5j4tUKV8nYOdPoXpcOt8qYPwHZ+3km4X3Z5H7 YE+NJeDWmNMw7Vs+zZ0BpG2lYsw8w7C8m5uOYIBNhHo2DET7du/MRY1WhnR/TOPkhwG5 F+neOinV0P1UggSgtQ3joIC2ps5q6h2qMN3lJFotCD5AK8WOsmz/m7pQIFlwaGAzm9uN AijtilZrnyEQcimv+BvZeTlmLwtGOK3YjmMVzUNUjZ2CLxAUlzE+Hj0qoAJFt9JJXIOJ U289Oi67EpbRvuG6LqwDd29DLqR33qANWkoP8O8KN6W6aFSX7ETghO86btfYTahxHhpn UVDw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=mg29T9fv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id s33-20020a814521000000b005cca9169320si2559541ywa.248.2023.11.24.09.02.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 09:02:31 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=mg29T9fv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 4872480AA26B; Fri, 24 Nov 2023 09:01:43 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346020AbjKXRA6 (ORCPT + 99 others); Fri, 24 Nov 2023 12:00:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41642 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229907AbjKXRAW (ORCPT ); Fri, 24 Nov 2023 12:00:22 -0500 Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1A8CF2125 for ; Fri, 24 Nov 2023 09:00:22 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700845220; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zT8W0511OjWhWNR4kOTUZaAUsYLmsetenUDLVK6jUuQ=; b=mg29T9fvYzHSof4M/GR7M9Li/3bW2JN/FuWllDj5AiebiSH6KK1o3K2doCoizULEhzm3N7 +3p2wfQxpifJDNYkZ9uMLCwwdXqJGp0dyPJWh4N2Y4TCRoBSD7YE9QAgcxIfQuIfRNkw5S DWu1R4Ss0ksk8oNiTyRbXJtLwa8esko= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa Subject: [PATCH v6 09/11] blksnap: snapshot and snapshot image block device Date: Fri, 24 Nov 2023 17:59:31 +0100 Message-Id: <20231124165933.27580-10-sergei.shtepa@linux.dev> In-Reply-To: <20231124165933.27580-1-sergei.shtepa@linux.dev> References: <20231124165933.27580-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 24 Nov 2023 09:01:43 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783465614788717267 X-GMAIL-MSGID: 1783465614788717267 From: Sergei Shtepa The struck snapshot combines block devices, for which a snapshot is created, block devices of their snapshot images, as well as a difference storage. There may be several snapshots at the same time, but they should not contain common block devices. This can be used for cases when backup is scheduled once an hour for some block devices, and once a day for others, and once a week for others. In this case, it is possible that three snapshots are used at the same time. Snapshot images of block devices provides the read and write operations. They redirect I/O units to the original block device or to differential storage devices. Co-developed-by: Christoph Hellwig Signed-off-by: Christoph Hellwig Signed-off-by: Sergei Shtepa --- drivers/block/blksnap/snapimage.c | 134 +++++++++ drivers/block/blksnap/snapimage.h | 10 + drivers/block/blksnap/snapshot.c | 440 ++++++++++++++++++++++++++++++ drivers/block/blksnap/snapshot.h | 64 +++++ 4 files changed, 648 insertions(+) create mode 100644 drivers/block/blksnap/snapimage.c create mode 100644 drivers/block/blksnap/snapimage.h create mode 100644 drivers/block/blksnap/snapshot.c create mode 100644 drivers/block/blksnap/snapshot.h diff --git a/drivers/block/blksnap/snapimage.c b/drivers/block/blksnap/snapimage.c new file mode 100644 index 000000000000..6efd39d2ce79 --- /dev/null +++ b/drivers/block/blksnap/snapimage.c @@ -0,0 +1,134 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Veeam Software Group GmbH */ +/* + * Present the snapshot image as a block device. + */ +#define pr_fmt(fmt) KBUILD_MODNAME "-image: " fmt +#include +#include +#include +#include +#include +#include "snapimage.h" +#include "tracker.h" +#include "chunk.h" +#include "cbt_map.h" + +/* + * The snapshot supports write operations. This allows for example to delete + * some files from the file system before backing up the volume. The data can + * be stored only in the difference storage. Therefore, before partially + * overwriting this data, it should be read from the original block device. + */ +static void snapimage_submit_bio(struct bio *bio) +{ + struct tracker *tracker = bio->bi_bdev->bd_disk->private_data; + struct diff_area *diff_area = tracker->diff_area; + unsigned int old_nofs; + struct blkfilter *prev_filter; + bool is_success = true; + + /* + * We can use the diff_area here without fear that it will be released. + * The diff_area is not blocked from releasing now, because + * snapimage_free() is calling before diff_area_put() in + * tracker_release_snapshot(). + */ + if (diff_area_is_corrupted(diff_area)) { + bio_io_error(bio); + return; + } + + /* + * The change tracking table should indicate that the image block device + * is different from the original device. At the next snapshot, such + * blocks must be inevitably reread. + */ + if (op_is_write(bio_op(bio))) + cbt_map_set_both(tracker->cbt_map, bio->bi_iter.bi_sector, + bio_sectors(bio)); + + prev_filter = current->blk_filter; + current->blk_filter = &tracker->filter; + old_nofs = memalloc_nofs_save(); + while (bio->bi_iter.bi_size && is_success) + is_success = diff_area_submit_chunk(diff_area, bio); + memalloc_nofs_restore(old_nofs); + current->blk_filter = prev_filter; + + if (is_success) + bio_endio(bio); + else + bio_io_error(bio); +} + +static const struct block_device_operations bd_ops = { + .owner = THIS_MODULE, + .submit_bio = snapimage_submit_bio, +}; + +void snapimage_free(struct tracker *tracker) +{ + struct gendisk *disk = tracker->snap_disk; + + if (!disk) + return; + + pr_debug("Snapshot image disk %s delete\n", disk->disk_name); + del_gendisk(disk); + put_disk(disk); + + tracker->snap_disk = NULL; +} + +int snapimage_create(struct tracker *tracker) +{ + int ret = 0; + dev_t dev_id = tracker->dev_id; + struct gendisk *disk; + + pr_info("Create snapshot image device for original device [%u:%u]\n", + MAJOR(dev_id), MINOR(dev_id)); + + disk = blk_alloc_disk(NUMA_NO_NODE); + if (!disk) { + pr_err("Failed to allocate disk\n"); + return -ENOMEM; + } + + disk->flags = GENHD_FL_NO_PART; + disk->fops = &bd_ops; + disk->private_data = tracker; + set_capacity(disk, tracker->cbt_map->device_capacity); + ret = snprintf(disk->disk_name, DISK_NAME_LEN, "%s_%d:%d", + BLKSNAP_IMAGE_NAME, MAJOR(dev_id), MINOR(dev_id)); + if (ret < 0) { + pr_err("Unable to set disk name for snapshot image device: invalid device id [%d:%d]\n", + MAJOR(dev_id), MINOR(dev_id)); + ret = -EINVAL; + goto fail_cleanup_disk; + } + pr_debug("Snapshot image disk name [%s]\n", disk->disk_name); + + blk_queue_physical_block_size(disk->queue, + tracker->diff_area->physical_blksz); + blk_queue_logical_block_size(disk->queue, + tracker->diff_area->logical_blksz); + + ret = add_disk(disk); + if (ret) { + pr_err("Failed to add disk [%s] for snapshot image device\n", + disk->disk_name); + goto fail_cleanup_disk; + } + tracker->snap_disk = disk; + + pr_debug("Image block device [%d:%d] has been created\n", + disk->major, disk->first_minor); + + return 0; + +fail_cleanup_disk: + put_disk(disk); + return ret; +} diff --git a/drivers/block/blksnap/snapimage.h b/drivers/block/blksnap/snapimage.h new file mode 100644 index 000000000000..cb2df7019eb8 --- /dev/null +++ b/drivers/block/blksnap/snapimage.h @@ -0,0 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef __BLKSNAP_SNAPIMAGE_H +#define __BLKSNAP_SNAPIMAGE_H + +struct tracker; + +void snapimage_free(struct tracker *tracker); +int snapimage_create(struct tracker *tracker); +#endif /* __BLKSNAP_SNAPIMAGE_H */ diff --git a/drivers/block/blksnap/snapshot.c b/drivers/block/blksnap/snapshot.c new file mode 100644 index 000000000000..21d94f12b5fc --- /dev/null +++ b/drivers/block/blksnap/snapshot.c @@ -0,0 +1,440 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#define pr_fmt(fmt) KBUILD_MODNAME "-snapshot: " fmt + +#include +#include +#include +#include +#include "snapshot.h" +#include "tracker.h" +#include "diff_storage.h" +#include "diff_area.h" +#include "snapimage.h" +#include "cbt_map.h" + +static LIST_HEAD(snapshots); +static DECLARE_RWSEM(snapshots_lock); + +static void snapshot_free(struct kref *kref) +{ + struct snapshot *snapshot = container_of(kref, struct snapshot, kref); + + pr_info("Release snapshot %pUb\n", &snapshot->id); + while (!list_empty(&snapshot->trackers)) { + struct tracker *tracker; + + tracker = list_first_entry(&snapshot->trackers, struct tracker, + link); + list_del_init(&tracker->link); + tracker_release_snapshot(tracker); + tracker_put(tracker); + } + + diff_storage_put(snapshot->diff_storage); + snapshot->diff_storage = NULL; + kfree(snapshot); +} + +static inline void snapshot_get(struct snapshot *snapshot) +{ + kref_get(&snapshot->kref); +}; +static inline void snapshot_put(struct snapshot *snapshot) +{ + if (likely(snapshot)) + kref_put(&snapshot->kref, snapshot_free); +}; + +static struct snapshot *snapshot_new(void) +{ + int ret; + struct snapshot *snapshot = NULL; + + snapshot = kzalloc(sizeof(struct snapshot), GFP_KERNEL); + if (!snapshot) + return ERR_PTR(-ENOMEM); + + snapshot->diff_storage = diff_storage_new(); + if (!snapshot->diff_storage) { + ret = -ENOMEM; + goto fail_free_snapshot; + } + + INIT_LIST_HEAD(&snapshot->link); + kref_init(&snapshot->kref); + uuid_gen(&snapshot->id); + init_rwsem(&snapshot->rw_lock); + snapshot->is_taken = false; + INIT_LIST_HEAD(&snapshot->trackers); + + return snapshot; + +fail_free_snapshot: + kfree(snapshot); + + return ERR_PTR(ret); +} + +void __exit snapshot_done(void) +{ + struct snapshot *snapshot; + + pr_debug("Cleanup snapshots\n"); + do { + down_write(&snapshots_lock); + snapshot = list_first_entry_or_null(&snapshots, struct snapshot, + link); + if (snapshot) + list_del(&snapshot->link); + up_write(&snapshots_lock); + + snapshot_put(snapshot); + } while (snapshot); +} + +int snapshot_create(struct blksnap_snapshot_create *arg) +{ + int ret; + struct snapshot *snapshot = NULL; + + snapshot = snapshot_new(); + if (IS_ERR(snapshot)) { + pr_err("Unable to create snapshot: failed to allocate snapshot structure\n"); + return PTR_ERR(snapshot); + } + + export_uuid(arg->id.b, &snapshot->id); + + ret = diff_storage_set_diff_storage(snapshot->diff_storage, + arg->diff_storage_fd, + arg->diff_storage_limit_sect); + if (ret) { + pr_err("Unable to create snapshot: invalid difference storage file\n"); + snapshot_put(snapshot); + return ret; + } + + down_write(&snapshots_lock); + list_add_tail(&snapshot->link, &snapshots); + up_write(&snapshots_lock); + + pr_info("Snapshot %pUb was created\n", arg->id.b); + return 0; +} + +static struct snapshot *snapshot_get_by_id(const uuid_t *id) +{ + struct snapshot *snapshot = NULL; + struct snapshot *s; + + down_read(&snapshots_lock); + if (list_empty(&snapshots)) + goto out; + + list_for_each_entry(s, &snapshots, link) { + if (uuid_equal(&s->id, id)) { + snapshot = s; + snapshot_get(snapshot); + break; + } + } +out: + up_read(&snapshots_lock); + return snapshot; +} + +int snapshot_add_device(const uuid_t *id, struct tracker *tracker) +{ + int ret = 0; + struct snapshot *snapshot = NULL; + + snapshot = snapshot_get_by_id(id); + if (!snapshot) + return -ESRCH; + + down_write(&snapshot->rw_lock); + if (tracker->dev_id == snapshot->diff_storage->dev_id) { + pr_err("The block device %d:%d is already being used as difference storage\n", + MAJOR(tracker->dev_id), MINOR(tracker->dev_id)); + goto out_up; + } + if (!list_empty(&snapshot->trackers)) { + struct tracker *tr; + + list_for_each_entry(tr, &snapshot->trackers, link) { + if ((tr == tracker) || + (tr->dev_id == tracker->dev_id)) { + ret = -EALREADY; + goto out_up; + } + } + } + if (list_empty(&tracker->link)) { + tracker_get(tracker); + list_add_tail(&tracker->link, &snapshot->trackers); + } else + ret = -EBUSY; +out_up: + up_write(&snapshot->rw_lock); + + snapshot_put(snapshot); + + return ret; +} + +int snapshot_destroy(const uuid_t *id) +{ + struct snapshot *snapshot = NULL; + + pr_info("Destroy snapshot %pUb\n", id); + down_write(&snapshots_lock); + if (!list_empty(&snapshots)) { + struct snapshot *s = NULL; + + list_for_each_entry(s, &snapshots, link) { + if (uuid_equal(&s->id, id)) { + snapshot = s; + list_del(&snapshot->link); + break; + } + } + } + up_write(&snapshots_lock); + + if (!snapshot) { + pr_err("Unable to destroy snapshot: cannot find snapshot by id %pUb\n", + id); + return -ENODEV; + } + snapshot_put(snapshot); + + return 0; +} + +static int snapshot_take_trackers(struct snapshot *snapshot) +{ + int ret = 0; + struct tracker *tracker; + + down_write(&snapshot->rw_lock); + + if (list_empty(&snapshot->trackers)) { + ret = -ENODEV; + goto fail; + } + + list_for_each_entry(tracker, &snapshot->trackers, link) { + struct diff_area *diff_area = + diff_area_new(tracker, snapshot->diff_storage); + + if (IS_ERR(diff_area)) { + ret = PTR_ERR(diff_area); + break; + } + tracker->diff_area = diff_area; + } + if (ret) + goto fail; + + /* + * Try to flush and freeze file system on each original block device. + */ + list_for_each_entry(tracker, &snapshot->trackers, link) { + if (freeze_bdev(tracker->diff_area->orig_bdev)) + pr_warn("Failed to freeze device [%u:%u]\n", + MAJOR(tracker->dev_id), MINOR(tracker->dev_id)); + else { + pr_debug("Device [%u:%u] was frozen\n", + MAJOR(tracker->dev_id), MINOR(tracker->dev_id)); + } + } + + /* + * Take snapshot - switch CBT tables and enable COW logic for each + * tracker. + */ + list_for_each_entry(tracker, &snapshot->trackers, link) { + ret = tracker_take_snapshot(tracker); + if (ret) { + pr_err("Unable to take snapshot: failed to capture snapshot %pUb\n", + &snapshot->id); + break; + } + } + + if (!ret) + snapshot->is_taken = true; + + /* + * Thaw file systems on original block devices. + */ + list_for_each_entry(tracker, &snapshot->trackers, link) { + if (thaw_bdev(tracker->diff_area->orig_bdev)) + pr_warn("Failed to thaw device [%u:%u]\n", + MAJOR(tracker->dev_id), MINOR(tracker->dev_id)); + else + pr_debug("Device [%u:%u] was unfrozen\n", + MAJOR(tracker->dev_id), MINOR(tracker->dev_id)); + } +fail: + if (ret) { + list_for_each_entry(tracker, &snapshot->trackers, link) { + if (tracker->diff_area) { + diff_area_put(tracker->diff_area); + tracker->diff_area = NULL; + } + } + } + up_write(&snapshot->rw_lock); + return ret; +} + +/* + * Sometimes a snapshot is in the state of corrupt immediately after it is + * taken. + */ +static int snapshot_check_trackers(struct snapshot *snapshot) +{ + int ret = 0; + struct tracker *tracker; + + down_read(&snapshot->rw_lock); + + list_for_each_entry(tracker, &snapshot->trackers, link) { + if (unlikely(diff_area_is_corrupted(tracker->diff_area))) { + pr_err("Unable to create snapshot for device [%u:%u]: diff area is corrupted\n", + MAJOR(tracker->dev_id), MINOR(tracker->dev_id)); + ret = -EFAULT; + break; + } + } + + up_read(&snapshot->rw_lock); + + return ret; +} + +/* + * Create all image block devices. + */ +static int snapshot_take_images(struct snapshot *snapshot) +{ + int ret = 0; + struct tracker *tracker; + + down_write(&snapshot->rw_lock); + + list_for_each_entry(tracker, &snapshot->trackers, link) { + ret = snapimage_create(tracker); + + if (ret) { + pr_err("Failed to create snapshot image for device [%u:%u] with error=%d\n", + MAJOR(tracker->dev_id), MINOR(tracker->dev_id), + ret); + break; + } + } + + up_write(&snapshot->rw_lock); + return ret; +} + +static int snapshot_release_trackers(struct snapshot *snapshot) +{ + int ret = 0; + struct tracker *tracker; + + down_write(&snapshot->rw_lock); + + list_for_each_entry(tracker, &snapshot->trackers, link) + tracker_release_snapshot(tracker); + + up_write(&snapshot->rw_lock); + return ret; +} + +int snapshot_take(const uuid_t *id) +{ + int ret = 0; + struct snapshot *snapshot; + + snapshot = snapshot_get_by_id(id); + if (!snapshot) + return -ESRCH; + + if (!snapshot->is_taken) { + ret = snapshot_take_trackers(snapshot); + if (!ret) { + ret = snapshot_check_trackers(snapshot); + if (!ret) + ret = snapshot_take_images(snapshot); + } + + if (ret) + snapshot_release_trackers(snapshot); + } else + ret = -EALREADY; + + snapshot_put(snapshot); + + if (ret) + pr_err("Unable to take snapshot %pUb\n", &snapshot->id); + else + pr_info("Snapshot %pUb was taken successfully\n", + &snapshot->id); + return ret; +} + +int snapshot_collect(unsigned int *pcount, + struct blksnap_uuid __user *id_array) +{ + int ret = 0; + int inx = 0; + struct snapshot *s; + + pr_debug("Collect snapshots\n"); + + down_read(&snapshots_lock); + if (list_empty(&snapshots)) + goto out; + + if (!id_array) { + list_for_each_entry(s, &snapshots, link) + inx++; + goto out; + } + + list_for_each_entry(s, &snapshots, link) { + if (inx >= *pcount) { + ret = -ENODATA; + goto out; + } + + if (copy_to_user(id_array[inx].b, &s->id.b, sizeof(uuid_t))) { + pr_err("Unable to collect snapshots: failed to copy data to user buffer\n"); + goto out; + } + + inx++; + } +out: + up_read(&snapshots_lock); + *pcount = inx; + return ret; +} + +struct event *snapshot_wait_event(const uuid_t *id, unsigned long timeout_ms) +{ + struct snapshot *snapshot; + struct event *event; + + snapshot = snapshot_get_by_id(id); + if (!snapshot) + return ERR_PTR(-ESRCH); + + event = event_wait(&snapshot->diff_storage->event_queue, timeout_ms); + + snapshot_put(snapshot); + return event; +} diff --git a/drivers/block/blksnap/snapshot.h b/drivers/block/blksnap/snapshot.h new file mode 100644 index 000000000000..8d24926bf86e --- /dev/null +++ b/drivers/block/blksnap/snapshot.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2023 Veeam Software Group GmbH */ +#ifndef __BLKSNAP_SNAPSHOT_H +#define __BLKSNAP_SNAPSHOT_H + +#include +#include +#include +#include +#include +#include +#include +#include +#include "event_queue.h" + +struct tracker; +struct diff_storage; +/** + * struct snapshot - Snapshot structure. + * @link: + * The list header allows to store snapshots in a linked list. + * @kref: + * Protects the structure from being released during the processing of + * an ioctl. + * @id: + * UUID of snapshot. + * @rw_lock: + * Protects the structure from being modified by different threads. + * @is_taken: + * Flag that the snapshot was taken. + * @diff_storage: + * A pointer to the difference storage of this snapshot. + * @trackers: + * List of block device trackers. + * + * A snapshot corresponds to a single backup session and provides snapshot + * images for multiple block devices. Several backup sessions can be performed + * at the same time, which means that several snapshots can exist at the same + * time. However, the original block device can only belong to one snapshot. + * Creating multiple snapshots from the same block device is not allowed. + */ +struct snapshot { + struct list_head link; + struct kref kref; + uuid_t id; + + struct rw_semaphore rw_lock; + + bool is_taken; + struct diff_storage *diff_storage; + struct list_head trackers; +}; + +void __exit snapshot_done(void); + +int snapshot_create(struct blksnap_snapshot_create *arg); +int snapshot_destroy(const uuid_t *id); +int snapshot_add_device(const uuid_t *id, struct tracker *tracker); +int snapshot_take(const uuid_t *id); +int snapshot_collect(unsigned int *pcount, + struct blksnap_uuid __user *id_array); +struct event *snapshot_wait_event(const uuid_t *id, unsigned long timeout_ms); + +#endif /* __BLKSNAP_SNAPSHOT_H */ From patchwork Fri Nov 24 16:59:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169529 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1371312vqx; Fri, 24 Nov 2023 09:02:38 -0800 (PST) X-Google-Smtp-Source: AGHT+IHRFxTRIEKcVJYbfmDQmObTUPuvmGuXWN5JDL9PYzo0aJbfQ741XzhYTBEcayhOl131aQk2 X-Received: by 2002:a05:6808:3386:b0:3b8:3e7d:db75 with SMTP id ce6-20020a056808338600b003b83e7ddb75mr4058577oib.55.1700845358399; Fri, 24 Nov 2023 09:02:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700845358; cv=none; d=google.com; s=arc-20160816; b=tNdiykJ87R4U8CrdKoHFAelHvsT+4p/BqLI6XGsIzRPbHmktAgENKEEQ8oSHaAl78d ThXkeWA+AruHGPAvns/MUJnr8n/ISTiYIaWfFa4xXjJMk44g1NF+gS4XuQaKzQQ5Njt5 HjmnWvgEUBhCpUTOde/DJ9axgLVGfWXfwtXVISgfjxncsJLg7FTLohtOEItp+c7y7Tcf +pgD8t+2TQlAoxhNuMjRCnd3rePKBKlZhWuyE7M3OGyEEZaelDW6oQbAGXoUpM0YuDMB SbdENI5Xqn5uxOVhLx0Uhgr0gTeZyghiLFQWPf6k2A5I/Mpq3+FAk0pijIREmZchKpp5 JQbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=3H2EngB4IC7J9sBiro1fbNUKQKRJyLxYEzCizNHIL5o=; fh=2mCdnEsEQzaNNg7WB3fw4oVqqq+eEoqCz0tNA1gXC/4=; b=qYRykH9pW6oAje7RV89pS2DZxm7Q9lp3a6pfJHYWiPQpVoOhXq762GL4ReE7uhpD6N 2SbZp/FKdQLthQKdw0Nyx73jHnSc941G4WudxgkYFPKCzVdt+1PCxK7zcYWTAT4/0XMA c514HoayFy7dW2D3Edx72MkpQdv8W2/1wvZPqZCPGQRKBMYmr66wwj4GHA4l8emLST4Z VnwyN8Mj1bfaE2jpoVwDakoAYH0m+QUA4WviBfKlIJFGlvhLw7dzSvxSRa/exMv0icg5 UPP5+kKxZWxLVHpVTxBhyDVmMfrRjtOh1xvJjirlfD/pcAgaPYoqcsZsZeME6dvTNwvE IUDA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=Vz45gGWB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id s14-20020a05680810ce00b003afafb944aesi1662577ois.27.2023.11.24.09.02.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 09:02:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=Vz45gGWB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id BD3D182B3DD2; Fri, 24 Nov 2023 09:02:07 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232903AbjKXRBF (ORCPT + 99 others); Fri, 24 Nov 2023 12:01:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47276 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235292AbjKXRAZ (ORCPT ); Fri, 24 Nov 2023 12:00:25 -0500 Received: from out-173.mta0.migadu.com (out-173.mta0.migadu.com [IPv6:2001:41d0:1004:224b::ad]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 69D3719BB; Fri, 24 Nov 2023 09:00:26 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700845224; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3H2EngB4IC7J9sBiro1fbNUKQKRJyLxYEzCizNHIL5o=; b=Vz45gGWBhO5z+XKwoOw0E6D48Hta7AOYAEttidaa3w/z4W5sITlJaeXjSXd85Nt+hl3kCU aTXHFwqMqwV7hj0k8jUi9LEvpxFLXXr4H0TTC/BuUMrA9KKE5gAvN0U8zPgoak0/Y9qGD0 EHNGghNq4QYnKjn5s5T4+I7bOJV0p9k= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa Subject: [PATCH v6 10/11] blksnap: Kconfig and Makefile Date: Fri, 24 Nov 2023 17:59:32 +0100 Message-Id: <20231124165933.27580-11-sergei.shtepa@linux.dev> In-Reply-To: <20231124165933.27580-1-sergei.shtepa@linux.dev> References: <20231124165933.27580-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Fri, 24 Nov 2023 09:02:07 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783465622512795405 X-GMAIL-MSGID: 1783465622512795405 From: Sergei Shtepa Allows to build a module and add the blksnap to the kernel tree. Co-developed-by: Christoph Hellwig Signed-off-by: Christoph Hellwig Signed-off-by: Sergei Shtepa --- drivers/block/Kconfig | 2 ++ drivers/block/Makefile | 2 ++ drivers/block/blksnap/Kconfig | 31 +++++++++++++++++++++++++++++++ drivers/block/blksnap/Makefile | 15 +++++++++++++++ 4 files changed, 50 insertions(+) create mode 100644 drivers/block/blksnap/Kconfig create mode 100644 drivers/block/blksnap/Makefile diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 5b9d4aaebb81..74d2d55526a3 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -404,4 +404,6 @@ config BLKDEV_UBLK_LEGACY_OPCODES source "drivers/block/rnbd/Kconfig" +source "drivers/block/blksnap/Kconfig" + endif # BLK_DEV diff --git a/drivers/block/Makefile b/drivers/block/Makefile index 101612cba303..9a2a9a56a247 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -40,3 +40,5 @@ obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk/ obj-$(CONFIG_BLK_DEV_UBLK) += ublk_drv.o swim_mod-y := swim.o swim_asm.o + +obj-$(CONFIG_BLKSNAP) += blksnap/ diff --git a/drivers/block/blksnap/Kconfig b/drivers/block/blksnap/Kconfig new file mode 100644 index 000000000000..f52272c12e1b --- /dev/null +++ b/drivers/block/blksnap/Kconfig @@ -0,0 +1,31 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Block device snapshot module configuration +# + +config BLKSNAP + tristate "Block Devices Snapshots Module (blksnap)" + help + Allow to create snapshots and track block changes for block devices. + Designed for creating backups for block devices. Snapshots are + temporary and are released when backup is completed. Change block + tracking allows to create incremental or differential backups. + +config BLKSNAP_DIFF_BLKDEV + bool "Use an optimized algorithm to store difference on a block device" + depends on BLKSNAP + default y + help + The difference storage for a snapshot can be a regular file or a + block device. We can work with a block device through the interface + of a regular file. However, direct management of I/O units should + allow for higher performance. + +config BLKSNAP_CHUNK_DIFF_BIO_SYNC + bool "Use a synchronous I/O unit processing algorithm for the snapshot image" + depends on BLKSNAP + default n + help + Theoretical asynchronous algorithm for processing I/O units should + have higher performance. However, an attempt to confirm this on test + runs did not bring any results. diff --git a/drivers/block/blksnap/Makefile b/drivers/block/blksnap/Makefile new file mode 100644 index 000000000000..8d528b95579a --- /dev/null +++ b/drivers/block/blksnap/Makefile @@ -0,0 +1,15 @@ +# SPDX-License-Identifier: GPL-2.0 + +blksnap-y := \ + cbt_map.o \ + chunk.o \ + diff_area.o \ + diff_buffer.o \ + diff_storage.o \ + event_queue.o \ + main.o \ + snapimage.o \ + snapshot.o \ + tracker.o + +obj-$(CONFIG_BLKSNAP) += blksnap.o From patchwork Fri Nov 24 16:59:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sergei Shtepa X-Patchwork-Id: 169527 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1371231vqx; Fri, 24 Nov 2023 09:02:33 -0800 (PST) X-Google-Smtp-Source: AGHT+IHvpQZikikJkAE+0cCiNiXTDuUr28gRpb0NtOU7Di3gxIR3PaquPUeTo06rfBVklu2JN0yc X-Received: by 2002:a05:6e02:20c7:b0:358:141:8584 with SMTP id 7-20020a056e0220c700b0035801418584mr4934764ilq.17.1700845353450; Fri, 24 Nov 2023 09:02:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700845353; cv=none; d=google.com; s=arc-20160816; b=uj2lu8J7gV2/Dn2cj5HR+p2FTVHKhjPp1GAfNL3tecenxnI40ERA/hi+XpsH7iZd4V Pt2sPXeKeri1432nMjB8T4UTqceB6FHOFSdnlP17iKAb7i8Z6/qCNMvoQPA9nwpqqXGv rfNx5/zfK0bgfq71TXw7149IoBfRUP0DoZbh2UV8VIPLwKKeYrof1EjtfmoTZZOkhCYa fFYVvADI0Q2YDsXEMFqOliqDLoKepIAJgQLfzp1SJHlihDSbEfxgfRQtYkKY5pKZDCso YWDYJFPk9Vm84T0mPIktpoCz8YoKf0Dduydstq74N51fNLwzIRqR7mjssiqqilrmuYJo jO+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=QWk7YTMvB3DTfZc9sNUwNxdg+CAbRo1hOQbqO2wSQ6A=; fh=lPjKKhlWGDRjdXlpc1pR+zlTQWWXClUGj0ay55zHgiQ=; b=k2JZU7rNWQAxXSEHVay8bNWc/R4M96r5yhfzsJW+Af5X/zRHJSZtYKbgmThjv7nkUO 7pzk4Pfj/xZBKEe4TwB0WqPIbnOomJFFz3nX9H7joxdI3Z+w5fXQGcHpImVuG4sn2hsA BTOIgE3r4+1/7dMJNYt+gskQknqAcBJWqQWxxiLHy7qN3Hb5GwRasFxME6HTxQiFiVI7 75dX4gOKywYITc+n/mrGaJpxwdMHo08Xjo++4AVu4jR/oOnArDcBYmFWGtCEFGPUJ6VA KDYZlR+feDk+8Ye1/9DYZL20t1sLxgvr3P262ytwIBKyoNhuCAwZmaH7lByGNFV9DPAa KrEA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="YS/3MDkQ"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id a19-20020a056638165300b004667a7ba3f6si2286266jat.126.2023.11.24.09.02.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 09:02:33 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="YS/3MDkQ"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 34A1A83D1554; Fri, 24 Nov 2023 09:01:52 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235084AbjKXRBP (ORCPT + 99 others); Fri, 24 Nov 2023 12:01:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49362 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345801AbjKXRA3 (ORCPT ); Fri, 24 Nov 2023 12:00:29 -0500 Received: from out-185.mta0.migadu.com (out-185.mta0.migadu.com [91.218.175.185]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A4D161BC8 for ; Fri, 24 Nov 2023 09:00:30 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700845228; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QWk7YTMvB3DTfZc9sNUwNxdg+CAbRo1hOQbqO2wSQ6A=; b=YS/3MDkQDJLZ1KOe1iNDAZDteB/JflqObZ/sxnSGYirEVXMmWDGjQitpPoxMWbwITRRGcK lCaiEJ8jACRUkhE5INGw8DLQ17uYGYUcBA+kCBAPjM3CKQX83iJY0AdVDX5aUYv45OjLes yYiDU6Zy3ApLCGmeU0KMq1smX6MXIK4= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa , Eric Biggers Subject: [PATCH v6 11/11] blksnap: prevents using devices with data integrity or inline encryption Date: Fri, 24 Nov 2023 17:59:33 +0100 Message-Id: <20231124165933.27580-12-sergei.shtepa@linux.dev> In-Reply-To: <20231124165933.27580-1-sergei.shtepa@linux.dev> References: <20231124165933.27580-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Fri, 24 Nov 2023 09:01:53 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783465617276094491 X-GMAIL-MSGID: 1783465617276094491 From: Sergei Shtepa There is an opinion that the use of the blksnap module may violate the security of encrypted data. The difference storage file may be located on an unreliable disk or even network storage. To implement secure compatibility with hardware inline encrypted devices will require discussion of algorithms and restrictions. For example, a restriction on the location of the difference storage only in virtual memory might help. Currently, there is no need for compatibility of the blksnap module and hardware inline encryption. I see no obstacles to ensuring the compatibility of the blksnap module and block devices with data integrity. However, this functionality was not planned or tested. Perhaps in the future this compatibility can be implemented. Theoretically possible that the block device was added to the snapshot before crypto_profile and integrity.profile were initialized. Checking the values of bi_crypt_context and bi_integrity ensures that the blksnap will not perform any actions with I/O units with which it is not compatible. Reported-by: Eric Biggers Signed-off-by: Sergei Shtepa --- drivers/block/blksnap/snapshot.c | 17 +++++++++++++++++ drivers/block/blksnap/tracker.c | 14 ++++++++++++++ 2 files changed, 31 insertions(+) diff --git a/drivers/block/blksnap/snapshot.c b/drivers/block/blksnap/snapshot.c index 21d94f12b5fc..a7675fdcf359 100644 --- a/drivers/block/blksnap/snapshot.c +++ b/drivers/block/blksnap/snapshot.c @@ -149,6 +149,23 @@ int snapshot_add_device(const uuid_t *id, struct tracker *tracker) int ret = 0; struct snapshot *snapshot = NULL; +#ifdef CONFIG_BLK_DEV_INTEGRITY + if (tracker->orig_bdev->bd_disk->queue->integrity.profile) { + pr_err("Blksnap is not compatible with data integrity\n"); + ret = -EPERM; + goto out_up; + } else + pr_debug("Data integrity not found\n"); +#endif + +#ifdef CONFIG_BLK_INLINE_ENCRYPTION + if (tracker->orig_bdev->bd_disk->queue->crypto_profile) { + pr_err("Blksnap is not compatible with hardware inline encryption\n"); + ret = -EPERM; + goto out_up; + } else + pr_debug("Inline encryption not found\n"); +#endif snapshot = snapshot_get_by_id(id); if (!snapshot) return -ESRCH; diff --git a/drivers/block/blksnap/tracker.c b/drivers/block/blksnap/tracker.c index 2b8978a2f42e..b38ead9afa69 100644 --- a/drivers/block/blksnap/tracker.c +++ b/drivers/block/blksnap/tracker.c @@ -57,6 +57,20 @@ static bool tracker_submit_bio(struct bio *bio) if (diff_area_is_corrupted(tracker->diff_area)) return false; +#ifdef CONFIG_BLK_INLINE_ENCRYPTION + if (bio->bi_crypt_context) { + pr_err_once("Hardware inline encryption is not supported\n"); + diff_area_set_corrupted(tracker->diff_area, -EPERM); + return false; + } +#endif +#ifdef CONFIG_BLK_DEV_INTEGRITY + if (bio->bi_integrity) { + pr_err_once("Data integrity is not supported\n"); + diff_area_set_corrupted(tracker->diff_area, -EPERM); + return false; + } +#endif return diff_area_cow(bio, tracker->diff_area, ©_iter); }