Message ID | 20230414110821.21548-1-p.raghav@samsung.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp305541vqo; Fri, 14 Apr 2023 04:27:03 -0700 (PDT) X-Google-Smtp-Source: AKy350a2kKYJdhDNdER473tZxy31xY870BKi/kTgIldMHl28ntMtfFrV80XyhorTp2+yHUUGocN3 X-Received: by 2002:a05:6a00:1514:b0:637:f447:9916 with SMTP id q20-20020a056a00151400b00637f4479916mr9622547pfu.16.1681471622960; Fri, 14 Apr 2023 04:27:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681471622; cv=none; d=google.com; s=arc-20160816; b=iVsaC6AKfUCRjXrBkTMwMSy31b6IdxP724MbOrrc/J8vd+SzkclCuGEgPHUqiC0UlE 6vdlg1A6mqVqt5ByMAvzL3YPSylH3n60/V39UeHv8q49xCGa2QRktFrFcu5YEUMMcKlj OP/WylcDzG16G9vLEhahXue9r+shQj8EDos+hFjvF7DEJDhlVzGnj0yOa+tONZ5N2NPv QLgw+r7NLEGeVh/+fe/rS8xs7dbfD6YYlB87ZGHJO2D8NsalKC1Op9YLcILsilFQ/Pvg umNED4fpzsSfUe3646f7OnqclOTDMj5X7w1Hzn6o/bxuUonTLONJBn1dzIkD0sMS6GyC sbeA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:references:cms-type:content-transfer-encoding :mime-version:message-id:date:subject:cc:to:from:dkim-signature :dkim-filter; bh=ruVO44Xm7S0hzAbgY2i3Oo5sjo+xMT5oybMi6DD7RXc=; b=Pn0M28DuxOxrbm4AYwo4h5HEIx7R7IUXokCCLozGh5VFsImF+oUBYAzi3R7aTheVir rJvYpPYVcbyo4G4GTWJlsU/dwIKypdHBONROm3OwV1fAJ6DaHI0v4T3qVQnjwXevfDt7 czZ16o3dLskJSeQDGmXgapDMc9+90cat8i5zcE2t2xtvvhV9eXfHpnbvkpL3+Rb4t4EO 7ncm/j//DZ3zpQupRRR1rVQcTIAu1bhcH6x2Z+oUBOQkekSzZRNqzwD+wu93xYCEug1I 5076BnF5jDnqpnVUdW2It7PYuslhmFjdOEIzjXIAqTagWWTs00tgpunQRdCOw+rS27yX dm7w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@samsung.com header.s=mail20170921 header.b="UgGc/0p0"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=samsung.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t22-20020a635f16000000b0051b3ea07eccsi4500690pgb.91.2023.04.14.04.26.48; Fri, 14 Apr 2023 04:27:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@samsung.com header.s=mail20170921 header.b="UgGc/0p0"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=samsung.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230397AbjDNLIc (ORCPT <rfc822;leviz.kernel.dev@gmail.com> + 99 others); Fri, 14 Apr 2023 07:08:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44520 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230129AbjDNLIb (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 14 Apr 2023 07:08:31 -0400 Received: from mailout2.w1.samsung.com (mailout2.w1.samsung.com [210.118.77.12]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5FA48138 for <linux-kernel@vger.kernel.org>; Fri, 14 Apr 2023 04:08:29 -0700 (PDT) Received: from eucas1p1.samsung.com (unknown [182.198.249.206]) by mailout2.w1.samsung.com (KnoxPortal) with ESMTP id 20230414110826euoutp02d98b999f26fe39af489eaade890da69a~VyIRczom11232612326euoutp02T for <linux-kernel@vger.kernel.org>; Fri, 14 Apr 2023 11:08:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 mailout2.w1.samsung.com 20230414110826euoutp02d98b999f26fe39af489eaade890da69a~VyIRczom11232612326euoutp02T DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsung.com; s=mail20170921; t=1681470506; bh=ruVO44Xm7S0hzAbgY2i3Oo5sjo+xMT5oybMi6DD7RXc=; h=From:To:Cc:Subject:Date:References:From; b=UgGc/0p0LwyuCv7oKYEUmekDv2L2FYUDGUgj7xRLi/XBJhnOmwDN57RxXNmkCYhA9 WL+uYD/eTPh/TzoOh+QlC1JQGR46/P/Ozkg8sNTLV2PmesmznZhmnskRtYnTNeRg7v 3dkL802nrOY1/oqGKxuSsu7hflCnKEjzWST52eek= Received: from eusmges2new.samsung.com (unknown [203.254.199.244]) by eucas1p2.samsung.com (KnoxPortal) with ESMTP id 20230414110825eucas1p21fdd751ca71c03c3451ce9f115e12d2e~VyIQmO-Sk0896208962eucas1p2U; Fri, 14 Apr 2023 11:08:25 +0000 (GMT) Received: from eucas1p2.samsung.com ( [182.198.249.207]) by eusmges2new.samsung.com (EUCPMTA) with SMTP id AA.25.09966.92439346; Fri, 14 Apr 2023 12:08:25 +0100 (BST) Received: from eusmtrp1.samsung.com (unknown [182.198.249.138]) by eucas1p1.samsung.com (KnoxPortal) with ESMTPA id 20230414110825eucas1p1ed4d16627889ef8542dfa31b1183063d~VyIQMS3Rj2214822148eucas1p1u; Fri, 14 Apr 2023 11:08:25 +0000 (GMT) Received: from eusmgms1.samsung.com (unknown [182.198.249.179]) by eusmtrp1.samsung.com (KnoxPortal) with ESMTP id 20230414110825eusmtrp1c2b4d631c12ee590a3efb275fb589b29~VyIQLtiec3053530535eusmtrp1Y; Fri, 14 Apr 2023 11:08:25 +0000 (GMT) X-AuditID: cbfec7f4-d4fff700000026ee-e3-643934292f92 Received: from eusmtip1.samsung.com ( [203.254.199.221]) by eusmgms1.samsung.com (EUCPMTA) with SMTP id D0.78.22108.92439346; Fri, 14 Apr 2023 12:08:25 +0100 (BST) Received: from localhost (unknown [106.210.248.243]) by eusmtip1.samsung.com (KnoxPortal) with ESMTPA id 20230414110825eusmtip11a66d34285144c8be31b895d63d417b6~VyIP93LY22918529185eusmtip1y; Fri, 14 Apr 2023 11:08:25 +0000 (GMT) From: Pankaj Raghav <p.raghav@samsung.com> To: brauner@kernel.org, willy@infradead.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, mcgrof@kernel.org, gost.dev@samsung.com, hare@suse.de, Pankaj Raghav <p.raghav@samsung.com> Subject: [RFC 0/4] convert create_page_buffers to create_folio_buffers Date: Fri, 14 Apr 2023 13:08:17 +0200 Message-Id: <20230414110821.21548-1-p.raghav@samsung.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrKKsWRmVeSWpSXmKPExsWy7djP87qaJpYpBvfaNS3mrF/DZvH68CdG i5sHdjJZ7Fk0CUjsPclicXnXHDaLGxOeMlp8XtrCbnH+73FWi98/5rA5cHlsXqHlsWlVJ5vH iRm/WTz6tqxi9Nh8utrj8yY5j01P3jIFsEdx2aSk5mSWpRbp2yVwZRw+eoG94BVXxZFOrQbG 6xxdjJwcEgImEo/2zGLvYuTiEBJYwSjxYUI/O0hCSOALo8SBFcEQic+MEj/v/GGB6fh64R0r RGI5o8TM91egnJeMEn0vf7J1MXJwsAloSTR2soOYIgKJEjffK4CUMAssYJS4dfs92AZhATeJ j/tbmEBsFgFViaV/XzGC2LwClhI3+yYyQyyTl9h/8CwzRFxQ4uTMJ2BHMAPFm7fOZgYZKiFw hUPi1OvFbBANLhJ7tm+AulRY4tXxLewQtozE6ck9UPFqiac3fkM1tzBK9O9cD3a0hIC1RN+Z HBCTWUBTYv0ufYhyR4m+y/sYISr4JG68FYQ4gU9i0rbpzBBhXomONiGIaiWJnT+fQC2VkLjc NAdqqYdEy8QfzJCwjZW43/6IaQKjwiwkj81C8tgshBsWMDKvYhRPLS3OTU8tNspLLdcrTswt Ls1L10vOz93ECExJp/8d/7KDcfmrj3qHGJk4GA8xSnAwK4nw/nAxTRHiTUmsrEotyo8vKs1J LT7EKM3BoiTOq217MllIID2xJDU7NbUgtQgmy8TBKdXAtPWggsc/kdap/7ds1SkJ85BbKcV4 7MAv+73n9shm5MRLGJkYdj5g6nk0LdzzeYSFk981oQLj1D1+Mu0da+w/T15ukf3gTuPUkwwL A9a0vYm3FS2a9dllT+TK1Ey77Ftvbrc8aH14/PKGs9xf3X5s2LXPS7OPt6nYy7SEaUGx3MTZ zBI+7Z9vVX076zHbgM2l/F9QrXTc6+5AmdOpxperd7GrnROxcZb/a6564YFQ30snxpdPzXjf CM94sUQ62urPla1/87ldW3dWbVk/v+HL7gWNbefPnRT/Oy3Rbo5bnFvz3iuMH+wMGvs6uLWa Nij5hqibuSsu8zNdcuzds/qdF7j7kq9a3Ip8OzVFYaWtEktxRqKhFnNRcSIADFWv5bgDAAA= X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprCIsWRmVeSWpSXmKPExsVy+t/xu7qaJpYpBpf3CFrMWb+GzeL14U+M FjcP7GSy2LNoEpDYe5LF4vKuOWwWNyY8ZbT4vLSF3eL83+OsFr9/zGFz4PLYvELLY9OqTjaP EzN+s3j0bVnF6LH5dLXH501yHpuevGUKYI/SsynKLy1JVcjILy6xVYo2tDDSM7S00DMysdQz NDaPtTIyVdK3s0lJzcksSy3St0vQyzh89AJ7wSuuiiOdWg2M1zm6GDk5JARMJL5eeMfaxcjF ISSwlFHi9LHzjBAJCYnbC5ugbGGJP9e62CCKnjNKHJ60nLmLkYODTUBLorGTHaRGRCBV4vSJ j2A1zALLGCXWzH3BBpIQFnCT+Li/hQnEZhFQlVj69xXYUF4BS4mbfROZIRbIS+w/eBZsJrOA psT6XfoQJYISJ2c+YQGxmYFKmrfOZp7AyD8LoWoWkqpZSKoWMDKvYhRJLS3OTc8tNtQrTswt Ls1L10vOz93ECIyebcd+bt7BOO/VR71DjEwcjIcYJTiYlUR4f7iYpgjxpiRWVqUW5ccXleak Fh9iNAW6eiKzlGhyPjB+80riDc0MTA1NzCwNTC3NjJXEeT0LOhKFBNITS1KzU1MLUotg+pg4 OKUamNLXpFxN1Yvfa/mzPy4pbD6ne/GVX9p/+Q69b/5vxHjO788/381+9551hibYnyyv/rRu btkq7pONjc++S/S8eptyeeI9l+9OzfYd6btd/k6WMqv693raVI3yf6WLQjaUqG6SX7JN7WB/ 4qTFPbOM/t2/vOgee3uR0M3H0nKS34TjvBLtS7f3b3t1Rql24hkZf4ctoe8ZLG9uUE8SnHln 7gexKS8Fv5am714aWfPuydLgrjlCa0/9snfrmsNTJnZ39et72bc8pntcnFL+2qSgvk28dMb0 iWv3RMwKKwub6nJqu5CTzroFUQ7LkoJWfvzyw7jYborXjCKX0OhPtjxccxJPdk09tyfjhvFK pqnz2NcpsRRnJBpqMRcVJwIA3tSlDScDAAA= X-CMS-MailID: 20230414110825eucas1p1ed4d16627889ef8542dfa31b1183063d X-Msg-Generator: CA Content-Type: text/plain; charset="utf-8" X-RootMTR: 20230414110825eucas1p1ed4d16627889ef8542dfa31b1183063d X-EPHeader: CA CMS-TYPE: 201P X-CMS-RootMailID: 20230414110825eucas1p1ed4d16627889ef8542dfa31b1183063d References: <CGME20230414110825eucas1p1ed4d16627889ef8542dfa31b1183063d@eucas1p1.samsung.com> X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_PASS,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763150788345063250?= X-GMAIL-MSGID: =?utf-8?q?1763150788345063250?= |
Series |
convert create_page_buffers to create_folio_buffers
|
|
Message
Pankaj Raghav
April 14, 2023, 11:08 a.m. UTC
One of the first kernel panic we hit when we try to increase the block size > 4k is inside create_page_buffers()[1]. Even though buffer.c function do not support large folios (folios > PAGE_SIZE) at the moment, these changes are required when we want to remove that constraint. Willy had already mentioned that he wanted to convert create_page_buffers to create_folio_buffers but didn't get to it yet, so I decided to take a shot. No functional changes introduced. OI: - I don't like the fact that I had to introduce folio_create_empty_buffers() as create_empty_buffers() is used in many parts of the kernel. Should I do a big bang change as a part of this series where we convert create_empty_buffers to take a folio and change the callers to pass a folio instead of a page? - I split the series into 4 commits for clarity. I could squash them into one patch if needed. [1] https://lore.kernel.org/all/ZBnGc4WbBOlnRUgd@casper.infradead.org/ Pankaj Raghav (4): fs/buffer: add set_bh_folio helper buffer: add alloc_folio_buffers() helper fs/buffer: add folio_create_empty_buffers helper fs/buffer: convert create_page_buffers to create_folio_buffers fs/buffer.c | 131 +++++++++++++++++++++++++++++++++--- include/linux/buffer_head.h | 4 ++ 2 files changed, 125 insertions(+), 10 deletions(-)
Comments
On 4/14/23 13:08, Pankaj Raghav wrote: > One of the first kernel panic we hit when we try to increase the > block size > 4k is inside create_page_buffers()[1]. Even though buffer.c > function do not support large folios (folios > PAGE_SIZE) at the moment, > these changes are required when we want to remove that constraint. > > Willy had already mentioned that he wanted to convert create_page_buffers to > create_folio_buffers but didn't get to it yet, so I decided to take a > shot. > > No functional changes introduced. > > OI: > - I don't like the fact that I had to introduce > folio_create_empty_buffers() as create_empty_buffers() is used in > many parts of the kernel. Should I do a big bang change as a part of > this series where we convert create_empty_buffers to take a folio and > change the callers to pass a folio instead of a page? > > - I split the series into 4 commits for clarity. I could squash them > into one patch if needed. > > [1] https://lore.kernel.org/all/ZBnGc4WbBOlnRUgd@casper.infradead.org/ > > Pankaj Raghav (4): > fs/buffer: add set_bh_folio helper > buffer: add alloc_folio_buffers() helper > fs/buffer: add folio_create_empty_buffers helper > fs/buffer: convert create_page_buffers to create_folio_buffers > > fs/buffer.c | 131 +++++++++++++++++++++++++++++++++--- > include/linux/buffer_head.h | 4 ++ > 2 files changed, 125 insertions(+), 10 deletions(-) > Funnily enough, I've been tinkering along the same lines, and ended up with pretty similar patches. I've had to use two additional patches to get my modified 'brd' driver off the ground with logical blocksize of 16k: - mm/filemap: allocate folios according to the blocksize (will be sending the patch separately) - Modify read_folio() to use the correct order: @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) limit = inode->i_sb->s_maxbytes; - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); - head = create_folio_buffers(folio, inode, 0); blocksize = head->b_size; bbits = block_size_bits(blocksize); - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); + if (WARN_ON(PAGE_SHIFT < bbits)) { + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); + } else { + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); + } lblock = (limit+blocksize-1) >> bbits; bh = head; nr = 0; With that (and my modified brd driver) I've been able to set the logical blocksize to 16k for brd and have it happily loaded. Haven't tested the write path yet, though; there's surely quite some work to be done. BTW; I've got another patch replacing 'writepage' with 'write_folio' (and the corresponding argument update). Is that a direction you want to go? Cheers, Hannes
On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: > BTW; I've got another patch replacing 'writepage' with 'write_folio' > (and the corresponding argument update). Is that a direction you want to go? No; ->writepage is being deleted. It's already gone from ext4 and xfs.
On 4/14/23 15:51, Matthew Wilcox wrote: > On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: >> BTW; I've got another patch replacing 'writepage' with 'write_folio' >> (and the corresponding argument update). Is that a direction you want to go? > > No; ->writepage is being deleted. It's already gone from ext4 and xfs. Aw. And here's me having converted block/fops over to using iomap w/ iomap_writepage(). Tough. Oh well. Wasn't a great fit anyway as for a sb_bread() replacement we would need a sub-page access for iomap. Question is whether we really need that or shouldn't read PAGE_SIZE sectors always. Surely would make life easier. And would save us all the logic with bh_lru etc as we can rely on the page cache. But probably an item for the iomap discussion at LSF. Unless you got plans already ... Cheers, Hannes
>> Pankaj Raghav (4): >> fs/buffer: add set_bh_folio helper >> buffer: add alloc_folio_buffers() helper >> fs/buffer: add folio_create_empty_buffers helper >> fs/buffer: convert create_page_buffers to create_folio_buffers >> >> fs/buffer.c | 131 +++++++++++++++++++++++++++++++++--- >> include/linux/buffer_head.h | 4 ++ >> 2 files changed, 125 insertions(+), 10 deletions(-) >> > Funnily enough, I've been tinkering along the same lines, and ended up with pretty similar patches. > I've had to use two additional patches to get my modified 'brd' driver off the ground with logical > blocksize of 16k: Good to know that we are working on a similar direction here. > - mm/filemap: allocate folios according to the blocksize > (will be sending the patch separately) > - Modify read_folio() to use the correct order: > > @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) > limit = inode->i_sb->s_maxbytes; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > head = create_folio_buffers(folio, inode, 0); > blocksize = head->b_size; > bbits = block_size_bits(blocksize); > > - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + if (WARN_ON(PAGE_SHIFT < bbits)) { > + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); > + } else { > + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + } > lblock = (limit+blocksize-1) >> bbits; > bh = head; > nr = 0; > > > With that (and my modified brd driver) I've been able to set the logical blocksize to 16k for brd > and have it happily loaded. I will give your patches a try as well.
On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: > On 4/14/23 13:08, Pankaj Raghav wrote: > > One of the first kernel panic we hit when we try to increase the > > block size > 4k is inside create_page_buffers()[1]. Even though buffer.c > > function do not support large folios (folios > PAGE_SIZE) at the moment, > > these changes are required when we want to remove that constraint. > Funnily enough, I've been tinkering along the same lines, and ended up with > pretty similar patches. > I've had to use two additional patches to get my modified 'brd' driver off > the ground with logical blocksize of 16k: > - mm/filemap: allocate folios according to the blocksize > (will be sending the patch separately) > - Modify read_folio() to use the correct order: > > @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, > get_block_t *get_block) > if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) > limit = inode->i_sb->s_maxbytes; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > head = create_folio_buffers(folio, inode, 0); > blocksize = head->b_size; > bbits = block_size_bits(blocksize); > > - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + if (WARN_ON(PAGE_SHIFT < bbits)) { > + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); > + } else { > + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + } > lblock = (limit+blocksize-1) >> bbits; > bh = head; > nr = 0; At a quick glance I think both approaches (unless Hannes already did it) seem to just miss that pesky static *arr[MAX_BUF_PER_PAGE], and so I think we need to: a) dynamically allocate those now b) do a cursory review of the users of that and prepare them to grok buffer heads which are blocksize based rather than PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE. Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for bs > PAGE_SIZE right now. Luis
On Fri, Apr 14, 2023 at 06:01:16PM -0700, Luis Chamberlain wrote: > a) dynamically allocate those now > b) do a cursory review of the users of that and prepare them > to grok buffer heads which are blocksize based rather than > PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE. > > Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for > bs > PAGE_SIZE right now. Worse, we'll overflow the array and corrupt the stack. This one is a simple fix ... +++ b/fs/buffer.c @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) { struct inode *inode = folio->mapping->host; sector_t iblock, lblock; - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; + struct buffer_head *bh, *head; unsigned int blocksize, bbits; int nr, i; int fully_mapped = 1; @@ -2335,7 +2335,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (buffer_uptodate(bh)) continue; } - arr[nr++] = bh; } while (i++, iblock++, (bh = bh->b_this_page) != head); if (fully_mapped) @@ -2353,24 +2352,27 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) } /* Stage two: lock the buffers */ - for (i = 0; i < nr; i++) { - bh = arr[i]; + bh = head; + do { lock_buffer(bh); mark_buffer_async_read(bh); - } + bh = bh->b_this_page; + } while (bh != head); /* * Stage 3: start the IO. Check for uptodateness * inside the buffer lock in case another process reading * the underlying blockdev brought it uptodate (the sct fix). */ - for (i = 0; i < nr; i++) { - bh = arr[i]; + bh = head; + do { if (buffer_uptodate(bh)) end_buffer_async_read(bh, 1); else submit_bh(REQ_OP_READ, bh); - } + bh = bh->b_this_page; + } while (bh != head); + return 0; } EXPORT_SYMBOL(block_read_full_folio);
On Sat, Apr 15, 2023 at 03:31:54AM +0100, Matthew Wilcox wrote: > On Fri, Apr 14, 2023 at 06:01:16PM -0700, Luis Chamberlain wrote: > > a) dynamically allocate those now > > b) do a cursory review of the users of that and prepare them > > to grok buffer heads which are blocksize based rather than > > PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE. > > > > Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for > > bs > PAGE_SIZE right now. > > Worse, we'll overflow the array and corrupt the stack. > > This one is a simple fix ... > > +++ b/fs/buffer.c > @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > { > struct inode *inode = folio->mapping->host; > sector_t iblock, lblock; > - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; > + struct buffer_head *bh, *head; > unsigned int blocksize, bbits; > int nr, i; > int fully_mapped = 1; > @@ -2335,7 +2335,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > if (buffer_uptodate(bh)) > continue; > } > - arr[nr++] = bh; > } while (i++, iblock++, (bh = bh->b_this_page) != head); > > if (fully_mapped) > @@ -2353,24 +2352,27 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > } > > /* Stage two: lock the buffers */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > + bh = head; > + do { > lock_buffer(bh); > mark_buffer_async_read(bh); > - } > + bh = bh->b_this_page; > + } while (bh != head); > > /* > * Stage 3: start the IO. Check for uptodateness > * inside the buffer lock in case another process reading > * the underlying blockdev brought it uptodate (the sct fix). > */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > + bh = head; > + do { > if (buffer_uptodate(bh)) > end_buffer_async_read(bh, 1); > else > submit_bh(REQ_OP_READ, bh); > - } > + bh = bh->b_this_page; > + } while (bh != head); > + > return 0; I thought of that but I saw that the loop that assigns the arr only pegs a bh if we don't "continue" for certain conditions, which made me believe that we only wanted to keep on the array as non-null items which meet the initial loop's criteria. If that is not accurate then yes, the simplication is nice! Luis
On Fri, Apr 14, 2023 at 08:24:56PM -0700, Luis Chamberlain wrote: > I thought of that but I saw that the loop that assigns the arr only > pegs a bh if we don't "continue" for certain conditions, which made me > believe that we only wanted to keep on the array as non-null items which > meet the initial loop's criteria. If that is not accurate then yes, > the simplication is nice! Uh, right. A little bit more carefully this time ... how does this look? diff --git a/fs/buffer.c b/fs/buffer.c index 5e67e21b350a..dff671079b02 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) { struct inode *inode = folio->mapping->host; sector_t iblock, lblock; - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; + struct buffer_head *bh, *head; unsigned int blocksize, bbits; int nr, i; int fully_mapped = 1; @@ -2335,7 +2335,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (buffer_uptodate(bh)) continue; } - arr[nr++] = bh; + nr++; } while (i++, iblock++, (bh = bh->b_this_page) != head); if (fully_mapped) @@ -2352,25 +2352,29 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) return 0; } - /* Stage two: lock the buffers */ - for (i = 0; i < nr; i++) { - bh = arr[i]; + /* + * Stage two: lock the buffers. Recheck the uptodate flag under + * the lock in case somebody else brought it uptodate first. + */ + bh = head; + do { + if (buffer_uptodate(bh)) + continue; lock_buffer(bh); + if (buffer_uptodate(bh)) { + unlock_buffer(bh); + continue; + } mark_buffer_async_read(bh); - } + } while ((bh = bh->b_this_page) != head); - /* - * Stage 3: start the IO. Check for uptodateness - * inside the buffer lock in case another process reading - * the underlying blockdev brought it uptodate (the sct fix). - */ - for (i = 0; i < nr; i++) { - bh = arr[i]; - if (buffer_uptodate(bh)) - end_buffer_async_read(bh, 1); - else + /* Stage 3: start the IO */ + bh = head; + do { + if (buffer_async_read(bh)) submit_bh(REQ_OP_READ, bh); - } + } while ((bh = bh->b_this_page) != head); + return 0; } EXPORT_SYMBOL(block_read_full_folio); I do wonder how much it's worth doing this vs switching to non-BH methods. I appreciate that's a lot of work still.
On 4/15/23 05:44, Matthew Wilcox wrote: > On Fri, Apr 14, 2023 at 08:24:56PM -0700, Luis Chamberlain wrote: >> I thought of that but I saw that the loop that assigns the arr only >> pegs a bh if we don't "continue" for certain conditions, which made me >> believe that we only wanted to keep on the array as non-null items which >> meet the initial loop's criteria. If that is not accurate then yes, >> the simplication is nice! > > Uh, right. A little bit more carefully this time ... how does this > look? > > diff --git a/fs/buffer.c b/fs/buffer.c > index 5e67e21b350a..dff671079b02 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > { > struct inode *inode = folio->mapping->host; > sector_t iblock, lblock; > - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; > + struct buffer_head *bh, *head; > unsigned int blocksize, bbits; > int nr, i; > int fully_mapped = 1; > @@ -2335,7 +2335,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > if (buffer_uptodate(bh)) > continue; > } > - arr[nr++] = bh; > + nr++; > } while (i++, iblock++, (bh = bh->b_this_page) != head); > > if (fully_mapped) > @@ -2352,25 +2352,29 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > return 0; > } > > - /* Stage two: lock the buffers */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > + /* > + * Stage two: lock the buffers. Recheck the uptodate flag under > + * the lock in case somebody else brought it uptodate first. > + */ > + bh = head; > + do { > + if (buffer_uptodate(bh)) > + continue; > lock_buffer(bh); > + if (buffer_uptodate(bh)) { > + unlock_buffer(bh); > + continue; > + } > mark_buffer_async_read(bh); > - } > + } while ((bh = bh->b_this_page) != head); > > - /* > - * Stage 3: start the IO. Check for uptodateness > - * inside the buffer lock in case another process reading > - * the underlying blockdev brought it uptodate (the sct fix). > - */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > - if (buffer_uptodate(bh)) > - end_buffer_async_read(bh, 1); > - else > + /* Stage 3: start the IO */ > + bh = head; > + do { > + if (buffer_async_read(bh)) > submit_bh(REQ_OP_READ, bh); > - } > + } while ((bh = bh->b_this_page) != head); > + > return 0; > } > EXPORT_SYMBOL(block_read_full_folio); > > > I do wonder how much it's worth doing this vs switching to non-BH methods. > I appreciate that's a lot of work still. That's what I've been wondering, too. I would _vastly_ prefer to switch over to iomap; however, the blasted sb_bread() is getting in the way. Currently iomap only runs on entire pages / folios, but a lot of (older) filesystems insist on doing 512 byte I/O. While this seem logical (seeing that 512 bytes is the default, and, in most cases, the only supported sector size) question is whether _we_ from the linux side need to do that. We _could_ upgrade to always do full page I/O; there's a good chance we'll be using the entire page anyway eventually. And with storage bandwidth getting larger and larger we might even get a performance boost there. And it would save us having to implement sub-page I/O for iomap. Hmm? Cheers, Hannes
On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > On 4/15/23 05:44, Matthew Wilcox wrote: > > I do wonder how much it's worth doing this vs switching to non-BH methods. > > I appreciate that's a lot of work still. > > That's what I've been wondering, too. > > I would _vastly_ prefer to switch over to iomap; however, the blasted > sb_bread() is getting in the way. Currently iomap only runs on entire > pages / folios, but a lot of (older) filesystems insist on doing 512 Hang on, no, iomap can issue sub-page reads. eg iomap_read_folio_sync() will read the parts of the folio which have not yet been read when called from __iomap_write_begin(). > byte I/O. While this seem logical (seeing that 512 bytes is the > default, and, in most cases, the only supported sector size) question > is whether _we_ from the linux side need to do that. > We _could_ upgrade to always do full page I/O; there's a good > chance we'll be using the entire page anyway eventually. > And with storage bandwidth getting larger and larger we might even > get a performance boost there. I think we need to look at this from the filesystem side. What do filesystems actually want to do? The first thing is they want to read the superblock. That's either going to be immediately freed ("Oh, this isn't a JFS filesystem after all") or it's going to hang around indefinitely. There's no particular need to keep it in any kind of cache (buffer or page). Except ... we want to probe a dozen different filesystems, and half of them keep their superblock at the same offset from the start of the block device. So we do want to keep it cached. That's arguing for using the page cache, at least to read it. Now, do we want userspace to be able to dd a new superblock into place and have the mounted filesystem see it? I suspect that confuses just about every filesystem out there. So I think the right answer is to read the page into the bdev's page cache and then copy it into a kmalloc'ed buffer which the filesystem is then responsible for freeing. It's also responsible for writing it back (so that's another API we need), and for a journalled filesystem, it needs to fit into the journalling scheme. Also, we may need to write back multiple copies of the superblock, possibly with slight modifications. There are a lot of considerations here, and I don't feel like I have enough of an appreciation of filesystem needs to come up with a decent API. I'd hope we can get a good discussion going at LSFMM.
On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote: > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > > On 4/15/23 05:44, Matthew Wilcox wrote: > > We _could_ upgrade to always do full page I/O; there's a good > > chance we'll be using the entire page anyway eventually. *Iff* doing away with buffer head 512 granularity could help block sizes greater than page size where physical and logical block size > PAGE_SIZE we whould also be able to see it on 4kn drives (logical and physical block size == 4k). A projection could be made after. In so far as experimenting with this, if you already have some effort on IOMAP for bdev aops one possibility for pure experimentation for now would be to peg a new set of aops could be set in the path of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early for us to know if the device is has (lbs = pbs) > 512. For NVMe for instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We put together and set the logical and phyiscal block size on NVMe on nvme_update_ns_info() --> nvme_update_disk_info(), right before we call device_add_disk(). The only way to override the aops then would be right before device_add_disk(), or as part of a new device_add_disk_aops() or whatever. > > And with storage bandwidth getting larger and larger we might even > > get a performance boost there. > > I think we need to look at this from the filesystem side. Before that let's recap the bdev cache current issues. Today by just adding the disk we move on to partition scanning immediately unless your block driver has a flag that says otherwise. The current crash we're evaluating with brd and that we also hit with NVMe is due to this part. device_add_disk() --> disk_scan_partitions() --> blkdev_get_whole() --> bdev_disk_changed() --> filemap_read_folio() --> filler() The filler is from aops. We don't even have a filesystem yet on these devices at this point. The entire partition core does this partition scanning. Refer to: disk_scan_partitions() --> block/partitions/core.c : bdev_disk_changed() And all of that stuff is also under a 512-byte atomic operation assumption, we could do better if wanted to. > What do filesystems actually want to do? So you are suggesting that the early reads of the block device by the block cache and its use of the page cache cache should be aligned / perhaps redesigned to assist more clearly with what modern filesystems might actually would want today? > The first thing is they want to read > the superblock. That's either going to be immediately freed ("Oh, > this isn't a JFS filesystem after all") or it's going to hang around > indefinitely. There's no particular need to keep it in any kind of > cache (buffer or page). And the bdev cache would not be able to know before hand that's the case. > Except ... we want to probe a dozen different > filesystems, and half of them keep their superblock at the same offset > from the start of the block device. So we do want to keep it cached. > That's arguing for using the page cache, at least to read it. Do we currently share anything from the bdev cache with the fs for this? Let's say that first block device blocksize in memory. > Now, do we want userspace to be able to dd a new superblock into place > and have the mounted filesystem see it? Not sure I follow this. dd a new super block? > I suspect that confuses just > about every filesystem out there. So I think the right answer is to read > the page into the bdev's page cache and then copy it into a kmalloc'ed > buffer which the filesystem is then responsible for freeing. It's also > responsible for writing it back (so that's another API we need), and for > a journalled filesystem, it needs to fit into the journalling scheme. > Also, we may need to write back multiple copies of the superblock, > possibly with slight modifications. Are you considering these as extentions to the bdev cache? Luis
On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote: > On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote: > > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > > > On 4/15/23 05:44, Matthew Wilcox wrote: > > > We _could_ upgrade to always do full page I/O; there's a good > > > chance we'll be using the entire page anyway eventually. > > *Iff* doing away with buffer head 512 granularity could help block sizes > greater than page size where physical and logical block size > PAGE_SIZE > we whould also be able to see it on 4kn drives (logical and physical > block size == 4k). A projection could be made after. > > In so far as experimenting with this, if you already have some > effort on IOMAP for bdev aops one possibility for pure experimentation > for now would be to peg a new set of aops could be set in the path > of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early > for us to know if the device is has (lbs = pbs) > 512. For NVMe for > instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We > put together and set the logical and phyiscal block size on NVMe on > nvme_update_ns_info() --> nvme_update_disk_info(), right before we > call device_add_disk(). The only way to override the aops then would > be right before device_add_disk(), or as part of a new device_add_disk_aops() > or whatever. I think you're making this harder than you need to. For LBA size > PAGE_SIZE, there is always only going to be one BH per folio-that-is-LBA-size, so all the problems with more-than-8-BHs-per-4k-page don't actually exist. I don't think we should be overriding the aops, and if we narrow the scope of large folio support in blockdev t only supporting folio_size == LBA size, it becomes much more feasible. > > > And with storage bandwidth getting larger and larger we might even > > > get a performance boost there. > > > > I think we need to look at this from the filesystem side. > > Before that let's recap the bdev cache current issues. Ooh, yes, this is good! I was totally neglecting the partition scanning code. > Today by just adding the disk we move on to partition scanning > immediately unless your block driver has a flag that says otherwise. The > current crash we're evaluating with brd and that we also hit with NVMe > is due to this part. > > device_add_disk() --> > disk_scan_partitions() --> > blkdev_get_whole() --> > bdev_disk_changed() --> > filemap_read_folio() --> filler() > > The filler is from aops. Right, so you missed a step in that callchain, which is read_mapping_folio(). That ends up in do_read_cache_folio(), which contains the deadly: folio = filemap_alloc_folio(gfp, 0); so that needs to be changed to get the minimum folio order from the mapping, and then it should work. > > What do filesystems actually want to do? > > So you are suggesting that the early reads of the block device by the > block cache and its use of the page cache cache should be aligned / > perhaps redesigned to assist more clearly with what modern filesystems > might actually would want today? Perhaps? I'm just saying the needs of the block device are not the only consideration here. I'd like an API that makes sense for the fs author. > > The first thing is they want to read > > the superblock. That's either going to be immediately freed ("Oh, > > this isn't a JFS filesystem after all") or it's going to hang around > > indefinitely. There's no particular need to keep it in any kind of > > cache (buffer or page). > > And the bdev cache would not be able to know before hand that's the case. Right, nobody knows until it's been read and examined. > > Except ... we want to probe a dozen different > > filesystems, and half of them keep their superblock at the same offset > > from the start of the block device. So we do want to keep it cached. > > That's arguing for using the page cache, at least to read it. > > Do we currently share anything from the bdev cache with the fs for this? > Let's say that first block device blocksize in memory. sb_bread() is used by most filesystems, and the buffer cache aliases into the page cache. > > Now, do we want userspace to be able to dd a new superblock into place > > and have the mounted filesystem see it? > > Not sure I follow this. dd a new super block? In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', I can overwrite the superblock. Do we want filesystems to see that kind of vandalism, or do we want the mounted filesystem to have its own copy of the data and overwrite what userspace wrote the next time it updates the superblock? (the trick is that this may not be vandalism, it might be the sysadmin updating the uuid or running some fsck-ish program or trying to update the superblock to support fabulous-new-feature on next mount. does this change the answer?) > > I suspect that confuses just > > about every filesystem out there. So I think the right answer is to read > > the page into the bdev's page cache and then copy it into a kmalloc'ed > > buffer which the filesystem is then responsible for freeing. It's also > > responsible for writing it back (so that's another API we need), and for > > a journalled filesystem, it needs to fit into the journalling scheme. > > Also, we may need to write back multiple copies of the superblock, > > possibly with slight modifications. > > Are you considering these as extentions to the bdev cache? I'm trying to suggest some of the considerations that need to go into a replacement for sb_bread().
On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote: > On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote: > > On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote: > > > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > > > > On 4/15/23 05:44, Matthew Wilcox wrote: > > > > We _could_ upgrade to always do full page I/O; there's a good > > > > chance we'll be using the entire page anyway eventually. > > > > *Iff* doing away with buffer head 512 granularity could help block sizes > > greater than page size where physical and logical block size > PAGE_SIZE > > we whould also be able to see it on 4kn drives (logical and physical > > block size == 4k). A projection could be made after. > > > > In so far as experimenting with this, if you already have some > > effort on IOMAP for bdev aops one possibility for pure experimentation > > for now would be to peg a new set of aops could be set in the path > > of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early > > for us to know if the device is has (lbs = pbs) > 512. For NVMe for > > instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We > > put together and set the logical and phyiscal block size on NVMe on > > nvme_update_ns_info() --> nvme_update_disk_info(), right before we > > call device_add_disk(). The only way to override the aops then would > > be right before device_add_disk(), or as part of a new device_add_disk_aops() > > or whatever. > > I think you're making this harder than you need to. > For LBA size > PAGE_SIZE, there is always only going to be > one BH per folio-that-is-LBA-size, so all the problems with > more-than-8-BHs-per-4k-page don't actually exist. Oh! Then yes, sorry! > I don't think we > should be overriding the aops, and if we narrow the scope of large folio > support in blockdev t only supporting folio_size == LBA size, it becomes > much more feasible. I'm trying to think of the possible use cases where folio_size != LBA size and I cannot immediately think of some. Yes there are cases where a filesystem may use a different block for say meta data than data, but that I believe is side issue, ie, read/writes for small metadata would have to be accepted. At least for NVMe we have metadata size as part of the LBA format, but from what I understand no Linux filesystem yet uses that. > > > > And with storage bandwidth getting larger and larger we might even > > > > get a performance boost there. > > > > > > I think we need to look at this from the filesystem side. > > > > Before that let's recap the bdev cache current issues. > > Ooh, yes, this is good! I was totally neglecting the partition > scanning code. > > > Today by just adding the disk we move on to partition scanning > > immediately unless your block driver has a flag that says otherwise. The > > current crash we're evaluating with brd and that we also hit with NVMe > > is due to this part. > > > > device_add_disk() --> > > disk_scan_partitions() --> > > blkdev_get_whole() --> > > bdev_disk_changed() --> > > filemap_read_folio() --> filler() > > > > The filler is from aops. > > Right, so you missed a step in that callchain, which is > read_mapping_folio(). That ends up in do_read_cache_folio(), which > contains the deadly: > > folio = filemap_alloc_folio(gfp, 0); Right and before this we have: if (!filler) filler = mapping->a_ops->read_folio; The folio is order 0 and after filemap_alloc_folio() its added to the page cache, then filemap_read_folio(file, filler, folio) uses the filler blkdev_read_folio() --> fs/buffer.c block_read_full_folio() Just for posterity: fs/buffer.c int block_read_full_folio(struct folio *folio, get_block_t *get_block) { ... struct buffer_head *bh, *head,...; ... head = create_page_buffers(&folio->page, inode, 0); ... } static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state) { BUG_ON(!PageLocked(page)); if (!page_has_buffers(page)) create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits), b_state); return page_buffers(page); } void create_empty_buffers(struct page *page, unsigned long blocksize, unsigned long b_state) { struct buffer_head *bh, *head, *tail; head = alloc_page_buffers(page, blocksize, true); bh = head; do { bh->b_state |= b_state; -----> CRASH HERE head is NULL tail = bh; bh = bh->b_this_page; } while (bh); tail->b_this_page = head; } Why is it NULL? The blocksize passed to alloc_page_buffers() is larger than PAGE_SIZE and below offset is PAGE_SIZE. PAGE_SIZE - blocksize gives a negative number and so the loop is not traversed. struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, bool retry) { struct buffer_head *bh, *head; gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT; long offset; struct mem_cgroup *memcg, *old_memcg; if (retry) gfp |= __GFP_NOFAIL; /* The page lock pins the memcg */ memcg = page_memcg(page); old_memcg = set_active_memcg(memcg); head = NULL; offset = PAGE_SIZE; while ((offset -= size) >= 0) { bh = alloc_buffer_head(gfp); if (!bh) goto no_grow; bh->b_this_page = head; bh->b_blocknr = -1; head = bh; bh->b_size = size; /* Link the buffer to its page */ set_bh_page(bh, page, offset); } out: set_active_memcg(old_memcg); return head; ... } I see now what you say about the buffer head being of the block size bh->b_size = size above. > so that needs to be changed to get the minimum folio order from the > mapping, and then it should work. > > > > What do filesystems actually want to do? > > > > So you are suggesting that the early reads of the block device by the > > block cache and its use of the page cache cache should be aligned / > > perhaps redesigned to assist more clearly with what modern filesystems > > might actually would want today? > > Perhaps? I'm just saying the needs of the block device are not the > only consideration here. I'd like an API that makes sense for the fs > author. Makes sense! > > > The first thing is they want to read > > > the superblock. That's either going to be immediately freed ("Oh, > > > this isn't a JFS filesystem after all") or it's going to hang around > > > indefinitely. There's no particular need to keep it in any kind of > > > cache (buffer or page). > > > > And the bdev cache would not be able to know before hand that's the case. > > Right, nobody knows until it's been read and examined. > > > > Except ... we want to probe a dozen different > > > filesystems, and half of them keep their superblock at the same offset > > > from the start of the block device. So we do want to keep it cached. > > > That's arguing for using the page cache, at least to read it. > > > > Do we currently share anything from the bdev cache with the fs for this? > > Let's say that first block device blocksize in memory. > > sb_bread() is used by most filesystems, and the buffer cache aliases > into the page cache. I see thanks. I checked what xfs does and its xfs_readsb() uses its own xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why they did that. > > > Now, do we want userspace to be able to dd a new superblock into place > > > and have the mounted filesystem see it? > > > > Not sure I follow this. dd a new super block? > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > I can overwrite the superblock. Do we want filesystems to see that > kind of vandalism, or do we want the mounted filesystem to have its > own copy of the data and overwrite what userspace wrote the next time it > updates the superblock? Oh, what happens today? > (the trick is that this may not be vandalism, it might be the sysadmin > updating the uuid or running some fsck-ish program or trying to update > the superblock to support fabulous-new-feature on next mount. does this > change the answer?) > > > > I suspect that confuses just > > > about every filesystem out there. So I think the right answer is to read > > > the page into the bdev's page cache and then copy it into a kmalloc'ed > > > buffer which the filesystem is then responsible for freeing. It's also > > > responsible for writing it back (so that's another API we need), and for > > > a journalled filesystem, it needs to fit into the journalling scheme. > > > Also, we may need to write back multiple copies of the superblock, > > > possibly with slight modifications. > > > > Are you considering these as extentions to the bdev cache? > > I'm trying to suggest some of the considerations that need to go into > a replacement for sb_bread(). I see! That would also help EOL buffer heads for that purpose. Luis
On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote: > On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote: > > I don't think we > > should be overriding the aops, and if we narrow the scope of large folio > > support in blockdev t only supporting folio_size == LBA size, it becomes > > much more feasible. > > I'm trying to think of the possible use cases where folio_size != LBA size > and I cannot immediately think of some. Yes there are cases where a > filesystem may use a different block for say meta data than data, but that > I believe is side issue, ie, read/writes for small metadata would have > to be accepted. At least for NVMe we have metadata size as part of the > LBA format, but from what I understand no Linux filesystem yet uses that. NVMe metadata is per-block metadata -- a CRC or similar. Filesystem metadata is things like directories, inode tables, free space bitmaps, etc. > struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, > bool retry) > { [...] > head = NULL; > offset = PAGE_SIZE; > while ((offset -= size) >= 0) { > > I see now what you say about the buffer head being of the block size > bh->b_size = size above. Yes, just changing that to 'offset = page_size(page);' will do the trick. > > sb_bread() is used by most filesystems, and the buffer cache aliases > > into the page cache. > > I see thanks. I checked what xfs does and its xfs_readsb() uses its own > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why > they did that. IRIX didn't have an sb_bread() ;-) > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > > I can overwrite the superblock. Do we want filesystems to see that > > kind of vandalism, or do we want the mounted filesystem to have its > > own copy of the data and overwrite what userspace wrote the next time it > > updates the superblock? > > Oh, what happens today? Depends on the filesystem, I think? Not really sure, to be honest.
On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote: > > > > Except ... we want to probe a dozen different > > > > filesystems, and half of them keep their superblock at the same offset > > > > from the start of the block device. So we do want to keep it cached. > > > > That's arguing for using the page cache, at least to read it. > > > > > > Do we currently share anything from the bdev cache with the fs for this? > > > Let's say that first block device blocksize in memory. > > > > sb_bread() is used by most filesystems, and the buffer cache aliases > > into the page cache. > > I see thanks. I checked what xfs does and its xfs_readsb() uses its own > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why > they did that. XFS has it's own metadata address space for caching - it does not use the block device page cache at all. This is not new, it never has. The xfs_buf buffer cache does not use the page cache, either. It does it's own thing, has it's own indexing, locking, shrinkers, etc. IOWs, it does not use the iomap infrastructure at all - iomap is used by XFS exclusively for data IO. As for why we use an uncached buffer for the superblock? That's largely historic because prior to 2007 every modification that did allocation/free needed to lock and modify the superblock at transaction commit. Hence it's always needed in memory but a critical fast path, so it is always directly available without needing to do a cache lookup to callers that need it. In 2007, lazy superblock counters got rid of the requirement to lock the superblock buffer in every transaction commit, so the uncached buffer optimisation hasn't really been needed for the past decade. But if it ain't broke, don't try to fix it.... > > > > Now, do we want userspace to be able to dd a new superblock into place > > > > and have the mounted filesystem see it? > > > > > > Not sure I follow this. dd a new super block? > > > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > > I can overwrite the superblock. Do we want filesystems to see that > > kind of vandalism, or do we want the mounted filesystem to have its > > own copy of the data and overwrite what userspace wrote the next time it > > updates the superblock? > > Oh, what happens today? In XFS, it will completely ignore the fact the the superblock got trashed like this. When the fs goes idle, or the sb modified for some other reason, it will relog the in-memory superblock and write it back to disk, thereby fixing the corruption. i.e. while the filesystem is mounted, the superblock is _write-only_... > > (the trick is that this may not be vandalism, it might be the sysadmin > > updating the uuid or running some fsck-ish program or trying to update > > the superblock to support fabulous-new-feature on next mount. does this > > change the answer?) If you need to change anything in the superblock while the XFS fs is mounted, then you have to use ioctls to modify the superblock contents through the running transaction subsystem. Editting the block device directly breaks the security model of filesystems that assume they have exclusive access to the block device whilst the filesystem is mounted.... -Dave.
On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: > @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, > get_block_t *get_block) > if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) > limit = inode->i_sb->s_maxbytes; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > head = create_folio_buffers(folio, inode, 0); > blocksize = head->b_size; > bbits = block_size_bits(blocksize); > > - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + if (WARN_ON(PAGE_SHIFT < bbits)) { > + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); > + } else { > + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + } > lblock = (limit+blocksize-1) >> bbits; > bh = head; > nr = 0; BTW I See this pattern in: fs/mpage.c: do_mpage_readpage() fs/mpage.c: __mpage_writepage() A helper might be in order. Luis
On 4/17/23 04:27, Luis Chamberlain wrote: > On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: >> @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, >> get_block_t *get_block) >> if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) >> limit = inode->i_sb->s_maxbytes; >> >> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >> - >> head = create_folio_buffers(folio, inode, 0); >> blocksize = head->b_size; >> bbits = block_size_bits(blocksize); >> >> - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); >> + if (WARN_ON(PAGE_SHIFT < bbits)) { >> + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); >> + } else { >> + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); >> + } >> lblock = (limit+blocksize-1) >> bbits; >> bh = head; >> nr = 0; > > BTW I See this pattern in: > > fs/mpage.c: do_mpage_readpage() > fs/mpage.c: __mpage_writepage() > > A helper might be in order. > But only _iff_ we decide to stick with buffer_heads and upgrade the buffer_head code to handle multi-page block sizes. The idea was to move over to iomap, hence I'm not sure into which lengths we should go keeping buffer_heads alive ... Cheers, Hannes
On Sun, Apr 16, 2023 at 03:07:33PM +0100, Matthew Wilcox wrote: > On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote: > > On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote: > > > I don't think we > > > should be overriding the aops, and if we narrow the scope of large folio > > > support in blockdev t only supporting folio_size == LBA size, it becomes > > > much more feasible. > > > > I'm trying to think of the possible use cases where folio_size != LBA size > > and I cannot immediately think of some. Yes there are cases where a > > filesystem may use a different block for say meta data than data, but that > > I believe is side issue, ie, read/writes for small metadata would have > > to be accepted. At least for NVMe we have metadata size as part of the > > LBA format, but from what I understand no Linux filesystem yet uses that. > > NVMe metadata is per-block metadata -- a CRC or similar. Filesystem > metadata is things like directories, inode tables, free space bitmaps, > etc. > > > struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, > > bool retry) > > { > [...] > > head = NULL; > > offset = PAGE_SIZE; > > while ((offset -= size) >= 0) { > > > > I see now what you say about the buffer head being of the block size > > bh->b_size = size above. > > Yes, just changing that to 'offset = page_size(page);' will do the trick. > > > > sb_bread() is used by most filesystems, and the buffer cache aliases > > > into the page cache. > > > > I see thanks. I checked what xfs does and its xfs_readsb() uses its own > > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and > > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why > > they did that. > > IRIX didn't have an sb_bread() ;-) > > > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > > > I can overwrite the superblock. Do we want filesystems to see that > > > kind of vandalism, or do we want the mounted filesystem to have its > > > own copy of the data and overwrite what userspace wrote the next time it > > > updates the superblock? > > > > Oh, what happens today? > > Depends on the filesystem, I think? Not really sure, to be honest. The filesystem driver sees the vandalism, and can very well crash as a result[1]. In that case it was corrupted journal contents being replayed, but the same thing would happen if you wrote a malicious userspace program to set the metadata_csum feature flag in the ondisk superblock after mounting the fs. https://bugzilla.kernel.org/show_bug.cgi?id=82201#c4 I've tried to prevent people from writing to mounted block devices in the past, but did not succeed. If you try to prevent programs from opening such devices with O_RDWR/O_WRONLY you then break lvm tools which require that ability even though they don't actually write anything to the block device. If you make the block device write_iter function fail, then old e2fsprogs breaks and you get shouted at for breaking userspace. Hence I decided to let security researchers find these bugs and control the design discussion via CVE. That's not correct and it's not smart, but it preserves some of my sanity. --D