From patchwork Fri May 26 20:14:56 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gao Xiang X-Patchwork-Id: 99654 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp737809vqr; Fri, 26 May 2023 13:38:02 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7FdC+Cyiz+UPHivO55lB55b8e5JoN230yVcfP/QqV1o/sWUk677wWQzZ2725BINid6TVL+ X-Received: by 2002:a05:6a20:394c:b0:10e:4664:7e7a with SMTP id r12-20020a056a20394c00b0010e46647e7amr715092pzg.49.1685133482015; Fri, 26 May 2023 13:38:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685133481; cv=none; d=google.com; s=arc-20160816; b=qz0tqOx3U4N7b1tJaP2B57N4+D+o1JZ4qlK6mynASKJacbd8oH6JUytrq67imUGuS4 FBEn9qvnE9QYOS0VuSSN1CiekjvAqF8VxHOZ1gPkMR/y9d/gRG0i1XvM7b11aiLcczTW B5k4FaW2dDgw5WB8aR2wZzs4StyvGO06ZYwYl1clDDsG97WQ+vI+PE063WnRy2lxmITS FOFfopkkMGaNm5UHXk1+FHhRz+flfsTPINN742RajXbKkSBXvNvTltL+TKtQWXXmx95v Z7P26zfrSpBv6aX337HSYjhnPtlWcc5gXjRC6tOV3Xvi4jfTMDc8HarSBIH71cWzvE+G feqA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=xzDHZ4VERt+XlSeCMv2FUQ9eyrh95GVXb6x5uwdfUKM=; b=JadEnN2XDKs45EII7caWac9AqKEjeF5s+XLn4ZANXsXbBIgiXE0Kcjpt1lT/AxDctG mI4RKXs6hqdtrXZKNvyzNifqs/kO87AOiu1/sxc0XhlHpoIf4U56Q+f8RppRcBngHAU9 gr4fBkcQfwt7NfQgF2OzH6PZyeYjQjom9L62aMpCzIMpiSZiNF+VCcY11u1hjqVT70rQ qlJ8J+qcAmyWcNgCQSP8WsDq3Iz/IYRfvs3GewXLquMu58F7t+r2IgLrkKODam2dedMh zvfEvEbZFfAZ7euKH2rrXzWbi2hqlKrxQFVTbbaJs/v/+GChCWoK8xV516BNWdakBf96 bnYQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bv9-20020a632e09000000b0053f21272753si3069026pgb.9.2023.05.26.13.37.47; Fri, 26 May 2023 13:38:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243227AbjEZUPX (ORCPT + 99 others); Fri, 26 May 2023 16:15:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37898 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236835AbjEZUPO (ORCPT ); Fri, 26 May 2023 16:15:14 -0400 Received: from out30-110.freemail.mail.aliyun.com (out30-110.freemail.mail.aliyun.com [115.124.30.110]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 75499F3 for ; Fri, 26 May 2023 13:15:12 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R191e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=3;SR=0;TI=SMTPD_---0VjXYN4h_1685132108; Received: from e18g06460.et15sqa.tbsite.net(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0VjXYN4h_1685132108) by smtp.aliyun-inc.com; Sat, 27 May 2023 04:15:09 +0800 From: Gao Xiang To: linux-erofs@lists.ozlabs.org Cc: LKML , Gao Xiang Subject: [PATCH 3/6] erofs: kill hooked chains to avoid loops on deduplicated compressed images Date: Sat, 27 May 2023 04:14:56 +0800 Message-Id: <20230526201459.128169-4-hsiangkao@linux.alibaba.com> X-Mailer: git-send-email 2.24.4 In-Reply-To: <20230526201459.128169-1-hsiangkao@linux.alibaba.com> References: <20230526201459.128169-1-hsiangkao@linux.alibaba.com> MIME-Version: 1.0 X-Spam-Status: No, score=-9.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766990526058184340?= X-GMAIL-MSGID: =?utf-8?q?1766990526058184340?= After heavily stressing EROFS with several images which include a hand-crafted image of repeated patterns for more than 46 days, I found two chains could be linked with each other almost simultaneously and form a loop so that the entire loop won't be submitted. As a consequence, the corresponding file pages will remain locked forever. It can be _only_ observed on data-deduplicated compressed images. For example, consider two chains with five pclusters in total: Chain 1: 2->3->4->5 -- The tail pcluster is 5; Chain 2: 5->1->2 -- The tail pcluster is 2. Chain 2 could link to Chain 1 with pcluster 5; and Chain 1 could link to Chain 2 at the same time with pcluster 2. Since hooked chains are all linked locklessly now, I have no idea how to simply avoid the race. Instead, let's avoid hooked chains completely until I could work out a proper way to fix this and end users finally tell us that it's needed to add it back. Actually, this optimization can be found with multi-threaded workloads (especially even more often on deduplicated compressed images), yet I'm not sure about the overall system impacts of not having this compared with implementation complexity. Fixes: 267f2492c8f7 ("erofs: introduce multi-reference pclusters (fully-referenced)") Signed-off-by: Gao Xiang --- fs/erofs/zdata.c | 72 ++++++++---------------------------------------- 1 file changed, 11 insertions(+), 61 deletions(-) diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c index a67f4ac19c48..76488824f146 100644 --- a/fs/erofs/zdata.c +++ b/fs/erofs/zdata.c @@ -93,11 +93,8 @@ struct z_erofs_pcluster { /* let's avoid the valid 32-bit kernel addresses */ -/* the chained workgroup has't submitted io (still open) */ +/* the end of a chain of pclusters */ #define Z_EROFS_PCLUSTER_TAIL ((void *)0x5F0ECAFE) -/* the chained workgroup has already submitted io */ -#define Z_EROFS_PCLUSTER_TAIL_CLOSED ((void *)0x5F0EDEAD) - #define Z_EROFS_PCLUSTER_NIL (NULL) struct z_erofs_decompressqueue { @@ -506,20 +503,6 @@ int __init z_erofs_init_zip_subsystem(void) enum z_erofs_pclustermode { Z_EROFS_PCLUSTER_INFLIGHT, - /* - * The current pclusters was the tail of an exist chain, in addition - * that the previous processed chained pclusters are all decided to - * be hooked up to it. - * A new chain will be created for the remaining pclusters which are - * not processed yet, so different from Z_EROFS_PCLUSTER_FOLLOWED, - * the next pcluster cannot reuse the whole page safely for inplace I/O - * in the following scenario: - * ________________________________________________________________ - * | tail (partial) page | head (partial) page | - * | (belongs to the next pcl) | (belongs to the current pcl) | - * |_______PCLUSTER_FOLLOWED______|________PCLUSTER_HOOKED__________| - */ - Z_EROFS_PCLUSTER_HOOKED, /* * a weak form of Z_EROFS_PCLUSTER_FOLLOWED, the difference is that it * could be dispatched into bypass queue later due to uptodated managed @@ -537,8 +520,8 @@ enum z_erofs_pclustermode { * ________________________________________________________________ * | tail (partial) page | head (partial) page | * | (of the current cl) | (of the previous collection) | - * | PCLUSTER_FOLLOWED or | | - * |_____PCLUSTER_HOOKED__|___________PCLUSTER_FOLLOWED____________| + * | | | + * |__PCLUSTER_FOLLOWED___|___________PCLUSTER_FOLLOWED____________| * * [ (*) the above page can be used as inplace I/O. ] */ @@ -552,7 +535,7 @@ struct z_erofs_decompress_frontend { struct page *pagepool; struct page *candidate_bvpage; - struct z_erofs_pcluster *pcl, *tailpcl; + struct z_erofs_pcluster *pcl; z_erofs_next_pcluster_t owned_head; enum z_erofs_pclustermode mode; @@ -757,19 +740,7 @@ static void z_erofs_try_to_claim_pcluster(struct z_erofs_decompress_frontend *f) return; } - /* - * type 2, link to the end of an existing open chain, be careful - * that its submission is controlled by the original attached chain. - */ - if (*owned_head != &pcl->next && pcl != f->tailpcl && - cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_TAIL, - *owned_head) == Z_EROFS_PCLUSTER_TAIL) { - *owned_head = Z_EROFS_PCLUSTER_TAIL; - f->mode = Z_EROFS_PCLUSTER_HOOKED; - f->tailpcl = NULL; - return; - } - /* type 3, it belongs to a chain, but it isn't the end of the chain */ + /* type 2, it belongs to an ongoing chain */ f->mode = Z_EROFS_PCLUSTER_INFLIGHT; } @@ -830,9 +801,6 @@ static int z_erofs_register_pcluster(struct z_erofs_decompress_frontend *fe) goto err_out; } } - /* used to check tail merging loop due to corrupted images */ - if (fe->owned_head == Z_EROFS_PCLUSTER_TAIL) - fe->tailpcl = pcl; fe->owned_head = &pcl->next; fe->pcl = pcl; return 0; @@ -853,7 +821,6 @@ static int z_erofs_collector_begin(struct z_erofs_decompress_frontend *fe) /* must be Z_EROFS_PCLUSTER_TAIL or pointed to previous pcluster */ DBG_BUGON(fe->owned_head == Z_EROFS_PCLUSTER_NIL); - DBG_BUGON(fe->owned_head == Z_EROFS_PCLUSTER_TAIL_CLOSED); if (!(map->m_flags & EROFS_MAP_META)) { grp = erofs_find_workgroup(fe->inode->i_sb, @@ -872,10 +839,6 @@ static int z_erofs_collector_begin(struct z_erofs_decompress_frontend *fe) if (ret == -EEXIST) { mutex_lock(&fe->pcl->lock); - /* used to check tail merging loop due to corrupted images */ - if (fe->owned_head == Z_EROFS_PCLUSTER_TAIL) - fe->tailpcl = fe->pcl; - z_erofs_try_to_claim_pcluster(fe); } else if (ret) { return ret; @@ -1030,8 +993,7 @@ static int z_erofs_do_read_page(struct z_erofs_decompress_frontend *fe, * those chains are handled asynchronously thus the page cannot be used * for inplace I/O or bvpage (should be processed in a strict order.) */ - tight &= (fe->mode >= Z_EROFS_PCLUSTER_HOOKED && - fe->mode != Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE); + tight &= (fe->mode > Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE); cur = end - min_t(unsigned int, offset + end - map->m_la, end); if (!(map->m_flags & EROFS_MAP_MAPPED)) { @@ -1400,10 +1362,7 @@ static void z_erofs_decompress_queue(const struct z_erofs_decompressqueue *io, }; z_erofs_next_pcluster_t owned = io->head; - while (owned != Z_EROFS_PCLUSTER_TAIL_CLOSED) { - /* impossible that 'owned' equals Z_EROFS_WORK_TPTR_TAIL */ - DBG_BUGON(owned == Z_EROFS_PCLUSTER_TAIL); - /* impossible that 'owned' equals Z_EROFS_PCLUSTER_NIL */ + while (owned != Z_EROFS_PCLUSTER_TAIL) { DBG_BUGON(owned == Z_EROFS_PCLUSTER_NIL); be.pcl = container_of(owned, struct z_erofs_pcluster, next); @@ -1420,7 +1379,7 @@ static void z_erofs_decompressqueue_work(struct work_struct *work) container_of(work, struct z_erofs_decompressqueue, u.work); struct page *pagepool = NULL; - DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL_CLOSED); + DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL); z_erofs_decompress_queue(bgq, &pagepool); erofs_release_pages(&pagepool); kvfree(bgq); @@ -1608,7 +1567,7 @@ static struct z_erofs_decompressqueue *jobqueue_init(struct super_block *sb, q->sync = true; } q->sb = sb; - q->head = Z_EROFS_PCLUSTER_TAIL_CLOSED; + q->head = Z_EROFS_PCLUSTER_TAIL; return q; } @@ -1626,11 +1585,7 @@ static void move_to_bypass_jobqueue(struct z_erofs_pcluster *pcl, z_erofs_next_pcluster_t *const submit_qtail = qtail[JQ_SUBMIT]; z_erofs_next_pcluster_t *const bypass_qtail = qtail[JQ_BYPASS]; - DBG_BUGON(owned_head == Z_EROFS_PCLUSTER_TAIL_CLOSED); - if (owned_head == Z_EROFS_PCLUSTER_TAIL) - owned_head = Z_EROFS_PCLUSTER_TAIL_CLOSED; - - WRITE_ONCE(pcl->next, Z_EROFS_PCLUSTER_TAIL_CLOSED); + WRITE_ONCE(pcl->next, Z_EROFS_PCLUSTER_TAIL); WRITE_ONCE(*submit_qtail, owned_head); WRITE_ONCE(*bypass_qtail, &pcl->next); @@ -1700,15 +1655,10 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f, unsigned int i = 0; bool bypass = true; - /* no possible 'owned_head' equals the following */ - DBG_BUGON(owned_head == Z_EROFS_PCLUSTER_TAIL_CLOSED); DBG_BUGON(owned_head == Z_EROFS_PCLUSTER_NIL); - pcl = container_of(owned_head, struct z_erofs_pcluster, next); + owned_head = READ_ONCE(pcl->next); - /* close the main owned chain at first */ - owned_head = cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_TAIL, - Z_EROFS_PCLUSTER_TAIL_CLOSED); if (z_erofs_is_inline_pcluster(pcl)) { move_to_bypass_jobqueue(pcl, qtail, owned_head); continue;