Message ID | 20221013222719.277923-1-stephen.s.brennan@oracle.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4ac7:0:0:0:0:0 with SMTP id y7csp509211wrs; Thu, 13 Oct 2022 15:33:55 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7x1+22+L/wIbocD7ZZlxXDPBc3ALXO7/OYiTbPh3EBTzfNMAkjQWW/JZVj6TJCltc3663P X-Received: by 2002:a17:906:dc8f:b0:78d:f675:226 with SMTP id cs15-20020a170906dc8f00b0078df6750226mr1361771ejc.745.1665700435522; Thu, 13 Oct 2022 15:33:55 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1665700435; cv=pass; d=google.com; s=arc-20160816; b=vVSMzhLApg+akuy93sXz5sB7kNpnlSgBt+R9iQNq6kWUKkR/QukPU8j/EFkqyNRSCy Szq4aUdJoquF7j3/QjlFKgcGC9K5dPA0tT4kgpo6w7z+3rqTsOaXk6oCK7PENOrRrLKV oCMGJJN6z/H5kzbmgbFZFSTI0wlA0PSYhUZ1hOgzoIufvErBDPjopdTD00K8/7CEcu07 9G/kRGjviC92eZv19Qm9D9kMU4ma0QE1tJWz2xycdc7CJsN56HHOQ4WaOMSUPBsv6mH4 sz9B2w1of5TZ3aU+qJ7xEwlL08sU3UvFCd6HDLeEY32Fwzcpa4mLE7O2cZlTbfa6lipy PDvA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :message-id:date:subject:cc:to:from:dkim-signature:dkim-signature; bh=uDJHGT9pBp4rtK41fSZEw5wBG+nPWeMGGtmmygGTfV4=; b=C7dZqy0cFHm6Yn7bA1T4SBaLi+AepXpheMzSu/8K/FLm3PWXUyEge2B/+gS/ZC6WdM /Tsk3GXJToLjXa3fFJ0fabnk4id+dVEtXbFlkzIccPdYbpQW4bGKE23Zuxea7qmrv24f UwvLtsLQTRUhJRJ/kflchJQW/IZYt/SoQ6dAdgxK2VhCe2kSMUUVm6Mu34BN4h7f2cR1 eC1kBaN3FlEgnv5kMqcD88TK7S1fAvq3NrBBPEK5aPsnwtjKLFIDIFELPOLMWRPu1+vO Rrhq7tQpY3A8C8WZ6s8vQ7RfEFMtpiNp5Q/4Gd5YyVklyruWBvxtPHZaP449J84lSojb iw1g== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2022-7-12 header.b=YpmzJAO+; dkim=pass header.i=@oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=m+CLzokP; arc=pass (i=1 spf=pass spfdomain=oracle.com dkim=pass dkdomain=oracle.com dmarc=pass fromdomain=oracle.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i13-20020a1709064fcd00b00773dd14c80asi813981ejw.860.2022.10.13.15.33.28; Thu, 13 Oct 2022 15:33:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2022-7-12 header.b=YpmzJAO+; dkim=pass header.i=@oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=m+CLzokP; arc=pass (i=1 spf=pass spfdomain=oracle.com dkim=pass dkdomain=oracle.com dmarc=pass fromdomain=oracle.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229581AbiJMW1c (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others); Thu, 13 Oct 2022 18:27:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43990 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229454AbiJMW1a (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 13 Oct 2022 18:27:30 -0400 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1B8F41382CB; Thu, 13 Oct 2022 15:27:27 -0700 (PDT) Received: from pps.filterd (m0246631.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 29DMLTEC022294; Thu, 13 Oct 2022 22:27:24 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : content-transfer-encoding : content-type : mime-version; s=corp-2022-7-12; bh=uDJHGT9pBp4rtK41fSZEw5wBG+nPWeMGGtmmygGTfV4=; b=YpmzJAO++H+qCLfjh4HEgvKEK5qGfEPXdKf+SgMinFhyZq/qVRTDb7VmanRL8tg+93Lz Z83/8JYcd375x+VBt2tTqw58W/amL+Dt1vRg547af//PKBkOO2c7L/M6HIC5L4xBmwc6 IrOBhUMIv7fo7cYx6UN7t7V9DxtInr1KWAADgb2tUtkHJ2Q4knyTsO/KW/TYdjK7Q/VJ Q/pCLwV69p+hnfDLmgmysi4BcsSamYUW/pWEEQ4AN+hJt0bcdhuYupha3ooBnbWWiznu VqFXUUNA+RlCPAp+V920fkByDr/2Aen7Ekp/3KmMf9nxtLhc64/WkWQuDToaY5y0dY4r BQ== Received: from phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta02.appoci.oracle.com [147.154.114.232]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3k6mswh8p8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 13 Oct 2022 22:27:24 +0000 Received: from pps.filterd (phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (8.17.1.5/8.17.1.5) with ESMTP id 29DKsnoc028728; Thu, 13 Oct 2022 22:27:23 GMT Received: from nam10-bn7-obe.outbound.protection.outlook.com (mail-bn7nam10lp2108.outbound.protection.outlook.com [104.47.70.108]) by phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 3k2yn6mm0w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 13 Oct 2022 22:27:23 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=OS5rIgWCnBOKX1pLg4mda8/RrWR+nqdmaDjtVVO5PvMDHUxFi/yaiFaXuaSewRcSBCVxllr76CTslnewHkpm//VG2gM0kVy32t85+r2bensmN2ETXRL7/1kSRlNdSMnSoMcTm8pR1H4uA1qPkXbTnRLiiTTRPBLAgHWIVh7BTSs7WSUdIZ1aRp0vHZKninN4p+5rjGqWvN7TFmE10FKLNu+UZinVfID2LzpYswue8weBv21O9coNEW/2O/r7guyhwmrfjDBFPbuE5t99q727VSSc1hXouP3IRXf4PRahXPZ/Ym3J8/AlCRZC8cA8G7dlSBl6jEg4fe4fuf8AZpvvlA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=uDJHGT9pBp4rtK41fSZEw5wBG+nPWeMGGtmmygGTfV4=; b=g9tB2NbSriLzEoVnVyp20WssLwYodNEvnS2+CVrPdGT4sOQMvJZtxgfD09Kw9uqg8sPwXzfja9MLjOoK3+evf8hos4h/RIfHdPmLqjHxsWU33sMI2VmmrlW71u6YdwwN7iI3bAr9wCgT6gvntWVyc0+0vcmebVvpHwbnmwQ5aGA1zvBCgwUN/LsXleSAlYKTON2Zru4yQzj3QeYrw3uc3juM9lsK/iicmxpX52Ygw4GGPnochtKFWk+8JjI4y0z6RPEnwg4SVEMY10IofLIjvEoJMSY4HbTuOiso0PWnh3OuJpEJi6jcEVL0MHuOW9fieUN+C1SvtJTPef+rdaUXbw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=uDJHGT9pBp4rtK41fSZEw5wBG+nPWeMGGtmmygGTfV4=; b=m+CLzokPiLKVPYGKD+Pk5ZG29bnXLuAR4btqu/InkV4L2jCsEbvhuEhJRXP0En1E35KevXZLedvQvzgpCGFJ9BTyjzyJYt4+IdxNtjP3BQBwoYwBN6fJ8VMmVc1TkUl8zEZu0W39cmCjh/Wwbm2Zjz5CbQARzdzvlos1mdd4gp8= Received: from CH2PR10MB4166.namprd10.prod.outlook.com (2603:10b6:610:78::20) by BLAPR10MB4916.namprd10.prod.outlook.com (2603:10b6:208:326::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5709.22; Thu, 13 Oct 2022 22:27:20 +0000 Received: from CH2PR10MB4166.namprd10.prod.outlook.com ([fe80::8828:8ba:8137:c4f7]) by CH2PR10MB4166.namprd10.prod.outlook.com ([fe80::8828:8ba:8137:c4f7%7]) with mapi id 15.20.5723.026; Thu, 13 Oct 2022 22:27:19 +0000 From: Stephen Brennan <stephen.s.brennan@oracle.com> To: Jan Kara <jack@suse.cz>, Amir Goldstein <amir73il@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Stephen Brennan <stephen.s.brennan@oracle.com> Subject: [RFC] fsnotify: allow sleepable child dentry flag update Date: Thu, 13 Oct 2022 15:27:19 -0700 Message-Id: <20221013222719.277923-1-stephen.s.brennan@oracle.com> X-Mailer: git-send-email 2.34.1 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: DS7PR03CA0353.namprd03.prod.outlook.com (2603:10b6:8:55::13) To CH2PR10MB4166.namprd10.prod.outlook.com (2603:10b6:610:78::20) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH2PR10MB4166:EE_|BLAPR10MB4916:EE_ X-MS-Office365-Filtering-Correlation-Id: a4a95577-995b-4c09-dea4-08daad6a172a X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 73pe7lPYzUykLQofRI/C0piYrSokuvNSE6qgtge3qyIM3YayWlsir3s1iJXyhunucCw9I7/OAiYH3HfGexau0szq3Ajfhb2POT+2QgX4qpzOuc88+ETMTj6TALw/73sfSs+sRK1V2Mjc2YMCN3CEZ6PE32J55pxhZgXTcznXLOgfPzrZjhzGnIz1POstXbxcjaR2AgPx1wq53sVeUQeGT5MpvJQ1baPmHVa8W7JEzgfCzFfQmtflLTzbFLRSWdlUp4vySZ/mv1leApBD5ijDNsZeXfpwsu7ov6HUlWvSPpwW+f0lCB2dIYlWmTkA7moBfC5+xw3hRR5cEZ/08e0E2YYXD6U0hv+R7tzIxeP5u44cbf3TfbrrJG90KPTwlnPDY11KRS2KN5ZfGZhdQPC1RI6KiMM5Bnq4xLl8Oi7v6jERrABlJAJBvIEyGPqOh4RIKP/FjIH+bOQSMubUhonf3KENk+Z0+DPuS4vjguP2QQLezL9arw9VcJimt7wiZ+/ZPRcIsLzvNJLhDUmrkvVMvaFkJLOK+/QHwin8otEGYaZ7LBI57l268byHLqfeKSb+ljxxlN1EhiZjS9W6WtjUGBoqkSHUo2DfjlEI0fCUUN2vct+J1m+6i7z8yOTGv1gPmo1bIeS5ujFj9stVqegumLwRjl0V2JCRnpDNB46i/sz6dl4uYdI+Pr1+0i3Q+q1FnQuM3EYPxMS2pIreXbHVsxANTPUauO3Ds/zF8pUMDIFmLVSrWbV9zHgLZ+stzL1PV5OqmAB0/8gFsou/xNd9GA== X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CH2PR10MB4166.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230022)(136003)(346002)(376002)(39860400002)(396003)(366004)(451199015)(38100700002)(6512007)(41300700001)(6506007)(4326008)(86362001)(103116003)(36756003)(107886003)(26005)(5660300002)(8936002)(54906003)(66946007)(110136005)(30864003)(478600001)(966005)(66476007)(66556008)(6486002)(15650500001)(8676002)(316002)(2616005)(186003)(1076003)(83380400001)(2906002);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: OLiznSeCvM2j9IyTsDA/lCheWD+hvqBiJXGAhdr78QIgFJBXHdAPdeqG6hqlO95wukV8oDdpI06iQCuBTvswtwU4ey9CpSl5r9KIy6m2K8A3Sab2aolxLsKsGpgm7ZCmIfA74HOVux9WFmu3l41AiNVDAYLYDhNiIqy2Ev9ARG3f0szhvEmAhDeyov5Ixdu/5Kj+tnhtaHCM1uYsple4ElCoMiDONw9Z6WmKnJjuQkrSVjj3A26Wf5ogOkDSy9Epc/NOZ7GQNSRwY3NlN2VOfpnaeKhHC4gaJTfJMtK/zyxfhyurf4inRwjvuYIHve1KNg7qgXKyk366kQTiPMUhi3895hy46Y6oV0vhQ7aGBxL5RH4I5OjIkCkQ20wavpiYASAlJCvqIT/sTtz1snoQ/qVYBE+Gw03hIsMK00ydOAkSjoZUEFIdJqZqxqI83uxnZjPkSQThRjAeCYn6+EF9kwg14TL8SKXrEYdyCA2hEoMIgVTBzO+PZh4iXnyaH1d7dENSBw+c1W4YsZpEwU52KI5r5+tadp1Psj0b3qyVnQspjHMpBfOMO97q/fyCwdozliBI8/3TLnnwH9hrtOU9YO/5vfsLnvMqvq1zVxVYISw/rTtCHOHEBm+Dpc79OMepi3S0MrC81AudXNLZAQBLcG31YE8ukGb9UkCGlEDkRpo+QhNWqj7S57NMkMTU5gVRz1LMmiR7LLYBvc5JAk4c22I6znoYT0zGPxhGl/SR/g4KkirhHVgpLCKCVzOKC9iw49Ad+qEI5IE1hOiqHM2NiWBr3SktXpLEIvYm7K9qBThOOwE4ZmplsRvlX2gOAINGeF4bqYgm3bvJlREJOjDKT28xT5EFVj7VO6WotDVANbMAgDm7s4Hw+FZQQ4bQVHbf0ykacFZkjUWgZvEYunypndcE7TH+2WwbuauKzEym+1QP/jFd1XPd2Z/M46P+dYi8qhaZFwWdu5/nmn5oqcsUYsMM3x+GFusAmKWZrhlctPypVVC/5C0n3PeSegKV6bl9YnvdyWrmKBTcSXgR8uApWjhlK/skKAQtsg5mWYdnQswRiHUoCAxmjJgVHgrMaM6g3aKD78Cl7KTKVPp+3CkRUdsPLArIz50A2M0CNFNbeqtRzGzKXly4qIBwsDh89Bjzu79XkWaJs+qIlMKKVivBvfnUp/RRsxuECePo7Ckaaqq84eku4ddB+rKnodDoVtm1vuL6o2SeAZgJF78gxFGq3pOm0VAWb5wsLAk/Y1BUNRnp42W6Dehe/vb/LNZl5gOCHW6zzod7RWvAmrcjKKbiwE2bgj+6gimu+6jorLcLuCQh/z8bjMXxQrqK3gdGWE81Tf0i/8GC47mGrOTEGMw4/0lsxfVTKY+pINrwYOD3gyd3+yUQkN6aQgkIBI1+b15/cMMeXy7gp6ExtBKBHiVnjwf8N2lVsy7OvBCUxr+6o9adLqXKcAjZDLApfE5ytAnlS3MnLec9z3Gm0rV+OfaVDTkCnJkQXyFpBBqKCxQxDo1I/gr2QKiPmHt8xr2/RYK55kAR5Sdg4eEiVkWe/UaNnFEOH4njhiGrhtBH3avCg8sIoNbSunC40E+Y2+AZOyys/sTk3+iVfA5AoZSSElvx6g== X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: a4a95577-995b-4c09-dea4-08daad6a172a X-MS-Exchange-CrossTenant-AuthSource: CH2PR10MB4166.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 13 Oct 2022 22:27:19.5357 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: GAAPUtsWeIRdnJM2PbSez1739E63BOQ/imugPuUk0sMxbzYVtUMsLDFxbJM1ZWzBDNV5Q/wvYszhnlUttH/5SGoJmMbgoQ3JPD9QvdooBUo= X-MS-Exchange-Transport-CrossTenantHeadersStamped: BLAPR10MB4916 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-10-13_08,2022-10-13_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 bulkscore=0 phishscore=0 mlxlogscore=999 mlxscore=0 spamscore=0 adultscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2209130000 definitions=main-2210130124 X-Proofpoint-ORIG-GUID: dwxNxDDG6X_a3ZA6iKkk_6fkRC0BCglP X-Proofpoint-GUID: dwxNxDDG6X_a3ZA6iKkk_6fkRC0BCglP X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1746613500292258733?= X-GMAIL-MSGID: =?utf-8?q?1746613500292258733?= |
Series |
[RFC] fsnotify: allow sleepable child dentry flag update
|
|
Commit Message
Stephen Brennan
Oct. 13, 2022, 10:27 p.m. UTC
Hi Jan, Amir, Al - this is a quite ugly patch but I want to discuss the idea behind it, to see whether we can find something workable. I apologize for the length of text here, but I think it's necessary to give full context and ideas. For background, on machines with lots of memory and weird workloads, __fsnotify_update_child_dentry_flags() has been a real thorn in our side. It grabs a couple spinlocks and iterates over the whole d_subdirs list. If that list is long, this can take a while. The list can be long due to lots of negative dentries (which can easily number in the hundreds of millions if you have a process that's relatively frequently looking up nonexisting files). But the list can also be long due to *positive* dentries. I've seen directories with ~7 million positive dentry children falling victim to this function before (XFS allows lots of files per dir)! Positive dentries take longer to process in this function (since they're actually locked and written to), so you don't need as many for them to be a problem. Anyway, if you have a huge d_subdirs list, then you can have problems with soft lockups. From my measurements with ftrace, 100 million negative dentries means that the function takes about 6 seconds to complete (varies wildly by CPU and kernel config/version). That's bad, but it can get *much worse*. Imagine that there are many frequently accessed files in such a directory, and you have an inotify watch. As soon as that watch is removed, the last fsnotify connector goes away, and i_fsnotify_mask becomes 0. System calls accessing dentries still see DCACHE_FSNOTIFY_PARENT_WATCHED, so they fall into __fsnotify_parent and will try to update the dentry flags. In my experience, a thundering herd of CPUs race to __fsnotify_update_child_dentry_flags(). The winner begins updating and the rest spin waiting for the parent inode's i_lock. Many CPUs make it to that point, and they *all* will proceed to iterate through d_subdirs, regardless of how long the list is, even though only the first CPU needed to do it. So now your 6 second spin gets multiplied by 5-10. And since the directory is frequently accessed, all the dget/dputs from other CPUs will all spin for this long time. This amounts to a nearly unusable system. Previously I've tried to generally limit or manage the number of negative dentries in the dcache, which as a general problem is very difficult to get concensus on. I've also tried the patches to reorder dentries in d_subdirs so negative dentries are at the end, which has some difficult issues interacting with d_walk. Neither of those ideas would help for a directory full of positive dentries either. So I have two more narrowly scoped strategies to improve the situation. Both are included in the hacky, awful patch below. First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody is holding the spinlock for several seconds at a time. We can actually achieve this via a cursor, the same way that simple_readdir() is implemented. I think this might require moving the declaration of d_alloc_cursor() and maybe exporting it. I had to #include fs/internal.h which is not ok. On its own, that actually makes problems worse, because it allows several tasks to update at the same time, and they're constantly locking/unlocking, which makes contention worse. So second is to add an inode flag saying that __fsnotify_update_child_dentry_flags() is already in progress. This prevents concurrent execution, and it allows the caller to skip updating since they know it's being handled, so it eliminates the thundering herd problem. The patch works great! It eliminates the chances of soft lockups and makes the system responsive under those weird loads. But now, I know I've added a new issue. Updating dentry flags is no longer atomic, and we've lost the guarantee that after fsnotify_recalc_mask(), child dentries are all flagged when necessary. It's possible that after __fsnotify_update_child_dentry_flags() will skip executing since it sees it's already happening, and inotify_add_watch() would return without the watch being fully ready. I think the approach can still be salvaged, we just need a way to resolve this. EG a wait queue or mutex in the connector would allow us to preserve the guarantee that the child dentries are flagged when necessary. But I know that's a big addition, so I wanted to get some feedback from you as the maintainers. Is the strategy here stupid? Am I missing an easier option? Is some added synchronization in the connector workable? Thanks! Stephen --- You may find it useful to create negative dentries with [1] while using this patch. Create 100 million dentries as follows. Then use [2] to create userspace load that's accessing files in the same directory. Finally, inotifywait will setup a watch, terminate after one event, and tear it down. Without the patch, the thundering herd of userspace tasks all line up to update the flags, and lockup the system. ./negdentcreate -p /tmp/dir -c 100000000 -t 4 touch /tmp/dir/file{1..10} for i in {1..10}; do ./openclose /tmp/dir/file$i & done inotifywait /tmp/dir [1] https://github.com/brenns10/kernel_stuff/tree/master/negdentcreate [2] https://github.com/brenns10/kernel_stuff/tree/master/openclose fs/notify/fsnotify.c | 96 ++++++++++++++++++++++++++++++++++---------- include/linux/fs.h | 4 ++ 2 files changed, 78 insertions(+), 22 deletions(-)
Comments
On Thu, Oct 13, 2022 at 03:27:19PM -0700, Stephen Brennan wrote: > So I have two more narrowly scoped strategies to improve the situation. Both are > included in the hacky, awful patch below. > > First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody > is holding the spinlock for several seconds at a time. We can actually achieve > this via a cursor, the same way that simple_readdir() is implemented. I think > this might require moving the declaration of d_alloc_cursor() and maybe > exporting it. I had to #include fs/internal.h which is not ok. Er... Won't that expose every filesystem to having to deal with cursors? Currently it's entirely up to the filesystem in question and I wouldn't be a dime on everyone being ready to cope with such surprises...
On Fri, Oct 14, 2022 at 1:27 AM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > Hi Jan, Amir, Al - this is a quite ugly patch but I want to discuss the idea > behind it, to see whether we can find something workable. I apologize for the > length of text here, but I think it's necessary to give full context and ideas. > Hi Stephen! > For background, on machines with lots of memory and weird workloads, > __fsnotify_update_child_dentry_flags() has been a real thorn in our side. It > grabs a couple spinlocks and iterates over the whole d_subdirs list. If that > list is long, this can take a while. The list can be long due to lots of > negative dentries (which can easily number in the hundreds of millions if you > have a process that's relatively frequently looking up nonexisting files). But > the list can also be long due to *positive* dentries. I've seen directories with > ~7 million positive dentry children falling victim to this function before (XFS > allows lots of files per dir)! Positive dentries take longer to process in this > function (since they're actually locked and written to), so you don't need as > many for them to be a problem. > > Anyway, if you have a huge d_subdirs list, then you can have problems with soft > lockups. From my measurements with ftrace, 100 million negative dentries means > that the function takes about 6 seconds to complete (varies wildly by CPU and > kernel config/version). That's bad, but it can get *much worse*. Imagine that > there are many frequently accessed files in such a directory, and you have an > inotify watch. As soon as that watch is removed, the last fsnotify connector > goes away, and i_fsnotify_mask becomes 0. System calls accessing dentries still > see DCACHE_FSNOTIFY_PARENT_WATCHED, so they fall into __fsnotify_parent and will > try to update the dentry flags. In my experience, a thundering herd of CPUs race > to __fsnotify_update_child_dentry_flags(). The winner begins updating and the > rest spin waiting for the parent inode's i_lock. Many CPUs make it to that > point, and they *all* will proceed to iterate through d_subdirs, regardless of > how long the list is, even though only the first CPU needed to do it. So now > your 6 second spin gets multiplied by 5-10. And since the directory is > frequently accessed, all the dget/dputs from other CPUs will all spin for this > long time. This amounts to a nearly unusable system. > > Previously I've tried to generally limit or manage the number of negative > dentries in the dcache, which as a general problem is very difficult to get > concensus on. I've also tried the patches to reorder dentries in d_subdirs so > negative dentries are at the end, which has some difficult issues interacting > with d_walk. Neither of those ideas would help for a directory full of positive > dentries either. > > So I have two more narrowly scoped strategies to improve the situation. Both are > included in the hacky, awful patch below. > > First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody > is holding the spinlock for several seconds at a time. We can actually achieve > this via a cursor, the same way that simple_readdir() is implemented. I think > this might require moving the declaration of d_alloc_cursor() and maybe > exporting it. I had to #include fs/internal.h which is not ok. > > On its own, that actually makes problems worse, because it allows several tasks > to update at the same time, and they're constantly locking/unlocking, which > makes contention worse. > > So second is to add an inode flag saying that > __fsnotify_update_child_dentry_flags() is already in progress. This prevents > concurrent execution, and it allows the caller to skip updating since they know > it's being handled, so it eliminates the thundering herd problem. > > The patch works great! It eliminates the chances of soft lockups and makes the > system responsive under those weird loads. But now, I know I've added a new > issue. Updating dentry flags is no longer atomic, and we've lost the guarantee Just between us ;) the update of the inode event mask is not atomic anyway, because the test for 'parent_watched' and fsnotify_inode_watches_children() in __fsnotify_parent() are done without any memory access synchronization. IOW, the only guarantee for users is that *sometime* after adding events to a mark mask, events will start being delivered and *sometime* after removing events from a mark mask, events will stop being delivered. Some events may have implicit memory barriers that make event delivery more deterministic, but others may not. This may not be considered an issue for asynchronous events, but actually, for permission events, I would like to fix that. To understand my motivations you can look at: https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Evicting_file_content > that after fsnotify_recalc_mask(), child dentries are all flagged when > necessary. It's possible that after __fsnotify_update_child_dentry_flags() will > skip executing since it sees it's already happening, and inotify_add_watch() > would return without the watch being fully ready. > > I think the approach can still be salvaged, we just need a way to resolve this. > EG a wait queue or mutex in the connector would allow us to preserve the > guarantee that the child dentries are flagged when necessary. But I know that's > a big addition, so I wanted to get some feedback from you as the maintainers. Is > the strategy here stupid? Am I missing an easier option? I think you may be missing an easier option. The call to __fsnotify_update_child_dentry_flags() in __fsnotify_parent() is a very aggressive optimization and I think it may be an overkill, and a footgun, according to your analysis. If only called from the context of fsnotify_recalc_mask() (i.e. update mark mask), __fsnotify_update_child_dentry_flags() can take the dir inode_lock() to synchronize. I don't think that the dir inode spin lock needs to be held at all during children iteration. I think that d_find_any_alias() should be used to obtain the alias with elevated refcount instead of the awkward d_u.d_alias iteration loop. In the context of __fsnotify_parent(), I think the optimization should stick with updating the flags for the specific child dentry that had the false positive parent_watched indication, leaving the rest of the siblings alone. Would that address the performance issues of your workload? Jan, Can you foresee any problems with this change? Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: > On Fri, Oct 14, 2022 at 1:27 AM Stephen Brennan > <stephen.s.brennan@oracle.com> wrote: >> >> Hi Jan, Amir, Al - this is a quite ugly patch but I want to discuss the idea >> behind it, to see whether we can find something workable. I apologize for the >> length of text here, but I think it's necessary to give full context and ideas. >> > > Hi Stephen! > >> For background, on machines with lots of memory and weird workloads, >> __fsnotify_update_child_dentry_flags() has been a real thorn in our side. It >> grabs a couple spinlocks and iterates over the whole d_subdirs list. If that >> list is long, this can take a while. The list can be long due to lots of >> negative dentries (which can easily number in the hundreds of millions if you >> have a process that's relatively frequently looking up nonexisting files). But >> the list can also be long due to *positive* dentries. I've seen directories with >> ~7 million positive dentry children falling victim to this function before (XFS >> allows lots of files per dir)! Positive dentries take longer to process in this >> function (since they're actually locked and written to), so you don't need as >> many for them to be a problem. >> >> Anyway, if you have a huge d_subdirs list, then you can have problems with soft >> lockups. From my measurements with ftrace, 100 million negative dentries means >> that the function takes about 6 seconds to complete (varies wildly by CPU and >> kernel config/version). That's bad, but it can get *much worse*. Imagine that >> there are many frequently accessed files in such a directory, and you have an >> inotify watch. As soon as that watch is removed, the last fsnotify connector >> goes away, and i_fsnotify_mask becomes 0. System calls accessing dentries still >> see DCACHE_FSNOTIFY_PARENT_WATCHED, so they fall into __fsnotify_parent and will >> try to update the dentry flags. In my experience, a thundering herd of CPUs race >> to __fsnotify_update_child_dentry_flags(). The winner begins updating and the >> rest spin waiting for the parent inode's i_lock. Many CPUs make it to that >> point, and they *all* will proceed to iterate through d_subdirs, regardless of >> how long the list is, even though only the first CPU needed to do it. So now >> your 6 second spin gets multiplied by 5-10. And since the directory is >> frequently accessed, all the dget/dputs from other CPUs will all spin for this >> long time. This amounts to a nearly unusable system. >> >> Previously I've tried to generally limit or manage the number of negative >> dentries in the dcache, which as a general problem is very difficult to get >> concensus on. I've also tried the patches to reorder dentries in d_subdirs so >> negative dentries are at the end, which has some difficult issues interacting >> with d_walk. Neither of those ideas would help for a directory full of positive >> dentries either. >> >> So I have two more narrowly scoped strategies to improve the situation. Both are >> included in the hacky, awful patch below. >> >> First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody >> is holding the spinlock for several seconds at a time. We can actually achieve >> this via a cursor, the same way that simple_readdir() is implemented. I think >> this might require moving the declaration of d_alloc_cursor() and maybe >> exporting it. I had to #include fs/internal.h which is not ok. >> >> On its own, that actually makes problems worse, because it allows several tasks >> to update at the same time, and they're constantly locking/unlocking, which >> makes contention worse. >> >> So second is to add an inode flag saying that >> __fsnotify_update_child_dentry_flags() is already in progress. This prevents >> concurrent execution, and it allows the caller to skip updating since they know >> it's being handled, so it eliminates the thundering herd problem. >> >> The patch works great! It eliminates the chances of soft lockups and makes the >> system responsive under those weird loads. But now, I know I've added a new >> issue. Updating dentry flags is no longer atomic, and we've lost the guarantee > > Just between us ;) the update of the inode event mask is not atomic anyway, > because the test for 'parent_watched' and fsnotify_inode_watches_children() > in __fsnotify_parent() are done without any memory access synchronization. > > IOW, the only guarantee for users is that *sometime* after adding events > to a mark mask, events will start being delivered and *sometime* after > removing events from a mark mask, events will stop being delivered. > Some events may have implicit memory barriers that make event delivery > more deterministic, but others may not. I did wonder about whether it was truly atomic even without the sleeping... the sleeping just makes matters much worse. But without the sleeping, I feel like it wouldn't take much memory synchronization. The dentry flags modification is protected by a spinlock, I assume we would just need a memory barrier to pair with the unlock? (But then again, I really need to read and then reread the memory model document, and think about it when it's not late for me.) > This may not be considered an issue for asynchronous events, but actually, > for permission events, I would like to fix that. > To understand my motivations you can look at: > https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Evicting_file_content I'll take a deeper look in (my) morning! It definitely helps for me to better understand use cases since I really don't know much beyond inotify. >> that after fsnotify_recalc_mask(), child dentries are all flagged when >> necessary. It's possible that after __fsnotify_update_child_dentry_flags() will >> skip executing since it sees it's already happening, and inotify_add_watch() >> would return without the watch being fully ready. >> >> I think the approach can still be salvaged, we just need a way to resolve this. >> EG a wait queue or mutex in the connector would allow us to preserve the >> guarantee that the child dentries are flagged when necessary. But I know that's >> a big addition, so I wanted to get some feedback from you as the maintainers. Is >> the strategy here stupid? Am I missing an easier option? > > I think you may be missing an easier option. > > The call to __fsnotify_update_child_dentry_flags() in > __fsnotify_parent() is a very aggressive optimization > and I think it may be an overkill, and a footgun, according > to your analysis. Agreed! > If only called from the context of fsnotify_recalc_mask() > (i.e. update mark mask), __fsnotify_update_child_dentry_flags() > can take the dir inode_lock() to synchronize. > > I don't think that the dir inode spin lock needs to be held > at all during children iteration. Definitely a sleeping lock is better than the spin lock. And if we did something like that, it would be worth keeping a little bit of state in the connector to keep track of whether the dentry flags are set or not. This way, if several marks are added in a row, you don't repeatedly iterate over the children to do the same operation over and over again. No matter what, we have to hold the parent dentry's spinlock, and that's expensive. So if we can make the update happen only when it would actually enable or disable the flags, that's worth a few bits of state. > I think that d_find_any_alias() should be used to obtain > the alias with elevated refcount instead of the awkward > d_u.d_alias iteration loop. D'oh! Much better idea :) Do you think the BUG_ON would still be worthwhile? > In the context of __fsnotify_parent(), I think the optimization > should stick with updating the flags for the specific child dentry > that had the false positive parent_watched indication, > leaving the rest of > WOULD that address the performance issues of your workload? I think synchronizing the __fsnotify_update_child_dentry_flags() with a mutex and getting rid of the call from __fsnotify_parent() would go a *huge* way (maybe 80%?) towards resolving the performance issues we've seen. To be clear, I'm representing not one single workload, but a few different customer workloads which center around this area. There are some extreme cases I've seen, where the dentry list is so huge, that even iterating over it once with the parent dentry spinlock held is enough to trigger softlockups - no need for several calls to __fsnotify_update_child_dentry_flags() queueing up as described in the original mail. So ideally, I'd love to try make *something* work with the cursor idea as well. But I think the two ideas can be separated easily, and I can discuss with Al further about if cursors can be salvaged at all. Thanks so much for the detailed look at this! Stephen
Hi guys! On Fri 14-10-22 11:01:39, Amir Goldstein wrote: > On Fri, Oct 14, 2022 at 1:27 AM Stephen Brennan > <stephen.s.brennan@oracle.com> wrote: > > > > Hi Jan, Amir, Al - this is a quite ugly patch but I want to discuss the idea > > behind it, to see whether we can find something workable. I apologize for the > > length of text here, but I think it's necessary to give full context and ideas. > > > For background, on machines with lots of memory and weird workloads, > > __fsnotify_update_child_dentry_flags() has been a real thorn in our side. It > > grabs a couple spinlocks and iterates over the whole d_subdirs list. If that > > list is long, this can take a while. The list can be long due to lots of > > negative dentries (which can easily number in the hundreds of millions if you > > have a process that's relatively frequently looking up nonexisting files). But > > the list can also be long due to *positive* dentries. I've seen directories with > > ~7 million positive dentry children falling victim to this function before (XFS > > allows lots of files per dir)! Positive dentries take longer to process in this > > function (since they're actually locked and written to), so you don't need as > > many for them to be a problem. > > > > Anyway, if you have a huge d_subdirs list, then you can have problems with soft > > lockups. From my measurements with ftrace, 100 million negative dentries means > > that the function takes about 6 seconds to complete (varies wildly by CPU and > > kernel config/version). That's bad, but it can get *much worse*. Imagine that > > there are many frequently accessed files in such a directory, and you have an > > inotify watch. As soon as that watch is removed, the last fsnotify connector > > goes away, and i_fsnotify_mask becomes 0. System calls accessing dentries still > > see DCACHE_FSNOTIFY_PARENT_WATCHED, so they fall into __fsnotify_parent and will > > try to update the dentry flags. In my experience, a thundering herd of CPUs race > > to __fsnotify_update_child_dentry_flags(). The winner begins updating and the > > rest spin waiting for the parent inode's i_lock. Many CPUs make it to that > > point, and they *all* will proceed to iterate through d_subdirs, regardless of > > how long the list is, even though only the first CPU needed to do it. So now > > your 6 second spin gets multiplied by 5-10. And since the directory is > > frequently accessed, all the dget/dputs from other CPUs will all spin for this > > long time. This amounts to a nearly unusable system. > > > > Previously I've tried to generally limit or manage the number of negative > > dentries in the dcache, which as a general problem is very difficult to get > > concensus on. I've also tried the patches to reorder dentries in d_subdirs so > > negative dentries are at the end, which has some difficult issues interacting > > with d_walk. Neither of those ideas would help for a directory full of positive > > dentries either. > > > > So I have two more narrowly scoped strategies to improve the situation. Both are > > included in the hacky, awful patch below. > > > > First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody > > is holding the spinlock for several seconds at a time. We can actually achieve > > this via a cursor, the same way that simple_readdir() is implemented. I think > > this might require moving the declaration of d_alloc_cursor() and maybe > > exporting it. I had to #include fs/internal.h which is not ok. > > > > On its own, that actually makes problems worse, because it allows several tasks > > to update at the same time, and they're constantly locking/unlocking, which > > makes contention worse. > > > > So second is to add an inode flag saying that > > __fsnotify_update_child_dentry_flags() is already in progress. This prevents > > concurrent execution, and it allows the caller to skip updating since they know > > it's being handled, so it eliminates the thundering herd problem. > > > > The patch works great! It eliminates the chances of soft lockups and makes the > > system responsive under those weird loads. But now, I know I've added a new > > issue. Updating dentry flags is no longer atomic, and we've lost the guarantee > > Just between us ;) the update of the inode event mask is not atomic anyway, > because the test for 'parent_watched' and fsnotify_inode_watches_children() > in __fsnotify_parent() are done without any memory access synchronization. > > IOW, the only guarantee for users is that *sometime* after adding events > to a mark mask, events will start being delivered and *sometime* after > removing events from a mark mask, events will stop being delivered. > Some events may have implicit memory barriers that make event delivery > more deterministic, but others may not. This holds even for fsnotify_inode_watches_children() call in __fsnotify_update_child_dentry_flags(). In principle we can have racing calls to __fsnotify_update_child_dentry_flags() which result in temporary inconsistency of child dentry flags with the mask in parent's connector. > > that after fsnotify_recalc_mask(), child dentries are all flagged when > > necessary. It's possible that after __fsnotify_update_child_dentry_flags() will > > skip executing since it sees it's already happening, and inotify_add_watch() > > would return without the watch being fully ready. > > > > I think the approach can still be salvaged, we just need a way to resolve this. > > EG a wait queue or mutex in the connector would allow us to preserve the > > guarantee that the child dentries are flagged when necessary. But I know that's > > a big addition, so I wanted to get some feedback from you as the maintainers. Is > > the strategy here stupid? Am I missing an easier option? > > I think you may be missing an easier option. > > The call to __fsnotify_update_child_dentry_flags() in > __fsnotify_parent() is a very aggressive optimization > and I think it may be an overkill, and a footgun, according > to your analysis. > > If only called from the context of fsnotify_recalc_mask() > (i.e. update mark mask), __fsnotify_update_child_dentry_flags() > can take the dir inode_lock() to synchronize. This will nest inode lock into fsnotify group lock though. I'm not aware of any immediate lock ordering problem with that but it might be problematic. Otherwise this seems workable. BTW if we call __fsnotify_update_child_dentry_flags() only from fsnotify_recalc_mask(), I don't think we even need more state in the connector to detect whether dentry flags update is needed or not. fsnotify_recalc_mask() knows the old & new mask state so it has all the information it needs to detect whether dentry flags update is needed or not. We just have to resolve the flags update races first to avoid maintaining the inconsistent flags state for a long time. > I don't think that the dir inode spin lock needs to be held > at all during children iteration. > > I think that d_find_any_alias() should be used to obtain > the alias with elevated refcount instead of the awkward > d_u.d_alias iteration loop. > > In the context of __fsnotify_parent(), I think the optimization > should stick with updating the flags for the specific child dentry > that had the false positive parent_watched indication, > leaving the rest of the siblings alone. Yep, that looks sensible to me. Essentially this should be just an uncommon fixup path until the full child dentry walk from fsnotify_recalc_mask() catches up with synchronizing all child dentry flags. Honza
On Mon, Oct 17, 2022 at 10:59 AM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > Amir Goldstein <amir73il@gmail.com> writes: > > On Fri, Oct 14, 2022 at 1:27 AM Stephen Brennan > > <stephen.s.brennan@oracle.com> wrote: > >> > >> Hi Jan, Amir, Al - this is a quite ugly patch but I want to discuss the idea > >> behind it, to see whether we can find something workable. I apologize for the > >> length of text here, but I think it's necessary to give full context and ideas. > >> > > > > Hi Stephen! > > > >> For background, on machines with lots of memory and weird workloads, > >> __fsnotify_update_child_dentry_flags() has been a real thorn in our side. It > >> grabs a couple spinlocks and iterates over the whole d_subdirs list. If that > >> list is long, this can take a while. The list can be long due to lots of > >> negative dentries (which can easily number in the hundreds of millions if you > >> have a process that's relatively frequently looking up nonexisting files). But > >> the list can also be long due to *positive* dentries. I've seen directories with > >> ~7 million positive dentry children falling victim to this function before (XFS > >> allows lots of files per dir)! Positive dentries take longer to process in this > >> function (since they're actually locked and written to), so you don't need as > >> many for them to be a problem. > >> > >> Anyway, if you have a huge d_subdirs list, then you can have problems with soft > >> lockups. From my measurements with ftrace, 100 million negative dentries means > >> that the function takes about 6 seconds to complete (varies wildly by CPU and > >> kernel config/version). That's bad, but it can get *much worse*. Imagine that > >> there are many frequently accessed files in such a directory, and you have an > >> inotify watch. As soon as that watch is removed, the last fsnotify connector > >> goes away, and i_fsnotify_mask becomes 0. System calls accessing dentries still > >> see DCACHE_FSNOTIFY_PARENT_WATCHED, so they fall into __fsnotify_parent and will > >> try to update the dentry flags. In my experience, a thundering herd of CPUs race > >> to __fsnotify_update_child_dentry_flags(). The winner begins updating and the > >> rest spin waiting for the parent inode's i_lock. Many CPUs make it to that > >> point, and they *all* will proceed to iterate through d_subdirs, regardless of > >> how long the list is, even though only the first CPU needed to do it. So now > >> your 6 second spin gets multiplied by 5-10. And since the directory is > >> frequently accessed, all the dget/dputs from other CPUs will all spin for this > >> long time. This amounts to a nearly unusable system. > >> > >> Previously I've tried to generally limit or manage the number of negative > >> dentries in the dcache, which as a general problem is very difficult to get > >> concensus on. I've also tried the patches to reorder dentries in d_subdirs so > >> negative dentries are at the end, which has some difficult issues interacting > >> with d_walk. Neither of those ideas would help for a directory full of positive > >> dentries either. > >> > >> So I have two more narrowly scoped strategies to improve the situation. Both are > >> included in the hacky, awful patch below. > >> > >> First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody > >> is holding the spinlock for several seconds at a time. We can actually achieve > >> this via a cursor, the same way that simple_readdir() is implemented. I think > >> this might require moving the declaration of d_alloc_cursor() and maybe > >> exporting it. I had to #include fs/internal.h which is not ok. > >> > >> On its own, that actually makes problems worse, because it allows several tasks > >> to update at the same time, and they're constantly locking/unlocking, which > >> makes contention worse. > >> > >> So second is to add an inode flag saying that > >> __fsnotify_update_child_dentry_flags() is already in progress. This prevents > >> concurrent execution, and it allows the caller to skip updating since they know > >> it's being handled, so it eliminates the thundering herd problem. > >> > >> The patch works great! It eliminates the chances of soft lockups and makes the > >> system responsive under those weird loads. But now, I know I've added a new > >> issue. Updating dentry flags is no longer atomic, and we've lost the guarantee > > > > Just between us ;) the update of the inode event mask is not atomic anyway, > > because the test for 'parent_watched' and fsnotify_inode_watches_children() > > in __fsnotify_parent() are done without any memory access synchronization. > > > > IOW, the only guarantee for users is that *sometime* after adding events > > to a mark mask, events will start being delivered and *sometime* after > > removing events from a mark mask, events will stop being delivered. > > Some events may have implicit memory barriers that make event delivery > > more deterministic, but others may not. > > I did wonder about whether it was truly atomic even without the > sleeping... the sleeping just makes matters much worse. But without the > sleeping, I feel like it wouldn't take much memory synchronization. > The dentry flags modification is protected by a spinlock, I assume we > would just need a memory barrier to pair with the unlock? > Haha "just" cannot be used here :) Yes, some sort of memory barrier is missing. The trick is not hurting performance in the common fast path where the event subscription mask has not been changed. This is especially true considering that all applications to date did just fine without atomic semantics for updating mark masks. Some applications may have been living in blissful ignorance ... > (But then again, I really need to read and then reread the memory model > document, and think about it when it's not late for me.) > > > This may not be considered an issue for asynchronous events, but actually, > > for permission events, I would like to fix that. > > To understand my motivations you can look at: > > https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Evicting_file_content > > I'll take a deeper look in (my) morning! It definitely helps for me to > better understand use cases since I really don't know much beyond > inotify. > Don't stress yourself over this doc, it's a WIP, but it describes a system where the atomic update of mark masks would be important for correctness. The document does provide a method of working around the atomic mask update requirement (making sure that sb event mask includes the needed event), which should be good enough for the described use case. > >> that after fsnotify_recalc_mask(), child dentries are all flagged when > >> necessary. It's possible that after __fsnotify_update_child_dentry_flags() will > >> skip executing since it sees it's already happening, and inotify_add_watch() > >> would return without the watch being fully ready. > >> > >> I think the approach can still be salvaged, we just need a way to resolve this. > >> EG a wait queue or mutex in the connector would allow us to preserve the > >> guarantee that the child dentries are flagged when necessary. But I know that's > >> a big addition, so I wanted to get some feedback from you as the maintainers. Is > >> the strategy here stupid? Am I missing an easier option? > > > > I think you may be missing an easier option. > > > > The call to __fsnotify_update_child_dentry_flags() in > > __fsnotify_parent() is a very aggressive optimization > > and I think it may be an overkill, and a footgun, according > > to your analysis. > > Agreed! > > > If only called from the context of fsnotify_recalc_mask() > > (i.e. update mark mask), __fsnotify_update_child_dentry_flags() > > can take the dir inode_lock() to synchronize. > > > > I don't think that the dir inode spin lock needs to be held > > at all during children iteration. > > Definitely a sleeping lock is better than the spin lock. And if we did > something like that, it would be worth keeping a little bit of state in > the connector to keep track of whether the dentry flags are set or not. > This way, if several marks are added in a row, you don't repeatedly > iterate over the children to do the same operation over and over again. > > No matter what, we have to hold the parent dentry's spinlock, and that's > expensive. So if we can make the update happen only when it would > actually enable or disable the flags, that's worth a few bits of state. > As Jan wrote, this state already exists in i_fsnotify_mask. fsnotify_recalc_mask() is very capable of knowing when the state described by fsnotify_inode_watches_children() is going to change. > > I think that d_find_any_alias() should be used to obtain > > the alias with elevated refcount instead of the awkward > > d_u.d_alias iteration loop. > > D'oh! Much better idea :) > Do you think the BUG_ON would still be worthwhile? > Which BUG_ON()? In general no, if there are ever more multiple aliases for a directory inode, updating dentry flags would be the last of our problems. > > In the context of __fsnotify_parent(), I think the optimization > > should stick with updating the flags for the specific child dentry > > that had the false positive parent_watched indication, > > leaving the rest of > > > WOULD that address the performance issues of your workload? > > I think synchronizing the __fsnotify_update_child_dentry_flags() with a > mutex and getting rid of the call from __fsnotify_parent() would go a > *huge* way (maybe 80%?) towards resolving the performance issues we've > seen. To be clear, I'm representing not one single workload, but a few > different customer workloads which center around this area. > > There are some extreme cases I've seen, where the dentry list is so > huge, that even iterating over it once with the parent dentry spinlock > held is enough to trigger softlockups - no need for several calls to > __fsnotify_update_child_dentry_flags() queueing up as described in the > original mail. So ideally, I'd love to try make *something* work with > the cursor idea as well. But I think the two ideas can be separated > easily, and I can discuss with Al further about if cursors can be > salvaged at all. > Assuming that you take the dir inode_lock() in __fsnotify_update_child_dentry_flags(), then I *think* that children dentries cannot be added to dcache and children dentries cannot turn from positive to negative and vice versa. Probably the only thing that can change d_subdirs is children dentries being evicted from dcache(?), so I *think* that once in N children if you can dget(child), drop alias->d_lock, cond_resched(), and then continue d_subdirs iteration from child->d_child. Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: > On Mon, Oct 17, 2022 at 10:59 AM Stephen Brennan > <stephen.s.brennan@oracle.com> wrote: >> >> Amir Goldstein <amir73il@gmail.com> writes: [snip] >> > I think that d_find_any_alias() should be used to obtain >> > the alias with elevated refcount instead of the awkward >> > d_u.d_alias iteration loop. >> >> D'oh! Much better idea :) >> Do you think the BUG_ON would still be worthwhile? >> > > Which BUG_ON()? > In general no, if there are ever more multiple aliases for > a directory inode, updating dentry flags would be the last > of our problems. Sorry, I meant the one in my patch which asserts that the dentry is the only alias for that inode. I suppose you're right about having bigger problems in that case -- but the existing code "handles" it by iterating over the alias list. > >> > In the context of __fsnotify_parent(), I think the optimization >> > should stick with updating the flags for the specific child dentry >> > that had the false positive parent_watched indication, >> > leaving the rest of >> >> > WOULD that address the performance issues of your workload? >> >> I think synchronizing the __fsnotify_update_child_dentry_flags() with a >> mutex and getting rid of the call from __fsnotify_parent() would go a >> *huge* way (maybe 80%?) towards resolving the performance issues we've >> seen. To be clear, I'm representing not one single workload, but a few >> different customer workloads which center around this area. >> >> There are some extreme cases I've seen, where the dentry list is so >> huge, that even iterating over it once with the parent dentry spinlock >> held is enough to trigger softlockups - no need for several calls to >> __fsnotify_update_child_dentry_flags() queueing up as described in the >> original mail. So ideally, I'd love to try make *something* work with >> the cursor idea as well. But I think the two ideas can be separated >> easily, and I can discuss with Al further about if cursors can be >> salvaged at all. >> > > Assuming that you take the dir inode_lock() in > __fsnotify_update_child_dentry_flags(), then I *think* that children > dentries cannot be added to dcache and children dentries cannot > turn from positive to negative and vice versa. > > Probably the only thing that can change d_subdirs is children dentries > being evicted from dcache(?), so I *think* that once in N children > if you can dget(child), drop alias->d_lock, cond_resched(), > and then continue d_subdirs iteration from child->d_child. This sounds like an excellent idea. I can't think of anything which would remove a dentry from d_subdirs without the inode lock held. Cursors wouldn't move without the lock held in read mode. Temporary dentries from d_alloc_parallel are similar - they need the inode locked shared in order to be removed from the parent list. I'll try implementing it (along with the fsnotify changes we've discussed in this thread). I'll add a BUG_ON after we wake up from COND_RESCHED() to guarantee that the parent is the same dentry as expected - just in case the assumption is wrong. Al - if you've read this far :) - does this approach sound reasonable, compared to the cursor? I'll send out some concrete patches as soon as I've implemented and done a few tests on them. Thanks, Stephen
On Mon, Oct 17, 2022 at 8:00 PM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > Amir Goldstein <amir73il@gmail.com> writes: > > > On Mon, Oct 17, 2022 at 10:59 AM Stephen Brennan > > <stephen.s.brennan@oracle.com> wrote: > >> > >> Amir Goldstein <amir73il@gmail.com> writes: > [snip] > >> > I think that d_find_any_alias() should be used to obtain > >> > the alias with elevated refcount instead of the awkward > >> > d_u.d_alias iteration loop. > >> > >> D'oh! Much better idea :) > >> Do you think the BUG_ON would still be worthwhile? > >> > > > > Which BUG_ON()? > > In general no, if there are ever more multiple aliases for > > a directory inode, updating dentry flags would be the last > > of our problems. > > Sorry, I meant the one in my patch which asserts that the dentry is the > only alias for that inode. I suppose you're right about having bigger > problems in that case -- but the existing code "handles" it by iterating > over the alias list. > It is not important IMO. > > > >> > In the context of __fsnotify_parent(), I think the optimization > >> > should stick with updating the flags for the specific child dentry > >> > that had the false positive parent_watched indication, > >> > leaving the rest of > >> > >> > WOULD that address the performance issues of your workload? > >> > >> I think synchronizing the __fsnotify_update_child_dentry_flags() with a > >> mutex and getting rid of the call from __fsnotify_parent() would go a > >> *huge* way (maybe 80%?) towards resolving the performance issues we've > >> seen. To be clear, I'm representing not one single workload, but a few > >> different customer workloads which center around this area. > >> > >> There are some extreme cases I've seen, where the dentry list is so > >> huge, that even iterating over it once with the parent dentry spinlock > >> held is enough to trigger softlockups - no need for several calls to > >> __fsnotify_update_child_dentry_flags() queueing up as described in the > >> original mail. So ideally, I'd love to try make *something* work with > >> the cursor idea as well. But I think the two ideas can be separated > >> easily, and I can discuss with Al further about if cursors can be > >> salvaged at all. > >> > > > > Assuming that you take the dir inode_lock() in > > __fsnotify_update_child_dentry_flags(), then I *think* that children > > dentries cannot be added to dcache and children dentries cannot > > turn from positive to negative and vice versa. > > > > Probably the only thing that can change d_subdirs is children dentries > > being evicted from dcache(?), so I *think* that once in N children > > if you can dget(child), drop alias->d_lock, cond_resched(), > > and then continue d_subdirs iteration from child->d_child. > > This sounds like an excellent idea. I can't think of anything which > would remove a dentry from d_subdirs without the inode lock held. > Cursors wouldn't move without the lock held in read mode. Temporary > dentries from d_alloc_parallel are similar - they need the inode locked > shared in order to be removed from the parent list. > > I'll try implementing it (along with the fsnotify changes we've > discussed in this thread). I'll add a BUG_ON after we wake up from > COND_RESCHED() to guarantee that the parent is the same dentry as > expected - just in case the assumption is wrong. BUG_ON() is almost never a good idea. If anything you should use if (WARN_ON_ONCE()) and break out of the loop either returning an error to fanotify_mark() or not. I personally think that as an unexpected code assertion returning an error to the user is not a must in this case. Thanks, Amir. > > Al - if you've read this far :) - does this approach sound reasonable, > compared to the cursor? I'll send out some concrete patches as soon as > I've implemented and done a few tests on them. > > Thanks, > Stephen
On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > Hi Jan, Amir, Al, > > Here's my first shot at implementing what we discussed. I tested it using the > negative dentry creation tool I mentioned in my previous message, with a similar > workflow. Rather than having a bunch of threads accessing the directory to > create that "thundering herd" of CPUs in __fsnotify_update_child_dentry_flags, I > just started a lot of inotifywait tasks: > > 1. Create 100 million negative dentries in a dir > 2. Use trace-cmd to watch __fsnotify_update_child_dentry_flags: > trace-cmd start -p function_graph -l __fsnotify_update_child_dentry_flags > sudo cat /sys/kernel/debug/tracing/trace_pipe > 3. Run a lot of inotifywait tasks: for i in {1..10} inotifywait $dir & done > > With step #3, I see only one execution of __fsnotify_update_child_dentry_flags. > Once that completes, all the inotifywait tasks say "Watches established". > Similarly, once an access occurs in the directory, a single > __fsnotify_update_child_dentry_flags execution occurs, and all the tasks exit. > In short: it works great! > > However, while testing this, I've observed a dentry still in use warning during > unmount of rpc_pipefs on the "nfs" dentry during shutdown. NFS is of course in > use, and I assume that fsnotify must have been used to trigger this. The error > is not there on mainline without my patch so it's definitely caused by this > code. I'll continue debugging it but I wanted to share my first take on this so > you could take a look. > > [ 1595.197339] BUG: Dentry 000000005f5e7197{i=67,n=nfs} still in use (2) [unmount of rpc_pipefs rpc_pipefs] > Hmm, the assumption we made about partial stability of d_subdirs under dir inode lock looks incorrect for rpc_pipefs. None of the functions that update the rpc_pipefs dcache take the parent inode lock. The assumption looks incorrect for other pseudo fs as well. The other side of the coin is that we do not really need to worry about walking a huge list of pseudo fs children. The question is how to classify those pseudo fs and whether there are other cases like this that we missed. Perhaps having simple_dentry_operationsis a good enough clue, but perhaps it is not enough. I am not sure. It covers all the cases of pseudo fs that I know about, so you can certainly use this clue to avoid going to sleep in the update loop as a first approximation. I can try to figure this out, but I prefer that Al will chime in to provide reliable answers to those questions. Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: > On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan > <stephen.s.brennan@oracle.com> wrote: >> >> Hi Jan, Amir, Al, >> >> Here's my first shot at implementing what we discussed. I tested it using the >> negative dentry creation tool I mentioned in my previous message, with a similar >> workflow. Rather than having a bunch of threads accessing the directory to >> create that "thundering herd" of CPUs in __fsnotify_update_child_dentry_flags, I >> just started a lot of inotifywait tasks: >> >> 1. Create 100 million negative dentries in a dir >> 2. Use trace-cmd to watch __fsnotify_update_child_dentry_flags: >> trace-cmd start -p function_graph -l __fsnotify_update_child_dentry_flags >> sudo cat /sys/kernel/debug/tracing/trace_pipe >> 3. Run a lot of inotifywait tasks: for i in {1..10} inotifywait $dir & done >> >> With step #3, I see only one execution of __fsnotify_update_child_dentry_flags. >> Once that completes, all the inotifywait tasks say "Watches established". >> Similarly, once an access occurs in the directory, a single >> __fsnotify_update_child_dentry_flags execution occurs, and all the tasks exit. >> In short: it works great! >> >> However, while testing this, I've observed a dentry still in use warning during >> unmount of rpc_pipefs on the "nfs" dentry during shutdown. NFS is of course in >> use, and I assume that fsnotify must have been used to trigger this. The error >> is not there on mainline without my patch so it's definitely caused by this >> code. I'll continue debugging it but I wanted to share my first take on this so >> you could take a look. >> >> [ 1595.197339] BUG: Dentry 000000005f5e7197{i=67,n=nfs} still in use (2) [unmount of rpc_pipefs rpc_pipefs] >> > > Hmm, the assumption we made about partial stability of d_subdirs > under dir inode lock looks incorrect for rpc_pipefs. > None of the functions that update the rpc_pipefs dcache take the parent > inode lock. That may be, but I'm confused how that would trigger this issue. If I'm understanding correctly, this warning indicates a reference counting bug. If __fsnotify_update_child_dentry_flags() had gone to sleep and the list were edited, then it seems like there could be only two possibilities that could cause bugs: 1. The dentry we slept holding a reference to was removed from the list, and maybe moved to a different one, or just removed. If that were the case, we're quite unlucky, because we'll start looping indefinitely as we'll never get back to the beginning of the list, or worse. 2. A dentry adjacent to the one we held a reference to was removed. In that case, our dentry's d_child pointers should get rearranged, and when we wake, we should see those updates and continue. In neither of those cases do I understand where we could have done a dget() unpaired with a dput(), which is what seemingly would trigger this issue. I'm probably wrong, but without understanding the mechanism behind the error, I'm not sure how to approach it. > The assumption looks incorrect for other pseudo fs as well. > > The other side of the coin is that we do not really need to worry > about walking a huge list of pseudo fs children. > > The question is how to classify those pseudo fs and whether there > are other cases like this that we missed. > > Perhaps having simple_dentry_operationsis a good enough > clue, but perhaps it is not enough. I am not sure. > > It covers all the cases of pseudo fs that I know about, so you > can certainly use this clue to avoid going to sleep in the > update loop as a first approximation. I would worry that it would become an exercise of whack-a-mole. Allow/deny-listing certain filesystems for certain behavior seems scary. > I can try to figure this out, but I prefer that Al will chime in to > provide reliable answers to those questions. I have a core dump from the warning (with panic_on_warn=1) and will see if I can trace or otherwise identify the exact mechanism myself. > Thanks, > Amir. > Thanks for your detailed review of both the patches. I didn't get much time today to update the patches and test them. Your feedback looks very helpful though, and I'll hope to send out an updated revision tomorrow. In the absolute worst case (and I don't want to concede defeat just yet), keeping patch 1 without patch 2 (sleepable iteration) would still be a major win, since it resolves the thundering herd problem which is what compounds problem of the long lists. Thanks! Stephen
On Wed, Oct 19, 2022 at 2:52 AM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > Amir Goldstein <amir73il@gmail.com> writes: > > On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan > > <stephen.s.brennan@oracle.com> wrote: > >> > >> Hi Jan, Amir, Al, > >> > >> Here's my first shot at implementing what we discussed. I tested it using the > >> negative dentry creation tool I mentioned in my previous message, with a similar > >> workflow. Rather than having a bunch of threads accessing the directory to > >> create that "thundering herd" of CPUs in __fsnotify_update_child_dentry_flags, I > >> just started a lot of inotifywait tasks: > >> > >> 1. Create 100 million negative dentries in a dir > >> 2. Use trace-cmd to watch __fsnotify_update_child_dentry_flags: > >> trace-cmd start -p function_graph -l __fsnotify_update_child_dentry_flags > >> sudo cat /sys/kernel/debug/tracing/trace_pipe > >> 3. Run a lot of inotifywait tasks: for i in {1..10} inotifywait $dir & done > >> > >> With step #3, I see only one execution of __fsnotify_update_child_dentry_flags. > >> Once that completes, all the inotifywait tasks say "Watches established". > >> Similarly, once an access occurs in the directory, a single > >> __fsnotify_update_child_dentry_flags execution occurs, and all the tasks exit. > >> In short: it works great! > >> > >> However, while testing this, I've observed a dentry still in use warning during > >> unmount of rpc_pipefs on the "nfs" dentry during shutdown. NFS is of course in > >> use, and I assume that fsnotify must have been used to trigger this. The error > >> is not there on mainline without my patch so it's definitely caused by this > >> code. I'll continue debugging it but I wanted to share my first take on this so > >> you could take a look. > >> > >> [ 1595.197339] BUG: Dentry 000000005f5e7197{i=67,n=nfs} still in use (2) [unmount of rpc_pipefs rpc_pipefs] > >> > > > > Hmm, the assumption we made about partial stability of d_subdirs > > under dir inode lock looks incorrect for rpc_pipefs. > > None of the functions that update the rpc_pipefs dcache take the parent > > inode lock. > > That may be, but I'm confused how that would trigger this issue. If I'm > understanding correctly, this warning indicates a reference counting > bug. Yes. On generic_shutdown_super() there should be no more references to dentries. > > If __fsnotify_update_child_dentry_flags() had gone to sleep and the list > were edited, then it seems like there could be only two possibilities > that could cause bugs: > > 1. The dentry we slept holding a reference to was removed from the list, > and maybe moved to a different one, or just removed. If that were the > case, we're quite unlucky, because we'll start looping indefinitely as > we'll never get back to the beginning of the list, or worse. > > 2. A dentry adjacent to the one we held a reference to was removed. In > that case, our dentry's d_child pointers should get rearranged, and when > we wake, we should see those updates and continue. > > In neither of those cases do I understand where we could have done a > dget() unpaired with a dput(), which is what seemingly would trigger > this issue. > I got the same impression. > I'm probably wrong, but without understanding the mechanism behind the > error, I'm not sure how to approach it. > > > The assumption looks incorrect for other pseudo fs as well. > > > > The other side of the coin is that we do not really need to worry > > about walking a huge list of pseudo fs children. > > > > The question is how to classify those pseudo fs and whether there > > are other cases like this that we missed. > > > > Perhaps having simple_dentry_operationsis a good enough > > clue, but perhaps it is not enough. I am not sure. > > > > It covers all the cases of pseudo fs that I know about, so you > > can certainly use this clue to avoid going to sleep in the > > update loop as a first approximation. > > I would worry that it would become an exercise of whack-a-mole. > Allow/deny-listing certain filesystems for certain behavior seems scary. > Totally agree. > > I can try to figure this out, but I prefer that Al will chime in to > > provide reliable answers to those questions. > > I have a core dump from the warning (with panic_on_warn=1) and will see > if I can trace or otherwise identify the exact mechanism myself. > Most likely the refcount was already leaked earlier, but worth trying. > > Thanks for your detailed review of both the patches. I didn't get much > time today to update the patches and test them. Your feedback looks very > helpful though, and I'll hope to send out an updated revision tomorrow. > > In the absolute worst case (and I don't want to concede defeat just > yet), keeping patch 1 without patch 2 (sleepable iteration) would still > be a major win, since it resolves the thundering herd problem which is > what compounds problem of the long lists. > Makes sense. Patch 1 logic is solid. Hope my suggestions won't complicate you too much, if they do, I am sure Jan will find a way to simplify ;) Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: > On Wed, Oct 19, 2022 at 2:52 AM Stephen Brennan > <stephen.s.brennan@oracle.com> wrote: >> >> Amir Goldstein <amir73il@gmail.com> writes: >> > On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan >> > <stephen.s.brennan@oracle.com> wrote: >> >> >> >> Hi Jan, Amir, Al, >> >> >> >> Here's my first shot at implementing what we discussed. I tested it using the >> >> negative dentry creation tool I mentioned in my previous message, with a similar >> >> workflow. Rather than having a bunch of threads accessing the directory to >> >> create that "thundering herd" of CPUs in __fsnotify_update_child_dentry_flags, I >> >> just started a lot of inotifywait tasks: >> >> >> >> 1. Create 100 million negative dentries in a dir >> >> 2. Use trace-cmd to watch __fsnotify_update_child_dentry_flags: >> >> trace-cmd start -p function_graph -l __fsnotify_update_child_dentry_flags >> >> sudo cat /sys/kernel/debug/tracing/trace_pipe >> >> 3. Run a lot of inotifywait tasks: for i in {1..10} inotifywait $dir & done >> >> >> >> With step #3, I see only one execution of __fsnotify_update_child_dentry_flags. >> >> Once that completes, all the inotifywait tasks say "Watches established". >> >> Similarly, once an access occurs in the directory, a single >> >> __fsnotify_update_child_dentry_flags execution occurs, and all the tasks exit. >> >> In short: it works great! >> >> >> >> However, while testing this, I've observed a dentry still in use warning during >> >> unmount of rpc_pipefs on the "nfs" dentry during shutdown. NFS is of course in >> >> use, and I assume that fsnotify must have been used to trigger this. The error >> >> is not there on mainline without my patch so it's definitely caused by this >> >> code. I'll continue debugging it but I wanted to share my first take on this so >> >> you could take a look. >> >> >> >> [ 1595.197339] BUG: Dentry 000000005f5e7197{i=67,n=nfs} still in use (2) [unmount of rpc_pipefs rpc_pipefs] >> >> >> > >> > Hmm, the assumption we made about partial stability of d_subdirs >> > under dir inode lock looks incorrect for rpc_pipefs. >> > None of the functions that update the rpc_pipefs dcache take the parent >> > inode lock. >> >> That may be, but I'm confused how that would trigger this issue. If I'm >> understanding correctly, this warning indicates a reference counting >> bug. > > Yes. > On generic_shutdown_super() there should be no more > references to dentries. > >> >> If __fsnotify_update_child_dentry_flags() had gone to sleep and the list >> were edited, then it seems like there could be only two possibilities >> that could cause bugs: >> >> 1. The dentry we slept holding a reference to was removed from the list, >> and maybe moved to a different one, or just removed. If that were the >> case, we're quite unlucky, because we'll start looping indefinitely as >> we'll never get back to the beginning of the list, or worse. >> >> 2. A dentry adjacent to the one we held a reference to was removed. In >> that case, our dentry's d_child pointers should get rearranged, and when >> we wake, we should see those updates and continue. >> >> In neither of those cases do I understand where we could have done a >> dget() unpaired with a dput(), which is what seemingly would trigger >> this issue. >> > > I got the same impression. Well I feel stupid. The reason behind this seems to be... that d_find_any_alias() returns a reference to the dentry, and I promptly leaked that. I'll have it fixed in v3 which I'm going through testing now. Stephen
> > Well I feel stupid. The reason behind this seems to be... that > d_find_any_alias() returns a reference to the dentry, and I promptly > leaked that. I'll have it fixed in v3 which I'm going through testing > now. > I reckon if you ran the LTP fsnotify tests you would have seen this warning a lot more instead of just one random pseudo filesystem that some process is probably setting a watch on... You should run the existing LTP test to check for regressions. The fanotify/inotify test cases in LTP are easy to run, for example: run make in testcases/kernel/syscalls/fanotify and execute individual ./fanotify* executable. If you point me to a branch, I can run the tests until you get your LTP setup ready. Thanks, Amir.
Al Viro <viro@zeniv.linux.org.uk> writes: > On Thu, Oct 13, 2022 at 03:27:19PM -0700, Stephen Brennan wrote: > >> So I have two more narrowly scoped strategies to improve the situation. Both are >> included in the hacky, awful patch below. >> >> First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody >> is holding the spinlock for several seconds at a time. We can actually achieve >> this via a cursor, the same way that simple_readdir() is implemented. I think >> this might require moving the declaration of d_alloc_cursor() and maybe >> exporting it. I had to #include fs/internal.h which is not ok. > > Er... Won't that expose every filesystem to having to deal with cursors? > Currently it's entirely up to the filesystem in question and I wouldn't > be a dime on everyone being ready to cope with such surprises... Hi Al, I wanted to follow-up on this. Yeah, I didn't realize that this could cause issues for some filesystems when I wrote it, since I hadn't considered that rather few filesystems use dcache_readdir(), and so most aren't prepared to encounter a cursor. Thanks for that catch. I think I came up with a better solution, which you can see in context in v3 [1]. The only change I have from the posting there is to eliminate the unnecssary "child->d_parent != parent" check. So the new idea is to take a reference to the current child and then go to sleep. That alone should be enough to prevent dentry_kill() from removing the child from its parent's list. However, it wouldn't prevent d_move() from moving it to a new list, nor would it prevent it from moving along the list if it were a cursor. However, in this situation, we also hold the parent i_rwsem in write mode, which means that no concurrent rename or readdir can happen. You can see my full analysis here [2]. So here's the new approach, if you can call out anything I've missod or confirm that it's sound, I'd really appreciate it! /* Must be called with inode->i_rwsem in held in write mode */ static bool __fsnotify_update_children_dentry_flags(struct inode *inode) { struct dentry *child, *alias, *last_ref = NULL; alias = d_find_any_alias(inode); /* * These lists can get very long, so we may need to sleep during * iteration. Normally this would be impossible without a cursor, * but since we have the inode locked exclusive, we're guaranteed * that the directory won't be modified, so whichever dentry we * pick to sleep on won't get moved. */ spin_lock(&alias->d_lock); list_for_each_entry(child, &alias->d_subdirs, d_child) { if (need_resched()) { /* * We need to hold a reference while we sleep. But when * we wake, dput() could free the dentry, invalidating * the list pointers. We can't look at the list pointers * until we re-lock the parent, and we can't dput() once * we have the parent locked. So the solution is to hold * onto our reference and free it the *next* time we drop * alias->d_lock: either at the end of the function, or * at the time of the next sleep. */ dget(child); spin_unlock(&alias->d_lock); dput(last_ref); last_ref = child; cond_resched(); spin_lock(&alias->d_lock); } if (!child->d_inode) continue; spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED); if (watched) child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED; else child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED; spin_unlock(&child->d_lock); } spin_unlock(&alias->d_lock); dput(last_ref); dput(alias); return watched; } Thanks, Stephen [1]: https://lore.kernel.org/linux-fsdevel/20221028001016.332663-4-stephen.s.brennan@oracle.com/ [2]: https://lore.kernel.org/linux-fsdevel/874jvigye9.fsf@oracle.com/
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 7974e91ffe13..d74949c01a29 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -13,6 +13,7 @@ #include <linux/fsnotify_backend.h> #include "fsnotify.h" +#include "../internal.h" /* * Clear all of the marks on an inode when it is being evicted from core @@ -102,42 +103,93 @@ void fsnotify_sb_delete(struct super_block *sb) * on a child we run all of our children and set a dentry flag saying that the * parent cares. Thus when an event happens on a child it can quickly tell * if there is a need to find a parent and send the event to the parent. + * + * Some directories may contain too many entries, making iterating through the + * parent list slow enough to cause soft lockups. So, we use a cursor -- just as + * simple_readdir() does -- which allows us to drop the locks, sleep, and pick + * up where we left off. To maintain mutual exclusion we set an inode state + * flag. */ void __fsnotify_update_child_dentry_flags(struct inode *inode) { - struct dentry *alias; - int watched; + struct dentry *alias, *child, *cursor = NULL; + const int BATCH_SIZE = 50000; + int watched, prev_watched, batch_remaining = BATCH_SIZE; + struct list_head *p; if (!S_ISDIR(inode->i_mode)) return; - /* determine if the children should tell inode about their events */ - watched = fsnotify_inode_watches_children(inode); + /* Do a quick check while inode is unlocked. This avoids unnecessary + * contention on i_lock. */ + if (inode->i_state & I_FSNOTIFY_UPDATING) + return; spin_lock(&inode->i_lock); + + if (inode->i_state & I_FSNOTIFY_UPDATING) { + spin_unlock(&inode->i_lock); + return; + } else { + inode->i_state |= I_FSNOTIFY_UPDATING; + } + /* run all of the dentries associated with this inode. Since this is a - * directory, there damn well better only be one item on this list */ - hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) { - struct dentry *child; - - /* run all of the children of the original inode and fix their - * d_flags to indicate parental interest (their parent is the - * original inode) */ - spin_lock(&alias->d_lock); - list_for_each_entry(child, &alias->d_subdirs, d_child) { - if (!child->d_inode) - continue; + * directory, there damn well better be exactly one item on this list */ + BUG_ON(!inode->i_dentry.first); + BUG_ON(inode->i_dentry.first->next); + alias = container_of(inode->i_dentry.first, struct dentry, d_u.d_alias); + + cursor = d_alloc_cursor(alias); + if (!cursor) { + inode->i_state &= ~I_FSNOTIFY_UPDATING; + spin_unlock(&inode->i_lock); + return; // TODO -ENOMEM + } - spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED); - if (watched) - child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED; - else - child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED; - spin_unlock(&child->d_lock); + /* determine if the children should tell inode about their events */ + watched = fsnotify_inode_watches_children(inode); + spin_lock(&alias->d_lock); + p = alias->d_subdirs.next; + + while (p != &alias->d_subdirs) { + if (batch_remaining-- <= 0 && need_resched()) { + batch_remaining = BATCH_SIZE; + list_move(&cursor->d_child, p); + p = &cursor->d_child; + spin_unlock(&alias->d_lock); + spin_unlock(&inode->i_lock); + cond_resched(); + spin_lock(&inode->i_lock); + spin_lock(&alias->d_lock); + prev_watched = watched; + watched = fsnotify_inode_watches_children(inode); + if (watched != prev_watched) { + /* Somebody else is messing with the flags, + * bail and let them handle it. */ + goto out; + } + /* TODO: is it possible that i_dentry list is changed? */ } - spin_unlock(&alias->d_lock); + child = list_entry(p, struct dentry, d_child); + p = p->next; + + if (!child->d_inode || child->d_flags & DCACHE_DENTRY_CURSOR) + continue; + + spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED); + if (watched) + child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED; + else + child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED; + spin_unlock(&child->d_lock); } + +out: + inode->i_state &= ~I_FSNOTIFY_UPDATING; + spin_unlock(&alias->d_lock); spin_unlock(&inode->i_lock); + dput(cursor); } /* Are inode/sb/mount interested in parent and name info with this event? */ diff --git a/include/linux/fs.h b/include/linux/fs.h index e654435f1651..f66aab9f792a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2433,6 +2433,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, * * I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback. * + * I_FSNOTIFY_UPDATING Used by fsnotify to avoid duplicated work when updating + * child dentry flag DCACHE_FSNOTIFY_PARENT_WATCHED. + * * Q: What is the difference between I_WILL_FREE and I_FREEING? */ #define I_DIRTY_SYNC (1 << 0) @@ -2456,6 +2459,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, #define I_DONTCACHE (1 << 16) #define I_SYNC_QUEUED (1 << 17) #define I_PINNING_FSCACHE_WB (1 << 18) +#define I_FSNOTIFY_UPDATING (1 << 19) #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)