Message ID | 20240213055554.1802415-1-ankur.a.arora@oracle.com |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-62957-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:bc8a:b0:106:860b:bbdd with SMTP id dn10csp346914dyb; Mon, 12 Feb 2024 21:57:21 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCVB8FImUzim1NLvTZ1sOQ+JckYDzkcOE4A2o4QeOJGFx8ilvy2AOgCSFw91Ban/8yP3mx9ZMZgcSxNc+/T4myAQbW+yrQ== X-Google-Smtp-Source: AGHT+IEVBza9q47dvUpc32vvJCPvC7DJ6QEzucC6Lo1Spi2h0EsMwWIcyr/9Uhs1y/gZw4BrJGDl X-Received: by 2002:a05:651c:1054:b0:2d0:e45c:5650 with SMTP id x20-20020a05651c105400b002d0e45c5650mr5526274ljm.11.1707803841342; Mon, 12 Feb 2024 21:57:21 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCW1nshsCHoiFj3DuYo35VkYp/ydbdGit5zti+1X3Zj1yjc4nWbPhjYSTCY0ytWR9vNMHbjdfm6Jgk74FvX1dQfUmmtpAw== Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id w22-20020a170906481600b00a3c4e369696si888030ejq.614.2024.02.12.21.57.21 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 12 Feb 2024 21:57:21 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-62957-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2023-11-20 header.b=JhtHFee7; dkim=pass header.i=@oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=d+8IUr6O; arc=fail (signature failed); spf=pass (google.com: domain of linux-kernel+bounces-62957-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-62957-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id BFFA41F24B23 for <ouuuleilei@gmail.com>; Tue, 13 Feb 2024 05:57:20 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0B52314275; Tue, 13 Feb 2024 05:56:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="JhtHFee7"; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b="d+8IUr6O" Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 91FE629A9 for <linux-kernel@vger.kernel.org>; Tue, 13 Feb 2024 05:56:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=205.220.177.32 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707803810; cv=fail; b=LEV4SqJr5pdG0QCSVmUJHbdpPfWFTjI4/LUgOoQD6bJwlbKC2chwvm2tn1VgSiG1z5OpZHnXM9zMGZSTjCNM6vDCj7HVHap4EEoXSvhv49WqC0Sh2GgdqlRJYZO9ATw30VN7xbPZ9yEQLD34WCJHyI0oaTp+2ZXeD+/SwBnCrEs= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707803810; c=relaxed/simple; bh=QmfmyVZ0kYUPs2l22YvwXK8WS3qxh3wfxhMrlw84LAs=; h=From:To:Cc:Subject:Date:Message-Id:Content-Type:MIME-Version; b=oXAD32yoN96UAWpxSB2eWQyNxeq5hwkrE9lVsnYKqQyBTnv2fyT37Mhi6603xSAc8t60C5TjYZuS6JXOsSui4Tw2yuJgns5XngC2e7jApCG/mjPSzhOmN+76hm/SR+QjoR/4xb32WI+gQRB/Nez8b2nRkSesOvKAdH6TpStlgaM= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=JhtHFee7; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b=d+8IUr6O; arc=fail smtp.client-ip=205.220.177.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Received: from pps.filterd (m0333520.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 41D5nfnP028871; Tue, 13 Feb 2024 05:55:40 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : content-transfer-encoding : content-type : mime-version; s=corp-2023-11-20; bh=7BXgj5ojMBd4y0BjmdCwqG8UY8YbvazdoI8FuJE8Rn0=; b=JhtHFee7Tg8zl+UuSdDNxgUkuf7vb/2JbMcnXe1R13psm/0BQlzhEvwQiwmNohXTvDlK TmY7xGL3fuwTHxWyI8M44Z4a8fe00mXsadPvMwD0PwJ0o8E6clYRXgl0xi0QAduJZ9/D NUbrrvtZ3Tr3s5V0wuxtM1HBYjQpSpHXJuEdgk9Gpmuo8Gx0X2y47aRkb7YUI/Ow9be7 bZoiJrYf5QdKK7kHxAWYWYPUHQ/CEwQgtWyWHNaLe09m6LoBVMQ8JP2+K1/uLzIgLdZi ae3qyqh+l1bNGrjAGuvQ2VW2yPB6O1AQcFnBnGrVZsOeKNsCYSGtcpdsxTYHyU02o7k6 lw== Received: from iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta03.appoci.oracle.com [130.35.103.27]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3w82m480g8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 13 Feb 2024 05:55:39 +0000 Received: from pps.filterd (iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com (8.17.1.19/8.17.1.19) with ESMTP id 41D40f8f031501; Tue, 13 Feb 2024 05:55:39 GMT Received: from nam04-bn8-obe.outbound.protection.outlook.com (mail-bn8nam04lp2040.outbound.protection.outlook.com [104.47.74.40]) by iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTPS id 3w5yk6tvnj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 13 Feb 2024 05:55:38 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Gh9MfDDWKy7yHSUh+xOwE5ycH1jeT2yD+vLnLQ5WowOk4yxZkZZhmZToZIUJCYO+ErNwvECJyZgeZJRXETcFq10PteO2jAoMgAcxCW6jdBQwb9140a1eWtxl+33INiie1UMkP3yaD9Z7LXFTIi0/WzV+ObUBVN73mF9dfH/WG/JOZaclkuOrG7SaNcSmfSt2TFO/3nDnljz9quUa3/n+BQLgWBSQs/y1bHdoGD7U2M8plz3uy/2CUvzOl0Ri8WNGiVU34g7VF+wblzdGj/ETz4DhAMhfclO3eeQUZMgi3zjVqwMZG+UPgrar7kjfyXt9hrkdHNILEz4ZMlefSVtpNA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=7BXgj5ojMBd4y0BjmdCwqG8UY8YbvazdoI8FuJE8Rn0=; b=jOwgH7qCD2eQTgXJ+6P5UNxpIdxlMoW3CwJcCbf7wl+0tLV0ASlRu832yjJaoEbbBmpoBZQZ54FiEIFpfRmCy/oD2axdNO6mNnL8lwupGjwjqjhhR84emYE0D9ZOLDyHqdnhX5FBWPQ46EVPO6tHxIOU4B+egkUWykZNnFNMFD/2F1kGsxJorDCN3zp0v1Y2zAq8I93hVA1LBthgr91Jr6x/Y8wU/1c9Z+tA/30GWe+nXQqCxlsM+biDW5+Yr1oTQMaRD8qpKysKdzXd6M2SR40/KnolccI7bSLmMHI0jxl5Fws2Gmy4KI5FsWzx1e+9zAnLltkHVwmu5AW9Fdzqww== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=7BXgj5ojMBd4y0BjmdCwqG8UY8YbvazdoI8FuJE8Rn0=; b=d+8IUr6OZyKPTazNjoZU1JUB+ucScP8fb9qel181a+XdjhbluXffrxqj4Ken8UCdmQkO0wN7W5m8jqCt6x7J5RxTXQ0fSuGORzS8H+Am/O1Tm+YucOlMQWwvFmO7s/JuqC6T90y/rqvZxGFlOIO2H67pU6NTzTcemr4/8B9XCaQ= Received: from CO6PR10MB5409.namprd10.prod.outlook.com (2603:10b6:5:357::14) by CO1PR10MB4642.namprd10.prod.outlook.com (2603:10b6:303:6f::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7270.39; Tue, 13 Feb 2024 05:55:36 +0000 Received: from CO6PR10MB5409.namprd10.prod.outlook.com ([fe80::91b3:fd53:a6ee:8685]) by CO6PR10MB5409.namprd10.prod.outlook.com ([fe80::91b3:fd53:a6ee:8685%4]) with mapi id 15.20.7270.036; Tue, 13 Feb 2024 05:55:36 +0000 From: Ankur Arora <ankur.a.arora@oracle.com> To: linux-kernel@vger.kernel.org Cc: tglx@linutronix.de, peterz@infradead.org, torvalds@linux-foundation.org, paulmck@kernel.org, akpm@linux-foundation.org, luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, willy@infradead.org, mgorman@suse.de, jpoimboe@kernel.org, mark.rutland@arm.com, jgross@suse.com, andrew.cooper3@citrix.com, bristot@kernel.org, mathieu.desnoyers@efficios.com, geert@linux-m68k.org, glaubitz@physik.fu-berlin.de, anton.ivanov@cambridgegreys.com, mattst88@gmail.com, krypton@ulrich-teichert.org, rostedt@goodmis.org, David.Laight@ACULAB.COM, richard@nod.at, mjguzik@gmail.com, jon.grimm@amd.com, bharata@amd.com, raghavendra.kt@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com, Ankur Arora <ankur.a.arora@oracle.com> Subject: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling Date: Mon, 12 Feb 2024 21:55:24 -0800 Message-Id: <20240213055554.1802415-1-ankur.a.arora@oracle.com> X-Mailer: git-send-email 2.31.1 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: MW4PR04CA0249.namprd04.prod.outlook.com (2603:10b6:303:88::14) To CO6PR10MB5409.namprd10.prod.outlook.com (2603:10b6:5:357::14) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CO6PR10MB5409:EE_|CO1PR10MB4642:EE_ X-MS-Office365-Filtering-Correlation-Id: 79901bba-158e-4251-0906-08dc2c586646 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 2odTO8AIDnluwdxIXNoAWuM2afQAH2Awas94gyOEsj8fvdNZOqqm34N74fUC6cTVDf1IPygsNR+ifk0wYQa9+eABl59Es7dh9Jr8/WVAKRqnbBxrgfZI8hA/oBwKx2x31FdK9tt0qmh6+cz+AcgWqgXOPQj1mSgj9IL9jFfxSN7z7f9RHoEpwlc7Kh7MPENPvkkKI7JIPezw8cAMR6lDJy7W95rX37+htLQk8LmIHSyOIWdO6HBSJItYb7BP15PLG5dfskjJrunBYsGiI9smE8p7NjKyz97zFuxu6S1NuugWEJ6vaOJT5TNudVaNCN+BmOf64JUUupVLO0GuQHS+fVL3ciVIosJ+aNlul1vgH2D3G20REZKb0AFnbgqOa1MVCxI3PzjQ8OhFDGXabZAwtJq6MtJ0IP3pnmlAL45Gxj9NUQKPRIVXPwAswuLtw7WKpx12Vt07+aD7KF3omgk4m+eT4cTHbJhQaFmZY5zt9B0zcFs1LYnnpX9CRnhneN81znqr6DqeGKLIWqNaURJvwM06wyd416lA8d/8ccqgKfeUni49JAT/h6UdF2dxyjF2AQIEzptIMRIF4FsqAU1Yfg== X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CO6PR10MB5409.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230031)(376002)(346002)(39860400002)(396003)(366004)(136003)(230922051799003)(1800799012)(451199024)(64100799003)(186009)(6916009)(2616005)(41300700001)(83380400001)(26005)(1076003)(107886003)(4326008)(8936002)(8676002)(66946007)(66476007)(66556008)(6506007)(6512007)(6486002)(316002)(6666004)(103116003)(966005)(36756003)(38100700002)(478600001)(86362001)(84970400001)(2906002)(7416002)(5660300002)(30864003)(7406005);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: toHw9qR2QyzJHblwTCm235JS8NDxYe7rDis3gsYc/GBn2qj/xPPWo+r+h1Ecv/zb0TVtCMUurBVketBgJ5v8QLeVQ7gt+lirulsMX5BtEEcBnDzFmmMw+ldoA8QongYDPMEtASVk6MIJ2F7JzT23/nlMC5sxI6HaCzZHmusUtxIOlDQjUfXgKmhxnODhVFO5UG4aAhhYs7og5bA2wsJMay/D0GqQN0r37xAQRTjlIWWGqVQtbXJyXGU7l5JJ7gHa30RqFA1hzJ7BY3YN377pDNLnIkQIC2jl4CAZ9HF0Y6Lk1V/aInu1ZyP/jdzh417+/MX2wYTcvO8jIkBxDXiY+d0TYFr0nL4hkEO/EY4eW6BpTnyT89P+kGTxl1i15KumgnxYnnChlU2U+c/nQPtlndNxEHki95CJ8fQi/Kvy4eTibthkkwm614bity739g6Sjmx1GL+oWYcgG7JfqoY4o1QziCcDklJLpsYaiDnHUNHo9JoB02OPnn22n+SCtt4WNy4rvs84gM9ZWIR8JgF5FWLW4gmYX6l+bEre7PQc9bGxxTBvxQptwiXabZQxVDpmcQZGBtKIypHs6jAd72rZ6GiVvfs3NBW4C12nj02yy8rGR63kfgOZ//iYGgRtpxUT+MMj/Sm9FuNsYnavqOxlerPQ8tBkvJIB7TzxwGzYCS7pvlQMLuphBWJjphFWcovSJxTsjTPxRZfOBvMp17HjgqlaqWfFUX/qMeiNtJ+nMDQ5bzzvnVJeaLZEB6cS9T2Ux+JfFJuSt1WneGrf4yTCpjIsQdLAhuD4JrPjKWjiOrN3eWYL9c0oi+xDCwY04e+nGB7KgXEXxN6k2IlZStKrB8ggWh80ni0s0JnMhtNHxDdWnLHSnqhxAXyvZvCSl6AhDUkupRacYuH7li4a7uG12VCfePyE/tfynbpuA0zM9FrKHhMTKyahMdmpzVrh7SVZTwi5y756/9jQ8j5Evj41AsSYcoSa6v9Kf/E9YUyaeTq9aBoorB+zX1skBjFaL/V8lOhyhIgRwr9bEHqD/C3oAmVLvWLKsvI9HA2IZZSDgOIex2/YmXqPZE2NjNMzZUSjHN1bGiqXC9AmOYdLUITHrOmcI9man4RymnQLSccFiG8C62DyKoqZHQWw8VxPt7Q+0wyQPQ3WuuqzkwpNFWgebyxrKMl792CGvqI2daGlqsK22/hJiGQnHKcNLrlDoVRYyYjyrPYYFSFta+e4nT8rzkU8HRC4eRKT6mB6SdTVuh4hqRM7UQDR57O/0uCCLJYlmLdAYFHW6HqycLhsrS5gqFdbonemfDvf1x/Whqy+osZReDVM/aIckoRjcl9nC2OCiHYLiRt8dRL7nB5iPybr9TI3PfLN00nqllsS5AEDjbxV5Gh4OT6eMnSYmtf3Jos7zSx0yj1ihmtp86wBsKIZxVze1AD+a9/cjPjh9t9O0BHu0BCauwp8zn8vmg7pUeAY2rgWNetgAdDK4HDTfV2O6MSsxUCh3MtshWgD994Is4HdKprYZAzS8psUNtsrPqHd8mdVUDLbxp8aXdMaDuoO5qcd/agrNeB18Y2HpyD8Oflwvg55Esjo1M9Nko0UTyPHYSR5jDI3YQCMBmUkD8cD1w== X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: pMgN3u6ZG3JuN/h+9NmOqk+Q2GmQrDjCc0XQydQIcJKt1g3dTywqDDwsg51d9Ia+MKO35Ba8dAKuXfDP8WCVN3G3gYCpUAUPBhM1TFtZCWmiHCVZQ4KGCVKHp75inE+HmzAIWz4Tv1B+VvB7v4RRpbtieYPlC1trQbbkV7h8vZmkIbOHcAOLVZuRQk+7IcJI7dpFEr4oBHgmPoNUP+olYiCfajpENfy6n9Qkfwtepj5yQBEOsYvqr7TPJt9N0eXXbfCPdx3gys5kGV4hV+v7+Tm0puIV9UFwnR01yyUuCNEHvYYF93Uya/xA+ePOa2mDs9VCJYUmdjrmOwhuWQPz8ymFfMuULQdqgmnuWOwXwWwKSDfpp3PQ4we0k4nRZZ7CvlhZOpILmezYp89NntDajc2WDO4ZHLifHQg2v2jb4N0vjhPfJx4gneeqn/T/iA9d23S6PEjWC6YKM8Ytaa8Sb5/29W1dRAyVwRHYHaEdtwYFSYMx4YoRaxAFYtqk8wG4EBZQMFnksTLKWn02kNZ4tmQCHMxWDnlxSBZxhDPyn4EiWPkXvd4n/gFiucnETXiwYgzyLtNLgMR/KyOD5MdvfA9+CBYmwdG0Wth6AUoIas8= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: 79901bba-158e-4251-0906-08dc2c586646 X-MS-Exchange-CrossTenant-AuthSource: CO6PR10MB5409.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 13 Feb 2024 05:55:36.5670 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: AHyAOzrHBU8fsZW7VeSefKZOlyXVfCj5xiluUYJc4YSlbHZgC7cF2YDEQi8/kZpMSBDsonXepPtDrYiU/C4A/AlzOPsZizuBLKu3A6e75bc= X-MS-Exchange-Transport-CrossTenantHeadersStamped: CO1PR10MB4642 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.1011,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2024-02-13_02,2024-02-12_03,2023-05-22_02 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 malwarescore=0 mlxscore=0 spamscore=0 adultscore=0 phishscore=0 bulkscore=0 mlxlogscore=999 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311290000 definitions=main-2402130043 X-Proofpoint-GUID: Kv0E6hnRhdIwDQWdGy1BajrSj_KcSBHJ X-Proofpoint-ORIG-GUID: Kv0E6hnRhdIwDQWdGy1BajrSj_KcSBHJ X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790762120410546052 X-GMAIL-MSGID: 1790762120410546052 |
Series |
PREEMPT_AUTO: support lazy rescheduling
|
|
Message
Ankur Arora
Feb. 13, 2024, 5:55 a.m. UTC
Hi, This series adds a new scheduling model PREEMPT_AUTO, which like PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend on explicit preemption points for the voluntary models. The series is based on Thomas' original proposal which he outlined in [1], [2] and in his PoC [3]. An earlier RFC version is at [4]. Design == PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus PREEMPT_COUNT). This means that the scheduler can always safely preempt. (This is identical to CONFIG_PREEMPT.) Having that, the next step is to make the rescheduling policy dependent on the chosen scheduling model. Currently, the scheduler uses a single need-resched bit (TIF_NEED_RESCHED) which it uses to state that a reschedule is needed. PREEMPT_AUTO extends this by adding an additional need-resched bit (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the scheduler to express two kinds of rescheduling intent: schedule at the earliest opportunity (TIF_NEED_RESCHED), or express a need for rescheduling while allowing the task on the runqueue to run to timeslice completion (TIF_NEED_RESCHED_LAZY). As mentioned above, the scheduler decides which need-resched bits are chosen based on the preemption model in use: TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY none never always [*] voluntary higher sched class other tasks [*] full always never [*] some details elided here. The last part of the puzzle is, when does preemption happen, or alternately stated, when are the need-resched bits checked: exit-to-user ret-to-kernel preempt_count() NEED_RESCHED_LAZY Y N N NEED_RESCHED Y Y Y Using NEED_RESCHED_LAZY allows for run-to-completion semantics when none/voluntary preemption policies are in effect. And eager semantics under full preemption. In addition, since this is driven purely by the scheduler (not depending on cond_resched() placement and the like), there is enough flexibility in the scheduler to cope with edge cases -- ex. a kernel task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by simply upgrading to a full NEED_RESCHED which can use more coercive instruments like resched IPI to induce a context-switch. Performance == The performance in the basic tests (perf bench sched messaging, kernbench) is fairly close to what we see under PREEMPT_DYNAMIC. (See patches 24, 25.) Comparing stress-ng --cyclic latencies with a background kernel load (stress-ng --mmap) serves as a good demonstration of how letting the scheduler enforce priorities, tick exhaustion etc helps: PREEMPT_DYNAMIC, preempt=voluntary stress-ng: info: [12252] setting to a 300 second (5 mins, 0.00 secs) run per stressor stress-ng: info: [12252] dispatching hogs: 1 cyclic stress-ng: info: [12253] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples stress-ng: info: [12253] cyclic: mean: 19973.46 ns, mode: 3560 ns stress-ng: info: [12253] cyclic: min: 2541 ns, max: 2751830 ns, std.dev. 68891.71 stress-ng: info: [12253] cyclic: latency percentiles: stress-ng: info: [12253] cyclic: 25.00%: 4800 ns stress-ng: info: [12253] cyclic: 50.00%: 12458 ns stress-ng: info: [12253] cyclic: 75.00%: 25220 ns stress-ng: info: [12253] cyclic: 90.00%: 35404 ns PREEMPT_AUTO, preempt=voluntary stress-ng: info: [8883] setting to a 300 second (5 mins, 0.00 secs) run per stressor stress-ng: info: [8883] dispatching hogs: 1 cyclic stress-ng: info: [8884] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples stress-ng: info: [8884] cyclic: mean: 14169.08 ns, mode: 3355 ns stress-ng: info: [8884] cyclic: min: 2570 ns, max: 2234939 ns, std.dev. 66056.95 stress-ng: info: [8884] cyclic: latency percentiles: stress-ng: info: [8884] cyclic: 25.00%: 3665 ns stress-ng: info: [8884] cyclic: 50.00%: 5409 ns stress-ng: info: [8884] cyclic: 75.00%: 16009 ns stress-ng: info: [8884] cyclic: 90.00%: 24392 ns Notice how much lower the 25/50/75/90 percentile latencies are for the PREEMPT_AUTO case. (See patch 26 for the full performance numbers.) For a macro test, a colleague in Oracle's Exadata team tried two OLTP benchmarks (on a 5.4.17 based Oracle kernel, with this series backported.) In both tests the data was cached on remote nodes (cells), and the database nodes (compute) served client queries, with clients being local in the first test and remote in the second. Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs) Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs PREEMPT_VOLUNTARY PREEMPT_AUTO (preempt=voluntary) ============================== ============================= clients throughput cpu-usage throughput cpu-usage Gain (tx/min) (utime %/stime %) (tx/min) (utime %/stime %) ------- ---------- ----------------- ---------- ----------------- ------- OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7% benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6% (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8% OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5% benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6% (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3% 90/10 RW ratio) (Both sets of tests have a fair amount of NW traffic since the query tables etc are cached on the cells. Additionally, the first set, given the local clients, stress the scheduler a bit more than the second.) The comparative performance for both the tests is fairly close, more or less within a margin of error. IMO the tests above (sched-messaging, kernbench, stress-ng, OLTP) show that this scheduling model has legs. That said, the none/voluntary models under PREEMPT_AUTO are conceptually different enough that there likely are workloads where performance would be subpar. That needs more extensive testing to figure out the weak points. Series layout == Patch 1, "preempt: introduce CONFIG_PREEMPT_AUTO" introduces the new scheduling model. Patches 2-5, "thread_info: selector for TIF_NEED_RESCHED[_LAZY]", "thread_info: tif_need_resched() now takes resched_t as param", "sched: make test_*_tsk_thread_flag() return bool", "sched: *_tsk_need_resched() now takes resched_t as param" introduce new thread_info/task helper interfaces or make changes to pre-existing ones that will be used in the rest of the series. Patches 6-9, "entry: handle lazy rescheduling at user-exit", "entry/kvm: handle lazy rescheduling at guest-entry", "entry: irqentry_exit only preempts for TIF_NEED_RESCHED", "sched: __schedule_loop() doesn't need to check for need_resched_lazy()" make changes/document the rescheduling points. Patches 10-11, "sched: separate PREEMPT_DYNAMIC config logic", "sched: runtime preemption config under PREEMPT_AUTO" reuse the PREEMPT_DYNAMIC runtime configuration logic. Patch 12-16, "rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO", "rcu: fix header guard for rcu_all_qs()", "preempt,rcu: warn on PREEMPT_RCU=n, preempt_model_full", "rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y", "rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y" add RCU support. Patch 17, "x86/thread_info: define TIF_NEED_RESCHED_LAZY" adds x86 support. Note on platform support: this is x86 only for now. Howeer, supporting architectures with !ARCH_NO_PREEMPT is straight-forward -- especially if they support GENERIC_ENTRY. Patches 18-21, "sched: prepare for lazy rescheduling in resched_curr()", "sched: default preemption policy for PREEMPT_AUTO", "sched: handle idle preemption for PREEMPT_AUTO", "sched: schedule eagerly in resched_cpu()" are preparatory patches for adding PREEMPT_AUTO. Among other things they add the default need-resched policy for !PREEMPT_AUTO, PREEMPT_AUTO, and the idle task. Patches 22-23, "sched/fair: refactor update_curr(), entity_tick()", "sched/fair: handle tick expiry under lazy preemption" handle the 'hog' problem, where a kernel task does not voluntarily schedule out. And, finally patches 24-26, "sched: support preempt=none under PREEMPT_AUTO" "sched: support preempt=full under PREEMPT_AUTO" "sched: handle preempt=voluntary under PREEMPT_AUTO" add support for the three preemption models. Patch 27-30, "sched: latency warn for TIF_NEED_RESCHED_LAZY", "tracing: support lazy resched", "Documentation: tracing: add TIF_NEED_RESCHED_LAZY", "osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y" handles remaining bits and pieces to do with TIF_NEED_RESCHED_LAZY. Changelog == RFC: - Addresses review comments and is generally a more focused version of the RFC. - Lots of code reorganization. - Bugfixes all over. - need_resched() now only checks for TIF_NEED_RESCHED instead of TIF_NEED_RESCHED|TIF_NEED_RESCHED_LAZY. - set_nr_if_polling() now does not check for TIF_NEED_RESCHED_LAZY. - Tighten idle related checks. - RCU changes to force context-switches when a quiescent state is urgently needed. - Does not break live-patching anymore Also at: github.com/terminus/linux preempt-v1 Please review. Thanks Ankur [1] https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/ [2] https://lore.kernel.org/lkml/87led2wdj0.ffs@tglx/ [3] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/ [4] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/ Ankur Arora (30): preempt: introduce CONFIG_PREEMPT_AUTO thread_info: selector for TIF_NEED_RESCHED[_LAZY] thread_info: tif_need_resched() now takes resched_t as param sched: make test_*_tsk_thread_flag() return bool sched: *_tsk_need_resched() now takes resched_t as param entry: handle lazy rescheduling at user-exit entry/kvm: handle lazy rescheduling at guest-entry entry: irqentry_exit only preempts for TIF_NEED_RESCHED sched: __schedule_loop() doesn't need to check for need_resched_lazy() sched: separate PREEMPT_DYNAMIC config logic sched: runtime preemption config under PREEMPT_AUTO rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO rcu: fix header guard for rcu_all_qs() preempt,rcu: warn on PREEMPT_RCU=n, preempt_model_full rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y x86/thread_info: define TIF_NEED_RESCHED_LAZY sched: prepare for lazy rescheduling in resched_curr() sched: default preemption policy for PREEMPT_AUTO sched: handle idle preemption for PREEMPT_AUTO sched: schedule eagerly in resched_cpu() sched/fair: refactor update_curr(), entity_tick() sched/fair: handle tick expiry under lazy preemption sched: support preempt=none under PREEMPT_AUTO sched: support preempt=full under PREEMPT_AUTO sched: handle preempt=voluntary under PREEMPT_AUTO sched: latency warn for TIF_NEED_RESCHED_LAZY tracing: support lazy resched Documentation: tracing: add TIF_NEED_RESCHED_LAZY osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y .../admin-guide/kernel-parameters.txt | 1 + Documentation/trace/ftrace.rst | 6 +- arch/s390/include/asm/preempt.h | 4 +- arch/s390/mm/pfault.c | 2 +- arch/x86/Kconfig | 1 + arch/x86/include/asm/thread_info.h | 10 +- drivers/acpi/processor_idle.c | 2 +- include/asm-generic/preempt.h | 4 +- include/linux/entry-common.h | 2 +- include/linux/entry-kvm.h | 2 +- include/linux/preempt.h | 2 +- include/linux/rcutree.h | 2 +- include/linux/sched.h | 43 ++- include/linux/sched/idle.h | 8 +- include/linux/thread_info.h | 57 +++- include/linux/trace_events.h | 6 +- init/Makefile | 1 + kernel/Kconfig.preempt | 37 ++- kernel/entry/common.c | 12 +- kernel/entry/kvm.c | 4 +- kernel/rcu/Kconfig | 2 +- kernel/rcu/tiny.c | 2 +- kernel/rcu/tree.c | 17 +- kernel/rcu/tree_exp.h | 4 +- kernel/rcu/tree_plugin.h | 15 +- kernel/rcu/tree_stall.h | 2 +- kernel/sched/core.c | 311 ++++++++++++------ kernel/sched/deadline.c | 6 +- kernel/sched/debug.c | 13 +- kernel/sched/fair.c | 56 ++-- kernel/sched/idle.c | 4 +- kernel/sched/rt.c | 6 +- kernel/sched/sched.h | 27 +- kernel/trace/trace.c | 4 +- kernel/trace/trace_osnoise.c | 22 +- kernel/trace/trace_output.c | 16 +- 36 files changed, 498 insertions(+), 215 deletions(-)
Comments
Hi Ankur, On Tue, Feb 13, 2024 at 6:56 AM Ankur Arora <ankur.a.arora@oracle.com> wrote: > This series adds a new scheduling model PREEMPT_AUTO, which like > PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > on explicit preemption points for the voluntary models. Thanks for your series! Can you please reduce the CC list for future submissions? It is always a good idea to have a closer look at the output of scripts/get_maintainer.pl, and edit it manually. There is usually no need to include everyone who ever contributed a tiny change to one of the affected files. Thanks again! Gr{oetje,eeting}s, Geert
Geert Uytterhoeven <geert@linux-m68k.org> writes: > Hi Ankur, > > On Tue, Feb 13, 2024 at 6:56 AM Ankur Arora <ankur.a.arora@oraclecom> wrote: >> This series adds a new scheduling model PREEMPT_AUTO, which like >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend >> on explicit preemption points for the voluntary models. > > Thanks for your series! > > Can you please reduce the CC list for future submissions? Will do. > It is always a good idea to have a closer look at the output of > scripts/get_maintainer.pl, and edit it manually. There is usually no > need to include everyone who ever contributed a tiny change to one of > the affected files. I was in two minds about whether to prune the CC list or not. So this is very helpful. Thanks -- ankur
On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: > Hi, > > This series adds a new scheduling model PREEMPT_AUTO, which like > PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > on explicit preemption points for the voluntary models. > > The series is based on Thomas' original proposal which he outlined > in [1], [2] and in his PoC [3]. > > An earlier RFC version is at [4]. This uncovered a couple of latent bugs in RCU due to its having been a good long time since anyone built a !SMP preemptible kernel with non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most likely for the merge window after next, but let me know if you need them sooner. I am also seeing OOM conditions during rcutorture testing of callback flooding, but I am still looking into this. The full diff on top of your series on top of v6.8-rc4 is shown below. Please let me know if I have messed up the Kconfig options. Thanx, Paul [1] 6a4352fd1418 ("rcu: Update lockdep while in RCU read-side critical section") 1b85e92eabcd ("rcu: Make TINY_RCU depend on !PREEMPT_RCU rather than !PREEMPTION") ------------------------------------------------------------------------ diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 0746b1b0b6639..b0b61b8598b03 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -778,9 +778,9 @@ static inline void rcu_read_unlock(void) { RCU_LOCKDEP_WARN(!rcu_is_watching(), "rcu_read_unlock() used illegally while idle"); + rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */ __release(RCU); __rcu_read_unlock(); - rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */ } /** diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index d0ecc8ef17a72..6bf969857a85b 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -31,7 +31,7 @@ config PREEMPT_RCU config TINY_RCU bool - default y if !PREEMPTION && !SMP + default y if !PREEMPT_RCU && !SMP help This option selects the RCU implementation that is designed for UP systems from which real-time response diff --git a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N index 07f5e0a70ae70..737389417c7b3 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N +++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N @@ -3,8 +3,10 @@ CONFIG_SMP=y CONFIG_NR_CPUS=4 CONFIG_HOTPLUG_CPU=y CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n #CHECK#CONFIG_RCU_EXPERT=n CONFIG_KPROBES=n CONFIG_FTRACE=n +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T index c70cf0405f248..c9aca21d02f8c 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T +++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T @@ -1,5 +1,6 @@ CONFIG_SMP=n CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n @@ -10,3 +11,4 @@ CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_DEBUG_ATOMIC_SLEEP=y #CHECK#CONFIG_PREEMPT_COUNT=y +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS02 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02 index 2f9fcffff5ae3..472259f9e0a6a 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TASKS02 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02 @@ -1,8 +1,10 @@ CONFIG_SMP=n CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n #CHECK#CONFIG_TASKS_RCU=y CONFIG_FORCE_TASKS_RCU=y CONFIG_RCU_EXPERT=y +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TINY02 b/tools/testing/selftests/rcutorture/configs/rcu/TINY02 index 30439f6fc20e6..df408933e7013 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TINY02 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TINY02 @@ -1,5 +1,6 @@ CONFIG_SMP=n CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n @@ -13,3 +14,4 @@ CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_DEBUG_OBJECTS=y CONFIG_DEBUG_OBJECTS_RCU_HEAD=y CONFIG_DEBUG_ATOMIC_SLEEP=y +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 index 85b407467454a..2f75c7349d83a 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 @@ -2,6 +2,7 @@ CONFIG_SMP=y CONFIG_NR_CPUS=5 CONFIG_HOTPLUG_CPU=y CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n @@ -12,3 +13,4 @@ CONFIG_FORCE_TASKS_TRACE_RCU=y #CHECK#CONFIG_TASKS_TRACE_RCU=y CONFIG_TASKS_TRACE_RCU_READ_MB=y CONFIG_RCU_EXPERT=y +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 index dc4985064b3ad..9ef845d54fa41 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 @@ -2,6 +2,7 @@ CONFIG_SMP=y CONFIG_NR_CPUS=8 CONFIG_PREEMPT_NONE=n CONFIG_PREEMPT_VOLUNTARY=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n #CHECK#CONFIG_TREE_RCU=y @@ -16,3 +17,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_RCU_EQS_DEBUG=y CONFIG_RCU_LAZY=y +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE05 b/tools/testing/selftests/rcutorture/configs/rcu/TREE05 index 9f48c73709ec3..31afd943d85ef 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE05 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE05 @@ -1,6 +1,7 @@ CONFIG_SMP=y CONFIG_NR_CPUS=8 CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n #CHECK#CONFIG_TREE_RCU=y @@ -18,3 +19,4 @@ CONFIG_PROVE_LOCKING=y CONFIG_PROVE_RCU_LIST=y CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE06 b/tools/testing/selftests/rcutorture/configs/rcu/TREE06 index db27651de04b8..1180fe36a3a12 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE06 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE06 @@ -1,6 +1,7 @@ CONFIG_SMP=y CONFIG_NR_CPUS=8 CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n #CHECK#CONFIG_TREE_RCU=y @@ -17,3 +18,4 @@ CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_OBJECTS=y CONFIG_DEBUG_OBJECTS_RCU_HEAD=y CONFIG_RCU_EXPERT=y +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE07 b/tools/testing/selftests/rcutorture/configs/rcu/TREE07 index d30922d8c8832..969e852bd618b 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE07 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE07 @@ -1,6 +1,7 @@ CONFIG_SMP=y CONFIG_NR_CPUS=16 CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n @@ -15,3 +16,4 @@ CONFIG_RCU_FANOUT_LEAF=2 CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE10 b/tools/testing/selftests/rcutorture/configs/rcu/TREE10 index a323d8948b7cf..4af22599f13ed 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE10 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE10 @@ -1,6 +1,7 @@ CONFIG_SMP=y CONFIG_NR_CPUS=56 CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n @@ -16,3 +17,4 @@ CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_OBJECTS=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=n +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRIVIAL b/tools/testing/selftests/rcutorture/configs/rcu/TRIVIAL index 5d546efa68e83..7b2c9fb0cd826 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TRIVIAL +++ b/tools/testing/selftests/rcutorture/configs/rcu/TRIVIAL @@ -1,6 +1,7 @@ CONFIG_SMP=y CONFIG_NR_CPUS=8 CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_HZ_PERIODIC=n @@ -9,3 +10,4 @@ CONFIG_NO_HZ_FULL=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcuscale/TINY b/tools/testing/selftests/rcutorture/configs/rcuscale/TINY index 0fa2dc086e10c..80230745e9dc7 100644 --- a/tools/testing/selftests/rcutorture/configs/rcuscale/TINY +++ b/tools/testing/selftests/rcutorture/configs/rcuscale/TINY @@ -1,5 +1,6 @@ CONFIG_SMP=n CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n @@ -14,3 +15,4 @@ CONFIG_RCU_BOOST=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_RCU_TRACE=y +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01 b/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01 index 0059592c7408a..eb47f36712305 100644 --- a/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01 +++ b/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01 @@ -1,5 +1,6 @@ CONFIG_SMP=y CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n @@ -14,3 +15,4 @@ CONFIG_RCU_BOOST=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_RCU_TRACE=y +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT b/tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT index 67f9d2998afd3..cb3219cb98c78 100644 --- a/tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT +++ b/tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT @@ -1,5 +1,6 @@ CONFIG_SMP=y CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n @@ -18,3 +19,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_KPROBES=n CONFIG_FTRACE=n +CONFIG_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/scf/NOPREEMPT b/tools/testing/selftests/rcutorture/configs/scf/NOPREEMPT index 6133f54ce2a7d..241f28e965e57 100644 --- a/tools/testing/selftests/rcutorture/configs/scf/NOPREEMPT +++ b/tools/testing/selftests/rcutorture/configs/scf/NOPREEMPT @@ -1,5 +1,6 @@ CONFIG_SMP=y CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n @@ -11,3 +12,4 @@ CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_PROVE_LOCKING=n CONFIG_KPROBES=n CONFIG_FTRACE=n +CONFIG_EXPERT=y
Paul E. McKenney <paulmck@kernel.org> writes: > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: >> Hi, >> >> This series adds a new scheduling model PREEMPT_AUTO, which like >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend >> on explicit preemption points for the voluntary models. >> >> The series is based on Thomas' original proposal which he outlined >> in [1], [2] and in his PoC [3]. >> >> An earlier RFC version is at [4]. > > This uncovered a couple of latent bugs in RCU due to its having been > a good long time since anyone built a !SMP preemptible kernel with > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most > likely for the merge window after next, but let me know if you need > them sooner. Thanks. As you can probably tell, I skipped out on !SMP in my testing. But, the attached diff should tide me over until the fixes are in. > I am also seeing OOM conditions during rcutorture testing of callback > flooding, but I am still looking into this. That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? Thanks -- ankur
On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: > > Paul E. McKenney <paulmck@kernel.org> writes: > > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: > >> Hi, > >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > >> on explicit preemption points for the voluntary models. > >> > >> The series is based on Thomas' original proposal which he outlined > >> in [1], [2] and in his PoC [3]. > >> > >> An earlier RFC version is at [4]. > > > > This uncovered a couple of latent bugs in RCU due to its having been > > a good long time since anyone built a !SMP preemptible kernel with > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most > > likely for the merge window after next, but let me know if you need > > them sooner. > > Thanks. As you can probably tell, I skipped out on !SMP in my testing. > But, the attached diff should tide me over until the fixes are in. That was indeed my guess. ;-) > > I am also seeing OOM conditions during rcutorture testing of callback > > flooding, but I am still looking into this. > > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on two of them thus far. I am running a longer test to see if this might be just luck. If not, I look to see what rcutorture scenarios TREE10 and TRACE01 have in common. Thanx, Paul
On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: > On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: > > > > Paul E. McKenney <paulmck@kernel.org> writes: > > > > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: > > >> Hi, > > >> > > >> This series adds a new scheduling model PREEMPT_AUTO, which like > > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > > >> on explicit preemption points for the voluntary models. > > >> > > >> The series is based on Thomas' original proposal which he outlined > > >> in [1], [2] and in his PoC [3]. > > >> > > >> An earlier RFC version is at [4]. > > > > > > This uncovered a couple of latent bugs in RCU due to its having been > > > a good long time since anyone built a !SMP preemptible kernel with > > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most > > > likely for the merge window after next, but let me know if you need > > > them sooner. > > > > Thanks. As you can probably tell, I skipped out on !SMP in my testing. > > But, the attached diff should tide me over until the fixes are in. > > That was indeed my guess. ;-) > > > > I am also seeing OOM conditions during rcutorture testing of callback > > > flooding, but I am still looking into this. > > > > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? > > On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on > two of them thus far. I am running a longer test to see if this might > be just luck. If not, I look to see what rcutorture scenarios TREE10 > and TRACE01 have in common. And still TRACE01 and TREE10 are hitting OOMs, still not seeing what sets them apart. I also hit a grace-period hang in TREE04, which does CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something to dig into more. I am also getting these from builds that enable KASAN: vmlinux.o: warning: objtool: mwait_idle+0x13: call to tif_resched.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0x36: call to tif_resched.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: acpi_safe_halt+0x0: call to tif_resched.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: poll_idle+0x33: call to tif_resched.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: default_enter_idle+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section Does tif_resched() need to be marked noinstr or some such? Tracing got harder to disable, but I beleive that is unrelated to lazy preemption. ;-) Thanx, Paul
On Thu, Feb 15 2024 at 11:28, Paul E. McKenney wrote: > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: > I am also getting these from builds that enable KASAN: > > vmlinux.o: warning: objtool: mwait_idle+0x13: call to tif_resched.constprop.0() leaves .noinstr.text section > vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0x36: call to tif_resched.constprop.0() leaves .noinstr.text section > vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section > vmlinux.o: warning: objtool: acpi_safe_halt+0x0: call to tif_resched.constprop.0() leaves .noinstr.text section > vmlinux.o: warning: objtool: poll_idle+0x33: call to tif_resched.constprop.0() leaves .noinstr.text section > vmlinux.o: warning: objtool: default_enter_idle+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section > > Does tif_resched() need to be marked noinstr or some such? __always_inline() probably
Paul E. McKenney <paulmck@kernel.org> writes: > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: >> > >> > Paul E. McKenney <paulmck@kernel.org> writes: >> > >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: >> > >> Hi, >> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend >> > >> on explicit preemption points for the voluntary models. >> > >> >> > >> The series is based on Thomas' original proposal which he outlined >> > >> in [1], [2] and in his PoC [3]. >> > >> >> > >> An earlier RFC version is at [4]. >> > > >> > > This uncovered a couple of latent bugs in RCU due to its having been >> > > a good long time since anyone built a !SMP preemptible kernel with >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most >> > > likely for the merge window after next, but let me know if you need >> > > them sooner. >> > >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. >> > But, the attached diff should tide me over until the fixes are in. >> >> That was indeed my guess. ;-) >> >> > > I am also seeing OOM conditions during rcutorture testing of callback >> > > flooding, but I am still looking into this. >> > >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on >> two of them thus far. I am running a longer test to see if this might >> be just luck. If not, I look to see what rcutorture scenarios TREE10 >> and TRACE01 have in common. > > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what > sets them apart. I also hit a grace-period hang in TREE04, which does > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something > to dig into more. > > I am also getting these from builds that enable KASAN: > > vmlinux.o: warning: objtool: mwait_idle+0x13: call to tif_resched.constprop.0() leaves .noinstr.text section > vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0x36: call to tif_resched.constprop.0() leaves .noinstr.text section > vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section > vmlinux.o: warning: objtool: acpi_safe_halt+0x0: call to tif_resched.constprop.0() leaves .noinstr.text section > vmlinux.o: warning: objtool: poll_idle+0x33: call to tif_resched.constprop.0() leaves .noinstr.text section > vmlinux.o: warning: objtool: default_enter_idle+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section Thanks Paul. Yeah, with KASAN, tif_resched() transforms into this out of line function: ffffffff810fec20 <tif_resched.constprop.0>: ffffffff810fec20: e8 5b c6 20 00 call ffffffff8130b280 <__sanitizer_cov_trace_pc> ffffffff810fec25: b8 03 00 00 00 mov $0x3,%eax ffffffff810fec2a: e9 71 56 61 01 jmp ffffffff827142a0 <__x86_return_thunk> ffffffff810fec2f: 90 nop > Does tif_resched() need to be marked noinstr or some such? Builds fine with Thomas' suggested fix. -------- diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h index 8752dbc2dac7..0810ddeb365d 100644 --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -81,12 +81,12 @@ typedef enum { * reduce to the same value (TIF_NEED_RESCHED) leaving any scheduling behaviour * unchanged. */ -static inline int tif_resched(resched_t rs) +static __always_inline inline int tif_resched(resched_t rs) { return TIF_NEED_RESCHED + rs * TIF_NEED_RESCHED_LAZY_OFFSET; } -static inline int _tif_resched(resched_t rs) +static __always_inline inline int _tif_resched(resched_t rs) { return 1 << tif_resched(rs); }
On Thu, Feb 15, 2024 at 09:04:00PM +0100, Thomas Gleixner wrote: > On Thu, Feb 15 2024 at 11:28, Paul E. McKenney wrote: > > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: > > I am also getting these from builds that enable KASAN: > > > > vmlinux.o: warning: objtool: mwait_idle+0x13: call to tif_resched.constprop.0() leaves .noinstr.text section > > vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0x36: call to tif_resched.constprop.0() leaves .noinstr.text section > > vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section > > vmlinux.o: warning: objtool: acpi_safe_halt+0x0: call to tif_resched.constprop.0() leaves .noinstr.text section > > vmlinux.o: warning: objtool: poll_idle+0x33: call to tif_resched.constprop.0() leaves .noinstr.text section > > vmlinux.o: warning: objtool: default_enter_idle+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section > > > > Does tif_resched() need to be marked noinstr or some such? > > __always_inline() probably That does the trick, thank you! Thanx, Paul ------------------------------------------------------------------------ diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h index 8752dbc2dac75..43b729935804e 100644 --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -81,7 +81,7 @@ typedef enum { * reduce to the same value (TIF_NEED_RESCHED) leaving any scheduling behaviour * unchanged. */ -static inline int tif_resched(resched_t rs) +static __always_inline int tif_resched(resched_t rs) { return TIF_NEED_RESCHED + rs * TIF_NEED_RESCHED_LAZY_OFFSET; }
On Thu, Feb 15, 2024 at 12:53:13PM -0800, Ankur Arora wrote: > > Paul E. McKenney <paulmck@kernel.org> writes: > > > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: > >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: > >> > > >> > Paul E. McKenney <paulmck@kernel.org> writes: > >> > > >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: > >> > >> Hi, > >> > >> > >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like > >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > >> > >> on explicit preemption points for the voluntary models. > >> > >> > >> > >> The series is based on Thomas' original proposal which he outlined > >> > >> in [1], [2] and in his PoC [3]. > >> > >> > >> > >> An earlier RFC version is at [4]. > >> > > > >> > > This uncovered a couple of latent bugs in RCU due to its having been > >> > > a good long time since anyone built a !SMP preemptible kernel with > >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most > >> > > likely for the merge window after next, but let me know if you need > >> > > them sooner. > >> > > >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. > >> > But, the attached diff should tide me over until the fixes are in. > >> > >> That was indeed my guess. ;-) > >> > >> > > I am also seeing OOM conditions during rcutorture testing of callback > >> > > flooding, but I am still looking into this. > >> > > >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? > >> > >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on > >> two of them thus far. I am running a longer test to see if this might > >> be just luck. If not, I look to see what rcutorture scenarios TREE10 > >> and TRACE01 have in common. > > > > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what > > sets them apart. I also hit a grace-period hang in TREE04, which does > > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something > > to dig into more. > > > > I am also getting these from builds that enable KASAN: > > > > vmlinux.o: warning: objtool: mwait_idle+0x13: call to tif_resched.constprop.0() leaves .noinstr.text section > > vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0x36: call to tif_resched.constprop.0() leaves .noinstr.text section > > vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section > > vmlinux.o: warning: objtool: acpi_safe_halt+0x0: call to tif_resched.constprop.0() leaves .noinstr.text section > > vmlinux.o: warning: objtool: poll_idle+0x33: call to tif_resched.constprop.0() leaves .noinstr.text section > > vmlinux.o: warning: objtool: default_enter_idle+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section > > Thanks Paul. Yeah, with KASAN, tif_resched() transforms into this out of > line function: > > ffffffff810fec20 <tif_resched.constprop.0>: > ffffffff810fec20: e8 5b c6 20 00 call ffffffff8130b280 <__sanitizer_cov_trace_pc> > ffffffff810fec25: b8 03 00 00 00 mov $0x3,%eax > ffffffff810fec2a: e9 71 56 61 01 jmp ffffffff827142a0 <__x86_return_thunk> > ffffffff810fec2f: 90 nop > > > Does tif_resched() need to be marked noinstr or some such? > > Builds fine with Thomas' suggested fix. You beat me to it. ;-) Thanx, Paul > -------- > diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h > index 8752dbc2dac7..0810ddeb365d 100644 > --- a/include/linux/thread_info.h > +++ b/include/linux/thread_info.h > @@ -81,12 +81,12 @@ typedef enum { > * reduce to the same value (TIF_NEED_RESCHED) leaving any scheduling behaviour > * unchanged. > */ > -static inline int tif_resched(resched_t rs) > +static __always_inline inline int tif_resched(resched_t rs) > { > return TIF_NEED_RESCHED + rs * TIF_NEED_RESCHED_LAZY_OFFSET; > } > > -static inline int _tif_resched(resched_t rs) > +static __always_inline inline int _tif_resched(resched_t rs) > { > return 1 << tif_resched(rs); > }
Paul E. McKenney <paulmck@kernel.org> writes: > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: >> > >> > Paul E. McKenney <paulmck@kernel.org> writes: >> > >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: >> > >> Hi, >> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend >> > >> on explicit preemption points for the voluntary models. >> > >> >> > >> The series is based on Thomas' original proposal which he outlined >> > >> in [1], [2] and in his PoC [3]. >> > >> >> > >> An earlier RFC version is at [4]. >> > > >> > > This uncovered a couple of latent bugs in RCU due to its having been >> > > a good long time since anyone built a !SMP preemptible kernel with >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most >> > > likely for the merge window after next, but let me know if you need >> > > them sooner. >> > >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. >> > But, the attached diff should tide me over until the fixes are in. >> >> That was indeed my guess. ;-) >> >> > > I am also seeing OOM conditions during rcutorture testing of callback >> > > flooding, but I am still looking into this. >> > >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on >> two of them thus far. I am running a longer test to see if this might >> be just luck. If not, I look to see what rcutorture scenarios TREE10 >> and TRACE01 have in common. > > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what > sets them apart. I also hit a grace-period hang in TREE04, which does > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something > to dig into more. So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y as well? (Just in the interest of minimizing configurations.) --- diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 index 9ef845d54fa4..819cff9113d8 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 @@ -1,7 +1,7 @@ CONFIG_SMP=y CONFIG_NR_CPUS=8 -CONFIG_PREEMPT_NONE=n -CONFIG_PREEMPT_VOLUNTARY=y +CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT_AUTO=y CONFIG_PREEMPT=n CONFIG_PREEMPT_DYNAMIC=n
On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote: > > Paul E. McKenney <paulmck@kernel.org> writes: > > > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: > >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: > >> > > >> > Paul E. McKenney <paulmck@kernel.org> writes: > >> > > >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: > >> > >> Hi, > >> > >> > >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like > >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > >> > >> on explicit preemption points for the voluntary models. > >> > >> > >> > >> The series is based on Thomas' original proposal which he outlined > >> > >> in [1], [2] and in his PoC [3]. > >> > >> > >> > >> An earlier RFC version is at [4]. > >> > > > >> > > This uncovered a couple of latent bugs in RCU due to its having been > >> > > a good long time since anyone built a !SMP preemptible kernel with > >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most > >> > > likely for the merge window after next, but let me know if you need > >> > > them sooner. > >> > > >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. > >> > But, the attached diff should tide me over until the fixes are in. > >> > >> That was indeed my guess. ;-) > >> > >> > > I am also seeing OOM conditions during rcutorture testing of callback > >> > > flooding, but I am still looking into this. > >> > > >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? > >> > >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on > >> two of them thus far. I am running a longer test to see if this might > >> be just luck. If not, I look to see what rcutorture scenarios TREE10 > >> and TRACE01 have in common. > > > > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what > > sets them apart. I also hit a grace-period hang in TREE04, which does > > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something > > to dig into more. > > So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder > if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y > as well? > (Just in the interest of minimizing configurations.) I would be happy to, but in the spirit of full disclosure... First, I have seen that failure only once, which is not enough to conclude that it has much to do with TREE04. It might simply be low probability, so that TREE04 simply was unlucky enough to hit it first. In contrast, I have sufficient data to be reasonably confident that the callback-flooding OOMs really do have something to do with the TRACE01 and TREE10 scenarios, even though I am not yet seeing what these two scenarios have in common that they don't also have in common with other scenarios. But what is life without a bit of mystery? ;-) Second, please see the attached tarball, which contains .csv files showing Kconfig options and kernel boot parameters for the various torture tests. The portions of the filenames preceding the "config.csv" correspond to the directories in tools/testing/selftests/rcutorture/configs. Third, there are additional scenarios hand-crafted by the script at tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of them have triggered, other than via the newly increased difficulty of configurating a tracing-free kernel with which to test, but they can still be useful in ruling out particular Kconfig options or kernel boot parameters being related to a given issue. But please do take a look at the .csv files and let me know what adjustments would be appropriate given the failure information. Thanx, Paul > --- > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 > index 9ef845d54fa4..819cff9113d8 100644 > --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 > @@ -1,7 +1,7 @@ > CONFIG_SMP=y > CONFIG_NR_CPUS=8 > -CONFIG_PREEMPT_NONE=n > -CONFIG_PREEMPT_VOLUNTARY=y > +CONFIG_PREEMPT_NONE=y > +CONFIG_PREEMPT_VOLUNTARY=n > CONFIG_PREEMPT_AUTO=y > CONFIG_PREEMPT=n > CONFIG_PREEMPT_DYNAMIC=n
On Thu, Feb 15, 2024 at 02:54:45PM -0800, Paul E. McKenney wrote: > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote: > > > > Paul E. McKenney <paulmck@kernel.org> writes: > > > > > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: > > >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: > > >> > > > >> > Paul E. McKenney <paulmck@kernel.org> writes: > > >> > > > >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: > > >> > >> Hi, > > >> > >> > > >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like > > >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > > >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > > >> > >> on explicit preemption points for the voluntary models. > > >> > >> > > >> > >> The series is based on Thomas' original proposal which he outlined > > >> > >> in [1], [2] and in his PoC [3]. > > >> > >> > > >> > >> An earlier RFC version is at [4]. > > >> > > > > >> > > This uncovered a couple of latent bugs in RCU due to its having been > > >> > > a good long time since anyone built a !SMP preemptible kernel with > > >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most > > >> > > likely for the merge window after next, but let me know if you need > > >> > > them sooner. > > >> > > > >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. > > >> > But, the attached diff should tide me over until the fixes are in. > > >> > > >> That was indeed my guess. ;-) > > >> > > >> > > I am also seeing OOM conditions during rcutorture testing of callback > > >> > > flooding, but I am still looking into this. > > >> > > > >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? > > >> > > >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on > > >> two of them thus far. I am running a longer test to see if this might > > >> be just luck. If not, I look to see what rcutorture scenarios TREE10 > > >> and TRACE01 have in common. > > > > > > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what > > > sets them apart. I also hit a grace-period hang in TREE04, which does > > > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something > > > to dig into more. > > > > So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder > > if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y > > as well? > > (Just in the interest of minimizing configurations.) This time with the tarball actually attached! :-/ Thanx, Paul > I would be happy to, but in the spirit of full disclosure... > > First, I have seen that failure only once, which is not enough to > conclude that it has much to do with TREE04. It might simply be low > probability, so that TREE04 simply was unlucky enough to hit it first. > In contrast, I have sufficient data to be reasonably confident that the > callback-flooding OOMs really do have something to do with the TRACE01 and > TREE10 scenarios, even though I am not yet seeing what these two scenarios > have in common that they don't also have in common with other scenarios. > But what is life without a bit of mystery? ;-) > > Second, please see the attached tarball, which contains .csv files showing > Kconfig options and kernel boot parameters for the various torture tests. > The portions of the filenames preceding the "config.csv" correspond to > the directories in tools/testing/selftests/rcutorture/configs. > > Third, there are additional scenarios hand-crafted by the script at > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of > them have triggered, other than via the newly increased difficulty > of configurating a tracing-free kernel with which to test, but they > can still be useful in ruling out particular Kconfig options or kernel > boot parameters being related to a given issue. > > But please do take a look at the .csv files and let me know what > adjustments would be appropriate given the failure information. > > Thanx, Paul > > > --- > > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 > > index 9ef845d54fa4..819cff9113d8 100644 > > --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 > > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 > > @@ -1,7 +1,7 @@ > > CONFIG_SMP=y > > CONFIG_NR_CPUS=8 > > -CONFIG_PREEMPT_NONE=n > > -CONFIG_PREEMPT_VOLUNTARY=y > > +CONFIG_PREEMPT_NONE=y > > +CONFIG_PREEMPT_VOLUNTARY=n > > CONFIG_PREEMPT_AUTO=y > > CONFIG_PREEMPT=n > > CONFIG_PREEMPT_DYNAMIC=n
Paul E. McKenney <paulmck@kernel.org> writes: > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote: >> >> Paul E. McKenney <paulmck@kernel.org> writes: >> >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: >> >> > >> >> > Paul E. McKenney <paulmck@kernel.org> writes: >> >> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: >> >> > >> Hi, >> >> > >> >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend >> >> > >> on explicit preemption points for the voluntary models. >> >> > >> >> >> > >> The series is based on Thomas' original proposal which he outlined >> >> > >> in [1], [2] and in his PoC [3]. >> >> > >> >> >> > >> An earlier RFC version is at [4]. >> >> > > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been >> >> > > a good long time since anyone built a !SMP preemptible kernel with >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most >> >> > > likely for the merge window after next, but let me know if you need >> >> > > them sooner. >> >> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. >> >> > But, the attached diff should tide me over until the fixes are in. >> >> >> >> That was indeed my guess. ;-) >> >> >> >> > > I am also seeing OOM conditions during rcutorture testing of callback >> >> > > flooding, but I am still looking into this. >> >> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? >> >> >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on >> >> two of them thus far. I am running a longer test to see if this might >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10 >> >> and TRACE01 have in common. >> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what >> > sets them apart. I also hit a grace-period hang in TREE04, which does >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something >> > to dig into more. >> >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y >> as well? >> (Just in the interest of minimizing configurations.) > > I would be happy to, but in the spirit of full disclosure... > > First, I have seen that failure only once, which is not enough to > conclude that it has much to do with TREE04. It might simply be low > probability, so that TREE04 simply was unlucky enough to hit it first. > In contrast, I have sufficient data to be reasonably confident that the > callback-flooding OOMs really do have something to do with the TRACE01 and > TREE10 scenarios, even though I am not yet seeing what these two scenarios > have in common that they don't also have in common with other scenarios. > But what is life without a bit of mystery? ;-) :). > Second, please see the attached tarball, which contains .csv files showing > Kconfig options and kernel boot parameters for the various torture tests. > The portions of the filenames preceding the "config.csv" correspond to > the directories in tools/testing/selftests/rcutorture/configs. So, at least some of the HZ_FULL=y tests don't run into problems. > Third, there are additional scenarios hand-crafted by the script at > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of > them have triggered, other than via the newly increased difficulty > of configurating a tracing-free kernel with which to test, but they > can still be useful in ruling out particular Kconfig options or kernel > boot parameters being related to a given issue. > > But please do take a look at the .csv files and let me know what > adjustments would be appropriate given the failure information. Nothing stands out just yet. Let me start a run here and see if that gives me some ideas. I'm guessing the splats don't give any useful information or you would have attached them ;). Thanks for testing, btw. -- ankur
Paul E. McKenney <paulmck@kernel.org> writes: > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote: >> >> Paul E. McKenney <paulmck@kernel.org> writes: >> >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: >> >> > >> >> > Paul E. McKenney <paulmck@kernel.org> writes: >> >> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: >> >> > >> Hi, >> >> > >> >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend >> >> > >> on explicit preemption points for the voluntary models. >> >> > >> >> >> > >> The series is based on Thomas' original proposal which he outlined >> >> > >> in [1], [2] and in his PoC [3]. >> >> > >> >> >> > >> An earlier RFC version is at [4]. >> >> > > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been >> >> > > a good long time since anyone built a !SMP preemptible kernel with >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most >> >> > > likely for the merge window after next, but let me know if you need >> >> > > them sooner. >> >> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. >> >> > But, the attached diff should tide me over until the fixes are in. >> >> >> >> That was indeed my guess. ;-) >> >> >> >> > > I am also seeing OOM conditions during rcutorture testing of callback >> >> > > flooding, but I am still looking into this. >> >> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? >> >> >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on >> >> two of them thus far. I am running a longer test to see if this might >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10 >> >> and TRACE01 have in common. >> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what >> > sets them apart. I also hit a grace-period hang in TREE04, which does >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something >> > to dig into more. >> >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y >> as well? >> (Just in the interest of minimizing configurations.) > > I would be happy to, but in the spirit of full disclosure... > > First, I have seen that failure only once, which is not enough to > conclude that it has much to do with TREE04. It might simply be low > probability, so that TREE04 simply was unlucky enough to hit it first. > In contrast, I have sufficient data to be reasonably confident that the > callback-flooding OOMs really do have something to do with the TRACE01 and > TREE10 scenarios, even though I am not yet seeing what these two scenarios > have in common that they don't also have in common with other scenarios. > But what is life without a bit of mystery? ;-) :). > Second, please see the attached tarball, which contains .csv files showing > Kconfig options and kernel boot parameters for the various torture tests. > The portions of the filenames preceding the "config.csv" correspond to > the directories in tools/testing/selftests/rcutorture/configs. So, at least some of the HZ_FULL=y tests don't run into problems. > Third, there are additional scenarios hand-crafted by the script at > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of > them have triggered, other than via the newly increased difficulty > of configurating a tracing-free kernel with which to test, but they > can still be useful in ruling out particular Kconfig options or kernel > boot parameters being related to a given issue. > > But please do take a look at the .csv files and let me know what > adjustments would be appropriate given the failure information. Nothing stands out just yet. Let me start a run here and see if that gives me some ideas. I'm guessing the splats don't give any useful information or you would have attached them ;). Thanks for testing, btw. -- ankur
On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote: > > Paul E. McKenney <paulmck@kernel.org> writes: > > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote: > >> > >> Paul E. McKenney <paulmck@kernel.org> writes: > >> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: > >> >> > > >> >> > Paul E. McKenney <paulmck@kernel.org> writes: > >> >> > > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: > >> >> > >> Hi, > >> >> > >> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > >> >> > >> on explicit preemption points for the voluntary models. > >> >> > >> > >> >> > >> The series is based on Thomas' original proposal which he outlined > >> >> > >> in [1], [2] and in his PoC [3]. > >> >> > >> > >> >> > >> An earlier RFC version is at [4]. > >> >> > > > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been > >> >> > > a good long time since anyone built a !SMP preemptible kernel with > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most > >> >> > > likely for the merge window after next, but let me know if you need > >> >> > > them sooner. > >> >> > > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. > >> >> > But, the attached diff should tide me over until the fixes are in. > >> >> > >> >> That was indeed my guess. ;-) > >> >> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback > >> >> > > flooding, but I am still looking into this. > >> >> > > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? > >> >> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on > >> >> two of them thus far. I am running a longer test to see if this might > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10 > >> >> and TRACE01 have in common. > >> > > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what > >> > sets them apart. I also hit a grace-period hang in TREE04, which does > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something > >> > to dig into more. > >> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y > >> as well? > >> (Just in the interest of minimizing configurations.) > > > > I would be happy to, but in the spirit of full disclosure... > > > > First, I have seen that failure only once, which is not enough to > > conclude that it has much to do with TREE04. It might simply be low > > probability, so that TREE04 simply was unlucky enough to hit it first. > > In contrast, I have sufficient data to be reasonably confident that the > > callback-flooding OOMs really do have something to do with the TRACE01 and > > TREE10 scenarios, even though I am not yet seeing what these two scenarios > > have in common that they don't also have in common with other scenarios. > > But what is life without a bit of mystery? ;-) > > :). > > > Second, please see the attached tarball, which contains .csv files showing > > Kconfig options and kernel boot parameters for the various torture tests. > > The portions of the filenames preceding the "config.csv" correspond to > > the directories in tools/testing/selftests/rcutorture/configs. > > So, at least some of the HZ_FULL=y tests don't run into problems. > > > Third, there are additional scenarios hand-crafted by the script at > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of > > them have triggered, other than via the newly increased difficulty > > of configurating a tracing-free kernel with which to test, but they > > can still be useful in ruling out particular Kconfig options or kernel > > boot parameters being related to a given issue. > > > > But please do take a look at the .csv files and let me know what > > adjustments would be appropriate given the failure information. > > Nothing stands out just yet. Let me start a run here and see if > that gives me some ideas. Sounds good, thank you! > I'm guessing the splats don't give any useful information or > you would have attached them ;). My plan is to extract what can be extracted from the overnight run that I just started. Just in case the fixes have any effect on things, unlikely though that might be given those fixes and the runs that failed. > Thanks for testing, btw. The sooner we find them, the sooner they get fixed. ;-) Thanx, Paul
On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote: > On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote: > > > > Paul E. McKenney <paulmck@kernel.org> writes: > > > > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote: > > >> > > >> Paul E. McKenney <paulmck@kernel.org> writes: > > >> > > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: > > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: > > >> >> > > > >> >> > Paul E. McKenney <paulmck@kernel.org> writes: > > >> >> > > > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: > > >> >> > >> Hi, > > >> >> > >> > > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like > > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > > >> >> > >> on explicit preemption points for the voluntary models. > > >> >> > >> > > >> >> > >> The series is based on Thomas' original proposal which he outlined > > >> >> > >> in [1], [2] and in his PoC [3]. > > >> >> > >> > > >> >> > >> An earlier RFC version is at [4]. > > >> >> > > > > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been > > >> >> > > a good long time since anyone built a !SMP preemptible kernel with > > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most > > >> >> > > likely for the merge window after next, but let me know if you need > > >> >> > > them sooner. > > >> >> > > > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. > > >> >> > But, the attached diff should tide me over until the fixes are in. > > >> >> > > >> >> That was indeed my guess. ;-) > > >> >> > > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback > > >> >> > > flooding, but I am still looking into this. > > >> >> > > > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? > > >> >> > > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on > > >> >> two of them thus far. I am running a longer test to see if this might > > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10 > > >> >> and TRACE01 have in common. > > >> > > > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what > > >> > sets them apart. I also hit a grace-period hang in TREE04, which does > > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something > > >> > to dig into more. > > >> > > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder > > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y > > >> as well? > > >> (Just in the interest of minimizing configurations.) > > > > > > I would be happy to, but in the spirit of full disclosure... > > > > > > First, I have seen that failure only once, which is not enough to > > > conclude that it has much to do with TREE04. It might simply be low > > > probability, so that TREE04 simply was unlucky enough to hit it first. > > > In contrast, I have sufficient data to be reasonably confident that the > > > callback-flooding OOMs really do have something to do with the TRACE01 and > > > TREE10 scenarios, even though I am not yet seeing what these two scenarios > > > have in common that they don't also have in common with other scenarios. > > > But what is life without a bit of mystery? ;-) > > > > :). > > > > > Second, please see the attached tarball, which contains .csv files showing > > > Kconfig options and kernel boot parameters for the various torture tests. > > > The portions of the filenames preceding the "config.csv" correspond to > > > the directories in tools/testing/selftests/rcutorture/configs. > > > > So, at least some of the HZ_FULL=y tests don't run into problems. > > > > > Third, there are additional scenarios hand-crafted by the script at > > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of > > > them have triggered, other than via the newly increased difficulty > > > of configurating a tracing-free kernel with which to test, but they > > > can still be useful in ruling out particular Kconfig options or kernel > > > boot parameters being related to a given issue. > > > > > > But please do take a look at the .csv files and let me know what > > > adjustments would be appropriate given the failure information. > > > > Nothing stands out just yet. Let me start a run here and see if > > that gives me some ideas. > > Sounds good, thank you! > > > I'm guessing the splats don't give any useful information or > > you would have attached them ;). > > My plan is to extract what can be extracted from the overnight run > that I just started. Just in case the fixes have any effect on things, > unlikely though that might be given those fixes and the runs that failed. And I only got no failures from either TREE10 or TRACE01 on last night's run. I merged your series on top of v6.8-rc4 with the -rcu tree's dev branch, the latter to get the RCU fixes. But this means that last night's results are not really comparable to earlier results. I did get a few TREE09 failures, but I get those anyway. I took it apart below for you because I got confused and thought that it was a TREE10 failure. So just in case you were curious what one of these looks like and because I am too lazy to delete it. ;-) So from the viewpoint of moderate rcutorture testing, this series looks good. Woo hoo!!! We did uncover a separate issue with Tasks RCU, which I will report on in more detail separately. However, this issue does not (repeat, *not*) affect lazy preemption as such, but instead any attempt to remove all of the cond_resched() invocations. My next step is to try this on bare metal on a system configured as is the fleet. But good progress for a week!!! Thanx, Paul ------------------------------------------------------------------------ [ 3458.875819] rcu_torture_fwd_prog: Starting forward-progress test 0 [ 3458.877155] rcu_torture_fwd_prog_cr: Starting forward-progress test 0 This says that rcutorture is starting a callback-flood forward-progress test. [ 3459.252546] rcu-torture: rtc: 00000000ec445089 ver: 298757 tfle: 0 rta: 298758 rtaf: 0 rtf: 298747 rtmbe: 0 rtmbkf: 0/0 rtbe: 0 rtbke: 0 rtbf: 0 rtb: 0 nt: 895741 barrier: 27715/27716:0 read-exits: 3984 nocb-toggles: 0:0 [ 3459.261545] rcu-torture: Reader Pipe: 363878289 159521 0 0 0 0 0 0 0 0 0 [ 3459.263883] rcu-torture: Reader Batch: 363126419 911391 0 0 0 0 0 0 0 0 0 [ 3459.267544] rcu-torture: Free-Block Circulation: 298757 298756 298754 298753 298752 298751 298750 298749 298748 298747 0 These lines are just statistics that rcutorture prints out periodically. Thus far, nothing bad happened. This is one of a few SMP scenarios that does not do CPU hotplug. But the TRACE01 scenario does do CPU hotplug, so not likely a determining factor. Another difference is that TREE10 is the only scenario with more than 16 CPUs, but then again, TRACE01 has only five. [ 3459.733109] ------------[ cut here ]------------ [ 3459.734165] rcutorture_oom_notify invoked upon OOM during forward-progress testing. [ 3459.735828] WARNING: CPU: 0 PID: 43 at kernel/rcu/rcutorture.c:2874 rcutorture_oom_notify+0x3e/0x1d0 Now something bad happened. RCU was unable to keep up with the callback flood. Given that users can create callback floods with close(open()) loops, this is not good. [ 3459.737761] Modules linked in: [ 3459.738408] CPU: 0 PID: 43 Comm: rcu_torture_fwd Not tainted 6.8.0-rc4-00096-g40c2642e6f24 #8252 [ 3459.740295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 3459.742651] RIP: 0010:rcutorture_oom_notify+0x3e/0x1d0 [ 3459.743821] Code: e8 37 48 c2 00 48 8b 1d f8 b4 dc 01 48 85 db 0f 84 80 01 00 00 90 48 c7 c6 40 f5 e0 92 48 c7 c7 10 52 23 93 e8 d3 b9 f9 ff 90 <0f> 0b 90 90 8b 35 f8 c0 66 01 85 f6 7e 40 45 31 ed 4d 63 e5 41 83 [ 3459.747935] RSP: 0018:ffffabbb0015bb30 EFLAGS: 00010282 [ 3459.749061] RAX: 0000000000000000 RBX: ffff9485812ae000 RCX: 00000000ffffdfff [ 3459.750601] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001 [ 3459.752026] RBP: ffffabbb0015bb98 R08: ffffffff93539388 R09: 00000000ffffdfff [ 3459.753616] R10: ffffffff934593a0 R11: ffffffff935093a0 R12: 0000000000000000 [ 3459.755134] R13: ffffabbb0015bb98 R14: ffffffff93547da0 R15: 00000000ffffffff [ 3459.756695] FS: 0000000000000000(0000) GS:ffffffff9344f000(0000) knlGS:0000000000000000 [ 3459.758443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3459.759672] CR2: 0000000000600298 CR3: 0000000001958000 CR4: 00000000000006f0 [ 3459.761253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3459.762748] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3459.764472] Call Trace: [ 3459.765003] <TASK> [ 3459.765483] ? __warn+0x61/0xe0 [ 3459.766176] ? rcutorture_oom_notify+0x3e/0x1d0 [ 3459.767154] ? report_bug+0x174/0x180 [ 3459.767942] ? handle_bug+0x3d/0x70 [ 3459.768715] ? exc_invalid_op+0x18/0x70 [ 3459.769561] ? asm_exc_invalid_op+0x1a/0x20 [ 3459.770494] ? rcutorture_oom_notify+0x3e/0x1d0 [ 3459.771501] blocking_notifier_call_chain+0x5c/0x80 [ 3459.772553] out_of_memory+0x236/0x4b0 [ 3459.773365] __alloc_pages+0x9ca/0xb10 [ 3459.774233] ? set_next_entity+0x8b/0x150 [ 3459.775107] new_slab+0x382/0x430 [ 3459.776454] ___slab_alloc+0x23c/0x8c0 [ 3459.777315] ? preempt_schedule_irq+0x32/0x50 [ 3459.778319] ? rcu_torture_fwd_prog+0x6bf/0x970 [ 3459.779291] ? rcu_torture_fwd_prog+0x6bf/0x970 [ 3459.780264] ? rcu_torture_fwd_prog+0x6bf/0x970 [ 3459.781244] kmalloc_trace+0x179/0x1a0 [ 3459.784651] rcu_torture_fwd_prog+0x6bf/0x970 [ 3459.785529] ? __pfx_rcu_torture_fwd_prog+0x10/0x10 [ 3459.786617] ? kthread+0xc3/0xf0 [ 3459.787352] ? __pfx_rcu_torture_fwd_prog+0x10/0x10 [ 3459.788417] kthread+0xc3/0xf0 [ 3459.789101] ? __pfx_kthread+0x10/0x10 [ 3459.789906] ret_from_fork+0x2f/0x50 [ 3459.790708] ? __pfx_kthread+0x10/0x10 [ 3459.791523] ret_from_fork_asm+0x1a/0x30 [ 3459.792359] </TASK> [ 3459.792835] ---[ end trace 0000000000000000 ]--- Standard rcutorture stack trace for this failure mode. [ 3459.793849] rcu_torture_fwd_cb_hist: Callback-invocation histogram 0 (duration 913 jiffies): 1s/10: 0:1 2s/10: 719677:32517 3s/10: 646965:0 So the whole thing lasted less than a second (913 jiffies). Each element of the histogram is 100 milliseconds worth. Nothing came through during the first 100 ms (not surprising), and one grace period elapsed (also not surprising). A lot of callbacks came through in the second 100 ms (also not surprising), but there were some tens of thousand grace periods (extremely surprising). The third 100 ms got a lot of callbacks but no grace periods (not surprising). Huh. This started at t=3458.877155 and we got the OOM at t=3459.733109, which roughly matches the reported time. [ 3459.796413] rcu: rcu_fwd_progress_check: GP age 737 jiffies The callback flood does seem to have stalled grace periods, though not by all *that* much. [ 3459.799402] rcu: rcu_preempt: wait state: RCU_GP_WAIT_FQS(5) ->state: 0x402 ->rt_priority 0 delta ->gp_start 740 ->gp_activity 0 ->gp_req_activity 747 ->gp_wake_time 68 ->gp_wake_seq 5535689 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->gp_max 713 ->gp_flags 0x0 The RCU grace-period kthread is in its loop looking for quiescent states, and is executing normally ("->gp_activity 0", as opposed to some huge number indicating that the kthread was never awakened). [ 3459.804267] rcu: rcu_node 0:0 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->qsmask 0x0 ...G ->n_boosts 0 The "->qsmask 0x0" says that all CPUs have provided a quiescent state, but the "G" indicates that the normal grace period is blocked by some task preempted within an RCU read-side critical section. This output is strange because a 56-CPU scenario should have considerably more output. Plus this means that this cannot possibly be TREE10 because that scenario is non-preemptible, so there cannot be grace periods waiting for quiescent states on anything but CPUs. This happens from time to time because TREE09 runs on a single CPU, and has preemption enabled, but not RCU priority boosting. And this output is quite appropriate for a single-CPU scenario. I probably should enable RCU priority boosting on this scenario. I would also need it to be pretty fast off the mark, because we OOMed about 700 milliseconds into the grace period. But that has nothing to do with lazy preemption! [ 3459.806271] rcu: cpu 0 ->gp_seq_needed 5535692 [ 3459.807139] rcu: RCU callbacks invoked since boot: 65398010 [ 3459.808374] rcu: rcu_fwd_progress_check: callbacks 0: 7484262 [ 3459.809640] rcutorture_oom_notify: Freed 1 RCU callbacks. [ 3460.616268] rcutorture_oom_notify: Freed 7484253 RCU callbacks. [ 3460.619551] rcutorture_oom_notify: Freed 0 RCU callbacks. [ 3460.620740] rcutorture_oom_notify returning after OOM processing. [ 3460.622719] rcu_torture_fwd_prog_cr: Waiting for CBs: rcu_barrier+0x0/0x2c0() 0 [ 3461.678556] rcu_torture_fwd_prog_nr: Starting forward-progress test 0 [ 3461.684546] rcu_torture_fwd_prog_nr: Waiting for CBs: rcu_barrier+0x0/0x2c0() 0
Paul E. McKenney <paulmck@kernel.org> writes: > On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote: >> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote: >> > >> > Paul E. McKenney <paulmck@kernel.org> writes: >> > >> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote: >> > >> >> > >> Paul E. McKenney <paulmck@kernel.org> writes: >> > >> >> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: >> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: >> > >> >> > >> > >> >> > Paul E. McKenney <paulmck@kernel.org> writes: >> > >> >> > >> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: >> > >> >> > >> Hi, >> > >> >> > >> >> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like >> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend >> > >> >> > >> on explicit preemption points for the voluntary models. >> > >> >> > >> >> > >> >> > >> The series is based on Thomas' original proposal which he outlined >> > >> >> > >> in [1], [2] and in his PoC [3]. >> > >> >> > >> >> > >> >> > >> An earlier RFC version is at [4]. >> > >> >> > > >> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been >> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with >> > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most >> > >> >> > > likely for the merge window after next, but let me know if you need >> > >> >> > > them sooner. >> > >> >> > >> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. >> > >> >> > But, the attached diff should tide me over until the fixes are in. >> > >> >> >> > >> >> That was indeed my guess. ;-) >> > >> >> >> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback >> > >> >> > > flooding, but I am still looking into this. >> > >> >> > >> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? >> > >> >> >> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on >> > >> >> two of them thus far. I am running a longer test to see if this might >> > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10 >> > >> >> and TRACE01 have in common. >> > >> > >> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what >> > >> > sets them apart. I also hit a grace-period hang in TREE04, which does >> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something >> > >> > to dig into more. >> > >> >> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder >> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y >> > >> as well? >> > >> (Just in the interest of minimizing configurations.) >> > > >> > > I would be happy to, but in the spirit of full disclosure... >> > > >> > > First, I have seen that failure only once, which is not enough to >> > > conclude that it has much to do with TREE04. It might simply be low >> > > probability, so that TREE04 simply was unlucky enough to hit it first. >> > > In contrast, I have sufficient data to be reasonably confident that the >> > > callback-flooding OOMs really do have something to do with the TRACE01 and >> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios >> > > have in common that they don't also have in common with other scenarios. >> > > But what is life without a bit of mystery? ;-) >> > >> > :). >> > >> > > Second, please see the attached tarball, which contains .csv files showing >> > > Kconfig options and kernel boot parameters for the various torture tests. >> > > The portions of the filenames preceding the "config.csv" correspond to >> > > the directories in tools/testing/selftests/rcutorture/configs. >> > >> > So, at least some of the HZ_FULL=y tests don't run into problems. >> > >> > > Third, there are additional scenarios hand-crafted by the script at >> > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of >> > > them have triggered, other than via the newly increased difficulty >> > > of configurating a tracing-free kernel with which to test, but they >> > > can still be useful in ruling out particular Kconfig options or kernel >> > > boot parameters being related to a given issue. >> > > >> > > But please do take a look at the .csv files and let me know what >> > > adjustments would be appropriate given the failure information. >> > >> > Nothing stands out just yet. Let me start a run here and see if >> > that gives me some ideas. >> >> Sounds good, thank you! >> >> > I'm guessing the splats don't give any useful information or >> > you would have attached them ;). >> >> My plan is to extract what can be extracted from the overnight run >> that I just started. Just in case the fixes have any effect on things, >> unlikely though that might be given those fixes and the runs that failed. > > And I only got no failures from either TREE10 or TRACE01 on last night's > run. Oh that's great news. Same for my overnight runs for TREE04 and TRACE01. Ongoing: a 24 hour run for those. Let's see how that goes. > I merged your series on top of v6.8-rc4 with the -rcu tree's > dev branch, the latter to get the RCU fixes. But this means that last > night's results are not really comparable to earlier results. > > I did get a few TREE09 failures, but I get those anyway. I took it > apart below for you because I got confused and thought that it was a > TREE10 failure. So just in case you were curious what one of these > looks like and because I am too lazy to delete it. ;-) Heh. Well, thanks for being lazy /after/ dissecting it nicely. > So from the viewpoint of moderate rcutorture testing, this series > looks good. Woo hoo!!! Awesome! > We did uncover a separate issue with Tasks RCU, which I will report on > in more detail separately. However, this issue does not (repeat, *not*) > affect lazy preemption as such, but instead any attempt to remove all > of the cond_resched() invocations. So, that sounds like it happens even with (CONFIG_PREEMPT_AUTO=n, CONFIG_PREEMPT=y)? Anyway will look out for it when you go into the detail. > My next step is to try this on bare metal on a system configured as > is the fleet. But good progress for a week!!! Yeah this is great. Fingers crossed for the wider set of tests. Thanks -- ankur
On Fri, Feb 16, 2024 at 07:59:45PM -0800, Ankur Arora wrote: > Paul E. McKenney <paulmck@kernel.org> writes: > > On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote: > >> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote: > >> > > >> > Paul E. McKenney <paulmck@kernel.org> writes: > >> > > >> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote: > >> > >> > >> > >> Paul E. McKenney <paulmck@kernel.org> writes: > >> > >> > >> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: > >> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: > >> > >> >> > > >> > >> >> > Paul E. McKenney <paulmck@kernel.org> writes: > >> > >> >> > > >> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: > >> > >> >> > >> Hi, > >> > >> >> > >> > >> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like > >> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > >> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > >> > >> >> > >> on explicit preemption points for the voluntary models. > >> > >> >> > >> > >> > >> >> > >> The series is based on Thomas' original proposal which he outlined > >> > >> >> > >> in [1], [2] and in his PoC [3]. > >> > >> >> > >> > >> > >> >> > >> An earlier RFC version is at [4]. > >> > >> >> > > > >> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been > >> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with > >> > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most > >> > >> >> > > likely for the merge window after next, but let me know if you need > >> > >> >> > > them sooner. > >> > >> >> > > >> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. > >> > >> >> > But, the attached diff should tide me over until the fixes are in. > >> > >> >> > >> > >> >> That was indeed my guess. ;-) > >> > >> >> > >> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback > >> > >> >> > > flooding, but I am still looking into this. > >> > >> >> > > >> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? > >> > >> >> > >> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on > >> > >> >> two of them thus far. I am running a longer test to see if this might > >> > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10 > >> > >> >> and TRACE01 have in common. > >> > >> > > >> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what > >> > >> > sets them apart. I also hit a grace-period hang in TREE04, which does > >> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something > >> > >> > to dig into more. > >> > >> > >> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder > >> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y > >> > >> as well? > >> > >> (Just in the interest of minimizing configurations.) > >> > > > >> > > I would be happy to, but in the spirit of full disclosure... > >> > > > >> > > First, I have seen that failure only once, which is not enough to > >> > > conclude that it has much to do with TREE04. It might simply be low > >> > > probability, so that TREE04 simply was unlucky enough to hit it first. > >> > > In contrast, I have sufficient data to be reasonably confident that the > >> > > callback-flooding OOMs really do have something to do with the TRACE01 and > >> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios > >> > > have in common that they don't also have in common with other scenarios. > >> > > But what is life without a bit of mystery? ;-) > >> > > >> > :). > >> > > >> > > Second, please see the attached tarball, which contains .csv files showing > >> > > Kconfig options and kernel boot parameters for the various torture tests. > >> > > The portions of the filenames preceding the "config.csv" correspond to > >> > > the directories in tools/testing/selftests/rcutorture/configs. > >> > > >> > So, at least some of the HZ_FULL=y tests don't run into problems. > >> > > >> > > Third, there are additional scenarios hand-crafted by the script at > >> > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of > >> > > them have triggered, other than via the newly increased difficulty > >> > > of configurating a tracing-free kernel with which to test, but they > >> > > can still be useful in ruling out particular Kconfig options or kernel > >> > > boot parameters being related to a given issue. > >> > > > >> > > But please do take a look at the .csv files and let me know what > >> > > adjustments would be appropriate given the failure information. > >> > > >> > Nothing stands out just yet. Let me start a run here and see if > >> > that gives me some ideas. > >> > >> Sounds good, thank you! > >> > >> > I'm guessing the splats don't give any useful information or > >> > you would have attached them ;). > >> > >> My plan is to extract what can be extracted from the overnight run > >> that I just started. Just in case the fixes have any effect on things, > >> unlikely though that might be given those fixes and the runs that failed. > > > > And I only got no failures from either TREE10 or TRACE01 on last night's > > run. > > Oh that's great news. Same for my overnight runs for TREE04 and TRACE01. > > Ongoing: a 24 hour run for those. Let's see how that goes. > > > I merged your series on top of v6.8-rc4 with the -rcu tree's > > dev branch, the latter to get the RCU fixes. But this means that last > > night's results are not really comparable to earlier results. > > > > I did get a few TREE09 failures, but I get those anyway. I took it > > apart below for you because I got confused and thought that it was a > > TREE10 failure. So just in case you were curious what one of these > > looks like and because I am too lazy to delete it. ;-) > > Heh. Well, thanks for being lazy /after/ dissecting it nicely. > > > So from the viewpoint of moderate rcutorture testing, this series > > looks good. Woo hoo!!! > > Awesome! > > > We did uncover a separate issue with Tasks RCU, which I will report on > > in more detail separately. However, this issue does not (repeat, *not*) > > affect lazy preemption as such, but instead any attempt to remove all > > of the cond_resched() invocations. > > So, that sounds like it happens even with (CONFIG_PREEMPT_AUTO=n, > CONFIG_PREEMPT=y)? > Anyway will look out for it when you go into the detail. Fair point, normally Tasks RCU isn't present when cond_resched() means anything. I will look again -- it is quite possible that I was confused by earlier in-fleet setups that had Tasks RCU enabled even when preemption was disabled. (We don't do that anymore, and, had I been paying sufficient attention, would not have been doing it to start with. Back in the day, enabling rcutorture, even as a module, had the side effect of enabling Tasks RCU. How else to test it, right? Well...) > > My next step is to try this on bare metal on a system configured as > > is the fleet. But good progress for a week!!! > > Yeah this is great. Fingers crossed for the wider set of tests. I got what might be a one-off when hitting rcutorture and KASAN harder. I am running 320*TRACE01 to see if it reproduces. In the meantime... [ 2242.502082] ------------[ cut here ]------------ [ 2242.502946] WARNING: CPU: 3 PID: 72 at kernel/rcu/rcutorture.c:2839 rcu_torture_fwd_prog+0x1125/0x14e0 Callback-flooding events go for at most eight seconds, and end earlier if the RCU flavor under test can clear the flood sooner. If the flood does consume the full eight seconds, then we get the above WARN_ON if too few callbacks are invoked in the meantime. So we get a splat, which is mostly there to make sure that rcutorture complains about this, not much information here. [ 2242.504580] Modules linked in: [ 2242.505125] CPU: 3 PID: 72 Comm: rcu_torture_fwd Not tainted 6.8.0-rc4-00101-ga3ecbc334926 #8321 [ 2242.506652] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 2242.508577] RIP: 0010:rcu_torture_fwd_prog+0x1125/0x14e0 [ 2242.509513] Code: 4b f9 ff ff e8 ac 16 3d 00 e9 0e f9 ff ff 48 c7 c7 c0 b9 00 91 e8 9b 16 3d 00 e9 71 f9 ff ff e8 91 16 3d 00 e9 bb f0 ff ff 90 <0f> 0b 90 e9 38 fe ff ff 48 8b bd 48 ff ff ff e8 47 16 3d 00 e9 53 [ 2242.512705] RSP: 0018:ffff8880028b7da0 EFLAGS: 00010293 [ 2242.513615] RAX: 000000010031ebc4 RBX: 0000000000000000 RCX: ffffffff8d5c6040 [ 2242.514843] RDX: 00000001001da27d RSI: 0000000000000008 RDI: 0000000000000e10 [ 2242.516073] RBP: ffff8880028b7ee0 R08: 0000000000000000 R09: ffffed100036d340 [ 2242.517308] R10: ffff888001b69a07 R11: 0000000000030001 R12: dffffc0000000000 [ 2242.518537] R13: 000000000001869e R14: ffffffff9100b9c0 R15: ffff888002714000 [ 2242.519765] FS: 0000000000000000(0000) GS:ffff88806d100000(0000) knlGS:0000000000000000 [ 2242.521152] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2242.522151] CR2: 0000000000000000 CR3: 0000000003f26000 CR4: 00000000000006f0 [ 2242.523392] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2242.524624] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2242.525857] Call Trace: [ 2242.526301] <TASK> [ 2242.526674] ? __warn+0xc8/0x260 [ 2242.527258] ? rcu_torture_fwd_prog+0x1125/0x14e0 [ 2242.528077] ? report_bug+0x291/0x2e0 [ 2242.528726] ? handle_bug+0x3d/0x80 [ 2242.529348] ? exc_invalid_op+0x18/0x50 [ 2242.530022] ? asm_exc_invalid_op+0x1a/0x20 [ 2242.530753] ? kthread_should_stop+0x70/0xc0 [ 2242.531508] ? rcu_torture_fwd_prog+0x1125/0x14e0 [ 2242.532333] ? __pfx_rcu_torture_fwd_prog+0x10/0x10 [ 2242.533191] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ 2242.534083] ? set_cpus_allowed_ptr+0x7c/0xb0 [ 2242.534847] ? __pfx_set_cpus_allowed_ptr+0x10/0x10 [ 2242.535696] ? __pfx_rcu_torture_fwd_prog+0x10/0x10 [ 2242.536547] ? kthread+0x24a/0x300 [ 2242.537159] ? __pfx_rcu_torture_fwd_prog+0x10/0x10 [ 2242.538038] kthread+0x24a/0x300 [ 2242.538607] ? __pfx_kthread+0x10/0x10 [ 2242.539283] ret_from_fork+0x2f/0x70 [ 2242.539907] ? __pfx_kthread+0x10/0x10 [ 2242.540571] ret_from_fork_asm+0x1b/0x30 [ 2242.541273] </TASK> [ 2242.541661] ---[ end trace 0000000000000000 ]--- But now there is some information... [ 2242.542471] rcu_torture_fwd_prog_cr Duration 8000 barrier: 749 pending 49999 n_launders: 99998 n_launders_sa: 99998 n_max_gps: 0 n_max_cbs: 50000 cver 0 gps 0 The flood lasted the full eight seconds ("Duration 8000"). The rcu_barrier_trace() operation consumed an additional 749ms ("barrier: 749"). There were almost 50K callbacks for that rcu_barrier_trace() to deal with ("pending 49999"). There were almost 100K that were recycled directly, as opposed to being newly allocated ("n_launders: 99998"), and all launders happened after the last allocation ("n_launders_sa: 99998"). This is to be expected because RCU Tasks Trace does not do priority boosting of preempted readers, and therefore rcutorture limits the number of outstanding callbacks in the flood to 50K. And it might never boost readers, given that it is legal to block in an RCU Tasks Trace read-side critical section. Insufficient callbacks were invoked ("n_max_gps: 0") and the full 50K permitted was allocated ("n_max_cbs: 50000"). The rcu_torture_writer() did not see a full grace period in the meantime ("cver 0") and there was no recorded grace period in the meantime ("gps 0"). [ 2242.544890] rcu_torture_fwd_cb_hist: Callback-invocation histogram 0 (duration 8751 jiffies): 1s/10: 0:0 2s/10: 0:0 3s/10: 0:0 4s/10: 0:0 5s/10: 0:0 6s/10: 0:0 7s/10: 0:0 8s/10: 50000:0 9s/10: 0:0 10s/10: 0:0 11s/10: 0:0 12s/10: 0:0 13s/10: 0:0 14s/10: 0:0 15s/10: 0:0 16s/10: 49999:0 17s/10: 0:0 18s/10: 0:0 19s/10: 0:0 20s/10: 0:0 21s/10: 0:0 22s/10: 0:0 23s/10: 0:0 24s/10: 0:0 25s/10: 0:0 26s/10: 0:0 27s/10: 0:0 28s/10: 0:0 29s/10: 0:0 30s/10: 0:0 31s/10: 0:0 32s/10: 0:0 33s/10: 0:0 34s/10: 0:0 35s/10: 0:0 36s/10: 0:0 37s/10: 0:0 38s/10: 0:0 39s/10: 0:0 40s/10: 0:0 41s/10: 0:0 42s/10: 0:0 43s/10: 0:0 44s/10: 0:0 45s/10: 0:0 46s/10: 0:0 47s/10: 0:0 48s/10: 0:0 49s/10: 0:0 50s/10: 0:0 51s/10: 0:0 52s/10: 0:0 53s/10: 0:0 54s/10: 0:0 55s/10: 0:0 56s/10: 0:0 57s/10: 0:0 58s/10: 0:0 59s/10: 0:0 60s/10: 0:0 61s/10: 0:0 62s/10: 0:0 63s/10: 0:0 64s/10: 0:0 65s/10: 0:0 66s/10: 0:0 67s/10: 0:0 68s/10: 0:0 69s/10: 0:0 70s/10: 0:0 71s/10: 0:0 72s/10: 0:0 73s/10: 0:0 74s/10: 0:0 75s/10: 0:0 76s/10: 0:0 77s/10: 0:0 78s/10: 0:0 [ 2242.545050] 79s/10: 0:0 80s/10: 0:0 81s/10: 49999:0 Except that we can see callbacks having been invoked during this time ("8s/10: 50000:0", "16s/10: 49999:0", "81s/10: 49999:0"). In part because rcutorture is unaware of RCU Tasks Trace's grace-period sequence number. So, first see if it is reproducible, second enable more diagnostics, third make more grace-period sequence numbers available to rcutorture, fourth recheck the diagnostics code, and then see where we go from there. It might be that lazy preemption needs adjustment, or it might be that it just tickled latent diagnostic issues in rcutorture. (I rarely hit this WARN_ON() except in early development, when the problem is usually glaringly obvious, hence all the uncertainty.) Thanx, Paul
On Sun, Feb 18, 2024 at 10:17:48AM -0800, Paul E. McKenney wrote: > On Fri, Feb 16, 2024 at 07:59:45PM -0800, Ankur Arora wrote: > > Paul E. McKenney <paulmck@kernel.org> writes: > > > On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote: > > >> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote: > > >> > > > >> > Paul E. McKenney <paulmck@kernel.org> writes: > > >> > > > >> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote: > > >> > >> > > >> > >> Paul E. McKenney <paulmck@kernel.org> writes: > > >> > >> > > >> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: > > >> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: > > >> > >> >> > > > >> > >> >> > Paul E. McKenney <paulmck@kernel.org> writes: > > >> > >> >> > > > >> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: > > >> > >> >> > >> Hi, > > >> > >> >> > >> > > >> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like > > >> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > > >> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > > >> > >> >> > >> on explicit preemption points for the voluntary models. > > >> > >> >> > >> > > >> > >> >> > >> The series is based on Thomas' original proposal which he outlined > > >> > >> >> > >> in [1], [2] and in his PoC [3]. > > >> > >> >> > >> > > >> > >> >> > >> An earlier RFC version is at [4]. > > >> > >> >> > > > > >> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been > > >> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with > > >> > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most > > >> > >> >> > > likely for the merge window after next, but let me know if you need > > >> > >> >> > > them sooner. > > >> > >> >> > > > >> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. > > >> > >> >> > But, the attached diff should tide me over until the fixes are in. > > >> > >> >> > > >> > >> >> That was indeed my guess. ;-) > > >> > >> >> > > >> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback > > >> > >> >> > > flooding, but I am still looking into this. > > >> > >> >> > > > >> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? > > >> > >> >> > > >> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on > > >> > >> >> two of them thus far. I am running a longer test to see if this might > > >> > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10 > > >> > >> >> and TRACE01 have in common. > > >> > >> > > > >> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what > > >> > >> > sets them apart. I also hit a grace-period hang in TREE04, which does > > >> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something > > >> > >> > to dig into more. > > >> > >> > > >> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder > > >> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y > > >> > >> as well? > > >> > >> (Just in the interest of minimizing configurations.) > > >> > > > > >> > > I would be happy to, but in the spirit of full disclosure... > > >> > > > > >> > > First, I have seen that failure only once, which is not enough to > > >> > > conclude that it has much to do with TREE04. It might simply be low > > >> > > probability, so that TREE04 simply was unlucky enough to hit it first. > > >> > > In contrast, I have sufficient data to be reasonably confident that the > > >> > > callback-flooding OOMs really do have something to do with the TRACE01 and > > >> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios > > >> > > have in common that they don't also have in common with other scenarios. > > >> > > But what is life without a bit of mystery? ;-) > > >> > > > >> > :). > > >> > > > >> > > Second, please see the attached tarball, which contains .csv files showing > > >> > > Kconfig options and kernel boot parameters for the various torture tests. > > >> > > The portions of the filenames preceding the "config.csv" correspond to > > >> > > the directories in tools/testing/selftests/rcutorture/configs. > > >> > > > >> > So, at least some of the HZ_FULL=y tests don't run into problems. > > >> > > > >> > > Third, there are additional scenarios hand-crafted by the script at > > >> > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of > > >> > > them have triggered, other than via the newly increased difficulty > > >> > > of configurating a tracing-free kernel with which to test, but they > > >> > > can still be useful in ruling out particular Kconfig options or kernel > > >> > > boot parameters being related to a given issue. > > >> > > > > >> > > But please do take a look at the .csv files and let me know what > > >> > > adjustments would be appropriate given the failure information. > > >> > > > >> > Nothing stands out just yet. Let me start a run here and see if > > >> > that gives me some ideas. > > >> > > >> Sounds good, thank you! > > >> > > >> > I'm guessing the splats don't give any useful information or > > >> > you would have attached them ;). > > >> > > >> My plan is to extract what can be extracted from the overnight run > > >> that I just started. Just in case the fixes have any effect on things, > > >> unlikely though that might be given those fixes and the runs that failed. > > > > > > And I only got no failures from either TREE10 or TRACE01 on last night's > > > run. > > > > Oh that's great news. Same for my overnight runs for TREE04 and TRACE01. > > > > Ongoing: a 24 hour run for those. Let's see how that goes. > > > > > I merged your series on top of v6.8-rc4 with the -rcu tree's > > > dev branch, the latter to get the RCU fixes. But this means that last > > > night's results are not really comparable to earlier results. > > > > > > I did get a few TREE09 failures, but I get those anyway. I took it > > > apart below for you because I got confused and thought that it was a > > > TREE10 failure. So just in case you were curious what one of these > > > looks like and because I am too lazy to delete it. ;-) > > > > Heh. Well, thanks for being lazy /after/ dissecting it nicely. > > > > > So from the viewpoint of moderate rcutorture testing, this series > > > looks good. Woo hoo!!! > > > > Awesome! > > > > > We did uncover a separate issue with Tasks RCU, which I will report on > > > in more detail separately. However, this issue does not (repeat, *not*) > > > affect lazy preemption as such, but instead any attempt to remove all > > > of the cond_resched() invocations. > > > > So, that sounds like it happens even with (CONFIG_PREEMPT_AUTO=n, > > CONFIG_PREEMPT=y)? > > Anyway will look out for it when you go into the detail. > > Fair point, normally Tasks RCU isn't present when cond_resched() > means anything. > > I will look again -- it is quite possible that I was confused by earlier > in-fleet setups that had Tasks RCU enabled even when preemption was > disabled. (We don't do that anymore, and, had I been paying sufficient > attention, would not have been doing it to start with. Back in the day, > enabling rcutorture, even as a module, had the side effect of enabling > Tasks RCU. How else to test it, right? Well...) OK, I got my head straight on this one... And the problem is in fact that Tasks RCU isn't normally present in non-preemptible kernels. This is because normal RCU will wait for preemption-disabled regions of code, and in PREMPT_NONE and PREEMPT_VOLUNTARY kernels, that includes pretty much any region of code lacking an explicit schedule() or similar. And as I understand it, tracing trampolines rely on this implicit lack of preemption. So, with lazy preemption, we could preempt in the middle of a trampoline, and synchronize_rcu() won't save us. Steve and Mathieu will correct me if I am wrong. If I do understand this correctly, one workaround is to remove the "if PREEMPTIBLE" on all occurrences of "select TASKS_RCU". That way, all kernels would use synchronize_rcu_tasks(), which would wait for a voluntary context switch. This workaround does increase the overhead and tracepoint-removal latency on non-preemptible kernels, so it might be time to revisit the synchronization of trampolines. Unfortunately, the things I have come up with thus far have disadvantages: o Keep a set of permanent trampolines that enter and exit some sort of explicit RCU read-side critical section. If the address for this trampoline to call is in a register, then these permanent trampolines remain constant so that no synchronization of them is required. The selected flavor of RCU can then be used to deal with the non-permanent trampolines. The disadvantage here is a significant increase in the complexity and overhead of trampoline code and the code that invokes the trampolines. This overhead limits where tracing may be used in the kernel, which is of course undesirable. o Check for being preempted within a trampoline, and track this within the tasks structure. The disadvantage here is that this requires keeping track of all of the trampolines and adding a check for being in one on a scheduler fast path. o Have a variant of Tasks RCU which checks the stack of preempted tasks, waiting until all have been seen without being preempted in a trampoline. This still requires keeping track of all the trampolines in an easy-to-search manner, but gets the overhead of searching off of the scheduler fastpaths. It is also necessary to check running tasks, which might have been interrupted from within a trampoline. I would have a hard time convincing myself that these return addresses were unconditionally reliable. But maybe they are? o Your idea here! Again, the short-term workaround is to remove the "if PREEMPTIBLE" from all of the "select TASKS_RCU" clauses. > > > My next step is to try this on bare metal on a system configured as > > > is the fleet. But good progress for a week!!! > > > > Yeah this is great. Fingers crossed for the wider set of tests. > > I got what might be a one-off when hitting rcutorture and KASAN harder. > I am running 320*TRACE01 to see if it reproduces. [ . . . ] > So, first see if it is reproducible, second enable more diagnostics, > third make more grace-period sequence numbers available to rcutorture, > fourth recheck the diagnostics code, and then see where we go from there. > It might be that lazy preemption needs adjustment, or it might be that > it just tickled latent diagnostic issues in rcutorture. > > (I rarely hit this WARN_ON() except in early development, when the > problem is usually glaringly obvious, hence all the uncertainty.) And it is eminently reproducible. Digging into it... Thanx, Paul
Paul E. McKenney <paulmck@kernel.org> writes: > On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote: >> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote: >> > >> > Paul E. McKenney <paulmck@kernel.org> writes: >> > >> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote: >> > >> >> > >> Paul E. McKenney <paulmck@kernel.org> writes: >> > >> >> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: >> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: >> > >> >> > >> > >> >> > Paul E. McKenney <paulmck@kernel.org> writes: >> > >> >> > >> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: >> > >> >> > >> Hi, >> > >> >> > >> >> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like >> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full >> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend >> > >> >> > >> on explicit preemption points for the voluntary models. >> > >> >> > >> >> > >> >> > >> The series is based on Thomas' original proposal which he outlined >> > >> >> > >> in [1], [2] and in his PoC [3]. >> > >> >> > >> >> > >> >> > >> An earlier RFC version is at [4]. >> > >> >> > > >> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been >> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with >> > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most >> > >> >> > > likely for the merge window after next, but let me know if you need >> > >> >> > > them sooner. >> > >> >> > >> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. >> > >> >> > But, the attached diff should tide me over until the fixes are in. >> > >> >> >> > >> >> That was indeed my guess. ;-) >> > >> >> >> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback >> > >> >> > > flooding, but I am still looking into this. >> > >> >> > >> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? >> > >> >> >> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on >> > >> >> two of them thus far. I am running a longer test to see if this might >> > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10 >> > >> >> and TRACE01 have in common. >> > >> > >> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what >> > >> > sets them apart. I also hit a grace-period hang in TREE04, which does >> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something >> > >> > to dig into more. >> > >> >> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder >> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y >> > >> as well? >> > >> (Just in the interest of minimizing configurations.) >> > > >> > > I would be happy to, but in the spirit of full disclosure... >> > > >> > > First, I have seen that failure only once, which is not enough to >> > > conclude that it has much to do with TREE04. It might simply be low >> > > probability, so that TREE04 simply was unlucky enough to hit it first. >> > > In contrast, I have sufficient data to be reasonably confident that the >> > > callback-flooding OOMs really do have something to do with the TRACE01 and >> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios >> > > have in common that they don't also have in common with other scenarios. >> > > But what is life without a bit of mystery? ;-) >> > >> > :). >> > >> > > Second, please see the attached tarball, which contains .csv files showing >> > > Kconfig options and kernel boot parameters for the various torture tests. >> > > The portions of the filenames preceding the "config.csv" correspond to >> > > the directories in tools/testing/selftests/rcutorture/configs. >> > >> > So, at least some of the HZ_FULL=y tests don't run into problems. >> > >> > > Third, there are additional scenarios hand-crafted by the script at >> > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of >> > > them have triggered, other than via the newly increased difficulty >> > > of configurating a tracing-free kernel with which to test, but they >> > > can still be useful in ruling out particular Kconfig options or kernel >> > > boot parameters being related to a given issue. >> > > >> > > But please do take a look at the .csv files and let me know what >> > > adjustments would be appropriate given the failure information. >> > >> > Nothing stands out just yet. Let me start a run here and see if >> > that gives me some ideas. >> >> Sounds good, thank you! >> >> > I'm guessing the splats don't give any useful information or >> > you would have attached them ;). >> >> My plan is to extract what can be extracted from the overnight run >> that I just started. Just in case the fixes have any effect on things, >> unlikely though that might be given those fixes and the runs that failed. > > And I only got no failures from either TREE10 or TRACE01 on last night's > run. I merged your series on top of v6.8-rc4 with the -rcu tree's > dev branch, the latter to get the RCU fixes. But this means that last > night's results are not really comparable to earlier results. Not sure if you saw any othe instances of this since, but a couple of things I tbelatedly noticed below. [ ... ] > [ 3459.733109] ------------[ cut here ]------------ > [ 3459.734165] rcutorture_oom_notify invoked upon OOM during forward-progress testing. > [ 3459.735828] WARNING: CPU: 0 PID: 43 at kernel/rcu/rcutorture.c:2874 rcutorture_oom_notify+0x3e/0x1d0 > > Now something bad happened. RCU was unable to keep up with the > callback flood. Given that users can create callback floods > with close(open()) loops, this is not good. > > [ 3459.737761] Modules linked in: > [ 3459.738408] CPU: 0 PID: 43 Comm: rcu_torture_fwd Not tainted 6.8.0-rc4-00096-g40c2642e6f24 #8252 > [ 3459.740295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > [ 3459.742651] RIP: 0010:rcutorture_oom_notify+0x3e/0x1d0 > [ 3459.743821] Code: e8 37 48 c2 00 48 8b 1d f8 b4 dc 01 48 85 db 0f 84 80 01 00 00 90 48 c7 c6 40 f5 e0 92 48 c7 c7 10 52 23 93 e8 d3 b9 f9 ff 90 <0f> 0b 90 90 8b 35 f8 c0 66 01 85 f6 7e 40 45 31 ed 4d 63 e5 41 83 > [ 3459.747935] RSP: 0018:ffffabbb0015bb30 EFLAGS: 00010282 > [ 3459.749061] RAX: 0000000000000000 RBX: ffff9485812ae000 RCX: 00000000ffffdfff > [ 3459.750601] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001 > [ 3459.752026] RBP: ffffabbb0015bb98 R08: ffffffff93539388 R09: 00000000ffffdfff > [ 3459.753616] R10: ffffffff934593a0 R11: ffffffff935093a0 R12: 0000000000000000 > [ 3459.755134] R13: ffffabbb0015bb98 R14: ffffffff93547da0 R15: 00000000ffffffff > [ 3459.756695] FS: 0000000000000000(0000) GS:ffffffff9344f000(0000) knlGS:0000000000000000 > [ 3459.758443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 3459.759672] CR2: 0000000000600298 CR3: 0000000001958000 CR4: 00000000000006f0 > [ 3459.761253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 3459.762748] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 3459.764472] Call Trace: > [ 3459.765003] <TASK> > [ 3459.765483] ? __warn+0x61/0xe0 > [ 3459.766176] ? rcutorture_oom_notify+0x3e/0x1d0 > [ 3459.767154] ? report_bug+0x174/0x180 > [ 3459.767942] ? handle_bug+0x3d/0x70 > [ 3459.768715] ? exc_invalid_op+0x18/0x70 > [ 3459.769561] ? asm_exc_invalid_op+0x1a/0x20 > [ 3459.770494] ? rcutorture_oom_notify+0x3e/0x1d0 > [ 3459.771501] blocking_notifier_call_chain+0x5c/0x80 > [ 3459.772553] out_of_memory+0x236/0x4b0 > [ 3459.773365] __alloc_pages+0x9ca/0xb10 > [ 3459.774233] ? set_next_entity+0x8b/0x150 > [ 3459.775107] new_slab+0x382/0x430 > [ 3459.776454] ___slab_alloc+0x23c/0x8c0 > [ 3459.777315] ? preempt_schedule_irq+0x32/0x50 > [ 3459.778319] ? rcu_torture_fwd_prog+0x6bf/0x970 > [ 3459.779291] ? rcu_torture_fwd_prog+0x6bf/0x970 > [ 3459.780264] ? rcu_torture_fwd_prog+0x6bf/0x970 > [ 3459.781244] kmalloc_trace+0x179/0x1a0 > [ 3459.784651] rcu_torture_fwd_prog+0x6bf/0x970 > [ 3459.785529] ? __pfx_rcu_torture_fwd_prog+0x10/0x10 > [ 3459.786617] ? kthread+0xc3/0xf0 > [ 3459.787352] ? __pfx_rcu_torture_fwd_prog+0x10/0x10 > [ 3459.788417] kthread+0xc3/0xf0 > [ 3459.789101] ? __pfx_kthread+0x10/0x10 > [ 3459.789906] ret_from_fork+0x2f/0x50 > [ 3459.790708] ? __pfx_kthread+0x10/0x10 > [ 3459.791523] ret_from_fork_asm+0x1a/0x30 > [ 3459.792359] </TASK> > [ 3459.792835] ---[ end trace 0000000000000000 ]--- > > Standard rcutorture stack trace for this failure mode. I see a preempt_schedule_irq() in the stack. So, I guess that at some point current (the task responsible for the callback flood?) was marked for lazy scheduling, did not schedule out, and then was eagerly preempted out at the next tick. > [ 3459.793849] rcu_torture_fwd_cb_hist: Callback-invocation histogram 0 (duration 913 jiffies): 1s/10: 0:1 2s/10: 719677:32517 3s/10: 646965:0 > > So the whole thing lasted less than a second (913 jiffies). > Each element of the histogram is 100 milliseconds worth. Nothing > came through during the first 100 ms (not surprising), and one > grace period elapsed (also not surprising). A lot of callbacks > came through in the second 100 ms (also not surprising), but there > were some tens of thousand grace periods (extremely surprising). > The third 100 ms got a lot of callbacks but no grace periods > (not surprising). > > Huh. This started at t=3458.877155 and we got the OOM at > t=3459.733109, which roughly matches the reported time. > > [ 3459.796413] rcu: rcu_fwd_progress_check: GP age 737 jiffies > > The callback flood does seem to have stalled grace periods, > though not by all *that* much. > > [ 3459.799402] rcu: rcu_preempt: wait state: RCU_GP_WAIT_FQS(5) ->state: 0x402 ->rt_priority 0 delta ->gp_start 740 ->gp_activity 0 ->gp_req_activity 747 ->gp_wake_time 68 ->gp_wake_seq 5535689 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->gp_max 713 ->gp_flags 0x0 > > The RCU grace-period kthread is in its loop looking for > quiescent states, and is executing normally ("->gp_activity 0", > as opposed to some huge number indicating that the kthread was > never awakened). > > [ 3459.804267] rcu: rcu_node 0:0 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->qsmask 0x0 ...G ->n_boosts 0 > > The "->qsmask 0x0" says that all CPUs have provided a quiescent > state, but the "G" indicates that the normal grace period is > blocked by some task preempted within an RCU read-side critical > section. This output is strange because a 56-CPU scenario should > have considerably more output. > > Plus this means that this cannot possibly be TREE10 because that > scenario is non-preemptible, so there cannot be grace periods > waiting for quiescent states on anything but CPUs. Might be missing the point, but with CONFIG_PREEMPT_NONE, you could be preempted if you exceed your time quanta by more than one tick. Though that of course needs the task to not be in the read-side critical section. Thanks -- ankur
On 2/13/2024 11:25 AM, Ankur Arora wrote: > Hi, > > This series adds a new scheduling model PREEMPT_AUTO, which like > PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > on explicit preemption points for the voluntary models. > > The series is based on Thomas' original proposal which he outlined > in [1], [2] and in his PoC [3]. > > An earlier RFC version is at [4]. > [...] Hello Ankur, Thank you for the series. Just giving a crisp summary since I am expecting a respin of patchseries with minor changes suggested by Thomas, Mark and a fix by Paul. and looking forward to test that. I was able to test the current patchset rather in a different way. On Milan, (2 node, 256 cpu, 512GB RAM), Did my regular benchmark testing, to see if there are any surprises. Will do more detailed testing/analysis w/ some of the scheduler specific tests also after your respin. Configuration tested. a) Base kernel (6.7), b) patched with PREEMPT_AUTO voluntary preemption. c) patched with PREEMPT_DYNAMIC voluntary preemption. Workloads I tested and their %gain, case b case c NAS +2.7 +1.9 Hashjoin, +0 +0 Graph500, -6 +0 XSBench +1.7 +0 Did kernbench etc test from Mel's mmtests suite also. Did not notice much difference. In summary benchmarks are mostly on positive side. Thanks and Regards - Raghu
On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote: > Configuration tested. > a) Base kernel (6.7), Which scheduling model is the baseline using? > b) patched with PREEMPT_AUTO voluntary preemption. > c) patched with PREEMPT_DYNAMIC voluntary preemption. > > Workloads I tested and their %gain, > case b case c > NAS +2.7 +1.9 > Hashjoin, +0 +0 > XSBench +1.7 +0 > Graph500, -6 +0 The Graph500 stands out. Needs some analysis. Thanks, tglx
On 2/21/2024 10:45 PM, Thomas Gleixner wrote: > On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote: >> Configuration tested. >> a) Base kernel (6.7), > > Which scheduling model is the baseline using? > baseline is also PREEMPT_DYNAMIC with voluntary preemption >> b) patched with PREEMPT_AUTO voluntary preemption. >> c) patched with PREEMPT_DYNAMIC voluntary preemption. >> >> Workloads I tested and their %gain, >> case b case c >> NAS +2.7 +1.9 >> Hashjoin, +0 +0 >> XSBench +1.7 +0 >> Graph500, -6 +0 > > The Graph500 stands out. Needs some analysis. > Sure. Will do more detailed analysis and comeback on this. Thanks - Raghu
On Tue, Feb 20, 2024 at 10:48:41PM -0800, Ankur Arora wrote: > Paul E. McKenney <paulmck@kernel.org> writes: > > On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote: > >> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote: > >> > > >> > Paul E. McKenney <paulmck@kernel.org> writes: > >> > > >> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote: > >> > >> > >> > >> Paul E. McKenney <paulmck@kernel.org> writes: > >> > >> > >> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote: > >> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote: > >> > >> >> > > >> > >> >> > Paul E. McKenney <paulmck@kernel.org> writes: > >> > >> >> > > >> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote: > >> > >> >> > >> Hi, > >> > >> >> > >> > >> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like > >> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full > >> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend > >> > >> >> > >> on explicit preemption points for the voluntary models. > >> > >> >> > >> > >> > >> >> > >> The series is based on Thomas' original proposal which he outlined > >> > >> >> > >> in [1], [2] and in his PoC [3]. > >> > >> >> > >> > >> > >> >> > >> An earlier RFC version is at [4]. > >> > >> >> > > > >> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been > >> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with > >> > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most > >> > >> >> > > likely for the merge window after next, but let me know if you need > >> > >> >> > > them sooner. > >> > >> >> > > >> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing. > >> > >> >> > But, the attached diff should tide me over until the fixes are in. > >> > >> >> > >> > >> >> That was indeed my guess. ;-) > >> > >> >> > >> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback > >> > >> >> > > flooding, but I am still looking into this. > >> > >> >> > > >> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration? > >> > >> >> > >> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on > >> > >> >> two of them thus far. I am running a longer test to see if this might > >> > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10 > >> > >> >> and TRACE01 have in common. > >> > >> > > >> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what > >> > >> > sets them apart. I also hit a grace-period hang in TREE04, which does > >> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something > >> > >> > to dig into more. > >> > >> > >> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder > >> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y > >> > >> as well? > >> > >> (Just in the interest of minimizing configurations.) > >> > > > >> > > I would be happy to, but in the spirit of full disclosure... > >> > > > >> > > First, I have seen that failure only once, which is not enough to > >> > > conclude that it has much to do with TREE04. It might simply be low > >> > > probability, so that TREE04 simply was unlucky enough to hit it first. > >> > > In contrast, I have sufficient data to be reasonably confident that the > >> > > callback-flooding OOMs really do have something to do with the TRACE01 and > >> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios > >> > > have in common that they don't also have in common with other scenarios. > >> > > But what is life without a bit of mystery? ;-) > >> > > >> > :). > >> > > >> > > Second, please see the attached tarball, which contains .csv files showing > >> > > Kconfig options and kernel boot parameters for the various torture tests. > >> > > The portions of the filenames preceding the "config.csv" correspond to > >> > > the directories in tools/testing/selftests/rcutorture/configs. > >> > > >> > So, at least some of the HZ_FULL=y tests don't run into problems. > >> > > >> > > Third, there are additional scenarios hand-crafted by the script at > >> > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of > >> > > them have triggered, other than via the newly increased difficulty > >> > > of configurating a tracing-free kernel with which to test, but they > >> > > can still be useful in ruling out particular Kconfig options or kernel > >> > > boot parameters being related to a given issue. > >> > > > >> > > But please do take a look at the .csv files and let me know what > >> > > adjustments would be appropriate given the failure information. > >> > > >> > Nothing stands out just yet. Let me start a run here and see if > >> > that gives me some ideas. > >> > >> Sounds good, thank you! > >> > >> > I'm guessing the splats don't give any useful information or > >> > you would have attached them ;). > >> > >> My plan is to extract what can be extracted from the overnight run > >> that I just started. Just in case the fixes have any effect on things, > >> unlikely though that might be given those fixes and the runs that failed. > > > > And I only got no failures from either TREE10 or TRACE01 on last night's > > run. I merged your series on top of v6.8-rc4 with the -rcu tree's > > dev branch, the latter to get the RCU fixes. But this means that last > > night's results are not really comparable to earlier results. > > Not sure if you saw any othe instances of this since, but a couple of > things I tbelatedly noticed below. Thank you for taking a look! > [ ... ] > > > [ 3459.733109] ------------[ cut here ]------------ > > [ 3459.734165] rcutorture_oom_notify invoked upon OOM during forward-progress testing. > > [ 3459.735828] WARNING: CPU: 0 PID: 43 at kernel/rcu/rcutorture.c:2874 rcutorture_oom_notify+0x3e/0x1d0 > > > > Now something bad happened. RCU was unable to keep up with the > > callback flood. Given that users can create callback floods > > with close(open()) loops, this is not good. > > > > [ 3459.737761] Modules linked in: > > [ 3459.738408] CPU: 0 PID: 43 Comm: rcu_torture_fwd Not tainted 6.8.0-rc4-00096-g40c2642e6f24 #8252 > > [ 3459.740295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > > [ 3459.742651] RIP: 0010:rcutorture_oom_notify+0x3e/0x1d0 > > [ 3459.743821] Code: e8 37 48 c2 00 48 8b 1d f8 b4 dc 01 48 85 db 0f 84 80 01 00 00 90 48 c7 c6 40 f5 e0 92 48 c7 c7 10 52 23 93 e8 d3 b9 f9 ff 90 <0f> 0b 90 90 8b 35 f8 c0 66 01 85 f6 7e 40 45 31 ed 4d 63 e5 41 83 > > [ 3459.747935] RSP: 0018:ffffabbb0015bb30 EFLAGS: 00010282 > > [ 3459.749061] RAX: 0000000000000000 RBX: ffff9485812ae000 RCX: 00000000ffffdfff > > [ 3459.750601] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001 > > [ 3459.752026] RBP: ffffabbb0015bb98 R08: ffffffff93539388 R09: 00000000ffffdfff > > [ 3459.753616] R10: ffffffff934593a0 R11: ffffffff935093a0 R12: 0000000000000000 > > [ 3459.755134] R13: ffffabbb0015bb98 R14: ffffffff93547da0 R15: 00000000ffffffff > > [ 3459.756695] FS: 0000000000000000(0000) GS:ffffffff9344f000(0000) knlGS:0000000000000000 > > [ 3459.758443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 3459.759672] CR2: 0000000000600298 CR3: 0000000001958000 CR4: 00000000000006f0 > > [ 3459.761253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [ 3459.762748] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [ 3459.764472] Call Trace: > > [ 3459.765003] <TASK> > > [ 3459.765483] ? __warn+0x61/0xe0 > > [ 3459.766176] ? rcutorture_oom_notify+0x3e/0x1d0 > > [ 3459.767154] ? report_bug+0x174/0x180 > > [ 3459.767942] ? handle_bug+0x3d/0x70 > > [ 3459.768715] ? exc_invalid_op+0x18/0x70 > > [ 3459.769561] ? asm_exc_invalid_op+0x1a/0x20 > > [ 3459.770494] ? rcutorture_oom_notify+0x3e/0x1d0 > > [ 3459.771501] blocking_notifier_call_chain+0x5c/0x80 > > [ 3459.772553] out_of_memory+0x236/0x4b0 > > [ 3459.773365] __alloc_pages+0x9ca/0xb10 > > [ 3459.774233] ? set_next_entity+0x8b/0x150 > > [ 3459.775107] new_slab+0x382/0x430 > > [ 3459.776454] ___slab_alloc+0x23c/0x8c0 > > [ 3459.777315] ? preempt_schedule_irq+0x32/0x50 > > [ 3459.778319] ? rcu_torture_fwd_prog+0x6bf/0x970 > > [ 3459.779291] ? rcu_torture_fwd_prog+0x6bf/0x970 > > [ 3459.780264] ? rcu_torture_fwd_prog+0x6bf/0x970 > > [ 3459.781244] kmalloc_trace+0x179/0x1a0 > > [ 3459.784651] rcu_torture_fwd_prog+0x6bf/0x970 > > [ 3459.785529] ? __pfx_rcu_torture_fwd_prog+0x10/0x10 > > [ 3459.786617] ? kthread+0xc3/0xf0 > > [ 3459.787352] ? __pfx_rcu_torture_fwd_prog+0x10/0x10 > > [ 3459.788417] kthread+0xc3/0xf0 > > [ 3459.789101] ? __pfx_kthread+0x10/0x10 > > [ 3459.789906] ret_from_fork+0x2f/0x50 > > [ 3459.790708] ? __pfx_kthread+0x10/0x10 > > [ 3459.791523] ret_from_fork_asm+0x1a/0x30 > > [ 3459.792359] </TASK> > > [ 3459.792835] ---[ end trace 0000000000000000 ]--- > > > > Standard rcutorture stack trace for this failure mode. > > I see a preempt_schedule_irq() in the stack. So, I guess that at some > point current (the task responsible for the callback flood?) was marked > for lazy scheduling, did not schedule out, and then was eagerly > preempted out at the next tick. That is expected, given that the kthread doing the callback flooding will run for up to eight seconds. Some instrumentation shows grace periods waiting on tasks, but that instrumentation is later than would be good, after the barrier operation. > > [ 3459.793849] rcu_torture_fwd_cb_hist: Callback-invocation histogram 0 (duration 913 jiffies): 1s/10: 0:1 2s/10: 719677:32517 3s/10: 646965:0 > > > > So the whole thing lasted less than a second (913 jiffies). > > Each element of the histogram is 100 milliseconds worth. Nothing > > came through during the first 100 ms (not surprising), and one > > grace period elapsed (also not surprising). A lot of callbacks > > came through in the second 100 ms (also not surprising), but there > > were some tens of thousand grace periods (extremely surprising). > > The third 100 ms got a lot of callbacks but no grace periods > > (not surprising). > > > > Huh. This started at t=3458.877155 and we got the OOM at > > t=3459.733109, which roughly matches the reported time. > > > > [ 3459.796413] rcu: rcu_fwd_progress_check: GP age 737 jiffies > > > > The callback flood does seem to have stalled grace periods, > > though not by all *that* much. > > > > [ 3459.799402] rcu: rcu_preempt: wait state: RCU_GP_WAIT_FQS(5) ->state: 0x402 ->rt_priority 0 delta ->gp_start 740 ->gp_activity 0 ->gp_req_activity 747 ->gp_wake_time 68 ->gp_wake_seq 5535689 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->gp_max 713 ->gp_flags 0x0 > > > > The RCU grace-period kthread is in its loop looking for > > quiescent states, and is executing normally ("->gp_activity 0", > > as opposed to some huge number indicating that the kthread was > > never awakened). > > > > [ 3459.804267] rcu: rcu_node 0:0 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->qsmask 0x0 ...G ->n_boosts 0 > > > > The "->qsmask 0x0" says that all CPUs have provided a quiescent > > state, but the "G" indicates that the normal grace period is > > blocked by some task preempted within an RCU read-side critical > > section. This output is strange because a 56-CPU scenario should > > have considerably more output. > > > > Plus this means that this cannot possibly be TREE10 because that > > scenario is non-preemptible, so there cannot be grace periods > > waiting for quiescent states on anything but CPUs. > > Might be missing the point, but with CONFIG_PREEMPT_NONE, you could > be preempted if you exceed your time quanta by more than one tick. > Though that of course needs the task to not be in the read-side critical > section. I have three things on my list: (1) Improve the instrumentation so that it captures the grace-period diagnostics periodically in a list of strings, then prints them only if something bad happened, (2) Use bisection to work out which commit instigates this behavior, and (3) that old fallback, code inspection. Other thoughts? Thanx, Paul
On Mon, 19 Feb 2024 08:48:20 -0800 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > I will look again -- it is quite possible that I was confused by earlier > > in-fleet setups that had Tasks RCU enabled even when preemption was > > disabled. (We don't do that anymore, and, had I been paying sufficient > > attention, would not have been doing it to start with. Back in the day, > > enabling rcutorture, even as a module, had the side effect of enabling > > Tasks RCU. How else to test it, right? Well...) > > OK, I got my head straight on this one... > > And the problem is in fact that Tasks RCU isn't normally present > in non-preemptible kernels. This is because normal RCU will wait > for preemption-disabled regions of code, and in PREMPT_NONE and > PREEMPT_VOLUNTARY kernels, that includes pretty much any region of code > lacking an explicit schedule() or similar. And as I understand it, > tracing trampolines rely on this implicit lack of preemption. > > So, with lazy preemption, we could preempt in the middle of a > trampoline, and synchronize_rcu() won't save us. > > Steve and Mathieu will correct me if I am wrong. > > If I do understand this correctly, one workaround is to remove the > "if PREEMPTIBLE" on all occurrences of "select TASKS_RCU". That way, > all kernels would use synchronize_rcu_tasks(), which would wait for > a voluntary context switch. > > This workaround does increase the overhead and tracepoint-removal > latency on non-preemptible kernels, so it might be time to revisit the > synchronization of trampolines. Unfortunately, the things I have come > up with thus far have disadvantages: > > o Keep a set of permanent trampolines that enter and exit > some sort of explicit RCU read-side critical section. > If the address for this trampoline to call is in a register, > then these permanent trampolines remain constant so that > no synchronization of them is required. The selected > flavor of RCU can then be used to deal with the non-permanent > trampolines. > > The disadvantage here is a significant increase in the complexity > and overhead of trampoline code and the code that invokes the > trampolines. This overhead limits where tracing may be used > in the kernel, which is of course undesirable. I wonder if we can just see if the instruction pointer at preemption is at something that was allocated? That is, if it __is_kernel(addr) returns false, then we need to do more work. Of course that means modules will also trigger this. We could check __is_module_text() but that does a bit more work and may cause too much overhead. But who knows, if the module check is only done if the __is_kernel() check fails, maybe it's not that bad. -- Steve > > o Check for being preempted within a trampoline, and track this > within the tasks structure. The disadvantage here is that this > requires keeping track of all of the trampolines and adding a > check for being in one on a scheduler fast path. > > o Have a variant of Tasks RCU which checks the stack of preempted > tasks, waiting until all have been seen without being preempted > in a trampoline. This still requires keeping track of all the > trampolines in an easy-to-search manner, but gets the overhead > of searching off of the scheduler fastpaths. > > It is also necessary to check running tasks, which might have > been interrupted from within a trampoline. > > I would have a hard time convincing myself that these return > addresses were unconditionally reliable. But maybe they are? > > o Your idea here! > > Again, the short-term workaround is to remove the "if PREEMPTIBLE" from > all of the "select TASKS_RCU" clauses. > > > > > My next step is to try this on bare metal on a system configured as > > > > is the fleet. But good progress for a week!!! > > > > > > Yeah this is great. Fingers crossed for the wider set of tests. > > > > I got what might be a one-off when hitting rcutorture and KASAN harder. > > I am running 320*TRACE01 to see if it reproduces. > > [ . . . ] > > > So, first see if it is reproducible, second enable more diagnostics, > > third make more grace-period sequence numbers available to rcutorture, > > fourth recheck the diagnostics code, and then see where we go from there. > > It might be that lazy preemption needs adjustment, or it might be that > > it just tickled latent diagnostic issues in rcutorture. > > > > (I rarely hit this WARN_ON() except in early development, when the > > problem is usually glaringly obvious, hence all the uncertainty.) > > And it is eminently reproducible. Digging into it...
On Wed, Feb 21, 2024 at 01:19:01PM -0500, Steven Rostedt wrote: > On Mon, 19 Feb 2024 08:48:20 -0800 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > I will look again -- it is quite possible that I was confused by earlier > > > in-fleet setups that had Tasks RCU enabled even when preemption was > > > disabled. (We don't do that anymore, and, had I been paying sufficient > > > attention, would not have been doing it to start with. Back in the day, > > > enabling rcutorture, even as a module, had the side effect of enabling > > > Tasks RCU. How else to test it, right? Well...) > > > > OK, I got my head straight on this one... > > > > And the problem is in fact that Tasks RCU isn't normally present > > in non-preemptible kernels. This is because normal RCU will wait > > for preemption-disabled regions of code, and in PREMPT_NONE and > > PREEMPT_VOLUNTARY kernels, that includes pretty much any region of code > > lacking an explicit schedule() or similar. And as I understand it, > > tracing trampolines rely on this implicit lack of preemption. > > > > So, with lazy preemption, we could preempt in the middle of a > > trampoline, and synchronize_rcu() won't save us. > > > > Steve and Mathieu will correct me if I am wrong. > > > > If I do understand this correctly, one workaround is to remove the > > "if PREEMPTIBLE" on all occurrences of "select TASKS_RCU". That way, > > all kernels would use synchronize_rcu_tasks(), which would wait for > > a voluntary context switch. > > > > This workaround does increase the overhead and tracepoint-removal > > latency on non-preemptible kernels, so it might be time to revisit the > > synchronization of trampolines. Unfortunately, the things I have come > > up with thus far have disadvantages: > > > > o Keep a set of permanent trampolines that enter and exit > > some sort of explicit RCU read-side critical section. > > If the address for this trampoline to call is in a register, > > then these permanent trampolines remain constant so that > > no synchronization of them is required. The selected > > flavor of RCU can then be used to deal with the non-permanent > > trampolines. > > > > The disadvantage here is a significant increase in the complexity > > and overhead of trampoline code and the code that invokes the > > trampolines. This overhead limits where tracing may be used > > in the kernel, which is of course undesirable. > > I wonder if we can just see if the instruction pointer at preemption is at > something that was allocated? That is, if it __is_kernel(addr) returns > false, then we need to do more work. Of course that means modules will also > trigger this. We could check __is_module_text() but that does a bit more > work and may cause too much overhead. But who knows, if the module check is > only done if the __is_kernel() check fails, maybe it's not that bad. I do like very much that idea, but it requires that we be able to identify this instruction pointer perfectly, no matter what. It might also require that we be able to perfectly identify any IRQ return addresses as well, for example, if the preemption was triggered within an interrupt handler. And interrupts from softirq environments might require identifying an additional level of IRQ return address. The original IRQ might have interrupted a trampoline, and then after transitioning into softirq, another IRQ might also interrupt a trampoline, and this last IRQ handler might have instigated a preemption. Are there additional levels or mechanisms requiring identifying return addresses? For whatever it is worth, and in case it should prove necessary, I have added a sneak preview of the Kconfig workaround at the end of this message. Thanx, Paul > -- Steve > > > > > o Check for being preempted within a trampoline, and track this > > within the tasks structure. The disadvantage here is that this > > requires keeping track of all of the trampolines and adding a > > check for being in one on a scheduler fast path. > > > > o Have a variant of Tasks RCU which checks the stack of preempted > > tasks, waiting until all have been seen without being preempted > > in a trampoline. This still requires keeping track of all the > > trampolines in an easy-to-search manner, but gets the overhead > > of searching off of the scheduler fastpaths. > > > > It is also necessary to check running tasks, which might have > > been interrupted from within a trampoline. > > > > I would have a hard time convincing myself that these return > > addresses were unconditionally reliable. But maybe they are? > > > > o Your idea here! > > > > Again, the short-term workaround is to remove the "if PREEMPTIBLE" from > > all of the "select TASKS_RCU" clauses. > > > > > > > My next step is to try this on bare metal on a system configured as > > > > > is the fleet. But good progress for a week!!! > > > > > > > > Yeah this is great. Fingers crossed for the wider set of tests. > > > > > > I got what might be a one-off when hitting rcutorture and KASAN harder. > > > I am running 320*TRACE01 to see if it reproduces. > > > > [ . . . ] > > > > > So, first see if it is reproducible, second enable more diagnostics, > > > third make more grace-period sequence numbers available to rcutorture, > > > fourth recheck the diagnostics code, and then see where we go from there. > > > It might be that lazy preemption needs adjustment, or it might be that > > > it just tickled latent diagnostic issues in rcutorture. > > > > > > (I rarely hit this WARN_ON() except in early development, when the > > > problem is usually glaringly obvious, hence all the uncertainty.) > > > > And it is eminently reproducible. Digging into it... diff --git a/arch/Kconfig b/arch/Kconfig index c91917b508736..154f994547632 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -55,7 +55,7 @@ config KPROBES depends on MODULES depends on HAVE_KPROBES select KALLSYMS - select TASKS_RCU if PREEMPTION + select NEED_TASKS_RCU help Kprobes allows you to trap at almost any kernel address and execute a callback function. register_kprobe() establishes @@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST config OPTPROBES def_bool y depends on KPROBES && HAVE_OPTPROBES - select TASKS_RCU if PREEMPTION + select NEED_TASKS_RCU config KPROBES_ON_FTRACE def_bool y diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig index 6a906ff930065..ce9fbc3b27ecf 100644 --- a/kernel/bpf/Kconfig +++ b/kernel/bpf/Kconfig @@ -27,7 +27,7 @@ config BPF_SYSCALL bool "Enable bpf() system call" select BPF select IRQ_WORK - select TASKS_RCU if PREEMPTION + select NEED_TASKS_RCU select TASKS_TRACE_RCU select BINARY_PRINTF select NET_SOCK_MSG if NET diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index 7dca0138260c3..3e079de0f5b43 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -85,9 +85,13 @@ config FORCE_TASKS_RCU idle, and user-mode execution as quiescent states. Not for manual selection in most cases. -config TASKS_RCU +config NEED_TASKS_RCU bool default n + +config TASKS_RCU + bool + default NEED_TASKS_RCU && (PREEMPTION || PREEMPT_AUTO) select IRQ_WORK config FORCE_TASKS_RUDE_RCU diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596d..6cdc5ff919b09 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -163,7 +163,7 @@ config TRACING select BINARY_PRINTF select EVENT_TRACING select TRACE_CLOCK - select TASKS_RCU if PREEMPTION + select NEED_TASKS_RCU config GENERIC_TRACER bool @@ -204,7 +204,7 @@ config FUNCTION_TRACER select GENERIC_TRACER select CONTEXT_SWITCH_TRACER select GLOB - select TASKS_RCU if PREEMPTION + select NEED_TASKS_RCU select TASKS_RUDE_RCU help Enable the kernel to trace every kernel function. This is done
On Wed, 21 Feb 2024 11:41:47 -0800 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > I wonder if we can just see if the instruction pointer at preemption is at > > something that was allocated? That is, if it __is_kernel(addr) returns > > false, then we need to do more work. Of course that means modules will also > > trigger this. We could check __is_module_text() but that does a bit more > > work and may cause too much overhead. But who knows, if the module check is > > only done if the __is_kernel() check fails, maybe it's not that bad. > > I do like very much that idea, but it requires that we be able to identify > this instruction pointer perfectly, no matter what. It might also require > that we be able to perfectly identify any IRQ return addresses as well, > for example, if the preemption was triggered within an interrupt handler. > And interrupts from softirq environments might require identifying an > additional level of IRQ return address. The original IRQ might have > interrupted a trampoline, and then after transitioning into softirq, > another IRQ might also interrupt a trampoline, and this last IRQ handler > might have instigated a preemption. Note, softirqs still require a real interrupt to happen in order to preempt executing code. Otherwise it should never be running from a trampoline. > > Are there additional levels or mechanisms requiring identifying > return addresses? Hmm, could we add to irq_enter_rcu() __this_cpu_write(__rcu_ip, instruction_pointer(get_irq_regs())); That is to save off were the ip was when it was interrupted. Hmm, but it looks like the get_irq_regs() is set up outside of irq_enter_rcu() :-( I wonder how hard it would be to change all the architectures to pass in pt_regs to irq_enter_rcu()? All the places it is called, the regs should be available. Either way, it looks like it will be a bit of work around the trampoline or around RCU to get this efficiently done. -- Steve
On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote: > On Wed, 21 Feb 2024 11:41:47 -0800 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > I wonder if we can just see if the instruction pointer at preemption is at > > > something that was allocated? That is, if it __is_kernel(addr) returns > > > false, then we need to do more work. Of course that means modules will also > > > trigger this. We could check __is_module_text() but that does a bit more > > > work and may cause too much overhead. But who knows, if the module check is > > > only done if the __is_kernel() check fails, maybe it's not that bad. > > > > I do like very much that idea, but it requires that we be able to identify > > this instruction pointer perfectly, no matter what. It might also require > > that we be able to perfectly identify any IRQ return addresses as well, > > for example, if the preemption was triggered within an interrupt handler. > > And interrupts from softirq environments might require identifying an > > additional level of IRQ return address. The original IRQ might have > > interrupted a trampoline, and then after transitioning into softirq, > > another IRQ might also interrupt a trampoline, and this last IRQ handler > > might have instigated a preemption. > > Note, softirqs still require a real interrupt to happen in order to preempt > executing code. Otherwise it should never be running from a trampoline. Yes, the first interrupt interrupted a trampoline. Then, on return, that interrupt transitioned to softirq (as opposed to ksoftirqd). While a softirq handler was executing within a trampoline, we got another interrupt. We thus have two interrupted trampolines. Or am I missing something that prevents this? > > Are there additional levels or mechanisms requiring identifying > > return addresses? > > Hmm, could we add to irq_enter_rcu() > > __this_cpu_write(__rcu_ip, instruction_pointer(get_irq_regs())); > > That is to save off were the ip was when it was interrupted. > > Hmm, but it looks like the get_irq_regs() is set up outside of > irq_enter_rcu() :-( > > I wonder how hard it would be to change all the architectures to pass in > pt_regs to irq_enter_rcu()? All the places it is called, the regs should be > available. > > Either way, it looks like it will be a bit of work around the trampoline or > around RCU to get this efficiently done. One approach would be to make Tasks RCU be present for PREEMPT_AUTO kernels as well as PREEMPTIBLE kernels, and then, as architectures provide the needed return-address infrastructure, transition those architectures to something more precise. Or maybe the workaround will prove to be good enough. We did inadvertently test it for a year or so at scale, so I am at least confident that it works. ;-) Thanx, Paul
Raghavendra K T <raghavendra.kt@amd.com> writes: > On 2/21/2024 10:45 PM, Thomas Gleixner wrote: >> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote: >>> Configuration tested. >>> a) Base kernel (6.7), >> Which scheduling model is the baseline using? >> > > baseline is also PREEMPT_DYNAMIC with voluntary preemption > >>> b) patched with PREEMPT_AUTO voluntary preemption. >>> c) patched with PREEMPT_DYNAMIC voluntary preemption. >>> >>> Workloads I tested and their %gain, >>> case b case c >>> NAS +2.7 +1.9 >>> Hashjoin, +0 +0 >>> XSBench +1.7 +0 >>> Graph500, -6 +0 >> The Graph500 stands out. Needs some analysis. >> > > Sure. Will do more detailed analysis and comeback on this. Thanks Raghu. Please keep me posted. Also, let me try to reproduce this locally. Could you post the parameters that you used for the Graph500 run? Thanks -- ankur
On 2/22/2024 2:46 AM, Ankur Arora wrote: > > Raghavendra K T <raghavendra.kt@amd.com> writes: > >> On 2/21/2024 10:45 PM, Thomas Gleixner wrote: >>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote: >>>> Configuration tested. >>>> a) Base kernel (6.7), >>> Which scheduling model is the baseline using? >>> >> >> baseline is also PREEMPT_DYNAMIC with voluntary preemption >> >>>> b) patched with PREEMPT_AUTO voluntary preemption. >>>> c) patched with PREEMPT_DYNAMIC voluntary preemption. >>>> >>>> Workloads I tested and their %gain, >>>> case b case c >>>> NAS +2.7 +1.9 >>>> Hashjoin, +0 +0 >>>> XSBench +1.7 +0 >>>> Graph500, -6 +0 >>> The Graph500 stands out. Needs some analysis. >>> >> >> Sure. Will do more detailed analysis and comeback on this. > > Thanks Raghu. Please keep me posted. > > Also, let me try to reproduce this locally. Could you post the > parameters that you used for the Graph500 run? > This was run as part of test suite where by from output, Parameters, I see as : SCALE: 27 nvtx: 134217728 edgefactor: 16 terasize: 3.43597383679999993e-02 A: 5.69999999999999951e-01 B: 1.90000000000000002e-01 C: 1.90000000000000002e-01 D: 5.00000000000000444e-02 generation_time: 4.93902114900000022e+00 construction_time: 2.55216929010000015e+01 nbfs: 64 Meanwhile since stddev for the runs I saw was little bit on higher side, I did think results are Okay. Rerunning with more iterations to see if there is a valid concern, if so I will dig more deep as Thomas pointed. Also will post the results of run. Thanks and Regards - Raghu
On 2/21/2024 10:45 PM, Thomas Gleixner wrote: > On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote: >> Configuration tested. >> a) Base kernel (6.7), > > Which scheduling model is the baseline using? > >> b) patched with PREEMPT_AUTO voluntary preemption. >> c) patched with PREEMPT_DYNAMIC voluntary preemption. >> >> Workloads I tested and their %gain, >> case b case c >> NAS +2.7 +1.9 >> Hashjoin, +0 +0 >> XSBench +1.7 +0 >> Graph500, -6 +0 > > The Graph500 stands out. Needs some analysis. > Hello Thomas, Ankur, Because of high stdev I saw with the runs for Graph500, continued to take results with more iterations. Here is the result. It does not look like there is a concern here. (you can see the *min* side of preempt-auto case which could have got the negative result in the analysis, But I should have posted stdev along with that. Sorry for not being louder there.). Overall this looks good. some time better but all within noise level. Benchmark = Graph500 x 6.7.0+ + 6.7.0-preempt-auto+ N Min Max Median Avg Stddev x 15 6.7165689e+09 7.7607743e+09 7.2213638e+09 7.2759563e+09 3.3353312e+08 + 15 6.4856432e+09 7.942607e+09 7.3115082e+09 7.3386124e+09 4.6474773e+08 No difference proven at 80.0% confidence No difference proven at 95.0% confidence No difference proven at 99.0% confidence Thanks and Regards - Raghu
On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote: > On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote: > > On Wed, 21 Feb 2024 11:41:47 -0800 > > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > > > I wonder if we can just see if the instruction pointer at preemption is at > > > > something that was allocated? That is, if it __is_kernel(addr) returns > > > > false, then we need to do more work. Of course that means modules will also > > > > trigger this. We could check __is_module_text() but that does a bit more > > > > work and may cause too much overhead. But who knows, if the module check is > > > > only done if the __is_kernel() check fails, maybe it's not that bad. > > > > > > I do like very much that idea, but it requires that we be able to identify > > > this instruction pointer perfectly, no matter what. It might also require > > > that we be able to perfectly identify any IRQ return addresses as well, > > > for example, if the preemption was triggered within an interrupt handler. > > > And interrupts from softirq environments might require identifying an > > > additional level of IRQ return address. The original IRQ might have > > > interrupted a trampoline, and then after transitioning into softirq, > > > another IRQ might also interrupt a trampoline, and this last IRQ handler > > > might have instigated a preemption. > > > > Note, softirqs still require a real interrupt to happen in order to preempt > > executing code. Otherwise it should never be running from a trampoline. > > Yes, the first interrupt interrupted a trampoline. Then, on return, > that interrupt transitioned to softirq (as opposed to ksoftirqd). > While a softirq handler was executing within a trampoline, we got > another interrupt. We thus have two interrupted trampolines. > > Or am I missing something that prevents this? Surely the problematic case is where the first interrupt is taken from a trampoline, but the inner interrupt is taken from not-a-trampoline? If the innermost interrupt context is a trampoline, that's the same as that without any nesting. We could handle nesting with a thread flag (e.g. TIF_IN_TRAMPOLINE) and a flag in irqentry_state_t (which is on the stack, and so each nested IRQ gets its own): * At IRQ exception entry, if TIF_IN_TRAMPOLINE is clear and pt_regs::ip is a trampoline, set TIF_IN_TRAMPOLINE and irqentry_state_t::entered_trampoline. * At IRQ exception exit, if irqentry_state_t::entered_trampoline is set, clear TIF_IN_TRAMPOLINE. That naturally nests since the inner IRQ sees TIF_IN_TRAMPOLINE is already set and does nothing on entry or exit, and anything imbetween can inspect TIF_IN_TRAMPOLINE and see the right value. On arm64 we don't dynamically allocate trampolines, *but* we potentially have a similar problem when changing the active ftrace_ops for a callsite, as all callsites share a common trampoline in the kernel text which reads a pointer to an ftrace_ops out of the callsite, then reads ftrace_ops::func from that. Since the ops could be dynamically allocated, we want to wait for reads of that to complete before reusing the memory, and ideally we wouldn't have new entryies into the func after we think we'd completed the transition. So Tasks RCU might be preferable as it waits for both the trampoline *and* the func to complete. > > > Are there additional levels or mechanisms requiring identifying > > > return addresses? > > > > Hmm, could we add to irq_enter_rcu() > > > > __this_cpu_write(__rcu_ip, instruction_pointer(get_irq_regs())); > > > > That is to save off were the ip was when it was interrupted. > > > > Hmm, but it looks like the get_irq_regs() is set up outside of > > irq_enter_rcu() :-( > > > > I wonder how hard it would be to change all the architectures to pass in > > pt_regs to irq_enter_rcu()? All the places it is called, the regs should be > > available. > > > > Either way, it looks like it will be a bit of work around the trampoline or > > around RCU to get this efficiently done. > > One approach would be to make Tasks RCU be present for PREEMPT_AUTO > kernels as well as PREEMPTIBLE kernels, and then, as architectures provide > the needed return-address infrastructure, transition those architectures > to something more precise. FWIW, that sounds good to me. Mark.
On Thu, Feb 22, 2024 at 03:50:02PM +0000, Mark Rutland wrote: > On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote: > > On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote: > > > On Wed, 21 Feb 2024 11:41:47 -0800 > > > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > > > > > I wonder if we can just see if the instruction pointer at preemption is at > > > > > something that was allocated? That is, if it __is_kernel(addr) returns > > > > > false, then we need to do more work. Of course that means modules will also > > > > > trigger this. We could check __is_module_text() but that does a bit more > > > > > work and may cause too much overhead. But who knows, if the module check is > > > > > only done if the __is_kernel() check fails, maybe it's not that bad. > > > > > > > > I do like very much that idea, but it requires that we be able to identify > > > > this instruction pointer perfectly, no matter what. It might also require > > > > that we be able to perfectly identify any IRQ return addresses as well, > > > > for example, if the preemption was triggered within an interrupt handler. > > > > And interrupts from softirq environments might require identifying an > > > > additional level of IRQ return address. The original IRQ might have > > > > interrupted a trampoline, and then after transitioning into softirq, > > > > another IRQ might also interrupt a trampoline, and this last IRQ handler > > > > might have instigated a preemption. > > > > > > Note, softirqs still require a real interrupt to happen in order to preempt > > > executing code. Otherwise it should never be running from a trampoline. > > > > Yes, the first interrupt interrupted a trampoline. Then, on return, > > that interrupt transitioned to softirq (as opposed to ksoftirqd). > > While a softirq handler was executing within a trampoline, we got > > another interrupt. We thus have two interrupted trampolines. > > > > Or am I missing something that prevents this? > > Surely the problematic case is where the first interrupt is taken from a > trampoline, but the inner interrupt is taken from not-a-trampoline? If the > innermost interrupt context is a trampoline, that's the same as that without > any nesting. It depends. If we wait for each task to not have a trampoline in effect then yes, we only need to know whether or not a given task has at least one trampoline in use. One concern with this approach is that a given task might have at least one trampoline in effect every time it is checked, unlikely though that might seem. If this is a problem, one way around it is to instead ask whether the current task still has a reference to one of a set of trampolines that has recently been removed. This avoids the problem of a task always being one some trampoline or another, but requires exact identification of any and all trampolines a given task is currently using. Either way, we need some way of determining whether or not a given PC value resides in a trampoline. This likely requires some data structure (hash table? tree? something else?) that must be traversed in order to carry out that determination. Depending on the traversal overhead, it might (or might not) be necessary to make sure that the traversal is not on the entry/exit/scheduler fast paths. It is also necessary to keep the trampoline-use overhead low and the trampoline call points small. > We could handle nesting with a thread flag (e.g. TIF_IN_TRAMPOLINE) and a flag > in irqentry_state_t (which is on the stack, and so each nested IRQ gets its > own): > > * At IRQ exception entry, if TIF_IN_TRAMPOLINE is clear and pt_regs::ip is a > trampoline, set TIF_IN_TRAMPOLINE and irqentry_state_t::entered_trampoline. > > * At IRQ exception exit, if irqentry_state_t::entered_trampoline is set, clear > TIF_IN_TRAMPOLINE. > > That naturally nests since the inner IRQ sees TIF_IN_TRAMPOLINE is already set > and does nothing on entry or exit, and anything imbetween can inspect > TIF_IN_TRAMPOLINE and see the right value. If the overhead of determining whether pt_regs::ip is a trampoline is sufficiently low, sounds good to me! I suppose that different architectures might have different answers to this question, just to keep things entertaining. ;-) > On arm64 we don't dynamically allocate trampolines, *but* we potentially have a > similar problem when changing the active ftrace_ops for a callsite, as all > callsites share a common trampoline in the kernel text which reads a pointer to > an ftrace_ops out of the callsite, then reads ftrace_ops::func from that. > > Since the ops could be dynamically allocated, we want to wait for reads of that > to complete before reusing the memory, and ideally we wouldn't have new > entryies into the func after we think we'd completed the transition. So Tasks > RCU might be preferable as it waits for both the trampoline *and* the func to > complete. OK, I am guessing that it is easier to work out whether pt_regs::ip is a trampoline than cases involving dyanamic allocation of trampolines. > > > > Are there additional levels or mechanisms requiring identifying > > > > return addresses? > > > > > > Hmm, could we add to irq_enter_rcu() > > > > > > __this_cpu_write(__rcu_ip, instruction_pointer(get_irq_regs())); > > > > > > That is to save off were the ip was when it was interrupted. > > > > > > Hmm, but it looks like the get_irq_regs() is set up outside of > > > irq_enter_rcu() :-( > > > > > > I wonder how hard it would be to change all the architectures to pass in > > > pt_regs to irq_enter_rcu()? All the places it is called, the regs should be > > > available. > > > > > > Either way, it looks like it will be a bit of work around the trampoline or > > > around RCU to get this efficiently done. > > > > One approach would be to make Tasks RCU be present for PREEMPT_AUTO > > kernels as well as PREEMPTIBLE kernels, and then, as architectures provide > > the needed return-address infrastructure, transition those architectures > > to something more precise. > > FWIW, that sounds good to me. Thank you, and I will Cc you on the patches. Thanx, Paul
On Wed, Feb 21 2024 at 22:57, Raghavendra K T wrote: > On 2/21/2024 10:45 PM, Thomas Gleixner wrote: >> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote: >>> Configuration tested. >>> a) Base kernel (6.7), >> >> Which scheduling model is the baseline using? >> > > baseline is also PREEMPT_DYNAMIC with voluntary preemption > >>> b) patched with PREEMPT_AUTO voluntary preemption. >>> c) patched with PREEMPT_DYNAMIC voluntary preemption. Which RCU variant do you have enabled with a, b, c ? I.e. PREEMPT_RCU=? Thanks, tglx
Thomas Gleixner <tglx@linutronix.de> writes: > On Wed, Feb 21 2024 at 22:57, Raghavendra K T wrote: >> On 2/21/2024 10:45 PM, Thomas Gleixner wrote: >>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote: >>>> Configuration tested. >>>> a) Base kernel (6.7), >>> >>> Which scheduling model is the baseline using? >>> >> >> baseline is also PREEMPT_DYNAMIC with voluntary preemption >> >>>> b) patched with PREEMPT_AUTO voluntary preemption. >>>> c) patched with PREEMPT_DYNAMIC voluntary preemption. > > Which RCU variant do you have enabled with a, b, c ? > > I.e. PREEMPT_RCU=? Raghu please confirm this, but if the defaults were chosen then we should have: >> baseline is also PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y >>>> b) patched with PREEMPT_AUTO voluntary preemption. If this was built with PREEMPT_VOLUNTARY then, PREEMPT_RCU=n. If with CONFIG_PREEMPT, PREEMPT_RCU=y. Might be worth rerunning the tests with the other combination as well (still with voluntary preemption). >>>> c) patched with PREEMPT_DYNAMIC voluntary preemption. PREEMPT_RCU=y Thanks -- ankur
On 2/23/2024 8:44 AM, Ankur Arora wrote: > > Thomas Gleixner <tglx@linutronix.de> writes: > >> On Wed, Feb 21 2024 at 22:57, Raghavendra K T wrote: >>> On 2/21/2024 10:45 PM, Thomas Gleixner wrote: >>>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote: >>>>> Configuration tested. >>>>> a) Base kernel (6.7), >>>> >>>> Which scheduling model is the baseline using? >>>> >>> >>> baseline is also PREEMPT_DYNAMIC with voluntary preemption >>> >>>>> b) patched with PREEMPT_AUTO voluntary preemption. >>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption. >> >> Which RCU variant do you have enabled with a, b, c ? >> >> I.e. PREEMPT_RCU=? > > Raghu please confirm this, but if the defaults were chosen > then we should have: > >>> baseline is also PREEMPT_DYNAMIC with voluntary preemption > PREEMPT_RCU=y > >>>>> b) patched with PREEMPT_AUTO voluntary preemption. > > If this was built with PREEMPT_VOLUNTARY then, PREEMPT_RCU=n. > If with CONFIG_PREEMPT, PREEMPT_RCU=y. > > Might be worth rerunning the tests with the other combination > as well (still with voluntary preemption). > >>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption. > PREEMPT_RCU=y Hello Thomas, Ankur, Yes, Ankur's understanding is right, defaults were chosen all the time so for a) base 6.7.0+ + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y b) patched + PREEMPT_AUTO voluntary preemption. PREEMPT_RCU = n c) patched + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y I will check with other combination (CONFIG_PREEMPT/PREEMPT_RCU) for (b) and comeback if I see anything interesting. Thanks and Regards - Raghu
On Thu, Feb 22, 2024 at 11:11:34AM -0800, Paul E. McKenney wrote: > On Thu, Feb 22, 2024 at 03:50:02PM +0000, Mark Rutland wrote: > > On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote: > > > On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote: > > > > On Wed, 21 Feb 2024 11:41:47 -0800 > > > > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > > > > > > > I wonder if we can just see if the instruction pointer at preemption is at > > > > > > something that was allocated? That is, if it __is_kernel(addr) returns > > > > > > false, then we need to do more work. Of course that means modules will also > > > > > > trigger this. We could check __is_module_text() but that does a bit more > > > > > > work and may cause too much overhead. But who knows, if the module check is > > > > > > only done if the __is_kernel() check fails, maybe it's not that bad. > > > > > > > > > > I do like very much that idea, but it requires that we be able to identify > > > > > this instruction pointer perfectly, no matter what. It might also require > > > > > that we be able to perfectly identify any IRQ return addresses as well, > > > > > for example, if the preemption was triggered within an interrupt handler. > > > > > And interrupts from softirq environments might require identifying an > > > > > additional level of IRQ return address. The original IRQ might have > > > > > interrupted a trampoline, and then after transitioning into softirq, > > > > > another IRQ might also interrupt a trampoline, and this last IRQ handler > > > > > might have instigated a preemption. > > > > > > > > Note, softirqs still require a real interrupt to happen in order to preempt > > > > executing code. Otherwise it should never be running from a trampoline. > > > > > > Yes, the first interrupt interrupted a trampoline. Then, on return, > > > that interrupt transitioned to softirq (as opposed to ksoftirqd). > > > While a softirq handler was executing within a trampoline, we got > > > another interrupt. We thus have two interrupted trampolines. > > > > > > Or am I missing something that prevents this? > > > > Surely the problematic case is where the first interrupt is taken from a > > trampoline, but the inner interrupt is taken from not-a-trampoline? If the > > innermost interrupt context is a trampoline, that's the same as that without > > any nesting. > > It depends. If we wait for each task to not have a trampoline in effect > then yes, we only need to know whether or not a given task has at least > one trampoline in use. One concern with this approach is that a given > task might have at least one trampoline in effect every time it is > checked, unlikely though that might seem. > > If this is a problem, one way around it is to instead ask whether the > current task still has a reference to one of a set of trampolines that > has recently been removed. This avoids the problem of a task always > being one some trampoline or another, but requires exact identification > of any and all trampolines a given task is currently using. > > Either way, we need some way of determining whether or not a given > PC value resides in a trampoline. This likely requires some data > structure (hash table? tree? something else?) that must be traversed > in order to carry out that determination. Depending on the traversal > overhead, it might (or might not) be necessary to make sure that the > traversal is not on the entry/exit/scheduler fast paths. It is also > necessary to keep the trampoline-use overhead low and the trampoline > call points small. Thanks; I hadn't thought about that shape of livelock problem; with that in mind my suggestion using flags was inadequate. I'm definitely in favour of just using Tasks RCU! That's what arm64 does today, anyhow! Mark.
On Fri, Feb 23, 2024 at 11:05:45AM +0000, Mark Rutland wrote: > On Thu, Feb 22, 2024 at 11:11:34AM -0800, Paul E. McKenney wrote: > > On Thu, Feb 22, 2024 at 03:50:02PM +0000, Mark Rutland wrote: > > > On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote: > > > > On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote: > > > > > On Wed, 21 Feb 2024 11:41:47 -0800 > > > > > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > > > > > > > > > I wonder if we can just see if the instruction pointer at preemption is at > > > > > > > something that was allocated? That is, if it __is_kernel(addr) returns > > > > > > > false, then we need to do more work. Of course that means modules will also > > > > > > > trigger this. We could check __is_module_text() but that does a bit more > > > > > > > work and may cause too much overhead. But who knows, if the module check is > > > > > > > only done if the __is_kernel() check fails, maybe it's not that bad. > > > > > > > > > > > > I do like very much that idea, but it requires that we be able to identify > > > > > > this instruction pointer perfectly, no matter what. It might also require > > > > > > that we be able to perfectly identify any IRQ return addresses as well, > > > > > > for example, if the preemption was triggered within an interrupt handler. > > > > > > And interrupts from softirq environments might require identifying an > > > > > > additional level of IRQ return address. The original IRQ might have > > > > > > interrupted a trampoline, and then after transitioning into softirq, > > > > > > another IRQ might also interrupt a trampoline, and this last IRQ handler > > > > > > might have instigated a preemption. > > > > > > > > > > Note, softirqs still require a real interrupt to happen in order to preempt > > > > > executing code. Otherwise it should never be running from a trampoline. > > > > > > > > Yes, the first interrupt interrupted a trampoline. Then, on return, > > > > that interrupt transitioned to softirq (as opposed to ksoftirqd). > > > > While a softirq handler was executing within a trampoline, we got > > > > another interrupt. We thus have two interrupted trampolines. > > > > > > > > Or am I missing something that prevents this? > > > > > > Surely the problematic case is where the first interrupt is taken from a > > > trampoline, but the inner interrupt is taken from not-a-trampoline? If the > > > innermost interrupt context is a trampoline, that's the same as that without > > > any nesting. > > > > It depends. If we wait for each task to not have a trampoline in effect > > then yes, we only need to know whether or not a given task has at least > > one trampoline in use. One concern with this approach is that a given > > task might have at least one trampoline in effect every time it is > > checked, unlikely though that might seem. > > > > If this is a problem, one way around it is to instead ask whether the > > current task still has a reference to one of a set of trampolines that > > has recently been removed. This avoids the problem of a task always > > being one some trampoline or another, but requires exact identification > > of any and all trampolines a given task is currently using. > > > > Either way, we need some way of determining whether or not a given > > PC value resides in a trampoline. This likely requires some data > > structure (hash table? tree? something else?) that must be traversed > > in order to carry out that determination. Depending on the traversal > > overhead, it might (or might not) be necessary to make sure that the > > traversal is not on the entry/exit/scheduler fast paths. It is also > > necessary to keep the trampoline-use overhead low and the trampoline > > call points small. > > Thanks; I hadn't thought about that shape of livelock problem; with that in > mind my suggestion using flags was inadequate. > > I'm definitely in favour of just using Tasks RCU! That's what arm64 does today, > anyhow! Full speed ahead, then!!! But if you come up with a nicer solution, please do not keep it a secret! Thanx, Paul
On 2/23/2024 11:58 AM, Raghavendra K T wrote: > On 2/23/2024 8:44 AM, Ankur Arora wrote: >> >> Thomas Gleixner <tglx@linutronix.de> writes: >> >>> On Wed, Feb 21 2024 at 22:57, Raghavendra K T wrote: >>>> On 2/21/2024 10:45 PM, Thomas Gleixner wrote: >>>>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote: >>>>>> Configuration tested. >>>>>> a) Base kernel (6.7), >>>>> >>>>> Which scheduling model is the baseline using? >>>>> >>>> >>>> baseline is also PREEMPT_DYNAMIC with voluntary preemption >>>> >>>>>> b) patched with PREEMPT_AUTO voluntary preemption. >>>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption. >>> >>> Which RCU variant do you have enabled with a, b, c ? >>> >>> I.e. PREEMPT_RCU=? >> >> Raghu please confirm this, but if the defaults were chosen >> then we should have: >> >>>> baseline is also PREEMPT_DYNAMIC with voluntary preemption >> PREEMPT_RCU=y >> >>>>>> b) patched with PREEMPT_AUTO voluntary preemption. >> >> If this was built with PREEMPT_VOLUNTARY then, PREEMPT_RCU=n. >> If with CONFIG_PREEMPT, PREEMPT_RCU=y. >> >> Might be worth rerunning the tests with the other combination >> as well (still with voluntary preemption). >> >>>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption. >> PREEMPT_RCU=y > > Hello Thomas, Ankur, > Yes, Ankur's understanding is right, defaults were chosen all the time so > for > a) base 6.7.0+ + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y > b) patched + PREEMPT_AUTO voluntary preemption. PREEMPT_RCU = n > c) patched + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y > I will check with other combination (CONFIG_PREEMPT/PREEMPT_RCU) for (b) > and comeback if I see anything interesting. > I see that d) patched + PREEMPT_AUTO=y voluntary preemption CONFIG_PREEMPT, PREEMPT_RCU = y All the results at 80% confidence case (d) HashJoin 0% Graph500 0% XSBench +1.2% NAS-ft +2.1% In general averages are better for all the benchmarks but at 99% confidence there seem to be no difference. Overall looks on par or better for case (d) Thanks and Regards - Raghu
Raghavendra K T <raghavendra.kt@amd.com> writes: > On 2/23/2024 11:58 AM, Raghavendra K T wrote: >> On 2/23/2024 8:44 AM, Ankur Arora wrote: >>> >>> Thomas Gleixner <tglx@linutronix.de> writes: >>> >>>> On Wed, Feb 21 2024 at 22:57, Raghavendra K T wrote: >>>>> On 2/21/2024 10:45 PM, Thomas Gleixner wrote: >>>>>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote: >>>>>>> Configuration tested. >>>>>>> a) Base kernel (6.7), >>>>>> >>>>>> Which scheduling model is the baseline using? >>>>>> >>>>> >>>>> baseline is also PREEMPT_DYNAMIC with voluntary preemption >>>>> >>>>>>> b) patched with PREEMPT_AUTO voluntary preemption. >>>>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption. >>>> >>>> Which RCU variant do you have enabled with a, b, c ? >>>> >>>> I.e. PREEMPT_RCU=? >>> >>> Raghu please confirm this, but if the defaults were chosen >>> then we should have: >>> >>>>> baseline is also PREEMPT_DYNAMIC with voluntary preemption >>> PREEMPT_RCU=y >>> >>>>>>> b) patched with PREEMPT_AUTO voluntary preemption. >>> >>> If this was built with PREEMPT_VOLUNTARY then, PREEMPT_RCU=n. >>> If with CONFIG_PREEMPT, PREEMPT_RCU=y. >>> >>> Might be worth rerunning the tests with the other combination >>> as well (still with voluntary preemption). >>> >>>>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption. >>> PREEMPT_RCU=y >> Hello Thomas, Ankur, >> Yes, Ankur's understanding is right, defaults were chosen all the time so >> for >> a) base 6.7.0+ + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y >> b) patched + PREEMPT_AUTO voluntary preemption. PREEMPT_RCU = n >> c) patched + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y > >> I will check with other combination (CONFIG_PREEMPT/PREEMPT_RCU) for (b) >> and comeback if I see anything interesting. >> > > I see that > > d) patched + PREEMPT_AUTO=y voluntary preemption CONFIG_PREEMPT, PREEMPT_RCU = y > > All the results at 80% confidence > case (d) > HashJoin 0% > Graph500 0% > XSBench +1.2% > NAS-ft +2.1% > > In general averages are better for all the benchmarks but at 99% > confidence there seem to be no difference. > > Overall looks on par or better for case (d) Thanks for running all of these Raghu. The numbers look pretty good (better than I expected honestly). -- ankur
On Fri, Feb 23, 2024 at 07:31:50AM -0800, Paul E. McKenney wrote: > On Fri, Feb 23, 2024 at 11:05:45AM +0000, Mark Rutland wrote: > > On Thu, Feb 22, 2024 at 11:11:34AM -0800, Paul E. McKenney wrote: > > > On Thu, Feb 22, 2024 at 03:50:02PM +0000, Mark Rutland wrote: > > > > On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote: > > > > > On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote: > > > > > > On Wed, 21 Feb 2024 11:41:47 -0800 > > > > > > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > > > > > > > > > > > I wonder if we can just see if the instruction pointer at preemption is at > > > > > > > > something that was allocated? That is, if it __is_kernel(addr) returns > > > > > > > > false, then we need to do more work. Of course that means modules will also > > > > > > > > trigger this. We could check __is_module_text() but that does a bit more > > > > > > > > work and may cause too much overhead. But who knows, if the module check is > > > > > > > > only done if the __is_kernel() check fails, maybe it's not that bad. > > > > > > > > > > > > > > I do like very much that idea, but it requires that we be able to identify > > > > > > > this instruction pointer perfectly, no matter what. It might also require > > > > > > > that we be able to perfectly identify any IRQ return addresses as well, > > > > > > > for example, if the preemption was triggered within an interrupt handler. > > > > > > > And interrupts from softirq environments might require identifying an > > > > > > > additional level of IRQ return address. The original IRQ might have > > > > > > > interrupted a trampoline, and then after transitioning into softirq, > > > > > > > another IRQ might also interrupt a trampoline, and this last IRQ handler > > > > > > > might have instigated a preemption. > > > > > > > > > > > > Note, softirqs still require a real interrupt to happen in order to preempt > > > > > > executing code. Otherwise it should never be running from a trampoline. > > > > > > > > > > Yes, the first interrupt interrupted a trampoline. Then, on return, > > > > > that interrupt transitioned to softirq (as opposed to ksoftirqd). > > > > > While a softirq handler was executing within a trampoline, we got > > > > > another interrupt. We thus have two interrupted trampolines. > > > > > > > > > > Or am I missing something that prevents this? > > > > > > > > Surely the problematic case is where the first interrupt is taken from a > > > > trampoline, but the inner interrupt is taken from not-a-trampoline? If the > > > > innermost interrupt context is a trampoline, that's the same as that without > > > > any nesting. > > > > > > It depends. If we wait for each task to not have a trampoline in effect > > > then yes, we only need to know whether or not a given task has at least > > > one trampoline in use. One concern with this approach is that a given > > > task might have at least one trampoline in effect every time it is > > > checked, unlikely though that might seem. > > > > > > If this is a problem, one way around it is to instead ask whether the > > > current task still has a reference to one of a set of trampolines that > > > has recently been removed. This avoids the problem of a task always > > > being one some trampoline or another, but requires exact identification > > > of any and all trampolines a given task is currently using. > > > > > > Either way, we need some way of determining whether or not a given > > > PC value resides in a trampoline. This likely requires some data > > > structure (hash table? tree? something else?) that must be traversed > > > in order to carry out that determination. Depending on the traversal > > > overhead, it might (or might not) be necessary to make sure that the > > > traversal is not on the entry/exit/scheduler fast paths. It is also > > > necessary to keep the trampoline-use overhead low and the trampoline > > > call points small. > > > > Thanks; I hadn't thought about that shape of livelock problem; with that in > > mind my suggestion using flags was inadequate. > > > > I'm definitely in favour of just using Tasks RCU! That's what arm64 does today, > > anyhow! > > Full speed ahead, then!!! But if you come up with a nicer solution, > please do not keep it a secret! The networking NAPI code ends up needing special help to avoid starving Tasks RCU grace periods [1]. I am therefore revisiting trying to make Tasks RCU directly detect trampoline usage, but without quite as much need to identify specific trampolines... I am putting this information in a Google document for future reference [2]. Thoughts? Thanx, Paul [1] https://lore.kernel.org/all/Zd4DXTyCf17lcTfq@debian.debian/ [2] https://docs.google.com/document/d/1kZY6AX-AHRIyYQsvUX6WJxS1LsDK4JA2CHuBnpkrR_U/edit?usp=sharing