Message ID | 20230131224236.122805-6-eric.devolder@oracle.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp90267wrn; Tue, 31 Jan 2023 15:23:55 -0800 (PST) X-Google-Smtp-Source: AK7set+YB6PFdXmCy2ivjS4+aPJf0da2yeZ6qCQnwZNwEL5QxmDomreEP9pHceY0We3rF/FI11eO X-Received: by 2002:a17:902:c94c:b0:196:58ac:6593 with SMTP id i12-20020a170902c94c00b0019658ac6593mr712973pla.61.1675207435123; Tue, 31 Jan 2023 15:23:55 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1675207435; cv=pass; d=google.com; s=arc-20160816; b=MLddTmm/Cr+0HDsgntQ7GL/7UbjuAKM/VgltD+XSCik7rOHxlpyB6w7PFZcr3gD2rV VkdHrW6HyjTqgYWVeMttdjjBbygNF32nhKxTvpna5jYafGOtQbLOlAGysUZ7yzEXAbo4 c+Duj6Q3Q6JISjXIgiaWRGzVSFWFb24+/YzIC+I76fo6DHS+0MFapV58XWrcVqt0qZeD A/cdmmoIipCwrtZSUqJhIPOAGma1O4aYE2TmP+j8XGk7YS1nmTr1RnsI+ScQnaASxWUm 7TCyxc4B+IOmkEqixJN2ERwAah8/CDGAD5N213/QE9jKDZr9/3NbnpLzHx9oPaUY98tK iE4A== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:dkim-signature; bh=8tUD23MLKHSi2Mcl5JhPnpLAKNKqOrZZxS+XQHu8fgY=; b=cL9NJ5X/kh11QgVPd1TYx+k4iUMrEMXCO6v9CZaTTwXbNO3jUs3ObWVlMUfTAJTlPD CCkn8UUn/uCtJKfVUovGo5rT5YnF+TzARnAXTce/n+UyG5BUelgevJVziOjVCwchOrHG F2xfdLH1kUMtqms/uxERoo/+nISaBdTCMrSqqtCK5z+XfOmQy4b5dWz+0edYEU/0dhTE OZdNgoNII3Vyn02wySLt2a2mnIR+aOhyQGVXJI/LKXVh6dcoEwsydEsPR5WhXzZuqZLs raCEgRD0+xpdNZDIK19pK0iLI8N2jqGwxVYOUEkAwH7k+t1l04caSEKsZGBx42+gN0Cz fSyA== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2022-7-12 header.b="hyEHuU/2"; dkim=pass header.i=@oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=MPBtYCyO; arc=pass (i=1 spf=pass spfdomain=oracle.com dkim=pass dkdomain=oracle.com dmarc=pass fromdomain=oracle.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s21-20020a170902b19500b0019609a3a4c6si16594502plr.89.2023.01.31.15.23.41; Tue, 31 Jan 2023 15:23:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2022-7-12 header.b="hyEHuU/2"; dkim=pass header.i=@oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=MPBtYCyO; arc=pass (i=1 spf=pass spfdomain=oracle.com dkim=pass dkdomain=oracle.com dmarc=pass fromdomain=oracle.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232007AbjAaWoF (ORCPT <rfc822;maxin.john@gmail.com> + 99 others); Tue, 31 Jan 2023 17:44:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36378 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231886AbjAaWnv (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 31 Jan 2023 17:43:51 -0500 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6BBBF46157 for <linux-kernel@vger.kernel.org>; Tue, 31 Jan 2023 14:43:50 -0800 (PST) Received: from pps.filterd (m0246631.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 30VIiM6a032618; Tue, 31 Jan 2023 22:43:20 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-transfer-encoding : content-type : mime-version; s=corp-2022-7-12; bh=8tUD23MLKHSi2Mcl5JhPnpLAKNKqOrZZxS+XQHu8fgY=; b=hyEHuU/2uCiZlRbKzwJRSpx9EMB5bRKYHpQLYV1pYdC2iXg8WMJx0l4HlCZOcKfE/VM2 J/fAva5LBn7+oQt6vn74oY8EXZUbaN7VvEBYDU0iZqSE18WveQBhLf7USYN87U3f504S b0GswIwmjO2WbYH7awoxg1YwNLPPhpnq1wKPQ03/uZKn7oveE13KbYnkW4A9Sw8NHCvK 6Ea4TDGPL1Z9UZe+4i9JZ8uTv6+GaKgsdrwfADrQcwXG959vl4q7kzMitDQGIoMWEF0z jPmdAq0wn5eq5oX8cUFOCwXFvhSZdetj5lorEN52V4XhdXsXCidpJ4S572xttTl+dYSW Yw== Received: from iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta01.appoci.oracle.com [130.35.100.223]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3ncvrjxx38-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 31 Jan 2023 22:43:20 +0000 Received: from pps.filterd (iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (8.17.1.5/8.17.1.5) with ESMTP id 30VLG1Ou012963; Tue, 31 Jan 2023 22:43:19 GMT Received: from nam12-bn8-obe.outbound.protection.outlook.com (mail-bn8nam12lp2177.outbound.protection.outlook.com [104.47.55.177]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTPS id 3nct5d7q9v-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 31 Jan 2023 22:43:19 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=gfPrhwrnI6M/nAwD9kgqEZBDFKno842UOSV0zKg1Pf70kMGGHyEG+S/Pm0x/Gwp4QIoiVGGF/8DMRaLW+tAssDcWvFxPyJZ1/sTtrAdJXcSgwBVXWhIKcta4g0HS1X9amIhsTEVb2W97S3T9jL1PWP98F8nHQBa1uu3nzV1Vlz6NiANlpSrfiFiT58GyQyzFaxCKswzsaXAu3H+ruuG9TcCS5Rp6JLSZVfECJYfotVc06+zgCOUwASDIhZ7WbQLSOMUIvAWCq2J4N9IpJ6XxtvuPk4c2JI6N74dpaJS/FbM16sR5Ils7zhWFtiORabV5hEaxiq+IYxyOdNuk1v/bQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=8tUD23MLKHSi2Mcl5JhPnpLAKNKqOrZZxS+XQHu8fgY=; b=lKnJgQ1Sih9xSTC27x2FtMg7AxR96oLMgyybD6pCNHm0JVvjZ7YuGZJw0S0fR1wqJZTVYnR99KgVcYoR9PSScY4AHXBD/cGjA75lKHhgXco2Am2kwAzmiypjcTRMnWacsyO+5S++ygFDI23U152hHoObYpBfI687wKLWJkISf4EogOpLkbUsi5DEuc0N+bER4kI9/HwDWudL3hOB7zuWaHr5P0CpsO8HA4nbXzveGxPmb7G/qK1BdqSgmaDacb9evkgUCzkzClKCs346Lbn3azBcdDcr0SwwNdxds3Aw6x4qms2jUCzD/kvqeONsgf0Z0/PGLl6efvWCL8yusBoD/A== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=8tUD23MLKHSi2Mcl5JhPnpLAKNKqOrZZxS+XQHu8fgY=; b=MPBtYCyO6DDlqDdcDSPoaUtNF9c9lYpsV3iEyOzYP7s1k4HXY0HTtwCOsM63F/oBzdnJPgLRtIo2tQTowaUTsl6ORs8ot1FOKawtOU8ZNfc4aTtEiH9/Eua/EwM0SJBNSkNMTVd5rJ/tYEBsvyWvhly6YE1LYifDvEQsg+DPPOQ= Received: from CO1PR10MB4531.namprd10.prod.outlook.com (2603:10b6:303:6c::22) by PH0PR10MB4647.namprd10.prod.outlook.com (2603:10b6:510:43::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6064.22; Tue, 31 Jan 2023 22:43:01 +0000 Received: from CO1PR10MB4531.namprd10.prod.outlook.com ([fe80::2101:3ed8:9765:370f]) by CO1PR10MB4531.namprd10.prod.outlook.com ([fe80::2101:3ed8:9765:370f%6]) with mapi id 15.20.6064.022; Tue, 31 Jan 2023 22:42:59 +0000 From: Eric DeVolder <eric.devolder@oracle.com> To: linux-kernel@vger.kernel.org, x86@kernel.org, kexec@lists.infradead.org, ebiederm@xmission.com, dyoung@redhat.com, bhe@redhat.com, vgoyal@redhat.com Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, nramas@linux.microsoft.com, thomas.lendacky@amd.com, robh@kernel.org, efault@gmx.de, rppt@kernel.org, david@redhat.com, sourabhjain@linux.ibm.com, konrad.wilk@oracle.com, boris.ostrovsky@oracle.com, eric.devolder@oracle.com Subject: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes Date: Tue, 31 Jan 2023 17:42:34 -0500 Message-Id: <20230131224236.122805-6-eric.devolder@oracle.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20230131224236.122805-1-eric.devolder@oracle.com> References: <20230131224236.122805-1-eric.devolder@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: SN4PR0501CA0062.namprd05.prod.outlook.com (2603:10b6:803:41::39) To CO1PR10MB4531.namprd10.prod.outlook.com (2603:10b6:303:6c::22) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CO1PR10MB4531:EE_|PH0PR10MB4647:EE_ X-MS-Office365-Filtering-Correlation-Id: 2f991337-5e24-494d-14d8-08db03dc80e0 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: bvKTyM9HQhSJMcTGJYMmD/NK/fmuxEK+/JS7YEw16/hRIzE3SsVMHi58imggausEmKB6ORlUMaPFlqZB3ViKBbnPGE+TNBwRTqtJIPWMsQPpg/h62ylWjmn5hiMXIkw6qGHmgfst8OZ/nGenPFjiALqdYDW8Hw8elNXETwEPtkb70RB8ZpI9bKZUCrgY1AagzH80NMswn33dWQqY4zOonvtihrZQ3c7NGULQ91A908SC5ZuM+Zrm1DhhBW/52fZC0ZE1bG/bcKqt/1vpezIfySrUxFONi7jXgztILxp6abXPJyW3wWITP27asMYLM5/RMU8nRdXJwtn3V+f+BmCb1tGvKV2ksmwErHcIIApkAkctpR2xdNXYt9V6QSZRtWHDs/BkFHP4bTO2BFtwlTYKtd08GP5AqlQJlb+7CG48Y5M79v4mCm9eub3HOEIZxnSu6KmZXDVugSUcelLZhOyc9rJeER86VM+Jh0t8bqwUO/3zs4ToZE0zCHh85y+GLZq/Hlf0cvjHDRvH7L3obmDsI4QLnl3WQVlnd/XuAHNGWqGMTCm0zvSyi9Ub8ORnSB4aHlnNxhxKwqmeuUiwMyrrxmVSfAuEwWPUPJKtvdAxi7Xm+i9/fkYiwajk/OYg3mVK8ifXSlhYW7KR0Bde96Cptw== X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CO1PR10MB4531.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230025)(136003)(366004)(396003)(376002)(39860400002)(346002)(451199018)(2616005)(316002)(83380400001)(66556008)(41300700001)(8936002)(8676002)(66476007)(4326008)(66946007)(107886003)(6666004)(1076003)(186003)(6506007)(26005)(478600001)(6486002)(86362001)(6512007)(36756003)(2906002)(7416002)(5660300002)(38100700002);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: BNgbG0UjLR4kZd+V+Sd34XjBC3IKDy4ZwuXdydwZQ7occPZqZS6TzQxS7yLYki5BxcWFfGzYqeAWfOKEfV0fcasC1saLbLI4vNh/MC8PkECRIRNAnij+XvrOJpyR4LqA+2Ip80wiJoKjePpljR9ThZMykax6uVlCeVHpIa7KpZUxxC+U/tRO2YG2R2E0Va/O+FDy2toGXiJZxQuJRSe59XRiIkExZn3xv62jHab5JFnSn0YU/e8p/g2MJtWyiOkjt9DmvZ7tNhdxHz8HUG/L3PgYyriZGgwW/1rYiZvHwCEJxNRnuQo7AdmCcc/HnoLPi+/AQZsvFERlrx8vCTyfWALbDCaIuLUccO2RxbcMyfRD53mL8+vtnyWOJ7jcksYUJu42c9r1ZxvvYIfQQhGoB0WSk3hGDMJ221aZJc6n3YBjiYCcINv+TACLFATgphZuoF/IX7mSQLfktYATQIrbS3jRUdJCYhXiUvWPe9xl2WsvJIuSVUsEQpKB8Fj3t9x74c/GMwU453OxfnElqniPa+CfHWW/qHO0o9zda7RSGhdD4gWxDF0nx4PT/knTolFHHOkMlYvx4HieNSwQZBzg97R+qaE+rB1oU8FgfRVjxRi7J8UPIOHiFtmL+osga4ycn2DnqlV4t/M/bWxj45fXDEn2w1HG28OBNg4nRPLW1Hp8ncRgJHgtNlMU4M+NWrHgL3h81ByR5Md6yaKxvykpGZoTc71+xpHyDIeasyXuAPrmZfdagTHMWQ9o9kWFpbjZ2O4puDM9Odj2eNaZpsdVm1IdGjT+BDWs2HFA/+hXbA8RTkdI58cM4hHGZaopFMH2M6Q0utJYcNaNkZ31wkFxIsavz4izCt/rzU2sjvHtmMw5siJVFomgHucBjNnBTIkb0mrdJwiH9IRUg5JioI5cbFQ90Rh5qjU/i/9W+00l59YjIcZO1wlbqnTiunDrsyl4obREfgCMxfrsYY0IeAGJtxc6zuw/qJc3pJeHrw1K9fy/fquK0uYRR3Tb42QsFCVtteRYOg1ez1eOnuPD5pqg7bLXE+ZcWvQgFtAcrlmj1AXunFvKpTfa3cDgpa9RvobkR7VhUmN+bhZm8VtWuTHn35rOOzd2hCp/BVsVnCVFHq3kqklOXbAvzkJJzWWASXGI2oNRGamhgLOF2TKDkC7Kfe9KegVuFmsFAOHM5xLdDKuVl/JP8FgnbN/E5VPXOdV74NLrDV3oNzHQAV51LVBk5/uwtpHmanTo0lwI+iFXxMbySxocDasGXwIUwUOkgjTgOc6WEr/bZKCghGuMrZMO2Qn05EoGMU7GaE9gRscAb6QHpRhe56ujACnaZ8bvsUsS4K+pSCtNwge+2DwQSG8PdNLlxpX0BQW/b5rljII9BN6/xQuXbZ5swDkqG8CEJEeFG17p7okTWU8iXHoWGVmW1cJd4k/EiIr2QVsX4tWmA6cUL4mbYjlpEeUzpvNuQqDu+XjWASqKOYjsuSE/yiw6MARp001tanbKqzcf29g4sOQOAjN4aRsox3/ySfhvBldyxRz9lJKc//K9HlQyZTQrDXzV6inE4Tg4f8cVTmvS7YJhGw9EC1Jvm3mUwH7Ns7XcHcgygfqHQxgmMq3ojP3cew== X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: 9fsIkPe5kMReztkR/eGmeSwLW0uk0+WpobN6toBgJPUAKH4iMc8yzlQ0qveh7OfpgRdNVGsKm8FUYocaFoTUmKQHbGanvJZOJ7U0v7mg89LlNGCPMhZ/kxeFdTMqaKWs9I99C6M/rZ7p4zOoQ4Cdy9KiUrRA446zxXR3I5Kw8ntm/kGDVlIFraBZSWrMXDyQxjYZrg5m9Z8wGckZwscTSRp0tX0pgiydbj654cJT/sqGuf3ia35J0sLtjlKUr8uD/OLO0kR0IVN04sJfH9j7ppx0REDvSta6crKEL7SUb64HgNz6vUY6DQiojQoNzbNMpEIDGfBk2OZQL4Jk4SpT22TKaJZmnABDYCMY5jgB1wycohUk42FpQw7HPMtcTVP7BuPb9mcHqcddbmlgJxtxmiBSE2Kdy9kWicnO7i8lR498QySo3dtTivjVHZ2hQNBVfc3kvQeHwJYUnY6Jpw9gEDJz2NSEz4IWR+QUSmOVOEu2zjkjeiJNFVW2fwOhahCu/qHpGuz6VOO+kCWR0CPN7CQqe/lBmQbma1uPanTKQXZaUBEER0aaLXsV99shhIaIHYPSNapTq7CiI0HR04UEOe+ZoxFRf94ohA/Xc2dC/vua/T9H7sBjS2ZONv2dZVcByyDbAzthajttimv5b5Pcx4Sk+69vtN7sSEYGllN6wiJg9Z6Gj8z0iqmZxT4/jawV0rQ1aI1s20xGQqpyVfA8w91DnBnjLbhoxuxr9jakvAP5PDwmovuwJmD1vpU9R9sjuI11bLAccYwywty7y3c+8Ka8VwhEOjg0e4FuWqF2IckH+U6B74UHou2lRnLuFh5zC7crXmhwq42R0NcmuH5DR+t0u5FjaObM1qpQKsdR/iSTI9OCGzqgBHYCIV+VTvfCTQmJLTXZFnYjL823QPkgUAfwrdfybhzD9r0DKazE7FQNq2HhSV6/fBIVobjaQXKofQDjElsRjo92XmyjI6NzkTrW1MaDaqJIFFd05M7JWP8G4YXz9f/bNpRcp07HTk/h5iap9nxz41qcQ/7IL9caBY7vo20bxUF1Oocnva02tsjtSaXlnT9sLovYbvfsEqxzG+vaBlBR4Cy+RbGX3l9EdPfVL8rve0YwUa1h4TCGbG2DVAv9rChr78t40tOtwCL13bD5iNAycVkyC783Yvx0c0C+QurlM71UaaHh+m5+h7g= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: 2f991337-5e24-494d-14d8-08db03dc80e0 X-MS-Exchange-CrossTenant-AuthSource: CO1PR10MB4531.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 31 Jan 2023 22:42:59.5414 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: rWTVtWbz8nATry76R49poY/69evAK9ByoXSimH+INJ1eFftdzJLmXV4UCxqdca6UJHWqVZXwpvFFEAKc42klAazAffsySCsYz16L42u3izo= X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH0PR10MB4647 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.122.1 definitions=2023-01-31_08,2023-01-31_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 phishscore=0 malwarescore=0 adultscore=0 bulkscore=0 mlxscore=0 mlxlogscore=999 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2212070000 definitions=main-2301310196 X-Proofpoint-GUID: eGlfM_w9TWFUqJXfI1uoDEDOJ9gdLrit X-Proofpoint-ORIG-GUID: eGlfM_w9TWFUqJXfI1uoDEDOJ9gdLrit X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756582311385401220?= X-GMAIL-MSGID: =?utf-8?q?1756582311385401220?= |
Series |
crash: Kernel handling of CPU and memory hot un/plug
|
|
Commit Message
Eric DeVolder
Jan. 31, 2023, 10:42 p.m. UTC
In crash_prepare_elf64_headers(), the for_each_present_cpu() is utilized to create the new elfcorehdr. When handling CPU hot unplug/offline events, the CPU is still on the for_each_present_cpu() list (not until the cpuhp state processing reaches CPUHP_OFFLINE does the CPU exit the list). Thus the CPU must be explicitly excluded when building the new list of CPUs. This change identifies in handle_hotplug_event() the CPU to be excluded, and the check for excluding the CPU in crash_prepare_elf64_headers(). Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Acked-by: Baoquan He <bhe@redhat.com> --- kernel/crash_core.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
Comments
Eric! On Tue, Jan 31 2023 at 17:42, Eric DeVolder wrote: > --- a/kernel/crash_core.c > +++ b/kernel/crash_core.c > @@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem, > > /* Prepare one phdr of type PT_NOTE for each present CPU */ > for_each_present_cpu(cpu) { > +#ifdef CONFIG_CRASH_HOTPLUG > + if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) { > + /* Skip the soon-to-be offlined cpu */ > + if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) && > + (cpu == image->offlinecpu)) > + continue; > + } > +#endif I'm failing to see how the above is correct in any way. Look at the following sequence of events: 1) Offline CPU$N -> Prepare elf headers with CPU$N excluded 2) Another hotplug operation != 'Online CPU$N' -> Prepare elf headers with CPU$N included Also in case of loading the crash kernel in the situation where not all present CPUs are online (think boot time SMT disable) then your resulting crash image will contain all present CPUs and none of the offline CPUs are excluded. How does that make any sense at all? This image->hp_action and image->offlinecpu dance is engineering voodoo. You just can do: for_each_present_cpu(cpu) { if (!cpu_online(cpu)) continue; do_stuff(cpu); which does the right thing in all situations and can be further simplified to: for_each_online_cpu(cpu) { do_stuff(cpu); without the need for ifdefs or whatever. No? Thanks, tglx
Hello Thomas, On 01/02/23 17:03, Thomas Gleixner wrote: > Eric! > > On Tue, Jan 31 2023 at 17:42, Eric DeVolder wrote: >> --- a/kernel/crash_core.c >> +++ b/kernel/crash_core.c >> @@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem, >> >> /* Prepare one phdr of type PT_NOTE for each present CPU */ >> for_each_present_cpu(cpu) { >> +#ifdef CONFIG_CRASH_HOTPLUG >> + if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) { >> + /* Skip the soon-to-be offlined cpu */ >> + if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) && >> + (cpu == image->offlinecpu)) >> + continue; >> + } >> +#endif > I'm failing to see how the above is correct in any way. Look at the > following sequence of events: > > 1) Offline CPU$N > > -> Prepare elf headers with CPU$N excluded > > 2) Another hotplug operation != 'Online CPU$N' > > -> Prepare elf headers with CPU$N included > > Also in case of loading the crash kernel in the situation where not all > present CPUs are online (think boot time SMT disable) then your > resulting crash image will contain all present CPUs and none of the > offline CPUs are excluded. > > How does that make any sense at all? > > This image->hp_action and image->offlinecpu dance is engineering > voodoo. You just can do: > > for_each_present_cpu(cpu) { > if (!cpu_online(cpu)) > continue; > do_stuff(cpu); > > which does the right thing in all situations and can be further > simplified to: > > for_each_online_cpu(cpu) { > do_stuff(cpu); What will be the implication on x86 if we pack PT_NOTE for possible CPUs? IIUC, on boot the crash notes are create for possible CPUs using pcpu_alloc and when the system is on crash path the crash notes for online CPUs is populated with the required data and rest crash notes are untouched. And I think the /proc/vmcore generation in kdump/second kernel and makedumpfile do take care of empty crash notes belong to offline CPUs. Any thoughts? Thanks, Sourabh
On Mon, Feb 06 2023 at 13:42, Sourabh Jain wrote: > On 01/02/23 17:03, Thomas Gleixner wrote: >> Also in case of loading the crash kernel in the situation where not all >> present CPUs are online (think boot time SMT disable) then your >> resulting crash image will contain all present CPUs and none of the >> offline CPUs are excluded. >> >> How does that make any sense at all? >> >> This image->hp_action and image->offlinecpu dance is engineering >> voodoo. You just can do: >> >> for_each_present_cpu(cpu) { >> if (!cpu_online(cpu)) >> continue; >> do_stuff(cpu); >> >> which does the right thing in all situations and can be further >> simplified to: >> >> for_each_online_cpu(cpu) { >> do_stuff(cpu); > > What will be the implication on x86 if we pack PT_NOTE for possible > CPUs? I don't know. > IIUC, on boot the crash notes are create for possible CPUs using pcpu_alloc > and when the system is on crash path the crash notes for online CPUs is > populated with the required data and rest crash notes are untouched. Which should be fine. That's a problem of postprocessing and it's unclear to me from the changelogs what the actual problem is which is trying to be solved here. Thanks, tglx
On 2/1/23 05:33, Thomas Gleixner wrote: > Eric! > > On Tue, Jan 31 2023 at 17:42, Eric DeVolder wrote: >> --- a/kernel/crash_core.c >> +++ b/kernel/crash_core.c >> @@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem, >> >> /* Prepare one phdr of type PT_NOTE for each present CPU */ >> for_each_present_cpu(cpu) { >> +#ifdef CONFIG_CRASH_HOTPLUG >> + if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) { >> + /* Skip the soon-to-be offlined cpu */ >> + if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) && >> + (cpu == image->offlinecpu)) >> + continue; >> + } >> +#endif > > I'm failing to see how the above is correct in any way. Look at the > following sequence of events: > > 1) Offline CPU$N > > -> Prepare elf headers with CPU$N excluded > > 2) Another hotplug operation != 'Online CPU$N' > > -> Prepare elf headers with CPU$N included > > Also in case of loading the crash kernel in the situation where not all > present CPUs are online (think boot time SMT disable) then your > resulting crash image will contain all present CPUs and none of the > offline CPUs are excluded. > > How does that make any sense at all? > > This image->hp_action and image->offlinecpu dance is engineering > voodoo. You just can do: > > for_each_present_cpu(cpu) { > if (!cpu_online(cpu)) > continue; > do_stuff(cpu); > > which does the right thing in all situations and can be further > simplified to: > > for_each_online_cpu(cpu) { > do_stuff(cpu); > > without the need for ifdefs or whatever. > > No? > > Thanks, > > tglx Thomas, I've been re-examining the cpuhp framework and understand a bit better its operation. Up until now, this patch series has been using either CPUHP_AP_ONLINE_DYN or more recently CPUHP_BP_PREPARE_DYN with the same handler for both the startup and teardown callbacks. This resulted in the cpu state, as seen by my handler, as being incorrect in one direction or the other. For example, when using CPUHP_AP_ONLINE_DYN, cpu_online() always resulted in 1 for the cpu in my callback, even during tear down. For CPUHP_BP_PREPARE_DYN, cpu_online() always resulted in 0. Thus the offlinecpu voodoo. But no more! The reason, as I now understand, is simple. A cpu will not show as online until after state CPUHP_BRINGUP_CPU (when working from CPUHP_OFFLINE towards CPUHP_ONLINE). And a cpu will not show as offline until after state CPUHP_TEARDOWN_CPU (when working reverse order from CPUHP_ONLINE to CPUHP_OFFLINE). The CPUHP_BRINGUP_CPU is the last state of the PREPARE section, and boots the new cpu. It is code running on the booting cpu that marks itself as online. CPUHP_BRINGUP_CPU .startup() bringup_cpu() __cpu_up() smp_ops.cpu_up() native_cpu_up() do_boot_cpu() ===== on new cpu! ===== start_secondary() set_cpu_online(true) There are quite a few CPUHP_..._STARTING states before the cpu is in a productive state. The CPUHP_TEARDOWN_CPU is the last state in the STARTING section, and takes the cpu down. Work/irqs are removed from this cpu and re-assigned to others. CPUHP_TEARDOWN_CPU .teardown() takedown_cpu() take_cpu_down() __cpu_disable() smp_ops.cpu_disable() native_cpu_disable() cpu_disable_common() remove_cpu_from_maps() set_cpu_online(false) So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names. The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My attempts at locating this state failed when inside the STARTING section, so I located this just inside the ONLINE sectoin. The crash hotplug handler is registered on this state as the callback for the .startup method. The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I placed it at the end of the PREPARE section. This crash hotplug handler is also registered on this state as the callback for the .teardown method. diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 6c6859bfc454..52d2db4d793e 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -131,6 +131,7 @@ enum cpuhp_state { CPUHP_ZCOMP_PREPARE, CPUHP_TIMERS_PREPARE, CPUHP_MIPS_SOC_PREPARE, + CPUHP_BP_ELFCOREHDR_OFFLINE, CPUHP_BP_PREPARE_DYN, CPUHP_BP_PREPARE_DYN_END = CPUHP_BP_PREPARE_DYN + 20, CPUHP_BRINGUP_CPU, @@ -205,6 +206,7 @@ enum cpuhp_state { /* Online section invoked on the hotplugged CPU from the hotplug thread */ CPUHP_AP_ONLINE_IDLE, + CPUHP_AP_ELFCOREHDR_ONLINE, CPUHP_AP_SCHED_WAIT_EMPTY, CPUHP_AP_SMPBOOT_THREADS, CPUHP_AP_X86_VDSO_VMA_ONLINE, diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 8a439b6d723b..e1a3430f06f4 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c + if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) { + result = cpuhp_setup_state_nocalls(CPUHP_AP_ELFCOREHDR_ONLINE, + "crash/cpuhp_online", crash_cpuhp_online, NULL); + result = cpuhp_setup_state_nocalls(CPUHP_BP_ELFCOREHDR_OFFLINE, + "crash/cpuhp_offline", NULL, crash_cpuhp_offline); + } With the above, there is no need for offlinecpu, as the crash hotplug handler callback now observes the correct cpu_online() state in both online and offline activities. Which leads me to the next item. Thomas you suggested for_each_online_cpu(cpu) { do_stuff(cpu); I've been looking into this further, and don't yet have conclusion. In light of Sourabh's comments/concerns about packing PT_NOTES, I need to determine if my introduction of if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) { if (!cpu_online(cpu)) continue; } does not cause other downstream issues. My testing was focused on hot plug/unplugging cpus in a last-on-first-off manner, where as I now realize cpus can be onlined/offlined sparsely (thus the PT_NOTE packing concern). I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo, makedumpfile and (the consumer of it all) the userspace crash utility, in order to understand the impact of moving from for_each_present_cpu() to for_each_online_cpu(). At any rate, I wanted to at least put forth the introduction of the two new CPUHP states and solicit feedback there while I investigate the for_each_online_cpu() matter. Thanks for pushing me on this topic! eric
Eric! On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: > On 2/1/23 05:33, Thomas Gleixner wrote: > > So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE > for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names. > > The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My > attempts at locating this state failed when inside the STARTING section, so I located > this just inside the ONLINE sectoin. The crash hotplug handler is registered on > this state as the callback for the .startup method. > > The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I > placed it at the end of the PREPARE section. This crash hotplug handler is also > registered on this state as the callback for the .teardown method. TBH, that's still overengineered. Something like this: bool cpu_is_alive(unsigned int cpu) { struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; } and use this to query the actual state at crash time. That spares all those callback heuristics. > I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo, > makedumpfile and (the consumer of it all) the userspace crash utility, > in order to understand the impact of moving from for_each_present_cpu() > to for_each_online_cpu(). Is the packing actually worth the trouble? What's the actual win? Thanks, tglx
On 2/8/23 07:44, Thomas Gleixner wrote: > Eric! > > On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >> On 2/1/23 05:33, Thomas Gleixner wrote: >> >> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE >> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names. >> >> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My >> attempts at locating this state failed when inside the STARTING section, so I located >> this just inside the ONLINE sectoin. The crash hotplug handler is registered on >> this state as the callback for the .startup method. >> >> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I >> placed it at the end of the PREPARE section. This crash hotplug handler is also >> registered on this state as the callback for the .teardown method. > > TBH, that's still overengineered. Something like this: > > bool cpu_is_alive(unsigned int cpu) > { > struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); > > return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; > } > > and use this to query the actual state at crash time. That spares all > those callback heuristics. > >> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo, >> makedumpfile and (the consumer of it all) the userspace crash utility, >> in order to understand the impact of moving from for_each_present_cpu() >> to for_each_online_cpu(). > > Is the packing actually worth the trouble? What's the actual win? > > Thanks, > > tglx > > Thomas, I've investigated the passing of crash notes through the vmcore. What I've learned is that: - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does not care what the contents of cpu PT_NOTES are, but it does coalesce them together. - makedumpfile will count the number of cpu PT_NOTES in order to determine its nr_cpus variable, which is reported in a header, but otherwise unused (except for sadump method). - the crash utility, for the purposes of determining the cpus, does not appear to reference the elfcorehdr PT_NOTEs. Instead it locates the various cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of course which are online. In addition, when crash does reference the cpu PT_NOTE, to get its prstatus, it does so by using a percpu technique directly in the vmcore image memory, not via the ELF structure. Said differently, it appears to me that crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them via kernel cpumasks and the memory within the vmcore. With this understanding, I did some testing. Perhaps the most telling test was that I changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1, hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a vmcore. The crash utility had no problem loading the vmcore, it reported the proper number of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different cpu via 'set -c 30' and the backtrace was completely valid. My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the cpu information directly from kernel data structures. Perhaps at one time crash relied upon the ELF information, but no more. (Perhaps there are other crash dump analyzers that might rely on the ELF info?) So, all this to say that I see no need to change crash_prepare_elf64_headers(). There is no compelling reason to move away from for_each_present_cpu(), or modify the list for online/offline. Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to be a compelling need to accurately track whether the cpu went online/offline for the purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from kernel data structures, not the elfcorehdr. I think this is what Sourabh has known and has been advocating for an optimization path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu structs are all laid out). I do think it best to leave that as an arch choice. Comments? Thanks! eric
Hello Eric, On 09/02/23 23:01, Eric DeVolder wrote: > > > On 2/8/23 07:44, Thomas Gleixner wrote: >> Eric! >> >> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>> On 2/1/23 05:33, Thomas Gleixner wrote: >>> >>> So my latest solution is introduce two new CPUHP states, >>> CPUHP_AP_ELFCOREHDR_ONLINE >>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open >>> to better names. >>> >>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after >>> CPUHP_BRINGUP_CPU. My >>> attempts at locating this state failed when inside the STARTING >>> section, so I located >>> this just inside the ONLINE sectoin. The crash hotplug handler is >>> registered on >>> this state as the callback for the .startup method. >>> >>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before >>> CPUHP_TEARDOWN_CPU, and I >>> placed it at the end of the PREPARE section. This crash hotplug >>> handler is also >>> registered on this state as the callback for the .teardown method. >> >> TBH, that's still overengineered. Something like this: >> >> bool cpu_is_alive(unsigned int cpu) >> { >> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >> >> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >> } >> >> and use this to query the actual state at crash time. That spares all >> those callback heuristics. >> >>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo, >>> makedumpfile and (the consumer of it all) the userspace crash utility, >>> in order to understand the impact of moving from for_each_present_cpu() >>> to for_each_online_cpu(). >> >> Is the packing actually worth the trouble? What's the actual win? >> >> Thanks, >> >> tglx >> >> > > Thomas, > I've investigated the passing of crash notes through the vmcore. What > I've learned is that: > > - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) > does > not care what the contents of cpu PT_NOTES are, but it does coalesce > them together. > > - makedumpfile will count the number of cpu PT_NOTES in order to > determine its > nr_cpus variable, which is reported in a header, but otherwise > unused (except > for sadump method). > > - the crash utility, for the purposes of determining the cpus, does > not appear to > reference the elfcorehdr PT_NOTEs. Instead it locates the various > cpu_[possible|present|online]_mask and computes nr_cpus from that, > and also of > course which are online. In addition, when crash does reference the > cpu PT_NOTE, > to get its prstatus, it does so by using a percpu technique directly > in the vmcore > image memory, not via the ELF structure. Said differently, it > appears to me that > crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it > obtains them > via kernel cpumasks and the memory within the vmcore. > > With this understanding, I did some testing. Perhaps the most telling > test was that I > changed the number of cpu PT_NOTEs emitted in the > crash_prepare_elf64_headers() to just 1, > hot plugged some cpus, then also took a few offline sparsely via > chcpu, then generated a > vmcore. The crash utility had no problem loading the vmcore, it > reported the proper number > of cpus and the number offline (despite only one cpu PT_NOTE), and > changing to a different > cpu via 'set -c 30' and the backtrace was completely valid. > > My take away is that crash utility does not rely upon ELF cpu > PT_NOTEs, it obtains the > cpu information directly from kernel data structures. Perhaps at one > time crash relied > upon the ELF information, but no more. (Perhaps there are other crash > dump analyzers > that might rely on the ELF info?) > > So, all this to say that I see no need to change > crash_prepare_elf64_headers(). There > is no compelling reason to move away from for_each_present_cpu(), or > modify the list for > online/offline. > > Which then leaves the topic of the cpuhp state on which to register. > Perhaps reverting > back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There > does not appear to > be a compelling need to accurately track whether the cpu went > online/offline for the > purposes of creating the elfcorehdr, as ultimately the crash utility > pulls that from > kernel data structures, not the elfcorehdr. > > I think this is what Sourabh has known and has been advocating for an > optimization > path that allows not regenerating the elfcorehdr on cpu changes > (because all the percpu > structs are all laid out). I do think it best to leave that as an arch > choice. Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c], makedumpfile, and crash tool I need your opinion on this: Do we really need to regenerate elfcorehdr for CPU hotplug events? If yes, can you please list the elfcorehdr components that changes due to CPU hotplug. From what I understood, crash notes are prepared for possible CPUs as system boots and could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr during the kdump kernel load. Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to regenerate it for CPU hotplug events. Or do we? Thanks, Sourabh Jain
On 2/9/23 12:43, Sourabh Jain wrote: > Hello Eric, > > On 09/02/23 23:01, Eric DeVolder wrote: >> >> >> On 2/8/23 07:44, Thomas Gleixner wrote: >>> Eric! >>> >>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>> >>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE >>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names. >>>> >>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My >>>> attempts at locating this state failed when inside the STARTING section, so I located >>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on >>>> this state as the callback for the .startup method. >>>> >>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I >>>> placed it at the end of the PREPARE section. This crash hotplug handler is also >>>> registered on this state as the callback for the .teardown method. >>> >>> TBH, that's still overengineered. Something like this: >>> >>> bool cpu_is_alive(unsigned int cpu) >>> { >>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>> >>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>> } >>> >>> and use this to query the actual state at crash time. That spares all >>> those callback heuristics. >>> >>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo, >>>> makedumpfile and (the consumer of it all) the userspace crash utility, >>>> in order to understand the impact of moving from for_each_present_cpu() >>>> to for_each_online_cpu(). >>> >>> Is the packing actually worth the trouble? What's the actual win? >>> >>> Thanks, >>> >>> tglx >>> >>> >> >> Thomas, >> I've investigated the passing of crash notes through the vmcore. What I've learned is that: >> >> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does >> not care what the contents of cpu PT_NOTES are, but it does coalesce them together. >> >> - makedumpfile will count the number of cpu PT_NOTES in order to determine its >> nr_cpus variable, which is reported in a header, but otherwise unused (except >> for sadump method). >> >> - the crash utility, for the purposes of determining the cpus, does not appear to >> reference the elfcorehdr PT_NOTEs. Instead it locates the various >> cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of >> course which are online. In addition, when crash does reference the cpu PT_NOTE, >> to get its prstatus, it does so by using a percpu technique directly in the vmcore >> image memory, not via the ELF structure. Said differently, it appears to me that >> crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them >> via kernel cpumasks and the memory within the vmcore. >> >> With this understanding, I did some testing. Perhaps the most telling test was that I >> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1, >> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a >> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number >> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different >> cpu via 'set -c 30' and the backtrace was completely valid. >> >> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the >> cpu information directly from kernel data structures. Perhaps at one time crash relied >> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers >> that might rely on the ELF info?) >> >> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There >> is no compelling reason to move away from for_each_present_cpu(), or modify the list for >> online/offline. >> >> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting >> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to >> be a compelling need to accurately track whether the cpu went online/offline for the >> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from >> kernel data structures, not the elfcorehdr. >> >> I think this is what Sourabh has known and has been advocating for an optimization >> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu >> structs are all laid out). I do think it best to leave that as an arch choice. > > Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c], > makedumpfile, and crash tool I need your opinion on this: > > Do we really need to regenerate elfcorehdr for CPU hotplug events? > If yes, can you please list the elfcorehdr components that changes due to CPU hotplug. Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the number of cpu PT_NOTEs (as the cpus are still present). > > From what I understood, crash notes are prepared for possible CPUs as system boots and > could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr > during the kdump kernel load. > > Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to > regenerate it for CPU hotplug events. Or do we? For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However, for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The caveat here of course is that if crash utility is the only coredump analyzer of concern, then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them. Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts any of this. Perhaps the one item that might help here is to distinguish between actual hot un/plug of cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug event and an online event (and unplug/offline). If those were distinguishable, then we could only regenerate on un/plug events. Or perhaps moving to for_each_possible_cpu() is the better choice? eric > > Thanks, > Sourabh Jain
On 10/02/23 01:09, Eric DeVolder wrote: > > > On 2/9/23 12:43, Sourabh Jain wrote: >> Hello Eric, >> >> On 09/02/23 23:01, Eric DeVolder wrote: >>> >>> >>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>> Eric! >>>> >>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>> >>>>> So my latest solution is introduce two new CPUHP states, >>>>> CPUHP_AP_ELFCOREHDR_ONLINE >>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm >>>>> open to better names. >>>>> >>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after >>>>> CPUHP_BRINGUP_CPU. My >>>>> attempts at locating this state failed when inside the STARTING >>>>> section, so I located >>>>> this just inside the ONLINE sectoin. The crash hotplug handler is >>>>> registered on >>>>> this state as the callback for the .startup method. >>>>> >>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before >>>>> CPUHP_TEARDOWN_CPU, and I >>>>> placed it at the end of the PREPARE section. This crash hotplug >>>>> handler is also >>>>> registered on this state as the callback for the .teardown method. >>>> >>>> TBH, that's still overengineered. Something like this: >>>> >>>> bool cpu_is_alive(unsigned int cpu) >>>> { >>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>> >>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>> } >>>> >>>> and use this to query the actual state at crash time. That spares all >>>> those callback heuristics. >>>> >>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo, >>>>> makedumpfile and (the consumer of it all) the userspace crash >>>>> utility, >>>>> in order to understand the impact of moving from >>>>> for_each_present_cpu() >>>>> to for_each_online_cpu(). >>>> >>>> Is the packing actually worth the trouble? What's the actual win? >>>> >>>> Thanks, >>>> >>>> tglx >>>> >>>> >>> >>> Thomas, >>> I've investigated the passing of crash notes through the vmcore. >>> What I've learned is that: >>> >>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its >>> job) does >>> not care what the contents of cpu PT_NOTES are, but it does >>> coalesce them together. >>> >>> - makedumpfile will count the number of cpu PT_NOTES in order to >>> determine its >>> nr_cpus variable, which is reported in a header, but otherwise >>> unused (except >>> for sadump method). >>> >>> - the crash utility, for the purposes of determining the cpus, does >>> not appear to >>> reference the elfcorehdr PT_NOTEs. Instead it locates the various >>> cpu_[possible|present|online]_mask and computes nr_cpus from that, >>> and also of >>> course which are online. In addition, when crash does reference >>> the cpu PT_NOTE, >>> to get its prstatus, it does so by using a percpu technique >>> directly in the vmcore >>> image memory, not via the ELF structure. Said differently, it >>> appears to me that >>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it >>> obtains them >>> via kernel cpumasks and the memory within the vmcore. >>> >>> With this understanding, I did some testing. Perhaps the most >>> telling test was that I >>> changed the number of cpu PT_NOTEs emitted in the >>> crash_prepare_elf64_headers() to just 1, >>> hot plugged some cpus, then also took a few offline sparsely via >>> chcpu, then generated a >>> vmcore. The crash utility had no problem loading the vmcore, it >>> reported the proper number >>> of cpus and the number offline (despite only one cpu PT_NOTE), and >>> changing to a different >>> cpu via 'set -c 30' and the backtrace was completely valid. >>> >>> My take away is that crash utility does not rely upon ELF cpu >>> PT_NOTEs, it obtains the >>> cpu information directly from kernel data structures. Perhaps at one >>> time crash relied >>> upon the ELF information, but no more. (Perhaps there are other >>> crash dump analyzers >>> that might rely on the ELF info?) >>> >>> So, all this to say that I see no need to change >>> crash_prepare_elf64_headers(). There >>> is no compelling reason to move away from for_each_present_cpu(), or >>> modify the list for >>> online/offline. >>> >>> Which then leaves the topic of the cpuhp state on which to register. >>> Perhaps reverting >>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There >>> does not appear to >>> be a compelling need to accurately track whether the cpu went >>> online/offline for the >>> purposes of creating the elfcorehdr, as ultimately the crash utility >>> pulls that from >>> kernel data structures, not the elfcorehdr. >>> >>> I think this is what Sourabh has known and has been advocating for >>> an optimization >>> path that allows not regenerating the elfcorehdr on cpu changes >>> (because all the percpu >>> structs are all laid out). I do think it best to leave that as an >>> arch choice. >> >> Since things are clear on how the PT_NOTES are consumed in kdump >> kernel [fs/proc/vmcore.c], >> makedumpfile, and crash tool I need your opinion on this: >> >> Do we really need to regenerate elfcorehdr for CPU hotplug events? >> If yes, can you please list the elfcorehdr components that changes >> due to CPU hotplug. > Due to the use of for_each_present_cpu(), it is possible for the > number of cpu PT_NOTEs > to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does > not impact the > number of cpu PT_NOTEs (as the cpus are still present). > >> >> From what I understood, crash notes are prepared for possible CPUs >> as system boots and >> could be used to create a PT_NOTE section for each possible CPU while >> generating the elfcorehdr >> during the kdump kernel load. >> >> Now once the elfcorehdr is loaded with PT_NOTEs for every possible >> CPU there is no need to >> regenerate it for CPU hotplug events. Or do we? > > For onlining/offlining of cpus, there is no need to regenerate the > elfcorehdr. However, > for actual hot un/plug of cpus, the answer is yes due to > for_each_present_cpu(). The > caveat here of course is that if crash utility is the only coredump > analyzer of concern, > then it doesn't care about these cpu PT_NOTEs and there would be no > need to re-generate them. > > Also, I'm not sure if ARM cpu hotplug, which is just now coming into > mainstream, impacts > any of this. > > Perhaps the one item that might help here is to distinguish between > actual hot un/plug of > cpus, versus onlining/offlining. At the moment, I can not distinguish > between a hot plug > event and an online event (and unplug/offline). If those were > distinguishable, then we > could only regenerate on un/plug events. > > Or perhaps moving to for_each_possible_cpu() is the better choice? Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about hot[un]plug case. Here is my view on how things should be handled if a core-dump analyzer is dependent on elfcorehdr PT_NOTEs to find online/offline CPUs. A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has one crash note per CPU for every possible CPU). Though the crash notes are allocated during the boot time they are populated when the system is on the crash path. This is how crash notes are populated on PowerPC and I am expecting it would be something similar on other architectures too. The crashing CPU sends IPI to every other online CPU with a callback function that updates the crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash note and proceeds further. The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes and display the information accordingly. Thoughts? - Sourabh
On 2/10/23 00:29, Sourabh Jain wrote: > > On 10/02/23 01:09, Eric DeVolder wrote: >> >> >> On 2/9/23 12:43, Sourabh Jain wrote: >>> Hello Eric, >>> >>> On 09/02/23 23:01, Eric DeVolder wrote: >>>> >>>> >>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>> Eric! >>>>> >>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>> >>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE >>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names. >>>>>> >>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My >>>>>> attempts at locating this state failed when inside the STARTING section, so I located >>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on >>>>>> this state as the callback for the .startup method. >>>>>> >>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I >>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also >>>>>> registered on this state as the callback for the .teardown method. >>>>> >>>>> TBH, that's still overengineered. Something like this: >>>>> >>>>> bool cpu_is_alive(unsigned int cpu) >>>>> { >>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>> >>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>> } >>>>> >>>>> and use this to query the actual state at crash time. That spares all >>>>> those callback heuristics. >>>>> >>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo, >>>>>> makedumpfile and (the consumer of it all) the userspace crash utility, >>>>>> in order to understand the impact of moving from for_each_present_cpu() >>>>>> to for_each_online_cpu(). >>>>> >>>>> Is the packing actually worth the trouble? What's the actual win? >>>>> >>>>> Thanks, >>>>> >>>>> tglx >>>>> >>>>> >>>> >>>> Thomas, >>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that: >>>> >>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does >>>> not care what the contents of cpu PT_NOTES are, but it does coalesce them together. >>>> >>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its >>>> nr_cpus variable, which is reported in a header, but otherwise unused (except >>>> for sadump method). >>>> >>>> - the crash utility, for the purposes of determining the cpus, does not appear to >>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various >>>> cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of >>>> course which are online. In addition, when crash does reference the cpu PT_NOTE, >>>> to get its prstatus, it does so by using a percpu technique directly in the vmcore >>>> image memory, not via the ELF structure. Said differently, it appears to me that >>>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them >>>> via kernel cpumasks and the memory within the vmcore. >>>> >>>> With this understanding, I did some testing. Perhaps the most telling test was that I >>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1, >>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a >>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number >>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different >>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>> >>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the >>>> cpu information directly from kernel data structures. Perhaps at one time crash relied >>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers >>>> that might rely on the ELF info?) >>>> >>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There >>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for >>>> online/offline. >>>> >>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting >>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to >>>> be a compelling need to accurately track whether the cpu went online/offline for the >>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from >>>> kernel data structures, not the elfcorehdr. >>>> >>>> I think this is what Sourabh has known and has been advocating for an optimization >>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu >>>> structs are all laid out). I do think it best to leave that as an arch choice. >>> >>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c], >>> makedumpfile, and crash tool I need your opinion on this: >>> >>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug. >> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs >> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the >> number of cpu PT_NOTEs (as the cpus are still present). >> >>> >>> From what I understood, crash notes are prepared for possible CPUs as system boots and >>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr >>> during the kdump kernel load. >>> >>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to >>> regenerate it for CPU hotplug events. Or do we? >> >> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However, >> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The >> caveat here of course is that if crash utility is the only coredump analyzer of concern, >> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them. >> >> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts >> any of this. >> >> Perhaps the one item that might help here is to distinguish between actual hot un/plug of >> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug >> event and an online event (and unplug/offline). If those were distinguishable, then we >> could only regenerate on un/plug events. >> >> Or perhaps moving to for_each_possible_cpu() is the better choice? > > Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about > hot[un]plug case. > > Here is my view on how things should be handled if a core-dump analyzer is dependent on > elfcorehdr PT_NOTEs to find online/offline CPUs. > > A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has > one crash note per CPU for every possible CPU). Though the crash notes are allocated > during the boot time they are populated when the system is on the crash path. > > This is how crash notes are populated on PowerPC and I am expecting it would be something > similar on other architectures too. > > The crashing CPU sends IPI to every other online CPU with a callback function that updates the > crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash > note and proceeds further. > > The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time > system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes > and display the information accordingly. > > Thoughts? > > - Sourabh In general, I agree with your points. You've presented a strong case to go with for_each_possible_cpu() in crash_prepare_elf64_headers() and those crash notes would always be present, and we can ignore changes to cpus wrt/ elfcorehdr updates. But what do we do about kexec_load() syscall? The way the userspace utility works is it determines cpus by: nr_cpus = sysconf(_SC_NPROCESSORS_CONF); which is not the equivalent of possible_cpus. So the complete list of cpu PT_NOTEs is not generated up front. We would need a solution for that? Thanks, eric PS. I'll be on vacation all of next week, returning 20feb.
On 11/02/23 06:05, Eric DeVolder wrote: > > > On 2/10/23 00:29, Sourabh Jain wrote: >> >> On 10/02/23 01:09, Eric DeVolder wrote: >>> >>> >>> On 2/9/23 12:43, Sourabh Jain wrote: >>>> Hello Eric, >>>> >>>> On 09/02/23 23:01, Eric DeVolder wrote: >>>>> >>>>> >>>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>>> Eric! >>>>>> >>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>>> >>>>>>> So my latest solution is introduce two new CPUHP states, >>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE >>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm >>>>>>> open to better names. >>>>>>> >>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after >>>>>>> CPUHP_BRINGUP_CPU. My >>>>>>> attempts at locating this state failed when inside the STARTING >>>>>>> section, so I located >>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler >>>>>>> is registered on >>>>>>> this state as the callback for the .startup method. >>>>>>> >>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before >>>>>>> CPUHP_TEARDOWN_CPU, and I >>>>>>> placed it at the end of the PREPARE section. This crash hotplug >>>>>>> handler is also >>>>>>> registered on this state as the callback for the .teardown method. >>>>>> >>>>>> TBH, that's still overengineered. Something like this: >>>>>> >>>>>> bool cpu_is_alive(unsigned int cpu) >>>>>> { >>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>>> >>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>>> } >>>>>> >>>>>> and use this to query the actual state at crash time. That spares >>>>>> all >>>>>> those callback heuristics. >>>>>> >>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, >>>>>>> vmcoreinfo, >>>>>>> makedumpfile and (the consumer of it all) the userspace crash >>>>>>> utility, >>>>>>> in order to understand the impact of moving from >>>>>>> for_each_present_cpu() >>>>>>> to for_each_online_cpu(). >>>>>> >>>>>> Is the packing actually worth the trouble? What's the actual win? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> tglx >>>>>> >>>>>> >>>>> >>>>> Thomas, >>>>> I've investigated the passing of crash notes through the vmcore. >>>>> What I've learned is that: >>>>> >>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its >>>>> job) does >>>>> not care what the contents of cpu PT_NOTES are, but it does >>>>> coalesce them together. >>>>> >>>>> - makedumpfile will count the number of cpu PT_NOTES in order to >>>>> determine its >>>>> nr_cpus variable, which is reported in a header, but otherwise >>>>> unused (except >>>>> for sadump method). >>>>> >>>>> - the crash utility, for the purposes of determining the cpus, >>>>> does not appear to >>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various >>>>> cpu_[possible|present|online]_mask and computes nr_cpus from >>>>> that, and also of >>>>> course which are online. In addition, when crash does reference >>>>> the cpu PT_NOTE, >>>>> to get its prstatus, it does so by using a percpu technique >>>>> directly in the vmcore >>>>> image memory, not via the ELF structure. Said differently, it >>>>> appears to me that >>>>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather >>>>> it obtains them >>>>> via kernel cpumasks and the memory within the vmcore. >>>>> >>>>> With this understanding, I did some testing. Perhaps the most >>>>> telling test was that I >>>>> changed the number of cpu PT_NOTEs emitted in the >>>>> crash_prepare_elf64_headers() to just 1, >>>>> hot plugged some cpus, then also took a few offline sparsely via >>>>> chcpu, then generated a >>>>> vmcore. The crash utility had no problem loading the vmcore, it >>>>> reported the proper number >>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and >>>>> changing to a different >>>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>>> >>>>> My take away is that crash utility does not rely upon ELF cpu >>>>> PT_NOTEs, it obtains the >>>>> cpu information directly from kernel data structures. Perhaps at >>>>> one time crash relied >>>>> upon the ELF information, but no more. (Perhaps there are other >>>>> crash dump analyzers >>>>> that might rely on the ELF info?) >>>>> >>>>> So, all this to say that I see no need to change >>>>> crash_prepare_elf64_headers(). There >>>>> is no compelling reason to move away from for_each_present_cpu(), >>>>> or modify the list for >>>>> online/offline. >>>>> >>>>> Which then leaves the topic of the cpuhp state on which to >>>>> register. Perhaps reverting >>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There >>>>> does not appear to >>>>> be a compelling need to accurately track whether the cpu went >>>>> online/offline for the >>>>> purposes of creating the elfcorehdr, as ultimately the crash >>>>> utility pulls that from >>>>> kernel data structures, not the elfcorehdr. >>>>> >>>>> I think this is what Sourabh has known and has been advocating for >>>>> an optimization >>>>> path that allows not regenerating the elfcorehdr on cpu changes >>>>> (because all the percpu >>>>> structs are all laid out). I do think it best to leave that as an >>>>> arch choice. >>>> >>>> Since things are clear on how the PT_NOTES are consumed in kdump >>>> kernel [fs/proc/vmcore.c], >>>> makedumpfile, and crash tool I need your opinion on this: >>>> >>>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>>> If yes, can you please list the elfcorehdr components that changes >>>> due to CPU hotplug. >>> Due to the use of for_each_present_cpu(), it is possible for the >>> number of cpu PT_NOTEs >>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does >>> not impact the >>> number of cpu PT_NOTEs (as the cpus are still present). >>> >>>> >>>> From what I understood, crash notes are prepared for possible CPUs >>>> as system boots and >>>> could be used to create a PT_NOTE section for each possible CPU >>>> while generating the elfcorehdr >>>> during the kdump kernel load. >>>> >>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible >>>> CPU there is no need to >>>> regenerate it for CPU hotplug events. Or do we? >>> >>> For onlining/offlining of cpus, there is no need to regenerate the >>> elfcorehdr. However, >>> for actual hot un/plug of cpus, the answer is yes due to >>> for_each_present_cpu(). The >>> caveat here of course is that if crash utility is the only coredump >>> analyzer of concern, >>> then it doesn't care about these cpu PT_NOTEs and there would be no >>> need to re-generate them. >>> >>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into >>> mainstream, impacts >>> any of this. >>> >>> Perhaps the one item that might help here is to distinguish between >>> actual hot un/plug of >>> cpus, versus onlining/offlining. At the moment, I can not >>> distinguish between a hot plug >>> event and an online event (and unplug/offline). If those were >>> distinguishable, then we >>> could only regenerate on un/plug events. >>> >>> Or perhaps moving to for_each_possible_cpu() is the better choice? >> >> Yes, because once elfcorehdr is built with possible CPUs we don't >> have to worry about >> hot[un]plug case. >> >> Here is my view on how things should be handled if a core-dump >> analyzer is dependent on >> elfcorehdr PT_NOTEs to find online/offline CPUs. >> >> A PT_NOTE in elfcorehdr holds the address of the corresponding crash >> notes (kernel has >> one crash note per CPU for every possible CPU). Though the crash >> notes are allocated >> during the boot time they are populated when the system is on the >> crash path. >> >> This is how crash notes are populated on PowerPC and I am expecting >> it would be something >> similar on other architectures too. >> >> The crashing CPU sends IPI to every other online CPU with a callback >> function that updates the >> crash notes of that specific CPU. Once the IPI completes the crashing >> CPU updates its own crash >> note and proceeds further. >> >> The crash notes of CPUs remain uninitialized if the CPUs were offline >> or hot unplugged at the time >> system crash. The core-dump analyzer should be able to identify >> [un]/initialized crash notes >> and display the information accordingly. >> >> Thoughts? >> >> - Sourabh > > In general, I agree with your points. You've presented a strong case > to go with for_each_possible_cpu() in crash_prepare_elf64_headers() > and those crash notes would always be present, and we can ignore > changes to cpus wrt/ elfcorehdr updates. > > But what do we do about kexec_load() syscall? The way the userspace > utility works is it determines cpus by: > nr_cpus = sysconf(_SC_NPROCESSORS_CONF); > which is not the equivalent of possible_cpus. So the complete list of > cpu PT_NOTEs is not generated up front. We would need a solution for > that? Hello Eric, The sysconf document says _SC_NPROCESSORS_CONF is processors configured, isn't that equivalent to possible CPUs? What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC it is possible CPUs. In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with: /sys/devices/system/cpu/possible for kexec_load case. Thoughts? - Sourabh Jain
On Mon, Feb 13 2023 at 10:10, Sourabh Jain wrote: > The sysconf document says _SC_NPROCESSORS_CONF is processors configured, > isn't that equivalent to possible CPUs? glibc tries to evaluate that in the following order: 1) /sys/devices/system/cpu/cpu* That's present CPUs not possible CPUs 2) /proc/stat That's online CPUs 3) sched_getaffinity() That's online CPUs at best. In the worst case it's an affinity mask which is set on a process group Thanks, tglx
On 13/02/23 18:22, Thomas Gleixner wrote: > On Mon, Feb 13 2023 at 10:10, Sourabh Jain wrote: >> The sysconf document says _SC_NPROCESSORS_CONF is processors configured, >> isn't that equivalent to possible CPUs? > glibc tries to evaluate that in the following order: > > 1) /sys/devices/system/cpu/cpu* > > That's present CPUs not possible CPUs > > 2) /proc/stat > > That's online CPUs > > 3) sched_getaffinity() > > That's online CPUs at best. In the worst case it's an affinity mask > which is set on a process group Thanks for the clarification Thomas. - Sourabh
On 2/10/23 00:29, Sourabh Jain wrote: > > On 10/02/23 01:09, Eric DeVolder wrote: >> >> >> On 2/9/23 12:43, Sourabh Jain wrote: >>> Hello Eric, >>> >>> On 09/02/23 23:01, Eric DeVolder wrote: >>>> >>>> >>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>> Eric! >>>>> >>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>> >>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE >>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names. >>>>>> >>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My >>>>>> attempts at locating this state failed when inside the STARTING section, so I located >>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on >>>>>> this state as the callback for the .startup method. >>>>>> >>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I >>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also >>>>>> registered on this state as the callback for the .teardown method. >>>>> >>>>> TBH, that's still overengineered. Something like this: >>>>> >>>>> bool cpu_is_alive(unsigned int cpu) >>>>> { >>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>> >>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>> } >>>>> >>>>> and use this to query the actual state at crash time. That spares all >>>>> those callback heuristics. >>>>> >>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo, >>>>>> makedumpfile and (the consumer of it all) the userspace crash utility, >>>>>> in order to understand the impact of moving from for_each_present_cpu() >>>>>> to for_each_online_cpu(). >>>>> >>>>> Is the packing actually worth the trouble? What's the actual win? >>>>> >>>>> Thanks, >>>>> >>>>> tglx >>>>> >>>>> >>>> >>>> Thomas, >>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that: >>>> >>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does >>>> not care what the contents of cpu PT_NOTES are, but it does coalesce them together. >>>> >>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its >>>> nr_cpus variable, which is reported in a header, but otherwise unused (except >>>> for sadump method). >>>> >>>> - the crash utility, for the purposes of determining the cpus, does not appear to >>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various >>>> cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of >>>> course which are online. In addition, when crash does reference the cpu PT_NOTE, >>>> to get its prstatus, it does so by using a percpu technique directly in the vmcore >>>> image memory, not via the ELF structure. Said differently, it appears to me that >>>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them >>>> via kernel cpumasks and the memory within the vmcore. >>>> >>>> With this understanding, I did some testing. Perhaps the most telling test was that I >>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1, >>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a >>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number >>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different >>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>> >>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the >>>> cpu information directly from kernel data structures. Perhaps at one time crash relied >>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers >>>> that might rely on the ELF info?) >>>> >>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There >>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for >>>> online/offline. >>>> >>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting >>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to >>>> be a compelling need to accurately track whether the cpu went online/offline for the >>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from >>>> kernel data structures, not the elfcorehdr. >>>> >>>> I think this is what Sourabh has known and has been advocating for an optimization >>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu >>>> structs are all laid out). I do think it best to leave that as an arch choice. >>> >>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c], >>> makedumpfile, and crash tool I need your opinion on this: >>> >>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug. >> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs >> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the >> number of cpu PT_NOTEs (as the cpus are still present). >> >>> >>> From what I understood, crash notes are prepared for possible CPUs as system boots and >>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr >>> during the kdump kernel load. >>> >>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to >>> regenerate it for CPU hotplug events. Or do we? >> >> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However, >> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The >> caveat here of course is that if crash utility is the only coredump analyzer of concern, >> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them. >> >> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts >> any of this. >> >> Perhaps the one item that might help here is to distinguish between actual hot un/plug of >> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug >> event and an online event (and unplug/offline). If those were distinguishable, then we >> could only regenerate on un/plug events. >> >> Or perhaps moving to for_each_possible_cpu() is the better choice? > > Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about > hot[un]plug case. > > Here is my view on how things should be handled if a core-dump analyzer is dependent on > elfcorehdr PT_NOTEs to find online/offline CPUs. > > A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has > one crash note per CPU for every possible CPU). Though the crash notes are allocated > during the boot time they are populated when the system is on the crash path. > > This is how crash notes are populated on PowerPC and I am expecting it would be something > similar on other architectures too. > > The crashing CPU sends IPI to every other online CPU with a callback function that updates the > crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash > note and proceeds further. > > The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time > system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes > and display the information accordingly. > > Thoughts? > > - Sourabh I've been examining what it would mean to move to for_each_possible_cpu() in crash_prepare_elf64_headers(). I think it means: - Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers(). - For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the load with the elfcorehdr generated by crash_prepare_elf64_headers(). - Eliminate/remove the cpuhp machinery for handling crash hotplug events. This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash analyzers that rely on ELF PT_NOTEs for crash_notes. If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp machinery in place and each arch could decide how to handle crash cpu hotplug events. The overhead for doing this is very minimal, and the events are likely very infrequent. No matter which is decided, to support crash hotplug for kexec_load still requires changes to the userspace kexec-tools utility (for excluding the elfcorehdr from the purgatory hash, and providing an appropriately sized elfcorehdr buffer). I know Sourabh votes for for_each_possible_cpu(), Thomas/Boris/Baoquan/others, I'd appreciate your opinion/insight here! Thanks! eric
On 24/02/23 02:04, Eric DeVolder wrote: > > > On 2/10/23 00:29, Sourabh Jain wrote: >> >> On 10/02/23 01:09, Eric DeVolder wrote: >>> >>> >>> On 2/9/23 12:43, Sourabh Jain wrote: >>>> Hello Eric, >>>> >>>> On 09/02/23 23:01, Eric DeVolder wrote: >>>>> >>>>> >>>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>>> Eric! >>>>>> >>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>>> >>>>>>> So my latest solution is introduce two new CPUHP states, >>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE >>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm >>>>>>> open to better names. >>>>>>> >>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after >>>>>>> CPUHP_BRINGUP_CPU. My >>>>>>> attempts at locating this state failed when inside the STARTING >>>>>>> section, so I located >>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler >>>>>>> is registered on >>>>>>> this state as the callback for the .startup method. >>>>>>> >>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before >>>>>>> CPUHP_TEARDOWN_CPU, and I >>>>>>> placed it at the end of the PREPARE section. This crash hotplug >>>>>>> handler is also >>>>>>> registered on this state as the callback for the .teardown method. >>>>>> >>>>>> TBH, that's still overengineered. Something like this: >>>>>> >>>>>> bool cpu_is_alive(unsigned int cpu) >>>>>> { >>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>>> >>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>>> } >>>>>> >>>>>> and use this to query the actual state at crash time. That spares >>>>>> all >>>>>> those callback heuristics. >>>>>> >>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, >>>>>>> vmcoreinfo, >>>>>>> makedumpfile and (the consumer of it all) the userspace crash >>>>>>> utility, >>>>>>> in order to understand the impact of moving from >>>>>>> for_each_present_cpu() >>>>>>> to for_each_online_cpu(). >>>>>> >>>>>> Is the packing actually worth the trouble? What's the actual win? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> tglx >>>>>> >>>>>> >>>>> >>>>> Thomas, >>>>> I've investigated the passing of crash notes through the vmcore. >>>>> What I've learned is that: >>>>> >>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its >>>>> job) does >>>>> not care what the contents of cpu PT_NOTES are, but it does >>>>> coalesce them together. >>>>> >>>>> - makedumpfile will count the number of cpu PT_NOTES in order to >>>>> determine its >>>>> nr_cpus variable, which is reported in a header, but otherwise >>>>> unused (except >>>>> for sadump method). >>>>> >>>>> - the crash utility, for the purposes of determining the cpus, >>>>> does not appear to >>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various >>>>> cpu_[possible|present|online]_mask and computes nr_cpus from >>>>> that, and also of >>>>> course which are online. In addition, when crash does reference >>>>> the cpu PT_NOTE, >>>>> to get its prstatus, it does so by using a percpu technique >>>>> directly in the vmcore >>>>> image memory, not via the ELF structure. Said differently, it >>>>> appears to me that >>>>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather >>>>> it obtains them >>>>> via kernel cpumasks and the memory within the vmcore. >>>>> >>>>> With this understanding, I did some testing. Perhaps the most >>>>> telling test was that I >>>>> changed the number of cpu PT_NOTEs emitted in the >>>>> crash_prepare_elf64_headers() to just 1, >>>>> hot plugged some cpus, then also took a few offline sparsely via >>>>> chcpu, then generated a >>>>> vmcore. The crash utility had no problem loading the vmcore, it >>>>> reported the proper number >>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and >>>>> changing to a different >>>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>>> >>>>> My take away is that crash utility does not rely upon ELF cpu >>>>> PT_NOTEs, it obtains the >>>>> cpu information directly from kernel data structures. Perhaps at >>>>> one time crash relied >>>>> upon the ELF information, but no more. (Perhaps there are other >>>>> crash dump analyzers >>>>> that might rely on the ELF info?) >>>>> >>>>> So, all this to say that I see no need to change >>>>> crash_prepare_elf64_headers(). There >>>>> is no compelling reason to move away from for_each_present_cpu(), >>>>> or modify the list for >>>>> online/offline. >>>>> >>>>> Which then leaves the topic of the cpuhp state on which to >>>>> register. Perhaps reverting >>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There >>>>> does not appear to >>>>> be a compelling need to accurately track whether the cpu went >>>>> online/offline for the >>>>> purposes of creating the elfcorehdr, as ultimately the crash >>>>> utility pulls that from >>>>> kernel data structures, not the elfcorehdr. >>>>> >>>>> I think this is what Sourabh has known and has been advocating for >>>>> an optimization >>>>> path that allows not regenerating the elfcorehdr on cpu changes >>>>> (because all the percpu >>>>> structs are all laid out). I do think it best to leave that as an >>>>> arch choice. >>>> >>>> Since things are clear on how the PT_NOTES are consumed in kdump >>>> kernel [fs/proc/vmcore.c], >>>> makedumpfile, and crash tool I need your opinion on this: >>>> >>>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>>> If yes, can you please list the elfcorehdr components that changes >>>> due to CPU hotplug. >>> Due to the use of for_each_present_cpu(), it is possible for the >>> number of cpu PT_NOTEs >>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does >>> not impact the >>> number of cpu PT_NOTEs (as the cpus are still present). >>> >>>> >>>> From what I understood, crash notes are prepared for possible CPUs >>>> as system boots and >>>> could be used to create a PT_NOTE section for each possible CPU >>>> while generating the elfcorehdr >>>> during the kdump kernel load. >>>> >>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible >>>> CPU there is no need to >>>> regenerate it for CPU hotplug events. Or do we? >>> >>> For onlining/offlining of cpus, there is no need to regenerate the >>> elfcorehdr. However, >>> for actual hot un/plug of cpus, the answer is yes due to >>> for_each_present_cpu(). The >>> caveat here of course is that if crash utility is the only coredump >>> analyzer of concern, >>> then it doesn't care about these cpu PT_NOTEs and there would be no >>> need to re-generate them. >>> >>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into >>> mainstream, impacts >>> any of this. >>> >>> Perhaps the one item that might help here is to distinguish between >>> actual hot un/plug of >>> cpus, versus onlining/offlining. At the moment, I can not >>> distinguish between a hot plug >>> event and an online event (and unplug/offline). If those were >>> distinguishable, then we >>> could only regenerate on un/plug events. >>> >>> Or perhaps moving to for_each_possible_cpu() is the better choice? >> >> Yes, because once elfcorehdr is built with possible CPUs we don't >> have to worry about >> hot[un]plug case. >> >> Here is my view on how things should be handled if a core-dump >> analyzer is dependent on >> elfcorehdr PT_NOTEs to find online/offline CPUs. >> >> A PT_NOTE in elfcorehdr holds the address of the corresponding crash >> notes (kernel has >> one crash note per CPU for every possible CPU). Though the crash >> notes are allocated >> during the boot time they are populated when the system is on the >> crash path. >> >> This is how crash notes are populated on PowerPC and I am expecting >> it would be something >> similar on other architectures too. >> >> The crashing CPU sends IPI to every other online CPU with a callback >> function that updates the >> crash notes of that specific CPU. Once the IPI completes the crashing >> CPU updates its own crash >> note and proceeds further. >> >> The crash notes of CPUs remain uninitialized if the CPUs were offline >> or hot unplugged at the time >> system crash. The core-dump analyzer should be able to identify >> [un]/initialized crash notes >> and display the information accordingly. >> >> Thoughts? >> >> - Sourabh > > I've been examining what it would mean to move to > for_each_possible_cpu() in crash_prepare_elf64_headers(). I think it > means: > > - Changing for_each_present_cpu() to for_each_possible_cpu() in > crash_prepare_elf64_headers(). > - For kexec_load() syscall path, rewrite the incoming/supplied > elfcorehdr immediately on the load with the elfcorehdr generated by > crash_prepare_elf64_headers(). > - Eliminate/remove the cpuhp machinery for handling crash hotplug events. If for_each_present_cpu is replaced with for_each_possible_cpu I still need cpuhp machinery to update FDT kexec segment for CPU hot add case. > > This would then setup PT_NOTEs for all possible cpus, which should in > theory accommodate crash analyzers that rely on ELF PT_NOTEs for > crash_notes. > > If staying with for_each_present_cpu() is ultimately decided, then I > think leaving the cpuhp machinery in place and each arch could decide > how to handle crash cpu hotplug events. The overhead for doing this is > very minimal, and the events are likely very infrequent. I agree. Some architectures may need cpuhp machinery to update kexec segment[s] other then elfcorehdr. For example FDT on PowerPC. - Sourabh Jain
On 2/24/23 02:34, Sourabh Jain wrote: > > On 24/02/23 02:04, Eric DeVolder wrote: >> >> >> On 2/10/23 00:29, Sourabh Jain wrote: >>> >>> On 10/02/23 01:09, Eric DeVolder wrote: >>>> >>>> >>>> On 2/9/23 12:43, Sourabh Jain wrote: >>>>> Hello Eric, >>>>> >>>>> On 09/02/23 23:01, Eric DeVolder wrote: >>>>>> >>>>>> >>>>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>>>> Eric! >>>>>>> >>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>>>> >>>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE >>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names. >>>>>>>> >>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My >>>>>>>> attempts at locating this state failed when inside the STARTING section, so I located >>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on >>>>>>>> this state as the callback for the .startup method. >>>>>>>> >>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I >>>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also >>>>>>>> registered on this state as the callback for the .teardown method. >>>>>>> >>>>>>> TBH, that's still overengineered. Something like this: >>>>>>> >>>>>>> bool cpu_is_alive(unsigned int cpu) >>>>>>> { >>>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>>>> >>>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>>>> } >>>>>>> >>>>>>> and use this to query the actual state at crash time. That spares all >>>>>>> those callback heuristics. >>>>>>> >>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo, >>>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility, >>>>>>>> in order to understand the impact of moving from for_each_present_cpu() >>>>>>>> to for_each_online_cpu(). >>>>>>> >>>>>>> Is the packing actually worth the trouble? What's the actual win? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> tglx >>>>>>> >>>>>>> >>>>>> >>>>>> Thomas, >>>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that: >>>>>> >>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does >>>>>> not care what the contents of cpu PT_NOTES are, but it does coalesce them together. >>>>>> >>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its >>>>>> nr_cpus variable, which is reported in a header, but otherwise unused (except >>>>>> for sadump method). >>>>>> >>>>>> - the crash utility, for the purposes of determining the cpus, does not appear to >>>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various >>>>>> cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of >>>>>> course which are online. In addition, when crash does reference the cpu PT_NOTE, >>>>>> to get its prstatus, it does so by using a percpu technique directly in the vmcore >>>>>> image memory, not via the ELF structure. Said differently, it appears to me that >>>>>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them >>>>>> via kernel cpumasks and the memory within the vmcore. >>>>>> >>>>>> With this understanding, I did some testing. Perhaps the most telling test was that I >>>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1, >>>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a >>>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number >>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different >>>>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>>>> >>>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the >>>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied >>>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers >>>>>> that might rely on the ELF info?) >>>>>> >>>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There >>>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for >>>>>> online/offline. >>>>>> >>>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting >>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to >>>>>> be a compelling need to accurately track whether the cpu went online/offline for the >>>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from >>>>>> kernel data structures, not the elfcorehdr. >>>>>> >>>>>> I think this is what Sourabh has known and has been advocating for an optimization >>>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu >>>>>> structs are all laid out). I do think it best to leave that as an arch choice. >>>>> >>>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c], >>>>> makedumpfile, and crash tool I need your opinion on this: >>>>> >>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug. >>>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs >>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the >>>> number of cpu PT_NOTEs (as the cpus are still present). >>>> >>>>> >>>>> From what I understood, crash notes are prepared for possible CPUs as system boots and >>>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr >>>>> during the kdump kernel load. >>>>> >>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to >>>>> regenerate it for CPU hotplug events. Or do we? >>>> >>>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However, >>>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The >>>> caveat here of course is that if crash utility is the only coredump analyzer of concern, >>>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them. >>>> >>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts >>>> any of this. >>>> >>>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of >>>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug >>>> event and an online event (and unplug/offline). If those were distinguishable, then we >>>> could only regenerate on un/plug events. >>>> >>>> Or perhaps moving to for_each_possible_cpu() is the better choice? >>> >>> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about >>> hot[un]plug case. >>> >>> Here is my view on how things should be handled if a core-dump analyzer is dependent on >>> elfcorehdr PT_NOTEs to find online/offline CPUs. >>> >>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has >>> one crash note per CPU for every possible CPU). Though the crash notes are allocated >>> during the boot time they are populated when the system is on the crash path. >>> >>> This is how crash notes are populated on PowerPC and I am expecting it would be something >>> similar on other architectures too. >>> >>> The crashing CPU sends IPI to every other online CPU with a callback function that updates the >>> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash >>> note and proceeds further. >>> >>> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time >>> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes >>> and display the information accordingly. >>> >>> Thoughts? >>> >>> - Sourabh >> >> I've been examining what it would mean to move to for_each_possible_cpu() in >> crash_prepare_elf64_headers(). I think it means: >> >> - Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers(). >> - For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the load >> with the elfcorehdr generated by crash_prepare_elf64_headers(). >> - Eliminate/remove the cpuhp machinery for handling crash hotplug events. > > If for_each_present_cpu is replaced with for_each_possible_cpu I still need cpuhp machinery > to update FDT kexec segment for CPU hot add case. Ah, ok, that's important! So the cpuhp callbacks are still needed. > > >> >> This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash >> analyzers that rely on ELF PT_NOTEs for crash_notes. >> >> If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp >> machinery in place and each arch could decide how to handle crash cpu hotplug events. The overhead >> for doing this is very minimal, and the events are likely very infrequent. > > I agree. Some architectures may need cpuhp machinery to update kexec segment[s] other then > elfcorehdr. For example FDT on PowerPC. > > - Sourabh Jain OK, I was thinking that the desire was to eliminate the cpuhp callbacks. In reality, the desire is to change to for_each_possible_cpu(). Given that the kernel creates crash_notes for all possible cpus upon kernel boot, there seems to be no reason to not do this? HOWEVER... It's not clear to me that this particular change needs to be part of this series. It's inclusion would facilitate PPC support, but doesn't "solve" anything in general. In fact it causes kexec_load and kexec_file_load to deviate (kexec_load via userspace kexec does the equivalent of for_each_present_cpu() where as with this change kexec_file_load would do for_each_possible_cpu(); until a hot plug event then both would do for_each_possible_cpu()). And if this change were to arrive as part of Sourabh's PPC support, then it does not appear to impact x86 (not sure about other arches). And the 'crash' dump analyzer doesn't care either way. Including this change would enable an optimization path (for x86 at least) that short-circuits cpu hotplug changes in the arch crash handler, for example: diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index aca3f1817674..0883f6b11de4 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct kimage *image) unsigned long mem, memsz; unsigned long elfsz = 0; + if (image->file_mode && ( + image->hp_action == KEXEC_CRASH_HP_ADD_CPU || + image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU)) + return; + /* * Create the new elfcorehdr reflecting the changes to CPU and/or * memory resources. I'm not sure that is compelling given the infrequent nature of cpu hotplug events. In my mind I still have a question about kexec_load() path. The userspace kexec can not do the equivalent of for_each_possible_cpu(). It can obtain max possible cpus from /sys/devices/system/cpu/possible, but for those cpus not present the /sys/devices/system/cpu/cpuXX is not available and so the crash_notes entries is not available. My attempts to expose all cpuXX lead to odd behavior that was requiring changes in ACPI and arch code that looked untenable. There seem to be these options available for kexec_load() path: - immediately rewrite the elfcorehdr upon load via a call to crash_prepare_elf64_headers(). I've made this work with the following, as proof of concept: diff --git a/kernel/kexec.c b/kernel/kexec.c index cb8e6e6f983c..4eb201270f97 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry, unsigned long kimage_free(image); out_unlock: kexec_unlock(); + if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) { + if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) { + crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE, KEXEC_CRASH_HP_INVALID_CPU); + } + } return ret; } - Another option is spend the time to determine whether exposing all cpuXX is a viable solution; I have no idea what impacts to userspace would be for possible-but-not-yet-present cpuXX entries would be. It might also mean requiring a 'present' entry available within the cpuXX. - Another option is to simply let the hot plug events rewrite the elfcorehdr on demand. This is what I've originally put forth, but not sure how this impacts PPC given for_each_possible_cpu() change. The concern is that today, both kexec_load and kexec_file_load mirror each other with respect to for_each_present_cpu(); that is userspace kexec is able to generate the elfcorehdr the same as would kexec_file_load, for cpus. But by changing to for_each_possible_cpu(), the two would deviate. Thoughts? eric
On 25/02/23 01:46, Eric DeVolder wrote: > > > On 2/24/23 02:34, Sourabh Jain wrote: >> >> On 24/02/23 02:04, Eric DeVolder wrote: >>> >>> >>> On 2/10/23 00:29, Sourabh Jain wrote: >>>> >>>> On 10/02/23 01:09, Eric DeVolder wrote: >>>>> >>>>> >>>>> On 2/9/23 12:43, Sourabh Jain wrote: >>>>>> Hello Eric, >>>>>> >>>>>> On 09/02/23 23:01, Eric DeVolder wrote: >>>>>>> >>>>>>> >>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>>>>> Eric! >>>>>>>> >>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>>>>> >>>>>>>>> So my latest solution is introduce two new CPUHP states, >>>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE >>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. >>>>>>>>> I'm open to better names. >>>>>>>>> >>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after >>>>>>>>> CPUHP_BRINGUP_CPU. My >>>>>>>>> attempts at locating this state failed when inside the >>>>>>>>> STARTING section, so I located >>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler >>>>>>>>> is registered on >>>>>>>>> this state as the callback for the .startup method. >>>>>>>>> >>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before >>>>>>>>> CPUHP_TEARDOWN_CPU, and I >>>>>>>>> placed it at the end of the PREPARE section. This crash >>>>>>>>> hotplug handler is also >>>>>>>>> registered on this state as the callback for the .teardown >>>>>>>>> method. >>>>>>>> >>>>>>>> TBH, that's still overengineered. Something like this: >>>>>>>> >>>>>>>> bool cpu_is_alive(unsigned int cpu) >>>>>>>> { >>>>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>>>>> >>>>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>>>>> } >>>>>>>> >>>>>>>> and use this to query the actual state at crash time. That >>>>>>>> spares all >>>>>>>> those callback heuristics. >>>>>>>> >>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, >>>>>>>>> vmcoreinfo, >>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash >>>>>>>>> utility, >>>>>>>>> in order to understand the impact of moving from >>>>>>>>> for_each_present_cpu() >>>>>>>>> to for_each_online_cpu(). >>>>>>>> >>>>>>>> Is the packing actually worth the trouble? What's the actual win? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> tglx >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> Thomas, >>>>>>> I've investigated the passing of crash notes through the vmcore. >>>>>>> What I've learned is that: >>>>>>> >>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do >>>>>>> its job) does >>>>>>> not care what the contents of cpu PT_NOTES are, but it does >>>>>>> coalesce them together. >>>>>>> >>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to >>>>>>> determine its >>>>>>> nr_cpus variable, which is reported in a header, but otherwise >>>>>>> unused (except >>>>>>> for sadump method). >>>>>>> >>>>>>> - the crash utility, for the purposes of determining the cpus, >>>>>>> does not appear to >>>>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various >>>>>>> cpu_[possible|present|online]_mask and computes nr_cpus from >>>>>>> that, and also of >>>>>>> course which are online. In addition, when crash does >>>>>>> reference the cpu PT_NOTE, >>>>>>> to get its prstatus, it does so by using a percpu technique >>>>>>> directly in the vmcore >>>>>>> image memory, not via the ELF structure. Said differently, it >>>>>>> appears to me that >>>>>>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; >>>>>>> rather it obtains them >>>>>>> via kernel cpumasks and the memory within the vmcore. >>>>>>> >>>>>>> With this understanding, I did some testing. Perhaps the most >>>>>>> telling test was that I >>>>>>> changed the number of cpu PT_NOTEs emitted in the >>>>>>> crash_prepare_elf64_headers() to just 1, >>>>>>> hot plugged some cpus, then also took a few offline sparsely via >>>>>>> chcpu, then generated a >>>>>>> vmcore. The crash utility had no problem loading the vmcore, it >>>>>>> reported the proper number >>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), >>>>>>> and changing to a different >>>>>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>>>>> >>>>>>> My take away is that crash utility does not rely upon ELF cpu >>>>>>> PT_NOTEs, it obtains the >>>>>>> cpu information directly from kernel data structures. Perhaps at >>>>>>> one time crash relied >>>>>>> upon the ELF information, but no more. (Perhaps there are other >>>>>>> crash dump analyzers >>>>>>> that might rely on the ELF info?) >>>>>>> >>>>>>> So, all this to say that I see no need to change >>>>>>> crash_prepare_elf64_headers(). There >>>>>>> is no compelling reason to move away from >>>>>>> for_each_present_cpu(), or modify the list for >>>>>>> online/offline. >>>>>>> >>>>>>> Which then leaves the topic of the cpuhp state on which to >>>>>>> register. Perhaps reverting >>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. >>>>>>> There does not appear to >>>>>>> be a compelling need to accurately track whether the cpu went >>>>>>> online/offline for the >>>>>>> purposes of creating the elfcorehdr, as ultimately the crash >>>>>>> utility pulls that from >>>>>>> kernel data structures, not the elfcorehdr. >>>>>>> >>>>>>> I think this is what Sourabh has known and has been advocating >>>>>>> for an optimization >>>>>>> path that allows not regenerating the elfcorehdr on cpu changes >>>>>>> (because all the percpu >>>>>>> structs are all laid out). I do think it best to leave that as >>>>>>> an arch choice. >>>>>> >>>>>> Since things are clear on how the PT_NOTES are consumed in kdump >>>>>> kernel [fs/proc/vmcore.c], >>>>>> makedumpfile, and crash tool I need your opinion on this: >>>>>> >>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>>>>> If yes, can you please list the elfcorehdr components that >>>>>> changes due to CPU hotplug. >>>>> Due to the use of for_each_present_cpu(), it is possible for the >>>>> number of cpu PT_NOTEs >>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus >>>>> does not impact the >>>>> number of cpu PT_NOTEs (as the cpus are still present). >>>>> >>>>>> >>>>>> From what I understood, crash notes are prepared for possible >>>>>> CPUs as system boots and >>>>>> could be used to create a PT_NOTE section for each possible CPU >>>>>> while generating the elfcorehdr >>>>>> during the kdump kernel load. >>>>>> >>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every >>>>>> possible CPU there is no need to >>>>>> regenerate it for CPU hotplug events. Or do we? >>>>> >>>>> For onlining/offlining of cpus, there is no need to regenerate the >>>>> elfcorehdr. However, >>>>> for actual hot un/plug of cpus, the answer is yes due to >>>>> for_each_present_cpu(). The >>>>> caveat here of course is that if crash utility is the only >>>>> coredump analyzer of concern, >>>>> then it doesn't care about these cpu PT_NOTEs and there would be >>>>> no need to re-generate them. >>>>> >>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming >>>>> into mainstream, impacts >>>>> any of this. >>>>> >>>>> Perhaps the one item that might help here is to distinguish >>>>> between actual hot un/plug of >>>>> cpus, versus onlining/offlining. At the moment, I can not >>>>> distinguish between a hot plug >>>>> event and an online event (and unplug/offline). If those were >>>>> distinguishable, then we >>>>> could only regenerate on un/plug events. >>>>> >>>>> Or perhaps moving to for_each_possible_cpu() is the better choice? >>>> >>>> Yes, because once elfcorehdr is built with possible CPUs we don't >>>> have to worry about >>>> hot[un]plug case. >>>> >>>> Here is my view on how things should be handled if a core-dump >>>> analyzer is dependent on >>>> elfcorehdr PT_NOTEs to find online/offline CPUs. >>>> >>>> A PT_NOTE in elfcorehdr holds the address of the corresponding >>>> crash notes (kernel has >>>> one crash note per CPU for every possible CPU). Though the crash >>>> notes are allocated >>>> during the boot time they are populated when the system is on the >>>> crash path. >>>> >>>> This is how crash notes are populated on PowerPC and I am expecting >>>> it would be something >>>> similar on other architectures too. >>>> >>>> The crashing CPU sends IPI to every other online CPU with a >>>> callback function that updates the >>>> crash notes of that specific CPU. Once the IPI completes the >>>> crashing CPU updates its own crash >>>> note and proceeds further. >>>> >>>> The crash notes of CPUs remain uninitialized if the CPUs were >>>> offline or hot unplugged at the time >>>> system crash. The core-dump analyzer should be able to identify >>>> [un]/initialized crash notes >>>> and display the information accordingly. >>>> >>>> Thoughts? >>>> >>>> - Sourabh >>> >>> I've been examining what it would mean to move to >>> for_each_possible_cpu() in crash_prepare_elf64_headers(). I think it >>> means: >>> >>> - Changing for_each_present_cpu() to for_each_possible_cpu() in >>> crash_prepare_elf64_headers(). >>> - For kexec_load() syscall path, rewrite the incoming/supplied >>> elfcorehdr immediately on the load with the elfcorehdr generated by >>> crash_prepare_elf64_headers(). >>> - Eliminate/remove the cpuhp machinery for handling crash hotplug >>> events. >> >> If for_each_present_cpu is replaced with for_each_possible_cpu I >> still need cpuhp machinery >> to update FDT kexec segment for CPU hot add case. > > Ah, ok, that's important! So the cpuhp callbacks are still needed. >> >> >>> >>> This would then setup PT_NOTEs for all possible cpus, which should >>> in theory accommodate crash analyzers that rely on ELF PT_NOTEs for >>> crash_notes. >>> >>> If staying with for_each_present_cpu() is ultimately decided, then I >>> think leaving the cpuhp machinery in place and each arch could >>> decide how to handle crash cpu hotplug events. The overhead for >>> doing this is very minimal, and the events are likely very infrequent. >> >> I agree. Some architectures may need cpuhp machinery to update kexec >> segment[s] other then elfcorehdr. For example FDT on PowerPC. >> >> - Sourabh Jain > > OK, I was thinking that the desire was to eliminate the cpuhp > callbacks. In reality, the desire is to change to > for_each_possible_cpu(). Given that the kernel creates crash_notes for > all possible cpus upon kernel boot, there seems to be no reason to not > do this? > > HOWEVER... > > It's not clear to me that this particular change needs to be part of > this series. It's inclusion would facilitate PPC support, but doesn't > "solve" anything in general. In fact it causes kexec_load and > kexec_file_load to deviate (kexec_load via userspace kexec does the > equivalent of for_each_present_cpu() where as with this change > kexec_file_load would do for_each_possible_cpu(); until a hot plug > event then both would do for_each_possible_cpu()). And if this change > were to arrive as part of Sourabh's PPC support, then it does not > appear to impact x86 (not sure about other arches). And the 'crash' > dump analyzer doesn't care either way. > > Including this change would enable an optimization path (for x86 at > least) that short-circuits cpu hotplug changes in the arch crash > handler, for example: > > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c > index aca3f1817674..0883f6b11de4 100644 > --- a/arch/x86/kernel/crash.c > +++ b/arch/x86/kernel/crash.c > @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct > kimage *image) > unsigned long mem, memsz; > unsigned long elfsz = 0; > > + if (image->file_mode && ( > + image->hp_action == KEXEC_CRASH_HP_ADD_CPU || > + image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU)) > + return; > + > /* > * Create the new elfcorehdr reflecting the changes to CPU and/or > * memory resources. > > I'm not sure that is compelling given the infrequent nature of cpu > hotplug events. It certainly closes/reduces the window where kdump is not active due kexec segment update.| > > In my mind I still have a question about kexec_load() path. The > userspace kexec can not do the equivalent of for_each_possible_cpu(). > It can obtain max possible cpus from /sys/devices/system/cpu/possible, > but for those cpus not present the /sys/devices/system/cpu/cpuXX is > not available and so the crash_notes entries is not available. My > attempts to expose all cpuXX lead to odd behavior that was requiring > changes in ACPI and arch code that looked untenable. > > There seem to be these options available for kexec_load() path: > - immediately rewrite the elfcorehdr upon load via a call to > crash_prepare_elf64_headers(). I've made this work with the following, > as proof of concept: Yes regenerating/patching the elfcorehdr could be an option for kexec_load syscall. > > diff --git a/kernel/kexec.c b/kernel/kexec.c > index cb8e6e6f983c..4eb201270f97 100644 > --- a/kernel/kexec.c > +++ b/kernel/kexec.c > @@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry, > unsigned long > kimage_free(image); > out_unlock: > kexec_unlock(); > + if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) { > + if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) { > + crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE, > KEXEC_CRASH_HP_INVALID_CPU); > + } > + } > return ret; > } > > - Another option is spend the time to determine whether exposing all > cpuXX is a viable solution; I have no idea what impacts to userspace > would be for possible-but-not-yet-present cpuXX entries would be. It > might also mean requiring a 'present' entry available within the cpuXX. > > - Another option is to simply let the hot plug events rewrite the > elfcorehdr on demand. This is what I've originally put forth, but not > sure how this impacts PPC given for_each_possible_cpu() change. Given that /sys/devices/system/cpu/cpuXX is not present for possbile-but-not-yet-present CPUs, I am wondering do we even have crash notes for possible CPUs on x86? > > The concern is that today, both kexec_load and kexec_file_load mirror > each other with respect to for_each_present_cpu(); that is userspace > kexec is able to generate the elfcorehdr the same as would > kexec_file_load, for cpus. But by changing to for_each_possible_cpu(), > the two would deviate. Thanks, Sourabh Jain
On 02/13/23 at 10:10am, Sourabh Jain wrote: > > On 11/02/23 06:05, Eric DeVolder wrote: > > > > > > On 2/10/23 00:29, Sourabh Jain wrote: > > > > > > On 10/02/23 01:09, Eric DeVolder wrote: > > > > > > > > > > > > On 2/9/23 12:43, Sourabh Jain wrote: > > > > > Hello Eric, > > > > > > > > > > On 09/02/23 23:01, Eric DeVolder wrote: > > > > > > > > > > > > > > > > > > On 2/8/23 07:44, Thomas Gleixner wrote: > > > > > > > Eric! > > > > > > > > > > > > > > On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: > > > > > > > > On 2/1/23 05:33, Thomas Gleixner wrote: > > > > > > > > > > > > > > > > So my latest solution is introduce two new CPUHP > > > > > > > > states, CPUHP_AP_ELFCOREHDR_ONLINE > > > > > > > > for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for > > > > > > > > offlining. I'm open to better names. > > > > > > > > > > > > > > > > The CPUHP_AP_ELFCOREHDR_ONLINE needs to be > > > > > > > > placed after CPUHP_BRINGUP_CPU. My > > > > > > > > attempts at locating this state failed when > > > > > > > > inside the STARTING section, so I located > > > > > > > > this just inside the ONLINE sectoin. The crash > > > > > > > > hotplug handler is registered on > > > > > > > > this state as the callback for the .startup method. > > > > > > > > > > > > > > > > The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be > > > > > > > > placed before CPUHP_TEARDOWN_CPU, and I > > > > > > > > placed it at the end of the PREPARE section. > > > > > > > > This crash hotplug handler is also > > > > > > > > registered on this state as the callback for the .teardown method. > > > > > > > > > > > > > > TBH, that's still overengineered. Something like this: > > > > > > > > > > > > > > bool cpu_is_alive(unsigned int cpu) > > > > > > > { > > > > > > > struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); > > > > > > > > > > > > > > return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; > > > > > > > } > > > > > > > > > > > > > > and use this to query the actual state at crash > > > > > > > time. That spares all > > > > > > > those callback heuristics. > > > > > > > > > > > > > > > I'm making my way though percpu crash_notes, > > > > > > > > elfcorehdr, vmcoreinfo, > > > > > > > > makedumpfile and (the consumer of it all) the > > > > > > > > userspace crash utility, > > > > > > > > in order to understand the impact of moving from > > > > > > > > for_each_present_cpu() > > > > > > > > to for_each_online_cpu(). > > > > > > > > > > > > > > Is the packing actually worth the trouble? What's the actual win? > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > tglx > > > > > > > > > > > > > > > > > > > > > > > > > > Thomas, > > > > > > I've investigated the passing of crash notes through the > > > > > > vmcore. What I've learned is that: > > > > > > > > > > > > - linux/fs/proc/vmcore.c (which makedumpfile references > > > > > > to do its job) does > > > > > > not care what the contents of cpu PT_NOTES are, but it > > > > > > does coalesce them together. > > > > > > > > > > > > - makedumpfile will count the number of cpu PT_NOTES in > > > > > > order to determine its > > > > > > nr_cpus variable, which is reported in a header, but > > > > > > otherwise unused (except > > > > > > for sadump method). > > > > > > > > > > > > - the crash utility, for the purposes of determining the > > > > > > cpus, does not appear to > > > > > > reference the elfcorehdr PT_NOTEs. Instead it locates the various > > > > > > cpu_[possible|present|online]_mask and computes > > > > > > nr_cpus from that, and also of > > > > > > course which are online. In addition, when crash does > > > > > > reference the cpu PT_NOTE, > > > > > > to get its prstatus, it does so by using a percpu > > > > > > technique directly in the vmcore > > > > > > image memory, not via the ELF structure. Said > > > > > > differently, it appears to me that > > > > > > crash utility doesn't rely on the ELF PT_NOTEs for > > > > > > cpus; rather it obtains them > > > > > > via kernel cpumasks and the memory within the vmcore. > > > > > > > > > > > > With this understanding, I did some testing. Perhaps the > > > > > > most telling test was that I > > > > > > changed the number of cpu PT_NOTEs emitted in the > > > > > > crash_prepare_elf64_headers() to just 1, > > > > > > hot plugged some cpus, then also took a few offline > > > > > > sparsely via chcpu, then generated a > > > > > > vmcore. The crash utility had no problem loading the > > > > > > vmcore, it reported the proper number > > > > > > of cpus and the number offline (despite only one cpu > > > > > > PT_NOTE), and changing to a different > > > > > > cpu via 'set -c 30' and the backtrace was completely valid. > > > > > > > > > > > > My take away is that crash utility does not rely upon > > > > > > ELF cpu PT_NOTEs, it obtains the > > > > > > cpu information directly from kernel data structures. > > > > > > Perhaps at one time crash relied > > > > > > upon the ELF information, but no more. (Perhaps there > > > > > > are other crash dump analyzers > > > > > > that might rely on the ELF info?) > > > > > > > > > > > > So, all this to say that I see no need to change > > > > > > crash_prepare_elf64_headers(). There > > > > > > is no compelling reason to move away from > > > > > > for_each_present_cpu(), or modify the list for > > > > > > online/offline. > > > > > > > > > > > > Which then leaves the topic of the cpuhp state on which > > > > > > to register. Perhaps reverting > > > > > > back to the use of CPUHP_BP_PREPARE_DYN is the right > > > > > > answer. There does not appear to > > > > > > be a compelling need to accurately track whether the cpu > > > > > > went online/offline for the > > > > > > purposes of creating the elfcorehdr, as ultimately the > > > > > > crash utility pulls that from > > > > > > kernel data structures, not the elfcorehdr. > > > > > > > > > > > > I think this is what Sourabh has known and has been > > > > > > advocating for an optimization > > > > > > path that allows not regenerating the elfcorehdr on cpu > > > > > > changes (because all the percpu > > > > > > structs are all laid out). I do think it best to leave > > > > > > that as an arch choice. > > > > > > > > > > Since things are clear on how the PT_NOTES are consumed in > > > > > kdump kernel [fs/proc/vmcore.c], > > > > > makedumpfile, and crash tool I need your opinion on this: > > > > > > > > > > Do we really need to regenerate elfcorehdr for CPU hotplug events? > > > > > If yes, can you please list the elfcorehdr components that > > > > > changes due to CPU hotplug. > > > > Due to the use of for_each_present_cpu(), it is possible for the > > > > number of cpu PT_NOTEs > > > > to fluctuate as cpus are un/plugged. Onlining/offlining of cpus > > > > does not impact the > > > > number of cpu PT_NOTEs (as the cpus are still present). > > > > > > > > > > > > > > From what I understood, crash notes are prepared for > > > > > possible CPUs as system boots and > > > > > could be used to create a PT_NOTE section for each possible > > > > > CPU while generating the elfcorehdr > > > > > during the kdump kernel load. > > > > > > > > > > Now once the elfcorehdr is loaded with PT_NOTEs for every > > > > > possible CPU there is no need to > > > > > regenerate it for CPU hotplug events. Or do we? > > > > > > > > For onlining/offlining of cpus, there is no need to regenerate > > > > the elfcorehdr. However, > > > > for actual hot un/plug of cpus, the answer is yes due to > > > > for_each_present_cpu(). The > > > > caveat here of course is that if crash utility is the only > > > > coredump analyzer of concern, > > > > then it doesn't care about these cpu PT_NOTEs and there would be > > > > no need to re-generate them. > > > > > > > > Also, I'm not sure if ARM cpu hotplug, which is just now coming > > > > into mainstream, impacts > > > > any of this. > > > > > > > > Perhaps the one item that might help here is to distinguish > > > > between actual hot un/plug of > > > > cpus, versus onlining/offlining. At the moment, I can not > > > > distinguish between a hot plug > > > > event and an online event (and unplug/offline). If those were > > > > distinguishable, then we > > > > could only regenerate on un/plug events. > > > > > > > > Or perhaps moving to for_each_possible_cpu() is the better choice? > > > > > > Yes, because once elfcorehdr is built with possible CPUs we don't > > > have to worry about > > > hot[un]plug case. > > > > > > Here is my view on how things should be handled if a core-dump > > > analyzer is dependent on > > > elfcorehdr PT_NOTEs to find online/offline CPUs. > > > > > > A PT_NOTE in elfcorehdr holds the address of the corresponding crash > > > notes (kernel has > > > one crash note per CPU for every possible CPU). Though the crash > > > notes are allocated > > > during the boot time they are populated when the system is on the > > > crash path. > > > > > > This is how crash notes are populated on PowerPC and I am expecting > > > it would be something > > > similar on other architectures too. > > > > > > The crashing CPU sends IPI to every other online CPU with a callback > > > function that updates the > > > crash notes of that specific CPU. Once the IPI completes the > > > crashing CPU updates its own crash > > > note and proceeds further. > > > > > > The crash notes of CPUs remain uninitialized if the CPUs were > > > offline or hot unplugged at the time > > > system crash. The core-dump analyzer should be able to identify > > > [un]/initialized crash notes > > > and display the information accordingly. > > > > > > Thoughts? > > > > > > - Sourabh > > > > In general, I agree with your points. You've presented a strong case to > > go with for_each_possible_cpu() in crash_prepare_elf64_headers() and > > those crash notes would always be present, and we can ignore changes to > > cpus wrt/ elfcorehdr updates. > > > > But what do we do about kexec_load() syscall? The way the userspace > > utility works is it determines cpus by: > > nr_cpus = sysconf(_SC_NPROCESSORS_CONF); > > which is not the equivalent of possible_cpus. So the complete list of > > cpu PT_NOTEs is not generated up front. We would need a solution for > > that? > Hello Eric, > > The sysconf document says _SC_NPROCESSORS_CONF is processors configured, > isn't that equivalent to possible CPUs? > > What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC > it is possible CPUs. From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present cpus. If these are true, we can use them. But I am wondering why the existing present cpu way is going to be discarded. Sorry, I tried to go through this thread, it's too long, can anyone summarize the reason with shorter and clear sentences. Sorry again for that. > > In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with: > /sys/devices/system/cpu/possible for kexec_load case. > > Thoughts? > > - Sourabh Jain >
On 2/28/23 06:44, Baoquan He wrote: > On 02/13/23 at 10:10am, Sourabh Jain wrote: >> >> On 11/02/23 06:05, Eric DeVolder wrote: >>> >>> >>> On 2/10/23 00:29, Sourabh Jain wrote: >>>> >>>> On 10/02/23 01:09, Eric DeVolder wrote: >>>>> >>>>> >>>>> On 2/9/23 12:43, Sourabh Jain wrote: >>>>>> Hello Eric, >>>>>> >>>>>> On 09/02/23 23:01, Eric DeVolder wrote: >>>>>>> >>>>>>> >>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>>>>> Eric! >>>>>>>> >>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>>>>> >>>>>>>>> So my latest solution is introduce two new CPUHP >>>>>>>>> states, CPUHP_AP_ELFCOREHDR_ONLINE >>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for >>>>>>>>> offlining. I'm open to better names. >>>>>>>>> >>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be >>>>>>>>> placed after CPUHP_BRINGUP_CPU. My >>>>>>>>> attempts at locating this state failed when >>>>>>>>> inside the STARTING section, so I located >>>>>>>>> this just inside the ONLINE sectoin. The crash >>>>>>>>> hotplug handler is registered on >>>>>>>>> this state as the callback for the .startup method. >>>>>>>>> >>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be >>>>>>>>> placed before CPUHP_TEARDOWN_CPU, and I >>>>>>>>> placed it at the end of the PREPARE section. >>>>>>>>> This crash hotplug handler is also >>>>>>>>> registered on this state as the callback for the .teardown method. >>>>>>>> >>>>>>>> TBH, that's still overengineered. Something like this: >>>>>>>> >>>>>>>> bool cpu_is_alive(unsigned int cpu) >>>>>>>> { >>>>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>>>>> >>>>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>>>>> } >>>>>>>> >>>>>>>> and use this to query the actual state at crash >>>>>>>> time. That spares all >>>>>>>> those callback heuristics. >>>>>>>> >>>>>>>>> I'm making my way though percpu crash_notes, >>>>>>>>> elfcorehdr, vmcoreinfo, >>>>>>>>> makedumpfile and (the consumer of it all) the >>>>>>>>> userspace crash utility, >>>>>>>>> in order to understand the impact of moving from >>>>>>>>> for_each_present_cpu() >>>>>>>>> to for_each_online_cpu(). >>>>>>>> >>>>>>>> Is the packing actually worth the trouble? What's the actual win? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> tglx >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> Thomas, >>>>>>> I've investigated the passing of crash notes through the >>>>>>> vmcore. What I've learned is that: >>>>>>> >>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references >>>>>>> to do its job) does >>>>>>> not care what the contents of cpu PT_NOTES are, but it >>>>>>> does coalesce them together. >>>>>>> >>>>>>> - makedumpfile will count the number of cpu PT_NOTES in >>>>>>> order to determine its >>>>>>> nr_cpus variable, which is reported in a header, but >>>>>>> otherwise unused (except >>>>>>> for sadump method). >>>>>>> >>>>>>> - the crash utility, for the purposes of determining the >>>>>>> cpus, does not appear to >>>>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various >>>>>>> cpu_[possible|present|online]_mask and computes >>>>>>> nr_cpus from that, and also of >>>>>>> course which are online. In addition, when crash does >>>>>>> reference the cpu PT_NOTE, >>>>>>> to get its prstatus, it does so by using a percpu >>>>>>> technique directly in the vmcore >>>>>>> image memory, not via the ELF structure. Said >>>>>>> differently, it appears to me that >>>>>>> crash utility doesn't rely on the ELF PT_NOTEs for >>>>>>> cpus; rather it obtains them >>>>>>> via kernel cpumasks and the memory within the vmcore. >>>>>>> >>>>>>> With this understanding, I did some testing. Perhaps the >>>>>>> most telling test was that I >>>>>>> changed the number of cpu PT_NOTEs emitted in the >>>>>>> crash_prepare_elf64_headers() to just 1, >>>>>>> hot plugged some cpus, then also took a few offline >>>>>>> sparsely via chcpu, then generated a >>>>>>> vmcore. The crash utility had no problem loading the >>>>>>> vmcore, it reported the proper number >>>>>>> of cpus and the number offline (despite only one cpu >>>>>>> PT_NOTE), and changing to a different >>>>>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>>>>> >>>>>>> My take away is that crash utility does not rely upon >>>>>>> ELF cpu PT_NOTEs, it obtains the >>>>>>> cpu information directly from kernel data structures. >>>>>>> Perhaps at one time crash relied >>>>>>> upon the ELF information, but no more. (Perhaps there >>>>>>> are other crash dump analyzers >>>>>>> that might rely on the ELF info?) >>>>>>> >>>>>>> So, all this to say that I see no need to change >>>>>>> crash_prepare_elf64_headers(). There >>>>>>> is no compelling reason to move away from >>>>>>> for_each_present_cpu(), or modify the list for >>>>>>> online/offline. >>>>>>> >>>>>>> Which then leaves the topic of the cpuhp state on which >>>>>>> to register. Perhaps reverting >>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right >>>>>>> answer. There does not appear to >>>>>>> be a compelling need to accurately track whether the cpu >>>>>>> went online/offline for the >>>>>>> purposes of creating the elfcorehdr, as ultimately the >>>>>>> crash utility pulls that from >>>>>>> kernel data structures, not the elfcorehdr. >>>>>>> >>>>>>> I think this is what Sourabh has known and has been >>>>>>> advocating for an optimization >>>>>>> path that allows not regenerating the elfcorehdr on cpu >>>>>>> changes (because all the percpu >>>>>>> structs are all laid out). I do think it best to leave >>>>>>> that as an arch choice. >>>>>> >>>>>> Since things are clear on how the PT_NOTES are consumed in >>>>>> kdump kernel [fs/proc/vmcore.c], >>>>>> makedumpfile, and crash tool I need your opinion on this: >>>>>> >>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>>>>> If yes, can you please list the elfcorehdr components that >>>>>> changes due to CPU hotplug. >>>>> Due to the use of for_each_present_cpu(), it is possible for the >>>>> number of cpu PT_NOTEs >>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus >>>>> does not impact the >>>>> number of cpu PT_NOTEs (as the cpus are still present). >>>>> >>>>>> >>>>>> From what I understood, crash notes are prepared for >>>>>> possible CPUs as system boots and >>>>>> could be used to create a PT_NOTE section for each possible >>>>>> CPU while generating the elfcorehdr >>>>>> during the kdump kernel load. >>>>>> >>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every >>>>>> possible CPU there is no need to >>>>>> regenerate it for CPU hotplug events. Or do we? >>>>> >>>>> For onlining/offlining of cpus, there is no need to regenerate >>>>> the elfcorehdr. However, >>>>> for actual hot un/plug of cpus, the answer is yes due to >>>>> for_each_present_cpu(). The >>>>> caveat here of course is that if crash utility is the only >>>>> coredump analyzer of concern, >>>>> then it doesn't care about these cpu PT_NOTEs and there would be >>>>> no need to re-generate them. >>>>> >>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming >>>>> into mainstream, impacts >>>>> any of this. >>>>> >>>>> Perhaps the one item that might help here is to distinguish >>>>> between actual hot un/plug of >>>>> cpus, versus onlining/offlining. At the moment, I can not >>>>> distinguish between a hot plug >>>>> event and an online event (and unplug/offline). If those were >>>>> distinguishable, then we >>>>> could only regenerate on un/plug events. >>>>> >>>>> Or perhaps moving to for_each_possible_cpu() is the better choice? >>>> >>>> Yes, because once elfcorehdr is built with possible CPUs we don't >>>> have to worry about >>>> hot[un]plug case. >>>> >>>> Here is my view on how things should be handled if a core-dump >>>> analyzer is dependent on >>>> elfcorehdr PT_NOTEs to find online/offline CPUs. >>>> >>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash >>>> notes (kernel has >>>> one crash note per CPU for every possible CPU). Though the crash >>>> notes are allocated >>>> during the boot time they are populated when the system is on the >>>> crash path. >>>> >>>> This is how crash notes are populated on PowerPC and I am expecting >>>> it would be something >>>> similar on other architectures too. >>>> >>>> The crashing CPU sends IPI to every other online CPU with a callback >>>> function that updates the >>>> crash notes of that specific CPU. Once the IPI completes the >>>> crashing CPU updates its own crash >>>> note and proceeds further. >>>> >>>> The crash notes of CPUs remain uninitialized if the CPUs were >>>> offline or hot unplugged at the time >>>> system crash. The core-dump analyzer should be able to identify >>>> [un]/initialized crash notes >>>> and display the information accordingly. >>>> >>>> Thoughts? >>>> >>>> - Sourabh >>> >>> In general, I agree with your points. You've presented a strong case to >>> go with for_each_possible_cpu() in crash_prepare_elf64_headers() and >>> those crash notes would always be present, and we can ignore changes to >>> cpus wrt/ elfcorehdr updates. >>> >>> But what do we do about kexec_load() syscall? The way the userspace >>> utility works is it determines cpus by: >>> nr_cpus = sysconf(_SC_NPROCESSORS_CONF); >>> which is not the equivalent of possible_cpus. So the complete list of >>> cpu PT_NOTEs is not generated up front. We would need a solution for >>> that? >> Hello Eric, >> >> The sysconf document says _SC_NPROCESSORS_CONF is processors configured, >> isn't that equivalent to possible CPUs? >> >> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC >> it is possible CPUs. > Baoquan, > From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is > returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present > cpus. If these are true, we can use them. Thomas Gleixner has pointed out that: glibc tries to evaluate that in the following order: 1) /sys/devices/system/cpu/cpu* That's present CPUs not possible CPUs 2) /proc/stat That's online CPUs 3) sched_getaffinity() That's online CPUs at best. In the worst case it's an affinity mask which is set on a process group meaning that _SC_NPROCESSORS_CONF is not equivalent to possible_cpus(). Furthermore, the /sys/system/devices/cpus/cpuXX entries are not available for not-present-but-possible cpus; thus userspace kexec utility can not write out the elfcorehdr with all possible cpus listed. > > But I am wondering why the existing present cpu way is going to be > discarded. Sorry, I tried to go through this thread, it's too long, can > anyone summarize the reason with shorter and clear sentences. Sorry > again for that. By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), in the case of the kexec_file_load(), this change would simplify some issues Sourabh has encountered for PPC support. It would also enable an optimization that permits NOT re-generating the elfcorehdr on cpu changes, as all the [possible] cpus are already described in the elfcorehdr. I've pointed out that this change would have kexec_load (as kexec-tools can only write out, initially, the present_cpus()) initially deviate from kexec_file_load (which would now write out the possible_cpus()). This deviation would disappear after the first hotplug event (due to calling crash_prepare_elf64_headers()). Or I've provided a simple way for kexec_load to rewrite its elfcorehdr upon initial load (by calling into the crash hotplug handler). Can you think of any side effects of going to for_each_possible_cpu()? Thanks, eric > >> >> In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with: >> /sys/devices/system/cpu/possible for kexec_load case. >> >> Thoughts? >> >> - Sourabh Jain >> >
On 2/27/23 00:11, Sourabh Jain wrote: > > On 25/02/23 01:46, Eric DeVolder wrote: >> >> >> On 2/24/23 02:34, Sourabh Jain wrote: >>> >>> On 24/02/23 02:04, Eric DeVolder wrote: >>>> >>>> >>>> On 2/10/23 00:29, Sourabh Jain wrote: >>>>> >>>>> On 10/02/23 01:09, Eric DeVolder wrote: >>>>>> >>>>>> >>>>>> On 2/9/23 12:43, Sourabh Jain wrote: >>>>>>> Hello Eric, >>>>>>> >>>>>>> On 09/02/23 23:01, Eric DeVolder wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>>>>>> Eric! >>>>>>>>> >>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>>>>>> >>>>>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE >>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names. >>>>>>>>>> >>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My >>>>>>>>>> attempts at locating this state failed when inside the STARTING section, so I located >>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on >>>>>>>>>> this state as the callback for the .startup method. >>>>>>>>>> >>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I >>>>>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also >>>>>>>>>> registered on this state as the callback for the .teardown method. >>>>>>>>> >>>>>>>>> TBH, that's still overengineered. Something like this: >>>>>>>>> >>>>>>>>> bool cpu_is_alive(unsigned int cpu) >>>>>>>>> { >>>>>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>>>>>> >>>>>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>>>>>> } >>>>>>>>> >>>>>>>>> and use this to query the actual state at crash time. That spares all >>>>>>>>> those callback heuristics. >>>>>>>>> >>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo, >>>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility, >>>>>>>>>> in order to understand the impact of moving from for_each_present_cpu() >>>>>>>>>> to for_each_online_cpu(). >>>>>>>>> >>>>>>>>> Is the packing actually worth the trouble? What's the actual win? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> tglx >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Thomas, >>>>>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that: >>>>>>>> >>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does >>>>>>>> not care what the contents of cpu PT_NOTES are, but it does coalesce them together. >>>>>>>> >>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its >>>>>>>> nr_cpus variable, which is reported in a header, but otherwise unused (except >>>>>>>> for sadump method). >>>>>>>> >>>>>>>> - the crash utility, for the purposes of determining the cpus, does not appear to >>>>>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various >>>>>>>> cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of >>>>>>>> course which are online. In addition, when crash does reference the cpu PT_NOTE, >>>>>>>> to get its prstatus, it does so by using a percpu technique directly in the vmcore >>>>>>>> image memory, not via the ELF structure. Said differently, it appears to me that >>>>>>>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them >>>>>>>> via kernel cpumasks and the memory within the vmcore. >>>>>>>> >>>>>>>> With this understanding, I did some testing. Perhaps the most telling test was that I >>>>>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1, >>>>>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a >>>>>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number >>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different >>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>>>>>> >>>>>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the >>>>>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied >>>>>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers >>>>>>>> that might rely on the ELF info?) >>>>>>>> >>>>>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There >>>>>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for >>>>>>>> online/offline. >>>>>>>> >>>>>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting >>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to >>>>>>>> be a compelling need to accurately track whether the cpu went online/offline for the >>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from >>>>>>>> kernel data structures, not the elfcorehdr. >>>>>>>> >>>>>>>> I think this is what Sourabh has known and has been advocating for an optimization >>>>>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu >>>>>>>> structs are all laid out). I do think it best to leave that as an arch choice. >>>>>>> >>>>>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c], >>>>>>> makedumpfile, and crash tool I need your opinion on this: >>>>>>> >>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>>>>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug. >>>>>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs >>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the >>>>>> number of cpu PT_NOTEs (as the cpus are still present). >>>>>> >>>>>>> >>>>>>> From what I understood, crash notes are prepared for possible CPUs as system boots and >>>>>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr >>>>>>> during the kdump kernel load. >>>>>>> >>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to >>>>>>> regenerate it for CPU hotplug events. Or do we? >>>>>> >>>>>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However, >>>>>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The >>>>>> caveat here of course is that if crash utility is the only coredump analyzer of concern, >>>>>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them. >>>>>> >>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts >>>>>> any of this. >>>>>> >>>>>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of >>>>>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug >>>>>> event and an online event (and unplug/offline). If those were distinguishable, then we >>>>>> could only regenerate on un/plug events. >>>>>> >>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice? >>>>> >>>>> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about >>>>> hot[un]plug case. >>>>> >>>>> Here is my view on how things should be handled if a core-dump analyzer is dependent on >>>>> elfcorehdr PT_NOTEs to find online/offline CPUs. >>>>> >>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has >>>>> one crash note per CPU for every possible CPU). Though the crash notes are allocated >>>>> during the boot time they are populated when the system is on the crash path. >>>>> >>>>> This is how crash notes are populated on PowerPC and I am expecting it would be something >>>>> similar on other architectures too. >>>>> >>>>> The crashing CPU sends IPI to every other online CPU with a callback function that updates the >>>>> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash >>>>> note and proceeds further. >>>>> >>>>> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time >>>>> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes >>>>> and display the information accordingly. >>>>> >>>>> Thoughts? >>>>> >>>>> - Sourabh >>>> >>>> I've been examining what it would mean to move to for_each_possible_cpu() in >>>> crash_prepare_elf64_headers(). I think it means: >>>> >>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers(). >>>> - For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the >>>> load with the elfcorehdr generated by crash_prepare_elf64_headers(). >>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug events. >>> >>> If for_each_present_cpu is replaced with for_each_possible_cpu I still need cpuhp machinery >>> to update FDT kexec segment for CPU hot add case. >> >> Ah, ok, that's important! So the cpuhp callbacks are still needed. >>> >>> >>>> >>>> This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash >>>> analyzers that rely on ELF PT_NOTEs for crash_notes. >>>> >>>> If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp >>>> machinery in place and each arch could decide how to handle crash cpu hotplug events. The >>>> overhead for doing this is very minimal, and the events are likely very infrequent. >>> >>> I agree. Some architectures may need cpuhp machinery to update kexec segment[s] other then >>> elfcorehdr. For example FDT on PowerPC. >>> >>> - Sourabh Jain >> >> OK, I was thinking that the desire was to eliminate the cpuhp callbacks. In reality, the desire is >> to change to for_each_possible_cpu(). Given that the kernel creates crash_notes for all possible >> cpus upon kernel boot, there seems to be no reason to not do this? >> >> HOWEVER... >> >> It's not clear to me that this particular change needs to be part of this series. It's inclusion >> would facilitate PPC support, but doesn't "solve" anything in general. In fact it causes >> kexec_load and kexec_file_load to deviate (kexec_load via userspace kexec does the equivalent of >> for_each_present_cpu() where as with this change kexec_file_load would do for_each_possible_cpu(); >> until a hot plug event then both would do for_each_possible_cpu()). And if this change were to >> arrive as part of Sourabh's PPC support, then it does not appear to impact x86 (not sure about >> other arches). And the 'crash' dump analyzer doesn't care either way. >> >> Including this change would enable an optimization path (for x86 at least) that short-circuits cpu >> hotplug changes in the arch crash handler, for example: >> >> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >> index aca3f1817674..0883f6b11de4 100644 >> --- a/arch/x86/kernel/crash.c >> +++ b/arch/x86/kernel/crash.c >> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct kimage *image) >> unsigned long mem, memsz; >> unsigned long elfsz = 0; >> >> + if (image->file_mode && ( >> + image->hp_action == KEXEC_CRASH_HP_ADD_CPU || >> + image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU)) >> + return; >> + >> /* >> * Create the new elfcorehdr reflecting the changes to CPU and/or >> * memory resources. >> >> I'm not sure that is compelling given the infrequent nature of cpu hotplug events. > It certainly closes/reduces the window where kdump is not active due kexec segment update.| Fair enough. I plan to include this change in v19. > >> >> In my mind I still have a question about kexec_load() path. The userspace kexec can not do the >> equivalent of for_each_possible_cpu(). It can obtain max possible cpus from >> /sys/devices/system/cpu/possible, but for those cpus not present the /sys/devices/system/cpu/cpuXX >> is not available and so the crash_notes entries is not available. My attempts to expose all cpuXX >> lead to odd behavior that was requiring changes in ACPI and arch code that looked untenable. >> >> There seem to be these options available for kexec_load() path: >> - immediately rewrite the elfcorehdr upon load via a call to crash_prepare_elf64_headers(). I've >> made this work with the following, as proof of concept: > Yes regenerating/patching the elfcorehdr could be an option for kexec_load syscall. So this is not needed by x86, but more so by ppc. Should this change be in the ppc set or this set? > >> >> diff --git a/kernel/kexec.c b/kernel/kexec.c >> index cb8e6e6f983c..4eb201270f97 100644 >> --- a/kernel/kexec.c >> +++ b/kernel/kexec.c >> @@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry, unsigned long >> kimage_free(image); >> out_unlock: >> kexec_unlock(); >> + if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) { >> + if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) { >> + crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE, KEXEC_CRASH_HP_INVALID_CPU); >> + } >> + } >> return ret; >> } >> >> - Another option is spend the time to determine whether exposing all cpuXX is a viable solution; I >> have no idea what impacts to userspace would be for possible-but-not-yet-present cpuXX entries >> would be. It might also mean requiring a 'present' entry available within the cpuXX. >> >> - Another option is to simply let the hot plug events rewrite the elfcorehdr on demand. This is >> what I've originally put forth, but not sure how this impacts PPC given for_each_possible_cpu() >> change. > Given that /sys/devices/system/cpu/cpuXX is not present for possbile-but-not-yet-present CPUs, I am > wondering do we even have crash notes for possible CPUs on x86? Yes there are crash_notes for all possible cpus on x86. eric >> >> The concern is that today, both kexec_load and kexec_file_load mirror each other with respect to >> for_each_present_cpu(); that is userspace kexec is able to generate the elfcorehdr the same as >> would kexec_file_load, for cpus. But by changing to for_each_possible_cpu(), the two would deviate. > > Thanks, > Sourabh Jain
On 01/03/23 03:20, Eric DeVolder wrote: > > > On 2/27/23 00:11, Sourabh Jain wrote: >> >> On 25/02/23 01:46, Eric DeVolder wrote: >>> >>> >>> On 2/24/23 02:34, Sourabh Jain wrote: >>>> >>>> On 24/02/23 02:04, Eric DeVolder wrote: >>>>> >>>>> >>>>> On 2/10/23 00:29, Sourabh Jain wrote: >>>>>> >>>>>> On 10/02/23 01:09, Eric DeVolder wrote: >>>>>>> >>>>>>> >>>>>>> On 2/9/23 12:43, Sourabh Jain wrote: >>>>>>>> Hello Eric, >>>>>>>> >>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>>>>>>> Eric! >>>>>>>>>> >>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>>>>>>> >>>>>>>>>>> So my latest solution is introduce two new CPUHP states, >>>>>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE >>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. >>>>>>>>>>> I'm open to better names. >>>>>>>>>>> >>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after >>>>>>>>>>> CPUHP_BRINGUP_CPU. My >>>>>>>>>>> attempts at locating this state failed when inside the >>>>>>>>>>> STARTING section, so I located >>>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug >>>>>>>>>>> handler is registered on >>>>>>>>>>> this state as the callback for the .startup method. >>>>>>>>>>> >>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before >>>>>>>>>>> CPUHP_TEARDOWN_CPU, and I >>>>>>>>>>> placed it at the end of the PREPARE section. This crash >>>>>>>>>>> hotplug handler is also >>>>>>>>>>> registered on this state as the callback for the .teardown >>>>>>>>>>> method. >>>>>>>>>> >>>>>>>>>> TBH, that's still overengineered. Something like this: >>>>>>>>>> >>>>>>>>>> bool cpu_is_alive(unsigned int cpu) >>>>>>>>>> { >>>>>>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>>>>>>> >>>>>>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> and use this to query the actual state at crash time. That >>>>>>>>>> spares all >>>>>>>>>> those callback heuristics. >>>>>>>>>> >>>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, >>>>>>>>>>> vmcoreinfo, >>>>>>>>>>> makedumpfile and (the consumer of it all) the userspace >>>>>>>>>>> crash utility, >>>>>>>>>>> in order to understand the impact of moving from >>>>>>>>>>> for_each_present_cpu() >>>>>>>>>>> to for_each_online_cpu(). >>>>>>>>>> >>>>>>>>>> Is the packing actually worth the trouble? What's the actual >>>>>>>>>> win? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> tglx >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> Thomas, >>>>>>>>> I've investigated the passing of crash notes through the >>>>>>>>> vmcore. What I've learned is that: >>>>>>>>> >>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do >>>>>>>>> its job) does >>>>>>>>> not care what the contents of cpu PT_NOTES are, but it does >>>>>>>>> coalesce them together. >>>>>>>>> >>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order >>>>>>>>> to determine its >>>>>>>>> nr_cpus variable, which is reported in a header, but >>>>>>>>> otherwise unused (except >>>>>>>>> for sadump method). >>>>>>>>> >>>>>>>>> - the crash utility, for the purposes of determining the cpus, >>>>>>>>> does not appear to >>>>>>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the >>>>>>>>> various >>>>>>>>> cpu_[possible|present|online]_mask and computes nr_cpus from >>>>>>>>> that, and also of >>>>>>>>> course which are online. In addition, when crash does >>>>>>>>> reference the cpu PT_NOTE, >>>>>>>>> to get its prstatus, it does so by using a percpu technique >>>>>>>>> directly in the vmcore >>>>>>>>> image memory, not via the ELF structure. Said differently, >>>>>>>>> it appears to me that >>>>>>>>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; >>>>>>>>> rather it obtains them >>>>>>>>> via kernel cpumasks and the memory within the vmcore. >>>>>>>>> >>>>>>>>> With this understanding, I did some testing. Perhaps the most >>>>>>>>> telling test was that I >>>>>>>>> changed the number of cpu PT_NOTEs emitted in the >>>>>>>>> crash_prepare_elf64_headers() to just 1, >>>>>>>>> hot plugged some cpus, then also took a few offline sparsely >>>>>>>>> via chcpu, then generated a >>>>>>>>> vmcore. The crash utility had no problem loading the vmcore, >>>>>>>>> it reported the proper number >>>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), >>>>>>>>> and changing to a different >>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>>>>>>> >>>>>>>>> My take away is that crash utility does not rely upon ELF cpu >>>>>>>>> PT_NOTEs, it obtains the >>>>>>>>> cpu information directly from kernel data structures. Perhaps >>>>>>>>> at one time crash relied >>>>>>>>> upon the ELF information, but no more. (Perhaps there are >>>>>>>>> other crash dump analyzers >>>>>>>>> that might rely on the ELF info?) >>>>>>>>> >>>>>>>>> So, all this to say that I see no need to change >>>>>>>>> crash_prepare_elf64_headers(). There >>>>>>>>> is no compelling reason to move away from >>>>>>>>> for_each_present_cpu(), or modify the list for >>>>>>>>> online/offline. >>>>>>>>> >>>>>>>>> Which then leaves the topic of the cpuhp state on which to >>>>>>>>> register. Perhaps reverting >>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. >>>>>>>>> There does not appear to >>>>>>>>> be a compelling need to accurately track whether the cpu went >>>>>>>>> online/offline for the >>>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash >>>>>>>>> utility pulls that from >>>>>>>>> kernel data structures, not the elfcorehdr. >>>>>>>>> >>>>>>>>> I think this is what Sourabh has known and has been advocating >>>>>>>>> for an optimization >>>>>>>>> path that allows not regenerating the elfcorehdr on cpu >>>>>>>>> changes (because all the percpu >>>>>>>>> structs are all laid out). I do think it best to leave that as >>>>>>>>> an arch choice. >>>>>>>> >>>>>>>> Since things are clear on how the PT_NOTES are consumed in >>>>>>>> kdump kernel [fs/proc/vmcore.c], >>>>>>>> makedumpfile, and crash tool I need your opinion on this: >>>>>>>> >>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>>>>>>> If yes, can you please list the elfcorehdr components that >>>>>>>> changes due to CPU hotplug. >>>>>>> Due to the use of for_each_present_cpu(), it is possible for the >>>>>>> number of cpu PT_NOTEs >>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus >>>>>>> does not impact the >>>>>>> number of cpu PT_NOTEs (as the cpus are still present). >>>>>>> >>>>>>>> >>>>>>>> From what I understood, crash notes are prepared for possible >>>>>>>> CPUs as system boots and >>>>>>>> could be used to create a PT_NOTE section for each possible CPU >>>>>>>> while generating the elfcorehdr >>>>>>>> during the kdump kernel load. >>>>>>>> >>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every >>>>>>>> possible CPU there is no need to >>>>>>>> regenerate it for CPU hotplug events. Or do we? >>>>>>> >>>>>>> For onlining/offlining of cpus, there is no need to regenerate >>>>>>> the elfcorehdr. However, >>>>>>> for actual hot un/plug of cpus, the answer is yes due to >>>>>>> for_each_present_cpu(). The >>>>>>> caveat here of course is that if crash utility is the only >>>>>>> coredump analyzer of concern, >>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be >>>>>>> no need to re-generate them. >>>>>>> >>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming >>>>>>> into mainstream, impacts >>>>>>> any of this. >>>>>>> >>>>>>> Perhaps the one item that might help here is to distinguish >>>>>>> between actual hot un/plug of >>>>>>> cpus, versus onlining/offlining. At the moment, I can not >>>>>>> distinguish between a hot plug >>>>>>> event and an online event (and unplug/offline). If those were >>>>>>> distinguishable, then we >>>>>>> could only regenerate on un/plug events. >>>>>>> >>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice? >>>>>> >>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't >>>>>> have to worry about >>>>>> hot[un]plug case. >>>>>> >>>>>> Here is my view on how things should be handled if a core-dump >>>>>> analyzer is dependent on >>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs. >>>>>> >>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding >>>>>> crash notes (kernel has >>>>>> one crash note per CPU for every possible CPU). Though the crash >>>>>> notes are allocated >>>>>> during the boot time they are populated when the system is on the >>>>>> crash path. >>>>>> >>>>>> This is how crash notes are populated on PowerPC and I am >>>>>> expecting it would be something >>>>>> similar on other architectures too. >>>>>> >>>>>> The crashing CPU sends IPI to every other online CPU with a >>>>>> callback function that updates the >>>>>> crash notes of that specific CPU. Once the IPI completes the >>>>>> crashing CPU updates its own crash >>>>>> note and proceeds further. >>>>>> >>>>>> The crash notes of CPUs remain uninitialized if the CPUs were >>>>>> offline or hot unplugged at the time >>>>>> system crash. The core-dump analyzer should be able to identify >>>>>> [un]/initialized crash notes >>>>>> and display the information accordingly. >>>>>> >>>>>> Thoughts? >>>>>> >>>>>> - Sourabh >>>>> >>>>> I've been examining what it would mean to move to >>>>> for_each_possible_cpu() in crash_prepare_elf64_headers(). I think >>>>> it means: >>>>> >>>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in >>>>> crash_prepare_elf64_headers(). >>>>> - For kexec_load() syscall path, rewrite the incoming/supplied >>>>> elfcorehdr immediately on the load with the elfcorehdr generated >>>>> by crash_prepare_elf64_headers(). >>>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug >>>>> events. >>>> >>>> If for_each_present_cpu is replaced with for_each_possible_cpu I >>>> still need cpuhp machinery >>>> to update FDT kexec segment for CPU hot add case. >>> >>> Ah, ok, that's important! So the cpuhp callbacks are still needed. >>>> >>>> >>>>> >>>>> This would then setup PT_NOTEs for all possible cpus, which should >>>>> in theory accommodate crash analyzers that rely on ELF PT_NOTEs >>>>> for crash_notes. >>>>> >>>>> If staying with for_each_present_cpu() is ultimately decided, then >>>>> I think leaving the cpuhp machinery in place and each arch could >>>>> decide how to handle crash cpu hotplug events. The overhead for >>>>> doing this is very minimal, and the events are likely very >>>>> infrequent. >>>> >>>> I agree. Some architectures may need cpuhp machinery to update >>>> kexec segment[s] other then elfcorehdr. For example FDT on PowerPC. >>>> >>>> - Sourabh Jain >>> >>> OK, I was thinking that the desire was to eliminate the cpuhp >>> callbacks. In reality, the desire is to change to >>> for_each_possible_cpu(). Given that the kernel creates crash_notes >>> for all possible cpus upon kernel boot, there seems to be no reason >>> to not do this? >>> >>> HOWEVER... >>> >>> It's not clear to me that this particular change needs to be part of >>> this series. It's inclusion would facilitate PPC support, but >>> doesn't "solve" anything in general. In fact it causes kexec_load >>> and kexec_file_load to deviate (kexec_load via userspace kexec does >>> the equivalent of for_each_present_cpu() where as with this change >>> kexec_file_load would do for_each_possible_cpu(); until a hot plug >>> event then both would do for_each_possible_cpu()). And if this >>> change were to arrive as part of Sourabh's PPC support, then it does >>> not appear to impact x86 (not sure about other arches). And the >>> 'crash' dump analyzer doesn't care either way. >>> >>> Including this change would enable an optimization path (for x86 at >>> least) that short-circuits cpu hotplug changes in the arch crash >>> handler, for example: >>> >>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >>> index aca3f1817674..0883f6b11de4 100644 >>> --- a/arch/x86/kernel/crash.c >>> +++ b/arch/x86/kernel/crash.c >>> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct >>> kimage *image) >>> unsigned long mem, memsz; >>> unsigned long elfsz = 0; >>> >>> + if (image->file_mode && ( >>> + image->hp_action == KEXEC_CRASH_HP_ADD_CPU || >>> + image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU)) >>> + return; >>> + >>> /* >>> * Create the new elfcorehdr reflecting the changes to CPU and/or >>> * memory resources. >>> >>> I'm not sure that is compelling given the infrequent nature of cpu >>> hotplug events. >> It certainly closes/reduces the window where kdump is not active due >> kexec segment update.| > > Fair enough. I plan to include this change in v19. > >> >>> >>> In my mind I still have a question about kexec_load() path. The >>> userspace kexec can not do the equivalent of >>> for_each_possible_cpu(). It can obtain max possible cpus from >>> /sys/devices/system/cpu/possible, but for those cpus not present the >>> /sys/devices/system/cpu/cpuXX is not available and so the >>> crash_notes entries is not available. My attempts to expose all >>> cpuXX lead to odd behavior that was requiring changes in ACPI and >>> arch code that looked untenable. >>> >>> There seem to be these options available for kexec_load() path: >>> - immediately rewrite the elfcorehdr upon load via a call to >>> crash_prepare_elf64_headers(). I've made this work with the >>> following, as proof of concept: >> Yes regenerating/patching the elfcorehdr could be an option for >> kexec_load syscall. > So this is not needed by x86, but more so by ppc. Should this change > be in the ppc set or this set? Since /sys/devices/system/cpu/cpuXX represents possible CPUs on PowerPC, there is no need for elfcorehdr regeneration on PowerPC for kexec_load case for CPU hotplug events. My ask is, keep the cpuhp machinery so that architectures can update other kexec segments if needed of CPU add/remove case. In case x86 has nothing to update on CPU hotplug events and you want remove the CPU hp machinery I can add the same in ppc patch series. Thanks, Sourabh Jain
On 3/1/23 00:22, Sourabh Jain wrote: > > On 01/03/23 03:20, Eric DeVolder wrote: >> >> >> On 2/27/23 00:11, Sourabh Jain wrote: >>> >>> On 25/02/23 01:46, Eric DeVolder wrote: >>>> >>>> >>>> On 2/24/23 02:34, Sourabh Jain wrote: >>>>> >>>>> On 24/02/23 02:04, Eric DeVolder wrote: >>>>>> >>>>>> >>>>>> On 2/10/23 00:29, Sourabh Jain wrote: >>>>>>> >>>>>>> On 10/02/23 01:09, Eric DeVolder wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2/9/23 12:43, Sourabh Jain wrote: >>>>>>>>> Hello Eric, >>>>>>>>> >>>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>>>>>>>> Eric! >>>>>>>>>>> >>>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>>>>>>>> >>>>>>>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE >>>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names. >>>>>>>>>>>> >>>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My >>>>>>>>>>>> attempts at locating this state failed when inside the STARTING section, so I located >>>>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on >>>>>>>>>>>> this state as the callback for the .startup method. >>>>>>>>>>>> >>>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I >>>>>>>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also >>>>>>>>>>>> registered on this state as the callback for the .teardown method. >>>>>>>>>>> >>>>>>>>>>> TBH, that's still overengineered. Something like this: >>>>>>>>>>> >>>>>>>>>>> bool cpu_is_alive(unsigned int cpu) >>>>>>>>>>> { >>>>>>>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>>>>>>>> >>>>>>>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> and use this to query the actual state at crash time. That spares all >>>>>>>>>>> those callback heuristics. >>>>>>>>>>> >>>>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo, >>>>>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility, >>>>>>>>>>>> in order to understand the impact of moving from for_each_present_cpu() >>>>>>>>>>>> to for_each_online_cpu(). >>>>>>>>>>> >>>>>>>>>>> Is the packing actually worth the trouble? What's the actual win? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> tglx >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thomas, >>>>>>>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that: >>>>>>>>>> >>>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does >>>>>>>>>> not care what the contents of cpu PT_NOTES are, but it does coalesce them together. >>>>>>>>>> >>>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its >>>>>>>>>> nr_cpus variable, which is reported in a header, but otherwise unused (except >>>>>>>>>> for sadump method). >>>>>>>>>> >>>>>>>>>> - the crash utility, for the purposes of determining the cpus, does not appear to >>>>>>>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various >>>>>>>>>> cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of >>>>>>>>>> course which are online. In addition, when crash does reference the cpu PT_NOTE, >>>>>>>>>> to get its prstatus, it does so by using a percpu technique directly in the vmcore >>>>>>>>>> image memory, not via the ELF structure. Said differently, it appears to me that >>>>>>>>>> crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them >>>>>>>>>> via kernel cpumasks and the memory within the vmcore. >>>>>>>>>> >>>>>>>>>> With this understanding, I did some testing. Perhaps the most telling test was that I >>>>>>>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1, >>>>>>>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a >>>>>>>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number >>>>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different >>>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>>>>>>>> >>>>>>>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the >>>>>>>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied >>>>>>>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers >>>>>>>>>> that might rely on the ELF info?) >>>>>>>>>> >>>>>>>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There >>>>>>>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for >>>>>>>>>> online/offline. >>>>>>>>>> >>>>>>>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting >>>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to >>>>>>>>>> be a compelling need to accurately track whether the cpu went online/offline for the >>>>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from >>>>>>>>>> kernel data structures, not the elfcorehdr. >>>>>>>>>> >>>>>>>>>> I think this is what Sourabh has known and has been advocating for an optimization >>>>>>>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu >>>>>>>>>> structs are all laid out). I do think it best to leave that as an arch choice. >>>>>>>>> >>>>>>>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c], >>>>>>>>> makedumpfile, and crash tool I need your opinion on this: >>>>>>>>> >>>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>>>>>>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug. >>>>>>>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs >>>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the >>>>>>>> number of cpu PT_NOTEs (as the cpus are still present). >>>>>>>> >>>>>>>>> >>>>>>>>> From what I understood, crash notes are prepared for possible CPUs as system boots and >>>>>>>>> could be used to create a PT_NOTE section for each possible CPU while generating the >>>>>>>>> elfcorehdr >>>>>>>>> during the kdump kernel load. >>>>>>>>> >>>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to >>>>>>>>> regenerate it for CPU hotplug events. Or do we? >>>>>>>> >>>>>>>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However, >>>>>>>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The >>>>>>>> caveat here of course is that if crash utility is the only coredump analyzer of concern, >>>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them. >>>>>>>> >>>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts >>>>>>>> any of this. >>>>>>>> >>>>>>>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of >>>>>>>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug >>>>>>>> event and an online event (and unplug/offline). If those were distinguishable, then we >>>>>>>> could only regenerate on un/plug events. >>>>>>>> >>>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice? >>>>>>> >>>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about >>>>>>> hot[un]plug case. >>>>>>> >>>>>>> Here is my view on how things should be handled if a core-dump analyzer is dependent on >>>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs. >>>>>>> >>>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has >>>>>>> one crash note per CPU for every possible CPU). Though the crash notes are allocated >>>>>>> during the boot time they are populated when the system is on the crash path. >>>>>>> >>>>>>> This is how crash notes are populated on PowerPC and I am expecting it would be something >>>>>>> similar on other architectures too. >>>>>>> >>>>>>> The crashing CPU sends IPI to every other online CPU with a callback function that updates the >>>>>>> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash >>>>>>> note and proceeds further. >>>>>>> >>>>>>> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the >>>>>>> time >>>>>>> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes >>>>>>> and display the information accordingly. >>>>>>> >>>>>>> Thoughts? >>>>>>> >>>>>>> - Sourabh >>>>>> >>>>>> I've been examining what it would mean to move to for_each_possible_cpu() in >>>>>> crash_prepare_elf64_headers(). I think it means: >>>>>> >>>>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers(). >>>>>> - For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the >>>>>> load with the elfcorehdr generated by crash_prepare_elf64_headers(). >>>>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug events. >>>>> >>>>> If for_each_present_cpu is replaced with for_each_possible_cpu I still need cpuhp machinery >>>>> to update FDT kexec segment for CPU hot add case. >>>> >>>> Ah, ok, that's important! So the cpuhp callbacks are still needed. >>>>> >>>>> >>>>>> >>>>>> This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash >>>>>> analyzers that rely on ELF PT_NOTEs for crash_notes. >>>>>> >>>>>> If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp >>>>>> machinery in place and each arch could decide how to handle crash cpu hotplug events. The >>>>>> overhead for doing this is very minimal, and the events are likely very infrequent. >>>>> >>>>> I agree. Some architectures may need cpuhp machinery to update kexec segment[s] other then >>>>> elfcorehdr. For example FDT on PowerPC. >>>>> >>>>> - Sourabh Jain >>>> >>>> OK, I was thinking that the desire was to eliminate the cpuhp callbacks. In reality, the desire >>>> is to change to for_each_possible_cpu(). Given that the kernel creates crash_notes for all >>>> possible cpus upon kernel boot, there seems to be no reason to not do this? >>>> >>>> HOWEVER... >>>> >>>> It's not clear to me that this particular change needs to be part of this series. It's inclusion >>>> would facilitate PPC support, but doesn't "solve" anything in general. In fact it causes >>>> kexec_load and kexec_file_load to deviate (kexec_load via userspace kexec does the equivalent of >>>> for_each_present_cpu() where as with this change kexec_file_load would do >>>> for_each_possible_cpu(); until a hot plug event then both would do for_each_possible_cpu()). And >>>> if this change were to arrive as part of Sourabh's PPC support, then it does not appear to >>>> impact x86 (not sure about other arches). And the 'crash' dump analyzer doesn't care either way. >>>> >>>> Including this change would enable an optimization path (for x86 at least) that short-circuits >>>> cpu hotplug changes in the arch crash handler, for example: >>>> >>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >>>> index aca3f1817674..0883f6b11de4 100644 >>>> --- a/arch/x86/kernel/crash.c >>>> +++ b/arch/x86/kernel/crash.c >>>> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct kimage *image) >>>> unsigned long mem, memsz; >>>> unsigned long elfsz = 0; >>>> >>>> + if (image->file_mode && ( >>>> + image->hp_action == KEXEC_CRASH_HP_ADD_CPU || >>>> + image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU)) >>>> + return; >>>> + >>>> /* >>>> * Create the new elfcorehdr reflecting the changes to CPU and/or >>>> * memory resources. >>>> >>>> I'm not sure that is compelling given the infrequent nature of cpu hotplug events. >>> It certainly closes/reduces the window where kdump is not active due kexec segment update.| >> >> Fair enough. I plan to include this change in v19. >> >>> >>>> >>>> In my mind I still have a question about kexec_load() path. The userspace kexec can not do the >>>> equivalent of for_each_possible_cpu(). It can obtain max possible cpus from >>>> /sys/devices/system/cpu/possible, but for those cpus not present the >>>> /sys/devices/system/cpu/cpuXX is not available and so the crash_notes entries is not available. >>>> My attempts to expose all cpuXX lead to odd behavior that was requiring changes in ACPI and arch >>>> code that looked untenable. >>>> >>>> There seem to be these options available for kexec_load() path: >>>> - immediately rewrite the elfcorehdr upon load via a call to crash_prepare_elf64_headers(). I've >>>> made this work with the following, as proof of concept: >>> Yes regenerating/patching the elfcorehdr could be an option for kexec_load syscall. >> So this is not needed by x86, but more so by ppc. Should this change be in the ppc set or this set? > Since /sys/devices/system/cpu/cpuXX represents possible CPUs on PowerPC, there is no need for > elfcorehdr regeneration on PowerPC for kexec_load case > for CPU hotplug events. > > My ask is, keep the cpuhp machinery so that architectures can update other kexec segments if needed > of CPU add/remove case. > > In case x86 has nothing to update on CPU hotplug events and you want remove the CPU hp machinery I > can add the same > in ppc patch series. I'll keep the cpuhp machinery; in general it is needed for kexec_load usage in particular since we are changing crash_prepare_elf64_headers() to for_each_possible_cpu(). eric > > Thanks, > Sourabh Jain
On 2/28/23 12:52, Eric DeVolder wrote: > > > On 2/28/23 06:44, Baoquan He wrote: >> On 02/13/23 at 10:10am, Sourabh Jain wrote: >>> >>> On 11/02/23 06:05, Eric DeVolder wrote: >>>> >>>> >>>> On 2/10/23 00:29, Sourabh Jain wrote: >>>>> >>>>> On 10/02/23 01:09, Eric DeVolder wrote: >>>>>> >>>>>> >>>>>> On 2/9/23 12:43, Sourabh Jain wrote: >>>>>>> Hello Eric, >>>>>>> >>>>>>> On 09/02/23 23:01, Eric DeVolder wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>>>>>> Eric! >>>>>>>>> >>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>>>>>> >>>>>>>>>> So my latest solution is introduce two new CPUHP >>>>>>>>>> states, CPUHP_AP_ELFCOREHDR_ONLINE >>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for >>>>>>>>>> offlining. I'm open to better names. >>>>>>>>>> >>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be >>>>>>>>>> placed after CPUHP_BRINGUP_CPU. My >>>>>>>>>> attempts at locating this state failed when >>>>>>>>>> inside the STARTING section, so I located >>>>>>>>>> this just inside the ONLINE sectoin. The crash >>>>>>>>>> hotplug handler is registered on >>>>>>>>>> this state as the callback for the .startup method. >>>>>>>>>> >>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be >>>>>>>>>> placed before CPUHP_TEARDOWN_CPU, and I >>>>>>>>>> placed it at the end of the PREPARE section. >>>>>>>>>> This crash hotplug handler is also >>>>>>>>>> registered on this state as the callback for the .teardown method. >>>>>>>>> >>>>>>>>> TBH, that's still overengineered. Something like this: >>>>>>>>> >>>>>>>>> bool cpu_is_alive(unsigned int cpu) >>>>>>>>> { >>>>>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>>>>>> >>>>>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>>>>>> } >>>>>>>>> >>>>>>>>> and use this to query the actual state at crash >>>>>>>>> time. That spares all >>>>>>>>> those callback heuristics. >>>>>>>>> >>>>>>>>>> I'm making my way though percpu crash_notes, >>>>>>>>>> elfcorehdr, vmcoreinfo, >>>>>>>>>> makedumpfile and (the consumer of it all) the >>>>>>>>>> userspace crash utility, >>>>>>>>>> in order to understand the impact of moving from >>>>>>>>>> for_each_present_cpu() >>>>>>>>>> to for_each_online_cpu(). >>>>>>>>> >>>>>>>>> Is the packing actually worth the trouble? What's the actual win? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> tglx >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Thomas, >>>>>>>> I've investigated the passing of crash notes through the >>>>>>>> vmcore. What I've learned is that: >>>>>>>> >>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references >>>>>>>> to do its job) does >>>>>>>> not care what the contents of cpu PT_NOTES are, but it >>>>>>>> does coalesce them together. >>>>>>>> >>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in >>>>>>>> order to determine its >>>>>>>> nr_cpus variable, which is reported in a header, but >>>>>>>> otherwise unused (except >>>>>>>> for sadump method). >>>>>>>> >>>>>>>> - the crash utility, for the purposes of determining the >>>>>>>> cpus, does not appear to >>>>>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the various >>>>>>>> cpu_[possible|present|online]_mask and computes >>>>>>>> nr_cpus from that, and also of >>>>>>>> course which are online. In addition, when crash does >>>>>>>> reference the cpu PT_NOTE, >>>>>>>> to get its prstatus, it does so by using a percpu >>>>>>>> technique directly in the vmcore >>>>>>>> image memory, not via the ELF structure. Said >>>>>>>> differently, it appears to me that >>>>>>>> crash utility doesn't rely on the ELF PT_NOTEs for >>>>>>>> cpus; rather it obtains them >>>>>>>> via kernel cpumasks and the memory within the vmcore. >>>>>>>> >>>>>>>> With this understanding, I did some testing. Perhaps the >>>>>>>> most telling test was that I >>>>>>>> changed the number of cpu PT_NOTEs emitted in the >>>>>>>> crash_prepare_elf64_headers() to just 1, >>>>>>>> hot plugged some cpus, then also took a few offline >>>>>>>> sparsely via chcpu, then generated a >>>>>>>> vmcore. The crash utility had no problem loading the >>>>>>>> vmcore, it reported the proper number >>>>>>>> of cpus and the number offline (despite only one cpu >>>>>>>> PT_NOTE), and changing to a different >>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>>>>>> >>>>>>>> My take away is that crash utility does not rely upon >>>>>>>> ELF cpu PT_NOTEs, it obtains the >>>>>>>> cpu information directly from kernel data structures. >>>>>>>> Perhaps at one time crash relied >>>>>>>> upon the ELF information, but no more. (Perhaps there >>>>>>>> are other crash dump analyzers >>>>>>>> that might rely on the ELF info?) >>>>>>>> >>>>>>>> So, all this to say that I see no need to change >>>>>>>> crash_prepare_elf64_headers(). There >>>>>>>> is no compelling reason to move away from >>>>>>>> for_each_present_cpu(), or modify the list for >>>>>>>> online/offline. >>>>>>>> >>>>>>>> Which then leaves the topic of the cpuhp state on which >>>>>>>> to register. Perhaps reverting >>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right >>>>>>>> answer. There does not appear to >>>>>>>> be a compelling need to accurately track whether the cpu >>>>>>>> went online/offline for the >>>>>>>> purposes of creating the elfcorehdr, as ultimately the >>>>>>>> crash utility pulls that from >>>>>>>> kernel data structures, not the elfcorehdr. >>>>>>>> >>>>>>>> I think this is what Sourabh has known and has been >>>>>>>> advocating for an optimization >>>>>>>> path that allows not regenerating the elfcorehdr on cpu >>>>>>>> changes (because all the percpu >>>>>>>> structs are all laid out). I do think it best to leave >>>>>>>> that as an arch choice. >>>>>>> >>>>>>> Since things are clear on how the PT_NOTES are consumed in >>>>>>> kdump kernel [fs/proc/vmcore.c], >>>>>>> makedumpfile, and crash tool I need your opinion on this: >>>>>>> >>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>>>>>> If yes, can you please list the elfcorehdr components that >>>>>>> changes due to CPU hotplug. >>>>>> Due to the use of for_each_present_cpu(), it is possible for the >>>>>> number of cpu PT_NOTEs >>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus >>>>>> does not impact the >>>>>> number of cpu PT_NOTEs (as the cpus are still present). >>>>>> >>>>>>> >>>>>>> From what I understood, crash notes are prepared for >>>>>>> possible CPUs as system boots and >>>>>>> could be used to create a PT_NOTE section for each possible >>>>>>> CPU while generating the elfcorehdr >>>>>>> during the kdump kernel load. >>>>>>> >>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every >>>>>>> possible CPU there is no need to >>>>>>> regenerate it for CPU hotplug events. Or do we? >>>>>> >>>>>> For onlining/offlining of cpus, there is no need to regenerate >>>>>> the elfcorehdr. However, >>>>>> for actual hot un/plug of cpus, the answer is yes due to >>>>>> for_each_present_cpu(). The >>>>>> caveat here of course is that if crash utility is the only >>>>>> coredump analyzer of concern, >>>>>> then it doesn't care about these cpu PT_NOTEs and there would be >>>>>> no need to re-generate them. >>>>>> >>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming >>>>>> into mainstream, impacts >>>>>> any of this. >>>>>> >>>>>> Perhaps the one item that might help here is to distinguish >>>>>> between actual hot un/plug of >>>>>> cpus, versus onlining/offlining. At the moment, I can not >>>>>> distinguish between a hot plug >>>>>> event and an online event (and unplug/offline). If those were >>>>>> distinguishable, then we >>>>>> could only regenerate on un/plug events. >>>>>> >>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice? >>>>> >>>>> Yes, because once elfcorehdr is built with possible CPUs we don't >>>>> have to worry about >>>>> hot[un]plug case. >>>>> >>>>> Here is my view on how things should be handled if a core-dump >>>>> analyzer is dependent on >>>>> elfcorehdr PT_NOTEs to find online/offline CPUs. >>>>> >>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash >>>>> notes (kernel has >>>>> one crash note per CPU for every possible CPU). Though the crash >>>>> notes are allocated >>>>> during the boot time they are populated when the system is on the >>>>> crash path. >>>>> >>>>> This is how crash notes are populated on PowerPC and I am expecting >>>>> it would be something >>>>> similar on other architectures too. >>>>> >>>>> The crashing CPU sends IPI to every other online CPU with a callback >>>>> function that updates the >>>>> crash notes of that specific CPU. Once the IPI completes the >>>>> crashing CPU updates its own crash >>>>> note and proceeds further. >>>>> >>>>> The crash notes of CPUs remain uninitialized if the CPUs were >>>>> offline or hot unplugged at the time >>>>> system crash. The core-dump analyzer should be able to identify >>>>> [un]/initialized crash notes >>>>> and display the information accordingly. >>>>> >>>>> Thoughts? >>>>> >>>>> - Sourabh >>>> >>>> In general, I agree with your points. You've presented a strong case to >>>> go with for_each_possible_cpu() in crash_prepare_elf64_headers() and >>>> those crash notes would always be present, and we can ignore changes to >>>> cpus wrt/ elfcorehdr updates. >>>> >>>> But what do we do about kexec_load() syscall? The way the userspace >>>> utility works is it determines cpus by: >>>> nr_cpus = sysconf(_SC_NPROCESSORS_CONF); >>>> which is not the equivalent of possible_cpus. So the complete list of >>>> cpu PT_NOTEs is not generated up front. We would need a solution for >>>> that? >>> Hello Eric, >>> >>> The sysconf document says _SC_NPROCESSORS_CONF is processors configured, >>> isn't that equivalent to possible CPUs? >>> >>> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC >>> it is possible CPUs. >> > Baoquan, > >> From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is >> returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present >> cpus. If these are true, we can use them. > > Thomas Gleixner has pointed out that: > > glibc tries to evaluate that in the following order: > 1) /sys/devices/system/cpu/cpu* > That's present CPUs not possible CPUs > 2) /proc/stat > That's online CPUs > 3) sched_getaffinity() > That's online CPUs at best. In the worst case it's an affinity mask > which is set on a process group > > meaning that _SC_NPROCESSORS_CONF is not equivalent to possible_cpus(). Furthermore, the > /sys/system/devices/cpus/cpuXX entries are not available for not-present-but-possible cpus; thus > userspace kexec utility can not write out the elfcorehdr with all possible cpus listed. > >> >> But I am wondering why the existing present cpu way is going to be >> discarded. Sorry, I tried to go through this thread, it's too long, can >> anyone summarize the reason with shorter and clear sentences. Sorry >> again for that. > > By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), in the case of the > kexec_file_load(), this change would simplify some issues Sourabh has encountered for PPC support. > It would also enable an optimization that permits NOT re-generating the elfcorehdr on cpu changes, > as all the [possible] cpus are already described in the elfcorehdr. > > I've pointed out that this change would have kexec_load (as kexec-tools can only write out, > initially, the present_cpus()) initially deviate from kexec_file_load (which would now write out the > possible_cpus()). This deviation would disappear after the first hotplug event (due to calling > crash_prepare_elf64_headers()). Or I've provided a simple way for kexec_load to rewrite its > elfcorehdr upon initial load (by calling into the crash hotplug handler). > > Can you think of any side effects of going to for_each_possible_cpu()? > > Thanks, > eric Well, this won't be shorter sentences, but hopefully it makes the case clearer. Below I've cut-n-pasted my current patch w/ commit message which explains it all. Please let me know if you can think of any side effects not addressed! Thanks, eric > > >> >>> >>> In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with: >>> /sys/devices/system/cpu/possible for kexec_load case. >>> >>> Thoughts? >>> >>> - Sourabh Jain >>> >> From b56aa428b07d970f26e3c3704d54ce8805f05ddc Mon Sep 17 00:00:00 2001 From: Eric DeVolder <eric.devolder@oracle.com> Date: Tue, 28 Feb 2023 14:20:04 -0500 Subject: [PATCH v19 3/7] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu() The function crash_prepare_elf64_headers() generates the elfcorehdr which describes the cpus and memory in the system for the crash kernel. In particular, it writes out ELF PT_NOTEs for memory regions and the processors in the system. With respect to the cpus, the current implementation utilizes for_each_present_cpu() which means that as cpus are added and removed, the elfcorehdr must again be updated to reflect the new set of cpus. The reasoning behind the change to use for_each_possible_cpu(), is: - At kernel boot time, all percpu crash_notes are allocated for all possible cpus; that is, crash_notes are not allocated dynamically when cpus are plugged/unplugged. Thus the crash_notes for each possible cpu are always available. - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per cpu. Changing to for_each_possible_cpu() is valid as the crash_notes pointed to by each cpu PT_NOTE are present and always valid. Furthermore, examining a common crash processing path of: kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer elfcorehdr /proc/vmcore vmcore reveals how the ELF cpu PT_NOTEs are utilized: - Upon panic, each cpu is sent an IPI and shuts itself down, recording its state in its crash_notes. When all cpus are shutdown, the crash kernel is launched with a pointer to the elfcorehdr. - The crash kernel via linux/fs/proc/vmcore.c does not examine or use the contents of the PT_NOTEs, it exposes them via /proc/vmcore. - The makedumpfile utility uses /proc/vmcore and reads the cpu PT_NOTEs to craft a nr_cpus variable, which is reported in a header but otherwise generally unused. Makedumpfile creates the vmcore. - The 'crash' dump analyzer does not appear to reference the cpu PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask symbols and directly examines those structure contents from vmcore memory. From that information it is able to determine which cpus are present and online, and locate the corresponding crash_notes. Said differently, it appears to me that 'crash' analyzer does not rely on the ELF PT_NOTEs for cpus; rather it obtains the information directly via kernel symbols and the memory within the vmcore. (There maybe other vmcore generating and analysis tools that do use these PT_NOTEs, but 'makedumpfile' and 'crash' seem to me to be the most common solution.) This change results in the benefit of having all cpus described in the elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr on cpu changes, at the small expense of an additional 56 bytes per PT_NOTE for not-present-but-possible cpus. On systems where kexec_file_load() syscall is utilized, all the above is valid. On systems where kexec_load() syscall is utilized, there may be the need for the elfcorehdr to be regenerated once. The reason being that some archs only populate the 'present' cpus in the /sys/devices/system/cpus entries, which the userspace 'kexec' utility uses to generate the userspace-supplied elfcorehdr. In this situation, one memory or cpu change will rewrite the elfcorehdr via the crash_prepare_elf64_headers() function and now all possible cpus will be described, just as with kexec_file_load() syscall. Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> --- kernel/crash_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index dba4b75f7541..537b199a8774 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -365,7 +365,7 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, ehdr->e_phentsize = sizeof(Elf64_Phdr); /* Prepare one phdr of type PT_NOTE for each present CPU */ - for_each_present_cpu(cpu) { + for_each_possible_cpu(cpu) { phdr->p_type = PT_NOTE; notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); phdr->p_offset = phdr->p_paddr = notes_addr;
On 01/03/23 00:22, Eric DeVolder wrote: > > > On 2/28/23 06:44, Baoquan He wrote: >> On 02/13/23 at 10:10am, Sourabh Jain wrote: >>> >>> On 11/02/23 06:05, Eric DeVolder wrote: >>>> >>>> >>>> On 2/10/23 00:29, Sourabh Jain wrote: >>>>> >>>>> On 10/02/23 01:09, Eric DeVolder wrote: >>>>>> >>>>>> >>>>>> On 2/9/23 12:43, Sourabh Jain wrote: >>>>>>> Hello Eric, >>>>>>> >>>>>>> On 09/02/23 23:01, Eric DeVolder wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote: >>>>>>>>> Eric! >>>>>>>>> >>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: >>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote: >>>>>>>>>> >>>>>>>>>> So my latest solution is introduce two new CPUHP >>>>>>>>>> states, CPUHP_AP_ELFCOREHDR_ONLINE >>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for >>>>>>>>>> offlining. I'm open to better names. >>>>>>>>>> >>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be >>>>>>>>>> placed after CPUHP_BRINGUP_CPU. My >>>>>>>>>> attempts at locating this state failed when >>>>>>>>>> inside the STARTING section, so I located >>>>>>>>>> this just inside the ONLINE sectoin. The crash >>>>>>>>>> hotplug handler is registered on >>>>>>>>>> this state as the callback for the .startup method. >>>>>>>>>> >>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be >>>>>>>>>> placed before CPUHP_TEARDOWN_CPU, and I >>>>>>>>>> placed it at the end of the PREPARE section. >>>>>>>>>> This crash hotplug handler is also >>>>>>>>>> registered on this state as the callback for the .teardown >>>>>>>>>> method. >>>>>>>>> >>>>>>>>> TBH, that's still overengineered. Something like this: >>>>>>>>> >>>>>>>>> bool cpu_is_alive(unsigned int cpu) >>>>>>>>> { >>>>>>>>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); >>>>>>>>> >>>>>>>>> return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; >>>>>>>>> } >>>>>>>>> >>>>>>>>> and use this to query the actual state at crash >>>>>>>>> time. That spares all >>>>>>>>> those callback heuristics. >>>>>>>>> >>>>>>>>>> I'm making my way though percpu crash_notes, >>>>>>>>>> elfcorehdr, vmcoreinfo, >>>>>>>>>> makedumpfile and (the consumer of it all) the >>>>>>>>>> userspace crash utility, >>>>>>>>>> in order to understand the impact of moving from >>>>>>>>>> for_each_present_cpu() >>>>>>>>>> to for_each_online_cpu(). >>>>>>>>> >>>>>>>>> Is the packing actually worth the trouble? What's the actual win? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> tglx >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Thomas, >>>>>>>> I've investigated the passing of crash notes through the >>>>>>>> vmcore. What I've learned is that: >>>>>>>> >>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references >>>>>>>> to do its job) does >>>>>>>> not care what the contents of cpu PT_NOTES are, but it >>>>>>>> does coalesce them together. >>>>>>>> >>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in >>>>>>>> order to determine its >>>>>>>> nr_cpus variable, which is reported in a header, but >>>>>>>> otherwise unused (except >>>>>>>> for sadump method). >>>>>>>> >>>>>>>> - the crash utility, for the purposes of determining the >>>>>>>> cpus, does not appear to >>>>>>>> reference the elfcorehdr PT_NOTEs. Instead it locates the >>>>>>>> various >>>>>>>> cpu_[possible|present|online]_mask and computes >>>>>>>> nr_cpus from that, and also of >>>>>>>> course which are online. In addition, when crash does >>>>>>>> reference the cpu PT_NOTE, >>>>>>>> to get its prstatus, it does so by using a percpu >>>>>>>> technique directly in the vmcore >>>>>>>> image memory, not via the ELF structure. Said >>>>>>>> differently, it appears to me that >>>>>>>> crash utility doesn't rely on the ELF PT_NOTEs for >>>>>>>> cpus; rather it obtains them >>>>>>>> via kernel cpumasks and the memory within the vmcore. >>>>>>>> >>>>>>>> With this understanding, I did some testing. Perhaps the >>>>>>>> most telling test was that I >>>>>>>> changed the number of cpu PT_NOTEs emitted in the >>>>>>>> crash_prepare_elf64_headers() to just 1, >>>>>>>> hot plugged some cpus, then also took a few offline >>>>>>>> sparsely via chcpu, then generated a >>>>>>>> vmcore. The crash utility had no problem loading the >>>>>>>> vmcore, it reported the proper number >>>>>>>> of cpus and the number offline (despite only one cpu >>>>>>>> PT_NOTE), and changing to a different >>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid. >>>>>>>> >>>>>>>> My take away is that crash utility does not rely upon >>>>>>>> ELF cpu PT_NOTEs, it obtains the >>>>>>>> cpu information directly from kernel data structures. >>>>>>>> Perhaps at one time crash relied >>>>>>>> upon the ELF information, but no more. (Perhaps there >>>>>>>> are other crash dump analyzers >>>>>>>> that might rely on the ELF info?) >>>>>>>> >>>>>>>> So, all this to say that I see no need to change >>>>>>>> crash_prepare_elf64_headers(). There >>>>>>>> is no compelling reason to move away from >>>>>>>> for_each_present_cpu(), or modify the list for >>>>>>>> online/offline. >>>>>>>> >>>>>>>> Which then leaves the topic of the cpuhp state on which >>>>>>>> to register. Perhaps reverting >>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right >>>>>>>> answer. There does not appear to >>>>>>>> be a compelling need to accurately track whether the cpu >>>>>>>> went online/offline for the >>>>>>>> purposes of creating the elfcorehdr, as ultimately the >>>>>>>> crash utility pulls that from >>>>>>>> kernel data structures, not the elfcorehdr. >>>>>>>> >>>>>>>> I think this is what Sourabh has known and has been >>>>>>>> advocating for an optimization >>>>>>>> path that allows not regenerating the elfcorehdr on cpu >>>>>>>> changes (because all the percpu >>>>>>>> structs are all laid out). I do think it best to leave >>>>>>>> that as an arch choice. >>>>>>> >>>>>>> Since things are clear on how the PT_NOTES are consumed in >>>>>>> kdump kernel [fs/proc/vmcore.c], >>>>>>> makedumpfile, and crash tool I need your opinion on this: >>>>>>> >>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events? >>>>>>> If yes, can you please list the elfcorehdr components that >>>>>>> changes due to CPU hotplug. >>>>>> Due to the use of for_each_present_cpu(), it is possible for the >>>>>> number of cpu PT_NOTEs >>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus >>>>>> does not impact the >>>>>> number of cpu PT_NOTEs (as the cpus are still present). >>>>>> >>>>>>> >>>>>>> From what I understood, crash notes are prepared for >>>>>>> possible CPUs as system boots and >>>>>>> could be used to create a PT_NOTE section for each possible >>>>>>> CPU while generating the elfcorehdr >>>>>>> during the kdump kernel load. >>>>>>> >>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every >>>>>>> possible CPU there is no need to >>>>>>> regenerate it for CPU hotplug events. Or do we? >>>>>> >>>>>> For onlining/offlining of cpus, there is no need to regenerate >>>>>> the elfcorehdr. However, >>>>>> for actual hot un/plug of cpus, the answer is yes due to >>>>>> for_each_present_cpu(). The >>>>>> caveat here of course is that if crash utility is the only >>>>>> coredump analyzer of concern, >>>>>> then it doesn't care about these cpu PT_NOTEs and there would be >>>>>> no need to re-generate them. >>>>>> >>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming >>>>>> into mainstream, impacts >>>>>> any of this. >>>>>> >>>>>> Perhaps the one item that might help here is to distinguish >>>>>> between actual hot un/plug of >>>>>> cpus, versus onlining/offlining. At the moment, I can not >>>>>> distinguish between a hot plug >>>>>> event and an online event (and unplug/offline). If those were >>>>>> distinguishable, then we >>>>>> could only regenerate on un/plug events. >>>>>> >>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice? >>>>> >>>>> Yes, because once elfcorehdr is built with possible CPUs we don't >>>>> have to worry about >>>>> hot[un]plug case. >>>>> >>>>> Here is my view on how things should be handled if a core-dump >>>>> analyzer is dependent on >>>>> elfcorehdr PT_NOTEs to find online/offline CPUs. >>>>> >>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash >>>>> notes (kernel has >>>>> one crash note per CPU for every possible CPU). Though the crash >>>>> notes are allocated >>>>> during the boot time they are populated when the system is on the >>>>> crash path. >>>>> >>>>> This is how crash notes are populated on PowerPC and I am expecting >>>>> it would be something >>>>> similar on other architectures too. >>>>> >>>>> The crashing CPU sends IPI to every other online CPU with a callback >>>>> function that updates the >>>>> crash notes of that specific CPU. Once the IPI completes the >>>>> crashing CPU updates its own crash >>>>> note and proceeds further. >>>>> >>>>> The crash notes of CPUs remain uninitialized if the CPUs were >>>>> offline or hot unplugged at the time >>>>> system crash. The core-dump analyzer should be able to identify >>>>> [un]/initialized crash notes >>>>> and display the information accordingly. >>>>> >>>>> Thoughts? >>>>> >>>>> - Sourabh >>>> >>>> In general, I agree with your points. You've presented a strong >>>> case to >>>> go with for_each_possible_cpu() in crash_prepare_elf64_headers() and >>>> those crash notes would always be present, and we can ignore >>>> changes to >>>> cpus wrt/ elfcorehdr updates. >>>> >>>> But what do we do about kexec_load() syscall? The way the userspace >>>> utility works is it determines cpus by: >>>> nr_cpus = sysconf(_SC_NPROCESSORS_CONF); >>>> which is not the equivalent of possible_cpus. So the complete list of >>>> cpu PT_NOTEs is not generated up front. We would need a solution for >>>> that? >>> Hello Eric, >>> >>> The sysconf document says _SC_NPROCESSORS_CONF is processors >>> configured, >>> isn't that equivalent to possible CPUs? >>> >>> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on >>> powerPC >>> it is possible CPUs. >> > Baoquan, > >> From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is >> returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present >> cpus. If these are true, we can use them. > > Thomas Gleixner has pointed out that: > > glibc tries to evaluate that in the following order: > 1) /sys/devices/system/cpu/cpu* > That's present CPUs not possible CPUs > 2) /proc/stat > That's online CPUs > 3) sched_getaffinity() > That's online CPUs at best. In the worst case it's an affinity mask > which is set on a process group > > meaning that _SC_NPROCESSORS_CONF is not equivalent to > possible_cpus(). Furthermore, the /sys/system/devices/cpus/cpuXX > entries are not available for not-present-but-possible cpus; thus > userspace kexec utility can not write out the elfcorehdr with all > possible cpus listed. > >> >> But I am wondering why the existing present cpu way is going to be >> discarded. Sorry, I tried to go through this thread, it's too long, can >> anyone summarize the reason with shorter and clear sentences. Sorry >> again for that. > Hello Eric, > By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), > in the case of the kexec_file_load(), this change would simplify some > issues Sourabh has encountered for PPC support. Things are fine even with for_each_present_cpu on PPC. It is just that I want to avoid the regeneration of elfcorehdr for every CPU change by packing possible CPUs at once. Thanks, Sourabh Jain
On 03/01/23 at 09:48am, Eric DeVolder wrote: ...... > From b56aa428b07d970f26e3c3704d54ce8805f05ddc Mon Sep 17 00:00:00 2001 > From: Eric DeVolder <eric.devolder@oracle.com> > Date: Tue, 28 Feb 2023 14:20:04 -0500 > Subject: [PATCH v19 3/7] crash: change crash_prepare_elf64_headers() to > for_each_possible_cpu() > > The function crash_prepare_elf64_headers() generates the elfcorehdr > which describes the cpus and memory in the system for the crash kernel. > In particular, it writes out ELF PT_NOTEs for memory regions and the > processors in the system. > > With respect to the cpus, the current implementation utilizes > for_each_present_cpu() which means that as cpus are added and removed, > the elfcorehdr must again be updated to reflect the new set of cpus. > > The reasoning behind the change to use for_each_possible_cpu(), is: > > - At kernel boot time, all percpu crash_notes are allocated for all > possible cpus; that is, crash_notes are not allocated dynamically > when cpus are plugged/unplugged. Thus the crash_notes for each > possible cpu are always available. > > - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per cpu. > Changing to for_each_possible_cpu() is valid as the crash_notes > pointed to by each cpu PT_NOTE are present and always valid. > > Furthermore, examining a common crash processing path of: > > kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer > elfcorehdr /proc/vmcore vmcore > > reveals how the ELF cpu PT_NOTEs are utilized: > > - Upon panic, each cpu is sent an IPI and shuts itself down, recording > its state in its crash_notes. When all cpus are shutdown, the > crash kernel is launched with a pointer to the elfcorehdr. > > - The crash kernel via linux/fs/proc/vmcore.c does not examine or > use the contents of the PT_NOTEs, it exposes them via /proc/vmcore. > > - The makedumpfile utility uses /proc/vmcore and reads the cpu > PT_NOTEs to craft a nr_cpus variable, which is reported in a > header but otherwise generally unused. Makedumpfile creates the > vmcore. > > - The 'crash' dump analyzer does not appear to reference the cpu > PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask > symbols and directly examines those structure contents from vmcore > memory. From that information it is able to determine which cpus > are present and online, and locate the corresponding crash_notes. > Said differently, it appears to me that 'crash' analyzer does not > rely on the ELF PT_NOTEs for cpus; rather it obtains the information > directly via kernel symbols and the memory within the vmcore. > > (There maybe other vmcore generating and analysis tools that do use > these PT_NOTEs, but 'makedumpfile' and 'crash' seem to me to be the > most common solution.) > > This change results in the benefit of having all cpus described in > the elfcorehdr, and therefore reducing the need to re-generate the > elfcorehdr on cpu changes, at the small expense of an additional > 56 bytes per PT_NOTE for not-present-but-possible cpus. > > On systems where kexec_file_load() syscall is utilized, all the above > is valid. On systems where kexec_load() syscall is utilized, there > may be the need for the elfcorehdr to be regenerated once. The reason > being that some archs only populate the 'present' cpus in the > /sys/devices/system/cpus entries, which the userspace 'kexec' utility > uses to generate the userspace-supplied elfcorehdr. In this situation, > one memory or cpu change will rewrite the elfcorehdr via the > crash_prepare_elf64_headers() function and now all possible cpus will > be described, just as with kexec_file_load() syscall. So, with for_each_possible_cpu(), we don't need to respond to cpu hotplug event, right? If so, it does bring benefit. While kexec_load won't benefit from that. So far, it looks not bad. > > Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com> > Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> > --- > kernel/crash_core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > index dba4b75f7541..537b199a8774 100644 > --- a/kernel/crash_core.c > +++ b/kernel/crash_core.c > @@ -365,7 +365,7 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, > ehdr->e_phentsize = sizeof(Elf64_Phdr); > > /* Prepare one phdr of type PT_NOTE for each present CPU */ > - for_each_present_cpu(cpu) { > + for_each_possible_cpu(cpu) { > phdr->p_type = PT_NOTE; > notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); > phdr->p_offset = phdr->p_paddr = notes_addr; > -- > 2.31.1 >
diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 5545de4597d0..d985d334fae4 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem, /* Prepare one phdr of type PT_NOTE for each present CPU */ for_each_present_cpu(cpu) { +#ifdef CONFIG_CRASH_HOTPLUG + if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) { + /* Skip the soon-to-be offlined cpu */ + if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) && + (cpu == image->offlinecpu)) + continue; + } +#endif phdr->p_type = PT_NOTE; notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu)); phdr->p_offset = phdr->p_paddr = notes_addr; @@ -769,6 +777,14 @@ static void handle_hotplug_event(unsigned int hp_action, unsigned int cpu) /* Differentiate between normal load and hotplug update */ image->hp_action = hp_action; + /* + * Record which CPU is being unplugged/offlined, so that it + * is explicitly excluded in crash_prepare_elf64_headers(). + */ + image->offlinecpu = + (hp_action == KEXEC_CRASH_HP_REMOVE_CPU) ? + cpu : KEXEC_CRASH_HP_INVALID_CPU; + /* Now invoke arch-specific update handler */ arch_crash_handle_hotplug_event(image);