Message ID | 20230104023738.1258925-1-sdonthineni@nvidia.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp4930236wrt; Tue, 3 Jan 2023 18:40:36 -0800 (PST) X-Google-Smtp-Source: AMrXdXsGQUNRShiJ+c/Fi+Mlh8NoDPSQZ1gCW2cRhXBSF3EVtEpL78zO57D5xpuWyxZ4jltIJEOg X-Received: by 2002:a05:6a20:94c6:b0:ad:aaac:d40b with SMTP id ht6-20020a056a2094c600b000adaaacd40bmr55578817pzb.45.1672800036094; Tue, 03 Jan 2023 18:40:36 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1672800036; cv=pass; d=google.com; s=arc-20160816; b=CUgzdrUIGjSZjltelSl7TzTAUL+WF9ODWKeRVnWOL/9djQ8nQ534eBCZOegLNst9bD t0ENevGq0a2fl3OkJy9LFYLZ5CzSWpmx/MBAFWXGFOTasDGO8Z1AA6CtR8UWv/eJIqux hFVUXga1PKeyGI7JtGsC/QuYOXkNuNo3y+KC0grsFTC/upvfJ7zayvUwPimOZpJTWL8q LCW+JICEJm+FkQ3liPmtgcl39KV3vGwa07cZZ0xTrLQqhQgbqW2+nO9vxuEUkU4qN9qm SEQKWWKbDZlFycPs8NO25yJr5gBpU/aE2aCsi65Zchsvp5LCOlhGXgE8lwgf0oc4PhKm Je/A== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=+wJURQphI4S2v9tbwmjJrLYq43L4evErXJ7tv9XF7xw=; b=cnkAiEJvHbSTZNUE9h8JGDXcm7T6uFT6Q6riMKU/DhD+DeGf1H+1RnaXpcSXK8VQyQ 25U10DDG57OhD0NNArQsAhAGRXuOr+q3mz8pUmfXiHJp9R/pL/eQRVsJG3LlJ0QxyDyd dq2dHP9XlR1HtCVpp47J3Htz0ypkSc/yekeOpV6PR4jD92kcR5Nb4/lvao0cxZlwyEGL O8YqZuU5NkqpKrGv9MxogmAavm89VnORO3sGrpHZIfPlkZarMapmPHP0l6RCMMSpm0to CHpnuqqNZoBGBBkTPjATHHvO0CTmeM31usQxqjdS0Yc8OeqRfGxtcKn9MKYybCf2goGe fBkg== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@Nvidia.com header.s=selector2 header.b=dYEMgLQI; arc=pass (i=1 spf=pass spfdomain=nvidia.com dmarc=pass fromdomain=nvidia.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=nvidia.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bg11-20020a056a02010b00b0049876520724si28340142pgb.303.2023.01.03.18.40.22; Tue, 03 Jan 2023 18:40:36 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@Nvidia.com header.s=selector2 header.b=dYEMgLQI; arc=pass (i=1 spf=pass spfdomain=nvidia.com dmarc=pass fromdomain=nvidia.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=nvidia.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230019AbjADCh5 (ORCPT <rfc822;tmhikaru@gmail.com> + 99 others); Tue, 3 Jan 2023 21:37:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54142 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230078AbjADChy (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 3 Jan 2023 21:37:54 -0500 Received: from NAM10-DM6-obe.outbound.protection.outlook.com (mail-dm6nam10on2076.outbound.protection.outlook.com [40.107.93.76]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 98A99E3E for <linux-kernel@vger.kernel.org>; Tue, 3 Jan 2023 18:37:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=cqEFQSHXlj32zlGOItx1Ikaq8IwhykNC2hSZhob8GV7szHvqcZIE05vuwv66N5sL2C0cJ+kuy2gr2su5/av03qFt2KiVIjw81tJM3ejhP5NIk28usbsmL8+u7sxITbUQ8y6SEQ1hf3DwAnENn2PljODHSNW7QfuJqaKrRSLorwO/kgs4IjXn2ng30Tv6sI8uLXV6sudpb86VYBWcmHhEos8tMrINY2ak704bu4R3QDBruP77maddT6LbEz4Shgx5KwB3SmQc/b2XObUW5k/ciaSSBKxEAOhXXQw9vEIPFE5OGzKGz2AeMtSCB1GdnLdHbURUAjIUABO+Fous0+NnUw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=+wJURQphI4S2v9tbwmjJrLYq43L4evErXJ7tv9XF7xw=; b=c2T67JRF3bl3g+cZ/gztEfvBLX/5SDgPCo8/K+tyyG1TiMdJygt74iwPP2tuuWSYGrO3bGEkCbMGV+0iZosRIkFlTD1m91dKO8oImcDePqPiAh9HuhbA8fhjv1gSygmJLJSwgm/adcCR6+hFT99Fjl7ML4IrCqLUyDEULy9b82Jx0vSrihfo3ReRSnfSIgDBvlhvBXTVHVh+fs8oV8/g6pXz14Hj4kbkR2jh6WD2cm2A4EpWt8cGunT+DK0uBJy/LDHUrXjknIUy4U8x/J6HmyYE+RoEhEG4H/D3y+FumIGaZDmfRItBTToIqaY6amgk+n4OFydSaQfAh6+e41RT8w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 216.228.117.160) smtp.rcpttodomain=arm.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=+wJURQphI4S2v9tbwmjJrLYq43L4evErXJ7tv9XF7xw=; b=dYEMgLQIS4Br/3yMW8g8jvGye3YboIol4qjF6govIOfSeBm8xV5FsBecMMTbHVFSqdvnT8NFPBketwX4T2CEZbwSczNn0C6gi703IsLoqZLa6ajYOyjvIWnLkqBjfP98yWRUXRHQeFKaLa7nAwkgnsLbdeAU231sfYqJBxHScvzZwLaI2gwAc94mKpnkySC1FHUXeVvnEh9kqxX34QnlUSBLzoNd9rZv2zkGhXBu9wVCxBwCHco5AJhS2Qmq+264tl7cH6gE5+4Fij5RzN6KyfITsI/kMC0gbj5TZOcFt9J7ZJMltfSDciWKSgVEsjRGp7lzlqM24v5kUdnpDEiaCg== Received: from BN9PR03CA0248.namprd03.prod.outlook.com (2603:10b6:408:ff::13) by PH8PR12MB6891.namprd12.prod.outlook.com (2603:10b6:510:1cb::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5944.19; Wed, 4 Jan 2023 02:37:50 +0000 Received: from BN8NAM11FT049.eop-nam11.prod.protection.outlook.com (2603:10b6:408:ff:cafe::af) by BN9PR03CA0248.outlook.office365.com (2603:10b6:408:ff::13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5966.19 via Frontend Transport; Wed, 4 Jan 2023 02:37:49 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 216.228.117.160) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.160 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.160; helo=mail.nvidia.com; pr=C Received: from mail.nvidia.com (216.228.117.160) by BN8NAM11FT049.mail.protection.outlook.com (10.13.177.157) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5944.17 via Frontend Transport; Wed, 4 Jan 2023 02:37:49 +0000 Received: from rnnvmail203.nvidia.com (10.129.68.9) by mail.nvidia.com (10.129.200.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.36; Tue, 3 Jan 2023 18:37:40 -0800 Received: from rnnvmail204.nvidia.com (10.129.68.6) by rnnvmail203.nvidia.com (10.129.68.9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.36; Tue, 3 Jan 2023 18:37:40 -0800 Received: from SDONTHINENI-DESKTOP.nvidia.com (10.127.8.9) by mail.nvidia.com (10.129.68.6) with Microsoft SMTP Server id 15.2.986.36 via Frontend Transport; Tue, 3 Jan 2023 18:37:40 -0800 From: Shanker Donthineni <sdonthineni@nvidia.com> To: Catalin Marinas <catalin.marinas@arm.com>, Marc Zyngier <maz@kernel.org>, Will Deacon <will@kernel.org>, James Morse <james.morse@arm.com> CC: <linux-arm-kernel@lists.infradead.org>, <linux-kernel@vger.kernel.org>, Shanker Donthineni <sdonthineni@nvidia.com> Subject: [PATCH v2] arm64: gic: increase the number of IRQ descriptors Date: Tue, 3 Jan 2023 20:37:38 -0600 Message-ID: <20230104023738.1258925-1-sdonthineni@nvidia.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-NVConfidentiality: public Content-Transfer-Encoding: 8bit Content-Type: text/plain X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN8NAM11FT049:EE_|PH8PR12MB6891:EE_ X-MS-Office365-Filtering-Correlation-Id: 54310cab-9288-4618-dcac-08daedfcabd7 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: GgSTx/XD2SVe/hXXeenEA8XIsElSHzGk+0bgRUoUmwbXOcMO+gHkn0J6xLdO/qXSeB7CcNBwew5BoHyAsJcJ1/mmY11drHNcm4nd87SGlRUPU0xMp2SP9AHiadl/n03wRgBc86RtSX7zRAhN3J4ioRY2azEeH1XgugC3O8TA7DVFMAHq92B17iv4lZBa/VIKIuLmwCqYsmuRToD2UD+dUVJ+Whd+BBtWEKXyOx3MGUjEqRS74244hM+5BIxRi9RhNz+zHwC/gSTp5LljLQHAxWD5Mxirz6xKutkvFP6iofJehj1VKwEMz+Zg4i8KcWYJoP7/MVPzIIYPYHTZng2YKDTraCvNHwMxhlMh1UYrp58KLOM7rllNPQgEi3SAlvRM3ZSn5ok21YSlLtDgroumFHFs2Bh7VaW44uJ5dDKdWus8cMpgraDoQmDPZ49szNnWsLcFUQNdrZx9wamEtl53ZcC6zs/G1Tpd7zpdoNWHZlFjb+NPx+c05YQuGJ5Q4q6gqh6C31W8WUoNumfiOx2855g/GZNy8pphVdUHREaHmaWVwspet1k8ARk2B5YgNC6XSWaeEuu4lH8Lyqxz6GKjncMahM07hPk/KKkmpEebR5E87+2kc3UDf6GL2QBkFwItkr4Xx6Zl6CgegTYhUQswcfUDkFQnCVwoc3V8wrZJa8Xasw63AsimI63bD32+ahG5H1DsPD3fvz0JjOte9s1k0A== X-Forefront-Antispam-Report: CIP:216.228.117.160;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:mail.nvidia.com;PTR:dc6edge1.nvidia.com;CAT:NONE;SFS:(13230022)(4636009)(136003)(39860400002)(376002)(396003)(346002)(451199015)(36840700001)(40470700004)(46966006)(36756003)(86362001)(82740400003)(4326008)(36860700001)(110136005)(54906003)(41300700001)(70586007)(8676002)(70206006)(356005)(7636003)(26005)(186003)(107886003)(478600001)(7696005)(5660300002)(8936002)(2906002)(426003)(40480700001)(316002)(2616005)(40460700003)(82310400005)(1076003)(336012)(47076005);DIR:OUT;SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 Jan 2023 02:37:49.4907 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 54310cab-9288-4618-dcac-08daedfcabd7 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.117.160];Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: BN8NAM11FT049.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH8PR12MB6891 X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_SPF_HELO, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1754053476923061936?= X-GMAIL-MSGID: =?utf-8?q?1754057970638752043?= |
Series |
[v2] arm64: gic: increase the number of IRQ descriptors
|
|
Commit Message
Shanker Donthineni
Jan. 4, 2023, 2:37 a.m. UTC
The default value of NR_IRQS is not sufficient to support GICv4.1
features and ~56K LPIs. This parameter would be too small for certain
server platforms where it has many IO devices and is capable of
direct injection of vSGI and vLPI features.
Currently, maximum of 64 + 8192 (IRQ_BITMAP_BITS) IRQ descriptors
are allowed. The vCPU creation fails after reaching count ~400 with
kvm-arm.vgic_v4_enable=1.
This patch increases NR_IRQS to 1^19 to cover 56K LPIs and 262144
vSGIs (16K vPEs x 16).
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
---
Changes since v1:
-create from v6.2-rc1 and edit commit text
arch/arm64/include/asm/irq.h | 4 ++++
1 file changed, 4 insertions(+)
Comments
On Wed, 04 Jan 2023 02:37:38 +0000, Shanker Donthineni <sdonthineni@nvidia.com> wrote: > > The default value of NR_IRQS is not sufficient to support GICv4.1 > features and ~56K LPIs. This parameter would be too small for certain > server platforms where it has many IO devices and is capable of > direct injection of vSGI and vLPI features. > > Currently, maximum of 64 + 8192 (IRQ_BITMAP_BITS) IRQ descriptors > are allowed. The vCPU creation fails after reaching count ~400 with > kvm-arm.vgic_v4_enable=1. > > This patch increases NR_IRQS to 1^19 to cover 56K LPIs and 262144 > vSGIs (16K vPEs x 16). > > Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> > --- > Changes since v1: > -create from v6.2-rc1 and edit commit text > > arch/arm64/include/asm/irq.h | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h > index fac08e18bcd5..3fffc0b8b704 100644 > --- a/arch/arm64/include/asm/irq.h > +++ b/arch/arm64/include/asm/irq.h > @@ -4,6 +4,10 @@ > > #ifndef __ASSEMBLER__ > > +#if defined(CONFIG_ARM_GIC_V3_ITS) > +#define NR_IRQS (1 << 19) > +#endif > + > #include <asm-generic/irq.h> > > struct pt_regs; Sorry, but I don't think this is an acceptable change. This is a large overhead that affects *everyone*, and that will eventually be too small anyway with larger systems and larger interrupt spaces. A better way to address this would be to move to a more dynamic allocation, converting the irqdesc rb-tree into an xarray, getting rid of the bitmaps (the allocation bitmap and the resend one), and track everything in the xarray. This would scale, avoid allocations, and benefit all architectures. Thanks, M.
Hi Marc, On 1/4/23 03:14, Marc Zyngier wrote: > External email: Use caution opening links or attachments > > > On Wed, 04 Jan 2023 02:37:38 +0000, > Shanker Donthineni <sdonthineni@nvidia.com> wrote: >> >> The default value of NR_IRQS is not sufficient to support GICv4.1 >> features and ~56K LPIs. This parameter would be too small for certain >> server platforms where it has many IO devices and is capable of >> direct injection of vSGI and vLPI features. >> >> Currently, maximum of 64 + 8192 (IRQ_BITMAP_BITS) IRQ descriptors >> are allowed. The vCPU creation fails after reaching count ~400 with >> kvm-arm.vgic_v4_enable=1. >> >> This patch increases NR_IRQS to 1^19 to cover 56K LPIs and 262144 >> vSGIs (16K vPEs x 16). >> >> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> >> --- >> Changes since v1: >> -create from v6.2-rc1 and edit commit text >> >> arch/arm64/include/asm/irq.h | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h >> index fac08e18bcd5..3fffc0b8b704 100644 >> --- a/arch/arm64/include/asm/irq.h >> +++ b/arch/arm64/include/asm/irq.h >> @@ -4,6 +4,10 @@ >> >> #ifndef __ASSEMBLER__ >> >> +#if defined(CONFIG_ARM_GIC_V3_ITS) >> +#define NR_IRQS (1 << 19) >> +#endif >> + >> #include <asm-generic/irq.h> >> >> struct pt_regs; > > Sorry, but I don't think this is an acceptable change. This is a large > overhead that affects *everyone*, and that will eventually be too > small anyway with larger systems and larger interrupt spaces. > > A better way to address this would be to move to a more dynamic > allocation, converting the irqdesc rb-tree into an xarray, getting rid > of the bitmaps (the allocation bitmap and the resend one), and track > everything in the xarray. The actual memory allocation for IRQ descriptors is still dynamic for ARM64. This change increases static memory for variable 'allocated_irqs' by 64KB, feel not a noticeable overhead. If 64KB is too high, can we change NR_IRQS to 65536. #ifdef CONFIG_SPARSE_IRQ # define IRQ_BITMAP_BITS (NR_IRQS + 8196) #else # define IRQ_BITMAP_BITS NR_IRQS #endif static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS); For ARM64, CONFIG_SPARSE_IR is set to y.
On Wed, 04 Jan 2023 13:47:03 +0000, Shanker Donthineni <sdonthineni@nvidia.com> wrote: > > Hi Marc, > > On 1/4/23 03:14, Marc Zyngier wrote: > > External email: Use caution opening links or attachments > > > > > > On Wed, 04 Jan 2023 02:37:38 +0000, > > Shanker Donthineni <sdonthineni@nvidia.com> wrote: > >> > >> The default value of NR_IRQS is not sufficient to support GICv4.1 > >> features and ~56K LPIs. This parameter would be too small for certain > >> server platforms where it has many IO devices and is capable of > >> direct injection of vSGI and vLPI features. > >> > >> Currently, maximum of 64 + 8192 (IRQ_BITMAP_BITS) IRQ descriptors > >> are allowed. The vCPU creation fails after reaching count ~400 with > >> kvm-arm.vgic_v4_enable=1. > >> > >> This patch increases NR_IRQS to 1^19 to cover 56K LPIs and 262144 > >> vSGIs (16K vPEs x 16). > >> > >> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> > >> --- > >> Changes since v1: > >> -create from v6.2-rc1 and edit commit text > >> > >> arch/arm64/include/asm/irq.h | 4 ++++ > >> 1 file changed, 4 insertions(+) > >> > >> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h > >> index fac08e18bcd5..3fffc0b8b704 100644 > >> --- a/arch/arm64/include/asm/irq.h > >> +++ b/arch/arm64/include/asm/irq.h > >> @@ -4,6 +4,10 @@ > >> > >> #ifndef __ASSEMBLER__ > >> > >> +#if defined(CONFIG_ARM_GIC_V3_ITS) > >> +#define NR_IRQS (1 << 19) > >> +#endif > >> + > >> #include <asm-generic/irq.h> > >> > >> struct pt_regs; > > > > Sorry, but I don't think this is an acceptable change. This is a large > > overhead that affects *everyone*, and that will eventually be too > > small anyway with larger systems and larger interrupt spaces. > > > > A better way to address this would be to move to a more dynamic > > allocation, converting the irqdesc rb-tree into an xarray, getting rid > > of the bitmaps (the allocation bitmap and the resend one), and track > > everything in the xarray. > > The actual memory allocation for IRQ descriptors is still dynamic for ARM64. > This change increases static memory for variable 'allocated_irqs' by 64KB, > feel not a noticeable overhead. 64kB for each bitmap, so that's already 128kB (you missed the irqs_resend bitmap). And that's for a number of IRQs that is still way below what the GIC architecture supports today. The architecture supports 32bit INTIDs, and that's 1GB worth of bitmaps, only for the physical side. Add the virtual stuff for which we create host-side descriptors, and we can go way beyond that. So what happens next, once you exceed the arbitrary limit that only satisfies your own use case? We will bump it up again, and again, bloating the kernel with useless static data that *nobody* needs. Specially not the VMs that you plan to run. So I'm putting my foot down right now, and saying that it needs to be fixed once and for all. The current scheme was OK for small interrupt spaces, but it isn't fit for purpose anymore, certainly not with things like the GICv4 architecture. I'm happy to help with it, but I'm certainly not willing to accept any sort of new compile-time limit. Thanks, M.
On 1/5/23 04:59, Marc Zyngier wrote: > External email: Use caution opening links or attachments > > > On Wed, 04 Jan 2023 13:47:03 +0000, > Shanker Donthineni <sdonthineni@nvidia.com> wrote: >> >> Hi Marc, >> >> On 1/4/23 03:14, Marc Zyngier wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> On Wed, 04 Jan 2023 02:37:38 +0000, >>> Shanker Donthineni <sdonthineni@nvidia.com> wrote: >>>> >>>> The default value of NR_IRQS is not sufficient to support GICv4.1 >>>> features and ~56K LPIs. This parameter would be too small for certain >>>> server platforms where it has many IO devices and is capable of >>>> direct injection of vSGI and vLPI features. >>>> >>>> Currently, maximum of 64 + 8192 (IRQ_BITMAP_BITS) IRQ descriptors >>>> are allowed. The vCPU creation fails after reaching count ~400 with >>>> kvm-arm.vgic_v4_enable=1. >>>> >>>> This patch increases NR_IRQS to 1^19 to cover 56K LPIs and 262144 >>>> vSGIs (16K vPEs x 16). >>>> >>>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> >>>> --- >>>> Changes since v1: >>>> -create from v6.2-rc1 and edit commit text >>>> >>>> arch/arm64/include/asm/irq.h | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h >>>> index fac08e18bcd5..3fffc0b8b704 100644 >>>> --- a/arch/arm64/include/asm/irq.h >>>> +++ b/arch/arm64/include/asm/irq.h >>>> @@ -4,6 +4,10 @@ >>>> >>>> #ifndef __ASSEMBLER__ >>>> >>>> +#if defined(CONFIG_ARM_GIC_V3_ITS) >>>> +#define NR_IRQS (1 << 19) >>>> +#endif >>>> + >>>> #include <asm-generic/irq.h> >>>> >>>> struct pt_regs; >>> >>> Sorry, but I don't think this is an acceptable change. This is a large >>> overhead that affects *everyone*, and that will eventually be too >>> small anyway with larger systems and larger interrupt spaces. >>> >>> A better way to address this would be to move to a more dynamic >>> allocation, converting the irqdesc rb-tree into an xarray, getting rid >>> of the bitmaps (the allocation bitmap and the resend one), and track >>> everything in the xarray. >> >> The actual memory allocation for IRQ descriptors is still dynamic for ARM64. >> This change increases static memory for variable 'allocated_irqs' by 64KB, >> feel not a noticeable overhead. > > 64kB for each bitmap, so that's already 128kB (you missed the > irqs_resend bitmap). And that's for a number of IRQs that is still way > below what the GIC architecture supports today. > > The architecture supports 32bit INTIDs, and that's 1GB worth of > bitmaps, only for the physical side. Add the virtual stuff for which > we create host-side descriptors, and we can go way beyond that. > > So what happens next, once you exceed the arbitrary limit that only > satisfies your own use case? We will bump it up again, and again, > bloating the kernel with useless static data that *nobody* needs. > Specially not the VMs that you plan to run. > > So I'm putting my foot down right now, and saying that it needs to be > fixed once and for all. The current scheme was OK for small interrupt > spaces, but it isn't fit for purpose anymore, certainly not with > things like the GICv4 architecture. > > I'm happy to help with it, but I'm certainly not willing to accept any > sort of new compile-time limit. Thanks for helping with a scalable solution instead of static allocation. Please include me whenever patches posted to LKML. I'm happy to verify on NVIDIA server platforms and provide test feedback.
On Thu, 05 Jan 2023 14:47:44 +0000, Shanker Donthineni <sdonthineni@nvidia.com> wrote: > > > > On 1/5/23 04:59, Marc Zyngier wrote: > > External email: Use caution opening links or attachments > > > > > > On Wed, 04 Jan 2023 13:47:03 +0000, > > Shanker Donthineni <sdonthineni@nvidia.com> wrote: > >> > >> Hi Marc, > >> > >> On 1/4/23 03:14, Marc Zyngier wrote: > >>> External email: Use caution opening links or attachments > >>> > >>> > >>> On Wed, 04 Jan 2023 02:37:38 +0000, > >>> Shanker Donthineni <sdonthineni@nvidia.com> wrote: > >>>> > >>>> The default value of NR_IRQS is not sufficient to support GICv4.1 > >>>> features and ~56K LPIs. This parameter would be too small for certain > >>>> server platforms where it has many IO devices and is capable of > >>>> direct injection of vSGI and vLPI features. > >>>> > >>>> Currently, maximum of 64 + 8192 (IRQ_BITMAP_BITS) IRQ descriptors > >>>> are allowed. The vCPU creation fails after reaching count ~400 with > >>>> kvm-arm.vgic_v4_enable=1. > >>>> > >>>> This patch increases NR_IRQS to 1^19 to cover 56K LPIs and 262144 > >>>> vSGIs (16K vPEs x 16). > >>>> > >>>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> > >>>> --- > >>>> Changes since v1: > >>>> -create from v6.2-rc1 and edit commit text > >>>> > >>>> arch/arm64/include/asm/irq.h | 4 ++++ > >>>> 1 file changed, 4 insertions(+) > >>>> > >>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h > >>>> index fac08e18bcd5..3fffc0b8b704 100644 > >>>> --- a/arch/arm64/include/asm/irq.h > >>>> +++ b/arch/arm64/include/asm/irq.h > >>>> @@ -4,6 +4,10 @@ > >>>> > >>>> #ifndef __ASSEMBLER__ > >>>> > >>>> +#if defined(CONFIG_ARM_GIC_V3_ITS) > >>>> +#define NR_IRQS (1 << 19) > >>>> +#endif > >>>> + > >>>> #include <asm-generic/irq.h> > >>>> > >>>> struct pt_regs; > >>> > >>> Sorry, but I don't think this is an acceptable change. This is a large > >>> overhead that affects *everyone*, and that will eventually be too > >>> small anyway with larger systems and larger interrupt spaces. > >>> > >>> A better way to address this would be to move to a more dynamic > >>> allocation, converting the irqdesc rb-tree into an xarray, getting rid > >>> of the bitmaps (the allocation bitmap and the resend one), and track > >>> everything in the xarray. > >> > >> The actual memory allocation for IRQ descriptors is still dynamic for ARM64. > >> This change increases static memory for variable 'allocated_irqs' by 64KB, > >> feel not a noticeable overhead. > > > > 64kB for each bitmap, so that's already 128kB (you missed the > > irqs_resend bitmap). And that's for a number of IRQs that is still way > > below what the GIC architecture supports today. > > > > The architecture supports 32bit INTIDs, and that's 1GB worth of > > bitmaps, only for the physical side. Add the virtual stuff for which > > we create host-side descriptors, and we can go way beyond that. > > > > So what happens next, once you exceed the arbitrary limit that only > > satisfies your own use case? We will bump it up again, and again, > > bloating the kernel with useless static data that *nobody* needs. > > Specially not the VMs that you plan to run. > > > > So I'm putting my foot down right now, and saying that it needs to be > > fixed once and for all. The current scheme was OK for small interrupt > > spaces, but it isn't fit for purpose anymore, certainly not with > > things like the GICv4 architecture. > > > > I'm happy to help with it, but I'm certainly not willing to accept any > > sort of new compile-time limit. > > Thanks for helping with a scalable solution instead of static > allocation. Please include me whenever patches posted to LKML. I'm > happy to verify on NVIDIA server platforms and provide test > feedback. > I offered to help you. I didn't offer to do the work for you! ;-) Thanks, M.
On 1/9/23 10:41, Marc Zyngier wrote: > External email: Use caution opening links or attachments > > > On Thu, 05 Jan 2023 14:47:44 +0000, > Shanker Donthineni <sdonthineni@nvidia.com> wrote: >> >> >> >> On 1/5/23 04:59, Marc Zyngier wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> On Wed, 04 Jan 2023 13:47:03 +0000, >>> Shanker Donthineni <sdonthineni@nvidia.com> wrote: >>>> >>>> Hi Marc, >>>> >>>> On 1/4/23 03:14, Marc Zyngier wrote: >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>> On Wed, 04 Jan 2023 02:37:38 +0000, >>>>> Shanker Donthineni <sdonthineni@nvidia.com> wrote: >>>>>> >>>>>> The default value of NR_IRQS is not sufficient to support GICv4.1 >>>>>> features and ~56K LPIs. This parameter would be too small for certain >>>>>> server platforms where it has many IO devices and is capable of >>>>>> direct injection of vSGI and vLPI features. >>>>>> >>>>>> Currently, maximum of 64 + 8192 (IRQ_BITMAP_BITS) IRQ descriptors >>>>>> are allowed. The vCPU creation fails after reaching count ~400 with >>>>>> kvm-arm.vgic_v4_enable=1. >>>>>> >>>>>> This patch increases NR_IRQS to 1^19 to cover 56K LPIs and 262144 >>>>>> vSGIs (16K vPEs x 16). >>>>>> >>>>>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> >>>>>> --- >>>>>> Changes since v1: >>>>>> -create from v6.2-rc1 and edit commit text >>>>>> >>>>>> arch/arm64/include/asm/irq.h | 4 ++++ >>>>>> 1 file changed, 4 insertions(+) >>>>>> >>>>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h >>>>>> index fac08e18bcd5..3fffc0b8b704 100644 >>>>>> --- a/arch/arm64/include/asm/irq.h >>>>>> +++ b/arch/arm64/include/asm/irq.h >>>>>> @@ -4,6 +4,10 @@ >>>>>> >>>>>> #ifndef __ASSEMBLER__ >>>>>> >>>>>> +#if defined(CONFIG_ARM_GIC_V3_ITS) >>>>>> +#define NR_IRQS (1 << 19) >>>>>> +#endif >>>>>> + >>>>>> #include <asm-generic/irq.h> >>>>>> >>>>>> struct pt_regs; >>>>> >>>>> Sorry, but I don't think this is an acceptable change. This is a large >>>>> overhead that affects *everyone*, and that will eventually be too >>>>> small anyway with larger systems and larger interrupt spaces. >>>>> >>>>> A better way to address this would be to move to a more dynamic >>>>> allocation, converting the irqdesc rb-tree into an xarray, getting rid >>>>> of the bitmaps (the allocation bitmap and the resend one), and track >>>>> everything in the xarray. >>>> >>>> The actual memory allocation for IRQ descriptors is still dynamic for ARM64. >>>> This change increases static memory for variable 'allocated_irqs' by 64KB, >>>> feel not a noticeable overhead. >>> >>> 64kB for each bitmap, so that's already 128kB (you missed the >>> irqs_resend bitmap). And that's for a number of IRQs that is still way >>> below what the GIC architecture supports today. >>> >>> The architecture supports 32bit INTIDs, and that's 1GB worth of >>> bitmaps, only for the physical side. Add the virtual stuff for which >>> we create host-side descriptors, and we can go way beyond that. >>> >>> So what happens next, once you exceed the arbitrary limit that only >>> satisfies your own use case? We will bump it up again, and again, >>> bloating the kernel with useless static data that *nobody* needs. >>> Specially not the VMs that you plan to run. >>> >>> So I'm putting my foot down right now, and saying that it needs to be >>> fixed once and for all. The current scheme was OK for small interrupt >>> spaces, but it isn't fit for purpose anymore, certainly not with >>> things like the GICv4 architecture. >>> >>> I'm happy to help with it, but I'm certainly not willing to accept any >>> sort of new compile-time limit. >> >> Thanks for helping with a scalable solution instead of static >> allocation. Please include me whenever patches posted to LKML. I'm >> happy to verify on NVIDIA server platforms and provide test >> feedback. >> > > I offered to help you. I didn't offer to do the work for you! ;-) > I've looked at the IDR/IDA API. There is no suitable function for allocating contiguous IDs to replace bitmap API. __irq_alloc_descs(): mutex_lock(&sparse_irq_lock); start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS, from, cnt, 0); ret = -EEXIST;
On 1/9/23 10:41, Marc Zyngier wrote: > External email: Use caution opening links or attachments > > > On Thu, 05 Jan 2023 14:47:44 +0000, > Shanker Donthineni <sdonthineni@nvidia.com> wrote: >> >> >> >> On 1/5/23 04:59, Marc Zyngier wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> On Wed, 04 Jan 2023 13:47:03 +0000, >>> Shanker Donthineni <sdonthineni@nvidia.com> wrote: >>>> >>>> Hi Marc, >>>> >>>> On 1/4/23 03:14, Marc Zyngier wrote: >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>> On Wed, 04 Jan 2023 02:37:38 +0000, >>>>> Shanker Donthineni <sdonthineni@nvidia.com> wrote: >>>>>> >>>>>> The default value of NR_IRQS is not sufficient to support GICv4.1 >>>>>> features and ~56K LPIs. This parameter would be too small for certain >>>>>> server platforms where it has many IO devices and is capable of >>>>>> direct injection of vSGI and vLPI features. >>>>>> >>>>>> Currently, maximum of 64 + 8192 (IRQ_BITMAP_BITS) IRQ descriptors >>>>>> are allowed. The vCPU creation fails after reaching count ~400 with >>>>>> kvm-arm.vgic_v4_enable=1. >>>>>> >>>>>> This patch increases NR_IRQS to 1^19 to cover 56K LPIs and 262144 >>>>>> vSGIs (16K vPEs x 16). >>>>>> >>>>>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> >>>>>> --- >>>>>> Changes since v1: >>>>>> -create from v6.2-rc1 and edit commit text >>>>>> >>>>>> arch/arm64/include/asm/irq.h | 4 ++++ >>>>>> 1 file changed, 4 insertions(+) >>>>>> >>>>>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h >>>>>> index fac08e18bcd5..3fffc0b8b704 100644 >>>>>> --- a/arch/arm64/include/asm/irq.h >>>>>> +++ b/arch/arm64/include/asm/irq.h >>>>>> @@ -4,6 +4,10 @@ >>>>>> >>>>>> #ifndef __ASSEMBLER__ >>>>>> >>>>>> +#if defined(CONFIG_ARM_GIC_V3_ITS) >>>>>> +#define NR_IRQS (1 << 19) >>>>>> +#endif >>>>>> + >>>>>> #include <asm-generic/irq.h> >>>>>> >>>>>> struct pt_regs; >>>>> >>>>> Sorry, but I don't think this is an acceptable change. This is a large >>>>> overhead that affects *everyone*, and that will eventually be too >>>>> small anyway with larger systems and larger interrupt spaces. >>>>> >>>>> A better way to address this would be to move to a more dynamic >>>>> allocation, converting the irqdesc rb-tree into an xarray, getting rid >>>>> of the bitmaps (the allocation bitmap and the resend one), and track >>>>> everything in the xarray. >>>> >>>> The actual memory allocation for IRQ descriptors is still dynamic for ARM64. >>>> This change increases static memory for variable 'allocated_irqs' by 64KB, >>>> feel not a noticeable overhead. >>> >>> 64kB for each bitmap, so that's already 128kB (you missed the >>> irqs_resend bitmap). And that's for a number of IRQs that is still way >>> below what the GIC architecture supports today. >>> >>> The architecture supports 32bit INTIDs, and that's 1GB worth of >>> bitmaps, only for the physical side. Add the virtual stuff for which >>> we create host-side descriptors, and we can go way beyond that. >>> >>> So what happens next, once you exceed the arbitrary limit that only >>> satisfies your own use case? We will bump it up again, and again, >>> bloating the kernel with useless static data that *nobody* needs. >>> Specially not the VMs that you plan to run. >>> >>> So I'm putting my foot down right now, and saying that it needs to be >>> fixed once and for all. The current scheme was OK for small interrupt >>> spaces, but it isn't fit for purpose anymore, certainly not with >>> things like the GICv4 architecture. >>> >>> I'm happy to help with it, but I'm certainly not willing to accept any >>> sort of new compile-time limit. >> >> Thanks for helping with a scalable solution instead of static >> allocation. Please include me whenever patches posted to LKML. I'm >> happy to verify on NVIDIA server platforms and provide test >> feedback. >> > > I offered to help you. I didn't offer to do the work for you! ;-) > I've looked at the IDR/IDA API. There is no suitable function for allocating contiguous IDs to replace bitmap API. __irq_alloc_descs(): mutex_lock(&sparse_irq_lock); start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS, from, cnt, 0); ret = -EEXIST; Is there any existing API that I can use for allocating contiguous IDs?
On Mon, 09 Jan 2023 17:13:25 +0000, Shanker Donthineni <sdonthineni@nvidia.com> wrote: > > >>> I'm happy to help with it, but I'm certainly not willing to accept any > >>> sort of new compile-time limit. > >> > >> Thanks for helping with a scalable solution instead of static > >> allocation. Please include me whenever patches posted to LKML. I'm > >> happy to verify on NVIDIA server platforms and provide test > >> feedback. > >> > > > > I offered to help you. I didn't offer to do the work for you! ;-) > > > > I've looked at the IDR/IDA API. There is no suitable function for > allocating contiguous IDs to replace bitmap API. > > __irq_alloc_descs(): > > mutex_lock(&sparse_irq_lock); > > start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS, > from, cnt, 0); > ret = -EEXIST; > > Is there any existing API that I can use for allocating contiguous IDs? I think you should address the problem the other way around, as there are lower hanging fruits: - turn the irq_desc_tree radix tree into a XArray - use the XArray mark feature to reimplement the irqs_resend bitmap Once you have done that, you have already halved the memory usage. To implement the allocated_irqs bitmap functionality, you have a bunch of options: - make the XArray an allocating XArray, and iterate over XA_FREE_MARK to find the free range (see how the infiniband subsystem is doing exactly that) - use another Xarray mark to annotate the allocated IRQs, find the distance between two allocations, and use this range if the request fits (a poor man's variation of the above) - use a sideband data structure such as the GICv3 LPI allocator, which is already dealing with range allocation (I'd rather avoid that) - something else? It should be fairly straightforward to perform the conversion in place. Thanks, M.
Hi Marc, On 1/10/23 02:20, Marc Zyngier wrote: > External email: Use caution opening links or attachments > > > On Mon, 09 Jan 2023 17:13:25 +0000, > Shanker Donthineni <sdonthineni@nvidia.com> wrote: >> >>>>> I'm happy to help with it, but I'm certainly not willing to accept any >>>>> sort of new compile-time limit. >>>> >>>> Thanks for helping with a scalable solution instead of static >>>> allocation. Please include me whenever patches posted to LKML. I'm >>>> happy to verify on NVIDIA server platforms and provide test >>>> feedback. >>>> >>> >>> I offered to help you. I didn't offer to do the work for you! ;-) >>> >> >> I've looked at the IDR/IDA API. There is no suitable function for >> allocating contiguous IDs to replace bitmap API. >> >> __irq_alloc_descs(): >> >> mutex_lock(&sparse_irq_lock); >> >> start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS, >> from, cnt, 0); >> ret = -EEXIST; >> >> Is there any existing API that I can use for allocating contiguous IDs? > > I think you should address the problem the other way around, as there > are lower hanging fruits: > > - turn the irq_desc_tree radix tree into a XArray > > - use the XArray mark feature to reimplement the irqs_resend bitmap > > Once you have done that, you have already halved the memory usage. > To implement the allocated_irqs bitmap functionality, you have a > bunch of options: > > - make the XArray an allocating XArray, and iterate over XA_FREE_MARK > to find the free range (see how the infiniband subsystem is doing > exactly that) > > - use another Xarray mark to annotate the allocated IRQs, find the > distance between two allocations, and use this range if the request > fits (a poor man's variation of the above) > > - use a sideband data structure such as the GICv3 LPI allocator, which > is already dealing with range allocation (I'd rather avoid that) > > - something else? > Thanks for providing the guidance. The irq_resend change will be simple, IDR will fit perfectly. Could you comment on the below 2 patches which are using IDR API? One IDR variable is used for both the IRQ ID allocation & descriptoirs. I'll test and post patches for comments if you're okay with the approach. Patch 1/2: genirq: Prepare code for IDR based allocation Introduce helper functions for managing Linux IRQ IDs and define for both SPARSE_IRQ and non-SPARSE_IRQ seperately. There is no change in functional behavior. Changes: -Helper function irq_alloc_descs_ids() for allocatind IRQ IDs -Helper function irq_free_descs_ids() to free IRQ IDs -Helper function irq_get_next_id() to get next IRQ ID diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index fd0996274401..a40ac0c58550 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -131,7 +131,6 @@ int nr_irqs = NR_IRQS; EXPORT_SYMBOL_GPL(nr_irqs); static DEFINE_MUTEX(sparse_irq_lock); -static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS); #ifdef CONFIG_SPARSE_IRQ @@ -344,6 +343,7 @@ static void irq_sysfs_del(struct irq_desc *desc) {} #endif /* CONFIG_SYSFS */ +static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS); static RADIX_TREE(irq_desc_tree, GFP_KERNEL); static void irq_insert_desc(unsigned int irq, struct irq_desc *desc) @@ -469,6 +469,22 @@ static void free_desc(unsigned int irq) call_rcu(&desc->rcu, delayed_free_desc); } +static void irq_free_descs_ids(unsigned int from, unsigned int cnt) +{ + bitmap_clear(allocated_irqs, from, cnt); +} + +static int irq_alloc_descs_ids(unsigned int from, unsigned int cnt) +{ + return bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS, + from, cnt, 0); +} + +static unsigned int irq_get_next_id(unsigned int offset) +{ + return find_next_bit(allocated_irqs, nr_irqs, offset); +} + static int alloc_descs(unsigned int start, unsigned int cnt, int node, const struct irq_affinity_desc *affinity, struct module *owner) @@ -553,6 +569,8 @@ int __init early_irq_init(void) #else /* !CONFIG_SPARSE_IRQ */ +static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS); + struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = { [0 ... NR_IRQS-1] = { .handle_irq = handle_bad_irq, @@ -591,6 +609,22 @@ struct irq_desc *irq_to_desc(unsigned int irq) } EXPORT_SYMBOL(irq_to_desc); +static void irq_free_descs_ids(unsigned int from, unsigned int cnt) +{ + bitmap_clear(allocated_irqs, from, cnt); +} + +static int irq_alloc_descs_ids(unsigned int from, unsigned int cnt) +{ + return bitmap_find_next_zero_area(allocated_irqs, NR_IRQS, + from, cnt, 0); +} + +static unsigned int irq_get_next_id(unsigned int offset) +{ + return find_next_bit(allocated_irqs, nr_irqs, offset); +} + static void free_desc(unsigned int irq) { struct irq_desc *desc = irq_to_desc(irq); @@ -768,7 +802,7 @@ void irq_free_descs(unsigned int from, unsigned int cnt) for (i = 0; i < cnt; i++) free_desc(from + i); - bitmap_clear(allocated_irqs, from, cnt); + irq_free_descs_ids(from, cnt); mutex_unlock(&sparse_irq_lock); } EXPORT_SYMBOL_GPL(irq_free_descs); @@ -810,8 +844,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node, mutex_lock(&sparse_irq_lock); - start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS, - from, cnt, 0); + start = irq_alloc_descs_ids(from, cnt); ret = -EEXIST; if (irq >=0 && start != irq) goto unlock; @@ -836,7 +869,7 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs); */ unsigned int irq_get_next_irq(unsigned int offset) { - return find_next_bit(allocated_irqs, nr_irqs, offset); + return irq_get_next_id(offset); } PATCH 2/2: genirq: Use IDR API for Linux-IRQ IDs allocation The build time config paramter IRQ_BITMAP_BITS (NR_IRQS+8196) may not be sufficient for some architectures. The interrupt ID sparse is huge for ARM-GIC architecture ~32 bits. Static bitmap memory for managing IDs is not optimal when NR_IRQS is set to a high value. It uses the IDR API for the IRQ ID allocation/deallocation and its descriptors management insertion/deletion/search. No other references to macro IRQ_BITMAP_BITS hence remove it. And also covert static allocation of the 'irqs_resend' bitmap to dynamic allocation using IDR. diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h index 5fdc0b557579..501f90962644 100644 --- a/kernel/irq/internals.h +++ b/kernel/irq/internals.h @@ -11,12 +11,6 @@ #include <linux/pm_runtime.h> #include <linux/sched/clock.h> -#ifdef CONFIG_SPARSE_IRQ -# define IRQ_BITMAP_BITS (NR_IRQS + 8196) -#else -# define IRQ_BITMAP_BITS NR_IRQS -#endif - #define istate core_internal_state__do_not_mess_with_it extern bool noirqdebug; diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index a40ac0c58550..bb1febd3a420 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -343,25 +343,25 @@ static void irq_sysfs_del(struct irq_desc *desc) {} #endif /* CONFIG_SYSFS */ -static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS); -static RADIX_TREE(irq_desc_tree, GFP_KERNEL); +static DEFINE_IDR(idr_irq_descs); static void irq_insert_desc(unsigned int irq, struct irq_desc *desc) { - radix_tree_insert(&irq_desc_tree, irq, desc); + idr_replace(&idr_irq_descs, desc, irq); } struct irq_desc *irq_to_desc(unsigned int irq) { - return radix_tree_lookup(&irq_desc_tree, irq); + return idr_find(&idr_irq_descs, irq); } + #ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE EXPORT_SYMBOL_GPL(irq_to_desc); #endif static void delete_irq_desc(unsigned int irq) { - radix_tree_delete(&irq_desc_tree, irq); + idr_replace(&idr_irq_descs, NULL, irq); } #ifdef CONFIG_SMP @@ -471,18 +471,48 @@ static void free_desc(unsigned int irq) static void irq_free_descs_ids(unsigned int from, unsigned int cnt) { - bitmap_clear(allocated_irqs, from, cnt); + int i; + + for (i = 0; i < cnt; i++) + idr_remove(&idr_irq_descs, from + i); } static int irq_alloc_descs_ids(unsigned int from, unsigned int cnt) { - return bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS, - from, cnt, 0); + int start, id, i; + + do { + /* Allocate starting ID */ + start = idr_alloc(&idr_irq_descs, NULL, from, 0, GFP_ATOMIC); + if (start < 0) + return start; + idr_set_cursor(&idr_irq_descs, start); + + /* Allocate contiguous IDs */ + for (i = 1; i < cnt; i++) { + id = idr_alloc_cyclic(&idr_irq_descs, NULL, start + i, + start + i, GFP_ATOMIC); + if (id < 0) { + irq_free_descs_ids(from, i); + break; + } + } + + /* Allocated 'cnt' IDs */ + if (i == cnt) + return start; + from = idr_get_cursor(&idr_irq_descs); + } while (from < INT_MAX); + + irq_free_descs_ids(start, i); + return -ENOSPC; } static unsigned int irq_get_next_id(unsigned int offset) { - return find_next_bit(allocated_irqs, nr_irqs, offset); + int id; + + return idr_get_next(&idr_irqs, &id) ? id : -EINVAL; } static int alloc_descs(unsigned int start, unsigned int cnt, int node, @@ -521,7 +551,6 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node, irq_sysfs_add(start + i, desc); irq_add_debugfs_entry(start + i, desc); } - bitmap_set(allocated_irqs, start, cnt); return start; err: @@ -532,8 +561,6 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node, static int irq_expand_nr_irqs(unsigned int nr) { - if (nr > IRQ_BITMAP_BITS) - return -ENOMEM; nr_irqs = nr; return 0; } @@ -542,6 +569,7 @@ int __init early_irq_init(void) { int i, initcnt, node = first_online_node; struct irq_desc *desc; + int irq; init_irq_default_affinity(); @@ -550,19 +578,10 @@ int __init early_irq_init(void) printk(KERN_INFO "NR_IRQS: %d, nr_irqs: %d, preallocated irqs: %d\n", NR_IRQS, nr_irqs, initcnt); - if (WARN_ON(nr_irqs > IRQ_BITMAP_BITS)) - nr_irqs = IRQ_BITMAP_BITS; - - if (WARN_ON(initcnt > IRQ_BITMAP_BITS)) - initcnt = IRQ_BITMAP_BITS; - - if (initcnt > nr_irqs) - nr_irqs = initcnt; - for (i = 0; i < initcnt; i++) { - desc = alloc_desc(i, node, 0, NULL, NULL); - set_bit(i, allocated_irqs); - irq_insert_desc(i, desc); + irq = irq_alloc_descs_ids(0, 1); + desc = alloc_desc(irq, node, 0, NULL, NULL); + irq_insert_desc(irq, desc); } return arch_early_irq_init(); } @@ -855,6 +874,8 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node, goto unlock; } ret = alloc_descs(start, cnt, node, affinity, owner); + if (ret != start) + irq_free_descs_ids(start, cnt); unlock: mutex_unlock(&sparse_irq_lock); return ret; diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c index 0c46e9fe3a89..1c9db8e03fba 100644 --- a/kernel/irq/resend.c +++ b/kernel/irq/resend.c @@ -21,8 +21,8 @@ #ifdef CONFIG_HARDIRQS_SW_RESEND -/* Bitmap to handle software resend of interrupts: */ -static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS); +/* IDR map to handle software resend of interrupts: */ +static DEFINE_IDR(irqs_resend); /* * Run software resends of IRQ's @@ -30,14 +30,11 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS); static void resend_irqs(struct tasklet_struct *unused) { struct irq_desc *desc; - int irq; - - while (!bitmap_empty(irqs_resend, nr_irqs)) { - irq = find_first_bit(irqs_resend, nr_irqs); - clear_bit(irq, irqs_resend); - desc = irq_to_desc(irq); - if (!desc) - continue; + int id; + + idr_for_each_entry(&irqs_resend, desc, id) { + idr_replace(&irqs_resend, NULL, id); + idr_remove(&irqs_resend, id); local_irq_disable(); desc->handle_irq(desc); local_irq_enable(); @@ -49,7 +46,7 @@ static DECLARE_TASKLET(resend_tasklet, resend_irqs); static int irq_sw_resend(struct irq_desc *desc) { - unsigned int irq = irq_desc_get_irq(desc); + int id; /* * Validate whether this interrupt can be safely injected from @@ -70,11 +67,13 @@ static int irq_sw_resend(struct irq_desc *desc) */ if (!desc->parent_irq) return -EINVAL; - irq = desc->parent_irq; } /* Set it pending and activate the softirq: */ - set_bit(irq, irqs_resend); + id = idr_alloc(&irqs_resend, desc, 0, 0, GFP_ATOMIC); + if (id < 0) + return id; + tasklet_schedule(&resend_tasklet); return 0; }
On 1/10/23 02:20, Marc Zyngier wrote: > External email: Use caution opening links or attachments > > > On Mon, 09 Jan 2023 17:13:25 +0000, > Shanker Donthineni <sdonthineni@nvidia.com> wrote: >> >>>>> I'm happy to help with it, but I'm certainly not willing to accept any >>>>> sort of new compile-time limit. >>>> >>>> Thanks for helping with a scalable solution instead of static >>>> allocation. Please include me whenever patches posted to LKML. I'm >>>> happy to verify on NVIDIA server platforms and provide test >>>> feedback. >>>> >>> >>> I offered to help you. I didn't offer to do the work for you! ;-) >>> >> >> I've looked at the IDR/IDA API. There is no suitable function for >> allocating contiguous IDs to replace bitmap API. >> >> __irq_alloc_descs(): >> >> mutex_lock(&sparse_irq_lock); >> >> start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS, >> from, cnt, 0); >> ret = -EEXIST; >> >> Is there any existing API that I can use for allocating contiguous IDs? > > I think you should address the problem the other way around, as there > are lower hanging fruits: > > - turn the irq_desc_tree radix tree into a XArray > > - use the XArray mark feature to reimplement the irqs_resend bitmap > > Once you have done that, you have already halved the memory usage. > To implement the allocated_irqs bitmap functionality, you have a > bunch of options: > > - make the XArray an allocating XArray, and iterate over XA_FREE_MARK > to find the free range (see how the infiniband subsystem is doing > exactly that) > > - use another Xarray mark to annotate the allocated IRQs, find the > distance between two allocations, and use this range if the request > fits (a poor man's variation of the above) > > - use a sideband data structure such as the GICv3 LPI allocator, which > is already dealing with range allocation (I'd rather avoid that) > > - something else? > I'll also prototype using XArray data structure instead of IDR based.
Shanker! On Tue, Jan 10 2023 at 08:22, Shanker Donthineni wrote: > On 1/10/23 02:20, Marc Zyngier wrote: >> I think you should address the problem the other way around, as there >> are lower hanging fruits: >> >> - turn the irq_desc_tree radix tree into a XArray >> >> - use the XArray mark feature to reimplement the irqs_resend bitmap and then you go and do: > genirq: Use IDR API for Linux-IRQ IDs allocation But let me look at your preparation patch first: > +static void irq_free_descs_ids(unsigned int from, unsigned int cnt) > +{ > + bitmap_clear(allocated_irqs, from, cnt); > +} > + > +static int irq_alloc_descs_ids(unsigned int from, unsigned int cnt) > +{ > + return bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS, > + from, cnt, 0); This is a complete misnomer simply because this does not allocate anything. It finds an appropriate sized empty area. The actual bitmap update happens later which you then remove in the second patch: > - bitmap_set(allocated_irqs, start, cnt); thereby breaking SPARSEIRQ=n configs.... > +} > + > +static unsigned int irq_get_next_id(unsigned int offset) > +{ > + return find_next_bit(allocated_irqs, nr_irqs, offset); > +} That's a misnomer too. This is not about getting an arbitrary next ID starting at @offset. This is about finding the next allocated interrupt number starting at @offset. Naming matters. This code is hard enough to read already. No need for further confusion. > The build time config paramter IRQ_BITMAP_BITS (NR_IRQS+8196) > may not be sufficient for some architectures. The interrupt ID > sparse is huge for ARM-GIC architecture ~32 bits. Static bitmap > memory for managing IDs is not optimal when NR_IRQS is set to > a high value. > > It uses the IDR API for the IRQ ID allocation/deallocation and > its descriptors management insertion/deletion/search. No other > references to macro IRQ_BITMAP_BITS hence remove it. Changelogs should tell the WHY and not the WHAT. I can see that it uses IDR from the patch, but there is _ZERO_ justification why IDR is the right choice for this. > static int irq_alloc_descs_ids(unsigned int from, unsigned int cnt) > { > - return bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS, > - from, cnt, 0); > + int start, id, i; > + > + do { > + /* Allocate starting ID */ > + start = idr_alloc(&idr_irq_descs, NULL, from, 0, GFP_ATOMIC); Why does this require GFP_ATOMIC? The allocation is serialized by a mutex and is fully preemptible. Can you find a single GPF_ATOMIC in the irqdesc code? If you had at least read through the changelogs of that file you would have found a series of commits which worked towards making the irqdesc allocation use GFP_KERNEL. But sure, it's way simpler to throw GFP_ATOMIC at the code just because... > + if (start < 0) > + return start; > + idr_set_cursor(&idr_irq_descs, start); > + > + /* Allocate contiguous IDs */ > + for (i = 1; i < cnt; i++) { > + id = idr_alloc_cyclic(&idr_irq_descs, NULL, start + i, > + start + i, GFP_ATOMIC); > + if (id < 0) { > + irq_free_descs_ids(from, i); So if there is not enough room, then you start over. *Shudder* Just assume a halfways dense populated IDR with tons of small holes and then try to allocate 128 MSI vectors. That'll take ages... You can simply use a maple_tree for this. static MTREE_INIT_EXT(sparse_irqs, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN, sparse_irq_lock); And the functions become: static int irq_find_free_area(unsigned int from, unsigned int cnt) { MA_STATE(mas, &sparse_irqs, 0, 0); if (mas_empty_area(&mas, from, MAX_SPARSE_IRQS, cnt)) return -ENOSPC; return mas.index; } static unsigned int irq_find_next_irq(unsigned int offset) { MA_STATE(mas, &sparse_irqs, offset, nr_irqs); struct irq_desc *desc = mas_next(&mas, nr_irqs); return desc ? irq_desc_get_irq(desc) : nr_irqs; } static int irq_insert_desc(irq, desc) { MA_STATE(mas, @sparse_irqs, irq, irq); return mas_store_gfp(&mas, desc, GFP_KERNEL); } static void irq_remove_desc(irq) { MA_STATE(mas, @sparse_irqs, irq, irq); return mas_erase(&mas); } or something like that. Coming back to SPARSEIRQ=n. I'm more than tempted to take this opportunity to get rid of this distinction. There is no real reason to have the duplicated code. We can simply get rid of the statically allocated irq descriptor arrays and just do the preallocation in early_irq_init(). Now for the pending bits: > ... The irq_resend change will be simple, IDR will fit perfectly. You wish... > /* Set it pending and activate the softirq: */ > - set_bit(irq, irqs_resend); > + id = idr_alloc(&irqs_resend, desc, 0, 0, GFP_ATOMIC); This breaks PREEMPT_RT as this code runs under a raw spinlock with interrupts and preemption disabled and _cannot_ do any allocations. Again, the changelogs of the interrupt code contain enough information to figure these things out. But sure it's simpler to throw some half baken stuff at the kernel and see what sticks... Marc's suggestion to utilize XARRAY and the mark feature would trivialy avoid this because there is no allocation required in that code path. The descriptor already exists in the XARRAY. But that can't work either on PREEMPT_RT because for setting the mark the xarray code needs to acquire xarray::xa_lock which is a regular spinlock, which nest inside of a raw spinlock. So this needs a completely different approach. Let's look at the functionality of the resend code: It's a crutch which tries to handle the inability of (legacy) interrupt chips to reinject interrupts at the hardware level. There is absolutely no reason to care about performance for that, but using IDR (or anything like that) instead of the bitmap is just hillarious. So what else can be done? The obvious: static DEFINE_RAW_SPINLOCK(irq_resend_lock); static struct hlist_head irq_resend_list; static int irq_sw_resend(struct irq_desc *desc) { .... raw_spin_lock(&irq_resend_lock); hlist_add_head(&desc->resend_node, &irq_resend_list); raw_spin_lock(&irq_resend_lock); tasklet_schedule(&resend_tasklet); } and the resend side: static void resend_irqs(struct tasklet_struct *unused) { struct irq_desc *desc; int irq; raw_spin_lock_irq(&irq_resend_lock); while (!hlist_empty(&irqs_resend_list)) { desc = hlist_entry(irqs_resend_list.first, ....); hlist_del_init(&desc->resend_node); desc->handle_irq(desc); } raw_spin_unlock_irq(&irq_resend_lock); } Plus the proper mechanics to handle the hlist entry when an interrupt is torn down, which is not rocket science either. Thanks, tglx
Hi Thomas & Marc, On 1/10/23 16:36, Thomas Gleixner wrote: > You can simply use a maple_tree for this. > > static MTREE_INIT_EXT(sparse_irqs, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN, > sparse_irq_lock); > > And the functions become: > > static int irq_find_free_area(unsigned int from, unsigned int cnt) > { > MA_STATE(mas, &sparse_irqs, 0, 0); > > if (mas_empty_area(&mas, from, MAX_SPARSE_IRQS, cnt)) > return -ENOSPC; > return mas.index; > } > > static unsigned int irq_find_next_irq(unsigned int offset) > { > MA_STATE(mas, &sparse_irqs, offset, nr_irqs); > struct irq_desc *desc = mas_next(&mas, nr_irqs); > > return desc ? irq_desc_get_irq(desc) : nr_irqs; > } > > static int irq_insert_desc(irq, desc) > { > MA_STATE(mas, @sparse_irqs, irq, irq); > > return mas_store_gfp(&mas, desc, GFP_KERNEL); > } > > static void irq_remove_desc(irq) > { > MA_STATE(mas, @sparse_irqs, irq, irq); > > return mas_erase(&mas); > } Thank you for providing the necessary functions, they have been extremely useful in getting started with implementing patches. However, I have encountered corruption in the maple data structure within mtree_load() when multiple virtual machines are being shut down simultaneously. To address this, I have added the flag MT_FLAGS_USE_RCU to ensure safe concurrent access during reads and writes. Please review patch series https://lore.kernel.org/all/20230130005725.3517597-1-sdonthineni@nvidia.com/ I have applied 6 patches from https://lore.kernel.org/all/20230109205336.3665937-2-surenb@google.com/#r to resolve RCU mode issues. The patches were tested on an ARM64 server and underwent several hours of evaluation with multiple virtual machines, yielding positive results. Thanks, Shanker
diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h index fac08e18bcd5..3fffc0b8b704 100644 --- a/arch/arm64/include/asm/irq.h +++ b/arch/arm64/include/asm/irq.h @@ -4,6 +4,10 @@ #ifndef __ASSEMBLER__ +#if defined(CONFIG_ARM_GIC_V3_ITS) +#define NR_IRQS (1 << 19) +#endif + #include <asm-generic/irq.h> struct pt_regs;