Message ID | 20231121115251.588436-1-lishifeng1992@126.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2b07:b0:403:3b70:6f57 with SMTP id io7csp566535vqb; Tue, 21 Nov 2023 04:01:29 -0800 (PST) X-Google-Smtp-Source: AGHT+IGzAdCtOUnuznfYIYSw98yT/32QYvlrtbVIud7s2wP80NuHXOsw+1kISwcptcELxRBX3Y5u X-Received: by 2002:a17:902:ac8f:b0:1ca:12f3:6775 with SMTP id h15-20020a170902ac8f00b001ca12f36775mr8306772plr.33.1700568088991; Tue, 21 Nov 2023 04:01:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700568088; cv=none; d=google.com; s=arc-20160816; b=OPxIHlWf65dYrVdAnS7aRp2gdGAunIUu0kI5ZbMWZToAhQOPVBW1kvEdJTBHvPEEuQ l0ZtJuPA0a/3utObxTo9iQF3zuAyTooH8tP0vPeGqGK/tqiBvePTQJeDHtrx09ic9O3A 2YT06596CbEV1tnhxZZ+0G/+4a2wUqZqg8UHyBsU3PMsW1PtipcfM/rctL1AhXKOM0OZ 2fvTXVjIryb8hdPfAze5zA/I0Ao3/tPDcm29f5RFaC7osyZ8dMXPczcZ/sSL5DdChxw3 BS+yWDtoZfsRmy2etmv/mWi3E12Rb0sFj65JwZ2aixnMAZz+lkrprA0qHXrSMP8Uo+29 Qksg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=0nbt5Iv6Psgndc9/7xheezuoX2tizM5FTvQBexU/VpI=; fh=kZhh7IFXsS3wVjNGdfc5Ofk9BTist3i/zbcBW4yKV0o=; b=NqraIL4dAzVl0PjFJWesDOjyZOvuWAl+2r2/wW6P5HWAWjrsJ9gyDKriHpREcp1rQC 6VW2vKgN9vnk1YjsKPh2HJJbYTBxbnP5EDispVufyHpIMkW9pLjnUKuNx3pYvEtSz6lv gtQJScgFfY4g6cfWC3fHxhTi0LnlZhRCmaWjijDJVEZRbucignODxMvt57NeJpEp6MFQ e1yLj9tRJJeYGtMo0zIWoPMR7ZjW/ESqj0idJfx/Vw+0QADy3tP0YWyAfRclmrAcxQVG 4vX5oGC9QduzH/qCAAtwDd/SIIFSTvnqxilKVeMgoFihY6HjK0TnlBQrloQ7UjlSSGFP 4Skw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@126.com header.s=s110527 header.b=d6qwFxww; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=126.com Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id p10-20020a1709026b8a00b001bde9e8a29fsi9722452plk.183.2023.11.21.04.01.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 21 Nov 2023 04:01:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@126.com header.s=s110527 header.b=d6qwFxww; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=126.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 05C3A807F652; Tue, 21 Nov 2023 04:01:00 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234656AbjKUMAt (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others); Tue, 21 Nov 2023 07:00:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50238 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234742AbjKUMA3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 21 Nov 2023 07:00:29 -0500 Received: from m126.mail.126.com (m126.mail.126.com [220.181.12.37]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id CF2A110CB; Tue, 21 Nov 2023 03:58:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=From:Subject:Date:Message-Id:MIME-Version; bh=0nbt5 Iv6Psgndc9/7xheezuoX2tizM5FTvQBexU/VpI=; b=d6qwFxwwscbDqWjScE9wr rRtRKKFeRXTaO3gkBNm+KhJKY+1KdIz6vFFxLILJ8za6Hx34ueTJhRlQKlvo71Qi KzttJ8gNcDXt2AXVCK4f0s5bXHnlgZWovuHLs+ZrD8XYWtWGkDle9EgyYlpn2GUZ 7qlUgYejoifPVD5b6IKgGM= Received: from ubuntu.localdomain (unknown [111.222.250.119]) by zwqz-smtp-mta-g5-0 (Coremail) with SMTP id _____wD3f_oWmlxlBnazCw--.33745S2; Tue, 21 Nov 2023 19:52:58 +0800 (CST) From: Shifeng Li <lishifeng1992@126.com> To: saeedm@nvidia.com, leon@kernel.org, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, jackm@dev.mellanox.co.il, ogerlitz@mellanox.com, roland@purestorage.com, eli@mellanox.com Cc: dinghui@sangfor.com.cn, netdev@vger.kernel.org, linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org, Shifeng Li <lishifeng1992@126.com> Subject: [PATCH v2] net/mlx5e: Fix a race in command alloc flow Date: Tue, 21 Nov 2023 03:52:51 -0800 Message-Id: <20231121115251.588436-1-lishifeng1992@126.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: _____wD3f_oWmlxlBnazCw--.33745S2 X-Coremail-Antispam: 1Uf129KBjvJXoWxWrWkKw1kKFy5GF1UZryDAwb_yoWrKF4UpF W7Wry7AF48Gw4q9r4vqF40v3W8C39rK3srGF1I93Z3W3Z8A34kAa4kJFyjgryUuFWxtFy7 JayDt3W8Arn3XF7anT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07U-zVbUUUUU= X-Originating-IP: [111.222.250.119] X-CM-SenderInfo: xolvxx5ihqwiqzzsqiyswou0bp/1S2mtgUvr1pD4QWXJgAAs- X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Tue, 21 Nov 2023 04:01:00 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783137314042850733 X-GMAIL-MSGID: 1783174884269484949 |
Series |
[v2] net/mlx5e: Fix a race in command alloc flow
|
|
Commit Message
Shifeng Li
Nov. 21, 2023, 11:52 a.m. UTC
Fix a cmd->ent use after free due to a race on command entry.
Such race occurs when one of the commands releases its last refcount and
frees its index and entry while another process running command flush
flow takes refcount to this command entry. The process which handles
commands flush may see this command as needed to be flushed if the other
process allocated a ent->idx but didn't set ent to cmd->ent_arr in
cmd_work_handler(). Fix it by moving the assignment of cmd->ent_arr into
the spin lock.
[70013.081955] BUG: KASAN: use-after-free in mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core]
[70013.081967] Write of size 4 at addr ffff88880b1510b4 by task kworker/26:1/1433361
[70013.081968]
[70013.081989] CPU: 26 PID: 1433361 Comm: kworker/26:1 Kdump: loaded Tainted: G OE 4.19.90-25.17.v2101.osc.sfc.6.10.0.0030.ky10.x86_64+debug #1
[70013.082001] Hardware name: SANGFOR 65N32-US/ASERVER-G-2605, BIOS SSSS5203 08/19/2020
[70013.082028] Workqueue: events aer_isr
[70013.082053] Call Trace:
[70013.082067] dump_stack+0x8b/0xbb
[70013.082086] print_address_description+0x6a/0x270
[70013.082102] kasan_report+0x179/0x2c0
[70013.082133] ? mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core]
[70013.082173] mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core]
[70013.082213] ? mlx5_cmd_use_polling+0x20/0x20 [mlx5_core]
[70013.082223] ? kmem_cache_free+0x1ad/0x1e0
[70013.082267] mlx5_cmd_flush+0x80/0x180 [mlx5_core]
[70013.082304] mlx5_enter_error_state+0x106/0x1d0 [mlx5_core]
[70013.082338] mlx5_try_fast_unload+0x2ea/0x4d0 [mlx5_core]
[70013.082377] remove_one+0x200/0x2b0 [mlx5_core]
[70013.082390] ? __pm_runtime_resume+0x58/0x70
[70013.082409] pci_device_remove+0xf3/0x280
[70013.082426] ? pcibios_free_irq+0x10/0x10
[70013.082439] device_release_driver_internal+0x1c3/0x470
[70013.082453] pci_stop_bus_device+0x109/0x160
[70013.082468] pci_stop_and_remove_bus_device+0xe/0x20
[70013.082485] pcie_do_fatal_recovery+0x167/0x550
[70013.082493] aer_isr+0x7d2/0x960
[70013.082510] ? aer_get_device_error_info+0x420/0x420
[70013.082526] ? __schedule+0x821/0x2040
[70013.082536] ? strscpy+0x85/0x180
[70013.082543] process_one_work+0x65f/0x12d0
[70013.082556] worker_thread+0x87/0xb50
[70013.082563] ? __kthread_parkme+0x82/0xf0
[70013.082569] ? process_one_work+0x12d0/0x12d0
[70013.082571] kthread+0x2e9/0x3a0
[70013.082579] ? kthread_create_worker_on_cpu+0xc0/0xc0
[70013.082592] ret_from_fork+0x1f/0x40
Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Shifeng Li <lishifeng1992@126.com>
---
drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
---
v1->v2: fix code conflicts.
Comments
On Tue, Nov 21, 2023 at 03:52:51AM -0800, Shifeng Li wrote: > Fix a cmd->ent use after free due to a race on command entry. > Such race occurs when one of the commands releases its last refcount and > frees its index and entry while another process running command flush > flow takes refcount to this command entry. The process which handles > commands flush may see this command as needed to be flushed if the other > process allocated a ent->idx but didn't set ent to cmd->ent_arr in > cmd_work_handler(). Fix it by moving the assignment of cmd->ent_arr into > the spin lock. > > [70013.081955] BUG: KASAN: use-after-free in mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] > [70013.081967] Write of size 4 at addr ffff88880b1510b4 by task kworker/26:1/1433361 > [70013.081968] > [70013.081989] CPU: 26 PID: 1433361 Comm: kworker/26:1 Kdump: loaded Tainted: G OE 4.19.90-25.17.v2101.osc.sfc.6.10.0.0030.ky10.x86_64+debug #1 > [70013.082001] Hardware name: SANGFOR 65N32-US/ASERVER-G-2605, BIOS SSSS5203 08/19/2020 > [70013.082028] Workqueue: events aer_isr > [70013.082053] Call Trace: > [70013.082067] dump_stack+0x8b/0xbb > [70013.082086] print_address_description+0x6a/0x270 > [70013.082102] kasan_report+0x179/0x2c0 > [70013.082133] ? mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] > [70013.082173] mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] > [70013.082213] ? mlx5_cmd_use_polling+0x20/0x20 [mlx5_core] > [70013.082223] ? kmem_cache_free+0x1ad/0x1e0 > [70013.082267] mlx5_cmd_flush+0x80/0x180 [mlx5_core] > [70013.082304] mlx5_enter_error_state+0x106/0x1d0 [mlx5_core] > [70013.082338] mlx5_try_fast_unload+0x2ea/0x4d0 [mlx5_core] > [70013.082377] remove_one+0x200/0x2b0 [mlx5_core] > [70013.082390] ? __pm_runtime_resume+0x58/0x70 > [70013.082409] pci_device_remove+0xf3/0x280 > [70013.082426] ? pcibios_free_irq+0x10/0x10 > [70013.082439] device_release_driver_internal+0x1c3/0x470 > [70013.082453] pci_stop_bus_device+0x109/0x160 > [70013.082468] pci_stop_and_remove_bus_device+0xe/0x20 > [70013.082485] pcie_do_fatal_recovery+0x167/0x550 > [70013.082493] aer_isr+0x7d2/0x960 > [70013.082510] ? aer_get_device_error_info+0x420/0x420 > [70013.082526] ? __schedule+0x821/0x2040 > [70013.082536] ? strscpy+0x85/0x180 > [70013.082543] process_one_work+0x65f/0x12d0 > [70013.082556] worker_thread+0x87/0xb50 > [70013.082563] ? __kthread_parkme+0x82/0xf0 > [70013.082569] ? process_one_work+0x12d0/0x12d0 > [70013.082571] kthread+0x2e9/0x3a0 > [70013.082579] ? kthread_create_worker_on_cpu+0xc0/0xc0 > [70013.082592] ret_from_fork+0x1f/0x40 I'm curious how did you get this error? I would expect to see some sort of lock in upper level which prevents it. Thanks
On 2023/11/22 20:02, Leon Romanovsky wrote: > On Tue, Nov 21, 2023 at 03:52:51AM -0800, Shifeng Li wrote: >> Fix a cmd->ent use after free due to a race on command entry. >> Such race occurs when one of the commands releases its last refcount and >> frees its index and entry while another process running command flush >> flow takes refcount to this command entry. The process which handles >> commands flush may see this command as needed to be flushed if the other >> process allocated a ent->idx but didn't set ent to cmd->ent_arr in >> cmd_work_handler(). Fix it by moving the assignment of cmd->ent_arr into >> the spin lock. >> >> [70013.081955] BUG: KASAN: use-after-free in mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] >> [70013.081967] Write of size 4 at addr ffff88880b1510b4 by task kworker/26:1/1433361 >> [70013.081968] >> [70013.081989] CPU: 26 PID: 1433361 Comm: kworker/26:1 Kdump: loaded Tainted: G OE 4.19.90-25.17.v2101.osc.sfc.6.10.0.0030.ky10.x86_64+debug #1 >> [70013.082001] Hardware name: SANGFOR 65N32-US/ASERVER-G-2605, BIOS SSSS5203 08/19/2020 >> [70013.082028] Workqueue: events aer_isr >> [70013.082053] Call Trace: >> [70013.082067] dump_stack+0x8b/0xbb >> [70013.082086] print_address_description+0x6a/0x270 >> [70013.082102] kasan_report+0x179/0x2c0 >> [70013.082133] ? mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] >> [70013.082173] mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] >> [70013.082213] ? mlx5_cmd_use_polling+0x20/0x20 [mlx5_core] >> [70013.082223] ? kmem_cache_free+0x1ad/0x1e0 >> [70013.082267] mlx5_cmd_flush+0x80/0x180 [mlx5_core] >> [70013.082304] mlx5_enter_error_state+0x106/0x1d0 [mlx5_core] >> [70013.082338] mlx5_try_fast_unload+0x2ea/0x4d0 [mlx5_core] >> [70013.082377] remove_one+0x200/0x2b0 [mlx5_core] >> [70013.082390] ? __pm_runtime_resume+0x58/0x70 >> [70013.082409] pci_device_remove+0xf3/0x280 >> [70013.082426] ? pcibios_free_irq+0x10/0x10 >> [70013.082439] device_release_driver_internal+0x1c3/0x470 >> [70013.082453] pci_stop_bus_device+0x109/0x160 >> [70013.082468] pci_stop_and_remove_bus_device+0xe/0x20 >> [70013.082485] pcie_do_fatal_recovery+0x167/0x550 >> [70013.082493] aer_isr+0x7d2/0x960 >> [70013.082510] ? aer_get_device_error_info+0x420/0x420 >> [70013.082526] ? __schedule+0x821/0x2040 >> [70013.082536] ? strscpy+0x85/0x180 >> [70013.082543] process_one_work+0x65f/0x12d0 >> [70013.082556] worker_thread+0x87/0xb50 >> [70013.082563] ? __kthread_parkme+0x82/0xf0 >> [70013.082569] ? process_one_work+0x12d0/0x12d0 >> [70013.082571] kthread+0x2e9/0x3a0 >> [70013.082579] ? kthread_create_worker_on_cpu+0xc0/0xc0 >> [70013.082592] ret_from_fork+0x1f/0x40 > > I'm curious how did you get this error? I would expect to see some sort > of lock in upper level which prevents it. > Just inject AER unrecoverable error to pci BDF device corresponding to the network card constantly and I get this error. Thanks > Thanks
On 2023/11/22 20:02, Leon Romanovsky wrote: > On Tue, Nov 21, 2023 at 03:52:51AM -0800, Shifeng Li wrote: >> Fix a cmd->ent use after free due to a race on command entry. >> Such race occurs when one of the commands releases its last refcount and >> frees its index and entry while another process running command flush >> flow takes refcount to this command entry. The process which handles >> commands flush may see this command as needed to be flushed if the other >> process allocated a ent->idx but didn't set ent to cmd->ent_arr in >> cmd_work_handler(). Fix it by moving the assignment of cmd->ent_arr into >> the spin lock. >> >> [70013.081955] BUG: KASAN: use-after-free in mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] >> [70013.081967] Write of size 4 at addr ffff88880b1510b4 by task kworker/26:1/1433361 >> [70013.081968] >> [70013.081989] CPU: 26 PID: 1433361 Comm: kworker/26:1 Kdump: loaded Tainted: G OE 4.19.90-25.17.v2101.osc.sfc.6.10.0.0030.ky10.x86_64+debug #1 >> [70013.082001] Hardware name: SANGFOR 65N32-US/ASERVER-G-2605, BIOS SSSS5203 08/19/2020 >> [70013.082028] Workqueue: events aer_isr >> [70013.082053] Call Trace: >> [70013.082067] dump_stack+0x8b/0xbb >> [70013.082086] print_address_description+0x6a/0x270 >> [70013.082102] kasan_report+0x179/0x2c0 >> [70013.082133] ? mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] >> [70013.082173] mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] >> [70013.082213] ? mlx5_cmd_use_polling+0x20/0x20 [mlx5_core] >> [70013.082223] ? kmem_cache_free+0x1ad/0x1e0 >> [70013.082267] mlx5_cmd_flush+0x80/0x180 [mlx5_core] >> [70013.082304] mlx5_enter_error_state+0x106/0x1d0 [mlx5_core] >> [70013.082338] mlx5_try_fast_unload+0x2ea/0x4d0 [mlx5_core] >> [70013.082377] remove_one+0x200/0x2b0 [mlx5_core] >> [70013.082390] ? __pm_runtime_resume+0x58/0x70 >> [70013.082409] pci_device_remove+0xf3/0x280 >> [70013.082426] ? pcibios_free_irq+0x10/0x10 >> [70013.082439] device_release_driver_internal+0x1c3/0x470 >> [70013.082453] pci_stop_bus_device+0x109/0x160 >> [70013.082468] pci_stop_and_remove_bus_device+0xe/0x20 >> [70013.082485] pcie_do_fatal_recovery+0x167/0x550 >> [70013.082493] aer_isr+0x7d2/0x960 >> [70013.082510] ? aer_get_device_error_info+0x420/0x420 >> [70013.082526] ? __schedule+0x821/0x2040 >> [70013.082536] ? strscpy+0x85/0x180 >> [70013.082543] process_one_work+0x65f/0x12d0 >> [70013.082556] worker_thread+0x87/0xb50 >> [70013.082563] ? __kthread_parkme+0x82/0xf0 >> [70013.082569] ? process_one_work+0x12d0/0x12d0 >> [70013.082571] kthread+0x2e9/0x3a0 >> [70013.082579] ? kthread_create_worker_on_cpu+0xc0/0xc0 >> [70013.082592] ret_from_fork+0x1f/0x40 > > I'm curious how did you get this error? I would expect to see some sort > of lock in upper level which prevents it. > The logical relationship of this error is as follows: aer_recover_work | ent->work ------------------------------------------------------+--------------------------------- aer_recover_work_func | |- pcie_do_recovery | |- report_error_detected | |- mlx5_pci_err_detected |cmd_work_handler |- mlx5_enter_error_state | |- cmd_alloc_index |- enter_error_state | |- lock cmd->alloc_lock |- mlx5_cmd_flush | |- clear_bit |- mlx5_cmd_trigger_completions | |- unlock cmd->alloc_lock |- lock cmd->alloc_lock | |- vector = ~dev->cmd.vars.bitmask | |- for_each_set_bit | |- cmd_ent_get(cmd->ent_arr[i]) (UAF) | |- unlock cmd->alloc_lock | |- cmd->ent_arr[ent->idx] = ent The cmd->ent_arr[ent->idx] assignment and the bit clearing are not protected by the cmd->alloc_lock in cmd_work_handler(). > Thanks
On 11/21/2023 1:52 PM, Shifeng Li wrote: > Fix a cmd->ent use after free due to a race on command entry. > Such race occurs when one of the commands releases its last refcount and > frees its index and entry while another process running command flush > flow takes refcount to this command entry. The process which handles > commands flush may see this command as needed to be flushed if the other > process allocated a ent->idx but didn't set ent to cmd->ent_arr in > cmd_work_handler(). Fix it by moving the assignment of cmd->ent_arr into > the spin lock. > > [70013.081955] BUG: KASAN: use-after-free in mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] > [70013.081967] Write of size 4 at addr ffff88880b1510b4 by task kworker/26:1/1433361 > [70013.081968] > [70013.081989] CPU: 26 PID: 1433361 Comm: kworker/26:1 Kdump: loaded Tainted: G OE 4.19.90-25.17.v2101.osc.sfc.6.10.0.0030.ky10.x86_64+debug #1 > [70013.082001] Hardware name: SANGFOR 65N32-US/ASERVER-G-2605, BIOS SSSS5203 08/19/2020 > [70013.082028] Workqueue: events aer_isr > [70013.082053] Call Trace: > [70013.082067] dump_stack+0x8b/0xbb > [70013.082086] print_address_description+0x6a/0x270 > [70013.082102] kasan_report+0x179/0x2c0 > [70013.082133] ? mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] > [70013.082173] mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] > [70013.082213] ? mlx5_cmd_use_polling+0x20/0x20 [mlx5_core] > [70013.082223] ? kmem_cache_free+0x1ad/0x1e0 > [70013.082267] mlx5_cmd_flush+0x80/0x180 [mlx5_core] > [70013.082304] mlx5_enter_error_state+0x106/0x1d0 [mlx5_core] > [70013.082338] mlx5_try_fast_unload+0x2ea/0x4d0 [mlx5_core] > [70013.082377] remove_one+0x200/0x2b0 [mlx5_core] > [70013.082390] ? __pm_runtime_resume+0x58/0x70 > [70013.082409] pci_device_remove+0xf3/0x280 > [70013.082426] ? pcibios_free_irq+0x10/0x10 > [70013.082439] device_release_driver_internal+0x1c3/0x470 > [70013.082453] pci_stop_bus_device+0x109/0x160 > [70013.082468] pci_stop_and_remove_bus_device+0xe/0x20 > [70013.082485] pcie_do_fatal_recovery+0x167/0x550 > [70013.082493] aer_isr+0x7d2/0x960 > [70013.082510] ? aer_get_device_error_info+0x420/0x420 > [70013.082526] ? __schedule+0x821/0x2040 > [70013.082536] ? strscpy+0x85/0x180 > [70013.082543] process_one_work+0x65f/0x12d0 > [70013.082556] worker_thread+0x87/0xb50 > [70013.082563] ? __kthread_parkme+0x82/0xf0 > [70013.082569] ? process_one_work+0x12d0/0x12d0 > [70013.082571] kthread+0x2e9/0x3a0 > [70013.082579] ? kthread_create_worker_on_cpu+0xc0/0xc0 > [70013.082592] ret_from_fork+0x1f/0x40 > > Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") > Signed-off-by: Shifeng Li<lishifeng1992@126.com> Fixes tag should be : Fixes: 50b2412b7e78 ("net/mlx5: Avoid possible free of command entry while timeout comp handler") Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Thanks!
On 2023/11/27 0:13, Moshe Shemesh wrote: > > > On 11/21/2023 1:52 PM, Shifeng Li wrote: >> Fix a cmd->ent use after free due to a race on command entry. >> Such race occurs when one of the commands releases its last refcount and >> frees its index and entry while another process running command flush >> flow takes refcount to this command entry. The process which handles >> commands flush may see this command as needed to be flushed if the other >> process allocated a ent->idx but didn't set ent to cmd->ent_arr in >> cmd_work_handler(). Fix it by moving the assignment of cmd->ent_arr into >> the spin lock. >> >> [70013.081955] BUG: KASAN: use-after-free in mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] >> [70013.081967] Write of size 4 at addr ffff88880b1510b4 by task kworker/26:1/1433361 >> [70013.081968] >> [70013.081989] CPU: 26 PID: 1433361 Comm: kworker/26:1 Kdump: loaded Tainted: G OE 4.19.90-25.17.v2101.osc.sfc.6.10.0.0030.ky10.x86_64+debug #1 >> [70013.082001] Hardware name: SANGFOR 65N32-US/ASERVER-G-2605, BIOS SSSS5203 08/19/2020 >> [70013.082028] Workqueue: events aer_isr >> [70013.082053] Call Trace: >> [70013.082067] dump_stack+0x8b/0xbb >> [70013.082086] print_address_description+0x6a/0x270 >> [70013.082102] kasan_report+0x179/0x2c0 >> [70013.082133] ? mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] >> [70013.082173] mlx5_cmd_trigger_completions+0x1e2/0x4c0 [mlx5_core] >> [70013.082213] ? mlx5_cmd_use_polling+0x20/0x20 [mlx5_core] >> [70013.082223] ? kmem_cache_free+0x1ad/0x1e0 >> [70013.082267] mlx5_cmd_flush+0x80/0x180 [mlx5_core] >> [70013.082304] mlx5_enter_error_state+0x106/0x1d0 [mlx5_core] >> [70013.082338] mlx5_try_fast_unload+0x2ea/0x4d0 [mlx5_core] >> [70013.082377] remove_one+0x200/0x2b0 [mlx5_core] >> [70013.082390] ? __pm_runtime_resume+0x58/0x70 >> [70013.082409] pci_device_remove+0xf3/0x280 >> [70013.082426] ? pcibios_free_irq+0x10/0x10 >> [70013.082439] device_release_driver_internal+0x1c3/0x470 >> [70013.082453] pci_stop_bus_device+0x109/0x160 >> [70013.082468] pci_stop_and_remove_bus_device+0xe/0x20 >> [70013.082485] pcie_do_fatal_recovery+0x167/0x550 >> [70013.082493] aer_isr+0x7d2/0x960 >> [70013.082510] ? aer_get_device_error_info+0x420/0x420 >> [70013.082526] ? __schedule+0x821/0x2040 >> [70013.082536] ? strscpy+0x85/0x180 >> [70013.082543] process_one_work+0x65f/0x12d0 >> [70013.082556] worker_thread+0x87/0xb50 >> [70013.082563] ? __kthread_parkme+0x82/0xf0 >> [70013.082569] ? process_one_work+0x12d0/0x12d0 >> [70013.082571] kthread+0x2e9/0x3a0 >> [70013.082579] ? kthread_create_worker_on_cpu+0xc0/0xc0 >> [70013.082592] ret_from_fork+0x1f/0x40 >> >> Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") >> Signed-off-by: Shifeng Li<lishifeng1992@126.com> > > Fixes tag should be : > Fixes: 50b2412b7e78 ("net/mlx5: Avoid possible free of command entry while timeout comp handler") > > Reviewed-by: Moshe Shemesh <moshe@nvidia.com> > I have sent v3. Thanks! > Thanks!
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c index f8f0a712c943..a7b1f9686c09 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c @@ -156,15 +156,18 @@ static u8 alloc_token(struct mlx5_cmd *cmd) return token; } -static int cmd_alloc_index(struct mlx5_cmd *cmd) +static int cmd_alloc_index(struct mlx5_cmd *cmd, struct mlx5_cmd_work_ent *ent) { unsigned long flags; int ret; spin_lock_irqsave(&cmd->alloc_lock, flags); ret = find_first_bit(&cmd->vars.bitmask, cmd->vars.max_reg_cmds); - if (ret < cmd->vars.max_reg_cmds) + if (ret < cmd->vars.max_reg_cmds) { clear_bit(ret, &cmd->vars.bitmask); + ent->idx = ret; + cmd->ent_arr[ent->idx] = ent; + } spin_unlock_irqrestore(&cmd->alloc_lock, flags); return ret < cmd->vars.max_reg_cmds ? ret : -ENOMEM; @@ -979,7 +982,7 @@ static void cmd_work_handler(struct work_struct *work) sem = ent->page_queue ? &cmd->vars.pages_sem : &cmd->vars.sem; down(sem); if (!ent->page_queue) { - alloc_ret = cmd_alloc_index(cmd); + alloc_ret = cmd_alloc_index(cmd, ent); if (alloc_ret < 0) { mlx5_core_err_rl(dev, "failed to allocate command entry\n"); if (ent->callback) { @@ -994,15 +997,14 @@ static void cmd_work_handler(struct work_struct *work) up(sem); return; } - ent->idx = alloc_ret; } else { ent->idx = cmd->vars.max_reg_cmds; spin_lock_irqsave(&cmd->alloc_lock, flags); clear_bit(ent->idx, &cmd->vars.bitmask); + cmd->ent_arr[ent->idx] = ent; spin_unlock_irqrestore(&cmd->alloc_lock, flags); } - cmd->ent_arr[ent->idx] = ent; lay = get_inst(cmd, ent->idx); ent->lay = lay; memset(lay, 0, sizeof(*lay));