Message ID | 20230725110622.129361-1-ptyadav@amazon.de |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9010:0:b0:3e4:2afc:c1 with SMTP id l16csp2409170vqg; Tue, 25 Jul 2023 04:47:26 -0700 (PDT) X-Google-Smtp-Source: APBJJlFR5zeWOBNBxXJTcdxABbi3w5NBhUcEE2zy4LIjQbM0g9V9q6CYkxo7+6s1KEVqMlJtUjVS X-Received: by 2002:aa7:d311:0:b0:522:3a37:a45f with SMTP id p17-20020aa7d311000000b005223a37a45fmr3591652edq.21.1690285646420; Tue, 25 Jul 2023 04:47:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690285646; cv=none; d=google.com; s=arc-20160816; b=CqTmihBvU9XFUGsoy+5q01dqeoDfNKyq8n9cCTjauMm6yMl7Awt26HnkYxqGgsx7BR ZgtpDvBRluyo5YBEGQAndqdb2H3CXmoenW1q80de3Ua0//Kh7y23YnC9C8/JpfIe3bPn wDuA45i5A/lzYgoIpBJPSLRJ9uZrzp4mwW4gIEANCRRztQCuOur+xcBfnJw5WehVugjT suFXTWJ+bUi8c7SuJ80d2xmEAdsj7klqEQNsmgPytaPZ22JuQEfLCKNyFz43fTuxPg3i csQs94GBVREuRfMWzExxL492HEvXuBk54490+jauWSi86EHPIfEgvA1nyce7qcZOYXc0 XBCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=f3+5Ws3IvZL4KRwctyVp9MLtU2uqQxu77BXFalL2nZY=; fh=O/tZmDpcBzTro9GeA3su3nItvVNjZfnSTcLkRcOL2oI=; b=RbcrPaz3bUAGkfEy00QlrOoBWL5VwDn+O/snx/lrKLEyv/KOrfOKWfJr1SFCVjwr9y FIB5/rrSZdzqNDx8BuYw2uiPiBqgDQgNdkcjYS2GFDdYogbHFp0dSlGqZDXUBgd7d3AL CFFpfhWOFJ3sxMOvDmr8diUX9SFXoDbHpo/CY64R4vTd+dxyL3JUmyIXCPCdMb4uOrzk xSvmG5W2q2VQ9nwngvyz9zlnltxkTWeF9sNOlrUdIXAkt2cdkLhoE9zVYjlSTM3nikfn JnBm6atvPA0db2YEw8nYV+L1mLgmU2qDFq4mf0+/4TdY3qzmtnRwO5DI3Q9ayfvHEAMp edvg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazon.de header.s=amazon201209 header.b=i+tZ5gr9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w16-20020a50fa90000000b0052255ff6097si499295edr.217.2023.07.25.04.47.02; Tue, 25 Jul 2023 04:47:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@amazon.de header.s=amazon201209 header.b=i+tZ5gr9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233906AbjGYLIU (ORCPT <rfc822;hanasaki@gmail.com> + 99 others); Tue, 25 Jul 2023 07:08:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44726 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234094AbjGYLIA (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 25 Jul 2023 07:08:00 -0400 Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com [52.95.48.154]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 281DB199E for <linux-kernel@vger.kernel.org>; Tue, 25 Jul 2023 04:06:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1690283193; x=1721819193; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=f3+5Ws3IvZL4KRwctyVp9MLtU2uqQxu77BXFalL2nZY=; b=i+tZ5gr9K6viNO1kIDc6+CC5DwKjeTs8Kf+u3SzueOfi2RljooXBCv3V DYsYfnOwEXTMdsJ7kIOdCepa2xXyUXgqij9kCfmYFvkzriG6JslaT1Jvi WYfkVHmJIHSkXhvc7TmUTIHlvt//8ZNAokJ5idVq8XFdkJDfFSkmQXTRl k=; X-IronPort-AV: E=Sophos;i="6.01,230,1684800000"; d="scan'208";a="348734793" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-pdx-2a-m6i4x-44b6fc51.us-west-2.amazon.com) ([10.43.8.2]) by smtp-border-fw-6001.iad6.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jul 2023 11:06:30 +0000 Received: from EX19D019EUA002.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-pdx-2a-m6i4x-44b6fc51.us-west-2.amazon.com (Postfix) with ESMTPS id 61984A0AD0; Tue, 25 Jul 2023 11:06:27 +0000 (UTC) Received: from EX19D028EUB001.ant.amazon.com (10.252.61.99) by EX19D019EUA002.ant.amazon.com (10.252.50.84) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.30; Tue, 25 Jul 2023 11:06:26 +0000 Received: from EX19MTAUEB001.ant.amazon.com (10.252.135.35) by EX19D028EUB001.ant.amazon.com (10.252.61.99) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.30; Tue, 25 Jul 2023 11:06:26 +0000 Received: from dev-dsk-ptyadav-1c-37607b33.eu-west-1.amazon.com (10.15.11.255) by mail-relay.amazon.com (10.252.135.35) with Microsoft SMTP Server id 15.2.1118.30 via Frontend Transport; Tue, 25 Jul 2023 11:06:26 +0000 Received: by dev-dsk-ptyadav-1c-37607b33.eu-west-1.amazon.com (Postfix, from userid 23027615) id 0142E23299; Tue, 25 Jul 2023 13:06:25 +0200 (CEST) From: Pratyush Yadav <ptyadav@amazon.de> To: Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@kernel.dk>, "Christoph Hellwig" <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me> CC: Pratyush Yadav <ptyadav@amazon.de>, <linux-nvme@lists.infradead.org>, <linux-kernel@vger.kernel.org> Subject: [PATCH] nvme-pci: do not set the NUMA node of device if it has none Date: Tue, 25 Jul 2023 13:06:22 +0200 Message-ID: <20230725110622.129361-1-ptyadav@amazon.de> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Spam-Status: No, score=-4.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1772392962115579119 X-GMAIL-MSGID: 1772392962115579119 |
Series |
nvme-pci: do not set the NUMA node of device if it has none
|
|
Commit Message
Pratyush Yadav
July 25, 2023, 11:06 a.m. UTC
If a device has no NUMA node information associated with it, the driver
puts the device in node first_memory_node (say node 0). As a side
effect, this gives an indication to userspace IRQ balancing programs
that the device is in node 0 so they prefer CPUs in node 0 to handle the
IRQs associated with the queues. For example, irqbalance will only let
CPUs in node 0 handle the interrupts. This reduces random access
performance on CPUs in node 1 since the interrupt for command completion
will fire on node 0.
For example, AWS EC2's i3.16xlarge instance does not expose NUMA
information for the NVMe devices. This means all NVMe devices have
NUMA_NO_NODE by default. Without this patch, random 4k read performance
measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50%
less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on
both nodes get similar performance (around 315k IOPS).
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
---
drivers/nvme/host/pci.c | 3 ---
1 file changed, 3 deletions(-)
Comments
On Tue, Jul 25, 2023 at 01:06:22PM +0200, Pratyush Yadav wrote: > If a device has no NUMA node information associated with it, the driver > puts the device in node first_memory_node (say node 0). As a side > effect, this gives an indication to userspace IRQ balancing programs > that the device is in node 0 so they prefer CPUs in node 0 to handle the > IRQs associated with the queues. For example, irqbalance will only let > CPUs in node 0 handle the interrupts. This reduces random access > performance on CPUs in node 1 since the interrupt for command completion > will fire on node 0. > > For example, AWS EC2's i3.16xlarge instance does not expose NUMA > information for the NVMe devices. This means all NVMe devices have > NUMA_NO_NODE by default. Without this patch, random 4k read performance > measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% > less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on > both nodes get similar performance (around 315k IOPS). irqbalance doesn't work with this driver though: the interrupts are managed by the kernel. Is there some other reason to explain the perf difference?
>> If a device has no NUMA node information associated with it, the driver >> puts the device in node first_memory_node (say node 0). As a side >> effect, this gives an indication to userspace IRQ balancing programs >> that the device is in node 0 so they prefer CPUs in node 0 to handle the >> IRQs associated with the queues. For example, irqbalance will only let >> CPUs in node 0 handle the interrupts. This reduces random access >> performance on CPUs in node 1 since the interrupt for command completion >> will fire on node 0. >> >> For example, AWS EC2's i3.16xlarge instance does not expose NUMA >> information for the NVMe devices. This means all NVMe devices have >> NUMA_NO_NODE by default. Without this patch, random 4k read performance >> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% >> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on >> both nodes get similar performance (around 315k IOPS). > > irqbalance doesn't work with this driver though: the interrupts are > managed by the kernel. Is there some other reason to explain the perf > difference? Maybe its because the numa_node goes to the tagset which allocates stuff based on that numa-node ?
On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote: >>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA >>> information for the NVMe devices. This means all NVMe devices have >>> NUMA_NO_NODE by default. Without this patch, random 4k read performance >>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% >>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on >>> both nodes get similar performance (around 315k IOPS). >> >> irqbalance doesn't work with this driver though: the interrupts are >> managed by the kernel. Is there some other reason to explain the perf >> difference? > > Maybe its because the numa_node goes to the tagset which allocates > stuff based on that numa-node ? Yeah, the only explanation I could come up with is that without this the allocations gets spread, and that somehow helps. All of this is a little obscure, but so is the NVMe practice of setting the node id to first_memory_node, which no other driver does. I'd really like to understand what's going on here first. After that this patch probably is the right thing, I'd just like to understand why.
On Wed, Jul 26 2023, Christoph Hellwig wrote: Hi all, > On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote: >>>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA >>>> information for the NVMe devices. This means all NVMe devices have >>>> NUMA_NO_NODE by default. Without this patch, random 4k read performance >>>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% >>>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on >>>> both nodes get similar performance (around 315k IOPS). >>> >>> irqbalance doesn't work with this driver though: the interrupts are >>> managed by the kernel. Is there some other reason to explain the perf >>> difference? Hmm, I did not know that. I have not gone and looked at the code but I think the same reasoning should hold, just with s/irqbalance/kernel. If the kernel IRQ balancer sees the device is on node 0, it would deliver its interrupts to CPUs on node 0. In my tests I can see that the interrupts for NVME queues are sent only to CPUs from node 0 without this patch. With this patch CPUs from both nodes get the interrupts. >> >> Maybe its because the numa_node goes to the tagset which allocates >> stuff based on that numa-node ? > > Yeah, the only explanation I could come up with is that without this > the allocations gets spread, and that somehow helps. All of this > is a little obscure, but so is the NVMe practice of setting the node id > to first_memory_node, which no other driver does. I'd really like to > understand what's going on here first. After that this patch probably > is the right thing, I'd just like to understand why. See above for my conjecture on why this happens. More specifically, I discovered this when running an application pinned to a node 1 CPU reading from an NVME device. I noticed it was performing worse than when it was pinned to node 0. If the process is free to move around it might not see such a large performance difference since it could move to a node 0 CPU. But if it is pinned to a CPU in node 1 then the interrupt will always hit a node 0 CPU and create higher latency for the reads. I have a simple fio test that can reproduce this. Save this [1] as fio.txt and then run numactl --cpunodebind 1 fio ./fio.txt. You can run it on any host with an NVME device that has no NUMA node. I have tested this on AWS EC2's i3.16xlarge instance type. [1] [global] ioengine=libaio filename=/dev/nvme0n1 group_reporting=1 direct=1 verify=0 norandommap=0 size=10% time_based=1 runtime=30 ramp_time=0 randrepeat=0 log_max_value=1 unified_rw_reporting=1 percentile_list=50:99:99.9:99.99:99.999 bwavgtime=10000 [4k_randread_qd16_4w] stonewall bs=4k rw=randread iodepth=32 numjobs=1
On Wed, Jul 26, 2023 at 05:30:33PM +0200, Pratyush Yadav wrote: > On Wed, Jul 26 2023, Christoph Hellwig wrote: > > On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote: > >>>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA > >>>> information for the NVMe devices. This means all NVMe devices have > >>>> NUMA_NO_NODE by default. Without this patch, random 4k read performance > >>>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% > >>>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on > >>>> both nodes get similar performance (around 315k IOPS). > >>> > >>> irqbalance doesn't work with this driver though: the interrupts are > >>> managed by the kernel. Is there some other reason to explain the perf > >>> difference? > > Hmm, I did not know that. I have not gone and looked at the code but I > think the same reasoning should hold, just with s/irqbalance/kernel. If > the kernel IRQ balancer sees the device is on node 0, it would deliver > its interrupts to CPUs on node 0. > > In my tests I can see that the interrupts for NVME queues are sent only > to CPUs from node 0 without this patch. With this patch CPUs from both > nodes get the interrupts. Could you send the output of: numactl --hardware and then with and without your patch: for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ cat /proc/irq/$i/{smp,effective}_affinity_list; \ done ?
On Wed, Jul 26 2023, Keith Busch wrote: > On Wed, Jul 26, 2023 at 05:30:33PM +0200, Pratyush Yadav wrote: >> On Wed, Jul 26 2023, Christoph Hellwig wrote: >> > On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote: >> >>>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA >> >>>> information for the NVMe devices. This means all NVMe devices have >> >>>> NUMA_NO_NODE by default. Without this patch, random 4k read performance >> >>>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% >> >>>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on >> >>>> both nodes get similar performance (around 315k IOPS). >> >>> >> >>> irqbalance doesn't work with this driver though: the interrupts are >> >>> managed by the kernel. Is there some other reason to explain the perf >> >>> difference? >> >> Hmm, I did not know that. I have not gone and looked at the code but I >> think the same reasoning should hold, just with s/irqbalance/kernel. If >> the kernel IRQ balancer sees the device is on node 0, it would deliver >> its interrupts to CPUs on node 0. >> >> In my tests I can see that the interrupts for NVME queues are sent only >> to CPUs from node 0 without this patch. With this patch CPUs from both >> nodes get the interrupts. > > Could you send the output of: > > numactl --hardware $ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 0 size: 245847 MB node 0 free: 245211 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 245932 MB node 1 free: 245328 MB node distances: node 0 1 0: 10 21 1: 21 10 > > and then with and without your patch: > > for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > done Without my patch: $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > done 40 40 33 33 44 44 9 9 32 32 2 2 6 6 11 11 1 1 35 35 39 39 13 13 42 42 46 46 41 41 46 46 15 15 5 5 43 43 0 0 14 14 8 8 12 12 7 7 10 10 47 47 38 38 36 36 3 3 34 34 45 45 5 5 With my patch: $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > done 9 9 15 15 5 5 23 23 38 38 52 52 21 21 36 36 13 13 56 56 44 44 42 42 31 31 48 48 5 5 3 3 1 1 11 11 28 28 18 18 34 34 29 29 58 58 46 46 54 54 59 59 32 32 7 7 56 56 62 62 49 49 57 57
On Wed, Jul 26, 2023 at 09:32:33PM +0200, Pratyush Yadav wrote: > On Wed, Jul 26 2023, Keith Busch wrote: > > Could you send the output of: > > > > numactl --hardware > > $ numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 > node 0 size: 245847 MB > node 0 free: 245211 MB > node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 > node 1 size: 245932 MB > node 1 free: 245328 MB > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 > > > > > and then with and without your patch: > > > > for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > > done > > Without my patch: > > $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > > done Hm, I wonder if there's something wrong with my script. All the cpu's should be accounted for in the smp_affinity_list, assuming it captured all the vectors of the nvme device, but both examples are missing half the CPUs. It looks like you have 32 vectors. Does that sound right? This does show the effective affinity is indeed always on node 0 without your patch. I don't see why, though: the "group_cpus_evenly()" function that spreads the interrupts doesn't know anything about the device the resource is being grouped for, so it shouldn't even take its NUMA node into consideration. It's just supposed to ensure all CPUs have a shared resource, preferring to not share across numa nodes. I'll emulate a similar CPU topology with similar nvme vector count and see if I can find anything suspicious. I'm a little concerned we may have the same problem for devices that have an associated NUMA node that your patch isn't addressing. > 41 > 40 > 33 > 33 > 44 > 44 > 9 > 9 > 32 > 32 > 2 > 2 > 6 > 6 > 11 > 11 > 1 > 1 > 35 > 35 > 39 > 39 > 13 > 13 > 42 > 42 > 46 > 46 > 41 > 41 > 46 > 46 > 15 > 15 > 5 > 5 > 43 > 43 > 0 > 0 > 14 > 14 > 8 > 8 > 12 > 12 > 7 > 7 > 10 > 10 > 47 > 47 > 38 > 38 > 36 > 36 > 3 > 3 > 34 > 34 > 45 > 45 > 5 > 5 > > With my patch: > > $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > > done > 9 > 9 > 15 > 15 > 5 > 5 > 23 > 23 > 38 > 38 > 52 > 52 > 21 > 21 > 36 > 36 > 13 > 13 > 56 > 56 > 44 > 44 > 42 > 42 > 31 > 31 > 48 > 48 > 5 > 5 > 3 > 3 > 1 > 1 > 11 > 11 > 28 > 28 > 18 > 18 > 34 > 34 > 29 > 29 > 58 > 58 > 46 > 46 > 54 > 54 > 59 > 59 > 32 > 32 > 7 > 7 > 56 > 56 > 62 > 62 > 49 > 49 > 57 > 57
Hi, On Wed, Jul 26 2023, Keith Busch wrote: > On Wed, Jul 26, 2023 at 09:32:33PM +0200, Pratyush Yadav wrote: >> On Wed, Jul 26 2023, Keith Busch wrote: >> > Could you send the output of: >> > >> > numactl --hardware >> >> $ numactl --hardware >> available: 2 nodes (0-1) >> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 >> node 0 size: 245847 MB >> node 0 free: 245211 MB >> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 >> node 1 size: 245932 MB >> node 1 free: 245328 MB >> node distances: >> node 0 1 >> 0: 10 21 >> 1: 21 10 >> >> > >> > and then with and without your patch: >> > >> > for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ >> > cat /proc/irq/$i/{smp,effective}_affinity_list; \ >> > done >> >> Without my patch: >> >> $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ >> > cat /proc/irq/$i/{smp,effective}_affinity_list; \ >> > done > > Hm, I wonder if there's something wrong with my script. All the cpu's > should be accounted for in the smp_affinity_list, assuming it captured > all the vectors of the nvme device, but both examples are missing half > the CPUs. It looks like you have 32 vectors. Does that sound right? Yes, there are 32 vectors, from nvme0q0 to nvme0q31. Should there be one vector for each CPU? Perhaps the device does not support that many queues? FWIW, $ sudo nvme get-feature /dev/nvme0n1 -f 7 -H get-feature:0x7 (Number of Queues), Current value:0x1e001e Number of IO Completion Queues Allocated (NCQA): 31 Number of IO Submission Queues Allocated (NSQA): 31 > > This does show the effective affinity is indeed always on node 0 without > your patch. I don't see why, though: the "group_cpus_evenly()" function > that spreads the interrupts doesn't know anything about the device the > resource is being grouped for, so it shouldn't even take its NUMA node > into consideration. It's just supposed to ensure all CPUs have a shared > resource, preferring to not share across numa nodes. I am guessing you are looking at irq_create_affinity_masks(). Yeah, It does not take into account the NUMA information. In fact, even if it did, the NUMA node associated with the IRQ is NUMA_NO_NODE (/proc/$irq/node == -1). I did some more digging over the week to figure out what is going on. It seems like the kernel _does_ in fact allow all CPUs in the affinity. I added some prints in set_affinity_irq() in drivers/xen/events/events_base.c (since that is the irqchip for the interrupt). I see it being called with mask: ffffffff,ffffffff. But I later see the function being called again with a different mask: 00000000,00008000. The stack trace shows the call is coming from ksys_write(). The process doing the write is irqbalance. So I think your earlier statement was incorrect. irqbalance does in fact balance these interrupts and it probably looks at the NUMA information of the device to make that decision. My original reasoning holds and irqbalance is the one picking the affinity. With this explanation, do you think the patch is good to go? BTW, could you please also add the below when applying? I forgot to add it when sending the patch. Fixes: a4aea5623d4a5 ("NVMe: Convert to blk-mq") > > I'll emulate a similar CPU topology with similar nvme vector count and > see if I can find anything suspicious. I'm a little concerned we may > have the same problem for devices that have an associated NUMA node that > your patch isn't addressing. > [...] -- Regards, Pratyush Yadav Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Fri, Jul 28, 2023 at 08:09:32PM +0200, Pratyush Yadav wrote: > > I am guessing you are looking at irq_create_affinity_masks(). Yeah, It > does not take into account the NUMA information. In fact, even if it > did, the NUMA node associated with the IRQ is NUMA_NO_NODE > (/proc/$irq/node == -1). > > I did some more digging over the week to figure out what is going on. It > seems like the kernel _does_ in fact allow all CPUs in the affinity. I > added some prints in set_affinity_irq() in > drivers/xen/events/events_base.c (since that is the irqchip for the > interrupt). I see it being called with mask: ffffffff,ffffffff. > > But I later see the function being called again with a different mask: > 00000000,00008000. The stack trace shows the call is coming from > ksys_write(). The process doing the write is irqbalance. > > So I think your earlier statement was incorrect. irqbalance does in fact > balance these interrupts and it probably looks at the NUMA information > of the device to make that decision. My original reasoning holds and > irqbalance is the one picking the affinity. > > With this explanation, do you think the patch is good to go? irqbalance still writes to the /proc/<irq>/smp_affinity to change it, right? That's just getting I/O errors on my machines because it fails irq_can_set_affinity_usr() for nvme's kernel managed interrupts (except the first vector, but that one is not used for I/O). Is there another path irqbalance is using that's somehow getting past the appropriate checks? Or perhaps is your xen irq_chip somehow bypassing the managed irq property?
On Fri, Jul 28 2023, Keith Busch wrote: > On Fri, Jul 28, 2023 at 08:09:32PM +0200, Pratyush Yadav wrote: >> >> I am guessing you are looking at irq_create_affinity_masks(). Yeah, It >> does not take into account the NUMA information. In fact, even if it >> did, the NUMA node associated with the IRQ is NUMA_NO_NODE >> (/proc/$irq/node == -1). >> >> I did some more digging over the week to figure out what is going on. It >> seems like the kernel _does_ in fact allow all CPUs in the affinity. I >> added some prints in set_affinity_irq() in >> drivers/xen/events/events_base.c (since that is the irqchip for the >> interrupt). I see it being called with mask: ffffffff,ffffffff. >> >> But I later see the function being called again with a different mask: >> 00000000,00008000. The stack trace shows the call is coming from >> ksys_write(). The process doing the write is irqbalance. >> >> So I think your earlier statement was incorrect. irqbalance does in fact >> balance these interrupts and it probably looks at the NUMA information >> of the device to make that decision. My original reasoning holds and >> irqbalance is the one picking the affinity. >> >> With this explanation, do you think the patch is good to go? > > irqbalance still writes to the /proc/<irq>/smp_affinity to change it, > right? That's just getting I/O errors on my machines because it fails > irq_can_set_affinity_usr() for nvme's kernel managed interrupts (except > the first vector, but that one is not used for I/O). Is there another > path irqbalance is using that's somehow getting past the appropriate > checks? Or perhaps is your xen irq_chip somehow bypassing the managed > irq property? I picked the interrupt "nvme4q26" as example. The call stack is (printed via WARN_ON(1)): ? __warn+0x7d/0x140 ? set_affinity_irq+0xf0/0x220 ? report_bug+0xf8/0x1e0 ? handle_bug+0x44/0x80 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? set_affinity_irq+0xf0/0x220 ? set_affinity_irq+0xf0/0x220 irq_do_set_affinity+0x135/0x1e0 irq_set_affinity_locked+0x186/0x1f0 __irq_set_affinity+0x41/0x70 write_irq_affinity.isra.8+0xf6/0x120 proc_reg_write+0x59/0x80 vfs_write+0xc7/0x3c0 ? __do_sys_newfstat+0x35/0x60 ? __fget_light+0xcb/0x120 ksys_write+0xa5/0xe0 do_syscall_64+0x42/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd The check you mention is in write_irq_affinity(). I added some prints there and it turns out that __irqd_can_set_affinity() returns true and irqd_affinity_is_managed() is false. I did some more digging and it turns out that the masks are called by irq_create_affinity_masks() which sets is_managed to the IRQ affinity descriptors. This is then passed down to __msi_domain_alloc_locked(). On a non-Xen system you would end up calling __msi_domain_alloc_irqs() next since ops->domain_alloc_irqs() is only implemented by Xen. This function finds the masks created earlier and passes them down to __irq_domain_alloc_irqs(). This then eventually lands in alloc_descs() which checks is_managed and sets IRQD_AFFINITY_MANAGED. On Xen though, xen_msi_domain_alloc_irqs() is called. This eventually lands in xen_allocate_irqs_dynamic() which calls irq_alloc_descs(). This macro calls __irq_alloc_descs() with affinity set to NULL. This leads to us losing the is_managed flag and the affinities created by irq_create_affinity_masks() via group_cpus_evenly(). As a result of this, MSI IRQs on Xen can never be managed by the kernel. They are marked as userspace manageable and irqbalance can set their affinity. Applying the (hacky) patch below fixes this problem and lets the kernel manage the affinities. ------ 8< ------ diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index c7715f8bd4522..15f36e34e28b4 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -743,9 +743,10 @@ static void xen_irq_init(unsigned irq) list_add_tail(&info->list, &xen_irq_list_head); } -static int __must_check xen_allocate_irqs_dynamic(int nvec) +static int __must_check xen_allocate_irqs_dynamic(int nvec, + struct irq_affinity_desc *affd) { - int i, irq = irq_alloc_descs(-1, 0, nvec, -1); + int i, irq = __irq_alloc_descs(-1, 0, nvec, -1, THIS_MODULE, affd); if (irq >= 0) { for (i = 0; i < nvec; i++) @@ -758,7 +759,7 @@ static int __must_check xen_allocate_irqs_dynamic(int nvec) static inline int __must_check xen_allocate_irq_dynamic(void) { - return xen_allocate_irqs_dynamic(1); + return xen_allocate_irqs_dynamic(1, NULL); } static int __must_check xen_allocate_irq_gsi(unsigned gsi) @@ -1108,7 +1109,7 @@ int xen_bind_pirq_msi_to_irq(struct pci_dev *dev, struct msi_desc *msidesc, mutex_lock(&irq_mapping_update_lock); - irq = xen_allocate_irqs_dynamic(nvec); + irq = xen_allocate_irqs_dynamic(nvec, msidesc->affinity); if (irq < 0) goto out; ------ >8 ------ With this patch, I get the below affinities: $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > done 8 8 16-17,48,65,67,69 18-19,50,71,73,75 20,52,77,79 21,53,81,83 22,54,85,87 23,55,89,91 24,56,93,95 25,57,97,99 26,58,101,103 27,59,105,107 28,60,109,111 29,61,113,115 30,62,117,119 31,63,121,123 49,51,125,127 0,32,64,66 1,33,68,70 2,34,72,74 3,35,76,78 4,36,80,82 5,37,84,86 6,38,88,90 7,39,92,94 8,40,96,98 9,41,100,102 10,42,104,106 11,43,108,110 12,44,112,114 13,45,116,118 14,46,120,122 15,47,124,126 The blank lines are because effective_smp_affinity is blank for all but the first interrupt. The problem is, even with this I still get the same performance difference when running on Node 1 vs Node 0. I am not sure why. Any pointers?
On Fri, Aug 04, 2023 at 04:50:16PM +0200, Pratyush Yadav wrote: > With this patch, I get the below affinities: Something still seems off without effective_affinity set. That attribute should always reflect one CPU from the smp_affinity_list. At least with your patch, the smp_affinity_list looks as expected: every CPU is accounted for, and no vector appears to share the resource among CPUs in different nodes. > $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > > done > 8 > 8 > 16-17,48,65,67,69 > > 18-19,50,71,73,75 > > 20,52,77,79 > > 21,53,81,83 > > 22,54,85,87 > > 23,55,89,91 > > 24,56,93,95 > > 25,57,97,99 > > 26,58,101,103 > > 27,59,105,107 > > 28,60,109,111 > > 29,61,113,115 > > 30,62,117,119 > > 31,63,121,123 > > 49,51,125,127 > > 0,32,64,66 > > 1,33,68,70 > > 2,34,72,74 > > 3,35,76,78 > > 4,36,80,82 > > 5,37,84,86 > > 6,38,88,90 > > 7,39,92,94 > > 8,40,96,98 > > 9,41,100,102 > > 10,42,104,106 > > 11,43,108,110 > > 12,44,112,114 > > 13,45,116,118 > > 14,46,120,122 > > 15,47,124,126 > > The blank lines are because effective_smp_affinity is blank for all but the first interrupt. > > The problem is, even with this I still get the same performance > difference when running on Node 1 vs Node 0. I am not sure why. Any > pointers? I suspect effective_affinity isn't getting set and interrupts are triggering on unexpected CPUs. If you check /proc/interrupts, can you confirm if the interrupts are firing on CPUs within the smp_affinity_list or some other CPU?
On Fri, Aug 04 2023, Keith Busch wrote: > On Fri, Aug 04, 2023 at 04:50:16PM +0200, Pratyush Yadav wrote: >> With this patch, I get the below affinities: > > Something still seems off without effective_affinity set. That attribute > should always reflect one CPU from the smp_affinity_list. > > At least with your patch, the smp_affinity_list looks as expected: every > CPU is accounted for, and no vector appears to share the resource among > CPUs in different nodes. > >> $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ >> > cat /proc/irq/$i/{smp,effective}_affinity_list; \ >> > done >> 8 >> 8 >> 16-17,48,65,67,69 >> >> 18-19,50,71,73,75 >> >> 20,52,77,79 >> >> 21,53,81,83 >> >> 22,54,85,87 >> >> 23,55,89,91 >> >> 24,56,93,95 >> >> 25,57,97,99 >> >> 26,58,101,103 >> >> 27,59,105,107 >> >> 28,60,109,111 >> >> 29,61,113,115 >> >> 30,62,117,119 >> >> 31,63,121,123 >> >> 49,51,125,127 >> >> 0,32,64,66 >> >> 1,33,68,70 >> >> 2,34,72,74 >> >> 3,35,76,78 >> >> 4,36,80,82 >> >> 5,37,84,86 >> >> 6,38,88,90 >> >> 7,39,92,94 >> >> 8,40,96,98 >> >> 9,41,100,102 >> >> 10,42,104,106 >> >> 11,43,108,110 >> >> 12,44,112,114 >> >> 13,45,116,118 >> >> 14,46,120,122 >> >> 15,47,124,126 >> >> The blank lines are because effective_smp_affinity is blank for all but the first interrupt. >> >> The problem is, even with this I still get the same performance >> difference when running on Node 1 vs Node 0. I am not sure why. Any >> pointers? > > I suspect effective_affinity isn't getting set and interrupts are > triggering on unexpected CPUs. If you check /proc/interrupts, can you > confirm if the interrupts are firing on CPUs within the > smp_affinity_list or some other CPU? All interrupts are hitting CPU0. I did some more digging. Xen sets the effective affinity via the function set_affinity_irq(). But it only sets it if info->evtchn is valid. But info->evtchn is set by startup_pirq(), which is called _after_ set_affinity_irq(). This worked before because irqbalance was coming in after all this happened and set the affinity, causing set_affinity_irq() to be called again. By that time startup_pirq() has been called and info->evtchn is set. This whole thing needs some refactoring to work properly. I wrote up a hacky patch (on top of my previous hacky patch) that fixes it. I will write up a proper patch when I find some more time next week. With this, I am now seeing good performance on node 1. I am seeing slightly slower performance on node 1 but that might be due to some other reasons and I do not want to dive into that for now. Thanks for your help with this :-) BTW, do you think I should re-send the patch with updated reasoning, since Cristoph thinks we should do this regardless of the performance reasons [0]? [0] https://lore.kernel.org/linux-nvme/20230726131408.GA15909@lst.de/ ----- 8< ----- diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index c7715f8bd4522..ed33d33a2f1c9 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -108,6 +108,7 @@ struct irq_info { unsigned irq; evtchn_port_t evtchn; /* event channel */ unsigned short cpu; /* cpu bound */ + unsigned short suggested_cpu; unsigned short eoi_cpu; /* EOI must happen on this cpu-1 */ unsigned int irq_epoch; /* If eoi_cpu valid: irq_epoch of event */ u64 eoi_time; /* Time in jiffies when to EOI. */ @@ -871,6 +873,8 @@ static void mask_ack_pirq(struct irq_data *data) eoi_pirq(data); } +static int xen_rebind_evtchn_to_cpu(struct irq_info *info, unsigned int tcpu); + static unsigned int __startup_pirq(unsigned int irq) { struct evtchn_bind_pirq bind_pirq; @@ -903,6 +907,14 @@ static unsigned int __startup_pirq(unsigned int irq) info->evtchn = evtchn; bind_evtchn_to_cpu(evtchn, 0, false); + if (info->suggested_cpu) { + int ret; + ret = xen_rebind_evtchn_to_cpu(info, info->suggested_cpu); + if (!ret) + irq_data_update_effective_affinity(irq_get_irq_data(irq), + cpumask_of(info->suggested_cpu)); + } + rc = xen_evtchn_port_setup(evtchn); if (rc) goto err; @@ -1856,12 +1872,17 @@ static unsigned int select_target_cpu(const struct cpumask *dest) static int set_affinity_irq(struct irq_data *data, const struct cpumask *dest, bool force) { + struct irq_info *info = info_for_irq(data->irq); unsigned int tcpu = select_target_cpu(dest); int ret; - ret = xen_rebind_evtchn_to_cpu(info_for_irq(data->irq), tcpu); - if (!ret) + ret = xen_rebind_evtchn_to_cpu(info, tcpu); + if (!ret) { irq_data_update_effective_affinity(data, cpumask_of(tcpu)); + } else { + if (info) + info->suggested_cpu = tcpu; + } return ret; }
On Tue, Aug 08, 2023 at 05:51:01PM +0200, Pratyush Yadav wrote: > BTW, do you think I should re-send the patch with updated reasoning, > since Cristoph thinks we should do this regardless of the performance > reasons [0]? Thanks for the investigation and finding the cause! Yeah, we should go with your original patch as-is, just with an updated changelog. While that may work around your immediate issue, I hope your xen patches from this analysis will be considered for upstream as well :)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index baf69af7ea78e..f5ba2d7102eae 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2916,9 +2916,6 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev, struct nvme_dev *dev; int ret = -ENOMEM; - if (node == NUMA_NO_NODE) - set_dev_node(&pdev->dev, first_memory_node); - dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node); if (!dev) return ERR_PTR(-ENOMEM);