nvme-pci: do not set the NUMA node of device if it has none

Message ID 20230725110622.129361-1-ptyadav@amazon.de
State New
Headers
Series nvme-pci: do not set the NUMA node of device if it has none |

Commit Message

Pratyush Yadav July 25, 2023, 11:06 a.m. UTC
  If a device has no NUMA node information associated with it, the driver
puts the device in node first_memory_node (say node 0). As a side
effect, this gives an indication to userspace IRQ balancing programs
that the device is in node 0 so they prefer CPUs in node 0 to handle the
IRQs associated with the queues. For example, irqbalance will only let
CPUs in node 0 handle the interrupts. This reduces random access
performance on CPUs in node 1 since the interrupt for command completion
will fire on node 0.

For example, AWS EC2's i3.16xlarge instance does not expose NUMA
information for the NVMe devices. This means all NVMe devices have
NUMA_NO_NODE by default. Without this patch, random 4k read performance
measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50%
less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on
both nodes get similar performance (around 315k IOPS).

Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
---
 drivers/nvme/host/pci.c | 3 ---
 1 file changed, 3 deletions(-)
  

Comments

Keith Busch July 25, 2023, 2:35 p.m. UTC | #1
On Tue, Jul 25, 2023 at 01:06:22PM +0200, Pratyush Yadav wrote:
> If a device has no NUMA node information associated with it, the driver
> puts the device in node first_memory_node (say node 0). As a side
> effect, this gives an indication to userspace IRQ balancing programs
> that the device is in node 0 so they prefer CPUs in node 0 to handle the
> IRQs associated with the queues. For example, irqbalance will only let
> CPUs in node 0 handle the interrupts. This reduces random access
> performance on CPUs in node 1 since the interrupt for command completion
> will fire on node 0.
>
> For example, AWS EC2's i3.16xlarge instance does not expose NUMA
> information for the NVMe devices. This means all NVMe devices have
> NUMA_NO_NODE by default. Without this patch, random 4k read performance
> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50%
> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on
> both nodes get similar performance (around 315k IOPS).

irqbalance doesn't work with this driver though: the interrupts are
managed by the kernel. Is there some other reason to explain the perf
difference?
  
Sagi Grimberg July 26, 2023, 7:58 a.m. UTC | #2
>> If a device has no NUMA node information associated with it, the driver
>> puts the device in node first_memory_node (say node 0). As a side
>> effect, this gives an indication to userspace IRQ balancing programs
>> that the device is in node 0 so they prefer CPUs in node 0 to handle the
>> IRQs associated with the queues. For example, irqbalance will only let
>> CPUs in node 0 handle the interrupts. This reduces random access
>> performance on CPUs in node 1 since the interrupt for command completion
>> will fire on node 0.
>>
>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA
>> information for the NVMe devices. This means all NVMe devices have
>> NUMA_NO_NODE by default. Without this patch, random 4k read performance
>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50%
>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on
>> both nodes get similar performance (around 315k IOPS).
> 
> irqbalance doesn't work with this driver though: the interrupts are
> managed by the kernel. Is there some other reason to explain the perf
> difference?

Maybe its because the numa_node goes to the tagset which allocates
stuff based on that numa-node ?
  
Christoph Hellwig July 26, 2023, 1:14 p.m. UTC | #3
On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote:
>>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA
>>> information for the NVMe devices. This means all NVMe devices have
>>> NUMA_NO_NODE by default. Without this patch, random 4k read performance
>>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50%
>>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on
>>> both nodes get similar performance (around 315k IOPS).
>>
>> irqbalance doesn't work with this driver though: the interrupts are
>> managed by the kernel. Is there some other reason to explain the perf
>> difference?
>
> Maybe its because the numa_node goes to the tagset which allocates
> stuff based on that numa-node ?

Yeah, the only explanation I could come up with is that without this
the allocations gets spread, and that somehow helps.  All of this
is a little obscure, but so is the NVMe practice of setting the node id
to first_memory_node, which no other driver does.  I'd really like to
understand what's going on here first.  After that this patch probably
is the right thing, I'd just like to understand why.
  
Pratyush Yadav July 26, 2023, 3:30 p.m. UTC | #4
On Wed, Jul 26 2023, Christoph Hellwig wrote:

Hi all,

> On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote:
>>>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA
>>>> information for the NVMe devices. This means all NVMe devices have
>>>> NUMA_NO_NODE by default. Without this patch, random 4k read performance
>>>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50%
>>>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on
>>>> both nodes get similar performance (around 315k IOPS).
>>>
>>> irqbalance doesn't work with this driver though: the interrupts are
>>> managed by the kernel. Is there some other reason to explain the perf
>>> difference?

Hmm, I did not know that. I have not gone and looked at the code but I
think the same reasoning should hold, just with s/irqbalance/kernel. If
the kernel IRQ balancer sees the device is on node 0, it would deliver
its interrupts to CPUs on node 0.

In my tests I can see that the interrupts for NVME queues are sent only
to CPUs from node 0 without this patch. With this patch CPUs from both
nodes get the interrupts.

>>
>> Maybe its because the numa_node goes to the tagset which allocates
>> stuff based on that numa-node ?
>
> Yeah, the only explanation I could come up with is that without this
> the allocations gets spread, and that somehow helps.  All of this
> is a little obscure, but so is the NVMe practice of setting the node id
> to first_memory_node, which no other driver does.  I'd really like to
> understand what's going on here first.  After that this patch probably
> is the right thing, I'd just like to understand why.

See above for my conjecture on why this happens.

More specifically, I discovered this when running an application pinned
to a node 1 CPU reading from an NVME device. I noticed it was performing
worse than when it was pinned to node 0.

If the process is free to move around it might not see such a large
performance difference since it could move to a node 0 CPU. But if it is
pinned to a CPU in node 1 then the interrupt will always hit a node 0
CPU and create higher latency for the reads.

I have a simple fio test that can reproduce this. Save this [1] as 
fio.txt and then run numactl --cpunodebind 1 fio ./fio.txt. You can run
it on any host with an NVME device that has no NUMA node. I have tested
this on AWS EC2's i3.16xlarge instance type.

[1]
    [global]
    ioengine=libaio
    filename=/dev/nvme0n1
    group_reporting=1
    direct=1
    verify=0
    norandommap=0
    size=10%
    time_based=1
    runtime=30
    ramp_time=0
    randrepeat=0
    log_max_value=1
    unified_rw_reporting=1
    percentile_list=50:99:99.9:99.99:99.999
    bwavgtime=10000

    [4k_randread_qd16_4w]
    stonewall
    bs=4k
    rw=randread
    iodepth=32
    numjobs=1
  
Keith Busch July 26, 2023, 4:17 p.m. UTC | #5
On Wed, Jul 26, 2023 at 05:30:33PM +0200, Pratyush Yadav wrote:
> On Wed, Jul 26 2023, Christoph Hellwig wrote:
> > On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote:
> >>>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA
> >>>> information for the NVMe devices. This means all NVMe devices have
> >>>> NUMA_NO_NODE by default. Without this patch, random 4k read performance
> >>>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50%
> >>>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on
> >>>> both nodes get similar performance (around 315k IOPS).
> >>>
> >>> irqbalance doesn't work with this driver though: the interrupts are
> >>> managed by the kernel. Is there some other reason to explain the perf
> >>> difference?
> 
> Hmm, I did not know that. I have not gone and looked at the code but I
> think the same reasoning should hold, just with s/irqbalance/kernel. If
> the kernel IRQ balancer sees the device is on node 0, it would deliver
> its interrupts to CPUs on node 0.
> 
> In my tests I can see that the interrupts for NVME queues are sent only
> to CPUs from node 0 without this patch. With this patch CPUs from both
> nodes get the interrupts.

Could you send the output of:

  numactl --hardware

and then with and without your patch:

  for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
    cat /proc/irq/$i/{smp,effective}_affinity_list; \
  done

?
  
Pratyush Yadav July 26, 2023, 7:32 p.m. UTC | #6
On Wed, Jul 26 2023, Keith Busch wrote:

> On Wed, Jul 26, 2023 at 05:30:33PM +0200, Pratyush Yadav wrote:
>> On Wed, Jul 26 2023, Christoph Hellwig wrote:
>> > On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote:
>> >>>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA
>> >>>> information for the NVMe devices. This means all NVMe devices have
>> >>>> NUMA_NO_NODE by default. Without this patch, random 4k read performance
>> >>>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50%
>> >>>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on
>> >>>> both nodes get similar performance (around 315k IOPS).
>> >>>
>> >>> irqbalance doesn't work with this driver though: the interrupts are
>> >>> managed by the kernel. Is there some other reason to explain the perf
>> >>> difference?
>>
>> Hmm, I did not know that. I have not gone and looked at the code but I
>> think the same reasoning should hold, just with s/irqbalance/kernel. If
>> the kernel IRQ balancer sees the device is on node 0, it would deliver
>> its interrupts to CPUs on node 0.
>>
>> In my tests I can see that the interrupts for NVME queues are sent only
>> to CPUs from node 0 without this patch. With this patch CPUs from both
>> nodes get the interrupts.
>
> Could you send the output of:
>
>   numactl --hardware

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 245847 MB
node 0 free: 245211 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 245932 MB
node 1 free: 245328 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

>
> and then with and without your patch:
>
>   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>   done

Without my patch:

    $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
    >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
    >   done
    40
    40
    33
    33
    44
    44
    9
    9
    32
    32
    2
    2
    6
    6
    11
    11
    1
    1
    35
    35
    39
    39
    13
    13
    42
    42
    46
    46
    41
    41
    46
    46
    15
    15
    5
    5
    43
    43
    0
    0
    14
    14
    8
    8
    12
    12
    7
    7
    10
    10
    47
    47
    38
    38
    36
    36
    3
    3
    34
    34
    45
    45
    5
    5

With my patch:

    $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
    >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
    >   done
    9
    9
    15
    15
    5
    5
    23
    23
    38
    38
    52
    52
    21
    21
    36
    36
    13
    13
    56
    56
    44
    44
    42
    42
    31
    31
    48
    48
    5
    5
    3
    3
    1
    1
    11
    11
    28
    28
    18
    18
    34
    34
    29
    29
    58
    58
    46
    46
    54
    54
    59
    59
    32
    32
    7
    7
    56
    56
    62
    62
    49
    49
    57
    57
  
Keith Busch July 26, 2023, 10:25 p.m. UTC | #7
On Wed, Jul 26, 2023 at 09:32:33PM +0200, Pratyush Yadav wrote:
> On Wed, Jul 26 2023, Keith Busch wrote:
> > Could you send the output of:
> >
> >   numactl --hardware
> 
> $ numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> node 0 size: 245847 MB
> node 0 free: 245211 MB
> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> node 1 size: 245932 MB
> node 1 free: 245328 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> >
> > and then with and without your patch:
> >
> >   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
> >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
> >   done
> 
> Without my patch:
> 
>     $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>     >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>     >   done

Hm, I wonder if there's something wrong with my script. All the cpu's
should be accounted for in the smp_affinity_list, assuming it captured
all the vectors of the nvme device, but both examples are missing half
the CPUs. It looks like you have 32 vectors. Does that sound right?

This does show the effective affinity is indeed always on node 0 without
your patch. I don't see why, though: the "group_cpus_evenly()" function
that spreads the interrupts doesn't know anything about the device the
resource is being grouped for, so it shouldn't even take its NUMA node
into consideration. It's just supposed to ensure all CPUs have a shared
resource, preferring to not share across numa nodes.

I'll emulate a similar CPU topology with similar nvme vector count and
see if I can find anything suspicious. I'm a little concerned we may
have the same problem for devices that have an associated NUMA node that
your patch isn't addressing.

>     41
>     40
>     33
>     33
>     44
>     44
>     9
>     9
>     32
>     32
>     2
>     2
>     6
>     6
>     11
>     11
>     1
>     1
>     35
>     35
>     39
>     39
>     13
>     13
>     42
>     42
>     46
>     46
>     41
>     41
>     46
>     46
>     15
>     15
>     5
>     5
>     43
>     43
>     0
>     0
>     14
>     14
>     8
>     8
>     12
>     12
>     7
>     7
>     10
>     10
>     47
>     47
>     38
>     38
>     36
>     36
>     3
>     3
>     34
>     34
>     45
>     45
>     5
>     5
> 
> With my patch:
> 
>     $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>     >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>     >   done
>     9
>     9
>     15
>     15
>     5
>     5
>     23
>     23
>     38
>     38
>     52
>     52
>     21
>     21
>     36
>     36
>     13
>     13
>     56
>     56
>     44
>     44
>     42
>     42
>     31
>     31
>     48
>     48
>     5
>     5
>     3
>     3
>     1
>     1
>     11
>     11
>     28
>     28
>     18
>     18
>     34
>     34
>     29
>     29
>     58
>     58
>     46
>     46
>     54
>     54
>     59
>     59
>     32
>     32
>     7
>     7
>     56
>     56
>     62
>     62
>     49
>     49
>     57
>     57
  
Pratyush Yadav July 28, 2023, 6:09 p.m. UTC | #8
Hi,

On Wed, Jul 26 2023, Keith Busch wrote:

> On Wed, Jul 26, 2023 at 09:32:33PM +0200, Pratyush Yadav wrote:
>> On Wed, Jul 26 2023, Keith Busch wrote:
>> > Could you send the output of:
>> >
>> >   numactl --hardware
>>
>> $ numactl --hardware
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
>> node 0 size: 245847 MB
>> node 0 free: 245211 MB
>> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
>> node 1 size: 245932 MB
>> node 1 free: 245328 MB
>> node distances:
>> node   0   1
>>   0:  10  21
>>   1:  21  10
>>
>> >
>> > and then with and without your patch:
>> >
>> >   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>> >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>> >   done
>>
>> Without my patch:
>>
>>     $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>>     >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>>     >   done
>
> Hm, I wonder if there's something wrong with my script. All the cpu's
> should be accounted for in the smp_affinity_list, assuming it captured
> all the vectors of the nvme device, but both examples are missing half
> the CPUs. It looks like you have 32 vectors. Does that sound right?

Yes, there are 32 vectors, from nvme0q0 to nvme0q31. Should there be one
vector for each CPU? Perhaps the device does not support that many
queues?

FWIW,

    $ sudo nvme get-feature /dev/nvme0n1 -f 7 -H
    get-feature:0x7 (Number of Queues), Current value:0x1e001e
            Number of IO Completion Queues Allocated (NCQA): 31
            Number of IO Submission Queues Allocated (NSQA): 31

>
> This does show the effective affinity is indeed always on node 0 without
> your patch. I don't see why, though: the "group_cpus_evenly()" function
> that spreads the interrupts doesn't know anything about the device the
> resource is being grouped for, so it shouldn't even take its NUMA node
> into consideration. It's just supposed to ensure all CPUs have a shared
> resource, preferring to not share across numa nodes.

I am guessing you are looking at irq_create_affinity_masks(). Yeah, It
does not take into account the NUMA information. In fact, even if it
did, the NUMA node associated with the IRQ is NUMA_NO_NODE
(/proc/$irq/node == -1).

I did some more digging over the week to figure out what is going on. It
seems like the kernel _does_ in fact allow all CPUs in the affinity. I
added some prints in set_affinity_irq() in
drivers/xen/events/events_base.c (since that is the irqchip for the
interrupt). I see it being called with mask: ffffffff,ffffffff.

But I later see the function being called again with a different mask:
00000000,00008000. The stack trace shows the call is coming from
ksys_write(). The process doing the write is irqbalance.

So I think your earlier statement was incorrect. irqbalance does in fact
balance these interrupts and it probably looks at the NUMA information
of the device to make that decision. My original reasoning holds and
irqbalance is the one picking the affinity.

With this explanation, do you think the patch is good to go?

BTW, could you please also add the below when applying? I forgot to add
it when sending the patch.

  Fixes: a4aea5623d4a5 ("NVMe: Convert to blk-mq")

>
> I'll emulate a similar CPU topology with similar nvme vector count and
> see if I can find anything suspicious. I'm a little concerned we may
> have the same problem for devices that have an associated NUMA node that
> your patch isn't addressing.
>
[...]

--
Regards,
Pratyush Yadav



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
  
Keith Busch July 28, 2023, 7:34 p.m. UTC | #9
On Fri, Jul 28, 2023 at 08:09:32PM +0200, Pratyush Yadav wrote:
> 
> I am guessing you are looking at irq_create_affinity_masks(). Yeah, It
> does not take into account the NUMA information. In fact, even if it
> did, the NUMA node associated with the IRQ is NUMA_NO_NODE
> (/proc/$irq/node == -1).
> 
> I did some more digging over the week to figure out what is going on. It
> seems like the kernel _does_ in fact allow all CPUs in the affinity. I
> added some prints in set_affinity_irq() in
> drivers/xen/events/events_base.c (since that is the irqchip for the
> interrupt). I see it being called with mask: ffffffff,ffffffff.
> 
> But I later see the function being called again with a different mask:
> 00000000,00008000. The stack trace shows the call is coming from
> ksys_write(). The process doing the write is irqbalance.
> 
> So I think your earlier statement was incorrect. irqbalance does in fact
> balance these interrupts and it probably looks at the NUMA information
> of the device to make that decision. My original reasoning holds and
> irqbalance is the one picking the affinity.
> 
> With this explanation, do you think the patch is good to go?

irqbalance still writes to the /proc/<irq>/smp_affinity to change it,
right? That's just getting I/O errors on my machines because it fails
irq_can_set_affinity_usr() for nvme's kernel managed interrupts (except
the first vector, but that one is not used for I/O). Is there another
path irqbalance is using that's somehow getting past the appropriate
checks? Or perhaps is your xen irq_chip somehow bypassing the managed
irq property?
  
Pratyush Yadav Aug. 4, 2023, 2:50 p.m. UTC | #10
On Fri, Jul 28 2023, Keith Busch wrote:

> On Fri, Jul 28, 2023 at 08:09:32PM +0200, Pratyush Yadav wrote:
>>
>> I am guessing you are looking at irq_create_affinity_masks(). Yeah, It
>> does not take into account the NUMA information. In fact, even if it
>> did, the NUMA node associated with the IRQ is NUMA_NO_NODE
>> (/proc/$irq/node == -1).
>>
>> I did some more digging over the week to figure out what is going on. It
>> seems like the kernel _does_ in fact allow all CPUs in the affinity. I
>> added some prints in set_affinity_irq() in
>> drivers/xen/events/events_base.c (since that is the irqchip for the
>> interrupt). I see it being called with mask: ffffffff,ffffffff.
>>
>> But I later see the function being called again with a different mask:
>> 00000000,00008000. The stack trace shows the call is coming from
>> ksys_write(). The process doing the write is irqbalance.
>>
>> So I think your earlier statement was incorrect. irqbalance does in fact
>> balance these interrupts and it probably looks at the NUMA information
>> of the device to make that decision. My original reasoning holds and
>> irqbalance is the one picking the affinity.
>>
>> With this explanation, do you think the patch is good to go?
>
> irqbalance still writes to the /proc/<irq>/smp_affinity to change it,
> right? That's just getting I/O errors on my machines because it fails
> irq_can_set_affinity_usr() for nvme's kernel managed interrupts (except
> the first vector, but that one is not used for I/O). Is there another
> path irqbalance is using that's somehow getting past the appropriate
> checks? Or perhaps is your xen irq_chip somehow bypassing the managed
> irq property?

I picked the interrupt "nvme4q26" as example. The call stack is (printed
via WARN_ON(1)):

    ? __warn+0x7d/0x140
    ? set_affinity_irq+0xf0/0x220
    ? report_bug+0xf8/0x1e0
    ? handle_bug+0x44/0x80
    ? exc_invalid_op+0x13/0x60
    ? asm_exc_invalid_op+0x16/0x20
    ? set_affinity_irq+0xf0/0x220
    ? set_affinity_irq+0xf0/0x220
    irq_do_set_affinity+0x135/0x1e0
    irq_set_affinity_locked+0x186/0x1f0
    __irq_set_affinity+0x41/0x70
    write_irq_affinity.isra.8+0xf6/0x120
    proc_reg_write+0x59/0x80
    vfs_write+0xc7/0x3c0
    ? __do_sys_newfstat+0x35/0x60
    ? __fget_light+0xcb/0x120
    ksys_write+0xa5/0xe0
    do_syscall_64+0x42/0x90
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

The check you mention is in write_irq_affinity(). I added some prints
there and it turns out that __irqd_can_set_affinity() returns true and
irqd_affinity_is_managed() is false.

I did some more digging and it turns out that the masks are called by
irq_create_affinity_masks() which sets is_managed to the IRQ affinity
descriptors.

This is then passed down to __msi_domain_alloc_locked(). On a non-Xen
system you would end up calling __msi_domain_alloc_irqs() next since
ops->domain_alloc_irqs() is only implemented by Xen. This function finds
the masks created earlier and passes them down to
__irq_domain_alloc_irqs(). This then eventually lands in alloc_descs()
which checks is_managed and sets IRQD_AFFINITY_MANAGED.

On Xen though, xen_msi_domain_alloc_irqs() is called. This eventually
lands in xen_allocate_irqs_dynamic() which calls irq_alloc_descs(). This
macro calls __irq_alloc_descs() with affinity set to NULL. This leads to
us losing the is_managed flag and the affinities created by
irq_create_affinity_masks() via group_cpus_evenly().

As a result of this, MSI IRQs on Xen can never be managed by the kernel.
They are marked as userspace manageable and irqbalance can set their
affinity. Applying the (hacky) patch below fixes this problem and lets
the kernel manage the affinities.

------ 8< ------
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index c7715f8bd4522..15f36e34e28b4 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -743,9 +743,10 @@ static void xen_irq_init(unsigned irq)
 	list_add_tail(&info->list, &xen_irq_list_head);
 }
 
-static int __must_check xen_allocate_irqs_dynamic(int nvec)
+static int __must_check xen_allocate_irqs_dynamic(int nvec,
+						  struct irq_affinity_desc *affd)
 {
-	int i, irq = irq_alloc_descs(-1, 0, nvec, -1);
+	int i, irq = __irq_alloc_descs(-1, 0, nvec, -1, THIS_MODULE, affd);
 
 	if (irq >= 0) {
 		for (i = 0; i < nvec; i++)
@@ -758,7 +759,7 @@ static int __must_check xen_allocate_irqs_dynamic(int nvec)
 static inline int __must_check xen_allocate_irq_dynamic(void)
 {
 
-	return xen_allocate_irqs_dynamic(1);
+	return xen_allocate_irqs_dynamic(1, NULL);
 }
 
 static int __must_check xen_allocate_irq_gsi(unsigned gsi)
@@ -1108,7 +1109,7 @@ int xen_bind_pirq_msi_to_irq(struct pci_dev *dev, struct msi_desc *msidesc,
 
 	mutex_lock(&irq_mapping_update_lock);
 
-	irq = xen_allocate_irqs_dynamic(nvec);
+	irq = xen_allocate_irqs_dynamic(nvec, msidesc->affinity);
 	if (irq < 0)
 		goto out;
------ >8 ------ 

With this patch, I get the below affinities:

    $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
    >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
    >   done
    8
    8
    16-17,48,65,67,69

    18-19,50,71,73,75

    20,52,77,79

    21,53,81,83

    22,54,85,87

    23,55,89,91

    24,56,93,95

    25,57,97,99

    26,58,101,103

    27,59,105,107

    28,60,109,111

    29,61,113,115

    30,62,117,119

    31,63,121,123

    49,51,125,127

    0,32,64,66

    1,33,68,70

    2,34,72,74

    3,35,76,78

    4,36,80,82

    5,37,84,86

    6,38,88,90

    7,39,92,94

    8,40,96,98

    9,41,100,102

    10,42,104,106

    11,43,108,110

    12,44,112,114

    13,45,116,118

    14,46,120,122

    15,47,124,126

The blank lines are because effective_smp_affinity is blank for all but the first interrupt.

The problem is, even with this I still get the same performance
difference when running on Node 1 vs Node 0. I am not sure why. Any
pointers?
  
Keith Busch Aug. 4, 2023, 3:19 p.m. UTC | #11
On Fri, Aug 04, 2023 at 04:50:16PM +0200, Pratyush Yadav wrote:
> With this patch, I get the below affinities:

Something still seems off without effective_affinity set. That attribute
should always reflect one CPU from the smp_affinity_list.

At least with your patch, the smp_affinity_list looks as expected: every
CPU is accounted for, and no vector appears to share the resource among
CPUs in different nodes.
 
>     $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>     >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>     >   done
>     8
>     8
>     16-17,48,65,67,69
> 
>     18-19,50,71,73,75
> 
>     20,52,77,79
> 
>     21,53,81,83
> 
>     22,54,85,87
> 
>     23,55,89,91
> 
>     24,56,93,95
> 
>     25,57,97,99
> 
>     26,58,101,103
> 
>     27,59,105,107
> 
>     28,60,109,111
> 
>     29,61,113,115
> 
>     30,62,117,119
> 
>     31,63,121,123
> 
>     49,51,125,127
> 
>     0,32,64,66
> 
>     1,33,68,70
> 
>     2,34,72,74
> 
>     3,35,76,78
> 
>     4,36,80,82
> 
>     5,37,84,86
> 
>     6,38,88,90
> 
>     7,39,92,94
> 
>     8,40,96,98
> 
>     9,41,100,102
> 
>     10,42,104,106
> 
>     11,43,108,110
> 
>     12,44,112,114
> 
>     13,45,116,118
> 
>     14,46,120,122
> 
>     15,47,124,126
> 
> The blank lines are because effective_smp_affinity is blank for all but the first interrupt.
> 
> The problem is, even with this I still get the same performance
> difference when running on Node 1 vs Node 0. I am not sure why. Any
> pointers?

I suspect effective_affinity isn't getting set and interrupts are
triggering on unexpected CPUs. If you check /proc/interrupts, can you
confirm if the interrupts are firing on CPUs within the
smp_affinity_list or some other CPU?
  
Pratyush Yadav Aug. 8, 2023, 3:51 p.m. UTC | #12
On Fri, Aug 04 2023, Keith Busch wrote:

> On Fri, Aug 04, 2023 at 04:50:16PM +0200, Pratyush Yadav wrote:
>> With this patch, I get the below affinities:
>
> Something still seems off without effective_affinity set. That attribute
> should always reflect one CPU from the smp_affinity_list.
>
> At least with your patch, the smp_affinity_list looks as expected: every
> CPU is accounted for, and no vector appears to share the resource among
> CPUs in different nodes.
>
>>     $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>>     >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>>     >   done
>>     8
>>     8
>>     16-17,48,65,67,69
>>
>>     18-19,50,71,73,75
>>
>>     20,52,77,79
>>
>>     21,53,81,83
>>
>>     22,54,85,87
>>
>>     23,55,89,91
>>
>>     24,56,93,95
>>
>>     25,57,97,99
>>
>>     26,58,101,103
>>
>>     27,59,105,107
>>
>>     28,60,109,111
>>
>>     29,61,113,115
>>
>>     30,62,117,119
>>
>>     31,63,121,123
>>
>>     49,51,125,127
>>
>>     0,32,64,66
>>
>>     1,33,68,70
>>
>>     2,34,72,74
>>
>>     3,35,76,78
>>
>>     4,36,80,82
>>
>>     5,37,84,86
>>
>>     6,38,88,90
>>
>>     7,39,92,94
>>
>>     8,40,96,98
>>
>>     9,41,100,102
>>
>>     10,42,104,106
>>
>>     11,43,108,110
>>
>>     12,44,112,114
>>
>>     13,45,116,118
>>
>>     14,46,120,122
>>
>>     15,47,124,126
>>
>> The blank lines are because effective_smp_affinity is blank for all but the first interrupt.
>>
>> The problem is, even with this I still get the same performance
>> difference when running on Node 1 vs Node 0. I am not sure why. Any
>> pointers?
>
> I suspect effective_affinity isn't getting set and interrupts are
> triggering on unexpected CPUs. If you check /proc/interrupts, can you
> confirm if the interrupts are firing on CPUs within the
> smp_affinity_list or some other CPU?

All interrupts are hitting CPU0.

I did some more digging. Xen sets the effective affinity via the
function set_affinity_irq(). But it only sets it if info->evtchn is
valid. But info->evtchn is set by startup_pirq(), which is called _after_
set_affinity_irq().

This worked before because irqbalance was coming in after all this
happened and set the affinity, causing set_affinity_irq() to be called
again. By that time startup_pirq() has been called and info->evtchn is
set.

This whole thing needs some refactoring to work properly. I wrote up a
hacky patch (on top of my previous hacky patch) that fixes it. I will
write up a proper patch when I find some more time next week.

With this, I am now seeing good performance on node 1. I am seeing
slightly slower performance on node 1 but that might be due to some
other reasons and I do not want to dive into that for now.

Thanks for your help with this :-)

BTW, do you think I should re-send the patch with updated reasoning,
since Cristoph thinks we should do this regardless of the performance
reasons [0]?

[0] https://lore.kernel.org/linux-nvme/20230726131408.GA15909@lst.de/

----- 8< -----
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index c7715f8bd4522..ed33d33a2f1c9 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -108,6 +108,7 @@ struct irq_info {
 	unsigned irq;
 	evtchn_port_t evtchn;   /* event channel */
 	unsigned short cpu;     /* cpu bound */
+	unsigned short suggested_cpu;
 	unsigned short eoi_cpu; /* EOI must happen on this cpu-1 */
 	unsigned int irq_epoch; /* If eoi_cpu valid: irq_epoch of event */
 	u64 eoi_time;           /* Time in jiffies when to EOI. */
@@ -871,6 +873,8 @@ static void mask_ack_pirq(struct irq_data *data)
 	eoi_pirq(data);
 }
 
+static int xen_rebind_evtchn_to_cpu(struct irq_info *info, unsigned int tcpu);
+
 static unsigned int __startup_pirq(unsigned int irq)
 {
 	struct evtchn_bind_pirq bind_pirq;
@@ -903,6 +907,14 @@ static unsigned int __startup_pirq(unsigned int irq)
 	info->evtchn = evtchn;
 	bind_evtchn_to_cpu(evtchn, 0, false);
 
+	if (info->suggested_cpu) {
+		int ret;
+		ret = xen_rebind_evtchn_to_cpu(info, info->suggested_cpu);
+		if (!ret)
+			irq_data_update_effective_affinity(irq_get_irq_data(irq),
+							cpumask_of(info->suggested_cpu));
+	}
+
 	rc = xen_evtchn_port_setup(evtchn);
 	if (rc)
 		goto err;
@@ -1856,12 +1872,17 @@ static unsigned int select_target_cpu(const struct cpumask *dest)
 static int set_affinity_irq(struct irq_data *data, const struct cpumask *dest,
 			    bool force)
 {
+	struct irq_info *info = info_for_irq(data->irq);
 	unsigned int tcpu = select_target_cpu(dest);
 	int ret;
 
-	ret = xen_rebind_evtchn_to_cpu(info_for_irq(data->irq), tcpu);
-	if (!ret)
+	ret = xen_rebind_evtchn_to_cpu(info, tcpu);
+	if (!ret) {
 		irq_data_update_effective_affinity(data, cpumask_of(tcpu));
+	} else {
+		if (info)
+			info->suggested_cpu = tcpu;
+	}
 
 	return ret;
 }
  
Keith Busch Aug. 8, 2023, 4:35 p.m. UTC | #13
On Tue, Aug 08, 2023 at 05:51:01PM +0200, Pratyush Yadav wrote:
> BTW, do you think I should re-send the patch with updated reasoning,
> since Cristoph thinks we should do this regardless of the performance
> reasons [0]?

Thanks for the investigation and finding the cause!

Yeah, we should go with your original patch as-is, just with an updated
changelog. While that may work around your immediate issue, I hope your
xen patches from this analysis will be considered for upstream as well :)
  

Patch

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index baf69af7ea78e..f5ba2d7102eae 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2916,9 +2916,6 @@  static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 	struct nvme_dev *dev;
 	int ret = -ENOMEM;
 
-	if (node == NUMA_NO_NODE)
-		set_dev_node(&pdev->dev, first_memory_node);
-
 	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
 	if (!dev)
 		return ERR_PTR(-ENOMEM);