[v2,0/2] Attach DT nodes to existing PCI devices

Message ID 20231130165700.685764-1-herve.codina@bootlin.com
Headers
Series Attach DT nodes to existing PCI devices |

Message

Herve Codina Nov. 30, 2023, 4:56 p.m. UTC
  Hi,

The commit 407d1a51921e ("PCI: Create device tree node for bridge")
creates of_node for PCI devices.
During the insertion handling of these new DT nodes done by of_platform,
new devices (struct device) are created.
For each PCI devices a struct device is already present (created and
handled by the PCI core).
Creating a new device from a DT node leads to some kind of wrong struct
device duplication to represent the exact same PCI device.

This patch series first introduces device_{add,remove}_of_node() in
order to add or remove a newly created of_node to an already existing
device.
Then it fixes the DT node creation for PCI devices to add or remove the
created node to the existing PCI device without any new device creation.

Best regards,
Hervé

Changes v1 -> v2
  - Patch 1
    Add 'Cc: stable@vger.kernel.org'

Herve Codina (2):
  driver core: Introduce device_{add,remove}_of_node()
  PCI: of: Attach created of_node to existing device

 drivers/base/core.c    | 74 ++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/of.c       | 15 +++++++--
 include/linux/device.h |  2 ++
 3 files changed, 89 insertions(+), 2 deletions(-)
  

Comments

Rob Herring Dec. 1, 2023, 10:26 p.m. UTC | #1
On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <herve.codina@bootlin.com> wrote:
>
> Hi,
>
> The commit 407d1a51921e ("PCI: Create device tree node for bridge")
> creates of_node for PCI devices.
> During the insertion handling of these new DT nodes done by of_platform,
> new devices (struct device) are created.
> For each PCI devices a struct device is already present (created and
> handled by the PCI core).
> Creating a new device from a DT node leads to some kind of wrong struct
> device duplication to represent the exact same PCI device.
>
> This patch series first introduces device_{add,remove}_of_node() in
> order to add or remove a newly created of_node to an already existing
> device.
> Then it fixes the DT node creation for PCI devices to add or remove the
> created node to the existing PCI device without any new device creation.

I think the simpler solution is to get the DT node created earlier. We
are just asking for pain if the DT node is set for the device at
different times compared to static DT nodes.

The following fixes the lack of of_node link. The DT unittest fails
with the change though and I don't see why.

Also, no idea if the bridge part works because my qemu setup doesn't
create bridges (anyone got a magic cmdline to create them?).

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 9c2137dae429..46b252bbe500 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
         */
        pcibios_bus_add_device(dev);
        pci_fixup_device(pci_fixup_final, dev);
-       if (pci_is_bridge(dev))
-               of_pci_make_dev_node(dev);
        pci_create_sysfs_dev_files(dev);
        pci_proc_attach_device(dev);
        pci_bridge_d3_update(dev);
diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index 51e3dd0ea5ab..e15eaf0127fc 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
                return 0;

        node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
+       if (!node && pci_is_bridge(dev))
+               of_pci_make_dev_node(dev);
        if (!node)
                return 0;

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index ea476252280a..e50b07fe5a63 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -6208,9 +6208,9 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL,
0x9a31, dpc_log_size);
  * before driver probing, it might need to add a device tree node as the final
  * fixup.
  */
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_XILINX, 0x5020, of_pci_make_dev_node);
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_XILINX, 0x5021, of_pci_make_dev_node);
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REDHAT, 0x0005, of_pci_make_dev_node);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_XILINX, 0x5020, of_pci_make_dev_node);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_XILINX, 0x5021, of_pci_make_dev_node);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_REDHAT, 0x0005, of_pci_make_dev_node);

 /*
  * Devices known to require a longer delay before first config space access
  
Bjorn Helgaas Dec. 1, 2023, 10:45 p.m. UTC | #2
On Fri, Dec 01, 2023 at 04:26:45PM -0600, Rob Herring wrote:
> On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <herve.codina@bootlin.com> wrote:
> ...

> Also, no idea if the bridge part works because my qemu setup doesn't
> create bridges (anyone got a magic cmdline to create them?).

I probably copied this from somewhere and certainly couldn't construct
it from scratch, but it did create a hierarchy like this:

  00:04.0 bridge to [bus 01-04] (Root Port)
  01:00.0 bridge to [bus 02-04] (Switch Upstream Port)
  02:00.0 bridge to [bus 03] (Switch Downstream Port)
  02:01.0 bridge to [bus 04] (Switch Downstream Port)
  03:00.0 endpoint
  04:00.0 endpoint

  IMAGE=ubuntu.img
  KERNEL=~/linux/arch/x86/boot/bzImage
  IMGDIR=~/virt/img/

  qemu-system-x86_64 -enable-kvm -s -m 2048 $IMAGE \
      -device pcie-root-port,id=root_port1,chassis=1,slot=1 \
      -device x3130-upstream,id=upstream_port1,bus=root_port1 \
      -device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=2,slot=1 \
      -device xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=2,slot=2 \
      -drive file=${IMGDIR}/nvme.qcow2,if=none,id=nvme1,snapshot=on \
      -device nvme,drive=nvme1,serial=nvme1,cmb_size_mb=2048,bus=downstream_port1 \
      -drive file=${IMGDIR}/nvme2.qcow2,if=none,id=nvme2,snapshot=on \
      -device nvme,drive=nvme2,serial=nvme1,bus=downstream_port2 \
      -virtfs local,id=home,path=/home/,security_model=mapped,mount_tag=home \
      -nographic \
      -kernel $KERNEL \
      -append "root=/dev/sda2 rootfstype=ext4 console=ttyS0,38400n8"
  
Herve Codina Dec. 4, 2023, 12:43 p.m. UTC | #3
Hi Rob,

On Fri, 1 Dec 2023 16:26:45 -0600
Rob Herring <robh@kernel.org> wrote:

> On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <herve.codina@bootlin.com> wrote:
> >
> > Hi,
> >
> > The commit 407d1a51921e ("PCI: Create device tree node for bridge")
> > creates of_node for PCI devices.
> > During the insertion handling of these new DT nodes done by of_platform,
> > new devices (struct device) are created.
> > For each PCI devices a struct device is already present (created and
> > handled by the PCI core).
> > Creating a new device from a DT node leads to some kind of wrong struct
> > device duplication to represent the exact same PCI device.
> >
> > This patch series first introduces device_{add,remove}_of_node() in
> > order to add or remove a newly created of_node to an already existing
> > device.
> > Then it fixes the DT node creation for PCI devices to add or remove the
> > created node to the existing PCI device without any new device creation.  
> 
> I think the simpler solution is to get the DT node created earlier. We
> are just asking for pain if the DT node is set for the device at
> different times compared to static DT nodes.
> 
> The following fixes the lack of of_node link. The DT unittest fails
> with the change though and I don't see why.
> 
> Also, no idea if the bridge part works because my qemu setup doesn't
> create bridges (anyone got a magic cmdline to create them?).
> 
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index 9c2137dae429..46b252bbe500 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
>          */
>         pcibios_bus_add_device(dev);
>         pci_fixup_device(pci_fixup_final, dev);
> -       if (pci_is_bridge(dev))
> -               of_pci_make_dev_node(dev);
>         pci_create_sysfs_dev_files(dev);
>         pci_proc_attach_device(dev);
>         pci_bridge_d3_update(dev);
> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> index 51e3dd0ea5ab..e15eaf0127fc 100644
> --- a/drivers/pci/of.c
> +++ b/drivers/pci/of.c
> @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
>                 return 0;
> 
>         node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> +       if (!node && pci_is_bridge(dev))
> +               of_pci_make_dev_node(dev);
>         if (!node)
>                 return 0;

Maybe it is too early.
of_pci_make_dev_node() creates a node and fills some properties based on 
some already set values available in the PCI device such as its struct resource
values.
We need to have some values set by the PCI infra in order to create our DT node
with correct values.

Best regards,
Hervé

> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index ea476252280a..e50b07fe5a63 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -6208,9 +6208,9 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL,
> 0x9a31, dpc_log_size);
>   * before driver probing, it might need to add a device tree node as the final
>   * fixup.
>   */
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_XILINX, 0x5020, of_pci_make_dev_node);
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_XILINX, 0x5021, of_pci_make_dev_node);
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REDHAT, 0x0005, of_pci_make_dev_node);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_XILINX, 0x5020, of_pci_make_dev_node);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_XILINX, 0x5021, of_pci_make_dev_node);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_REDHAT, 0x0005, of_pci_make_dev_node);
> 
>  /*
>   * Devices known to require a longer delay before first config space access
  
Rob Herring Dec. 4, 2023, 1:59 p.m. UTC | #4
On Mon, Dec 4, 2023 at 6:43 AM Herve Codina <herve.codina@bootlin.com> wrote:
>
> Hi Rob,
>
> On Fri, 1 Dec 2023 16:26:45 -0600
> Rob Herring <robh@kernel.org> wrote:
>
> > On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <herve.codina@bootlin.com> wrote:
> > >
> > > Hi,
> > >
> > > The commit 407d1a51921e ("PCI: Create device tree node for bridge")
> > > creates of_node for PCI devices.
> > > During the insertion handling of these new DT nodes done by of_platform,
> > > new devices (struct device) are created.
> > > For each PCI devices a struct device is already present (created and
> > > handled by the PCI core).
> > > Creating a new device from a DT node leads to some kind of wrong struct
> > > device duplication to represent the exact same PCI device.
> > >
> > > This patch series first introduces device_{add,remove}_of_node() in
> > > order to add or remove a newly created of_node to an already existing
> > > device.
> > > Then it fixes the DT node creation for PCI devices to add or remove the
> > > created node to the existing PCI device without any new device creation.
> >
> > I think the simpler solution is to get the DT node created earlier. We
> > are just asking for pain if the DT node is set for the device at
> > different times compared to static DT nodes.
> >
> > The following fixes the lack of of_node link. The DT unittest fails
> > with the change though and I don't see why.
> >
> > Also, no idea if the bridge part works because my qemu setup doesn't
> > create bridges (anyone got a magic cmdline to create them?).
> >
> > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > index 9c2137dae429..46b252bbe500 100644
> > --- a/drivers/pci/bus.c
> > +++ b/drivers/pci/bus.c
> > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> >          */
> >         pcibios_bus_add_device(dev);
> >         pci_fixup_device(pci_fixup_final, dev);
> > -       if (pci_is_bridge(dev))
> > -               of_pci_make_dev_node(dev);
> >         pci_create_sysfs_dev_files(dev);
> >         pci_proc_attach_device(dev);
> >         pci_bridge_d3_update(dev);
> > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > --- a/drivers/pci/of.c
> > +++ b/drivers/pci/of.c
> > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> >                 return 0;
> >
> >         node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > +       if (!node && pci_is_bridge(dev))
> > +               of_pci_make_dev_node(dev);
> >         if (!node)
> >                 return 0;
>
> Maybe it is too early.
> of_pci_make_dev_node() creates a node and fills some properties based on
> some already set values available in the PCI device such as its struct resource
> values.
> We need to have some values set by the PCI infra in order to create our DT node
> with correct values.

Indeed, that's probably the issue I'm having. In that case,
DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
device_add().

I think modifying sysfs after device_add() is going to race with
userspace. Userspace is notified of a new device, and then the of_node
link may or may not be there when it reads sysfs. Also, not sure if
we'll need DT modaliases with PCI devices, but they won't work if the
DT node is not set before device_add().

Rob
  
Herve Codina Dec. 4, 2023, 3:30 p.m. UTC | #5
Hi Rob,

On Mon, 4 Dec 2023 07:59:09 -0600
Rob Herring <robh@kernel.org> wrote:

[...]

> > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > index 9c2137dae429..46b252bbe500 100644
> > > --- a/drivers/pci/bus.c
> > > +++ b/drivers/pci/bus.c
> > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > >          */
> > >         pcibios_bus_add_device(dev);
> > >         pci_fixup_device(pci_fixup_final, dev);
> > > -       if (pci_is_bridge(dev))
> > > -               of_pci_make_dev_node(dev);
> > >         pci_create_sysfs_dev_files(dev);
> > >         pci_proc_attach_device(dev);
> > >         pci_bridge_d3_update(dev);
> > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > --- a/drivers/pci/of.c
> > > +++ b/drivers/pci/of.c
> > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > >                 return 0;
> > >
> > >         node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > +       if (!node && pci_is_bridge(dev))
> > > +               of_pci_make_dev_node(dev);
> > >         if (!node)
> > >                 return 0;  
> >
> > Maybe it is too early.
> > of_pci_make_dev_node() creates a node and fills some properties based on
> > some already set values available in the PCI device such as its struct resource
> > values.
> > We need to have some values set by the PCI infra in order to create our DT node
> > with correct values.  
> 
> Indeed, that's probably the issue I'm having. In that case,
> DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> device_add().
> 
> I think modifying sysfs after device_add() is going to race with
> userspace. Userspace is notified of a new device, and then the of_node
> link may or may not be there when it reads sysfs. Also, not sure if
> we'll need DT modaliases with PCI devices, but they won't work if the
> DT node is not set before device_add().

Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
fix your QEMU unittest ?

We have to note that between the pci_fixup_device(pci_fixup_header, dev) call
and the device_add() call, the call to pci_set_msi_domain() is present.
MSIs are not supported currently but in the future ...

Related to DT modaliases, I don't think they are needed.
All drivers related to PCI device should be declared as pci_driver.
Correct me if I am wrong but I think that the core PCI will load the correct
module without any DT modalias.

Best regards,
Hervé
  
Lizhi Hou Dec. 4, 2023, 4:48 p.m. UTC | #6
Here is the link I put in patch set cover letter

https://github.com/houlz0507/xoclv2/blob/pci-dt-0329/pci-dt-patch-0329/README

It has the steps I used to create and start qemu with pci bridges. 
Hopefully this helps.


Lizhi

On 12/1/23 14:45, Bjorn Helgaas wrote:
> On Fri, Dec 01, 2023 at 04:26:45PM -0600, Rob Herring wrote:
>> On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <herve.codina@bootlin.com> wrote:
>> ...
>> Also, no idea if the bridge part works because my qemu setup doesn't
>> create bridges (anyone got a magic cmdline to create them?).
> I probably copied this from somewhere and certainly couldn't construct
> it from scratch, but it did create a hierarchy like this:
>
>    00:04.0 bridge to [bus 01-04] (Root Port)
>    01:00.0 bridge to [bus 02-04] (Switch Upstream Port)
>    02:00.0 bridge to [bus 03] (Switch Downstream Port)
>    02:01.0 bridge to [bus 04] (Switch Downstream Port)
>    03:00.0 endpoint
>    04:00.0 endpoint
>
>    IMAGE=ubuntu.img
>    KERNEL=~/linux/arch/x86/boot/bzImage
>    IMGDIR=~/virt/img/
>
>    qemu-system-x86_64 -enable-kvm -s -m 2048 $IMAGE \
>        -device pcie-root-port,id=root_port1,chassis=1,slot=1 \
>        -device x3130-upstream,id=upstream_port1,bus=root_port1 \
>        -device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=2,slot=1 \
>        -device xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=2,slot=2 \
>        -drive file=${IMGDIR}/nvme.qcow2,if=none,id=nvme1,snapshot=on \
>        -device nvme,drive=nvme1,serial=nvme1,cmb_size_mb=2048,bus=downstream_port1 \
>        -drive file=${IMGDIR}/nvme2.qcow2,if=none,id=nvme2,snapshot=on \
>        -device nvme,drive=nvme2,serial=nvme1,bus=downstream_port2 \
>        -virtfs local,id=home,path=/home/,security_model=mapped,mount_tag=home \
>        -nographic \
>        -kernel $KERNEL \
>        -append "root=/dev/sda2 rootfstype=ext4 console=ttyS0,38400n8"
  
Rob Herring Dec. 4, 2023, 11:03 p.m. UTC | #7
On Mon, Dec 4, 2023 at 9:30 AM Herve Codina <herve.codina@bootlin.com> wrote:
>
> Hi Rob,
>
> On Mon, 4 Dec 2023 07:59:09 -0600
> Rob Herring <robh@kernel.org> wrote:
>
> [...]
>
> > > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > > index 9c2137dae429..46b252bbe500 100644
> > > > --- a/drivers/pci/bus.c
> > > > +++ b/drivers/pci/bus.c
> > > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > > >          */
> > > >         pcibios_bus_add_device(dev);
> > > >         pci_fixup_device(pci_fixup_final, dev);
> > > > -       if (pci_is_bridge(dev))
> > > > -               of_pci_make_dev_node(dev);
> > > >         pci_create_sysfs_dev_files(dev);
> > > >         pci_proc_attach_device(dev);
> > > >         pci_bridge_d3_update(dev);
> > > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > > --- a/drivers/pci/of.c
> > > > +++ b/drivers/pci/of.c
> > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > >                 return 0;
> > > >
> > > >         node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > +       if (!node && pci_is_bridge(dev))
> > > > +               of_pci_make_dev_node(dev);
> > > >         if (!node)
> > > >                 return 0;
> > >
> > > Maybe it is too early.
> > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > some already set values available in the PCI device such as its struct resource
> > > values.
> > > We need to have some values set by the PCI infra in order to create our DT node
> > > with correct values.
> >
> > Indeed, that's probably the issue I'm having. In that case,
> > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > device_add().
> >
> > I think modifying sysfs after device_add() is going to race with
> > userspace. Userspace is notified of a new device, and then the of_node
> > link may or may not be there when it reads sysfs. Also, not sure if
> > we'll need DT modaliases with PCI devices, but they won't work if the
> > DT node is not set before device_add().
>
> Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
> On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
> fix your QEMU unittest ?

No...

And testing the bridge part crashes. That's because there's a
dependency on the bridge->subordinate to write out bus-range and
interrupt-map. I think the fix there is we should just not write those
properties. The bus range isn't needed because the kernel does its own
assignments. For interrupt-map, it is only needed if "interrupts" is
present in the child devices. If not present, then the standard PCI
swizzling is used. Alternatively, I think the interrupt mapping could
be simplified to just implement the standard swizzling at each level
which isn't dependent on any of the devices on the bus. I gave that a
go where each interrupt-map just points to the parent bridge, but ran
into an issue that the bridge nodes don't have a phandle. That should
be fixable, but I'd rather go with the first option. I suppose that
depends on how the interrupts downstream of the PCI device need to get
resolved. It could be that the PCI device serves as the interrupt
controller and can resolve the parent interrupt on its own (which may
be needed for ACPI host anyways).

> We have to note that between the pci_fixup_device(pci_fixup_header, dev) call
> and the device_add() call, the call to pci_set_msi_domain() is present.
> MSIs are not supported currently but in the future ...

MSI's aren't ever described in PCI nodes. Only the host bridge. So I
don't think we should have problems there.

> Related to DT modaliases, I don't think they are needed.
> All drivers related to PCI device should be declared as pci_driver.
> Correct me if I am wrong but I think that the core PCI will load the correct
> module without any DT modalias.

Yes, you are probably right.

Rob
  
Herve Codina Dec. 5, 2023, 8:04 a.m. UTC | #8
On Mon, 4 Dec 2023 17:03:21 -0600
Rob Herring <robh@kernel.org> wrote:

> On Mon, Dec 4, 2023 at 9:30 AM Herve Codina <herve.codina@bootlin.com> wrote:
> >
> > Hi Rob,
> >
> > On Mon, 4 Dec 2023 07:59:09 -0600
> > Rob Herring <robh@kernel.org> wrote:
> >
> > [...]
> >  
> > > > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > > > index 9c2137dae429..46b252bbe500 100644
> > > > > --- a/drivers/pci/bus.c
> > > > > +++ b/drivers/pci/bus.c
> > > > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > > > >          */
> > > > >         pcibios_bus_add_device(dev);
> > > > >         pci_fixup_device(pci_fixup_final, dev);
> > > > > -       if (pci_is_bridge(dev))
> > > > > -               of_pci_make_dev_node(dev);
> > > > >         pci_create_sysfs_dev_files(dev);
> > > > >         pci_proc_attach_device(dev);
> > > > >         pci_bridge_d3_update(dev);
> > > > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > > > --- a/drivers/pci/of.c
> > > > > +++ b/drivers/pci/of.c
> > > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > >                 return 0;
> > > > >
> > > > >         node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > > +       if (!node && pci_is_bridge(dev))
> > > > > +               of_pci_make_dev_node(dev);
> > > > >         if (!node)
> > > > >                 return 0;  
> > > >
> > > > Maybe it is too early.
> > > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > > some already set values available in the PCI device such as its struct resource
> > > > values.
> > > > We need to have some values set by the PCI infra in order to create our DT node
> > > > with correct values.  
> > >
> > > Indeed, that's probably the issue I'm having. In that case,
> > > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > > device_add().
> > >
> > > I think modifying sysfs after device_add() is going to race with
> > > userspace. Userspace is notified of a new device, and then the of_node
> > > link may or may not be there when it reads sysfs. Also, not sure if
> > > we'll need DT modaliases with PCI devices, but they won't work if the
> > > DT node is not set before device_add().  
> >
> > Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
> > On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
> > fix your QEMU unittest ?  
> 
> No...
> 
> And testing the bridge part crashes. That's because there's a
> dependency on the bridge->subordinate to write out bus-range and
> interrupt-map. I think the fix there is we should just not write those
> properties. The bus range isn't needed because the kernel does its own
> assignments. For interrupt-map, it is only needed if "interrupts" is
> present in the child devices. If not present, then the standard PCI
> swizzling is used. Alternatively, I think the interrupt mapping could
> be simplified to just implement the standard swizzling at each level
> which isn't dependent on any of the devices on the bus. I gave that a
> go where each interrupt-map just points to the parent bridge, but ran
> into an issue that the bridge nodes don't have a phandle. That should
> be fixable, but I'd rather go with the first option. I suppose that
> depends on how the interrupts downstream of the PCI device need to get
> resolved. It could be that the PCI device serves as the interrupt
> controller and can resolve the parent interrupt on its own (which may
> be needed for ACPI host anyways).

About interrupt, I am a bit stuck on my side.
My dtso (applied at PCI device level) contains the following:
	fragment@0 {
		target-path="";
		__overlay__ {
			pci-ep-bus@0 {
				compatible = "simple-bus";
				#address-cells = <1>;
				#size-cells = <1>;

				/*
				 * map @0xe2000000 (32MB) to BAR0 (CPU)
				 * map @0xe0000000 (16MB) to BAR1 (AMBA)
				 */
				ranges = <0xe2000000 0x00 0x00 0x00 0x2000000
				          0xe0000000 0x01 0x00 0x00 0x1000000>;

				itc: itc {
					compatible = "microchip,lan966x-itc";
					#interrupt-cells = <1>;
					interrupt-controller;
					reg = <0xe00c0120 0x190>;
				};
				
				...
			 };
		};
	};

I have a 'simple-bus' with a 'ranges' property to translate the BAR addresses
then several devices. Among them a interrupt controller (itc). Its parent
interrupt is the one used by the PCI device (INTA).
I cannot describe this parent interrupt in the dtso because to that I need the
parent interrupt phandle which will be know only at runtime.
Of course, I can modified the overlay applied to tweak the 'interrupt' and
'interrupt-parent' in the itc node from my PCI device driver at runtime but I
would like to avoid this kind of tweak in the PCI device driver.
This kind of tweak is overlay dependent and needs to be done by each PCI
device driver that need to work with interrupts.

For BAR addresses translation, we use the 'ranges' property at the PCI device
node to translate 0 0 0 to BAR0, 1 0 0 to BAR1, ...
What do you think about a new 'irq-ranges' property to translate the irq number
and get the irq parent controller base.

irq-ranges = <child_irq_spec parent_irq_spec length>;

The idea would be to translate the child irq specifier (irq number +
possible flags) parent DT node.
if 'irq-ranges' present in parent translate irq spec then:
1) 'interrupt' present in parent.
   In that case, if the translation match, we use the translated irq spec
   and resolve the parent interrupt controller using existing ways from this
   node. If it does not match: error.
2) 'interrupt-map' present in parent.
   We use existing ways from this node with the translated irq spec to get the
   parent interrupt controller.
3) 'interrupt-controller' present in parent.
   We use this node as parent interrupt controller and the translated irq spec
4) no specific property present, update parent taking the parent's parent and
   go again.

With this the overlay becomes:
	pci-ep-bus@0 {
		compatible = "simple-bus";
		#address-cells = <1>;
		#size-cells = <1>;
		#interrupt-cells = <1>;

		/*
		 * map @0xe2000000 (32MB) to BAR0 (CPU)
		 * map @0xe0000000 (16MB) to BAR1 (AMBA)
		 */
		ranges = <0xe2000000 0x00 0x00 0x00 0x2000000
		          0xe0000000 0x01 0x00 0x00 0x1000000>;

		/* Translate PCI IRQ*/
		irq-ranges = <1 1 1>;

		itc: itc {
			compatible = "microchip,lan966x-itc";
			#interrupt-cells = <1>;
			interrupts = <1>;
			interrupt-controller;
			reg = <0xe00c0120 0x190>;
		};
		
		...
	 };

And the PCI device node built at runtime:
	dev@0,0 {
		#address-cells = <0x03>;
		interrupts = <0x01>;  <-- use parent interrupt-map
		#size-cells = <0x02>;
		compatible = "pci1055,9660\0pciclass,020000\0pciclass,0200";
		ranges = ...;
		irq-ranges = <1 0x01 1>
		reg = <0x10000 0x00 0x00 0x00 0x00>;
	};
or
	dev@0,0 {
		#address-cells = <0x03>;
		interrupts = <NN>;
		interrupts-parent = <phandle'
		#size-cells = <0x02>;
		compatible = "pci1055,9660\0pciclass,020000\0pciclass,0200";
		ranges = ...;
		irq-ranges = <1 NN 1>
		reg = <0x10000 0x00 0x00 0x00 0x00>;
	};

In the second case, interrup-map in bridges nodes is still needed to build the
'interrupts' / 'interrupts-parent' in devices nodes.
and irq-ranges (if it makes sense on your side) allow to get and interrupt from
the overlay without the need to know the irq parent phandle.

Regards,
Hervé

> 
> > We have to note that between the pci_fixup_device(pci_fixup_header, dev) call
> > and the device_add() call, the call to pci_set_msi_domain() is present.
> > MSIs are not supported currently but in the future ...  
> 
> MSI's aren't ever described in PCI nodes. Only the host bridge. So I
> don't think we should have problems there.
> 
> > Related to DT modaliases, I don't think they are needed.
> > All drivers related to PCI device should be declared as pci_driver.
> > Correct me if I am wrong but I think that the core PCI will load the correct
> > module without any DT modalias.  
> 
> Yes, you are probably right.
> 
> Rob
  
Lizhi Hou Dec. 5, 2023, 6:53 p.m. UTC | #9
On 12/4/23 15:03, Rob Herring wrote:
> On Mon, Dec 4, 2023 at 9:30 AM Herve Codina <herve.codina@bootlin.com> wrote:
>> Hi Rob,
>>
>> On Mon, 4 Dec 2023 07:59:09 -0600
>> Rob Herring <robh@kernel.org> wrote:
>>
>> [...]
>>
>>>>> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
>>>>> index 9c2137dae429..46b252bbe500 100644
>>>>> --- a/drivers/pci/bus.c
>>>>> +++ b/drivers/pci/bus.c
>>>>> @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
>>>>>           */
>>>>>          pcibios_bus_add_device(dev);
>>>>>          pci_fixup_device(pci_fixup_final, dev);
>>>>> -       if (pci_is_bridge(dev))
>>>>> -               of_pci_make_dev_node(dev);
>>>>>          pci_create_sysfs_dev_files(dev);
>>>>>          pci_proc_attach_device(dev);
>>>>>          pci_bridge_d3_update(dev);
>>>>> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
>>>>> index 51e3dd0ea5ab..e15eaf0127fc 100644
>>>>> --- a/drivers/pci/of.c
>>>>> +++ b/drivers/pci/of.c
>>>>> @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
>>>>>                  return 0;
>>>>>
>>>>>          node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
>>>>> +       if (!node && pci_is_bridge(dev))
>>>>> +               of_pci_make_dev_node(dev);
>>>>>          if (!node)
>>>>>                  return 0;
>>>> Maybe it is too early.
>>>> of_pci_make_dev_node() creates a node and fills some properties based on
>>>> some already set values available in the PCI device such as its struct resource
>>>> values.
>>>> We need to have some values set by the PCI infra in order to create our DT node
>>>> with correct values.
>>> Indeed, that's probably the issue I'm having. In that case,
>>> DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
>>> device_add().
>>>
>>> I think modifying sysfs after device_add() is going to race with
>>> userspace. Userspace is notified of a new device, and then the of_node
>>> link may or may not be there when it reads sysfs. Also, not sure if
>>> we'll need DT modaliases with PCI devices, but they won't work if the
>>> DT node is not set before device_add().
>> Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
>> On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
>> fix your QEMU unittest ?
> No...
>
> And testing the bridge part crashes. That's because there's a
> dependency on the bridge->subordinate to write out bus-range and
> interrupt-map. I think the fix there is we should just not write those
> properties. The bus range isn't needed because the kernel does its own
> assignments. For interrupt-map, it is only needed if "interrupts" is
Without 'bus-range', dtc will output a warning while compiling the 
generated node.
> present in the child devices. If not present, then the standard PCI
> swizzling is used. Alternatively, I think the interrupt mapping could
> be simplified to just implement the standard swizzling at each level
> which isn't dependent on any of the devices on the bus. I gave that a
> go where each interrupt-map just points to the parent bridge, but ran
> into an issue that the bridge nodes don't have a phandle. That should
> be fixable, but I'd rather go with the first option. I suppose that

Do you mean it might be something similar with I posted in V12?

https://lore.kernel.org/lkml/97ae2eda-f712-0b83-dc90-0f29edd5db8b@amd.com/


Thanks,

Lizhi

> depends on how the interrupts downstream of the PCI device need to get
> resolved. It could be that the PCI device serves as the interrupt
> controller and can resolve the parent interrupt on its own (which may
> be needed for ACPI host anyways).
>
>> We have to note that between the pci_fixup_device(pci_fixup_header, dev) call
>> and the device_add() call, the call to pci_set_msi_domain() is present.
>> MSIs are not supported currently but in the future ...
> MSI's aren't ever described in PCI nodes. Only the host bridge. So I
> don't think we should have problems there.
>
>> Related to DT modaliases, I don't think they are needed.
>> All drivers related to PCI device should be declared as pci_driver.
>> Correct me if I am wrong but I think that the core PCI will load the correct
>> module without any DT modalias.
> Yes, you are probably right.
>
> Rob
  
Rob Herring Dec. 7, 2023, 10:51 p.m. UTC | #10
On Tue, Dec 5, 2023 at 2:05 AM Herve Codina <herve.codina@bootlin.com> wrote:
>
> On Mon, 4 Dec 2023 17:03:21 -0600
> Rob Herring <robh@kernel.org> wrote:
>
> > On Mon, Dec 4, 2023 at 9:30 AM Herve Codina <herve.codina@bootlin.com> wrote:
> > >
> > > Hi Rob,
> > >
> > > On Mon, 4 Dec 2023 07:59:09 -0600
> > > Rob Herring <robh@kernel.org> wrote:
> > >
> > > [...]
> > >
> > > > > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > > > > index 9c2137dae429..46b252bbe500 100644
> > > > > > --- a/drivers/pci/bus.c
> > > > > > +++ b/drivers/pci/bus.c
> > > > > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > > > > >          */
> > > > > >         pcibios_bus_add_device(dev);
> > > > > >         pci_fixup_device(pci_fixup_final, dev);
> > > > > > -       if (pci_is_bridge(dev))
> > > > > > -               of_pci_make_dev_node(dev);
> > > > > >         pci_create_sysfs_dev_files(dev);
> > > > > >         pci_proc_attach_device(dev);
> > > > > >         pci_bridge_d3_update(dev);
> > > > > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > > > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > > > > --- a/drivers/pci/of.c
> > > > > > +++ b/drivers/pci/of.c
> > > > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > > >                 return 0;
> > > > > >
> > > > > >         node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > > > +       if (!node && pci_is_bridge(dev))
> > > > > > +               of_pci_make_dev_node(dev);
> > > > > >         if (!node)
> > > > > >                 return 0;
> > > > >
> > > > > Maybe it is too early.
> > > > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > > > some already set values available in the PCI device such as its struct resource
> > > > > values.
> > > > > We need to have some values set by the PCI infra in order to create our DT node
> > > > > with correct values.
> > > >
> > > > Indeed, that's probably the issue I'm having. In that case,
> > > > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > > > device_add().
> > > >
> > > > I think modifying sysfs after device_add() is going to race with
> > > > userspace. Userspace is notified of a new device, and then the of_node
> > > > link may or may not be there when it reads sysfs. Also, not sure if
> > > > we'll need DT modaliases with PCI devices, but they won't work if the
> > > > DT node is not set before device_add().
> > >
> > > Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
> > > On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
> > > fix your QEMU unittest ?
> >
> > No...

I think the problem is we aren't setting the fwnode, just the of_node
ptr, but I haven't had a chance to verify that.

> > And testing the bridge part crashes. That's because there's a
> > dependency on the bridge->subordinate to write out bus-range and
> > interrupt-map. I think the fix there is we should just not write those
> > properties. The bus range isn't needed because the kernel does its own
> > assignments. For interrupt-map, it is only needed if "interrupts" is
> > present in the child devices. If not present, then the standard PCI
> > swizzling is used. Alternatively, I think the interrupt mapping could
> > be simplified to just implement the standard swizzling at each level
> > which isn't dependent on any of the devices on the bus. I gave that a
> > go where each interrupt-map just points to the parent bridge, but ran
> > into an issue that the bridge nodes don't have a phandle. That should
> > be fixable, but I'd rather go with the first option. I suppose that
> > depends on how the interrupts downstream of the PCI device need to get
> > resolved. It could be that the PCI device serves as the interrupt
> > controller and can resolve the parent interrupt on its own (which may
> > be needed for ACPI host anyways).
>
> About interrupt, I am a bit stuck on my side.
> My dtso (applied at PCI device level) contains the following:
>         fragment@0 {
>                 target-path="";
>                 __overlay__ {
>                         pci-ep-bus@0 {
>                                 compatible = "simple-bus";
>                                 #address-cells = <1>;
>                                 #size-cells = <1>;
>
>                                 /*
>                                  * map @0xe2000000 (32MB) to BAR0 (CPU)
>                                  * map @0xe0000000 (16MB) to BAR1 (AMBA)
>                                  */
>                                 ranges = <0xe2000000 0x00 0x00 0x00 0x2000000
>                                           0xe0000000 0x01 0x00 0x00 0x1000000>;
>
>                                 itc: itc {
>                                         compatible = "microchip,lan966x-itc";
>                                         #interrupt-cells = <1>;
>                                         interrupt-controller;
>                                         reg = <0xe00c0120 0x190>;
>                                 };
>
>                                 ...
>                          };
>                 };
>         };
>
> I have a 'simple-bus' with a 'ranges' property to translate the BAR addresses
> then several devices. Among them a interrupt controller (itc). Its parent
> interrupt is the one used by the PCI device (INTA).
> I cannot describe this parent interrupt in the dtso because to that I need the
> parent interrupt phandle which will be know only at runtime.

But you don't. The logic to find the interrupt parent is walk up the
parent nodes until you find 'interrupt-parent' or
'#interrupt-controller' (and interrupt-map always has
#interrupt-controller). So your overlay just needs 'interrupts = <1>'
for INTA and it should all just work.

That of course implies that we need interrupt properties in all the
bridges which I was hoping to avoid. In the ACPI case, for DT
interrupt parsing to work, we're going to need to end up in an
'interrupt-controller' node somewhere. I think the options are either
we walk interrupt-map properties up to the host bridge which then
points to something or the PCI device is the interrupt controller. I
think the PCI device is the right place. How the downstream interrupts
are routed to PCI interrupts are defined by the device. That would
work the same way for both DT and ACPI. If you are concerned about
implementing in each driver needing this, some library functions can
mitigate that.

I'm trying to play around with the IRQ domains and get this to work,
but not having any luck yet.

> Of course, I can modified the overlay applied to tweak the 'interrupt' and
> 'interrupt-parent' in the itc node from my PCI device driver at runtime but I
> would like to avoid this kind of tweak in the PCI device driver.
> This kind of tweak is overlay dependent and needs to be done by each PCI
> device driver that need to work with interrupts.
>
> For BAR addresses translation, we use the 'ranges' property at the PCI device
> node to translate 0 0 0 to BAR0, 1 0 0 to BAR1, ...
> What do you think about a new 'irq-ranges' property to translate the irq number
> and get the irq parent controller base.
>
> irq-ranges = <child_irq_spec parent_irq_spec length>;

Seems fragile as you have to know something about the parent (the # of
cells), but you don't have the phandle. If you needed multiple
entries, you couldn't parse this.

Rob
  
Herve Codina Dec. 8, 2023, 8:48 a.m. UTC | #11
Hi Rob,

On Thu, 7 Dec 2023 16:51:56 -0600
Rob Herring <robh@kernel.org> wrote:

> On Tue, Dec 5, 2023 at 2:05 AM Herve Codina <herve.codina@bootlin.com> wrote:
> >
> > On Mon, 4 Dec 2023 17:03:21 -0600
> > Rob Herring <robh@kernel.org> wrote:
> >  
> > > On Mon, Dec 4, 2023 at 9:30 AM Herve Codina <herve.codina@bootlin.com> wrote:  
> > > >
> > > > Hi Rob,
> > > >
> > > > On Mon, 4 Dec 2023 07:59:09 -0600
> > > > Rob Herring <robh@kernel.org> wrote:
> > > >
> > > > [...]
> > > >  
> > > > > > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > > > > > index 9c2137dae429..46b252bbe500 100644
> > > > > > > --- a/drivers/pci/bus.c
> > > > > > > +++ b/drivers/pci/bus.c
> > > > > > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > > > > > >          */
> > > > > > >         pcibios_bus_add_device(dev);
> > > > > > >         pci_fixup_device(pci_fixup_final, dev);
> > > > > > > -       if (pci_is_bridge(dev))
> > > > > > > -               of_pci_make_dev_node(dev);
> > > > > > >         pci_create_sysfs_dev_files(dev);
> > > > > > >         pci_proc_attach_device(dev);
> > > > > > >         pci_bridge_d3_update(dev);
> > > > > > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > > > > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > > > > > --- a/drivers/pci/of.c
> > > > > > > +++ b/drivers/pci/of.c
> > > > > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > > > >                 return 0;
> > > > > > >
> > > > > > >         node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > > > > +       if (!node && pci_is_bridge(dev))
> > > > > > > +               of_pci_make_dev_node(dev);
> > > > > > >         if (!node)
> > > > > > >                 return 0;  
> > > > > >
> > > > > > Maybe it is too early.
> > > > > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > > > > some already set values available in the PCI device such as its struct resource
> > > > > > values.
> > > > > > We need to have some values set by the PCI infra in order to create our DT node
> > > > > > with correct values.  
> > > > >
> > > > > Indeed, that's probably the issue I'm having. In that case,
> > > > > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > > > > device_add().
> > > > >
> > > > > I think modifying sysfs after device_add() is going to race with
> > > > > userspace. Userspace is notified of a new device, and then the of_node
> > > > > link may or may not be there when it reads sysfs. Also, not sure if
> > > > > we'll need DT modaliases with PCI devices, but they won't work if the
> > > > > DT node is not set before device_add().  
> > > >
> > > > Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
> > > > On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
> > > > fix your QEMU unittest ?  
> > >
> > > No...  
> 
> I think the problem is we aren't setting the fwnode, just the of_node
> ptr, but I haven't had a chance to verify that.
> 
> > > And testing the bridge part crashes. That's because there's a
> > > dependency on the bridge->subordinate to write out bus-range and
> > > interrupt-map. I think the fix there is we should just not write those
> > > properties. The bus range isn't needed because the kernel does its own
> > > assignments. For interrupt-map, it is only needed if "interrupts" is
> > > present in the child devices. If not present, then the standard PCI
> > > swizzling is used. Alternatively, I think the interrupt mapping could
> > > be simplified to just implement the standard swizzling at each level
> > > which isn't dependent on any of the devices on the bus. I gave that a
> > > go where each interrupt-map just points to the parent bridge, but ran
> > > into an issue that the bridge nodes don't have a phandle. That should
> > > be fixable, but I'd rather go with the first option. I suppose that
> > > depends on how the interrupts downstream of the PCI device need to get
> > > resolved. It could be that the PCI device serves as the interrupt
> > > controller and can resolve the parent interrupt on its own (which may
> > > be needed for ACPI host anyways).  
> >
> > About interrupt, I am a bit stuck on my side.
> > My dtso (applied at PCI device level) contains the following:
> >         fragment@0 {
> >                 target-path="";
> >                 __overlay__ {
> >                         pci-ep-bus@0 {
> >                                 compatible = "simple-bus";
> >                                 #address-cells = <1>;
> >                                 #size-cells = <1>;
> >
> >                                 /*
> >                                  * map @0xe2000000 (32MB) to BAR0 (CPU)
> >                                  * map @0xe0000000 (16MB) to BAR1 (AMBA)
> >                                  */
> >                                 ranges = <0xe2000000 0x00 0x00 0x00 0x2000000
> >                                           0xe0000000 0x01 0x00 0x00 0x1000000>;
> >
> >                                 itc: itc {
> >                                         compatible = "microchip,lan966x-itc";
> >                                         #interrupt-cells = <1>;
> >                                         interrupt-controller;
> >                                         reg = <0xe00c0120 0x190>;
> >                                 };
> >
> >                                 ...
> >                          };
> >                 };
> >         };
> >
> > I have a 'simple-bus' with a 'ranges' property to translate the BAR addresses
> > then several devices. Among them a interrupt controller (itc). Its parent
> > interrupt is the one used by the PCI device (INTA).
> > I cannot describe this parent interrupt in the dtso because to that I need the
> > parent interrupt phandle which will be know only at runtime.  
> 
> But you don't. The logic to find the interrupt parent is walk up the
> parent nodes until you find 'interrupt-parent' or
> '#interrupt-controller' (and interrupt-map always has
> #interrupt-controller). So your overlay just needs 'interrupts = <1>'
> for INTA and it should all just work.

Yes, I tried some stuffs in that way...
> 
> That of course implies that we need interrupt properties in all the
> bridges which I was hoping to avoid. In the ACPI case, for DT
> interrupt parsing to work, we're going to need to end up in an
> 'interrupt-controller' node somewhere. I think the options are either

... and I went up to that point.
Further more with that way, we need to update the addr value retrieved
from the device requesting the interrupt.
  https://elixir.bootlin.com/linux/latest/source/drivers/of/irq.c#L343
Indeed, with the current 'interrupt-map' at bridges level, a addr value
update is needed at the PCI device level if the interrupt is requested
from some PCI device children.
This is were my (not so good) interrupt-ranges property could play a
role.

> we walk interrupt-map properties up to the host bridge which then
> points to something or the PCI device is the interrupt controller. I
> think the PCI device is the right place. How the downstream interrupts

Agree, the PCI device seems to be the best candidate.

> are routed to PCI interrupts are defined by the device. That would
> work the same way for both DT and ACPI. If you are concerned about
> implementing in each driver needing this, some library functions can
> mitigate that.
> 
> I'm trying to play around with the IRQ domains and get this to work,
> but not having any luck yet.

May I help you on some points?

Got a system with a real hardware (Lan966x) and a setup using an ARM platform (full
DT) and an other one using an x86 platform (ACPI).
Also, I have some piece of code to create the PCI host bridge node.
Of course, this need to be first working on a full DT system but I can do some test
to be sure that can be still ok on a ACPI system.

> 
> > Of course, I can modified the overlay applied to tweak the 'interrupt' and
> > 'interrupt-parent' in the itc node from my PCI device driver at runtime but I
> > would like to avoid this kind of tweak in the PCI device driver.
> > This kind of tweak is overlay dependent and needs to be done by each PCI
> > device driver that need to work with interrupts.
> >
> > For BAR addresses translation, we use the 'ranges' property at the PCI device
> > node to translate 0 0 0 to BAR0, 1 0 0 to BAR1, ...
> > What do you think about a new 'irq-ranges' property to translate the irq number
> > and get the irq parent controller base.
> >
> > irq-ranges = <child_irq_spec parent_irq_spec length>;  
> 
> Seems fragile as you have to know something about the parent (the # of
> cells), but you don't have the phandle. If you needed multiple
> entries, you couldn't parse this.
> 

Right.
With the PCI device seen as an interrupt controller, there is no more need of
this interrup-ranges property.

Best regards,
Hervé
  
Herve Codina Dec. 14, 2023, 2:31 p.m. UTC | #12
Hi Rob,

On Fri, 8 Dec 2023 09:48:40 +0100
Herve Codina <herve.codina@bootlin.com> wrote:

> > 
> > But you don't. The logic to find the interrupt parent is walk up the
> > parent nodes until you find 'interrupt-parent' or
> > '#interrupt-controller' (and interrupt-map always has
> > #interrupt-controller). So your overlay just needs 'interrupts = <1>'
> > for INTA and it should all just work.  
> 
> Yes, I tried some stuffs in that way...
> > 
> > That of course implies that we need interrupt properties in all the
> > bridges which I was hoping to avoid. In the ACPI case, for DT
> > interrupt parsing to work, we're going to need to end up in an
> > 'interrupt-controller' node somewhere. I think the options are either  
> 
> ... and I went up to that point.
> Further more with that way, we need to update the addr value retrieved
> from the device requesting the interrupt.
>   https://elixir.bootlin.com/linux/latest/source/drivers/of/irq.c#L343
> Indeed, with the current 'interrupt-map' at bridges level, a addr value
> update is needed at the PCI device level if the interrupt is requested
> from some PCI device children.
> This is were my (not so good) interrupt-ranges property could play a
> role.
> 
> > we walk interrupt-map properties up to the host bridge which then
> > points to something or the PCI device is the interrupt controller. I
> > think the PCI device is the right place. How the downstream interrupts  
> 
> Agree, the PCI device seems to be the best candidate.
> 
> > are routed to PCI interrupts are defined by the device. That would
> > work the same way for both DT and ACPI. If you are concerned about
> > implementing in each driver needing this, some library functions can
> > mitigate that.
> > 
> > I'm trying to play around with the IRQ domains and get this to work,
> > but not having any luck yet.  
> 

Got some piece of code creating an interrupt controller at PCI device level.
To have it working, '#interrupt-cell = <1>' and 'interrupt-controller'
properties need to be set in the PCI device DT node.

I can set them when the PCI device DT node is created (add them in
of_pci_add_properties()) but as this interrupt controller is specific to the
PCI device driver (it needs to be created after the pci_enable_device() call
and will probably be device specific with MSI), it would make sense to have
these properties set by the PCI device driver itself or in the overlay it
applies.

But these properties creation done by the device driver itself (or the
overlay) lead to memory leaks.
Indeed, we cannot add properties to an existing node without memory
leaks. When a property is removed, the property is not freed but moved
to the node->deadprops list (of_remove_property()).
The properties present in the list will be free once the node itself is
removed.
In our use-case, the node (PCI device node) is not removed when the driver
is removed and probe again (rmmod/insmod).

Do you have any opinion about the '#interrupt-cell' and
'interrupt-controller' properties creation:

- Created by of_pci_add_properties():
  No mem leak but done outside the specific device driver itself and done for
  all PCI devices.
  Maybe the driver will not create the interrupt controller, maybe it will
  prefer an other value for '#interrupt-cell', maybe it will handle MSI and
  will need to set other properties, ... All of these are device specific.

- Created by the device driver itself (or the overlay it applies):
  The mem leak is present. Any idea about a way to fix that or at least having
  a fix for some properties ?
  I was thinking about avoiding to export properties (of_find_property()) and
  so avoid references properties or introducing refcount on properties but
  these modifications touch a lot of drivers and subsystems and are subject
  to errors.
  That's said, checking a property presence based on its name can be done without
  exporting the property, as well as getting a single value. Iterating over array
  values need some more complicated code.


Best regards,
Hervé
  
Herve Codina Dec. 15, 2023, 1:52 p.m. UTC | #13
Hi Rob,

On Mon, 4 Dec 2023 07:59:09 -0600
Rob Herring <robh@kernel.org> wrote:

> On Mon, Dec 4, 2023 at 6:43 AM Herve Codina <herve.codina@bootlin.com> wrote:
> >
> > Hi Rob,
> >
> > On Fri, 1 Dec 2023 16:26:45 -0600
> > Rob Herring <robh@kernel.org> wrote:
> >  
> > > On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <herve.codina@bootlin.com> wrote:  
> > > >
> > > > Hi,
> > > >
> > > > The commit 407d1a51921e ("PCI: Create device tree node for bridge")
> > > > creates of_node for PCI devices.
> > > > During the insertion handling of these new DT nodes done by of_platform,
> > > > new devices (struct device) are created.
> > > > For each PCI devices a struct device is already present (created and
> > > > handled by the PCI core).
> > > > Creating a new device from a DT node leads to some kind of wrong struct
> > > > device duplication to represent the exact same PCI device.
> > > >
> > > > This patch series first introduces device_{add,remove}_of_node() in
> > > > order to add or remove a newly created of_node to an already existing
> > > > device.
> > > > Then it fixes the DT node creation for PCI devices to add or remove the
> > > > created node to the existing PCI device without any new device creation.  
> > >
> > > I think the simpler solution is to get the DT node created earlier. We
> > > are just asking for pain if the DT node is set for the device at
> > > different times compared to static DT nodes.
> > >
> > > The following fixes the lack of of_node link. The DT unittest fails
> > > with the change though and I don't see why.
> > >
> > > Also, no idea if the bridge part works because my qemu setup doesn't
> > > create bridges (anyone got a magic cmdline to create them?).
> > >
> > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > index 9c2137dae429..46b252bbe500 100644
> > > --- a/drivers/pci/bus.c
> > > +++ b/drivers/pci/bus.c
> > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > >          */
> > >         pcibios_bus_add_device(dev);
> > >         pci_fixup_device(pci_fixup_final, dev);
> > > -       if (pci_is_bridge(dev))
> > > -               of_pci_make_dev_node(dev);
> > >         pci_create_sysfs_dev_files(dev);
> > >         pci_proc_attach_device(dev);
> > >         pci_bridge_d3_update(dev);
> > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > --- a/drivers/pci/of.c
> > > +++ b/drivers/pci/of.c
> > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > >                 return 0;
> > >
> > >         node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > +       if (!node && pci_is_bridge(dev))
> > > +               of_pci_make_dev_node(dev);
> > >         if (!node)
> > >                 return 0;  
> >
> > Maybe it is too early.
> > of_pci_make_dev_node() creates a node and fills some properties based on
> > some already set values available in the PCI device such as its struct resource
> > values.
> > We need to have some values set by the PCI infra in order to create our DT node
> > with correct values.  
> 
> Indeed, that's probably the issue I'm having. In that case,
> DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> device_add().
> 
> I think modifying sysfs after device_add() is going to race with
> userspace. Userspace is notified of a new device, and then the of_node
> link may or may not be there when it reads sysfs. Also, not sure if
> we'll need DT modaliases with PCI devices, but they won't work if the
> DT node is not set before device_add().
> 
> Rob

DECLARE_PCI_FIXUP_HEADER is too early as well as doing the DT node creation
just before the device_add() call.
Indeed, in order to fill the DT properties, resources need to be assigned
(needed for the 'ranges' property used for addresses translation).
The resources assignment is done after the call to device_add().

Some PCI sysfs files are already created after adding the device by the
pci_create_sysfs_dev_files() call:
  https://elixir.bootlin.com/linux/v6.6/source/drivers/pci/bus.c#L347

Is it really an issue to add the of_node link to sysfs on an already present
device ?

Maybe we can add the of_node link before the device_add() call and add the
'ranges' property in the DT node later, once resources are assigned.
In that case, the race condition is not fixed but moved from the PCI device
to the DT node the device is pointing to.
With DT overlays and of_changeset_*(), modifying nodes is a normal behavior.
Is that acceptable for the 'ranges' property in this use-case ?

Best regards,
Hervé