[v11,07/14] irqchip: Add RISC-V incoming MSI controller early driver

Message ID 20231023172800.315343-8-apatel@ventanamicro.com
State New
Headers
Series Linux RISC-V AIA Support |

Commit Message

Anup Patel Oct. 23, 2023, 5:27 p.m. UTC
  The RISC-V advanced interrupt architecture (AIA) specification
defines a new MSI controller called incoming message signalled
interrupt controller (IMSIC) which manages MSI on per-HART (or
per-CPU) basis. It also supports IPIs as software injected MSIs.
(For more details refer https://github.com/riscv/riscv-aia)

Let us add an early irqchip driver for RISC-V IMSIC which sets
up the IMSIC state and provide IPIs.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
---
 drivers/irqchip/Kconfig                 |   6 +
 drivers/irqchip/Makefile                |   1 +
 drivers/irqchip/irq-riscv-imsic-early.c | 235 ++++++
 drivers/irqchip/irq-riscv-imsic-state.c | 962 ++++++++++++++++++++++++
 drivers/irqchip/irq-riscv-imsic-state.h | 109 +++
 include/linux/irqchip/riscv-imsic.h     |  87 +++
 6 files changed, 1400 insertions(+)
 create mode 100644 drivers/irqchip/irq-riscv-imsic-early.c
 create mode 100644 drivers/irqchip/irq-riscv-imsic-state.c
 create mode 100644 drivers/irqchip/irq-riscv-imsic-state.h
 create mode 100644 include/linux/irqchip/riscv-imsic.h
  

Comments

Conor Dooley Oct. 24, 2023, 9:25 a.m. UTC | #1
On Mon, Oct 23, 2023 at 10:57:53PM +0530, Anup Patel wrote:

> +#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> +void imsic_vector_debug_show(struct seq_file *m,
> +			     struct imsic_vector *vec, int ind)
> +{
> +	unsigned int mcpu = 0, mlocal_id = 0;
> +	struct imsic_local_priv *lpriv;
> +	bool move_in_progress = false;
> +	struct imsic_vector *mvec;
> +	bool is_enabled = false;
> +	unsigned long flags;
> +
> +	lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
> +	if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
> +		return;
> +
> +	raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> +	if (test_bit(vec->local_id, lpriv->ids_enabled_bitmap))
> +		is_enabled = true;
> +	mvec = lpriv->ids_move[vec->local_id];
> +	if (mvec) {
> +		move_in_progress = true;
> +		mcpu = mvec->cpu;
> +		mlocal_id = mvec->local_id;
> +	}
> +	raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> +
> +	seq_printf(m, "%*starget_cpu      : %5u\n", ind, "", vec->cpu);
> +	seq_printf(m, "%*starget_local_id : %5u\n", ind, "", vec->local_id);
> +	seq_printf(m, "%*sis_reserved     : %5u\n", ind, "",
> +		   (vec->local_id <= IMSIC_IPI_ID) ? 1 : 0);

> +	seq_printf(m, "%*sis_enabled      : %5u\n", ind, "",
> +		   (move_in_progress) ? 1 : 0);

gcc & clang report:
drivers/irqchip/irq-riscv-imsic-state.c:288:14: warning: variable 'is_enabled' set but not used [-Wunused-but-set-variable]

This looks to be a copy-pasta issue, and the move_in_progress here
should be is_enabled?

> +	seq_printf(m, "%*sis_move_pending : %5u\n", ind, "",
> +		   (move_in_progress) ? 1 : 0);
> +	if (move_in_progress) {
> +		seq_printf(m, "%*smove_cpu        : %5u\n", ind, "", mcpu);
> +		seq_printf(m, "%*smove_local_id   : %5u\n", ind, "", mlocal_id);
> +	}
> +}
  
Anup Patel Oct. 24, 2023, 12:08 p.m. UTC | #2
On Tue, Oct 24, 2023 at 2:55 PM Conor Dooley <conor@kernel.org> wrote:
>
> On Mon, Oct 23, 2023 at 10:57:53PM +0530, Anup Patel wrote:
>
> > +#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> > +void imsic_vector_debug_show(struct seq_file *m,
> > +                          struct imsic_vector *vec, int ind)
> > +{
> > +     unsigned int mcpu = 0, mlocal_id = 0;
> > +     struct imsic_local_priv *lpriv;
> > +     bool move_in_progress = false;
> > +     struct imsic_vector *mvec;
> > +     bool is_enabled = false;
> > +     unsigned long flags;
> > +
> > +     lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
> > +     if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
> > +             return;
> > +
> > +     raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> > +     if (test_bit(vec->local_id, lpriv->ids_enabled_bitmap))
> > +             is_enabled = true;
> > +     mvec = lpriv->ids_move[vec->local_id];
> > +     if (mvec) {
> > +             move_in_progress = true;
> > +             mcpu = mvec->cpu;
> > +             mlocal_id = mvec->local_id;
> > +     }
> > +     raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> > +
> > +     seq_printf(m, "%*starget_cpu      : %5u\n", ind, "", vec->cpu);
> > +     seq_printf(m, "%*starget_local_id : %5u\n", ind, "", vec->local_id);
> > +     seq_printf(m, "%*sis_reserved     : %5u\n", ind, "",
> > +                (vec->local_id <= IMSIC_IPI_ID) ? 1 : 0);
>
> > +     seq_printf(m, "%*sis_enabled      : %5u\n", ind, "",
> > +                (move_in_progress) ? 1 : 0);
>
> gcc & clang report:
> drivers/irqchip/irq-riscv-imsic-state.c:288:14: warning: variable 'is_enabled' set but not used [-Wunused-but-set-variable]
>
> This looks to be a copy-pasta issue, and the move_in_progress here
> should be is_enabled?

Thanks for catching. Strangely,  I did not see this warning with
the toolchain which I use.

I will fix it in the next patch revision.

Regards,
Anup

>
> > +     seq_printf(m, "%*sis_move_pending : %5u\n", ind, "",
> > +                (move_in_progress) ? 1 : 0);
> > +     if (move_in_progress) {
> > +             seq_printf(m, "%*smove_cpu        : %5u\n", ind, "", mcpu);
> > +             seq_printf(m, "%*smove_local_id   : %5u\n", ind, "", mlocal_id);
> > +     }
> > +}
  
Björn Töpel Oct. 24, 2023, 1:05 p.m. UTC | #3
Hi Anup!

Wow, I'm really happy to see that you're moving towards the 1-1 model!

Anup Patel <apatel@ventanamicro.com> writes:

> The RISC-V advanced interrupt architecture (AIA) specification
> defines a new MSI controller called incoming message signalled
> interrupt controller (IMSIC) which manages MSI on per-HART (or
> per-CPU) basis. It also supports IPIs as software injected MSIs.
> (For more details refer https://github.com/riscv/riscv-aia)
>
> Let us add an early irqchip driver for RISC-V IMSIC which sets
> up the IMSIC state and provide IPIs.

It would help (reviewers, and future bugfixers) if you add (here or in
the cover) what design decisions you've taken instead of just saying
that you're now supporting IMSIC.

> Signed-off-by: Anup Patel <apatel@ventanamicro.com>
> ---
>  drivers/irqchip/Kconfig                 |   6 +
>  drivers/irqchip/Makefile                |   1 +
>  drivers/irqchip/irq-riscv-imsic-early.c | 235 ++++++
>  drivers/irqchip/irq-riscv-imsic-state.c | 962 ++++++++++++++++++++++++
>  drivers/irqchip/irq-riscv-imsic-state.h | 109 +++
>  include/linux/irqchip/riscv-imsic.h     |  87 +++
>  6 files changed, 1400 insertions(+)
>  create mode 100644 drivers/irqchip/irq-riscv-imsic-early.c
>  create mode 100644 drivers/irqchip/irq-riscv-imsic-state.c
>  create mode 100644 drivers/irqchip/irq-riscv-imsic-state.h
>  create mode 100644 include/linux/irqchip/riscv-imsic.h
>
> diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
> index f7149d0f3d45..bdd80716114d 100644
> --- a/drivers/irqchip/Kconfig
> +++ b/drivers/irqchip/Kconfig
> @@ -546,6 +546,12 @@ config SIFIVE_PLIC
>  	select IRQ_DOMAIN_HIERARCHY
>  	select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP
>  
> +config RISCV_IMSIC
> +	bool
> +	depends on RISCV
> +	select IRQ_DOMAIN_HIERARCHY
> +	select GENERIC_MSI_IRQ
> +
>  config EXYNOS_IRQ_COMBINER
>  	bool "Samsung Exynos IRQ combiner support" if COMPILE_TEST
>  	depends on (ARCH_EXYNOS && ARM) || COMPILE_TEST
> diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
> index ffd945fe71aa..d714724387ce 100644
> --- a/drivers/irqchip/Makefile
> +++ b/drivers/irqchip/Makefile
> @@ -95,6 +95,7 @@ obj-$(CONFIG_QCOM_MPM)			+= irq-qcom-mpm.o
>  obj-$(CONFIG_CSKY_MPINTC)		+= irq-csky-mpintc.o
>  obj-$(CONFIG_CSKY_APB_INTC)		+= irq-csky-apb-intc.o
>  obj-$(CONFIG_RISCV_INTC)		+= irq-riscv-intc.o
> +obj-$(CONFIG_RISCV_IMSIC)		+= irq-riscv-imsic-state.o irq-riscv-imsic-early.o
>  obj-$(CONFIG_SIFIVE_PLIC)		+= irq-sifive-plic.o
>  obj-$(CONFIG_IMX_IRQSTEER)		+= irq-imx-irqsteer.o
>  obj-$(CONFIG_IMX_INTMUX)		+= irq-imx-intmux.o
> diff --git a/drivers/irqchip/irq-riscv-imsic-early.c b/drivers/irqchip/irq-riscv-imsic-early.c
> new file mode 100644
> index 000000000000..23f689ff5807
> --- /dev/null
> +++ b/drivers/irqchip/irq-riscv-imsic-early.c
> @@ -0,0 +1,235 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2021 Western Digital Corporation or its affiliates.
> + * Copyright (C) 2022 Ventana Micro Systems Inc.
> + */
> +
> +#define pr_fmt(fmt) "riscv-imsic: " fmt
> +#include <linux/cpu.h>
> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/irq.h>
> +#include <linux/irqchip.h>
> +#include <linux/irqchip/chained_irq.h>
> +#include <linux/module.h>
> +#include <linux/spinlock.h>
> +#include <linux/smp.h>
> +
> +#include "irq-riscv-imsic-state.h"
> +
> +static int imsic_parent_irq;
> +
> +#ifdef CONFIG_SMP
> +static irqreturn_t imsic_local_sync_handler(int irq, void *data)
> +{
> +	imsic_local_sync();
> +	return IRQ_HANDLED;
> +}
> +
> +static void imsic_ipi_send(unsigned int cpu)
> +{
> +	struct imsic_local_config *local =
> +				per_cpu_ptr(imsic->global.local, cpu);

General nit for the series; There's a lot line breaks in the series. Try
to use the full 100 chars for a line.

> +
> +	writel(IMSIC_IPI_ID, local->msi_va);

Do you need the barriers here? If so, please document. If not, use the
_releaxed() version.

> +}
> +
> +static void imsic_ipi_starting_cpu(void)
> +{
> +	/* Enable IPIs for current CPU. */
> +	__imsic_id_set_enable(IMSIC_IPI_ID);
> +
> +	/* Enable virtual IPI used for IMSIC ID synchronization */
> +	enable_percpu_irq(imsic->ipi_virq, 0);

Maybe pass IRQ_TYPE_NONE instead of 0, so it's clearer?

> +}
> +
> +static void imsic_ipi_dying_cpu(void)
> +{
> +	/*
> +	 * Disable virtual IPI used for IMSIC ID synchronization so
> +	 * that we don't receive ID synchronization requests.
> +	 */
> +	disable_percpu_irq(imsic->ipi_virq);
> +}
> +
> +static int __init imsic_ipi_domain_init(void)
> +{
> +	int virq;
> +
> +	/* Create IMSIC IPI multiplexing */
> +	virq = ipi_mux_create(IMSIC_NR_IPI, imsic_ipi_send);
> +	if (virq <= 0)
> +		return (virq < 0) ? virq : -ENOMEM;
> +	imsic->ipi_virq = virq;
> +
> +	/* First vIRQ is used for IMSIC ID synchronization */
> +	virq = request_percpu_irq(imsic->ipi_virq, imsic_local_sync_handler,
> +				  "riscv-imsic-lsync", imsic->global.local);
> +	if (virq)
> +		return virq;
> +	irq_set_status_flags(imsic->ipi_virq, IRQ_HIDDEN);
> +	imsic->ipi_lsync_desc = irq_to_desc(imsic->ipi_virq);
> +
> +	/* Set vIRQ range */
> +	riscv_ipi_set_virq_range(imsic->ipi_virq + 1, IMSIC_NR_IPI - 1, true);
> +
> +	/* Announce that IMSIC is providing IPIs */
> +	pr_info("%pfwP: providing IPIs using interrupt %d\n",
> +		imsic->fwnode, IMSIC_IPI_ID);
> +
> +	return 0;
> +}
> +#else
> +static void imsic_ipi_starting_cpu(void)
> +{
> +}
> +
> +static void imsic_ipi_dying_cpu(void)
> +{
> +}
> +
> +static int __init imsic_ipi_domain_init(void)
> +{
> +	return 0;
> +}
> +#endif
> +
> +/*
> + * To handle an interrupt, we read the TOPEI CSR and write zero in one
> + * instruction. If TOPEI CSR is non-zero then we translate TOPEI.ID to
> + * Linux interrupt number and let Linux IRQ subsystem handle it.
> + */
> +static void imsic_handle_irq(struct irq_desc *desc)
> +{
> +	struct irq_chip *chip = irq_desc_get_chip(desc);
> +	int err, cpu = smp_processor_id();
> +	struct imsic_vector *vec;
> +	unsigned long local_id;
> +
> +	chained_irq_enter(chip, desc);
> +
> +	while ((local_id = csr_swap(CSR_TOPEI, 0))) {
> +		local_id = local_id >> TOPEI_ID_SHIFT;
> +
> +		if (local_id == IMSIC_IPI_ID) {
> +#ifdef CONFIG_SMP
> +			ipi_mux_process();
> +#endif
> +			continue;
> +		}
> +
> +		if (unlikely(!imsic->base_domain))
> +			continue;
> +
> +		vec = imsic_vector_from_local_id(cpu, local_id);
> +		if (!vec) {
> +			pr_warn_ratelimited(
> +				"vector not found for local ID 0x%lx\n",
> +				local_id);
> +			continue;
> +		}
> +
> +		err = generic_handle_domain_irq(imsic->base_domain,
> +						vec->hwirq);
> +		if (unlikely(err))
> +			pr_warn_ratelimited(
> +				"hwirq 0x%x mapping not found\n",
> +				vec->hwirq);
> +	}
> +
> +	chained_irq_exit(chip, desc);
> +}
> +
> +static int imsic_starting_cpu(unsigned int cpu)
> +{
> +	/* Enable per-CPU parent interrupt */
> +	enable_percpu_irq(imsic_parent_irq,
> +			  irq_get_trigger_type(imsic_parent_irq));
> +
> +	/* Setup IPIs */
> +	imsic_ipi_starting_cpu();
> +
> +	/*
> +	 * Interrupts identities might have been enabled/disabled while
> +	 * this CPU was not running so sync-up local enable/disable state.
> +	 */
> +	imsic_local_sync();
> +
> +	/* Enable local interrupt delivery */
> +	imsic_local_delivery(true);
> +
> +	return 0;
> +}
> +
> +static int imsic_dying_cpu(unsigned int cpu)
> +{
> +	/* Cleanup IPIs */
> +	imsic_ipi_dying_cpu();
> +
> +	return 0;
> +}
> +
> +static int __init imsic_early_probe(struct fwnode_handle *fwnode)
> +{
> +	int rc;
> +	struct irq_domain *domain;
> +
> +	/* Find parent domain and register chained handler */
> +	domain = irq_find_matching_fwnode(riscv_get_intc_hwnode(),
> +					  DOMAIN_BUS_ANY);
> +	if (!domain) {
> +		pr_err("%pfwP: Failed to find INTC domain\n", fwnode);
> +		return -ENOENT;
> +	}
> +	imsic_parent_irq = irq_create_mapping(domain, RV_IRQ_EXT);
> +	if (!imsic_parent_irq) {
> +		pr_err("%pfwP: Failed to create INTC mapping\n", fwnode);
> +		return -ENOENT;
> +	}
> +	irq_set_chained_handler(imsic_parent_irq, imsic_handle_irq);
> +
> +	/* Initialize IPI domain */
> +	rc = imsic_ipi_domain_init();
> +	if (rc) {
> +		pr_err("%pfwP: Failed to initialize IPI domain\n", fwnode);
> +		return rc;
> +	}
> +
> +	/*
> +	 * Setup cpuhp state (must be done after setting imsic_parent_irq)
> +	 *
> +	 * Don't disable per-CPU IMSIC file when CPU goes offline
> +	 * because this affects IPI and the masking/unmasking of
> +	 * virtual IPIs is done via generic IPI-Mux
> +	 */
> +	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> +			  "irqchip/riscv/imsic:starting",
> +			  imsic_starting_cpu, imsic_dying_cpu);
> +
> +	return 0;
> +}
> +
> +static int __init imsic_early_dt_init(struct device_node *node,
> +				      struct device_node *parent)
> +{
> +	int rc;
> +	struct fwnode_handle *fwnode = &node->fwnode;
> +
> +	/* Setup IMSIC state */
> +	rc = imsic_setup_state(fwnode);
> +	if (rc) {
> +		pr_err("%pfwP: failed to setup state (error %d)\n",
> +			fwnode, rc);
> +		return rc;
> +	}
> +
> +	/* Do early setup of IPIs */
> +	rc = imsic_early_probe(fwnode);
> +	if (rc)
> +		return rc;
> +
> +	/* Ensure that OF platform device gets probed */
> +	of_node_clear_flag(node, OF_POPULATED);
> +	return 0;
> +}
> +IRQCHIP_DECLARE(riscv_imsic, "riscv,imsics", imsic_early_dt_init);
> diff --git a/drivers/irqchip/irq-riscv-imsic-state.c b/drivers/irqchip/irq-riscv-imsic-state.c
> new file mode 100644
> index 000000000000..54465e47851c
> --- /dev/null
> +++ b/drivers/irqchip/irq-riscv-imsic-state.c
> @@ -0,0 +1,962 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2021 Western Digital Corporation or its affiliates.
> + * Copyright (C) 2022 Ventana Micro Systems Inc.
> + */
> +
> +#define pr_fmt(fmt) "riscv-imsic: " fmt
> +#include <linux/cpu.h>
> +#include <linux/bitmap.h>
> +#include <linux/interrupt.h>
> +#include <linux/irq.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/of_address.h>
> +#include <linux/of_irq.h>
> +#include <linux/seq_file.h>
> +#include <linux/spinlock.h>
> +#include <linux/smp.h>
> +#include <asm/hwcap.h>
> +
> +#include "irq-riscv-imsic-state.h"
> +
> +#define IMSIC_DISABLE_EIDELIVERY		0
> +#define IMSIC_ENABLE_EIDELIVERY			1
> +#define IMSIC_DISABLE_EITHRESHOLD		1
> +#define IMSIC_ENABLE_EITHRESHOLD		0
> +
> +#define imsic_csr_write(__c, __v)		\
> +do {						\
> +	csr_write(CSR_ISELECT, __c);		\
> +	csr_write(CSR_IREG, __v);		\
> +} while (0)
> +
> +#define imsic_csr_read(__c)			\
> +({						\
> +	unsigned long __v;			\
> +	csr_write(CSR_ISELECT, __c);		\
> +	__v = csr_read(CSR_IREG);		\
> +	__v;					\
> +})
> +
> +#define imsic_csr_read_clear(__c, __v)		\
> +({						\
> +	unsigned long __r;			\
> +	csr_write(CSR_ISELECT, __c);		\
> +	__r = csr_read_clear(CSR_IREG, __v);	\
> +	__r;					\
> +})
> +
> +#define imsic_csr_set(__c, __v)			\
> +do {						\
> +	csr_write(CSR_ISELECT, __c);		\
> +	csr_set(CSR_IREG, __v);			\
> +} while (0)
> +
> +#define imsic_csr_clear(__c, __v)		\
> +do {						\
> +	csr_write(CSR_ISELECT, __c);		\
> +	csr_clear(CSR_IREG, __v);		\
> +} while (0)
> +
> +struct imsic_priv *imsic;
> +
> +const struct imsic_global_config *imsic_get_global_config(void)
> +{
> +	return (imsic) ? &imsic->global : NULL;

Nit: No need for the parenthesis.

> +}
> +EXPORT_SYMBOL_GPL(imsic_get_global_config);
> +
> +static bool __imsic_eix_read_clear(unsigned long id, bool pend)
> +{
> +	unsigned long isel, imask;
> +
> +	isel = id / BITS_PER_LONG;
> +	isel *= BITS_PER_LONG / IMSIC_EIPx_BITS;
> +	isel += (pend) ? IMSIC_EIP0 : IMSIC_EIE0;
> +	imask = BIT(id & (__riscv_xlen - 1));
> +
> +	return (imsic_csr_read_clear(isel, imask) & imask) ? true : false;
> +}
> +
> +#define __imsic_id_read_clear_enabled(__id)		\
> +	__imsic_eix_read_clear((__id), false)
> +#define __imsic_id_read_clear_pending(__id)		\
> +	__imsic_eix_read_clear((__id), true)
> +
> +void __imsic_eix_update(unsigned long base_id,
> +			unsigned long num_id, bool pend, bool val)
> +{
> +	unsigned long i, isel, ireg;
> +	unsigned long id = base_id, last_id = base_id + num_id;
> +
> +	while (id < last_id) {
> +		isel = id / BITS_PER_LONG;
> +		isel *= BITS_PER_LONG / IMSIC_EIPx_BITS;
> +		isel += (pend) ? IMSIC_EIP0 : IMSIC_EIE0;

Parenthesis nit.

> +
> +		ireg = 0;
> +		for (i = id & (__riscv_xlen - 1);
> +		     (id < last_id) && (i < __riscv_xlen); i++) {
> +			ireg |= BIT(i);
> +			id++;
> +		}
> +
> +		/*
> +		 * The IMSIC EIEx and EIPx registers are indirectly
> +		 * accessed via using ISELECT and IREG CSRs so we
> +		 * need to access these CSRs without getting preempted.
> +		 *
> +		 * All existing users of this function call this
> +		 * function with local IRQs disabled so we don't
> +		 * need to do anything special here.
> +		 */
> +		if (val)
> +			imsic_csr_set(isel, ireg);
> +		else
> +			imsic_csr_clear(isel, ireg);
> +	}
> +}
> +
> +void imsic_local_sync(void)
> +{
> +	struct imsic_local_priv *lpriv = this_cpu_ptr(imsic->lpriv);
> +	struct imsic_local_config *mlocal;
> +	struct imsic_vector *mvec;
> +	unsigned long flags;
> +	int i;
> +
> +	raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> +	for (i = 1; i <= imsic->global.nr_ids; i++) {
> +		if (i == IMSIC_IPI_ID)
> +			continue;
> +
> +		if (test_bit(i, lpriv->ids_enabled_bitmap))
> +			__imsic_id_set_enable(i);
> +		else
> +			__imsic_id_clear_enable(i);
> +
> +		mvec = lpriv->ids_move[i];
> +		lpriv->ids_move[i] = NULL;
> +		if (mvec) {
> +			if (__imsic_id_read_clear_pending(i)) {
> +				mlocal = per_cpu_ptr(imsic->global.local,
> +						     mvec->cpu);
> +				writel(mvec->local_id, mlocal->msi_va);

Again, do you need all the barriers? If yes, document. No, then relax
the call.

> +			}
> +
> +			lpriv->vectors[i].hwirq = UINT_MAX;
> +			lpriv->vectors[i].order = UINT_MAX;
> +			clear_bit(i, lpriv->ids_used_bitmap);
> +		}
> +
> +	}
> +	raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> +}
> +
> +void imsic_local_delivery(bool enable)
> +{
> +	if (enable) {
> +		imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_ENABLE_EITHRESHOLD);
> +		imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_ENABLE_EIDELIVERY);
> +	} else {
> +		imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_DISABLE_EIDELIVERY);
> +		imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_DISABLE_EITHRESHOLD);
> +	}

My regular "early exit" nit. I guess I really dislike indentation. ;-)

> +}
> +
> +#ifdef CONFIG_SMP
> +static void imsic_remote_sync(unsigned int cpu)
> +{
> +	/*
> +	 * We simply inject ID synchronization IPI to a target CPU
> +	 * if it is not same as the current CPU. The ipi_send_mask()
> +	 * implementation of IPI mux will inject ID synchronization
> +	 * IPI only for CPUs that have enabled it so offline CPUs
> +	 * won't receive IPI. An offline CPU will unconditionally
> +	 * synchronize IDs through imsic_starting_cpu() when the
> +	 * CPU is brought up.
> +	 */
> +	if (cpu_online(cpu)) {
> +		if (cpu != smp_processor_id())
> +			__ipi_send_mask(imsic->ipi_lsync_desc, cpumask_of(cpu));
> +		else
> +			imsic_local_sync();
> +	}
> +}
> +#else
> +static inline void imsic_remote_sync(unsigned int cpu)

Remove inline.

> +{
> +	imsic_local_sync();
> +}
> +#endif
> +
> +void imsic_vector_mask(struct imsic_vector *vec)
> +{
> +	struct imsic_local_priv *lpriv;
> +	unsigned long flags;
> +
> +	lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
> +	if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
> +		return;
> +
> +	raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> +	bitmap_clear(lpriv->ids_enabled_bitmap, vec->local_id, 1);
> +	raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> +
> +	imsic_remote_sync(vec->cpu);

x86 seems to set a timer instead, for the remote cpu cleanup, which can
be much cheaper, and less in instrusive. Is that applicable here?

> +}
> +
> +void imsic_vector_unmask(struct imsic_vector *vec)
> +{
> +	struct imsic_local_priv *lpriv;
> +	unsigned long flags;
> +
> +	lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
> +	if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
> +		return;
> +
> +	raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> +	bitmap_set(lpriv->ids_enabled_bitmap, vec->local_id, 1);
> +	raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> +
> +	imsic_remote_sync(vec->cpu);
> +}
> +
> +void imsic_vector_move(struct imsic_vector *old_vec,
> +			struct imsic_vector *new_vec)
> +{
> +	struct imsic_local_priv *old_lpriv, *new_lpriv;
> +	struct imsic_vector *ovec, *nvec;
> +	unsigned long flags, flags1;
> +	unsigned int i;
> +
> +	if (WARN_ON(old_vec->cpu == new_vec->cpu ||
> +		    old_vec->order != new_vec->order ||
> +		    (old_vec->local_id & IMSIC_VECTOR_MASK(old_vec)) ||
> +		    (new_vec->local_id & IMSIC_VECTOR_MASK(new_vec))))
> +		return;
> +
> +	old_lpriv = per_cpu_ptr(imsic->lpriv, old_vec->cpu);
> +	if (WARN_ON(&old_lpriv->vectors[old_vec->local_id] != old_vec))
> +		return;
> +
> +	new_lpriv = per_cpu_ptr(imsic->lpriv, new_vec->cpu);
> +	if (WARN_ON(&new_lpriv->vectors[new_vec->local_id] != new_vec))
> +		return;
> +
> +	raw_spin_lock_irqsave(&old_lpriv->ids_lock, flags);
> +	raw_spin_lock_irqsave(&new_lpriv->ids_lock, flags1);
> +
> +	/* Move the state of each vector entry */
> +	for (i = 0; i < BIT(old_vec->order); i++) {
> +		ovec = old_vec + i;
> +		nvec = new_vec + i;
> +
> +		/* Unmask the new vector entry */
> +		if (test_bit(ovec->local_id, old_lpriv->ids_enabled_bitmap))
> +			bitmap_set(new_lpriv->ids_enabled_bitmap,
> +				   nvec->local_id, 1);
> +
> +		/* Mask the old vector entry */
> +		bitmap_clear(old_lpriv->ids_enabled_bitmap, ovec->local_id, 1);
> +
> +		/*
> +		 * Move and re-trigger the new vector entry based on the
> +		 * pending state of the old vector entry because we might
> +		 * get a device interrupt on the old vector entry while
> +		 * device was being moved to the new vector entry.
> +		 */
> +		old_lpriv->ids_move[ovec->local_id] = nvec;
> +	}

Hmm, nested spinlocks, and reimplementing what the irq matrix allocator
does.

Convince me why irq matrix is not a good fit to track the interrupts IDs
*and* get handling/tracking for managed/unmanaged interrupts. You said
that it was the power-of-two blocks for MSI, but can't that be enfored
on matrix alloc? Where are you doing the special handling of MSI?

The reason I'm asking is because I'm pretty certain that x86 has proper
MSI support (Thomas Gleixner can answer for sure! ;-))

IMSIC smells a lot like the the LAPIC. The implementation could probably
be *very* close to what arch/x86/kernel/apic/vector.c does.

Am I completly off here?


Björn
  
Anup Patel Oct. 25, 2023, 5:08 a.m. UTC | #4
On Tue, Oct 24, 2023 at 6:35 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Hi Anup!
>
> Wow, I'm really happy to see that you're moving towards the 1-1 model!
>
> Anup Patel <apatel@ventanamicro.com> writes:
>
> > The RISC-V advanced interrupt architecture (AIA) specification
> > defines a new MSI controller called incoming message signalled
> > interrupt controller (IMSIC) which manages MSI on per-HART (or
> > per-CPU) basis. It also supports IPIs as software injected MSIs.
> > (For more details refer https://github.com/riscv/riscv-aia)
> >
> > Let us add an early irqchip driver for RISC-V IMSIC which sets
> > up the IMSIC state and provide IPIs.
>
> It would help (reviewers, and future bugfixers) if you add (here or in
> the cover) what design decisions you've taken instead of just saying
> that you're now supporting IMSIC.

I agree with the suggestion but this kind of information should be
in the source itself rather than commit description.

>
> > Signed-off-by: Anup Patel <apatel@ventanamicro.com>
> > ---
> >  drivers/irqchip/Kconfig                 |   6 +
> >  drivers/irqchip/Makefile                |   1 +
> >  drivers/irqchip/irq-riscv-imsic-early.c | 235 ++++++
> >  drivers/irqchip/irq-riscv-imsic-state.c | 962 ++++++++++++++++++++++++
> >  drivers/irqchip/irq-riscv-imsic-state.h | 109 +++
> >  include/linux/irqchip/riscv-imsic.h     |  87 +++
> >  6 files changed, 1400 insertions(+)
> >  create mode 100644 drivers/irqchip/irq-riscv-imsic-early.c
> >  create mode 100644 drivers/irqchip/irq-riscv-imsic-state.c
> >  create mode 100644 drivers/irqchip/irq-riscv-imsic-state.h
> >  create mode 100644 include/linux/irqchip/riscv-imsic.h
> >
> > diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
> > index f7149d0f3d45..bdd80716114d 100644
> > --- a/drivers/irqchip/Kconfig
> > +++ b/drivers/irqchip/Kconfig
> > @@ -546,6 +546,12 @@ config SIFIVE_PLIC
> >       select IRQ_DOMAIN_HIERARCHY
> >       select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP
> >
> > +config RISCV_IMSIC
> > +     bool
> > +     depends on RISCV
> > +     select IRQ_DOMAIN_HIERARCHY
> > +     select GENERIC_MSI_IRQ
> > +
> >  config EXYNOS_IRQ_COMBINER
> >       bool "Samsung Exynos IRQ combiner support" if COMPILE_TEST
> >       depends on (ARCH_EXYNOS && ARM) || COMPILE_TEST
> > diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
> > index ffd945fe71aa..d714724387ce 100644
> > --- a/drivers/irqchip/Makefile
> > +++ b/drivers/irqchip/Makefile
> > @@ -95,6 +95,7 @@ obj-$(CONFIG_QCOM_MPM)                      += irq-qcom-mpm.o
> >  obj-$(CONFIG_CSKY_MPINTC)            += irq-csky-mpintc.o
> >  obj-$(CONFIG_CSKY_APB_INTC)          += irq-csky-apb-intc.o
> >  obj-$(CONFIG_RISCV_INTC)             += irq-riscv-intc.o
> > +obj-$(CONFIG_RISCV_IMSIC)            += irq-riscv-imsic-state.o irq-riscv-imsic-early.o
> >  obj-$(CONFIG_SIFIVE_PLIC)            += irq-sifive-plic.o
> >  obj-$(CONFIG_IMX_IRQSTEER)           += irq-imx-irqsteer.o
> >  obj-$(CONFIG_IMX_INTMUX)             += irq-imx-intmux.o
> > diff --git a/drivers/irqchip/irq-riscv-imsic-early.c b/drivers/irqchip/irq-riscv-imsic-early.c
> > new file mode 100644
> > index 000000000000..23f689ff5807
> > --- /dev/null
> > +++ b/drivers/irqchip/irq-riscv-imsic-early.c
> > @@ -0,0 +1,235 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2021 Western Digital Corporation or its affiliates.
> > + * Copyright (C) 2022 Ventana Micro Systems Inc.
> > + */
> > +
> > +#define pr_fmt(fmt) "riscv-imsic: " fmt
> > +#include <linux/cpu.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/io.h>
> > +#include <linux/irq.h>
> > +#include <linux/irqchip.h>
> > +#include <linux/irqchip/chained_irq.h>
> > +#include <linux/module.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/smp.h>
> > +
> > +#include "irq-riscv-imsic-state.h"
> > +
> > +static int imsic_parent_irq;
> > +
> > +#ifdef CONFIG_SMP
> > +static irqreturn_t imsic_local_sync_handler(int irq, void *data)
> > +{
> > +     imsic_local_sync();
> > +     return IRQ_HANDLED;
> > +}
> > +
> > +static void imsic_ipi_send(unsigned int cpu)
> > +{
> > +     struct imsic_local_config *local =
> > +                             per_cpu_ptr(imsic->global.local, cpu);
>
> General nit for the series; There's a lot line breaks in the series. Try
> to use the full 100 chars for a line.

I prefer the 80 lines limit.

>
> > +
> > +     writel(IMSIC_IPI_ID, local->msi_va);
>
> Do you need the barriers here? If so, please document. If not, use the
> _releaxed() version.

We can't assume that _relaxed version of MMIO operations
will work for RISC-V implementation so we conservatively
use regular MMIO operations without _releaxed().

>
> > +}
> > +
> > +static void imsic_ipi_starting_cpu(void)
> > +{
> > +     /* Enable IPIs for current CPU. */
> > +     __imsic_id_set_enable(IMSIC_IPI_ID);
> > +
> > +     /* Enable virtual IPI used for IMSIC ID synchronization */
> > +     enable_percpu_irq(imsic->ipi_virq, 0);
>
> Maybe pass IRQ_TYPE_NONE instead of 0, so it's clearer?

Okay, I will update.

>
> > +}
> > +
> > +static void imsic_ipi_dying_cpu(void)
> > +{
> > +     /*
> > +      * Disable virtual IPI used for IMSIC ID synchronization so
> > +      * that we don't receive ID synchronization requests.
> > +      */
> > +     disable_percpu_irq(imsic->ipi_virq);
> > +}
> > +
> > +static int __init imsic_ipi_domain_init(void)
> > +{
> > +     int virq;
> > +
> > +     /* Create IMSIC IPI multiplexing */
> > +     virq = ipi_mux_create(IMSIC_NR_IPI, imsic_ipi_send);
> > +     if (virq <= 0)
> > +             return (virq < 0) ? virq : -ENOMEM;
> > +     imsic->ipi_virq = virq;
> > +
> > +     /* First vIRQ is used for IMSIC ID synchronization */
> > +     virq = request_percpu_irq(imsic->ipi_virq, imsic_local_sync_handler,
> > +                               "riscv-imsic-lsync", imsic->global.local);
> > +     if (virq)
> > +             return virq;
> > +     irq_set_status_flags(imsic->ipi_virq, IRQ_HIDDEN);
> > +     imsic->ipi_lsync_desc = irq_to_desc(imsic->ipi_virq);
> > +
> > +     /* Set vIRQ range */
> > +     riscv_ipi_set_virq_range(imsic->ipi_virq + 1, IMSIC_NR_IPI - 1, true);
> > +
> > +     /* Announce that IMSIC is providing IPIs */
> > +     pr_info("%pfwP: providing IPIs using interrupt %d\n",
> > +             imsic->fwnode, IMSIC_IPI_ID);
> > +
> > +     return 0;
> > +}
> > +#else
> > +static void imsic_ipi_starting_cpu(void)
> > +{
> > +}
> > +
> > +static void imsic_ipi_dying_cpu(void)
> > +{
> > +}
> > +
> > +static int __init imsic_ipi_domain_init(void)
> > +{
> > +     return 0;
> > +}
> > +#endif
> > +
> > +/*
> > + * To handle an interrupt, we read the TOPEI CSR and write zero in one
> > + * instruction. If TOPEI CSR is non-zero then we translate TOPEI.ID to
> > + * Linux interrupt number and let Linux IRQ subsystem handle it.
> > + */
> > +static void imsic_handle_irq(struct irq_desc *desc)
> > +{
> > +     struct irq_chip *chip = irq_desc_get_chip(desc);
> > +     int err, cpu = smp_processor_id();
> > +     struct imsic_vector *vec;
> > +     unsigned long local_id;
> > +
> > +     chained_irq_enter(chip, desc);
> > +
> > +     while ((local_id = csr_swap(CSR_TOPEI, 0))) {
> > +             local_id = local_id >> TOPEI_ID_SHIFT;
> > +
> > +             if (local_id == IMSIC_IPI_ID) {
> > +#ifdef CONFIG_SMP
> > +                     ipi_mux_process();
> > +#endif
> > +                     continue;
> > +             }
> > +
> > +             if (unlikely(!imsic->base_domain))
> > +                     continue;
> > +
> > +             vec = imsic_vector_from_local_id(cpu, local_id);
> > +             if (!vec) {
> > +                     pr_warn_ratelimited(
> > +                             "vector not found for local ID 0x%lx\n",
> > +                             local_id);
> > +                     continue;
> > +             }
> > +
> > +             err = generic_handle_domain_irq(imsic->base_domain,
> > +                                             vec->hwirq);
> > +             if (unlikely(err))
> > +                     pr_warn_ratelimited(
> > +                             "hwirq 0x%x mapping not found\n",
> > +                             vec->hwirq);
> > +     }
> > +
> > +     chained_irq_exit(chip, desc);
> > +}
> > +
> > +static int imsic_starting_cpu(unsigned int cpu)
> > +{
> > +     /* Enable per-CPU parent interrupt */
> > +     enable_percpu_irq(imsic_parent_irq,
> > +                       irq_get_trigger_type(imsic_parent_irq));
> > +
> > +     /* Setup IPIs */
> > +     imsic_ipi_starting_cpu();
> > +
> > +     /*
> > +      * Interrupts identities might have been enabled/disabled while
> > +      * this CPU was not running so sync-up local enable/disable state.
> > +      */
> > +     imsic_local_sync();
> > +
> > +     /* Enable local interrupt delivery */
> > +     imsic_local_delivery(true);
> > +
> > +     return 0;
> > +}
> > +
> > +static int imsic_dying_cpu(unsigned int cpu)
> > +{
> > +     /* Cleanup IPIs */
> > +     imsic_ipi_dying_cpu();
> > +
> > +     return 0;
> > +}
> > +
> > +static int __init imsic_early_probe(struct fwnode_handle *fwnode)
> > +{
> > +     int rc;
> > +     struct irq_domain *domain;
> > +
> > +     /* Find parent domain and register chained handler */
> > +     domain = irq_find_matching_fwnode(riscv_get_intc_hwnode(),
> > +                                       DOMAIN_BUS_ANY);
> > +     if (!domain) {
> > +             pr_err("%pfwP: Failed to find INTC domain\n", fwnode);
> > +             return -ENOENT;
> > +     }
> > +     imsic_parent_irq = irq_create_mapping(domain, RV_IRQ_EXT);
> > +     if (!imsic_parent_irq) {
> > +             pr_err("%pfwP: Failed to create INTC mapping\n", fwnode);
> > +             return -ENOENT;
> > +     }
> > +     irq_set_chained_handler(imsic_parent_irq, imsic_handle_irq);
> > +
> > +     /* Initialize IPI domain */
> > +     rc = imsic_ipi_domain_init();
> > +     if (rc) {
> > +             pr_err("%pfwP: Failed to initialize IPI domain\n", fwnode);
> > +             return rc;
> > +     }
> > +
> > +     /*
> > +      * Setup cpuhp state (must be done after setting imsic_parent_irq)
> > +      *
> > +      * Don't disable per-CPU IMSIC file when CPU goes offline
> > +      * because this affects IPI and the masking/unmasking of
> > +      * virtual IPIs is done via generic IPI-Mux
> > +      */
> > +     cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> > +                       "irqchip/riscv/imsic:starting",
> > +                       imsic_starting_cpu, imsic_dying_cpu);
> > +
> > +     return 0;
> > +}
> > +
> > +static int __init imsic_early_dt_init(struct device_node *node,
> > +                                   struct device_node *parent)
> > +{
> > +     int rc;
> > +     struct fwnode_handle *fwnode = &node->fwnode;
> > +
> > +     /* Setup IMSIC state */
> > +     rc = imsic_setup_state(fwnode);
> > +     if (rc) {
> > +             pr_err("%pfwP: failed to setup state (error %d)\n",
> > +                     fwnode, rc);
> > +             return rc;
> > +     }
> > +
> > +     /* Do early setup of IPIs */
> > +     rc = imsic_early_probe(fwnode);
> > +     if (rc)
> > +             return rc;
> > +
> > +     /* Ensure that OF platform device gets probed */
> > +     of_node_clear_flag(node, OF_POPULATED);
> > +     return 0;
> > +}
> > +IRQCHIP_DECLARE(riscv_imsic, "riscv,imsics", imsic_early_dt_init);
> > diff --git a/drivers/irqchip/irq-riscv-imsic-state.c b/drivers/irqchip/irq-riscv-imsic-state.c
> > new file mode 100644
> > index 000000000000..54465e47851c
> > --- /dev/null
> > +++ b/drivers/irqchip/irq-riscv-imsic-state.c
> > @@ -0,0 +1,962 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2021 Western Digital Corporation or its affiliates.
> > + * Copyright (C) 2022 Ventana Micro Systems Inc.
> > + */
> > +
> > +#define pr_fmt(fmt) "riscv-imsic: " fmt
> > +#include <linux/cpu.h>
> > +#include <linux/bitmap.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/irq.h>
> > +#include <linux/module.h>
> > +#include <linux/of.h>
> > +#include <linux/of_address.h>
> > +#include <linux/of_irq.h>
> > +#include <linux/seq_file.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/smp.h>
> > +#include <asm/hwcap.h>
> > +
> > +#include "irq-riscv-imsic-state.h"
> > +
> > +#define IMSIC_DISABLE_EIDELIVERY             0
> > +#define IMSIC_ENABLE_EIDELIVERY                      1
> > +#define IMSIC_DISABLE_EITHRESHOLD            1
> > +#define IMSIC_ENABLE_EITHRESHOLD             0
> > +
> > +#define imsic_csr_write(__c, __v)            \
> > +do {                                         \
> > +     csr_write(CSR_ISELECT, __c);            \
> > +     csr_write(CSR_IREG, __v);               \
> > +} while (0)
> > +
> > +#define imsic_csr_read(__c)                  \
> > +({                                           \
> > +     unsigned long __v;                      \
> > +     csr_write(CSR_ISELECT, __c);            \
> > +     __v = csr_read(CSR_IREG);               \
> > +     __v;                                    \
> > +})
> > +
> > +#define imsic_csr_read_clear(__c, __v)               \
> > +({                                           \
> > +     unsigned long __r;                      \
> > +     csr_write(CSR_ISELECT, __c);            \
> > +     __r = csr_read_clear(CSR_IREG, __v);    \
> > +     __r;                                    \
> > +})
> > +
> > +#define imsic_csr_set(__c, __v)                      \
> > +do {                                         \
> > +     csr_write(CSR_ISELECT, __c);            \
> > +     csr_set(CSR_IREG, __v);                 \
> > +} while (0)
> > +
> > +#define imsic_csr_clear(__c, __v)            \
> > +do {                                         \
> > +     csr_write(CSR_ISELECT, __c);            \
> > +     csr_clear(CSR_IREG, __v);               \
> > +} while (0)
> > +
> > +struct imsic_priv *imsic;
> > +
> > +const struct imsic_global_config *imsic_get_global_config(void)
> > +{
> > +     return (imsic) ? &imsic->global : NULL;
>
> Nit: No need for the parenthesis.

Okay, I will update.

>
> > +}
> > +EXPORT_SYMBOL_GPL(imsic_get_global_config);
> > +
> > +static bool __imsic_eix_read_clear(unsigned long id, bool pend)
> > +{
> > +     unsigned long isel, imask;
> > +
> > +     isel = id / BITS_PER_LONG;
> > +     isel *= BITS_PER_LONG / IMSIC_EIPx_BITS;
> > +     isel += (pend) ? IMSIC_EIP0 : IMSIC_EIE0;
> > +     imask = BIT(id & (__riscv_xlen - 1));
> > +
> > +     return (imsic_csr_read_clear(isel, imask) & imask) ? true : false;
> > +}
> > +
> > +#define __imsic_id_read_clear_enabled(__id)          \
> > +     __imsic_eix_read_clear((__id), false)
> > +#define __imsic_id_read_clear_pending(__id)          \
> > +     __imsic_eix_read_clear((__id), true)
> > +
> > +void __imsic_eix_update(unsigned long base_id,
> > +                     unsigned long num_id, bool pend, bool val)
> > +{
> > +     unsigned long i, isel, ireg;
> > +     unsigned long id = base_id, last_id = base_id + num_id;
> > +
> > +     while (id < last_id) {
> > +             isel = id / BITS_PER_LONG;
> > +             isel *= BITS_PER_LONG / IMSIC_EIPx_BITS;
> > +             isel += (pend) ? IMSIC_EIP0 : IMSIC_EIE0;
>
> Parenthesis nit.

Okay, I will update.

>
> > +
> > +             ireg = 0;
> > +             for (i = id & (__riscv_xlen - 1);
> > +                  (id < last_id) && (i < __riscv_xlen); i++) {
> > +                     ireg |= BIT(i);
> > +                     id++;
> > +             }
> > +
> > +             /*
> > +              * The IMSIC EIEx and EIPx registers are indirectly
> > +              * accessed via using ISELECT and IREG CSRs so we
> > +              * need to access these CSRs without getting preempted.
> > +              *
> > +              * All existing users of this function call this
> > +              * function with local IRQs disabled so we don't
> > +              * need to do anything special here.
> > +              */
> > +             if (val)
> > +                     imsic_csr_set(isel, ireg);
> > +             else
> > +                     imsic_csr_clear(isel, ireg);
> > +     }
> > +}
> > +
> > +void imsic_local_sync(void)
> > +{
> > +     struct imsic_local_priv *lpriv = this_cpu_ptr(imsic->lpriv);
> > +     struct imsic_local_config *mlocal;
> > +     struct imsic_vector *mvec;
> > +     unsigned long flags;
> > +     int i;
> > +
> > +     raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> > +     for (i = 1; i <= imsic->global.nr_ids; i++) {
> > +             if (i == IMSIC_IPI_ID)
> > +                     continue;
> > +
> > +             if (test_bit(i, lpriv->ids_enabled_bitmap))
> > +                     __imsic_id_set_enable(i);
> > +             else
> > +                     __imsic_id_clear_enable(i);
> > +
> > +             mvec = lpriv->ids_move[i];
> > +             lpriv->ids_move[i] = NULL;
> > +             if (mvec) {
> > +                     if (__imsic_id_read_clear_pending(i)) {
> > +                             mlocal = per_cpu_ptr(imsic->global.local,
> > +                                                  mvec->cpu);
> > +                             writel(mvec->local_id, mlocal->msi_va);
>
> Again, do you need all the barriers? If yes, document. No, then relax
> the call.

Same comment as above.

>
> > +                     }
> > +
> > +                     lpriv->vectors[i].hwirq = UINT_MAX;
> > +                     lpriv->vectors[i].order = UINT_MAX;
> > +                     clear_bit(i, lpriv->ids_used_bitmap);
> > +             }
> > +
> > +     }
> > +     raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> > +}
> > +
> > +void imsic_local_delivery(bool enable)
> > +{
> > +     if (enable) {
> > +             imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_ENABLE_EITHRESHOLD);
> > +             imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_ENABLE_EIDELIVERY);
> > +     } else {
> > +             imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_DISABLE_EIDELIVERY);
> > +             imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_DISABLE_EITHRESHOLD);
> > +     }
>
> My regular "early exit" nit. I guess I really dislike indentation. ;-)

-ENOPARSE

>
> > +}
> > +
> > +#ifdef CONFIG_SMP
> > +static void imsic_remote_sync(unsigned int cpu)
> > +{
> > +     /*
> > +      * We simply inject ID synchronization IPI to a target CPU
> > +      * if it is not same as the current CPU. The ipi_send_mask()
> > +      * implementation of IPI mux will inject ID synchronization
> > +      * IPI only for CPUs that have enabled it so offline CPUs
> > +      * won't receive IPI. An offline CPU will unconditionally
> > +      * synchronize IDs through imsic_starting_cpu() when the
> > +      * CPU is brought up.
> > +      */
> > +     if (cpu_online(cpu)) {
> > +             if (cpu != smp_processor_id())
> > +                     __ipi_send_mask(imsic->ipi_lsync_desc, cpumask_of(cpu));
> > +             else
> > +                     imsic_local_sync();
> > +     }
> > +}
> > +#else
> > +static inline void imsic_remote_sync(unsigned int cpu)
>
> Remove inline.

Okay, I will update.

>
> > +{
> > +     imsic_local_sync();
> > +}
> > +#endif
> > +
> > +void imsic_vector_mask(struct imsic_vector *vec)
> > +{
> > +     struct imsic_local_priv *lpriv;
> > +     unsigned long flags;
> > +
> > +     lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
> > +     if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
> > +             return;
> > +
> > +     raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> > +     bitmap_clear(lpriv->ids_enabled_bitmap, vec->local_id, 1);
> > +     raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> > +
> > +     imsic_remote_sync(vec->cpu);
>
> x86 seems to set a timer instead, for the remote cpu cleanup, which can
> be much cheaper, and less in instrusive. Is that applicable here?

The issue with that approach is deciding the right duration
of timer interrupt. There might be platforms who need
immediate mask/unmask response. We can certainely
keep improving/tuning this over-time.

>
> > +}
> > +
> > +void imsic_vector_unmask(struct imsic_vector *vec)
> > +{
> > +     struct imsic_local_priv *lpriv;
> > +     unsigned long flags;
> > +
> > +     lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
> > +     if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
> > +             return;
> > +
> > +     raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> > +     bitmap_set(lpriv->ids_enabled_bitmap, vec->local_id, 1);
> > +     raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> > +
> > +     imsic_remote_sync(vec->cpu);
> > +}
> > +
> > +void imsic_vector_move(struct imsic_vector *old_vec,
> > +                     struct imsic_vector *new_vec)
> > +{
> > +     struct imsic_local_priv *old_lpriv, *new_lpriv;
> > +     struct imsic_vector *ovec, *nvec;
> > +     unsigned long flags, flags1;
> > +     unsigned int i;
> > +
> > +     if (WARN_ON(old_vec->cpu == new_vec->cpu ||
> > +                 old_vec->order != new_vec->order ||
> > +                 (old_vec->local_id & IMSIC_VECTOR_MASK(old_vec)) ||
> > +                 (new_vec->local_id & IMSIC_VECTOR_MASK(new_vec))))
> > +             return;
> > +
> > +     old_lpriv = per_cpu_ptr(imsic->lpriv, old_vec->cpu);
> > +     if (WARN_ON(&old_lpriv->vectors[old_vec->local_id] != old_vec))
> > +             return;
> > +
> > +     new_lpriv = per_cpu_ptr(imsic->lpriv, new_vec->cpu);
> > +     if (WARN_ON(&new_lpriv->vectors[new_vec->local_id] != new_vec))
> > +             return;
> > +
> > +     raw_spin_lock_irqsave(&old_lpriv->ids_lock, flags);
> > +     raw_spin_lock_irqsave(&new_lpriv->ids_lock, flags1);
> > +
> > +     /* Move the state of each vector entry */
> > +     for (i = 0; i < BIT(old_vec->order); i++) {
> > +             ovec = old_vec + i;
> > +             nvec = new_vec + i;
> > +
> > +             /* Unmask the new vector entry */
> > +             if (test_bit(ovec->local_id, old_lpriv->ids_enabled_bitmap))
> > +                     bitmap_set(new_lpriv->ids_enabled_bitmap,
> > +                                nvec->local_id, 1);
> > +
> > +             /* Mask the old vector entry */
> > +             bitmap_clear(old_lpriv->ids_enabled_bitmap, ovec->local_id, 1);
> > +
> > +             /*
> > +              * Move and re-trigger the new vector entry based on the
> > +              * pending state of the old vector entry because we might
> > +              * get a device interrupt on the old vector entry while
> > +              * device was being moved to the new vector entry.
> > +              */
> > +             old_lpriv->ids_move[ovec->local_id] = nvec;
> > +     }
>
> Hmm, nested spinlocks, and reimplementing what the irq matrix allocator
> does.
>
> Convince me why irq matrix is not a good fit to track the interrupts IDs
> *and* get handling/tracking for managed/unmanaged interrupts. You said
> that it was the power-of-two blocks for MSI, but can't that be enfored
> on matrix alloc? Where are you doing the special handling of MSI?
>
> The reason I'm asking is because I'm pretty certain that x86 has proper
> MSI support (Thomas Gleixner can answer for sure! ;-))
>
> IMSIC smells a lot like the the LAPIC. The implementation could probably
> be *very* close to what arch/x86/kernel/apic/vector.c does.
>
> Am I completly off here?
>

The x86 APIC driver only supports MSI-X due to which the IRQ matrix
allocator only supports ID/Vector allocation suitable for MSI-X whereas
the ARM GICv3 driver supports both legacy MSI and MSI-X. In absence
of legacy MSI support, Linux x86 will fallback to INTx for PCI devices
with legacy MSI support but for RISC-V platforms we can't assume that
INTx is available because we might be dealing with an IMSIC-only
platform.

Refer, x86_vector_msi_parent_ops in arch/x86/kernel/apic/msi.c and
X86_VECTOR_MSI_FLAGS_SUPPORTED in arch/x86/include/asm/msi.h

Refer, its_pci_msi_domain_info in drivers/irqchip/irq-gic-v3-its-pci-msi.c

The changes which I think are need in the IRQ matrix allocator before
integrating it in the IMSIC driver are the following:
1) IRQ matrix allocator assumed NR_VECTORS to be a fixed define
    which the arch code provides but in RISC-V world the number of
    IDs are discovered from DT or ACPI.
2) IRQ matrix allocator needs to be support allocator multiple vectors
    in power-of-2 which will allow IMSIC driver to support both legacy
    MSI and MSI-X. This will involve changing the way best CPU is
    found, the way bitmap APIs are used and adding some new APIs
    for allocate vectors in power-of-2

Based on above, I suggest we keep the integration of IRQ matrix
allocator in the IMSIC driver as a separate series which will allow
us to unblock other series (such as AIA ACPI support, power
managment related changes in AIA drivers, etc).

Regards,
Anup
  
Björn Töpel Oct. 25, 2023, 4:05 p.m. UTC | #5
Hi!

Anup Patel <apatel@ventanamicro.com> writes:

> On Tue, Oct 24, 2023 at 6:35 PM Björn Töpel <bjorn@kernel.org> wrote:
>>
>> Hi Anup!
>>
>> Wow, I'm really happy to see that you're moving towards the 1-1 model!
>>
>> Anup Patel <apatel@ventanamicro.com> writes:
>>
>> > The RISC-V advanced interrupt architecture (AIA) specification
>> > defines a new MSI controller called incoming message signalled
>> > interrupt controller (IMSIC) which manages MSI on per-HART (or
>> > per-CPU) basis. It also supports IPIs as software injected MSIs.
>> > (For more details refer https://github.com/riscv/riscv-aia)
>> >
>> > Let us add an early irqchip driver for RISC-V IMSIC which sets
>> > up the IMSIC state and provide IPIs.
>>
>> It would help (reviewers, and future bugfixers) if you add (here or in
>> the cover) what design decisions you've taken instead of just saying
>> that you're now supporting IMSIC.
>
> I agree with the suggestion but this kind of information should be
> in the source itself rather than commit description.

I think the high-level flow, and why you made certain design decisions
should be in the commit message.

The "how" in the code, the "why" in the commit message. Regardless -- it
would make it easier for reviewers to get into your code faster.

[...]

>> > +
>> > +     writel(IMSIC_IPI_ID, local->msi_va);
>>
>> Do you need the barriers here? If so, please document. If not, use the
>> _releaxed() version.
>
> We can't assume that _relaxed version of MMIO operations
> will work for RISC-V implementation so we conservatively
> use regular MMIO operations without _releaxed().

You'll need to expand on your thinking here, Anup. We can't just
sprinkle fences everywhere because of "we can't assume it'll work". Do
you need proper barriers for IPIs or not?

[...]

>> > +             mvec = lpriv->ids_move[i];
>> > +             lpriv->ids_move[i] = NULL;
>> > +             if (mvec) {
>> > +                     if (__imsic_id_read_clear_pending(i)) {
>> > +                             mlocal = per_cpu_ptr(imsic->global.local,
>> > +                                                  mvec->cpu);
>> > +                             writel(mvec->local_id, mlocal->msi_va);
>>
>> Again, do you need all the barriers? If yes, document. No, then relax
>> the call.
>
> Same comment as above.

Dito for me! ;-)

>> > +                     }
>> > +
>> > +                     lpriv->vectors[i].hwirq = UINT_MAX;
>> > +                     lpriv->vectors[i].order = UINT_MAX;
>> > +                     clear_bit(i, lpriv->ids_used_bitmap);
>> > +             }
>> > +
>> > +     }
>> > +     raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
>> > +}
>> > +
>> > +void imsic_local_delivery(bool enable)
>> > +{
>> > +     if (enable) {
>> > +             imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_ENABLE_EITHRESHOLD);
>> > +             imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_ENABLE_EIDELIVERY);
>> > +     } else {
>> > +             imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_DISABLE_EIDELIVERY);
>> > +             imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_DISABLE_EITHRESHOLD);
>> > +     }
>>
>> My regular "early exit" nit. I guess I really dislike indentation. ;-)
>
> -ENOPARSE

if (...) {
  a();
  b();
  c();
} else {
  d();
  e();
}

vs

if (...) {
  a();
  b();
  c();
  return;
}

d();
e();

[...]

>> > +void imsic_vector_mask(struct imsic_vector *vec)
>> > +{
>> > +     struct imsic_local_priv *lpriv;
>> > +     unsigned long flags;
>> > +
>> > +     lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
>> > +     if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
>> > +             return;
>> > +
>> > +     raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
>> > +     bitmap_clear(lpriv->ids_enabled_bitmap, vec->local_id, 1);
>> > +     raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
>> > +
>> > +     imsic_remote_sync(vec->cpu);
>>
>> x86 seems to set a timer instead, for the remote cpu cleanup, which can
>> be much cheaper, and less in instrusive. Is that applicable here?
>
> The issue with that approach is deciding the right duration
> of timer interrupt. There might be platforms who need
> immediate mask/unmask response. We can certainely
> keep improving/tuning this over-time.

Any concrete examples where this is an actual problem?

[...]

>> > +void imsic_vector_move(struct imsic_vector *old_vec,
>> > +                     struct imsic_vector *new_vec)
>> > +{
>> > +     struct imsic_local_priv *old_lpriv, *new_lpriv;
>> > +     struct imsic_vector *ovec, *nvec;
>> > +     unsigned long flags, flags1;
>> > +     unsigned int i;
>> > +
>> > +     if (WARN_ON(old_vec->cpu == new_vec->cpu ||
>> > +                 old_vec->order != new_vec->order ||
>> > +                 (old_vec->local_id & IMSIC_VECTOR_MASK(old_vec)) ||
>> > +                 (new_vec->local_id & IMSIC_VECTOR_MASK(new_vec))))
>> > +             return;
>> > +
>> > +     old_lpriv = per_cpu_ptr(imsic->lpriv, old_vec->cpu);
>> > +     if (WARN_ON(&old_lpriv->vectors[old_vec->local_id] != old_vec))
>> > +             return;
>> > +
>> > +     new_lpriv = per_cpu_ptr(imsic->lpriv, new_vec->cpu);
>> > +     if (WARN_ON(&new_lpriv->vectors[new_vec->local_id] != new_vec))
>> > +             return;
>> > +
>> > +     raw_spin_lock_irqsave(&old_lpriv->ids_lock, flags);
>> > +     raw_spin_lock_irqsave(&new_lpriv->ids_lock, flags1);
>> > +
>> > +     /* Move the state of each vector entry */
>> > +     for (i = 0; i < BIT(old_vec->order); i++) {
>> > +             ovec = old_vec + i;
>> > +             nvec = new_vec + i;
>> > +
>> > +             /* Unmask the new vector entry */
>> > +             if (test_bit(ovec->local_id, old_lpriv->ids_enabled_bitmap))
>> > +                     bitmap_set(new_lpriv->ids_enabled_bitmap,
>> > +                                nvec->local_id, 1);
>> > +
>> > +             /* Mask the old vector entry */
>> > +             bitmap_clear(old_lpriv->ids_enabled_bitmap, ovec->local_id, 1);
>> > +
>> > +             /*
>> > +              * Move and re-trigger the new vector entry based on the
>> > +              * pending state of the old vector entry because we might
>> > +              * get a device interrupt on the old vector entry while
>> > +              * device was being moved to the new vector entry.
>> > +              */
>> > +             old_lpriv->ids_move[ovec->local_id] = nvec;
>> > +     }
>>
>> Hmm, nested spinlocks, and reimplementing what the irq matrix allocator
>> does.
>>
>> Convince me why irq matrix is not a good fit to track the interrupts IDs
>> *and* get handling/tracking for managed/unmanaged interrupts. You said
>> that it was the power-of-two blocks for MSI, but can't that be enfored
>> on matrix alloc? Where are you doing the special handling of MSI?
>>
>> The reason I'm asking is because I'm pretty certain that x86 has proper
>> MSI support (Thomas Gleixner can answer for sure! ;-))
>>
>> IMSIC smells a lot like the the LAPIC. The implementation could probably
>> be *very* close to what arch/x86/kernel/apic/vector.c does.
>>
>> Am I completly off here?
>>
>
> The x86 APIC driver only supports MSI-X due to which the IRQ matrix
> allocator only supports ID/Vector allocation suitable for MSI-X whereas
> the ARM GICv3 driver supports both legacy MSI and MSI-X. In absence
> of legacy MSI support, Linux x86 will fallback to INTx for PCI devices
> with legacy MSI support but for RISC-V platforms we can't assume that
> INTx is available because we might be dealing with an IMSIC-only
> platform.

You're mixing up MSI and *multi-MSI* (multiple MSI vectors).

x86 support MSI-X, MSI, and multi-MSI with IOMMU.

Gleixner has a some insights on why one probably should *not* jump
through hoops to support multi-MSI:
https://lore.kernel.org/all/877d7yhve7.ffs@tglx/

Will we really see HW requiring multi-MSI support on RISC-V systems
without IOMMU? To me this sounds like a theoretical exercise.

> Refer, x86_vector_msi_parent_ops in arch/x86/kernel/apic/msi.c and
> X86_VECTOR_MSI_FLAGS_SUPPORTED in arch/x86/include/asm/msi.h
>
> Refer, its_pci_msi_domain_info in drivers/irqchip/irq-gic-v3-its-pci-msi.c
>
> The changes which I think are need in the IRQ matrix allocator before
> integrating it in the IMSIC driver are the following:
> 1) IRQ matrix allocator assumed NR_VECTORS to be a fixed define
>     which the arch code provides but in RISC-V world the number of
>     IDs are discovered from DT or ACPI.

Ok, let's try to be bit more explicit. Have you had a look at
kernel/irq/matrix.c?

You need to define the IRQ_MATRIX_BITS (which x86 sets to NR_VECTORS).
This is the size of the bitmap. For IMSIC this would be 2047.

The matrix allocator is an excellent fit, modulo multi-MSI. It's battle
proven code.

> 2) IRQ matrix allocator needs to be support allocator multiple vectors
>     in power-of-2 which will allow IMSIC driver to support both legacy
>     MSI and MSI-X. This will involve changing the way best CPU is
>     found, the way bitmap APIs are used and adding some new APIs
>     for allocate vectors in power-of-2

...and all the other things multi-MSI requires.

> Based on above, I suggest we keep the integration of IRQ matrix
> allocator in the IMSIC driver as a separate series which will allow
> us to unblock other series (such as AIA ACPI support, power
> managment related changes in AIA drivers, etc).

I suggest removing the multi-MSI support, and use the matrix allocator.
We have something that looks like what x86 has (IMSIC). We have a
battle-proven implementation, and helper function. In my view it would
be just weird not to piggy-back on that work, and benefit from years of
bugfixes/things we haven't thought of.

Finally; I don't see that you're handling managed interrupt in the
series (Oh, the matrix allocator has support for that! ;-)).

I realize it's some changes, but the interrupt handling is a central
piece.

If you agree with my input, LMK if you're time/work-constrained, and I
can take a stab at integrating it in the series.


Björn
  
Anup Patel Oct. 25, 2023, 5:25 p.m. UTC | #6
On Wed, Oct 25, 2023 at 9:35 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Hi!
>
> Anup Patel <apatel@ventanamicro.com> writes:
>
> > On Tue, Oct 24, 2023 at 6:35 PM Björn Töpel <bjorn@kernel.org> wrote:
> >>
> >> Hi Anup!
> >>
> >> Wow, I'm really happy to see that you're moving towards the 1-1 model!
> >>
> >> Anup Patel <apatel@ventanamicro.com> writes:
> >>
> >> > The RISC-V advanced interrupt architecture (AIA) specification
> >> > defines a new MSI controller called incoming message signalled
> >> > interrupt controller (IMSIC) which manages MSI on per-HART (or
> >> > per-CPU) basis. It also supports IPIs as software injected MSIs.
> >> > (For more details refer https://github.com/riscv/riscv-aia)
> >> >
> >> > Let us add an early irqchip driver for RISC-V IMSIC which sets
> >> > up the IMSIC state and provide IPIs.
> >>
> >> It would help (reviewers, and future bugfixers) if you add (here or in
> >> the cover) what design decisions you've taken instead of just saying
> >> that you're now supporting IMSIC.
> >
> > I agree with the suggestion but this kind of information should be
> > in the source itself rather than commit description.
>
> I think the high-level flow, and why you made certain design decisions
> should be in the commit message.
>
> The "how" in the code, the "why" in the commit message. Regardless -- it
> would make it easier for reviewers to get into your code faster.
>
> [...]
>
> >> > +
> >> > +     writel(IMSIC_IPI_ID, local->msi_va);
> >>
> >> Do you need the barriers here? If so, please document. If not, use the
> >> _releaxed() version.
> >
> > We can't assume that _relaxed version of MMIO operations
> > will work for RISC-V implementation so we conservatively
> > use regular MMIO operations without _releaxed().
>
> You'll need to expand on your thinking here, Anup. We can't just
> sprinkle fences everywhere because of "we can't assume it'll work". Do
> you need proper barriers for IPIs or not?

For IPIs, we use generic IPI-mux which has its own barriers. We
certainly need matching read and write barrier for the data being
passed for synchronization.

>
> [...]
>
> >> > +             mvec = lpriv->ids_move[i];
> >> > +             lpriv->ids_move[i] = NULL;
> >> > +             if (mvec) {
> >> > +                     if (__imsic_id_read_clear_pending(i)) {
> >> > +                             mlocal = per_cpu_ptr(imsic->global.local,
> >> > +                                                  mvec->cpu);
> >> > +                             writel(mvec->local_id, mlocal->msi_va);
> >>
> >> Again, do you need all the barriers? If yes, document. No, then relax
> >> the call.
> >
> > Same comment as above.
>
> Dito for me! ;-)
>
> >> > +                     }
> >> > +
> >> > +                     lpriv->vectors[i].hwirq = UINT_MAX;
> >> > +                     lpriv->vectors[i].order = UINT_MAX;
> >> > +                     clear_bit(i, lpriv->ids_used_bitmap);
> >> > +             }
> >> > +
> >> > +     }
> >> > +     raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> >> > +}
> >> > +
> >> > +void imsic_local_delivery(bool enable)
> >> > +{
> >> > +     if (enable) {
> >> > +             imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_ENABLE_EITHRESHOLD);
> >> > +             imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_ENABLE_EIDELIVERY);
> >> > +     } else {
> >> > +             imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_DISABLE_EIDELIVERY);
> >> > +             imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_DISABLE_EITHRESHOLD);
> >> > +     }
> >>
> >> My regular "early exit" nit. I guess I really dislike indentation. ;-)
> >
> > -ENOPARSE
>
> if (...) {
>   a();
>   b();
>   c();
> } else {
>   d();
>   e();
> }
>
> vs
>
> if (...) {
>   a();
>   b();
>   c();
>   return;
> }
>
> d();
> e();
>
> [...]
>
> >> > +void imsic_vector_mask(struct imsic_vector *vec)
> >> > +{
> >> > +     struct imsic_local_priv *lpriv;
> >> > +     unsigned long flags;
> >> > +
> >> > +     lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
> >> > +     if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
> >> > +             return;
> >> > +
> >> > +     raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> >> > +     bitmap_clear(lpriv->ids_enabled_bitmap, vec->local_id, 1);
> >> > +     raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> >> > +
> >> > +     imsic_remote_sync(vec->cpu);
> >>
> >> x86 seems to set a timer instead, for the remote cpu cleanup, which can
> >> be much cheaper, and less in instrusive. Is that applicable here?
> >
> > The issue with that approach is deciding the right duration
> > of timer interrupt. There might be platforms who need
> > immediate mask/unmask response. We can certainely
> > keep improving/tuning this over-time.
>
> Any concrete examples where this is an actual problem?

Do you have a concrete timer duration with proper justification ?

>
> [...]
>
> >> > +void imsic_vector_move(struct imsic_vector *old_vec,
> >> > +                     struct imsic_vector *new_vec)
> >> > +{
> >> > +     struct imsic_local_priv *old_lpriv, *new_lpriv;
> >> > +     struct imsic_vector *ovec, *nvec;
> >> > +     unsigned long flags, flags1;
> >> > +     unsigned int i;
> >> > +
> >> > +     if (WARN_ON(old_vec->cpu == new_vec->cpu ||
> >> > +                 old_vec->order != new_vec->order ||
> >> > +                 (old_vec->local_id & IMSIC_VECTOR_MASK(old_vec)) ||
> >> > +                 (new_vec->local_id & IMSIC_VECTOR_MASK(new_vec))))
> >> > +             return;
> >> > +
> >> > +     old_lpriv = per_cpu_ptr(imsic->lpriv, old_vec->cpu);
> >> > +     if (WARN_ON(&old_lpriv->vectors[old_vec->local_id] != old_vec))
> >> > +             return;
> >> > +
> >> > +     new_lpriv = per_cpu_ptr(imsic->lpriv, new_vec->cpu);
> >> > +     if (WARN_ON(&new_lpriv->vectors[new_vec->local_id] != new_vec))
> >> > +             return;
> >> > +
> >> > +     raw_spin_lock_irqsave(&old_lpriv->ids_lock, flags);
> >> > +     raw_spin_lock_irqsave(&new_lpriv->ids_lock, flags1);
> >> > +
> >> > +     /* Move the state of each vector entry */
> >> > +     for (i = 0; i < BIT(old_vec->order); i++) {
> >> > +             ovec = old_vec + i;
> >> > +             nvec = new_vec + i;
> >> > +
> >> > +             /* Unmask the new vector entry */
> >> > +             if (test_bit(ovec->local_id, old_lpriv->ids_enabled_bitmap))
> >> > +                     bitmap_set(new_lpriv->ids_enabled_bitmap,
> >> > +                                nvec->local_id, 1);
> >> > +
> >> > +             /* Mask the old vector entry */
> >> > +             bitmap_clear(old_lpriv->ids_enabled_bitmap, ovec->local_id, 1);
> >> > +
> >> > +             /*
> >> > +              * Move and re-trigger the new vector entry based on the
> >> > +              * pending state of the old vector entry because we might
> >> > +              * get a device interrupt on the old vector entry while
> >> > +              * device was being moved to the new vector entry.
> >> > +              */
> >> > +             old_lpriv->ids_move[ovec->local_id] = nvec;
> >> > +     }
> >>
> >> Hmm, nested spinlocks, and reimplementing what the irq matrix allocator
> >> does.
> >>
> >> Convince me why irq matrix is not a good fit to track the interrupts IDs
> >> *and* get handling/tracking for managed/unmanaged interrupts. You said
> >> that it was the power-of-two blocks for MSI, but can't that be enfored
> >> on matrix alloc? Where are you doing the special handling of MSI?
> >>
> >> The reason I'm asking is because I'm pretty certain that x86 has proper
> >> MSI support (Thomas Gleixner can answer for sure! ;-))
> >>
> >> IMSIC smells a lot like the the LAPIC. The implementation could probably
> >> be *very* close to what arch/x86/kernel/apic/vector.c does.
> >>
> >> Am I completly off here?
> >>
> >
> > The x86 APIC driver only supports MSI-X due to which the IRQ matrix
> > allocator only supports ID/Vector allocation suitable for MSI-X whereas
> > the ARM GICv3 driver supports both legacy MSI and MSI-X. In absence
> > of legacy MSI support, Linux x86 will fallback to INTx for PCI devices
> > with legacy MSI support but for RISC-V platforms we can't assume that
> > INTx is available because we might be dealing with an IMSIC-only
> > platform.
>
> You're mixing up MSI and *multi-MSI* (multiple MSI vectors).

So now you are doubting my understanding of MSI ?

>
> x86 support MSI-X, MSI, and multi-MSI with IOMMU.
>
> Gleixner has a some insights on why one probably should *not* jump
> through hoops to support multi-MSI:
> https://lore.kernel.org/all/877d7yhve7.ffs@tglx/

This is a fair justification to drop why x86 does not support
the legacy-MSI or "multi-MSI".

>
> Will we really see HW requiring multi-MSI support on RISC-V systems
> without IOMMU? To me this sounds like a theoretical exercise.
>
> > Refer, x86_vector_msi_parent_ops in arch/x86/kernel/apic/msi.c and
> > X86_VECTOR_MSI_FLAGS_SUPPORTED in arch/x86/include/asm/msi.h
> >
> > Refer, its_pci_msi_domain_info in drivers/irqchip/irq-gic-v3-its-pci-msi.c
> >
> > The changes which I think are need in the IRQ matrix allocator before
> > integrating it in the IMSIC driver are the following:
> > 1) IRQ matrix allocator assumed NR_VECTORS to be a fixed define
> >     which the arch code provides but in RISC-V world the number of
> >     IDs are discovered from DT or ACPI.
>
> Ok, let's try to be bit more explicit. Have you had a look at
> kernel/irq/matrix.c?

Why do you doubt it ?

>
> You need to define the IRQ_MATRIX_BITS (which x86 sets to NR_VECTORS).
> This is the size of the bitmap. For IMSIC this would be 2047.

Wow, let's just create large bitmaps even when underlying HW has
fewer per-CPU IDs !!!

>
> The matrix allocator is an excellent fit, modulo multi-MSI. It's battle
> proven code.
>
> > 2) IRQ matrix allocator needs to be support allocator multiple vectors
> >     in power-of-2 which will allow IMSIC driver to support both legacy
> >     MSI and MSI-X. This will involve changing the way best CPU is
> >     found, the way bitmap APIs are used and adding some new APIs
> >     for allocate vectors in power-of-2
>
> ...and all the other things multi-MSI requires.
>
> > Based on above, I suggest we keep the integration of IRQ matrix
> > allocator in the IMSIC driver as a separate series which will allow
> > us to unblock other series (such as AIA ACPI support, power
> > managment related changes in AIA drivers, etc).
>
> I suggest removing the multi-MSI support, and use the matrix allocator.
> We have something that looks like what x86 has (IMSIC). We have a
> battle-proven implementation, and helper function. In my view it would
> be just weird not to piggy-back on that work, and benefit from years of
> bugfixes/things we haven't thought of.
>
> Finally; I don't see that you're handling managed interrupt in the
> series (Oh, the matrix allocator has support for that! ;-)).

We don't need managed interrupts like x86 does. We are using
IPI-mux to create multiple virtual IPIs on-top-of single ID and we
use some of these virtual IPIs for internal managment.

>
> I realize it's some changes, but the interrupt handling is a central
> piece.
>
> If you agree with my input, LMK if you're time/work-constrained, and I
> can take a stab at integrating it in the series.
>
>
> Björn

Regards,
Anup
  
Björn Töpel Oct. 26, 2023, 8:51 a.m. UTC | #7
Hi Anup,

I'm getting the vibes that you are upset. Just to clarify; I want AIA
support as much as the next guy. I'm not here to pick fights, and argue
for non-technical things. I'm just here for
questions/clarifications/suggestion, so we can move the implementation
forward.

If I for some reason offended you, please let me know. If that was the
case, that was not on purpose, and accept my apologies.

Now, please let's continue the technical discussion.

Anup Patel <apatel@ventanamicro.com> writes:

>> >> > +
>> >> > +     writel(IMSIC_IPI_ID, local->msi_va);
>> >>
>> >> Do you need the barriers here? If so, please document. If not, use the
>> >> _releaxed() version.
>> >
>> > We can't assume that _relaxed version of MMIO operations
>> > will work for RISC-V implementation so we conservatively
>> > use regular MMIO operations without _releaxed().
>>
>> You'll need to expand on your thinking here, Anup. We can't just
>> sprinkle fences everywhere because of "we can't assume it'll work". Do
>> you need proper barriers for IPIs or not?
>
> For IPIs, we use generic IPI-mux which has its own barriers. We
> certainly need matching read and write barrier for the data being
> passed for synchronization.

Ok! If the IPI-mux has the barriers, it seems like a writel_relaxed will
do just fine.

>> >> > +void imsic_vector_mask(struct imsic_vector *vec)
>> >> > +{
>> >> > +     struct imsic_local_priv *lpriv;
>> >> > +     unsigned long flags;
>> >> > +
>> >> > +     lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
>> >> > +     if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
>> >> > +             return;
>> >> > +
>> >> > +     raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
>> >> > +     bitmap_clear(lpriv->ids_enabled_bitmap, vec->local_id, 1);
>> >> > +     raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
>> >> > +
>> >> > +     imsic_remote_sync(vec->cpu);
>> >>
>> >> x86 seems to set a timer instead, for the remote cpu cleanup, which can
>> >> be much cheaper, and less in instrusive. Is that applicable here?
>> >
>> > The issue with that approach is deciding the right duration
>> > of timer interrupt. There might be platforms who need
>> > immediate mask/unmask response. We can certainely
>> > keep improving/tuning this over-time.
>>
>> Any concrete examples where this is an actual problem?
>
> Do you have a concrete timer duration with proper justification ?

I would simply mimic what x86 does for now -- jiffies + 1.

No biggie for me, and this can, as you say, be improved later.

>> >> > +void imsic_vector_move(struct imsic_vector *old_vec,
>> >> > +                     struct imsic_vector *new_vec)
>> >> > +{
>> >> > +     struct imsic_local_priv *old_lpriv, *new_lpriv;
>> >> > +     struct imsic_vector *ovec, *nvec;
>> >> > +     unsigned long flags, flags1;
>> >> > +     unsigned int i;
>> >> > +
>> >> > +     if (WARN_ON(old_vec->cpu == new_vec->cpu ||
>> >> > +                 old_vec->order != new_vec->order ||
>> >> > +                 (old_vec->local_id & IMSIC_VECTOR_MASK(old_vec)) ||
>> >> > +                 (new_vec->local_id & IMSIC_VECTOR_MASK(new_vec))))
>> >> > +             return;
>> >> > +
>> >> > +     old_lpriv = per_cpu_ptr(imsic->lpriv, old_vec->cpu);
>> >> > +     if (WARN_ON(&old_lpriv->vectors[old_vec->local_id] != old_vec))
>> >> > +             return;
>> >> > +
>> >> > +     new_lpriv = per_cpu_ptr(imsic->lpriv, new_vec->cpu);
>> >> > +     if (WARN_ON(&new_lpriv->vectors[new_vec->local_id] != new_vec))
>> >> > +             return;
>> >> > +
>> >> > +     raw_spin_lock_irqsave(&old_lpriv->ids_lock, flags);
>> >> > +     raw_spin_lock_irqsave(&new_lpriv->ids_lock, flags1);
>> >> > +
>> >> > +     /* Move the state of each vector entry */
>> >> > +     for (i = 0; i < BIT(old_vec->order); i++) {
>> >> > +             ovec = old_vec + i;
>> >> > +             nvec = new_vec + i;
>> >> > +
>> >> > +             /* Unmask the new vector entry */
>> >> > +             if (test_bit(ovec->local_id, old_lpriv->ids_enabled_bitmap))
>> >> > +                     bitmap_set(new_lpriv->ids_enabled_bitmap,
>> >> > +                                nvec->local_id, 1);
>> >> > +
>> >> > +             /* Mask the old vector entry */
>> >> > +             bitmap_clear(old_lpriv->ids_enabled_bitmap, ovec->local_id, 1);
>> >> > +
>> >> > +             /*
>> >> > +              * Move and re-trigger the new vector entry based on the
>> >> > +              * pending state of the old vector entry because we might
>> >> > +              * get a device interrupt on the old vector entry while
>> >> > +              * device was being moved to the new vector entry.
>> >> > +              */
>> >> > +             old_lpriv->ids_move[ovec->local_id] = nvec;
>> >> > +     }
>> >>
>> >> Hmm, nested spinlocks, and reimplementing what the irq matrix allocator
>> >> does.
>> >>
>> >> Convince me why irq matrix is not a good fit to track the interrupts IDs
>> >> *and* get handling/tracking for managed/unmanaged interrupts. You said
>> >> that it was the power-of-two blocks for MSI, but can't that be enfored
>> >> on matrix alloc? Where are you doing the special handling of MSI?
>> >>
>> >> The reason I'm asking is because I'm pretty certain that x86 has proper
>> >> MSI support (Thomas Gleixner can answer for sure! ;-))
>> >>
>> >> IMSIC smells a lot like the the LAPIC. The implementation could probably
>> >> be *very* close to what arch/x86/kernel/apic/vector.c does.
>> >>
>> >> Am I completly off here?
>> >>
>> >
>> > The x86 APIC driver only supports MSI-X due to which the IRQ matrix
>> > allocator only supports ID/Vector allocation suitable for MSI-X whereas
>> > the ARM GICv3 driver supports both legacy MSI and MSI-X. In absence
>> > of legacy MSI support, Linux x86 will fallback to INTx for PCI devices
>> > with legacy MSI support but for RISC-V platforms we can't assume that
>> > INTx is available because we might be dealing with an IMSIC-only
>> > platform.
>>
>> You're mixing up MSI and *multi-MSI* (multiple MSI vectors).
>
> So now you are doubting my understanding of MSI ?

I'm not doubting anything. Maybe we need to clarify so that we
understand each other.

You said: "The x86 APIC driver only supports MSI-X..." And that made me
think that you didn't have all the details. Sorry for making that
assumption.

Let's clear up the terminology, for our own sake:

  * legacy-MSI: MSI (non-MSIX!), with *only one vector*.
  * multi-MSI: MSI (non-MSIX!), with multiple vectors
  * MSI-X

"MSI" can also refer to all of the above.

x86 supports legacy-MSI and MSI-X for non-remapped MSIs, and multi-MSI
with IOMMU support.

>> x86 support MSI-X, MSI, and multi-MSI with IOMMU.
>>
>> Gleixner has a some insights on why one probably should *not* jump
>> through hoops to support multi-MSI:
>> https://lore.kernel.org/all/877d7yhve7.ffs@tglx/
>
> This is a fair justification to drop why x86 does not support
> the legacy-MSI or "multi-MSI".

My claim is that x86 does support legacy-MSI, but for design decision,
has avoided multi-MSI.

AFAIU, there are few multi-MSI devices out there.

>> Will we really see HW requiring multi-MSI support on RISC-V systems
>> without IOMMU? To me this sounds like a theoretical exercise.
>>
>> > Refer, x86_vector_msi_parent_ops in arch/x86/kernel/apic/msi.c and
>> > X86_VECTOR_MSI_FLAGS_SUPPORTED in arch/x86/include/asm/msi.h
>> >
>> > Refer, its_pci_msi_domain_info in drivers/irqchip/irq-gic-v3-its-pci-msi.c
>> >
>> > The changes which I think are need in the IRQ matrix allocator before
>> > integrating it in the IMSIC driver are the following:
>> > 1) IRQ matrix allocator assumed NR_VECTORS to be a fixed define
>> >     which the arch code provides but in RISC-V world the number of
>> >     IDs are discovered from DT or ACPI.
>>
>> Ok, let's try to be bit more explicit. Have you had a look at
>> kernel/irq/matrix.c?
>
> Why do you doubt it ?

Again, no doubts -- I'm just trying to clarify. Sorry if that touched a
nerve!

>> You need to define the IRQ_MATRIX_BITS (which x86 sets to NR_VECTORS).
>> This is the size of the bitmap. For IMSIC this would be 2047.
>
> Wow, let's just create large bitmaps even when underlying HW has
> fewer per-CPU IDs !!!

Yeah, fair argument. It's a bit too much. Here's a patch to the matrix
allocator that fixes that. Note that it's only compile tested:

--8<--
From 2be4093a39b0560247289f8f4c8214cdacda7870 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= <bjorn@rivosinc.com>
Date: Thu, 26 Oct 2023 10:17:21 +0200
Subject: [PATCH] genirq/matrix: Dynamic bitmap allocation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Some (future) users of the irq matrix allocator, do not know the size
of the bitmaps at compile time.

To avoid wasting memory on unneccesary large bitmaps, size the bitmap
at matrix allocation time.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/x86/include/asm/hw_irq.h |  2 --
 kernel/irq/matrix.c           | 33 ++++++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 551829884734..dcfaa3812306 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -16,8 +16,6 @@
 
 #include <asm/irq_vectors.h>
 
-#define IRQ_MATRIX_BITS		NR_VECTORS
-
 #ifndef __ASSEMBLY__
 
 #include <linux/percpu.h>
diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 1698e77645ac..16ce956935ca 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -8,8 +8,6 @@
 #include <linux/cpu.h>
 #include <linux/irq.h>
 
-#define IRQ_MATRIX_SIZE	(BITS_TO_LONGS(IRQ_MATRIX_BITS))
-
 struct cpumap {
 	unsigned int		available;
 	unsigned int		allocated;
@@ -17,8 +15,9 @@ struct cpumap {
 	unsigned int		managed_allocated;
 	bool			initialized;
 	bool			online;
-	unsigned long		alloc_map[IRQ_MATRIX_SIZE];
-	unsigned long		managed_map[IRQ_MATRIX_SIZE];
+	unsigned long		*alloc_map;
+	unsigned long		*managed_map;
+	unsigned long		bitmap_storage[];
 };
 
 struct irq_matrix {
@@ -32,8 +31,10 @@ struct irq_matrix {
 	unsigned int		total_allocated;
 	unsigned int		online_maps;
 	struct cpumap __percpu	*maps;
-	unsigned long		scratch_map[IRQ_MATRIX_SIZE];
-	unsigned long		system_map[IRQ_MATRIX_SIZE];
+	unsigned long		*scratch_map;
+	unsigned long		*system_map;
+	unsigned long		bitmap_storage[];
+
 };
 
 #define CREATE_TRACE_POINTS
@@ -50,24 +51,34 @@ __init struct irq_matrix *irq_alloc_matrix(unsigned int matrix_bits,
 					   unsigned int alloc_start,
 					   unsigned int alloc_end)
 {
+	unsigned int cpu, matrix_size = BITS_TO_LONGS(matrix_bits);
 	struct irq_matrix *m;
 
-	if (matrix_bits > IRQ_MATRIX_BITS)
-		return NULL;
-
-	m = kzalloc(sizeof(*m), GFP_KERNEL);
+	m = kzalloc(struct_size(m, bitmap_storage, matrix_size * 2), GFP_KERNEL);
 	if (!m)
 		return NULL;
 
+	m->scratch_map = &m->bitmap_storage[0];
+	m->system_map = &m->bitmap_storage[matrix_size];
+
 	m->matrix_bits = matrix_bits;
 	m->alloc_start = alloc_start;
 	m->alloc_end = alloc_end;
 	m->alloc_size = alloc_end - alloc_start;
-	m->maps = alloc_percpu(*m->maps);
+	m->maps = __alloc_percpu(struct_size(m->maps, bitmap_storage, matrix_size * 2),
+				 __alignof__(*m->maps));
 	if (!m->maps) {
 		kfree(m);
 		return NULL;
 	}
+
+	for_each_possible_cpu(cpu){
+		struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
+
+		cm->alloc_map = &cm->bitmap_storage[0];
+		cm->managed_map = &cm->bitmap_storage[matrix_size];
+	}
+
 	return m;
 }
 

base-commit: 611da07b89fdd53f140d7b33013f255bf0ed8f34
  
Thomas Gleixner Oct. 28, 2023, 6:18 p.m. UTC | #8
On Thu, Oct 26 2023 at 10:51, Björn Töpel wrote:
>>> >> > +     raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
>>> >> > +     bitmap_clear(lpriv->ids_enabled_bitmap, vec->local_id, 1);
>>> >> > +     raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
>>> >> > +
>>> >> > +     imsic_remote_sync(vec->cpu);
>>> >>
>>> >> x86 seems to set a timer instead, for the remote cpu cleanup, which can
>>> >> be much cheaper, and less in instrusive. Is that applicable here?
>>> >
>>> > The issue with that approach is deciding the right duration
>>> > of timer interrupt. There might be platforms who need
>>> > immediate mask/unmask response. We can certainely
>>> > keep improving/tuning this over-time.
>>>
>>> Any concrete examples where this is an actual problem?
>>
>> Do you have a concrete timer duration with proper justification ?
>
> I would simply mimic what x86 does for now -- jiffies + 1.

That's good enough. The point is that the interrupt might still end up
on the old target CPU depending on timing, but the next one is
guaranteed to be targeted to the new target CPU.

So you can't cleanup the vector on the old target immediately, but it
does not matter at all whether you clean it up 10ms or 10s later. It's
just wasting a vector on the old target.

Doing it with an IPI (as x86 did before) only works when the IPI vector
is of lower priority than the vector which got moved. Otherwise the IPI
will be served first, find the vector pending and then it's up a creek
without a paddle because it can't retrigger the IPI as that would again
be served first. So it can't clean up ever...

The timer just avoids this and as I said the delay is completely
irrelevant.

>>> >> The reason I'm asking is because I'm pretty certain that x86 has proper
>>> >> MSI support (Thomas Gleixner can answer for sure! ;-))

It has proper MSI support with some limitations.

>>> >> IMSIC smells a lot like the the LAPIC.

Eeew. :)

> My claim is that x86 does support legacy-MSI, but for design decision,
> has avoided multi-MSI.

There are two variants of PCI/MSI:

  1) MSI
  2) MSI-X

Neither of them is legacy and both support multiple vectors at the
device hardware level.

  #1 MSI

      Affinity setting requires to move all vectors to the new target in
      one go because the device gets only the base vector in the MSI
      message and uses the lower bits as index.

      So that's of limited use anyway because it's impossible to use
      that for multi-queue or other purposes where the main point is to
      spread the interrupts accross CPUs.

      It does not have mandatory masking which makes affinity changes
      even more problematic at least on x86 because the update to the
      message store in the PCI config space is non-atomic. See the dance
      which is required for a single vector in msi_set_affity().

      IOW, if the MSI message is directly delivered to the target CPU
      and the device does not support masking then single vector is
      already complext and multi-MSI support becomes a horrorshow.

      Another issue especially on x86 with the limitation of about 200
      device vectors per CPU is the requirement to allocate a
      consecutive vector space power of 2 aligned. That's pretty fast at
      the point of vector exhaustion.

      That _are_ the reasons why X86 does not support multi-MSI without
      interrupt remapping. It just does the only sane thing and limits
      to one vector per device.

      Interrupt remapping avoids the problem because it allows to steer
      the vectors individually and the affinity update is atomic. It
      obviously also lifts the requirement for a consecutive vector
      space.

      Serioulsy w/o interrupt remapping or an equivalent translation
      mechanism which allows to steer the vectors individually multi-MSI
      is absolutely pointless and not worth the trouble to support it.


  #2 MSI-X

       Has a message store per vector and mandatory per vector masking
       which makes multi vector support trivial even w/o interrupt
       remapping. It does neither require a consecutive vector space.

So if AIA is similar to the APIC, then single MSI needs the same dance
and multi-MSI needs that theatre ^ N.

> AFAIU, there are few multi-MSI devices out there.

You wish. MSI-X is "more expensive" (probaly 0.5 Cent). Now that
interrupt remapping is pretty much always available on x86, the problem
is "fixed" indirectly. So especially x86 on-chip devices still use MSI
and not MSI-X. MSI-X is primarily used in multi-queue devices as
multi-MSI is limited to 32 vectors.

Thanks,

        tglx
  
Thomas Gleixner Oct. 28, 2023, 6:34 p.m. UTC | #9
On Mon, Oct 23 2023 at 22:57, Anup Patel wrote:
> +#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> +void imsic_vector_debug_show(struct seq_file *m,
> +			     struct imsic_vector *vec, int ind)
> +{
> +	unsigned int mcpu = 0, mlocal_id = 0;
> +	struct imsic_local_priv *lpriv;
> +	bool move_in_progress = false;
> +	struct imsic_vector *mvec;
> +	bool is_enabled = false;
> +	unsigned long flags;
> +
> +	lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
> +	if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
> +		return;
> +
> +	raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> +	if (test_bit(vec->local_id, lpriv->ids_enabled_bitmap))
> +		is_enabled = true;
> +	mvec = lpriv->ids_move[vec->local_id];
> +	if (mvec) {
> +		move_in_progress = true;
> +		mcpu = mvec->cpu;
> +		mlocal_id = mvec->local_id;
> +	}
> +	raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> +
> +	seq_printf(m, "%*starget_cpu      : %5u\n", ind, "", vec->cpu);
> +	seq_printf(m, "%*starget_local_id : %5u\n", ind, "", vec->local_id);
> +	seq_printf(m, "%*sis_reserved     : %5u\n", ind, "",
> +		   (vec->local_id <= IMSIC_IPI_ID) ? 1 : 0);
> +	seq_printf(m, "%*sis_enabled      : %5u\n", ind, "",
> +		   (move_in_progress) ? 1 : 0);
> +	seq_printf(m, "%*sis_move_pending : %5u\n", ind, "",
> +		   (move_in_progress) ? 1 : 0);
> +	if (move_in_progress) {
> +		seq_printf(m, "%*smove_cpu        : %5u\n", ind, "", mcpu);
> +		seq_printf(m, "%*smove_local_id   : %5u\n", ind, "", mlocal_id);
> +	}
> +}
> +
> +void imsic_vector_debug_show_summary(struct seq_file *m, int ind)
> +{
> +	unsigned int cpu, total_avail = 0, total_used = 0;
> +	struct imsic_global_config *global = &imsic->global;
> +	struct imsic_local_priv *lpriv;
> +	unsigned long flags;
> +
> +	for_each_possible_cpu(cpu) {
> +		lpriv = per_cpu_ptr(imsic->lpriv, cpu);
> +
> +		total_avail += global->nr_ids;
> +
> +		raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> +		total_used += bitmap_weight(lpriv->ids_used_bitmap,
> +					    global->nr_ids + 1) - 1;
> +		raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> +	}
> +
> +	seq_printf(m, "%*stotal : %5u\n", ind, "", total_avail);
> +	seq_printf(m, "%*sused  : %5u\n", ind, "", total_used);
> +	seq_printf(m, "%*s| CPU | tot | usd | vectors\n", ind, " ");
> +
> +	cpus_read_lock();
> +	for_each_online_cpu(cpu) {
> +		lpriv = per_cpu_ptr(imsic->lpriv, cpu);
> +
> +		raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> +		total_used = bitmap_weight(lpriv->ids_used_bitmap,
> +					   global->nr_ids + 1) - 1;
> +		seq_printf(m, "%*s %4d  %4u  %4u  %*pbl\n", ind, " ",
> +			   cpu, global->nr_ids, total_used,
> +			   global->nr_ids + 1, lpriv->ids_used_bitmap);
> +		raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> +	}
> +	cpus_read_unlock();

This looks very close to the matrix alocator information, just done differently.

> +static unsigned int imsic_vector_best_cpu(const struct cpumask *mask,
> +					  unsigned int order)
> +{
> +	struct imsic_global_config *global = &imsic->global;
> +	unsigned int cpu, best_cpu, free, maxfree = 0;
> +	struct imsic_local_priv *lpriv;
> +	unsigned long flags;
> +
> +	best_cpu = UINT_MAX;
> +	for_each_cpu(cpu, mask) {
> +		if (!cpu_online(cpu))
> +			continue;
> +
> +		lpriv = per_cpu_ptr(imsic->lpriv, cpu);
> +		raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
> +		free = bitmap_weight(lpriv->ids_used_bitmap,
> +				     global->nr_ids + 1);
> +		free = (global->nr_ids + 1) - free;
> +		raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
> +		if (free < BIT(order) || free <= maxfree)
> +			continue;
> +
> +		best_cpu = cpu;
> +		maxfree = free;
> +	}
> +
> +	return best_cpu;

Looks very much like what the matrix allocator provides, right?

What's the actual reason that you can't use it?

Thanks,

        tglx
  

Patch

diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index f7149d0f3d45..bdd80716114d 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -546,6 +546,12 @@  config SIFIVE_PLIC
 	select IRQ_DOMAIN_HIERARCHY
 	select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP
 
+config RISCV_IMSIC
+	bool
+	depends on RISCV
+	select IRQ_DOMAIN_HIERARCHY
+	select GENERIC_MSI_IRQ
+
 config EXYNOS_IRQ_COMBINER
 	bool "Samsung Exynos IRQ combiner support" if COMPILE_TEST
 	depends on (ARCH_EXYNOS && ARM) || COMPILE_TEST
diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
index ffd945fe71aa..d714724387ce 100644
--- a/drivers/irqchip/Makefile
+++ b/drivers/irqchip/Makefile
@@ -95,6 +95,7 @@  obj-$(CONFIG_QCOM_MPM)			+= irq-qcom-mpm.o
 obj-$(CONFIG_CSKY_MPINTC)		+= irq-csky-mpintc.o
 obj-$(CONFIG_CSKY_APB_INTC)		+= irq-csky-apb-intc.o
 obj-$(CONFIG_RISCV_INTC)		+= irq-riscv-intc.o
+obj-$(CONFIG_RISCV_IMSIC)		+= irq-riscv-imsic-state.o irq-riscv-imsic-early.o
 obj-$(CONFIG_SIFIVE_PLIC)		+= irq-sifive-plic.o
 obj-$(CONFIG_IMX_IRQSTEER)		+= irq-imx-irqsteer.o
 obj-$(CONFIG_IMX_INTMUX)		+= irq-imx-intmux.o
diff --git a/drivers/irqchip/irq-riscv-imsic-early.c b/drivers/irqchip/irq-riscv-imsic-early.c
new file mode 100644
index 000000000000..23f689ff5807
--- /dev/null
+++ b/drivers/irqchip/irq-riscv-imsic-early.c
@@ -0,0 +1,235 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+
+#define pr_fmt(fmt) "riscv-imsic: " fmt
+#include <linux/cpu.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/irq.h>
+#include <linux/irqchip.h>
+#include <linux/irqchip/chained_irq.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+
+#include "irq-riscv-imsic-state.h"
+
+static int imsic_parent_irq;
+
+#ifdef CONFIG_SMP
+static irqreturn_t imsic_local_sync_handler(int irq, void *data)
+{
+	imsic_local_sync();
+	return IRQ_HANDLED;
+}
+
+static void imsic_ipi_send(unsigned int cpu)
+{
+	struct imsic_local_config *local =
+				per_cpu_ptr(imsic->global.local, cpu);
+
+	writel(IMSIC_IPI_ID, local->msi_va);
+}
+
+static void imsic_ipi_starting_cpu(void)
+{
+	/* Enable IPIs for current CPU. */
+	__imsic_id_set_enable(IMSIC_IPI_ID);
+
+	/* Enable virtual IPI used for IMSIC ID synchronization */
+	enable_percpu_irq(imsic->ipi_virq, 0);
+}
+
+static void imsic_ipi_dying_cpu(void)
+{
+	/*
+	 * Disable virtual IPI used for IMSIC ID synchronization so
+	 * that we don't receive ID synchronization requests.
+	 */
+	disable_percpu_irq(imsic->ipi_virq);
+}
+
+static int __init imsic_ipi_domain_init(void)
+{
+	int virq;
+
+	/* Create IMSIC IPI multiplexing */
+	virq = ipi_mux_create(IMSIC_NR_IPI, imsic_ipi_send);
+	if (virq <= 0)
+		return (virq < 0) ? virq : -ENOMEM;
+	imsic->ipi_virq = virq;
+
+	/* First vIRQ is used for IMSIC ID synchronization */
+	virq = request_percpu_irq(imsic->ipi_virq, imsic_local_sync_handler,
+				  "riscv-imsic-lsync", imsic->global.local);
+	if (virq)
+		return virq;
+	irq_set_status_flags(imsic->ipi_virq, IRQ_HIDDEN);
+	imsic->ipi_lsync_desc = irq_to_desc(imsic->ipi_virq);
+
+	/* Set vIRQ range */
+	riscv_ipi_set_virq_range(imsic->ipi_virq + 1, IMSIC_NR_IPI - 1, true);
+
+	/* Announce that IMSIC is providing IPIs */
+	pr_info("%pfwP: providing IPIs using interrupt %d\n",
+		imsic->fwnode, IMSIC_IPI_ID);
+
+	return 0;
+}
+#else
+static void imsic_ipi_starting_cpu(void)
+{
+}
+
+static void imsic_ipi_dying_cpu(void)
+{
+}
+
+static int __init imsic_ipi_domain_init(void)
+{
+	return 0;
+}
+#endif
+
+/*
+ * To handle an interrupt, we read the TOPEI CSR and write zero in one
+ * instruction. If TOPEI CSR is non-zero then we translate TOPEI.ID to
+ * Linux interrupt number and let Linux IRQ subsystem handle it.
+ */
+static void imsic_handle_irq(struct irq_desc *desc)
+{
+	struct irq_chip *chip = irq_desc_get_chip(desc);
+	int err, cpu = smp_processor_id();
+	struct imsic_vector *vec;
+	unsigned long local_id;
+
+	chained_irq_enter(chip, desc);
+
+	while ((local_id = csr_swap(CSR_TOPEI, 0))) {
+		local_id = local_id >> TOPEI_ID_SHIFT;
+
+		if (local_id == IMSIC_IPI_ID) {
+#ifdef CONFIG_SMP
+			ipi_mux_process();
+#endif
+			continue;
+		}
+
+		if (unlikely(!imsic->base_domain))
+			continue;
+
+		vec = imsic_vector_from_local_id(cpu, local_id);
+		if (!vec) {
+			pr_warn_ratelimited(
+				"vector not found for local ID 0x%lx\n",
+				local_id);
+			continue;
+		}
+
+		err = generic_handle_domain_irq(imsic->base_domain,
+						vec->hwirq);
+		if (unlikely(err))
+			pr_warn_ratelimited(
+				"hwirq 0x%x mapping not found\n",
+				vec->hwirq);
+	}
+
+	chained_irq_exit(chip, desc);
+}
+
+static int imsic_starting_cpu(unsigned int cpu)
+{
+	/* Enable per-CPU parent interrupt */
+	enable_percpu_irq(imsic_parent_irq,
+			  irq_get_trigger_type(imsic_parent_irq));
+
+	/* Setup IPIs */
+	imsic_ipi_starting_cpu();
+
+	/*
+	 * Interrupts identities might have been enabled/disabled while
+	 * this CPU was not running so sync-up local enable/disable state.
+	 */
+	imsic_local_sync();
+
+	/* Enable local interrupt delivery */
+	imsic_local_delivery(true);
+
+	return 0;
+}
+
+static int imsic_dying_cpu(unsigned int cpu)
+{
+	/* Cleanup IPIs */
+	imsic_ipi_dying_cpu();
+
+	return 0;
+}
+
+static int __init imsic_early_probe(struct fwnode_handle *fwnode)
+{
+	int rc;
+	struct irq_domain *domain;
+
+	/* Find parent domain and register chained handler */
+	domain = irq_find_matching_fwnode(riscv_get_intc_hwnode(),
+					  DOMAIN_BUS_ANY);
+	if (!domain) {
+		pr_err("%pfwP: Failed to find INTC domain\n", fwnode);
+		return -ENOENT;
+	}
+	imsic_parent_irq = irq_create_mapping(domain, RV_IRQ_EXT);
+	if (!imsic_parent_irq) {
+		pr_err("%pfwP: Failed to create INTC mapping\n", fwnode);
+		return -ENOENT;
+	}
+	irq_set_chained_handler(imsic_parent_irq, imsic_handle_irq);
+
+	/* Initialize IPI domain */
+	rc = imsic_ipi_domain_init();
+	if (rc) {
+		pr_err("%pfwP: Failed to initialize IPI domain\n", fwnode);
+		return rc;
+	}
+
+	/*
+	 * Setup cpuhp state (must be done after setting imsic_parent_irq)
+	 *
+	 * Don't disable per-CPU IMSIC file when CPU goes offline
+	 * because this affects IPI and the masking/unmasking of
+	 * virtual IPIs is done via generic IPI-Mux
+	 */
+	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+			  "irqchip/riscv/imsic:starting",
+			  imsic_starting_cpu, imsic_dying_cpu);
+
+	return 0;
+}
+
+static int __init imsic_early_dt_init(struct device_node *node,
+				      struct device_node *parent)
+{
+	int rc;
+	struct fwnode_handle *fwnode = &node->fwnode;
+
+	/* Setup IMSIC state */
+	rc = imsic_setup_state(fwnode);
+	if (rc) {
+		pr_err("%pfwP: failed to setup state (error %d)\n",
+			fwnode, rc);
+		return rc;
+	}
+
+	/* Do early setup of IPIs */
+	rc = imsic_early_probe(fwnode);
+	if (rc)
+		return rc;
+
+	/* Ensure that OF platform device gets probed */
+	of_node_clear_flag(node, OF_POPULATED);
+	return 0;
+}
+IRQCHIP_DECLARE(riscv_imsic, "riscv,imsics", imsic_early_dt_init);
diff --git a/drivers/irqchip/irq-riscv-imsic-state.c b/drivers/irqchip/irq-riscv-imsic-state.c
new file mode 100644
index 000000000000..54465e47851c
--- /dev/null
+++ b/drivers/irqchip/irq-riscv-imsic-state.c
@@ -0,0 +1,962 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+
+#define pr_fmt(fmt) "riscv-imsic: " fmt
+#include <linux/cpu.h>
+#include <linux/bitmap.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/of_address.h>
+#include <linux/of_irq.h>
+#include <linux/seq_file.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+#include <asm/hwcap.h>
+
+#include "irq-riscv-imsic-state.h"
+
+#define IMSIC_DISABLE_EIDELIVERY		0
+#define IMSIC_ENABLE_EIDELIVERY			1
+#define IMSIC_DISABLE_EITHRESHOLD		1
+#define IMSIC_ENABLE_EITHRESHOLD		0
+
+#define imsic_csr_write(__c, __v)		\
+do {						\
+	csr_write(CSR_ISELECT, __c);		\
+	csr_write(CSR_IREG, __v);		\
+} while (0)
+
+#define imsic_csr_read(__c)			\
+({						\
+	unsigned long __v;			\
+	csr_write(CSR_ISELECT, __c);		\
+	__v = csr_read(CSR_IREG);		\
+	__v;					\
+})
+
+#define imsic_csr_read_clear(__c, __v)		\
+({						\
+	unsigned long __r;			\
+	csr_write(CSR_ISELECT, __c);		\
+	__r = csr_read_clear(CSR_IREG, __v);	\
+	__r;					\
+})
+
+#define imsic_csr_set(__c, __v)			\
+do {						\
+	csr_write(CSR_ISELECT, __c);		\
+	csr_set(CSR_IREG, __v);			\
+} while (0)
+
+#define imsic_csr_clear(__c, __v)		\
+do {						\
+	csr_write(CSR_ISELECT, __c);		\
+	csr_clear(CSR_IREG, __v);		\
+} while (0)
+
+struct imsic_priv *imsic;
+
+const struct imsic_global_config *imsic_get_global_config(void)
+{
+	return (imsic) ? &imsic->global : NULL;
+}
+EXPORT_SYMBOL_GPL(imsic_get_global_config);
+
+static bool __imsic_eix_read_clear(unsigned long id, bool pend)
+{
+	unsigned long isel, imask;
+
+	isel = id / BITS_PER_LONG;
+	isel *= BITS_PER_LONG / IMSIC_EIPx_BITS;
+	isel += (pend) ? IMSIC_EIP0 : IMSIC_EIE0;
+	imask = BIT(id & (__riscv_xlen - 1));
+
+	return (imsic_csr_read_clear(isel, imask) & imask) ? true : false;
+}
+
+#define __imsic_id_read_clear_enabled(__id)		\
+	__imsic_eix_read_clear((__id), false)
+#define __imsic_id_read_clear_pending(__id)		\
+	__imsic_eix_read_clear((__id), true)
+
+void __imsic_eix_update(unsigned long base_id,
+			unsigned long num_id, bool pend, bool val)
+{
+	unsigned long i, isel, ireg;
+	unsigned long id = base_id, last_id = base_id + num_id;
+
+	while (id < last_id) {
+		isel = id / BITS_PER_LONG;
+		isel *= BITS_PER_LONG / IMSIC_EIPx_BITS;
+		isel += (pend) ? IMSIC_EIP0 : IMSIC_EIE0;
+
+		ireg = 0;
+		for (i = id & (__riscv_xlen - 1);
+		     (id < last_id) && (i < __riscv_xlen); i++) {
+			ireg |= BIT(i);
+			id++;
+		}
+
+		/*
+		 * The IMSIC EIEx and EIPx registers are indirectly
+		 * accessed via using ISELECT and IREG CSRs so we
+		 * need to access these CSRs without getting preempted.
+		 *
+		 * All existing users of this function call this
+		 * function with local IRQs disabled so we don't
+		 * need to do anything special here.
+		 */
+		if (val)
+			imsic_csr_set(isel, ireg);
+		else
+			imsic_csr_clear(isel, ireg);
+	}
+}
+
+void imsic_local_sync(void)
+{
+	struct imsic_local_priv *lpriv = this_cpu_ptr(imsic->lpriv);
+	struct imsic_local_config *mlocal;
+	struct imsic_vector *mvec;
+	unsigned long flags;
+	int i;
+
+	raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
+	for (i = 1; i <= imsic->global.nr_ids; i++) {
+		if (i == IMSIC_IPI_ID)
+			continue;
+
+		if (test_bit(i, lpriv->ids_enabled_bitmap))
+			__imsic_id_set_enable(i);
+		else
+			__imsic_id_clear_enable(i);
+
+		mvec = lpriv->ids_move[i];
+		lpriv->ids_move[i] = NULL;
+		if (mvec) {
+			if (__imsic_id_read_clear_pending(i)) {
+				mlocal = per_cpu_ptr(imsic->global.local,
+						     mvec->cpu);
+				writel(mvec->local_id, mlocal->msi_va);
+			}
+
+			lpriv->vectors[i].hwirq = UINT_MAX;
+			lpriv->vectors[i].order = UINT_MAX;
+			clear_bit(i, lpriv->ids_used_bitmap);
+		}
+
+	}
+	raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
+}
+
+void imsic_local_delivery(bool enable)
+{
+	if (enable) {
+		imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_ENABLE_EITHRESHOLD);
+		imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_ENABLE_EIDELIVERY);
+	} else {
+		imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_DISABLE_EIDELIVERY);
+		imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_DISABLE_EITHRESHOLD);
+	}
+}
+
+#ifdef CONFIG_SMP
+static void imsic_remote_sync(unsigned int cpu)
+{
+	/*
+	 * We simply inject ID synchronization IPI to a target CPU
+	 * if it is not same as the current CPU. The ipi_send_mask()
+	 * implementation of IPI mux will inject ID synchronization
+	 * IPI only for CPUs that have enabled it so offline CPUs
+	 * won't receive IPI. An offline CPU will unconditionally
+	 * synchronize IDs through imsic_starting_cpu() when the
+	 * CPU is brought up.
+	 */
+	if (cpu_online(cpu)) {
+		if (cpu != smp_processor_id())
+			__ipi_send_mask(imsic->ipi_lsync_desc, cpumask_of(cpu));
+		else
+			imsic_local_sync();
+	}
+}
+#else
+static inline void imsic_remote_sync(unsigned int cpu)
+{
+	imsic_local_sync();
+}
+#endif
+
+void imsic_vector_mask(struct imsic_vector *vec)
+{
+	struct imsic_local_priv *lpriv;
+	unsigned long flags;
+
+	lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
+	if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
+		return;
+
+	raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
+	bitmap_clear(lpriv->ids_enabled_bitmap, vec->local_id, 1);
+	raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
+
+	imsic_remote_sync(vec->cpu);
+}
+
+void imsic_vector_unmask(struct imsic_vector *vec)
+{
+	struct imsic_local_priv *lpriv;
+	unsigned long flags;
+
+	lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
+	if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
+		return;
+
+	raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
+	bitmap_set(lpriv->ids_enabled_bitmap, vec->local_id, 1);
+	raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
+
+	imsic_remote_sync(vec->cpu);
+}
+
+void imsic_vector_move(struct imsic_vector *old_vec,
+			struct imsic_vector *new_vec)
+{
+	struct imsic_local_priv *old_lpriv, *new_lpriv;
+	struct imsic_vector *ovec, *nvec;
+	unsigned long flags, flags1;
+	unsigned int i;
+
+	if (WARN_ON(old_vec->cpu == new_vec->cpu ||
+		    old_vec->order != new_vec->order ||
+		    (old_vec->local_id & IMSIC_VECTOR_MASK(old_vec)) ||
+		    (new_vec->local_id & IMSIC_VECTOR_MASK(new_vec))))
+		return;
+
+	old_lpriv = per_cpu_ptr(imsic->lpriv, old_vec->cpu);
+	if (WARN_ON(&old_lpriv->vectors[old_vec->local_id] != old_vec))
+		return;
+
+	new_lpriv = per_cpu_ptr(imsic->lpriv, new_vec->cpu);
+	if (WARN_ON(&new_lpriv->vectors[new_vec->local_id] != new_vec))
+		return;
+
+	raw_spin_lock_irqsave(&old_lpriv->ids_lock, flags);
+	raw_spin_lock_irqsave(&new_lpriv->ids_lock, flags1);
+
+	/* Move the state of each vector entry */
+	for (i = 0; i < BIT(old_vec->order); i++) {
+		ovec = old_vec + i;
+		nvec = new_vec + i;
+
+		/* Unmask the new vector entry */
+		if (test_bit(ovec->local_id, old_lpriv->ids_enabled_bitmap))
+			bitmap_set(new_lpriv->ids_enabled_bitmap,
+				   nvec->local_id, 1);
+
+		/* Mask the old vector entry */
+		bitmap_clear(old_lpriv->ids_enabled_bitmap, ovec->local_id, 1);
+
+		/*
+		 * Move and re-trigger the new vector entry based on the
+		 * pending state of the old vector entry because we might
+		 * get a device interrupt on the old vector entry while
+		 * device was being moved to the new vector entry.
+		 */
+		old_lpriv->ids_move[ovec->local_id] = nvec;
+	}
+
+	raw_spin_unlock_irqrestore(&new_lpriv->ids_lock, flags1);
+	raw_spin_unlock_irqrestore(&old_lpriv->ids_lock, flags);
+
+	imsic_remote_sync(old_vec->cpu);
+	imsic_remote_sync(new_vec->cpu);
+}
+
+#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
+void imsic_vector_debug_show(struct seq_file *m,
+			     struct imsic_vector *vec, int ind)
+{
+	unsigned int mcpu = 0, mlocal_id = 0;
+	struct imsic_local_priv *lpriv;
+	bool move_in_progress = false;
+	struct imsic_vector *mvec;
+	bool is_enabled = false;
+	unsigned long flags;
+
+	lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
+	if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
+		return;
+
+	raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
+	if (test_bit(vec->local_id, lpriv->ids_enabled_bitmap))
+		is_enabled = true;
+	mvec = lpriv->ids_move[vec->local_id];
+	if (mvec) {
+		move_in_progress = true;
+		mcpu = mvec->cpu;
+		mlocal_id = mvec->local_id;
+	}
+	raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
+
+	seq_printf(m, "%*starget_cpu      : %5u\n", ind, "", vec->cpu);
+	seq_printf(m, "%*starget_local_id : %5u\n", ind, "", vec->local_id);
+	seq_printf(m, "%*sis_reserved     : %5u\n", ind, "",
+		   (vec->local_id <= IMSIC_IPI_ID) ? 1 : 0);
+	seq_printf(m, "%*sis_enabled      : %5u\n", ind, "",
+		   (move_in_progress) ? 1 : 0);
+	seq_printf(m, "%*sis_move_pending : %5u\n", ind, "",
+		   (move_in_progress) ? 1 : 0);
+	if (move_in_progress) {
+		seq_printf(m, "%*smove_cpu        : %5u\n", ind, "", mcpu);
+		seq_printf(m, "%*smove_local_id   : %5u\n", ind, "", mlocal_id);
+	}
+}
+
+void imsic_vector_debug_show_summary(struct seq_file *m, int ind)
+{
+	unsigned int cpu, total_avail = 0, total_used = 0;
+	struct imsic_global_config *global = &imsic->global;
+	struct imsic_local_priv *lpriv;
+	unsigned long flags;
+
+	for_each_possible_cpu(cpu) {
+		lpriv = per_cpu_ptr(imsic->lpriv, cpu);
+
+		total_avail += global->nr_ids;
+
+		raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
+		total_used += bitmap_weight(lpriv->ids_used_bitmap,
+					    global->nr_ids + 1) - 1;
+		raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
+	}
+
+	seq_printf(m, "%*stotal : %5u\n", ind, "", total_avail);
+	seq_printf(m, "%*sused  : %5u\n", ind, "", total_used);
+	seq_printf(m, "%*s| CPU | tot | usd | vectors\n", ind, " ");
+
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		lpriv = per_cpu_ptr(imsic->lpriv, cpu);
+
+		raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
+		total_used = bitmap_weight(lpriv->ids_used_bitmap,
+					   global->nr_ids + 1) - 1;
+		seq_printf(m, "%*s %4d  %4u  %4u  %*pbl\n", ind, " ",
+			   cpu, global->nr_ids, total_used,
+			   global->nr_ids + 1, lpriv->ids_used_bitmap);
+		raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
+	}
+	cpus_read_unlock();
+}
+#endif
+
+struct imsic_vector *imsic_vector_from_local_id(unsigned int cpu,
+						unsigned int local_id)
+{
+	struct imsic_local_priv *lpriv = per_cpu_ptr(imsic->lpriv, cpu);
+
+	if (!lpriv || imsic->global.nr_ids < local_id)
+		return NULL;
+
+	return &lpriv->vectors[local_id];
+}
+
+static unsigned int imsic_vector_best_cpu(const struct cpumask *mask,
+					  unsigned int order)
+{
+	struct imsic_global_config *global = &imsic->global;
+	unsigned int cpu, best_cpu, free, maxfree = 0;
+	struct imsic_local_priv *lpriv;
+	unsigned long flags;
+
+	best_cpu = UINT_MAX;
+	for_each_cpu(cpu, mask) {
+		if (!cpu_online(cpu))
+			continue;
+
+		lpriv = per_cpu_ptr(imsic->lpriv, cpu);
+		raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
+		free = bitmap_weight(lpriv->ids_used_bitmap,
+				     global->nr_ids + 1);
+		free = (global->nr_ids + 1) - free;
+		raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
+
+		if (free < BIT(order) || free <= maxfree)
+			continue;
+
+		best_cpu = cpu;
+		maxfree = free;
+	}
+
+	return best_cpu;
+}
+
+struct imsic_vector *imsic_vector_alloc(unsigned int hwirq,
+					const struct cpumask *mask,
+					unsigned int order)
+{
+	struct imsic_vector *vec = NULL;
+	struct imsic_local_priv *lpriv;
+	unsigned long flags;
+	unsigned int cpu;
+	int i, local_id;
+
+	if (!mask || cpumask_empty(mask))
+		return NULL;
+
+	cpu = imsic_vector_best_cpu(mask, order);
+	if (cpu == UINT_MAX)
+		return NULL;
+
+	lpriv = per_cpu_ptr(imsic->lpriv, cpu);
+	raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
+	local_id = bitmap_find_free_region(lpriv->ids_used_bitmap,
+					   imsic->global.nr_ids + 1,
+					   order);
+	if (local_id > 0) {
+		for (i = 0; i < BIT(order); i++) {
+			vec = &lpriv->vectors[local_id + i];
+			vec->hwirq = hwirq + i;
+			vec->order = order;
+		}
+		vec = &lpriv->vectors[local_id];
+	}
+	raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
+
+	return vec;
+}
+
+void imsic_vector_free(struct imsic_vector *vec)
+{
+	unsigned int i, local_id, order;
+	struct imsic_local_priv *lpriv;
+	struct imsic_vector *tvec;
+	unsigned long flags;
+
+	if (WARN_ON(vec->hwirq == UINT_MAX || vec->order == UINT_MAX))
+		return;
+
+	lpriv = per_cpu_ptr(imsic->lpriv, vec->cpu);
+	if (WARN_ON(&lpriv->vectors[vec->local_id] != vec))
+		return;
+
+	order = vec->order;
+	local_id = IMSIC_VECTOR_BASE_LOCAL_ID(vec);
+
+	raw_spin_lock_irqsave(&lpriv->ids_lock, flags);
+	for (i = 0; i < BIT(order); i++) {
+		tvec = &lpriv->vectors[local_id + i];
+		tvec->hwirq = UINT_MAX;
+		tvec->order = UINT_MAX;
+	}
+	bitmap_release_region(lpriv->ids_used_bitmap, local_id, order);
+	raw_spin_unlock_irqrestore(&lpriv->ids_lock, flags);
+}
+
+static void __init imsic_local_cleanup(void)
+{
+	int cpu;
+	struct imsic_local_priv *lpriv;
+
+	for_each_possible_cpu(cpu) {
+		lpriv = per_cpu_ptr(imsic->lpriv, cpu);
+
+		bitmap_free(lpriv->ids_enabled_bitmap);
+		bitmap_free(lpriv->ids_used_bitmap);
+		kfree(lpriv->ids_move);
+		kfree(lpriv->vectors);
+	}
+
+	free_percpu(imsic->lpriv);
+}
+
+static int __init imsic_local_init(void)
+{
+	struct imsic_global_config *global = &imsic->global;
+	struct imsic_local_priv *lpriv;
+	struct imsic_vector *vec;
+	int cpu, i;
+
+	/* Allocate per-CPU private state */
+	imsic->lpriv = alloc_percpu(typeof(*(imsic->lpriv)));
+	if (!imsic->lpriv)
+		return -ENOMEM;
+
+	/* Setup per-CPU private state */
+	for_each_possible_cpu(cpu) {
+		lpriv = per_cpu_ptr(imsic->lpriv, cpu);
+
+		raw_spin_lock_init(&lpriv->ids_lock);
+
+		/* Allocate used bitmap */
+		lpriv->ids_used_bitmap = bitmap_zalloc(global->nr_ids + 1,
+						       GFP_KERNEL);
+		if (!lpriv->ids_used_bitmap) {
+			imsic_local_cleanup();
+			return -ENOMEM;
+		}
+
+		/* Allocate enabled bitmap */
+		lpriv->ids_enabled_bitmap = bitmap_zalloc(global->nr_ids + 1,
+							  GFP_KERNEL);
+		if (!lpriv->ids_enabled_bitmap) {
+			imsic_local_cleanup();
+			return -ENOMEM;
+		}
+
+		/* Allocate move array */
+		lpriv->ids_move = kcalloc(global->nr_ids + 1,
+					sizeof(*lpriv->ids_move), GFP_KERNEL);
+		if (!lpriv->ids_move) {
+			imsic_local_cleanup();
+			return -ENOMEM;
+		}
+
+		/* Reserve ID#0 because it is special and never implemented */
+		bitmap_set(lpriv->ids_used_bitmap, 0, 1);
+
+		/* Reserve IPI ID because it is special and used internally */
+		bitmap_set(lpriv->ids_used_bitmap, IMSIC_IPI_ID, 1);
+
+		/* Allocate vector array */
+		lpriv->vectors = kcalloc(global->nr_ids + 1,
+					 sizeof(*lpriv->vectors), GFP_KERNEL);
+		if (!lpriv->vectors) {
+			imsic_local_cleanup();
+			return -ENOMEM;
+		}
+
+		/* Setup vector array */
+		for (i = 0; i <= global->nr_ids; i++) {
+			vec = &lpriv->vectors[i];
+			vec->cpu = cpu;
+			vec->local_id = i;
+			vec->hwirq = UINT_MAX;
+			vec->order = UINT_MAX;
+		}
+	}
+
+	return 0;
+}
+
+int imsic_hwirqs_alloc(unsigned int order)
+{
+	int ret;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&imsic->hwirqs_lock, flags);
+	ret = bitmap_find_free_region(imsic->hwirqs_used_bitmap,
+				      imsic->nr_hwirqs, order);
+	raw_spin_unlock_irqrestore(&imsic->hwirqs_lock, flags);
+
+	return ret;
+}
+
+void imsic_hwirqs_free(unsigned int base_hwirq, unsigned int order)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&imsic->hwirqs_lock, flags);
+	bitmap_release_region(imsic->hwirqs_used_bitmap, base_hwirq, order);
+	raw_spin_unlock_irqrestore(&imsic->hwirqs_lock, flags);
+}
+
+static int __init imsic_hwirqs_init(void)
+{
+	struct imsic_global_config *global = &imsic->global;
+
+	imsic->nr_hwirqs = num_possible_cpus() * global->nr_ids;
+
+	raw_spin_lock_init(&imsic->hwirqs_lock);
+
+	imsic->hwirqs_used_bitmap = bitmap_zalloc(imsic->nr_hwirqs,
+						  GFP_KERNEL);
+	if (!imsic->hwirqs_used_bitmap)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __init imsic_hwirqs_cleanup(void)
+{
+	bitmap_free(imsic->hwirqs_used_bitmap);
+}
+
+static int __init imsic_get_parent_hartid(struct fwnode_handle *fwnode,
+					  u32 index, unsigned long *hartid)
+{
+	int rc;
+	struct of_phandle_args parent;
+
+	/*
+	 * Currently, only OF fwnode is supported so extend this
+	 * function for ACPI support.
+	 */
+	if (!is_of_node(fwnode))
+		return -EINVAL;
+
+	rc = of_irq_parse_one(to_of_node(fwnode), index, &parent);
+	if (rc)
+		return rc;
+
+	/*
+	 * Skip interrupts other than external interrupts for
+	 * current privilege level.
+	 */
+	if (parent.args[0] != RV_IRQ_EXT)
+		return -EINVAL;
+
+	return riscv_of_parent_hartid(parent.np, hartid);
+}
+
+static int __init imsic_get_mmio_resource(struct fwnode_handle *fwnode,
+					  u32 index, struct resource *res)
+{
+	/*
+	 * Currently, only OF fwnode is supported so extend this
+	 * function for ACPI support.
+	 */
+	if (!is_of_node(fwnode))
+		return -EINVAL;
+
+	return of_address_to_resource(to_of_node(fwnode), index, res);
+}
+
+static int __init imsic_parse_fwnode(struct fwnode_handle *fwnode,
+				     struct imsic_global_config *global,
+				     u32 *nr_parent_irqs,
+				     u32 *nr_mmios)
+{
+	unsigned long hartid;
+	struct resource res;
+	int rc;
+	u32 i;
+
+	/*
+	 * Currently, only OF fwnode is supported so extend this
+	 * function for ACPI support.
+	 */
+	if (!is_of_node(fwnode))
+		return -EINVAL;
+
+	*nr_parent_irqs = 0;
+	*nr_mmios = 0;
+
+	/* Find number of parent interrupts */
+	*nr_parent_irqs = 0;
+	while (!imsic_get_parent_hartid(fwnode, *nr_parent_irqs, &hartid))
+		(*nr_parent_irqs)++;
+	if (!(*nr_parent_irqs)) {
+		pr_err("%pfwP: no parent irqs available\n", fwnode);
+		return -EINVAL;
+	}
+
+	/* Find number of guest index bits in MSI address */
+	rc = of_property_read_u32(to_of_node(fwnode),
+				  "riscv,guest-index-bits",
+				  &global->guest_index_bits);
+	if (rc)
+		global->guest_index_bits = 0;
+
+	/* Find number of HART index bits */
+	rc = of_property_read_u32(to_of_node(fwnode),
+				  "riscv,hart-index-bits",
+				  &global->hart_index_bits);
+	if (rc) {
+		/* Assume default value */
+		global->hart_index_bits = __fls(*nr_parent_irqs);
+		if (BIT(global->hart_index_bits) < *nr_parent_irqs)
+			global->hart_index_bits++;
+	}
+
+	/* Find number of group index bits */
+	rc = of_property_read_u32(to_of_node(fwnode),
+				  "riscv,group-index-bits",
+				  &global->group_index_bits);
+	if (rc)
+		global->group_index_bits = 0;
+
+	/*
+	 * Find first bit position of group index.
+	 * If not specified assumed the default APLIC-IMSIC configuration.
+	 */
+	rc = of_property_read_u32(to_of_node(fwnode),
+				  "riscv,group-index-shift",
+				  &global->group_index_shift);
+	if (rc)
+		global->group_index_shift = IMSIC_MMIO_PAGE_SHIFT * 2;
+
+	/* Find number of interrupt identities */
+	rc = of_property_read_u32(to_of_node(fwnode),
+				  "riscv,num-ids",
+				  &global->nr_ids);
+	if (rc) {
+		pr_err("%pfwP: number of interrupt identities not found\n",
+			fwnode);
+		return rc;
+	}
+
+	/* Find number of guest interrupt identities */
+	rc = of_property_read_u32(to_of_node(fwnode),
+				  "riscv,num-guest-ids",
+				  &global->nr_guest_ids);
+	if (rc)
+		global->nr_guest_ids = global->nr_ids;
+
+	/* Sanity check guest index bits */
+	i = BITS_PER_LONG - IMSIC_MMIO_PAGE_SHIFT;
+	if (i < global->guest_index_bits) {
+		pr_err("%pfwP: guest index bits too big\n", fwnode);
+		return -EINVAL;
+	}
+
+	/* Sanity check HART index bits */
+	i = BITS_PER_LONG - IMSIC_MMIO_PAGE_SHIFT - global->guest_index_bits;
+	if (i < global->hart_index_bits) {
+		pr_err("%pfwP: HART index bits too big\n", fwnode);
+		return -EINVAL;
+	}
+
+	/* Sanity check group index bits */
+	i = BITS_PER_LONG - IMSIC_MMIO_PAGE_SHIFT -
+	    global->guest_index_bits - global->hart_index_bits;
+	if (i < global->group_index_bits) {
+		pr_err("%pfwP: group index bits too big\n", fwnode);
+		return -EINVAL;
+	}
+
+	/* Sanity check group index shift */
+	i = global->group_index_bits + global->group_index_shift - 1;
+	if (i >= BITS_PER_LONG) {
+		pr_err("%pfwP: group index shift too big\n", fwnode);
+		return -EINVAL;
+	}
+
+	/* Sanity check number of interrupt identities */
+	if ((global->nr_ids < IMSIC_MIN_ID) ||
+	    (global->nr_ids >= IMSIC_MAX_ID) ||
+	    ((global->nr_ids & IMSIC_MIN_ID) != IMSIC_MIN_ID)) {
+		pr_err("%pfwP: invalid number of interrupt identities\n",
+			fwnode);
+		return -EINVAL;
+	}
+
+	/* Sanity check number of guest interrupt identities */
+	if ((global->nr_guest_ids < IMSIC_MIN_ID) ||
+	    (global->nr_guest_ids >= IMSIC_MAX_ID) ||
+	    ((global->nr_guest_ids & IMSIC_MIN_ID) != IMSIC_MIN_ID)) {
+		pr_err("%pfwP: invalid number of guest interrupt identities\n",
+			fwnode);
+		return -EINVAL;
+	}
+
+	/* Compute base address */
+	rc = imsic_get_mmio_resource(fwnode, 0, &res);
+	if (rc) {
+		pr_err("%pfwP: first MMIO resource not found\n", fwnode);
+		return -EINVAL;
+	}
+	global->base_addr = res.start;
+	global->base_addr &= ~(BIT(global->guest_index_bits +
+				   global->hart_index_bits +
+				   IMSIC_MMIO_PAGE_SHIFT) - 1);
+	global->base_addr &= ~((BIT(global->group_index_bits) - 1) <<
+			       global->group_index_shift);
+
+	/* Find number of MMIO register sets */
+	while (!imsic_get_mmio_resource(fwnode, *nr_mmios, &res))
+		(*nr_mmios)++;
+
+	return 0;
+}
+
+int __init imsic_setup_state(struct fwnode_handle *fwnode)
+{
+	int rc, cpu;
+	phys_addr_t base_addr;
+	void __iomem **mmios_va = NULL;
+	struct resource *mmios = NULL;
+	struct imsic_local_config *local;
+	struct imsic_global_config *global;
+	unsigned long reloff, hartid;
+	u32 i, j, index, nr_parent_irqs, nr_mmios, nr_handlers = 0;
+
+	/*
+	 * Only one IMSIC instance allowed in a platform for clean
+	 * implementation of SMP IRQ affinity and per-CPU IPIs.
+	 *
+	 * This means on a multi-socket (or multi-die) platform we
+	 * will have multiple MMIO regions for one IMSIC instance.
+	 */
+	if (imsic) {
+		pr_err("%pfwP: already initialized hence ignoring\n",
+			fwnode);
+		return -EALREADY;
+	}
+
+	if (!riscv_isa_extension_available(NULL, SxAIA)) {
+		pr_err("%pfwP: AIA support not available\n", fwnode);
+		return -ENODEV;
+	}
+
+	imsic = kzalloc(sizeof(*imsic), GFP_KERNEL);
+	if (!imsic)
+		return -ENOMEM;
+	imsic->fwnode = fwnode;
+	global = &imsic->global;
+
+	global->local = alloc_percpu(typeof(*(global->local)));
+	if (!global->local) {
+		rc = -ENOMEM;
+		goto out_free_priv;
+	}
+
+	/* Parse IMSIC fwnode */
+	rc = imsic_parse_fwnode(fwnode, global, &nr_parent_irqs, &nr_mmios);
+	if (rc)
+		goto out_free_local;
+
+	/* Allocate MMIO resource array */
+	mmios = kcalloc(nr_mmios, sizeof(*mmios), GFP_KERNEL);
+	if (!mmios) {
+		rc = -ENOMEM;
+		goto out_free_local;
+	}
+
+	/* Allocate MMIO virtual address array */
+	mmios_va = kcalloc(nr_mmios, sizeof(*mmios_va), GFP_KERNEL);
+	if (!mmios_va) {
+		rc = -ENOMEM;
+		goto out_iounmap;
+	}
+
+	/* Parse and map MMIO register sets */
+	for (i = 0; i < nr_mmios; i++) {
+		rc = imsic_get_mmio_resource(fwnode, i, &mmios[i]);
+		if (rc) {
+			pr_err("%pfwP: unable to parse MMIO regset %d\n",
+				fwnode, i);
+			goto out_iounmap;
+		}
+
+		base_addr = mmios[i].start;
+		base_addr &= ~(BIT(global->guest_index_bits +
+				   global->hart_index_bits +
+				   IMSIC_MMIO_PAGE_SHIFT) - 1);
+		base_addr &= ~((BIT(global->group_index_bits) - 1) <<
+			       global->group_index_shift);
+		if (base_addr != global->base_addr) {
+			rc = -EINVAL;
+			pr_err("%pfwP: address mismatch for regset %d\n",
+				fwnode, i);
+			goto out_iounmap;
+		}
+
+		mmios_va[i] = ioremap(mmios[i].start, resource_size(&mmios[i]));
+		if (!mmios_va[i]) {
+			rc = -EIO;
+			pr_err("%pfwP: unable to map MMIO regset %d\n",
+				fwnode, i);
+			goto out_iounmap;
+		}
+	}
+
+	/* Initialize HW interrupt numbers */
+	rc = imsic_hwirqs_init();
+	if (rc) {
+		pr_err("%pfwP: failed to initialize HW interrupts numbers\n",
+		       fwnode);
+		goto out_iounmap;
+	}
+
+	/* Initialize local (or per-CPU )state */
+	rc = imsic_local_init();
+	if (rc) {
+		pr_err("%pfwP: failed to initialize local state\n",
+		       fwnode);
+		goto out_hwirqs_cleanup;
+	}
+
+	/* Configure handlers for target CPUs */
+	for (i = 0; i < nr_parent_irqs; i++) {
+		rc = imsic_get_parent_hartid(fwnode, i, &hartid);
+		if (rc) {
+			pr_warn("%pfwP: hart ID for parent irq%d not found\n",
+				fwnode, i);
+			continue;
+		}
+
+		cpu = riscv_hartid_to_cpuid(hartid);
+		if (cpu < 0) {
+			pr_warn("%pfwP: invalid cpuid for parent irq%d\n",
+				fwnode, i);
+			continue;
+		}
+
+		/* Find MMIO location of MSI page */
+		index = nr_mmios;
+		reloff = i * BIT(global->guest_index_bits) *
+			 IMSIC_MMIO_PAGE_SZ;
+		for (j = 0; nr_mmios; j++) {
+			if (reloff < resource_size(&mmios[j])) {
+				index = j;
+				break;
+			}
+
+			/*
+			 * MMIO region size may not be aligned to
+			 * BIT(global->guest_index_bits) * IMSIC_MMIO_PAGE_SZ
+			 * if holes are present.
+			 */
+			reloff -= ALIGN(resource_size(&mmios[j]),
+			BIT(global->guest_index_bits) * IMSIC_MMIO_PAGE_SZ);
+		}
+		if (index >= nr_mmios) {
+			pr_warn("%pfwP: MMIO not found for parent irq%d\n",
+				fwnode, i);
+			continue;
+		}
+
+		local = per_cpu_ptr(global->local, cpu);
+		local->msi_pa = mmios[index].start + reloff;
+		local->msi_va = mmios_va[index] + reloff;
+
+		nr_handlers++;
+	}
+
+	/* If no CPU handlers found then can't take interrupts */
+	if (!nr_handlers) {
+		pr_err("%pfwP: No CPU handlers found\n", fwnode);
+		rc = -ENODEV;
+		goto out_local_cleanup;
+	}
+
+	/* We don't need MMIO arrays anymore so let's free-up */
+	kfree(mmios_va);
+	kfree(mmios);
+
+	return 0;
+
+out_local_cleanup:
+	imsic_local_cleanup();
+out_hwirqs_cleanup:
+	imsic_hwirqs_cleanup();
+out_iounmap:
+	for (i = 0; i < nr_mmios; i++) {
+		if (mmios_va[i])
+			iounmap(mmios_va[i]);
+	}
+	kfree(mmios_va);
+	kfree(mmios);
+out_free_local:
+	free_percpu(imsic->global.local);
+out_free_priv:
+	kfree(imsic);
+	imsic = NULL;
+	return rc;
+}
diff --git a/drivers/irqchip/irq-riscv-imsic-state.h b/drivers/irqchip/irq-riscv-imsic-state.h
new file mode 100644
index 000000000000..82911b8b08b4
--- /dev/null
+++ b/drivers/irqchip/irq-riscv-imsic-state.h
@@ -0,0 +1,109 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+
+#ifndef _IRQ_RISCV_IMSIC_STATE_H
+#define _IRQ_RISCV_IMSIC_STATE_H
+
+#include <linux/irqchip/riscv-imsic.h>
+#include <linux/irqdomain.h>
+#include <linux/fwnode.h>
+
+/*
+ * The IMSIC driver uses 1 IPI for ID synchronization and
+ * arch/riscv/kernel/smp.c require 6 IPIs so we fix the
+ * total number of IPIs to 8.
+ */
+#define IMSIC_IPI_ID				1
+#define IMSIC_NR_IPI				8
+
+struct imsic_vector {
+	/* Fixed details of the vector */
+	unsigned int cpu;
+	unsigned int local_id;
+	/* Details saved by driver in the vector */
+	unsigned int hwirq;
+	unsigned int order;
+};
+
+#define IMSIC_VECTOR_MASK(__v)			\
+	(BIT((__v)->order) - 1UL)
+#define IMSIC_VECTOR_BASE_LOCAL_ID(__v)		\
+	((__v)->local_id & ~IMSIC_VECTOR_MASK(__v))
+#define IMSIC_VECTOR_BASE_HWIRQ(__v)		\
+	((__v)->hwirq & ~IMSIC_VECTOR_MASK(__v))
+
+struct imsic_local_priv {
+	/* Local state of interrupt identities */
+	raw_spinlock_t ids_lock;
+	unsigned long *ids_used_bitmap;
+	unsigned long *ids_enabled_bitmap;
+	struct imsic_vector **ids_move;
+
+	/* Local vector table */
+	struct imsic_vector *vectors;
+};
+
+struct imsic_priv {
+	/* Device details */
+	struct fwnode_handle *fwnode;
+
+	/* Global configuration common for all HARTs */
+	struct imsic_global_config global;
+
+	/* Dummy HW interrupt numbers */
+	unsigned int nr_hwirqs;
+	raw_spinlock_t hwirqs_lock;
+	unsigned long *hwirqs_used_bitmap;
+
+	/* Per-CPU state */
+	struct imsic_local_priv __percpu *lpriv;
+
+	/* IPI interrupt identity and synchronization */
+	int ipi_virq;
+	struct irq_desc *ipi_lsync_desc;
+
+	/* IRQ domains (created by platform driver) */
+	struct irq_domain *base_domain;
+	struct irq_domain *plat_domain;
+};
+
+extern struct imsic_priv *imsic;
+
+void __imsic_eix_update(unsigned long base_id,
+			unsigned long num_id, bool pend, bool val);
+
+#define __imsic_id_set_enable(__id)		\
+	__imsic_eix_update((__id), 1, false, true)
+#define __imsic_id_clear_enable(__id)	\
+	__imsic_eix_update((__id), 1, false, false)
+
+void imsic_local_sync(void);
+void imsic_local_delivery(bool enable);
+
+void imsic_vector_mask(struct imsic_vector *vec);
+void imsic_vector_unmask(struct imsic_vector *vec);
+void imsic_vector_move(struct imsic_vector *old_vec,
+			struct imsic_vector *new_vec);
+
+struct imsic_vector *imsic_vector_from_local_id(unsigned int cpu,
+						unsigned int local_id);
+
+struct imsic_vector *imsic_vector_alloc(unsigned int hwirq,
+					const struct cpumask *mask,
+					unsigned int order);
+void imsic_vector_free(struct imsic_vector *vector);
+
+void imsic_vector_debug_show(struct seq_file *m,
+			     struct imsic_vector *vec, int ind);
+
+void imsic_vector_debug_show_summary(struct seq_file *m, int ind);
+
+int imsic_hwirqs_alloc(unsigned int order);
+void imsic_hwirqs_free(unsigned int base_hwirq, unsigned int order);
+
+int imsic_setup_state(struct fwnode_handle *fwnode);
+
+#endif
diff --git a/include/linux/irqchip/riscv-imsic.h b/include/linux/irqchip/riscv-imsic.h
new file mode 100644
index 000000000000..cbb7bcd0e4dd
--- /dev/null
+++ b/include/linux/irqchip/riscv-imsic.h
@@ -0,0 +1,87 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+#ifndef __LINUX_IRQCHIP_RISCV_IMSIC_H
+#define __LINUX_IRQCHIP_RISCV_IMSIC_H
+
+#include <linux/types.h>
+#include <linux/bitops.h>
+#include <asm/csr.h>
+
+#define IMSIC_MMIO_PAGE_SHIFT		12
+#define IMSIC_MMIO_PAGE_SZ		BIT(IMSIC_MMIO_PAGE_SHIFT)
+#define IMSIC_MMIO_PAGE_LE		0x00
+#define IMSIC_MMIO_PAGE_BE		0x04
+
+#define IMSIC_MIN_ID			63
+#define IMSIC_MAX_ID			2048
+
+#define IMSIC_EIDELIVERY		0x70
+
+#define IMSIC_EITHRESHOLD		0x72
+
+#define IMSIC_EIP0			0x80
+#define IMSIC_EIP63			0xbf
+#define IMSIC_EIPx_BITS			32
+
+#define IMSIC_EIE0			0xc0
+#define IMSIC_EIE63			0xff
+#define IMSIC_EIEx_BITS			32
+
+#define IMSIC_FIRST			IMSIC_EIDELIVERY
+#define IMSIC_LAST			IMSIC_EIE63
+
+#define IMSIC_MMIO_SETIPNUM_LE		0x00
+#define IMSIC_MMIO_SETIPNUM_BE		0x04
+
+struct imsic_local_config {
+	phys_addr_t msi_pa;
+	void __iomem *msi_va;
+};
+
+struct imsic_global_config {
+	/*
+	 * MSI Target Address Scheme
+	 *
+	 * XLEN-1                                                12     0
+	 * |                                                     |     |
+	 * -------------------------------------------------------------
+	 * |xxxxxx|Group Index|xxxxxxxxxxx|HART Index|Guest Index|  0  |
+	 * -------------------------------------------------------------
+	 */
+
+	/* Bits representing Guest index, HART index, and Group index */
+	u32 guest_index_bits;
+	u32 hart_index_bits;
+	u32 group_index_bits;
+	u32 group_index_shift;
+
+	/* Global base address matching all target MSI addresses */
+	phys_addr_t base_addr;
+
+	/* Number of interrupt identities */
+	u32 nr_ids;
+
+	/* Number of guest interrupt identities */
+	u32 nr_guest_ids;
+
+	/* Per-CPU IMSIC addresses */
+	struct imsic_local_config __percpu *local;
+};
+
+#ifdef CONFIG_RISCV_IMSIC
+
+extern const struct imsic_global_config *imsic_get_global_config(void);
+
+#else
+
+static inline const struct imsic_global_config *imsic_get_global_config(void)
+{
+	return NULL;
+}
+
+#endif
+
+#endif