[v5,04/11] iommu/vt-d: Add helper to setup pasid nested translation
Commit Message
From: Lu Baolu <baolu.lu@linux.intel.com>
The configurations are passed in from the user when the user domain is
allocated. This helper interprets these configurations according to the
data structure defined in uapi/linux/iommufd.h. The EINVAL error will be
returned if any of configurations are not compatible with the hardware
capabilities. The caller can retry with another compatible user domain.
The encoding of fields of each pasid entry is defined in section 9.6 of
the VT-d spec.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
drivers/iommu/intel/pasid.c | 125 ++++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
2 files changed, 127 insertions(+)
Comments
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, September 21, 2023 3:54 PM
> +
> +/**
> + * intel_pasid_setup_nested() - Set up PASID entry for nested translation.
> + * @iommu: IOMMU which the device belong to
> + * @dev: Device to be set up for translation
> + * @pasid: PASID to be programmed in the device PASID table
> + * @domain: User first-level domain nested on a s2 domain
let's use stage-1/stage-2 consistently
> + *
> + * This is used for nested translation based vIOMMU. e.g. guest IOVA and
> + * guest shared virtual address. In this case, the first-level page tables
> + * are used for GVA/GIOVA-GPA translation in the guest, the second-level
> + * page tables are used for GPA-HPA translation.
No need to mention guest or vIOMMU. Just stick to the fact of
nested configuration.
> + */
> +int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device
> *dev,
> + u32 pasid, struct dmar_domain *domain)
> +{
> + struct iommu_hwpt_vtd_s1 *s1_cfg = &domain->s1_cfg;
> + pgd_t *s1_gpgd = (pgd_t *)(uintptr_t)domain->s1_pgtbl;
> + struct dmar_domain *s2_domain = domain->s2_domain;
> + u16 did = domain_id_iommu(domain, iommu);
> + struct dma_pte *pgd = s2_domain->pgd;
> + struct pasid_entry *pte;
> +
> + if (!ecap_nest(iommu->ecap)) {
> + pr_err_ratelimited("%s: No nested translation support\n",
> + iommu->name);
> + return -ENODEV;
> + }
this check has been done when creating the nest_parent s2 and iommufd
shouldn't request a nested setup when the specified s2 is not nest_parent.
If you still want to keep this check then a WARN_ON.
> +
> + /*
> + * Address width should match in two dimensions: CPU vs. IOMMU,
> + * guest vs. host.
> + */
No 'guest'. Just that the user requested addr width must be supported by
the hardware.
> + switch (s1_cfg->addr_width) {
> + case ADDR_WIDTH_4LEVEL:
> + break;
> +#ifdef CONFIG_X86
> + case ADDR_WIDTH_5LEVEL:
> + if (!cpu_feature_enabled(X86_FEATURE_LA57) ||
> + !cap_fl5lp_support(iommu->cap)) {
> + dev_err_ratelimited(dev,
> + "5-level paging not supported\n");
> + return -EINVAL;
> + }
> + break;
I wonder whether this check is too strict. a nested configuration doesn't
mandate vSVM. e.g. for guest IOVA the s1 page table is not walked by
the CPU. Adding a CPU capability check here doesn't make much sense.
Of course doing so may not cause a problem reality as very likely all
Intel platforms have same 5level support in both CPU/IOMMU. But
code-wise it's still better to do it right.
Ideally the guest will be notified with 5level support only when the CPU
supports it then it's the guest's business to choose the right format to
match the CPU capability. Misconfiguration will be caught by the CPU
virtualization path?
> +#endif
> + default:
> + dev_err_ratelimited(dev, "Invalid guest address width %d\n",
> + s1_cfg->addr_width);
ditto. use 'stage-1 address width' instead of 'guest'.
> + return -EINVAL;
> + }
> +
> + if ((s1_cfg->flags & IOMMU_VTD_S1_SRE) && !ecap_srs(iommu-
> >ecap)) {
> + pr_err_ratelimited("No supervisor request support on %s\n",
> + iommu->name);
> + return -EINVAL;
> + }
> +
> + if ((s1_cfg->flags & IOMMU_VTD_S1_EAFE) && !ecap_eafs(iommu-
> >ecap)) {
> + pr_err_ratelimited("No extended access flag support
> on %s\n",
> + iommu->name);
> + return -EINVAL;
> + }
> +
> + spin_lock(&iommu->lock);
> + pte = intel_pasid_get_entry(dev, pasid);
> + if (!pte) {
> + spin_unlock(&iommu->lock);
> + return -ENODEV;
> + }
> + if (pasid_pte_is_present(pte)) {
> + spin_unlock(&iommu->lock);
> + return -EBUSY;
> + }
> +
> + pasid_clear_entry(pte);
> +
> + if (s1_cfg->addr_width == ADDR_WIDTH_5LEVEL)
> + pasid_set_flpm(pte, 1);
> +
> + pasid_set_flptr(pte, (uintptr_t)s1_gpgd);
> +
> + if (s1_cfg->flags & IOMMU_VTD_S1_SRE) {
> + pasid_set_sre(pte);
> + if (s1_cfg->flags & IOMMU_VTD_S1_WPE)
> + pasid_set_wpe(pte);
> + }
> +
> + if (s1_cfg->flags & IOMMU_VTD_S1_EAFE)
> + pasid_set_eafe(pte);
> +
> + if (s2_domain->force_snooping)
> + pasid_set_pgsnp(pte);
> +
> + pasid_set_slptr(pte, virt_to_phys(pgd));
> + pasid_set_fault_enable(pte);
> + pasid_set_domain_id(pte, did);
> + pasid_set_address_width(pte, s2_domain->agaw);
> + pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> + pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
> + pasid_set_present(pte);
> + spin_unlock(&iommu->lock);
All changes within iommu->lock are specific to the device specific
PASID entry. Probably this is one potential cleanup TODO to
use a per-device lock instead.
> +
> + pasid_flush_caches(iommu, pte, pasid, did);
> +
> + return 0;
> +}
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index 4e9e68c3c388..7906d73f4ded 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -109,6 +109,8 @@ int intel_pasid_setup_second_level(struct
> intel_iommu *iommu,
> int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
> struct dmar_domain *domain,
> struct device *dev, u32 pasid);
> +int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device
> *dev,
> + u32 pasid, struct dmar_domain *domain);
> void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
> struct device *dev, u32 pasid,
> bool fault_ignore);
> --
> 2.34.1
On 2023/9/27 14:37, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, September 21, 2023 3:54 PM
>> +
>> +/**
>> + * intel_pasid_setup_nested() - Set up PASID entry for nested translation.
>> + * @iommu: IOMMU which the device belong to
>> + * @dev: Device to be set up for translation
>> + * @pasid: PASID to be programmed in the device PASID table
>> + * @domain: User first-level domain nested on a s2 domain
>
> let's use stage-1/stage-2 consistently
ok
>> + *
>> + * This is used for nested translation based vIOMMU. e.g. guest IOVA and
>> + * guest shared virtual address. In this case, the first-level page tables
>> + * are used for GVA/GIOVA-GPA translation in the guest, the second-level
>> + * page tables are used for GPA-HPA translation.
>
> No need to mention guest or vIOMMU. Just stick to the fact of
> nested configuration.
how about below?
* This is used for nested translation. The input domain should be
* nested type and nested on a parent with 'is_nested_parent' flag
* set.
>> + */
>> +int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device
>> *dev,
>> + u32 pasid, struct dmar_domain *domain)
>> +{
>> + struct iommu_hwpt_vtd_s1 *s1_cfg = &domain->s1_cfg;
>> + pgd_t *s1_gpgd = (pgd_t *)(uintptr_t)domain->s1_pgtbl;
>> + struct dmar_domain *s2_domain = domain->s2_domain;
>> + u16 did = domain_id_iommu(domain, iommu);
>> + struct dma_pte *pgd = s2_domain->pgd;
>> + struct pasid_entry *pte;
>> +
>> + if (!ecap_nest(iommu->ecap)) {
>> + pr_err_ratelimited("%s: No nested translation support\n",
>> + iommu->name);
>> + return -ENODEV;
>> + }
>
> this check has been done when creating the nest_parent s2 and iommufd
> shouldn't request a nested setup when the specified s2 is not nest_parent.
yes, it has been checked when allocating nest_parent s2. Current iommufd
nesting series does not check nested setup specifically. All the attach
logic is common across the domain types. So it seems no need for iommufd
to check it.
> If you still want to keep this check then a WARN_ON.
may just remove it.
>> +
>> + /*
>> + * Address width should match in two dimensions: CPU vs. IOMMU,
>> + * guest vs. host.
>> + */
>
> No 'guest'. Just that the user requested addr width must be supported by
> the hardware.
sure.
>
>> + switch (s1_cfg->addr_width) {
>> + case ADDR_WIDTH_4LEVEL:
>> + break;
>> +#ifdef CONFIG_X86
>> + case ADDR_WIDTH_5LEVEL:
>> + if (!cpu_feature_enabled(X86_FEATURE_LA57) ||
>> + !cap_fl5lp_support(iommu->cap)) {
>> + dev_err_ratelimited(dev,
>> + "5-level paging not supported\n");
>> + return -EINVAL;
>> + }
>> + break;
>
> I wonder whether this check is too strict. a nested configuration doesn't
> mandate vSVM. e.g. for guest IOVA the s1 page table is not walked by
> the CPU. Adding a CPU capability check here doesn't make much sense.
>
> Of course doing so may not cause a problem reality as very likely all
> Intel platforms have same 5level support in both CPU/IOMMU. But
> code-wise it's still better to do it right.
>
> Ideally the guest will be notified with 5level support only when the CPU
> supports it then it's the guest's business to choose the right format to
> match the CPU capability. Misconfiguration will be caught by the CPU
> virtualization path?
yes, it may make more sense that CPU and IOMMU side take its own check to
avoid misuage of 5lvl. For vSVM, it's guest's bussiness to ensure its vCPU
and vIOMMU both have 5lvl. will relax the check here.
>> +#endif
>> + default:
>> + dev_err_ratelimited(dev, "Invalid guest address width %d\n",
>> + s1_cfg->addr_width);
>
> ditto. use 'stage-1 address width' instead of 'guest'.
sure.
>
>> + return -EINVAL;
>> + }
>> +
>> + if ((s1_cfg->flags & IOMMU_VTD_S1_SRE) && !ecap_srs(iommu-
>>> ecap)) {
>> + pr_err_ratelimited("No supervisor request support on %s\n",
>> + iommu->name);
>> + return -EINVAL;
>> + }
>> +
>> + if ((s1_cfg->flags & IOMMU_VTD_S1_EAFE) && !ecap_eafs(iommu-
>>> ecap)) {
>> + pr_err_ratelimited("No extended access flag support
>> on %s\n",
>> + iommu->name);
>> + return -EINVAL;
>> + }
>> +
>> + spin_lock(&iommu->lock);
>> + pte = intel_pasid_get_entry(dev, pasid);
>> + if (!pte) {
>> + spin_unlock(&iommu->lock);
>> + return -ENODEV;
>> + }
>> + if (pasid_pte_is_present(pte)) {
>> + spin_unlock(&iommu->lock);
>> + return -EBUSY;
>> + }
>> +
>> + pasid_clear_entry(pte);
>> +
>> + if (s1_cfg->addr_width == ADDR_WIDTH_5LEVEL)
>> + pasid_set_flpm(pte, 1);
>> +
>> + pasid_set_flptr(pte, (uintptr_t)s1_gpgd);
>> +
>> + if (s1_cfg->flags & IOMMU_VTD_S1_SRE) {
>> + pasid_set_sre(pte);
>> + if (s1_cfg->flags & IOMMU_VTD_S1_WPE)
>> + pasid_set_wpe(pte);
>> + }
>> +
>> + if (s1_cfg->flags & IOMMU_VTD_S1_EAFE)
>> + pasid_set_eafe(pte);
>> +
>> + if (s2_domain->force_snooping)
>> + pasid_set_pgsnp(pte);
>> +
>> + pasid_set_slptr(pte, virt_to_phys(pgd));
>> + pasid_set_fault_enable(pte);
>> + pasid_set_domain_id(pte, did);
>> + pasid_set_address_width(pte, s2_domain->agaw);
>> + pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>> + pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
>> + pasid_set_present(pte);
>> + spin_unlock(&iommu->lock);
>
> All changes within iommu->lock are specific to the device specific
> PASID entry. Probably this is one potential cleanup TODO to
> use a per-device lock instead.
yeah, a separate cleanup. is it, @Baolu?
>> +
>> + pasid_flush_caches(iommu, pte, pasid, did);
>> +
>> + return 0;
>> +}
>> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
>> index 4e9e68c3c388..7906d73f4ded 100644
>> --- a/drivers/iommu/intel/pasid.h
>> +++ b/drivers/iommu/intel/pasid.h
>> @@ -109,6 +109,8 @@ int intel_pasid_setup_second_level(struct
>> intel_iommu *iommu,
>> int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>> struct dmar_domain *domain,
>> struct device *dev, u32 pasid);
>> +int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device
>> *dev,
>> + u32 pasid, struct dmar_domain *domain);
>> void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
>> struct device *dev, u32 pasid,
>> bool fault_ignore);
>> --
>> 2.34.1
>
On 2023/10/13 20:40, Yi Liu wrote:
>>
>>> + return -EINVAL;
>>> + }
>>> +
>>> + if ((s1_cfg->flags & IOMMU_VTD_S1_SRE) && !ecap_srs(iommu-
>>>> ecap)) {
>>> + pr_err_ratelimited("No supervisor request support on %s\n",
>>> + iommu->name);
>>> + return -EINVAL;
>>> + }
>>> +
>>> + if ((s1_cfg->flags & IOMMU_VTD_S1_EAFE) && !ecap_eafs(iommu-
>>>> ecap)) {
>>> + pr_err_ratelimited("No extended access flag support
>>> on %s\n",
>>> + iommu->name);
>>> + return -EINVAL;
>>> + }
>>> +
>>> + spin_lock(&iommu->lock);
>>> + pte = intel_pasid_get_entry(dev, pasid);
>>> + if (!pte) {
>>> + spin_unlock(&iommu->lock);
>>> + return -ENODEV;
>>> + }
>>> + if (pasid_pte_is_present(pte)) {
>>> + spin_unlock(&iommu->lock);
>>> + return -EBUSY;
>>> + }
>>> +
>>> + pasid_clear_entry(pte);
>>> +
>>> + if (s1_cfg->addr_width == ADDR_WIDTH_5LEVEL)
>>> + pasid_set_flpm(pte, 1);
>>> +
>>> + pasid_set_flptr(pte, (uintptr_t)s1_gpgd);
>>> +
>>> + if (s1_cfg->flags & IOMMU_VTD_S1_SRE) {
>>> + pasid_set_sre(pte);
>>> + if (s1_cfg->flags & IOMMU_VTD_S1_WPE)
>>> + pasid_set_wpe(pte);
>>> + }
>>> +
>>> + if (s1_cfg->flags & IOMMU_VTD_S1_EAFE)
>>> + pasid_set_eafe(pte);
>>> +
>>> + if (s2_domain->force_snooping)
>>> + pasid_set_pgsnp(pte);
>>> +
>>> + pasid_set_slptr(pte, virt_to_phys(pgd));
>>> + pasid_set_fault_enable(pte);
>>> + pasid_set_domain_id(pte, did);
>>> + pasid_set_address_width(pte, s2_domain->agaw);
>>> + pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>>> + pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
>>> + pasid_set_present(pte);
>>> + spin_unlock(&iommu->lock);
>>
>> All changes within iommu->lock are specific to the device specific
>> PASID entry. Probably this is one potential cleanup TODO to
>> use a per-device lock instead.
>
> yeah, a separate cleanup. is it, @Baolu?
Sure. I'd like to take some time to consider this further. Please keep
it as-is for the time being.
Best regards,
baolu
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, October 13, 2023 8:40 PM
>
> >> + *
> >> + * This is used for nested translation based vIOMMU. e.g. guest IOVA and
> >> + * guest shared virtual address. In this case, the first-level page tables
> >> + * are used for GVA/GIOVA-GPA translation in the guest, the second-level
> >> + * page tables are used for GPA-HPA translation.
> >
> > No need to mention guest or vIOMMU. Just stick to the fact of
> > nested configuration.
>
> how about below?
>
> * This is used for nested translation. The input domain should be
> * nested type and nested on a parent with 'is_nested_parent' flag
> * set.
* This is used for nested translation. The input domain should be
* the nested type with the parent having 'is_nested_parent' flag set.
@@ -335,6 +335,15 @@ static inline void pasid_set_fault_enable(struct pasid_entry *pe)
pasid_set_bits(&pe->val[0], 1 << 1, 0);
}
+/*
+ * Setup the SRE(Supervisor Request Enable) field (Bit 128) of a
+ * scalable mode PASID entry.
+ */
+static inline void pasid_set_sre(struct pasid_entry *pe)
+{
+ pasid_set_bits(&pe->val[2], 1 << 0, 1);
+}
+
/*
* Setup the WPE(Write Protect Enable) field (Bit 132) of a
* scalable mode PASID entry.
@@ -402,6 +411,15 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value)
pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
}
+/*
+ * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
+ * of a scalable mode PASID entry.
+ */
+static inline void pasid_set_eafe(struct pasid_entry *pe)
+{
+ pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
+}
+
static void
pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
u16 did, u32 pasid)
@@ -713,3 +731,110 @@ void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu,
if (!cap_caching_mode(iommu->cap))
devtlb_invalidation_with_pasid(iommu, dev, pasid);
}
+
+/**
+ * intel_pasid_setup_nested() - Set up PASID entry for nested translation.
+ * @iommu: IOMMU which the device belong to
+ * @dev: Device to be set up for translation
+ * @pasid: PASID to be programmed in the device PASID table
+ * @domain: User first-level domain nested on a s2 domain
+ *
+ * This is used for nested translation based vIOMMU. e.g. guest IOVA and
+ * guest shared virtual address. In this case, the first-level page tables
+ * are used for GVA/GIOVA-GPA translation in the guest, the second-level
+ * page tables are used for GPA-HPA translation.
+ */
+int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev,
+ u32 pasid, struct dmar_domain *domain)
+{
+ struct iommu_hwpt_vtd_s1 *s1_cfg = &domain->s1_cfg;
+ pgd_t *s1_gpgd = (pgd_t *)(uintptr_t)domain->s1_pgtbl;
+ struct dmar_domain *s2_domain = domain->s2_domain;
+ u16 did = domain_id_iommu(domain, iommu);
+ struct dma_pte *pgd = s2_domain->pgd;
+ struct pasid_entry *pte;
+
+ if (!ecap_nest(iommu->ecap)) {
+ pr_err_ratelimited("%s: No nested translation support\n",
+ iommu->name);
+ return -ENODEV;
+ }
+
+ /*
+ * Address width should match in two dimensions: CPU vs. IOMMU,
+ * guest vs. host.
+ */
+ switch (s1_cfg->addr_width) {
+ case ADDR_WIDTH_4LEVEL:
+ break;
+#ifdef CONFIG_X86
+ case ADDR_WIDTH_5LEVEL:
+ if (!cpu_feature_enabled(X86_FEATURE_LA57) ||
+ !cap_fl5lp_support(iommu->cap)) {
+ dev_err_ratelimited(dev,
+ "5-level paging not supported\n");
+ return -EINVAL;
+ }
+ break;
+#endif
+ default:
+ dev_err_ratelimited(dev, "Invalid guest address width %d\n",
+ s1_cfg->addr_width);
+ return -EINVAL;
+ }
+
+ if ((s1_cfg->flags & IOMMU_VTD_S1_SRE) && !ecap_srs(iommu->ecap)) {
+ pr_err_ratelimited("No supervisor request support on %s\n",
+ iommu->name);
+ return -EINVAL;
+ }
+
+ if ((s1_cfg->flags & IOMMU_VTD_S1_EAFE) && !ecap_eafs(iommu->ecap)) {
+ pr_err_ratelimited("No extended access flag support on %s\n",
+ iommu->name);
+ return -EINVAL;
+ }
+
+ spin_lock(&iommu->lock);
+ pte = intel_pasid_get_entry(dev, pasid);
+ if (!pte) {
+ spin_unlock(&iommu->lock);
+ return -ENODEV;
+ }
+ if (pasid_pte_is_present(pte)) {
+ spin_unlock(&iommu->lock);
+ return -EBUSY;
+ }
+
+ pasid_clear_entry(pte);
+
+ if (s1_cfg->addr_width == ADDR_WIDTH_5LEVEL)
+ pasid_set_flpm(pte, 1);
+
+ pasid_set_flptr(pte, (uintptr_t)s1_gpgd);
+
+ if (s1_cfg->flags & IOMMU_VTD_S1_SRE) {
+ pasid_set_sre(pte);
+ if (s1_cfg->flags & IOMMU_VTD_S1_WPE)
+ pasid_set_wpe(pte);
+ }
+
+ if (s1_cfg->flags & IOMMU_VTD_S1_EAFE)
+ pasid_set_eafe(pte);
+
+ if (s2_domain->force_snooping)
+ pasid_set_pgsnp(pte);
+
+ pasid_set_slptr(pte, virt_to_phys(pgd));
+ pasid_set_fault_enable(pte);
+ pasid_set_domain_id(pte, did);
+ pasid_set_address_width(pte, s2_domain->agaw);
+ pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
+ pasid_set_translation_type(pte, PASID_ENTRY_PGTT_NESTED);
+ pasid_set_present(pte);
+ spin_unlock(&iommu->lock);
+
+ pasid_flush_caches(iommu, pte, pasid, did);
+
+ return 0;
+}
@@ -109,6 +109,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
struct dmar_domain *domain,
struct device *dev, u32 pasid);
+int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev,
+ u32 pasid, struct dmar_domain *domain);
void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
struct device *dev, u32 pasid,
bool fault_ignore);