diff mbox series

[1/4] iommu/amd: Introduce Protection-domain flag VFIO

Message ID	20230110143137.54517-2-suravee.suthikulpanit@amd.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> To: <linux-kernel@vger.kernel.org>, <iommu@lists.linux.dev> CC: <joro@8bytes.org>, <robin.murphy@arm.com>, <ashish.kalra@amd.com>, <thomas.lendacky@amd.com>, <vasant.hegde@amd.com>, <jon.grimm@amd.com>, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Subject: [PATCH 1/4] iommu/amd: Introduce Protection-domain flag VFIO Date: Tue, 10 Jan 2023 08:31:34 -0600 Message-ID: <20230110143137.54517-2-suravee.suthikulpanit@amd.com> In-Reply-To: <20230110143137.54517-1-suravee.suthikulpanit@amd.com> References: <20230110143137.54517-1-suravee.suthikulpanit@amd.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk
Series	iommu/amd: Force SNP-enabled VFIO domain to 4K page size \| [0/4] iommu/amd: Force SNP-enabled VFIO domain to 4K page size [1/4] iommu/amd: Introduce Protection-domain flag VFIO [2/4] iommu/amd: Introduce structure amd_iommu_svm_ops.is_snp_guest() [3/4] iommu: Introduce IOMMU call-back for processing struct KVM assigned to VFIO [4/4] iommu/amd: Force SNP-enabled VFIO domain to 4K page size

Commit Message

Suravee Suthikulpanit Jan. 10, 2023, 2:31 p.m. UTC

  Currently, to detect if a domain is enabled with VFIO support, the driver
checks if the domain has devices attached and check if the domain type is
IOMMU_DOMAIN_UNMANAGED.

To be more explicit, introduce protection-domain flag PD_VFIO_MASK
to signify an VFIO-enabled  domain is enabled with VFIO support.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
 drivers/iommu/amd/amd_iommu_types.h | 1 +
 drivers/iommu/amd/iommu.c           | 7 +++++--
 2 files changed, 6 insertions(+), 2 deletions(-)

Comments

kernel test robot Jan. 11, 2023, 3:31 a.m. UTC | #1

Hi Suravee,

I love your patch! Perhaps something to improve:

[auto build test WARNING on awilliam-vfio/for-linus]
[also build test WARNING on linus/master v6.2-rc3 next-20230110]
[cannot apply to joro-iommu/next awilliam-vfio/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Suravee-Suthikulpanit/iommu-amd-Introduce-Protection-domain-flag-VFIO/20230110-223527
base:   https://github.com/awilliam/linux-vfio.git for-linus
patch link:    https://lore.kernel.org/r/20230110143137.54517-2-suravee.suthikulpanit%40amd.com
patch subject: [PATCH 1/4] iommu/amd: Introduce Protection-domain flag VFIO
config: x86_64-defconfig
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/ea80bf7e918395d3445c0114fee20997512a4ccf
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Suravee-Suthikulpanit/iommu-amd-Introduce-Protection-domain-flag-VFIO/20230110-223527
        git checkout ea80bf7e918395d3445c0114fee20997512a4ccf
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 olddefconfig
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/iommu/amd/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/iommu/amd/iommu.c: In function 'amd_iommu_detach_device':
>> drivers/iommu/amd/iommu.c:2136:35: warning: unused variable 'domain' [-Wunused-variable]
    2136 |         struct protection_domain *domain = to_pdomain(dom);
         |                                   ^~~~~~


vim +/domain +2136 drivers/iommu/amd/iommu.c

  2131	
  2132	static void amd_iommu_detach_device(struct iommu_domain *dom,
  2133					    struct device *dev)
  2134	{
  2135		struct iommu_dev_data *dev_data = dev_iommu_priv_get(dev);
> 2136		struct protection_domain *domain = to_pdomain(dom);
  2137		struct amd_iommu *iommu;
  2138	
  2139		if (!check_device(dev))
  2140			return;
  2141	
  2142		if (dev_data->domain != NULL)
  2143			detach_device(dev);
  2144	
  2145		iommu = rlookup_amd_iommu(dev);
  2146		if (!iommu)
  2147			return;
  2148

Jason Gunthorpe Jan. 13, 2023, 3:33 p.m. UTC | #2

On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote:
> Currently, to detect if a domain is enabled with VFIO support, the driver
> checks if the domain has devices attached and check if the domain type is
> IOMMU_DOMAIN_UNMANAGED.

NAK

If you need weird HW specific stuff like this then please implement it
properly in iommufd, not try and randomly guess what things need from
the domain type.

All this confidential computing stuff needs a comprehensive solution,
not some piecemeal mess. How can you even use a CC guest with VFIO in
the upstream kernel? Hmm?

Jason

Kalra, Ashish Jan. 19, 2023, 8:54 a.m. UTC | #3

Hello Jason,

On 1/13/2023 9:33 AM, Jason Gunthorpe wrote:
> On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote:
>> Currently, to detect if a domain is enabled with VFIO support, the driver
>> checks if the domain has devices attached and check if the domain type is
>> IOMMU_DOMAIN_UNMANAGED.
> 
> NAK
> 
> If you need weird HW specific stuff like this then please implement it
> properly in iommufd, not try and randomly guess what things need from
> the domain type.
> 
> All this confidential computing stuff needs a comprehensive solution,
> not some piecemeal mess. How can you even use a CC guest with VFIO in
> the upstream kernel? Hmm?
> 

Currently all guest devices are untrusted - whether they are emulated, 
virtio or passthrough. In the current use case of VFIO 
device-passthrough to an SNP guest, the pass-through device will perform 
DMA to un-encrypted or shared guest memory, in the same way as virtio or 
emulated devices.

This fix is prompted by an issue reported by Nvidia, they are trying to 
do PCIe device passthrough to SNP guest. The memory allocated for DMA is 
through dma_alloc_coherent() in the SNP guest and during DMA I/O an 
RMP_PAGE_FAULT is observed on the host.

These dma_alloc_coherent() calls map into page state change hypercalls 
into the host to change guest page state from encrypted to shared in the 
RMP table.

Following is a link to issue discussed above:
https://github.com/AMDESE/AMDSEV/issues/109

Now, to set individual 4K entries to different shared/private mappings 
in NPT or host page tables for large page entries, the RMP and NPT/host 
page table large page entries are split to 4K pte’s.

The same split is required in iommu page table entries to remain in sync 
with the corresponding RMP table entry. If the iommu entry is covering a 
range with a large page entry and the individual 4K mappings in that 
range have now changed, the RMP checks during IOMMU page table walk will 
cause a RMP page fault to be signaled.

The fix is to force 4K page size for IOMMU page tables for SNP guests.

This patch-set adds support to detect if a domain belongs to an 
SNP-enabled guest. This way it can set default page size of a domain to 
4K only for SNP-enabled guest and allow non-SNP guest to use larger page 
size.

Hopefully, this explains the usage case for this patch-set.

Thanks,
Ashish

Jason Gunthorpe Jan. 19, 2023, 5:44 p.m. UTC | #4

On Thu, Jan 19, 2023 at 02:54:43AM -0600, Kalra, Ashish wrote:
> Hello Jason,
> 
> On 1/13/2023 9:33 AM, Jason Gunthorpe wrote:
> > On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote:
> > > Currently, to detect if a domain is enabled with VFIO support, the driver
> > > checks if the domain has devices attached and check if the domain type is
> > > IOMMU_DOMAIN_UNMANAGED.
> > 
> > NAK
> > 
> > If you need weird HW specific stuff like this then please implement it
> > properly in iommufd, not try and randomly guess what things need from
> > the domain type.
> > 
> > All this confidential computing stuff needs a comprehensive solution,
> > not some piecemeal mess. How can you even use a CC guest with VFIO in
> > the upstream kernel? Hmm?
> > 
> 
> Currently all guest devices are untrusted - whether they are emulated,
> virtio or passthrough. In the current use case of VFIO device-passthrough to
> an SNP guest, the pass-through device will perform DMA to un-encrypted or
> shared guest memory, in the same way as virtio or emulated devices.
> 
> This fix is prompted by an issue reported by Nvidia, they are trying to do
> PCIe device passthrough to SNP guest. The memory allocated for DMA is
> through dma_alloc_coherent() in the SNP guest and during DMA I/O an
> RMP_PAGE_FAULT is observed on the host.
> 
> These dma_alloc_coherent() calls map into page state change hypercalls into
> the host to change guest page state from encrypted to shared in the RMP
> table.
> 
> Following is a link to issue discussed above:
> https://github.com/AMDESE/AMDSEV/issues/109

Wow you should really write all of this in the commmit message

> Now, to set individual 4K entries to different shared/private
> mappings in NPT or host page tables for large page entries, the RMP
> and NPT/host page table large page entries are split to 4K pte’s.

Why are mappings to private pages even in the iommu in the first
place - and how did they even get there?

I thought the design for the private memory was walling it off in a
memfd and making it un-gup'able?

This seems to be your actual problem, somehow the iommu is being
loaded with private memory PFNs instead of only being loaded with
shared PFNs when shared mappings are created?

If the IOMMU mappings actually only extend to the legitimate shared
pages then you don't have a problem with large IOPTEs spanning a
mixture of page types.

> The fix is to force 4K page size for IOMMU page tables for SNP guests.

But even if you want to persue this as the fix, it should not be done
in this way.

> This patch-set adds support to detect if a domain belongs to an SNP-enabled
> guest. This way it can set default page size of a domain to 4K only for
> SNP-enabled guest and allow non-SNP guest to use larger page size.

As I said, the KVM has nothing to do with the iommu and I want to
laregly keep it that way.

If the VMM needs to request a 4k page size only iommu_domain because
it is somehow mapping mixtures of private and public pages, then the
VMM knows it is doing this crazy thing and it needs to ask iommufd
directly for customized iommu_domain from the driver.

No KVM interconnection.

In fact, we already have a way to do this in iommufd generically, have
the VMM set IOMMU_OPTION_HUGE_PAGES = 0.

Jason

Kalra, Ashish Jan. 20, 2023, 3:12 p.m. UTC | #5

On 1/19/2023 11:44 AM, Jason Gunthorpe wrote:
> On Thu, Jan 19, 2023 at 02:54:43AM -0600, Kalra, Ashish wrote:
>> Hello Jason,
>>
>> On 1/13/2023 9:33 AM, Jason Gunthorpe wrote:
>>> On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote:
>>>> Currently, to detect if a domain is enabled with VFIO support, the driver
>>>> checks if the domain has devices attached and check if the domain type is
>>>> IOMMU_DOMAIN_UNMANAGED.
>>>
>>> NAK
>>>
>>> If you need weird HW specific stuff like this then please implement it
>>> properly in iommufd, not try and randomly guess what things need from
>>> the domain type.
>>>
>>> All this confidential computing stuff needs a comprehensive solution,
>>> not some piecemeal mess. How can you even use a CC guest with VFIO in
>>> the upstream kernel? Hmm?
>>>
>>
>> Currently all guest devices are untrusted - whether they are emulated,
>> virtio or passthrough. In the current use case of VFIO device-passthrough to
>> an SNP guest, the pass-through device will perform DMA to un-encrypted or
>> shared guest memory, in the same way as virtio or emulated devices.
>>
>> This fix is prompted by an issue reported by Nvidia, they are trying to do
>> PCIe device passthrough to SNP guest. The memory allocated for DMA is
>> through dma_alloc_coherent() in the SNP guest and during DMA I/O an
>> RMP_PAGE_FAULT is observed on the host.
>>
>> These dma_alloc_coherent() calls map into page state change hypercalls into
>> the host to change guest page state from encrypted to shared in the RMP
>> table.
>>
>> Following is a link to issue discussed above:
>> https://github.com/AMDESE/AMDSEV/issues/109
> 
> Wow you should really write all of this in the commmit message
> 
>> Now, to set individual 4K entries to different shared/private
>> mappings in NPT or host page tables for large page entries, the RMP
>> and NPT/host page table large page entries are split to 4K pte’s.
> 
> Why are mappings to private pages even in the iommu in the first
> place - and how did they even get there?
>

You seem to be confusing between host/NPT page tables and IOMMU page 
tables.

There are no private page mappings in the IOMMU page tables, as i 
mentioned above currently all DMA to SNP guest is to/from shared memory.

> I thought the design for the private memory was walling it off in a
> memfd and making it un-gup'able?
> 
> This seems to be your actual problem, somehow the iommu is being
> loaded with private memory PFNs instead of only being loaded with
> shared PFNs when shared mappings are created?
> 

The IOMMU page tables are loaded with shared PFNs and not private memory 
PFNs.

> If the IOMMU mappings actually only extend to the legitimate shared
> pages then you don't have a problem with large IOPTEs spanning a
> mixture of page types.
> 
>> The fix is to force 4K page size for IOMMU page tables for SNP guests.
> 
> But even if you want to persue this as the fix, it should not be done
> in this way.
> 
>> This patch-set adds support to detect if a domain belongs to an SNP-enabled
>> guest. This way it can set default page size of a domain to 4K only for
>> SNP-enabled guest and allow non-SNP guest to use larger page size.
> 
> As I said, the KVM has nothing to do with the iommu and I want to
> laregly keep it that way.
> 
> If the VMM needs to request a 4k page size only iommu_domain because
> it is somehow mapping mixtures of private and public pages, 

Again, there is no mixture of private and public pages, the IOMMU only 
has shared page mappings.

Thanks,
Ashish

>then the
> VMM knows it is doing this crazy thing and it needs to ask iommufd
> directly for customized iommu_domain from the driver.
> 
> No KVM interconnection.
> 
> In fact, we already have a way to do this in iommufd generically, have
> the VMM set IOMMU_OPTION_HUGE_PAGES = 0.
> 
> Jason
>

Jason Gunthorpe Jan. 20, 2023, 4:13 p.m. UTC | #6

On Fri, Jan 20, 2023 at 09:12:26AM -0600, Kalra, Ashish wrote:
> On 1/19/2023 11:44 AM, Jason Gunthorpe wrote:
> > On Thu, Jan 19, 2023 at 02:54:43AM -0600, Kalra, Ashish wrote:
> > > Hello Jason,
> > > 
> > > On 1/13/2023 9:33 AM, Jason Gunthorpe wrote:
> > > > On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote:
> > > > > Currently, to detect if a domain is enabled with VFIO support, the driver
> > > > > checks if the domain has devices attached and check if the domain type is
> > > > > IOMMU_DOMAIN_UNMANAGED.
> > > > 
> > > > NAK
> > > > 
> > > > If you need weird HW specific stuff like this then please implement it
> > > > properly in iommufd, not try and randomly guess what things need from
> > > > the domain type.
> > > > 
> > > > All this confidential computing stuff needs a comprehensive solution,
> > > > not some piecemeal mess. How can you even use a CC guest with VFIO in
> > > > the upstream kernel? Hmm?
> > > > 
> > > 
> > > Currently all guest devices are untrusted - whether they are emulated,
> > > virtio or passthrough. In the current use case of VFIO device-passthrough to
> > > an SNP guest, the pass-through device will perform DMA to un-encrypted or
> > > shared guest memory, in the same way as virtio or emulated devices.
> > > 
> > > This fix is prompted by an issue reported by Nvidia, they are trying to do
> > > PCIe device passthrough to SNP guest. The memory allocated for DMA is
> > > through dma_alloc_coherent() in the SNP guest and during DMA I/O an
> > > RMP_PAGE_FAULT is observed on the host.
> > > 
> > > These dma_alloc_coherent() calls map into page state change hypercalls into
> > > the host to change guest page state from encrypted to shared in the RMP
> > > table.
> > > 
> > > Following is a link to issue discussed above:
> > > https://github.com/AMDESE/AMDSEV/issues/109
> > 
> > Wow you should really write all of this in the commmit message
> > 
> > > Now, to set individual 4K entries to different shared/private
> > > mappings in NPT or host page tables for large page entries, the RMP
> > > and NPT/host page table large page entries are split to 4K pte’s.
> > 
> > Why are mappings to private pages even in the iommu in the first
> > place - and how did they even get there?
> > 
> 
> You seem to be confusing between host/NPT page tables and IOMMU page tables.

No, I haven't. I'm repeating what was said:

 during DMA I/O an RMP_PAGE_FAULT is observed on the host.

So, I'm interested to hear how you can get a RMP_PAGE_FAULT from the
IOMMU if the IOMMU is only programmed with shared pages that, by (my)
definition, are accessible to the CPU and should not generate a
RMP_PAGE_FAULT?

I think you are confusing my use of the word private with some AMD
architecture deatils. When I say private I mean that the host CPU will
generate a violation if it tries to access the memory.

I think the conclusion is logical - if the IOMMU is experiencing a
protection violation it is because the IOMMU was programed with PFNs
it is not allowed to access - and so why was that even done in the
first place?

I suppose what is going on is you program the IOPTEs with PFNs of
unknown state and when the PFN changes access protections the IOMMU
can simply use it without needing to synchronize with the access
protection change. And your problem is that the granularity of access
protection change does not match the IOPTE granularity in the IOMMU.

But this seems very wasteful as the IOMMU will be using IOPTEs and
also will pin the memory when the systems *knows* this memory cannot
be accessed through the IOMMU. It seems much better to dynamically
establish IOMMU mappings only when you learn that the memory is
actually accesisble to the IOMMU.

Also, I thought the leading plan for CC was to use the memfd approach here:

https://lore.kernel.org/kvm/20220915142913.2213336-1-chao.p.peng@linux.intel.com/

Which prevents mmaping the memory to userspace - so how did it get
into the IOMMU in the first place?

Jason

Kalra, Ashish Jan. 20, 2023, 5:01 p.m. UTC | #7

Hello Jason,

On 1/20/2023 10:13 AM, Jason Gunthorpe wrote:
> On Fri, Jan 20, 2023 at 09:12:26AM -0600, Kalra, Ashish wrote:
>> On 1/19/2023 11:44 AM, Jason Gunthorpe wrote:
>>> On Thu, Jan 19, 2023 at 02:54:43AM -0600, Kalra, Ashish wrote:
>>>> Hello Jason,
>>>>
>>>> On 1/13/2023 9:33 AM, Jason Gunthorpe wrote:
>>>>> On Tue, Jan 10, 2023 at 08:31:34AM -0600, Suravee Suthikulpanit wrote:
>>>>>> Currently, to detect if a domain is enabled with VFIO support, the driver
>>>>>> checks if the domain has devices attached and check if the domain type is
>>>>>> IOMMU_DOMAIN_UNMANAGED.
>>>>>
>>>>> NAK
>>>>>
>>>>> If you need weird HW specific stuff like this then please implement it
>>>>> properly in iommufd, not try and randomly guess what things need from
>>>>> the domain type.
>>>>>
>>>>> All this confidential computing stuff needs a comprehensive solution,
>>>>> not some piecemeal mess. How can you even use a CC guest with VFIO in
>>>>> the upstream kernel? Hmm?
>>>>>
>>>>
>>>> Currently all guest devices are untrusted - whether they are emulated,
>>>> virtio or passthrough. In the current use case of VFIO device-passthrough to
>>>> an SNP guest, the pass-through device will perform DMA to un-encrypted or
>>>> shared guest memory, in the same way as virtio or emulated devices.
>>>>
>>>> This fix is prompted by an issue reported by Nvidia, they are trying to do
>>>> PCIe device passthrough to SNP guest. The memory allocated for DMA is
>>>> through dma_alloc_coherent() in the SNP guest and during DMA I/O an
>>>> RMP_PAGE_FAULT is observed on the host.
>>>>
>>>> These dma_alloc_coherent() calls map into page state change hypercalls into
>>>> the host to change guest page state from encrypted to shared in the RMP
>>>> table.
>>>>
>>>> Following is a link to issue discussed above:
>>>> https://github.com/AMDESE/AMDSEV/issues/109
>>>
>>> Wow you should really write all of this in the commmit message
>>>
>>>> Now, to set individual 4K entries to different shared/private
>>>> mappings in NPT or host page tables for large page entries, the RMP
>>>> and NPT/host page table large page entries are split to 4K pte’s.
>>>
>>> Why are mappings to private pages even in the iommu in the first
>>> place - and how did they even get there?
>>>
>>
>> You seem to be confusing between host/NPT page tables and IOMMU page tables.
> 
> No, I haven't. I'm repeating what was said:
> 
>   during DMA I/O an RMP_PAGE_FAULT is observed on the host.
> 
> So, I'm interested to hear how you can get a RMP_PAGE_FAULT from the
> IOMMU if the IOMMU is only programmed with shared pages that, by (my)
> definition, are accessible to the CPU and should not generate a
> RMP_PAGE_FAULT?

Yes, sorry i got confused with your use of the word private as you 
mention below.

We basically get the RMP #PF from the IOMMU because there is a page size 
mismatch between the RMP table and the IOMMU page table. The RMP table's 
large page entry has been smashed to 4K PTEs to handle page state change 
to shared on 4K mappings, so this change has to be synced up with the 
IOMMU page table, otherwise there is now a page size mismatch between 
RMP table and IOMMU page table which causes the RMP #PF.

Thanks,
Ashish

> 
> I think you are confusing my use of the word private with some AMD
> architecture deatils. When I say private I mean that the host CPU will
> generate a violation if it tries to access the memory.
> 
> I think the conclusion is logical - if the IOMMU is experiencing a
> protection violation it is because the IOMMU was programed with PFNs
> it is not allowed to access - and so why was that even done in the
> first place?
> 
> I suppose what is going on is you program the IOPTEs with PFNs of
> unknown state and when the PFN changes access protections the IOMMU
> can simply use it without needing to synchronize with the access
> protection change. And your problem is that the granularity of access
> protection change does not match the IOPTE granularity in the IOMMU.
> 
> But this seems very wasteful as the IOMMU will be using IOPTEs and
> also will pin the memory when the systems *knows* this memory cannot
> be accessed through the IOMMU. It seems much better to dynamically
> establish IOMMU mappings only when you learn that the memory is
> actually accesisble to the IOMMU.
> 
> Also, I thought the leading plan for CC was to use the memfd approach here:
> 
> https://lore.kernel.org/kvm/20220915142913.2213336-1-chao.p.peng@linux.intel.com/
> 
> Which prevents mmaping the memory to userspace - so how did it get
> into the IOMMU in the first place?
> 
> Jason
>

Jason Gunthorpe Jan. 20, 2023, 5:50 p.m. UTC | #8

On Fri, Jan 20, 2023 at 11:01:21AM -0600, Kalra, Ashish wrote:

> We basically get the RMP #PF from the IOMMU because there is a page size
> mismatch between the RMP table and the IOMMU page table. The RMP table's
> large page entry has been smashed to 4K PTEs to handle page state change to
> shared on 4K mappings, so this change has to be synced up with the IOMMU
> page table, otherwise there is now a page size mismatch between RMP table
> and IOMMU page table which causes the RMP #PF.

I understand that, you haven't answered my question:

Why is the IOMMU being programmed with pages it cannot access in the
first place?

Don't do that is the obvious solution there, and preserves huge page
IO performance.

Jason

Kalra, Ashish Jan. 20, 2023, 7:55 p.m. UTC | #9

On 1/20/2023 11:50 AM, Jason Gunthorpe wrote:
> On Fri, Jan 20, 2023 at 11:01:21AM -0600, Kalra, Ashish wrote:
> 
>> We basically get the RMP #PF from the IOMMU because there is a page size
>> mismatch between the RMP table and the IOMMU page table. The RMP table's
>> large page entry has been smashed to 4K PTEs to handle page state change to
>> shared on 4K mappings, so this change has to be synced up with the IOMMU
>> page table, otherwise there is now a page size mismatch between RMP table
>> and IOMMU page table which causes the RMP #PF.
> 
> I understand that, you haven't answered my question:
> 
> Why is the IOMMU being programmed with pages it cannot access in the
> first place?
>

I believe the IOMMU page tables are setup as part of device pass-through 
to be able to do DMA to all of the guest memory, but i am not an IOMMU 
expert, so i will let Suravee elaborate on this.

Thanks,
Ashish

> Don't do that is the obvious solution there, and preserves huge page
> IO performance.
> 
> Jason
>

Tom Lendacky Jan. 20, 2023, 10:42 p.m. UTC | #10

On 1/20/23 13:55, Kalra, Ashish wrote:
> On 1/20/2023 11:50 AM, Jason Gunthorpe wrote:
>> On Fri, Jan 20, 2023 at 11:01:21AM -0600, Kalra, Ashish wrote:
>>
>>> We basically get the RMP #PF from the IOMMU because there is a page size
>>> mismatch between the RMP table and the IOMMU page table. The RMP table's
>>> large page entry has been smashed to 4K PTEs to handle page state 
>>> change to
>>> shared on 4K mappings, so this change has to be synced up with the IOMMU
>>> page table, otherwise there is now a page size mismatch between RMP table
>>> and IOMMU page table which causes the RMP #PF.
>>
>> I understand that, you haven't answered my question:
>>
>> Why is the IOMMU being programmed with pages it cannot access in the
>> first place?
>>
> 
> I believe the IOMMU page tables are setup as part of device pass-through 
> to be able to do DMA to all of the guest memory, but i am not an IOMMU 
> expert, so i will let Suravee elaborate on this.

Right. And what I believe Jason is saying is, that for SNP, since we know 
we can only DMA into shared pages, there is no need to setup the initial 
IOMMU page tables for all of guest memory. Instead, wait and set up IOMMU 
mappings when we do a page state change to shared and remove any mappings 
when we do a page state change to private.

Thanks,
Tom

> 
> Thanks,
> Ashish
> 
>> Don't do that is the obvious solution there, and preserves huge page
>> IO performance.
>>
>> Jason
>>

Jason Gunthorpe Jan. 21, 2023, 12:09 a.m. UTC | #11

On Fri, Jan 20, 2023 at 04:42:26PM -0600, Tom Lendacky wrote:
> On 1/20/23 13:55, Kalra, Ashish wrote:
> > On 1/20/2023 11:50 AM, Jason Gunthorpe wrote:
> > > On Fri, Jan 20, 2023 at 11:01:21AM -0600, Kalra, Ashish wrote:
> > > 
> > > > We basically get the RMP #PF from the IOMMU because there is a page size
> > > > mismatch between the RMP table and the IOMMU page table. The RMP table's
> > > > large page entry has been smashed to 4K PTEs to handle page
> > > > state change to
> > > > shared on 4K mappings, so this change has to be synced up with the IOMMU
> > > > page table, otherwise there is now a page size mismatch between RMP table
> > > > and IOMMU page table which causes the RMP #PF.
> > > 
> > > I understand that, you haven't answered my question:
> > > 
> > > Why is the IOMMU being programmed with pages it cannot access in the
> > > first place?
> > > 
> > 
> > I believe the IOMMU page tables are setup as part of device pass-through
> > to be able to do DMA to all of the guest memory, but i am not an IOMMU
> > expert, so i will let Suravee elaborate on this.
> 
> Right. And what I believe Jason is saying is, that for SNP, since we know we
> can only DMA into shared pages, there is no need to setup the initial IOMMU
> page tables for all of guest memory. Instead, wait and set up IOMMU mappings
> when we do a page state change to shared and remove any mappings when we do
> a page state change to private.

Correct.

I don't know the details of how the shared/private works on AMD, eg if
the hypervisor even knows of the private/shared transformation..

At the very worst I suppose if you just turn on the vIOMMU it should
just start working as the vIOMMU mode should make the paging
dynamic. eg virtio-iommu or something general might even do the job.

Pinning pages is expensive, breaks swap, defragmentation and wastes a
lot of iommu memory. Given that the bounce buffers really shouldn't be
reallocated constantly I'd expect vIOMMU to be performance OK.

This solves the page size mistmatch issue because the iommu never has
a PFN installed that would generate a RMP #PF. The iommu can continue
to use large page sizes whenever possible.

It seems to me the current approach of just stuffing all memory into
the iommu is a just a shortcut to get something working.

Otherwise, what I said before is still the case: only the VMM knows
what it is doing. It knows if it is using a model where it programs
private memory into the IOMMU so it knows if it should ask the kernel
to use a 4k iommu page size.

Trying to have the kernel guess what userspace is doing based on the
kvm is simply architecturally wrong.

Jason

diff mbox series

Patch

diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 1d0a70c85333..ad124959a26a 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -439,6 +439,7 @@ 
 					      translation */
 #define PD_IOMMUV2_MASK		(1UL << 3) /* domain has gcr3 table */
 #define PD_GIOV_MASK		(1UL << 4) /* domain enable GIOV support */
+#define PD_VFIO_MASK		(1UL << 5) /* domain enable VFIO support */
 
 extern bool amd_iommu_dump;
 #define DUMP_printk(format, arg...)				\
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3847f3bdc568..681ab1fdb7d5 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2056,6 +2056,8 @@  static struct protection_domain *protection_domain_alloc(unsigned int type)
 		mode = PAGE_MODE_NONE;
 	} else if (type == IOMMU_DOMAIN_UNMANAGED) {
 		pgtable = AMD_IOMMU_V1;
+		/* Mark unmanaged domain for VFIO */
+		domain->flags |= PD_VFIO_MASK;
 	}
 
 	switch (pgtable) {
@@ -2130,6 +2132,7 @@  static void amd_iommu_detach_device(struct iommu_domain *dom,
 				    struct device *dev)
 {
 	struct iommu_dev_data *dev_data = dev_iommu_priv_get(dev);
+	struct protection_domain *domain = to_pdomain(dom);
 	struct amd_iommu *iommu;
 
 	if (!check_device(dev))
@@ -2144,7 +2147,7 @@  static void amd_iommu_detach_device(struct iommu_domain *dom,
 
 #ifdef CONFIG_IRQ_REMAP
 	if (AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) &&
-	    (dom->type == IOMMU_DOMAIN_UNMANAGED))
+	    (domain->flags & PD_VFIO_MASK))
 		dev_data->use_vapic = 0;
 #endif
 
@@ -2176,7 +2179,7 @@  static int amd_iommu_attach_device(struct iommu_domain *dom,
 
 #ifdef CONFIG_IRQ_REMAP
 	if (AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)) {
-		if (dom->type == IOMMU_DOMAIN_UNMANAGED)
+		if (domain->flags & PD_VFIO_MASK)
 			dev_data->use_vapic = 1;
 		else
 			dev_data->use_vapic = 0;