kvm: x86/mmu: Simplify pte_list_{add|remove}
Commit Message
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Simplify pte_list_{add|remove} by ensuring all the non-head pte_list_desc
to be full and addition/removal actions being performed on the head.
To make pte_list_add() return a count as before, @tail_count is also
added to the struct pte_list_desc.
No visible performace is changed in tests. But pte_list_add() is no longer
shown in the perf result for the COWed pages even the guest forks millions
of tasks.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
arch/x86/kvm/mmu/mmu.c | 77 +++++++++++++++++++++---------------------
1 file changed, 38 insertions(+), 39 deletions(-)
Comments
On Fri, 13 Jan 2023 20:29:10 +0800, Lai Jiangshan wrote:
> Simplify pte_list_{add|remove} by ensuring all the non-head pte_list_desc
> to be full and addition/removal actions being performed on the head.
>
> To make pte_list_add() return a count as before, @tail_count is also
> added to the struct pte_list_desc.
>
> No visible performace is changed in tests. But pte_list_add() is no longer
> shown in the perf result for the COWed pages even the guest forks millions
> of tasks.
>
> [...]
Applied to kvm-x86 mmu, thanks! I added quite a few comments and a BUG_ON() to
sanity check that the head is never empty when trying to remove an entry, but I
didn't make anything changes to the code itself.
[1/1] kvm: x86/mmu: Simplify pte_list_{add|remove}
https://github.com/kvm-x86/linux/commit/141705b78381
--
https://github.com/kvm-x86/linux/tree/next
https://github.com/kvm-x86/linux/tree/fixes
On Thu, Mar 23, 2023 at 3:51 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, 13 Jan 2023 20:29:10 +0800, Lai Jiangshan wrote:
> > Simplify pte_list_{add|remove} by ensuring all the non-head pte_list_desc
> > to be full and addition/removal actions being performed on the head.
> >
> > To make pte_list_add() return a count as before, @tail_count is also
> > added to the struct pte_list_desc.
> >
> > No visible performace is changed in tests. But pte_list_add() is no longer
> > shown in the perf result for the COWed pages even the guest forks millions
> > of tasks.
> >
> > [...]
>
> Applied to kvm-x86 mmu, thanks! I added quite a few comments and a BUG_ON() to
> sanity check that the head is never empty when trying to remove an entry, but I
> didn't make anything changes to the code itself.
>
> [1/1] kvm: x86/mmu: Simplify pte_list_{add|remove}
> https://github.com/kvm-x86/linux/commit/141705b78381
>
I am not sure if it is possible, but now spte_count is u32 so does
tail_count. I wonder if an attacker could use the potential integer
overflow to trigger this? E.g,: creating a huge number of little L1
EPTs with the many nGPA-> one GPA? hmm, I think it could overflow
tail_count? Please double check.
spte_count is u32, but assigned to an (signed) int j and BUG_ON(j <
0)? Please don't add more BUG_ON in KVM mmu... and please change
either 'spte_count' to 'int' or 'j' to u32.
In general, please, no BUG_ON(), at least no more BUG_ON() on our nested MMU...
Please take a second thought on this one before merge!
Thanks.
-Mingwei
On Fri, Mar 24, 2023, Mingwei Zhang wrote:
> On Thu, Mar 23, 2023 at 3:51 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, 13 Jan 2023 20:29:10 +0800, Lai Jiangshan wrote:
> > > Simplify pte_list_{add|remove} by ensuring all the non-head pte_list_desc
> > > to be full and addition/removal actions being performed on the head.
> > >
> > > To make pte_list_add() return a count as before, @tail_count is also
> > > added to the struct pte_list_desc.
> > >
> > > No visible performace is changed in tests. But pte_list_add() is no longer
> > > shown in the perf result for the COWed pages even the guest forks millions
> > > of tasks.
> > >
> > > [...]
> >
> > Applied to kvm-x86 mmu, thanks! I added quite a few comments and a BUG_ON() to
> > sanity check that the head is never empty when trying to remove an entry, but I
> > didn't make anything changes to the code itself.
> >
> > [1/1] kvm: x86/mmu: Simplify pte_list_{add|remove}
> > https://github.com/kvm-x86/linux/commit/141705b78381
> >
>
> I am not sure if it is possible, but now spte_count is u32 so does
> tail_count. I wonder if an attacker could use the potential integer
> overflow to trigger this? E.g,: creating a huge number of little L1
> EPTs with the many nGPA-> one GPA? hmm, I think it could overflow
> tail_count? Please double check.
Heh, I had the same reaction and even started reworking the patch, but convinced
myself that there are no new issues as __rmap_add() artificially limits the number
of rmap entries to 1000. But looking again, mmu_page_add_parent_pte() doesn't
have the same safeguard, so in theory, a VM could overflow tail_count.
I'll stare at this some more next week, I'd really prefer to avoid taking on any
amount of complexity in KVM to handle something that should never occur in
practice. But being buried deep in __link_shadow_page() makes the error handling
annoying.
> spte_count is u32, but assigned to an (signed) int j and BUG_ON(j <
> 0)? Please don't add more BUG_ON in KVM mmu... and please change
> either 'spte_count' to 'int' or 'j' to u32.
Why? signed vs. unsigned doesn't change anything. spte_count is can't be greater
than PTE_LIST_EXT, so overflow is not a concern. The BUG_ON() is purely to ensure
there's no underflow, i.e. that spte_count > 0, so that the array can be safely
accessed.
> In general, please, no BUG_ON(), at least no more BUG_ON() on our nested MMU...
In this case, the alternative is worse, as writing "head_desc->sptes[j]" with
"j < 0", i.e. a very large value if j is stored as an unsigned, would write
arbitrary kernel memory.
I completely agree that the BUG_ON() is gross, but that's an existing issue that
needs to be remedied. And to do that, I really want to resurrect and utilize
the KVM_BUG_ON_DATA_CORRUPTION() idea[*].
[*] https://lore.kernel.org/all/Y5dax8XJV0F5adUw@google.com
@@ -131,10 +131,16 @@ module_param(dbg, bool, 0644);
struct pte_list_desc {
struct pte_list_desc *more;
/*
- * Stores number of entries stored in the pte_list_desc. No need to be
- * u64 but just for easier alignment. When PTE_LIST_EXT, means full.
+ * @spte_count: Stores number of entries stored in the pte_list_desc.
+ * When PTE_LIST_EXT, means full. All the non-head pte_list_desc must
+ * be full.
+ *
+ * @tail_count: Stores number of entries stored in its tail descriptions.
+ *
+ * No need to be u32 but just for easier alignment.
*/
- u64 spte_count;
+ u32 spte_count;
+ u32 tail_count;
u64 *sptes[PTE_LIST_EXT];
};
@@ -917,22 +923,25 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
desc->sptes[0] = (u64 *)rmap_head->val;
desc->sptes[1] = spte;
desc->spte_count = 2;
+ desc->tail_count = 0;
rmap_head->val = (unsigned long)desc | 1;
++count;
} else {
rmap_printk("%p %llx many->many\n", spte, *spte);
desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
- while (desc->spte_count == PTE_LIST_EXT) {
- count += PTE_LIST_EXT;
- if (!desc->more) {
- desc->more = kvm_mmu_memory_cache_alloc(cache);
- desc = desc->more;
- desc->spte_count = 0;
- break;
- }
- desc = desc->more;
+ count = desc->tail_count + desc->spte_count;
+ /*
+ * When the head pte_list_desc is full, the whole list must
+ * be full since all the non-head pte_list_desc are full.
+ * So just allocate a new head.
+ */
+ if (desc->spte_count == PTE_LIST_EXT) {
+ desc = kvm_mmu_memory_cache_alloc(cache);
+ desc->more = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+ desc->spte_count = 0;
+ desc->tail_count = count;
+ rmap_head->val = (unsigned long)desc | 1;
}
- count += desc->spte_count;
desc->sptes[desc->spte_count++] = spte;
}
return count;
@@ -940,30 +949,30 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
static void
pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
- struct pte_list_desc *desc, int i,
- struct pte_list_desc *prev_desc)
+ struct pte_list_desc *desc, int i)
{
- int j = desc->spte_count - 1;
+ struct pte_list_desc *head_desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+ int j = head_desc->spte_count - 1;
- desc->sptes[i] = desc->sptes[j];
- desc->sptes[j] = NULL;
- desc->spte_count--;
- if (desc->spte_count)
+ /*
+ * Grab an entry from the head pte_list_desc to ensure that
+ * the non-head pte_list_desc are full.
+ */
+ desc->sptes[i] = head_desc->sptes[j];
+ head_desc->sptes[j] = NULL;
+ head_desc->spte_count--;
+ if (head_desc->spte_count)
return;
- if (!prev_desc && !desc->more)
+ if (!head_desc->more)
rmap_head->val = 0;
else
- if (prev_desc)
- prev_desc->more = desc->more;
- else
- rmap_head->val = (unsigned long)desc->more | 1;
- mmu_free_pte_list_desc(desc);
+ rmap_head->val = (unsigned long)head_desc->more | 1;
+ mmu_free_pte_list_desc(head_desc);
}
static void pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head)
{
struct pte_list_desc *desc;
- struct pte_list_desc *prev_desc;
int i;
if (!rmap_head->val) {
@@ -979,16 +988,13 @@ static void pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head)
} else {
rmap_printk("%p many->many\n", spte);
desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
- prev_desc = NULL;
while (desc) {
for (i = 0; i < desc->spte_count; ++i) {
if (desc->sptes[i] == spte) {
- pte_list_desc_remove_entry(rmap_head,
- desc, i, prev_desc);
+ pte_list_desc_remove_entry(rmap_head, desc, i);
return;
}
}
- prev_desc = desc;
desc = desc->more;
}
pr_err("%s: %p many->many\n", __func__, spte);
@@ -1035,7 +1041,6 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
{
struct pte_list_desc *desc;
- unsigned int count = 0;
if (!rmap_head->val)
return 0;
@@ -1043,13 +1048,7 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
return 1;
desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
-
- while (desc) {
- count += desc->spte_count;
- desc = desc->more;
- }
-
- return count;
+ return desc->tail_count + desc->spte_count;
}
static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,