[v9,02/10] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()

Message ID 20231207161211.2374093-3-ryan.roberts@arm.com
State New
Headers
Series Multi-size THP for anonymous memory |

Commit Message

Ryan Roberts Dec. 7, 2023, 4:12 p.m. UTC
  In preparation for supporting anonymous multi-size THP, improve
folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
passed to it. In this case, all contained pages are accounted using the
order-0 folio (or base page) scheme.

Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/rmap.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)
  

Comments

David Hildenbrand Jan. 14, 2024, 5:33 p.m. UTC | #1
On 13.01.24 23:42, Jiri Olsa wrote:
> On Thu, Dec 07, 2023 at 04:12:03PM +0000, Ryan Roberts wrote:
>> In preparation for supporting anonymous multi-size THP, improve
>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>> passed to it. In this case, all contained pages are accounted using the
>> order-0 folio (or base page) scheme.
>>
>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   mm/rmap.c | 28 ++++++++++++++++++++--------
>>   1 file changed, 20 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 2a1e45e6419f..846fc79f3ca9 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1335,32 +1335,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>>    * This means the inc-and-test can be bypassed.
>>    * The folio does not have to be locked.
>>    *
>> - * If the folio is large, it is accounted as a THP.  As the folio
>> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>>    * is new, it's assumed to be mapped exclusively by a single process.
>>    */
>>   void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>   		unsigned long address)
>>   {
>> -	int nr;
>> +	int nr = folio_nr_pages(folio);
>>   
>> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>> +	VM_BUG_ON_VMA(address < vma->vm_start ||
>> +			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> 
> hi,
> I'm hitting this bug (console output below) with adding uprobe
> on simple program like:
> 
>    $ cat up.c
>    int main(void)
>    {
>       return 0;
>    }
> 
>    # bpftrace -e 'uprobe:/home/jolsa/up:_start {}'
> 
>    $ ./up
> 
> it's on top of current linus tree master:
>    052d534373b7 Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
> 
> before this patch it seems to work, I can send my .config if needed

bpf only inserts a small folio, so no magic there.

It was:
	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
And now it is
	VM_BUG_ON_VMA(address < vma->vm_start || address + (nr << PAGE_SHIFT) > vma->vm_end, vma);

I think this change is sane. As long as the address is aligned to full pages
(which it better should be)

Staring at uprobe_write_opcode, likely vaddr isn't aligned ...

Likely (hopefully) that is not an issue for __folio_set_anon(), because linear_page_index()
will mask these bits off.


Would the following change fix it for you?

 From c640a8363e47bc96965a35115a040b5f876c4320 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Sun, 14 Jan 2024 18:32:57 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  kernel/events/uprobes.c | 2 +-
  mm/rmap.c               | 1 +
  2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 485bb0389b488..929e98c629652 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -537,7 +537,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
  		}
  	}
  
-	ret = __replace_page(vma, vaddr, old_page, new_page);
+	ret = __replace_page(vma, vaddr & PAGE_MASK, old_page, new_page);
  	if (new_page)
  		put_page(new_page);
  put_old:
diff --git a/mm/rmap.c b/mm/rmap.c
index f5d43edad529a..a903db4df6b97 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1408,6 +1408,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
  {
  	int nr = folio_nr_pages(folio);
  
+	VM_WARN_ON_FOLIO(!IS_ALIGNED(address, PAGE_SIZE), folio);
  	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
  	VM_BUG_ON_VMA(address < vma->vm_start ||
  			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
  
David Hildenbrand Jan. 15, 2024, 9:38 a.m. UTC | #2
>>> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
>>> index 485bb0389b488..929e98c629652 100644
>>> --- a/kernel/events/uprobes.c
>>> +++ b/kernel/events/uprobes.c
>>> @@ -537,7 +537,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>>>   		}
>>>   	}
>>> -	ret = __replace_page(vma, vaddr, old_page, new_page);
>>> +	ret = __replace_page(vma, vaddr & PAGE_MASK, old_page, new_page);
>>>   	if (new_page)
>>>   		put_page(new_page);
>>>   put_old:
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index f5d43edad529a..a903db4df6b97 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1408,6 +1408,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>   {
>>>   	int nr = folio_nr_pages(folio);
>>> +	VM_WARN_ON_FOLIO(!IS_ALIGNED(address, PAGE_SIZE), folio);
> 
> nit: Is it worth also adding this to __folio_add_anon_rmap() so that
> folio_add_anon_rmap_ptes() and folio_add_anon_rmap_pmd() also benefit?
> 

Yes, same thoughts. Just included it so we would catch if still 
something goes wrong here.

I'll split that change out either way.


> Regardless:
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

Thanks!
  
Sven Schnelle Jan. 24, 2024, 11:15 a.m. UTC | #3
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 14/01/2024 20:55, Jiri Olsa wrote:
>> On Sun, Jan 14, 2024 at 06:33:56PM +0100, David Hildenbrand wrote:
>>> On 13.01.24 23:42, Jiri Olsa wrote:
>>>> On Thu, Dec 07, 2023 at 04:12:03PM +0000, Ryan Roberts wrote:
>>>>> In preparation for supporting anonymous multi-size THP, improve
>>>>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be
>>>>> passed to it. In this case, all contained pages are accounted using the
>>>>> order-0 folio (or base page) scheme.
>>>>>
>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>>>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>>>> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>   mm/rmap.c | 28 ++++++++++++++++++++--------
>>>>>   1 file changed, 20 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 2a1e45e6419f..846fc79f3ca9 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1335,32 +1335,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>>>>>    * This means the inc-and-test can be bypassed.
>>>>>    * The folio does not have to be locked.
>>>>>    *
>>>>> - * If the folio is large, it is accounted as a THP.  As the folio
>>>>> + * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
>>>>>    * is new, it's assumed to be mapped exclusively by a single process.
>>>>>    */
>>>>>   void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>>>>   		unsigned long address)
>>>>>   {
>>>>> -	int nr;
>>>>> +	int nr = folio_nr_pages(folio);
>>>>> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>>>>> +	VM_BUG_ON_VMA(address < vma->vm_start ||
>>>>> +			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>>>>
>>>> hi,
>>>> I'm hitting this bug (console output below) with adding uprobe
>>>> on simple program like:
>>>>
>>>>    $ cat up.c
>>>>    int main(void)
>>>>    {
>>>>       return 0;
>>>>    }
>>>>
>>>>    # bpftrace -e 'uprobe:/home/jolsa/up:_start {}'
>>>>
>>>>    $ ./up
>>>>
>>>> it's on top of current linus tree master:
>>>>    052d534373b7 Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
>>>>
>>>> before this patch it seems to work, I can send my .config if needed
>
> Thanks for the bug report!

I just hit the same bug in our CI, but can't find the fix in -next. Is
this in the queue somewhere?

Thanks
Sven
  
Sven Schnelle Jan. 24, 2024, 12:28 p.m. UTC | #4
Hi Ryan,

Ryan Roberts <ryan.roberts@arm.com> writes:

>>>>>>>>> I'm hitting this bug (console output below) with adding uprobe
>>>>>>>>> on simple program like:
>>>>>>>>>
>>>>>>>>>    $ cat up.c
>>>>>>>>>    int main(void)
>>>>>>>>>    {
>>>>>>>>>       return 0;
>>>>>>>>>    }
>>>>>>>>>
>>>>>>>>>    # bpftrace -e 'uprobe:/home/jolsa/up:_start {}'
>>>>>>>>>
>>>>>>>>>    $ ./up
>>>>>>>>>
>>>>>>>>> it's on top of current linus tree master:
>>>>>>>>>    052d534373b7 Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
>>>>>>>>>
>>>>>>>>> before this patch it seems to work, I can send my .config if needed
>>>>>>
>>>>>> Thanks for the bug report!
>>>>>
>>>>> I just hit the same bug in our CI, but can't find the fix in -next. Is
>>>>> this in the queue somewhere?
>>>>
>>>> we hit it as well, but I can see the fix in linux-next/master
>>>>
>>>>   4c137bc28064 uprobes: use pagesize-aligned virtual address when replacing pages
>>>
>>> Yes that's the one. Just to confirm: you are still hitting the VM_BUG_ON despite
>>> having this change in your kernel? Could you please send over the full bug log?
>> 
>> ah sorry.. I meant the change fixes the problem for us, it just did not
>> yet propagate through the merge cycle into bpf trees.. but I can see it
>> in linux-next tree, so it's probably just matter of time
>
> OK great! How about you, Sven? Do you have this change in your kernel? Hopefully
> it should fix your problem.

Same here - the fix makes uprobes work again, i just didn't see it in
torvalds-master and neither in todays linux-next. But Jiri is right,
it's in linux-next/master. I just missed to find it there. So everything
should be ok.
  

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index 2a1e45e6419f..846fc79f3ca9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1335,32 +1335,44 @@  void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
  * This means the inc-and-test can be bypassed.
  * The folio does not have to be locked.
  *
- * If the folio is large, it is accounted as a THP.  As the folio
+ * If the folio is pmd-mappable, it is accounted as a THP.  As the folio
  * is new, it's assumed to be mapped exclusively by a single process.
  */
 void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		unsigned long address)
 {
-	int nr;
+	int nr = folio_nr_pages(folio);
 
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	VM_BUG_ON_VMA(address < vma->vm_start ||
+			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
 	__folio_set_swapbacked(folio);
+	__folio_set_anon(folio, vma, address, true);
 
-	if (likely(!folio_test_pmd_mappable(folio))) {
+	if (likely(!folio_test_large(folio))) {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_mapcount, 0);
-		nr = 1;
+		SetPageAnonExclusive(&folio->page);
+	} else if (!folio_test_pmd_mappable(folio)) {
+		int i;
+
+		for (i = 0; i < nr; i++) {
+			struct page *page = folio_page(folio, i);
+
+			/* increment count (starts at -1) */
+			atomic_set(&page->_mapcount, 0);
+			SetPageAnonExclusive(page);
+		}
+
+		atomic_set(&folio->_nr_pages_mapped, nr);
 	} else {
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
 		atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED);
-		nr = folio_nr_pages(folio);
+		SetPageAnonExclusive(&folio->page);
 		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
 	}
 
 	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
-	__folio_set_anon(folio, vma, address, true);
-	SetPageAnonExclusive(&folio->page);
 }
 
 /**