mm: fix hugetlb page unmap count balance issue

Message ID 20230512072036.1027784-1-junxiao.chang@intel.com
State New
Headers
Series mm: fix hugetlb page unmap count balance issue |

Commit Message

Chang, Junxiao May 12, 2023, 7:20 a.m. UTC
  hugetlb page usually is mapped with pmd, but occasionally it might be
mapped with pte. QEMU can use udma-buf to create host dmabufs for guest
framebuffers. When QEMU is launched with parameter "hugetlb=on",
udmabuffer driver maps hugetlb page with pte in page fault handler.
Call chain looks like:

page_add_file_rmap
do_set_pte
finish_fault
__do_fault -> udmabuf_vm_fault, it maps hugetlb page here.
do_read_fault

In function page_add_file_rmap, compound is false since it is pte mapping.

When qemu exits and page is unmapped in function page_remove_rmap, the
hugetlb page should not be handled in pmd way.

This change is to check compound parameter as well as hugetlb flag. It
fixes below kernel bug which is reproduced with 6.3 kernel:

[  114.027754] BUG: Bad page cache in process qemu-system-x86  pfn:37aa00
[  114.034288] page:000000000dd2153b refcount:514 mapcount:-4 mapping:000000004b01ca30 index:0x13800 pfn:0x37aa00
[  114.044277] head:000000000dd2153b order:9 entire_mapcount:-4 nr_pages_mapped:4 pincount:512
[  114.052623] aops:hugetlbfs_aops ino:6f93
[  114.056552] flags: 0x17ffffc0010001(locked|head|node=0|zone=2|lastcpupid=0x1fffff)
[  114.064115] raw: 0017ffffc0010001 fffff7338deb0008 fffff7338dea0008 ffff98dc855ea870
[  114.071847] raw: 000000000000009c 0000000000000002 00000202ffffffff 0000000000000000
[  114.079572] page dumped because: still mapped when deleted
[  114.085048] CPU: 0 PID: 3122 Comm: qemu-system-x86 Tainted: G    BU  W   E      6.3.0-v3+ #62
[  114.093566] Hardware name: Intel Corporation Alder Lake Client Platform DDR5 SODIMM SBS RVP, BIOS ADLPFWI1.R00.3084.D89.2303211034 03/21/2023
[  114.106839] Call Trace:
[  114.109291]  <TASK>
[  114.111405]  dump_stack_lvl+0x4c/0x70
[  114.115073]  dump_stack+0x14/0x20
[  114.118395]  filemap_unaccount_folio+0x159/0x220
[  114.123021]  filemap_remove_folio+0x54/0x110
[  114.127295]  remove_inode_hugepages+0x111/0x5b0
[  114.131834]  hugetlbfs_evict_inode+0x23/0x50
[  114.136111]  evict+0xcd/0x1e0
[  114.139083]  iput.part.0+0x183/0x1e0
[  114.142663]  iput+0x20/0x30
[  114.145466]  dentry_unlink_inode+0xcc/0x130
[  114.149655]  __dentry_kill+0xec/0x1a0
[  114.153325]  dput+0x1ca/0x3c0
[  114.156293]  __fput+0xf4/0x280
[  114.159357]  ____fput+0x12/0x20
[  114.162502]  task_work_run+0x62/0xa0
[  114.166088]  do_exit+0x352/0xae0
[  114.169321]  do_group_exit+0x39/0x90
[  114.172892]  get_signal+0xa09/0xa30
[  114.176391]  arch_do_signal_or_restart+0x33/0x280
[  114.181098]  exit_to_user_mode_prepare+0x11f/0x190
[  114.185893]  syscall_exit_to_user_mode+0x2a/0x50
[  114.190509]  do_syscall_64+0x4c/0x90
[  114.194095]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

Fixes: 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of THPs")
Signed-off-by: Junxiao Chang <junxiao.chang@intel.com>
---
 mm/rmap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
  

Comments

Andrew Morton May 12, 2023, 9:03 p.m. UTC | #1
On Fri, 12 May 2023 15:20:36 +0800 Junxiao Chang <junxiao.chang@intel.com> wrote:

> hugetlb page usually is mapped with pmd, but occasionally it might be
> mapped with pte. QEMU can use udma-buf to create host dmabufs for guest
> framebuffers. When QEMU is launched with parameter "hugetlb=on",
> udmabuffer driver maps hugetlb page with pte in page fault handler.

Are there any other situations in which a hugetlb page is mapped in
this fashion?

If not, can QEMU be changed to map with a pmd?

So we get one less weird special case in MM.
  
James Houghton May 12, 2023, 9:26 p.m. UTC | #2
On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote:
>
> hugetlb page usually is mapped with pmd, but occasionally it might be
> mapped with pte. QEMU can use udma-buf to create host dmabufs for guest
> framebuffers. When QEMU is launched with parameter "hugetlb=on",
> udmabuffer driver maps hugetlb page with pte in page fault handler.
> Call chain looks like:
>
> page_add_file_rmap
> do_set_pte
> finish_fault
> __do_fault -> udmabuf_vm_fault, it maps hugetlb page here.
> do_read_fault
>
> In function page_add_file_rmap, compound is false since it is pte mapping.
>
> When qemu exits and page is unmapped in function page_remove_rmap, the
> hugetlb page should not be handled in pmd way.
>
> This change is to check compound parameter as well as hugetlb flag. It
> fixes below kernel bug which is reproduced with 6.3 kernel:
>
> [  114.027754] BUG: Bad page cache in process qemu-system-x86  pfn:37aa00
> [  114.034288] page:000000000dd2153b refcount:514 mapcount:-4 mapping:000000004b01ca30 index:0x13800 pfn:0x37aa00
> [  114.044277] head:000000000dd2153b order:9 entire_mapcount:-4 nr_pages_mapped:4 pincount:512
> [  114.052623] aops:hugetlbfs_aops ino:6f93
> [  114.056552] flags: 0x17ffffc0010001(locked|head|node=0|zone=2|lastcpupid=0x1fffff)
> [  114.064115] raw: 0017ffffc0010001 fffff7338deb0008 fffff7338dea0008 ffff98dc855ea870
> [  114.071847] raw: 000000000000009c 0000000000000002 00000202ffffffff 0000000000000000
> [  114.079572] page dumped because: still mapped when deleted
> [  114.085048] CPU: 0 PID: 3122 Comm: qemu-system-x86 Tainted: G    BU  W   E      6.3.0-v3+ #62
> [  114.093566] Hardware name: Intel Corporation Alder Lake Client Platform DDR5 SODIMM SBS RVP, BIOS ADLPFWI1.R00.3084.D89.2303211034 03/21/2023
> [  114.106839] Call Trace:
> [  114.109291]  <TASK>
> [  114.111405]  dump_stack_lvl+0x4c/0x70
> [  114.115073]  dump_stack+0x14/0x20
> [  114.118395]  filemap_unaccount_folio+0x159/0x220
> [  114.123021]  filemap_remove_folio+0x54/0x110
> [  114.127295]  remove_inode_hugepages+0x111/0x5b0
> [  114.131834]  hugetlbfs_evict_inode+0x23/0x50
> [  114.136111]  evict+0xcd/0x1e0
> [  114.139083]  iput.part.0+0x183/0x1e0
> [  114.142663]  iput+0x20/0x30
> [  114.145466]  dentry_unlink_inode+0xcc/0x130
> [  114.149655]  __dentry_kill+0xec/0x1a0
> [  114.153325]  dput+0x1ca/0x3c0
> [  114.156293]  __fput+0xf4/0x280
> [  114.159357]  ____fput+0x12/0x20
> [  114.162502]  task_work_run+0x62/0xa0
> [  114.166088]  do_exit+0x352/0xae0
> [  114.169321]  do_group_exit+0x39/0x90
> [  114.172892]  get_signal+0xa09/0xa30
> [  114.176391]  arch_do_signal_or_restart+0x33/0x280
> [  114.181098]  exit_to_user_mode_prepare+0x11f/0x190
> [  114.185893]  syscall_exit_to_user_mode+0x2a/0x50
> [  114.190509]  do_syscall_64+0x4c/0x90
> [  114.194095]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>
> Fixes: 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of THPs")
> Signed-off-by: Junxiao Chang <junxiao.chang@intel.com>
> ---
>  mm/rmap.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 19392e090bec6..b42fc0389c243 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1377,9 +1377,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>
>         VM_BUG_ON_PAGE(compound && !PageHead(page), page);
>
> -       /* Hugetlb pages are not counted in NR_*MAPPED */
> -       if (unlikely(folio_test_hugetlb(folio))) {
> -               /* hugetlb pages are always mapped with pmds */
> +       /* Hugetlb pages usually are not counted in NR_*MAPPED */
> +       if (unlikely(folio_test_hugetlb(folio) && compound)) {
> +               /* hugetlb pages are mapped with pmds */
>                 atomic_dec(&folio->_entire_mapcount);
>                 return;
>         }

This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You
need something like [1]. I can resend it if that's what we should be
doing, but this mapcounting scheme doesn't work when the page structs
have been freed.

It seems like it was a mistake to include support for hugetlb memfds in udmabuf.

[1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/

- James
  
Mike Kravetz May 12, 2023, 11:29 p.m. UTC | #3
On 05/12/23 14:26, James Houghton wrote:
> On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote:
> 
> This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You
> need something like [1]. I can resend it if that's what we should be
> doing, but this mapcounting scheme doesn't work when the page structs
> have been freed.
> 
> It seems like it was a mistake to include support for hugetlb memfds in udmabuf.

IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping
hugepages (v4).  Looks like it was never sent to linux-mm?  That is unfortunate
as hugetlb vmemmap freeing went in at about the same time.  And, as you have
noted udmabuf will not work if hugetlb vmemmap freeing is enabled.

Sigh!

Trying to think of a way forward.
  
Chang, Junxiao May 15, 2023, 12:08 a.m. UTC | #4
Thank you for review it! We only reproduced this hugetlb mapping issue with QEMU command which opens udmabuf driver(drivers/dma-buf/udmabuf.c).
I agree with you, it is better to map huge page with a pmd instead of a pte. 😊

-----Original Message-----
From: Andrew Morton <akpm@linux-foundation.org> 
Sent: Saturday, May 13, 2023 5:04 AM
To: Chang, Junxiao <junxiao.chang@intel.com>
Cc: kirill.shutemov@linux.intel.com; Hocko, Michal <mhocko@suse.com>; jmarchan@redhat.com; linux-mm@kvack.org; linux-kernel@vger.kernel.org; mike.kravetz@oracle.com; muchun.song@linux.dev
Subject: Re: [PATCH] mm: fix hugetlb page unmap count balance issue

On Fri, 12 May 2023 15:20:36 +0800 Junxiao Chang <junxiao.chang@intel.com> wrote:

> hugetlb page usually is mapped with pmd, but occasionally it might be 
> mapped with pte. QEMU can use udma-buf to create host dmabufs for 
> guest framebuffers. When QEMU is launched with parameter "hugetlb=on", 
> udmabuffer driver maps hugetlb page with pte in page fault handler.

Are there any other situations in which a hugetlb page is mapped in this fashion?

If not, can QEMU be changed to map with a pmd?

So we get one less weird special case in MM.
  
Chang, Junxiao May 15, 2023, 12:44 a.m. UTC | #5
With James' patch, udmabuf driver might meet VM BUG? Since its compound is false and folio_test_hugetlb(folio) is true:
	if (likely(!compound)) {
+		if (unlikely(folio_test_hugetlb(folio)))
+			VM_BUG_ON_PAGE(HPageVmemmapOptimized(&folio->page),
+				       page);

It is better to notice udmabuf driver owner before merging James's https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/?

Thanks,
Junxiao

-----Original Message-----
From: Mike Kravetz <mike.kravetz@oracle.com> 
Sent: Saturday, May 13, 2023 7:30 AM
To: James Houghton <jthoughton@google.com>
Cc: Chang, Junxiao <junxiao.chang@intel.com>; akpm@linux-foundation.org; kirill.shutemov@linux.intel.com; Hocko, Michal <mhocko@suse.com>; jmarchan@redhat.com; linux-mm@kvack.org; linux-kernel@vger.kernel.org; muchun.song@linux.dev
Subject: Re: [PATCH] mm: fix hugetlb page unmap count balance issue

On 05/12/23 14:26, James Houghton wrote:
> On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote:
> 
> This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You 
> need something like [1]. I can resend it if that's what we should be 
> doing, but this mapcounting scheme doesn't work when the page structs 
> have been freed.
> 
> It seems like it was a mistake to include support for hugetlb memfds in udmabuf.

IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping hugepages (v4).  Looks like it was never sent to linux-mm?  That is unfortunate as hugetlb vmemmap freeing went in at about the same time.  And, as you have noted udmabuf will not work if hugetlb vmemmap freeing is enabled.

Sigh!

Trying to think of a way forward.
--
Mike Kravetz

> 
> [1]: 
> https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@g
> oogle.com/
> 
> - James
  
Mike Kravetz May 15, 2023, 5:04 p.m. UTC | #6
On 05/12/23 16:29, Mike Kravetz wrote:
> On 05/12/23 14:26, James Houghton wrote:
> > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote:
> > 
> > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You
> > need something like [1]. I can resend it if that's what we should be
> > doing, but this mapcounting scheme doesn't work when the page structs
> > have been freed.
> > 
> > It seems like it was a mistake to include support for hugetlb memfds in udmabuf.
> 
> IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping
> hugepages (v4).  Looks like it was never sent to linux-mm?  That is unfortunate
> as hugetlb vmemmap freeing went in at about the same time.  And, as you have
> noted udmabuf will not work if hugetlb vmemmap freeing is enabled.
> 
> Sigh!
> 
> Trying to think of a way forward.
> -- 
> Mike Kravetz
> 
> > 
> > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/
> > 
> > - James

Adding people and list on Cc: involved with commit 16c243e99d33.

There are several issues with trying to map tail pages of hugetllb pages
not taken into account with udmabuf.  James spent quite a bit of time trying
to understand and address all the issues with the HGM code.  While using
the scheme proposed by James, may be an approach to the mapcount issue there
are also other issues that need attention.  For example, I do not see how
the fault code checks the state of the hugetlb page (such as poison) as none
of that state is carried in tail pages.

The more I think about it, the more I think udmabuf should treat hugetlb
pages as hugetlb pages.  They should be mapped at the appropriate level
in the page table.  Of course, this would impose new restrictions on the
API (mmap and ioctl) that may break existing users.  I have no idea how
extensively udmabuf is being used with hugetlb mappings.
  
Mike Kravetz May 16, 2023, 10:34 p.m. UTC | #7
On 05/15/23 10:04, Mike Kravetz wrote:
> On 05/12/23 16:29, Mike Kravetz wrote:
> > On 05/12/23 14:26, James Houghton wrote:
> > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote:
> > > 
> > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You
> > > need something like [1]. I can resend it if that's what we should be
> > > doing, but this mapcounting scheme doesn't work when the page structs
> > > have been freed.
> > > 
> > > It seems like it was a mistake to include support for hugetlb memfds in udmabuf.
> > 
> > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping
> > hugepages (v4).  Looks like it was never sent to linux-mm?  That is unfortunate
> > as hugetlb vmemmap freeing went in at about the same time.  And, as you have
> > noted udmabuf will not work if hugetlb vmemmap freeing is enabled.
> > 
> > Sigh!
> > 
> > Trying to think of a way forward.
> > -- 
> > Mike Kravetz
> > 
> > > 
> > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/
> > > 
> > > - James
> 
> Adding people and list on Cc: involved with commit 16c243e99d33.
> 
> There are several issues with trying to map tail pages of hugetllb pages
> not taken into account with udmabuf.  James spent quite a bit of time trying
> to understand and address all the issues with the HGM code.  While using
> the scheme proposed by James, may be an approach to the mapcount issue there
> are also other issues that need attention.  For example, I do not see how
> the fault code checks the state of the hugetlb page (such as poison) as none
> of that state is carried in tail pages.
> 
> The more I think about it, the more I think udmabuf should treat hugetlb
> pages as hugetlb pages.  They should be mapped at the appropriate level
> in the page table.  Of course, this would impose new restrictions on the
> API (mmap and ioctl) that may break existing users.  I have no idea how
> extensively udmabuf is being used with hugetlb mappings.

Verified that using udmabug on a hugetlb mapping with vmemmap optimization will
BUG as:

[14106.812312] BUG: unable to handle page fault for address: ffffea000a7c4030
[14106.813704] #PF: supervisor write access in kernel mode
[14106.814791] #PF: error_code(0x0003) - permissions violation
[14106.815921] PGD 27fff9067 P4D 27fff9067 PUD 27fff8067 PMD 17ec34067 PTE 8000000285dab021
[14106.818489] Oops: 0003 [#1] PREEMPT SMP PTI
[14106.819345] CPU: 2 PID: 2313 Comm: udmabuf Not tainted 6.4.0-rc1-next-20230508+ #44
[14106.820906] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[14106.822679] RIP: 0010:page_add_file_rmap+0x2e/0x270

I started looking more closely at the driver and I do not fully understand the
usage model.  I took clues from the selftest and driver.  It seems the first
step is to create a buffer via the UDMABUF_CREATE ioctl.  This will copy 4K
pages from the page cache to an array associated with a file.  I did note that
hugetlb and shm behavior is different here as the driver can not add missing
hugetlb pages to the cache as it does with shm.  However, what seems more
concerning is that there is nothing to prevent the pages from being replaced
in the cache before being added to a udmabuf mapping.  This means udmabuf
mapping and original memfd could be operating on a different set of pages.
Is this acceptable, or somehow prevented?

In my role, I am more interested in udmabuf handling of hugetlb pages.
Trying to use individual 4K pages of hugetlb pages is something that
should be avoided here.  Would it be acceptable to change code so that
only whole hugetlb pages are used by udmabuf?  If not, then perhaps the
existing hugetlb support can be removed?
  
Andrew Morton June 7, 2023, 7:03 p.m. UTC | #8
On Tue, 16 May 2023 15:34:40 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> On 05/15/23 10:04, Mike Kravetz wrote:
> > On 05/12/23 16:29, Mike Kravetz wrote:
> > > On 05/12/23 14:26, James Houghton wrote:
> > > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote:
> > > > 
> > > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You
> > > > need something like [1]. I can resend it if that's what we should be
> > > > doing, but this mapcounting scheme doesn't work when the page structs
> > > > have been freed.
> > > > 
> > > > It seems like it was a mistake to include support for hugetlb memfds in udmabuf.
> > > 
> > > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping
> > > hugepages (v4).  Looks like it was never sent to linux-mm?  That is unfortunate
> > > as hugetlb vmemmap freeing went in at about the same time.  And, as you have
> > > noted udmabuf will not work if hugetlb vmemmap freeing is enabled.
> > > 
> > > Sigh!
> > > 
> > > Trying to think of a way forward.
> > > -- 
> > > Mike Kravetz
> > > 
> > > > 
> > > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/
> > > > 
> > > > - James
> > 
> > Adding people and list on Cc: involved with commit 16c243e99d33.
> > 
> > There are several issues with trying to map tail pages of hugetllb pages
> > not taken into account with udmabuf.  James spent quite a bit of time trying
> > to understand and address all the issues with the HGM code.  While using
> > the scheme proposed by James, may be an approach to the mapcount issue there
> > are also other issues that need attention.  For example, I do not see how
> > the fault code checks the state of the hugetlb page (such as poison) as none
> > of that state is carried in tail pages.
> > 
> > The more I think about it, the more I think udmabuf should treat hugetlb
> > pages as hugetlb pages.  They should be mapped at the appropriate level
> > in the page table.  Of course, this would impose new restrictions on the
> > API (mmap and ioctl) that may break existing users.  I have no idea how
> > extensively udmabuf is being used with hugetlb mappings.
> 
> Verified that using udmabug on a hugetlb mapping with vmemmap optimization will
> BUG as:

BUGs aren't good.  Can we please find a way to push this along?

Have we heard anything from any udmabuf people?
  
David Hildenbrand June 7, 2023, 7:27 p.m. UTC | #9
On 17.05.23 00:34, Mike Kravetz wrote:
> On 05/15/23 10:04, Mike Kravetz wrote:
>> On 05/12/23 16:29, Mike Kravetz wrote:
>>> On 05/12/23 14:26, James Houghton wrote:
>>>> On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote:
>>>>
>>>> This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You
>>>> need something like [1]. I can resend it if that's what we should be
>>>> doing, but this mapcounting scheme doesn't work when the page structs
>>>> have been freed.
>>>>
>>>> It seems like it was a mistake to include support for hugetlb memfds in udmabuf.
>>>
>>> IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping
>>> hugepages (v4).  Looks like it was never sent to linux-mm?  That is unfortunate
>>> as hugetlb vmemmap freeing went in at about the same time.  And, as you have
>>> noted udmabuf will not work if hugetlb vmemmap freeing is enabled.
>>>
>>> Sigh!
>>>
>>> Trying to think of a way forward.
>>> -- 
>>> Mike Kravetz
>>>
>>>>
>>>> [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/
>>>>
>>>> - James
>>
>> Adding people and list on Cc: involved with commit 16c243e99d33.
>>
>> There are several issues with trying to map tail pages of hugetllb pages
>> not taken into account with udmabuf.  James spent quite a bit of time trying
>> to understand and address all the issues with the HGM code.  While using
>> the scheme proposed by James, may be an approach to the mapcount issue there
>> are also other issues that need attention.  For example, I do not see how
>> the fault code checks the state of the hugetlb page (such as poison) as none
>> of that state is carried in tail pages.
>>
>> The more I think about it, the more I think udmabuf should treat hugetlb
>> pages as hugetlb pages.  They should be mapped at the appropriate level
>> in the page table.  Of course, this would impose new restrictions on the
>> API (mmap and ioctl) that may break existing users.  I have no idea how
>> extensively udmabuf is being used with hugetlb mappings.
> 
> Verified that using udmabug on a hugetlb mapping with vmemmap optimization will
> BUG as:
> 
> [14106.812312] BUG: unable to handle page fault for address: ffffea000a7c4030
> [14106.813704] #PF: supervisor write access in kernel mode
> [14106.814791] #PF: error_code(0x0003) - permissions violation
> [14106.815921] PGD 27fff9067 P4D 27fff9067 PUD 27fff8067 PMD 17ec34067 PTE 8000000285dab021
> [14106.818489] Oops: 0003 [#1] PREEMPT SMP PTI
> [14106.819345] CPU: 2 PID: 2313 Comm: udmabuf Not tainted 6.4.0-rc1-next-20230508+ #44
> [14106.820906] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [14106.822679] RIP: 0010:page_add_file_rmap+0x2e/0x270
> 
> I started looking more closely at the driver and I do not fully understand the
> usage model.  I took clues from the selftest and driver.  It seems the first
> step is to create a buffer via the UDMABUF_CREATE ioctl.  This will copy 4K
> pages from the page cache to an array associated with a file.  I did note that
> hugetlb and shm behavior is different here as the driver can not add missing
> hugetlb pages to the cache as it does with shm.  However, what seems more
> concerning is that there is nothing to prevent the pages from being replaced
> in the cache before being added to a udmabuf mapping.  This means udmabuf
> mapping and original memfd could be operating on a different set of pages.
> Is this acceptable, or somehow prevented?
> 
> In my role, I am more interested in udmabuf handling of hugetlb pages.
> Trying to use individual 4K pages of hugetlb pages is something that
> should be avoided here.  Would it be acceptable to change code so that
> only whole hugetlb pages are used by udmabuf?  If not, then perhaps the
> existing hugetlb support can be removed?

I'm wondering if that VMA shouldn't be some kind of special mapping 
(VM_PFNMAP), such that the struct page is entirely ignored?

I'm quite confused and concerned when I read that code (what the hell is 
it doing with shmem/hugetlb pages? why does the mapcount even get adjusted?)

This all has a bad smell to it, I hope I'm missing something important ...
  
Mike Kravetz June 7, 2023, 8:53 p.m. UTC | #10
On 06/07/23 12:03, Andrew Morton wrote:
> On Tue, 16 May 2023 15:34:40 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> > On 05/15/23 10:04, Mike Kravetz wrote:
> > > On 05/12/23 16:29, Mike Kravetz wrote:
> > > > On 05/12/23 14:26, James Houghton wrote:
> > > > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote:
> > > > > 
> > > > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You
> > > > > need something like [1]. I can resend it if that's what we should be
> > > > > doing, but this mapcounting scheme doesn't work when the page structs
> > > > > have been freed.
> > > > > 
> > > > > It seems like it was a mistake to include support for hugetlb memfds in udmabuf.
> > > > 
> > > > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping
> > > > hugepages (v4).  Looks like it was never sent to linux-mm?  That is unfortunate
> > > > as hugetlb vmemmap freeing went in at about the same time.  And, as you have
> > > > noted udmabuf will not work if hugetlb vmemmap freeing is enabled.
> > > > 
> > > > Sigh!
> > > > 
> > > > Trying to think of a way forward.
> > > > -- 
> > > > Mike Kravetz
> > > > 
> > > > > 
> > > > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/
> > > > > 
> > > > > - James
> > > 
> > > Adding people and list on Cc: involved with commit 16c243e99d33.
> > > 
> > > There are several issues with trying to map tail pages of hugetllb pages
> > > not taken into account with udmabuf.  James spent quite a bit of time trying
> > > to understand and address all the issues with the HGM code.  While using
> > > the scheme proposed by James, may be an approach to the mapcount issue there
> > > are also other issues that need attention.  For example, I do not see how
> > > the fault code checks the state of the hugetlb page (such as poison) as none
> > > of that state is carried in tail pages.
> > > 
> > > The more I think about it, the more I think udmabuf should treat hugetlb
> > > pages as hugetlb pages.  They should be mapped at the appropriate level
> > > in the page table.  Of course, this would impose new restrictions on the
> > > API (mmap and ioctl) that may break existing users.  I have no idea how
> > > extensively udmabuf is being used with hugetlb mappings.
> > 
> > Verified that using udmabug on a hugetlb mapping with vmemmap optimization will
> > BUG as:
> 
> BUGs aren't good.  Can we please find a way to push this along?
> 
> Have we heard anything from any udmabuf people?
> 

I have not heard anything.  When this issue popped up, it took me by surprise.

udmabuf maintainer (Gerd Hoffmann), the people who added hugetlb support and
the list where udmabuf was developed (dri-devel@lists.freedesktop.org) have
been on cc.

My 'gut reaction' would be to remove hugetlb support from udmabuf.  From a
quick look, if we really want this support then there will need to be some
API changes.  For example UDMABUF_CREATE should be hugetlb page aligned
and a multiple of hugetlb page size if using a hugetlb mapping.

It would be good to know about users of the driver.
  
Andrew Morton June 7, 2023, 9 p.m. UTC | #11
On Wed, 7 Jun 2023 13:53:10 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> > 
> > BUGs aren't good.  Can we please find a way to push this along?
> > 
> > Have we heard anything from any udmabuf people?
> > 
> 
> I have not heard anything.  When this issue popped up, it took me by surprise.
> 
> udmabuf maintainer (Gerd Hoffmann), the people who added hugetlb support and
> the list where udmabuf was developed (dri-devel@lists.freedesktop.org) have
> been on cc.

Maybe Greg can suggest a way forward.

> My 'gut reaction' would be to remove hugetlb support from udmabuf.  From a
> quick look, if we really want this support then there will need to be some
> API changes.  For example UDMABUF_CREATE should be hugetlb page aligned
> and a multiple of hugetlb page size if using a hugetlb mapping.
> 
> It would be good to know about users of the driver.

So disabling "hugetlb=on" (and adding an explanatory printk) would
suffice for now?
  
Mike Kravetz June 7, 2023, 9:16 p.m. UTC | #12
On 06/07/23 14:00, Andrew Morton wrote:
> On Wed, 7 Jun 2023 13:53:10 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> > > 
> > > BUGs aren't good.  Can we please find a way to push this along?
> > > 
> > > Have we heard anything from any udmabuf people?
> > > 
> > 
> > I have not heard anything.  When this issue popped up, it took me by surprise.
> > 
> > udmabuf maintainer (Gerd Hoffmann), the people who added hugetlb support and
> > the list where udmabuf was developed (dri-devel@lists.freedesktop.org) have
> > been on cc.
> 
> Maybe Greg can suggest a way forward.
> 
> > My 'gut reaction' would be to remove hugetlb support from udmabuf.  From a
> > quick look, if we really want this support then there will need to be some
> > API changes.  For example UDMABUF_CREATE should be hugetlb page aligned
> > and a multiple of hugetlb page size if using a hugetlb mapping.
> > 
> > It would be good to know about users of the driver.
> 
> So disabling "hugetlb=on" (and adding an explanatory printk) would
> suffice for now?

I can put together a patch to do that.
  
Greg KH June 8, 2023, 7:59 a.m. UTC | #13
On Wed, Jun 07, 2023 at 02:00:01PM -0700, Andrew Morton wrote:
> On Wed, 7 Jun 2023 13:53:10 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> > > 
> > > BUGs aren't good.  Can we please find a way to push this along?
> > > 
> > > Have we heard anything from any udmabuf people?
> > > 
> > 
> > I have not heard anything.  When this issue popped up, it took me by surprise.
> > 
> > udmabuf maintainer (Gerd Hoffmann), the people who added hugetlb support and
> > the list where udmabuf was developed (dri-devel@lists.freedesktop.org) have
> > been on cc.
> 
> Maybe Greg can suggest a way forward.

I'm guessing that no one is using this code then, so why don't we just
remove it entirely?

thanks,

greg k-h
  
Gerd Hoffmann June 19, 2023, 12:27 p.m. UTC | #14
On Mon, May 15, 2023 at 10:04:42AM -0700, Mike Kravetz wrote:
> On 05/12/23 16:29, Mike Kravetz wrote:
> > On 05/12/23 14:26, James Houghton wrote:
> > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote:
> > > 
> > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You
> > > need something like [1]. I can resend it if that's what we should be
> > > doing, but this mapcounting scheme doesn't work when the page structs
> > > have been freed.
> > > 
> > > It seems like it was a mistake to include support for hugetlb memfds in udmabuf.
> > 
> > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping
> > hugepages (v4).  Looks like it was never sent to linux-mm?  That is unfortunate
> > as hugetlb vmemmap freeing went in at about the same time.  And, as you have
> > noted udmabuf will not work if hugetlb vmemmap freeing is enabled.
> > 
> > Sigh!
> > 
> > Trying to think of a way forward.
> > -- 
> > Mike Kravetz
> > 
> > > 
> > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/
> > > 
> > > - James
> 
> Adding people and list on Cc: involved with commit 16c243e99d33.
> 
> There are several issues with trying to map tail pages of hugetllb pages
> not taken into account with udmabuf.  James spent quite a bit of time trying
> to understand and address all the issues with the HGM code.  While using
> the scheme proposed by James, may be an approach to the mapcount issue there
> are also other issues that need attention.  For example, I do not see how
> the fault code checks the state of the hugetlb page (such as poison) as none
> of that state is carried in tail pages.
> 
> The more I think about it, the more I think udmabuf should treat hugetlb
> pages as hugetlb pages.  They should be mapped at the appropriate level
> in the page table.  Of course, this would impose new restrictions on the
> API (mmap and ioctl) that may break existing users.  I have no idea how
> extensively udmabuf is being used with hugetlb mappings.

User of this is qemu.  It can use the udmabuf driver to create host
dma-bufs for guest resources (virtio-gpu buffers), to avoid copying
data when showing the guest display in a host window.

hugetlb support is needed in case qemu guest memory is backed by
hugetlbfs.  That does not imply the virtio-gpu buffers are hugepage
aligned though, udmabuf would still need to operate on smaller chunks
of memory.  So with additional restrictions this will not work any
more for qemu.  I'd suggest to just revert hugetlb support instead
and go back to the drawing board.

Also not sure why hugetlbfs is used for guest memory in the first place.
It used to be a thing years ago, but with the arrival of transparent
hugepages there is as far I know little reason to still use hugetlbfs.

Vivek? Dongwon?

take care,
  Gerd
  
Kasireddy, Vivek June 20, 2023, 6:23 a.m. UTC | #15
Hi Gerd,

> 
> On Mon, May 15, 2023 at 10:04:42AM -0700, Mike Kravetz wrote:
> > On 05/12/23 16:29, Mike Kravetz wrote:
> > > On 05/12/23 14:26, James Houghton wrote:
> > > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang
> <junxiao.chang@intel.com> wrote:
> > > >
> > > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages.
> You
> > > > need something like [1]. I can resend it if that's what we should be
> > > > doing, but this mapcounting scheme doesn't work when the page
> structs
> > > > have been freed.
> > > >
> > > > It seems like it was a mistake to include support for hugetlb memfds in
> udmabuf.
> > >
> > > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for
> mapping
> > > hugepages (v4).  Looks like it was never sent to linux-mm?  That is
> unfortunate
> > > as hugetlb vmemmap freeing went in at about the same time.  And, as
> you have
> > > noted udmabuf will not work if hugetlb vmemmap freeing is enabled.
> > >
> > > Sigh!
> > >
> > > Trying to think of a way forward.
> > > --
> > > Mike Kravetz
> > >
> > > >
> > > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-
> jthoughton@google.com/
> > > >
> > > > - James
> >
> > Adding people and list on Cc: involved with commit 16c243e99d33.
> >
> > There are several issues with trying to map tail pages of hugetllb pages
> > not taken into account with udmabuf.  James spent quite a bit of time
> trying
> > to understand and address all the issues with the HGM code.  While using
> > the scheme proposed by James, may be an approach to the mapcount
> issue there
> > are also other issues that need attention.  For example, I do not see how
> > the fault code checks the state of the hugetlb page (such as poison) as none
> > of that state is carried in tail pages.
> >
> > The more I think about it, the more I think udmabuf should treat hugetlb
> > pages as hugetlb pages.  They should be mapped at the appropriate level
> > in the page table.  Of course, this would impose new restrictions on the
> > API (mmap and ioctl) that may break existing users.  I have no idea how
> > extensively udmabuf is being used with hugetlb mappings.
> 
> User of this is qemu.  It can use the udmabuf driver to create host
> dma-bufs for guest resources (virtio-gpu buffers), to avoid copying
> data when showing the guest display in a host window.
> 
> hugetlb support is needed in case qemu guest memory is backed by
> hugetlbfs.  That does not imply the virtio-gpu buffers are hugepage
> aligned though, udmabuf would still need to operate on smaller chunks
> of memory.  So with additional restrictions this will not work any
> more for qemu.  I'd suggest to just revert hugetlb support instead
> and go back to the drawing board.
> 
> Also not sure why hugetlbfs is used for guest memory in the first place.
> It used to be a thing years ago, but with the arrival of transparent
> hugepages there is as far I know little reason to still use hugetlbfs.
The main reason why we are interested in using hugetlbfs for guest memory
is because we observed non-trivial performance improvement while running
certain 3D heavy workloads in the guest. And, we noticed this by only
switching the Guest memory backend to include hugepages (i.e, hugetlb=on)
and with no other changes.

To address the current situation, I am readying a patch for udmabuf driver that
would add back support for mapping hugepages but without making use of
the subpages directly.

Thanks,
Vivek

> 
> Vivek? Dongwon?
> 
> take care,
>   Gerd
  

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index 19392e090bec6..b42fc0389c243 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1377,9 +1377,9 @@  void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 
 	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
 
-	/* Hugetlb pages are not counted in NR_*MAPPED */
-	if (unlikely(folio_test_hugetlb(folio))) {
-		/* hugetlb pages are always mapped with pmds */
+	/* Hugetlb pages usually are not counted in NR_*MAPPED */
+	if (unlikely(folio_test_hugetlb(folio) && compound)) {
+		/* hugetlb pages are mapped with pmds */
 		atomic_dec(&folio->_entire_mapcount);
 		return;
 	}