Message ID | 20230512072036.1027784-1-junxiao.chang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp4921580vqo; Fri, 12 May 2023 00:53:34 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7U80XrfiDcp3qk7SXcLFIc8L/QrwNQkzfrLGUjmoubdH+a8asySQZhy2akdR0XABWwQ1xG X-Received: by 2002:a05:6a20:144b:b0:101:3c60:67b6 with SMTP id a11-20020a056a20144b00b001013c6067b6mr17251442pzi.12.1683878014493; Fri, 12 May 2023 00:53:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683878014; cv=none; d=google.com; s=arc-20160816; b=Cf7zeCbDEHatpZ9gbC6KKhYLW3B3d1EPZA/7J1vASVchZMCgDtlgMG30928EQEPb04 mPaTGJHCHRcwd5pZ43s6lZPeYBwk9MbVO1VHcz2PePcvH1s99SaHgjwU5AHUYVhV4Hwy FDwZTHJL56no4yxkE33iofeCLrDD5FvLI8unMeWzLJ40dgrPNI4jqtNeLrphIGhPSdp2 Gf+MRRcJkxvI+TrbDzNe7rvcmVr1qaAVBxeI+qSqgSw3KZI3ODJFBkgrjoF51b2zfFJX q3FuozhY9RAj1Z0BfnOcIiTiXAsHoAL4tit8wUWd9srlEpI3MRz5y6JYh6G362Xr7GH9 eBVQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=tFunb/k6Xco/fMw3XIIzFPi0UrKPu4LtUlpF/FQbhXQ=; b=Oer6qknDp324I/YC0+vmspHJHJbF/guFdWjJULvLnQqn4o7KvF9zpHNz0AM2i7hYGv 3Ht1oWiqEGSA2m0nSIY0rJuupH8ToR09nuprlbxwVfx+f9Xye2nwlWzc4F7zRxIc7mYc TgCoqEKAsbk3dh8es0JZVSkJa5G5m6qCx1M3Wh+uPUk3Ox9Cw04xVSZZ7rvVbxckDrK2 ALZdYf25Zbq1/rpD29Jn86FjUSIo6YnMrxa34unCrv9UkBaiWRBvJ0u9NvQoC8ew0yBP aWk5CBT6TFST0KlH1bv2Yj1IH8tdM9G6lov7pkdLxlhMU5rGCVuLIX4EwVQuelCLVeRO XLuQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=MwrY6Owe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u187-20020a6379c4000000b005303b2a2471si8738203pgc.413.2023.05.12.00.53.21; Fri, 12 May 2023 00:53:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=MwrY6Owe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240274AbjELHVg (ORCPT <rfc822;peekingduck44@gmail.com> + 99 others); Fri, 12 May 2023 03:21:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33054 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240193AbjELHUv (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 12 May 2023 03:20:51 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E87D51992 for <linux-kernel@vger.kernel.org>; Fri, 12 May 2023 00:20:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1683876048; x=1715412048; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=vbfbV3YIRAiL+Me0xGu53szMmW7EztoemLnA/EVynh4=; b=MwrY6Owerp5XKQRhA32Uw9W6LaCWSezgBcSt49y10Rft1+oobM3WVZPg cmKGIu/BSUWCn2qCTls6snwXrfLXPh1GbGwYJRhOzAOAafAEafDyV5aYc sGuz/RAhivz7QEF7K91vgutdutzcjT7pd3iVDbIGXbZZwbtNqtm6Wr39r 33raf8DAD15A8F90fGhh+Tcl1+va5DVNzB8zKt6lH5xp1f5DylyDoUncn FUMzQJJ3Fp1Vuv5mcc9LtrQtN+WPz1DrC9lQ0+sds1/XWl0sq9tRgIK19 /qSv6c9esN3DhrtOV7E3nVEWBOq1CIV9k6/y0OMGybOyr7J1CG4iqhQ9V A==; X-IronPort-AV: E=McAfee;i="6600,9927,10707"; a="349566264" X-IronPort-AV: E=Sophos;i="5.99,269,1677571200"; d="scan'208";a="349566264" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 May 2023 00:20:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10707"; a="765073979" X-IronPort-AV: E=Sophos;i="5.99,269,1677571200"; d="scan'208";a="765073979" Received: from junxiaochang.bj.intel.com ([10.238.154.225]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 May 2023 00:20:46 -0700 From: Junxiao Chang <junxiao.chang@intel.com> To: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, mhocko@suse.com, jmarchan@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, mike.kravetz@oracle.com, muchun.song@linux.dev Cc: junxiao.chang@intel.com Subject: [PATCH] mm: fix hugetlb page unmap count balance issue Date: Fri, 12 May 2023 15:20:36 +0800 Message-Id: <20230512072036.1027784-1-junxiao.chang@intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1765674073235324687?= X-GMAIL-MSGID: =?utf-8?q?1765674073235324687?= |
Series |
mm: fix hugetlb page unmap count balance issue
|
|
Commit Message
Chang, Junxiao
May 12, 2023, 7:20 a.m. UTC
hugetlb page usually is mapped with pmd, but occasionally it might be
mapped with pte. QEMU can use udma-buf to create host dmabufs for guest
framebuffers. When QEMU is launched with parameter "hugetlb=on",
udmabuffer driver maps hugetlb page with pte in page fault handler.
Call chain looks like:
page_add_file_rmap
do_set_pte
finish_fault
__do_fault -> udmabuf_vm_fault, it maps hugetlb page here.
do_read_fault
In function page_add_file_rmap, compound is false since it is pte mapping.
When qemu exits and page is unmapped in function page_remove_rmap, the
hugetlb page should not be handled in pmd way.
This change is to check compound parameter as well as hugetlb flag. It
fixes below kernel bug which is reproduced with 6.3 kernel:
[ 114.027754] BUG: Bad page cache in process qemu-system-x86 pfn:37aa00
[ 114.034288] page:000000000dd2153b refcount:514 mapcount:-4 mapping:000000004b01ca30 index:0x13800 pfn:0x37aa00
[ 114.044277] head:000000000dd2153b order:9 entire_mapcount:-4 nr_pages_mapped:4 pincount:512
[ 114.052623] aops:hugetlbfs_aops ino:6f93
[ 114.056552] flags: 0x17ffffc0010001(locked|head|node=0|zone=2|lastcpupid=0x1fffff)
[ 114.064115] raw: 0017ffffc0010001 fffff7338deb0008 fffff7338dea0008 ffff98dc855ea870
[ 114.071847] raw: 000000000000009c 0000000000000002 00000202ffffffff 0000000000000000
[ 114.079572] page dumped because: still mapped when deleted
[ 114.085048] CPU: 0 PID: 3122 Comm: qemu-system-x86 Tainted: G BU W E 6.3.0-v3+ #62
[ 114.093566] Hardware name: Intel Corporation Alder Lake Client Platform DDR5 SODIMM SBS RVP, BIOS ADLPFWI1.R00.3084.D89.2303211034 03/21/2023
[ 114.106839] Call Trace:
[ 114.109291] <TASK>
[ 114.111405] dump_stack_lvl+0x4c/0x70
[ 114.115073] dump_stack+0x14/0x20
[ 114.118395] filemap_unaccount_folio+0x159/0x220
[ 114.123021] filemap_remove_folio+0x54/0x110
[ 114.127295] remove_inode_hugepages+0x111/0x5b0
[ 114.131834] hugetlbfs_evict_inode+0x23/0x50
[ 114.136111] evict+0xcd/0x1e0
[ 114.139083] iput.part.0+0x183/0x1e0
[ 114.142663] iput+0x20/0x30
[ 114.145466] dentry_unlink_inode+0xcc/0x130
[ 114.149655] __dentry_kill+0xec/0x1a0
[ 114.153325] dput+0x1ca/0x3c0
[ 114.156293] __fput+0xf4/0x280
[ 114.159357] ____fput+0x12/0x20
[ 114.162502] task_work_run+0x62/0xa0
[ 114.166088] do_exit+0x352/0xae0
[ 114.169321] do_group_exit+0x39/0x90
[ 114.172892] get_signal+0xa09/0xa30
[ 114.176391] arch_do_signal_or_restart+0x33/0x280
[ 114.181098] exit_to_user_mode_prepare+0x11f/0x190
[ 114.185893] syscall_exit_to_user_mode+0x2a/0x50
[ 114.190509] do_syscall_64+0x4c/0x90
[ 114.194095] entry_SYSCALL_64_after_hwframe+0x72/0xdc
Fixes: 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of THPs")
Signed-off-by: Junxiao Chang <junxiao.chang@intel.com>
---
mm/rmap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
Comments
On Fri, 12 May 2023 15:20:36 +0800 Junxiao Chang <junxiao.chang@intel.com> wrote: > hugetlb page usually is mapped with pmd, but occasionally it might be > mapped with pte. QEMU can use udma-buf to create host dmabufs for guest > framebuffers. When QEMU is launched with parameter "hugetlb=on", > udmabuffer driver maps hugetlb page with pte in page fault handler. Are there any other situations in which a hugetlb page is mapped in this fashion? If not, can QEMU be changed to map with a pmd? So we get one less weird special case in MM.
On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote: > > hugetlb page usually is mapped with pmd, but occasionally it might be > mapped with pte. QEMU can use udma-buf to create host dmabufs for guest > framebuffers. When QEMU is launched with parameter "hugetlb=on", > udmabuffer driver maps hugetlb page with pte in page fault handler. > Call chain looks like: > > page_add_file_rmap > do_set_pte > finish_fault > __do_fault -> udmabuf_vm_fault, it maps hugetlb page here. > do_read_fault > > In function page_add_file_rmap, compound is false since it is pte mapping. > > When qemu exits and page is unmapped in function page_remove_rmap, the > hugetlb page should not be handled in pmd way. > > This change is to check compound parameter as well as hugetlb flag. It > fixes below kernel bug which is reproduced with 6.3 kernel: > > [ 114.027754] BUG: Bad page cache in process qemu-system-x86 pfn:37aa00 > [ 114.034288] page:000000000dd2153b refcount:514 mapcount:-4 mapping:000000004b01ca30 index:0x13800 pfn:0x37aa00 > [ 114.044277] head:000000000dd2153b order:9 entire_mapcount:-4 nr_pages_mapped:4 pincount:512 > [ 114.052623] aops:hugetlbfs_aops ino:6f93 > [ 114.056552] flags: 0x17ffffc0010001(locked|head|node=0|zone=2|lastcpupid=0x1fffff) > [ 114.064115] raw: 0017ffffc0010001 fffff7338deb0008 fffff7338dea0008 ffff98dc855ea870 > [ 114.071847] raw: 000000000000009c 0000000000000002 00000202ffffffff 0000000000000000 > [ 114.079572] page dumped because: still mapped when deleted > [ 114.085048] CPU: 0 PID: 3122 Comm: qemu-system-x86 Tainted: G BU W E 6.3.0-v3+ #62 > [ 114.093566] Hardware name: Intel Corporation Alder Lake Client Platform DDR5 SODIMM SBS RVP, BIOS ADLPFWI1.R00.3084.D89.2303211034 03/21/2023 > [ 114.106839] Call Trace: > [ 114.109291] <TASK> > [ 114.111405] dump_stack_lvl+0x4c/0x70 > [ 114.115073] dump_stack+0x14/0x20 > [ 114.118395] filemap_unaccount_folio+0x159/0x220 > [ 114.123021] filemap_remove_folio+0x54/0x110 > [ 114.127295] remove_inode_hugepages+0x111/0x5b0 > [ 114.131834] hugetlbfs_evict_inode+0x23/0x50 > [ 114.136111] evict+0xcd/0x1e0 > [ 114.139083] iput.part.0+0x183/0x1e0 > [ 114.142663] iput+0x20/0x30 > [ 114.145466] dentry_unlink_inode+0xcc/0x130 > [ 114.149655] __dentry_kill+0xec/0x1a0 > [ 114.153325] dput+0x1ca/0x3c0 > [ 114.156293] __fput+0xf4/0x280 > [ 114.159357] ____fput+0x12/0x20 > [ 114.162502] task_work_run+0x62/0xa0 > [ 114.166088] do_exit+0x352/0xae0 > [ 114.169321] do_group_exit+0x39/0x90 > [ 114.172892] get_signal+0xa09/0xa30 > [ 114.176391] arch_do_signal_or_restart+0x33/0x280 > [ 114.181098] exit_to_user_mode_prepare+0x11f/0x190 > [ 114.185893] syscall_exit_to_user_mode+0x2a/0x50 > [ 114.190509] do_syscall_64+0x4c/0x90 > [ 114.194095] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > Fixes: 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of THPs") > Signed-off-by: Junxiao Chang <junxiao.chang@intel.com> > --- > mm/rmap.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 19392e090bec6..b42fc0389c243 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1377,9 +1377,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, > > VM_BUG_ON_PAGE(compound && !PageHead(page), page); > > - /* Hugetlb pages are not counted in NR_*MAPPED */ > - if (unlikely(folio_test_hugetlb(folio))) { > - /* hugetlb pages are always mapped with pmds */ > + /* Hugetlb pages usually are not counted in NR_*MAPPED */ > + if (unlikely(folio_test_hugetlb(folio) && compound)) { > + /* hugetlb pages are mapped with pmds */ > atomic_dec(&folio->_entire_mapcount); > return; > } This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You need something like [1]. I can resend it if that's what we should be doing, but this mapcounting scheme doesn't work when the page structs have been freed. It seems like it was a mistake to include support for hugetlb memfds in udmabuf. [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/ - James
On 05/12/23 14:26, James Houghton wrote: > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote: > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You > need something like [1]. I can resend it if that's what we should be > doing, but this mapcounting scheme doesn't work when the page structs > have been freed. > > It seems like it was a mistake to include support for hugetlb memfds in udmabuf. IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping hugepages (v4). Looks like it was never sent to linux-mm? That is unfortunate as hugetlb vmemmap freeing went in at about the same time. And, as you have noted udmabuf will not work if hugetlb vmemmap freeing is enabled. Sigh! Trying to think of a way forward.
Thank you for review it! We only reproduced this hugetlb mapping issue with QEMU command which opens udmabuf driver(drivers/dma-buf/udmabuf.c). I agree with you, it is better to map huge page with a pmd instead of a pte. 😊 -----Original Message----- From: Andrew Morton <akpm@linux-foundation.org> Sent: Saturday, May 13, 2023 5:04 AM To: Chang, Junxiao <junxiao.chang@intel.com> Cc: kirill.shutemov@linux.intel.com; Hocko, Michal <mhocko@suse.com>; jmarchan@redhat.com; linux-mm@kvack.org; linux-kernel@vger.kernel.org; mike.kravetz@oracle.com; muchun.song@linux.dev Subject: Re: [PATCH] mm: fix hugetlb page unmap count balance issue On Fri, 12 May 2023 15:20:36 +0800 Junxiao Chang <junxiao.chang@intel.com> wrote: > hugetlb page usually is mapped with pmd, but occasionally it might be > mapped with pte. QEMU can use udma-buf to create host dmabufs for > guest framebuffers. When QEMU is launched with parameter "hugetlb=on", > udmabuffer driver maps hugetlb page with pte in page fault handler. Are there any other situations in which a hugetlb page is mapped in this fashion? If not, can QEMU be changed to map with a pmd? So we get one less weird special case in MM.
With James' patch, udmabuf driver might meet VM BUG? Since its compound is false and folio_test_hugetlb(folio) is true: if (likely(!compound)) { + if (unlikely(folio_test_hugetlb(folio))) + VM_BUG_ON_PAGE(HPageVmemmapOptimized(&folio->page), + page); It is better to notice udmabuf driver owner before merging James's https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/? Thanks, Junxiao -----Original Message----- From: Mike Kravetz <mike.kravetz@oracle.com> Sent: Saturday, May 13, 2023 7:30 AM To: James Houghton <jthoughton@google.com> Cc: Chang, Junxiao <junxiao.chang@intel.com>; akpm@linux-foundation.org; kirill.shutemov@linux.intel.com; Hocko, Michal <mhocko@suse.com>; jmarchan@redhat.com; linux-mm@kvack.org; linux-kernel@vger.kernel.org; muchun.song@linux.dev Subject: Re: [PATCH] mm: fix hugetlb page unmap count balance issue On 05/12/23 14:26, James Houghton wrote: > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote: > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You > need something like [1]. I can resend it if that's what we should be > doing, but this mapcounting scheme doesn't work when the page structs > have been freed. > > It seems like it was a mistake to include support for hugetlb memfds in udmabuf. IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping hugepages (v4). Looks like it was never sent to linux-mm? That is unfortunate as hugetlb vmemmap freeing went in at about the same time. And, as you have noted udmabuf will not work if hugetlb vmemmap freeing is enabled. Sigh! Trying to think of a way forward. -- Mike Kravetz > > [1]: > https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@g > oogle.com/ > > - James
On 05/12/23 16:29, Mike Kravetz wrote: > On 05/12/23 14:26, James Houghton wrote: > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote: > > > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You > > need something like [1]. I can resend it if that's what we should be > > doing, but this mapcounting scheme doesn't work when the page structs > > have been freed. > > > > It seems like it was a mistake to include support for hugetlb memfds in udmabuf. > > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping > hugepages (v4). Looks like it was never sent to linux-mm? That is unfortunate > as hugetlb vmemmap freeing went in at about the same time. And, as you have > noted udmabuf will not work if hugetlb vmemmap freeing is enabled. > > Sigh! > > Trying to think of a way forward. > -- > Mike Kravetz > > > > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/ > > > > - James Adding people and list on Cc: involved with commit 16c243e99d33. There are several issues with trying to map tail pages of hugetllb pages not taken into account with udmabuf. James spent quite a bit of time trying to understand and address all the issues with the HGM code. While using the scheme proposed by James, may be an approach to the mapcount issue there are also other issues that need attention. For example, I do not see how the fault code checks the state of the hugetlb page (such as poison) as none of that state is carried in tail pages. The more I think about it, the more I think udmabuf should treat hugetlb pages as hugetlb pages. They should be mapped at the appropriate level in the page table. Of course, this would impose new restrictions on the API (mmap and ioctl) that may break existing users. I have no idea how extensively udmabuf is being used with hugetlb mappings.
On 05/15/23 10:04, Mike Kravetz wrote: > On 05/12/23 16:29, Mike Kravetz wrote: > > On 05/12/23 14:26, James Houghton wrote: > > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote: > > > > > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You > > > need something like [1]. I can resend it if that's what we should be > > > doing, but this mapcounting scheme doesn't work when the page structs > > > have been freed. > > > > > > It seems like it was a mistake to include support for hugetlb memfds in udmabuf. > > > > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping > > hugepages (v4). Looks like it was never sent to linux-mm? That is unfortunate > > as hugetlb vmemmap freeing went in at about the same time. And, as you have > > noted udmabuf will not work if hugetlb vmemmap freeing is enabled. > > > > Sigh! > > > > Trying to think of a way forward. > > -- > > Mike Kravetz > > > > > > > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/ > > > > > > - James > > Adding people and list on Cc: involved with commit 16c243e99d33. > > There are several issues with trying to map tail pages of hugetllb pages > not taken into account with udmabuf. James spent quite a bit of time trying > to understand and address all the issues with the HGM code. While using > the scheme proposed by James, may be an approach to the mapcount issue there > are also other issues that need attention. For example, I do not see how > the fault code checks the state of the hugetlb page (such as poison) as none > of that state is carried in tail pages. > > The more I think about it, the more I think udmabuf should treat hugetlb > pages as hugetlb pages. They should be mapped at the appropriate level > in the page table. Of course, this would impose new restrictions on the > API (mmap and ioctl) that may break existing users. I have no idea how > extensively udmabuf is being used with hugetlb mappings. Verified that using udmabug on a hugetlb mapping with vmemmap optimization will BUG as: [14106.812312] BUG: unable to handle page fault for address: ffffea000a7c4030 [14106.813704] #PF: supervisor write access in kernel mode [14106.814791] #PF: error_code(0x0003) - permissions violation [14106.815921] PGD 27fff9067 P4D 27fff9067 PUD 27fff8067 PMD 17ec34067 PTE 8000000285dab021 [14106.818489] Oops: 0003 [#1] PREEMPT SMP PTI [14106.819345] CPU: 2 PID: 2313 Comm: udmabuf Not tainted 6.4.0-rc1-next-20230508+ #44 [14106.820906] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014 [14106.822679] RIP: 0010:page_add_file_rmap+0x2e/0x270 I started looking more closely at the driver and I do not fully understand the usage model. I took clues from the selftest and driver. It seems the first step is to create a buffer via the UDMABUF_CREATE ioctl. This will copy 4K pages from the page cache to an array associated with a file. I did note that hugetlb and shm behavior is different here as the driver can not add missing hugetlb pages to the cache as it does with shm. However, what seems more concerning is that there is nothing to prevent the pages from being replaced in the cache before being added to a udmabuf mapping. This means udmabuf mapping and original memfd could be operating on a different set of pages. Is this acceptable, or somehow prevented? In my role, I am more interested in udmabuf handling of hugetlb pages. Trying to use individual 4K pages of hugetlb pages is something that should be avoided here. Would it be acceptable to change code so that only whole hugetlb pages are used by udmabuf? If not, then perhaps the existing hugetlb support can be removed?
On Tue, 16 May 2023 15:34:40 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > On 05/15/23 10:04, Mike Kravetz wrote: > > On 05/12/23 16:29, Mike Kravetz wrote: > > > On 05/12/23 14:26, James Houghton wrote: > > > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote: > > > > > > > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You > > > > need something like [1]. I can resend it if that's what we should be > > > > doing, but this mapcounting scheme doesn't work when the page structs > > > > have been freed. > > > > > > > > It seems like it was a mistake to include support for hugetlb memfds in udmabuf. > > > > > > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping > > > hugepages (v4). Looks like it was never sent to linux-mm? That is unfortunate > > > as hugetlb vmemmap freeing went in at about the same time. And, as you have > > > noted udmabuf will not work if hugetlb vmemmap freeing is enabled. > > > > > > Sigh! > > > > > > Trying to think of a way forward. > > > -- > > > Mike Kravetz > > > > > > > > > > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/ > > > > > > > > - James > > > > Adding people and list on Cc: involved with commit 16c243e99d33. > > > > There are several issues with trying to map tail pages of hugetllb pages > > not taken into account with udmabuf. James spent quite a bit of time trying > > to understand and address all the issues with the HGM code. While using > > the scheme proposed by James, may be an approach to the mapcount issue there > > are also other issues that need attention. For example, I do not see how > > the fault code checks the state of the hugetlb page (such as poison) as none > > of that state is carried in tail pages. > > > > The more I think about it, the more I think udmabuf should treat hugetlb > > pages as hugetlb pages. They should be mapped at the appropriate level > > in the page table. Of course, this would impose new restrictions on the > > API (mmap and ioctl) that may break existing users. I have no idea how > > extensively udmabuf is being used with hugetlb mappings. > > Verified that using udmabug on a hugetlb mapping with vmemmap optimization will > BUG as: BUGs aren't good. Can we please find a way to push this along? Have we heard anything from any udmabuf people?
On 17.05.23 00:34, Mike Kravetz wrote: > On 05/15/23 10:04, Mike Kravetz wrote: >> On 05/12/23 16:29, Mike Kravetz wrote: >>> On 05/12/23 14:26, James Houghton wrote: >>>> On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote: >>>> >>>> This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You >>>> need something like [1]. I can resend it if that's what we should be >>>> doing, but this mapcounting scheme doesn't work when the page structs >>>> have been freed. >>>> >>>> It seems like it was a mistake to include support for hugetlb memfds in udmabuf. >>> >>> IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping >>> hugepages (v4). Looks like it was never sent to linux-mm? That is unfortunate >>> as hugetlb vmemmap freeing went in at about the same time. And, as you have >>> noted udmabuf will not work if hugetlb vmemmap freeing is enabled. >>> >>> Sigh! >>> >>> Trying to think of a way forward. >>> -- >>> Mike Kravetz >>> >>>> >>>> [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/ >>>> >>>> - James >> >> Adding people and list on Cc: involved with commit 16c243e99d33. >> >> There are several issues with trying to map tail pages of hugetllb pages >> not taken into account with udmabuf. James spent quite a bit of time trying >> to understand and address all the issues with the HGM code. While using >> the scheme proposed by James, may be an approach to the mapcount issue there >> are also other issues that need attention. For example, I do not see how >> the fault code checks the state of the hugetlb page (such as poison) as none >> of that state is carried in tail pages. >> >> The more I think about it, the more I think udmabuf should treat hugetlb >> pages as hugetlb pages. They should be mapped at the appropriate level >> in the page table. Of course, this would impose new restrictions on the >> API (mmap and ioctl) that may break existing users. I have no idea how >> extensively udmabuf is being used with hugetlb mappings. > > Verified that using udmabug on a hugetlb mapping with vmemmap optimization will > BUG as: > > [14106.812312] BUG: unable to handle page fault for address: ffffea000a7c4030 > [14106.813704] #PF: supervisor write access in kernel mode > [14106.814791] #PF: error_code(0x0003) - permissions violation > [14106.815921] PGD 27fff9067 P4D 27fff9067 PUD 27fff8067 PMD 17ec34067 PTE 8000000285dab021 > [14106.818489] Oops: 0003 [#1] PREEMPT SMP PTI > [14106.819345] CPU: 2 PID: 2313 Comm: udmabuf Not tainted 6.4.0-rc1-next-20230508+ #44 > [14106.820906] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014 > [14106.822679] RIP: 0010:page_add_file_rmap+0x2e/0x270 > > I started looking more closely at the driver and I do not fully understand the > usage model. I took clues from the selftest and driver. It seems the first > step is to create a buffer via the UDMABUF_CREATE ioctl. This will copy 4K > pages from the page cache to an array associated with a file. I did note that > hugetlb and shm behavior is different here as the driver can not add missing > hugetlb pages to the cache as it does with shm. However, what seems more > concerning is that there is nothing to prevent the pages from being replaced > in the cache before being added to a udmabuf mapping. This means udmabuf > mapping and original memfd could be operating on a different set of pages. > Is this acceptable, or somehow prevented? > > In my role, I am more interested in udmabuf handling of hugetlb pages. > Trying to use individual 4K pages of hugetlb pages is something that > should be avoided here. Would it be acceptable to change code so that > only whole hugetlb pages are used by udmabuf? If not, then perhaps the > existing hugetlb support can be removed? I'm wondering if that VMA shouldn't be some kind of special mapping (VM_PFNMAP), such that the struct page is entirely ignored? I'm quite confused and concerned when I read that code (what the hell is it doing with shmem/hugetlb pages? why does the mapcount even get adjusted?) This all has a bad smell to it, I hope I'm missing something important ...
On 06/07/23 12:03, Andrew Morton wrote: > On Tue, 16 May 2023 15:34:40 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > On 05/15/23 10:04, Mike Kravetz wrote: > > > On 05/12/23 16:29, Mike Kravetz wrote: > > > > On 05/12/23 14:26, James Houghton wrote: > > > > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote: > > > > > > > > > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You > > > > > need something like [1]. I can resend it if that's what we should be > > > > > doing, but this mapcounting scheme doesn't work when the page structs > > > > > have been freed. > > > > > > > > > > It seems like it was a mistake to include support for hugetlb memfds in udmabuf. > > > > > > > > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping > > > > hugepages (v4). Looks like it was never sent to linux-mm? That is unfortunate > > > > as hugetlb vmemmap freeing went in at about the same time. And, as you have > > > > noted udmabuf will not work if hugetlb vmemmap freeing is enabled. > > > > > > > > Sigh! > > > > > > > > Trying to think of a way forward. > > > > -- > > > > Mike Kravetz > > > > > > > > > > > > > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/ > > > > > > > > > > - James > > > > > > Adding people and list on Cc: involved with commit 16c243e99d33. > > > > > > There are several issues with trying to map tail pages of hugetllb pages > > > not taken into account with udmabuf. James spent quite a bit of time trying > > > to understand and address all the issues with the HGM code. While using > > > the scheme proposed by James, may be an approach to the mapcount issue there > > > are also other issues that need attention. For example, I do not see how > > > the fault code checks the state of the hugetlb page (such as poison) as none > > > of that state is carried in tail pages. > > > > > > The more I think about it, the more I think udmabuf should treat hugetlb > > > pages as hugetlb pages. They should be mapped at the appropriate level > > > in the page table. Of course, this would impose new restrictions on the > > > API (mmap and ioctl) that may break existing users. I have no idea how > > > extensively udmabuf is being used with hugetlb mappings. > > > > Verified that using udmabug on a hugetlb mapping with vmemmap optimization will > > BUG as: > > BUGs aren't good. Can we please find a way to push this along? > > Have we heard anything from any udmabuf people? > I have not heard anything. When this issue popped up, it took me by surprise. udmabuf maintainer (Gerd Hoffmann), the people who added hugetlb support and the list where udmabuf was developed (dri-devel@lists.freedesktop.org) have been on cc. My 'gut reaction' would be to remove hugetlb support from udmabuf. From a quick look, if we really want this support then there will need to be some API changes. For example UDMABUF_CREATE should be hugetlb page aligned and a multiple of hugetlb page size if using a hugetlb mapping. It would be good to know about users of the driver.
On Wed, 7 Jun 2023 13:53:10 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > BUGs aren't good. Can we please find a way to push this along? > > > > Have we heard anything from any udmabuf people? > > > > I have not heard anything. When this issue popped up, it took me by surprise. > > udmabuf maintainer (Gerd Hoffmann), the people who added hugetlb support and > the list where udmabuf was developed (dri-devel@lists.freedesktop.org) have > been on cc. Maybe Greg can suggest a way forward. > My 'gut reaction' would be to remove hugetlb support from udmabuf. From a > quick look, if we really want this support then there will need to be some > API changes. For example UDMABUF_CREATE should be hugetlb page aligned > and a multiple of hugetlb page size if using a hugetlb mapping. > > It would be good to know about users of the driver. So disabling "hugetlb=on" (and adding an explanatory printk) would suffice for now?
On 06/07/23 14:00, Andrew Morton wrote: > On Wed, 7 Jun 2023 13:53:10 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > > BUGs aren't good. Can we please find a way to push this along? > > > > > > Have we heard anything from any udmabuf people? > > > > > > > I have not heard anything. When this issue popped up, it took me by surprise. > > > > udmabuf maintainer (Gerd Hoffmann), the people who added hugetlb support and > > the list where udmabuf was developed (dri-devel@lists.freedesktop.org) have > > been on cc. > > Maybe Greg can suggest a way forward. > > > My 'gut reaction' would be to remove hugetlb support from udmabuf. From a > > quick look, if we really want this support then there will need to be some > > API changes. For example UDMABUF_CREATE should be hugetlb page aligned > > and a multiple of hugetlb page size if using a hugetlb mapping. > > > > It would be good to know about users of the driver. > > So disabling "hugetlb=on" (and adding an explanatory printk) would > suffice for now? I can put together a patch to do that.
On Wed, Jun 07, 2023 at 02:00:01PM -0700, Andrew Morton wrote: > On Wed, 7 Jun 2023 13:53:10 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > > BUGs aren't good. Can we please find a way to push this along? > > > > > > Have we heard anything from any udmabuf people? > > > > > > > I have not heard anything. When this issue popped up, it took me by surprise. > > > > udmabuf maintainer (Gerd Hoffmann), the people who added hugetlb support and > > the list where udmabuf was developed (dri-devel@lists.freedesktop.org) have > > been on cc. > > Maybe Greg can suggest a way forward. I'm guessing that no one is using this code then, so why don't we just remove it entirely? thanks, greg k-h
On Mon, May 15, 2023 at 10:04:42AM -0700, Mike Kravetz wrote: > On 05/12/23 16:29, Mike Kravetz wrote: > > On 05/12/23 14:26, James Houghton wrote: > > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang <junxiao.chang@intel.com> wrote: > > > > > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. You > > > need something like [1]. I can resend it if that's what we should be > > > doing, but this mapcounting scheme doesn't work when the page structs > > > have been freed. > > > > > > It seems like it was a mistake to include support for hugetlb memfds in udmabuf. > > > > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for mapping > > hugepages (v4). Looks like it was never sent to linux-mm? That is unfortunate > > as hugetlb vmemmap freeing went in at about the same time. And, as you have > > noted udmabuf will not work if hugetlb vmemmap freeing is enabled. > > > > Sigh! > > > > Trying to think of a way forward. > > -- > > Mike Kravetz > > > > > > > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2-jthoughton@google.com/ > > > > > > - James > > Adding people and list on Cc: involved with commit 16c243e99d33. > > There are several issues with trying to map tail pages of hugetllb pages > not taken into account with udmabuf. James spent quite a bit of time trying > to understand and address all the issues with the HGM code. While using > the scheme proposed by James, may be an approach to the mapcount issue there > are also other issues that need attention. For example, I do not see how > the fault code checks the state of the hugetlb page (such as poison) as none > of that state is carried in tail pages. > > The more I think about it, the more I think udmabuf should treat hugetlb > pages as hugetlb pages. They should be mapped at the appropriate level > in the page table. Of course, this would impose new restrictions on the > API (mmap and ioctl) that may break existing users. I have no idea how > extensively udmabuf is being used with hugetlb mappings. User of this is qemu. It can use the udmabuf driver to create host dma-bufs for guest resources (virtio-gpu buffers), to avoid copying data when showing the guest display in a host window. hugetlb support is needed in case qemu guest memory is backed by hugetlbfs. That does not imply the virtio-gpu buffers are hugepage aligned though, udmabuf would still need to operate on smaller chunks of memory. So with additional restrictions this will not work any more for qemu. I'd suggest to just revert hugetlb support instead and go back to the drawing board. Also not sure why hugetlbfs is used for guest memory in the first place. It used to be a thing years ago, but with the arrival of transparent hugepages there is as far I know little reason to still use hugetlbfs. Vivek? Dongwon? take care, Gerd
Hi Gerd, > > On Mon, May 15, 2023 at 10:04:42AM -0700, Mike Kravetz wrote: > > On 05/12/23 16:29, Mike Kravetz wrote: > > > On 05/12/23 14:26, James Houghton wrote: > > > > On Fri, May 12, 2023 at 12:20 AM Junxiao Chang > <junxiao.chang@intel.com> wrote: > > > > > > > > This alone doesn't fix mapcounting for PTE-mapped HugeTLB pages. > You > > > > need something like [1]. I can resend it if that's what we should be > > > > doing, but this mapcounting scheme doesn't work when the page > structs > > > > have been freed. > > > > > > > > It seems like it was a mistake to include support for hugetlb memfds in > udmabuf. > > > > > > IIUC, it was added with commit 16c243e99d33 udmabuf: Add support for > mapping > > > hugepages (v4). Looks like it was never sent to linux-mm? That is > unfortunate > > > as hugetlb vmemmap freeing went in at about the same time. And, as > you have > > > noted udmabuf will not work if hugetlb vmemmap freeing is enabled. > > > > > > Sigh! > > > > > > Trying to think of a way forward. > > > -- > > > Mike Kravetz > > > > > > > > > > > [1]: https://lore.kernel.org/linux-mm/20230306230004.1387007-2- > jthoughton@google.com/ > > > > > > > > - James > > > > Adding people and list on Cc: involved with commit 16c243e99d33. > > > > There are several issues with trying to map tail pages of hugetllb pages > > not taken into account with udmabuf. James spent quite a bit of time > trying > > to understand and address all the issues with the HGM code. While using > > the scheme proposed by James, may be an approach to the mapcount > issue there > > are also other issues that need attention. For example, I do not see how > > the fault code checks the state of the hugetlb page (such as poison) as none > > of that state is carried in tail pages. > > > > The more I think about it, the more I think udmabuf should treat hugetlb > > pages as hugetlb pages. They should be mapped at the appropriate level > > in the page table. Of course, this would impose new restrictions on the > > API (mmap and ioctl) that may break existing users. I have no idea how > > extensively udmabuf is being used with hugetlb mappings. > > User of this is qemu. It can use the udmabuf driver to create host > dma-bufs for guest resources (virtio-gpu buffers), to avoid copying > data when showing the guest display in a host window. > > hugetlb support is needed in case qemu guest memory is backed by > hugetlbfs. That does not imply the virtio-gpu buffers are hugepage > aligned though, udmabuf would still need to operate on smaller chunks > of memory. So with additional restrictions this will not work any > more for qemu. I'd suggest to just revert hugetlb support instead > and go back to the drawing board. > > Also not sure why hugetlbfs is used for guest memory in the first place. > It used to be a thing years ago, but with the arrival of transparent > hugepages there is as far I know little reason to still use hugetlbfs. The main reason why we are interested in using hugetlbfs for guest memory is because we observed non-trivial performance improvement while running certain 3D heavy workloads in the guest. And, we noticed this by only switching the Guest memory backend to include hugepages (i.e, hugetlb=on) and with no other changes. To address the current situation, I am readying a patch for udmabuf driver that would add back support for mapping hugepages but without making use of the subpages directly. Thanks, Vivek > > Vivek? Dongwon? > > take care, > Gerd
diff --git a/mm/rmap.c b/mm/rmap.c index 19392e090bec6..b42fc0389c243 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1377,9 +1377,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, VM_BUG_ON_PAGE(compound && !PageHead(page), page); - /* Hugetlb pages are not counted in NR_*MAPPED */ - if (unlikely(folio_test_hugetlb(folio))) { - /* hugetlb pages are always mapped with pmds */ + /* Hugetlb pages usually are not counted in NR_*MAPPED */ + if (unlikely(folio_test_hugetlb(folio) && compound)) { + /* hugetlb pages are mapped with pmds */ atomic_dec(&folio->_entire_mapcount); return; }