Message ID | bc27af32b0418ed1138a1c3a41e46f54559025a5.1665991453.git.baolin.wang@linux.alibaba.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4ac7:0:0:0:0:0 with SMTP id y7csp1317698wrs; Mon, 17 Oct 2022 00:40:50 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4Tcdg7WR24k6FYwR5gU5H8X9W5bykpNjC1DHbt7020ACU8B+n7YMp6fD7efdU84hASwy3I X-Received: by 2002:a17:90a:1785:b0:20a:6162:2b6a with SMTP id q5-20020a17090a178500b0020a61622b6amr12095130pja.180.1665992450196; Mon, 17 Oct 2022 00:40:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1665992450; cv=none; d=google.com; s=arc-20160816; b=VPD0bqdi/6hroYHy2hqyu7UEOh0vpjt4nWhBqNi89FgYOq7WjlxiqPfmgeIN3DxE8t sPlu2MkFvEOi2qlr0zfUs/pxw1Lof5NbFdDusZ0BFKsTV1snSLKugyC/VWFr6d7kgi1o lX7QcqIZ12b8FNSjQfTkfEEGpkElOTHfT8ngg0Sz5zD/dOP7ILxPreNmE3Ez9wX+b+y+ 3HpJ3GseL76LrhbOfSYjyHMdE0Qv48+PszLGlS1XQthPqVlxwpYjRqh3cBzGa1X63Gjs uMDOe3yNCTLaBkoQKVQhdfUl4jRUPOs5ICywa1QJAIK3ZBv6QOkm2m9Tc/ojXqQBGepE DDcQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from; bh=6y1duUw7DFyUWoEOtVAjYvl9xr2LsgSMrXRnGUD9pCo=; b=s7OwP18eUNUDNwWq9jL42QQ7ZGKzTho1qezQZxOmoWZm/m6YIjdhkdiyIM3yoNKgU2 icH1QcrRMIrfrjB799awH4/HrLfdJQ6s9FDiZiQz8OwgRYihJb11rxtqi/QYeDxbSnZf lxq51qWBHI1bCTHNNQobfImmbqQSqkeE9hrixgdopQHwEu7efB+CacK3pstdKZIcgSGD P18+yCl3dlC9qVYTP/4nfZPi/iFMnhvNruRNViG/uRC4p3dGY/B9a5zDy8h97U+ks8cn gfO4+09/irY6HlN6ZifRJRR3PUVsE0w3ofxGTGnOBnpEv9HPD3Vt/PvZatI6cFlxj0an cIZg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t16-20020a656090000000b00461f124bc4csi10887439pgu.86.2022.10.17.00.40.37; Mon, 17 Oct 2022 00:40:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229999AbiJQHcQ (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others); Mon, 17 Oct 2022 03:32:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41238 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229562AbiJQHcP (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 17 Oct 2022 03:32:15 -0400 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B7CA6203; Mon, 17 Oct 2022 00:32:13 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046050;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VSK0-Ok_1665991930; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0VSK0-Ok_1665991930) by smtp.aliyun-inc.com; Mon, 17 Oct 2022 15:32:11 +0800 From: Baolin Wang <baolin.wang@linux.alibaba.com> To: akpm@linux-foundation.org Cc: arnd@arndb.de, baolin.wang@linux.alibaba.com, jingshan@linux.alibaba.com, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH] mm: Introduce new MADV_NOMOVABLE behavior Date: Mon, 17 Oct 2022 15:32:01 +0800 Message-Id: <bc27af32b0418ed1138a1c3a41e46f54559025a5.1665991453.git.baolin.wang@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 X-Spam-Status: No, score=-9.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1746919699502565111?= X-GMAIL-MSGID: =?utf-8?q?1746919699502565111?= |
Series |
[RFC] mm: Introduce new MADV_NOMOVABLE behavior
|
|
Commit Message
Baolin Wang
Oct. 17, 2022, 7:32 a.m. UTC
When creating a virtual machine, we will use memfd_create() to get
a file descriptor which can be used to create share memory mappings
using the mmap function, meanwhile the mmap() will set the MAP_POPULATE
flag to allocate physical pages for the virtual machine.
When allocating physical pages for the guest, the host can fallback to
allocate some CMA pages for the guest when over half of the zone's free
memory is in the CMA area.
In guest os, when the application wants to do some data transaction with
DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and
create IOMMU mappings for the DMA pages. However, when calling
VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be
failed to longterm-pin sometimes.
After some invetigation, we found the pages used to do DMA mapping can
contain some CMA pages, and these CMA pages will cause a possible
failure of the longterm-pin, due to failed to migrate the CMA pages.
The reason of migration failure may be temporary reference count or
memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA
ioctl returns error, which makes the application failed to start.
To fix this issue, this patch introduces a new madvise behavior, named
as MADV_NOMOVABLE, to avoid allocating CMA pages and movable pages if
the users want to do longterm-pin, which can remove the possible failure
of movable or CMA pages migration.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
include/linux/mm.h | 6 ++++++
include/uapi/asm-generic/mman-common.h | 2 ++
mm/madvise.c | 6 ++++++
mm/memory.c | 6 ++++++
4 files changed, 20 insertions(+)
Comments
On 17.10.22 09:32, Baolin Wang wrote: > When creating a virtual machine, we will use memfd_create() to get > a file descriptor which can be used to create share memory mappings > using the mmap function, meanwhile the mmap() will set the MAP_POPULATE > flag to allocate physical pages for the virtual machine. > > When allocating physical pages for the guest, the host can fallback to > allocate some CMA pages for the guest when over half of the zone's free > memory is in the CMA area. > > In guest os, when the application wants to do some data transaction with > DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and > create IOMMU mappings for the DMA pages. However, when calling > VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be > failed to longterm-pin sometimes. > > After some invetigation, we found the pages used to do DMA mapping can > contain some CMA pages, and these CMA pages will cause a possible > failure of the longterm-pin, due to failed to migrate the CMA pages. > The reason of migration failure may be temporary reference count or > memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA > ioctl returns error, which makes the application failed to start. > > To fix this issue, this patch introduces a new madvise behavior, named > as MADV_NOMOVABLE, to avoid allocating CMA pages and movable pages if > the users want to do longterm-pin, which can remove the possible failure > of movable or CMA pages migration. Sorry to say, but that sounds like a hack to work around a kernel implementation detail (how often we retry to migrate pages). If there are CMA/ZONE_MOVABLE issue, please fix them instead, and avoid leaking these details to user space. ALSO, with MAP_POPULATE as described by you this madvise flag doesn't make too much sense, because it will gets et after all memory already was allocated ... NAK
On 10/17/2022 4:41 PM, David Hildenbrand wrote: > On 17.10.22 09:32, Baolin Wang wrote: >> When creating a virtual machine, we will use memfd_create() to get >> a file descriptor which can be used to create share memory mappings >> using the mmap function, meanwhile the mmap() will set the MAP_POPULATE >> flag to allocate physical pages for the virtual machine. >> >> When allocating physical pages for the guest, the host can fallback to >> allocate some CMA pages for the guest when over half of the zone's free >> memory is in the CMA area. >> >> In guest os, when the application wants to do some data transaction with >> DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and >> create IOMMU mappings for the DMA pages. However, when calling >> VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be >> failed to longterm-pin sometimes. >> >> After some invetigation, we found the pages used to do DMA mapping can >> contain some CMA pages, and these CMA pages will cause a possible >> failure of the longterm-pin, due to failed to migrate the CMA pages. >> The reason of migration failure may be temporary reference count or >> memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA >> ioctl returns error, which makes the application failed to start. >> >> To fix this issue, this patch introduces a new madvise behavior, named >> as MADV_NOMOVABLE, to avoid allocating CMA pages and movable pages if >> the users want to do longterm-pin, which can remove the possible failure >> of movable or CMA pages migration. > > Sorry to say, but that sounds like a hack to work around a kernel > implementation detail (how often we retry to migrate pages). IMO, in our case one migration failure will make our application failed to start, which is not a trival problem. So mitigate the failure of migration can be important in this case. > If there are CMA/ZONE_MOVABLE issue, please fix them instead, and avoid > leaking these details to user space. Now we can not forbid the fallback to CMA allocation if there are enough free CMA in the zone, right? So adding a hint to help to diable ALLOC_CMA flag seems reasonable? For CMA/ZONE_MOVABLE details, yes, not suitable to leak to user space. so how about rename the madvise as MADV_PINNABLE, which means we will do longterm-pin after allocation, and no CMA/ZONE_MOVABLE pages will be allocated. Or do you have any good idea? Thanks. > ALSO, with MAP_POPULATE as described by you this madvise flag doesn't > make too much sense, because it will gets et after all memory already > was allocated ... This is not a problem I think, we can change to use MADV_POPULATE_XXX to preallocate the physical pages after MADV_NOMOVABLE madvise.
On 17.10.22 11:09, Baolin Wang wrote: > > > On 10/17/2022 4:41 PM, David Hildenbrand wrote: >> On 17.10.22 09:32, Baolin Wang wrote: >>> When creating a virtual machine, we will use memfd_create() to get >>> a file descriptor which can be used to create share memory mappings >>> using the mmap function, meanwhile the mmap() will set the MAP_POPULATE >>> flag to allocate physical pages for the virtual machine. >>> >>> When allocating physical pages for the guest, the host can fallback to >>> allocate some CMA pages for the guest when over half of the zone's free >>> memory is in the CMA area. >>> >>> In guest os, when the application wants to do some data transaction with >>> DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and >>> create IOMMU mappings for the DMA pages. However, when calling >>> VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be >>> failed to longterm-pin sometimes. >>> >>> After some invetigation, we found the pages used to do DMA mapping can >>> contain some CMA pages, and these CMA pages will cause a possible >>> failure of the longterm-pin, due to failed to migrate the CMA pages. >>> The reason of migration failure may be temporary reference count or >>> memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA >>> ioctl returns error, which makes the application failed to start. >>> >>> To fix this issue, this patch introduces a new madvise behavior, named >>> as MADV_NOMOVABLE, to avoid allocating CMA pages and movable pages if >>> the users want to do longterm-pin, which can remove the possible failure >>> of movable or CMA pages migration. >> >> Sorry to say, but that sounds like a hack to work around a kernel >> implementation detail (how often we retry to migrate pages). > > IMO, in our case one migration failure will make our application failed > to start, which is not a trival problem. So mitigate the failure of > migration can be important in this case. The right thing to do is to understand why these migrations fail and see if we can improve the migration code. > >> If there are CMA/ZONE_MOVABLE issue, please fix them instead, and avoid >> leaking these details to user space. > > Now we can not forbid the fallback to CMA allocation if there are enough > free CMA in the zone, right? So adding a hint to help to diable > ALLOC_CMA flag seems reasonable? > > For CMA/ZONE_MOVABLE details, yes, not suitable to leak to user space. > so how about rename the madvise as MADV_PINNABLE, which means we will do > longterm-pin after allocation, and no CMA/ZONE_MOVABLE pages will be > allocated. > I really don't think any of these new user-visible madv modes with questionable semantics to workaround kernel implementation issues are a good idea. Especially MADV_PINNABLE has a *very* misleading name. I understand that something like "MADV_MIGHT_PIN" *might* be helpful to minimize page migration. But IMHO that could only be a pure optimization, but wouldn't stop us from allocating (or migrating to) CMA/ZONE_MOVABLE in the kernel on all code paths. It would be best effort only. It's not user space decision how/where the kernel allocates memory. No hacking around that. > Or do you have any good idea? Thanks. Investigate why migration of these pages fails and how we can improve the code to make migration of these pages work more reliably. I am not completely against having a kernel parameter that would disable allocating from CMA areas completely, even though it defeats the purpose of CMA. But it wouldn't apply to ZONE_MOVABLE, so it would be just another hackish approach. > >> ALSO, with MAP_POPULATE as described by you this madvise flag doesn't >> make too much sense, because it will gets et after all memory already >> was allocated ... > > This is not a problem I think, we can change to use MADV_POPULATE_XXX to > preallocate the physical pages after MADV_NOMOVABLE madvise. Yes, I know; I'm pointing out that your patch description is inconsistent.
On 10/17/2022 7:27 PM, David Hildenbrand wrote: > On 17.10.22 11:09, Baolin Wang wrote: >> >> >> On 10/17/2022 4:41 PM, David Hildenbrand wrote: >>> On 17.10.22 09:32, Baolin Wang wrote: >>>> When creating a virtual machine, we will use memfd_create() to get >>>> a file descriptor which can be used to create share memory mappings >>>> using the mmap function, meanwhile the mmap() will set the MAP_POPULATE >>>> flag to allocate physical pages for the virtual machine. >>>> >>>> When allocating physical pages for the guest, the host can fallback to >>>> allocate some CMA pages for the guest when over half of the zone's free >>>> memory is in the CMA area. >>>> >>>> In guest os, when the application wants to do some data transaction >>>> with >>>> DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and >>>> create IOMMU mappings for the DMA pages. However, when calling >>>> VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be >>>> failed to longterm-pin sometimes. >>>> >>>> After some invetigation, we found the pages used to do DMA mapping can >>>> contain some CMA pages, and these CMA pages will cause a possible >>>> failure of the longterm-pin, due to failed to migrate the CMA pages. >>>> The reason of migration failure may be temporary reference count or >>>> memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA >>>> ioctl returns error, which makes the application failed to start. >>>> >>>> To fix this issue, this patch introduces a new madvise behavior, named >>>> as MADV_NOMOVABLE, to avoid allocating CMA pages and movable pages if >>>> the users want to do longterm-pin, which can remove the possible >>>> failure >>>> of movable or CMA pages migration. >>> >>> Sorry to say, but that sounds like a hack to work around a kernel >>> implementation detail (how often we retry to migrate pages). >> >> IMO, in our case one migration failure will make our application failed >> to start, which is not a trival problem. So mitigate the failure of >> migration can be important in this case. > > The right thing to do is to understand why these migrations fail and see > if we can improve the migration code. OK. See below. > >> >>> If there are CMA/ZONE_MOVABLE issue, please fix them instead, and avoid >>> leaking these details to user space. >> >> Now we can not forbid the fallback to CMA allocation if there are enough >> free CMA in the zone, right? So adding a hint to help to diable >> ALLOC_CMA flag seems reasonable? >> >> For CMA/ZONE_MOVABLE details, yes, not suitable to leak to user space. >> so how about rename the madvise as MADV_PINNABLE, which means we will do >> longterm-pin after allocation, and no CMA/ZONE_MOVABLE pages will be >> allocated. >> > > I really don't think any of these new user-visible madv modes with > questionable semantics to workaround kernel implementation issues are a > good idea. > > Especially MADV_PINNABLE has a *very* misleading name. > > > I understand that something like "MADV_MIGHT_PIN" *might* be helpful to > minimize page migration. But IMHO that could only be a pure > optimization, but wouldn't stop us from allocating (or migrating to) > CMA/ZONE_MOVABLE in the kernel on all code paths. It would be best > effort only. > > It's not user space decision how/where the kernel allocates memory. No > hacking around that. > >> Or do you have any good idea? Thanks. > > Investigate why migration of these pages fails and how we can improve > the code to make migration of these pages work more reliably. I observed one migration failure case (which is not easy to reproduce) is that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed' count is also 1. That means when migrating a THP which is in CMA area, but can not allocate a new THP due to memory fragmentation, so it will split the THP. However THP split is also failed, probably the reason is temporary reference count of this THP. And the temporary reference count can be caused by dropping page caches (I observed the drop caches operation in the system), but we can not drop the shmem page caches due to they are already dirty at that time. So we can try again in migrate_pages() if THP split is failed to mitigate the failure of migration, especially for the failure reason is temporary reference count? Does this sound reasonable for you? However I still worried there are other possible cases to cause migration failure, so no CMA allocation for our case seems more stable IMO. > I am not completely against having a kernel parameter that would disable > allocating from CMA areas completely, even though it defeats the purpose > of CMA. But it wouldn't apply to ZONE_MOVABLE, so it would be just > another hackish approach. > >> >>> ALSO, with MAP_POPULATE as described by you this madvise flag doesn't >>> make too much sense, because it will gets et after all memory already >>> was allocated ... >> >> This is not a problem I think, we can change to use MADV_POPULATE_XXX to >> preallocate the physical pages after MADV_NOMOVABLE madvise. > > Yes, I know; I'm pointing out that your patch description is inconsistent. OK. Thanks.
> I observed one migration failure case (which is not easy to reproduce) > is that, the 'thp_migration_fail' count is 1 and the > 'thp_split_page_failed' count is also 1. > > That means when migrating a THP which is in CMA area, but can not > allocate a new THP due to memory fragmentation, so it will split the > THP. However THP split is also failed, probably the reason is temporary > reference count of this THP. And the temporary reference count can be > caused by dropping page caches (I observed the drop caches operation in > the system), but we can not drop the shmem page caches due to they are > already dirty at that time. > > So we can try again in migrate_pages() if THP split is failed to > mitigate the failure of migration, especially for the failure reason is > temporary reference count? Does this sound reasonable for you? It sound reasonable, and I understand that debugging these issues is tricky. But we really have to figure out the root cause to make these pages that are indeed movable (but only temporarily not movable for reason XYZ) movable. We'd need some indication to retry migration longer / again. > > However I still worried there are other possible cases to cause > migration failure, so no CMA allocation for our case seems more stable IMO. Yes, I can understand that. But as one example, you're approach doesn't handle the case that a page that was allocated on !CMA/!ZONE_MOVABLE would get migrated to CMA/ZONE_MOVABLE just before you would try pinning the page (to migrate it again off CMA/ZONE_MOVABLE). We really have to fix the root cause.
On 10/19/2022 11:17 PM, David Hildenbrand wrote: >> I observed one migration failure case (which is not easy to reproduce) >> is that, the 'thp_migration_fail' count is 1 and the >> 'thp_split_page_failed' count is also 1. >> >> That means when migrating a THP which is in CMA area, but can not >> allocate a new THP due to memory fragmentation, so it will split the >> THP. However THP split is also failed, probably the reason is temporary >> reference count of this THP. And the temporary reference count can be >> caused by dropping page caches (I observed the drop caches operation in >> the system), but we can not drop the shmem page caches due to they are >> already dirty at that time. >> >> So we can try again in migrate_pages() if THP split is failed to >> mitigate the failure of migration, especially for the failure reason is >> temporary reference count? Does this sound reasonable for you? > > It sound reasonable, and I understand that debugging these issues is > tricky. But we really have to figure out the root cause to make these > pages that are indeed movable (but only temporarily not movable for > reason XYZ) movable. > > We'd need some indication to retry migration longer / again. OK. Let me try this and see if there are other possible failure cases in the products. >> >> However I still worried there are other possible cases to cause >> migration failure, so no CMA allocation for our case seems more stable >> IMO. > > Yes, I can understand that. But as one example, you're approach doesn't > handle the case that a page that was allocated on !CMA/!ZONE_MOVABLE > would get migrated to CMA/ZONE_MOVABLE just before you would try pinning > the page (to migrate it again off CMA/ZONE_MOVABLE). Indeed, like you said before, just helpful to minimize page migration now. Maybe I can take MADV_PINNABLE into considering when allocating new pages, such as alloc_migration_target(). Anyway let me try to fix the root cause first to see if it can solve our problem. > We really have to fix the root cause. OK. Thanks for your input.
diff --git a/include/linux/mm.h b/include/linux/mm.h index c63dfc804f1e..c9b2ab6e96fc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -307,6 +307,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_HUGEPAGE 0x20000000 /* MADV_HUGEPAGE marked this vma */ #define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */ #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ +#define VM_NOMOVABLE 0x100000000 /* Avoid movable pages */ #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS #define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */ @@ -661,6 +662,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma) return vma->vm_flags & VM_ACCESS_FLAGS; } +static inline bool vma_no_movable(struct vm_area_struct *vma) +{ + return vma->vm_flags & VM_NOMOVABLE; +} + static inline struct vm_area_struct *vma_find(struct vma_iterator *vmi, unsigned long max) { diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..d6e64eda28b6 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -79,6 +79,8 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_NOMOVABLE 26 /* Avoid movable pages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/madvise.c b/mm/madvise.c index 2baa93ca2310..fc59d4f1f123 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1045,6 +1045,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, case MADV_DONTDUMP: new_flags |= VM_DONTDUMP; break; + case MADV_NOMOVABLE: + new_flags |= VM_NOMOVABLE; + break; case MADV_DODUMP: if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) return -EINVAL; @@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior) case MADV_PAGEOUT: case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: + case MADV_NOMOVABLE: #ifdef CONFIG_KSM case MADV_MERGEABLE: case MADV_UNMERGEABLE: @@ -1360,6 +1364,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * triggering read faults if required * MADV_POPULATE_WRITE - populate (prefault) page tables writable by * triggering write faults if required + * MADV_NOMOVABLE - avoid movable pages allocation in the page fault path + * due to longterm-pin required. * * return values: * zero - success diff --git a/mm/memory.c b/mm/memory.c index f88c351aecd4..5b75be6ba659 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5189,6 +5189,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, struct pt_regs *regs) { vm_fault_t ret; + unsigned long pf_flags; __set_current_state(TASK_RUNNING); @@ -5203,6 +5204,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, flags & FAULT_FLAG_REMOTE)) return VM_FAULT_SIGSEGV; + if (vma_no_movable(vma)) + pf_flags = memalloc_pin_save(); /* * Enable the memcg OOM handling for faults triggered in user * space. Kernel faults are handled more gracefully. @@ -5231,6 +5234,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, mem_cgroup_oom_synchronize(false); } + if (vma_no_movable(vma)) + memalloc_pin_restore(pf_flags); + mm_account_fault(regs, address, flags, ret); return ret;