[v7,0/4] Implement IOCTL to get and/or the clear info about PTEs

Message ID 20230109064519.3555250-1-usama.anjum@collabora.com
Headers
Series Implement IOCTL to get and/or the clear info about PTEs |

Message

Muhammad Usama Anjum Jan. 9, 2023, 6:45 a.m. UTC
  *Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
  flags

Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [A][B] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [C] as it is based inherently
on the PTEs.

So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.

[A] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
[B] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
[C] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n

*Changes in v6:*
- Updated the interface and made cosmetic changes

*Cover Letter in v5:*
Hello,

This patch series implements IOCTL on the pagemap procfs file to get the
information about the page table entries (PTEs). The following operations
are supported in this ioctl:
- Get the information if the pages are soft-dirty, file mapped, present
  or swapped.
- Clear the soft-dirty PTE bit of the pages.
- Get and clear the soft-dirty PTE bit of the pages atomically.

Soft-dirty PTE bit of the memory pages can be read by using the pagemap
procfs file. The soft-dirty PTE bit for the whole memory range of the
process can be cleared by writing to the clear_refs file. There are other
methods to mimic this information entirely in userspace with poor
performance:
- The mprotect syscall and SIGSEGV handler for bookkeeping
- The userfaultfd syscall with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty PTE bit status and clear operation
  possible.
- The soft-dirty PTE bit of only a part of memory cannot be cleared.

Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows. This syscall is used by games to
keep track of dirty pages to process only the dirty pages.

The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project[2][3]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project[2].

The IOCTL returns the addresses of the pages which match the specific masks.
The page addresses are returned in struct page_region in a compact form.
The max_pages is needed to support a use case where user only wants to get
a specific number of pages. So there is no need to find all the pages of
interest in the range when max_pages is specified. The IOCTL returns when
the maximum number of the pages are found. The max_pages is optional. If
max_pages is specified, it must be equal or greater than the vec_size.
This restriction is needed to handle worse case when one page_region only
contains info of one page and it cannot be compacted. This is needed to
emulate the Windows getWriteWatch() syscall.

Some non-dirty pages get marked as dirty because of the kernel's
internal activity (such as VMA merging as soft-dirty bit difference isn't
considered while deciding to merge VMAs). The dirty bit of the pages is
stored in the VMA flags and in the per page flags. If any of these two bits
are set, the page is considered to be soft dirty. Suppose you have cleared
the soft dirty bit of half of VMA which will be done by splitting the VMA
and clearing soft dirty bit flag in the half VMA and the pages in it. Now
kernel may decide to merge the VMAs again. So the half VMA becomes dirty
again. This splitting/merging costs performance. The application receives
a lot of pages which aren't dirty in reality but marked as dirty.
Performance is lost again here. Also sometimes user doesn't want the newly
allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
solves both the problems. It is used to not depend on the soft dirty flag
in the VMA flags. So VMA splitting and merging doesn't happen. It only
depends on the soft dirty bit of the individual pages. Thus by using this
flag, there may be a scenerio such that the new memory regions which are
just created, doesn't look dirty when seen with the IOCTL, but look dirty
when seen from procfs. This seems okay as the user of this flag know the
implication of using it.

[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/
[2] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[3] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/

Regards,
Muhammad Usama Anjum

Muhammad Usama Anjum (4):
  userfaultfd: Add UFFD WP Async support
  userfaultfd: split mwriteprotect_range()
  fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
    PTEs
  selftests: vm: add pagemap ioctl tests

 fs/proc/task_mmu.c                         | 300 +++++++
 fs/userfaultfd.c                           | 161 ++--
 include/linux/userfaultfd_k.h              |  10 +
 include/uapi/linux/fs.h                    |  50 ++
 include/uapi/linux/userfaultfd.h           |   6 +
 mm/userfaultfd.c                           |  40 +-
 tools/include/uapi/linux/fs.h              |  50 ++
 tools/testing/selftests/vm/.gitignore      |   1 +
 tools/testing/selftests/vm/Makefile        |   5 +-
 tools/testing/selftests/vm/pagemap_ioctl.c | 884 +++++++++++++++++++++
 10 files changed, 1424 insertions(+), 83 deletions(-)
 create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
  

Comments

Muhammad Usama Anjum Jan. 18, 2023, 6:55 a.m. UTC | #1
On 1/9/23 11:45 AM, Muhammad Usama Anjum wrote:
> *Changes in v7:*
> - Add uffd wp async
> - Update the IOCTL to use uffd under the hood instead of soft-dirty
>   flags
> 
> Stop using the soft-dirty flags for finding which pages have been
> written to. It is too delicate and wrong as it shows more soft-dirty
> pages than the actual soft-dirty pages. There is no interest in
> correcting it [A][B] as this is how the feature was written years ago.
> It shouldn't be updated to changed behaviour. Peter Xu has suggested
> using the async version of the UFFD WP [C] as it is based inherently
> on the PTEs.
> 
> So in this patch series, I've added a new mode to the UFFD which is
> asynchronous version of the write protect. When this variant of the
> UFFD WP is used, the page faults are resolved automatically by the
> kernel. The pages which have been written-to can be found by reading
> pagemap file (!PM_UFFD_WP). This feature can be used successfully to
> find which pages have been written to from the time the pages were
> write protected. This works just like the soft-dirty flag without
> showing any extra pages which aren't soft-dirty in reality.
Any thoughts on this version are highly welcome. Please review.

> 
> [A] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
> [B] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
> [C] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
> 
> *Changes in v6:*
> - Updated the interface and made cosmetic changes
> 
> *Cover Letter in v5:*
> Hello,
> 
> This patch series implements IOCTL on the pagemap procfs file to get the
> information about the page table entries (PTEs). The following operations
> are supported in this ioctl:
> - Get the information if the pages are soft-dirty, file mapped, present
>   or swapped.
> - Clear the soft-dirty PTE bit of the pages.
> - Get and clear the soft-dirty PTE bit of the pages atomically.
> 
> Soft-dirty PTE bit of the memory pages can be read by using the pagemap
> procfs file. The soft-dirty PTE bit for the whole memory range of the
> process can be cleared by writing to the clear_refs file. There are other
> methods to mimic this information entirely in userspace with poor
> performance:
> - The mprotect syscall and SIGSEGV handler for bookkeeping
> - The userfaultfd syscall with the handler for bookkeeping
> Some benchmarks can be seen here[1]. This series adds features that weren't
> present earlier:
> - There is no atomic get soft-dirty PTE bit status and clear operation
>   possible.
> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
> 
> Historically, soft-dirty PTE bit tracking has been used in the CRIU
> project. The procfs interface is enough for finding the soft-dirty bit
> status and clearing the soft-dirty bit of all the pages of a process.
> We have the use case where we need to track the soft-dirty PTE bit for
> only specific pages on demand. We need this tracking and clear mechanism
> of a region of memory while the process is running to emulate the
> getWriteWatch() syscall of Windows. This syscall is used by games to
> keep track of dirty pages to process only the dirty pages.
> 
> The information related to pages if the page is file mapped, present and
> swapped is required for the CRIU project[2][3]. The addition of the
> required mask, any mask, excluded mask and return masks are also required
> for the CRIU project[2].
> 
> The IOCTL returns the addresses of the pages which match the specific masks.
> The page addresses are returned in struct page_region in a compact form.
> The max_pages is needed to support a use case where user only wants to get
> a specific number of pages. So there is no need to find all the pages of
> interest in the range when max_pages is specified. The IOCTL returns when
> the maximum number of the pages are found. The max_pages is optional. If
> max_pages is specified, it must be equal or greater than the vec_size.
> This restriction is needed to handle worse case when one page_region only
> contains info of one page and it cannot be compacted. This is needed to
> emulate the Windows getWriteWatch() syscall.
> 
> Some non-dirty pages get marked as dirty because of the kernel's
> internal activity (such as VMA merging as soft-dirty bit difference isn't
> considered while deciding to merge VMAs). The dirty bit of the pages is
> stored in the VMA flags and in the per page flags. If any of these two bits
> are set, the page is considered to be soft dirty. Suppose you have cleared
> the soft dirty bit of half of VMA which will be done by splitting the VMA
> and clearing soft dirty bit flag in the half VMA and the pages in it. Now
> kernel may decide to merge the VMAs again. So the half VMA becomes dirty
> again. This splitting/merging costs performance. The application receives
> a lot of pages which aren't dirty in reality but marked as dirty.
> Performance is lost again here. Also sometimes user doesn't want the newly
> allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag
> solves both the problems. It is used to not depend on the soft dirty flag
> in the VMA flags. So VMA splitting and merging doesn't happen. It only
> depends on the soft dirty bit of the individual pages. Thus by using this
> flag, there may be a scenerio such that the new memory regions which are
> just created, doesn't look dirty when seen with the IOCTL, but look dirty
> when seen from procfs. This seems okay as the user of this flag know the
> implication of using it.
> 
> [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/
> [2] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
> [3] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
> 
> Regards,
> Muhammad Usama Anjum
> 
> Muhammad Usama Anjum (4):
>   userfaultfd: Add UFFD WP Async support
>   userfaultfd: split mwriteprotect_range()
>   fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
>     PTEs
>   selftests: vm: add pagemap ioctl tests
> 
>  fs/proc/task_mmu.c                         | 300 +++++++
>  fs/userfaultfd.c                           | 161 ++--
>  include/linux/userfaultfd_k.h              |  10 +
>  include/uapi/linux/fs.h                    |  50 ++
>  include/uapi/linux/userfaultfd.h           |   6 +
>  mm/userfaultfd.c                           |  40 +-
>  tools/include/uapi/linux/fs.h              |  50 ++
>  tools/testing/selftests/vm/.gitignore      |   1 +
>  tools/testing/selftests/vm/Makefile        |   5 +-
>  tools/testing/selftests/vm/pagemap_ioctl.c | 884 +++++++++++++++++++++
>  10 files changed, 1424 insertions(+), 83 deletions(-)
>  create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
>
  
Peter Xu Jan. 18, 2023, 10:12 p.m. UTC | #2
On Mon, Jan 09, 2023 at 11:45:15AM +0500, Muhammad Usama Anjum wrote:
> *Changes in v7:*
> - Add uffd wp async
> - Update the IOCTL to use uffd under the hood instead of soft-dirty
>   flags
> 
> Stop using the soft-dirty flags for finding which pages have been
> written to. It is too delicate and wrong as it shows more soft-dirty
> pages than the actual soft-dirty pages. There is no interest in
> correcting it [A][B] as this is how the feature was written years ago.
> It shouldn't be updated to changed behaviour. Peter Xu has suggested
> using the async version of the UFFD WP [C] as it is based inherently
> on the PTEs.
> 
> So in this patch series, I've added a new mode to the UFFD which is
> asynchronous version of the write protect. When this variant of the
> UFFD WP is used, the page faults are resolved automatically by the
> kernel. The pages which have been written-to can be found by reading
> pagemap file (!PM_UFFD_WP). This feature can be used successfully to
> find which pages have been written to from the time the pages were
> write protected. This works just like the soft-dirty flag without
> showing any extra pages which aren't soft-dirty in reality.
> 
> [A] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
> [B] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
> [C] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
> 
> *Changes in v6:*
> - Updated the interface and made cosmetic changes
> 
> *Cover Letter in v5:*
> Hello,

Please consider either drop the cover letter below this point or rephrase,
otherwise many of them are not true anymore and it can confuse the
reviewers.

I have a few high level comments/questions here, please bare with me if any
of them are already discussed by others in the old versions; I'd be happy
to read them when there's a pointer to the relevant answers.

Firstly, doc update is more than welcomed to explain the new interface
first (before throwing the code..).  That can be done in pagemap.rst on
pagemap changes, or userfaultfd.rst on userfaultfd.

Besides, can you provide more justification on the new pagemap-side
interface design?

It seems it came from the Windows API GetWriteWatch(), but it's definitely
not exactly that.  Let me spell some points out..

There're four kinds of masks (required/anyof/excluded/return).  Are they
all needed?  Why this is a good interface design?

I saw you used page_region structure to keep the information.  I think you
wanted to have a densed output, especially if counting in the "return mask"
above it starts to make more sense. If with a very limited return mask it
means many of the (continuous) page information can be merged into a single
page_region struct when the kernel is scanning.

However, at the meantime the other three masks (required/anyof/excluded)
made me quite confused - it means you wanted to somehow filter the pages
and only some of them will get collected.  The thing is for a continuous
page range if any of the page got skipped due to the masks (e.g. not in
"required" or in "excluded") it also means it can never be merged into
previous page_region either.  That seems to be against the principle of
having densed output.

I hope you can help clarify what's the major use case here.

There's also the new interface to do atomic "fetch + update" on wrprotected
pages.  Is that just for efficiency or is the accuracy required in some of
the applications?

Thanks,
  
Muhammad Usama Anjum Jan. 23, 2023, 1:15 p.m. UTC | #3
On 1/19/23 3:12 AM, Peter Xu wrote:
> On Mon, Jan 09, 2023 at 11:45:15AM +0500, Muhammad Usama Anjum wrote:
>> *Changes in v7:*
>> - Add uffd wp async
>> - Update the IOCTL to use uffd under the hood instead of soft-dirty
>>   flags
>>
>> Stop using the soft-dirty flags for finding which pages have been
>> written to. It is too delicate and wrong as it shows more soft-dirty
>> pages than the actual soft-dirty pages. There is no interest in
>> correcting it [A][B] as this is how the feature was written years ago.
>> It shouldn't be updated to changed behaviour. Peter Xu has suggested
>> using the async version of the UFFD WP [C] as it is based inherently
>> on the PTEs.
>>
>> So in this patch series, I've added a new mode to the UFFD which is
>> asynchronous version of the write protect. When this variant of the
>> UFFD WP is used, the page faults are resolved automatically by the
>> kernel. The pages which have been written-to can be found by reading
>> pagemap file (!PM_UFFD_WP). This feature can be used successfully to
>> find which pages have been written to from the time the pages were
>> write protected. This works just like the soft-dirty flag without
>> showing any extra pages which aren't soft-dirty in reality.
>>
>> [A] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
>> [B] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
>> [C] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
>>
>> *Changes in v6:*
>> - Updated the interface and made cosmetic changes
>>
>> *Cover Letter in v5:*
>> Hello,
> 
> Please consider either drop the cover letter below this point or rephrase,
> otherwise many of them are not true anymore and it can confuse the
> reviewers.
I'll remove.

> 
> I have a few high level comments/questions here, please bare with me if any
> of them are already discussed by others in the old versions; I'd be happy
> to read them when there's a pointer to the relevant answers.
> 
> Firstly, doc update is more than welcomed to explain the new interface
> first (before throwing the code..).  That can be done in pagemap.rst on
> pagemap changes, or userfaultfd.rst on userfaultfd.
Okay. I'll add the documentation in next version or after the series has
been accepted. Initially I'd added the documentation. But the code kept on
changing so much that I had to spend considerable time on updating the
documentation. I know it is better to add documentation with the patches.
I'll try to add it.

> 
> Besides, can you provide more justification on the new pagemap-side
> interface design?
> 
> It seems it came from the Windows API GetWriteWatch(), but it's definitely
> not exactly that.  Let me spell some points out..
Initially, we just wanted a way to emulate Windows API GetWriteWatch(). So
we had added `max_pages` in the IOCTL arguments which is optional and can
be used to specify how many pages we want to find of our interest. There
was only one set of flags to be matched with the pages.

> 
> There're four kinds of masks (required/anyof/excluded/return).  Are they
> all needed?  Why this is a good interface design?
Then, CRIU developers Andrea [1] and Danylo [2], asked to include all these
different kinds of masks. I'd thought of these masks as fancy filter inside
the kernel. But there wasn't anyone else to review. So I'd included them to
move forward. Please let me know your thoughts after reading emails from [1].

> 
> I saw you used page_region structure to keep the information.  I think you
> wanted to have a densed output, especially if counting in the "return mask"
> above it starts to make more sense. If with a very limited return mask it
> means many of the (continuous) page information can be merged into a single
> page_region struct when the kernel is scanning.
Correct.

> 
> However, at the meantime the other three masks (required/anyof/excluded)
> made me quite confused - it means you wanted to somehow filter the pages
> and only some of them will get collected.  The thing is for a continuous
> page range if any of the page got skipped due to the masks (e.g. not in
> "required" or in "excluded") it also means it can never be merged into
> previous page_region either.  That seems to be against the principle of
> having densed output.
The filtering is being done. But the output can still be condensed
regardless. There isn't that randomness in the page flags of the
consecutive pages.

> 
> I hope you can help clarify what's the major use case here.
> 
> There's also the new interface to do atomic "fetch + update" on wrprotected
> pages.  Is that just for efficiency or is the accuracy required in some of
> the applications?
"Atomic fetch and update/clear" or "Atomic fetch Written-to status and
clear it" is needed to support GetWriteWatch() and there is no already
present way to perform this operation atomically. We want efficiency and
accuracy both to get good performance/speed. So this IOCTL is needed to
achieve:
1) New functionality which isn't already present
2) Most efficient and accurate method to perform the operation (it isn't
possible through soft-dirty feature)

> 
> Thanks,
> 

[1] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
  
Peter Xu Jan. 24, 2023, 7:49 p.m. UTC | #4
On Mon, Jan 23, 2023 at 06:15:00PM +0500, Muhammad Usama Anjum wrote:
> > Firstly, doc update is more than welcomed to explain the new interface
> > first (before throwing the code..).  That can be done in pagemap.rst on
> > pagemap changes, or userfaultfd.rst on userfaultfd.
> Okay. I'll add the documentation in next version or after the series has
> been accepted. Initially I'd added the documentation. But the code kept on
> changing so much that I had to spend considerable time on updating the
> documentation. I know it is better to add documentation with the patches.
> I'll try to add it.

Yes, logically it should be the thing people start looking with.  It'll
help reviewers to understand how does it work in general if relevant
description is not in the cover letter, so it can matter even before the
series is merged.

> > There're four kinds of masks (required/anyof/excluded/return).  Are they
> > all needed?  Why this is a good interface design?
> Then, CRIU developers Andrea [1] and Danylo [2], asked to include all these
> different kinds of masks. I'd thought of these masks as fancy filter inside
> the kernel. But there wasn't anyone else to review. So I'd included them to
> move forward. Please let me know your thoughts after reading emails from [1].

The idea makes sense to me, thanks.  I just hope "moving it forward" is not
the only reason that you included it.

Please also consider to attach relevant links to your next cover letter so
new reviewers can be aware of why the interface is proposed like that.

IMHO it would be also great if the CRIU people can acknowledge the
interface at some point to make sure it satisfies the needs.  An POC would
be even better on CRIU, but maybe that's asking too much.
  
Danylo Mocherniuk Jan. 25, 2023, 2:45 p.m. UTC | #5
On Tue, Jan 24, 2023 at 8:49 PM Peter Xu <peterx@redhat.com> wrote:
> On Mon, Jan 23, 2023 at 06:15:00PM +0500, Muhammad Usama Anjum wrote:
> > > Firstly, doc update is more than welcomed to explain the new interface
> > first (before throwing the code..).  That can be done in pagemap.rst on
> > > pagemap changes, or userfaultfd.rst on userfaultfd.
> > Okay. I'll add the documentation in next version or after the series has
> > been accepted. Initially I'd added the documentation. But the code kept on
> > changing so much that I had to spend considerable time on updating the
> > documentation. I know it is better to add documentation with the patches.
> > I'll try to add it.
>
> Yes, logically it should be the thing people start looking with.  It'll
> help reviewers to understand how does it work in general if relevant
> description is not in the cover letter, so it can matter even before the
> series is merged.
> > > There're four kinds of masks (required/anyof/excluded/return).  Are they
> > > all needed?  Why this is a good interface design?
> > Then, CRIU developers Andrea [1] and Danylo [2], asked to include all these
> > different kinds of masks. I'd thought of these masks as fancy filter inside
> > the kernel. But there wasn't anyone else to review. So I'd included them to
> > move forward. Please let me know your thoughts after reading emails from [1].
> The idea makes sense to me, thanks.  I just hope "moving it forward" is not
> the only reason that you included it.
> Please also consider to attach relevant links to your next cover letter so
> new reviewers can be aware of why the interface is proposed like that.
> IMHO it would be also great if the CRIU people can acknowledge the
> interface at some point to make sure it satisfies the needs.

I acknowledge that this interface looks good for my use case to get interesting
pages from a big sparse mapping. For Andrei's use case to iteratively dump memory
it also looks good IMO. 

> An POC would be even better on CRIU, but maybe that's asking too much.

Can't promise now, but happy to do this when there'll be a clear signal that this
patchset is about to be merged. Meanwhile, I'll make some smaller tests with this
patchset rebased and will get back if there are some problems with that. 

> -- 
> Peter Xu