[v2] mm/uffd: UFFD_FEATURE_WP_UNPOPULATED

Message ID 20230227230044.1596744-1-peterx@redhat.com
State New
Headers
Series [v2] mm/uffd: UFFD_FEATURE_WP_UNPOPULATED |

Commit Message

Peter Xu Feb. 27, 2023, 11 p.m. UTC
  This is a new feature that controls how uffd-wp handles none ptes.  When
it's set, the kernel will handle anonymous memory the same way as file
memory, by allowing the user to wr-protect unpopulated ptes.

File memories handles none ptes consistently by allowing wr-protecting of
none ptes because of the unawareness of page cache being exist or not.  For
anonymous it was not as persistent because we used to assume that we don't
need protections on none ptes or known zero pages.

One use case of such a feature bit was VM live snapshot, where if without
wr-protecting empty ptes the snapshot can contain random rubbish in the
holes of the anonymous memory, which can cause misbehave of the guest when
the guest OS assumes the pages should be all zeros.

QEMU worked it around by pre-populate the section with reads to fill in
zero page entries before starting the whole snapshot process [1].

Recently there's another need raised on using userfaultfd wr-protect for
detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
case if without being able to wr-protect none ptes by default, the dirty
info can get lost, since we cannot treat every none pte to be dirty (the
current design is identify a page dirty based on uffd-wp bit being cleared).

In general, we want to be able to wr-protect empty ptes too even for
anonymous.

This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
uffd-wp handling on none ptes being consistent no matter what the memory
type is underneath.  It doesn't have any impact on file memories so far
because we already have pte markers taking care of that.  So it only
affects anonymous.

The feature bit is by default off, so the old behavior will be maintained.
Sometimes it may be wanted because the wr-protect of none ptes will contain
overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
anonymous), but also on creating the pgtables to store the pte markers. So
there's potentially less chance of using thp on the first fault for a none
pmd or larger than a pmd.

The major implementation part is teaching the whole kernel to understand
pte markers even for anonymously mapped ranges, meanwhile allowing the
UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
new feature bit is set.

Note that even if the patch subject starts with mm/uffd, there're a few
small refactors to major mm path of handling anonymous page faults. But
they should be straightforward.

So far, add a very light smoke test within the userfaultfd kselftest
pagemap unit test to make sure anon pte markers work.

[1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
[1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/

Signed-off-by: Peter Xu <peterx@redhat.com>
---
v1->v2:
- Use pte markers rather than populate zero pages when protect [David]
- Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
---
 fs/userfaultfd.c                         | 14 ++++++
 include/linux/mm_inline.h                |  6 +++
 include/linux/userfaultfd_k.h            |  6 +++
 include/uapi/linux/userfaultfd.h         | 10 +++-
 mm/memory.c                              | 54 ++++++++++++++++------
 mm/mprotect.c                            | 59 ++++++++++++++++++++----
 tools/testing/selftests/mm/userfaultfd.c | 22 ++++++++-
 7 files changed, 146 insertions(+), 25 deletions(-)
  

Comments

Peter Xu Feb. 28, 2023, 12:36 a.m. UTC | #1
On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
> This is a new feature that controls how uffd-wp handles none ptes.  When
> it's set, the kernel will handle anonymous memory the same way as file
> memory, by allowing the user to wr-protect unpopulated ptes.
> 
> File memories handles none ptes consistently by allowing wr-protecting of
> none ptes because of the unawareness of page cache being exist or not.  For
> anonymous it was not as persistent because we used to assume that we don't
> need protections on none ptes or known zero pages.
> 
> One use case of such a feature bit was VM live snapshot, where if without
> wr-protecting empty ptes the snapshot can contain random rubbish in the
> holes of the anonymous memory, which can cause misbehave of the guest when
> the guest OS assumes the pages should be all zeros.
> 
> QEMU worked it around by pre-populate the section with reads to fill in
> zero page entries before starting the whole snapshot process [1].
> 
> Recently there's another need raised on using userfaultfd wr-protect for
> detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
> case if without being able to wr-protect none ptes by default, the dirty
> info can get lost, since we cannot treat every none pte to be dirty (the
> current design is identify a page dirty based on uffd-wp bit being cleared).
> 
> In general, we want to be able to wr-protect empty ptes too even for
> anonymous.
> 
> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
> uffd-wp handling on none ptes being consistent no matter what the memory
> type is underneath.  It doesn't have any impact on file memories so far
> because we already have pte markers taking care of that.  So it only
> affects anonymous.
> 
> The feature bit is by default off, so the old behavior will be maintained.
> Sometimes it may be wanted because the wr-protect of none ptes will contain
> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
> anonymous), but also on creating the pgtables to store the pte markers. So
> there's potentially less chance of using thp on the first fault for a none
> pmd or larger than a pmd.
> 
> The major implementation part is teaching the whole kernel to understand
> pte markers even for anonymously mapped ranges, meanwhile allowing the
> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
> new feature bit is set.
> 
> Note that even if the patch subject starts with mm/uffd, there're a few
> small refactors to major mm path of handling anonymous page faults. But
> they should be straightforward.
> 
> So far, add a very light smoke test within the userfaultfd kselftest
> pagemap unit test to make sure anon pte markers work.
> 
> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> v1->v2:
> - Use pte markers rather than populate zero pages when protect [David]
> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]

Some very initial performance numbers (I only ran in a VM but it should be
similar, unit is "us") below as requested.  The measurement is about time
spent when wr-protecting 10G range of empty but mapped memory.  It's done
in a VM, assuming we'll get similar results on bare metal.

Four test cases:

        - default UFFDIO_WP
        - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
        - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
        - UFFDIO_WP with WP_UNPOPULATED

Results:

        Test DEFAULT: 2
        Test PRE-READ: 3277099 (pre-fault 3253826)
        Test MADVISE: 2250361 (pre-fault 2226310)
        Test WP-UNPOPULATE: 20850

I'll add these information into the commit message when there's a new
version.

[1] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c
  
Muhammad Usama Anjum Feb. 28, 2023, 7:21 a.m. UTC | #2
Hi Peter,

Thank you so much for sending.

On 2/28/23 5:36 AM, Peter Xu wrote:
> On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
>> This is a new feature that controls how uffd-wp handles none ptes.  When
>> it's set, the kernel will handle anonymous memory the same way as file
>> memory, by allowing the user to wr-protect unpopulated ptes.
>>
>> File memories handles none ptes consistently by allowing wr-protecting of
>> none ptes because of the unawareness of page cache being exist or not.  For
>> anonymous it was not as persistent because we used to assume that we don't
>> need protections on none ptes or known zero pages.
>>
>> One use case of such a feature bit was VM live snapshot, where if without
>> wr-protecting empty ptes the snapshot can contain random rubbish in the
>> holes of the anonymous memory, which can cause misbehave of the guest when
>> the guest OS assumes the pages should be all zeros.
>>
>> QEMU worked it around by pre-populate the section with reads to fill in
>> zero page entries before starting the whole snapshot process [1].
>>
>> Recently there's another need raised on using userfaultfd wr-protect for
>> detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
>> case if without being able to wr-protect none ptes by default, the dirty
>> info can get lost, since we cannot treat every none pte to be dirty (the
>> current design is identify a page dirty based on uffd-wp bit being cleared).
>>
>> In general, we want to be able to wr-protect empty ptes too even for
>> anonymous.
>>
>> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
>> uffd-wp handling on none ptes being consistent no matter what the memory
>> type is underneath.  It doesn't have any impact on file memories so far
>> because we already have pte markers taking care of that.  So it only
>> affects anonymous.
>>
>> The feature bit is by default off, so the old behavior will be maintained.
>> Sometimes it may be wanted because the wr-protect of none ptes will contain
>> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
>> anonymous), but also on creating the pgtables to store the pte markers. So
>> there's potentially less chance of using thp on the first fault for a none
>> pmd or larger than a pmd.
>>
>> The major implementation part is teaching the whole kernel to understand
>> pte markers even for anonymously mapped ranges, meanwhile allowing the
>> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
>> new feature bit is set.
>>
>> Note that even if the patch subject starts with mm/uffd, there're a few
>> small refactors to major mm path of handling anonymous page faults. But
>> they should be straightforward.
>>
>> So far, add a very light smoke test within the userfaultfd kselftest
>> pagemap unit test to make sure anon pte markers work.
>>
>> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
>> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
>>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>> v1->v2:
>> - Use pte markers rather than populate zero pages when protect [David]
>> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
> 
> Some very initial performance numbers (I only ran in a VM but it should be
> similar, unit is "us") below as requested.  The measurement is about time
> spent when wr-protecting 10G range of empty but mapped memory.  It's done
> in a VM, assuming we'll get similar results on bare metal.
> 
> Four test cases:
> 
>         - default UFFDIO_WP
>         - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
>         - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
>         - UFFDIO_WP with WP_UNPOPULATED
> 
> Results:
> 
>         Test DEFAULT: 2
>         Test PRE-READ: 3277099 (pre-fault 3253826)
>         Test MADVISE: 2250361 (pre-fault 2226310)
>         Test WP-UNPOPULATE: 20850
> 
> I'll add these information into the commit message when there's a new
> version.
I'm hitting a bug where I'm unable to write to the memory after adding this
patch and wp the memory. I'm hitting this case in your test and my tests as
well. Please apply the following diff to your test to reproduce on your end:

--- uffd_wp_perf.c.orig 2023-02-28 12:09:38.971820791 +0500
+++ uffd_wp_perf.c      2023-02-28 12:13:11.077827160 +0500
@@ -114,6 +114,7 @@
         start1 = get_usec();
     }
     wp_range(uffd, buffer, SIZE, true);
+    buffer[0] = 'a';
     if (start1 == 0)
          printf("%"PRIu64"\n", get_usec() - start);
     else


> 
> [1] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c
>
  
Peter Xu Feb. 28, 2023, 3:58 p.m. UTC | #3
On Tue, Feb 28, 2023 at 12:21:45PM +0500, Muhammad Usama Anjum wrote:
> Hi Peter,

Hi, Muhammad,

> 
> Thank you so much for sending.
> 
> On 2/28/23 5:36 AM, Peter Xu wrote:
> > On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
> >> This is a new feature that controls how uffd-wp handles none ptes.  When
> >> it's set, the kernel will handle anonymous memory the same way as file
> >> memory, by allowing the user to wr-protect unpopulated ptes.
> >>
> >> File memories handles none ptes consistently by allowing wr-protecting of
> >> none ptes because of the unawareness of page cache being exist or not.  For
> >> anonymous it was not as persistent because we used to assume that we don't
> >> need protections on none ptes or known zero pages.
> >>
> >> One use case of such a feature bit was VM live snapshot, where if without
> >> wr-protecting empty ptes the snapshot can contain random rubbish in the
> >> holes of the anonymous memory, which can cause misbehave of the guest when
> >> the guest OS assumes the pages should be all zeros.
> >>
> >> QEMU worked it around by pre-populate the section with reads to fill in
> >> zero page entries before starting the whole snapshot process [1].
> >>
> >> Recently there's another need raised on using userfaultfd wr-protect for
> >> detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
> >> case if without being able to wr-protect none ptes by default, the dirty
> >> info can get lost, since we cannot treat every none pte to be dirty (the
> >> current design is identify a page dirty based on uffd-wp bit being cleared).
> >>
> >> In general, we want to be able to wr-protect empty ptes too even for
> >> anonymous.
> >>
> >> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
> >> uffd-wp handling on none ptes being consistent no matter what the memory
> >> type is underneath.  It doesn't have any impact on file memories so far
> >> because we already have pte markers taking care of that.  So it only
> >> affects anonymous.
> >>
> >> The feature bit is by default off, so the old behavior will be maintained.
> >> Sometimes it may be wanted because the wr-protect of none ptes will contain
> >> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
> >> anonymous), but also on creating the pgtables to store the pte markers. So
> >> there's potentially less chance of using thp on the first fault for a none
> >> pmd or larger than a pmd.
> >>
> >> The major implementation part is teaching the whole kernel to understand
> >> pte markers even for anonymously mapped ranges, meanwhile allowing the
> >> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
> >> new feature bit is set.
> >>
> >> Note that even if the patch subject starts with mm/uffd, there're a few
> >> small refactors to major mm path of handling anonymous page faults. But
> >> they should be straightforward.
> >>
> >> So far, add a very light smoke test within the userfaultfd kselftest
> >> pagemap unit test to make sure anon pte markers work.
> >>
> >> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
> >> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
> >>
> >> Signed-off-by: Peter Xu <peterx@redhat.com>
> >> ---
> >> v1->v2:
> >> - Use pte markers rather than populate zero pages when protect [David]
> >> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
> > 
> > Some very initial performance numbers (I only ran in a VM but it should be
> > similar, unit is "us") below as requested.  The measurement is about time
> > spent when wr-protecting 10G range of empty but mapped memory.  It's done
> > in a VM, assuming we'll get similar results on bare metal.
> > 
> > Four test cases:
> > 
> >         - default UFFDIO_WP
> >         - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
> >         - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
> >         - UFFDIO_WP with WP_UNPOPULATED
> > 
> > Results:
> > 
> >         Test DEFAULT: 2
> >         Test PRE-READ: 3277099 (pre-fault 3253826)
> >         Test MADVISE: 2250361 (pre-fault 2226310)
> >         Test WP-UNPOPULATE: 20850
> > 
> > I'll add these information into the commit message when there's a new
> > version.
> I'm hitting a bug where I'm unable to write to the memory after adding this
> patch and wp the memory. I'm hitting this case in your test and my tests as
> well. Please apply the following diff to your test to reproduce on your end:
> 
> --- uffd_wp_perf.c.orig 2023-02-28 12:09:38.971820791 +0500
> +++ uffd_wp_perf.c      2023-02-28 12:13:11.077827160 +0500
> @@ -114,6 +114,7 @@
>          start1 = get_usec();
>      }
>      wp_range(uffd, buffer, SIZE, true);
> +    buffer[0] = 'a';
>      if (start1 == 0)
>           printf("%"PRIu64"\n", get_usec() - start);
>      else

This is expected, because the test didn't start any fault resolving thread,
so the write will block until someone unprotects the page.

But it shouldn't happen to your use case if you applied both WP_UNPOPULATED
& WP_ASYNC.

Could you try "cat /proc/$PID/stack" to see where does your thread stuck
at?
  
Muhammad Usama Anjum Feb. 28, 2023, 4:24 p.m. UTC | #4
On 2/28/23 8:58 PM, Peter Xu wrote:
> On Tue, Feb 28, 2023 at 12:21:45PM +0500, Muhammad Usama Anjum wrote:
>> Hi Peter,
> 
> Hi, Muhammad,
> 
>>
>> Thank you so much for sending.
>>
>> On 2/28/23 5:36 AM, Peter Xu wrote:
>>> On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
>>>> This is a new feature that controls how uffd-wp handles none ptes.  When
>>>> it's set, the kernel will handle anonymous memory the same way as file
>>>> memory, by allowing the user to wr-protect unpopulated ptes.
>>>>
>>>> File memories handles none ptes consistently by allowing wr-protecting of
>>>> none ptes because of the unawareness of page cache being exist or not.  For
>>>> anonymous it was not as persistent because we used to assume that we don't
>>>> need protections on none ptes or known zero pages.
>>>>
>>>> One use case of such a feature bit was VM live snapshot, where if without
>>>> wr-protecting empty ptes the snapshot can contain random rubbish in the
>>>> holes of the anonymous memory, which can cause misbehave of the guest when
>>>> the guest OS assumes the pages should be all zeros.
>>>>
>>>> QEMU worked it around by pre-populate the section with reads to fill in
>>>> zero page entries before starting the whole snapshot process [1].
>>>>
>>>> Recently there's another need raised on using userfaultfd wr-protect for
>>>> detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
>>>> case if without being able to wr-protect none ptes by default, the dirty
>>>> info can get lost, since we cannot treat every none pte to be dirty (the
>>>> current design is identify a page dirty based on uffd-wp bit being cleared).
>>>>
>>>> In general, we want to be able to wr-protect empty ptes too even for
>>>> anonymous.
>>>>
>>>> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
>>>> uffd-wp handling on none ptes being consistent no matter what the memory
>>>> type is underneath.  It doesn't have any impact on file memories so far
>>>> because we already have pte markers taking care of that.  So it only
>>>> affects anonymous.
>>>>
>>>> The feature bit is by default off, so the old behavior will be maintained.
>>>> Sometimes it may be wanted because the wr-protect of none ptes will contain
>>>> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
>>>> anonymous), but also on creating the pgtables to store the pte markers. So
>>>> there's potentially less chance of using thp on the first fault for a none
>>>> pmd or larger than a pmd.
>>>>
>>>> The major implementation part is teaching the whole kernel to understand
>>>> pte markers even for anonymously mapped ranges, meanwhile allowing the
>>>> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
>>>> new feature bit is set.
>>>>
>>>> Note that even if the patch subject starts with mm/uffd, there're a few
>>>> small refactors to major mm path of handling anonymous page faults. But
>>>> they should be straightforward.
>>>>
>>>> So far, add a very light smoke test within the userfaultfd kselftest
>>>> pagemap unit test to make sure anon pte markers work.
>>>>
>>>> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
>>>> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
>>>>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>> ---
>>>> v1->v2:
>>>> - Use pte markers rather than populate zero pages when protect [David]
>>>> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
>>>
>>> Some very initial performance numbers (I only ran in a VM but it should be
>>> similar, unit is "us") below as requested.  The measurement is about time
>>> spent when wr-protecting 10G range of empty but mapped memory.  It's done
>>> in a VM, assuming we'll get similar results on bare metal.
>>>
>>> Four test cases:
>>>
>>>         - default UFFDIO_WP
>>>         - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
>>>         - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
>>>         - UFFDIO_WP with WP_UNPOPULATED
>>>
>>> Results:
>>>
>>>         Test DEFAULT: 2
>>>         Test PRE-READ: 3277099 (pre-fault 3253826)
>>>         Test MADVISE: 2250361 (pre-fault 2226310)
>>>         Test WP-UNPOPULATE: 20850
>>>
>>> I'll add these information into the commit message when there's a new
>>> version.
>> I'm hitting a bug where I'm unable to write to the memory after adding this
>> patch and wp the memory. I'm hitting this case in your test and my tests as
>> well. Please apply the following diff to your test to reproduce on your end:
>>
>> --- uffd_wp_perf.c.orig 2023-02-28 12:09:38.971820791 +0500
>> +++ uffd_wp_perf.c      2023-02-28 12:13:11.077827160 +0500
>> @@ -114,6 +114,7 @@
>>          start1 = get_usec();
>>      }
>>      wp_range(uffd, buffer, SIZE, true);
>> +    buffer[0] = 'a';
>>      if (start1 == 0)
>>           printf("%"PRIu64"\n", get_usec() - start);
>>      else
> 
> This is expected, because the test didn't start any fault resolving thread,
> so the write will block until someone unprotects the page.
Ohh.. sorry. Wrong reproducer.

> 
> But it shouldn't happen to your use case if you applied both WP_UNPOPULATED
> & WP_ASYNC.
I'm using both WP_UNPOPULATED and ASYNC. The program gets stuck at right time:


1..57
ok 1 sanity_tests_sd wrong flag specified
ok 2 sanity_tests_sd wrong mask specified
ok 3 sanity_tests_sd wrong return mask specified
ok 4 sanity_tests_sd mixture of correct and wrong flag
ok 5 sanity_tests_sd PM_SCAN_OP_WP cannot be used without get
ok 6 sanity_tests_sd Clear area with larger vec size
^C
Program received signal SIGINT, Interrupt.
0x000000000040220c in sanity_tests_sd () at pagemap_ioctl.c:198
198                     mem[i]++;
(gdb) bt
#0  0x000000000040220c in sanity_tests_sd () at pagemap_ioctl.c:198
#1  0x0000000000404e14 in main () at pagemap_ioctl.c:846
()

/proc/$PID/stack is empty. Not sure why. I can see stack trace of other
applications, but not this one's.

Let me send better reproducer for you.

> 
> Could you try "cat /proc/$PID/stack" to see where does your thread stuck
> at?
>
  
Muhammad Usama Anjum March 1, 2023, 7:55 a.m. UTC | #5
Hi Peter,

Finally I've found the bug.

On 2/28/23 9:24 PM, Muhammad Usama Anjum wrote:
> On 2/28/23 8:58 PM, Peter Xu wrote:
>> On Tue, Feb 28, 2023 at 12:21:45PM +0500, Muhammad Usama Anjum wrote:
>>> Hi Peter,
>>
>> Hi, Muhammad,
>>
>>>
>>> Thank you so much for sending.
>>>
>>> On 2/28/23 5:36 AM, Peter Xu wrote:
>>>> On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
>>>>> This is a new feature that controls how uffd-wp handles none ptes.  When
>>>>> it's set, the kernel will handle anonymous memory the same way as file
>>>>> memory, by allowing the user to wr-protect unpopulated ptes.
>>>>>
>>>>> File memories handles none ptes consistently by allowing wr-protecting of
>>>>> none ptes because of the unawareness of page cache being exist or not.  For
>>>>> anonymous it was not as persistent because we used to assume that we don't
>>>>> need protections on none ptes or known zero pages.
>>>>>
>>>>> One use case of such a feature bit was VM live snapshot, where if without
>>>>> wr-protecting empty ptes the snapshot can contain random rubbish in the
>>>>> holes of the anonymous memory, which can cause misbehave of the guest when
>>>>> the guest OS assumes the pages should be all zeros.
>>>>>
>>>>> QEMU worked it around by pre-populate the section with reads to fill in
>>>>> zero page entries before starting the whole snapshot process [1].
>>>>>
>>>>> Recently there's another need raised on using userfaultfd wr-protect for
>>>>> detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
>>>>> case if without being able to wr-protect none ptes by default, the dirty
>>>>> info can get lost, since we cannot treat every none pte to be dirty (the
>>>>> current design is identify a page dirty based on uffd-wp bit being cleared).
>>>>>
>>>>> In general, we want to be able to wr-protect empty ptes too even for
>>>>> anonymous.
>>>>>
>>>>> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
>>>>> uffd-wp handling on none ptes being consistent no matter what the memory
>>>>> type is underneath.  It doesn't have any impact on file memories so far
>>>>> because we already have pte markers taking care of that.  So it only
>>>>> affects anonymous.
>>>>>
>>>>> The feature bit is by default off, so the old behavior will be maintained.
>>>>> Sometimes it may be wanted because the wr-protect of none ptes will contain
>>>>> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
>>>>> anonymous), but also on creating the pgtables to store the pte markers. So
>>>>> there's potentially less chance of using thp on the first fault for a none
>>>>> pmd or larger than a pmd.
>>>>>
>>>>> The major implementation part is teaching the whole kernel to understand
>>>>> pte markers even for anonymously mapped ranges, meanwhile allowing the
>>>>> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
>>>>> new feature bit is set.
>>>>>
>>>>> Note that even if the patch subject starts with mm/uffd, there're a few
>>>>> small refactors to major mm path of handling anonymous page faults. But
>>>>> they should be straightforward.
>>>>>
>>>>> So far, add a very light smoke test within the userfaultfd kselftest
>>>>> pagemap unit test to make sure anon pte markers work.
>>>>>
>>>>> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
>>>>> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
>>>>>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>> v1->v2:
>>>>> - Use pte markers rather than populate zero pages when protect [David]
>>>>> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
>>>>
>>>> Some very initial performance numbers (I only ran in a VM but it should be
>>>> similar, unit is "us") below as requested.  The measurement is about time
>>>> spent when wr-protecting 10G range of empty but mapped memory.  It's done
>>>> in a VM, assuming we'll get similar results on bare metal.
>>>>
>>>> Four test cases:
>>>>
>>>>         - default UFFDIO_WP
>>>>         - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
>>>>         - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
>>>>         - UFFDIO_WP with WP_UNPOPULATED
>>>>
>>>> Results:
>>>>
>>>>         Test DEFAULT: 2
>>>>         Test PRE-READ: 3277099 (pre-fault 3253826)
>>>>         Test MADVISE: 2250361 (pre-fault 2226310)
>>>>         Test WP-UNPOPULATE: 20850
>>>>
>>>> I'll add these information into the commit message when there's a new
>>>> version.
>>> I'm hitting a bug where I'm unable to write to the memory after adding this
>>> patch and wp the memory. I'm hitting this case in your test and my tests as
>>> well. Please apply the following diff to your test to reproduce on your end:
>>>
>>> --- uffd_wp_perf.c.orig 2023-02-28 12:09:38.971820791 +0500
>>> +++ uffd_wp_perf.c      2023-02-28 12:13:11.077827160 +0500
>>> @@ -114,6 +114,7 @@
>>>          start1 = get_usec();
>>>      }
>>>      wp_range(uffd, buffer, SIZE, true);
>>> +    buffer[0] = 'a';
While using WP_UNPOPULATED, we get stuck if newly allocated memory is read
without initialization. This can be reproduced by either of the following
statements:
    printf("%c", buffer[0]);
    buffer[0]++;

This bug has start to appear on this patch. How are you handling reading
newly allocated memory when WP_UNPOPULATED is defined?



Running my pagemap_ioctl selftest as benchmark in a VM:
without zeropage / wp_unpopulated (decide from pte_none() if page is dirty
or not, buggy and wrong implementation, just for reference)
26.608 seconds
with zeropage
39.203 seconds
with wp_unpopulated
62.907 seconds

136% worse performance overall
60% worse performance of unpopulated than zeropage

>>>      if (start1 == 0)
>>>           printf("%"PRIu64"\n", get_usec() - start);
>>>      else
>>
>> This is expected, because the test didn't start any fault resolving thread,
>> so the write will block until someone unprotects the page.
> Ohh.. sorry. Wrong reproducer.
> 
>>
>> But it shouldn't happen to your use case if you applied both WP_UNPOPULATED
>> & WP_ASYNC.
> I'm using both WP_UNPOPULATED and ASYNC. The program gets stuck at right time:
> 
> 
> 1..57
> ok 1 sanity_tests_sd wrong flag specified
> ok 2 sanity_tests_sd wrong mask specified
> ok 3 sanity_tests_sd wrong return mask specified
> ok 4 sanity_tests_sd mixture of correct and wrong flag
> ok 5 sanity_tests_sd PM_SCAN_OP_WP cannot be used without get
> ok 6 sanity_tests_sd Clear area with larger vec size
> ^C
> Program received signal SIGINT, Interrupt.
> 0x000000000040220c in sanity_tests_sd () at pagemap_ioctl.c:198
> 198                     mem[i]++;
> (gdb) bt
> #0  0x000000000040220c in sanity_tests_sd () at pagemap_ioctl.c:198
> #1  0x0000000000404e14 in main () at pagemap_ioctl.c:846
> ()
> 
> /proc/$PID/stack is empty. Not sure why. I can see stack trace of other
> applications, but not this one's.
> 
> Let me send better reproducer for you.
> 
>>
>> Could you try "cat /proc/$PID/stack" to see where does your thread stuck
>> at?
>>
>
  
Peter Xu March 1, 2023, 3:19 p.m. UTC | #6
On Wed, Mar 01, 2023 at 12:55:51PM +0500, Muhammad Usama Anjum wrote:
> Hi Peter,

Hi, Muhammad,

> While using WP_UNPOPULATED, we get stuck if newly allocated memory is read
> without initialization. This can be reproduced by either of the following
> statements:
>     printf("%c", buffer[0]);
>     buffer[0]++;
> 
> This bug has start to appear on this patch. How are you handling reading
> newly allocated memory when WP_UNPOPULATED is defined?

Yes it's a bug, thanks for the reproducer. You're right I missed a trivial
but important detail.  Could you try apply below on top?

---8<---
diff --git a/mm/memory.c b/mm/memory.c
index 46934133bd0b..2f4b3892948b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4062,7 +4062,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
                                                vma->vm_page_prot));
                vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
                                vmf->address, &vmf->ptl);
-               if (!pte_none(*vmf->pte)) {
+               if (vmf_pte_changed(vmf)) {
                        update_mmu_tlb(vma, vmf->address, vmf->pte);
                        goto unlock;
                }
---8<---

I can send a new version after you confirmed it at least works on your
side. I'll also add some more test to cover that in the next version.

The current smoke test within this patch is really light; I somehow rely on
you on this patch on the testing side, and thanks for that.

> Running my pagemap_ioctl selftest as benchmark in a VM:
> without zeropage / wp_unpopulated (decide from pte_none() if page is dirty
> or not, buggy and wrong implementation, just for reference)
> 26.608 seconds
> with zeropage
> 39.203 seconds
> with wp_unpopulated
> 62.907 seconds
> 
> 136% worse performance overall
> 60% worse performance of unpopulated than zeropage

Yes this is unfortunate, because we're protecting more things than before
when with WP_ZEROPAGE / WP_UNPOPULATED but that's what it is for (when we
want to make sure that accuracy on the holes).

I didn't look closer to your whole test suite yet, but my pure test on
protection above should mean that it's still much better for such a use
case than either (1) pre-read or (2) MADV_POPULATE_READ.

Again, I hope the performance result is not a concern to you.  If it is,
please let us know.

Thanks,
  
Muhammad Usama Anjum March 1, 2023, 5:13 p.m. UTC | #7
On 3/1/23 8:19 PM, Peter Xu wrote:
> On Wed, Mar 01, 2023 at 12:55:51PM +0500, Muhammad Usama Anjum wrote:
>> Hi Peter,
> 
> Hi, Muhammad,
> 
>> While using WP_UNPOPULATED, we get stuck if newly allocated memory is read
>> without initialization. This can be reproduced by either of the following
>> statements:
>>     printf("%c", buffer[0]);
>>     buffer[0]++;
>>
>> This bug has start to appear on this patch. How are you handling reading
>> newly allocated memory when WP_UNPOPULATED is defined?
> 
> Yes it's a bug, thanks for the reproducer. You're right I missed a trivial
> but important detail.  Could you try apply below on top?
> 
> ---8<---
> diff --git a/mm/memory.c b/mm/memory.c
> index 46934133bd0b..2f4b3892948b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4062,7 +4062,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                                                 vma->vm_page_prot));
>                 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>                                 vmf->address, &vmf->ptl);
> -               if (!pte_none(*vmf->pte)) {
> +               if (vmf_pte_changed(vmf)) {
>                         update_mmu_tlb(vma, vmf->address, vmf->pte);
>                         goto unlock;
>                 }
> ---8<---
This patch works. Thank you so much!

> 
> I can send a new version after you confirmed it at least works on your
> side. I'll also add some more test to cover that in the next version.
> 
> The current smoke test within this patch is really light; I somehow rely on
> you on this patch on the testing side, and thanks for that.
> 
>> Running my pagemap_ioctl selftest as benchmark in a VM:
>> without zeropage / wp_unpopulated (decide from pte_none() if page is dirty
>> or not, buggy and wrong implementation, just for reference)
>> 26.608 seconds
>> with zeropage
>> 39.203 seconds
>> with wp_unpopulated
>> 62.907 seconds
>>
>> 136% worse performance overall
>> 60% worse performance of unpopulated than zeropage
> 
> Yes this is unfortunate, because we're protecting more things than before
> when with WP_ZEROPAGE / WP_UNPOPULATED but that's what it is for (when we
> want to make sure that accuracy on the holes).
> 
> I didn't look closer to your whole test suite yet, but my pure test on
> protection above should mean that it's still much better for such a use
> case than either (1) pre-read or (2) MADV_POPULATE_READ.
Ohh... I should stop comparing UNPOPULATE with buggy implementation and
compare with pre-read. I've compared apples with oranges.

I'll do better benchmark for the comparison sake. I'll let you know if the
performance is becoming an issue. Overall we need pagemap_ioctl + UFFD to
correctly emulate Windows syscall. Secondly we also need good performance
(more the better).

> 
> Again, I hope the performance result is not a concern to you.  If it is,
> please let us know.
> 
> Thanks,
>
  
David Hildenbrand March 2, 2023, 9:37 a.m. UTC | #8
On 01.03.23 18:13, Muhammad Usama Anjum wrote:
> On 3/1/23 8:19 PM, Peter Xu wrote:
>> On Wed, Mar 01, 2023 at 12:55:51PM +0500, Muhammad Usama Anjum wrote:
>>> Hi Peter,
>>
>> Hi, Muhammad,
>>
>>> While using WP_UNPOPULATED, we get stuck if newly allocated memory is read
>>> without initialization. This can be reproduced by either of the following
>>> statements:
>>>      printf("%c", buffer[0]);
>>>      buffer[0]++;
>>>
>>> This bug has start to appear on this patch. How are you handling reading
>>> newly allocated memory when WP_UNPOPULATED is defined?
>>
>> Yes it's a bug, thanks for the reproducer. You're right I missed a trivial
>> but important detail.  Could you try apply below on top?
>>
>> ---8<---
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 46934133bd0b..2f4b3892948b 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4062,7 +4062,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                                                  vma->vm_page_prot));
>>                  vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>                                  vmf->address, &vmf->ptl);
>> -               if (!pte_none(*vmf->pte)) {
>> +               if (vmf_pte_changed(vmf)) {
>>                          update_mmu_tlb(vma, vmf->address, vmf->pte);
>>                          goto unlock;
>>                  }
>> ---8<---
> This patch works. Thank you so much!
> 
>>
>> I can send a new version after you confirmed it at least works on your
>> side. I'll also add some more test to cover that in the next version.
>>
>> The current smoke test within this patch is really light; I somehow rely on
>> you on this patch on the testing side, and thanks for that.
>>
>>> Running my pagemap_ioctl selftest as benchmark in a VM:
>>> without zeropage / wp_unpopulated (decide from pte_none() if page is dirty
>>> or not, buggy and wrong implementation, just for reference)
>>> 26.608 seconds
>>> with zeropage
>>> 39.203 seconds
>>> with wp_unpopulated
>>> 62.907 seconds
>>>
>>> 136% worse performance overall
>>> 60% worse performance of unpopulated than zeropage
>>
>> Yes this is unfortunate, because we're protecting more things than before
>> when with WP_ZEROPAGE / WP_UNPOPULATED but that's what it is for (when we
>> want to make sure that accuracy on the holes).
>>
>> I didn't look closer to your whole test suite yet, but my pure test on
>> protection above should mean that it's still much better for such a use
>> case than either (1) pre-read or (2) MADV_POPULATE_READ.
> Ohh... I should stop comparing UNPOPULATE with buggy implementation and
> compare with pre-read. I've compared apples with oranges.

Note that I think there are ways to avoid that overhead (as raised in 
reply to Peter's reply), so IMHO it's still valuable to know which 
benefit we could have without allocating pagetables and placing the 
shared zeropages.

> 
> I'll do better benchmark for the comparison sake. I'll let you know if the
> performance is becoming an issue. Overall we need pagemap_ioctl + UFFD to
> correctly emulate Windows syscall. Secondly we also need good performance
> (more the better).

I'm curious, are you eventually applying UFFD to possibly large (sparse) 
VMAs, that are eventually even only PROT_READ or PROT_WRITE?

Especially for such large sparse VMAs, the current way of allocating 
pagetables to place markers/zeropages is far from optimal.
  
Peter Xu March 2, 2023, 1:57 p.m. UTC | #9
On Thu, Mar 02, 2023 at 10:37:44AM +0100, David Hildenbrand wrote:
> Especially for such large sparse VMAs, the current way of allocating
> pagetables to place markers/zeropages is far from optimal.

IMHO that's not a generic workload.  As mentioned in the reply there, I
would suggest we go with simple then we have space to optimize it in the
future if necessary, because the API will be the same.

If I want to monitor a super sparse VMA on dirty, one thing I can already
try is register with MISSING + WP (!WP_UNPOPULATED).  It may be a good
intermediate solution to avoid any marker ovearheads.
  
David Hildenbrand March 2, 2023, 2:01 p.m. UTC | #10
On 02.03.23 14:57, Peter Xu wrote:
> On Thu, Mar 02, 2023 at 10:37:44AM +0100, David Hildenbrand wrote:
>> Especially for such large sparse VMAs, the current way of allocating
>> pagetables to place markers/zeropages is far from optimal.
> 
> IMHO that's not a generic workload.  As mentioned in the reply there, I
> would suggest we go with simple then we have space to optimize it in the
> future if necessary, because the API will be the same.
> 

I disagree with "generic workload", we use sparse mmaps all over the 
place, and when blindly used by e.g., CRIU, we'll simply end up wasting 
memory and time.

But I already agreed that this optimization that is a separate thing to 
implement.
  
Muhammad Usama Anjum March 2, 2023, 3:14 p.m. UTC | #11
On 3/2/23 7:01 PM, David Hildenbrand wrote:
> On 02.03.23 14:57, Peter Xu wrote:
>> On Thu, Mar 02, 2023 at 10:37:44AM +0100, David Hildenbrand wrote:
>>> Especially for such large sparse VMAs, the current way of allocating
>>> pagetables to place markers/zeropages is far from optimal.
>>
>> IMHO that's not a generic workload.  As mentioned in the reply there, I
>> would suggest we go with simple then we have space to optimize it in the
>> future if necessary, because the API will be the same.
This is a good idea.

I'm trying to understand why aren't we going with most optimized
implementation. Why aren't we targeting it at this point in time?

>>
> 
> I disagree with "generic workload", we use sparse mmaps all over the place,
> and when blindly used by e.g., CRIU, we'll simply end up wasting memory and
> time.
I've heard about a use case where a file of size 10s of GBs can be mapped
to the memory and then accessed off and on. We need to handle this
correctly and efficiently.

> 
> But I already agreed that this optimization that is a separate thing to
> implement.
>
  
Muhammad Usama Anjum March 2, 2023, 5:19 p.m. UTC | #12
On 2/28/23 5:36 AM, Peter Xu wrote:
> On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
>> This is a new feature that controls how uffd-wp handles none ptes.  When
>> it's set, the kernel will handle anonymous memory the same way as file
>> memory, by allowing the user to wr-protect unpopulated ptes.
>>
>> File memories handles none ptes consistently by allowing wr-protecting of
>> none ptes because of the unawareness of page cache being exist or not.  For
>> anonymous it was not as persistent because we used to assume that we don't
>> need protections on none ptes or known zero pages.
>>
>> One use case of such a feature bit was VM live snapshot, where if without
>> wr-protecting empty ptes the snapshot can contain random rubbish in the
>> holes of the anonymous memory, which can cause misbehave of the guest when
>> the guest OS assumes the pages should be all zeros.
>>
>> QEMU worked it around by pre-populate the section with reads to fill in
>> zero page entries before starting the whole snapshot process [1].
>>
>> Recently there's another need raised on using userfaultfd wr-protect for
>> detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
>> case if without being able to wr-protect none ptes by default, the dirty
>> info can get lost, since we cannot treat every none pte to be dirty (the
>> current design is identify a page dirty based on uffd-wp bit being cleared).
>>
>> In general, we want to be able to wr-protect empty ptes too even for
>> anonymous.
>>
>> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
>> uffd-wp handling on none ptes being consistent no matter what the memory
>> type is underneath.  It doesn't have any impact on file memories so far
>> because we already have pte markers taking care of that.  So it only
>> affects anonymous.
>>
>> The feature bit is by default off, so the old behavior will be maintained.
>> Sometimes it may be wanted because the wr-protect of none ptes will contain
>> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
>> anonymous), but also on creating the pgtables to store the pte markers. So
>> there's potentially less chance of using thp on the first fault for a none
>> pmd or larger than a pmd.
>>
>> The major implementation part is teaching the whole kernel to understand
>> pte markers even for anonymously mapped ranges, meanwhile allowing the
>> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
>> new feature bit is set.
>>
>> Note that even if the patch subject starts with mm/uffd, there're a few
>> small refactors to major mm path of handling anonymous page faults. But
>> they should be straightforward.
>>
>> So far, add a very light smoke test within the userfaultfd kselftest
>> pagemap unit test to make sure anon pte markers work.
>>
>> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
>> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
>>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>> v1->v2:
>> - Use pte markers rather than populate zero pages when protect [David]
>> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
> 
> Some very initial performance numbers (I only ran in a VM but it should be
> similar, unit is "us") below as requested.  The measurement is about time
> spent when wr-protecting 10G range of empty but mapped memory.  It's done
> in a VM, assuming we'll get similar results on bare metal.
> 
> Four test cases:
> 
>         - default UFFDIO_WP
>         - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
>         - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
>         - UFFDIO_WP with WP_UNPOPULATED
> 
> Results:
> 
>         Test DEFAULT: 2
>         Test PRE-READ: 3277099 (pre-fault 3253826)
>         Test MADVISE: 2250361 (pre-fault 2226310)
>         Test WP-UNPOPULATE: 20850
In your case:
Default < WP-UNPOPULATE < MADVISE < PRE-READ


In my testing on next-20230228 with this patch and my uffd async patch:

Test DEFAULT: 6
Test PRE-READ: 37157 (pre-fault 37006)
Test MADVISE: 4884 (pre-fault 4465)
Test WP-UNPOPULATE: 17794

DEFAULT < MADVISE < WP-UNPOPULATE < PRE-READ

On my setup, MADVISE is performing better than WP-UNPOPULATE consistently.
I'm not sure why I'm getting this discrepancy here. I've liked your results
to be honest where we perform better with WP-UNPOPULATE than MADVISE. What
can be done to get consistent benchmarks over your and my side?

> 
> I'll add these information into the commit message when there's a new
> version.
> 
> [1] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c
>
  
David Hildenbrand March 2, 2023, 5:38 p.m. UTC | #13
On 02.03.23 18:19, Muhammad Usama Anjum wrote:
> On 2/28/23 5:36 AM, Peter Xu wrote:
>> On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
>>> This is a new feature that controls how uffd-wp handles none ptes.  When
>>> it's set, the kernel will handle anonymous memory the same way as file
>>> memory, by allowing the user to wr-protect unpopulated ptes.
>>>
>>> File memories handles none ptes consistently by allowing wr-protecting of
>>> none ptes because of the unawareness of page cache being exist or not.  For
>>> anonymous it was not as persistent because we used to assume that we don't
>>> need protections on none ptes or known zero pages.
>>>
>>> One use case of such a feature bit was VM live snapshot, where if without
>>> wr-protecting empty ptes the snapshot can contain random rubbish in the
>>> holes of the anonymous memory, which can cause misbehave of the guest when
>>> the guest OS assumes the pages should be all zeros.
>>>
>>> QEMU worked it around by pre-populate the section with reads to fill in
>>> zero page entries before starting the whole snapshot process [1].
>>>
>>> Recently there's another need raised on using userfaultfd wr-protect for
>>> detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
>>> case if without being able to wr-protect none ptes by default, the dirty
>>> info can get lost, since we cannot treat every none pte to be dirty (the
>>> current design is identify a page dirty based on uffd-wp bit being cleared).
>>>
>>> In general, we want to be able to wr-protect empty ptes too even for
>>> anonymous.
>>>
>>> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
>>> uffd-wp handling on none ptes being consistent no matter what the memory
>>> type is underneath.  It doesn't have any impact on file memories so far
>>> because we already have pte markers taking care of that.  So it only
>>> affects anonymous.
>>>
>>> The feature bit is by default off, so the old behavior will be maintained.
>>> Sometimes it may be wanted because the wr-protect of none ptes will contain
>>> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
>>> anonymous), but also on creating the pgtables to store the pte markers. So
>>> there's potentially less chance of using thp on the first fault for a none
>>> pmd or larger than a pmd.
>>>
>>> The major implementation part is teaching the whole kernel to understand
>>> pte markers even for anonymously mapped ranges, meanwhile allowing the
>>> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
>>> new feature bit is set.
>>>
>>> Note that even if the patch subject starts with mm/uffd, there're a few
>>> small refactors to major mm path of handling anonymous page faults. But
>>> they should be straightforward.
>>>
>>> So far, add a very light smoke test within the userfaultfd kselftest
>>> pagemap unit test to make sure anon pte markers work.
>>>
>>> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
>>> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>> v1->v2:
>>> - Use pte markers rather than populate zero pages when protect [David]
>>> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
>>
>> Some very initial performance numbers (I only ran in a VM but it should be
>> similar, unit is "us") below as requested.  The measurement is about time
>> spent when wr-protecting 10G range of empty but mapped memory.  It's done
>> in a VM, assuming we'll get similar results on bare metal.
>>
>> Four test cases:
>>
>>          - default UFFDIO_WP
>>          - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
>>          - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
>>          - UFFDIO_WP with WP_UNPOPULATED
>>
>> Results:
>>
>>          Test DEFAULT: 2
>>          Test PRE-READ: 3277099 (pre-fault 3253826)
>>          Test MADVISE: 2250361 (pre-fault 2226310)
>>          Test WP-UNPOPULATE: 20850
> In your case:
> Default < WP-UNPOPULATE < MADVISE < PRE-READ
> 
> 
> In my testing on next-20230228 with this patch and my uffd async patch:
> 
> Test DEFAULT: 6
> Test PRE-READ: 37157 (pre-fault 37006)
> Test MADVISE: 4884 (pre-fault 4465)
> Test WP-UNPOPULATE: 17794
> 
> DEFAULT < MADVISE < WP-UNPOPULATE < PRE-READ
> 
> On my setup, MADVISE is performing better than WP-UNPOPULATE consistently.
> I'm not sure why I'm getting this discrepancy here. I've liked your results
> to be honest where we perform better with WP-UNPOPULATE than MADVISE. What
> can be done to get consistent benchmarks over your and my side?

Probably because the current approach from Peter uses uffd-wp markers, 
and these markers can currently only reside on the PTE level, not on the 
PMD level yet.

With MADVISE you get a huge zeropage and avoid dealing with PTEs.
  
Peter Xu March 2, 2023, 10 p.m. UTC | #14
On Thu, Mar 02, 2023 at 08:14:05PM +0500, Muhammad Usama Anjum wrote:
> I've heard about a use case where a file of size 10s of GBs can be mapped
> to the memory and then accessed off and on. We need to handle this
> correctly and efficiently.

Note again that besides any slowness it'll stop working for generic FSs.

One way is we can consider enabling uffd-wp async for all archs, but I need
to think more about it, and it will not be compatible with any other modes.
  
Peter Xu March 2, 2023, 10:21 p.m. UTC | #15
On Thu, Mar 02, 2023 at 06:38:20PM +0100, David Hildenbrand wrote:
> On 02.03.23 18:19, Muhammad Usama Anjum wrote:
> > On 2/28/23 5:36 AM, Peter Xu wrote:
> > > On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
> > > > This is a new feature that controls how uffd-wp handles none ptes.  When
> > > > it's set, the kernel will handle anonymous memory the same way as file
> > > > memory, by allowing the user to wr-protect unpopulated ptes.
> > > > 
> > > > File memories handles none ptes consistently by allowing wr-protecting of
> > > > none ptes because of the unawareness of page cache being exist or not.  For
> > > > anonymous it was not as persistent because we used to assume that we don't
> > > > need protections on none ptes or known zero pages.
> > > > 
> > > > One use case of such a feature bit was VM live snapshot, where if without
> > > > wr-protecting empty ptes the snapshot can contain random rubbish in the
> > > > holes of the anonymous memory, which can cause misbehave of the guest when
> > > > the guest OS assumes the pages should be all zeros.
> > > > 
> > > > QEMU worked it around by pre-populate the section with reads to fill in
> > > > zero page entries before starting the whole snapshot process [1].
> > > > 
> > > > Recently there's another need raised on using userfaultfd wr-protect for
> > > > detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
> > > > case if without being able to wr-protect none ptes by default, the dirty
> > > > info can get lost, since we cannot treat every none pte to be dirty (the
> > > > current design is identify a page dirty based on uffd-wp bit being cleared).
> > > > 
> > > > In general, we want to be able to wr-protect empty ptes too even for
> > > > anonymous.
> > > > 
> > > > This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
> > > > uffd-wp handling on none ptes being consistent no matter what the memory
> > > > type is underneath.  It doesn't have any impact on file memories so far
> > > > because we already have pte markers taking care of that.  So it only
> > > > affects anonymous.
> > > > 
> > > > The feature bit is by default off, so the old behavior will be maintained.
> > > > Sometimes it may be wanted because the wr-protect of none ptes will contain
> > > > overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
> > > > anonymous), but also on creating the pgtables to store the pte markers. So
> > > > there's potentially less chance of using thp on the first fault for a none
> > > > pmd or larger than a pmd.
> > > > 
> > > > The major implementation part is teaching the whole kernel to understand
> > > > pte markers even for anonymously mapped ranges, meanwhile allowing the
> > > > UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
> > > > new feature bit is set.
> > > > 
> > > > Note that even if the patch subject starts with mm/uffd, there're a few
> > > > small refactors to major mm path of handling anonymous page faults. But
> > > > they should be straightforward.
> > > > 
> > > > So far, add a very light smoke test within the userfaultfd kselftest
> > > > pagemap unit test to make sure anon pte markers work.
> > > > 
> > > > [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
> > > > [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > > v1->v2:
> > > > - Use pte markers rather than populate zero pages when protect [David]
> > > > - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
> > > 
> > > Some very initial performance numbers (I only ran in a VM but it should be
> > > similar, unit is "us") below as requested.  The measurement is about time
> > > spent when wr-protecting 10G range of empty but mapped memory.  It's done
> > > in a VM, assuming we'll get similar results on bare metal.
> > > 
> > > Four test cases:
> > > 
> > >          - default UFFDIO_WP
> > >          - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
> > >          - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
> > >          - UFFDIO_WP with WP_UNPOPULATED
> > > 
> > > Results:
> > > 
> > >          Test DEFAULT: 2
> > >          Test PRE-READ: 3277099 (pre-fault 3253826)
> > >          Test MADVISE: 2250361 (pre-fault 2226310)
> > >          Test WP-UNPOPULATE: 20850
> > In your case:
> > Default < WP-UNPOPULATE < MADVISE < PRE-READ
> > 
> > 
> > In my testing on next-20230228 with this patch and my uffd async patch:
> > 
> > Test DEFAULT: 6
> > Test PRE-READ: 37157 (pre-fault 37006)
> > Test MADVISE: 4884 (pre-fault 4465)
> > Test WP-UNPOPULATE: 17794
> > 
> > DEFAULT < MADVISE < WP-UNPOPULATE < PRE-READ
> > 
> > On my setup, MADVISE is performing better than WP-UNPOPULATE consistently.
> > I'm not sure why I'm getting this discrepancy here. I've liked your results
> > to be honest where we perform better with WP-UNPOPULATE than MADVISE. What
> > can be done to get consistent benchmarks over your and my side?
> 
> Probably because the current approach from Peter uses uffd-wp markers, and
> these markers can currently only reside on the PTE level, not on the PMD
> level yet.
> 
> With MADVISE you get a huge zeropage and avoid dealing with PTEs.

Yes, probably.  But then when write happens it'll be done there when split,
so the overhead was delayed.

Meanwhile I'll retest again (probably tomorrow..) with bare metals with THP
on/off to double check.

Muhammad, do you think the current performance will work for you?

Especially I want to double check with you again on whether
XFS/EXT4/... will be needed for the tracking purpose so you can reply here
together.  We shouldn't merge anything that doesn't have at least one
existing good use case, and we may need to rethink if it's not.

For performance, one approach is probably making uffd-wp async separate
from other features, where we can revert the meaning of uffd-wp bit to
mimic what soft-dirty does (I think this will look closer to what David
mentioned in the other thread), by defining uffd-wp=1 as "written" and
uffd-wp=0 as clean.

IIUC that'll make it one bit and work as fast as soft-dirty, meanwhile all
uffd-wp marker things can hopefully still be maintained.  However I really
don't like it to violate a lot of things, e.g., when UFFDIO_WRITEPROTECT
another round we'll need to DROP uffd-wp if async, but APPLY uffd-wp if
sync...  So in general it'll need more thoughts and slower to do.
  
Muhammad Usama Anjum March 3, 2023, 6:42 a.m. UTC | #16
On 3/3/23 3:21 AM, Peter Xu wrote:
> On Thu, Mar 02, 2023 at 06:38:20PM +0100, David Hildenbrand wrote:
>> On 02.03.23 18:19, Muhammad Usama Anjum wrote:
>>> On 2/28/23 5:36 AM, Peter Xu wrote:
>>>> On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
>>>>> This is a new feature that controls how uffd-wp handles none ptes.  When
>>>>> it's set, the kernel will handle anonymous memory the same way as file
>>>>> memory, by allowing the user to wr-protect unpopulated ptes.
>>>>>
>>>>> File memories handles none ptes consistently by allowing wr-protecting of
>>>>> none ptes because of the unawareness of page cache being exist or not.  For
>>>>> anonymous it was not as persistent because we used to assume that we don't
>>>>> need protections on none ptes or known zero pages.
>>>>>
>>>>> One use case of such a feature bit was VM live snapshot, where if without
>>>>> wr-protecting empty ptes the snapshot can contain random rubbish in the
>>>>> holes of the anonymous memory, which can cause misbehave of the guest when
>>>>> the guest OS assumes the pages should be all zeros.
>>>>>
>>>>> QEMU worked it around by pre-populate the section with reads to fill in
>>>>> zero page entries before starting the whole snapshot process [1].
>>>>>
>>>>> Recently there's another need raised on using userfaultfd wr-protect for
>>>>> detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
>>>>> case if without being able to wr-protect none ptes by default, the dirty
>>>>> info can get lost, since we cannot treat every none pte to be dirty (the
>>>>> current design is identify a page dirty based on uffd-wp bit being cleared).
>>>>>
>>>>> In general, we want to be able to wr-protect empty ptes too even for
>>>>> anonymous.
>>>>>
>>>>> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
>>>>> uffd-wp handling on none ptes being consistent no matter what the memory
>>>>> type is underneath.  It doesn't have any impact on file memories so far
>>>>> because we already have pte markers taking care of that.  So it only
>>>>> affects anonymous.
>>>>>
>>>>> The feature bit is by default off, so the old behavior will be maintained.
>>>>> Sometimes it may be wanted because the wr-protect of none ptes will contain
>>>>> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
>>>>> anonymous), but also on creating the pgtables to store the pte markers. So
>>>>> there's potentially less chance of using thp on the first fault for a none
>>>>> pmd or larger than a pmd.
>>>>>
>>>>> The major implementation part is teaching the whole kernel to understand
>>>>> pte markers even for anonymously mapped ranges, meanwhile allowing the
>>>>> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
>>>>> new feature bit is set.
>>>>>
>>>>> Note that even if the patch subject starts with mm/uffd, there're a few
>>>>> small refactors to major mm path of handling anonymous page faults. But
>>>>> they should be straightforward.
>>>>>
>>>>> So far, add a very light smoke test within the userfaultfd kselftest
>>>>> pagemap unit test to make sure anon pte markers work.
>>>>>
>>>>> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
>>>>> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
>>>>>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>> v1->v2:
>>>>> - Use pte markers rather than populate zero pages when protect [David]
>>>>> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
>>>>
>>>> Some very initial performance numbers (I only ran in a VM but it should be
>>>> similar, unit is "us") below as requested.  The measurement is about time
>>>> spent when wr-protecting 10G range of empty but mapped memory.  It's done
>>>> in a VM, assuming we'll get similar results on bare metal.
>>>>
>>>> Four test cases:
>>>>
>>>>          - default UFFDIO_WP
>>>>          - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
>>>>          - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
>>>>          - UFFDIO_WP with WP_UNPOPULATED
>>>>
>>>> Results:
>>>>
>>>>          Test DEFAULT: 2
>>>>          Test PRE-READ: 3277099 (pre-fault 3253826)
>>>>          Test MADVISE: 2250361 (pre-fault 2226310)
>>>>          Test WP-UNPOPULATE: 20850
>>> In your case:
>>> Default < WP-UNPOPULATE < MADVISE < PRE-READ
>>>
>>>
>>> In my testing on next-20230228 with this patch and my uffd async patch:
>>>
>>> Test DEFAULT: 6
>>> Test PRE-READ: 37157 (pre-fault 37006)
>>> Test MADVISE: 4884 (pre-fault 4465)
>>> Test WP-UNPOPULATE: 17794
>>>
>>> DEFAULT < MADVISE < WP-UNPOPULATE < PRE-READ
>>>
>>> On my setup, MADVISE is performing better than WP-UNPOPULATE consistently.
>>> I'm not sure why I'm getting this discrepancy here. I've liked your results
>>> to be honest where we perform better with WP-UNPOPULATE than MADVISE. What
>>> can be done to get consistent benchmarks over your and my side?
>>
>> Probably because the current approach from Peter uses uffd-wp markers, and
>> these markers can currently only reside on the PTE level, not on the PMD
>> level yet.
>>
>> With MADVISE you get a huge zeropage and avoid dealing with PTEs.
> 
> Yes, probably.  But then when write happens it'll be done there when split,
> so the overhead was delayed.
> 
> Meanwhile I'll retest again (probably tomorrow..) with bare metals with THP
> on/off to double check.
Turning on/off THP has effect.

(1) With huge page disabled
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
./uffd_wp_perf
Test DEFAULT: 4
Test PRE-READ: 1111453 (pre-fault 1101011)
Test MADVISE: 278276 (pre-fault 266378)
Test WP-UNPOPULATE: 11712

(2) With Huge page enabled
echo always > /sys/kernel/mm/transparent_hugepage/enabled
./uffd_wp_perf
Test DEFAULT: 4
Test PRE-READ: 22521 (pre-fault 22348)
Test MADVISE: 4909 (pre-fault 4743)
Test WP-UNPOPULATE: 14448

> 
> Muhammad, do you think the current performance will work for you?
Yes, the current performance is good enough for now. We'll be using normal
pages for memory on which PAGEMAP_IOCTL will be used for emulating Windows
syscall. When normal size pages are used, (1) show that WP-UNPOPULATE has
really good performance (278276 --> 11712). Lets proceed with it.

> 
> Especially I want to double check with you again on whether
> XFS/EXT4/... will be needed for the tracking purpose so you can reply here
> together.  We shouldn't merge anything that doesn't have at least one
> existing good use case, and we may need to rethink if it's not.
We have been trying to find soft dirty pages correctly on anon memory.
Right now with UN_POPULATED, ASYNC and PAGEMAP_IOCTL, we'll be able to
achieve this correctly.

We want to track the file mapped memory the same way for another Windows
api translation in which file memory will be tracked to find out soft-dirty
pages. I believe once it is supported in the mm, filesystem doesn't matter
as we'll be tracking memory mapped files.

So we have use cases for both soft-dirty tracking on anon and file backed
memories.

> 
> For performance, one approach is probably making uffd-wp async separate
> from other features, where we can revert the meaning of uffd-wp bit to
> mimic what soft-dirty does (I think this will look closer to what David
> mentioned in the other thread), by defining uffd-wp=1 as "written" and
> uffd-wp=0 as clean.
I'm happy with current state of things. WP Async is just a feature where
faults are resolved automatically. The page is WP or not perfectly
translates if page is written (dirty) or not with just not (!) operation.
If we go and change the meanings of uffd_wp=1 as written, it would cause
confusion and probably we'll need a lot more changes to add async feature
which is right now quite small changeset.

> 
> IIUC that'll make it one bit and work as fast as soft-dirty, meanwhile all
> uffd-wp marker things can hopefully still be maintained.  However I really
> don't like it to violate a lot of things, e.g., when UFFDIO_WRITEPROTECT
> another round we'll need to DROP uffd-wp if async, but APPLY uffd-wp if
> sync...  So in general it'll need more thoughts and slower to do.
Sorry, I'm unable to understand your example here. Can you please elaborate?

>
  
Peter Xu March 3, 2023, 4:47 p.m. UTC | #17
On Fri, Mar 03, 2023 at 11:42:46AM +0500, Muhammad Usama Anjum wrote:
> On 3/3/23 3:21 AM, Peter Xu wrote:
> > On Thu, Mar 02, 2023 at 06:38:20PM +0100, David Hildenbrand wrote:
> >> On 02.03.23 18:19, Muhammad Usama Anjum wrote:
> >>> On 2/28/23 5:36 AM, Peter Xu wrote:
> >>>> On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
> >>>>> This is a new feature that controls how uffd-wp handles none ptes.  When
> >>>>> it's set, the kernel will handle anonymous memory the same way as file
> >>>>> memory, by allowing the user to wr-protect unpopulated ptes.
> >>>>>
> >>>>> File memories handles none ptes consistently by allowing wr-protecting of
> >>>>> none ptes because of the unawareness of page cache being exist or not.  For
> >>>>> anonymous it was not as persistent because we used to assume that we don't
> >>>>> need protections on none ptes or known zero pages.
> >>>>>
> >>>>> One use case of such a feature bit was VM live snapshot, where if without
> >>>>> wr-protecting empty ptes the snapshot can contain random rubbish in the
> >>>>> holes of the anonymous memory, which can cause misbehave of the guest when
> >>>>> the guest OS assumes the pages should be all zeros.
> >>>>>
> >>>>> QEMU worked it around by pre-populate the section with reads to fill in
> >>>>> zero page entries before starting the whole snapshot process [1].
> >>>>>
> >>>>> Recently there's another need raised on using userfaultfd wr-protect for
> >>>>> detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
> >>>>> case if without being able to wr-protect none ptes by default, the dirty
> >>>>> info can get lost, since we cannot treat every none pte to be dirty (the
> >>>>> current design is identify a page dirty based on uffd-wp bit being cleared).
> >>>>>
> >>>>> In general, we want to be able to wr-protect empty ptes too even for
> >>>>> anonymous.
> >>>>>
> >>>>> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
> >>>>> uffd-wp handling on none ptes being consistent no matter what the memory
> >>>>> type is underneath.  It doesn't have any impact on file memories so far
> >>>>> because we already have pte markers taking care of that.  So it only
> >>>>> affects anonymous.
> >>>>>
> >>>>> The feature bit is by default off, so the old behavior will be maintained.
> >>>>> Sometimes it may be wanted because the wr-protect of none ptes will contain
> >>>>> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
> >>>>> anonymous), but also on creating the pgtables to store the pte markers. So
> >>>>> there's potentially less chance of using thp on the first fault for a none
> >>>>> pmd or larger than a pmd.
> >>>>>
> >>>>> The major implementation part is teaching the whole kernel to understand
> >>>>> pte markers even for anonymously mapped ranges, meanwhile allowing the
> >>>>> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
> >>>>> new feature bit is set.
> >>>>>
> >>>>> Note that even if the patch subject starts with mm/uffd, there're a few
> >>>>> small refactors to major mm path of handling anonymous page faults. But
> >>>>> they should be straightforward.
> >>>>>
> >>>>> So far, add a very light smoke test within the userfaultfd kselftest
> >>>>> pagemap unit test to make sure anon pte markers work.
> >>>>>
> >>>>> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
> >>>>> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
> >>>>>
> >>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
> >>>>> ---
> >>>>> v1->v2:
> >>>>> - Use pte markers rather than populate zero pages when protect [David]
> >>>>> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
> >>>>
> >>>> Some very initial performance numbers (I only ran in a VM but it should be
> >>>> similar, unit is "us") below as requested.  The measurement is about time
> >>>> spent when wr-protecting 10G range of empty but mapped memory.  It's done
> >>>> in a VM, assuming we'll get similar results on bare metal.
> >>>>
> >>>> Four test cases:
> >>>>
> >>>>          - default UFFDIO_WP
> >>>>          - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
> >>>>          - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
> >>>>          - UFFDIO_WP with WP_UNPOPULATED
> >>>>
> >>>> Results:
> >>>>
> >>>>          Test DEFAULT: 2
> >>>>          Test PRE-READ: 3277099 (pre-fault 3253826)
> >>>>          Test MADVISE: 2250361 (pre-fault 2226310)
> >>>>          Test WP-UNPOPULATE: 20850
> >>> In your case:
> >>> Default < WP-UNPOPULATE < MADVISE < PRE-READ
> >>>
> >>>
> >>> In my testing on next-20230228 with this patch and my uffd async patch:
> >>>
> >>> Test DEFAULT: 6
> >>> Test PRE-READ: 37157 (pre-fault 37006)
> >>> Test MADVISE: 4884 (pre-fault 4465)
> >>> Test WP-UNPOPULATE: 17794
> >>>
> >>> DEFAULT < MADVISE < WP-UNPOPULATE < PRE-READ
> >>>
> >>> On my setup, MADVISE is performing better than WP-UNPOPULATE consistently.
> >>> I'm not sure why I'm getting this discrepancy here. I've liked your results
> >>> to be honest where we perform better with WP-UNPOPULATE than MADVISE. What
> >>> can be done to get consistent benchmarks over your and my side?
> >>
> >> Probably because the current approach from Peter uses uffd-wp markers, and
> >> these markers can currently only reside on the PTE level, not on the PMD
> >> level yet.
> >>
> >> With MADVISE you get a huge zeropage and avoid dealing with PTEs.
> > 
> > Yes, probably.  But then when write happens it'll be done there when split,
> > so the overhead was delayed.
> > 
> > Meanwhile I'll retest again (probably tomorrow..) with bare metals with THP
> > on/off to double check.
> Turning on/off THP has effect.
> 
> (1) With huge page disabled
> echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
> ./uffd_wp_perf
> Test DEFAULT: 4
> Test PRE-READ: 1111453 (pre-fault 1101011)
> Test MADVISE: 278276 (pre-fault 266378)
> Test WP-UNPOPULATE: 11712
> 
> (2) With Huge page enabled
> echo always > /sys/kernel/mm/transparent_hugepage/enabled
> ./uffd_wp_perf
> Test DEFAULT: 4
> Test PRE-READ: 22521 (pre-fault 22348)
> Test MADVISE: 4909 (pre-fault 4743)
> Test WP-UNPOPULATE: 14448

Thanks. Yes then this is explaine-able to me.  As mentioned previously, thp
will drastically speedup everything during prefault+ioctl(UFFDIO_WP),
though that overhead will be applied when the 1st write triggers later,
afaik.

> 
> > 
> > Muhammad, do you think the current performance will work for you?
> Yes, the current performance is good enough for now. We'll be using normal
> pages for memory on which PAGEMAP_IOCTL will be used for emulating Windows
> syscall. When normal size pages are used, (1) show that WP-UNPOPULATE has
> really good performance (278276 --> 11712). Lets proceed with it.

OK.

> 
> > 
> > Especially I want to double check with you again on whether
> > XFS/EXT4/... will be needed for the tracking purpose so you can reply here
> > together.  We shouldn't merge anything that doesn't have at least one
> > existing good use case, and we may need to rethink if it's not.
> We have been trying to find soft dirty pages correctly on anon memory.
> Right now with UN_POPULATED, ASYNC and PAGEMAP_IOCTL, we'll be able to
> achieve this correctly.
> 
> We want to track the file mapped memory the same way for another Windows
> api translation in which file memory will be tracked to find out soft-dirty
> pages. I believe once it is supported in the mm, filesystem doesn't matter
> as we'll be tracking memory mapped files.

Some more elaboration would be great here.

For example, what is the other windows API?  What will you do with the
dirty information?  What is the impact of supporting only shmem/hugetlbfs
for "!anon"?

> 
> So we have use cases for both soft-dirty tracking on anon and file backed
> memories.
> 
> > 
> > For performance, one approach is probably making uffd-wp async separate
> > from other features, where we can revert the meaning of uffd-wp bit to
> > mimic what soft-dirty does (I think this will look closer to what David
> > mentioned in the other thread), by defining uffd-wp=1 as "written" and
> > uffd-wp=0 as clean.
> I'm happy with current state of things. WP Async is just a feature where
> faults are resolved automatically. The page is WP or not perfectly
> translates if page is written (dirty) or not with just not (!) operation.
> If we go and change the meanings of uffd_wp=1 as written, it would cause
> confusion and probably we'll need a lot more changes to add async feature
> which is right now quite small changeset.

Yes, that's also why I wanted to keep it like this unless necessary.  It
makes all uffd features still more aligned with everything so more maintainable.

> 
> > 
> > IIUC that'll make it one bit and work as fast as soft-dirty, meanwhile all
> > uffd-wp marker things can hopefully still be maintained.  However I really
> > don't like it to violate a lot of things, e.g., when UFFDIO_WRITEPROTECT
> > another round we'll need to DROP uffd-wp if async, but APPLY uffd-wp if
> > sync...  So in general it'll need more thoughts and slower to do.
> Sorry, I'm unable to understand your example here. Can you please elaborate?

I think it's mostly what you stated above, the changeset will just get more
involved because of either the redefinition of uffd-wp depending on feature
enabled, or using yet another bit (e.g., if you still remember the very
original reply I was planning to reuse soft-dirty bit but used together
with VM_UFFD_WP; I think that resolves the degraded perf on unpopulated
memory too because it'll work more like soft-dirty).

This is some detail which, IMHO, we can leave that for later.  To me I'm
more curious on knowing more details of your use case, and how uffd could
suite that the best way it could, assuming that (or part of it like the
unpopulate work here) can also be leveraged by others.  Your answers to
above questions may help.

Thanks,
  
Muhammad Usama Anjum March 6, 2023, 9:03 a.m. UTC | #18
On 3/3/23 9:47 PM, Peter Xu wrote:
> On Fri, Mar 03, 2023 at 11:42:46AM +0500, Muhammad Usama Anjum wrote:
>> On 3/3/23 3:21 AM, Peter Xu wrote:
>>> On Thu, Mar 02, 2023 at 06:38:20PM +0100, David Hildenbrand wrote:
>>>> On 02.03.23 18:19, Muhammad Usama Anjum wrote:
>>>>> On 2/28/23 5:36 AM, Peter Xu wrote:
>>>>>> On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
>>>>>>> This is a new feature that controls how uffd-wp handles none ptes.  When
>>>>>>> it's set, the kernel will handle anonymous memory the same way as file
>>>>>>> memory, by allowing the user to wr-protect unpopulated ptes.
>>>>>>>
>>>>>>> File memories handles none ptes consistently by allowing wr-protecting of
>>>>>>> none ptes because of the unawareness of page cache being exist or not.  For
>>>>>>> anonymous it was not as persistent because we used to assume that we don't
>>>>>>> need protections on none ptes or known zero pages.
>>>>>>>
>>>>>>> One use case of such a feature bit was VM live snapshot, where if without
>>>>>>> wr-protecting empty ptes the snapshot can contain random rubbish in the
>>>>>>> holes of the anonymous memory, which can cause misbehave of the guest when
>>>>>>> the guest OS assumes the pages should be all zeros.
>>>>>>>
>>>>>>> QEMU worked it around by pre-populate the section with reads to fill in
>>>>>>> zero page entries before starting the whole snapshot process [1].
>>>>>>>
>>>>>>> Recently there's another need raised on using userfaultfd wr-protect for
>>>>>>> detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
>>>>>>> case if without being able to wr-protect none ptes by default, the dirty
>>>>>>> info can get lost, since we cannot treat every none pte to be dirty (the
>>>>>>> current design is identify a page dirty based on uffd-wp bit being cleared).
>>>>>>>
>>>>>>> In general, we want to be able to wr-protect empty ptes too even for
>>>>>>> anonymous.
>>>>>>>
>>>>>>> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
>>>>>>> uffd-wp handling on none ptes being consistent no matter what the memory
>>>>>>> type is underneath.  It doesn't have any impact on file memories so far
>>>>>>> because we already have pte markers taking care of that.  So it only
>>>>>>> affects anonymous.
>>>>>>>
>>>>>>> The feature bit is by default off, so the old behavior will be maintained.
>>>>>>> Sometimes it may be wanted because the wr-protect of none ptes will contain
>>>>>>> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
>>>>>>> anonymous), but also on creating the pgtables to store the pte markers. So
>>>>>>> there's potentially less chance of using thp on the first fault for a none
>>>>>>> pmd or larger than a pmd.
>>>>>>>
>>>>>>> The major implementation part is teaching the whole kernel to understand
>>>>>>> pte markers even for anonymously mapped ranges, meanwhile allowing the
>>>>>>> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
>>>>>>> new feature bit is set.
>>>>>>>
>>>>>>> Note that even if the patch subject starts with mm/uffd, there're a few
>>>>>>> small refactors to major mm path of handling anonymous page faults. But
>>>>>>> they should be straightforward.
>>>>>>>
>>>>>>> So far, add a very light smoke test within the userfaultfd kselftest
>>>>>>> pagemap unit test to make sure anon pte markers work.
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
>>>>>>> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
>>>>>>>
>>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>>> ---
>>>>>>> v1->v2:
>>>>>>> - Use pte markers rather than populate zero pages when protect [David]
>>>>>>> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
>>>>>>
>>>>>> Some very initial performance numbers (I only ran in a VM but it should be
>>>>>> similar, unit is "us") below as requested.  The measurement is about time
>>>>>> spent when wr-protecting 10G range of empty but mapped memory.  It's done
>>>>>> in a VM, assuming we'll get similar results on bare metal.
>>>>>>
>>>>>> Four test cases:
>>>>>>
>>>>>>          - default UFFDIO_WP
>>>>>>          - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
>>>>>>          - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
>>>>>>          - UFFDIO_WP with WP_UNPOPULATED
>>>>>>
>>>>>> Results:
>>>>>>
>>>>>>          Test DEFAULT: 2
>>>>>>          Test PRE-READ: 3277099 (pre-fault 3253826)
>>>>>>          Test MADVISE: 2250361 (pre-fault 2226310)
>>>>>>          Test WP-UNPOPULATE: 20850
>>>>> In your case:
>>>>> Default < WP-UNPOPULATE < MADVISE < PRE-READ
>>>>>
>>>>>
>>>>> In my testing on next-20230228 with this patch and my uffd async patch:
>>>>>
>>>>> Test DEFAULT: 6
>>>>> Test PRE-READ: 37157 (pre-fault 37006)
>>>>> Test MADVISE: 4884 (pre-fault 4465)
>>>>> Test WP-UNPOPULATE: 17794
>>>>>
>>>>> DEFAULT < MADVISE < WP-UNPOPULATE < PRE-READ
>>>>>
>>>>> On my setup, MADVISE is performing better than WP-UNPOPULATE consistently.
>>>>> I'm not sure why I'm getting this discrepancy here. I've liked your results
>>>>> to be honest where we perform better with WP-UNPOPULATE than MADVISE. What
>>>>> can be done to get consistent benchmarks over your and my side?
>>>>
>>>> Probably because the current approach from Peter uses uffd-wp markers, and
>>>> these markers can currently only reside on the PTE level, not on the PMD
>>>> level yet.
>>>>
>>>> With MADVISE you get a huge zeropage and avoid dealing with PTEs.
>>>
>>> Yes, probably.  But then when write happens it'll be done there when split,
>>> so the overhead was delayed.
>>>
>>> Meanwhile I'll retest again (probably tomorrow..) with bare metals with THP
>>> on/off to double check.
>> Turning on/off THP has effect.
>>
>> (1) With huge page disabled
>> echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
>> ./uffd_wp_perf
>> Test DEFAULT: 4
>> Test PRE-READ: 1111453 (pre-fault 1101011)
>> Test MADVISE: 278276 (pre-fault 266378)
>> Test WP-UNPOPULATE: 11712
Hi Paul,

We were reading, touching or faulting memory before engaging wp. This
reading is called pre-read in this benchmark. We can fault the memory
through madvise as well, it is called madvise here. WP-Unpopulated is being
added in this patch which will remove the need to pre-fault the memory and
it is way faster as shown by the numbers.

>>
>> (2) With Huge page enabled
>> echo always > /sys/kernel/mm/transparent_hugepage/enabled
>> ./uffd_wp_perf
>> Test DEFAULT: 4
>> Test PRE-READ: 22521 (pre-fault 22348)
>> Test MADVISE: 4909 (pre-fault 4743)
>> Test WP-UNPOPULATE: 14448
> 
> Thanks. Yes then this is explaine-able to me.  As mentioned previously, thp
> will drastically speedup everything during prefault+ioctl(UFFDIO_WP),
> though that overhead will be applied when the 1st write triggers later,
> afaik.
> 
>>
>>>
>>> Muhammad, do you think the current performance will work for you?
>> Yes, the current performance is good enough for now. We'll be using normal
>> pages for memory on which PAGEMAP_IOCTL will be used for emulating Windows
>> syscall. When normal size pages are used, (1) show that WP-UNPOPULATE has
>> really good performance (278276 --> 11712). Lets proceed with it.
> 
> OK.
> 
>>
>>>
>>> Especially I want to double check with you again on whether
>>> XFS/EXT4/... will be needed for the tracking purpose so you can reply here
>>> together.  We shouldn't merge anything that doesn't have at least one
>>> existing good use case, and we may need to rethink if it's not.
>> We have been trying to find soft dirty pages correctly on anon memory.
>> Right now with UN_POPULATED, ASYNC and PAGEMAP_IOCTL, we'll be able to
>> achieve this correctly.
>>
>> We want to track the file mapped memory the same way for another Windows
>> api translation in which file memory will be tracked to find out soft-dirty
>> pages. I believe once it is supported in the mm, filesystem doesn't matter
>> as we'll be tracking memory mapped files.
> 
> Some more elaboration would be great here.
There is something called "writecopy" in Windows. Paul has given me some
details.

@Paul can you share more details that:
1) What is writecopy?
2) If there is no documentation or documented behavior, Can you please
share the current reference implementation in Wine even though it is buggy
at the time? (I can implement correct implementation with this patch and my
wip v11 pagemap_ioctl patches for testing purpose.)

> 
> For example, what is the other windows API?  What will you do with the
> dirty information?  What is the impact of supporting only shmem/hugetlbfs
> for "!anon"?
> 
>>
>> So we have use cases for both soft-dirty tracking on anon and file backed
>> memories.
>>
>>>
>>> For performance, one approach is probably making uffd-wp async separate
>>> from other features, where we can revert the meaning of uffd-wp bit to
>>> mimic what soft-dirty does (I think this will look closer to what David
>>> mentioned in the other thread), by defining uffd-wp=1 as "written" and
>>> uffd-wp=0 as clean.
>> I'm happy with current state of things. WP Async is just a feature where
>> faults are resolved automatically. The page is WP or not perfectly
>> translates if page is written (dirty) or not with just not (!) operation.
>> If we go and change the meanings of uffd_wp=1 as written, it would cause
>> confusion and probably we'll need a lot more changes to add async feature
>> which is right now quite small changeset.
> 
> Yes, that's also why I wanted to keep it like this unless necessary.  It
> makes all uffd features still more aligned with everything so more maintainable.
> 
>>
>>>
>>> IIUC that'll make it one bit and work as fast as soft-dirty, meanwhile all
>>> uffd-wp marker things can hopefully still be maintained.  However I really
>>> don't like it to violate a lot of things, e.g., when UFFDIO_WRITEPROTECT
>>> another round we'll need to DROP uffd-wp if async, but APPLY uffd-wp if
>>> sync...  So in general it'll need more thoughts and slower to do.
>> Sorry, I'm unable to understand your example here. Can you please elaborate?
> 
> I think it's mostly what you stated above, the changeset will just get more
> involved because of either the redefinition of uffd-wp depending on feature
> enabled, or using yet another bit (e.g., if you still remember the very
> original reply I was planning to reuse soft-dirty bit but used together
> with VM_UFFD_WP; I think that resolves the degraded perf on unpopulated
> memory too because it'll work more like soft-dirty).
> 
> This is some detail which, IMHO, we can leave that for later.  To me I'm
> more curious on knowing more details of your use case, and how uffd could
> suite that the best way it could, assuming that (or part of it like the
> unpopulate work here) can also be leveraged by others.  Your answers to
> above questions may help.
> 
> Thanks,
>
  
Muhammad Usama Anjum March 6, 2023, 4:09 p.m. UTC | #19
On 3/6/23 2:03 PM, Muhammad Usama Anjum wrote:
> On 3/3/23 9:47 PM, Peter Xu wrote:
>> On Fri, Mar 03, 2023 at 11:42:46AM +0500, Muhammad Usama Anjum wrote:
>>> On 3/3/23 3:21 AM, Peter Xu wrote:
>>>> On Thu, Mar 02, 2023 at 06:38:20PM +0100, David Hildenbrand wrote:
>>>>> On 02.03.23 18:19, Muhammad Usama Anjum wrote:
>>>>>> On 2/28/23 5:36 AM, Peter Xu wrote:
>>>>>>> On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote:
>>>>>>>> This is a new feature that controls how uffd-wp handles none ptes.  When
>>>>>>>> it's set, the kernel will handle anonymous memory the same way as file
>>>>>>>> memory, by allowing the user to wr-protect unpopulated ptes.
>>>>>>>>
>>>>>>>> File memories handles none ptes consistently by allowing wr-protecting of
>>>>>>>> none ptes because of the unawareness of page cache being exist or not.  For
>>>>>>>> anonymous it was not as persistent because we used to assume that we don't
>>>>>>>> need protections on none ptes or known zero pages.
>>>>>>>>
>>>>>>>> One use case of such a feature bit was VM live snapshot, where if without
>>>>>>>> wr-protecting empty ptes the snapshot can contain random rubbish in the
>>>>>>>> holes of the anonymous memory, which can cause misbehave of the guest when
>>>>>>>> the guest OS assumes the pages should be all zeros.
>>>>>>>>
>>>>>>>> QEMU worked it around by pre-populate the section with reads to fill in
>>>>>>>> zero page entries before starting the whole snapshot process [1].
>>>>>>>>
>>>>>>>> Recently there's another need raised on using userfaultfd wr-protect for
>>>>>>>> detecting dirty pages (to replace soft-dirty in some cases) [2].  In that
>>>>>>>> case if without being able to wr-protect none ptes by default, the dirty
>>>>>>>> info can get lost, since we cannot treat every none pte to be dirty (the
>>>>>>>> current design is identify a page dirty based on uffd-wp bit being cleared).
>>>>>>>>
>>>>>>>> In general, we want to be able to wr-protect empty ptes too even for
>>>>>>>> anonymous.
>>>>>>>>
>>>>>>>> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
>>>>>>>> uffd-wp handling on none ptes being consistent no matter what the memory
>>>>>>>> type is underneath.  It doesn't have any impact on file memories so far
>>>>>>>> because we already have pte markers taking care of that.  So it only
>>>>>>>> affects anonymous.
>>>>>>>>
>>>>>>>> The feature bit is by default off, so the old behavior will be maintained.
>>>>>>>> Sometimes it may be wanted because the wr-protect of none ptes will contain
>>>>>>>> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to
>>>>>>>> anonymous), but also on creating the pgtables to store the pte markers. So
>>>>>>>> there's potentially less chance of using thp on the first fault for a none
>>>>>>>> pmd or larger than a pmd.
>>>>>>>>
>>>>>>>> The major implementation part is teaching the whole kernel to understand
>>>>>>>> pte markers even for anonymously mapped ranges, meanwhile allowing the
>>>>>>>> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
>>>>>>>> new feature bit is set.
>>>>>>>>
>>>>>>>> Note that even if the patch subject starts with mm/uffd, there're a few
>>>>>>>> small refactors to major mm path of handling anonymous page faults. But
>>>>>>>> they should be straightforward.
>>>>>>>>
>>>>>>>> So far, add a very light smoke test within the userfaultfd kselftest
>>>>>>>> pagemap unit test to make sure anon pte markers work.
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/
>>>>>>>> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/
>>>>>>>>
>>>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>>>> ---
>>>>>>>> v1->v2:
>>>>>>>> - Use pte markers rather than populate zero pages when protect [David]
>>>>>>>> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David]
>>>>>>>
>>>>>>> Some very initial performance numbers (I only ran in a VM but it should be
>>>>>>> similar, unit is "us") below as requested.  The measurement is about time
>>>>>>> spent when wr-protecting 10G range of empty but mapped memory.  It's done
>>>>>>> in a VM, assuming we'll get similar results on bare metal.
>>>>>>>
>>>>>>> Four test cases:
>>>>>>>
>>>>>>>          - default UFFDIO_WP
>>>>>>>          - pre-read the memory, then UFFDIO_WP (what QEMU does right now)
>>>>>>>          - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP
>>>>>>>          - UFFDIO_WP with WP_UNPOPULATED
>>>>>>>
>>>>>>> Results:
>>>>>>>
>>>>>>>          Test DEFAULT: 2
>>>>>>>          Test PRE-READ: 3277099 (pre-fault 3253826)
>>>>>>>          Test MADVISE: 2250361 (pre-fault 2226310)
>>>>>>>          Test WP-UNPOPULATE: 20850
>>>>>> In your case:
>>>>>> Default < WP-UNPOPULATE < MADVISE < PRE-READ
>>>>>>
>>>>>>
>>>>>> In my testing on next-20230228 with this patch and my uffd async patch:
>>>>>>
>>>>>> Test DEFAULT: 6
>>>>>> Test PRE-READ: 37157 (pre-fault 37006)
>>>>>> Test MADVISE: 4884 (pre-fault 4465)
>>>>>> Test WP-UNPOPULATE: 17794
>>>>>>
>>>>>> DEFAULT < MADVISE < WP-UNPOPULATE < PRE-READ
>>>>>>
>>>>>> On my setup, MADVISE is performing better than WP-UNPOPULATE consistently.
>>>>>> I'm not sure why I'm getting this discrepancy here. I've liked your results
>>>>>> to be honest where we perform better with WP-UNPOPULATE than MADVISE. What
>>>>>> can be done to get consistent benchmarks over your and my side?
>>>>>
>>>>> Probably because the current approach from Peter uses uffd-wp markers, and
>>>>> these markers can currently only reside on the PTE level, not on the PMD
>>>>> level yet.
>>>>>
>>>>> With MADVISE you get a huge zeropage and avoid dealing with PTEs.
>>>>
>>>> Yes, probably.  But then when write happens it'll be done there when split,
>>>> so the overhead was delayed.
>>>>
>>>> Meanwhile I'll retest again (probably tomorrow..) with bare metals with THP
>>>> on/off to double check.
>>> Turning on/off THP has effect.
>>>
>>> (1) With huge page disabled
>>> echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
>>> ./uffd_wp_perf
>>> Test DEFAULT: 4
>>> Test PRE-READ: 1111453 (pre-fault 1101011)
>>> Test MADVISE: 278276 (pre-fault 266378)
>>> Test WP-UNPOPULATE: 11712
> Hi Paul,
> 
> We were reading, touching or faulting memory before engaging wp. This
> reading is called pre-read in this benchmark. We can fault the memory
> through madvise as well, it is called madvise here. WP-Unpopulated is being
> added in this patch which will remove the need to pre-fault the memory and
> it is way faster as shown by the numbers.
> 
>>>
>>> (2) With Huge page enabled
>>> echo always > /sys/kernel/mm/transparent_hugepage/enabled
>>> ./uffd_wp_perf
>>> Test DEFAULT: 4
>>> Test PRE-READ: 22521 (pre-fault 22348)
>>> Test MADVISE: 4909 (pre-fault 4743)
>>> Test WP-UNPOPULATE: 14448
>>
>> Thanks. Yes then this is explaine-able to me.  As mentioned previously, thp
>> will drastically speedup everything during prefault+ioctl(UFFDIO_WP),
>> though that overhead will be applied when the 1st write triggers later,
>> afaik.
>>
>>>
>>>>
>>>> Muhammad, do you think the current performance will work for you?
>>> Yes, the current performance is good enough for now. We'll be using normal
>>> pages for memory on which PAGEMAP_IOCTL will be used for emulating Windows
>>> syscall. When normal size pages are used, (1) show that WP-UNPOPULATE has
>>> really good performance (278276 --> 11712). Lets proceed with it.
>>
>> OK.
>>
>>>
>>>>
>>>> Especially I want to double check with you again on whether
>>>> XFS/EXT4/... will be needed for the tracking purpose so you can reply here
>>>> together.  We shouldn't merge anything that doesn't have at least one
>>>> existing good use case, and we may need to rethink if it's not.
>>> We have been trying to find soft dirty pages correctly on anon memory.
>>> Right now with UN_POPULATED, ASYNC and PAGEMAP_IOCTL, we'll be able to
>>> achieve this correctly.
>>>
>>> We want to track the file mapped memory the same way for another Windows
>>> api translation in which file memory will be tracked to find out soft-dirty
>>> pages. I believe once it is supported in the mm, filesystem doesn't matter
>>> as we'll be tracking memory mapped files.
>>
>> Some more elaboration would be great here.
> There is something called "writecopy" in Windows. Paul has given me some
> details.
> 
> @Paul can you share more details that:
> 1) What is writecopy?
> 2) If there is no documentation or documented behavior, Can you please
> share the current reference implementation in Wine even though it is buggy
> at the time? (I can implement correct implementation with this patch and my
> wip v11 pagemap_ioctl patches for testing purpose.)
Hi Peter,

Just had some offline discussion with Paul, lets leave this second use case
for future. We don't want to make discussions more entangled. There isn't
much clarity on it right now. Lets only complete the first use case.

The current patch seems good. Please can you post the next version by
including the missing part. Then I'll also be able to send the v11 of
pagemap_ioctl by basing on it.

> 
>>
>> For example, what is the other windows API?  What will you do with the
>> dirty information?  What is the impact of supporting only shmem/hugetlbfs
>> for "!anon"?
>>
>>>
>>> So we have use cases for both soft-dirty tracking on anon and file backed
>>> memories.
>>>
>>>>
>>>> For performance, one approach is probably making uffd-wp async separate
>>>> from other features, where we can revert the meaning of uffd-wp bit to
>>>> mimic what soft-dirty does (I think this will look closer to what David
>>>> mentioned in the other thread), by defining uffd-wp=1 as "written" and
>>>> uffd-wp=0 as clean.
>>> I'm happy with current state of things. WP Async is just a feature where
>>> faults are resolved automatically. The page is WP or not perfectly
>>> translates if page is written (dirty) or not with just not (!) operation.
>>> If we go and change the meanings of uffd_wp=1 as written, it would cause
>>> confusion and probably we'll need a lot more changes to add async feature
>>> which is right now quite small changeset.
>>
>> Yes, that's also why I wanted to keep it like this unless necessary.  It
>> makes all uffd features still more aligned with everything so more maintainable.
>>
>>>
>>>>
>>>> IIUC that'll make it one bit and work as fast as soft-dirty, meanwhile all
>>>> uffd-wp marker things can hopefully still be maintained.  However I really
>>>> don't like it to violate a lot of things, e.g., when UFFDIO_WRITEPROTECT
>>>> another round we'll need to DROP uffd-wp if async, but APPLY uffd-wp if
>>>> sync...  So in general it'll need more thoughts and slower to do.
>>> Sorry, I'm unable to understand your example here. Can you please elaborate?
>>
>> I think it's mostly what you stated above, the changeset will just get more
>> involved because of either the redefinition of uffd-wp depending on feature
>> enabled, or using yet another bit (e.g., if you still remember the very
>> original reply I was planning to reuse soft-dirty bit but used together
>> with VM_UFFD_WP; I think that resolves the degraded perf on unpopulated
>> memory too because it'll work more like soft-dirty).
>>
>> This is some detail which, IMHO, we can leave that for later.  To me I'm
>> more curious on knowing more details of your use case, and how uffd could
>> suite that the best way it could, assuming that (or part of it like the
>> unpopulate work here) can also be leveraged by others.  Your answers to
>> above questions may help.
>>
>> Thanks,
>>
>
  

Patch

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 44d1ee429eb0..a2499908985a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -108,6 +108,19 @@  static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx)
 	return ctx->features & UFFD_FEATURE_INITIALIZED;
 }
 
+bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma)
+{
+	struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
+
+	if (!userfaultfd_wp(vma))
+		return false;
+
+	if (!ctx)
+		return false;
+
+	return ctx->features & UFFD_FEATURE_WP_UNPOPULATED;
+}
+
 static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
 				     vm_flags_t flags)
 {
@@ -1971,6 +1984,7 @@  static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 #endif
 #ifndef CONFIG_PTE_MARKER_UFFD_WP
 	uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
+	uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED;
 #endif
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
 	ret = -EFAULT;
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index de1e622dd366..0e1d239a882c 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -557,6 +557,12 @@  pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
 	/* The current status of the pte should be "cleared" before calling */
 	WARN_ON_ONCE(!pte_none(*pte));
 
+	/*
+	 * NOTE: userfaultfd_wp_unpopulated() doesn't need this whole
+	 * thing, because when zapping either it means it's dropping the
+	 * page, or in TTU where the present pte will be quickly replaced
+	 * with a swap pte.  There's no way of leaking the bit.
+	 */
 	if (vma_is_anonymous(vma) || !userfaultfd_wp(vma))
 		return;
 
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 3767f18114ef..9d43c1e88175 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -179,6 +179,7 @@  extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
 				  unsigned long end, struct list_head *uf);
 extern void userfaultfd_unmap_complete(struct mm_struct *mm,
 				       struct list_head *uf);
+extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma);
 
 #else /* CONFIG_USERFAULTFD */
 
@@ -274,6 +275,11 @@  static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 #endif /* CONFIG_USERFAULTFD */
 
 static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 005e5e306266..90c958952bfc 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -38,7 +38,8 @@ 
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
-			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
+			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
+			   UFFD_FEATURE_WP_UNPOPULATED)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -203,6 +204,12 @@  struct uffdio_api {
 	 *
 	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
 	 * write-protection mode is supported on both shmem and hugetlbfs.
+	 *
+	 * UFFD_FEATURE_WP_UNPOPULATED indicates that userfaultfd
+	 * write-protection mode will always apply to unpopulated pages
+	 * (i.e. empty ptes).  This will be the default behavior for shmem
+	 * & hugetlbfs, so this flag only affects anonymous memory behavior
+	 * when userfault write-protection mode is registered.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -217,6 +224,7 @@  struct uffdio_api {
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
+#define UFFD_FEATURE_WP_UNPOPULATED		(1<<13)
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/mm/memory.c b/mm/memory.c
index bfa3100ec5a3..46934133bd0b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -104,6 +104,20 @@  EXPORT_SYMBOL(mem_map);
 #endif
 
 static vm_fault_t do_fault(struct vm_fault *vmf);
+static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
+static bool vmf_pte_changed(struct vm_fault *vmf);
+
+/*
+ * Return true if the original pte was a uffd-wp pte marker (so the pte was
+ * wr-protected).
+ */
+static bool vmf_orig_pte_uffd_wp(struct vm_fault *vmf)
+{
+	if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
+		return false;
+
+	return pte_marker_uffd_wp(vmf->orig_pte);
+}
 
 /*
  * A number of key systems in x86 including ioremap() rely on the assumption
@@ -1346,6 +1360,10 @@  zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *pte,
 			      struct zap_details *details, pte_t pteval)
 {
+	/* Zap on anonymous always means dropping everything */
+	if (vma_is_anonymous(vma))
+		return;
+
 	if (zap_drop_file_uffd_wp(details))
 		return;
 
@@ -1452,8 +1470,12 @@  static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				continue;
 			rss[mm_counter(page)]--;
 		} else if (pte_marker_entry_uffd_wp(entry)) {
-			/* Only drop the uffd-wp marker if explicitly requested */
-			if (!zap_drop_file_uffd_wp(details))
+			/*
+			 * For anon: always drop the marker; for file: only
+			 * drop the marker if explicitly requested.
+			 */
+			if (!vma_is_anonymous(vma) &&
+			    !zap_drop_file_uffd_wp(details))
 				continue;
 		} else if (is_hwpoison_entry(entry) ||
 			   is_swapin_error_entry(entry)) {
@@ -3620,6 +3642,14 @@  static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
 	return 0;
 }
 
+static vm_fault_t handle_pte_missing(struct vm_fault *vmf)
+{
+	if (vma_is_anonymous(vmf->vma))
+		return do_anonymous_page(vmf);
+	else
+		return do_fault(vmf);
+}
+
 /*
  * This is actually a page-missing access, but with uffd-wp special pte
  * installed.  It means this pte was wr-protected before being unmapped.
@@ -3630,11 +3660,10 @@  static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
 	 * Just in case there're leftover special ptes even after the region
 	 * got unregistered - we can simply clear them.
 	 */
-	if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
+	if (unlikely(!userfaultfd_wp(vmf->vma)))
 		return pte_marker_clear(vmf);
 
-	/* do_fault() can handle pte markers too like none pte */
-	return do_fault(vmf);
+	return handle_pte_missing(vmf);
 }
 
 static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
@@ -3999,6 +4028,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
  */
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
+	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
 	vm_fault_t ret = 0;
@@ -4072,7 +4102,7 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
-	if (!pte_none(*vmf->pte)) {
+	if (vmf_pte_changed(vmf)) {
 		update_mmu_tlb(vma, vmf->address, vmf->pte);
 		goto release;
 	}
@@ -4092,6 +4122,8 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	folio_add_new_anon_rmap(folio, vma, vmf->address);
 	folio_add_lru_vma(folio, vma);
 setpte:
+	if (uffd_wp)
+		entry = pte_mkuffd_wp(entry);
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -4259,7 +4291,7 @@  vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
+	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	bool prefault = vmf->address != addr;
 	pte_t entry;
@@ -4903,12 +4935,8 @@  static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		}
 	}
 
-	if (!vmf->pte) {
-		if (vma_is_anonymous(vmf->vma))
-			return do_anonymous_page(vmf);
-		else
-			return do_fault(vmf);
-	}
+	if (!vmf->pte)
+		return handle_pte_missing(vmf);
 
 	if (!pte_present(vmf->orig_pte))
 		return do_swap_page(vmf);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 231929f119d9..6a2df93158ee 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -276,7 +276,16 @@  static long change_pte_range(struct mmu_gather *tlb,
 		} else {
 			/* It must be an none page, or what else?.. */
 			WARN_ON_ONCE(!pte_none(oldpte));
-			if (unlikely(uffd_wp && !vma_is_anonymous(vma))) {
+
+			/*
+			 * Nobody plays with any none ptes besides
+			 * userfaultfd when applying the protections.
+			 */
+			if (likely(!uffd_wp))
+				continue;
+
+			if (!vma_is_anonymous(vma) ||
+			    userfaultfd_wp_unpopulated(vma)) {
 				/*
 				 * For file-backed mem, we need to be able to
 				 * wr-protect a none pte, because even if the
@@ -320,23 +329,53 @@  static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 	return 0;
 }
 
-/* Return true if we're uffd wr-protecting file-backed memory, or false */
+/*
+ * Return true if we want to split huge thps in change protection
+ * procedure, false otherwise.
+ */
 static inline bool
-uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
+pgtable_split_needed(struct vm_area_struct *vma, unsigned long cp_flags)
 {
+	/*
+	 * pte markers only resides in pte level, if we need pte markers,
+	 * we need to split.  We cannot wr-protect shmem thp because file
+	 * thp is handled differently when split by erasing the pmd so far.
+	 */
 	return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma);
 }
 
 /*
- * If wr-protecting the range for file-backed, populate pgtable for the case
- * when pgtable is empty but page cache exists.  When {pte|pmd|...}_alloc()
- * failed we treat it the same way as pgtable allocation failures during
- * page faults by kicking OOM and returning error.
+ * Return true if we want to populate pgtables in change protection
+ * procedure, false otherwise
+ */
+static inline bool
+pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
+{
+	/* If not within ioctl(UFFDIO_WRITEPROTECT), then don't bother */
+	if (!(cp_flags & MM_CP_UFFD_WP))
+		return false;
+
+	/* Either if this is file-based, we need it for pte markers */
+	if (!vma_is_anonymous(vma))
+		return true;
+
+	/*
+	 * Or anonymous, we only need this if WP_ZEROPAGE enabled (to
+	 * install zero pages).
+	 */
+	return userfaultfd_wp_unpopulated(vma);
+}
+
+/*
+ * Populate the pgtable underneath for whatever reason if requested.
+ * When {pte|pmd|...}_alloc() failed we treat it the same way as pgtable
+ * allocation failures during page faults by kicking OOM and returning
+ * error.
  */
 #define  change_pmd_prepare(vma, pmd, cp_flags)				\
 	({								\
 		long err = 0;						\
-		if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
+		if (unlikely(pgtable_populate_needed(vma, cp_flags))) {	\
 			if (pte_alloc(vma->vm_mm, pmd))			\
 				err = -ENOMEM;				\
 		}							\
@@ -351,7 +390,7 @@  uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
 #define  change_prepare(vma, high, low, addr, cp_flags)			\
 	  ({								\
 		long err = 0;						\
-		if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
+		if (unlikely(pgtable_populate_needed(vma, cp_flags))) {	\
 			low##_t *p = low##_alloc(vma->vm_mm, high, addr); \
 			if (p == NULL)					\
 				err = -ENOMEM;				\
@@ -404,7 +443,7 @@  static inline long change_pmd_range(struct mmu_gather *tlb,
 
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
-			    uffd_wp_protect_file(vma, cp_flags)) {
+			    pgtable_split_needed(vma, cp_flags)) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 				/*
 				 * For file-backed, the pmd could have been
diff --git a/tools/testing/selftests/mm/userfaultfd.c b/tools/testing/selftests/mm/userfaultfd.c
index 7f22844ed704..122695a63cf1 100644
--- a/tools/testing/selftests/mm/userfaultfd.c
+++ b/tools/testing/selftests/mm/userfaultfd.c
@@ -1462,7 +1462,7 @@  static void userfaultfd_pagemap_test(unsigned int test_pgsize)
 	/* Flush so it doesn't flush twice in parent/child later */
 	fflush(stdout);
 
-	uffd_test_ctx_init(0);
+	uffd_test_ctx_init(UFFD_FEATURE_WP_UNPOPULATED);
 
 	if (test_pgsize > page_size) {
 		/* This is a thp test */
@@ -1482,6 +1482,26 @@  static void userfaultfd_pagemap_test(unsigned int test_pgsize)
 
 	pagemap_fd = pagemap_open();
 
+	/* Smoke test WP_UNPOPULATED first when it's still empty */
+	if (test_pgsize == page_size) {
+		/* Test applying pte marker to anon unpopulated */
+		wp_range(uffd, (uint64_t)area_dst, test_pgsize, true);
+		value = pagemap_read_vaddr(pagemap_fd, area_dst);
+		pagemap_check_wp(value, true);
+
+		/* Test unprotect with anon pte marker */
+		wp_range(uffd, (uint64_t)area_dst, test_pgsize, false);
+		value = pagemap_read_vaddr(pagemap_fd, area_dst);
+		pagemap_check_wp(value, false);
+
+		/* Re-apply, test zap on anon which should drop the marker */
+		wp_range(uffd, (uint64_t)area_dst, test_pgsize, true);
+		if (madvise(area_dst, test_pgsize, MADV_DONTNEED))
+			err("madvise(MADV_DONTNEED) failed");
+		value = pagemap_read_vaddr(pagemap_fd, area_dst);
+		pagemap_check_wp(value, false);
+	}
+
 	/* Touch the page */
 	*area_dst = 1;
 	wp_range(uffd, (uint64_t)area_dst, test_pgsize, true);