[04/47] LoongArch: Set _PAGE_DIRTY only if _PAGE_WRITE is set in {pmd,pte}_mkdirty()

Message ID 20221117042532.4064448-1-chenhuacai@loongson.cn
State New
Headers
Series None |

Commit Message

Huacai Chen Nov. 17, 2022, 4:25 a.m. UTC
  Now {pmd,pte}_mkdirty() set _PAGE_DIRTY bit unconditionally, this causes
random segmentation fault after commit 0ccf7f168e17bb7e ("mm/thp: carry
over dirty bit when thp splits on pmd").

The reason is: when fork(), parent process use pmd_wrprotect() to clear
huge page's _PAGE_WRITE and _PAGE_DIRTY (for COW); then pte_mkdirty() set
_PAGE_DIRTY as well as _PAGE_MODIFIED while splitting dirty huge pages;
once _PAGE_DIRTY is set, there will be no tlb modify exception so the COW
machanism fails; and at last memory corruption occurred between parent
and child processes.

So, we should set _PAGE_DIRTY only when _PAGE_WRITE is set in {pmd,pte}_
mkdirty().

Cc: stable@vger.kernel.org
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
---
Note: CC sparc maillist because they have similar issues.
 
 arch/loongarch/include/asm/pgtable.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)
  

Comments

Xi Ruoyao Nov. 17, 2022, 1:59 p.m. UTC | #1
Hi Huacai,

On Thu, 2022-11-17 at 12:25 +0800, Huacai Chen wrote:
> Now {pmd,pte}_mkdirty() set _PAGE_DIRTY bit unconditionally, this causes
> random segmentation fault after commit 0ccf7f168e17bb7e ("mm/thp: carry
> over dirty bit when thp splits on pmd").

Hmm, the pte_mkdirty call is already removed in commit 624a2c94f5b7a081
("Partly revert \"mm/thp: carry over dirty bit when thp splits on
pmd\"").

Not sure if this issue is related to some random segfaults I've observed
recently though.  My last kernel build contains 0ccf7f168e17bb7e but
does not contain 624a2c94f5b7a081.

> 
> The reason is: when fork(), parent process use pmd_wrprotect() to clear
> huge page's _PAGE_WRITE and _PAGE_DIRTY (for COW); then pte_mkdirty() set
> _PAGE_DIRTY as well as _PAGE_MODIFIED while splitting dirty huge pages;
> once _PAGE_DIRTY is set, there will be no tlb modify exception so the COW
> machanism fails; and at last memory corruption occurred between parent
> and child processes.
> 
> So, we should set _PAGE_DIRTY only when _PAGE_WRITE is set in {pmd,pte}_
> mkdirty().
> 
> Cc: stable@vger.kernel.org
> Cc: Peter Xu <peterx@redhat.com>
> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> ---
> Note: CC sparc maillist because they have similar issues.
>  
>  arch/loongarch/include/asm/pgtable.h | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
> index 946704bee599..debbe116f105 100644
> --- a/arch/loongarch/include/asm/pgtable.h
> +++ b/arch/loongarch/include/asm/pgtable.h
> @@ -349,7 +349,9 @@ static inline pte_t pte_mkclean(pte_t pte)
>  
>  static inline pte_t pte_mkdirty(pte_t pte)
>  {
> -       pte_val(pte) |= (_PAGE_DIRTY | _PAGE_MODIFIED);
> +       pte_val(pte) |= _PAGE_MODIFIED;
> +       if (pte_val(pte) & _PAGE_WRITE)
> +               pte_val(pte) |= _PAGE_DIRTY;
>         return pte;
>  }
>  
> @@ -478,7 +480,9 @@ static inline pmd_t pmd_mkclean(pmd_t pmd)
>  
>  static inline pmd_t pmd_mkdirty(pmd_t pmd)
>  {
> -       pmd_val(pmd) |= (_PAGE_DIRTY | _PAGE_MODIFIED);
> +       pmd_val(pmd) |= _PAGE_MODIFIED;
> +       if (pmd_val(pmd) & _PAGE_WRITE)
> +               pmd_val(pmd) |= _PAGE_DIRTY;
>         return pmd;
>  }
>
  
Peter Xu Nov. 17, 2022, 3:12 p.m. UTC | #2
Hi, Huacai,

On Thu, Nov 17, 2022 at 12:25:32PM +0800, Huacai Chen wrote:
> Now {pmd,pte}_mkdirty() set _PAGE_DIRTY bit unconditionally, this causes
> random segmentation fault after commit 0ccf7f168e17bb7e ("mm/thp: carry
> over dirty bit when thp splits on pmd").
> 
> The reason is: when fork(), parent process use pmd_wrprotect() to clear
> huge page's _PAGE_WRITE and _PAGE_DIRTY (for COW);

Is it safe to drop dirty bit when wr-protect?  It means the mm can reclaim
the page directly assuming the page contains rubbish.

Consider after fork() and memory pressure kicks the kswapd, I don't see
anything stops the kswapd from recycling the pages and lose the data in
both processes.

> then pte_mkdirty() set
> _PAGE_DIRTY as well as _PAGE_MODIFIED while splitting dirty huge pages;
> once _PAGE_DIRTY is set, there will be no tlb modify exception so the COW
> machanism fails; and at last memory corruption occurred between parent
> and child processes.
> 
> So, we should set _PAGE_DIRTY only when _PAGE_WRITE is set in {pmd,pte}_
> mkdirty().
> 
> Cc: stable@vger.kernel.org
> Cc: Peter Xu <peterx@redhat.com>
> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> ---
> Note: CC sparc maillist because they have similar issues.

I also had a look on sparc64, it seems to not do the same as loongarch
here (not removing dirty in wr-protect):

static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
	pte_t pte = __pte(pmd_val(pmd));

	pte = pte_wrprotect(pte);

	return __pmd(pte_val(pte));
}

static inline pte_t pte_wrprotect(pte_t pte)
{
	unsigned long val = pte_val(pte), tmp;

	__asm__ __volatile__(
	"\n661:	andn		%0, %3, %0\n"
	"	nop\n"
	"\n662:	nop\n"
	"	nop\n"
	"	.section	.sun4v_2insn_patch, \"ax\"\n"
	"	.word		661b\n"
	"	sethi		%%uhi(%4), %1\n"
	"	sllx		%1, 32, %1\n"
	"	.word		662b\n"
	"	or		%1, %%lo(%4), %1\n"
	"	andn		%0, %1, %0\n"
	"	.previous\n"
	: "=r" (val), "=r" (tmp)
	: "0" (val), "i" (_PAGE_WRITE_4U | _PAGE_W_4U),
	  "i" (_PAGE_WRITE_4V | _PAGE_W_4V));

	return __pte(val);
}

>  
>  arch/loongarch/include/asm/pgtable.h | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
> index 946704bee599..debbe116f105 100644
> --- a/arch/loongarch/include/asm/pgtable.h
> +++ b/arch/loongarch/include/asm/pgtable.h
> @@ -349,7 +349,9 @@ static inline pte_t pte_mkclean(pte_t pte)
>  
>  static inline pte_t pte_mkdirty(pte_t pte)
>  {
> -	pte_val(pte) |= (_PAGE_DIRTY | _PAGE_MODIFIED);
> +	pte_val(pte) |= _PAGE_MODIFIED;
> +	if (pte_val(pte) & _PAGE_WRITE)
> +		pte_val(pte) |= _PAGE_DIRTY;

I'm not sure whether mm has rule to always set write bit then set dirty
bit, need to be careful here because the outcome may differ when use:

  pte_mkdirty(pte_mkwrite(pte))
  (expected)

VS:

  pte_mkwrite(pte_mkdirty(pte))
  (dirty not set)

I had a feeling I miss some arch-specific details here on why loongarch
needs such implementation, but I can't quickly tell.

Thanks,

>  	return pte;
>  }
>  
> @@ -478,7 +480,9 @@ static inline pmd_t pmd_mkclean(pmd_t pmd)
>  
>  static inline pmd_t pmd_mkdirty(pmd_t pmd)
>  {
> -	pmd_val(pmd) |= (_PAGE_DIRTY | _PAGE_MODIFIED);
> +	pmd_val(pmd) |= _PAGE_MODIFIED;
> +	if (pmd_val(pmd) & _PAGE_WRITE)
> +		pmd_val(pmd) |= _PAGE_DIRTY;
>  	return pmd;
>  }
>  
> -- 
> 2.31.1
>
  
Xi Ruoyao Nov. 17, 2022, 3:42 p.m. UTC | #3
On Thu, 2022-11-17 at 21:59 +0800, Xi Ruoyao wrote:
> Hi Huacai,
> 
> On Thu, 2022-11-17 at 12:25 +0800, Huacai Chen wrote:
> > Now {pmd,pte}_mkdirty() set _PAGE_DIRTY bit unconditionally, this causes
> > random segmentation fault after commit 0ccf7f168e17bb7e ("mm/thp: carry
> > over dirty bit when thp splits on pmd").
> 
> Hmm, the pte_mkdirty call is already removed in commit 624a2c94f5b7a081
> ("Partly revert \"mm/thp: carry over dirty bit when thp splits on
> pmd\"").
> 
> Not sure if this issue is related to some random segfaults I've observed
> recently though.  My last kernel build contains 0ccf7f168e17bb7e but
> does not contain 624a2c94f5b7a081.

I can confirm this patch alone (without 624a2c94f5b7a081) fixes the
random segfaults I've recently encountered running GCC testsuite.

Link: https://gcc.gnu.org/pipermail/gcc/2022-November/239857.html
Tested-by: Xi Ruoyao <xry111@xry111.site>
  
Peter Xu Nov. 17, 2022, 6:53 p.m. UTC | #4
On Thu, Nov 17, 2022 at 10:12:07AM -0500, Peter Xu wrote:
> Hi, Huacai,
> 
> On Thu, Nov 17, 2022 at 12:25:32PM +0800, Huacai Chen wrote:
> > Now {pmd,pte}_mkdirty() set _PAGE_DIRTY bit unconditionally, this causes
> > random segmentation fault after commit 0ccf7f168e17bb7e ("mm/thp: carry
> > over dirty bit when thp splits on pmd").
> > 
> > The reason is: when fork(), parent process use pmd_wrprotect() to clear
> > huge page's _PAGE_WRITE and _PAGE_DIRTY (for COW);
> 
> Is it safe to drop dirty bit when wr-protect?  It means the mm can reclaim
> the page directly assuming the page contains rubbish.
> 
> Consider after fork() and memory pressure kicks the kswapd, I don't see
> anything stops the kswapd from recycling the pages and lose the data in
> both processes.

Feel free to ignore this question..  I think I got an answer from Hev (and
I then got a follow up question):

https://lore.kernel.org/all/Y3Z9Zf0jARMOkFBq@x1n/

> 
> > then pte_mkdirty() set
> > _PAGE_DIRTY as well as _PAGE_MODIFIED while splitting dirty huge pages;
> > once _PAGE_DIRTY is set, there will be no tlb modify exception so the COW
> > machanism fails; and at last memory corruption occurred between parent
> > and child processes.
> > 
> > So, we should set _PAGE_DIRTY only when _PAGE_WRITE is set in {pmd,pte}_
> > mkdirty().
> > 
> > Cc: stable@vger.kernel.org
> > Cc: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> > ---
> > Note: CC sparc maillist because they have similar issues.
> 
> I also had a look on sparc64, it seems to not do the same as loongarch
> here (not removing dirty in wr-protect):
> 
> static inline pmd_t pmd_wrprotect(pmd_t pmd)
> {
> 	pte_t pte = __pte(pmd_val(pmd));
> 
> 	pte = pte_wrprotect(pte);
> 
> 	return __pmd(pte_val(pte));
> }
> 
> static inline pte_t pte_wrprotect(pte_t pte)
> {
> 	unsigned long val = pte_val(pte), tmp;
> 
> 	__asm__ __volatile__(
> 	"\n661:	andn		%0, %3, %0\n"
> 	"	nop\n"
> 	"\n662:	nop\n"
> 	"	nop\n"
> 	"	.section	.sun4v_2insn_patch, \"ax\"\n"
> 	"	.word		661b\n"
> 	"	sethi		%%uhi(%4), %1\n"
> 	"	sllx		%1, 32, %1\n"
> 	"	.word		662b\n"
> 	"	or		%1, %%lo(%4), %1\n"
> 	"	andn		%0, %1, %0\n"
> 	"	.previous\n"
> 	: "=r" (val), "=r" (tmp)
> 	: "0" (val), "i" (_PAGE_WRITE_4U | _PAGE_W_4U),
> 	  "i" (_PAGE_WRITE_4V | _PAGE_W_4V));
> 
> 	return __pte(val);
> }

(Same here; I just overlooked what does _PAGE_W_4U meant..)

> 
> >  
> >  arch/loongarch/include/asm/pgtable.h | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
> > index 946704bee599..debbe116f105 100644
> > --- a/arch/loongarch/include/asm/pgtable.h
> > +++ b/arch/loongarch/include/asm/pgtable.h
> > @@ -349,7 +349,9 @@ static inline pte_t pte_mkclean(pte_t pte)
> >  
> >  static inline pte_t pte_mkdirty(pte_t pte)
> >  {
> > -	pte_val(pte) |= (_PAGE_DIRTY | _PAGE_MODIFIED);
> > +	pte_val(pte) |= _PAGE_MODIFIED;
> > +	if (pte_val(pte) & _PAGE_WRITE)
> > +		pte_val(pte) |= _PAGE_DIRTY;
> 
> I'm not sure whether mm has rule to always set write bit then set dirty
> bit, need to be careful here because the outcome may differ when use:
> 
>   pte_mkdirty(pte_mkwrite(pte))
>   (expected)
> 
> VS:
> 
>   pte_mkwrite(pte_mkdirty(pte))
>   (dirty not set)
> 
> I had a feeling I miss some arch-specific details here on why loongarch
> needs such implementation, but I can't quickly tell.

After a closer look I think it's fine for loongarch as pte_mkwrite will
also set the dirty bit unconditionally, so at least the two ways will still
generate the same pte (DIRTY+MODIFIED+WRITE).

But this whole thing is still confusing to me.  It'll still be great if
anyone can help explain why the _DIRTY cannot be set only in pte_mkwrite()
if that's the solo place in charge of "whether the pte is writable".

The other follow up question is: how do we mark "this pte contains valid
data" (the common definition of "dirty bit"), while "this pte is not
writable" on loongarch?

It can happen when we're installing a page with non-zero data meanwhile
wr-protected.  That's actually a valid case for userfaultfd wr-protect mode
where user specified UFFDIO_COPY ioctl with flag UFFDIO_COPY_MODE_WP, where
we'll install a non-zero page from user buffer but don't grant write bit.

From code-wise, I think it can be done currently with this on loongarch:

  pte_wrprotect(pte_mkwrite(pte_mkdirty(pte)))

Where pte_wrprotect(pte_mkwrite(pte)) is not a no-op but applying MODIFIED.

While on many other archs it'll be as simple as:

  pte_mkdirty(pte)

But that's really error-prone and not obvious.

Copying Hev too.

Thanks,
  
hev Nov. 19, 2022, 2:20 p.m. UTC | #5
Hi, Peter,

On Fri, Nov 18, 2022 at 2:53 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Nov 17, 2022 at 10:12:07AM -0500, Peter Xu wrote:
> > Hi, Huacai,
> >
> > On Thu, Nov 17, 2022 at 12:25:32PM +0800, Huacai Chen wrote:
> > > Now {pmd,pte}_mkdirty() set _PAGE_DIRTY bit unconditionally, this causes
> > > random segmentation fault after commit 0ccf7f168e17bb7e ("mm/thp: carry
> > > over dirty bit when thp splits on pmd").
> > >
> > > The reason is: when fork(), parent process use pmd_wrprotect() to clear
> > > huge page's _PAGE_WRITE and _PAGE_DIRTY (for COW);
> >
> > Is it safe to drop dirty bit when wr-protect?  It means the mm can reclaim
> > the page directly assuming the page contains rubbish.
> >
> > Consider after fork() and memory pressure kicks the kswapd, I don't see
> > anything stops the kswapd from recycling the pages and lose the data in
> > both processes.
>
> Feel free to ignore this question..  I think I got an answer from Hev (and
> I then got a follow up question):
>
> https://lore.kernel.org/all/Y3Z9Zf0jARMOkFBq@x1n/
>
> >
> > > then pte_mkdirty() set
> > > _PAGE_DIRTY as well as _PAGE_MODIFIED while splitting dirty huge pages;
> > > once _PAGE_DIRTY is set, there will be no tlb modify exception so the COW
> > > machanism fails; and at last memory corruption occurred between parent
> > > and child processes.
> > >
> > > So, we should set _PAGE_DIRTY only when _PAGE_WRITE is set in {pmd,pte}_
> > > mkdirty().
> > >
> > > Cc: stable@vger.kernel.org
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> > > ---
> > > Note: CC sparc maillist because they have similar issues.
> >
> > I also had a look on sparc64, it seems to not do the same as loongarch
> > here (not removing dirty in wr-protect):
> >
> > static inline pmd_t pmd_wrprotect(pmd_t pmd)
> > {
> >       pte_t pte = __pte(pmd_val(pmd));
> >
> >       pte = pte_wrprotect(pte);
> >
> >       return __pmd(pte_val(pte));
> > }
> >
> > static inline pte_t pte_wrprotect(pte_t pte)
> > {
> >       unsigned long val = pte_val(pte), tmp;
> >
> >       __asm__ __volatile__(
> >       "\n661: andn            %0, %3, %0\n"
> >       "       nop\n"
> >       "\n662: nop\n"
> >       "       nop\n"
> >       "       .section        .sun4v_2insn_patch, \"ax\"\n"
> >       "       .word           661b\n"
> >       "       sethi           %%uhi(%4), %1\n"
> >       "       sllx            %1, 32, %1\n"
> >       "       .word           662b\n"
> >       "       or              %1, %%lo(%4), %1\n"
> >       "       andn            %0, %1, %0\n"
> >       "       .previous\n"
> >       : "=r" (val), "=r" (tmp)
> >       : "0" (val), "i" (_PAGE_WRITE_4U | _PAGE_W_4U),
> >         "i" (_PAGE_WRITE_4V | _PAGE_W_4V));
> >
> >       return __pte(val);
> > }
>
> (Same here; I just overlooked what does _PAGE_W_4U meant..)
>
> >
> > >
> > >  arch/loongarch/include/asm/pgtable.h | 8 ++++++--
> > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
> > > index 946704bee599..debbe116f105 100644
> > > --- a/arch/loongarch/include/asm/pgtable.h
> > > +++ b/arch/loongarch/include/asm/pgtable.h
> > > @@ -349,7 +349,9 @@ static inline pte_t pte_mkclean(pte_t pte)
> > >
> > >  static inline pte_t pte_mkdirty(pte_t pte)
> > >  {
> > > -   pte_val(pte) |= (_PAGE_DIRTY | _PAGE_MODIFIED);
> > > +   pte_val(pte) |= _PAGE_MODIFIED;
> > > +   if (pte_val(pte) & _PAGE_WRITE)
> > > +           pte_val(pte) |= _PAGE_DIRTY;
> >
> > I'm not sure whether mm has rule to always set write bit then set dirty
> > bit, need to be careful here because the outcome may differ when use:
> >
> >   pte_mkdirty(pte_mkwrite(pte))
> >   (expected)
> >
> > VS:
> >
> >   pte_mkwrite(pte_mkdirty(pte))
> >   (dirty not set)
> >
> > I had a feeling I miss some arch-specific details here on why loongarch
> > needs such implementation, but I can't quickly tell.
>
> After a closer look I think it's fine for loongarch as pte_mkwrite will
> also set the dirty bit unconditionally, so at least the two ways will still
> generate the same pte (DIRTY+MODIFIED+WRITE).
>
> But this whole thing is still confusing to me.  It'll still be great if
> anyone can help explain why the _DIRTY cannot be set only in pte_mkwrite()
> if that's the solo place in charge of "whether the pte is writable".
>
> The other follow up question is: how do we mark "this pte contains valid
> data" (the common definition of "dirty bit"), while "this pte is not
> writable" on loongarch?
>
> It can happen when we're installing a page with non-zero data meanwhile
> wr-protected.  That's actually a valid case for userfaultfd wr-protect mode
> where user specified UFFDIO_COPY ioctl with flag UFFDIO_COPY_MODE_WP, where
> we'll install a non-zero page from user buffer but don't grant write bit.
>
> From code-wise, I think it can be done currently with this on loongarch:
>
>   pte_wrprotect(pte_mkwrite(pte_mkdirty(pte)))
>
> Where pte_wrprotect(pte_mkwrite(pte)) is not a no-op but applying MODIFIED.

We would like to note that on LoongArch (for misunderstanding naming):
* _PAGE_DIRTY meaning hardware writable.
* _PAGE_WRITE meaning software writable.
* _PAGE_MODIFIED meaning software dirty, this page contains updated valid data.

PTE APIs:
* pte_mkwrite: Allow to write, only needs set _PAGE_WRITE.
* pte_mkdirty: Mark as dirty, only needs set _PAGE_MODIFIED.
* pte_dirty: Test is dirty, only test _PAGE_MODIFIED.
* pte_wrprotect: Clear both writable, force to raise exception to
handle_mm_fault.

If a pte is only set _PAGE_WRITE without _PAGE_DIRTY by pte_mkwrite,
then a write memory access will cause mmu exception, and the
(_PAGE_DIRTY|_PAGE_MODIFIED) will be set in this exception handler. I
think the _PAGE_DIRTY is also possible to set in pte_mkwrite for
speedup, then _PAGE_MODIFIED must be set at the same time. To avoid
the page data being modified but not detected by pte_dirty. (Current
code may needs to fix

pte_mkdirty mark pte as dirty is the main function, It can also make
pte writeable by hardware(_PAGE_DIRTY) for speedup (too) if and only
if the pte is writable(_PAGE_WRITE). (mkdirty sets _PAGE_DIRTY
unconditionally is the root cause of the huge page COW issue.

For write-protection, pte_wrprotect will clear both writable(software
and hardware) in pte to force a MMU exception to handle_mm_fault.

So yeah, the pte marked as dirty(_PAGE_MODIFIED) and without any
writable in the following code:

  pte_wrprotect(pte_mkwrite(pte_mkdirty(pte)))

Regards,
Ray

>
> While on many other archs it'll be as simple as:
>
>   pte_mkdirty(pte)
>
> But that's really error-prone and not obvious.
>
> Copying Hev too.
>
> Thanks,
>
> --
> Peter Xu
>
  
Peter Xu Nov. 21, 2022, 6:31 p.m. UTC | #6
On Sat, Nov 19, 2022 at 10:20:20PM +0800, hev wrote:
> Hi, Peter,

Hev,

> 
> On Fri, Nov 18, 2022 at 2:53 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Nov 17, 2022 at 10:12:07AM -0500, Peter Xu wrote:
> > > Hi, Huacai,
> > >
> > > On Thu, Nov 17, 2022 at 12:25:32PM +0800, Huacai Chen wrote:
> > > > Now {pmd,pte}_mkdirty() set _PAGE_DIRTY bit unconditionally, this causes
> > > > random segmentation fault after commit 0ccf7f168e17bb7e ("mm/thp: carry
> > > > over dirty bit when thp splits on pmd").
> > > >
> > > > The reason is: when fork(), parent process use pmd_wrprotect() to clear
> > > > huge page's _PAGE_WRITE and _PAGE_DIRTY (for COW);
> > >
> > > Is it safe to drop dirty bit when wr-protect?  It means the mm can reclaim
> > > the page directly assuming the page contains rubbish.
> > >
> > > Consider after fork() and memory pressure kicks the kswapd, I don't see
> > > anything stops the kswapd from recycling the pages and lose the data in
> > > both processes.
> >
> > Feel free to ignore this question..  I think I got an answer from Hev (and
> > I then got a follow up question):
> >
> > https://lore.kernel.org/all/Y3Z9Zf0jARMOkFBq@x1n/
> >
> > >
> > > > then pte_mkdirty() set
> > > > _PAGE_DIRTY as well as _PAGE_MODIFIED while splitting dirty huge pages;
> > > > once _PAGE_DIRTY is set, there will be no tlb modify exception so the COW
> > > > machanism fails; and at last memory corruption occurred between parent
> > > > and child processes.
> > > >
> > > > So, we should set _PAGE_DIRTY only when _PAGE_WRITE is set in {pmd,pte}_
> > > > mkdirty().
> > > >
> > > > Cc: stable@vger.kernel.org
> > > > Cc: Peter Xu <peterx@redhat.com>
> > > > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> > > > ---
> > > > Note: CC sparc maillist because they have similar issues.
> > >
> > > I also had a look on sparc64, it seems to not do the same as loongarch
> > > here (not removing dirty in wr-protect):
> > >
> > > static inline pmd_t pmd_wrprotect(pmd_t pmd)
> > > {
> > >       pte_t pte = __pte(pmd_val(pmd));
> > >
> > >       pte = pte_wrprotect(pte);
> > >
> > >       return __pmd(pte_val(pte));
> > > }
> > >
> > > static inline pte_t pte_wrprotect(pte_t pte)
> > > {
> > >       unsigned long val = pte_val(pte), tmp;
> > >
> > >       __asm__ __volatile__(
> > >       "\n661: andn            %0, %3, %0\n"
> > >       "       nop\n"
> > >       "\n662: nop\n"
> > >       "       nop\n"
> > >       "       .section        .sun4v_2insn_patch, \"ax\"\n"
> > >       "       .word           661b\n"
> > >       "       sethi           %%uhi(%4), %1\n"
> > >       "       sllx            %1, 32, %1\n"
> > >       "       .word           662b\n"
> > >       "       or              %1, %%lo(%4), %1\n"
> > >       "       andn            %0, %1, %0\n"
> > >       "       .previous\n"
> > >       : "=r" (val), "=r" (tmp)
> > >       : "0" (val), "i" (_PAGE_WRITE_4U | _PAGE_W_4U),
> > >         "i" (_PAGE_WRITE_4V | _PAGE_W_4V));
> > >
> > >       return __pte(val);
> > > }
> >
> > (Same here; I just overlooked what does _PAGE_W_4U meant..)
> >
> > >
> > > >
> > > >  arch/loongarch/include/asm/pgtable.h | 8 ++++++--
> > > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
> > > > index 946704bee599..debbe116f105 100644
> > > > --- a/arch/loongarch/include/asm/pgtable.h
> > > > +++ b/arch/loongarch/include/asm/pgtable.h
> > > > @@ -349,7 +349,9 @@ static inline pte_t pte_mkclean(pte_t pte)
> > > >
> > > >  static inline pte_t pte_mkdirty(pte_t pte)
> > > >  {
> > > > -   pte_val(pte) |= (_PAGE_DIRTY | _PAGE_MODIFIED);
> > > > +   pte_val(pte) |= _PAGE_MODIFIED;
> > > > +   if (pte_val(pte) & _PAGE_WRITE)
> > > > +           pte_val(pte) |= _PAGE_DIRTY;
> > >
> > > I'm not sure whether mm has rule to always set write bit then set dirty
> > > bit, need to be careful here because the outcome may differ when use:
> > >
> > >   pte_mkdirty(pte_mkwrite(pte))
> > >   (expected)
> > >
> > > VS:
> > >
> > >   pte_mkwrite(pte_mkdirty(pte))
> > >   (dirty not set)
> > >
> > > I had a feeling I miss some arch-specific details here on why loongarch
> > > needs such implementation, but I can't quickly tell.
> >
> > After a closer look I think it's fine for loongarch as pte_mkwrite will
> > also set the dirty bit unconditionally, so at least the two ways will still
> > generate the same pte (DIRTY+MODIFIED+WRITE).
> >
> > But this whole thing is still confusing to me.  It'll still be great if
> > anyone can help explain why the _DIRTY cannot be set only in pte_mkwrite()
> > if that's the solo place in charge of "whether the pte is writable".
> >
> > The other follow up question is: how do we mark "this pte contains valid
> > data" (the common definition of "dirty bit"), while "this pte is not
> > writable" on loongarch?
> >
> > It can happen when we're installing a page with non-zero data meanwhile
> > wr-protected.  That's actually a valid case for userfaultfd wr-protect mode
> > where user specified UFFDIO_COPY ioctl with flag UFFDIO_COPY_MODE_WP, where
> > we'll install a non-zero page from user buffer but don't grant write bit.
> >
> > From code-wise, I think it can be done currently with this on loongarch:
> >
> >   pte_wrprotect(pte_mkwrite(pte_mkdirty(pte)))
> >
> > Where pte_wrprotect(pte_mkwrite(pte)) is not a no-op but applying MODIFIED.
> 
> We would like to note that on LoongArch (for misunderstanding naming):
> * _PAGE_DIRTY meaning hardware writable.
> * _PAGE_WRITE meaning software writable.
> * _PAGE_MODIFIED meaning software dirty, this page contains updated valid data.
> 
> PTE APIs:
> * pte_mkwrite: Allow to write, only needs set _PAGE_WRITE.
> * pte_mkdirty: Mark as dirty, only needs set _PAGE_MODIFIED.
> * pte_dirty: Test is dirty, only test _PAGE_MODIFIED.
> * pte_wrprotect: Clear both writable, force to raise exception to
> handle_mm_fault.
> 
> If a pte is only set _PAGE_WRITE without _PAGE_DIRTY by pte_mkwrite,
> then a write memory access will cause mmu exception, and the
> (_PAGE_DIRTY|_PAGE_MODIFIED) will be set in this exception handler. I
> think the _PAGE_DIRTY is also possible to set in pte_mkwrite for
> speedup, then _PAGE_MODIFIED must be set at the same time. To avoid
> the page data being modified but not detected by pte_dirty. (Current
> code may needs to fix
> 
> pte_mkdirty mark pte as dirty is the main function, It can also make
> pte writeable by hardware(_PAGE_DIRTY) for speedup (too) if and only
> if the pte is writable(_PAGE_WRITE). (mkdirty sets _PAGE_DIRTY
> unconditionally is the root cause of the huge page COW issue.

Yes, and that's why Huacai's patch can fix this issue for Loongarch, but
sparc64 still suffers from it so far.

> 
> For write-protection, pte_wrprotect will clear both writable(software
> and hardware) in pte to force a MMU exception to handle_mm_fault.
> 
> So yeah, the pte marked as dirty(_PAGE_MODIFIED) and without any
> writable in the following code:
> 
>   pte_wrprotect(pte_mkwrite(pte_mkdirty(pte)))

Thanks for the further explanation, Hev.

I think so far I keep the questioning on whether it's a good optimization
to apply _DIRTY in pte_mkdirty here.

Since we have pte_mkwrite() API after all, even if there's an optimization
IMHO it should be done by the kernel generic code already, I don't
immediately see why it needs to be arch-specific.  IOW, I'm not sure
whether such optimization for loongarch will bring any real performance
benefit?

The thing is, generic mm code should already have called pte_mkwrite()
along with pte_mkdirty() when proper, so _DIRTY will be properly applied
even if pte_mkdirty() only applies _MODIFIED in loongarch code without
extra mmu faults - if you see the code base that's really the major cases,
except a few very special conditions where we want to only set _MODIFIED
but may want to keep the pte read-only, but in that case it's never wise to
set _DIRTY in pte_mkdirty anyway.

I just don't know whether there's anything else I could have overlooked, so
maybe removing "set _DIRTY" in pte_mkdirty() will regress in some other way.

For now, I think I'll go ahead and try to propose another trivial fix for
the pmd split generic code so we'll have the pte_mkdirty back (I think I
need to reorder pte_wrprotect() so that it needs to appear after
pte_mkdirty()) but hopefully also work for sparc64; loongarch will also
benefit if before this patch lands.

(Sorry to be off-topic on this thread)
  

Patch

diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index 946704bee599..debbe116f105 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -349,7 +349,9 @@  static inline pte_t pte_mkclean(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	pte_val(pte) |= (_PAGE_DIRTY | _PAGE_MODIFIED);
+	pte_val(pte) |= _PAGE_MODIFIED;
+	if (pte_val(pte) & _PAGE_WRITE)
+		pte_val(pte) |= _PAGE_DIRTY;
 	return pte;
 }
 
@@ -478,7 +480,9 @@  static inline pmd_t pmd_mkclean(pmd_t pmd)
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
-	pmd_val(pmd) |= (_PAGE_DIRTY | _PAGE_MODIFIED);
+	pmd_val(pmd) |= _PAGE_MODIFIED;
+	if (pmd_val(pmd) & _PAGE_WRITE)
+		pmd_val(pmd) |= _PAGE_DIRTY;
 	return pmd;
 }