[0/3] mm,thp,rmap: rework the use of subpages_mapcount

Message ID c4b8485b-1f26-1a5f-bdf-c6c22611f610@google.com
Headers
Series mm,thp,rmap: rework the use of subpages_mapcount |

Message

Hugh Dickins Nov. 18, 2022, 9:08 a.m. UTC
  Linus was underwhelmed by the earlier compound mapcounts series:
this series builds on top of it (as in next-20221117) to follow
up on his suggestions - except rmap.c still using lock_page_memcg(),
since I hesitate to steal the pleasure of deletion from Johannes.

1/3 mm,thp,rmap: subpages_mapcount of PTE-mapped subpages
2/3 mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped
3/3 mm,thp,rmap: clean up the end of __split_huge_pmd_locked()

 Documentation/mm/transhuge.rst |  10 +-
 include/linux/mm.h             |  65 +++++++----
 include/linux/rmap.h           |  12 +-
 mm/debug.c                     |   2 +-
 mm/huge_memory.c               |  15 +--
 mm/rmap.c                      | 213 ++++++++++-------------------------
 6 files changed, 119 insertions(+), 198 deletions(-)

Hugh
  

Comments

Linus Torvalds Nov. 18, 2022, 8:18 p.m. UTC | #1
On Fri, Nov 18, 2022 at 1:08 AM Hugh Dickins <hughd@google.com> wrote:
>
> Linus was underwhelmed by the earlier compound mapcounts series:
> this series builds on top of it (as in next-20221117) to follow
> up on his suggestions - except rmap.c still using lock_page_memcg(),
> since I hesitate to steal the pleasure of deletion from Johannes.

This looks good to me. Particularly 2/3 made me go "Aww, yes" but the
overall line removal stats look good too.

That said, I only looked at the patches, and not the end result
itself. But not having the bit spin lock is, I think, a huge
improvement.

I do wonder if this should be now just merged with your previous
series - it looks a bit odd how your previous series adds that
bitlock, only for it to be immediately removed.

But if you think the logic ends up being easier to follow this way as
two separate patch series, I guess I don't care.

And the memcg locking is entirely a separate issue, and I hope
Johannes will deal with that.

Thanks,
              Linus
  
Johannes Weiner Nov. 18, 2022, 8:42 p.m. UTC | #2
On Fri, Nov 18, 2022 at 12:18:42PM -0800, Linus Torvalds wrote:
> On Fri, Nov 18, 2022 at 1:08 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > Linus was underwhelmed by the earlier compound mapcounts series:
> > this series builds on top of it (as in next-20221117) to follow
> > up on his suggestions - except rmap.c still using lock_page_memcg(),
> > since I hesitate to steal the pleasure of deletion from Johannes.
> 
> This looks good to me. Particularly 2/3 made me go "Aww, yes" but the
> overall line removal stats look good too.
> 
> That said, I only looked at the patches, and not the end result
> itself. But not having the bit spin lock is, I think, a huge
> improvement.
> 
> I do wonder if this should be now just merged with your previous
> series - it looks a bit odd how your previous series adds that
> bitlock, only for it to be immediately removed.
> 
> But if you think the logic ends up being easier to follow this way as
> two separate patch series, I guess I don't care.
> 
> And the memcg locking is entirely a separate issue, and I hope
> Johannes will deal with that.

Yeah, I'll redo the removal on top of this series and resend it.

Thanks
  
Hugh Dickins Nov. 18, 2022, 8:51 p.m. UTC | #3
On Fri, 18 Nov 2022, Linus Torvalds wrote:
> On Fri, Nov 18, 2022 at 1:08 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > Linus was underwhelmed by the earlier compound mapcounts series:
> > this series builds on top of it (as in next-20221117) to follow
> > up on his suggestions - except rmap.c still using lock_page_memcg(),
> > since I hesitate to steal the pleasure of deletion from Johannes.
> 
> This looks good to me. Particularly 2/3 made me go "Aww, yes" but the
> overall line removal stats look good too.
> 
> That said, I only looked at the patches, and not the end result
> itself. But not having the bit spin lock is, I think, a huge
> improvement.

Great, thanks a lot for looking through.

> 
> I do wonder if this should be now just merged with your previous
> series - it looks a bit odd how your previous series adds that
> bitlock, only for it to be immediately removed.
> 
> But if you think the logic ends up being easier to follow this way as
> two separate patch series, I guess I don't care.

I rather like having its evolution on record there, but that might just
be my sentimentality + laziness.  Kirill did a grand job of reviewing
the first series: I think that, at least for now, it would be easier
for people to review the changes if the two series are not recombined.

But the first series has not yet graduated from mm-unstable,
so if Andrew and/or Kirill also prefer to have them combined into one
bit_spin_lock-less series, that I can do.  (And the end result should be
identical, so would not complicate Johannes's lock_page_memcg() excision.)

Hugh

> 
> And the memcg locking is entirely a separate issue, and I hope
> Johannes will deal with that.
> 
> Thanks,
>               Linus
  
Andrew Morton Nov. 18, 2022, 10:03 p.m. UTC | #4
On Fri, 18 Nov 2022 12:51:09 -0800 (PST) Hugh Dickins <hughd@google.com> wrote:

> But the first series has not yet graduated from mm-unstable,
> so if Andrew and/or Kirill also prefer to have them combined into one
> bit_spin_lock-less series, that I can do.  (And the end result should be
> identical, so would not complicate Johannes's lock_page_memcg() excision.)

I'd prefer that approach.  It's -rc5 and the earlier "mm,huge,rmap:
unify and speed up compound mapcounts" series has had some testing. 
I'd prefer not to toss it all out and start again.
  
Linus Torvalds Nov. 18, 2022, 10:07 p.m. UTC | #5
On Fri, Nov 18, 2022 at 2:03 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> I'd prefer that approach.

The "that approach" is a bit ambiguous here, particularly considering
how you quoted things.

But I think from the context you meant "keep them as two separate
series, even if the second undoes part of the first and does it
differently".

And that's fine. Even if it's maybe a bit odd to introduce that
locking that then goes away, I can't argue with "the first series was
already reviewed and has gone through a fair amount of testing".

             Linus
  
Hugh Dickins Nov. 18, 2022, 10:10 p.m. UTC | #6
On Fri, 18 Nov 2022, Andrew Morton wrote:
> On Fri, 18 Nov 2022 12:51:09 -0800 (PST) Hugh Dickins <hughd@google.com> wrote:
> 
> > But the first series has not yet graduated from mm-unstable,
> > so if Andrew and/or Kirill also prefer to have them combined into one
> > bit_spin_lock-less series, that I can do.  (And the end result should be
> > identical, so would not complicate Johannes's lock_page_memcg() excision.)
> 
> I'd prefer that approach.

I think you're saying that you prefer the other approach, to keep the
two series separate (second immediately after the first, or not, doesn't
matter), rather than combined into one bit_spin_lock-less series.
Please clarify! Thanks,

Hugh

> It's -rc5 and the earlier "mm,huge,rmap:
> unify and speed up compound mapcounts" series has had some testing. 
> I'd prefer not to toss it all out and start again.
  
Andrew Morton Nov. 18, 2022, 10:23 p.m. UTC | #7
On Fri, 18 Nov 2022 14:10:32 -0800 (PST) Hugh Dickins <hughd@google.com> wrote:

> On Fri, 18 Nov 2022, Andrew Morton wrote:
> > On Fri, 18 Nov 2022 12:51:09 -0800 (PST) Hugh Dickins <hughd@google.com> wrote:
> > 
> > > But the first series has not yet graduated from mm-unstable,
> > > so if Andrew and/or Kirill also prefer to have them combined into one
> > > bit_spin_lock-less series, that I can do.  (And the end result should be
> > > identical, so would not complicate Johannes's lock_page_memcg() excision.)
> > 
> > I'd prefer that approach.
> 
> I think you're saying that you prefer the other approach, to keep the
> two series separate (second immediately after the first, or not, doesn't
> matter), rather than combined into one bit_spin_lock-less series.
> Please clarify! Thanks,

Yes, two separate series.   Apologies for the confuddling.
  
Shakeel Butt Nov. 21, 2022, 4:59 p.m. UTC | #8
On Fri, Nov 18, 2022 at 01:08:13AM -0800, Hugh Dickins wrote:
> Linus was underwhelmed by the earlier compound mapcounts series:
> this series builds on top of it (as in next-20221117) to follow
> up on his suggestions - except rmap.c still using lock_page_memcg(),
> since I hesitate to steal the pleasure of deletion from Johannes.
> 

Is there a plan to remove lock_page_memcg() altogether which I missed? I
am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows
up in the perf profile on exit path) but if we are removing it then I
should just wait.
  
Linus Torvalds Nov. 21, 2022, 5:16 p.m. UTC | #9
On Mon, Nov 21, 2022 at 8:59 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Is there a plan to remove lock_page_memcg() altogether which I missed? I
> am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows
> up in the perf profile on exit path)

Yay. It seems I'm not the only one hating it.

> but if we are removing it then I should just wait.

Well, I think Johannes was saying that at least the case I disliked
(the rmap removal from the page table tear-down - I strongly suspect
it's the one you're seeing on your perf profile too) can be removed
entirely as long as it's done under the page table lock (which my
final version of the rmap delaying still was).

See

    https://lore.kernel.org/all/Y2llcRiDLHc2kg%2FN@cmpxchg.org/

for his preliminary patch.

That said, if you have some patch to make it a no-op for _other_
reasons, and could be done away with _entirely_ (not just for rmap),
then that would be even better. I am  not a fan of that lock in
general, but in the teardown rmap path it's actively horrifying
because it is taken one page at a time. So it's taken a *lot*
(although you might not see it if all you run is long-running
benchmarks - it's mainly the "run lots of small scripts that really
hits it).

The reason it seems to be so horrifyingly noticeable on the exit path
is that the fork() side already does the rmap stuff (mainly
__page_dup_rmap()) _without_ having to do the lock_page_memcg() dance.

So I really hate that lock. It's completely inconsistent, and it all
feels very wrong. It seemed entirely pointless when I was looking at
the rmap removal path for a single page. The fact that both you and
Johannes seem to be more than ready to just remove it makes me much
happier, because I've never actually known the memcg code enough to do
anything about my simmering hatred.

              Linus
  
Johannes Weiner Nov. 21, 2022, 6:52 p.m. UTC | #10
On Mon, Nov 21, 2022 at 04:59:38PM +0000, Shakeel Butt wrote:
> On Fri, Nov 18, 2022 at 01:08:13AM -0800, Hugh Dickins wrote:
> > Linus was underwhelmed by the earlier compound mapcounts series:
> > this series builds on top of it (as in next-20221117) to follow
> > up on his suggestions - except rmap.c still using lock_page_memcg(),
> > since I hesitate to steal the pleasure of deletion from Johannes.
> 
> Is there a plan to remove lock_page_memcg() altogether which I missed? I
> am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows
> up in the perf profile on exit path) but if we are removing it then I
> should just wait.

We can remove it for rmap at least, but we might be able to do more.

Besides rmap, we're left with the dirty and writeback page transitions
that wrt cgroups need to be atomic with NR_FILE_DIRTY and NR_WRITEBACK.

Looking through the various callsites, I think we can delete it from
setting and clearing dirty state, as we always hold the page lock (or
the pte lock in some instances of folio_mark_dirty). Both of these are
taken from the cgroup side, so we're good there.

I think we can also remove it when setting writeback, because those
sites have the page locked as well.

That leaves clearing writeback. This can't hold the page lock due to
the atomic context, so currently we need to take lock_page_memcg() as
the lock of last resort.

I wonder if we can have cgroup take the xalock instead: writeback
ending on file pages always acquires the xarray lock. Swap writeback
currently doesn't, but we could make it so (swap_address_space).

The only thing that gives me pause is the !mapping check in
__folio_end_writeback. File and swapcache pages usually have mappings,
and truncation waits for writeback to finish before axing
page->mapping. So AFAICS this can only happen if we call end_writeback
on something that isn't under writeback - in which case the test_clear
will fail and we don't update the stats anyway. But I want to be sure.

Does anybody know from the top of their heads if a page under
writeback could be without a mapping in some weird cornercase?

If we could ensure that the NR_WRITEBACK decs are always protected by
the xalock, we could grab it from mem_cgroup_move_account(), and then
kill lock_page_memcg() altogether.
  
Hugh Dickins Nov. 22, 2022, 1:32 a.m. UTC | #11
On Mon, 21 Nov 2022, Johannes Weiner wrote:
> On Mon, Nov 21, 2022 at 04:59:38PM +0000, Shakeel Butt wrote:
> > On Fri, Nov 18, 2022 at 01:08:13AM -0800, Hugh Dickins wrote:
> > > Linus was underwhelmed by the earlier compound mapcounts series:
> > > this series builds on top of it (as in next-20221117) to follow
> > > up on his suggestions - except rmap.c still using lock_page_memcg(),
> > > since I hesitate to steal the pleasure of deletion from Johannes.
> > 
> > Is there a plan to remove lock_page_memcg() altogether which I missed? I
> > am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows
> > up in the perf profile on exit path) but if we are removing it then I
> > should just wait.
> 
> We can remove it for rmap at least, but we might be able to do more.

I hope the calls from mm/rmap.c can be deleted before deciding the
bigger picture for lock_page_memcg() itself; getting rid of it would
be very nice, but it has always had a difficult job to do (and you've
devoted lots of good effort to minimizing it).

> 
> Besides rmap, we're left with the dirty and writeback page transitions
> that wrt cgroups need to be atomic with NR_FILE_DIRTY and NR_WRITEBACK.
> 
> Looking through the various callsites, I think we can delete it from
> setting and clearing dirty state, as we always hold the page lock (or
> the pte lock in some instances of folio_mark_dirty). Both of these are
> taken from the cgroup side, so we're good there.
> 
> I think we can also remove it when setting writeback, because those
> sites have the page locked as well.
> 
> That leaves clearing writeback. This can't hold the page lock due to
> the atomic context, so currently we need to take lock_page_memcg() as
> the lock of last resort.
> 
> I wonder if we can have cgroup take the xalock instead: writeback
> ending on file pages always acquires the xarray lock. Swap writeback
> currently doesn't, but we could make it so (swap_address_space).

It's a little bit of a regression to have to take that lock when
ending writeback on swap (compared with the rcu_read_lock() of almost
every lock_page_memcg()); but I suppose if swap had been doing that
all along, like the normal page cache case, I would not be complaining.

> 
> The only thing that gives me pause is the !mapping check in
> __folio_end_writeback. File and swapcache pages usually have mappings,
> and truncation waits for writeback to finish before axing
> page->mapping. So AFAICS this can only happen if we call end_writeback
> on something that isn't under writeback - in which case the test_clear
> will fail and we don't update the stats anyway. But I want to be sure.
> 
> Does anybody know from the top of their heads if a page under
> writeback could be without a mapping in some weird cornercase?

End of writeback has been a persistent troublemaker, in several ways;
I forget whether we are content with it now or not.

I would not trust whatever I think OTOH of that !mapping case, but I
was deeper into it two years ago, and find myself saying "Can mapping be
NULL? I don't see how, but allow for that with a WARN_ON_ONCE()" in a
patch I posted then (but it didn't go in, we went in another direction).

I'm pretty sure it never warned once for me, but I probably wasn't doing
enough to test it.  And IIRC I did also think that the !mapping check had
perhaps been copied from a related function, one where it made more sense.

It's also worth noting that the two stats which get decremented there,
NR_WRITEBACK and NR_ZONE_WRITE_PENDING, are two of the three which we
have commented "Skip checking stats known to go negative occasionally"
in mm/vmstat.c: I never did come up with a convincing explanation for
that (Roman had his explanation, but I wasn't quite convinced).
Maybe it would just be wrong to touch them if mapping were NULL.

> 
> If we could ensure that the NR_WRITEBACK decs are always protected by
> the xalock, we could grab it from mem_cgroup_move_account(), and then
> kill lock_page_memcg() altogether.

I suppose so (but I still feel grudging about the xalock for swap).

Hugh
  
Matthew Wilcox Nov. 22, 2022, 5:57 a.m. UTC | #12
On Mon, Nov 21, 2022 at 01:52:23PM -0500, Johannes Weiner wrote:
> That leaves clearing writeback. This can't hold the page lock due to
> the atomic context, so currently we need to take lock_page_memcg() as
> the lock of last resort.
> 
> I wonder if we can have cgroup take the xalock instead: writeback
> ending on file pages always acquires the xarray lock. Swap writeback
> currently doesn't, but we could make it so (swap_address_space).
> 
> The only thing that gives me pause is the !mapping check in
> __folio_end_writeback. File and swapcache pages usually have mappings,
> and truncation waits for writeback to finish before axing
> page->mapping. So AFAICS this can only happen if we call end_writeback
> on something that isn't under writeback - in which case the test_clear
> will fail and we don't update the stats anyway. But I want to be sure.
> 
> Does anybody know from the top of their heads if a page under
> writeback could be without a mapping in some weird cornercase?

I can't think of such a corner case.  We should always wait for
writeback to finish before removing the page from the page cache;
the writeback bit used to be (and kind of still is) an implicit
reference to the page, which means that we can't remove the page
cache's reference to the page without waiting for writeback.

> If we could ensure that the NR_WRITEBACK decs are always protected by
> the xalock, we could grab it from mem_cgroup_move_account(), and then
> kill lock_page_memcg() altogether.

I'm not thrilled by this idea, but I'm not going to veto it.
  
Johannes Weiner Nov. 22, 2022, 6:55 a.m. UTC | #13
On Tue, Nov 22, 2022 at 05:57:42AM +0000, Matthew Wilcox wrote:
> On Mon, Nov 21, 2022 at 01:52:23PM -0500, Johannes Weiner wrote:
> > That leaves clearing writeback. This can't hold the page lock due to
> > the atomic context, so currently we need to take lock_page_memcg() as
> > the lock of last resort.
> > 
> > I wonder if we can have cgroup take the xalock instead: writeback
> > ending on file pages always acquires the xarray lock. Swap writeback
> > currently doesn't, but we could make it so (swap_address_space).
> > 
> > The only thing that gives me pause is the !mapping check in
> > __folio_end_writeback. File and swapcache pages usually have mappings,
> > and truncation waits for writeback to finish before axing
> > page->mapping. So AFAICS this can only happen if we call end_writeback
> > on something that isn't under writeback - in which case the test_clear
> > will fail and we don't update the stats anyway. But I want to be sure.
> > 
> > Does anybody know from the top of their heads if a page under
> > writeback could be without a mapping in some weird cornercase?
> 
> I can't think of such a corner case.  We should always wait for
> writeback to finish before removing the page from the page cache;
> the writeback bit used to be (and kind of still is) an implicit
> reference to the page, which means that we can't remove the page
> cache's reference to the page without waiting for writeback.

Great, thanks!

> > If we could ensure that the NR_WRITEBACK decs are always protected by
> > the xalock, we could grab it from mem_cgroup_move_account(), and then
> > kill lock_page_memcg() altogether.
> 
> I'm not thrilled by this idea, but I'm not going to veto it.

Ok, I'm also happy to drop this one.

Certainly, the rmap one is the lowest-hanging fruit. I have the patch
rebased against Hugh's series in mm-unstable; I'll wait for that to
settle down, and then send an updated version to Andrew.
  
Shakeel Butt Nov. 22, 2022, 4:27 p.m. UTC | #14
On Mon, Nov 21, 2022 at 09:16:58AM -0800, Linus Torvalds wrote:
> On Mon, Nov 21, 2022 at 8:59 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > Is there a plan to remove lock_page_memcg() altogether which I missed? I
> > am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows
> > up in the perf profile on exit path)
> 
> Yay. It seems I'm not the only one hating it.
> 
> > but if we are removing it then I should just wait.
> 
> Well, I think Johannes was saying that at least the case I disliked
> (the rmap removal from the page table tear-down - I strongly suspect
> it's the one you're seeing on your perf profile too)

Yes indeed that is the one.

-   99.89%     0.00%  fork-large-mmap  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hw◆
     entry_SYSCALL_64_after_hwframe                               
   - do_syscall_64                                                
      - 48.94% __x64_sys_exit_group                               
           do_group_exit                                          
         - do_exit                                                
            - 48.94% exit_mm                                      
                 mmput                                            
               - __mmput                                          
                  - exit_mmap                                     
                     - 48.61% unmap_vmas                          
                        - 48.61% unmap_single_vma                 
                           - unmap_page_range                     
                              - 48.60% zap_p4d_range              
                                 - 44.66% zap_pte_range           
                                    + 12.61% tlb_flush_mmu        
                                    - 9.38% page_remove_rmap      
                                         2.50% lock_page_memcg    
                                         2.37% unlock_page_memcg  
                                         0.61% PageHuge           
                                      4.80% vm_normal_page        
                                      2.56% __tlb_remove_page_size
                                      0.85% lock_page_memcg       
                                      0.53% PageHuge              
                                   2.22% __tlb_remove_page_size   
                                   0.93% vm_normal_page           
                                   0.72% page_remove_rmap

> can be removed
> entirely as long as it's done under the page table lock (which my
> final version of the rmap delaying still was).
> 
> See
> 
>     https://lore.kernel.org/all/Y2llcRiDLHc2kg%2FN@cmpxchg.org/
> 
> for his preliminary patch.
> 
> That said, if you have some patch to make it a no-op for _other_
> reasons, and could be done away with _entirely_ (not just for rmap),
> then that would be even better.

I am actually looking at deprecating the whole "move charge"
funcitonality of cgroup-v1 i.e. the underlying reason lock_page_memcg
exists. That already does not work for couple of cases like partially
mapped THP and madv_free'd pages. Though that deprecation process would
take some time. In the meantime I was looking at if we can make these
functions nop for cgroup-v2.

thanks,
Shakeel
  
Shakeel Butt Nov. 22, 2022, 4:30 p.m. UTC | #15
On Tue, Nov 22, 2022 at 01:55:39AM -0500, Johannes Weiner wrote:
> On Tue, Nov 22, 2022 at 05:57:42AM +0000, Matthew Wilcox wrote:
> > On Mon, Nov 21, 2022 at 01:52:23PM -0500, Johannes Weiner wrote:
> > > That leaves clearing writeback. This can't hold the page lock due to
> > > the atomic context, so currently we need to take lock_page_memcg() as
> > > the lock of last resort.
> > > 
> > > I wonder if we can have cgroup take the xalock instead: writeback
> > > ending on file pages always acquires the xarray lock. Swap writeback
> > > currently doesn't, but we could make it so (swap_address_space).
> > > 
> > > The only thing that gives me pause is the !mapping check in
> > > __folio_end_writeback. File and swapcache pages usually have mappings,
> > > and truncation waits for writeback to finish before axing
> > > page->mapping. So AFAICS this can only happen if we call end_writeback
> > > on something that isn't under writeback - in which case the test_clear
> > > will fail and we don't update the stats anyway. But I want to be sure.
> > > 
> > > Does anybody know from the top of their heads if a page under
> > > writeback could be without a mapping in some weird cornercase?
> > 
> > I can't think of such a corner case.  We should always wait for
> > writeback to finish before removing the page from the page cache;
> > the writeback bit used to be (and kind of still is) an implicit
> > reference to the page, which means that we can't remove the page
> > cache's reference to the page without waiting for writeback.
> 
> Great, thanks!
> 
> > > If we could ensure that the NR_WRITEBACK decs are always protected by
> > > the xalock, we could grab it from mem_cgroup_move_account(), and then
> > > kill lock_page_memcg() altogether.
> > 
> > I'm not thrilled by this idea, but I'm not going to veto it.
> 
> Ok, I'm also happy to drop this one.
> 
> Certainly, the rmap one is the lowest-hanging fruit. I have the patch
> rebased against Hugh's series in mm-unstable; I'll wait for that to
> settle down, and then send an updated version to Andrew.

I am planning to initiate the deprecation of the move charge
functionality of v1. So I would say let's go with low hanging fruit for
now and let slow process of deprecation remove the remaining cases.