[RFC,0/6] mm: improve page allocator scalability via splitting zones

Message ID 20230511065607.37407-1-ying.huang@intel.com
Headers
Series mm: improve page allocator scalability via splitting zones |

Message

Huang, Ying May 11, 2023, 6:56 a.m. UTC
  The patchset is based on upstream v6.3.

More and more cores are put in one physical CPU (usually one NUMA node
too).  In 2023, one high-end server CPU has 56, 64, or more cores.
Even more cores per physical CPU are planned for future CPUs.  While
all cores in one physical CPU will contend for the page allocation on
one zone in most cases.  This causes heavy zone lock contention in
some workloads.  And the situation will become worse and worse in the
future.

For example, on an 2-socket Intel server machine with 224 logical
CPUs, if the kernel is built with `make -j224`, the zone lock
contention cycles% can reach up to about 12.7%.

To improve the scalability of the page allocation, in this series, we
will create one zone instance for each about 256 GB memory of a zone
type generally.  That is, one large zone type will be split into
multiple zone instances.  Then, different logical CPUs will prefer
different zone instances based on the logical CPU No.  So the total
number of logical CPUs contend on one zone will be reduced.  Thus the
scalability is improved.

With the series, the zone lock contention cycles% reduces to less than
1.6% in the above kbuild test case when 4 zone instances are created
for ZONE_NORMAL.

Also tested the series with the will-it-scale/page_fault1 with 16
processes.  With the optimization, the benchmark score increases up to
18.2% and the zone lock contention reduces from 13.01% to 0.56%.

To create multiple zone instances for a zone type, another choice is
to create zone instances based on the total number of logical CPUs.
We choose to use memory size because it is easier to be implemented.
In most cases, the more the cores, the larger the memory size is.
And, on system with larger memory size, the performance requirement of
the page allocator is usually higher.

Best Regards,
Huang, Ying
  

Comments

Jonathan Cameron May 11, 2023, 10:30 a.m. UTC | #1
On Thu, 11 May 2023 14:56:01 +0800
Huang Ying <ying.huang@intel.com> wrote:

> The patchset is based on upstream v6.3.
> 
> More and more cores are put in one physical CPU (usually one NUMA node
> too).  In 2023, one high-end server CPU has 56, 64, or more cores.
> Even more cores per physical CPU are planned for future CPUs.  While
> all cores in one physical CPU will contend for the page allocation on
> one zone in most cases.  This causes heavy zone lock contention in
> some workloads.  And the situation will become worse and worse in the
> future.
> 
> For example, on an 2-socket Intel server machine with 224 logical
> CPUs, if the kernel is built with `make -j224`, the zone lock
> contention cycles% can reach up to about 12.7%.
> 
> To improve the scalability of the page allocation, in this series, we
> will create one zone instance for each about 256 GB memory of a zone
> type generally.  That is, one large zone type will be split into
> multiple zone instances.  Then, different logical CPUs will prefer
> different zone instances based on the logical CPU No.  So the total
> number of logical CPUs contend on one zone will be reduced.  Thus the
> scalability is improved.
> 
> With the series, the zone lock contention cycles% reduces to less than
> 1.6% in the above kbuild test case when 4 zone instances are created
> for ZONE_NORMAL.
> 
> Also tested the series with the will-it-scale/page_fault1 with 16
> processes.  With the optimization, the benchmark score increases up to
> 18.2% and the zone lock contention reduces from 13.01% to 0.56%.
> 
> To create multiple zone instances for a zone type, another choice is
> to create zone instances based on the total number of logical CPUs.
> We choose to use memory size because it is easier to be implemented.
> In most cases, the more the cores, the larger the memory size is.
> And, on system with larger memory size, the performance requirement of
> the page allocator is usually higher.
> 
> Best Regards,
> Huang, Ying
> 
Hi,

Interesting idea.  I'm curious though on whether this can suffer from
imbalance problems where due to uneven allocations from particular CPUs
you can end up with all page faults happening in one zone and the original
contention problem coming back?  Or am I missing some process that will
result in that imbalance being corrected?

Jonathan
  
Arjan van de Ven May 11, 2023, 1:07 p.m. UTC | #2
On 5/11/2023 3:30 AM, Jonathan Cameron wrote:

> Hi,
> 
> Interesting idea.  I'm curious though on whether this can suffer from
> imbalance problems where due to uneven allocations from particular CPUs
> you can end up with all page faults happening in one zone and the original
> contention problem coming back?  Or am I missing some process that will
> result in that imbalance being corrected?
> 
> Jonathan

Well, the first line of defense is the per cpu page lists...
it can well be that a couple of cpus all in the same zone hit some high frequency
pattern... that by itself isn't the real issue. Note the "a couple".
It gets to be a problem if "a high number" start hitting this...
And by splitting the total into smaller pieces, this is going to be much much
less likely, since the total number per zone is just less.
  
Dave Hansen May 11, 2023, 2:23 p.m. UTC | #3
On 5/10/23 23:56, Huang Ying wrote:
> To improve the scalability of the page allocation, in this series, we
> will create one zone instance for each about 256 GB memory of a zone
> type generally.  That is, one large zone type will be split into
> multiple zone instances. 

A few anecdotes for why I think _some_ people will like this:

Some Intel hardware has a "RAM" caching mechanism.  It either caches
DRAM in High-Bandwidth Memory or Persistent Memory in DRAM.  This cache
is direct-mapped and can have lots of collisions.  One way to prevent
collisions is to chop up the physical memory into cache-sized zones and
let users choose to allocate from one zone.  That fixes the conflicts.

Some other Intel hardware a ways to chop a NUMA node representing a
single socket into slices.  Usually one slice gets a memory controller
and its closest cores.  Intel calls these approaches Cluster on Die or
Sub-NUMA Clustering and users can select it from the BIOS.

In both of these cases, users have reported scalability improvements.
We've gone as far as to suggest the socket-splitting options to folks
today who are hitting zone scalability issues on that hardware.

That said, those _same_ users sometimes come back and say something
along the lines of: "So... we've got this app that allocates a big hunk
of memory.  It's going slower than before."  They're filling up one of
the chopped-up zones, hitting _some_ kind of undesirable reclaim
behavior and they want their humpty-dumpty zones put back together again
... without hurting scalability.  Some people will never be happy. :)

Anyway, _if_ you do this, you might also consider being able to
dynamically adjust a CPU's zonelists somehow.  That would relieve
pressure on one zone for those uneven allocations.  That wasn't an
option in the two cases above because users had ulterior motives for
sticking inside a single zone.  But, in your case, the zones really do
have equivalent performance.
  
Michal Hocko May 11, 2023, 3:05 p.m. UTC | #4
On Thu 11-05-23 14:56:01, Huang Ying wrote:
> The patchset is based on upstream v6.3.
> 
> More and more cores are put in one physical CPU (usually one NUMA node
> too).  In 2023, one high-end server CPU has 56, 64, or more cores.
> Even more cores per physical CPU are planned for future CPUs.  While
> all cores in one physical CPU will contend for the page allocation on
> one zone in most cases.  This causes heavy zone lock contention in
> some workloads.  And the situation will become worse and worse in the
> future.
> 
> For example, on an 2-socket Intel server machine with 224 logical
> CPUs, if the kernel is built with `make -j224`, the zone lock
> contention cycles% can reach up to about 12.7%.
> 
> To improve the scalability of the page allocation, in this series, we
> will create one zone instance for each about 256 GB memory of a zone
> type generally.  That is, one large zone type will be split into
> multiple zone instances.  Then, different logical CPUs will prefer
> different zone instances based on the logical CPU No.  So the total
> number of logical CPUs contend on one zone will be reduced.  Thus the
> scalability is improved.

It is not really clear to me why you need a new zone for all this rather
than partition free lists internally within the zone? Essentially to
increase the current two level system to 3: per cpu caches, per cpu
arenas and global fallback.

I am also missing some information why pcp caches tunning is not
sufficient.
  
Huang, Ying May 12, 2023, 2:55 a.m. UTC | #5
Hi, Michal,

Thanks for comments!

Michal Hocko <mhocko@suse.com> writes:

> On Thu 11-05-23 14:56:01, Huang Ying wrote:
>> The patchset is based on upstream v6.3.
>> 
>> More and more cores are put in one physical CPU (usually one NUMA node
>> too).  In 2023, one high-end server CPU has 56, 64, or more cores.
>> Even more cores per physical CPU are planned for future CPUs.  While
>> all cores in one physical CPU will contend for the page allocation on
>> one zone in most cases.  This causes heavy zone lock contention in
>> some workloads.  And the situation will become worse and worse in the
>> future.
>> 
>> For example, on an 2-socket Intel server machine with 224 logical
>> CPUs, if the kernel is built with `make -j224`, the zone lock
>> contention cycles% can reach up to about 12.7%.
>> 
>> To improve the scalability of the page allocation, in this series, we
>> will create one zone instance for each about 256 GB memory of a zone
>> type generally.  That is, one large zone type will be split into
>> multiple zone instances.  Then, different logical CPUs will prefer
>> different zone instances based on the logical CPU No.  So the total
>> number of logical CPUs contend on one zone will be reduced.  Thus the
>> scalability is improved.
>
> It is not really clear to me why you need a new zone for all this rather
> than partition free lists internally within the zone? Essentially to
> increase the current two level system to 3: per cpu caches, per cpu
> arenas and global fallback.

Sorry, I didn't get your idea here.  What is per cpu arenas?  What's the
difference between it and per cpu caches (PCP)?

> I am also missing some information why pcp caches tunning is not
> sufficient.

PCP does improve the page allocation scalability greatly!  But it
doesn't help much for workloads that allocating pages on one CPU and
free them in different CPUs.  PCP tuning can improve the page allocation
scalability for a workload greatly.  But it's not trivial to find the
best tuning parameters for various workloads and workload run time
statuses (workloads may have different loads and memory requirements at
different time).  And we may run different workloads on different
logical CPUs of the system.  This also makes it hard to find the best
PCP tuning globally.  It would be better to find a solution to improve
the page allocation scalability out of box or automatically.  Do you
agree?

Best Regards,
Huang, Ying
  
Huang, Ying May 12, 2023, 3:08 a.m. UTC | #6
Hi, Dave,

Dave Hansen <dave.hansen@intel.com> writes:

> On 5/10/23 23:56, Huang Ying wrote:
>> To improve the scalability of the page allocation, in this series, we
>> will create one zone instance for each about 256 GB memory of a zone
>> type generally.  That is, one large zone type will be split into
>> multiple zone instances. 
>
> A few anecdotes for why I think _some_ people will like this:
>
> Some Intel hardware has a "RAM" caching mechanism.  It either caches
> DRAM in High-Bandwidth Memory or Persistent Memory in DRAM.  This cache
> is direct-mapped and can have lots of collisions.  One way to prevent
> collisions is to chop up the physical memory into cache-sized zones and
> let users choose to allocate from one zone.  That fixes the conflicts.
>
> Some other Intel hardware a ways to chop a NUMA node representing a
> single socket into slices.  Usually one slice gets a memory controller
> and its closest cores.  Intel calls these approaches Cluster on Die or
> Sub-NUMA Clustering and users can select it from the BIOS.
>
> In both of these cases, users have reported scalability improvements.
> We've gone as far as to suggest the socket-splitting options to folks
> today who are hitting zone scalability issues on that hardware.
>
> That said, those _same_ users sometimes come back and say something
> along the lines of: "So... we've got this app that allocates a big hunk
> of memory.  It's going slower than before."  They're filling up one of
> the chopped-up zones, hitting _some_ kind of undesirable reclaim
> behavior and they want their humpty-dumpty zones put back together again
> ... without hurting scalability.  Some people will never be happy. :)

Thanks a lot for your valuable input!

> Anyway, _if_ you do this, you might also consider being able to
> dynamically adjust a CPU's zonelists somehow.  That would relieve
> pressure on one zone for those uneven allocations.  That wasn't an
> option in the two cases above because users had ulterior motives for
> sticking inside a single zone.  But, in your case, the zones really do
> have equivalent performance.

Yes.  For the requirements you mentioned above, we need a mechanism to
adjust a CPU's zonelists dynamically.  I will not implement that in this
series.  But I think that it's doable based on the multiple zone
instances per zone type implementation in this series.

Best Regards,
Huang, Ying
  
Michal Hocko May 15, 2023, 11:14 a.m. UTC | #7
On Fri 12-05-23 10:55:21, Huang, Ying wrote:
> Hi, Michal,
> 
> Thanks for comments!
> 
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Thu 11-05-23 14:56:01, Huang Ying wrote:
> >> The patchset is based on upstream v6.3.
> >> 
> >> More and more cores are put in one physical CPU (usually one NUMA node
> >> too).  In 2023, one high-end server CPU has 56, 64, or more cores.
> >> Even more cores per physical CPU are planned for future CPUs.  While
> >> all cores in one physical CPU will contend for the page allocation on
> >> one zone in most cases.  This causes heavy zone lock contention in
> >> some workloads.  And the situation will become worse and worse in the
> >> future.
> >> 
> >> For example, on an 2-socket Intel server machine with 224 logical
> >> CPUs, if the kernel is built with `make -j224`, the zone lock
> >> contention cycles% can reach up to about 12.7%.
> >> 
> >> To improve the scalability of the page allocation, in this series, we
> >> will create one zone instance for each about 256 GB memory of a zone
> >> type generally.  That is, one large zone type will be split into
> >> multiple zone instances.  Then, different logical CPUs will prefer
> >> different zone instances based on the logical CPU No.  So the total
> >> number of logical CPUs contend on one zone will be reduced.  Thus the
> >> scalability is improved.
> >
> > It is not really clear to me why you need a new zone for all this rather
> > than partition free lists internally within the zone? Essentially to
> > increase the current two level system to 3: per cpu caches, per cpu
> > arenas and global fallback.
> 
> Sorry, I didn't get your idea here.  What is per cpu arenas?  What's the
> difference between it and per cpu caches (PCP)?

Sorry, I didn't give this much thought than the above. Essentially, we
have 2 level system right now. Pcp caches should reduce the contention
on the per cpu level and that should work reasonably well, if you manage
to align batch sizes to the workload AFAIK. If this is not sufficient
then why to add the full zone rather than to add another level that
caches across a larger than a cpu unit. Maybe a core?

This might be a wrong way around going for this but there is not much
performance analysis about the source of the lock contention so I am
mostly guessing.

> > I am also missing some information why pcp caches tunning is not
> > sufficient.
> 
> PCP does improve the page allocation scalability greatly!  But it
> doesn't help much for workloads that allocating pages on one CPU and
> free them in different CPUs.  PCP tuning can improve the page allocation
> scalability for a workload greatly.  But it's not trivial to find the
> best tuning parameters for various workloads and workload run time
> statuses (workloads may have different loads and memory requirements at
> different time).  And we may run different workloads on different
> logical CPUs of the system.  This also makes it hard to find the best
> PCP tuning globally.

Yes this makes sense. Does that mean that the global pcp tuning is not
keeping up and we need to be able to do more auto-tuning on local bases
rather than global?

> It would be better to find a solution to improve
> the page allocation scalability out of box or automatically.  Do you
> agree?

Yes.
  
Huang, Ying May 16, 2023, 9:38 a.m. UTC | #8
Michal Hocko <mhocko@suse.com> writes:

> On Fri 12-05-23 10:55:21, Huang, Ying wrote:
>> Hi, Michal,
>> 
>> Thanks for comments!
>> 
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Thu 11-05-23 14:56:01, Huang Ying wrote:
>> >> The patchset is based on upstream v6.3.
>> >> 
>> >> More and more cores are put in one physical CPU (usually one NUMA node
>> >> too).  In 2023, one high-end server CPU has 56, 64, or more cores.
>> >> Even more cores per physical CPU are planned for future CPUs.  While
>> >> all cores in one physical CPU will contend for the page allocation on
>> >> one zone in most cases.  This causes heavy zone lock contention in
>> >> some workloads.  And the situation will become worse and worse in the
>> >> future.
>> >> 
>> >> For example, on an 2-socket Intel server machine with 224 logical
>> >> CPUs, if the kernel is built with `make -j224`, the zone lock
>> >> contention cycles% can reach up to about 12.7%.
>> >> 
>> >> To improve the scalability of the page allocation, in this series, we
>> >> will create one zone instance for each about 256 GB memory of a zone
>> >> type generally.  That is, one large zone type will be split into
>> >> multiple zone instances.  Then, different logical CPUs will prefer
>> >> different zone instances based on the logical CPU No.  So the total
>> >> number of logical CPUs contend on one zone will be reduced.  Thus the
>> >> scalability is improved.
>> >
>> > It is not really clear to me why you need a new zone for all this rather
>> > than partition free lists internally within the zone? Essentially to
>> > increase the current two level system to 3: per cpu caches, per cpu
>> > arenas and global fallback.
>> 
>> Sorry, I didn't get your idea here.  What is per cpu arenas?  What's the
>> difference between it and per cpu caches (PCP)?
>
> Sorry, I didn't give this much thought than the above. Essentially, we
> have 2 level system right now. Pcp caches should reduce the contention
> on the per cpu level and that should work reasonably well, if you manage
> to align batch sizes to the workload AFAIK. If this is not sufficient
> then why to add the full zone rather than to add another level that
> caches across a larger than a cpu unit. Maybe a core?
>
> This might be a wrong way around going for this but there is not much
> performance analysis about the source of the lock contention so I am
> mostly guessing.

I guess that the page allocation scalability will be improved if we put
more pages in the per CPU caches, or add another level of cache for
multiple logical CPUs.  Because more page allocation requirements can be
satisfied without acquiring zone lock.

As other caching system, there are always cases that the caches are
drained and too many requirements goes to underlying slow layer (zone
here).  For example, if a workload needs to allocate a huge number of
pages (larger than cache size) in parallel, it will run into zone lock
contention finally.  The situation will became worse and worse if we
share one zone with more and more logical CPUs.  Which is the trend in
industry now.  Per my understanding, we can observe the high zone lock
contention cycles in kbuild test because of that.

So, per my understanding, to improve the page allocation scalability in
bad situations (that is, caching doesn't work well enough), we need to
restrict the number of logical CPUs that share one zone.  This series is
an attempt for that.  Better caching can increase the good situations
and reduce the bad situations.  But it seems hard to eliminate all bad
situations.

From another perspective, we don't install more and more memory for each
logical CPU.  This makes it hard to enlarge the default per-CPU cache
size.

>> > I am also missing some information why pcp caches tunning is not
>> > sufficient.
>> 
>> PCP does improve the page allocation scalability greatly!  But it
>> doesn't help much for workloads that allocating pages on one CPU and
>> free them in different CPUs.  PCP tuning can improve the page allocation
>> scalability for a workload greatly.  But it's not trivial to find the
>> best tuning parameters for various workloads and workload run time
>> statuses (workloads may have different loads and memory requirements at
>> different time).  And we may run different workloads on different
>> logical CPUs of the system.  This also makes it hard to find the best
>> PCP tuning globally.
>
> Yes this makes sense. Does that mean that the global pcp tuning is not
> keeping up and we need to be able to do more auto-tuning on local bases
> rather than global?

Similar as above, I think that PCP helps the good situations performance
greatly, and splitting zone can help the bad situations scalability.
They are working at the different levels.

As for PCP auto-tuning, I think that it's hard to implement it to
resolve all problems (that is, makes PCP never be drained).

And auto-tuning doesn't sound easy.  Do you have some idea of how to do
that?

>> It would be better to find a solution to improve
>> the page allocation scalability out of box or automatically.  Do you
>> agree?
>
> Yes. 

Best Regards,
Huang, Ying
  
David Hildenbrand May 16, 2023, 10:30 a.m. UTC | #9
On 16.05.23 11:38, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
>> On Fri 12-05-23 10:55:21, Huang, Ying wrote:
>>> Hi, Michal,
>>>
>>> Thanks for comments!
>>>
>>> Michal Hocko <mhocko@suse.com> writes:
>>>
>>>> On Thu 11-05-23 14:56:01, Huang Ying wrote:
>>>>> The patchset is based on upstream v6.3.
>>>>>
>>>>> More and more cores are put in one physical CPU (usually one NUMA node
>>>>> too).  In 2023, one high-end server CPU has 56, 64, or more cores.
>>>>> Even more cores per physical CPU are planned for future CPUs.  While
>>>>> all cores in one physical CPU will contend for the page allocation on
>>>>> one zone in most cases.  This causes heavy zone lock contention in
>>>>> some workloads.  And the situation will become worse and worse in the
>>>>> future.
>>>>>
>>>>> For example, on an 2-socket Intel server machine with 224 logical
>>>>> CPUs, if the kernel is built with `make -j224`, the zone lock
>>>>> contention cycles% can reach up to about 12.7%.
>>>>>
>>>>> To improve the scalability of the page allocation, in this series, we
>>>>> will create one zone instance for each about 256 GB memory of a zone
>>>>> type generally.  That is, one large zone type will be split into
>>>>> multiple zone instances.  Then, different logical CPUs will prefer
>>>>> different zone instances based on the logical CPU No.  So the total
>>>>> number of logical CPUs contend on one zone will be reduced.  Thus the
>>>>> scalability is improved.
>>>>
>>>> It is not really clear to me why you need a new zone for all this rather
>>>> than partition free lists internally within the zone? Essentially to
>>>> increase the current two level system to 3: per cpu caches, per cpu
>>>> arenas and global fallback.
>>>
>>> Sorry, I didn't get your idea here.  What is per cpu arenas?  What's the
>>> difference between it and per cpu caches (PCP)?
>>
>> Sorry, I didn't give this much thought than the above. Essentially, we
>> have 2 level system right now. Pcp caches should reduce the contention
>> on the per cpu level and that should work reasonably well, if you manage
>> to align batch sizes to the workload AFAIK. If this is not sufficient
>> then why to add the full zone rather than to add another level that
>> caches across a larger than a cpu unit. Maybe a core?
>>
>> This might be a wrong way around going for this but there is not much
>> performance analysis about the source of the lock contention so I am
>> mostly guessing.
> 
> I guess that the page allocation scalability will be improved if we put
> more pages in the per CPU caches, or add another level of cache for
> multiple logical CPUs.  Because more page allocation requirements can be
> satisfied without acquiring zone lock.
> 
> As other caching system, there are always cases that the caches are
> drained and too many requirements goes to underlying slow layer (zone
> here).  For example, if a workload needs to allocate a huge number of
> pages (larger than cache size) in parallel, it will run into zone lock
> contention finally.  The situation will became worse and worse if we
> share one zone with more and more logical CPUs.  Which is the trend in
> industry now.  Per my understanding, we can observe the high zone lock
> contention cycles in kbuild test because of that.
> 
> So, per my understanding, to improve the page allocation scalability in
> bad situations (that is, caching doesn't work well enough), we need to
> restrict the number of logical CPUs that share one zone.  This series is
> an attempt for that.  Better caching can increase the good situations
> and reduce the bad situations.  But it seems hard to eliminate all bad
> situations.
> 
>  From another perspective, we don't install more and more memory for each
> logical CPU.  This makes it hard to enlarge the default per-CPU cache
> size.
> 
>>>> I am also missing some information why pcp caches tunning is not
>>>> sufficient.
>>>
>>> PCP does improve the page allocation scalability greatly!  But it
>>> doesn't help much for workloads that allocating pages on one CPU and
>>> free them in different CPUs.  PCP tuning can improve the page allocation
>>> scalability for a workload greatly.  But it's not trivial to find the
>>> best tuning parameters for various workloads and workload run time
>>> statuses (workloads may have different loads and memory requirements at
>>> different time).  And we may run different workloads on different
>>> logical CPUs of the system.  This also makes it hard to find the best
>>> PCP tuning globally.
>>
>> Yes this makes sense. Does that mean that the global pcp tuning is not
>> keeping up and we need to be able to do more auto-tuning on local bases
>> rather than global?
> 
> Similar as above, I think that PCP helps the good situations performance
> greatly, and splitting zone can help the bad situations scalability.
> They are working at the different levels.
> 
> As for PCP auto-tuning, I think that it's hard to implement it to
> resolve all problems (that is, makes PCP never be drained).
> 
> And auto-tuning doesn't sound easy.  Do you have some idea of how to do
> that?

If we could avoid instantiating more zones and rather improve existing 
mechanisms (PCP), that would be much more preferred IMHO. I'm sure it's 
not easy, but that shouldn't stop us from trying ;)

I did not look into the details of this proposal, but seeing the change 
in include/linux/page-flags-layout.h scares me. Further, I'm not so sure 
how that change really interacts with hot(un)plug of memory ... on a 
quick glimpse I feel like this series hacks the code such that such that 
the split works based on the boot memory size ...

I agree with Michal that looking into auto-tuning PCP would be 
preferred. If that can't be done, adding another layer might end up 
cleaner and eventually cover more use cases.

[I recall there was once a proposal to add a 3rd layer to limit 
fragmenation to individual memory blocks; but the granularity was rather 
small and there were also some concerns that I don't recall anymore]
  
Huang, Ying May 17, 2023, 1:34 a.m. UTC | #10
David Hildenbrand <david@redhat.com> writes:

> On 16.05.23 11:38, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>>> On Fri 12-05-23 10:55:21, Huang, Ying wrote:
>>>> Hi, Michal,
>>>>
>>>> Thanks for comments!
>>>>
>>>> Michal Hocko <mhocko@suse.com> writes:
>>>>
>>>>> On Thu 11-05-23 14:56:01, Huang Ying wrote:
>>>>>> The patchset is based on upstream v6.3.
>>>>>>
>>>>>> More and more cores are put in one physical CPU (usually one NUMA node
>>>>>> too).  In 2023, one high-end server CPU has 56, 64, or more cores.
>>>>>> Even more cores per physical CPU are planned for future CPUs.  While
>>>>>> all cores in one physical CPU will contend for the page allocation on
>>>>>> one zone in most cases.  This causes heavy zone lock contention in
>>>>>> some workloads.  And the situation will become worse and worse in the
>>>>>> future.
>>>>>>
>>>>>> For example, on an 2-socket Intel server machine with 224 logical
>>>>>> CPUs, if the kernel is built with `make -j224`, the zone lock
>>>>>> contention cycles% can reach up to about 12.7%.
>>>>>>
>>>>>> To improve the scalability of the page allocation, in this series, we
>>>>>> will create one zone instance for each about 256 GB memory of a zone
>>>>>> type generally.  That is, one large zone type will be split into
>>>>>> multiple zone instances.  Then, different logical CPUs will prefer
>>>>>> different zone instances based on the logical CPU No.  So the total
>>>>>> number of logical CPUs contend on one zone will be reduced.  Thus the
>>>>>> scalability is improved.
>>>>>
>>>>> It is not really clear to me why you need a new zone for all this rather
>>>>> than partition free lists internally within the zone? Essentially to
>>>>> increase the current two level system to 3: per cpu caches, per cpu
>>>>> arenas and global fallback.
>>>>
>>>> Sorry, I didn't get your idea here.  What is per cpu arenas?  What's the
>>>> difference between it and per cpu caches (PCP)?
>>>
>>> Sorry, I didn't give this much thought than the above. Essentially, we
>>> have 2 level system right now. Pcp caches should reduce the contention
>>> on the per cpu level and that should work reasonably well, if you manage
>>> to align batch sizes to the workload AFAIK. If this is not sufficient
>>> then why to add the full zone rather than to add another level that
>>> caches across a larger than a cpu unit. Maybe a core?
>>>
>>> This might be a wrong way around going for this but there is not much
>>> performance analysis about the source of the lock contention so I am
>>> mostly guessing.
>> I guess that the page allocation scalability will be improved if we
>> put
>> more pages in the per CPU caches, or add another level of cache for
>> multiple logical CPUs.  Because more page allocation requirements can be
>> satisfied without acquiring zone lock.
>> As other caching system, there are always cases that the caches are
>> drained and too many requirements goes to underlying slow layer (zone
>> here).  For example, if a workload needs to allocate a huge number of
>> pages (larger than cache size) in parallel, it will run into zone lock
>> contention finally.  The situation will became worse and worse if we
>> share one zone with more and more logical CPUs.  Which is the trend in
>> industry now.  Per my understanding, we can observe the high zone lock
>> contention cycles in kbuild test because of that.
>> So, per my understanding, to improve the page allocation scalability
>> in
>> bad situations (that is, caching doesn't work well enough), we need to
>> restrict the number of logical CPUs that share one zone.  This series is
>> an attempt for that.  Better caching can increase the good situations
>> and reduce the bad situations.  But it seems hard to eliminate all bad
>> situations.
>>  From another perspective, we don't install more and more memory for
>> each
>> logical CPU.  This makes it hard to enlarge the default per-CPU cache
>> size.
>> 
>>>>> I am also missing some information why pcp caches tunning is not
>>>>> sufficient.
>>>>
>>>> PCP does improve the page allocation scalability greatly!  But it
>>>> doesn't help much for workloads that allocating pages on one CPU and
>>>> free them in different CPUs.  PCP tuning can improve the page allocation
>>>> scalability for a workload greatly.  But it's not trivial to find the
>>>> best tuning parameters for various workloads and workload run time
>>>> statuses (workloads may have different loads and memory requirements at
>>>> different time).  And we may run different workloads on different
>>>> logical CPUs of the system.  This also makes it hard to find the best
>>>> PCP tuning globally.
>>>
>>> Yes this makes sense. Does that mean that the global pcp tuning is not
>>> keeping up and we need to be able to do more auto-tuning on local bases
>>> rather than global?
>> Similar as above, I think that PCP helps the good situations
>> performance
>> greatly, and splitting zone can help the bad situations scalability.
>> They are working at the different levels.
>> As for PCP auto-tuning, I think that it's hard to implement it to
>> resolve all problems (that is, makes PCP never be drained).
>> And auto-tuning doesn't sound easy.  Do you have some idea of how to
>> do
>> that?
>
> If we could avoid instantiating more zones and rather improve existing
> mechanisms (PCP), that would be much more preferred IMHO. I'm sure
> it's not easy, but that shouldn't stop us from trying ;)

I do think improving PCP or adding another level of cache will help
performance and scalability.

And, I think that it has value too to improve the performance of zone
itself.  Because there will be always some cases that the zone lock
itself is contended.

That is, PCP and zone works at different level, and both deserve to be
improved.  Do you agree?

> I did not look into the details of this proposal, but seeing the
> change in include/linux/page-flags-layout.h scares me.

It's possible for us to use 1 more bit in page->flags.  Do you think
that will cause severe issue?  Or you think some other stuff isn't
acceptable?

> Further, I'm not so sure how that change really interacts with
> hot(un)plug of memory ... on a quick glimpse I feel like this series
> hacks the code such that such that the split works based on the boot
> memory size ...

Em..., the zone stuff is kind of static now.  It's hard to add a zone at
run-time.  So, in this series, we determine the number of zones per zone
type based on boot memory size.  This may be improved in the future via
pre-allocate some empty zone instances during boot and hot-add some
memory to these zones.

> I agree with Michal that looking into auto-tuning PCP would be
> preferred. If that can't be done, adding another layer might end up 
> cleaner and eventually cover more use cases.

I do agree that it's valuable to make PCP etc. cover more use cases.  I
just think that this should not prevent us from optimizing zone itself
to cover remaining use cases.

> [I recall there was once a proposal to add a 3rd layer to limit
> fragmenation to individual memory blocks; but the granularity was
> rather small and there were also some concerns that I don't recall
> anymore]

Best Regards,
Huang, Ying
  
David Hildenbrand May 17, 2023, 8:09 a.m. UTC | #11
>> If we could avoid instantiating more zones and rather improve existing
>> mechanisms (PCP), that would be much more preferred IMHO. I'm sure
>> it's not easy, but that shouldn't stop us from trying ;)
> 
> I do think improving PCP or adding another level of cache will help
> performance and scalability.
> 
> And, I think that it has value too to improve the performance of zone
> itself.  Because there will be always some cases that the zone lock
> itself is contended.
> 
> That is, PCP and zone works at different level, and both deserve to be
> improved.  Do you agree?

Spoiler: my humble opinion


Well, the zone is kind-of your "global" memory provider, and PCPs cache 
a fraction of that to avoid exactly having to mess with that global 
datastructure and lock contention.

One benefit I can see of such a "global" memory provider with caches on 
top is is that it is nicely integrated: for example, the concept of 
memory pressure exists for the zone as a whole. All memory is of the 
same kind and managed in a single entity, but free memory is cached for 
performance.

As soon as you manage the memory in multiple zones of the same kind, you 
lose that "global" view of your memory that is of the same kind, but 
managed in different bucks. You might end up with a lot of memory 
pressure in a single such zone, but still have plenty in another zone.

As one example, hot(un)plug of memory is easy: there is only a single 
zone. No need to make smart decisions or deal with having memory we're 
hotunplugging be stranded in multiple zones.

> 
>> I did not look into the details of this proposal, but seeing the
>> change in include/linux/page-flags-layout.h scares me.
> 
> It's possible for us to use 1 more bit in page->flags.  Do you think
> that will cause severe issue?  Or you think some other stuff isn't
> acceptable?

The issue is, everybody wants to consume more bits in page->flags, so if 
we can get away without it that would be much better :)

The more bits you want to consume, the more people will ask for making 
this a compile-time option and eventually compile it out on distro 
kernels (e.g., with many NUMA nodes). So we end up with more code and 
complexity and eventually not get the benefits where we really want them.

> 
>> Further, I'm not so sure how that change really interacts with
>> hot(un)plug of memory ... on a quick glimpse I feel like this series
>> hacks the code such that such that the split works based on the boot
>> memory size ...
> 
> Em..., the zone stuff is kind of static now.  It's hard to add a zone at
> run-time.  So, in this series, we determine the number of zones per zone
> type based on boot memory size.  This may be improved in the future via
> pre-allocate some empty zone instances during boot and hot-add some
> memory to these zones.

Just to give you some idea: with virtio-mem, hyper-v, daxctl, and 
upcoming cxl dynamic memory pooling (some day I'm sure ;) ) you might 
see quite a small boot memory (e.g., 4 GiB) but a significant amount of 
memory getting hotplugged incrementally (e.g., up to 1 TiB) -- well, and 
hotunplugged. With multiple zone instances you really have to be careful 
and might have to re-balance between the multiple zones to keep the 
scalability, to not create imbalances between the zones ...

Something like PCP auto-tuning would be able to handle that mostly 
automatically, as there is only a single memory pool.

> 
>> I agree with Michal that looking into auto-tuning PCP would be
>> preferred. If that can't be done, adding another layer might end up
>> cleaner and eventually cover more use cases.
> 
> I do agree that it's valuable to make PCP etc. cover more use cases.  I
> just think that this should not prevent us from optimizing zone itself
> to cover remaining use cases.

I really don't like the concept of replicating zones of the same kind 
for the same NUMA node. But that's just my personal opinion maintaining 
some memory hot(un)plug code :)

Having that said, some kind of a sub-zone concept (additional layer) as 
outlined by Michal IIUC, for example, indexed by core id/has/whatsoever 
could eventually be worth exploring. Yes, such a design raises various 
questions ... :)
  
Huang, Ying May 18, 2023, 8:06 a.m. UTC | #12
David Hildenbrand <david@redhat.com> writes:

>>> If we could avoid instantiating more zones and rather improve existing
>>> mechanisms (PCP), that would be much more preferred IMHO. I'm sure
>>> it's not easy, but that shouldn't stop us from trying ;)
>> I do think improving PCP or adding another level of cache will help
>> performance and scalability.
>> And, I think that it has value too to improve the performance of
>> zone
>> itself.  Because there will be always some cases that the zone lock
>> itself is contended.
>> That is, PCP and zone works at different level, and both deserve to
>> be
>> improved.  Do you agree?
>
> Spoiler: my humble opinion
>
>
> Well, the zone is kind-of your "global" memory provider, and PCPs
> cache a fraction of that to avoid exactly having to mess with that
> global datastructure and lock contention.
>
> One benefit I can see of such a "global" memory provider with caches
> on top is is that it is nicely integrated: for example, the concept of 
> memory pressure exists for the zone as a whole. All memory is of the
> same kind and managed in a single entity, but free memory is cached
> for performance.
>
> As soon as you manage the memory in multiple zones of the same kind,
> you lose that "global" view of your memory that is of the same kind,
> but managed in different bucks. You might end up with a lot of memory 
> pressure in a single such zone, but still have plenty in another zone.
>
> As one example, hot(un)plug of memory is easy: there is only a single
> zone. No need to make smart decisions or deal with having memory we're 
> hotunplugging be stranded in multiple zones.

I understand that there are some unresolved issues for splitting zone.
I will think more about them and the possible solutions.

>> 
>>> I did not look into the details of this proposal, but seeing the
>>> change in include/linux/page-flags-layout.h scares me.
>> It's possible for us to use 1 more bit in page->flags.  Do you think
>> that will cause severe issue?  Or you think some other stuff isn't
>> acceptable?
>
> The issue is, everybody wants to consume more bits in page->flags, so
> if we can get away without it that would be much better :)

Yes.

> The more bits you want to consume, the more people will ask for making
> this a compile-time option and eventually compile it out on distro 
> kernels (e.g., with many NUMA nodes). So we end up with more code and
> complexity and eventually not get the benefits where we really want
> them.

That's possible.  Although I think we will still use more page flags
when necessary.

>> 
>>> Further, I'm not so sure how that change really interacts with
>>> hot(un)plug of memory ... on a quick glimpse I feel like this series
>>> hacks the code such that such that the split works based on the boot
>>> memory size ...
>> Em..., the zone stuff is kind of static now.  It's hard to add a
>> zone at
>> run-time.  So, in this series, we determine the number of zones per zone
>> type based on boot memory size.  This may be improved in the future via
>> pre-allocate some empty zone instances during boot and hot-add some
>> memory to these zones.
>
> Just to give you some idea: with virtio-mem, hyper-v, daxctl, and
> upcoming cxl dynamic memory pooling (some day I'm sure ;) ) you might 
> see quite a small boot memory (e.g., 4 GiB) but a significant amount
> of memory getting hotplugged incrementally (e.g., up to 1 TiB) --
> well, and hotunplugged. With multiple zone instances you really have
> to be careful and might have to re-balance between the multiple zones
> to keep the scalability, to not create imbalances between the zones
> ...

Thanks for your information!

> Something like PCP auto-tuning would be able to handle that mostly
> automatically, as there is only a single memory pool.

I agree that optimizing PCP will help performance regardless of
splitting zone or not.

>> 
>>> I agree with Michal that looking into auto-tuning PCP would be
>>> preferred. If that can't be done, adding another layer might end up
>>> cleaner and eventually cover more use cases.
>> I do agree that it's valuable to make PCP etc. cover more use cases.
>> I
>> just think that this should not prevent us from optimizing zone itself
>> to cover remaining use cases.
>
> I really don't like the concept of replicating zones of the same kind
> for the same NUMA node. But that's just my personal opinion
> maintaining some memory hot(un)plug code :)
>
> Having that said, some kind of a sub-zone concept (additional layer)
> as outlined by Michal IIUC, for example, indexed by core
> id/has/whatsoever could eventually be worth exploring. Yes, such a
> design raises various questions ... :)

Yes.  That's another possible solution for the page allocation
scalability problem.

Best Regards,
Huang, Ying
  
Michal Hocko May 24, 2023, 12:30 p.m. UTC | #13
[Sorry for late reply, conferencing last 2 weeks and now catching up]

On Tue 16-05-23 12:30:21, David Hildenbrand wrote:
[...]
> > And auto-tuning doesn't sound easy.  Do you have some idea of how to do
> > that?
> 
> If we could avoid instantiating more zones and rather improve existing
> mechanisms (PCP), that would be much more preferred IMHO. I'm sure it's not
> easy, but that shouldn't stop us from trying ;)

Absolutely agreed. Increasing the zone number sounds like a hack to me
TBH. It seems like an easier way but it allows more subtle problems
later on. E.g. hard to predict per-zone memory consumption and memory
reclaim disbalances.
  
Huang, Ying May 29, 2023, 1:13 a.m. UTC | #14
Michal Hocko <mhocko@suse.com> writes:

> [Sorry for late reply, conferencing last 2 weeks and now catching up]

Never mind.  And my reply is late too for sickness.

> On Tue 16-05-23 12:30:21, David Hildenbrand wrote:
> [...]
>> > And auto-tuning doesn't sound easy.  Do you have some idea of how to do
>> > that?
>> 
>> If we could avoid instantiating more zones and rather improve existing
>> mechanisms (PCP), that would be much more preferred IMHO. I'm sure it's not
>> easy, but that shouldn't stop us from trying ;)
>
> Absolutely agreed. Increasing the zone number sounds like a hack to me
> TBH. It seems like an easier way but it allows more subtle problems
> later on. E.g. hard to predict per-zone memory consumption and memory
> reclaim disbalances.

At least, we all think that improving PCP is something deserved to be
done.  I will do some experiment on that and return to you (after some
time).

Best Regards,
Huang, Ying