[v3] mm: Add nodes= arg to memory.reclaim

Message ID 20221202223533.1785418-1-almasrymina@google.com
State New
Headers
Series [v3] mm: Add nodes= arg to memory.reclaim |

Commit Message

Mina Almasry Dec. 2, 2022, 10:35 p.m. UTC
  The nodes= arg instructs the kernel to only scan the given nodes for
proactive reclaim. For example use cases, consider a 2 tier memory system:

nodes 0,1 -> top tier
nodes 2,3 -> second tier

$ echo "1m nodes=0" > memory.reclaim

This instructs the kernel to attempt to reclaim 1m memory from node 0.
Since node 0 is a top tier node, demotion will be attempted first. This
is useful to direct proactive reclaim to specific nodes that are under
pressure.

$ echo "1m nodes=2,3" > memory.reclaim

This instructs the kernel to attempt to reclaim 1m memory in the second tier,
since this tier of memory has no demotion targets the memory will be
reclaimed.

$ echo "1m nodes=0,1" > memory.reclaim

Instructs the kernel to reclaim memory from the top tier nodes, which can
be desirable according to the userspace policy if there is pressure on
the top tiers. Since these nodes have demotion targets, the kernel will
attempt demotion first.

Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim""), the proactive reclaim interface memory.reclaim does both
reclaim and demotion. Reclaim and demotion incur different latency costs
to the jobs in the cgroup. Demoted memory would still be addressable
by the userspace at a higher latency, but reclaimed memory would need to
incur a pagefault.

The 'nodes' arg is useful to allow the userspace to control demotion
and reclaim independently according to its policy: if the memory.reclaim
is called on a node with demotion targets, it will attempt demotion first;
if it is called on a node without demotion targets, it will only attempt
reclaim.

Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>

---

v3:
- Dropped RFC tag from subject.
- Added Michal's Ack
- Applied some of Bagas's comment suggestions.
- Converted try_to_fre_mem_cgorup_pages() to take nodemask_t* instead of
  nodemask_t as Shakeel and Muchun suggested.

Cc: bagasdotme@gmail.com

Thanks for the comments and reviews.
---
 Documentation/admin-guide/cgroup-v2.rst | 15 +++---
 include/linux/swap.h                    |  3 +-
 mm/memcontrol.c                         | 67 ++++++++++++++++++++-----
 mm/vmscan.c                             |  4 +-
 4 files changed, 68 insertions(+), 21 deletions(-)

--
2.39.0.rc0.267.gcb52ba06e7-goog
  

Comments

Shakeel Butt Dec. 2, 2022, 11:51 p.m. UTC | #1
On Fri, Dec 2, 2022 at 2:37 PM Mina Almasry <almasrymina@google.com> wrote:
>
> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
>
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
>
> $ echo "1m nodes=0" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
>
> $ echo "1m nodes=2,3" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
>
> $ echo "1m nodes=0,1" > memory.reclaim
>
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
>
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
>
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
>

Acked-by: Shakeel Butt <shakeelb@google.com>
  
Muchun Song Dec. 3, 2022, 3:17 a.m. UTC | #2
> On Dec 3, 2022, at 06:35, Mina Almasry <almasrymina@google.com> wrote:
> 
> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
> 
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
> 
> $ echo "1m nodes=0" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
> 
> $ echo "1m nodes=2,3" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
> 
> $ echo "1m nodes=0,1" > memory.reclaim
> 
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
> 
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
> 
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>

Acked-by: Muchun Song <songmuchun@bytedance.com>

Thanks.
  
Michal Hocko Dec. 12, 2022, 8:55 a.m. UTC | #3
On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
> 
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
> 
> $ echo "1m nodes=0" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
> 
> $ echo "1m nodes=2,3" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
> 
> $ echo "1m nodes=0,1" > memory.reclaim
> 
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
> 
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
> 
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>

After discussion in [1] I have realized that I haven't really thought
through all the consequences of this patch and therefore I am retracting
my ack here. I am not nacking the patch at this statge but I also think
this shouldn't be merged now and we should really consider all the
consequences.

Let me summarize my main concerns here as well. The proposed
implementation doesn't apply the provided nodemask to the whole reclaim
process. This means that demotion can happen outside of the mask so the
the user request cannot really control demotion targets and that limits
the interface should there be any need for a finer grained control in
the future (see an example in [2]).
Another problem is that this can limit future reclaim extensions because
of existing assumptions of the interface [3] - specify only top-tier
node to force the aging without actually reclaiming any charges and
(ab)use the interface only for aging on multi-tier system. A change to
the reclaim to not demote in some cases could break this usecase.

My counter proposal would be to define the nodemask for memory.reclaim
as a domain to constrain the charge reclaim. That means both aging and
reclaim including demotion which is a part of aging. This will allow
to control where to demote for balancing purposes (e.g. demote to node 2
rather than 3) which is impossible with the proposed scheme.

[1] http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com
[2] http://lkml.kernel.org/r/Y5bnRtJ6sojtjgVD@dhcp22.suse.cz
[3] http://lkml.kernel.org/r/CAAPL-u8rgW-JACKUT5ChmGSJiTDABcDRjNzW_QxMjCTk9zO4sg@mail.gmail.com
  
Mina Almasry Dec. 13, 2022, 12:54 a.m. UTC | #4
On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> > The nodes= arg instructs the kernel to only scan the given nodes for
> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >
> > nodes 0,1 -> top tier
> > nodes 2,3 -> second tier
> >
> > $ echo "1m nodes=0" > memory.reclaim
> >
> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> > Since node 0 is a top tier node, demotion will be attempted first. This
> > is useful to direct proactive reclaim to specific nodes that are under
> > pressure.
> >
> > $ echo "1m nodes=2,3" > memory.reclaim
> >
> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> > since this tier of memory has no demotion targets the memory will be
> > reclaimed.
> >
> > $ echo "1m nodes=0,1" > memory.reclaim
> >
> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> > be desirable according to the userspace policy if there is pressure on
> > the top tiers. Since these nodes have demotion targets, the kernel will
> > attempt demotion first.
> >
> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > reclaim""), the proactive reclaim interface memory.reclaim does both
> > reclaim and demotion. Reclaim and demotion incur different latency costs
> > to the jobs in the cgroup. Demoted memory would still be addressable
> > by the userspace at a higher latency, but reclaimed memory would need to
> > incur a pagefault.
> >
> > The 'nodes' arg is useful to allow the userspace to control demotion
> > and reclaim independently according to its policy: if the memory.reclaim
> > is called on a node with demotion targets, it will attempt demotion first;
> > if it is called on a node without demotion targets, it will only attempt
> > reclaim.
> >
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
>
> After discussion in [1] I have realized that I haven't really thought
> through all the consequences of this patch and therefore I am retracting
> my ack here. I am not nacking the patch at this statge but I also think
> this shouldn't be merged now and we should really consider all the
> consequences.
>
> Let me summarize my main concerns here as well. The proposed
> implementation doesn't apply the provided nodemask to the whole reclaim
> process. This means that demotion can happen outside of the mask so the
> the user request cannot really control demotion targets and that limits
> the interface should there be any need for a finer grained control in
> the future (see an example in [2]).
> Another problem is that this can limit future reclaim extensions because
> of existing assumptions of the interface [3] - specify only top-tier
> node to force the aging without actually reclaiming any charges and
> (ab)use the interface only for aging on multi-tier system. A change to
> the reclaim to not demote in some cases could break this usecase.
>

I think this is correct. My use case is to request from the kernel to
do demotion without reclaim in the cgroup, and the reason for that is
stated in the commit message:

"Reclaim and demotion incur different latency costs to the jobs in the
cgroup. Demoted memory would still be addressable by the userspace at
a higher latency, but reclaimed memory would need to incur a
pagefault."

For jobs of some latency tiers, we would like to trigger proactive
demotion (which incurs relatively low latency on the job), but not
trigger proactive reclaim (which incurs a pagefault). I initially had
proposed a separate interface for this, but Johannes directed me to
this interface instead in [1]. In the same email Johannes also tells
me that meta's reclaim stack relies on memory.reclaim triggering
demotion, so it seems that I'm not the first to take a dependency on
this. Additionally in [2] Johannes also says it would be great if in
the long term reclaim policy and demotion policy do not diverge.

[1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
[2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/

> My counter proposal would be to define the nodemask for memory.reclaim
> as a domain to constrain the charge reclaim. That means both aging and
> reclaim including demotion which is a part of aging. This will allow
> to control where to demote for balancing purposes (e.g. demote to node 2
> rather than 3) which is impossible with the proposed scheme.
>

My understanding is that with this interface in order to trigger
demotion I would want to list both the top tier nodes and the bottom
tier nodes on the nodemask, and since the bottom tier nodes are in the
nodemask the kernel will not just trigger demotion, but will also
trigger reclaim. This is very specifically not our use case and not
the goal of this patch.

I had also suggested adding a demotion= arg to memory.reclaim so the
userspace may customize this behavior, but Johannes rejected this in
[3] to adhere to the aging pipeline.

All in all I like Johannes's model in [3] describing the aging
pipeline and the relationship between demotion and reclaim. The nodes=
arg is just a hint to the kernel that the userspace is looking for
reclaim from a top tier node (which would be done by demotion
according to the aging pipeline) or a bottom tier node (which would be
done by reclaim according to the aging pipeline). I think this
interface is aligned with this model.

[3] https://lore.kernel.org/linux-mm/Y36XchdgTCsMP4jT@cmpxchg.org/

> [1] http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com
> [2] http://lkml.kernel.org/r/Y5bnRtJ6sojtjgVD@dhcp22.suse.cz
> [3] http://lkml.kernel.org/r/CAAPL-u8rgW-JACKUT5ChmGSJiTDABcDRjNzW_QxMjCTk9zO4sg@mail.gmail.com
> --
> Michal Hocko
> SUSE Labs
  
Huang, Ying Dec. 13, 2022, 6:30 a.m. UTC | #5
Mina Almasry <almasrymina@google.com> writes:

> On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
>> > The nodes= arg instructs the kernel to only scan the given nodes for
>> > proactive reclaim. For example use cases, consider a 2 tier memory system:
>> >
>> > nodes 0,1 -> top tier
>> > nodes 2,3 -> second tier
>> >
>> > $ echo "1m nodes=0" > memory.reclaim
>> >
>> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
>> > Since node 0 is a top tier node, demotion will be attempted first. This
>> > is useful to direct proactive reclaim to specific nodes that are under
>> > pressure.
>> >
>> > $ echo "1m nodes=2,3" > memory.reclaim
>> >
>> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
>> > since this tier of memory has no demotion targets the memory will be
>> > reclaimed.
>> >
>> > $ echo "1m nodes=0,1" > memory.reclaim
>> >
>> > Instructs the kernel to reclaim memory from the top tier nodes, which can
>> > be desirable according to the userspace policy if there is pressure on
>> > the top tiers. Since these nodes have demotion targets, the kernel will
>> > attempt demotion first.
>> >
>> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
>> > reclaim""), the proactive reclaim interface memory.reclaim does both
>> > reclaim and demotion. Reclaim and demotion incur different latency costs
>> > to the jobs in the cgroup. Demoted memory would still be addressable
>> > by the userspace at a higher latency, but reclaimed memory would need to
>> > incur a pagefault.
>> >
>> > The 'nodes' arg is useful to allow the userspace to control demotion
>> > and reclaim independently according to its policy: if the memory.reclaim
>> > is called on a node with demotion targets, it will attempt demotion first;
>> > if it is called on a node without demotion targets, it will only attempt
>> > reclaim.
>> >
>> > Acked-by: Michal Hocko <mhocko@suse.com>
>> > Signed-off-by: Mina Almasry <almasrymina@google.com>
>>
>> After discussion in [1] I have realized that I haven't really thought
>> through all the consequences of this patch and therefore I am retracting
>> my ack here. I am not nacking the patch at this statge but I also think
>> this shouldn't be merged now and we should really consider all the
>> consequences.
>>
>> Let me summarize my main concerns here as well. The proposed
>> implementation doesn't apply the provided nodemask to the whole reclaim
>> process. This means that demotion can happen outside of the mask so the
>> the user request cannot really control demotion targets and that limits
>> the interface should there be any need for a finer grained control in
>> the future (see an example in [2]).
>> Another problem is that this can limit future reclaim extensions because
>> of existing assumptions of the interface [3] - specify only top-tier
>> node to force the aging without actually reclaiming any charges and
>> (ab)use the interface only for aging on multi-tier system. A change to
>> the reclaim to not demote in some cases could break this usecase.
>>
>
> I think this is correct. My use case is to request from the kernel to
> do demotion without reclaim in the cgroup, and the reason for that is
> stated in the commit message:
>
> "Reclaim and demotion incur different latency costs to the jobs in the
> cgroup. Demoted memory would still be addressable by the userspace at
> a higher latency, but reclaimed memory would need to incur a
> pagefault."
>
> For jobs of some latency tiers, we would like to trigger proactive
> demotion (which incurs relatively low latency on the job), but not
> trigger proactive reclaim (which incurs a pagefault). I initially had
> proposed a separate interface for this, but Johannes directed me to
> this interface instead in [1]. In the same email Johannes also tells
> me that meta's reclaim stack relies on memory.reclaim triggering
> demotion, so it seems that I'm not the first to take a dependency on
> this. Additionally in [2] Johannes also says it would be great if in
> the long term reclaim policy and demotion policy do not diverge.
>
> [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
> [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/

After these discussion, I think the solution maybe use different
interfaces for "proactive demote" and "proactive reclaim".  That is,
reconsider "memory.demote".  In this way, we will always uncharge the
cgroup for "memory.reclaim".  This avoid the possible confusion there.
And, because demotion is considered aging, we don't need to disable
demotion for "memory.reclaim", just don't count it.

Best Regards,
Huang, Ying
  
Wei Xu Dec. 13, 2022, 7:48 a.m. UTC | #6
On Mon, Dec 12, 2022 at 10:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Mina Almasry <almasrymina@google.com> writes:
>
> > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> >> > The nodes= arg instructs the kernel to only scan the given nodes for
> >> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >> >
> >> > nodes 0,1 -> top tier
> >> > nodes 2,3 -> second tier
> >> >
> >> > $ echo "1m nodes=0" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> >> > Since node 0 is a top tier node, demotion will be attempted first. This
> >> > is useful to direct proactive reclaim to specific nodes that are under
> >> > pressure.
> >> >
> >> > $ echo "1m nodes=2,3" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> >> > since this tier of memory has no demotion targets the memory will be
> >> > reclaimed.
> >> >
> >> > $ echo "1m nodes=0,1" > memory.reclaim
> >> >
> >> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> >> > be desirable according to the userspace policy if there is pressure on
> >> > the top tiers. Since these nodes have demotion targets, the kernel will
> >> > attempt demotion first.
> >> >
> >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> >> > reclaim""), the proactive reclaim interface memory.reclaim does both
> >> > reclaim and demotion. Reclaim and demotion incur different latency costs
> >> > to the jobs in the cgroup. Demoted memory would still be addressable
> >> > by the userspace at a higher latency, but reclaimed memory would need to
> >> > incur a pagefault.
> >> >
> >> > The 'nodes' arg is useful to allow the userspace to control demotion
> >> > and reclaim independently according to its policy: if the memory.reclaim
> >> > is called on a node with demotion targets, it will attempt demotion first;
> >> > if it is called on a node without demotion targets, it will only attempt
> >> > reclaim.
> >> >
> >> > Acked-by: Michal Hocko <mhocko@suse.com>
> >> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >>
> >> After discussion in [1] I have realized that I haven't really thought
> >> through all the consequences of this patch and therefore I am retracting
> >> my ack here. I am not nacking the patch at this statge but I also think
> >> this shouldn't be merged now and we should really consider all the
> >> consequences.
> >>
> >> Let me summarize my main concerns here as well. The proposed
> >> implementation doesn't apply the provided nodemask to the whole reclaim
> >> process. This means that demotion can happen outside of the mask so the
> >> the user request cannot really control demotion targets and that limits
> >> the interface should there be any need for a finer grained control in
> >> the future (see an example in [2]).
> >> Another problem is that this can limit future reclaim extensions because
> >> of existing assumptions of the interface [3] - specify only top-tier
> >> node to force the aging without actually reclaiming any charges and
> >> (ab)use the interface only for aging on multi-tier system. A change to
> >> the reclaim to not demote in some cases could break this usecase.
> >>
> >
> > I think this is correct. My use case is to request from the kernel to
> > do demotion without reclaim in the cgroup, and the reason for that is
> > stated in the commit message:
> >
> > "Reclaim and demotion incur different latency costs to the jobs in the
> > cgroup. Demoted memory would still be addressable by the userspace at
> > a higher latency, but reclaimed memory would need to incur a
> > pagefault."
> >
> > For jobs of some latency tiers, we would like to trigger proactive
> > demotion (which incurs relatively low latency on the job), but not
> > trigger proactive reclaim (which incurs a pagefault). I initially had
> > proposed a separate interface for this, but Johannes directed me to
> > this interface instead in [1]. In the same email Johannes also tells
> > me that meta's reclaim stack relies on memory.reclaim triggering
> > demotion, so it seems that I'm not the first to take a dependency on
> > this. Additionally in [2] Johannes also says it would be great if in
> > the long term reclaim policy and demotion policy do not diverge.
> >
> > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
> > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/
>
> After these discussion, I think the solution maybe use different
> interfaces for "proactive demote" and "proactive reclaim".  That is,
> reconsider "memory.demote".  In this way, we will always uncharge the
> cgroup for "memory.reclaim".  This avoid the possible confusion there.
> And, because demotion is considered aging, we don't need to disable
> demotion for "memory.reclaim", just don't count it.

+1 on memory.demote.

> Best Regards,
> Huang, Ying
>
  
Michal Hocko Dec. 13, 2022, 8:33 a.m. UTC | #7
On Mon 12-12-22 16:54:27, Mina Almasry wrote:
> On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Let me summarize my main concerns here as well. The proposed
> > implementation doesn't apply the provided nodemask to the whole reclaim
> > process. This means that demotion can happen outside of the mask so the
> > the user request cannot really control demotion targets and that limits
> > the interface should there be any need for a finer grained control in
> > the future (see an example in [2]).
> > Another problem is that this can limit future reclaim extensions because
> > of existing assumptions of the interface [3] - specify only top-tier
> > node to force the aging without actually reclaiming any charges and
> > (ab)use the interface only for aging on multi-tier system. A change to
> > the reclaim to not demote in some cases could break this usecase.
> >
> 
> I think this is correct. My use case is to request from the kernel to
> do demotion without reclaim in the cgroup, and the reason for that is
> stated in the commit message:
> 
> "Reclaim and demotion incur different latency costs to the jobs in the
> cgroup. Demoted memory would still be addressable by the userspace at
> a higher latency, but reclaimed memory would need to incur a
> pagefault."
> 
> For jobs of some latency tiers, we would like to trigger proactive
> demotion (which incurs relatively low latency on the job), but not
> trigger proactive reclaim (which incurs a pagefault). I initially had
> proposed a separate interface for this, but Johannes directed me to
> this interface instead in [1]. In the same email Johannes also tells
> me that meta's reclaim stack relies on memory.reclaim triggering
> demotion, so it seems that I'm not the first to take a dependency on
> this. Additionally in [2] Johannes also says it would be great if in
> the long term reclaim policy and demotion policy do not diverge.

I do recognize your need to control the demotion but I argue that it is
a bad idea to rely on an implicit behavior of the memory reclaim and an
interface which is _documented_ to primarily _reclaim_ memory.

Really, consider that the current demotion implementation will change
in the future and based on a newly added heuristic memory reclaim or
compression would be preferred over migration to a different tier.  This
might completely break your current assumptions and break your usecase
which relies on an implicit demotion behavior.  Do you see that as a
potential problem at all? What shall we do in that case? Special case
memory.reclaim behavior?

Now to your specific usecase. If there is a need to do a memory
distribution balancing then fine but this should be a well defined
interface. E.g. is there a need to not only control demotion but
promotions as well? I haven't heard anybody requesting that so far
but I can easily imagine that like outsourcing the memory reclaim to
the userspace someone might want to do the same thing with the numa
balancing because $REASONS. Should that ever happen, I am pretty sure
hooking into memory.reclaim is not really a great idea.

See where I am coming from?

> [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
> [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/
  
Michal Hocko Dec. 13, 2022, 8:51 a.m. UTC | #8
On Tue 13-12-22 14:30:57, Huang, Ying wrote:
> Mina Almasry <almasrymina@google.com> writes:
[...]
> After these discussion, I think the solution maybe use different
> interfaces for "proactive demote" and "proactive reclaim".  That is,
> reconsider "memory.demote".  In this way, we will always uncharge the
> cgroup for "memory.reclaim".  This avoid the possible confusion there.
> And, because demotion is considered aging, we don't need to disable
> demotion for "memory.reclaim", just don't count it.

As already pointed out in my previous email, we should really think more
about future requirements. Do we add memory.promote interface when there
is a request to implement numa balancing into the userspace? Maybe yes
but maybe the node balancing should be more generic than bound to memory
tiering and apply to a more fine grained nodemask control.

Fundamentally we already have APIs to age (MADV_COLD, MADV_FREE),
reclaim (MADV_PAGEOUT, MADV_DONTNEED) and MADV_WILLNEED to prioritize
(swap in, or read ahead) which are per mm/file. Their primary usability
issue is that they are process centric and that requires a very deep
understanding of the process mm layout so it is not really usable for a
larger scale orchestration.
The important part of those interfaces is that they do not talk about
demotion because that is an implementation detail. I think we want to
follow that model at least. From a higher level POV I believe we really
need an interface to age&reclaim and balance memory among nodes. Are
there more higher level usecases?
  
Johannes Weiner Dec. 13, 2022, 1:30 p.m. UTC | #9
On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> Mina Almasry <almasrymina@google.com> writes:
> 
> > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> >> > The nodes= arg instructs the kernel to only scan the given nodes for
> >> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >> >
> >> > nodes 0,1 -> top tier
> >> > nodes 2,3 -> second tier
> >> >
> >> > $ echo "1m nodes=0" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> >> > Since node 0 is a top tier node, demotion will be attempted first. This
> >> > is useful to direct proactive reclaim to specific nodes that are under
> >> > pressure.
> >> >
> >> > $ echo "1m nodes=2,3" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> >> > since this tier of memory has no demotion targets the memory will be
> >> > reclaimed.
> >> >
> >> > $ echo "1m nodes=0,1" > memory.reclaim
> >> >
> >> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> >> > be desirable according to the userspace policy if there is pressure on
> >> > the top tiers. Since these nodes have demotion targets, the kernel will
> >> > attempt demotion first.
> >> >
> >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> >> > reclaim""), the proactive reclaim interface memory.reclaim does both
> >> > reclaim and demotion. Reclaim and demotion incur different latency costs
> >> > to the jobs in the cgroup. Demoted memory would still be addressable
> >> > by the userspace at a higher latency, but reclaimed memory would need to
> >> > incur a pagefault.
> >> >
> >> > The 'nodes' arg is useful to allow the userspace to control demotion
> >> > and reclaim independently according to its policy: if the memory.reclaim
> >> > is called on a node with demotion targets, it will attempt demotion first;
> >> > if it is called on a node without demotion targets, it will only attempt
> >> > reclaim.
> >> >
> >> > Acked-by: Michal Hocko <mhocko@suse.com>
> >> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >>
> >> After discussion in [1] I have realized that I haven't really thought
> >> through all the consequences of this patch and therefore I am retracting
> >> my ack here. I am not nacking the patch at this statge but I also think
> >> this shouldn't be merged now and we should really consider all the
> >> consequences.
> >>
> >> Let me summarize my main concerns here as well. The proposed
> >> implementation doesn't apply the provided nodemask to the whole reclaim
> >> process. This means that demotion can happen outside of the mask so the
> >> the user request cannot really control demotion targets and that limits
> >> the interface should there be any need for a finer grained control in
> >> the future (see an example in [2]).
> >> Another problem is that this can limit future reclaim extensions because
> >> of existing assumptions of the interface [3] - specify only top-tier
> >> node to force the aging without actually reclaiming any charges and
> >> (ab)use the interface only for aging on multi-tier system. A change to
> >> the reclaim to not demote in some cases could break this usecase.
> >>
> >
> > I think this is correct. My use case is to request from the kernel to
> > do demotion without reclaim in the cgroup, and the reason for that is
> > stated in the commit message:
> >
> > "Reclaim and demotion incur different latency costs to the jobs in the
> > cgroup. Demoted memory would still be addressable by the userspace at
> > a higher latency, but reclaimed memory would need to incur a
> > pagefault."
> >
> > For jobs of some latency tiers, we would like to trigger proactive
> > demotion (which incurs relatively low latency on the job), but not
> > trigger proactive reclaim (which incurs a pagefault). I initially had
> > proposed a separate interface for this, but Johannes directed me to
> > this interface instead in [1]. In the same email Johannes also tells
> > me that meta's reclaim stack relies on memory.reclaim triggering
> > demotion, so it seems that I'm not the first to take a dependency on
> > this. Additionally in [2] Johannes also says it would be great if in
> > the long term reclaim policy and demotion policy do not diverge.
> >
> > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
> > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/
> 
> After these discussion, I think the solution maybe use different
> interfaces for "proactive demote" and "proactive reclaim".  That is,
> reconsider "memory.demote".  In this way, we will always uncharge the
> cgroup for "memory.reclaim".  This avoid the possible confusion there.
> And, because demotion is considered aging, we don't need to disable
> demotion for "memory.reclaim", just don't count it.

Hm, so in summary:

1) memory.reclaim would demote and reclaim like today, but it would
   change to only count reclaimed pages against the goal.

2) memory.demote would only demote.

   a) What if the demotion targets are full? Would it reclaim or fail?

3) Would memory.reclaim and memory.demote still need nodemasks? Would
   they return -EINVAL if a) memory.reclaim gets passed only toptier
   nodes or b) memory.demote gets passed any lasttier nodes?
  
Huang, Ying Dec. 13, 2022, 1:42 p.m. UTC | #10
Michal Hocko <mhocko@suse.com> writes:

> On Tue 13-12-22 14:30:57, Huang, Ying wrote:
>> Mina Almasry <almasrymina@google.com> writes:
> [...]
>> After these discussion, I think the solution maybe use different
>> interfaces for "proactive demote" and "proactive reclaim".  That is,
>> reconsider "memory.demote".  In this way, we will always uncharge the
>> cgroup for "memory.reclaim".  This avoid the possible confusion there.
>> And, because demotion is considered aging, we don't need to disable
>> demotion for "memory.reclaim", just don't count it.
>
> As already pointed out in my previous email, we should really think more
> about future requirements. Do we add memory.promote interface when there
> is a request to implement numa balancing into the userspace? Maybe yes
> but maybe the node balancing should be more generic than bound to memory
> tiering and apply to a more fine grained nodemask control.
>
> Fundamentally we already have APIs to age (MADV_COLD, MADV_FREE),
> reclaim (MADV_PAGEOUT, MADV_DONTNEED) and MADV_WILLNEED to prioritize
> (swap in, or read ahead) which are per mm/file. Their primary usability
> issue is that they are process centric and that requires a very deep
> understanding of the process mm layout so it is not really usable for a
> larger scale orchestration.
> The important part of those interfaces is that they do not talk about
> demotion because that is an implementation detail. I think we want to
> follow that model at least. From a higher level POV I believe we really
> need an interface to age&reclaim and balance memory among nodes. Are
> there more higher level usecases?

Yes.  If the high level interface can satisfy the requirements, we
should use them or define them.  But I guess Mina and Xu has some
requirements at the level of memory tiers (demotion/promotion)?

Best Regards,
Huang, Ying
  
Michal Hocko Dec. 13, 2022, 2:03 p.m. UTC | #11
On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
[...]
> > After these discussion, I think the solution maybe use different
> > interfaces for "proactive demote" and "proactive reclaim".  That is,
> > reconsider "memory.demote".  In this way, we will always uncharge the
> > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> > And, because demotion is considered aging, we don't need to disable
> > demotion for "memory.reclaim", just don't count it.
> 
> Hm, so in summary:
> 
> 1) memory.reclaim would demote and reclaim like today, but it would
>    change to only count reclaimed pages against the goal.
> 
> 2) memory.demote would only demote.
> 
>    a) What if the demotion targets are full? Would it reclaim or fail?
> 
> 3) Would memory.reclaim and memory.demote still need nodemasks? Would
>    they return -EINVAL if a) memory.reclaim gets passed only toptier
>    nodes or b) memory.demote gets passed any lasttier nodes?

I would also add
4) Do we want to allow to control the demotion path (e.g. which node to
   demote from and to) and how to achieve that?
5) Is the demotion api restricted to multi-tier systems or any numa
   configuration allowed as well?
  
Johannes Weiner Dec. 13, 2022, 3:58 p.m. UTC | #12
On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
> I do recognize your need to control the demotion but I argue that it is
> a bad idea to rely on an implicit behavior of the memory reclaim and an
> interface which is _documented_ to primarily _reclaim_ memory.

I think memory.reclaim should demote as part of page aging. What I'd
like to avoid is *having* to manually control the aging component in
the interface (e.g. making memory.reclaim *only* reclaim, and
*requiring* a coordinated use of memory.demote to ensure progress.)

> Really, consider that the current demotion implementation will change
> in the future and based on a newly added heuristic memory reclaim or
> compression would be preferred over migration to a different tier.  This
> might completely break your current assumptions and break your usecase
> which relies on an implicit demotion behavior.  Do you see that as a
> potential problem at all? What shall we do in that case? Special case
> memory.reclaim behavior?

Shouldn't that be derived from the distance propertiers in the tier
configuration?

I.e. if local compression is faster than demoting to a slower node, we
should maybe have a separate tier for that. Ignoring proactive reclaim
or demotion commands for a second: on that node, global memory
pressure should always compress first, while the oldest pages from the
compression cache should demote to the other node(s) - until they
eventually get swapped out.

However fine-grained we make proactive reclaim control over these
stages, it should at least be possible for the user to request the
default behavior that global pressure follows, without jumping through
hoops or requiring the coordinated use of multiple knobs. So IMO there
is an argument for having a singular knob that requests comprehensive
aging and reclaiming across the configured hierarchy.

As far as explicit control over the individual stages goes - no idea
if you would call the compression stage demotion or reclaim. The
distinction still does not make much of sense to me, since reclaim is
just another form of demotion. Sure, page faults have a different
access latency than dax to slower memory. But you could also have 3
tiers of memory where the difference between tier 1 and 2 is much
smaller than the difference between 2 and 3, and you might want to
apply different demotion rates between them as well.

The other argument is that demotion does not free cgroup memory,
whereas reclaim does. But with multiple memory tiers of vastly
different performance, isn't there also an argument for granting
cgroups different shares of each memory? So that a higher priority
group has access to a bigger share of the fastest memory, and lower
prio cgroups are relegated to lower tiers. If we split those pools,
then "demotion" will actually free memory in a cgroup.

This is why I liked adding a nodes= argument to memory.reclaim the
best. It doesn't encode a distinction that may not last for long.

The problem comes from how to interpret the input argument and the
return value, right? Could we solve this by requiring the passed
nodes= to all be of the same memory tier? Then there is no confusion
around what is requested and what the return value means.

And if no nodes are passed, it means reclaim (from the lowest memory
tier) X pages and demote as needed, then return the reclaimed pages.

> Now to your specific usecase. If there is a need to do a memory
> distribution balancing then fine but this should be a well defined
> interface. E.g. is there a need to not only control demotion but
> promotions as well? I haven't heard anybody requesting that so far
> but I can easily imagine that like outsourcing the memory reclaim to
> the userspace someone might want to do the same thing with the numa
> balancing because $REASONS. Should that ever happen, I am pretty sure
> hooking into memory.reclaim is not really a great idea.

Should this ever happen, it would seem fair that that be a separate
knob anyway, no? One knob to move the pipeline in one direction
(aging), one knob to move it the other way.
  
Mina Almasry Dec. 13, 2022, 7:29 p.m. UTC | #13
On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> [...]
> > > After these discussion, I think the solution maybe use different
> > > interfaces for "proactive demote" and "proactive reclaim".  That is,
> > > reconsider "memory.demote".  In this way, we will always uncharge the
> > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> > > And, because demotion is considered aging, we don't need to disable
> > > demotion for "memory.reclaim", just don't count it.
> >
> > Hm, so in summary:
> >
> > 1) memory.reclaim would demote and reclaim like today, but it would
> >    change to only count reclaimed pages against the goal.
> >
> > 2) memory.demote would only demote.
> >

If the above 2 points are agreeable then yes, this sounds good to me
and does address our use case.

> >    a) What if the demotion targets are full? Would it reclaim or fail?
> >

Wei will chime in if he disagrees, but I think we _require_ that it
fails, not falls back to reclaim. The interface is asking for
demotion, and is called memory.demote. For such an interface to fall
back to reclaim would be very confusing to userspace and may trigger
reclaim on a high priority job that we want to shield from proactive
reclaim.

> > 3) Would memory.reclaim and memory.demote still need nodemasks?

memory.demote will need a nodemask, for sure. Today the nodemask would
be useful if there is a specific node in the top tier that is
overloaded and we want to reduce the pressure by demoting. In the
future there will be N tiers and the nodemask says which tier to
demote from.

I don't think memory.reclaim would need a nodemask anymore? At least I
no longer see the use for it for us.

> >    Would
> >    they return -EINVAL if a) memory.reclaim gets passed only toptier
> >    nodes or b) memory.demote gets passed any lasttier nodes?
>

Honestly it would be great if memory.reclaim can force reclaim from a
top tier nodes. It breaks the aginig pipeline, yes, but if the user is
specifically asking for that because they decided in their usecase
it's a good idea then the kernel should comply IMO. Not a strict
requirement for us. Wei will chime in if he disagrees.

memory.demote returning -EINVAL for lasttier nodes makes sense to me.

> I would also add
> 4) Do we want to allow to control the demotion path (e.g. which node to
>    demote from and to) and how to achieve that?

We care deeply about specifying which node to demote _from_. That
would be some node that is approaching pressure and we're looking for
proactive saving from. So far I haven't seen any reason to control
which nodes to demote _to_. The kernel deciding that based on the
aging pipeline and the node distances sounds good to me. Obviously
someone else may find that useful.

> 5) Is the demotion api restricted to multi-tier systems or any numa
>    configuration allowed as well?
>

demotion will of course not work on single tiered systems. The
interface may return some failure on such systems or not be available
at all.

> --
> Michal Hocko
> SUSE Labs
  
Mina Almasry Dec. 13, 2022, 7:53 p.m. UTC | #14
On Tue, Dec 13, 2022 at 7:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
> > I do recognize your need to control the demotion but I argue that it is
> > a bad idea to rely on an implicit behavior of the memory reclaim and an
> > interface which is _documented_ to primarily _reclaim_ memory.
>
> I think memory.reclaim should demote as part of page aging. What I'd
> like to avoid is *having* to manually control the aging component in
> the interface (e.g. making memory.reclaim *only* reclaim, and
> *requiring* a coordinated use of memory.demote to ensure progress.)
>
> > Really, consider that the current demotion implementation will change
> > in the future and based on a newly added heuristic memory reclaim or
> > compression would be preferred over migration to a different tier.  This
> > might completely break your current assumptions and break your usecase
> > which relies on an implicit demotion behavior.  Do you see that as a
> > potential problem at all? What shall we do in that case? Special case
> > memory.reclaim behavior?
>
> Shouldn't that be derived from the distance propertiers in the tier
> configuration?
>
> I.e. if local compression is faster than demoting to a slower node, we
> should maybe have a separate tier for that. Ignoring proactive reclaim
> or demotion commands for a second: on that node, global memory
> pressure should always compress first, while the oldest pages from the
> compression cache should demote to the other node(s) - until they
> eventually get swapped out.
>
> However fine-grained we make proactive reclaim control over these
> stages, it should at least be possible for the user to request the
> default behavior that global pressure follows, without jumping through
> hoops or requiring the coordinated use of multiple knobs. So IMO there
> is an argument for having a singular knob that requests comprehensive
> aging and reclaiming across the configured hierarchy.
>
> As far as explicit control over the individual stages goes - no idea
> if you would call the compression stage demotion or reclaim. The
> distinction still does not make much of sense to me, since reclaim is
> just another form of demotion. Sure, page faults have a different
> access latency than dax to slower memory. But you could also have 3
> tiers of memory where the difference between tier 1 and 2 is much
> smaller than the difference between 2 and 3, and you might want to
> apply different demotion rates between them as well.
>
> The other argument is that demotion does not free cgroup memory,
> whereas reclaim does. But with multiple memory tiers of vastly
> different performance, isn't there also an argument for granting
> cgroups different shares of each memory? So that a higher priority
> group has access to a bigger share of the fastest memory, and lower
> prio cgroups are relegated to lower tiers. If we split those pools,
> then "demotion" will actually free memory in a cgroup.
>

I would also like to say I implemented something in line with that in [1].

In this patch, pages demoted from inside the nodemask to outside the
nodemask count as 'reclaimed'. This, in my mind, is a very generic
solution to the 'should demoted pages count as reclaim?' problem, and
will work in all scenarios as long as the nodemask passed to
shrink_folio_list() is set correctly by the call stack.

> This is why I liked adding a nodes= argument to memory.reclaim the
> best. It doesn't encode a distinction that may not last for long.
>
> The problem comes from how to interpret the input argument and the
> return value, right? Could we solve this by requiring the passed
> nodes= to all be of the same memory tier? Then there is no confusion
> around what is requested and what the return value means.
>

I feel like I arrived at a better solution in [1], where pages demoted
from inside of the nodemask to outside count as reclaimed and the rest
don't. But I think we could solve this by explicit checks that nodes=
arg are from the same tier, yes.

> And if no nodes are passed, it means reclaim (from the lowest memory
> tier) X pages and demote as needed, then return the reclaimed pages.
>
> > Now to your specific usecase. If there is a need to do a memory
> > distribution balancing then fine but this should be a well defined
> > interface. E.g. is there a need to not only control demotion but
> > promotions as well? I haven't heard anybody requesting that so far
> > but I can easily imagine that like outsourcing the memory reclaim to
> > the userspace someone might want to do the same thing with the numa
> > balancing because $REASONS. Should that ever happen, I am pretty sure
> > hooking into memory.reclaim is not really a great idea.
>
> Should this ever happen, it would seem fair that that be a separate
> knob anyway, no? One knob to move the pipeline in one direction
> (aging), one knob to move it the other way.

[1] https://lore.kernel.org/linux-mm/20221206023406.3182800-1-almasrymina@google.com/
  
Huang, Ying Dec. 14, 2022, 7:15 a.m. UTC | #15
Johannes Weiner <hannes@cmpxchg.org> writes:

> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
>> I do recognize your need to control the demotion but I argue that it is
>> a bad idea to rely on an implicit behavior of the memory reclaim and an
>> interface which is _documented_ to primarily _reclaim_ memory.
>
> I think memory.reclaim should demote as part of page aging. What I'd
> like to avoid is *having* to manually control the aging component in
> the interface (e.g. making memory.reclaim *only* reclaim, and
> *requiring* a coordinated use of memory.demote to ensure progress.)
>
>> Really, consider that the current demotion implementation will change
>> in the future and based on a newly added heuristic memory reclaim or
>> compression would be preferred over migration to a different tier.  This
>> might completely break your current assumptions and break your usecase
>> which relies on an implicit demotion behavior.  Do you see that as a
>> potential problem at all? What shall we do in that case? Special case
>> memory.reclaim behavior?
>
> Shouldn't that be derived from the distance propertiers in the tier
> configuration?
>
> I.e. if local compression is faster than demoting to a slower node, we
> should maybe have a separate tier for that. Ignoring proactive reclaim
> or demotion commands for a second: on that node, global memory
> pressure should always compress first, while the oldest pages from the
> compression cache should demote to the other node(s) - until they
> eventually get swapped out.
>
> However fine-grained we make proactive reclaim control over these
> stages, it should at least be possible for the user to request the
> default behavior that global pressure follows, without jumping through
> hoops or requiring the coordinated use of multiple knobs. So IMO there
> is an argument for having a singular knob that requests comprehensive
> aging and reclaiming across the configured hierarchy.
>
> As far as explicit control over the individual stages goes - no idea
> if you would call the compression stage demotion or reclaim. The
> distinction still does not make much of sense to me, since reclaim is
> just another form of demotion. Sure, page faults have a different
> access latency than dax to slower memory. But you could also have 3
> tiers of memory where the difference between tier 1 and 2 is much
> smaller than the difference between 2 and 3, and you might want to
> apply different demotion rates between them as well.
>
> The other argument is that demotion does not free cgroup memory,
> whereas reclaim does. But with multiple memory tiers of vastly
> different performance, isn't there also an argument for granting
> cgroups different shares of each memory? So that a higher priority
> group has access to a bigger share of the fastest memory, and lower
> prio cgroups are relegated to lower tiers. If we split those pools,
> then "demotion" will actually free memory in a cgroup.
>
> This is why I liked adding a nodes= argument to memory.reclaim the
> best. It doesn't encode a distinction that may not last for long.
>
> The problem comes from how to interpret the input argument and the
> return value, right? Could we solve this by requiring the passed
> nodes= to all be of the same memory tier? Then there is no confusion
> around what is requested and what the return value means.

Yes.  The definition is clear if nodes= from the same memory tier.

> And if no nodes are passed, it means reclaim (from the lowest memory
> tier) X pages and demote as needed, then return the reclaimed pages.

It appears that the definition isn't very clear here.  How many pages
should be demoted?  The target number is the value echoed to
memory.reclaim?  Or requested_number - pages_in_lowest_tier?  Should we
demote in as many tiers as possible or in as few tiers as possible?  One
possibility is to take advantage of top tier memory as much as
possible.  That is, try to reclaim pages in lower tiers only.

>> Now to your specific usecase. If there is a need to do a memory
>> distribution balancing then fine but this should be a well defined
>> interface. E.g. is there a need to not only control demotion but
>> promotions as well? I haven't heard anybody requesting that so far
>> but I can easily imagine that like outsourcing the memory reclaim to
>> the userspace someone might want to do the same thing with the numa
>> balancing because $REASONS. Should that ever happen, I am pretty sure
>> hooking into memory.reclaim is not really a great idea.
>
> Should this ever happen, it would seem fair that that be a separate
> knob anyway, no? One knob to move the pipeline in one direction
> (aging), one knob to move it the other way.

Agree.

Best Regards,
Huang, Ying
  
Huang, Ying Dec. 14, 2022, 7:20 a.m. UTC | #16
Mina Almasry <almasrymina@google.com> writes:

> On Tue, Dec 13, 2022 at 7:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
>> > I do recognize your need to control the demotion but I argue that it is
>> > a bad idea to rely on an implicit behavior of the memory reclaim and an
>> > interface which is _documented_ to primarily _reclaim_ memory.
>>
>> I think memory.reclaim should demote as part of page aging. What I'd
>> like to avoid is *having* to manually control the aging component in
>> the interface (e.g. making memory.reclaim *only* reclaim, and
>> *requiring* a coordinated use of memory.demote to ensure progress.)
>>
>> > Really, consider that the current demotion implementation will change
>> > in the future and based on a newly added heuristic memory reclaim or
>> > compression would be preferred over migration to a different tier.  This
>> > might completely break your current assumptions and break your usecase
>> > which relies on an implicit demotion behavior.  Do you see that as a
>> > potential problem at all? What shall we do in that case? Special case
>> > memory.reclaim behavior?
>>
>> Shouldn't that be derived from the distance propertiers in the tier
>> configuration?
>>
>> I.e. if local compression is faster than demoting to a slower node, we
>> should maybe have a separate tier for that. Ignoring proactive reclaim
>> or demotion commands for a second: on that node, global memory
>> pressure should always compress first, while the oldest pages from the
>> compression cache should demote to the other node(s) - until they
>> eventually get swapped out.
>>
>> However fine-grained we make proactive reclaim control over these
>> stages, it should at least be possible for the user to request the
>> default behavior that global pressure follows, without jumping through
>> hoops or requiring the coordinated use of multiple knobs. So IMO there
>> is an argument for having a singular knob that requests comprehensive
>> aging and reclaiming across the configured hierarchy.
>>
>> As far as explicit control over the individual stages goes - no idea
>> if you would call the compression stage demotion or reclaim. The
>> distinction still does not make much of sense to me, since reclaim is
>> just another form of demotion. Sure, page faults have a different
>> access latency than dax to slower memory. But you could also have 3
>> tiers of memory where the difference between tier 1 and 2 is much
>> smaller than the difference between 2 and 3, and you might want to
>> apply different demotion rates between them as well.
>>
>> The other argument is that demotion does not free cgroup memory,
>> whereas reclaim does. But with multiple memory tiers of vastly
>> different performance, isn't there also an argument for granting
>> cgroups different shares of each memory? So that a higher priority
>> group has access to a bigger share of the fastest memory, and lower
>> prio cgroups are relegated to lower tiers. If we split those pools,
>> then "demotion" will actually free memory in a cgroup.
>>
>
> I would also like to say I implemented something in line with that in [1].
>
> In this patch, pages demoted from inside the nodemask to outside the
> nodemask count as 'reclaimed'. This, in my mind, is a very generic
> solution to the 'should demoted pages count as reclaim?' problem, and
> will work in all scenarios as long as the nodemask passed to
> shrink_folio_list() is set correctly by the call stack.

It's still not clear that how many pages should be demoted among the
nodes inside the nodemask.  One possibility is to keep as many higher
tier pages as possible.

Best Regards,
Huang, Ying
  
Michal Hocko Dec. 14, 2022, 10:23 a.m. UTC | #17
On Tue 13-12-22 11:29:45, Mina Almasry wrote:
> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> > [...]
> > > > After these discussion, I think the solution maybe use different
> > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
> > > > reconsider "memory.demote".  In this way, we will always uncharge the
> > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> > > > And, because demotion is considered aging, we don't need to disable
> > > > demotion for "memory.reclaim", just don't count it.
> > >
> > > Hm, so in summary:
> > >
> > > 1) memory.reclaim would demote and reclaim like today, but it would
> > >    change to only count reclaimed pages against the goal.
> > >
> > > 2) memory.demote would only demote.
> > >
> 
> If the above 2 points are agreeable then yes, this sounds good to me
> and does address our use case.
> 
> > >    a) What if the demotion targets are full? Would it reclaim or fail?
> > >
> 
> Wei will chime in if he disagrees, but I think we _require_ that it
> fails, not falls back to reclaim. The interface is asking for
> demotion, and is called memory.demote. For such an interface to fall
> back to reclaim would be very confusing to userspace and may trigger
> reclaim on a high priority job that we want to shield from proactive
> reclaim.

But what should happen if the immediate demotion target is full but
lower tiers are still usable. Should the first one demote before
allowing to demote from the top tier?
 
> > > 3) Would memory.reclaim and memory.demote still need nodemasks?
> 
> memory.demote will need a nodemask, for sure. Today the nodemask would
> be useful if there is a specific node in the top tier that is
> overloaded and we want to reduce the pressure by demoting. In the
> future there will be N tiers and the nodemask says which tier to
> demote from.

OK, so what is the exact semantic of the node mask. Does it control
where to demote from or to or both?

> I don't think memory.reclaim would need a nodemask anymore? At least I
> no longer see the use for it for us.
> 
> > >    Would
> > >    they return -EINVAL if a) memory.reclaim gets passed only toptier
> > >    nodes or b) memory.demote gets passed any lasttier nodes?
> >
> 
> Honestly it would be great if memory.reclaim can force reclaim from a
> top tier nodes. It breaks the aginig pipeline, yes, but if the user is
> specifically asking for that because they decided in their usecase
> it's a good idea then the kernel should comply IMO. Not a strict
> requirement for us. Wei will chime in if he disagrees.

That would require a nodemask to say which nodes to reclaim, no? The
default behavior should be in line with what standard memory reclaim
does. If the demotion is a part of that process so should be
memory.reclaim part of it. If we want to have a finer control then a
nodemask is really a must and then the nodemaks should constrain both
agining and reclaim.

> memory.demote returning -EINVAL for lasttier nodes makes sense to me.
> 
> > I would also add
> > 4) Do we want to allow to control the demotion path (e.g. which node to
> >    demote from and to) and how to achieve that?
> 
> We care deeply about specifying which node to demote _from_. That
> would be some node that is approaching pressure and we're looking for
> proactive saving from. So far I haven't seen any reason to control
> which nodes to demote _to_. The kernel deciding that based on the
> aging pipeline and the node distances sounds good to me. Obviously
> someone else may find that useful.

Please keep in mind that the interface should be really prepared for
future extensions so try to abstract from your immediate usecases.

> > 5) Is the demotion api restricted to multi-tier systems or any numa
> >    configuration allowed as well?
> >
> 
> demotion will of course not work on single tiered systems. The
> interface may return some failure on such systems or not be available
> at all.

Is there any strong reason for that? We do not have any interface to
control NUMA balancing from userspace. Why cannot we use the interface
for that purpose?
  
Michal Hocko Dec. 14, 2022, 10:43 a.m. UTC | #18
On Tue 13-12-22 16:58:50, Johannes Weiner wrote:
> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote:
> > I do recognize your need to control the demotion but I argue that it is
> > a bad idea to rely on an implicit behavior of the memory reclaim and an
> > interface which is _documented_ to primarily _reclaim_ memory.
> 
> I think memory.reclaim should demote as part of page aging. What I'd
> like to avoid is *having* to manually control the aging component in
> the interface (e.g. making memory.reclaim *only* reclaim, and
> *requiring* a coordinated use of memory.demote to ensure progress.)

Yes, I do agree with that. Demotion is a part of the aging. I meant to say
that the result of the operation should be reclaimed charges but that
doesn't mean that demotion is not a part of that process.

I am mostly concerned about demote only behavior that Mina is targetting
and want to use memory.reclaim interface.

> > Really, consider that the current demotion implementation will change
> > in the future and based on a newly added heuristic memory reclaim or
> > compression would be preferred over migration to a different tier.  This
> > might completely break your current assumptions and break your usecase
> > which relies on an implicit demotion behavior.  Do you see that as a
> > potential problem at all? What shall we do in that case? Special case
> > memory.reclaim behavior?
> 
> Shouldn't that be derived from the distance propertiers in the tier
> configuration?
> 
> I.e. if local compression is faster than demoting to a slower node, we
> should maybe have a separate tier for that. Ignoring proactive reclaim
> or demotion commands for a second: on that node, global memory
> pressure should always compress first, while the oldest pages from the
> compression cache should demote to the other node(s) - until they
> eventually get swapped out.
> 
> However fine-grained we make proactive reclaim control over these
> stages, it should at least be possible for the user to request the
> default behavior that global pressure follows, without jumping through
> hoops or requiring the coordinated use of multiple knobs. So IMO there
> is an argument for having a singular knob that requests comprehensive
> aging and reclaiming across the configured hierarchy.
> 
> As far as explicit control over the individual stages goes - no idea
> if you would call the compression stage demotion or reclaim. The
> distinction still does not make much of sense to me, since reclaim is
> just another form of demotion.

From the external visibility POV the major difference between the two is
that the reclaim decreases the overall charged memory. And there are
pro-active reclaim usecases which rely on that. Demotion is mostly
memory placement rebalancing. Sure still visible in per-node stats and
with implications to performance but that is a different story.

> Sure, page faults have a different
> access latency than dax to slower memory. But you could also have 3
> tiers of memory where the difference between tier 1 and 2 is much
> smaller than the difference between 2 and 3, and you might want to
> apply different demotion rates between them as well.
> 
> The other argument is that demotion does not free cgroup memory,
> whereas reclaim does. But with multiple memory tiers of vastly
> different performance, isn't there also an argument for granting
> cgroups different shares of each memory?

Yes. We have already had requests for per node limits in the past. And I
do expect this will show up as a problem here as well but with a
reasonable memory.reclaim and potentially memory.demote interfaces the
balancing and policy making can be outsourced to the userspace .

> So that a higher priority
> group has access to a bigger share of the fastest memory, and lower
> prio cgroups are relegated to lower tiers. If we split those pools,
> then "demotion" will actually free memory in a cgroup.
> 
> This is why I liked adding a nodes= argument to memory.reclaim the
> best. It doesn't encode a distinction that may not last for long.
> 
> The problem comes from how to interpret the input argument and the
> return value, right? Could we solve this by requiring the passed
> nodes= to all be of the same memory tier? Then there is no confusion
> around what is requested and what the return value means.

Just to make sure I am on the same page. This means that if a node mask
is specified then it always implies demotion without any control over
how the demotion is done, right?

> And if no nodes are passed, it means reclaim (from the lowest memory
> tier) X pages and demote as needed, then return the reclaimed pages.

IMO this is rather constrained semantic which will completely rule out
some potentially interesting usecases. E.g. fine grained control over
the demotion path or enforced reclaim for node balancing. Also if we
ever want a promote interface then it would better fit with demote
counterpart.

> > Now to your specific usecase. If there is a need to do a memory
> > distribution balancing then fine but this should be a well defined
> > interface. E.g. is there a need to not only control demotion but
> > promotions as well? I haven't heard anybody requesting that so far
> > but I can easily imagine that like outsourcing the memory reclaim to
> > the userspace someone might want to do the same thing with the numa
> > balancing because $REASONS. Should that ever happen, I am pretty sure
> > hooking into memory.reclaim is not really a great idea.
> 
> Should this ever happen, it would seem fair that that be a separate
> knob anyway, no? One knob to move the pipeline in one direction
> (aging), one knob to move it the other way.

Yes, this is what I am inclining to as well.
  
Huang, Ying Dec. 15, 2022, 5:50 a.m. UTC | #19
Michal Hocko <mhocko@suse.com> writes:

> On Tue 13-12-22 11:29:45, Mina Almasry wrote:
>> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
>> >
>> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
>> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
>> > [...]
>> > > > After these discussion, I think the solution maybe use different
>> > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
>> > > > reconsider "memory.demote".  In this way, we will always uncharge the
>> > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
>> > > > And, because demotion is considered aging, we don't need to disable
>> > > > demotion for "memory.reclaim", just don't count it.
>> > >
>> > > Hm, so in summary:
>> > >
>> > > 1) memory.reclaim would demote and reclaim like today, but it would
>> > >    change to only count reclaimed pages against the goal.
>> > >
>> > > 2) memory.demote would only demote.
>> > >
>> 
>> If the above 2 points are agreeable then yes, this sounds good to me
>> and does address our use case.
>> 
>> > >    a) What if the demotion targets are full? Would it reclaim or fail?
>> > >
>> 
>> Wei will chime in if he disagrees, but I think we _require_ that it
>> fails, not falls back to reclaim. The interface is asking for
>> demotion, and is called memory.demote. For such an interface to fall
>> back to reclaim would be very confusing to userspace and may trigger
>> reclaim on a high priority job that we want to shield from proactive
>> reclaim.
>
> But what should happen if the immediate demotion target is full but
> lower tiers are still usable. Should the first one demote before
> allowing to demote from the top tier?
>  
>> > > 3) Would memory.reclaim and memory.demote still need nodemasks?
>> 
>> memory.demote will need a nodemask, for sure. Today the nodemask would
>> be useful if there is a specific node in the top tier that is
>> overloaded and we want to reduce the pressure by demoting. In the
>> future there will be N tiers and the nodemask says which tier to
>> demote from.
>
> OK, so what is the exact semantic of the node mask. Does it control
> where to demote from or to or both?
>
>> I don't think memory.reclaim would need a nodemask anymore? At least I
>> no longer see the use for it for us.
>> 
>> > >    Would
>> > >    they return -EINVAL if a) memory.reclaim gets passed only toptier
>> > >    nodes or b) memory.demote gets passed any lasttier nodes?
>> >
>> 
>> Honestly it would be great if memory.reclaim can force reclaim from a
>> top tier nodes. It breaks the aginig pipeline, yes, but if the user is
>> specifically asking for that because they decided in their usecase
>> it's a good idea then the kernel should comply IMO. Not a strict
>> requirement for us. Wei will chime in if he disagrees.
>
> That would require a nodemask to say which nodes to reclaim, no? The
> default behavior should be in line with what standard memory reclaim
> does. If the demotion is a part of that process so should be
> memory.reclaim part of it. If we want to have a finer control then a
> nodemask is really a must and then the nodemaks should constrain both
> agining and reclaim.
>
>> memory.demote returning -EINVAL for lasttier nodes makes sense to me.
>> 
>> > I would also add
>> > 4) Do we want to allow to control the demotion path (e.g. which node to
>> >    demote from and to) and how to achieve that?
>> 
>> We care deeply about specifying which node to demote _from_. That
>> would be some node that is approaching pressure and we're looking for
>> proactive saving from. So far I haven't seen any reason to control
>> which nodes to demote _to_. The kernel deciding that based on the
>> aging pipeline and the node distances sounds good to me. Obviously
>> someone else may find that useful.
>
> Please keep in mind that the interface should be really prepared for
> future extensions so try to abstract from your immediate usecases.

I see two requirements here, one is to control the demotion source, that
is, which nodes to free memory.  The other is to control the demotion
path.  I think that we can use two different parameters for them, for
example, "from=<demotion source nodes>" and "to=<demotion target
nodes>".  In most cases we don't need to control the demotion path.
Because in current implementation, the nodes in the lower tiers in the
same socket (local nodes) will be preferred.  I think that this is
the desired behavior in most cases.

>> > 5) Is the demotion api restricted to multi-tier systems or any numa
>> >    configuration allowed as well?
>> >
>> 
>> demotion will of course not work on single tiered systems. The
>> interface may return some failure on such systems or not be available
>> at all.
>
> Is there any strong reason for that? We do not have any interface to
> control NUMA balancing from userspace. Why cannot we use the interface
> for that purpose? 

Do you mean to demote the cold pages from the specified source nodes to
the specified target nodes in different sockets?  We don't do that to
avoid loop in the demotion path.  If we prevent the target nodes from
demoting cold pages to the source nodes at the same time, it seems
doable.

Best Regards,
Huang, Ying
  
Michal Hocko Dec. 15, 2022, 9:21 a.m. UTC | #20
On Thu 15-12-22 13:50:14, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Tue 13-12-22 11:29:45, Mina Almasry wrote:
> >> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> >> >
> >> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> >> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> >> > [...]
> >> > > > After these discussion, I think the solution maybe use different
> >> > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
> >> > > > reconsider "memory.demote".  In this way, we will always uncharge the
> >> > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> >> > > > And, because demotion is considered aging, we don't need to disable
> >> > > > demotion for "memory.reclaim", just don't count it.
> >> > >
> >> > > Hm, so in summary:
> >> > >
> >> > > 1) memory.reclaim would demote and reclaim like today, but it would
> >> > >    change to only count reclaimed pages against the goal.
> >> > >
> >> > > 2) memory.demote would only demote.
> >> > >
> >> 
> >> If the above 2 points are agreeable then yes, this sounds good to me
> >> and does address our use case.
> >> 
> >> > >    a) What if the demotion targets are full? Would it reclaim or fail?
> >> > >
> >> 
> >> Wei will chime in if he disagrees, but I think we _require_ that it
> >> fails, not falls back to reclaim. The interface is asking for
> >> demotion, and is called memory.demote. For such an interface to fall
> >> back to reclaim would be very confusing to userspace and may trigger
> >> reclaim on a high priority job that we want to shield from proactive
> >> reclaim.
> >
> > But what should happen if the immediate demotion target is full but
> > lower tiers are still usable. Should the first one demote before
> > allowing to demote from the top tier?
> >  
> >> > > 3) Would memory.reclaim and memory.demote still need nodemasks?
> >> 
> >> memory.demote will need a nodemask, for sure. Today the nodemask would
> >> be useful if there is a specific node in the top tier that is
> >> overloaded and we want to reduce the pressure by demoting. In the
> >> future there will be N tiers and the nodemask says which tier to
> >> demote from.
> >
> > OK, so what is the exact semantic of the node mask. Does it control
> > where to demote from or to or both?
> >
> >> I don't think memory.reclaim would need a nodemask anymore? At least I
> >> no longer see the use for it for us.
> >> 
> >> > >    Would
> >> > >    they return -EINVAL if a) memory.reclaim gets passed only toptier
> >> > >    nodes or b) memory.demote gets passed any lasttier nodes?
> >> >
> >> 
> >> Honestly it would be great if memory.reclaim can force reclaim from a
> >> top tier nodes. It breaks the aginig pipeline, yes, but if the user is
> >> specifically asking for that because they decided in their usecase
> >> it's a good idea then the kernel should comply IMO. Not a strict
> >> requirement for us. Wei will chime in if he disagrees.
> >
> > That would require a nodemask to say which nodes to reclaim, no? The
> > default behavior should be in line with what standard memory reclaim
> > does. If the demotion is a part of that process so should be
> > memory.reclaim part of it. If we want to have a finer control then a
> > nodemask is really a must and then the nodemaks should constrain both
> > agining and reclaim.
> >
> >> memory.demote returning -EINVAL for lasttier nodes makes sense to me.
> >> 
> >> > I would also add
> >> > 4) Do we want to allow to control the demotion path (e.g. which node to
> >> >    demote from and to) and how to achieve that?
> >> 
> >> We care deeply about specifying which node to demote _from_. That
> >> would be some node that is approaching pressure and we're looking for
> >> proactive saving from. So far I haven't seen any reason to control
> >> which nodes to demote _to_. The kernel deciding that based on the
> >> aging pipeline and the node distances sounds good to me. Obviously
> >> someone else may find that useful.
> >
> > Please keep in mind that the interface should be really prepared for
> > future extensions so try to abstract from your immediate usecases.
> 
> I see two requirements here, one is to control the demotion source, that
> is, which nodes to free memory.  The other is to control the demotion
> path.  I think that we can use two different parameters for them, for
> example, "from=<demotion source nodes>" and "to=<demotion target
> nodes>".  In most cases we don't need to control the demotion path.
> Because in current implementation, the nodes in the lower tiers in the
> same socket (local nodes) will be preferred.  I think that this is
> the desired behavior in most cases.

Even if the demotion path is not really required at the moment we should
keep in mind future potential extensions. E.g. when a userspace based
balancing is to be implemented because the default behavior cannot
capture userspace policies (one example would be enforcing a
prioritization of containers when some container's demoted pages would
need to be demoted further to free up a space for a different
workload). 
 
> >> > 5) Is the demotion api restricted to multi-tier systems or any numa
> >> >    configuration allowed as well?
> >> >
> >> 
> >> demotion will of course not work on single tiered systems. The
> >> interface may return some failure on such systems or not be available
> >> at all.
> >
> > Is there any strong reason for that? We do not have any interface to
> > control NUMA balancing from userspace. Why cannot we use the interface
> > for that purpose? 
> 
> Do you mean to demote the cold pages from the specified source nodes to
> the specified target nodes in different sockets?  We don't do that to
> avoid loop in the demotion path.  If we prevent the target nodes from
> demoting cold pages to the source nodes at the same time, it seems
> doable.

Loops could be avoid by properly specifying from and to nodes if this is
going to be a fine grained interface to control demotion.
  
Wei Xu Dec. 15, 2022, 5:58 p.m. UTC | #21
On Wed, Dec 14, 2022 at 2:23 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 13-12-22 11:29:45, Mina Almasry wrote:
> > On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> > > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> > > [...]
> > > > > After these discussion, I think the solution maybe use different
> > > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
> > > > > reconsider "memory.demote".  In this way, we will always uncharge the
> > > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> > > > > And, because demotion is considered aging, we don't need to disable
> > > > > demotion for "memory.reclaim", just don't count it.
> > > >
> > > > Hm, so in summary:
> > > >
> > > > 1) memory.reclaim would demote and reclaim like today, but it would
> > > >    change to only count reclaimed pages against the goal.
> > > >
> > > > 2) memory.demote would only demote.
> > > >
> >
> > If the above 2 points are agreeable then yes, this sounds good to me
> > and does address our use case.
> >
> > > >    a) What if the demotion targets are full? Would it reclaim or fail?
> > > >
> >
> > Wei will chime in if he disagrees, but I think we _require_ that it
> > fails, not falls back to reclaim. The interface is asking for
> > demotion, and is called memory.demote. For such an interface to fall
> > back to reclaim would be very confusing to userspace and may trigger
> > reclaim on a high priority job that we want to shield from proactive
> > reclaim.
>
> But what should happen if the immediate demotion target is full but
> lower tiers are still usable. Should the first one demote before
> allowing to demote from the top tier?

In that case, the demotion will fall back to the lower tiers.  See
node_get_allowed_targets() and establish_demotion_targets()..

> > > > 3) Would memory.reclaim and memory.demote still need nodemasks?
> >
> > memory.demote will need a nodemask, for sure. Today the nodemask would
> > be useful if there is a specific node in the top tier that is
> > overloaded and we want to reduce the pressure by demoting. In the
> > future there will be N tiers and the nodemask says which tier to
> > demote from.
>
> OK, so what is the exact semantic of the node mask. Does it control
> where to demote from or to or both?

The nodemask argument proposed here is to control where to demote
from.   We can follow the existing kernel demotion order to select
where to demote to.  If the need to control the demotion destination
arises, another argument can be added.

> > I don't think memory.reclaim would need a nodemask anymore? At least I
> > no longer see the use for it for us.
> >
> > > >    Would
> > > >    they return -EINVAL if a) memory.reclaim gets passed only toptier
> > > >    nodes or b) memory.demote gets passed any lasttier nodes?
> > >
> >
> > Honestly it would be great if memory.reclaim can force reclaim from a
> > top tier nodes. It breaks the aginig pipeline, yes, but if the user is
> > specifically asking for that because they decided in their usecase
> > it's a good idea then the kernel should comply IMO. Not a strict
> > requirement for us. Wei will chime in if he disagrees.
>
> That would require a nodemask to say which nodes to reclaim, no? The
> default behavior should be in line with what standard memory reclaim
> does. If the demotion is a part of that process so should be
> memory.reclaim part of it. If we want to have a finer control then a
> nodemask is really a must and then the nodemaks should constrain both
> agining and reclaim.

Given that the original meaning of memory.reclaim is to free up
memory, I agree that when a nodemask is provided, the kernel should be
allowed to do both aging/demotion and reclaim.  Whether to allow
reclaim from top-tier nodes is a kernel implementation choice.  The
userspace should not depend on that.

Also, because the expectation of memory.reclaim is to free up the
specified amount of bytes, I think if a page is demoted, but both its
source and target nodes are still in the given nodemask, such a
demoted page should not be counted towards the requested bytes of
memory.reclaim. In the case that no nodemask is given (i.e. to free up
memory from all nodes), the demoted pages should never be counted in
the return value of try_to_free_mem_cgroup_pages().

Meanwhile, I'd argue that even though we want to unify demotion and
reclaim, there are still significant differences between them.
Demotion moves pages between two memory tiers, while reclaim can move
pages to a much slower tier, e.g. disk-based files or swap.  Both the
page movement latencies and the reaccess latencies can be
significantly different for demotion/reclaim.  So it is useful for the
userspace to be able to request demotion without reclaim.  A separate
interface, e.g. memory.demote, seems like a good choice for that.

> > memory.demote returning -EINVAL for lasttier nodes makes sense to me.
> >
> > > I would also add
> > > 4) Do we want to allow to control the demotion path (e.g. which node to
> > >    demote from and to) and how to achieve that?
> >
> > We care deeply about specifying which node to demote _from_. That
> > would be some node that is approaching pressure and we're looking for
> > proactive saving from. So far I haven't seen any reason to control
> > which nodes to demote _to_. The kernel deciding that based on the
> > aging pipeline and the node distances sounds good to me. Obviously
> > someone else may find that useful.
>
> Please keep in mind that the interface should be really prepared for
> future extensions so try to abstract from your immediate usecases.
>
> > > 5) Is the demotion api restricted to multi-tier systems or any numa
> > >    configuration allowed as well?
> > >
> >
> > demotion will of course not work on single tiered systems. The
> > interface may return some failure on such systems or not be available
> > at all.
>
> Is there any strong reason for that? We do not have any interface to
> control NUMA balancing from userspace. Why cannot we use the interface
> for that purpose?

A demotion interface such as memory.demote will trigger the demotion
code path in the kernel, which depends on multiple memory tiers.

I think what you are getting is a more general page migration
interface for memcg, which will need both source and target nodes as
arguments. I think this can be a great idea.  It should be able to
support our demotion use cases as well.

> --
> Michal Hocko
> SUSE Labs
  
Huang, Ying Dec. 16, 2022, 3:02 a.m. UTC | #22
Michal Hocko <mhocko@suse.com> writes:

> On Thu 15-12-22 13:50:14, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Tue 13-12-22 11:29:45, Mina Almasry wrote:
>> >> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
>> >> >
>> >> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
>> >> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
>> >> > [...]
>> >> > > > After these discussion, I think the solution maybe use different
>> >> > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
>> >> > > > reconsider "memory.demote".  In this way, we will always uncharge the
>> >> > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
>> >> > > > And, because demotion is considered aging, we don't need to disable
>> >> > > > demotion for "memory.reclaim", just don't count it.
>> >> > >
>> >> > > Hm, so in summary:
>> >> > >
>> >> > > 1) memory.reclaim would demote and reclaim like today, but it would
>> >> > >    change to only count reclaimed pages against the goal.
>> >> > >
>> >> > > 2) memory.demote would only demote.
>> >> > >
>> >> 
>> >> If the above 2 points are agreeable then yes, this sounds good to me
>> >> and does address our use case.
>> >> 
>> >> > >    a) What if the demotion targets are full? Would it reclaim or fail?
>> >> > >
>> >> 
>> >> Wei will chime in if he disagrees, but I think we _require_ that it
>> >> fails, not falls back to reclaim. The interface is asking for
>> >> demotion, and is called memory.demote. For such an interface to fall
>> >> back to reclaim would be very confusing to userspace and may trigger
>> >> reclaim on a high priority job that we want to shield from proactive
>> >> reclaim.
>> >
>> > But what should happen if the immediate demotion target is full but
>> > lower tiers are still usable. Should the first one demote before
>> > allowing to demote from the top tier?
>> >  
>> >> > > 3) Would memory.reclaim and memory.demote still need nodemasks?
>> >> 
>> >> memory.demote will need a nodemask, for sure. Today the nodemask would
>> >> be useful if there is a specific node in the top tier that is
>> >> overloaded and we want to reduce the pressure by demoting. In the
>> >> future there will be N tiers and the nodemask says which tier to
>> >> demote from.
>> >
>> > OK, so what is the exact semantic of the node mask. Does it control
>> > where to demote from or to or both?
>> >
>> >> I don't think memory.reclaim would need a nodemask anymore? At least I
>> >> no longer see the use for it for us.
>> >> 
>> >> > >    Would
>> >> > >    they return -EINVAL if a) memory.reclaim gets passed only toptier
>> >> > >    nodes or b) memory.demote gets passed any lasttier nodes?
>> >> >
>> >> 
>> >> Honestly it would be great if memory.reclaim can force reclaim from a
>> >> top tier nodes. It breaks the aginig pipeline, yes, but if the user is
>> >> specifically asking for that because they decided in their usecase
>> >> it's a good idea then the kernel should comply IMO. Not a strict
>> >> requirement for us. Wei will chime in if he disagrees.
>> >
>> > That would require a nodemask to say which nodes to reclaim, no? The
>> > default behavior should be in line with what standard memory reclaim
>> > does. If the demotion is a part of that process so should be
>> > memory.reclaim part of it. If we want to have a finer control then a
>> > nodemask is really a must and then the nodemaks should constrain both
>> > agining and reclaim.
>> >
>> >> memory.demote returning -EINVAL for lasttier nodes makes sense to me.
>> >> 
>> >> > I would also add
>> >> > 4) Do we want to allow to control the demotion path (e.g. which node to
>> >> >    demote from and to) and how to achieve that?
>> >> 
>> >> We care deeply about specifying which node to demote _from_. That
>> >> would be some node that is approaching pressure and we're looking for
>> >> proactive saving from. So far I haven't seen any reason to control
>> >> which nodes to demote _to_. The kernel deciding that based on the
>> >> aging pipeline and the node distances sounds good to me. Obviously
>> >> someone else may find that useful.
>> >
>> > Please keep in mind that the interface should be really prepared for
>> > future extensions so try to abstract from your immediate usecases.
>> 
>> I see two requirements here, one is to control the demotion source, that
>> is, which nodes to free memory.  The other is to control the demotion
>> path.  I think that we can use two different parameters for them, for
>> example, "from=<demotion source nodes>" and "to=<demotion target
>> nodes>".  In most cases we don't need to control the demotion path.
>> Because in current implementation, the nodes in the lower tiers in the
>> same socket (local nodes) will be preferred.  I think that this is
>> the desired behavior in most cases.
>
> Even if the demotion path is not really required at the moment we should
> keep in mind future potential extensions. E.g. when a userspace based
> balancing is to be implemented because the default behavior cannot
> capture userspace policies (one example would be enforcing a
> prioritization of containers when some container's demoted pages would
> need to be demoted further to free up a space for a different
> workload). 

Yes.  We should consider the potential requirements.

>> >> > 5) Is the demotion api restricted to multi-tier systems or any numa
>> >> >    configuration allowed as well?
>> >> >
>> >> 
>> >> demotion will of course not work on single tiered systems. The
>> >> interface may return some failure on such systems or not be available
>> >> at all.
>> >
>> > Is there any strong reason for that? We do not have any interface to
>> > control NUMA balancing from userspace. Why cannot we use the interface
>> > for that purpose? 
>> 
>> Do you mean to demote the cold pages from the specified source nodes to
>> the specified target nodes in different sockets?  We don't do that to
>> avoid loop in the demotion path.  If we prevent the target nodes from
>> demoting cold pages to the source nodes at the same time, it seems
>> doable.
>
> Loops could be avoid by properly specifying from and to nodes if this is
> going to be a fine grained interface to control demotion.

Yes.

Best Regards,
Huang, Ying
  
Michal Hocko Dec. 16, 2022, 8:40 a.m. UTC | #23
On Thu 15-12-22 09:58:12, Wei Xu wrote:
> On Wed, Dec 14, 2022 at 2:23 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 13-12-22 11:29:45, Mina Almasry wrote:
> > > On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> > > > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> > > > [...]
> > > > > > After these discussion, I think the solution maybe use different
> > > > > > interfaces for "proactive demote" and "proactive reclaim".  That is,
> > > > > > reconsider "memory.demote".  In this way, we will always uncharge the
> > > > > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> > > > > > And, because demotion is considered aging, we don't need to disable
> > > > > > demotion for "memory.reclaim", just don't count it.
> > > > >
> > > > > Hm, so in summary:
> > > > >
> > > > > 1) memory.reclaim would demote and reclaim like today, but it would
> > > > >    change to only count reclaimed pages against the goal.
> > > > >
> > > > > 2) memory.demote would only demote.
> > > > >
> > >
> > > If the above 2 points are agreeable then yes, this sounds good to me
> > > and does address our use case.
> > >
> > > > >    a) What if the demotion targets are full? Would it reclaim or fail?
> > > > >
> > >
> > > Wei will chime in if he disagrees, but I think we _require_ that it
> > > fails, not falls back to reclaim. The interface is asking for
> > > demotion, and is called memory.demote. For such an interface to fall
> > > back to reclaim would be very confusing to userspace and may trigger
> > > reclaim on a high priority job that we want to shield from proactive
> > > reclaim.
> >
> > But what should happen if the immediate demotion target is full but
> > lower tiers are still usable. Should the first one demote before
> > allowing to demote from the top tier?
> 
> In that case, the demotion will fall back to the lower tiers.  See
> node_get_allowed_targets() and establish_demotion_targets()..

I am not talking about an implicit behavior that we do not want to cast
into interface. If we want to allow a fine grained control over demotion
then the implementation shouldn't rely on the current behavior.

[...]
> > Is there any strong reason for that? We do not have any interface to
> > control NUMA balancing from userspace. Why cannot we use the interface
> > for that purpose?
> 
> A demotion interface such as memory.demote will trigger the demotion
> code path in the kernel, which depends on multiple memory tiers.

Demotion is just a fancy name of a directed migration. There is no realy
dependency on the HW nor the technology.

> I think what you are getting is a more general page migration
> interface for memcg, which will need both source and target nodes as
> arguments. I think this can be a great idea.  It should be able to
> support our demotion use cases as well.

yes.
  

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 74cec76be9f2..c8ae7c897f14 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1245,17 +1245,13 @@  PAGE_SIZE multiple when read back.
 	This is a simple interface to trigger memory reclaim in the
 	target cgroup.

-	This file accepts a single key, the number of bytes to reclaim.
-	No nested keys are currently supported.
+	This file accepts a string which contains the number of bytes to
+	reclaim.

 	Example::

 	  echo "1G" > memory.reclaim

-	The interface can be later extended with nested keys to
-	configure the reclaim behavior. For example, specify the
-	type of memory to reclaim from (anon, file, ..).
-
 	Please note that the kernel can over or under reclaim from
 	the target cgroup. If less bytes are reclaimed than the
 	specified amount, -EAGAIN is returned.
@@ -1267,6 +1263,13 @@  PAGE_SIZE multiple when read back.
 	This means that the networking layer will not adapt based on
 	reclaim induced by memory.reclaim.

+	This file also allows the user to specify the nodes to reclaim from,
+	via the 'nodes=' key, for example::
+
+	  echo "1G nodes=0,1" > memory.reclaim
+
+	The above instructs the kernel to reclaim memory from nodes 0,1.
+
   memory.peak
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ceed49516ad..2787b84eaf12 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -418,7 +418,8 @@  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
-						  unsigned int reclaim_options);
+						  unsigned int reclaim_options,
+						  nodemask_t *nodemask);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 23750cec0036..0f02f47a87e4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@ 
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
+#include <linux/parser.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -2392,7 +2393,8 @@  static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
-							MEMCG_RECLAIM_MAY_SWAP);
+							MEMCG_RECLAIM_MAY_SWAP,
+							NULL);
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2683,7 +2685,8 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,

 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, reclaim_options);
+						    gfp_mask, reclaim_options,
+						    NULL);
 	psi_memstall_leave(&pflags);

 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3503,7 +3506,8 @@  static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}

 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
+					NULL)) {
 			ret = -EBUSY;
 			break;
 		}
@@ -3614,7 +3618,8 @@  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;

 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP))
+						  MEMCG_RECLAIM_MAY_SWAP,
+						  NULL))
 			nr_retries--;
 	}

@@ -6407,7 +6412,8 @@  static ssize_t memory_high_write(struct kernfs_open_file *of,
 		}

 		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					NULL);

 		if (!reclaimed && !nr_retries--)
 			break;
@@ -6456,7 +6462,8 @@  static ssize_t memory_max_write(struct kernfs_open_file *of,

 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					NULL))
 				nr_reclaims--;
 			continue;
 		}
@@ -6579,21 +6586,54 @@  static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }

+enum {
+	MEMORY_RECLAIM_NODES = 0,
+	MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t if_tokens = {
+	{ MEMORY_RECLAIM_NODES, "nodes=%s" },
+	{ MEMORY_RECLAIM_NULL, NULL },
+};
+
 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
-	unsigned int reclaim_options;
-	int err;
+	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
+				       MEMCG_RECLAIM_PROACTIVE;
+	char *old_buf, *start;
+	substring_t args[MAX_OPT_ARGS];
+	int token;
+	char value[256];
+	nodemask_t nodemask = NODE_MASK_ALL;

 	buf = strstrip(buf);
-	err = page_counter_memparse(buf, "", &nr_to_reclaim);
-	if (err)
-		return err;

-	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
+	old_buf = buf;
+	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+	if (buf == old_buf)
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	while ((start = strsep(&buf, " ")) != NULL) {
+		if (!strlen(start))
+			continue;
+		token = match_token(start, if_tokens, args);
+		match_strlcpy(value, args, sizeof(value));
+		switch (token) {
+		case MEMORY_RECLAIM_NODES:
+			if (nodelist_parse(value, nodemask) < 0)
+				return -EINVAL;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
 	while (nr_reclaimed < nr_to_reclaim) {
 		unsigned long reclaimed;

@@ -6610,7 +6650,8 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,

 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
 						nr_to_reclaim - nr_reclaimed,
-						GFP_KERNEL, reclaim_options);
+						GFP_KERNEL, reclaim_options,
+						&nodemask);

 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b8e8e43806b..62b0c9b46bd2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6735,7 +6735,8 @@  unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   unsigned int reclaim_options)
+					   unsigned int reclaim_options,
+					   nodemask_t *nodemask)
 {
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
@@ -6750,6 +6751,7 @@  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
 		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+		.nodemask = nodemask,
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put