[RFC,V1] mm: Disable demotion from proactive reclaim

Message ID 20221122203850.2765015-1-almasrymina@google.com
State New
Headers
Series [RFC,V1] mm: Disable demotion from proactive reclaim |

Commit Message

Mina Almasry Nov. 22, 2022, 8:38 p.m. UTC
  Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim""), the proactive reclaim interface memory.reclaim does both
reclaim and demotion. This is likely fine for us for latency critical
jobs where we would want to disable proactive reclaim entirely, and is
also fine for latency tolerant jobs where we would like to both
proactively reclaim and demote.

However, for some latency tiers in the middle we would like to demote but
not reclaim. This is because reclaim and demotion incur different latency
costs to the jobs in the cgroup. Demoted memory would still be addressable
by the userspace at a higher latency, but reclaimed memory would need to
incur a pagefault.

To address this, I propose having reclaim-only and demotion-only
mechanisms in the kernel. There are a couple possible
interfaces to carry this out I considered:

1. Disable demotion in the memory.reclaim interface and add a new
   demotion interface (memory.demote).
2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion
   behavior in the kernel like so:
   	- demote=0 would disable demotion from this call.
	- demote=1 would allow the kernel to demote if it desires.
	- demote=2 would only demote if possible but not attempt any
	  other form of reclaim.

I've implemented option #1 in this RFC to demonstrate a sample and would
love feedback on the usecase and approach.

Additionally, when triggering proactive demotion it is preferrable to have
a nodelist argument that allows the userspace to proactively demote from
specific memory tiers according to its policy. The userspace may want
to demote from specific nodes that are under pressure, and may want to
demote from specific tiers that are under pressure. An example of this
use case would be in a 3 memory tier system, the userspace may want to
demote from the second tier without disturbing the hot memory in the top
tier.

The current RFC series is missing updates to docs and selftests, but if
the general approach and usecase is acceptable I plan to add these with
a PATCH V1 review.

First patch in this series disables demotion from the proactive reclaim
interface memory.reclaim. Follow up patches in the series implement the
memory.demote interface and its nodeslist arg.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/linux/swap.h |  6 ++++++
 mm/memcontrol.c      | 16 +++++++++-------
 mm/vmscan.c          | 21 +++++++++++++++++----
 3 files changed, 32 insertions(+), 11 deletions(-)

--
2.38.1.584.g0f3c55d4c2-goog
  

Comments

Johannes Weiner Nov. 23, 2022, 6 p.m. UTC | #1
Hello Mina,

On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote:
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. This is likely fine for us for latency critical
> jobs where we would want to disable proactive reclaim entirely, and is
> also fine for latency tolerant jobs where we would like to both
> proactively reclaim and demote.
> 
> However, for some latency tiers in the middle we would like to demote but
> not reclaim. This is because reclaim and demotion incur different latency
> costs to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
> 
> To address this, I propose having reclaim-only and demotion-only
> mechanisms in the kernel. There are a couple possible
> interfaces to carry this out I considered:
> 
> 1. Disable demotion in the memory.reclaim interface and add a new
>    demotion interface (memory.demote).
> 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion
>    behavior in the kernel like so:
>    	- demote=0 would disable demotion from this call.
> 	- demote=1 would allow the kernel to demote if it desires.
> 	- demote=2 would only demote if possible but not attempt any
> 	  other form of reclaim.

Unfortunately, our proactive reclaim stack currently relies on
memory.reclaim doing both. It may not stay like that, but I'm a bit
wary of changing user-visible semantics post-facto.

In patch 2, you're adding a node interface to memory.demote. Can you
add this to memory.reclaim instead? This would allow you to control
demotion and reclaim independently as you please: if you call it on a
node with demotion targets, it will demote; if you call it on a node
without one, it'll reclaim. And current users will remain unaffected.
  
Mina Almasry Nov. 23, 2022, 9:20 p.m. UTC | #2
On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hello Mina,
>
> On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote:
> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > reclaim""), the proactive reclaim interface memory.reclaim does both
> > reclaim and demotion. This is likely fine for us for latency critical
> > jobs where we would want to disable proactive reclaim entirely, and is
> > also fine for latency tolerant jobs where we would like to both
> > proactively reclaim and demote.
> >
> > However, for some latency tiers in the middle we would like to demote but
> > not reclaim. This is because reclaim and demotion incur different latency
> > costs to the jobs in the cgroup. Demoted memory would still be addressable
> > by the userspace at a higher latency, but reclaimed memory would need to
> > incur a pagefault.
> >
> > To address this, I propose having reclaim-only and demotion-only
> > mechanisms in the kernel. There are a couple possible
> > interfaces to carry this out I considered:
> >
> > 1. Disable demotion in the memory.reclaim interface and add a new
> >    demotion interface (memory.demote).
> > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion
> >    behavior in the kernel like so:
> >       - demote=0 would disable demotion from this call.
> >       - demote=1 would allow the kernel to demote if it desires.
> >       - demote=2 would only demote if possible but not attempt any
> >         other form of reclaim.
>
> Unfortunately, our proactive reclaim stack currently relies on
> memory.reclaim doing both. It may not stay like that, but I'm a bit
> wary of changing user-visible semantics post-facto.
>
> In patch 2, you're adding a node interface to memory.demote. Can you
> add this to memory.reclaim instead? This would allow you to control
> demotion and reclaim independently as you please: if you call it on a
> node with demotion targets, it will demote; if you call it on a node
> without one, it'll reclaim. And current users will remain unaffected.

Hello Johannes, thanks for taking a look!

I can certainly add the "nodes=" arg to memory.reclaim and you're
right, that would help in bridging the gap. However, if I understand
the underlying code correctly, with only the nodes= arg the kernel
will indeed attempt demotion first, but the kernel will also merrily
fall back to reclaiming if it can't demote the full amount. I had
hoped to have the flexibility to protect latency sensitive jobs from
reclaim entirely while attempting to do demotion.

There are probably ways to get around that in the userspace. I presume
the userspace can check if there is available memory on the node's
demotion targets, and if so, the kernel should demote-only. But I feel
that wouldn't be reliable as the demotion logic may change across
kernel versions. The userspace may think the kernel would demote but
instead demotion failed due to whatever heuristic introduced into the
new kernel version.

The above is just one angle of the issue. Another angle (which Yosry
would care most about I think) is that at Google we call
memory.reclaim mainly when memory.current is too close to memory.max
and we expect the memory usage of the cgroup to drop as a result of a
success memory.reclaim call. I suspect once we take in commit
3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""),
we would run into that regression, but I defer to Yosry here, he may
have a solution for that in mind already.

For these reasons, what do you think about adding both the "nodes="
and "demote=" args from my patch series to memory.reclaim? With these
args the default memory.reclaim behavior would suit meta as-is (no
change) but we would be able to configure it to address our use cases
as well.

As always, thanks for taking the time to review!
  
Yosry Ahmed Nov. 23, 2022, 9:35 p.m. UTC | #3
On Wed, Nov 23, 2022 at 1:21 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Hello Mina,
> >
> > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote:
> > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > > reclaim""), the proactive reclaim interface memory.reclaim does both
> > > reclaim and demotion. This is likely fine for us for latency critical
> > > jobs where we would want to disable proactive reclaim entirely, and is
> > > also fine for latency tolerant jobs where we would like to both
> > > proactively reclaim and demote.
> > >
> > > However, for some latency tiers in the middle we would like to demote but
> > > not reclaim. This is because reclaim and demotion incur different latency
> > > costs to the jobs in the cgroup. Demoted memory would still be addressable
> > > by the userspace at a higher latency, but reclaimed memory would need to
> > > incur a pagefault.
> > >
> > > To address this, I propose having reclaim-only and demotion-only
> > > mechanisms in the kernel. There are a couple possible
> > > interfaces to carry this out I considered:
> > >
> > > 1. Disable demotion in the memory.reclaim interface and add a new
> > >    demotion interface (memory.demote).
> > > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion
> > >    behavior in the kernel like so:
> > >       - demote=0 would disable demotion from this call.
> > >       - demote=1 would allow the kernel to demote if it desires.
> > >       - demote=2 would only demote if possible but not attempt any
> > >         other form of reclaim.
> >
> > Unfortunately, our proactive reclaim stack currently relies on
> > memory.reclaim doing both. It may not stay like that, but I'm a bit
> > wary of changing user-visible semantics post-facto.
> >
> > In patch 2, you're adding a node interface to memory.demote. Can you
> > add this to memory.reclaim instead? This would allow you to control
> > demotion and reclaim independently as you please: if you call it on a
> > node with demotion targets, it will demote; if you call it on a node
> > without one, it'll reclaim. And current users will remain unaffected.
>
> Hello Johannes, thanks for taking a look!
>
> I can certainly add the "nodes=" arg to memory.reclaim and you're
> right, that would help in bridging the gap. However, if I understand
> the underlying code correctly, with only the nodes= arg the kernel
> will indeed attempt demotion first, but the kernel will also merrily
> fall back to reclaiming if it can't demote the full amount. I had
> hoped to have the flexibility to protect latency sensitive jobs from
> reclaim entirely while attempting to do demotion.
>
> There are probably ways to get around that in the userspace. I presume
> the userspace can check if there is available memory on the node's
> demotion targets, and if so, the kernel should demote-only. But I feel
> that wouldn't be reliable as the demotion logic may change across
> kernel versions. The userspace may think the kernel would demote but
> instead demotion failed due to whatever heuristic introduced into the
> new kernel version.
>
> The above is just one angle of the issue. Another angle (which Yosry
> would care most about I think) is that at Google we call
> memory.reclaim mainly when memory.current is too close to memory.max
> and we expect the memory usage of the cgroup to drop as a result of a
> success memory.reclaim call. I suspect once we take in commit
> 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""),
> we would run into that regression, but I defer to Yosry here, he may
> have a solution for that in mind already.

We don't exactly rely on memory.current, but we do have a separate
proactive reclaim policy today from demotion, and we do expect
memory.reclaim to reclaim memory and not demote it. So it is important
that we can control reclaim vs. demotion separately. Having
memory.reclaim do demotions by default is not ideal for our current
setup, so at least having a demote= argument to control it (no
demotions, may demote, only demote) is needed.

>
> For these reasons, what do you think about adding both the "nodes="
> and "demote=" args from my patch series to memory.reclaim? With these
> args the default memory.reclaim behavior would suit meta as-is (no
> change) but we would be able to configure it to address our use cases
> as well.
>
> As always, thanks for taking the time to review!
  
Johannes Weiner Nov. 23, 2022, 9:58 p.m. UTC | #4
On Wed, Nov 23, 2022 at 01:20:57PM -0800, Mina Almasry wrote:
> On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Hello Mina,
> >
> > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote:
> > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > > reclaim""), the proactive reclaim interface memory.reclaim does both
> > > reclaim and demotion. This is likely fine for us for latency critical
> > > jobs where we would want to disable proactive reclaim entirely, and is
> > > also fine for latency tolerant jobs where we would like to both
> > > proactively reclaim and demote.
> > >
> > > However, for some latency tiers in the middle we would like to demote but
> > > not reclaim. This is because reclaim and demotion incur different latency
> > > costs to the jobs in the cgroup. Demoted memory would still be addressable
> > > by the userspace at a higher latency, but reclaimed memory would need to
> > > incur a pagefault.
> > >
> > > To address this, I propose having reclaim-only and demotion-only
> > > mechanisms in the kernel. There are a couple possible
> > > interfaces to carry this out I considered:
> > >
> > > 1. Disable demotion in the memory.reclaim interface and add a new
> > >    demotion interface (memory.demote).
> > > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion
> > >    behavior in the kernel like so:
> > >       - demote=0 would disable demotion from this call.
> > >       - demote=1 would allow the kernel to demote if it desires.
> > >       - demote=2 would only demote if possible but not attempt any
> > >         other form of reclaim.
> >
> > Unfortunately, our proactive reclaim stack currently relies on
> > memory.reclaim doing both. It may not stay like that, but I'm a bit
> > wary of changing user-visible semantics post-facto.
> >
> > In patch 2, you're adding a node interface to memory.demote. Can you
> > add this to memory.reclaim instead? This would allow you to control
> > demotion and reclaim independently as you please: if you call it on a
> > node with demotion targets, it will demote; if you call it on a node
> > without one, it'll reclaim. And current users will remain unaffected.
> 
> Hello Johannes, thanks for taking a look!
> 
> I can certainly add the "nodes=" arg to memory.reclaim and you're
> right, that would help in bridging the gap. However, if I understand
> the underlying code correctly, with only the nodes= arg the kernel
> will indeed attempt demotion first, but the kernel will also merrily
> fall back to reclaiming if it can't demote the full amount. I had
> hoped to have the flexibility to protect latency sensitive jobs from
> reclaim entirely while attempting to do demotion.

The fallback to reclaim actually strikes me as wrong.

Think of reclaim as 'demoting' the pages to the storage tier. If we
have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
CXL and from CXL to storage. If we reclaim a page from RAM, it means
we 'demote' it directly from RAM to storage, bypassing potentially a
huge amount of pages colder than it in CXL. That doesn't seem right.

If demotion fails, IMO it shouldn't satisfy the reclaim request by
breaking the layering. Rather it should deflect that pressure to the
lower layers to make room. This makes sure we maintain an aging
pipeline that honors the memory tier hierarchy.

So I'm hesitant to design cgroup controls around the current behavior.

> The above is just one angle of the issue. Another angle (which Yosry
> would care most about I think) is that at Google we call
> memory.reclaim mainly when memory.current is too close to memory.max
> and we expect the memory usage of the cgroup to drop as a result of a
> success memory.reclaim call. I suspect once we take in commit
> 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""),
> we would run into that regression, but I defer to Yosry here, he may
> have a solution for that in mind already.

IMO it should both demote and reclaim. Simliar to how memory.reclaim
on a non-tiered memory system would both deactivate active pages and
reclaim inactive pages.
  
Johannes Weiner Nov. 23, 2022, 10:30 p.m. UTC | #5
On Wed, Nov 23, 2022 at 01:35:13PM -0800, Yosry Ahmed wrote:
> On Wed, Nov 23, 2022 at 1:21 PM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Hello Mina,
> > >
> > > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote:
> > > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > > > reclaim""), the proactive reclaim interface memory.reclaim does both
> > > > reclaim and demotion. This is likely fine for us for latency critical
> > > > jobs where we would want to disable proactive reclaim entirely, and is
> > > > also fine for latency tolerant jobs where we would like to both
> > > > proactively reclaim and demote.
> > > >
> > > > However, for some latency tiers in the middle we would like to demote but
> > > > not reclaim. This is because reclaim and demotion incur different latency
> > > > costs to the jobs in the cgroup. Demoted memory would still be addressable
> > > > by the userspace at a higher latency, but reclaimed memory would need to
> > > > incur a pagefault.
> > > >
> > > > To address this, I propose having reclaim-only and demotion-only
> > > > mechanisms in the kernel. There are a couple possible
> > > > interfaces to carry this out I considered:
> > > >
> > > > 1. Disable demotion in the memory.reclaim interface and add a new
> > > >    demotion interface (memory.demote).
> > > > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion
> > > >    behavior in the kernel like so:
> > > >       - demote=0 would disable demotion from this call.
> > > >       - demote=1 would allow the kernel to demote if it desires.
> > > >       - demote=2 would only demote if possible but not attempt any
> > > >         other form of reclaim.
> > >
> > > Unfortunately, our proactive reclaim stack currently relies on
> > > memory.reclaim doing both. It may not stay like that, but I'm a bit
> > > wary of changing user-visible semantics post-facto.
> > >
> > > In patch 2, you're adding a node interface to memory.demote. Can you
> > > add this to memory.reclaim instead? This would allow you to control
> > > demotion and reclaim independently as you please: if you call it on a
> > > node with demotion targets, it will demote; if you call it on a node
> > > without one, it'll reclaim. And current users will remain unaffected.
> >
> > Hello Johannes, thanks for taking a look!
> >
> > I can certainly add the "nodes=" arg to memory.reclaim and you're
> > right, that would help in bridging the gap. However, if I understand
> > the underlying code correctly, with only the nodes= arg the kernel
> > will indeed attempt demotion first, but the kernel will also merrily
> > fall back to reclaiming if it can't demote the full amount. I had
> > hoped to have the flexibility to protect latency sensitive jobs from
> > reclaim entirely while attempting to do demotion.
> >
> > There are probably ways to get around that in the userspace. I presume
> > the userspace can check if there is available memory on the node's
> > demotion targets, and if so, the kernel should demote-only. But I feel
> > that wouldn't be reliable as the demotion logic may change across
> > kernel versions. The userspace may think the kernel would demote but
> > instead demotion failed due to whatever heuristic introduced into the
> > new kernel version.
> >
> > The above is just one angle of the issue. Another angle (which Yosry
> > would care most about I think) is that at Google we call
> > memory.reclaim mainly when memory.current is too close to memory.max
> > and we expect the memory usage of the cgroup to drop as a result of a
> > success memory.reclaim call. I suspect once we take in commit
> > 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""),
> > we would run into that regression, but I defer to Yosry here, he may
> > have a solution for that in mind already.
> 
> We don't exactly rely on memory.current, but we do have a separate
> proactive reclaim policy today from demotion, and we do expect
> memory.reclaim to reclaim memory and not demote it. So it is important
> that we can control reclaim vs. demotion separately. Having
> memory.reclaim do demotions by default is not ideal for our current
> setup, so at least having a demote= argument to control it (no
> demotions, may demote, only demote) is needed.

With a nodemask you should be able to only reclaim by specifying
terminal memory tiers that do that, and leave out higher tiers that
demote.

That said, it would actually be nice if reclaim policy wouldn't have
to differ from demotion policy longer term. Ultimately it comes down
to mapping age to memory tier, right? Such that hot pages are in RAM,
warm pages are in CXL, cold pages are in storage. If you apply equal
presure on all tiers, it's access frequency that should determine
which RAM pages to demote, and which CXL pages to reclaim. If RAM
pages are hot and refuse demotion, and CXL pages are cold in
comparison, CXL should clear out. If RAM pages are warm, they should
get demoted to CXL but not reclaimed further from there (and rotate
instead).

Do we know what's preventing this from happening today? What's the
reason you want to control them independently?
  
Mina Almasry Nov. 23, 2022, 10:37 p.m. UTC | #6
On Wed, Nov 23, 2022 at 1:57 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Nov 23, 2022 at 01:20:57PM -0800, Mina Almasry wrote:
> > On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Hello Mina,
> > >
> > > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote:
> > > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > > > reclaim""), the proactive reclaim interface memory.reclaim does both
> > > > reclaim and demotion. This is likely fine for us for latency critical
> > > > jobs where we would want to disable proactive reclaim entirely, and is
> > > > also fine for latency tolerant jobs where we would like to both
> > > > proactively reclaim and demote.
> > > >
> > > > However, for some latency tiers in the middle we would like to demote but
> > > > not reclaim. This is because reclaim and demotion incur different latency
> > > > costs to the jobs in the cgroup. Demoted memory would still be addressable
> > > > by the userspace at a higher latency, but reclaimed memory would need to
> > > > incur a pagefault.
> > > >
> > > > To address this, I propose having reclaim-only and demotion-only
> > > > mechanisms in the kernel. There are a couple possible
> > > > interfaces to carry this out I considered:
> > > >
> > > > 1. Disable demotion in the memory.reclaim interface and add a new
> > > >    demotion interface (memory.demote).
> > > > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion
> > > >    behavior in the kernel like so:
> > > >       - demote=0 would disable demotion from this call.
> > > >       - demote=1 would allow the kernel to demote if it desires.
> > > >       - demote=2 would only demote if possible but not attempt any
> > > >         other form of reclaim.
> > >
> > > Unfortunately, our proactive reclaim stack currently relies on
> > > memory.reclaim doing both. It may not stay like that, but I'm a bit
> > > wary of changing user-visible semantics post-facto.
> > >
> > > In patch 2, you're adding a node interface to memory.demote. Can you
> > > add this to memory.reclaim instead? This would allow you to control
> > > demotion and reclaim independently as you please: if you call it on a
> > > node with demotion targets, it will demote; if you call it on a node
> > > without one, it'll reclaim. And current users will remain unaffected.
> >
> > Hello Johannes, thanks for taking a look!
> >
> > I can certainly add the "nodes=" arg to memory.reclaim and you're
> > right, that would help in bridging the gap. However, if I understand
> > the underlying code correctly, with only the nodes= arg the kernel
> > will indeed attempt demotion first, but the kernel will also merrily
> > fall back to reclaiming if it can't demote the full amount. I had
> > hoped to have the flexibility to protect latency sensitive jobs from
> > reclaim entirely while attempting to do demotion.
>
> The fallback to reclaim actually strikes me as wrong.
>
> Think of reclaim as 'demoting' the pages to the storage tier. If we
> have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> CXL and from CXL to storage. If we reclaim a page from RAM, it means
> we 'demote' it directly from RAM to storage, bypassing potentially a
> huge amount of pages colder than it in CXL. That doesn't seem right.
>

Ah, I see. When you put it like that it makes a lot of sense. Reclaim
would be just another type of demotion as you put it, i.e. demoting
from the lowest memory tier to storage. I assume in your model
demoting from the lowest memory tier to storage includes all of
swapping, writing back of dirty pages, and discarding clean file
pages. All these pages (anon, clean file pages, or dirty file pages)
should first be demoted down the memory tiers until finally they get
'demoted' to storage. i.e. reclaimed.

> If demotion fails, IMO it shouldn't satisfy the reclaim request by
> breaking the layering. Rather it should deflect that pressure to the
> lower layers to make room. This makes sure we maintain an aging
> pipeline that honors the memory tier hierarchy.
>

Also got it. I believe the pseudo code would be roughly like a bubble
sort algorithm of sorts, where the coldest pages are demoted to the
next memory tier, and finally reclaimed from the final memory tier:

demoted_pages = 0;
retry:
    memory_tier = lowest_tier:
    for (; memory_tier < 0; memory_tier--) {
        if (memory_tier == lowest_tier)
            demoted_pages += demote_to_storage();
        else
            demote_pages += demote_to_the_next_memory_tier(memory_tier);
    }
    if (demoted_pages < pages_to_demote)
         goto retry;

> So I'm hesitant to design cgroup controls around the current behavior.
>

Thanks for taking the time to explain this. I think if it sounds good
to folks I'll add the "nodes=" arg as you described now. Reworking the
reclaim algorithm for memory tiering would be a bigger change in need
of its own patchset.

I think the nodes= arg by itself would help bridge the gap quite a
bit. I surmise for Google we can :
1. Force reclaim by: echo "<size> nodes=<lowest memory tier nodes>" >
memory.reclaim
2. Almost force demotion by: echo "<size> nodes=<highest memory tier
nodes>" > memory.reclaim

In case #2 the kernel may still reclaim it if it can't demote the full
amount, but that is as you put it more of a bug that should be fixed.

However, even in a world where the reclaim code works as you
described, I wonder if we still need some kind of demote= arg. The
issue is that demoting from the lowest memory tier to storage incurs
more of a latency impact to the application than demoting between the
other memory tiers, and that's because the other memory tiers are
directly addressable, while pages demote to storage incur a pagefault.
Not sure if that's a big concern at the moment.

> > The above is just one angle of the issue. Another angle (which Yosry
> > would care most about I think) is that at Google we call
> > memory.reclaim mainly when memory.current is too close to memory.max
> > and we expect the memory usage of the cgroup to drop as a result of a
> > success memory.reclaim call. I suspect once we take in commit
> > 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""),
> > we would run into that regression, but I defer to Yosry here, he may
> > have a solution for that in mind already.
>
> IMO it should both demote and reclaim. Simliar to how memory.reclaim
> on a non-tiered memory system would both deactivate active pages and
> reclaim inactive pages.
  
Yosry Ahmed Nov. 23, 2022, 11:47 p.m. UTC | #7
On Wed, Nov 23, 2022 at 2:30 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Nov 23, 2022 at 01:35:13PM -0800, Yosry Ahmed wrote:
> > On Wed, Nov 23, 2022 at 1:21 PM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > Hello Mina,
> > > >
> > > > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote:
> > > > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > > > > reclaim""), the proactive reclaim interface memory.reclaim does both
> > > > > reclaim and demotion. This is likely fine for us for latency critical
> > > > > jobs where we would want to disable proactive reclaim entirely, and is
> > > > > also fine for latency tolerant jobs where we would like to both
> > > > > proactively reclaim and demote.
> > > > >
> > > > > However, for some latency tiers in the middle we would like to demote but
> > > > > not reclaim. This is because reclaim and demotion incur different latency
> > > > > costs to the jobs in the cgroup. Demoted memory would still be addressable
> > > > > by the userspace at a higher latency, but reclaimed memory would need to
> > > > > incur a pagefault.
> > > > >
> > > > > To address this, I propose having reclaim-only and demotion-only
> > > > > mechanisms in the kernel. There are a couple possible
> > > > > interfaces to carry this out I considered:
> > > > >
> > > > > 1. Disable demotion in the memory.reclaim interface and add a new
> > > > >    demotion interface (memory.demote).
> > > > > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion
> > > > >    behavior in the kernel like so:
> > > > >       - demote=0 would disable demotion from this call.
> > > > >       - demote=1 would allow the kernel to demote if it desires.
> > > > >       - demote=2 would only demote if possible but not attempt any
> > > > >         other form of reclaim.
> > > >
> > > > Unfortunately, our proactive reclaim stack currently relies on
> > > > memory.reclaim doing both. It may not stay like that, but I'm a bit
> > > > wary of changing user-visible semantics post-facto.
> > > >
> > > > In patch 2, you're adding a node interface to memory.demote. Can you
> > > > add this to memory.reclaim instead? This would allow you to control
> > > > demotion and reclaim independently as you please: if you call it on a
> > > > node with demotion targets, it will demote; if you call it on a node
> > > > without one, it'll reclaim. And current users will remain unaffected.
> > >
> > > Hello Johannes, thanks for taking a look!
> > >
> > > I can certainly add the "nodes=" arg to memory.reclaim and you're
> > > right, that would help in bridging the gap. However, if I understand
> > > the underlying code correctly, with only the nodes= arg the kernel
> > > will indeed attempt demotion first, but the kernel will also merrily
> > > fall back to reclaiming if it can't demote the full amount. I had
> > > hoped to have the flexibility to protect latency sensitive jobs from
> > > reclaim entirely while attempting to do demotion.
> > >
> > > There are probably ways to get around that in the userspace. I presume
> > > the userspace can check if there is available memory on the node's
> > > demotion targets, and if so, the kernel should demote-only. But I feel
> > > that wouldn't be reliable as the demotion logic may change across
> > > kernel versions. The userspace may think the kernel would demote but
> > > instead demotion failed due to whatever heuristic introduced into the
> > > new kernel version.
> > >
> > > The above is just one angle of the issue. Another angle (which Yosry
> > > would care most about I think) is that at Google we call
> > > memory.reclaim mainly when memory.current is too close to memory.max
> > > and we expect the memory usage of the cgroup to drop as a result of a
> > > success memory.reclaim call. I suspect once we take in commit
> > > 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""),
> > > we would run into that regression, but I defer to Yosry here, he may
> > > have a solution for that in mind already.
> >
> > We don't exactly rely on memory.current, but we do have a separate
> > proactive reclaim policy today from demotion, and we do expect
> > memory.reclaim to reclaim memory and not demote it. So it is important
> > that we can control reclaim vs. demotion separately. Having
> > memory.reclaim do demotions by default is not ideal for our current
> > setup, so at least having a demote= argument to control it (no
> > demotions, may demote, only demote) is needed.
>
> With a nodemask you should be able to only reclaim by specifying
> terminal memory tiers that do that, and leave out higher tiers that
> demote.
>
> That said, it would actually be nice if reclaim policy wouldn't have
> to differ from demotion policy longer term. Ultimately it comes down
> to mapping age to memory tier, right? Such that hot pages are in RAM,
> warm pages are in CXL, cold pages are in storage. If you apply equal
> presure on all tiers, it's access frequency that should determine
> which RAM pages to demote, and which CXL pages to reclaim. If RAM
> pages are hot and refuse demotion, and CXL pages are cold in
> comparison, CXL should clear out. If RAM pages are warm, they should
> get demoted to CXL but not reclaimed further from there (and rotate
> instead).
>
> Do we know what's preventing this from happening today? What's the
> reason you want to control them independently?

The motivation was giving user space more flexibility to design their
policies. However, as you point out, the current behavior of falling
back to reclaiming when we cannot demote is not ideal, and maybe we
should not design policies around it. We can always revisit this if a
use case arises where a clear distinction needs to be drawn between
reclaiming and demotion policies.
  
Huang, Ying Nov. 24, 2022, 5:51 a.m. UTC | #8
Hi, Johannes,

Johannes Weiner <hannes@cmpxchg.org> writes:
[...]
>
> The fallback to reclaim actually strikes me as wrong.
>
> Think of reclaim as 'demoting' the pages to the storage tier. If we
> have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> CXL and from CXL to storage. If we reclaim a page from RAM, it means
> we 'demote' it directly from RAM to storage, bypassing potentially a
> huge amount of pages colder than it in CXL. That doesn't seem right.
>
> If demotion fails, IMO it shouldn't satisfy the reclaim request by
> breaking the layering. Rather it should deflect that pressure to the
> lower layers to make room. This makes sure we maintain an aging
> pipeline that honors the memory tier hierarchy.

Yes.  I think that we should avoid to fall back to reclaim as much as
possible too.  Now, when we allocate memory for demotion
(alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
kswapd reclaim on lower tier node to free some memory to avoid fall back
to reclaim on current (higher tier) node.  This may be not good enough,
for example, the following patch from Hasan may help via waking up
kswapd earlier.

https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/

Do you know what is the next step plan for this patch?

Should we do even more?

From another point of view, I still think that we can use falling back
to reclaim as the last resort to avoid OOM in some special situations,
for example, most pages in the lowest tier node are mlock() or too hot
to be reclaimed.

> So I'm hesitant to design cgroup controls around the current behavior.
>

Best Regards,
Huang, Ying
  
Yang Shi Nov. 28, 2022, 10:24 p.m. UTC | #9
On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Johannes,
>
> Johannes Weiner <hannes@cmpxchg.org> writes:
> [...]
> >
> > The fallback to reclaim actually strikes me as wrong.
> >
> > Think of reclaim as 'demoting' the pages to the storage tier. If we
> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
> > we 'demote' it directly from RAM to storage, bypassing potentially a
> > huge amount of pages colder than it in CXL. That doesn't seem right.
> >
> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
> > breaking the layering. Rather it should deflect that pressure to the
> > lower layers to make room. This makes sure we maintain an aging
> > pipeline that honors the memory tier hierarchy.
>
> Yes.  I think that we should avoid to fall back to reclaim as much as
> possible too.  Now, when we allocate memory for demotion
> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
> kswapd reclaim on lower tier node to free some memory to avoid fall back
> to reclaim on current (higher tier) node.  This may be not good enough,
> for example, the following patch from Hasan may help via waking up
> kswapd earlier.

For the ideal case, I do agree with Johannes to demote the page tier
by tier rather than reclaiming them from the higher tiers. But I also
agree with your premature OOM concern.

>
> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
>
> Do you know what is the next step plan for this patch?
>
> Should we do even more?

In my initial implementation I implemented a simple throttle logic
when the demotion is not going to succeed if the demotion target has
not enough free memory (just check the watermark) to make migration
succeed without doing any reclamation. Shall we resurrect that?

Waking kswapd sooner is fine to me, but it may be not enough, for
example, the kswapd may not keep up so remature OOM may happen on
higher tiers or reclaim may still happen. I think throttling the
reclaimer/demoter until kswapd makes progress could avoid both. And
since the lower tiers memory typically is quite larger than the higher
tiers, so the throttle should happen very rarely IMHO.

>
> From another point of view, I still think that we can use falling back
> to reclaim as the last resort to avoid OOM in some special situations,
> for example, most pages in the lowest tier node are mlock() or too hot
> to be reclaimed.
>
> > So I'm hesitant to design cgroup controls around the current behavior.
> >
>
> Best Regards,
> Huang, Ying
>
  
Huang, Ying Nov. 29, 2022, 12:53 a.m. UTC | #10
Yang Shi <shy828301@gmail.com> writes:

> On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Johannes,
>>
>> Johannes Weiner <hannes@cmpxchg.org> writes:
>> [...]
>> >
>> > The fallback to reclaim actually strikes me as wrong.
>> >
>> > Think of reclaim as 'demoting' the pages to the storage tier. If we
>> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
>> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
>> > we 'demote' it directly from RAM to storage, bypassing potentially a
>> > huge amount of pages colder than it in CXL. That doesn't seem right.
>> >
>> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
>> > breaking the layering. Rather it should deflect that pressure to the
>> > lower layers to make room. This makes sure we maintain an aging
>> > pipeline that honors the memory tier hierarchy.
>>
>> Yes.  I think that we should avoid to fall back to reclaim as much as
>> possible too.  Now, when we allocate memory for demotion
>> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
>> kswapd reclaim on lower tier node to free some memory to avoid fall back
>> to reclaim on current (higher tier) node.  This may be not good enough,
>> for example, the following patch from Hasan may help via waking up
>> kswapd earlier.
>
> For the ideal case, I do agree with Johannes to demote the page tier
> by tier rather than reclaiming them from the higher tiers. But I also
> agree with your premature OOM concern.
>
>>
>> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
>>
>> Do you know what is the next step plan for this patch?
>>
>> Should we do even more?
>
> In my initial implementation I implemented a simple throttle logic
> when the demotion is not going to succeed if the demotion target has
> not enough free memory (just check the watermark) to make migration
> succeed without doing any reclamation. Shall we resurrect that?

Can you share the link to your throttle patch?  Or paste it here?

> Waking kswapd sooner is fine to me, but it may be not enough, for
> example, the kswapd may not keep up so remature OOM may happen on
> higher tiers or reclaim may still happen. I think throttling the
> reclaimer/demoter until kswapd makes progress could avoid both. And
> since the lower tiers memory typically is quite larger than the higher
> tiers, so the throttle should happen very rarely IMHO.
>
>>
>> From another point of view, I still think that we can use falling back
>> to reclaim as the last resort to avoid OOM in some special situations,
>> for example, most pages in the lowest tier node are mlock() or too hot
>> to be reclaimed.
>>
>> > So I'm hesitant to design cgroup controls around the current behavior.

Best Regards,
Huang, Ying
  
Yang Shi Nov. 29, 2022, 5:27 p.m. UTC | #11
On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Johannes,
> >>
> >> Johannes Weiner <hannes@cmpxchg.org> writes:
> >> [...]
> >> >
> >> > The fallback to reclaim actually strikes me as wrong.
> >> >
> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
> >> >
> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
> >> > breaking the layering. Rather it should deflect that pressure to the
> >> > lower layers to make room. This makes sure we maintain an aging
> >> > pipeline that honors the memory tier hierarchy.
> >>
> >> Yes.  I think that we should avoid to fall back to reclaim as much as
> >> possible too.  Now, when we allocate memory for demotion
> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
> >> to reclaim on current (higher tier) node.  This may be not good enough,
> >> for example, the following patch from Hasan may help via waking up
> >> kswapd earlier.
> >
> > For the ideal case, I do agree with Johannes to demote the page tier
> > by tier rather than reclaiming them from the higher tiers. But I also
> > agree with your premature OOM concern.
> >
> >>
> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
> >>
> >> Do you know what is the next step plan for this patch?
> >>
> >> Should we do even more?
> >
> > In my initial implementation I implemented a simple throttle logic
> > when the demotion is not going to succeed if the demotion target has
> > not enough free memory (just check the watermark) to make migration
> > succeed without doing any reclamation. Shall we resurrect that?
>
> Can you share the link to your throttle patch?  Or paste it here?

I just found this on the mailing list.
https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/

But it didn't have the throttling logic, I may not submit that version
to the mailing list since we decided to drop this and merge mine and
Dave's.

Anyway it is not hard to add the throttling logic, we already have a
few throttling cases in vmscan, for example, "mm/vmscan: throttle
reclaim until some writeback completes if congested".
>
> > Waking kswapd sooner is fine to me, but it may be not enough, for
> > example, the kswapd may not keep up so remature OOM may happen on
> > higher tiers or reclaim may still happen. I think throttling the
> > reclaimer/demoter until kswapd makes progress could avoid both. And
> > since the lower tiers memory typically is quite larger than the higher
> > tiers, so the throttle should happen very rarely IMHO.
> >
> >>
> >> From another point of view, I still think that we can use falling back
> >> to reclaim as the last resort to avoid OOM in some special situations,
> >> for example, most pages in the lowest tier node are mlock() or too hot
> >> to be reclaimed.
> >>
> >> > So I'm hesitant to design cgroup controls around the current behavior.
>
> Best Regards,
> Huang, Ying
  
Johannes Weiner Nov. 29, 2022, 6:08 p.m. UTC | #12
Hello Ying,

On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote:
> Johannes Weiner <hannes@cmpxchg.org> writes:
> > The fallback to reclaim actually strikes me as wrong.
> >
> > Think of reclaim as 'demoting' the pages to the storage tier. If we
> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
> > we 'demote' it directly from RAM to storage, bypassing potentially a
> > huge amount of pages colder than it in CXL. That doesn't seem right.
> >
> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
> > breaking the layering. Rather it should deflect that pressure to the
> > lower layers to make room. This makes sure we maintain an aging
> > pipeline that honors the memory tier hierarchy.
> 
> Yes.  I think that we should avoid to fall back to reclaim as much as
> possible too.  Now, when we allocate memory for demotion
> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
> kswapd reclaim on lower tier node to free some memory to avoid fall back
> to reclaim on current (higher tier) node.  This may be not good enough,
> for example, the following patch from Hasan may help via waking up
> kswapd earlier.
> 
> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
> 
> Do you know what is the next step plan for this patch?
> 
> Should we do even more?
> 
> From another point of view, I still think that we can use falling back
> to reclaim as the last resort to avoid OOM in some special situations,
> for example, most pages in the lowest tier node are mlock() or too hot
> to be reclaimed.

If they're hotter than reclaim candidates on the toptier, shouldn't
they get promoted instead and make room that way? We may have to tweak
the watermark logic a bit to facilitate that (allow promotions where
regular allocations already fail?). But this sort of resorting would
be preferable to age inversions.

The mlock scenario sounds possible. In that case, it wouldn't be an
aging inversion, since there is nothing colder on the CXL node.

Maybe a bypass check should explicitly consult the demotion target
watermarks against its evictable pages (similar to the file_is_tiny
check in prepare_scan_count)?

Because in any other scenario, if there is a bug in the promo/demo
coordination, I think we'd rather have the OOM than deal with age
inversions causing intermittent performance issues that are incredibly
hard to track down.
  
Mina Almasry Nov. 30, 2022, 2:14 a.m. UTC | #13
On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Johannes,
>
> Johannes Weiner <hannes@cmpxchg.org> writes:
> [...]
> >
> > The fallback to reclaim actually strikes me as wrong.
> >
> > Think of reclaim as 'demoting' the pages to the storage tier. If we
> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
> > we 'demote' it directly from RAM to storage, bypassing potentially a
> > huge amount of pages colder than it in CXL. That doesn't seem right.
> >
> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
> > breaking the layering. Rather it should deflect that pressure to the
> > lower layers to make room. This makes sure we maintain an aging
> > pipeline that honors the memory tier hierarchy.
>
> Yes.  I think that we should avoid to fall back to reclaim as much as
> possible too.  Now, when we allocate memory for demotion
> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger

I may be missing something but as far I can tell reclaim is disabled
for allocations from lower tier memory:
https://elixir.bootlin.com/linux/v6.1-rc7/source/mm/vmscan.c#L1583

I think this is maybe a good thing when doing proactive demotion. In
this case we probably don't want to try to reclaim from lower tier
nodes and instead fail the proactive demotion. However I can see this
being desirable when the top tier nodes are under real memory pressure
to deflect that pressure to the lower tier nodes.

> kswapd reclaim on lower tier node to free some memory to avoid fall back
> to reclaim on current (higher tier) node.  This may be not good enough,
> for example, the following patch from Hasan may help via waking up
> kswapd earlier.
>
> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
>
> Do you know what is the next step plan for this patch?
>
> Should we do even more?
>
> From another point of view, I still think that we can use falling back
> to reclaim as the last resort to avoid OOM in some special situations,
> for example, most pages in the lowest tier node are mlock() or too hot
> to be reclaimed.
>
> > So I'm hesitant to design cgroup controls around the current behavior.

I sent RFC v2 patch:
https://lore.kernel.org/linux-mm/20221130020328.1009347-1-almasrymina@google.com/T/#u

Please take a look when convenient. Thanks!

> >
>
> Best Regards,
> Huang, Ying
>
  
Huang, Ying Nov. 30, 2022, 3:55 a.m. UTC | #14
Johannes Weiner <hannes@cmpxchg.org> writes:

> Hello Ying,
>
> On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote:
>> Johannes Weiner <hannes@cmpxchg.org> writes:
>> > The fallback to reclaim actually strikes me as wrong.
>> >
>> > Think of reclaim as 'demoting' the pages to the storage tier. If we
>> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
>> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
>> > we 'demote' it directly from RAM to storage, bypassing potentially a
>> > huge amount of pages colder than it in CXL. That doesn't seem right.
>> >
>> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
>> > breaking the layering. Rather it should deflect that pressure to the
>> > lower layers to make room. This makes sure we maintain an aging
>> > pipeline that honors the memory tier hierarchy.
>> 
>> Yes.  I think that we should avoid to fall back to reclaim as much as
>> possible too.  Now, when we allocate memory for demotion
>> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
>> kswapd reclaim on lower tier node to free some memory to avoid fall back
>> to reclaim on current (higher tier) node.  This may be not good enough,
>> for example, the following patch from Hasan may help via waking up
>> kswapd earlier.
>> 
>> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
>> 
>> Do you know what is the next step plan for this patch?
>> 
>> Should we do even more?
>> 
>> From another point of view, I still think that we can use falling back
>> to reclaim as the last resort to avoid OOM in some special situations,
>> for example, most pages in the lowest tier node are mlock() or too hot
>> to be reclaimed.
>
> If they're hotter than reclaim candidates on the toptier, shouldn't
> they get promoted instead and make room that way? We may have to tweak
> the watermark logic a bit to facilitate that (allow promotions where
> regular allocations already fail?). But this sort of resorting would
> be preferable to age inversions.

Now it's legal to enable demotion and disable promotion.  Yes, this is
wrong configuration in general.  But should we trigger OOM for these
users?

And now promotion only works for default NUMA policy (and MPOL_BIND to
both promotion source and target nodes with MPOL_F_NUMA_BALANCING).  If
we use some other NUMA policy, the pages cannot be promoted too.

> The mlock scenario sounds possible. In that case, it wouldn't be an
> aging inversion, since there is nothing colder on the CXL node.
>
> Maybe a bypass check should explicitly consult the demotion target
> watermarks against its evictable pages (similar to the file_is_tiny
> check in prepare_scan_count)?

Yes.  This sounds doable.

> Because in any other scenario, if there is a bug in the promo/demo
> coordination, I think we'd rather have the OOM than deal with age
> inversions causing intermittent performance issues that are incredibly
> hard to track down.

Previously, I thought that people will always prefer performance
regression than OOM.  Apparently, I am wrong.

Anyway, I think that we need to reduce the possibility of OOM or falling
back to reclaim as much as possible firstly.  Do you agree?

One possibility, can we fall back to reclaim only if the sc->priority is
small enough (even 0)?

Best Regards,
Huang, Ying
  
Huang, Ying Nov. 30, 2022, 5:31 a.m. UTC | #15
Yang Shi <shy828301@gmail.com> writes:

> On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yang Shi <shy828301@gmail.com> writes:
>>
>> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Hi, Johannes,
>> >>
>> >> Johannes Weiner <hannes@cmpxchg.org> writes:
>> >> [...]
>> >> >
>> >> > The fallback to reclaim actually strikes me as wrong.
>> >> >
>> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
>> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
>> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
>> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
>> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
>> >> >
>> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
>> >> > breaking the layering. Rather it should deflect that pressure to the
>> >> > lower layers to make room. This makes sure we maintain an aging
>> >> > pipeline that honors the memory tier hierarchy.
>> >>
>> >> Yes.  I think that we should avoid to fall back to reclaim as much as
>> >> possible too.  Now, when we allocate memory for demotion
>> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
>> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
>> >> to reclaim on current (higher tier) node.  This may be not good enough,
>> >> for example, the following patch from Hasan may help via waking up
>> >> kswapd earlier.
>> >
>> > For the ideal case, I do agree with Johannes to demote the page tier
>> > by tier rather than reclaiming them from the higher tiers. But I also
>> > agree with your premature OOM concern.
>> >
>> >>
>> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
>> >>
>> >> Do you know what is the next step plan for this patch?
>> >>
>> >> Should we do even more?
>> >
>> > In my initial implementation I implemented a simple throttle logic
>> > when the demotion is not going to succeed if the demotion target has
>> > not enough free memory (just check the watermark) to make migration
>> > succeed without doing any reclamation. Shall we resurrect that?
>>
>> Can you share the link to your throttle patch?  Or paste it here?
>
> I just found this on the mailing list.
> https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/

Per my understanding, this patch will avoid demoting if there's no free
space on demotion target?  If so, I think that we should trigger kswapd
reclaiming on demotion target before that.  And we can simply avoid to
fall back to reclaim firstly, then avoid to scan as an improvement as
that in your patch above.

Best Regards,
Huang, Ying

> But it didn't have the throttling logic, I may not submit that version
> to the mailing list since we decided to drop this and merge mine and
> Dave's.
>
> Anyway it is not hard to add the throttling logic, we already have a
> few throttling cases in vmscan, for example, "mm/vmscan: throttle
> reclaim until some writeback completes if congested".
>>
>> > Waking kswapd sooner is fine to me, but it may be not enough, for
>> > example, the kswapd may not keep up so remature OOM may happen on
>> > higher tiers or reclaim may still happen. I think throttling the
>> > reclaimer/demoter until kswapd makes progress could avoid both. And
>> > since the lower tiers memory typically is quite larger than the higher
>> > tiers, so the throttle should happen very rarely IMHO.
>> >
>> >>
>> >> From another point of view, I still think that we can use falling back
>> >> to reclaim as the last resort to avoid OOM in some special situations,
>> >> for example, most pages in the lowest tier node are mlock() or too hot
>> >> to be reclaimed.
>> >>
>> >> > So I'm hesitant to design cgroup controls around the current behavior.
>>
>> Best Regards,
>> Huang, Ying
  
Huang, Ying Nov. 30, 2022, 5:39 a.m. UTC | #16
Mina Almasry <almasrymina@google.com> writes:

> On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Johannes,
>>
>> Johannes Weiner <hannes@cmpxchg.org> writes:
>> [...]
>> >
>> > The fallback to reclaim actually strikes me as wrong.
>> >
>> > Think of reclaim as 'demoting' the pages to the storage tier. If we
>> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
>> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
>> > we 'demote' it directly from RAM to storage, bypassing potentially a
>> > huge amount of pages colder than it in CXL. That doesn't seem right.
>> >
>> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
>> > breaking the layering. Rather it should deflect that pressure to the
>> > lower layers to make room. This makes sure we maintain an aging
>> > pipeline that honors the memory tier hierarchy.
>>
>> Yes.  I think that we should avoid to fall back to reclaim as much as
>> possible too.  Now, when we allocate memory for demotion
>> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
>
> I may be missing something but as far I can tell reclaim is disabled
> for allocations from lower tier memory:
> https://elixir.bootlin.com/linux/v6.1-rc7/source/mm/vmscan.c#L1583

#define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)

We have GFP_NOWAIT set in gfp.

> I think this is maybe a good thing when doing proactive demotion. In
> this case we probably don't want to try to reclaim from lower tier
> nodes and instead fail the proactive demotion.

Do you have some real use cases for this?  If so, we can tweak the
logic.

> However I can see this being desirable when the top tier nodes are
> under real memory pressure to deflect that pressure to the lower tier
> nodes.

Yes.

Best Regards,
Huang, Ying

>> kswapd reclaim on lower tier node to free some memory to avoid fall back
>> to reclaim on current (higher tier) node.  This may be not good enough,
>> for example, the following patch from Hasan may help via waking up
>> kswapd earlier.
>>
>> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
>>
>> Do you know what is the next step plan for this patch?
>>
>> Should we do even more?
>>
>> From another point of view, I still think that we can use falling back
>> to reclaim as the last resort to avoid OOM in some special situations,
>> for example, most pages in the lowest tier node are mlock() or too hot
>> to be reclaimed.
>>
>> > So I'm hesitant to design cgroup controls around the current behavior.
>
> I sent RFC v2 patch:
> https://lore.kernel.org/linux-mm/20221130020328.1009347-1-almasrymina@google.com/T/#u
>
> Please take a look when convenient. Thanks!
>
>> >
>>
>> Best Regards,
>> Huang, Ying
>>
  
Mina Almasry Nov. 30, 2022, 6:06 a.m. UTC | #17
On Tue, Nov 29, 2022 at 9:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Mina Almasry <almasrymina@google.com> writes:
>
> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Johannes,
> >>
> >> Johannes Weiner <hannes@cmpxchg.org> writes:
> >> [...]
> >> >
> >> > The fallback to reclaim actually strikes me as wrong.
> >> >
> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
> >> >
> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
> >> > breaking the layering. Rather it should deflect that pressure to the
> >> > lower layers to make room. This makes sure we maintain an aging
> >> > pipeline that honors the memory tier hierarchy.
> >>
> >> Yes.  I think that we should avoid to fall back to reclaim as much as
> >> possible too.  Now, when we allocate memory for demotion
> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
> >
> > I may be missing something but as far I can tell reclaim is disabled
> > for allocations from lower tier memory:
> > https://elixir.bootlin.com/linux/v6.1-rc7/source/mm/vmscan.c#L1583
>
> #define GFP_NOWAIT      (__GFP_KSWAPD_RECLAIM)
>
> We have GFP_NOWAIT set in gfp.
>

Ah, thanks. I missed that.

> > I think this is maybe a good thing when doing proactive demotion. In
> > this case we probably don't want to try to reclaim from lower tier
> > nodes and instead fail the proactive demotion.
>
> Do you have some real use cases for this?  If so, we can tweak the
> logic.
>

Nothing real at the moment. I was thinking this may be something
desirable to tune at some point.

> > However I can see this being desirable when the top tier nodes are
> > under real memory pressure to deflect that pressure to the lower tier
> > nodes.
>
> Yes.
>
> Best Regards,
> Huang, Ying
>
> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
> >> to reclaim on current (higher tier) node.  This may be not good enough,
> >> for example, the following patch from Hasan may help via waking up
> >> kswapd earlier.
> >>
> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
> >>
> >> Do you know what is the next step plan for this patch?
> >>
> >> Should we do even more?
> >>
> >> From another point of view, I still think that we can use falling back
> >> to reclaim as the last resort to avoid OOM in some special situations,
> >> for example, most pages in the lowest tier node are mlock() or too hot
> >> to be reclaimed.
> >>
> >> > So I'm hesitant to design cgroup controls around the current behavior.
> >
> > I sent RFC v2 patch:
> > https://lore.kernel.org/linux-mm/20221130020328.1009347-1-almasrymina@google.com/T/#u
> >
> > Please take a look when convenient. Thanks!
> >
> >> >
> >>
> >> Best Regards,
> >> Huang, Ying
> >>
>
  
Yang Shi Nov. 30, 2022, 6:49 p.m. UTC | #18
On Tue, Nov 29, 2022 at 9:33 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yang Shi <shy828301@gmail.com> writes:
> >>
> >> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Hi, Johannes,
> >> >>
> >> >> Johannes Weiner <hannes@cmpxchg.org> writes:
> >> >> [...]
> >> >> >
> >> >> > The fallback to reclaim actually strikes me as wrong.
> >> >> >
> >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
> >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
> >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
> >> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
> >> >> >
> >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
> >> >> > breaking the layering. Rather it should deflect that pressure to the
> >> >> > lower layers to make room. This makes sure we maintain an aging
> >> >> > pipeline that honors the memory tier hierarchy.
> >> >>
> >> >> Yes.  I think that we should avoid to fall back to reclaim as much as
> >> >> possible too.  Now, when we allocate memory for demotion
> >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
> >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
> >> >> to reclaim on current (higher tier) node.  This may be not good enough,
> >> >> for example, the following patch from Hasan may help via waking up
> >> >> kswapd earlier.
> >> >
> >> > For the ideal case, I do agree with Johannes to demote the page tier
> >> > by tier rather than reclaiming them from the higher tiers. But I also
> >> > agree with your premature OOM concern.
> >> >
> >> >>
> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
> >> >>
> >> >> Do you know what is the next step plan for this patch?
> >> >>
> >> >> Should we do even more?
> >> >
> >> > In my initial implementation I implemented a simple throttle logic
> >> > when the demotion is not going to succeed if the demotion target has
> >> > not enough free memory (just check the watermark) to make migration
> >> > succeed without doing any reclamation. Shall we resurrect that?
> >>
> >> Can you share the link to your throttle patch?  Or paste it here?
> >
> > I just found this on the mailing list.
> > https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/
>
> Per my understanding, this patch will avoid demoting if there's no free
> space on demotion target?  If so, I think that we should trigger kswapd
> reclaiming on demotion target before that.  And we can simply avoid to
> fall back to reclaim firstly, then avoid to scan as an improvement as
> that in your patch above.

Yes, it should. The rough idea looks like:

if (the demote target is contended)
    wake up kswapd
    reclaim_throttle(VMSCAN_THROTTLE_DEMOTION)
    retry demotion

The kswapd is responsible for clearing the contention flag.

>
> Best Regards,
> Huang, Ying
>
> > But it didn't have the throttling logic, I may not submit that version
> > to the mailing list since we decided to drop this and merge mine and
> > Dave's.
> >
> > Anyway it is not hard to add the throttling logic, we already have a
> > few throttling cases in vmscan, for example, "mm/vmscan: throttle
> > reclaim until some writeback completes if congested".
> >>
> >> > Waking kswapd sooner is fine to me, but it may be not enough, for
> >> > example, the kswapd may not keep up so remature OOM may happen on
> >> > higher tiers or reclaim may still happen. I think throttling the
> >> > reclaimer/demoter until kswapd makes progress could avoid both. And
> >> > since the lower tiers memory typically is quite larger than the higher
> >> > tiers, so the throttle should happen very rarely IMHO.
> >> >
> >> >>
> >> >> From another point of view, I still think that we can use falling back
> >> >> to reclaim as the last resort to avoid OOM in some special situations,
> >> >> for example, most pages in the lowest tier node are mlock() or too hot
> >> >> to be reclaimed.
> >> >>
> >> >> > So I'm hesitant to design cgroup controls around the current behavior.
> >>
> >> Best Regards,
> >> Huang, Ying
  
Huang, Ying Dec. 1, 2022, 1:51 a.m. UTC | #19
Yang Shi <shy828301@gmail.com> writes:

> On Tue, Nov 29, 2022 at 9:33 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yang Shi <shy828301@gmail.com> writes:
>>
>> > On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yang Shi <shy828301@gmail.com> writes:
>> >>
>> >> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Hi, Johannes,
>> >> >>
>> >> >> Johannes Weiner <hannes@cmpxchg.org> writes:
>> >> >> [...]
>> >> >> >
>> >> >> > The fallback to reclaim actually strikes me as wrong.
>> >> >> >
>> >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
>> >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
>> >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
>> >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
>> >> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
>> >> >> >
>> >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
>> >> >> > breaking the layering. Rather it should deflect that pressure to the
>> >> >> > lower layers to make room. This makes sure we maintain an aging
>> >> >> > pipeline that honors the memory tier hierarchy.
>> >> >>
>> >> >> Yes.  I think that we should avoid to fall back to reclaim as much as
>> >> >> possible too.  Now, when we allocate memory for demotion
>> >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
>> >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
>> >> >> to reclaim on current (higher tier) node.  This may be not good enough,
>> >> >> for example, the following patch from Hasan may help via waking up
>> >> >> kswapd earlier.
>> >> >
>> >> > For the ideal case, I do agree with Johannes to demote the page tier
>> >> > by tier rather than reclaiming them from the higher tiers. But I also
>> >> > agree with your premature OOM concern.
>> >> >
>> >> >>
>> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
>> >> >>
>> >> >> Do you know what is the next step plan for this patch?
>> >> >>
>> >> >> Should we do even more?
>> >> >
>> >> > In my initial implementation I implemented a simple throttle logic
>> >> > when the demotion is not going to succeed if the demotion target has
>> >> > not enough free memory (just check the watermark) to make migration
>> >> > succeed without doing any reclamation. Shall we resurrect that?
>> >>
>> >> Can you share the link to your throttle patch?  Or paste it here?
>> >
>> > I just found this on the mailing list.
>> > https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/
>>
>> Per my understanding, this patch will avoid demoting if there's no free
>> space on demotion target?  If so, I think that we should trigger kswapd
>> reclaiming on demotion target before that.  And we can simply avoid to
>> fall back to reclaim firstly, then avoid to scan as an improvement as
>> that in your patch above.
>
> Yes, it should. The rough idea looks like:
>
> if (the demote target is contended)
>     wake up kswapd
>     reclaim_throttle(VMSCAN_THROTTLE_DEMOTION)
>     retry demotion
>
> The kswapd is responsible for clearing the contention flag.

We may do this, at least for demotion in kswapd.  But I think that this
could be the second step optimization after we make correct choice
between demotion/reclaim.  What if the pages in demotion target is too
hot to be reclaimed first?  Should we reclaim in fast memory node to
avoid OOM?

Best Regards,
Huang, Ying

>>
>> > But it didn't have the throttling logic, I may not submit that version
>> > to the mailing list since we decided to drop this and merge mine and
>> > Dave's.
>> >
>> > Anyway it is not hard to add the throttling logic, we already have a
>> > few throttling cases in vmscan, for example, "mm/vmscan: throttle
>> > reclaim until some writeback completes if congested".
>> >>
>> >> > Waking kswapd sooner is fine to me, but it may be not enough, for
>> >> > example, the kswapd may not keep up so remature OOM may happen on
>> >> > higher tiers or reclaim may still happen. I think throttling the
>> >> > reclaimer/demoter until kswapd makes progress could avoid both. And
>> >> > since the lower tiers memory typically is quite larger than the higher
>> >> > tiers, so the throttle should happen very rarely IMHO.
>> >> >
>> >> >>
>> >> >> From another point of view, I still think that we can use falling back
>> >> >> to reclaim as the last resort to avoid OOM in some special situations,
>> >> >> for example, most pages in the lowest tier node are mlock() or too hot
>> >> >> to be reclaimed.
>> >> >>
>> >> >> > So I'm hesitant to design cgroup controls around the current behavior.
>> >>
>> >> Best Regards,
>> >> Huang, Ying
  
Mina Almasry Dec. 1, 2022, 8:40 p.m. UTC | #20
On Tue, Nov 29, 2022 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Johannes Weiner <hannes@cmpxchg.org> writes:
>
> > Hello Ying,
> >
> > On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote:
> >> Johannes Weiner <hannes@cmpxchg.org> writes:
> >> > The fallback to reclaim actually strikes me as wrong.
> >> >
> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
> >> >
> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
> >> > breaking the layering. Rather it should deflect that pressure to the
> >> > lower layers to make room. This makes sure we maintain an aging
> >> > pipeline that honors the memory tier hierarchy.
> >>
> >> Yes.  I think that we should avoid to fall back to reclaim as much as
> >> possible too.  Now, when we allocate memory for demotion
> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
> >> to reclaim on current (higher tier) node.  This may be not good enough,
> >> for example, the following patch from Hasan may help via waking up
> >> kswapd earlier.
> >>
> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
> >>
> >> Do you know what is the next step plan for this patch?
> >>
> >> Should we do even more?
> >>
> >> From another point of view, I still think that we can use falling back
> >> to reclaim as the last resort to avoid OOM in some special situations,
> >> for example, most pages in the lowest tier node are mlock() or too hot
> >> to be reclaimed.
> >
> > If they're hotter than reclaim candidates on the toptier, shouldn't
> > they get promoted instead and make room that way? We may have to tweak
> > the watermark logic a bit to facilitate that (allow promotions where
> > regular allocations already fail?). But this sort of resorting would
> > be preferable to age inversions.
>
> Now it's legal to enable demotion and disable promotion.  Yes, this is
> wrong configuration in general.  But should we trigger OOM for these
> users?
>
> And now promotion only works for default NUMA policy (and MPOL_BIND to
> both promotion source and target nodes with MPOL_F_NUMA_BALANCING).  If
> we use some other NUMA policy, the pages cannot be promoted too.
>
> > The mlock scenario sounds possible. In that case, it wouldn't be an
> > aging inversion, since there is nothing colder on the CXL node.
> >
> > Maybe a bypass check should explicitly consult the demotion target
> > watermarks against its evictable pages (similar to the file_is_tiny
> > check in prepare_scan_count)?
>
> Yes.  This sounds doable.
>
> > Because in any other scenario, if there is a bug in the promo/demo
> > coordination, I think we'd rather have the OOM than deal with age
> > inversions causing intermittent performance issues that are incredibly
> > hard to track down.
>
> Previously, I thought that people will always prefer performance
> regression than OOM.  Apparently, I am wrong.
>
> Anyway, I think that we need to reduce the possibility of OOM or falling
> back to reclaim as much as possible firstly.  Do you agree?
>

I've been discussing this with a few folks here. I think FWIW general
feeling here is that demoting from top tier nodes is preferred, except
in extreme circumstances we would indeed like to run with a
performance issue rather than OOM a customer VM. I wonder if there is
another way to debug mis-tiered pages rather than trigger an oom to
debug.

One thing I think/hope we can trivially agree on is that proactive
reclaim/demotion is _not_ an extreme circumstance. I would like me or
someone from the team to follow up with a patch that disables fallback
to reclaim on proactive reclaim/demotion (sc->proactive).

> One possibility, can we fall back to reclaim only if the sc->priority is
> small enough (even 0)?
>

This makes sense to me.

> Best Regards,
> Huang, Ying
>
  
Yang Shi Dec. 1, 2022, 10:45 p.m. UTC | #21
On Wed, Nov 30, 2022 at 5:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > On Tue, Nov 29, 2022 at 9:33 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yang Shi <shy828301@gmail.com> writes:
> >>
> >> > On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yang Shi <shy828301@gmail.com> writes:
> >> >>
> >> >> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Hi, Johannes,
> >> >> >>
> >> >> >> Johannes Weiner <hannes@cmpxchg.org> writes:
> >> >> >> [...]
> >> >> >> >
> >> >> >> > The fallback to reclaim actually strikes me as wrong.
> >> >> >> >
> >> >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
> >> >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> >> >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
> >> >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
> >> >> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
> >> >> >> >
> >> >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
> >> >> >> > breaking the layering. Rather it should deflect that pressure to the
> >> >> >> > lower layers to make room. This makes sure we maintain an aging
> >> >> >> > pipeline that honors the memory tier hierarchy.
> >> >> >>
> >> >> >> Yes.  I think that we should avoid to fall back to reclaim as much as
> >> >> >> possible too.  Now, when we allocate memory for demotion
> >> >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
> >> >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
> >> >> >> to reclaim on current (higher tier) node.  This may be not good enough,
> >> >> >> for example, the following patch from Hasan may help via waking up
> >> >> >> kswapd earlier.
> >> >> >
> >> >> > For the ideal case, I do agree with Johannes to demote the page tier
> >> >> > by tier rather than reclaiming them from the higher tiers. But I also
> >> >> > agree with your premature OOM concern.
> >> >> >
> >> >> >>
> >> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
> >> >> >>
> >> >> >> Do you know what is the next step plan for this patch?
> >> >> >>
> >> >> >> Should we do even more?
> >> >> >
> >> >> > In my initial implementation I implemented a simple throttle logic
> >> >> > when the demotion is not going to succeed if the demotion target has
> >> >> > not enough free memory (just check the watermark) to make migration
> >> >> > succeed without doing any reclamation. Shall we resurrect that?
> >> >>
> >> >> Can you share the link to your throttle patch?  Or paste it here?
> >> >
> >> > I just found this on the mailing list.
> >> > https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/
> >>
> >> Per my understanding, this patch will avoid demoting if there's no free
> >> space on demotion target?  If so, I think that we should trigger kswapd
> >> reclaiming on demotion target before that.  And we can simply avoid to
> >> fall back to reclaim firstly, then avoid to scan as an improvement as
> >> that in your patch above.
> >
> > Yes, it should. The rough idea looks like:
> >
> > if (the demote target is contended)
> >     wake up kswapd
> >     reclaim_throttle(VMSCAN_THROTTLE_DEMOTION)
> >     retry demotion
> >
> > The kswapd is responsible for clearing the contention flag.
>
> We may do this, at least for demotion in kswapd.  But I think that this
> could be the second step optimization after we make correct choice
> between demotion/reclaim.  What if the pages in demotion target is too
> hot to be reclaimed first?  Should we reclaim in fast memory node to
> avoid OOM?

IMHO we can't avoid reclaiming from the fast nodes entirely if we
prioritize avoiding OOMs. But it should happen very very rarely with
the throttling logic or other methods. BTW did you run any test to see
how many times vmscan reclaims from fast nodes instead of demotion
with the current implementation for some typical workloads?

>
> Best Regards,
> Huang, Ying
>
> >>
> >> > But it didn't have the throttling logic, I may not submit that version
> >> > to the mailing list since we decided to drop this and merge mine and
> >> > Dave's.
> >> >
> >> > Anyway it is not hard to add the throttling logic, we already have a
> >> > few throttling cases in vmscan, for example, "mm/vmscan: throttle
> >> > reclaim until some writeback completes if congested".
> >> >>
> >> >> > Waking kswapd sooner is fine to me, but it may be not enough, for
> >> >> > example, the kswapd may not keep up so remature OOM may happen on
> >> >> > higher tiers or reclaim may still happen. I think throttling the
> >> >> > reclaimer/demoter until kswapd makes progress could avoid both. And
> >> >> > since the lower tiers memory typically is quite larger than the higher
> >> >> > tiers, so the throttle should happen very rarely IMHO.
> >> >> >
> >> >> >>
> >> >> >> From another point of view, I still think that we can use falling back
> >> >> >> to reclaim as the last resort to avoid OOM in some special situations,
> >> >> >> for example, most pages in the lowest tier node are mlock() or too hot
> >> >> >> to be reclaimed.
> >> >> >>
> >> >> >> > So I'm hesitant to design cgroup controls around the current behavior.
> >> >>
> >> >> Best Regards,
> >> >> Huang, Ying
  
Huang, Ying Dec. 2, 2022, 1:57 a.m. UTC | #22
Yang Shi <shy828301@gmail.com> writes:

> On Wed, Nov 30, 2022 at 5:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yang Shi <shy828301@gmail.com> writes:
>>
>> > On Tue, Nov 29, 2022 at 9:33 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yang Shi <shy828301@gmail.com> writes:
>> >>
>> >> > On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yang Shi <shy828301@gmail.com> writes:
>> >> >>
>> >> >> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Hi, Johannes,
>> >> >> >>
>> >> >> >> Johannes Weiner <hannes@cmpxchg.org> writes:
>> >> >> >> [...]
>> >> >> >> >
>> >> >> >> > The fallback to reclaim actually strikes me as wrong.
>> >> >> >> >
>> >> >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
>> >> >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
>> >> >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
>> >> >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
>> >> >> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
>> >> >> >> >
>> >> >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
>> >> >> >> > breaking the layering. Rather it should deflect that pressure to the
>> >> >> >> > lower layers to make room. This makes sure we maintain an aging
>> >> >> >> > pipeline that honors the memory tier hierarchy.
>> >> >> >>
>> >> >> >> Yes.  I think that we should avoid to fall back to reclaim as much as
>> >> >> >> possible too.  Now, when we allocate memory for demotion
>> >> >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
>> >> >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
>> >> >> >> to reclaim on current (higher tier) node.  This may be not good enough,
>> >> >> >> for example, the following patch from Hasan may help via waking up
>> >> >> >> kswapd earlier.
>> >> >> >
>> >> >> > For the ideal case, I do agree with Johannes to demote the page tier
>> >> >> > by tier rather than reclaiming them from the higher tiers. But I also
>> >> >> > agree with your premature OOM concern.
>> >> >> >
>> >> >> >>
>> >> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
>> >> >> >>
>> >> >> >> Do you know what is the next step plan for this patch?
>> >> >> >>
>> >> >> >> Should we do even more?
>> >> >> >
>> >> >> > In my initial implementation I implemented a simple throttle logic
>> >> >> > when the demotion is not going to succeed if the demotion target has
>> >> >> > not enough free memory (just check the watermark) to make migration
>> >> >> > succeed without doing any reclamation. Shall we resurrect that?
>> >> >>
>> >> >> Can you share the link to your throttle patch?  Or paste it here?
>> >> >
>> >> > I just found this on the mailing list.
>> >> > https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/
>> >>
>> >> Per my understanding, this patch will avoid demoting if there's no free
>> >> space on demotion target?  If so, I think that we should trigger kswapd
>> >> reclaiming on demotion target before that.  And we can simply avoid to
>> >> fall back to reclaim firstly, then avoid to scan as an improvement as
>> >> that in your patch above.
>> >
>> > Yes, it should. The rough idea looks like:
>> >
>> > if (the demote target is contended)
>> >     wake up kswapd
>> >     reclaim_throttle(VMSCAN_THROTTLE_DEMOTION)
>> >     retry demotion
>> >
>> > The kswapd is responsible for clearing the contention flag.
>>
>> We may do this, at least for demotion in kswapd.  But I think that this
>> could be the second step optimization after we make correct choice
>> between demotion/reclaim.  What if the pages in demotion target is too
>> hot to be reclaimed first?  Should we reclaim in fast memory node to
>> avoid OOM?
>
> IMHO we can't avoid reclaiming from the fast nodes entirely if we
> prioritize avoiding OOMs.

Yes.  I think so too.

> But it should happen very very rarely with the throttling logic or
> other methods.

Yes.  I think that this is possible.

> BTW did you run any test to see how many times vmscan reclaims from
> fast nodes instead of demotion with the current implementation for
> some typical workloads?

No.  I haven't done that.

Best Regards,
Huang, Ying

>>
>> >>
>> >> > But it didn't have the throttling logic, I may not submit that version
>> >> > to the mailing list since we decided to drop this and merge mine and
>> >> > Dave's.
>> >> >
>> >> > Anyway it is not hard to add the throttling logic, we already have a
>> >> > few throttling cases in vmscan, for example, "mm/vmscan: throttle
>> >> > reclaim until some writeback completes if congested".
>> >> >>
>> >> >> > Waking kswapd sooner is fine to me, but it may be not enough, for
>> >> >> > example, the kswapd may not keep up so remature OOM may happen on
>> >> >> > higher tiers or reclaim may still happen. I think throttling the
>> >> >> > reclaimer/demoter until kswapd makes progress could avoid both. And
>> >> >> > since the lower tiers memory typically is quite larger than the higher
>> >> >> > tiers, so the throttle should happen very rarely IMHO.
>> >> >> >
>> >> >> >>
>> >> >> >> From another point of view, I still think that we can use falling back
>> >> >> >> to reclaim as the last resort to avoid OOM in some special situations,
>> >> >> >> for example, most pages in the lowest tier node are mlock() or too hot
>> >> >> >> to be reclaimed.
>> >> >> >>
>> >> >> >> > So I'm hesitant to design cgroup controls around the current behavior.
>> >> >>
>> >> >> Best Regards,
>> >> >> Huang, Ying
  
Huang, Ying Dec. 2, 2022, 2:01 a.m. UTC | #23
Mina Almasry <almasrymina@google.com> writes:

> On Tue, Nov 29, 2022 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Johannes Weiner <hannes@cmpxchg.org> writes:
>>
>> > Hello Ying,
>> >
>> > On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote:
>> >> Johannes Weiner <hannes@cmpxchg.org> writes:
>> >> > The fallback to reclaim actually strikes me as wrong.
>> >> >
>> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
>> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
>> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
>> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
>> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
>> >> >
>> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
>> >> > breaking the layering. Rather it should deflect that pressure to the
>> >> > lower layers to make room. This makes sure we maintain an aging
>> >> > pipeline that honors the memory tier hierarchy.
>> >>
>> >> Yes.  I think that we should avoid to fall back to reclaim as much as
>> >> possible too.  Now, when we allocate memory for demotion
>> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
>> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
>> >> to reclaim on current (higher tier) node.  This may be not good enough,
>> >> for example, the following patch from Hasan may help via waking up
>> >> kswapd earlier.
>> >>
>> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
>> >>
>> >> Do you know what is the next step plan for this patch?
>> >>
>> >> Should we do even more?
>> >>
>> >> From another point of view, I still think that we can use falling back
>> >> to reclaim as the last resort to avoid OOM in some special situations,
>> >> for example, most pages in the lowest tier node are mlock() or too hot
>> >> to be reclaimed.
>> >
>> > If they're hotter than reclaim candidates on the toptier, shouldn't
>> > they get promoted instead and make room that way? We may have to tweak
>> > the watermark logic a bit to facilitate that (allow promotions where
>> > regular allocations already fail?). But this sort of resorting would
>> > be preferable to age inversions.
>>
>> Now it's legal to enable demotion and disable promotion.  Yes, this is
>> wrong configuration in general.  But should we trigger OOM for these
>> users?
>>
>> And now promotion only works for default NUMA policy (and MPOL_BIND to
>> both promotion source and target nodes with MPOL_F_NUMA_BALANCING).  If
>> we use some other NUMA policy, the pages cannot be promoted too.
>>
>> > The mlock scenario sounds possible. In that case, it wouldn't be an
>> > aging inversion, since there is nothing colder on the CXL node.
>> >
>> > Maybe a bypass check should explicitly consult the demotion target
>> > watermarks against its evictable pages (similar to the file_is_tiny
>> > check in prepare_scan_count)?
>>
>> Yes.  This sounds doable.
>>
>> > Because in any other scenario, if there is a bug in the promo/demo
>> > coordination, I think we'd rather have the OOM than deal with age
>> > inversions causing intermittent performance issues that are incredibly
>> > hard to track down.
>>
>> Previously, I thought that people will always prefer performance
>> regression than OOM.  Apparently, I am wrong.
>>
>> Anyway, I think that we need to reduce the possibility of OOM or falling
>> back to reclaim as much as possible firstly.  Do you agree?
>>
>
> I've been discussing this with a few folks here. I think FWIW general
> feeling here is that demoting from top tier nodes is preferred, except
> in extreme circumstances we would indeed like to run with a
> performance issue rather than OOM a customer VM. I wonder if there is
> another way to debug mis-tiered pages rather than trigger an oom to
> debug.
>
> One thing I think/hope we can trivially agree on is that proactive
> reclaim/demotion is _not_ an extreme circumstance. I would like me or
> someone from the team to follow up with a patch that disables fallback
> to reclaim on proactive reclaim/demotion (sc->proactive).

Yes.  This makes sense to me.

Best Regards,
Huang, Ying

>> One possibility, can we fall back to reclaim only if the sc->priority is
>> small enough (even 0)?
>>
>
> This makes sense to me.
>
>> Best Regards,
>> Huang, Ying
>>
  
Mina Almasry Dec. 2, 2022, 2:06 a.m. UTC | #24
On Thu, Dec 1, 2022 at 6:02 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Mina Almasry <almasrymina@google.com> writes:
>
> > On Tue, Nov 29, 2022 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Johannes Weiner <hannes@cmpxchg.org> writes:
> >>
> >> > Hello Ying,
> >> >
> >> > On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote:
> >> >> Johannes Weiner <hannes@cmpxchg.org> writes:
> >> >> > The fallback to reclaim actually strikes me as wrong.
> >> >> >
> >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we
> >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to
> >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means
> >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a
> >> >> > huge amount of pages colder than it in CXL. That doesn't seem right.
> >> >> >
> >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by
> >> >> > breaking the layering. Rather it should deflect that pressure to the
> >> >> > lower layers to make room. This makes sure we maintain an aging
> >> >> > pipeline that honors the memory tier hierarchy.
> >> >>
> >> >> Yes.  I think that we should avoid to fall back to reclaim as much as
> >> >> possible too.  Now, when we allocate memory for demotion
> >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used.  So, we will trigger
> >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back
> >> >> to reclaim on current (higher tier) node.  This may be not good enough,
> >> >> for example, the following patch from Hasan may help via waking up
> >> >> kswapd earlier.
> >> >>
> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/
> >> >>
> >> >> Do you know what is the next step plan for this patch?
> >> >>
> >> >> Should we do even more?
> >> >>
> >> >> From another point of view, I still think that we can use falling back
> >> >> to reclaim as the last resort to avoid OOM in some special situations,
> >> >> for example, most pages in the lowest tier node are mlock() or too hot
> >> >> to be reclaimed.
> >> >
> >> > If they're hotter than reclaim candidates on the toptier, shouldn't
> >> > they get promoted instead and make room that way? We may have to tweak
> >> > the watermark logic a bit to facilitate that (allow promotions where
> >> > regular allocations already fail?). But this sort of resorting would
> >> > be preferable to age inversions.
> >>
> >> Now it's legal to enable demotion and disable promotion.  Yes, this is
> >> wrong configuration in general.  But should we trigger OOM for these
> >> users?
> >>
> >> And now promotion only works for default NUMA policy (and MPOL_BIND to
> >> both promotion source and target nodes with MPOL_F_NUMA_BALANCING).  If
> >> we use some other NUMA policy, the pages cannot be promoted too.
> >>
> >> > The mlock scenario sounds possible. In that case, it wouldn't be an
> >> > aging inversion, since there is nothing colder on the CXL node.
> >> >
> >> > Maybe a bypass check should explicitly consult the demotion target
> >> > watermarks against its evictable pages (similar to the file_is_tiny
> >> > check in prepare_scan_count)?
> >>
> >> Yes.  This sounds doable.
> >>
> >> > Because in any other scenario, if there is a bug in the promo/demo
> >> > coordination, I think we'd rather have the OOM than deal with age
> >> > inversions causing intermittent performance issues that are incredibly
> >> > hard to track down.
> >>
> >> Previously, I thought that people will always prefer performance
> >> regression than OOM.  Apparently, I am wrong.
> >>
> >> Anyway, I think that we need to reduce the possibility of OOM or falling
> >> back to reclaim as much as possible firstly.  Do you agree?
> >>
> >
> > I've been discussing this with a few folks here. I think FWIW general
> > feeling here is that demoting from top tier nodes is preferred, except
> > in extreme circumstances we would indeed like to run with a
> > performance issue rather than OOM a customer VM. I wonder if there is
> > another way to debug mis-tiered pages rather than trigger an oom to
> > debug.
> >
> > One thing I think/hope we can trivially agree on is that proactive
> > reclaim/demotion is _not_ an extreme circumstance. I would like me or
> > someone from the team to follow up with a patch that disables fallback
> > to reclaim on proactive reclaim/demotion (sc->proactive).
>
> Yes.  This makes sense to me.
>

Glad to hear it. Patch is already sent for review btw:
https://lore.kernel.org/linux-mm/20221201233317.1394958-1-almasrymina@google.com/T/

> Best Regards,
> Huang, Ying
>
> >> One possibility, can we fall back to reclaim only if the sc->priority is
> >> small enough (even 0)?
> >>
> >
> > This makes sense to me.
> >
> >> Best Regards,
> >> Huang, Ying
> >>
>
  

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index fec6647a289a..f768171c2dc2 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -416,6 +416,12 @@  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,

 #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
 #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
+#define MEMCG_RECLAIM_MAY_DEMOTE (1 << 3)
+#define MEMCG_RECLAIM_ONLY_DEMOTE (1 << 4)
+#define MEMCG_RECLAIM_DEFAULT \
+	(MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_MAY_DEMOTE)
+#define MEMCG_RECLAIM_NO_SWAP (MEMCG_RECLAIM_DEFAULT & ~MEMCG_RECLAIM_MAY_SWAP)
+
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f412c903ee4f..fd4ff1c865a2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2392,7 +2392,7 @@  static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
-							MEMCG_RECLAIM_MAY_SWAP);
+							MEMCG_RECLAIM_DEFAULT);
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2637,7 +2637,7 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	unsigned long nr_reclaimed;
 	bool passed_oom = false;
-	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
+	unsigned int reclaim_options = MEMCG_RECLAIM_DEFAULT;
 	bool drained = false;
 	bool raised_max_event = false;
 	unsigned long pflags;
@@ -3503,7 +3503,8 @@  static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}

 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+					memsw ? MEMCG_RECLAIM_NO_SWAP :
+						MEMCG_RECLAIM_DEFAULT)) {
 			ret = -EBUSY;
 			break;
 		}
@@ -3614,7 +3615,7 @@  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;

 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP))
+						  MEMCG_RECLAIM_DEFAULT))
 			nr_retries--;
 	}

@@ -6407,7 +6408,7 @@  static ssize_t memory_high_write(struct kernfs_open_file *of,
 		}

 		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+					GFP_KERNEL, MEMCG_RECLAIM_DEFAULT);

 		if (!reclaimed && !nr_retries--)
 			break;
@@ -6456,7 +6457,7 @@  static ssize_t memory_max_write(struct kernfs_open_file *of,

 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+					GFP_KERNEL, MEMCG_RECLAIM_DEFAULT))
 				nr_reclaims--;
 			continue;
 		}
@@ -6593,7 +6594,8 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	if (err)
 		return err;

-	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
+	reclaim_options = MEMCG_RECLAIM_DEFAULT | MEMCG_RECLAIM_PROACTIVE;
+	reclaim_options &= ~MEMCG_RECLAIM_MAY_DEMOTE;
 	while (nr_reclaimed < nr_to_reclaim) {
 		unsigned long reclaimed;

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d8751e403599..dea05ad8ece5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -132,8 +132,14 @@  struct scan_control {
 	/* The file folios on the current node are dangerously low */
 	unsigned int file_is_tiny:1;

-	/* Always discard instead of demoting to lower tier memory */
-	unsigned int no_demotion:1;
+	/*
+	 * Configure discard instead of demoting to lower tier memory:
+	 *
+	 * demotion = 0 -> no demotion
+	 * demotion = 1 -> may demote
+	 * demotion = 2 -> only demote
+	 */
+	unsigned int demotion;

 #ifdef CONFIG_LRU_GEN
 	/* help kswapd make better choices among multiple memcgs */
@@ -542,7 +548,7 @@  static bool can_demote(int nid, struct scan_control *sc)
 {
 	if (!numa_demotion_enabled)
 		return false;
-	if (sc && sc->no_demotion)
+	if (sc && !sc->demotion)
 		return false;
 	if (next_demotion_node(nid) == NUMA_NO_NODE)
 		return false;
@@ -2674,7 +2680,7 @@  static unsigned int reclaim_folio_list(struct list_head *folio_list,
 		.may_writepage = 1,
 		.may_unmap = 1,
 		.may_swap = 1,
-		.no_demotion = 1,
+		.demotion = 0,
 	};

 	nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, false);
@@ -6726,6 +6732,13 @@  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	 */
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);

+	if (reclaim_options & MEMCG_RECLAIM_ONLY_DEMOTE)
+		sc.demotion = 2;
+	else if (reclaim_options & MEMCG_RECLAIM_MAY_DEMOTE)
+		sc.demotion = 1;
+	else
+		sc.demotion = 0;
+
 	set_task_reclaim_state(current, &sc.reclaim_state);
 	trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask);
 	noreclaim_flag = memalloc_noreclaim_save();