[RFC,0/2] mm: Working Set Reporting

Message ID 20230509185419.1088297-1-yuanchu@google.com
Headers
Series mm: Working Set Reporting |

Message

Yuanchu Xie May 9, 2023, 6:54 p.m. UTC
  Background
==========
For both clients and servers, workloads can be containerized with virtual machines, kubernetes containers, or memcgs. The workloads differ between servers and clients.
Server jobs have more predictable memory footprints, and are concerned about stability and performance. One technique is proactive reclaim, which reclaims memory ahead of memory pressure, and makes apparent the amount of actually free memory on a machine.
Client applications are more bursty and unpredictable since they react to user interactions. The system needs to respond quickly to interesting events, and be aware of energy usage.
An overcommitted machine can scale the containers' footprint through memory.max/high, virtio-balloon, etc.
The balloon device is a typical mechanism for sharing memory between a guest VM and host. It is particularly useful in multi-VM scenarios where memory is overcommitted and dynamic changes to VM memory size are required as workloads change on the system. The balloon device now has a number of features to assist in judiciously sharing memory resources amongst the guests and host (e.g free page hinting, stats, free page reporting). For a host controller program tasked with optimizing memory resources in a multi-VM environment, it must use these tools to answer two concrete questions:
    1. When is the right time to modify the balloon?
    2. How much should the balloon be changed by?
An early project to develop such an "auto-balloon" capability was done in 2013 [1]. More recently, additional VIRTIO devices have been created (virtio-mem, virtio-pmem) that offer more tools for a number of use cases, each with advantages and disadvantages (see [2] for a recent overview by RedHat of this space). A previous proposal to extend MGLRU with working set interfaces [3] focuses on the server use cases but does not work for clients.

Proposal
==========
A unified Working Set reporting structure that works for both servers and clients. It involves per-node histograms on the host, per-memcg histograms, and a virtio-balloon driver extension.
There are two ways of working with Working Set reporting: event-driven and querying. The host controller can receive notifications from reclaim, which produces a report, or the controller can query for the histogram directly.
    Patch 1 introduces the Working Set reporting mechanism and the host interfaces. See the Details section for
    Patch 2 extends the virtio-balloon driver with Working Set reporting.
The initial RFC builds on MGLRU and is intended to be a Proof of Concept for discussion and refinements. T.J. and I aim to support the active/inactive LRU and working set estimation from the userspace. We are working on demo scripts and getting some numbers as well. The RFC is a bit hacky and should be built with the these configs:
CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y
CONFIG_VIRTIO_BALLOON=y
CONFIG_WSS=y

Host
==========
On the host side, a few sysfs files are added to monitor the working set of the host.
On a CONFIG_NUMA system, they live under "/sys/devices/system/node/nodeX/wss/", otherwise they are under "/sys/kernel/mm/wss/". They are mostly read/write tuneables except for the histogram. The files work as follows:

report_ms:
    Read-write, specifies report threshold in milliseconds, min value 0 max value LONG_MAX. 0 disables working set reporting
    A rate-limiting factor that prevents frequent aging from generating reports too fast. For example, with a report threshold of 500ms, suppose aging happens 3 times within 500ms, the first one generates a wss report, and the rest are ignored.
    Example:
    $ echo 1000 > report_ms

refresh_ms:
    Read-write, specifies refresh threshold in milliseconds, min value 0 max value LONG_MAX. 0 ensures that every histogram read produces a new report.
    A rate-limiting factor that prevents working set histogram reads from triggering aging too frequently. For example, with a refresh threshold of 10,000ms, if a WSS report is generated within the past 10,000ms, reading the wss/histogram does not perform aging, otherwise, aging occurs, a new wss report is generated and read. Generating a report can block for the period of time that it takes to complete aging.
    Example:
    $ echo 10000 > refresh_ms

intervals_ms:
    Read-write, specifies bin intervals in milliseconds, min value 1, max value LONG_MAX.
    Example:
    $ echo 1000,2000,3000,4000 > intervals_ms

histogram:
    Read-only, prints wss report for this node in the format of:
        <interval in ms> anon=<nr_pages> file=<nr_pages>
        <...>
    Reading it may trigger aging if the refresh threshold has passed.
    On poll, it waits until kswapd performs aging on this node, and notifies subject to the rate limiting threshold set by report_ms
    A per-node histogram that captures the number of bytes of user memory in each working set bin. It reports the anon and file pages separately for each bin. It does not track other types of memory, e.g. hugetlb or kernel memory.
    Example, note that the last bin is a catch-all bin that comes after all the intervals_ms bins:
    $ cat histogram
    1000 anon=618 file=10
    2000 anon=0 file=0
    3000 anon=72 file=0
    4000 anon=83 file=0
    9223372036854775807 anon=1004 file=182

A per-memcg interface is also included, to enable the use cases where one may use memcgs to manage applications on the host, along with VMs.
The files are:
    memory.wss.report_ms
    memory.wss.refresh_ms
    memory.wss.intervals_ms
    memory.wss.histogram
They support per-node configurations by requiring the node to be specified (one node at a time), e.g.
    $ echo N0=1000 > memory.wss.report_ms
    $ echo N1=3000 > memory.wss.report_ms
    $ echo N0=1000,2000,3000,4000 > memory.wss.intervals_ms
    $ cat memory.wss.intervals_ms
    N0=1000,2000,4000,9223372036854775807 N1=9223372036854775807
    $ cat memory.wss.histogram
    N0
    1000 anon=6330 file=0
    2000 anon=72 file=0
    4000 anon=0 file=0
    9223372036854775807 anon=0 file=0
    N1
    9223372036854775807 anon=0 file=0

A reaccess histogram is also implemented for memcgs.
The files are:
    memory.reaccess.intervals_ms
    memory.reaccess.histogram
The interface formats are identical to the memory.wss.*. Writing to memory.reaccess.intervals_ms clears the histogram for the corresponding node.
The reaccess histogram is a per-node histogram of page counters. When a page is discovered to be reaccessed during scanning, the counter for the bin the page is previously in is incremented. For server use cases, the workload memory access pattern is fairly predictable. A proactive reclaimer can use the reaccess information to determine the right bin to reclaim.
    Example, where 72 instances of reaccess were discovered where for pages idle for 1000ms-2000ms during scanning:
    $ cat memory.reaccess.histogram
    N0
    1000 anon=6330 file=0
    2000 anon=72 file=0
    4000 anon=0 file=0
    9223372036854775807 anon=0 file=0
    N1
    9223372036854775807 anon=0 file=0

virtio-balloon
==========
The Working Set reporting mechanism presented in the first patch in this series provides a mechanism to assist a controller in making such balloon adjustments. There are two components in this patch:
- The virtio-balloon driver has a new feature (VIRTIO_F_WS_REPORTING) to standardize the configuration and communication of Working Set reports to the device.
- A stand-in interface for connecting MM activities (here, only background reclaim) to a client (here, just the balloon driver) so that the driver can be notified at appropriate times when a new Working Set report is available (and would be useful to share).
By providing a "hook" into reclaim activities, we can provide a mechanism for timely updates (i.e. when the guest is under memory pressure). By providing a uniform reporting structure in both the host and all guests, a global picture of memory utilization can be reconstructed in the controller, thus helping to answer the question of how much to adjust the balloon.
The reporting mechanism can be combined with a domain-specific balloon policy in an overcommitted multi-vm scenario, providing balloon adjustments to drive the separate reclaim activities in a coordinated fashion.
TODO:
 - Specify a proper interface for clients to register for Working Set reports, using the shrinker interface as a guide.

References:
[1] https://www.linux-kvm.org/page/Projects/auto-ballooning
[2] https://kvmforum2020.sched.com/event/eE4U/virtio-balloonpmemmem-managing-guest-memory-david-hildenbrand-michael-s-tsirkin-red-hat
[3] https://lore.kernel.org/linux-mm/20221214225123.2770216-1-yuanchu@google.com/

talumbau (2):
  mm: multigen-LRU: working set reporting
  virtio-balloon: Add Working Set reporting

 drivers/base/node.c                 |   2 +
 drivers/virtio/virtio_balloon.c     | 243 +++++++++++-
 include/linux/balloon_compaction.h  |   6 +
 include/linux/memcontrol.h          |   6 +
 include/linux/mmzone.h              |  14 +-
 include/linux/wss.h                 |  57 +++
 include/uapi/linux/virtio_balloon.h |  21 +
 mm/Kconfig                          |   7 +
 mm/Makefile                         |   1 +
 mm/memcontrol.c                     | 349 ++++++++++++++++-
 mm/mmzone.c                         |   2 +
 mm/vmscan.c                         | 581 +++++++++++++++++++++++++++-
 mm/wss.c                            |  56 +++
 13 files changed, 1341 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/wss.h
 create mode 100644 mm/wss.c