From patchwork Tue May 9 18:54:17 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 9105 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp3114020vqo; Tue, 9 May 2023 12:17:55 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5EDyV6becNRdUOFCkgyZMO/A4DaSD64yW/zVAFxwgJbgmc9u1Z8Rr3jW1IVApKiQ9ubkWw X-Received: by 2002:a05:6a21:6d88:b0:f2:7c00:fe7a with SMTP id wl8-20020a056a216d8800b000f27c00fe7amr18922706pzb.10.1683659875241; Tue, 09 May 2023 12:17:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683659875; cv=none; d=google.com; s=arc-20160816; b=XTfQEd1tIjdqvm4Fq9FBhZv5RBqEfrJCOBbkawK5xh1cWC+9ZqB7AqbQSkZzP0SHct IB2kXezmvF1HKFcQaL/a1kXU3OH50IXzFvb+g6zV89Q0uZ2UNyb82tXwXu3+E5EhFLW0 s7y3qCtyRHy2w5FmjFwimGjN6VvFXQTTZcTijPhmiDycFIld/OYhC0EJaVRZGPxCKTyf z8wwzUO5mfkH86mI1FISunCRngbd/sGADYDR72cP5AGXXQp9fi9VqxcRSEQ8RQ+U34Lb EgPV59AAxqvRzYOuMj7uDOgTv88zZC0+Qp+GaVOlaBxxW5sX+tq6Jcy3VtkqeSGKrfls zI8Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :message-id:mime-version:date:dkim-signature; bh=LC0FZWUPJYiXz+9KGOyu/I1FWoZ4PhU7nKNGRM1Iaac=; b=BE9BL0g/43fqNej/SGs3D5o8H5bpwuAldcdbJERO3q120H/uqFC4fZqYPmYlO+KKrx CKWJaqLn4SqS7TEwjq54FFVUwYRhrQtO6FK8faVXKskoj9FajaRX7wARd8py+onTvzSV x/IG4FTHRzVm8EDaDH3bCme4INY203nh0lOQAXV+Swds5ayfBAduks8NWxDL5drYqgGE 0Y1JwymkJzYI5YHoDpkeB4ffoY0TcgujvNNpDZuLq/37tkC++whIePUMimQ6oFN7My4F R7GYP1wNLn124T0yI/zZEM7LOib6B5vnJK1k+iY/1DhwHMWnty5EYydzlxpkVGA8ACF4 UQlg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=TTnbX0Gb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x6-20020aa79a46000000b006478fe28452si899516pfj.27.2023.05.09.12.17.39; Tue, 09 May 2023 12:17:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=TTnbX0Gb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234795AbjEISyo (ORCPT + 99 others); Tue, 9 May 2023 14:54:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35456 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233293AbjEISyl (ORCPT ); Tue, 9 May 2023 14:54:41 -0400 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D4A4D4224 for ; Tue, 9 May 2023 11:54:28 -0700 (PDT) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-55a7d1f6914so119051147b3.1 for ; Tue, 09 May 2023 11:54:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1683658468; x=1686250468; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:from:to:cc:subject:date:message-id:reply-to; bh=LC0FZWUPJYiXz+9KGOyu/I1FWoZ4PhU7nKNGRM1Iaac=; b=TTnbX0GbPjgGnUxeFZbJhVt40KHXVqzLdKBcEG1Sj3G8/Z9B8pPzcMJ7p+ChJplsOd XLyQzACiqEkaJ235bBMCEJDVAcDx+Y/YoH5WLaCYvGtxsacpUmgLjfK6JcHsyS7FMYjX OfYwk92ad2NbG+fLNfYEQF/SZ08s5hVRZPBsE2m1PKigbGJnoJoTyAXGhje2YjU2+uZr lOooSO6NIaLUIuZ646NpI1tECL47TPUkH/YLMYIdSqEt1fgbm2a4lNULND6Lvp6rG3Kp nx+JcQV4ToY4ye0ao+Mt3jPk3WQMYDYQLkqeKXee/tb5baMwrd9ITix6hNaPZ0kbiI4p 5+Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683658468; x=1686250468; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=LC0FZWUPJYiXz+9KGOyu/I1FWoZ4PhU7nKNGRM1Iaac=; b=RATjQXLGbHgaAK5OvEUxdZ6uK5AAdDA+APbUy2LVz7XIp+qHEbv7Gb8DE7EMh5Ts8P MnnzGYpDx2l77ODzdJlp1JNvOf5WHf+iqFBqE/nM9hncZCchBtD+d33OTnIwe12/0iYx 4YlZ/m/8l1BqPElQYAyh55fQkbEQIn5wW/JCId7ozhKyMMuhQFAPjd8cBYDNK3YRK7xQ NNhQAyMhyfp1O8DxMH3z/KwqeWBpE3asXK08UMKAd41JmP4PFL7h7awLgp5VDmSNNw1N RXOnUiVJss/hqSYHxir6LH6Htk7HIcvf7cDgBWN1nEtoAdnw0vBv0OhaqemRo/JBp8b0 KOWw== X-Gm-Message-State: AC+VfDz8JX4q0u5l/OqvH7qVO8Jw8hXQaeospqcr+Hm+sGdEshXrzd0M 0PYXWHoXd0NHJ6iLRivMvjalrMU88aqw X-Received: from yuanchu.bej.corp.google.com ([2401:fa00:44:10:d495:1070:e926:f84a]) (user=yuanchu job=sendgmr) by 2002:a81:bc09:0:b0:55d:95b7:39d8 with SMTP id a9-20020a81bc09000000b0055d95b739d8mr9071164ywi.7.1683658467995; Tue, 09 May 2023 11:54:27 -0700 (PDT) Date: Wed, 10 May 2023 02:54:17 +0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.40.1.521.gf1e218fcd8-goog Message-ID: <20230509185419.1088297-1-yuanchu@google.com> Subject: [RFC PATCH 0/2] mm: Working Set Reporting From: Yuanchu Xie To: David Hildenbrand , "Sudarshan Rajagopalan (QUIC)" , kai.huang@intel.com, hch@lst.de, jon@nutanix.com Cc: SeongJae Park , Shakeel Butt , Aneesh Kumar K V , Greg Kroah-Hartman , "Rafael J. Wysocki" , "Michael S. Tsirkin" , Jason Wang , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Yu Zhao , "Matthew Wilcox (Oracle)" , Yosry Ahmed , Vasily Averin , talumbau , Yuanchu Xie , linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1765445337099549399?= X-GMAIL-MSGID: =?utf-8?q?1765445337099549399?= Background ========== For both clients and servers, workloads can be containerized with virtual machines, kubernetes containers, or memcgs. The workloads differ between servers and clients. Server jobs have more predictable memory footprints, and are concerned about stability and performance. One technique is proactive reclaim, which reclaims memory ahead of memory pressure, and makes apparent the amount of actually free memory on a machine. Client applications are more bursty and unpredictable since they react to user interactions. The system needs to respond quickly to interesting events, and be aware of energy usage. An overcommitted machine can scale the containers' footprint through memory.max/high, virtio-balloon, etc. The balloon device is a typical mechanism for sharing memory between a guest VM and host. It is particularly useful in multi-VM scenarios where memory is overcommitted and dynamic changes to VM memory size are required as workloads change on the system. The balloon device now has a number of features to assist in judiciously sharing memory resources amongst the guests and host (e.g free page hinting, stats, free page reporting). For a host controller program tasked with optimizing memory resources in a multi-VM environment, it must use these tools to answer two concrete questions: 1. When is the right time to modify the balloon? 2. How much should the balloon be changed by? An early project to develop such an "auto-balloon" capability was done in 2013 [1]. More recently, additional VIRTIO devices have been created (virtio-mem, virtio-pmem) that offer more tools for a number of use cases, each with advantages and disadvantages (see [2] for a recent overview by RedHat of this space). A previous proposal to extend MGLRU with working set interfaces [3] focuses on the server use cases but does not work for clients. Proposal ========== A unified Working Set reporting structure that works for both servers and clients. It involves per-node histograms on the host, per-memcg histograms, and a virtio-balloon driver extension. There are two ways of working with Working Set reporting: event-driven and querying. The host controller can receive notifications from reclaim, which produces a report, or the controller can query for the histogram directly. Patch 1 introduces the Working Set reporting mechanism and the host interfaces. See the Details section for Patch 2 extends the virtio-balloon driver with Working Set reporting. The initial RFC builds on MGLRU and is intended to be a Proof of Concept for discussion and refinements. T.J. and I aim to support the active/inactive LRU and working set estimation from the userspace. We are working on demo scripts and getting some numbers as well. The RFC is a bit hacky and should be built with the these configs: CONFIG_LRU_GEN=y CONFIG_LRU_GEN_ENABLED=y CONFIG_VIRTIO_BALLOON=y CONFIG_WSS=y Host ========== On the host side, a few sysfs files are added to monitor the working set of the host. On a CONFIG_NUMA system, they live under "/sys/devices/system/node/nodeX/wss/", otherwise they are under "/sys/kernel/mm/wss/". They are mostly read/write tuneables except for the histogram. The files work as follows: report_ms: Read-write, specifies report threshold in milliseconds, min value 0 max value LONG_MAX. 0 disables working set reporting A rate-limiting factor that prevents frequent aging from generating reports too fast. For example, with a report threshold of 500ms, suppose aging happens 3 times within 500ms, the first one generates a wss report, and the rest are ignored. Example: $ echo 1000 > report_ms refresh_ms: Read-write, specifies refresh threshold in milliseconds, min value 0 max value LONG_MAX. 0 ensures that every histogram read produces a new report. A rate-limiting factor that prevents working set histogram reads from triggering aging too frequently. For example, with a refresh threshold of 10,000ms, if a WSS report is generated within the past 10,000ms, reading the wss/histogram does not perform aging, otherwise, aging occurs, a new wss report is generated and read. Generating a report can block for the period of time that it takes to complete aging. Example: $ echo 10000 > refresh_ms intervals_ms: Read-write, specifies bin intervals in milliseconds, min value 1, max value LONG_MAX. Example: $ echo 1000,2000,3000,4000 > intervals_ms histogram: Read-only, prints wss report for this node in the format of: anon= file= <...> Reading it may trigger aging if the refresh threshold has passed. On poll, it waits until kswapd performs aging on this node, and notifies subject to the rate limiting threshold set by report_ms A per-node histogram that captures the number of bytes of user memory in each working set bin. It reports the anon and file pages separately for each bin. It does not track other types of memory, e.g. hugetlb or kernel memory. Example, note that the last bin is a catch-all bin that comes after all the intervals_ms bins: $ cat histogram 1000 anon=618 file=10 2000 anon=0 file=0 3000 anon=72 file=0 4000 anon=83 file=0 9223372036854775807 anon=1004 file=182 A per-memcg interface is also included, to enable the use cases where one may use memcgs to manage applications on the host, along with VMs. The files are: memory.wss.report_ms memory.wss.refresh_ms memory.wss.intervals_ms memory.wss.histogram They support per-node configurations by requiring the node to be specified (one node at a time), e.g. $ echo N0=1000 > memory.wss.report_ms $ echo N1=3000 > memory.wss.report_ms $ echo N0=1000,2000,3000,4000 > memory.wss.intervals_ms $ cat memory.wss.intervals_ms N0=1000,2000,4000,9223372036854775807 N1=9223372036854775807 $ cat memory.wss.histogram N0 1000 anon=6330 file=0 2000 anon=72 file=0 4000 anon=0 file=0 9223372036854775807 anon=0 file=0 N1 9223372036854775807 anon=0 file=0 A reaccess histogram is also implemented for memcgs. The files are: memory.reaccess.intervals_ms memory.reaccess.histogram The interface formats are identical to the memory.wss.*. Writing to memory.reaccess.intervals_ms clears the histogram for the corresponding node. The reaccess histogram is a per-node histogram of page counters. When a page is discovered to be reaccessed during scanning, the counter for the bin the page is previously in is incremented. For server use cases, the workload memory access pattern is fairly predictable. A proactive reclaimer can use the reaccess information to determine the right bin to reclaim. Example, where 72 instances of reaccess were discovered where for pages idle for 1000ms-2000ms during scanning: $ cat memory.reaccess.histogram N0 1000 anon=6330 file=0 2000 anon=72 file=0 4000 anon=0 file=0 9223372036854775807 anon=0 file=0 N1 9223372036854775807 anon=0 file=0 virtio-balloon ========== The Working Set reporting mechanism presented in the first patch in this series provides a mechanism to assist a controller in making such balloon adjustments. There are two components in this patch: - The virtio-balloon driver has a new feature (VIRTIO_F_WS_REPORTING) to standardize the configuration and communication of Working Set reports to the device. - A stand-in interface for connecting MM activities (here, only background reclaim) to a client (here, just the balloon driver) so that the driver can be notified at appropriate times when a new Working Set report is available (and would be useful to share). By providing a "hook" into reclaim activities, we can provide a mechanism for timely updates (i.e. when the guest is under memory pressure). By providing a uniform reporting structure in both the host and all guests, a global picture of memory utilization can be reconstructed in the controller, thus helping to answer the question of how much to adjust the balloon. The reporting mechanism can be combined with a domain-specific balloon policy in an overcommitted multi-vm scenario, providing balloon adjustments to drive the separate reclaim activities in a coordinated fashion. TODO: - Specify a proper interface for clients to register for Working Set reports, using the shrinker interface as a guide. References: [1] https://www.linux-kvm.org/page/Projects/auto-ballooning [2] https://kvmforum2020.sched.com/event/eE4U/virtio-balloonpmemmem-managing-guest-memory-david-hildenbrand-michael-s-tsirkin-red-hat [3] https://lore.kernel.org/linux-mm/20221214225123.2770216-1-yuanchu@google.com/ talumbau (2): mm: multigen-LRU: working set reporting virtio-balloon: Add Working Set reporting drivers/base/node.c | 2 + drivers/virtio/virtio_balloon.c | 243 +++++++++++- include/linux/balloon_compaction.h | 6 + include/linux/memcontrol.h | 6 + include/linux/mmzone.h | 14 +- include/linux/wss.h | 57 +++ include/uapi/linux/virtio_balloon.h | 21 + mm/Kconfig | 7 + mm/Makefile | 1 + mm/memcontrol.c | 349 ++++++++++++++++- mm/mmzone.c | 2 + mm/vmscan.c | 581 +++++++++++++++++++++++++++- mm/wss.c | 56 +++ 13 files changed, 1341 insertions(+), 4 deletions(-) create mode 100644 include/linux/wss.h create mode 100644 mm/wss.c