Message ID | cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp1983632wrn; Mon, 23 Jan 2023 21:45:43 -0800 (PST) X-Google-Smtp-Source: AMrXdXukpsw1nbap7GYfBS69fU4UaBW+C1LEPz9nbokJJuwVZ3WXZmDFmzZF7t3AfdV1zj9HqCq5 X-Received: by 2002:a62:3007:0:b0:587:8d47:acdd with SMTP id w7-20020a623007000000b005878d47acddmr28021317pfw.34.1674539142985; Mon, 23 Jan 2023 21:45:42 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1674539142; cv=pass; d=google.com; s=arc-20160816; b=alPaudGZ00RgiSIx0Ct4KXxGYjBHjIUtjDDFJAxnYSIQ082kP6AWCKiB09owfaIEFM T5qx5gqSI6GmV4qpZel92/M6at0PFEEZABcOqB+l/Yd8Y0O2mfCzytfM5BsTyH2oux4h hFGvnqZ07DuB/Pd/x8KHZCrq84wdI8FRXxmjSijwaKyrtLs6U/o4/V0ZMyMRriYmugu7 pBQZgulyOedojmijHKtIUUD2R1U1ZcCNynMV2fDtN86Fo5zGNETvu5jGrmlEGnip9Z1/ egWgCuM9QS4ICthhbIVXp1IG5GCI4L+bKDywmjkg3xQkFsLYYbTmp18A2q/X1tF+9XFB KiLQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :message-id:date:subject:cc:to:from:dkim-signature; bh=SjiFpve7UU1PFy62Cv9loVcC9ZrCHwMjfsJQ2a58EO0=; b=QsAhLSH6Ri84t1dTb9Z9XlQbaMj8JYpSQK0G1E5K4Ls7nhD9pPI/mMXNz+8eM03luD L7rXgo6KbMuajGRx9O5gTCilDKl5D8GOcXH3tq5z3yjLGK5WXn7K1SKk0rgNk/BtJPl2 TvTJ/BhSxdeeqctDT1Ej9LzEvGsCPCGNFN801+zIImkhlggywZSO2eXG7lxD96QICP0v 18BDbJU97C0xCyvukcx3Pi7K6OLeew13yZM8yI5Mp4MrU7zNee8l5F6Yk6Qj9naaaBsJ 8xa9rPT8FjbXDAXbTSuzMR0yQa6/g5MnsO6W1oep3sbeIsFGAN3MObEkQIxdIsyPaSB4 NdrA== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@Nvidia.com header.s=selector2 header.b="TQxk4/Oh"; arc=pass (i=1 spf=pass spfdomain=nvidia.com dkim=pass dkdomain=nvidia.com dmarc=pass fromdomain=nvidia.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=nvidia.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 84-20020a621857000000b0057a942a7102si1256888pfy.120.2023.01.23.21.45.31; Mon, 23 Jan 2023 21:45:42 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@Nvidia.com header.s=selector2 header.b="TQxk4/Oh"; arc=pass (i=1 spf=pass spfdomain=nvidia.com dkim=pass dkdomain=nvidia.com dmarc=pass fromdomain=nvidia.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=nvidia.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232138AbjAXFn2 (ORCPT <rfc822;rust.linux@gmail.com> + 99 others); Tue, 24 Jan 2023 00:43:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38152 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229666AbjAXFn0 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 24 Jan 2023 00:43:26 -0500 Received: from NAM10-MW2-obe.outbound.protection.outlook.com (mail-mw2nam10on2070.outbound.protection.outlook.com [40.107.94.70]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 98E163346C; Mon, 23 Jan 2023 21:43:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=jYPC4PFWGum7hZxHsFml+MGgzA1i7RUxDVzTfLcAuSwMsI6iSU3zBCJx1w3I+t5hLLz9N71AiXSIeqsGwl0+wxkM1agF5n784GHfiZeTeDyaK3qhCy/hR/OKZvunb3m7rJ+SW/eTgrixIPLlSec1yDikB2ZMbce3NxtQUG+vQNnIUMTHlKWswMZ1LgwZAWEts3pk690q+FFC7QYDMVlMrCc3sntoKq9zQ1Lpok/rfyykKvMxPrQp/rXonaPeO24oH9gItYDMvkRRKMVsJUter0t47LN2Jtj763zIe44n81LnFRC7mrcpMwJ0TJD7Ld4qFpXsOPGokU33xQfrZh5hIA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=SjiFpve7UU1PFy62Cv9loVcC9ZrCHwMjfsJQ2a58EO0=; b=CS6sLVQdgJi/r/VkyNgwSI4xygj6n29RG4BjkRxJeiOQ97l4JBytPWev+Cs5evlPiLpBvWWFoWMysu/8eJzWod1H9c0baa1UP4sNrOaWnTIJa8I1ktKcgGhOQJz47a03uRs46qBbFr8v614dP601XYznbsm6nEx95z9eU8NLTNiPPGnqQbB9WiCo3N/BZgYmUptG/kZM+tyegB+UOjaB56VO0yNyoubMpSiq2a4py0wZ0+eMN5d2eEJWUYu6BRJ90daVMjWPRoYFS4fMupx2ZvDDZs0NY787BkO2iZSm3nlkntP07AFRRxilfDtxwxavEWJI36+zRKuznyYzl2nlMg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=SjiFpve7UU1PFy62Cv9loVcC9ZrCHwMjfsJQ2a58EO0=; b=TQxk4/Oh3nrA3NDEChbqEG1UgNQUTho1ie3iC0sF89WFW4bZo6Y87y4jhfEkuDGOKZQflQpe9qonRWNFq5662fMUWyvHZ+e1I3KB35PdYOsJzRdAdj6psTaAQQ2MjcsMzt3xcZRVwHVImTAM56nk0FbxFk8C8t6yy6snRJQYU73ANFRU9Cug6z3HmTSjYaxuaFG6VhvipE+6PiaPdmt1o1q6kNf8nhYQTV/FwFJJEvg5lb0thySMx0W5WO6YjxIOOmdnVG3zND7uY8+Q7S+LP89E1krIGVpIrNBlWJNN9MYmDl+MxTKmHvxAb4F8pG19WGOth696uLjqCve10U3tAg== Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=nvidia.com; Received: from BYAPR12MB3176.namprd12.prod.outlook.com (2603:10b6:a03:134::26) by MW3PR12MB4540.namprd12.prod.outlook.com (2603:10b6:303:52::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6002.33; Tue, 24 Jan 2023 05:43:24 +0000 Received: from BYAPR12MB3176.namprd12.prod.outlook.com ([fe80::465a:6564:6198:2f4e]) by BYAPR12MB3176.namprd12.prod.outlook.com ([fe80::465a:6564:6198:2f4e%4]) with mapi id 15.20.6002.033; Tue, 24 Jan 2023 05:43:24 +0000 From: Alistair Popple <apopple@nvidia.com> To: linux-mm@kvack.org, cgroups@vger.kernel.org Cc: linux-kernel@vger.kernel.org, jgg@nvidia.com, jhubbard@nvidia.com, tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com, mkoutny@suse.com, daniel@ffwll.ch, Alistair Popple <apopple@nvidia.com> Subject: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Date: Tue, 24 Jan 2023 16:42:29 +1100 Message-Id: <cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com> X-Mailer: git-send-email 2.39.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: SY5PR01CA0033.ausprd01.prod.outlook.com (2603:10c6:10:1f8::10) To BYAPR12MB3176.namprd12.prod.outlook.com (2603:10b6:a03:134::26) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BYAPR12MB3176:EE_|MW3PR12MB4540:EE_ X-MS-Office365-Filtering-Correlation-Id: 8a3af0fe-921b-4dc4-8f7c-08dafdcde86c X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: Y5BvCtPuzZtexFmS+0oLG/en4LIQMCbj5QVGEcmpf4zJy4zPf5jfuNKl5AMt8FePYlLSahzEwIOyWf2H3jQlesFl2jxcUlyOHjrl5c1+yeEcS/tlZy2ZSOwkKXjddwaBU4wqsTuX722X/eAQtpYIgL4A8KG2hbxReZ9krPZpfxyp2z1sO0pkkn7p3d9kNZD+JisDmdne/E2Du7m2tAcGSFI0zzM1bZpFbd6qTlqNM+5zhltgE1uGfFOdrnf6h6iMSNRwM5/3LOtUOR7Q+XP3QEqUChAtLZfKRqAQeeIJupKqa3S6XANkazodb/KUqUyHT3PI41W+vM2v9iHfL+PW2qpTDLeis0gF/NeUG0uUnk9CEK/e7JoC35BIBUBH5aESlHPvhyIhsecoTcB0IP6qIngUQeUQf6zyuylh+KQhfZVnVK+ev/jY/ZmDp6V57/OdR0Nlqp3VgNZiXh4yF9mHKVcjxxL9qpuxdag/0yl++J390yuoa2ePqu1/fPBIbb/aWnZTNuZsGY1+Bx5svynPnFUDBnAfoDSrUbBrn9x1NnFAxvQItU+ygz8l4n6vB2wCvaK1tysOlPnrXHTWFCuExWwarxFUakFdv4TnX/CjD39TK/cIPyAadxmxepTrNESMrtv+uSExNkQ506e1kRj+GA== X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:BYAPR12MB3176.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230022)(4636009)(346002)(366004)(376002)(136003)(396003)(39860400002)(451199015)(66899015)(36756003)(86362001)(5660300002)(38100700002)(2906002)(4326008)(8936002)(83380400001)(41300700001)(66476007)(478600001)(6486002)(66556008)(6506007)(6512007)(8676002)(26005)(186003)(316002)(66946007)(2616005)(107886003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: WRS8/D0oQ3hDQU/9jO4Q2RmFCNU3B87fHyVloLkmwYJkXRTT0awgtVucNM95GIVZO3TYrML0QWTZSxtMk1E+MDZeoSsnQ1FcktafwXxtDhncXXw726rbiCsxZI6Nr8kprgqDHBPTF/I3TB+yXFWE89zGSgkEMqY9NmduDFzQ0Gd0f8kTNs8+pvLgdQ/AJNUlXTDoRKGQh9u2++ubhN9WAgo1+oNlZExLJIV5aew8TxS0fDJXdzH4Jf+8ARqorK71t9cIUUkMPD2O5eJeZ4AgMiZOwLcivp+wrM/JmOEQte20jkPBKdLJSVzvHTdxX34ZJG61jn9ZiPAtVPhZPWzHv/arr98/gkjSsauBjsCl7xxfTJn2c3fBK2NjyFt888k1ujIdAh367UlrFbY82KKp95BnrZMppLTBWH62rvx9m5Mhq7al8W/LplIQrArAguxoZY3UK8EHv2NgC5iAfRLueKj8gC/VT/CVMi0oxSI1ErVXKgCf4/odyF3fm1OBFfHgmv7/PbAxBvK6Kig/nu052myrTvIyAMbKqkh2jyaQSix4MQimzI7bypWL7rIvgh9+k94kSOz7PjCu3eAuDpvYjv8HifywoRaqLQdrCRVSafoLiqMDd7zPVqfxh6TWPWSACfexCREhrT56xDpwb6SRiFYvNWiD71D9lt3kuuC83ZLh8cu1zIBMvwo5ySUzsCXER0KSM5q3JlJaSZE5UoiY6QrD5RWluxRURzgz1C00N2VRSlAjTrNZw540NxoCCLDHo0PJFjK5tPpAuvG+f7olLNs5KuogUnv9br5Tt752gfgABz3ndiy6dWzNrvPtEQor0AswnwQ+JkYdoWS2C7e7NsrNso2qqeU9mJN3QlH8wXi74J71FQyUiKtwkMaxIgGu39rel5ppCNuWTVJRzHigGazhH8qf89P9WG2v9Dlysa8hRLqeYBhBuSXJoVOA5O4S6MsFknBJaWxaOFBUaAs1VNdLbfHSLGvne10appPqUdS4HVMLn6bikNu1CutIhtDmvgG3i2xcqROJc6zGxfVQt90smpN1XL9BPVUZZoYjpMJncaz3fyrc9hkfu8wYoIeyXBnuAPbjmd6Amj+rMzSZ9z6s3QxLYaLHoqjjMg04Ux3xu82yuTCFPTs6Pk0tsJ7JBUqOJ9akO2uwp8rUdxWGQ5Sp4vscp90HVKos3LMTyMBXcC3+ND9OojrMvx3ZJsJAAlCcTbK4iKbTxoDcularAaA62BUEWh7G5ErsrgsgXldBW1eaKjWNLdU10PnsTbetroBZkP68h85jNht1W9cyWoeqinju1h0LEsT7GXsjVPPhZL+DDlCn8AB66ceBCYlwOfgTTsHMZLyTNt2ui/dsb5hD/OvLyIvPN95F0eunqsE5+/Odi6l/ZY4SEiFfIGjpgnr+5B9Ol8CusinXosAnDES5EioJuaKXjT22oXSXyQz/LxV2OuCd0G7YtiWdr8AuboHQVw1J2Tn2NZtwXVLNH/HsB8c42hQCBCT/RDMMuRFYIx34Aq/hX5Wp+UwVb63EVChyCLbsuO5SCYBIhdkG4b6ucRMUzMG8d9p3YnJow7sjEFMw3iQGR1V9i4ETz3w0 X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: 8a3af0fe-921b-4dc4-8f7c-08dafdcde86c X-MS-Exchange-CrossTenant-AuthSource: BYAPR12MB3176.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Jan 2023 05:43:23.7773 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: eb8/86cneqdDQ3j0Y8YGoLaaDH+4zDj/KeBxs2GlDKF4pRtb6dNSfME+VgD8XA7KAQVayOofJ6vMDIup4ghypA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: MW3PR12MB4540 X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_SPF_HELO, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755881556358974224?= X-GMAIL-MSGID: =?utf-8?q?1755881556358974224?= |
Series |
mm: Introduce a cgroup to limit the amount of locked and pinned memory
|
|
Message
Alistair Popple
Jan. 24, 2023, 5:42 a.m. UTC
Having large amounts of unmovable or unreclaimable memory in a system can lead to system instability due to increasing the likelihood of encountering out-of-memory conditions. Therefore it is desirable to limit the amount of memory users can lock or pin. From userspace such limits can be enforced by setting RLIMIT_MEMLOCK. However there is no standard method that drivers and other in-kernel users can use to check and enforce this limit. This has lead to a large number of inconsistencies in how limits are enforced. For example some drivers will use mm->locked_mm while others will use mm->pinned_mm or user->locked_mm. It is therefore possible to have up to three times RLIMIT_MEMLOCKED pinned. Having pinned memory limited per-task also makes it easy for users to exceed the limit. For example drivers that pin memory with pin_user_pages() it tends to remain pinned after fork. To deal with this and other issues this series introduces a cgroup for tracking and limiting the number of pages pinned or locked by tasks in the group. However the existing behaviour with regards to the rlimit needs to be maintained. Therefore the lesser of the two limits is enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit, but this bypass is not allowed for the cgroup. The first part of this series converts existing drivers which open-code the use of locked_mm/pinned_mm over to a common interface which manages the refcounts of the associated task/mm/user structs. This ensures accounting of pages is consistent and makes it easier to add charging of the cgroup. The second part of the series adds the cgroup and converts core mm code such as mlock over to charging the cgroup before finally introducing some selftests. As I don't have access to systems with all the various devices I haven't been able to test all driver changes. Any help there would be appreciated. Alistair Popple (19): mm: Introduce vm_account drivers/vhost: Convert to use vm_account drivers/vdpa: Convert vdpa to use the new vm_structure infiniband/umem: Convert to use vm_account RMDA/siw: Convert to use vm_account RDMA/usnic: convert to use vm_account vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm vfio/spapr_tce: Convert accounting to pinned_vm io_uring: convert to use vm_account net: skb: Switch to using vm_account xdp: convert to use vm_account kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() fpga: dfl: afu: convert to use vm_account mm: Introduce a cgroup for pinned memory mm/util: Extend vm_account to charge pages against the pin cgroup mm/util: Refactor account_locked_vm mm: Convert mmap and mlock to use account_locked_vm mm/mmap: Charge locked memory to pins cgroup selftests/vm: Add pins-cgroup selftest for mlock/mmap MAINTAINERS | 8 +- arch/powerpc/kvm/book3s_64_vio.c | 10 +- arch/powerpc/mm/book3s64/iommu_api.c | 29 +-- drivers/fpga/dfl-afu-dma-region.c | 11 +- drivers/fpga/dfl-afu.h | 1 +- drivers/infiniband/core/umem.c | 16 +- drivers/infiniband/core/umem_odp.c | 6 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 13 +- drivers/infiniband/hw/usnic/usnic_uiom.h | 1 +- drivers/infiniband/sw/siw/siw.h | 2 +- drivers/infiniband/sw/siw/siw_mem.c | 20 +-- drivers/infiniband/sw/siw/siw_verbs.c | 15 +- drivers/vdpa/vdpa_user/vduse_dev.c | 20 +-- drivers/vfio/vfio_iommu_spapr_tce.c | 15 +- drivers/vfio/vfio_iommu_type1.c | 59 +---- drivers/vhost/vdpa.c | 9 +- drivers/vhost/vhost.c | 2 +- drivers/vhost/vhost.h | 1 +- include/linux/cgroup.h | 20 ++- include/linux/cgroup_subsys.h | 4 +- include/linux/io_uring_types.h | 3 +- include/linux/kvm_host.h | 1 +- include/linux/mm.h | 5 +- include/linux/mm_types.h | 88 ++++++++- include/linux/skbuff.h | 6 +- include/net/sock.h | 2 +- include/net/xdp_sock.h | 2 +- include/rdma/ib_umem.h | 1 +- io_uring/io_uring.c | 20 +-- io_uring/notif.c | 4 +- io_uring/notif.h | 10 +- io_uring/rsrc.c | 38 +--- io_uring/rsrc.h | 9 +- mm/Kconfig | 11 +- mm/Makefile | 1 +- mm/internal.h | 2 +- mm/mlock.c | 76 +------ mm/mmap.c | 76 +++---- mm/mremap.c | 54 +++-- mm/pins_cgroup.c | 273 ++++++++++++++++++++++++- mm/secretmem.c | 6 +- mm/util.c | 196 +++++++++++++++-- net/core/skbuff.c | 47 +--- net/rds/message.c | 9 +- net/xdp/xdp_umem.c | 38 +-- tools/testing/selftests/vm/Makefile | 1 +- tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++- virt/kvm/kvm_main.c | 3 +- 48 files changed, 1114 insertions(+), 401 deletions(-) create mode 100644 mm/pins_cgroup.c create mode 100644 tools/testing/selftests/vm/pins-cgroup.c base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65
Comments
On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple@nvidia.com> wrote: > > Having large amounts of unmovable or unreclaimable memory in a system > can lead to system instability due to increasing the likelihood of > encountering out-of-memory conditions. Therefore it is desirable to > limit the amount of memory users can lock or pin. > > From userspace such limits can be enforced by setting > RLIMIT_MEMLOCK. However there is no standard method that drivers and > other in-kernel users can use to check and enforce this limit. > > This has lead to a large number of inconsistencies in how limits are > enforced. For example some drivers will use mm->locked_mm while others > will use mm->pinned_mm or user->locked_mm. It is therefore possible to > have up to three times RLIMIT_MEMLOCKED pinned. > > Having pinned memory limited per-task also makes it easy for users to > exceed the limit. For example drivers that pin memory with > pin_user_pages() it tends to remain pinned after fork. To deal with > this and other issues this series introduces a cgroup for tracking and > limiting the number of pages pinned or locked by tasks in the group. > > However the existing behaviour with regards to the rlimit needs to be > maintained. Therefore the lesser of the two limits is > enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit, > but this bypass is not allowed for the cgroup. > > The first part of this series converts existing drivers which > open-code the use of locked_mm/pinned_mm over to a common interface > which manages the refcounts of the associated task/mm/user > structs. This ensures accounting of pages is consistent and makes it > easier to add charging of the cgroup. > > The second part of the series adds the cgroup and converts core mm > code such as mlock over to charging the cgroup before finally > introducing some selftests. I didn't go through the entire series, so apologies if this was mentioned somewhere, but do you mind elaborating on why this is added as a separate cgroup controller rather than an extension of the memory cgroup controller? > > > As I don't have access to systems with all the various devices I > haven't been able to test all driver changes. Any help there would be > appreciated. > > Alistair Popple (19): > mm: Introduce vm_account > drivers/vhost: Convert to use vm_account > drivers/vdpa: Convert vdpa to use the new vm_structure > infiniband/umem: Convert to use vm_account > RMDA/siw: Convert to use vm_account > RDMA/usnic: convert to use vm_account > vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm > vfio/spapr_tce: Convert accounting to pinned_vm > io_uring: convert to use vm_account > net: skb: Switch to using vm_account > xdp: convert to use vm_account > kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() > fpga: dfl: afu: convert to use vm_account > mm: Introduce a cgroup for pinned memory > mm/util: Extend vm_account to charge pages against the pin cgroup > mm/util: Refactor account_locked_vm > mm: Convert mmap and mlock to use account_locked_vm > mm/mmap: Charge locked memory to pins cgroup > selftests/vm: Add pins-cgroup selftest for mlock/mmap > > MAINTAINERS | 8 +- > arch/powerpc/kvm/book3s_64_vio.c | 10 +- > arch/powerpc/mm/book3s64/iommu_api.c | 29 +-- > drivers/fpga/dfl-afu-dma-region.c | 11 +- > drivers/fpga/dfl-afu.h | 1 +- > drivers/infiniband/core/umem.c | 16 +- > drivers/infiniband/core/umem_odp.c | 6 +- > drivers/infiniband/hw/usnic/usnic_uiom.c | 13 +- > drivers/infiniband/hw/usnic/usnic_uiom.h | 1 +- > drivers/infiniband/sw/siw/siw.h | 2 +- > drivers/infiniband/sw/siw/siw_mem.c | 20 +-- > drivers/infiniband/sw/siw/siw_verbs.c | 15 +- > drivers/vdpa/vdpa_user/vduse_dev.c | 20 +-- > drivers/vfio/vfio_iommu_spapr_tce.c | 15 +- > drivers/vfio/vfio_iommu_type1.c | 59 +---- > drivers/vhost/vdpa.c | 9 +- > drivers/vhost/vhost.c | 2 +- > drivers/vhost/vhost.h | 1 +- > include/linux/cgroup.h | 20 ++- > include/linux/cgroup_subsys.h | 4 +- > include/linux/io_uring_types.h | 3 +- > include/linux/kvm_host.h | 1 +- > include/linux/mm.h | 5 +- > include/linux/mm_types.h | 88 ++++++++- > include/linux/skbuff.h | 6 +- > include/net/sock.h | 2 +- > include/net/xdp_sock.h | 2 +- > include/rdma/ib_umem.h | 1 +- > io_uring/io_uring.c | 20 +-- > io_uring/notif.c | 4 +- > io_uring/notif.h | 10 +- > io_uring/rsrc.c | 38 +--- > io_uring/rsrc.h | 9 +- > mm/Kconfig | 11 +- > mm/Makefile | 1 +- > mm/internal.h | 2 +- > mm/mlock.c | 76 +------ > mm/mmap.c | 76 +++---- > mm/mremap.c | 54 +++-- > mm/pins_cgroup.c | 273 ++++++++++++++++++++++++- > mm/secretmem.c | 6 +- > mm/util.c | 196 +++++++++++++++-- > net/core/skbuff.c | 47 +--- > net/rds/message.c | 9 +- > net/xdp/xdp_umem.c | 38 +-- > tools/testing/selftests/vm/Makefile | 1 +- > tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++- > virt/kvm/kvm_main.c | 3 +- > 48 files changed, 1114 insertions(+), 401 deletions(-) > create mode 100644 mm/pins_cgroup.c > create mode 100644 tools/testing/selftests/vm/pins-cgroup.c > > base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65 > -- > git-series 0.9.1 >
On Tue, Jan 24, 2023 at 04:42:29PM +1100, Alistair Popple wrote: > Having large amounts of unmovable or unreclaimable memory in a system > can lead to system instability due to increasing the likelihood of > encountering out-of-memory conditions. Therefore it is desirable to > limit the amount of memory users can lock or pin. > > From userspace such limits can be enforced by setting > RLIMIT_MEMLOCK. However there is no standard method that drivers and > other in-kernel users can use to check and enforce this limit. > > This has lead to a large number of inconsistencies in how limits are > enforced. For example some drivers will use mm->locked_mm while others > will use mm->pinned_mm or user->locked_mm. It is therefore possible to > have up to three times RLIMIT_MEMLOCKED pinned. > > Having pinned memory limited per-task also makes it easy for users to > exceed the limit. For example drivers that pin memory with > pin_user_pages() it tends to remain pinned after fork. To deal with > this and other issues this series introduces a cgroup for tracking and > limiting the number of pages pinned or locked by tasks in the group. > > However the existing behaviour with regards to the rlimit needs to be > maintained. Therefore the lesser of the two limits is > enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit, > but this bypass is not allowed for the cgroup. > > The first part of this series converts existing drivers which > open-code the use of locked_mm/pinned_mm over to a common interface > which manages the refcounts of the associated task/mm/user > structs. This ensures accounting of pages is consistent and makes it > easier to add charging of the cgroup. > > The second part of the series adds the cgroup and converts core mm > code such as mlock over to charging the cgroup before finally > introducing some selftests. > > As I don't have access to systems with all the various devices I > haven't been able to test all driver changes. Any help there would be > appreciated. I'm excited by this series, thanks for making it. The pin accounting has been a long standing problem and cgroups will really help! Jason
Yosry Ahmed <yosryahmed@google.com> writes: > On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple@nvidia.com> wrote: >> >> Having large amounts of unmovable or unreclaimable memory in a system >> can lead to system instability due to increasing the likelihood of >> encountering out-of-memory conditions. Therefore it is desirable to >> limit the amount of memory users can lock or pin. >> >> From userspace such limits can be enforced by setting >> RLIMIT_MEMLOCK. However there is no standard method that drivers and >> other in-kernel users can use to check and enforce this limit. >> >> This has lead to a large number of inconsistencies in how limits are >> enforced. For example some drivers will use mm->locked_mm while others >> will use mm->pinned_mm or user->locked_mm. It is therefore possible to >> have up to three times RLIMIT_MEMLOCKED pinned. >> >> Having pinned memory limited per-task also makes it easy for users to >> exceed the limit. For example drivers that pin memory with >> pin_user_pages() it tends to remain pinned after fork. To deal with >> this and other issues this series introduces a cgroup for tracking and >> limiting the number of pages pinned or locked by tasks in the group. >> >> However the existing behaviour with regards to the rlimit needs to be >> maintained. Therefore the lesser of the two limits is >> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit, >> but this bypass is not allowed for the cgroup. >> >> The first part of this series converts existing drivers which >> open-code the use of locked_mm/pinned_mm over to a common interface >> which manages the refcounts of the associated task/mm/user >> structs. This ensures accounting of pages is consistent and makes it >> easier to add charging of the cgroup. >> >> The second part of the series adds the cgroup and converts core mm >> code such as mlock over to charging the cgroup before finally >> introducing some selftests. > > > I didn't go through the entire series, so apologies if this was > mentioned somewhere, but do you mind elaborating on why this is added > as a separate cgroup controller rather than an extension of the memory > cgroup controller? One of my early prototypes actually did add this to the memcg controller. However pinned pages fall under their own limit, and we wanted to always account pages to the cgroup of the task using the driver rather than say folio_memcg(). So adding it to memcg didn't seem to have much benefit as we didn't end up using any of the infrastructure provided by memcg. Hence I thought it was clearer to just add it as it's own controller. - Alistair >> >> >> As I don't have access to systems with all the various devices I >> haven't been able to test all driver changes. Any help there would be >> appreciated. >> >> Alistair Popple (19): >> mm: Introduce vm_account >> drivers/vhost: Convert to use vm_account >> drivers/vdpa: Convert vdpa to use the new vm_structure >> infiniband/umem: Convert to use vm_account >> RMDA/siw: Convert to use vm_account >> RDMA/usnic: convert to use vm_account >> vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm >> vfio/spapr_tce: Convert accounting to pinned_vm >> io_uring: convert to use vm_account >> net: skb: Switch to using vm_account >> xdp: convert to use vm_account >> kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() >> fpga: dfl: afu: convert to use vm_account >> mm: Introduce a cgroup for pinned memory >> mm/util: Extend vm_account to charge pages against the pin cgroup >> mm/util: Refactor account_locked_vm >> mm: Convert mmap and mlock to use account_locked_vm >> mm/mmap: Charge locked memory to pins cgroup >> selftests/vm: Add pins-cgroup selftest for mlock/mmap >> >> MAINTAINERS | 8 +- >> arch/powerpc/kvm/book3s_64_vio.c | 10 +- >> arch/powerpc/mm/book3s64/iommu_api.c | 29 +-- >> drivers/fpga/dfl-afu-dma-region.c | 11 +- >> drivers/fpga/dfl-afu.h | 1 +- >> drivers/infiniband/core/umem.c | 16 +- >> drivers/infiniband/core/umem_odp.c | 6 +- >> drivers/infiniband/hw/usnic/usnic_uiom.c | 13 +- >> drivers/infiniband/hw/usnic/usnic_uiom.h | 1 +- >> drivers/infiniband/sw/siw/siw.h | 2 +- >> drivers/infiniband/sw/siw/siw_mem.c | 20 +-- >> drivers/infiniband/sw/siw/siw_verbs.c | 15 +- >> drivers/vdpa/vdpa_user/vduse_dev.c | 20 +-- >> drivers/vfio/vfio_iommu_spapr_tce.c | 15 +- >> drivers/vfio/vfio_iommu_type1.c | 59 +---- >> drivers/vhost/vdpa.c | 9 +- >> drivers/vhost/vhost.c | 2 +- >> drivers/vhost/vhost.h | 1 +- >> include/linux/cgroup.h | 20 ++- >> include/linux/cgroup_subsys.h | 4 +- >> include/linux/io_uring_types.h | 3 +- >> include/linux/kvm_host.h | 1 +- >> include/linux/mm.h | 5 +- >> include/linux/mm_types.h | 88 ++++++++- >> include/linux/skbuff.h | 6 +- >> include/net/sock.h | 2 +- >> include/net/xdp_sock.h | 2 +- >> include/rdma/ib_umem.h | 1 +- >> io_uring/io_uring.c | 20 +-- >> io_uring/notif.c | 4 +- >> io_uring/notif.h | 10 +- >> io_uring/rsrc.c | 38 +--- >> io_uring/rsrc.h | 9 +- >> mm/Kconfig | 11 +- >> mm/Makefile | 1 +- >> mm/internal.h | 2 +- >> mm/mlock.c | 76 +------ >> mm/mmap.c | 76 +++---- >> mm/mremap.c | 54 +++-- >> mm/pins_cgroup.c | 273 ++++++++++++++++++++++++- >> mm/secretmem.c | 6 +- >> mm/util.c | 196 +++++++++++++++-- >> net/core/skbuff.c | 47 +--- >> net/rds/message.c | 9 +- >> net/xdp/xdp_umem.c | 38 +-- >> tools/testing/selftests/vm/Makefile | 1 +- >> tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++- >> virt/kvm/kvm_main.c | 3 +- >> 48 files changed, 1114 insertions(+), 401 deletions(-) >> create mode 100644 mm/pins_cgroup.c >> create mode 100644 tools/testing/selftests/vm/pins-cgroup.c >> >> base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65 >> -- >> git-series 0.9.1 >>
On Mon, Jan 30, 2023 at 5:07 PM Alistair Popple <apopple@nvidia.com> wrote: > > > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple@nvidia.com> wrote: > >> > >> Having large amounts of unmovable or unreclaimable memory in a system > >> can lead to system instability due to increasing the likelihood of > >> encountering out-of-memory conditions. Therefore it is desirable to > >> limit the amount of memory users can lock or pin. > >> > >> From userspace such limits can be enforced by setting > >> RLIMIT_MEMLOCK. However there is no standard method that drivers and > >> other in-kernel users can use to check and enforce this limit. > >> > >> This has lead to a large number of inconsistencies in how limits are > >> enforced. For example some drivers will use mm->locked_mm while others > >> will use mm->pinned_mm or user->locked_mm. It is therefore possible to > >> have up to three times RLIMIT_MEMLOCKED pinned. > >> > >> Having pinned memory limited per-task also makes it easy for users to > >> exceed the limit. For example drivers that pin memory with > >> pin_user_pages() it tends to remain pinned after fork. To deal with > >> this and other issues this series introduces a cgroup for tracking and > >> limiting the number of pages pinned or locked by tasks in the group. > >> > >> However the existing behaviour with regards to the rlimit needs to be > >> maintained. Therefore the lesser of the two limits is > >> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit, > >> but this bypass is not allowed for the cgroup. > >> > >> The first part of this series converts existing drivers which > >> open-code the use of locked_mm/pinned_mm over to a common interface > >> which manages the refcounts of the associated task/mm/user > >> structs. This ensures accounting of pages is consistent and makes it > >> easier to add charging of the cgroup. > >> > >> The second part of the series adds the cgroup and converts core mm > >> code such as mlock over to charging the cgroup before finally > >> introducing some selftests. > > > > > > I didn't go through the entire series, so apologies if this was > > mentioned somewhere, but do you mind elaborating on why this is added > > as a separate cgroup controller rather than an extension of the memory > > cgroup controller? > > One of my early prototypes actually did add this to the memcg > controller. However pinned pages fall under their own limit, and we > wanted to always account pages to the cgroup of the task using the > driver rather than say folio_memcg(). So adding it to memcg didn't seem > to have much benefit as we didn't end up using any of the infrastructure > provided by memcg. Hence I thought it was clearer to just add it as it's > own controller. To clarify, you account and limit pinned memory based on the cgroup of the process pinning the pages, not based on the cgroup that the pages are actually charged to? Is my understanding correct? IOW, you limit the amount of memory that processes in a cgroup can pin, not the amount of memory charged to a cgroup that can be pinned? > > - Alistair > > >> > >> > >> As I don't have access to systems with all the various devices I > >> haven't been able to test all driver changes. Any help there would be > >> appreciated. > >> > >> Alistair Popple (19): > >> mm: Introduce vm_account > >> drivers/vhost: Convert to use vm_account > >> drivers/vdpa: Convert vdpa to use the new vm_structure > >> infiniband/umem: Convert to use vm_account > >> RMDA/siw: Convert to use vm_account > >> RDMA/usnic: convert to use vm_account > >> vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm > >> vfio/spapr_tce: Convert accounting to pinned_vm > >> io_uring: convert to use vm_account > >> net: skb: Switch to using vm_account > >> xdp: convert to use vm_account > >> kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() > >> fpga: dfl: afu: convert to use vm_account > >> mm: Introduce a cgroup for pinned memory > >> mm/util: Extend vm_account to charge pages against the pin cgroup > >> mm/util: Refactor account_locked_vm > >> mm: Convert mmap and mlock to use account_locked_vm > >> mm/mmap: Charge locked memory to pins cgroup > >> selftests/vm: Add pins-cgroup selftest for mlock/mmap > >> > >> MAINTAINERS | 8 +- > >> arch/powerpc/kvm/book3s_64_vio.c | 10 +- > >> arch/powerpc/mm/book3s64/iommu_api.c | 29 +-- > >> drivers/fpga/dfl-afu-dma-region.c | 11 +- > >> drivers/fpga/dfl-afu.h | 1 +- > >> drivers/infiniband/core/umem.c | 16 +- > >> drivers/infiniband/core/umem_odp.c | 6 +- > >> drivers/infiniband/hw/usnic/usnic_uiom.c | 13 +- > >> drivers/infiniband/hw/usnic/usnic_uiom.h | 1 +- > >> drivers/infiniband/sw/siw/siw.h | 2 +- > >> drivers/infiniband/sw/siw/siw_mem.c | 20 +-- > >> drivers/infiniband/sw/siw/siw_verbs.c | 15 +- > >> drivers/vdpa/vdpa_user/vduse_dev.c | 20 +-- > >> drivers/vfio/vfio_iommu_spapr_tce.c | 15 +- > >> drivers/vfio/vfio_iommu_type1.c | 59 +---- > >> drivers/vhost/vdpa.c | 9 +- > >> drivers/vhost/vhost.c | 2 +- > >> drivers/vhost/vhost.h | 1 +- > >> include/linux/cgroup.h | 20 ++- > >> include/linux/cgroup_subsys.h | 4 +- > >> include/linux/io_uring_types.h | 3 +- > >> include/linux/kvm_host.h | 1 +- > >> include/linux/mm.h | 5 +- > >> include/linux/mm_types.h | 88 ++++++++- > >> include/linux/skbuff.h | 6 +- > >> include/net/sock.h | 2 +- > >> include/net/xdp_sock.h | 2 +- > >> include/rdma/ib_umem.h | 1 +- > >> io_uring/io_uring.c | 20 +-- > >> io_uring/notif.c | 4 +- > >> io_uring/notif.h | 10 +- > >> io_uring/rsrc.c | 38 +--- > >> io_uring/rsrc.h | 9 +- > >> mm/Kconfig | 11 +- > >> mm/Makefile | 1 +- > >> mm/internal.h | 2 +- > >> mm/mlock.c | 76 +------ > >> mm/mmap.c | 76 +++---- > >> mm/mremap.c | 54 +++-- > >> mm/pins_cgroup.c | 273 ++++++++++++++++++++++++- > >> mm/secretmem.c | 6 +- > >> mm/util.c | 196 +++++++++++++++-- > >> net/core/skbuff.c | 47 +--- > >> net/rds/message.c | 9 +- > >> net/xdp/xdp_umem.c | 38 +-- > >> tools/testing/selftests/vm/Makefile | 1 +- > >> tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++- > >> virt/kvm/kvm_main.c | 3 +- > >> 48 files changed, 1114 insertions(+), 401 deletions(-) > >> create mode 100644 mm/pins_cgroup.c > >> create mode 100644 tools/testing/selftests/vm/pins-cgroup.c > >> > >> base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65 > >> -- > >> git-series 0.9.1 > >> >
Yosry Ahmed <yosryahmed@google.com> writes: > On Mon, Jan 30, 2023 at 5:07 PM Alistair Popple <apopple@nvidia.com> wrote: >> >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> > On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple@nvidia.com> wrote: >> >> >> >> Having large amounts of unmovable or unreclaimable memory in a system >> >> can lead to system instability due to increasing the likelihood of >> >> encountering out-of-memory conditions. Therefore it is desirable to >> >> limit the amount of memory users can lock or pin. >> >> >> >> From userspace such limits can be enforced by setting >> >> RLIMIT_MEMLOCK. However there is no standard method that drivers and >> >> other in-kernel users can use to check and enforce this limit. >> >> >> >> This has lead to a large number of inconsistencies in how limits are >> >> enforced. For example some drivers will use mm->locked_mm while others >> >> will use mm->pinned_mm or user->locked_mm. It is therefore possible to >> >> have up to three times RLIMIT_MEMLOCKED pinned. >> >> >> >> Having pinned memory limited per-task also makes it easy for users to >> >> exceed the limit. For example drivers that pin memory with >> >> pin_user_pages() it tends to remain pinned after fork. To deal with >> >> this and other issues this series introduces a cgroup for tracking and >> >> limiting the number of pages pinned or locked by tasks in the group. >> >> >> >> However the existing behaviour with regards to the rlimit needs to be >> >> maintained. Therefore the lesser of the two limits is >> >> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit, >> >> but this bypass is not allowed for the cgroup. >> >> >> >> The first part of this series converts existing drivers which >> >> open-code the use of locked_mm/pinned_mm over to a common interface >> >> which manages the refcounts of the associated task/mm/user >> >> structs. This ensures accounting of pages is consistent and makes it >> >> easier to add charging of the cgroup. >> >> >> >> The second part of the series adds the cgroup and converts core mm >> >> code such as mlock over to charging the cgroup before finally >> >> introducing some selftests. >> > >> > >> > I didn't go through the entire series, so apologies if this was >> > mentioned somewhere, but do you mind elaborating on why this is added >> > as a separate cgroup controller rather than an extension of the memory >> > cgroup controller? >> >> One of my early prototypes actually did add this to the memcg >> controller. However pinned pages fall under their own limit, and we >> wanted to always account pages to the cgroup of the task using the >> driver rather than say folio_memcg(). So adding it to memcg didn't seem >> to have much benefit as we didn't end up using any of the infrastructure >> provided by memcg. Hence I thought it was clearer to just add it as it's >> own controller. > > To clarify, you account and limit pinned memory based on the cgroup of > the process pinning the pages, not based on the cgroup that the pages > are actually charged to? Is my understanding correct? That's correct. > IOW, you limit the amount of memory that processes in a cgroup can > pin, not the amount of memory charged to a cgroup that can be pinned? Right, that's a good clarification which I might steal and add to the cover letter. >> >> - Alistair >> >> >> >> >> >> >> As I don't have access to systems with all the various devices I >> >> haven't been able to test all driver changes. Any help there would be >> >> appreciated. >> >> >> >> Alistair Popple (19): >> >> mm: Introduce vm_account >> >> drivers/vhost: Convert to use vm_account >> >> drivers/vdpa: Convert vdpa to use the new vm_structure >> >> infiniband/umem: Convert to use vm_account >> >> RMDA/siw: Convert to use vm_account >> >> RDMA/usnic: convert to use vm_account >> >> vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm >> >> vfio/spapr_tce: Convert accounting to pinned_vm >> >> io_uring: convert to use vm_account >> >> net: skb: Switch to using vm_account >> >> xdp: convert to use vm_account >> >> kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() >> >> fpga: dfl: afu: convert to use vm_account >> >> mm: Introduce a cgroup for pinned memory >> >> mm/util: Extend vm_account to charge pages against the pin cgroup >> >> mm/util: Refactor account_locked_vm >> >> mm: Convert mmap and mlock to use account_locked_vm >> >> mm/mmap: Charge locked memory to pins cgroup >> >> selftests/vm: Add pins-cgroup selftest for mlock/mmap >> >> >> >> MAINTAINERS | 8 +- >> >> arch/powerpc/kvm/book3s_64_vio.c | 10 +- >> >> arch/powerpc/mm/book3s64/iommu_api.c | 29 +-- >> >> drivers/fpga/dfl-afu-dma-region.c | 11 +- >> >> drivers/fpga/dfl-afu.h | 1 +- >> >> drivers/infiniband/core/umem.c | 16 +- >> >> drivers/infiniband/core/umem_odp.c | 6 +- >> >> drivers/infiniband/hw/usnic/usnic_uiom.c | 13 +- >> >> drivers/infiniband/hw/usnic/usnic_uiom.h | 1 +- >> >> drivers/infiniband/sw/siw/siw.h | 2 +- >> >> drivers/infiniband/sw/siw/siw_mem.c | 20 +-- >> >> drivers/infiniband/sw/siw/siw_verbs.c | 15 +- >> >> drivers/vdpa/vdpa_user/vduse_dev.c | 20 +-- >> >> drivers/vfio/vfio_iommu_spapr_tce.c | 15 +- >> >> drivers/vfio/vfio_iommu_type1.c | 59 +---- >> >> drivers/vhost/vdpa.c | 9 +- >> >> drivers/vhost/vhost.c | 2 +- >> >> drivers/vhost/vhost.h | 1 +- >> >> include/linux/cgroup.h | 20 ++- >> >> include/linux/cgroup_subsys.h | 4 +- >> >> include/linux/io_uring_types.h | 3 +- >> >> include/linux/kvm_host.h | 1 +- >> >> include/linux/mm.h | 5 +- >> >> include/linux/mm_types.h | 88 ++++++++- >> >> include/linux/skbuff.h | 6 +- >> >> include/net/sock.h | 2 +- >> >> include/net/xdp_sock.h | 2 +- >> >> include/rdma/ib_umem.h | 1 +- >> >> io_uring/io_uring.c | 20 +-- >> >> io_uring/notif.c | 4 +- >> >> io_uring/notif.h | 10 +- >> >> io_uring/rsrc.c | 38 +--- >> >> io_uring/rsrc.h | 9 +- >> >> mm/Kconfig | 11 +- >> >> mm/Makefile | 1 +- >> >> mm/internal.h | 2 +- >> >> mm/mlock.c | 76 +------ >> >> mm/mmap.c | 76 +++---- >> >> mm/mremap.c | 54 +++-- >> >> mm/pins_cgroup.c | 273 ++++++++++++++++++++++++- >> >> mm/secretmem.c | 6 +- >> >> mm/util.c | 196 +++++++++++++++-- >> >> net/core/skbuff.c | 47 +--- >> >> net/rds/message.c | 9 +- >> >> net/xdp/xdp_umem.c | 38 +-- >> >> tools/testing/selftests/vm/Makefile | 1 +- >> >> tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++- >> >> virt/kvm/kvm_main.c | 3 +- >> >> 48 files changed, 1114 insertions(+), 401 deletions(-) >> >> create mode 100644 mm/pins_cgroup.c >> >> create mode 100644 tools/testing/selftests/vm/pins-cgroup.c >> >> >> >> base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65 >> >> -- >> >> git-series 0.9.1 >> >> >>
On 24.01.23 21:12, Jason Gunthorpe wrote: > On Tue, Jan 24, 2023 at 04:42:29PM +1100, Alistair Popple wrote: >> Having large amounts of unmovable or unreclaimable memory in a system >> can lead to system instability due to increasing the likelihood of >> encountering out-of-memory conditions. Therefore it is desirable to >> limit the amount of memory users can lock or pin. >> >> From userspace such limits can be enforced by setting >> RLIMIT_MEMLOCK. However there is no standard method that drivers and >> other in-kernel users can use to check and enforce this limit. >> >> This has lead to a large number of inconsistencies in how limits are >> enforced. For example some drivers will use mm->locked_mm while others >> will use mm->pinned_mm or user->locked_mm. It is therefore possible to >> have up to three times RLIMIT_MEMLOCKED pinned. >> >> Having pinned memory limited per-task also makes it easy for users to >> exceed the limit. For example drivers that pin memory with >> pin_user_pages() it tends to remain pinned after fork. To deal with >> this and other issues this series introduces a cgroup for tracking and >> limiting the number of pages pinned or locked by tasks in the group. >> >> However the existing behaviour with regards to the rlimit needs to be >> maintained. Therefore the lesser of the two limits is >> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit, >> but this bypass is not allowed for the cgroup. >> >> The first part of this series converts existing drivers which >> open-code the use of locked_mm/pinned_mm over to a common interface >> which manages the refcounts of the associated task/mm/user >> structs. This ensures accounting of pages is consistent and makes it >> easier to add charging of the cgroup. >> >> The second part of the series adds the cgroup and converts core mm >> code such as mlock over to charging the cgroup before finally >> introducing some selftests. >> >> As I don't have access to systems with all the various devices I >> haven't been able to test all driver changes. Any help there would be >> appreciated. > > I'm excited by this series, thanks for making it. > > The pin accounting has been a long standing problem and cgroups will > really help! Indeed. I'm curious how GUP-fast, pinning the same page multiple times, and pinning subpages of larger folios are handled :)
On Tue, Jan 31, 2023 at 02:57:20PM +0100, David Hildenbrand wrote: > > I'm excited by this series, thanks for making it. > > > > The pin accounting has been a long standing problem and cgroups will > > really help! > > Indeed. I'm curious how GUP-fast, pinning the same page multiple times, and > pinning subpages of larger folios are handled :) The same as today. The pinning is done based on the result from GUP, and we charge every returned struct page. So duplicates are counted multiple times, folios are ignored. Removing duplicate charges would be costly, it would require storage to keep track of how many times individual pages have been charged to each cgroup (eg an xarray indexed by PFN of integers in each cgroup). It doesn't seem worth the cost, IMHO. We've made alot of investment now with iommufd to remove the most annoying sources of duplicated pins so it is much less of a problem in the qemu context at least. Jason
On 31.01.23 15:03, Jason Gunthorpe wrote: > On Tue, Jan 31, 2023 at 02:57:20PM +0100, David Hildenbrand wrote: > >>> I'm excited by this series, thanks for making it. >>> >>> The pin accounting has been a long standing problem and cgroups will >>> really help! >> >> Indeed. I'm curious how GUP-fast, pinning the same page multiple times, and >> pinning subpages of larger folios are handled :) > > The same as today. The pinning is done based on the result from GUP, > and we charge every returned struct page. > > So duplicates are counted multiple times, folios are ignored. > > Removing duplicate charges would be costly, it would require storage > to keep track of how many times individual pages have been charged to > each cgroup (eg an xarray indexed by PFN of integers in each cgroup). > > It doesn't seem worth the cost, IMHO. > > We've made alot of investment now with iommufd to remove the most > annoying sources of duplicated pins so it is much less of a problem in > the qemu context at least. Wasn't there the discussion regarding using vfio+io_uring+rdma+$whatever on a VM and requiring multiple times the VM size as memlock limit? Would it be the same now, just that we need multiple times the pin limit?
On Tue, Jan 31, 2023 at 03:06:10PM +0100, David Hildenbrand wrote: > On 31.01.23 15:03, Jason Gunthorpe wrote: > > On Tue, Jan 31, 2023 at 02:57:20PM +0100, David Hildenbrand wrote: > > > > > > I'm excited by this series, thanks for making it. > > > > > > > > The pin accounting has been a long standing problem and cgroups will > > > > really help! > > > > > > Indeed. I'm curious how GUP-fast, pinning the same page multiple times, and > > > pinning subpages of larger folios are handled :) > > > > The same as today. The pinning is done based on the result from GUP, > > and we charge every returned struct page. > > > > So duplicates are counted multiple times, folios are ignored. > > > > Removing duplicate charges would be costly, it would require storage > > to keep track of how many times individual pages have been charged to > > each cgroup (eg an xarray indexed by PFN of integers in each cgroup). > > > > It doesn't seem worth the cost, IMHO. > > > > We've made alot of investment now with iommufd to remove the most > > annoying sources of duplicated pins so it is much less of a problem in > > the qemu context at least. > > Wasn't there the discussion regarding using vfio+io_uring+rdma+$whatever on > a VM and requiring multiple times the VM size as memlock limit? Yes, but iommufd gives us some more options to mitigate this. eg it makes some of logical sense to point RDMA at the iommufd page table that is already pinned when trying to DMA from guest memory, in this case it could ride on the existing pin. > Would it be the same now, just that we need multiple times the pin > limit? Yes Jason
On 31.01.23 15:10, Jason Gunthorpe wrote: > On Tue, Jan 31, 2023 at 03:06:10PM +0100, David Hildenbrand wrote: >> On 31.01.23 15:03, Jason Gunthorpe wrote: >>> On Tue, Jan 31, 2023 at 02:57:20PM +0100, David Hildenbrand wrote: >>> >>>>> I'm excited by this series, thanks for making it. >>>>> >>>>> The pin accounting has been a long standing problem and cgroups will >>>>> really help! >>>> >>>> Indeed. I'm curious how GUP-fast, pinning the same page multiple times, and >>>> pinning subpages of larger folios are handled :) >>> >>> The same as today. The pinning is done based on the result from GUP, >>> and we charge every returned struct page. >>> >>> So duplicates are counted multiple times, folios are ignored. >>> >>> Removing duplicate charges would be costly, it would require storage >>> to keep track of how many times individual pages have been charged to >>> each cgroup (eg an xarray indexed by PFN of integers in each cgroup). >>> >>> It doesn't seem worth the cost, IMHO. >>> >>> We've made alot of investment now with iommufd to remove the most >>> annoying sources of duplicated pins so it is much less of a problem in >>> the qemu context at least. >> >> Wasn't there the discussion regarding using vfio+io_uring+rdma+$whatever on >> a VM and requiring multiple times the VM size as memlock limit? > > Yes, but iommufd gives us some more options to mitigate this. > > eg it makes some of logical sense to point RDMA at the iommufd page > table that is already pinned when trying to DMA from guest memory, in > this case it could ride on the existing pin. Right, I suspect some issue is that the address space layout for the RDMA device might be completely different. But I'm no expert on IOMMUs at all :) I do understand that at least multiple VFIO containers could benefit by only pinning once (IIUC that mgiht have been an issue?). > >> Would it be the same now, just that we need multiple times the pin >> limit? > > Yes Okay, thanks. It's all still a big improvement, because I also asked for TDX restrictedmem to be accounted somehow as unmovable.
On Tue, Jan 31, 2023 at 03:15:49PM +0100, David Hildenbrand wrote: > On 31.01.23 15:10, Jason Gunthorpe wrote: > > On Tue, Jan 31, 2023 at 03:06:10PM +0100, David Hildenbrand wrote: > > > On 31.01.23 15:03, Jason Gunthorpe wrote: > > > > On Tue, Jan 31, 2023 at 02:57:20PM +0100, David Hildenbrand wrote: > > > > > > > > > > I'm excited by this series, thanks for making it. > > > > > > > > > > > > The pin accounting has been a long standing problem and cgroups will > > > > > > really help! > > > > > > > > > > Indeed. I'm curious how GUP-fast, pinning the same page multiple times, and > > > > > pinning subpages of larger folios are handled :) > > > > > > > > The same as today. The pinning is done based on the result from GUP, > > > > and we charge every returned struct page. > > > > > > > > So duplicates are counted multiple times, folios are ignored. > > > > > > > > Removing duplicate charges would be costly, it would require storage > > > > to keep track of how many times individual pages have been charged to > > > > each cgroup (eg an xarray indexed by PFN of integers in each cgroup). > > > > > > > > It doesn't seem worth the cost, IMHO. > > > > > > > > We've made alot of investment now with iommufd to remove the most > > > > annoying sources of duplicated pins so it is much less of a problem in > > > > the qemu context at least. > > > > > > Wasn't there the discussion regarding using vfio+io_uring+rdma+$whatever on > > > a VM and requiring multiple times the VM size as memlock limit? > > > > Yes, but iommufd gives us some more options to mitigate this. > > > > eg it makes some of logical sense to point RDMA at the iommufd page > > table that is already pinned when trying to DMA from guest memory, in > > this case it could ride on the existing pin. > > Right, I suspect some issue is that the address space layout for the RDMA > device might be completely different. But I'm no expert on IOMMUs at all :) Oh it doesn't matter, it is all virtualized so many times.. > I do understand that at least multiple VFIO containers could benefit by only > pinning once (IIUC that mgiht have been an issue?). iommufd has fixed this completely. > It's all still a big improvement, because I also asked for TDX restrictedmem > to be accounted somehow as unmovable. Yeah, it is sort of reasonable to think of the CC "secret memory" as memory that is no different from memory being DMA'd to. The DMA is just some other vCPU. I still don't have a clear idea how all this CC memory is going to actually work. Eventually it has to get into iommufd as well, somehow. Jason
On Tue, Jan 31, 2023 at 3:24 AM Alistair Popple <apopple@nvidia.com> wrote: > > > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Mon, Jan 30, 2023 at 5:07 PM Alistair Popple <apopple@nvidia.com> wrote: > >> > >> > >> Yosry Ahmed <yosryahmed@google.com> writes: > >> > >> > On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple@nvidia.com> wrote: > >> >> > >> >> Having large amounts of unmovable or unreclaimable memory in a system > >> >> can lead to system instability due to increasing the likelihood of > >> >> encountering out-of-memory conditions. Therefore it is desirable to > >> >> limit the amount of memory users can lock or pin. > >> >> > >> >> From userspace such limits can be enforced by setting > >> >> RLIMIT_MEMLOCK. However there is no standard method that drivers and > >> >> other in-kernel users can use to check and enforce this limit. > >> >> > >> >> This has lead to a large number of inconsistencies in how limits are > >> >> enforced. For example some drivers will use mm->locked_mm while others > >> >> will use mm->pinned_mm or user->locked_mm. It is therefore possible to > >> >> have up to three times RLIMIT_MEMLOCKED pinned. > >> >> > >> >> Having pinned memory limited per-task also makes it easy for users to > >> >> exceed the limit. For example drivers that pin memory with > >> >> pin_user_pages() it tends to remain pinned after fork. To deal with > >> >> this and other issues this series introduces a cgroup for tracking and > >> >> limiting the number of pages pinned or locked by tasks in the group. > >> >> > >> >> However the existing behaviour with regards to the rlimit needs to be > >> >> maintained. Therefore the lesser of the two limits is > >> >> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit, > >> >> but this bypass is not allowed for the cgroup. > >> >> > >> >> The first part of this series converts existing drivers which > >> >> open-code the use of locked_mm/pinned_mm over to a common interface > >> >> which manages the refcounts of the associated task/mm/user > >> >> structs. This ensures accounting of pages is consistent and makes it > >> >> easier to add charging of the cgroup. > >> >> > >> >> The second part of the series adds the cgroup and converts core mm > >> >> code such as mlock over to charging the cgroup before finally > >> >> introducing some selftests. > >> > > >> > > >> > I didn't go through the entire series, so apologies if this was > >> > mentioned somewhere, but do you mind elaborating on why this is added > >> > as a separate cgroup controller rather than an extension of the memory > >> > cgroup controller? > >> > >> One of my early prototypes actually did add this to the memcg > >> controller. However pinned pages fall under their own limit, and we > >> wanted to always account pages to the cgroup of the task using the > >> driver rather than say folio_memcg(). So adding it to memcg didn't seem > >> to have much benefit as we didn't end up using any of the infrastructure > >> provided by memcg. Hence I thought it was clearer to just add it as it's > >> own controller. > > > > To clarify, you account and limit pinned memory based on the cgroup of > > the process pinning the pages, not based on the cgroup that the pages > > are actually charged to? Is my understanding correct? > > That's correct. Interesting. > > > IOW, you limit the amount of memory that processes in a cgroup can > > pin, not the amount of memory charged to a cgroup that can be pinned? > > Right, that's a good clarification which I might steal and add to the > cover letter. Feel free to :) Please also clarify this in the code/docs. Glancing through the patches I was asking myself multiple times why this is not "memory.pinned.[current/max]" or similar. > > >> > >> - Alistair > >> > >> >> > >> >> > >> >> As I don't have access to systems with all the various devices I > >> >> haven't been able to test all driver changes. Any help there would be > >> >> appreciated. > >> >> > >> >> Alistair Popple (19): > >> >> mm: Introduce vm_account > >> >> drivers/vhost: Convert to use vm_account > >> >> drivers/vdpa: Convert vdpa to use the new vm_structure > >> >> infiniband/umem: Convert to use vm_account > >> >> RMDA/siw: Convert to use vm_account > >> >> RDMA/usnic: convert to use vm_account > >> >> vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm > >> >> vfio/spapr_tce: Convert accounting to pinned_vm > >> >> io_uring: convert to use vm_account > >> >> net: skb: Switch to using vm_account > >> >> xdp: convert to use vm_account > >> >> kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() > >> >> fpga: dfl: afu: convert to use vm_account > >> >> mm: Introduce a cgroup for pinned memory > >> >> mm/util: Extend vm_account to charge pages against the pin cgroup > >> >> mm/util: Refactor account_locked_vm > >> >> mm: Convert mmap and mlock to use account_locked_vm > >> >> mm/mmap: Charge locked memory to pins cgroup > >> >> selftests/vm: Add pins-cgroup selftest for mlock/mmap > >> >> > >> >> MAINTAINERS | 8 +- > >> >> arch/powerpc/kvm/book3s_64_vio.c | 10 +- > >> >> arch/powerpc/mm/book3s64/iommu_api.c | 29 +-- > >> >> drivers/fpga/dfl-afu-dma-region.c | 11 +- > >> >> drivers/fpga/dfl-afu.h | 1 +- > >> >> drivers/infiniband/core/umem.c | 16 +- > >> >> drivers/infiniband/core/umem_odp.c | 6 +- > >> >> drivers/infiniband/hw/usnic/usnic_uiom.c | 13 +- > >> >> drivers/infiniband/hw/usnic/usnic_uiom.h | 1 +- > >> >> drivers/infiniband/sw/siw/siw.h | 2 +- > >> >> drivers/infiniband/sw/siw/siw_mem.c | 20 +-- > >> >> drivers/infiniband/sw/siw/siw_verbs.c | 15 +- > >> >> drivers/vdpa/vdpa_user/vduse_dev.c | 20 +-- > >> >> drivers/vfio/vfio_iommu_spapr_tce.c | 15 +- > >> >> drivers/vfio/vfio_iommu_type1.c | 59 +---- > >> >> drivers/vhost/vdpa.c | 9 +- > >> >> drivers/vhost/vhost.c | 2 +- > >> >> drivers/vhost/vhost.h | 1 +- > >> >> include/linux/cgroup.h | 20 ++- > >> >> include/linux/cgroup_subsys.h | 4 +- > >> >> include/linux/io_uring_types.h | 3 +- > >> >> include/linux/kvm_host.h | 1 +- > >> >> include/linux/mm.h | 5 +- > >> >> include/linux/mm_types.h | 88 ++++++++- > >> >> include/linux/skbuff.h | 6 +- > >> >> include/net/sock.h | 2 +- > >> >> include/net/xdp_sock.h | 2 +- > >> >> include/rdma/ib_umem.h | 1 +- > >> >> io_uring/io_uring.c | 20 +-- > >> >> io_uring/notif.c | 4 +- > >> >> io_uring/notif.h | 10 +- > >> >> io_uring/rsrc.c | 38 +--- > >> >> io_uring/rsrc.h | 9 +- > >> >> mm/Kconfig | 11 +- > >> >> mm/Makefile | 1 +- > >> >> mm/internal.h | 2 +- > >> >> mm/mlock.c | 76 +------ > >> >> mm/mmap.c | 76 +++---- > >> >> mm/mremap.c | 54 +++-- > >> >> mm/pins_cgroup.c | 273 ++++++++++++++++++++++++- > >> >> mm/secretmem.c | 6 +- > >> >> mm/util.c | 196 +++++++++++++++-- > >> >> net/core/skbuff.c | 47 +--- > >> >> net/rds/message.c | 9 +- > >> >> net/xdp/xdp_umem.c | 38 +-- > >> >> tools/testing/selftests/vm/Makefile | 1 +- > >> >> tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++- > >> >> virt/kvm/kvm_main.c | 3 +- > >> >> 48 files changed, 1114 insertions(+), 401 deletions(-) > >> >> create mode 100644 mm/pins_cgroup.c > >> >> create mode 100644 tools/testing/selftests/vm/pins-cgroup.c > >> >> > >> >> base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65 > >> >> -- > >> >> git-series 0.9.1 > >> >> > >> >