Message ID | 20230612203953.2093911-15-Liam.Howlett@oracle.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp124719vqr; Mon, 12 Jun 2023 13:49:11 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5CL0sS+ocYview+7l5L6zOuJ2GA4KXRWZhnnSag++MVrZqu2I66xXdSFOSKpu/qJJ32Pqp X-Received: by 2002:a17:906:794b:b0:973:ddfe:e074 with SMTP id l11-20020a170906794b00b00973ddfee074mr10927523ejo.2.1686602951138; Mon, 12 Jun 2023 13:49:11 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1686602951; cv=pass; d=google.com; s=arc-20160816; b=G4tonAmLWmA2ttuyXTUH9vTGZOo0ptXjbm8K4bWlV21CWrlHhxSoJwDHOmFuBMIumA LWHfgl8/AemFY+TqhpVfWrHEsnqtY+/6PgoRceiUecG6tAgUWQliSV95OgNQhgeexMw0 WcNZO9nyDsbZulJkdmTLtQe3W6V5WyMm7NQcKDt8o/KNCB9hHgtQfbJRPsOH5e01IrvK REpvupr73i0vqM38MLwrjZiVi3JpiSAgAF38CYWFZXd2s3IiqMzcUgXkYwrwFq2noFtR i1huXvNNyaIqvIMoSVU5sy7vQDxmxDRQXeob4HSG2vKbc9OHj5PD6mI7J34CIpCrbqyi 8O8w== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:dkim-signature; bh=Q022I5yTvzYmJXxqO7/WBlIwy5GdU306+IeJGS1jis4=; b=sh675c5JsQlroyRGUFp0hzOHiLBc6xdK9dhbBRKrEGzHPRTi5lWL1Xwo7nJZsZLU29 Fhffo4GUiXZ/47iVyV2PFPx7zXeFz3TKFVdhXck1xr7SwyU3z/If5rYFh5nSGBpFNYEF giSnseVTVaq4nseiC21xeeEeiA4ythKIWTeiNe849I9ZQwPp6Ido9kCn07+bydNUxxVi CFsROacNm5kaH0jJLfCQ4RKBHeTMDp6mwwQTFMFw/PBKc774yFU2qnYbaj2Byeu+XYB8 hN0sUeOXIDHNLroGzZRQB4L2LK0CKkjjMtDSiIjqdVCHIAdl6NhQodZCAu4tVFhH7QUO 6bGg== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2023-03-30 header.b=BFacnYjJ; dkim=pass header.i=@oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b="SMnlvZ/z"; arc=pass (i=1 spf=pass spfdomain=oracle.com dkim=pass dkdomain=oracle.com dmarc=pass fromdomain=oracle.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c19-20020a170906171300b00977eb8d769bsi5406504eje.289.2023.06.12.13.48.41; Mon, 12 Jun 2023 13:49:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2023-03-30 header.b=BFacnYjJ; dkim=pass header.i=@oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b="SMnlvZ/z"; arc=pass (i=1 spf=pass spfdomain=oracle.com dkim=pass dkdomain=oracle.com dmarc=pass fromdomain=oracle.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235410AbjFLUmA (ORCPT <rfc822;rust.linux@gmail.com> + 99 others); Mon, 12 Jun 2023 16:42:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59568 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236013AbjFLUk7 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 12 Jun 2023 16:40:59 -0400 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B89E01998 for <linux-kernel@vger.kernel.org>; Mon, 12 Jun 2023 13:40:52 -0700 (PDT) Received: from pps.filterd (m0246631.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 35CKNk8H016718; Mon, 12 Jun 2023 20:40:38 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-transfer-encoding : content-type : mime-version; s=corp-2023-03-30; bh=Q022I5yTvzYmJXxqO7/WBlIwy5GdU306+IeJGS1jis4=; b=BFacnYjJeSgPODs2GxVJSgn137lZuOULhnO7nCoWxAL5FipMaep70jLuaCBf9TR5brhP rPUwrMuevppb8gFcqkxMlrRFbQFZWcB2hAbu0ozapNhGRc7jDrhLcDTKZUDRiz6eVA32 ZHYWAl33SNocGRfn4UwABsoink+r4GIm7OrREZAPJGGqqtd+g8VHJDgwewviHwAn43Kf a7HH6D7vLdgVqHJ6dCxir/sk0gEf5dOSS4YUYfd3z80Y1kzthOx+oEefvp/yTlonAXGV c6wtpk7bpfJkx2oI8tojx44ZTtrSIPrf2blS7YkkZY/b13kRRMYQu5J/S66LoOsQz5ul fQ== Received: from iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta02.appoci.oracle.com [147.154.18.20]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3r4fy3busg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 12 Jun 2023 20:40:38 +0000 Received: from pps.filterd (iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (8.17.1.19/8.17.1.19) with ESMTP id 35CKWdkG016285; Mon, 12 Jun 2023 20:40:37 GMT Received: from nam02-bn1-obe.outbound.protection.outlook.com (mail-bn1nam02lp2045.outbound.protection.outlook.com [104.47.51.45]) by iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTPS id 3r4fm39act-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 12 Jun 2023 20:40:37 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=VJkmYnM97KjX6tf5szi/uJ3Tveld42Y1TEvr2bqXVQIoobih+kkBmQql05wV2FlyYfOE39JSAkbUEzoa2dPzCJ1yHTDvZAkuawwyuYC94RfBrO6lzvKbmKF7f0OZISzrYOi6my7lz+HUuTqCZbtk+sm3DNCV7ahX9QzTg7+vVTqI/EkBcuUwNlfScOIjQ4j7kreKHPBjojXVxuIvh08cgzB3rYDeSHwcgVsJ5E3gXLv5BsSsunabYuzXmIiCEtAuyUfY5S6O9yGjJkXK6xlnRuu8b0s+93DALF6ykZH9hKQhVzc0S9ghOdS5IldqjgI9tuAPBiGCO+e0oVV+4hFFrw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Q022I5yTvzYmJXxqO7/WBlIwy5GdU306+IeJGS1jis4=; b=BpFc9F0G7527iHl+wFlusB9KMlTUIKyWA4vBWOK753xhxq4mO439SOxmoWtd0EOYs7jLGSB3dWoH7r+KpSCW8x5g46axg1rzLedhqytd4HmHQ9JBc3dKu8tEQTmM3WYEQYxcACQBE+ykZyisnO7gWeDHqMShxpcsm2R+EslYbPjTKSnzHJMBo0K/6tHXegcH9r78FvC0Jpb8dq0XvGkQPbV8lsnZagVIR76c0z9F8sZgh/v/yDMClqrdIvVoSBi0BXbHYkJTBmXl7i41CqFwrnKEk+RM0/xinKq4YfU4X+Riffq+tDVYNSylGSmduLk2RbnKKoXVwKYfE4jdS9IUQA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Q022I5yTvzYmJXxqO7/WBlIwy5GdU306+IeJGS1jis4=; b=SMnlvZ/zKV5Gz1gDJXyw4pbGXOtaNTjfzWnSHuoorYF31PbwlKX59kMre6bHIZ6gsmtSSI/WD2Y82taJ7FEaIgFTiv6GKQhxPdwPKC9tF40W5KB284yX8iQyj9EYjp5Lj8gomi4EoM9OltuWAQlTochfniPlTZbB/wsspE/0QE0= Received: from SN6PR10MB3022.namprd10.prod.outlook.com (2603:10b6:805:d8::25) by SA1PR10MB5868.namprd10.prod.outlook.com (2603:10b6:806:22b::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6455.44; Mon, 12 Jun 2023 20:40:35 +0000 Received: from SN6PR10MB3022.namprd10.prod.outlook.com ([fe80::998f:d221:5fb6:c67d]) by SN6PR10MB3022.namprd10.prod.outlook.com ([fe80::998f:d221:5fb6:c67d%7]) with mapi id 15.20.6455.045; Mon, 12 Jun 2023 20:40:35 +0000 From: "Liam R. Howlett" <Liam.Howlett@oracle.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: maple-tree@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Peng Zhang <zhangpeng.00@bytedance.com>, "Liam R. Howlett" <Liam.Howlett@oracle.com> Subject: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations Date: Mon, 12 Jun 2023 16:39:51 -0400 Message-Id: <20230612203953.2093911-15-Liam.Howlett@oracle.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230612203953.2093911-1-Liam.Howlett@oracle.com> References: <20230612203953.2093911-1-Liam.Howlett@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: YT4PR01CA0079.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:b01:ff::11) To SN6PR10MB3022.namprd10.prod.outlook.com (2603:10b6:805:d8::25) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SN6PR10MB3022:EE_|SA1PR10MB5868:EE_ X-MS-Office365-Filtering-Correlation-Id: a45bbe5f-5396-4af5-f256-08db6b854657 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 6O66l1Ng+2YWPcjS+/NR1IKzjQjWNRbB1MgIN0Pt/8ZLDPLHtrVdoJxF9ulyW/RGnZcnnZhBysz273AFIoRQnOlwqZp1kV5WhDWNmf2LIDUuwwb5kPbffOEdg/Qn9mmBqu+/qZ3DA48TjgPBC9IbYq5zzYm/k8a/d6NW5sfoJuaIkB4P95D22ZCiCxg8S+iJ9pT0MBqifRr6vBAHudG/Le9K6IiCJEUbx6ruIFjoqIirScNymdoKtVfcmHakwRxfQALxccVlxmwM/7Xk87WPeaCMt2bzr92YZgwpf3Tekmey0NObIpgJhnm9Xy2pEGC0VTCeZmNEDd/t9USjD3mdb8Gv/hEMqEGuUsdMTyE2eUxP71mJJmIrqWkb2HXX9R1jaG/XgJE2TmKL0n4izCqn2L6YUyGlTt1DghAZ8J8p2oHYH2zHUXhzPZ4sWcCCgb15+2Tk9aUNsEgA2RwvSCf1Jb5af94Tb1gU68Y7AHRl4p/OzOp+aope67/vgYg+oCmuNvW8ElksCoGSRl1X8R1xHmGzAE1nQo6NzLV4qdD3pKrEKeJA8WXdH5zSuq8w7t71 X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SN6PR10MB3022.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230028)(366004)(346002)(39860400002)(136003)(396003)(376002)(451199021)(8936002)(8676002)(5660300002)(4326008)(6916009)(66946007)(66476007)(66556008)(316002)(54906003)(2906002)(41300700001)(6666004)(478600001)(6486002)(6506007)(26005)(1076003)(107886003)(6512007)(83380400001)(186003)(36756003)(2616005)(86362001)(38100700002);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: sAqqN/m5Dfl434q16z9a9qsPtpgv5uCd5xAs7Asq2bMrJ82zP44iiQezbYmDToC0FS0U9a5Bax35JfYPWd1V/fm6rsKxUCrgKeD4m7jDbbLvdTKX+cL74zENAiKjHXau5G3CPFJnWT/DlKAV73lnxvU12YBR3822vio8lmp5pDmOk6mdxGqJo235POl9SZbY4eIaKrezN4lOTymAi2qtLhWF2WvDJMy86jkhLHoHrbLUeQ3Cj/upUXQ9kptY3MhtZtCVRIlSA9ttGGN5nS0dpQlU1nmqeoHzs5Po3JzXhgW3bZOJqPktx5CUwSeFvO1kFMCDpP15HJwPTE0QTXf54NfKQaL2ItoIpmL5ndvfBWsk5ZI8kaK70huotjrDcxWNmtfgjATghxxb1+/xok6ujgRQKDDwDZF/L9dsL+Zcec/uLNkZ0VXFyPzlVzvXSFA7Dp4/vvra95+ktr88ZLetyJYNvdle4smamZqfxz9f5ofICsaCpBn34h+nI2qGH1f+AB3vRFHHB3N//xJWJWAtGZYBP3ilJt6T6fWaaDqs05dnb/YfJcJ/s2TcdIUrzTrcERiylqwNy4n9TeoS6hcg7suAgQAj+H/P6rmoJaT1SrfJSpzF9UyeLtjvCYxJibgd9LME5LUzpkXlUSi2kZYYM86kEv7j8cGizecH4BueqoAQNYNOln5HrE45lTE+M+Vk6tsBrGF/H8Pr/LJ5sZJDrhEeGLMiyhI4E/XP9ihcviJ0g72xpdmyUmVgj3KM2qrZOoasAMa6ng8ILuvtdXv/uBgE1mJ7MO6V08dNTk4MWKlu802Q7cSu9zppju75ZVoy8dp/n8ktNJzPRZrWMfnhfhLBZ21yeYAlXhIqZA/2s/aoIJ/L/BV9ISSq6pX0pyr6evSuT6ZzRztza9LWgNqBSJKE2LMF1RK3wEsOMTilAOf8HI5XmvAl6m/egYT/kEv49dnwvI0+2YVDl4X2G4ILu7k8WrVGXPBCfPgFPhvxhoUgxIhGES6SgUIiIyAtJCrEZI2qWt/JXP9d6H6M7rupXijS182WsmaPruD/z/Ut9CDbOVHtGlpRQmK1zBOduso3Bje9GfCEeKK2t4KQj998kjCU37QB8gkTOgfrLcc+bZaKvuAgZpnObH4DP6f81+Oz/oksYjBKF7jT3JgvjbJeRrH9nVhfJgMikKst6//yu0QwshI3af9oSDCSnKGnbZ9GZ896/4l2F5yxOIDZ5G11n96+UOt/CB91oRsNdu0AXO24mnTt6Fiwh9U9d/LZTpMlOFC7xdG8yg4xVna554LdTxZeRHnuHXO5rT4+V8L8Savbedwqh6gewACUY3ijjONiOqyb9+IdVolspChnEyQrg3/3RhKHs6y+k9GhgHMy3pFfkORotr+40nyTIopwnz7Ui48mnx0ZiqWw2hqcuGAugtGmam1wl5AKwAekPNZuIPBv0YdiBZtJXRTzsT3fuglDzpx/w6LBDt9yuzJduEwI9xsTqOkQpFNldGKcw0qMXGiM8eNAM8rOmRkT9YpcWa6/I4dqDNH8hTxl5I9Mz8KyvQPe+6tLq7rzTSSKlAPYTsg/+r3NQAFqcK4nv8uuHZ0mR8Bcqq80lmGDQMvVY93upg== X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: 5vbifZgLHqZezB10eFNCmOqma7kQtDu6wuaEkzLEc0vE/EQX3etV+UhwTg3y7onr8DJqj+chrPgFKhUskS6fwBtu9hFd4d4poXJYs2V0/uO/GkgDf3BrUDa4fU2IVph9Tr7dNaCrdnhKtHC5lVKg2jUXxHYU0c8E5ADF6chmlTSW+yIDZoWhZgE4MztMzQxj4v88xMAFrRmPMM84BIv8i/WzWZujRGrZA6JACaXEJd0XrviAIIQPIvo4GpK2exI2aD/Se3oNFvrZY3ZmwgzICF2v9C0tzBmcwJ3GYMe80Cz+JZ6TyRWbGiwYtdP/2dzXnhdJQGKFR2BOnT+22LYKxi2TGY+biuCTJGWNMoaoQ7upGaFza4N+BRbhvuWOloZjL66LpTUIXxK3KkhHFjOt2ncbA0DJSFLVODAuTKQhqut8VgXJ3m7l7C3fky6G+h0moRkI5J3OXJEaSLJu3slwcBPs2SswZmSp+uKrxXkY5uj4eE3wYzfDrzMvfDiZmoTWgJd1SCbVSn7hWQV8lpFA0J/uPqpxICSuEk3Tp3XHQ6Q14Bio75irY/40wdkIEJ5oQZOb/n/X52puecX29AJcm6UjEqznue3Qf0i6agia+AmPOV0iqV5ta0GN6Me3gw7PgEJteCntBntphaX4qJmvAy7UoEWOJB+5fN2gOrh28YOYCb5Istw6M6xwn6AX3TB/eIOD7ys/3ZoB82yxOCt3E86Hdp0oP6HF15VoEZCcdQk/SdFtReNbq1fGzbisvAYJbmnYOtMe/rjkviXka+rOtXoGyqKDoHJgu8RClswfWxU/SCsFSSfDNtIA+ooCR1PeGC2CpjiRrDCu3SyXpeQvzUmtE9a0F2Q4ryih3r/Dua8D9stWQH2XITmRyGULL68MQmmtbbBeDOfLbEubJxLUDmniBTDSwIUkr02bA8TmJ51Ve96RsiXYAY1KMFuRiFHf X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: a45bbe5f-5396-4af5-f256-08db6b854657 X-MS-Exchange-CrossTenant-AuthSource: SN6PR10MB3022.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 12 Jun 2023 20:40:35.9074 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: swcic7+my1i/H/AP7kot2vF0i3RNC5eN7gRDobxHWcETL/xsmLvVHykX3vmDECqwHGmcxQ15qseGtrd5VagD6w== X-MS-Exchange-Transport-CrossTenantHeadersStamped: SA1PR10MB5868 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.957,Hydra:6.0.573,FMLib:17.11.176.26 definitions=2023-06-12_16,2023-06-12_02,2023-05-22_02 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxlogscore=999 bulkscore=0 mlxscore=0 spamscore=0 malwarescore=0 adultscore=0 phishscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2305260000 definitions=main-2306120176 X-Proofpoint-GUID: T7FUYL7xApYuMbVNCww0nSA1vPmRLPyb X-Proofpoint-ORIG-GUID: T7FUYL7xApYuMbVNCww0nSA1vPmRLPyb X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768531375603595694?= X-GMAIL-MSGID: =?utf-8?q?1768531375603595694?= |
Series |
Reduce preallocations for maple tree
|
|
Commit Message
Liam R. Howlett
June 12, 2023, 8:39 p.m. UTC
Calculate the number of nodes based on the pending write action instead
of assuming the worst case.
This addresses a performance regression introduced in platforms that
have longer allocation timing.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
lib/maple_tree.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 47 insertions(+), 1 deletion(-)
Comments
On 6/12/23 22:39, Liam R. Howlett wrote: > Calculate the number of nodes based on the pending write action instead > of assuming the worst case. Liam already gave me a heads-up on this patch, which I already replied to [1]. However, I think it might make sense to also reply to this patch directly. For a mas_preallocate() calculating the actual required nodes to be allocated instead of assuming the worst to work, it is required to ensure that the tree does not change between calling mas_preallocate() and mas_store_prealloc() if my understanding is correct. In DRM however, more specifically the DRM GPUVA Manager [2], we do have the case that we are not able to ensure this: Jobs to create GPU mappings can be submitted by userspace, are queued up by the kernel and are processed asynchronously in dma-fence signalling critical paths, e.g. by using the drm_gpu_scheduler. Hence, we must be able to allocate the worst case amount of node, since at the time a job is submitted we can't predict the state the maple tree keeping track of mappings has once a mapping is inserted in the (asynchronous) dma-fence signalling critical path. A more detailed explanation can be found in [1]. Could we keep a separate function for allocating the worst case amount of nodes additionally to this optimization? E.g. something like mas_preallocate_worst_case() or mas_preallocate_unlocked() (since I guess the new one requires the maple tree to be kept locked in order not to change)? [1] https://lore.kernel.org/nouveau/68cd25de-e767-725e-2e7b-703217230bb0@redhat.com/T/#ma326e200b1de1e3c9df4e9fcb3bf243061fee8b5 [2] https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@redhat.com/T/#m47ab82310f87793d0f0cc1825a316eb30ad5b653 - Danilo > > This addresses a performance regression introduced in platforms that > have longer allocation timing. > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> > --- > lib/maple_tree.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 47 insertions(+), 1 deletion(-) > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > index 048d6413a114..7ac5b5457603 100644 > --- a/lib/maple_tree.c > +++ b/lib/maple_tree.c > @@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc); > */ > int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) > { > + MA_WR_STATE(wr_mas, mas, entry); > + unsigned char node_size; > + int request = 1; > int ret; > > - mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp); > + > + if (unlikely(!mas->index && mas->last == ULONG_MAX)) > + goto ask_now; > + > + mas_wr_store_setup(&wr_mas); > + wr_mas.content = mas_start(mas); > + /* Root expand */ > + if (unlikely(mas_is_none(mas) || mas_is_ptr(mas))) > + goto ask_now; > + > + if (unlikely(!mas_wr_walk(&wr_mas))) { > + /* Spanning store, use worst case for now */ > + request = 1 + mas_mt_height(mas) * 3; > + goto ask_now; > + } > + > + /* At this point, we are at the leaf node that needs to be altered. */ > + /* Exact fit, no nodes needed. */ > + if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last) > + return 0; > + > + mas_wr_end_piv(&wr_mas); > + node_size = mas_wr_new_end(&wr_mas); > + /* Slot store can avoid using any nodes */ > + if (node_size == wr_mas.node_end && wr_mas.offset_end - mas->offset == 1) > + return 0; > + > + if (node_size >= mt_slots[wr_mas.type]) { > + /* Split, worst case for now. */ > + request = 1 + mas_mt_height(mas) * 2; > + goto ask_now; > + } > + > + /* Appending does not need any nodes */ > + if (node_size == wr_mas.node_end + 1 && mas->offset == wr_mas.node_end) > + return 0; > + > + /* Potential spanning rebalance collapsing a node, use worst-case */ > + if (node_size - 1 <= mt_min_slots[wr_mas.type]) > + request = mas_mt_height(mas) * 2 - 1; > + > + /* node store needs one node */ > +ask_now: > + mas_node_count_gfp(mas, request, gfp); > mas->mas_flags |= MA_STATE_PREALLOC; > if (likely(!mas_is_err(mas))) > return 0;
在 2023/6/23 00:41, Danilo Krummrich 写道: > On 6/12/23 22:39, Liam R. Howlett wrote: >> Calculate the number of nodes based on the pending write action instead >> of assuming the worst case. > > Liam already gave me a heads-up on this patch, which I already replied > to [1]. > > However, I think it might make sense to also reply to this patch directly. > > For a mas_preallocate() calculating the actual required nodes to be > allocated instead of assuming the worst to work, it is required to > ensure that the tree does not change between calling mas_preallocate() > and mas_store_prealloc() if my understanding is correct. > > In DRM however, more specifically the DRM GPUVA Manager [2], we do have > the case that we are not able to ensure this: > > Jobs to create GPU mappings can be submitted by userspace, are queued up > by the kernel and are processed asynchronously in dma-fence signalling > critical paths, e.g. by using the drm_gpu_scheduler. Hence, we must be > able to allocate the worst case amount of node, since at the time a job > is submitted we can't predict the state the maple tree keeping track of > mappings has once a mapping is inserted in the (asynchronous) dma-fence > signalling critical path. > > A more detailed explanation can be found in [1]. > > Could we keep a separate function for allocating the worst case amount > of nodes additionally to this optimization? E.g. something like > mas_preallocate_worst_case() or mas_preallocate_unlocked() (since I > guess the new one requires the maple tree to be kept locked in order not > to change)? Hi Danilo, Your understanding seems incorrect. Even with previously unoptimized mas_preallocate(), the maple tree cannot be modified between calls to mas_preallocate() and mas_store_prealloc(). The calculation of the number of pre-allocated nodes depends on the structure of the maple tree. In the unoptimized mas_preallocate(), it depends on the height of the tree. If the maple tree is modified before mas_store_prealloc() and the height of the tree changes, the number of pre-allocated nodes is inaccurate. Regards, Peng > > [1] > https://lore.kernel.org/nouveau/68cd25de-e767-725e-2e7b-703217230bb0@redhat.com/T/#ma326e200b1de1e3c9df4e9fcb3bf243061fee8b5 > > [2] > https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@redhat.com/T/#m47ab82310f87793d0f0cc1825a316eb30ad5b653 > > - Danilo > >> >> This addresses a performance regression introduced in platforms that >> have longer allocation timing. >> >> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> >> --- >> lib/maple_tree.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 47 insertions(+), 1 deletion(-) >> >> diff --git a/lib/maple_tree.c b/lib/maple_tree.c >> index 048d6413a114..7ac5b5457603 100644 >> --- a/lib/maple_tree.c >> +++ b/lib/maple_tree.c >> @@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc); >> */ >> int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >> { >> + MA_WR_STATE(wr_mas, mas, entry); >> + unsigned char node_size; >> + int request = 1; >> int ret; >> - mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp); >> + >> + if (unlikely(!mas->index && mas->last == ULONG_MAX)) >> + goto ask_now; >> + >> + mas_wr_store_setup(&wr_mas); >> + wr_mas.content = mas_start(mas); >> + /* Root expand */ >> + if (unlikely(mas_is_none(mas) || mas_is_ptr(mas))) >> + goto ask_now; >> + >> + if (unlikely(!mas_wr_walk(&wr_mas))) { >> + /* Spanning store, use worst case for now */ >> + request = 1 + mas_mt_height(mas) * 3; >> + goto ask_now; >> + } >> + >> + /* At this point, we are at the leaf node that needs to be >> altered. */ >> + /* Exact fit, no nodes needed. */ >> + if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last) >> + return 0; >> + >> + mas_wr_end_piv(&wr_mas); >> + node_size = mas_wr_new_end(&wr_mas); >> + /* Slot store can avoid using any nodes */ >> + if (node_size == wr_mas.node_end && wr_mas.offset_end - >> mas->offset == 1) >> + return 0; >> + >> + if (node_size >= mt_slots[wr_mas.type]) { >> + /* Split, worst case for now. */ >> + request = 1 + mas_mt_height(mas) * 2; >> + goto ask_now; >> + } >> + >> + /* Appending does not need any nodes */ >> + if (node_size == wr_mas.node_end + 1 && mas->offset == >> wr_mas.node_end) >> + return 0; >> + >> + /* Potential spanning rebalance collapsing a node, use worst-case */ >> + if (node_size - 1 <= mt_min_slots[wr_mas.type]) >> + request = mas_mt_height(mas) * 2 - 1; >> + >> + /* node store needs one node */ >> +ask_now: >> + mas_node_count_gfp(mas, request, gfp); >> mas->mas_flags |= MA_STATE_PREALLOC; >> if (likely(!mas_is_err(mas))) >> return 0; > >
Hi Peng, On 6/25/23 05:28, Peng Zhang wrote: > > > 在 2023/6/23 00:41, Danilo Krummrich 写道: >> On 6/12/23 22:39, Liam R. Howlett wrote: >>> Calculate the number of nodes based on the pending write action instead >>> of assuming the worst case. >> >> Liam already gave me a heads-up on this patch, which I already replied >> to [1]. >> >> However, I think it might make sense to also reply to this patch >> directly. >> >> For a mas_preallocate() calculating the actual required nodes to be >> allocated instead of assuming the worst to work, it is required to >> ensure that the tree does not change between calling mas_preallocate() >> and mas_store_prealloc() if my understanding is correct. >> >> In DRM however, more specifically the DRM GPUVA Manager [2], we do >> have the case that we are not able to ensure this: >> >> Jobs to create GPU mappings can be submitted by userspace, are queued >> up by the kernel and are processed asynchronously in dma-fence >> signalling critical paths, e.g. by using the drm_gpu_scheduler. Hence, >> we must be able to allocate the worst case amount of node, since at >> the time a job is submitted we can't predict the state the maple tree >> keeping track of mappings has once a mapping is inserted in the >> (asynchronous) dma-fence signalling critical path. >> >> A more detailed explanation can be found in [1]. >> >> Could we keep a separate function for allocating the worst case amount >> of nodes additionally to this optimization? E.g. something like >> mas_preallocate_worst_case() or mas_preallocate_unlocked() (since I >> guess the new one requires the maple tree to be kept locked in order >> not to change)? > Hi Danilo, > > Your understanding seems incorrect. Even with previously unoptimized > mas_preallocate(), the maple tree cannot be modified between calls to > mas_preallocate() and mas_store_prealloc(). The calculation of the > number of pre-allocated nodes depends on the structure of the maple > tree. In the unoptimized mas_preallocate(), it depends on the height of > the tree. If the maple tree is modified before mas_store_prealloc() and > the height of the tree changes, the number of pre-allocated nodes is > inaccurate. Thanks for pointing this out! First of all, it's probably fair to say "naive me", it totally makes sense the tree height is needed - it's a b-tree. On the other hand, unless I miss something (and if so, please let me know), something is bogus with the API then. While the documentation of the Advanced API of the maple tree explicitly claims that the user of the API is responsible for locking, this should be limited to the bounds set by the maple tree implementation. Which means, the user must decide for either the internal (spin-) lock or an external lock (which possibly goes away in the future) and acquire and release it according to the rules maple tree enforces through lockdep checks. Let's say one picks the internal lock. How is one supposed to ensure the tree isn't modified using the internal lock with mas_preallocate()? Besides that, I think the documentation should definitely mention this limitation and give some guidance for the locking. Currently, from an API perspective, I can't see how anyone not familiar with the implementation details would be able to recognize this limitation. In terms of the GPUVA manager, unfortunately, it seems like I need to drop the maple tree and go back to using a rb-tree, since it seems there is no sane way doing a worst-case pre-allocation that does not suffer from this limitation. - Danilo > > Regards, > Peng > >> >> [1] >> https://lore.kernel.org/nouveau/68cd25de-e767-725e-2e7b-703217230bb0@redhat.com/T/#ma326e200b1de1e3c9df4e9fcb3bf243061fee8b5 >> >> [2] >> https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@redhat.com/T/#m47ab82310f87793d0f0cc1825a316eb30ad5b653 >> >> - Danilo >> >>> >>> This addresses a performance regression introduced in platforms that >>> have longer allocation timing. >>> >>> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> >>> --- >>> lib/maple_tree.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++- >>> 1 file changed, 47 insertions(+), 1 deletion(-) >>> >>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c >>> index 048d6413a114..7ac5b5457603 100644 >>> --- a/lib/maple_tree.c >>> +++ b/lib/maple_tree.c >>> @@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc); >>> */ >>> int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>> { >>> + MA_WR_STATE(wr_mas, mas, entry); >>> + unsigned char node_size; >>> + int request = 1; >>> int ret; >>> - mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp); >>> + >>> + if (unlikely(!mas->index && mas->last == ULONG_MAX)) >>> + goto ask_now; >>> + >>> + mas_wr_store_setup(&wr_mas); >>> + wr_mas.content = mas_start(mas); >>> + /* Root expand */ >>> + if (unlikely(mas_is_none(mas) || mas_is_ptr(mas))) >>> + goto ask_now; >>> + >>> + if (unlikely(!mas_wr_walk(&wr_mas))) { >>> + /* Spanning store, use worst case for now */ >>> + request = 1 + mas_mt_height(mas) * 3; >>> + goto ask_now; >>> + } >>> + >>> + /* At this point, we are at the leaf node that needs to be >>> altered. */ >>> + /* Exact fit, no nodes needed. */ >>> + if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last) >>> + return 0; >>> + >>> + mas_wr_end_piv(&wr_mas); >>> + node_size = mas_wr_new_end(&wr_mas); >>> + /* Slot store can avoid using any nodes */ >>> + if (node_size == wr_mas.node_end && wr_mas.offset_end - >>> mas->offset == 1) >>> + return 0; >>> + >>> + if (node_size >= mt_slots[wr_mas.type]) { >>> + /* Split, worst case for now. */ >>> + request = 1 + mas_mt_height(mas) * 2; >>> + goto ask_now; >>> + } >>> + >>> + /* Appending does not need any nodes */ >>> + if (node_size == wr_mas.node_end + 1 && mas->offset == >>> wr_mas.node_end) >>> + return 0; >>> + >>> + /* Potential spanning rebalance collapsing a node, use >>> worst-case */ >>> + if (node_size - 1 <= mt_min_slots[wr_mas.type]) >>> + request = mas_mt_height(mas) * 2 - 1; >>> + >>> + /* node store needs one node */ >>> +ask_now: >>> + mas_node_count_gfp(mas, request, gfp); >>> mas->mas_flags |= MA_STATE_PREALLOC; >>> if (likely(!mas_is_err(mas))) >>> return 0; >> >> >
On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote: > On the other hand, unless I miss something (and if so, please let me know), > something is bogus with the API then. > > While the documentation of the Advanced API of the maple tree explicitly > claims that the user of the API is responsible for locking, this should be > limited to the bounds set by the maple tree implementation. Which means, the > user must decide for either the internal (spin-) lock or an external lock > (which possibly goes away in the future) and acquire and release it > according to the rules maple tree enforces through lockdep checks. > > Let's say one picks the internal lock. How is one supposed to ensure the > tree isn't modified using the internal lock with mas_preallocate()? > > Besides that, I think the documentation should definitely mention this > limitation and give some guidance for the locking. > > Currently, from an API perspective, I can't see how anyone not familiar with > the implementation details would be able to recognize this limitation. > > In terms of the GPUVA manager, unfortunately, it seems like I need to drop > the maple tree and go back to using a rb-tree, since it seems there is no > sane way doing a worst-case pre-allocation that does not suffer from this > limitation. I haven't been paying much attention here (too many other things going on), but something's wrong. First, you shouldn't need to preallocate. Preallocation is only there for really gnarly cases. The way this is *supposed* to work is that the store walks down to the leaf, attempts to insert into that leaf and tries to allocate new nodes with __GFP_NOWAIT. If that fails, it drops the spinlock, allocates with the gfp flags you've specified, then rewalks the tree to retry the store, this time with allocated nodes in its back pocket so that the store will succeed.
在 2023/6/26 08:38, Danilo Krummrich 写道: > Hi Peng, > > On 6/25/23 05:28, Peng Zhang wrote: >> >> >> 在 2023/6/23 00:41, Danilo Krummrich 写道: >>> On 6/12/23 22:39, Liam R. Howlett wrote: >>>> Calculate the number of nodes based on the pending write action instead >>>> of assuming the worst case. >>> >>> Liam already gave me a heads-up on this patch, which I already >>> replied to [1]. >>> >>> However, I think it might make sense to also reply to this patch >>> directly. >>> >>> For a mas_preallocate() calculating the actual required nodes to be >>> allocated instead of assuming the worst to work, it is required to >>> ensure that the tree does not change between calling >>> mas_preallocate() and mas_store_prealloc() if my understanding is >>> correct. >>> >>> In DRM however, more specifically the DRM GPUVA Manager [2], we do >>> have the case that we are not able to ensure this: >>> >>> Jobs to create GPU mappings can be submitted by userspace, are queued >>> up by the kernel and are processed asynchronously in dma-fence >>> signalling critical paths, e.g. by using the drm_gpu_scheduler. >>> Hence, we must be able to allocate the worst case amount of node, >>> since at the time a job is submitted we can't predict the state the >>> maple tree keeping track of mappings has once a mapping is inserted >>> in the (asynchronous) dma-fence signalling critical path. >>> >>> A more detailed explanation can be found in [1]. >>> >>> Could we keep a separate function for allocating the worst case >>> amount of nodes additionally to this optimization? E.g. something >>> like mas_preallocate_worst_case() or mas_preallocate_unlocked() >>> (since I guess the new one requires the maple tree to be kept locked >>> in order not to change)? >> Hi Danilo, >> >> Your understanding seems incorrect. Even with previously unoptimized >> mas_preallocate(), the maple tree cannot be modified between calls to >> mas_preallocate() and mas_store_prealloc(). The calculation of the >> number of pre-allocated nodes depends on the structure of the maple >> tree. In the unoptimized mas_preallocate(), it depends on the height of >> the tree. If the maple tree is modified before mas_store_prealloc() and >> the height of the tree changes, the number of pre-allocated nodes is >> inaccurate. > > Thanks for pointing this out! > > First of all, it's probably fair to say "naive me", it totally makes > sense the tree height is needed - it's a b-tree. > > On the other hand, unless I miss something (and if so, please let me > know), something is bogus with the API then. > > While the documentation of the Advanced API of the maple tree explicitly > claims that the user of the API is responsible for locking, this should > be limited to the bounds set by the maple tree implementation. Which > means, the user must decide for either the internal (spin-) lock or an > external lock (which possibly goes away in the future) and acquire and > release it according to the rules maple tree enforces through lockdep > checks. > > Let's say one picks the internal lock. How is one supposed to ensure the > tree isn't modified using the internal lock with mas_preallocate()? > > Besides that, I think the documentation should definitely mention this > limitation and give some guidance for the locking. Yes, the documentation of maple tree is not detailed and complete. > > Currently, from an API perspective, I can't see how anyone not familiar > with the implementation details would be able to recognize this limitation. > > In terms of the GPUVA manager, unfortunately, it seems like I need to > drop the maple tree and go back to using a rb-tree, since it seems there > is no sane way doing a worst-case pre-allocation that does not suffer > from this limitation. I also think preallocation may not be necessary, and I agree with what Matthew said. Preallocation should be used in some cases where preallocation has to be used. If preallocation is used, but the number of preallocated nodes is insufficient because the tree is modified midway, GFP_NOWAIT will be used for memory allocation during the tree modification process, and the user may not notice that more nodes are not from preallocation. > > - Danilo > >> >> Regards, >> Peng >> >>> >>> [1] >>> https://lore.kernel.org/nouveau/68cd25de-e767-725e-2e7b-703217230bb0@redhat.com/T/#ma326e200b1de1e3c9df4e9fcb3bf243061fee8b5 >>> >>> [2] >>> https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@redhat.com/T/#m47ab82310f87793d0f0cc1825a316eb30ad5b653 >>> >>> - Danilo >>> >>>> >>>> This addresses a performance regression introduced in platforms that >>>> have longer allocation timing. >>>> >>>> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> >>>> --- >>>> lib/maple_tree.c | 48 >>>> +++++++++++++++++++++++++++++++++++++++++++++++- >>>> 1 file changed, 47 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c >>>> index 048d6413a114..7ac5b5457603 100644 >>>> --- a/lib/maple_tree.c >>>> +++ b/lib/maple_tree.c >>>> @@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc); >>>> */ >>>> int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>>> { >>>> + MA_WR_STATE(wr_mas, mas, entry); >>>> + unsigned char node_size; >>>> + int request = 1; >>>> int ret; >>>> - mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp); >>>> + >>>> + if (unlikely(!mas->index && mas->last == ULONG_MAX)) >>>> + goto ask_now; >>>> + >>>> + mas_wr_store_setup(&wr_mas); >>>> + wr_mas.content = mas_start(mas); >>>> + /* Root expand */ >>>> + if (unlikely(mas_is_none(mas) || mas_is_ptr(mas))) >>>> + goto ask_now; >>>> + >>>> + if (unlikely(!mas_wr_walk(&wr_mas))) { >>>> + /* Spanning store, use worst case for now */ >>>> + request = 1 + mas_mt_height(mas) * 3; >>>> + goto ask_now; >>>> + } >>>> + >>>> + /* At this point, we are at the leaf node that needs to be >>>> altered. */ >>>> + /* Exact fit, no nodes needed. */ >>>> + if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last) >>>> + return 0; >>>> + >>>> + mas_wr_end_piv(&wr_mas); >>>> + node_size = mas_wr_new_end(&wr_mas); >>>> + /* Slot store can avoid using any nodes */ >>>> + if (node_size == wr_mas.node_end && wr_mas.offset_end - >>>> mas->offset == 1) >>>> + return 0; >>>> + >>>> + if (node_size >= mt_slots[wr_mas.type]) { >>>> + /* Split, worst case for now. */ >>>> + request = 1 + mas_mt_height(mas) * 2; >>>> + goto ask_now; >>>> + } >>>> + >>>> + /* Appending does not need any nodes */ >>>> + if (node_size == wr_mas.node_end + 1 && mas->offset == >>>> wr_mas.node_end) >>>> + return 0; >>>> + >>>> + /* Potential spanning rebalance collapsing a node, use >>>> worst-case */ >>>> + if (node_size - 1 <= mt_min_slots[wr_mas.type]) >>>> + request = mas_mt_height(mas) * 2 - 1; >>>> + >>>> + /* node store needs one node */ >>>> +ask_now: >>>> + mas_node_count_gfp(mas, request, gfp); >>>> mas->mas_flags |= MA_STATE_PREALLOC; >>>> if (likely(!mas_is_err(mas))) >>>> return 0; >>> >>> >> >
On 6/26/23 15:19, Matthew Wilcox wrote: > On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote: >> On the other hand, unless I miss something (and if so, please let me know), >> something is bogus with the API then. >> >> While the documentation of the Advanced API of the maple tree explicitly >> claims that the user of the API is responsible for locking, this should be >> limited to the bounds set by the maple tree implementation. Which means, the >> user must decide for either the internal (spin-) lock or an external lock >> (which possibly goes away in the future) and acquire and release it >> according to the rules maple tree enforces through lockdep checks. >> >> Let's say one picks the internal lock. How is one supposed to ensure the >> tree isn't modified using the internal lock with mas_preallocate()? >> >> Besides that, I think the documentation should definitely mention this >> limitation and give some guidance for the locking. >> >> Currently, from an API perspective, I can't see how anyone not familiar with >> the implementation details would be able to recognize this limitation. >> >> In terms of the GPUVA manager, unfortunately, it seems like I need to drop >> the maple tree and go back to using a rb-tree, since it seems there is no >> sane way doing a worst-case pre-allocation that does not suffer from this >> limitation. > > I haven't been paying much attention here (too many other things going > on), but something's wrong. > > First, you shouldn't need to preallocate. Preallocation is only there Unfortunately, I think we really have a case where we have to. Typically GPU mappings are created in a dma-fence signalling critical path and that is where such mappings need to be added to the maple tree. Hence, we can't do any sleeping allocations there. > for really gnarly cases. The way this is *supposed* to work is that > the store walks down to the leaf, attempts to insert into that leaf > and tries to allocate new nodes with __GFP_NOWAIT. If that fails, > it drops the spinlock, allocates with the gfp flags you've specified, > then rewalks the tree to retry the store, this time with allocated > nodes in its back pocket so that the store will succeed. You are talking about mas_store_gfp() here, right? And I guess, if the tree has changed while the spinlock was dropped and even more nodes are needed it just retries until it succeeds? But what about mas_preallocate()? What happens if the tree changed in between mas_preallocate() and mas_store_prealloc()? Does the latter one fall back to __GFP_NOWAIT in such a case? I guess not, since mas_store_prealloc() has a void return type, and __GFP_NOWAIT could fail as well. So, how to use the internal spinlock for mas_preallocate() and mas_store_prealloc() to ensure the tree can't change?
On 6/26/23 16:08, Peng Zhang wrote: > > > 在 2023/6/26 08:38, Danilo Krummrich 写道: >> Hi Peng, >> >> On 6/25/23 05:28, Peng Zhang wrote: >>> >>> >>> 在 2023/6/23 00:41, Danilo Krummrich 写道: >>>> On 6/12/23 22:39, Liam R. Howlett wrote: >>>>> Calculate the number of nodes based on the pending write action >>>>> instead >>>>> of assuming the worst case. >>>> >>>> Liam already gave me a heads-up on this patch, which I already >>>> replied to [1]. >>>> >>>> However, I think it might make sense to also reply to this patch >>>> directly. >>>> >>>> For a mas_preallocate() calculating the actual required nodes to be >>>> allocated instead of assuming the worst to work, it is required to >>>> ensure that the tree does not change between calling >>>> mas_preallocate() and mas_store_prealloc() if my understanding is >>>> correct. >>>> >>>> In DRM however, more specifically the DRM GPUVA Manager [2], we do >>>> have the case that we are not able to ensure this: >>>> >>>> Jobs to create GPU mappings can be submitted by userspace, are >>>> queued up by the kernel and are processed asynchronously in >>>> dma-fence signalling critical paths, e.g. by using the >>>> drm_gpu_scheduler. Hence, we must be able to allocate the worst case >>>> amount of node, since at the time a job is submitted we can't >>>> predict the state the maple tree keeping track of mappings has once >>>> a mapping is inserted in the (asynchronous) dma-fence signalling >>>> critical path. >>>> >>>> A more detailed explanation can be found in [1]. >>>> >>>> Could we keep a separate function for allocating the worst case >>>> amount of nodes additionally to this optimization? E.g. something >>>> like mas_preallocate_worst_case() or mas_preallocate_unlocked() >>>> (since I guess the new one requires the maple tree to be kept locked >>>> in order not to change)? >>> Hi Danilo, >>> >>> Your understanding seems incorrect. Even with previously unoptimized >>> mas_preallocate(), the maple tree cannot be modified between calls to >>> mas_preallocate() and mas_store_prealloc(). The calculation of the >>> number of pre-allocated nodes depends on the structure of the maple >>> tree. In the unoptimized mas_preallocate(), it depends on the height of >>> the tree. If the maple tree is modified before mas_store_prealloc() and >>> the height of the tree changes, the number of pre-allocated nodes is >>> inaccurate. >> >> Thanks for pointing this out! >> >> First of all, it's probably fair to say "naive me", it totally makes >> sense the tree height is needed - it's a b-tree. >> >> On the other hand, unless I miss something (and if so, please let me >> know), something is bogus with the API then. >> >> While the documentation of the Advanced API of the maple tree >> explicitly claims that the user of the API is responsible for locking, >> this should be limited to the bounds set by the maple tree >> implementation. Which means, the user must decide for either the >> internal (spin-) lock or an external lock (which possibly goes away in >> the future) and acquire and release it according to the rules maple >> tree enforces through lockdep checks. >> >> Let's say one picks the internal lock. How is one supposed to ensure >> the tree isn't modified using the internal lock with mas_preallocate()? >> >> Besides that, I think the documentation should definitely mention this >> limitation and give some guidance for the locking. > Yes, the documentation of maple tree is not detailed and complete. >> >> Currently, from an API perspective, I can't see how anyone not >> familiar with the implementation details would be able to recognize >> this limitation. >> >> In terms of the GPUVA manager, unfortunately, it seems like I need to >> drop the maple tree and go back to using a rb-tree, since it seems >> there is no sane way doing a worst-case pre-allocation that does not >> suffer from this limitation. > I also think preallocation may not be necessary, and I agree with what > Matthew said. Preallocation should be used in some cases where > preallocation has to be used. If preallocation is used, but the number > of preallocated nodes is insufficient because the tree is modified > midway, GFP_NOWAIT will be used for memory allocation during the tree > modification process, and the user may not notice that more nodes are > not from preallocation. Please see my reply to Matthew. :) - Danilo > >> >> - Danilo >> >>> >>> Regards, >>> Peng >>> >>>> >>>> [1] >>>> https://lore.kernel.org/nouveau/68cd25de-e767-725e-2e7b-703217230bb0@redhat.com/T/#ma326e200b1de1e3c9df4e9fcb3bf243061fee8b5 >>>> >>>> [2] >>>> https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@redhat.com/T/#m47ab82310f87793d0f0cc1825a316eb30ad5b653 >>>> >>>> - Danilo >>>> >>>>> >>>>> This addresses a performance regression introduced in platforms that >>>>> have longer allocation timing. >>>>> >>>>> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> >>>>> --- >>>>> lib/maple_tree.c | 48 >>>>> +++++++++++++++++++++++++++++++++++++++++++++++- >>>>> 1 file changed, 47 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c >>>>> index 048d6413a114..7ac5b5457603 100644 >>>>> --- a/lib/maple_tree.c >>>>> +++ b/lib/maple_tree.c >>>>> @@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc); >>>>> */ >>>>> int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>>>> { >>>>> + MA_WR_STATE(wr_mas, mas, entry); >>>>> + unsigned char node_size; >>>>> + int request = 1; >>>>> int ret; >>>>> - mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp); >>>>> + >>>>> + if (unlikely(!mas->index && mas->last == ULONG_MAX)) >>>>> + goto ask_now; >>>>> + >>>>> + mas_wr_store_setup(&wr_mas); >>>>> + wr_mas.content = mas_start(mas); >>>>> + /* Root expand */ >>>>> + if (unlikely(mas_is_none(mas) || mas_is_ptr(mas))) >>>>> + goto ask_now; >>>>> + >>>>> + if (unlikely(!mas_wr_walk(&wr_mas))) { >>>>> + /* Spanning store, use worst case for now */ >>>>> + request = 1 + mas_mt_height(mas) * 3; >>>>> + goto ask_now; >>>>> + } >>>>> + >>>>> + /* At this point, we are at the leaf node that needs to be >>>>> altered. */ >>>>> + /* Exact fit, no nodes needed. */ >>>>> + if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last) >>>>> + return 0; >>>>> + >>>>> + mas_wr_end_piv(&wr_mas); >>>>> + node_size = mas_wr_new_end(&wr_mas); >>>>> + /* Slot store can avoid using any nodes */ >>>>> + if (node_size == wr_mas.node_end && wr_mas.offset_end - >>>>> mas->offset == 1) >>>>> + return 0; >>>>> + >>>>> + if (node_size >= mt_slots[wr_mas.type]) { >>>>> + /* Split, worst case for now. */ >>>>> + request = 1 + mas_mt_height(mas) * 2; >>>>> + goto ask_now; >>>>> + } >>>>> + >>>>> + /* Appending does not need any nodes */ >>>>> + if (node_size == wr_mas.node_end + 1 && mas->offset == >>>>> wr_mas.node_end) >>>>> + return 0; >>>>> + >>>>> + /* Potential spanning rebalance collapsing a node, use >>>>> worst-case */ >>>>> + if (node_size - 1 <= mt_min_slots[wr_mas.type]) >>>>> + request = mas_mt_height(mas) * 2 - 1; >>>>> + >>>>> + /* node store needs one node */ >>>>> +ask_now: >>>>> + mas_node_count_gfp(mas, request, gfp); >>>>> mas->mas_flags |= MA_STATE_PREALLOC; >>>>> if (likely(!mas_is_err(mas))) >>>>> return 0; >>>> >>>> >>> >> >
在 2023/6/26 22:27, Danilo Krummrich 写道: > On 6/26/23 15:19, Matthew Wilcox wrote: >> On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote: >>> On the other hand, unless I miss something (and if so, please let me >>> know), >>> something is bogus with the API then. >>> >>> While the documentation of the Advanced API of the maple tree explicitly >>> claims that the user of the API is responsible for locking, this >>> should be >>> limited to the bounds set by the maple tree implementation. Which >>> means, the >>> user must decide for either the internal (spin-) lock or an external >>> lock >>> (which possibly goes away in the future) and acquire and release it >>> according to the rules maple tree enforces through lockdep checks. >>> >>> Let's say one picks the internal lock. How is one supposed to ensure the >>> tree isn't modified using the internal lock with mas_preallocate()? >>> >>> Besides that, I think the documentation should definitely mention this >>> limitation and give some guidance for the locking. >>> >>> Currently, from an API perspective, I can't see how anyone not >>> familiar with >>> the implementation details would be able to recognize this limitation. >>> >>> In terms of the GPUVA manager, unfortunately, it seems like I need to >>> drop >>> the maple tree and go back to using a rb-tree, since it seems there >>> is no >>> sane way doing a worst-case pre-allocation that does not suffer from >>> this >>> limitation. >> >> I haven't been paying much attention here (too many other things going >> on), but something's wrong. >> >> First, you shouldn't need to preallocate. Preallocation is only there > > Unfortunately, I think we really have a case where we have to. Typically > GPU mappings are created in a dma-fence signalling critical path and > that is where such mappings need to be added to the maple tree. Hence, > we can't do any sleeping allocations there. > >> for really gnarly cases. The way this is *supposed* to work is that >> the store walks down to the leaf, attempts to insert into that leaf >> and tries to allocate new nodes with __GFP_NOWAIT. If that fails, >> it drops the spinlock, allocates with the gfp flags you've specified, >> then rewalks the tree to retry the store, this time with allocated >> nodes in its back pocket so that the store will succeed. > > You are talking about mas_store_gfp() here, right? And I guess, if the > tree has changed while the spinlock was dropped and even more nodes are > needed it just retries until it succeeds? > > But what about mas_preallocate()? What happens if the tree changed in > between mas_preallocate() and mas_store_prealloc()? Does the latter one > fall back to __GFP_NOWAIT in such a case? I guess not, since > mas_store_prealloc() has a void return type, and __GFP_NOWAIT could fail > as well. mas_store_prealloc() will fallback to __GFP_NOWAIT and issue a warning. If __GFP_NOWAIT allocation fails, BUG_ON() in mas_store_prealloc() will be triggered. > > So, how to use the internal spinlock for mas_preallocate() and > mas_store_prealloc() to ensure the tree can't change? >
On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote: > On 6/26/23 15:19, Matthew Wilcox wrote: > > On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote: > > > On the other hand, unless I miss something (and if so, please let me know), > > > something is bogus with the API then. > > > > > > While the documentation of the Advanced API of the maple tree explicitly > > > claims that the user of the API is responsible for locking, this should be > > > limited to the bounds set by the maple tree implementation. Which means, the > > > user must decide for either the internal (spin-) lock or an external lock > > > (which possibly goes away in the future) and acquire and release it > > > according to the rules maple tree enforces through lockdep checks. > > > > > > Let's say one picks the internal lock. How is one supposed to ensure the > > > tree isn't modified using the internal lock with mas_preallocate()? > > > > > > Besides that, I think the documentation should definitely mention this > > > limitation and give some guidance for the locking. > > > > > > Currently, from an API perspective, I can't see how anyone not familiar with > > > the implementation details would be able to recognize this limitation. > > > > > > In terms of the GPUVA manager, unfortunately, it seems like I need to drop > > > the maple tree and go back to using a rb-tree, since it seems there is no > > > sane way doing a worst-case pre-allocation that does not suffer from this > > > limitation. > > > > I haven't been paying much attention here (too many other things going > > on), but something's wrong. > > > > First, you shouldn't need to preallocate. Preallocation is only there > > Unfortunately, I think we really have a case where we have to. Typically GPU > mappings are created in a dma-fence signalling critical path and that is > where such mappings need to be added to the maple tree. Hence, we can't do > any sleeping allocations there. OK, so there are various ways to hadle this, depending on what's appropriate for your case. The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM layer "This is too hard, let me tap into the emergency reserves". It's mildly frowned upon, so let's see if we can do better. If you know where the allocation needs to be stored, but want it to act as NULL until the time is right, you can store a ZERO entry. That will read as NULL until you store to it. A pure overwriting store will not cause any memory allocation since all the implementation has to do is change a pointer. The XArray wraps this up nicely behind an xa_reserve() API. As you're discovering, the Maple Tree API isn't fully baked yet.
On 6/26/23 16:49, Peng Zhang wrote: > > > 在 2023/6/26 22:27, Danilo Krummrich 写道: >> On 6/26/23 15:19, Matthew Wilcox wrote: >>> On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote: >>>> On the other hand, unless I miss something (and if so, please let me >>>> know), >>>> something is bogus with the API then. >>>> >>>> While the documentation of the Advanced API of the maple tree >>>> explicitly >>>> claims that the user of the API is responsible for locking, this >>>> should be >>>> limited to the bounds set by the maple tree implementation. Which >>>> means, the >>>> user must decide for either the internal (spin-) lock or an external >>>> lock >>>> (which possibly goes away in the future) and acquire and release it >>>> according to the rules maple tree enforces through lockdep checks. >>>> >>>> Let's say one picks the internal lock. How is one supposed to ensure >>>> the >>>> tree isn't modified using the internal lock with mas_preallocate()? >>>> >>>> Besides that, I think the documentation should definitely mention this >>>> limitation and give some guidance for the locking. >>>> >>>> Currently, from an API perspective, I can't see how anyone not >>>> familiar with >>>> the implementation details would be able to recognize this limitation. >>>> >>>> In terms of the GPUVA manager, unfortunately, it seems like I need >>>> to drop >>>> the maple tree and go back to using a rb-tree, since it seems there >>>> is no >>>> sane way doing a worst-case pre-allocation that does not suffer from >>>> this >>>> limitation. >>> >>> I haven't been paying much attention here (too many other things going >>> on), but something's wrong. >>> >>> First, you shouldn't need to preallocate. Preallocation is only there >> >> Unfortunately, I think we really have a case where we have to. >> Typically GPU mappings are created in a dma-fence signalling critical >> path and that is where such mappings need to be added to the maple >> tree. Hence, we can't do any sleeping allocations there. >> >>> for really gnarly cases. The way this is *supposed* to work is that >>> the store walks down to the leaf, attempts to insert into that leaf >>> and tries to allocate new nodes with __GFP_NOWAIT. If that fails, >>> it drops the spinlock, allocates with the gfp flags you've specified, >>> then rewalks the tree to retry the store, this time with allocated >>> nodes in its back pocket so that the store will succeed. >> >> You are talking about mas_store_gfp() here, right? And I guess, if the >> tree has changed while the spinlock was dropped and even more nodes >> are needed it just retries until it succeeds? >> >> But what about mas_preallocate()? What happens if the tree changed in >> between mas_preallocate() and mas_store_prealloc()? Does the latter >> one fall back to __GFP_NOWAIT in such a case? I guess not, since >> mas_store_prealloc() has a void return type, and __GFP_NOWAIT could >> fail as well. > mas_store_prealloc() will fallback to __GFP_NOWAIT and issue a warning. > If __GFP_NOWAIT allocation fails, BUG_ON() in mas_store_prealloc() will > be triggered. Ok, so this is an absolute last resort and surely should not be relied on. I think the maple tree should either strictly enforce (through locking policy) that this can never happen or if API wise it is OK not to lock these two is legit, return an error code rather then issue a warning and even worse call BUG_ON() in case it can't fix things up. - Danilo > >> >> So, how to use the internal spinlock for mas_preallocate() and >> mas_store_prealloc() to ensure the tree can't change? >> >
On 6/26/23 16:52, Matthew Wilcox wrote: > On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote: >> On 6/26/23 15:19, Matthew Wilcox wrote: >>> On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote: >>>> On the other hand, unless I miss something (and if so, please let me know), >>>> something is bogus with the API then. >>>> >>>> While the documentation of the Advanced API of the maple tree explicitly >>>> claims that the user of the API is responsible for locking, this should be >>>> limited to the bounds set by the maple tree implementation. Which means, the >>>> user must decide for either the internal (spin-) lock or an external lock >>>> (which possibly goes away in the future) and acquire and release it >>>> according to the rules maple tree enforces through lockdep checks. >>>> >>>> Let's say one picks the internal lock. How is one supposed to ensure the >>>> tree isn't modified using the internal lock with mas_preallocate()? >>>> >>>> Besides that, I think the documentation should definitely mention this >>>> limitation and give some guidance for the locking. >>>> >>>> Currently, from an API perspective, I can't see how anyone not familiar with >>>> the implementation details would be able to recognize this limitation. >>>> >>>> In terms of the GPUVA manager, unfortunately, it seems like I need to drop >>>> the maple tree and go back to using a rb-tree, since it seems there is no >>>> sane way doing a worst-case pre-allocation that does not suffer from this >>>> limitation. >>> >>> I haven't been paying much attention here (too many other things going >>> on), but something's wrong. >>> >>> First, you shouldn't need to preallocate. Preallocation is only there >> >> Unfortunately, I think we really have a case where we have to. Typically GPU >> mappings are created in a dma-fence signalling critical path and that is >> where such mappings need to be added to the maple tree. Hence, we can't do >> any sleeping allocations there. > > OK, so there are various ways to hadle this, depending on what's > appropriate for your case. > > The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM > layer "This is too hard, let me tap into the emergency reserves". It's > mildly frowned upon, so let's see if we can do better. > > If you know where the allocation needs to be stored, but want it to act as > NULL until the time is right, you can store a ZERO entry. That will read > as NULL until you store to it. A pure overwriting store will not cause > any memory allocation since all the implementation has to do is change > a pointer. The XArray wraps this up nicely behind an xa_reserve() API. > As you're discovering, the Maple Tree API isn't fully baked yet. > Unfortunately, GFP_ATOMIC seems the be the only option. I think storing entries in advance would not work. Typically userspace submits a job to the kernel issuing one or multiple requests to map and unmap memory in an ioctl. Such a job is then put into a queue and processed asynchronously in a dma-fence signalling critical section. Hence, at the we'd store entries in advance we could have an arbitrary amount of pending jobs potentially still messing with the same address space region. So, the only way to go seems to be to use mas_store_gfp() with GFP_ATOMIC directly in the fence signalling critical path. I guess mas_store_gfp() does not BUG_ON() if it can't get atomic pages? Also, I just saw that the tree is limited in it's height (MAPLE_HEIGHT_MAX). Do you think it could be a sane alternative to pre-allocate with MAPLE_HEIGHT_MAX rather than to rely on atomic pages? Or maybe a compromise of pre-allocating just a couple of nodes and then rely on atomic pages for the rest? FYI, we're talking about a magnitude of hundreds of thousands of entries to be stored in the tree. - Danilo
* Danilo Krummrich <dakr@redhat.com> [230626 14:37]: > On 6/26/23 16:52, Matthew Wilcox wrote: > > On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote: > > > On 6/26/23 15:19, Matthew Wilcox wrote: > > > > On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote: > > > > > On the other hand, unless I miss something (and if so, please let me know), > > > > > something is bogus with the API then. > > > > > > > > > > While the documentation of the Advanced API of the maple tree explicitly > > > > > claims that the user of the API is responsible for locking, this should be > > > > > limited to the bounds set by the maple tree implementation. Which means, the > > > > > user must decide for either the internal (spin-) lock or an external lock > > > > > (which possibly goes away in the future) and acquire and release it > > > > > according to the rules maple tree enforces through lockdep checks. > > > > > > > > > > Let's say one picks the internal lock. How is one supposed to ensure the > > > > > tree isn't modified using the internal lock with mas_preallocate()? > > > > > > > > > > Besides that, I think the documentation should definitely mention this > > > > > limitation and give some guidance for the locking. > > > > > > > > > > Currently, from an API perspective, I can't see how anyone not familiar with > > > > > the implementation details would be able to recognize this limitation. > > > > > > > > > > In terms of the GPUVA manager, unfortunately, it seems like I need to drop > > > > > the maple tree and go back to using a rb-tree, since it seems there is no > > > > > sane way doing a worst-case pre-allocation that does not suffer from this > > > > > limitation. > > > > > > > > I haven't been paying much attention here (too many other things going > > > > on), but something's wrong. > > > > > > > > First, you shouldn't need to preallocate. Preallocation is only there > > > > > > Unfortunately, I think we really have a case where we have to. Typically GPU > > > mappings are created in a dma-fence signalling critical path and that is > > > where such mappings need to be added to the maple tree. Hence, we can't do > > > any sleeping allocations there. > > > > OK, so there are various ways to hadle this, depending on what's > > appropriate for your case. > > > > The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM > > layer "This is too hard, let me tap into the emergency reserves". It's > > mildly frowned upon, so let's see if we can do better. > > > > If you know where the allocation needs to be stored, but want it to act as > > NULL until the time is right, you can store a ZERO entry. That will read > > as NULL until you store to it. A pure overwriting store will not cause > > any memory allocation since all the implementation has to do is change > > a pointer. The XArray wraps this up nicely behind an xa_reserve() API. > > As you're discovering, the Maple Tree API isn't fully baked yet. > > > > Unfortunately, GFP_ATOMIC seems the be the only option. I think storing > entries in advance would not work. Typically userspace submits a job to the > kernel issuing one or multiple requests to map and unmap memory in an ioctl. > Such a job is then put into a queue and processed asynchronously in a > dma-fence signalling critical section. Hence, at the we'd store entries in > advance we could have an arbitrary amount of pending jobs potentially still > messing with the same address space region. What I think you are saying is that you have a number of requests flooding in, which may overwrite the same areas, but are queued up to be written after they are queued. These operations look to be kept in order according to the code in nouveau_job_submit[1]. Is this correct? So then, your issue isn't that you don't know where they will land, but don't know if the area that you reserved is already split into other areas? For instance, before the range 5-10 is backed by whatever happens in the fence, it may have already become 5-6 & 8-10 by something that came after (from userspace) but hasn't been processed by the kernel that will live at 7? So you can't write 5-10 right away because you can't be sure 5-10 is going to exist once you reach the kernel fence code that stores the entry? Is my understanding of your issue correct? Oh, and I guess the queued requests would have to remain ordered between threads or whatever is on the other side? I mean, you can't have two threads firing different things into the kernel at the same region because I would think the results would be unpredictable? Can these overlapping entries partially overlap one region and another? That is, can you have three in-flight writes that does something like: store 1-10, store 10-20, store 5-15? How stable of an output is needed? Does each kernel write need to be 100% correct or is there a point where the userspace updates stop and only then it is needed to be stable? > > So, the only way to go seems to be to use mas_store_gfp() with GFP_ATOMIC > directly in the fence signalling critical path. I guess mas_store_gfp() does > not BUG_ON() if it can't get atomic pages? > > Also, I just saw that the tree is limited in it's height (MAPLE_HEIGHT_MAX). > Do you think it could be a sane alternative to pre-allocate with > MAPLE_HEIGHT_MAX rather than to rely on atomic pages? Or maybe a compromise > of pre-allocating just a couple of nodes and then rely on atomic pages for > the rest? > > FYI, we're talking about a magnitude of hundreds of thousands of entries to > be stored in the tree. > Since you are not tracking gaps, you will get 16 entries per node. The maximum height is 31, so that would be 16^31, assuming a gap between each entry (the worst case), you can cut that in 1/2. To assure you can successfully allocate storage for a new entries, you'd need to allocate 30 * 3 + 1, or 91 nodes, which is 6 pages. That'll be highly wasteful as almost all of these would be freed, and sometimes all of them. You estimate less than 1M entries, that would never go over 6 levels (8.3M entries with the worst-case). 5 levels would get you 500K in the worst case, but realistically you'll be in the 5 levels almost always. So, 5*3+1 = 17 nodes, or 2 pages (1 node over 1 page).. assuming 4k pages. [1] https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@redhat.com/T/#Z2e.:..:20230620004217.4700-4-dakr::40redhat.com:1drivers:gpu:drm:drm_gem.c
Hi Liam, On 6/27/23 03:58, Liam R. Howlett wrote: > * Danilo Krummrich <dakr@redhat.com> [230626 14:37]: >> On 6/26/23 16:52, Matthew Wilcox wrote: >>> On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote: >>>> On 6/26/23 15:19, Matthew Wilcox wrote: >>>>> On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote: >>>>>> On the other hand, unless I miss something (and if so, please let me know), >>>>>> something is bogus with the API then. >>>>>> >>>>>> While the documentation of the Advanced API of the maple tree explicitly >>>>>> claims that the user of the API is responsible for locking, this should be >>>>>> limited to the bounds set by the maple tree implementation. Which means, the >>>>>> user must decide for either the internal (spin-) lock or an external lock >>>>>> (which possibly goes away in the future) and acquire and release it >>>>>> according to the rules maple tree enforces through lockdep checks. >>>>>> >>>>>> Let's say one picks the internal lock. How is one supposed to ensure the >>>>>> tree isn't modified using the internal lock with mas_preallocate()? >>>>>> >>>>>> Besides that, I think the documentation should definitely mention this >>>>>> limitation and give some guidance for the locking. >>>>>> >>>>>> Currently, from an API perspective, I can't see how anyone not familiar with >>>>>> the implementation details would be able to recognize this limitation. >>>>>> >>>>>> In terms of the GPUVA manager, unfortunately, it seems like I need to drop >>>>>> the maple tree and go back to using a rb-tree, since it seems there is no >>>>>> sane way doing a worst-case pre-allocation that does not suffer from this >>>>>> limitation. >>>>> >>>>> I haven't been paying much attention here (too many other things going >>>>> on), but something's wrong. >>>>> >>>>> First, you shouldn't need to preallocate. Preallocation is only there >>>> >>>> Unfortunately, I think we really have a case where we have to. Typically GPU >>>> mappings are created in a dma-fence signalling critical path and that is >>>> where such mappings need to be added to the maple tree. Hence, we can't do >>>> any sleeping allocations there. >>> >>> OK, so there are various ways to hadle this, depending on what's >>> appropriate for your case. >>> >>> The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM >>> layer "This is too hard, let me tap into the emergency reserves". It's >>> mildly frowned upon, so let's see if we can do better. >>> >>> If you know where the allocation needs to be stored, but want it to act as >>> NULL until the time is right, you can store a ZERO entry. That will read >>> as NULL until you store to it. A pure overwriting store will not cause >>> any memory allocation since all the implementation has to do is change >>> a pointer. The XArray wraps this up nicely behind an xa_reserve() API. >>> As you're discovering, the Maple Tree API isn't fully baked yet. >>> >> >> Unfortunately, GFP_ATOMIC seems the be the only option. I think storing >> entries in advance would not work. Typically userspace submits a job to the >> kernel issuing one or multiple requests to map and unmap memory in an ioctl. >> Such a job is then put into a queue and processed asynchronously in a >> dma-fence signalling critical section. Hence, at the we'd store entries in >> advance we could have an arbitrary amount of pending jobs potentially still >> messing with the same address space region. > > What I think you are saying is that you have a number of requests > flooding in, which may overwrite the same areas, but are queued up to be > written after they are queued. These operations look to be kept in > order according to the code in nouveau_job_submit[1]. Is this correct? That's all correct. (Although Nouveau isn't a good example in this case. Some aspects of it do and some aspects of it do not apply to the problem we're discussing here.) > > So then, your issue isn't that you don't know where they will land, but > don't know if the area that you reserved is already split into other > areas? For instance, before the range 5-10 is backed by whatever > happens in the fence, it may have already become 5-6 & 8-10 by something > that came after (from userspace) but hasn't been processed by the > kernel that will live at 7? So you can't write 5-10 right away because > you can't be sure 5-10 is going to exist once you reach the kernel fence > code that stores the entry? > > Is my understanding of your issue correct? Yes, it is. However, the problem already starts while trying to reserve an area. In order to satisfy a user request, such a request is broken down into operations such as unmap mappings which are in the way entirely, remap mappings which intersect with the requested mapping and finally map the requested mapping. The execution of such a sequence must appear atomic and hence be locked accordingly. When trying to reserve an area we'd need to take that lock. But since this lock would be used in the dma-fence signalling critical path as well we'd not be allowed to do sleeping allocations while holding this lock. Potentially, this could be solved with a retry loop though. Drop the lock while allocating, take it again and check whether we still got enough nodes allocated. Analogous to what the maple tree does in mas_store_gfp(), I guess. > > Oh, and I guess the queued requests would have to remain ordered between > threads or whatever is on the other side? I mean, you can't have two > threads firing different things into the kernel at the same region > because I would think the results would be unpredictable? Once a job is queued up in the kernel they remain ordered. However, user threads could concurrently push jobs to the kernel altering the same region of the address space - it just would not make any sense for userspace to do that. In general userspace is responsible for the semantics of the address space. The kernel basically just takes any (valid) request and make it happen. It also assures waiting and signalling of fences which might be bound to certain jobs and obviously keeps track of the VA space to be able to clean things up once a client disappears. > > Can these overlapping entries partially overlap one region and another? > That is, can you have three in-flight writes that does something like: > store 1-10, store 10-20, store 5-15? Absolutely, yes. > > How stable of an output is needed? Does each kernel write need to be > 100% correct or is there a point where the userspace updates stop and > only then it is needed to be stable? It needs to be 100% correct all the time. The reason is that, as mentioned above, every job can carry in- and out-fences, such that userspace can order these jobs against the execution of shaders. This is also why there could be jobs queued up, where all of them apply changes to the same region within the VA space, since there might be shader executions (or just memory copies) ordered right between them. - Danilo > >> >> So, the only way to go seems to be to use mas_store_gfp() with GFP_ATOMIC >> directly in the fence signalling critical path. I guess mas_store_gfp() does >> not BUG_ON() if it can't get atomic pages? >> >> Also, I just saw that the tree is limited in it's height (MAPLE_HEIGHT_MAX). >> Do you think it could be a sane alternative to pre-allocate with >> MAPLE_HEIGHT_MAX rather than to rely on atomic pages? Or maybe a compromise >> of pre-allocating just a couple of nodes and then rely on atomic pages for >> the rest? >> >> FYI, we're talking about a magnitude of hundreds of thousands of entries to >> be stored in the tree. >> > > Since you are not tracking gaps, you will get 16 entries per node. The > maximum height is 31, so that would be 16^31, assuming a gap between > each entry (the worst case), you can cut that in 1/2. To assure you can > successfully allocate storage for a new entries, you'd need to allocate > 30 * 3 + 1, or 91 nodes, which is 6 pages. That'll be highly wasteful > as almost all of these would be freed, and sometimes all of them. > > You estimate less than 1M entries, that would never go over 6 levels (8.3M > entries with the worst-case). 5 levels would get you 500K in the worst > case, but realistically you'll be in the 5 levels almost always. So, > 5*3+1 = 17 nodes, or 2 pages (1 node over 1 page).. assuming 4k pages. > > [1] https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@redhat.com/T/#Z2e.:..:20230620004217.4700-4-dakr::40redhat.com:1drivers:gpu:drm:drm_gem.c >
* Danilo Krummrich <dakr@redhat.com> [230627 10:58]: > Hi Liam, > > On 6/27/23 03:58, Liam R. Howlett wrote: > > * Danilo Krummrich <dakr@redhat.com> [230626 14:37]: > > > On 6/26/23 16:52, Matthew Wilcox wrote: > > > > On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote: > > > > > On 6/26/23 15:19, Matthew Wilcox wrote: > > > > > > On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote: > > > > > > > On the other hand, unless I miss something (and if so, please let me know), > > > > > > > something is bogus with the API then. > > > > > > > > > > > > > > While the documentation of the Advanced API of the maple tree explicitly > > > > > > > claims that the user of the API is responsible for locking, this should be > > > > > > > limited to the bounds set by the maple tree implementation. Which means, the > > > > > > > user must decide for either the internal (spin-) lock or an external lock > > > > > > > (which possibly goes away in the future) and acquire and release it > > > > > > > according to the rules maple tree enforces through lockdep checks. > > > > > > > > > > > > > > Let's say one picks the internal lock. How is one supposed to ensure the > > > > > > > tree isn't modified using the internal lock with mas_preallocate()? > > > > > > > > > > > > > > Besides that, I think the documentation should definitely mention this > > > > > > > limitation and give some guidance for the locking. > > > > > > > > > > > > > > Currently, from an API perspective, I can't see how anyone not familiar with > > > > > > > the implementation details would be able to recognize this limitation. > > > > > > > > > > > > > > In terms of the GPUVA manager, unfortunately, it seems like I need to drop > > > > > > > the maple tree and go back to using a rb-tree, since it seems there is no > > > > > > > sane way doing a worst-case pre-allocation that does not suffer from this > > > > > > > limitation. > > > > > > > > > > > > I haven't been paying much attention here (too many other things going > > > > > > on), but something's wrong. > > > > > > > > > > > > First, you shouldn't need to preallocate. Preallocation is only there > > > > > > > > > > Unfortunately, I think we really have a case where we have to. Typically GPU > > > > > mappings are created in a dma-fence signalling critical path and that is > > > > > where such mappings need to be added to the maple tree. Hence, we can't do > > > > > any sleeping allocations there. > > > > > > > > OK, so there are various ways to hadle this, depending on what's > > > > appropriate for your case. > > > > > > > > The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM > > > > layer "This is too hard, let me tap into the emergency reserves". It's > > > > mildly frowned upon, so let's see if we can do better. > > > > > > > > If you know where the allocation needs to be stored, but want it to act as > > > > NULL until the time is right, you can store a ZERO entry. That will read > > > > as NULL until you store to it. A pure overwriting store will not cause > > > > any memory allocation since all the implementation has to do is change > > > > a pointer. The XArray wraps this up nicely behind an xa_reserve() API. > > > > As you're discovering, the Maple Tree API isn't fully baked yet. > > > > > > > > > > Unfortunately, GFP_ATOMIC seems the be the only option. I think storing > > > entries in advance would not work. Typically userspace submits a job to the > > > kernel issuing one or multiple requests to map and unmap memory in an ioctl. > > > Such a job is then put into a queue and processed asynchronously in a > > > dma-fence signalling critical section. Hence, at the we'd store entries in > > > advance we could have an arbitrary amount of pending jobs potentially still > > > messing with the same address space region. > > > > What I think you are saying is that you have a number of requests > > flooding in, which may overwrite the same areas, but are queued up to be > > written after they are queued. These operations look to be kept in > > order according to the code in nouveau_job_submit[1]. Is this correct? > > That's all correct. > > (Although Nouveau isn't a good example in this case. Some aspects of it do > and some aspects of it do not apply to the problem we're discussing here.) > > > > > So then, your issue isn't that you don't know where they will land, but > > don't know if the area that you reserved is already split into other > > areas? For instance, before the range 5-10 is backed by whatever > > happens in the fence, it may have already become 5-6 & 8-10 by something > > that came after (from userspace) but hasn't been processed by the > > kernel that will live at 7? So you can't write 5-10 right away because > > you can't be sure 5-10 is going to exist once you reach the kernel fence > > code that stores the entry? > > > > Is my understanding of your issue correct? > > Yes, it is. > > However, the problem already starts while trying to reserve an area. In > order to satisfy a user request, such a request is broken down into > operations such as unmap mappings which are in the way entirely, remap > mappings which intersect with the requested mapping and finally map the > requested mapping. The execution of such a sequence must appear atomic and > hence be locked accordingly. When trying to reserve an area we'd need to > take that lock. But since this lock would be used in the dma-fence > signalling critical path as well we'd not be allowed to do sleeping > allocations while holding this lock. > > Potentially, this could be solved with a retry loop though. Drop the lock > while allocating, take it again and check whether we still got enough nodes > allocated. Analogous to what the maple tree does in mas_store_gfp(), I > guess. > > > > > Oh, and I guess the queued requests would have to remain ordered between > > threads or whatever is on the other side? I mean, you can't have two > > threads firing different things into the kernel at the same region > > because I would think the results would be unpredictable? > > Once a job is queued up in the kernel they remain ordered. > > However, user threads could concurrently push jobs to the kernel altering > the same region of the address space - it just would not make any sense for > userspace to do that. > > In general userspace is responsible for the semantics of the address space. > The kernel basically just takes any (valid) request and make it happen. It > also assures waiting and signalling of fences which might be bound to > certain jobs and obviously keeps track of the VA space to be able to clean > things up once a client disappears. > > > > > Can these overlapping entries partially overlap one region and another? > > That is, can you have three in-flight writes that does something like: > > store 1-10, store 10-20, store 5-15? > > Absolutely, yes. > > > > > How stable of an output is needed? Does each kernel write need to be > > 100% correct or is there a point where the userspace updates stop and > > only then it is needed to be stable? > > It needs to be 100% correct all the time. The reason is that, as mentioned > above, every job can carry in- and out-fences, such that userspace can order > these jobs against the execution of shaders. But each job is split into parts, so the fences surround these groups of operations? Since ordering is kept, you must reach a point before entering the fences which could call the mas_preallocate() to ensure enough nodes exist to install the new mapping, and then no other operations will be happening. I guess what you are saying is each fence has more than one tree operation? As long as you are not mapping more than a range, then this should be possible in a single write and thus a single preallocation. You can do this by not actually writing unmaps/remaps to the tree within the fence. Once the out-fence is reached, then the operation looks atomic. Reading your patch, it is not clear this is accurate for VM_BIND of asynchronous syncobjs. Is the fence spanning multiple syncobjs with various ranges to map? Or are these things the split-up tasks of unmap/remap, etc that will eventually boil down to what appears to be a single write? > > This is also why there could be jobs queued up, where all of them apply > changes to the same region within the VA space, since there might be shader > executions (or just memory copies) ordered right between them. > > - Danilo > > > > > > > > > So, the only way to go seems to be to use mas_store_gfp() with GFP_ATOMIC > > > directly in the fence signalling critical path. I guess mas_store_gfp() does > > > not BUG_ON() if it can't get atomic pages? > > > > > > Also, I just saw that the tree is limited in it's height (MAPLE_HEIGHT_MAX). > > > Do you think it could be a sane alternative to pre-allocate with > > > MAPLE_HEIGHT_MAX rather than to rely on atomic pages? Or maybe a compromise > > > of pre-allocating just a couple of nodes and then rely on atomic pages for > > > the rest? > > > > > > FYI, we're talking about a magnitude of hundreds of thousands of entries to > > > be stored in the tree. > > > > > > > Since you are not tracking gaps, you will get 16 entries per node. The > > maximum height is 31, so that would be 16^31, assuming a gap between > > each entry (the worst case), you can cut that in 1/2. To assure you can > > successfully allocate storage for a new entries, you'd need to allocate > > 30 * 3 + 1, or 91 nodes, which is 6 pages. That'll be highly wasteful > > as almost all of these would be freed, and sometimes all of them. > > > > You estimate less than 1M entries, that would never go over 6 levels (8.3M > > entries with the worst-case). 5 levels would get you 500K in the worst > > case, but realistically you'll be in the 5 levels almost always. So, > > 5*3+1 = 17 nodes, or 2 pages (1 node over 1 page).. assuming 4k pages. > > > > [1] https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@redhat.com/T/#Z2e.:..:20230620004217.4700-4-dakr::40redhat.com:1drivers:gpu:drm:drm_gem.c > > >
On 6/27/23 18:11, Liam R. Howlett wrote: > * Danilo Krummrich <dakr@redhat.com> [230627 10:58]: >> Hi Liam, >> >> On 6/27/23 03:58, Liam R. Howlett wrote: >>> * Danilo Krummrich <dakr@redhat.com> [230626 14:37]: >>>> On 6/26/23 16:52, Matthew Wilcox wrote: >>>>> On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote: >>>>>> On 6/26/23 15:19, Matthew Wilcox wrote: >>>>>>> On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote: >>>>>>>> On the other hand, unless I miss something (and if so, please let me know), >>>>>>>> something is bogus with the API then. >>>>>>>> >>>>>>>> While the documentation of the Advanced API of the maple tree explicitly >>>>>>>> claims that the user of the API is responsible for locking, this should be >>>>>>>> limited to the bounds set by the maple tree implementation. Which means, the >>>>>>>> user must decide for either the internal (spin-) lock or an external lock >>>>>>>> (which possibly goes away in the future) and acquire and release it >>>>>>>> according to the rules maple tree enforces through lockdep checks. >>>>>>>> >>>>>>>> Let's say one picks the internal lock. How is one supposed to ensure the >>>>>>>> tree isn't modified using the internal lock with mas_preallocate()? >>>>>>>> >>>>>>>> Besides that, I think the documentation should definitely mention this >>>>>>>> limitation and give some guidance for the locking. >>>>>>>> >>>>>>>> Currently, from an API perspective, I can't see how anyone not familiar with >>>>>>>> the implementation details would be able to recognize this limitation. >>>>>>>> >>>>>>>> In terms of the GPUVA manager, unfortunately, it seems like I need to drop >>>>>>>> the maple tree and go back to using a rb-tree, since it seems there is no >>>>>>>> sane way doing a worst-case pre-allocation that does not suffer from this >>>>>>>> limitation. >>>>>>> >>>>>>> I haven't been paying much attention here (too many other things going >>>>>>> on), but something's wrong. >>>>>>> >>>>>>> First, you shouldn't need to preallocate. Preallocation is only there >>>>>> >>>>>> Unfortunately, I think we really have a case where we have to. Typically GPU >>>>>> mappings are created in a dma-fence signalling critical path and that is >>>>>> where such mappings need to be added to the maple tree. Hence, we can't do >>>>>> any sleeping allocations there. >>>>> >>>>> OK, so there are various ways to hadle this, depending on what's >>>>> appropriate for your case. >>>>> >>>>> The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM >>>>> layer "This is too hard, let me tap into the emergency reserves". It's >>>>> mildly frowned upon, so let's see if we can do better. >>>>> >>>>> If you know where the allocation needs to be stored, but want it to act as >>>>> NULL until the time is right, you can store a ZERO entry. That will read >>>>> as NULL until you store to it. A pure overwriting store will not cause >>>>> any memory allocation since all the implementation has to do is change >>>>> a pointer. The XArray wraps this up nicely behind an xa_reserve() API. >>>>> As you're discovering, the Maple Tree API isn't fully baked yet. >>>>> >>>> >>>> Unfortunately, GFP_ATOMIC seems the be the only option. I think storing >>>> entries in advance would not work. Typically userspace submits a job to the >>>> kernel issuing one or multiple requests to map and unmap memory in an ioctl. >>>> Such a job is then put into a queue and processed asynchronously in a >>>> dma-fence signalling critical section. Hence, at the we'd store entries in >>>> advance we could have an arbitrary amount of pending jobs potentially still >>>> messing with the same address space region. >>> >>> What I think you are saying is that you have a number of requests >>> flooding in, which may overwrite the same areas, but are queued up to be >>> written after they are queued. These operations look to be kept in >>> order according to the code in nouveau_job_submit[1]. Is this correct? >> >> That's all correct. >> >> (Although Nouveau isn't a good example in this case. Some aspects of it do >> and some aspects of it do not apply to the problem we're discussing here.) >> >>> >>> So then, your issue isn't that you don't know where they will land, but >>> don't know if the area that you reserved is already split into other >>> areas? For instance, before the range 5-10 is backed by whatever >>> happens in the fence, it may have already become 5-6 & 8-10 by something >>> that came after (from userspace) but hasn't been processed by the >>> kernel that will live at 7? So you can't write 5-10 right away because >>> you can't be sure 5-10 is going to exist once you reach the kernel fence >>> code that stores the entry? >>> >>> Is my understanding of your issue correct? >> >> Yes, it is. >> >> However, the problem already starts while trying to reserve an area. In >> order to satisfy a user request, such a request is broken down into >> operations such as unmap mappings which are in the way entirely, remap >> mappings which intersect with the requested mapping and finally map the >> requested mapping. The execution of such a sequence must appear atomic and >> hence be locked accordingly. When trying to reserve an area we'd need to >> take that lock. But since this lock would be used in the dma-fence >> signalling critical path as well we'd not be allowed to do sleeping >> allocations while holding this lock. >> >> Potentially, this could be solved with a retry loop though. Drop the lock >> while allocating, take it again and check whether we still got enough nodes >> allocated. Analogous to what the maple tree does in mas_store_gfp(), I >> guess. >> >>> >>> Oh, and I guess the queued requests would have to remain ordered between >>> threads or whatever is on the other side? I mean, you can't have two >>> threads firing different things into the kernel at the same region >>> because I would think the results would be unpredictable? >> >> Once a job is queued up in the kernel they remain ordered. >> >> However, user threads could concurrently push jobs to the kernel altering >> the same region of the address space - it just would not make any sense for >> userspace to do that. >> >> In general userspace is responsible for the semantics of the address space. >> The kernel basically just takes any (valid) request and make it happen. It >> also assures waiting and signalling of fences which might be bound to >> certain jobs and obviously keeps track of the VA space to be able to clean >> things up once a client disappears. >> >>> >>> Can these overlapping entries partially overlap one region and another? >>> That is, can you have three in-flight writes that does something like: >>> store 1-10, store 10-20, store 5-15? >> >> Absolutely, yes. >> >>> >>> How stable of an output is needed? Does each kernel write need to be >>> 100% correct or is there a point where the userspace updates stop and >>> only then it is needed to be stable? >> >> It needs to be 100% correct all the time. The reason is that, as mentioned >> above, every job can carry in- and out-fences, such that userspace can order >> these jobs against the execution of shaders. > > But each job is split into parts, so the fences surround these groups of > operations? Yes, each job can have multiple requests to map or unmap something and each of them gets broken down into operations to make them happen. The fences are per job. > > Since ordering is kept, you must reach a point before entering the > fences which could call the mas_preallocate() to ensure enough nodes > exist to install the new mapping, and then no other operations will be > happening. I guess what you are saying is each fence has more than one > tree operation? > I guess you assume that in the asynchronous path, where jobs are fetched from the queue for execution, there must be a point of time before we enter the fence signalling critical path. This is not the case. The fence signalling critical path is entered once the corresponding out-fence is published to userspace and hence becomes visible to userspace. This happens when the job submit ioctl() (which is where the job is queued up for execution) returns. Hence, all jobs in the queue potentially entered their fence signalling critical path already before they could have been fetched from the queue. The job submit ioctl() is where we would need to call mas_preallocate(), but this is also where we could have concurrent modifications to the tree from previously submitted jobs that have been fetched from the queue for asynchronous execution. > As long as you are not mapping more than a range, then this should be possible > in a single write and thus a single preallocation. You can do this by > not actually writing unmaps/remaps to the tree within the fence. Once > the out-fence is reached, then the operation looks atomic. As mentioned above there can be an arbitrary amount of map and unmap requests per job. Also, we can have cases where we, for instance, have a mapping A[0,10] either already in the tree (or pending in the job queue) and a user requests B[4,7]. The expected result would be: A'[0,4], B[4,7] and A''[7,10]. Hence, one job can cause multiple writes per single map request even. At the time the job asking to map B is submitted, A might not even be in the tree yet. > > Reading your patch, it is not clear this is accurate for VM_BIND of > asynchronous syncobjs. Is the fence spanning multiple syncobjs with > various ranges to map? Or are these things the split-up tasks of > unmap/remap, etc that will eventually boil down to what appears to be a > single write? > The VM_BIND ioctl() is the ioctl() mentioned above to submit (bind) jobs. Userspace can pass syncobjs together with a job, which can be both syncobj which contain fences to wait for before executing the job and syncobjs to install the job's fence to for userspace to wait for them or pass them into another kernel interface to synchronize them against something else. Otherwise, as mentioned above, each job can have multiple requests to map or unmap something and each of them gets broken down into operations to make such a request happen, which themselves can be re-maps, unmaps and maps. For instance, an unmap request for a given range, gets broken down into re-maps of mappings it intersects and unmaps of mappings it entirely spans over. Thanks, Danilo >> >> This is also why there could be jobs queued up, where all of them apply >> changes to the same region within the VA space, since there might be shader >> executions (or just memory copies) ordered right between them. >> >> - Danilo >> >>> >>>> >>>> So, the only way to go seems to be to use mas_store_gfp() with GFP_ATOMIC >>>> directly in the fence signalling critical path. I guess mas_store_gfp() does >>>> not BUG_ON() if it can't get atomic pages? >>>> >>>> Also, I just saw that the tree is limited in it's height (MAPLE_HEIGHT_MAX). >>>> Do you think it could be a sane alternative to pre-allocate with >>>> MAPLE_HEIGHT_MAX rather than to rely on atomic pages? Or maybe a compromise >>>> of pre-allocating just a couple of nodes and then rely on atomic pages for >>>> the rest? >>>> >>>> FYI, we're talking about a magnitude of hundreds of thousands of entries to >>>> be stored in the tree. >>>> >>> >>> Since you are not tracking gaps, you will get 16 entries per node. The >>> maximum height is 31, so that would be 16^31, assuming a gap between >>> each entry (the worst case), you can cut that in 1/2. To assure you can >>> successfully allocate storage for a new entries, you'd need to allocate >>> 30 * 3 + 1, or 91 nodes, which is 6 pages. That'll be highly wasteful >>> as almost all of these would be freed, and sometimes all of them. >>> >>> You estimate less than 1M entries, that would never go over 6 levels (8.3M >>> entries with the worst-case). 5 levels would get you 500K in the worst >>> case, but realistically you'll be in the 5 levels almost always. So, >>> 5*3+1 = 17 nodes, or 2 pages (1 node over 1 page).. assuming 4k pages. >>> >>> [1] https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@redhat.com/T/#Z2e.:..:20230620004217.4700-4-dakr::40redhat.com:1drivers:gpu:drm:drm_gem.c >>> >> >
diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 048d6413a114..7ac5b5457603 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc); */ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) { + MA_WR_STATE(wr_mas, mas, entry); + unsigned char node_size; + int request = 1; int ret; - mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp); + + if (unlikely(!mas->index && mas->last == ULONG_MAX)) + goto ask_now; + + mas_wr_store_setup(&wr_mas); + wr_mas.content = mas_start(mas); + /* Root expand */ + if (unlikely(mas_is_none(mas) || mas_is_ptr(mas))) + goto ask_now; + + if (unlikely(!mas_wr_walk(&wr_mas))) { + /* Spanning store, use worst case for now */ + request = 1 + mas_mt_height(mas) * 3; + goto ask_now; + } + + /* At this point, we are at the leaf node that needs to be altered. */ + /* Exact fit, no nodes needed. */ + if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last) + return 0; + + mas_wr_end_piv(&wr_mas); + node_size = mas_wr_new_end(&wr_mas); + /* Slot store can avoid using any nodes */ + if (node_size == wr_mas.node_end && wr_mas.offset_end - mas->offset == 1) + return 0; + + if (node_size >= mt_slots[wr_mas.type]) { + /* Split, worst case for now. */ + request = 1 + mas_mt_height(mas) * 2; + goto ask_now; + } + + /* Appending does not need any nodes */ + if (node_size == wr_mas.node_end + 1 && mas->offset == wr_mas.node_end) + return 0; + + /* Potential spanning rebalance collapsing a node, use worst-case */ + if (node_size - 1 <= mt_min_slots[wr_mas.type]) + request = mas_mt_height(mas) * 2 - 1; + + /* node store needs one node */ +ask_now: + mas_node_count_gfp(mas, request, gfp); mas->mas_flags |= MA_STATE_PREALLOC; if (likely(!mas_is_err(mas))) return 0;