Message ID | 20231018221123.136403-1-dongli.zhang@oracle.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2010:b0:403:3b70:6f57 with SMTP id fe16csp27582vqb; Wed, 18 Oct 2023 15:18:28 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHH1C6p3uKTLeNdiNtxaWzhleh74hk/+7nM/cjXRgC1zb8DHi75l9VE/aKbznDQzQBj4/dM X-Received: by 2002:a05:6a20:431f:b0:17b:2f9:4146 with SMTP id h31-20020a056a20431f00b0017b02f94146mr456278pzk.43.1697667508032; Wed, 18 Oct 2023 15:18:28 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1697667508; cv=pass; d=google.com; s=arc-20160816; b=zSCSaLyS+mTe04yPlxrM/19rgJVBNskTGTJ25UMiTfUaoOdv7nZgYpBrwOSQBVgL3p G7URSwl+HCKqd67qddRgwPpd/Wg8TNX/YI4oVr4QkW8t2nlOkn+qUQleUMCCuPWdHjno H2NFAZqpl8K06sHWueCeW3xBbEvCWypGuhrevKeK/lBb52DzA3/WCjSmqtunDpP30Hon NJ1iJZ+SrGTh7gFm+dAzfpkbki+V3wdizVJojyMsFogDL1qa0Rqn6IF/B6VTmLoxiyfn Nd099Gf/a5fc0chNo31QjczM3L8+V0ssvmfO86GfHVVJvMC/FfOGxROG2ek8Uucjy14z +hAg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :message-id:date:subject:cc:to:from:dkim-signature:dkim-signature; bh=0S7AhvO2r1Sn1Aa/lDwDJew3h+/4QcW63K+NKu+NC5Q=; fh=1VT6rCwN+ZY4dhTZMWVH8rWfjdO3lUffHljj5FVM9fk=; b=laSRFXcIJWIpkeSmFUTWfVm2cGlwrDQTJk8TeiVgy8hwRd7E5TOtfBhXwSUZzPK78G mI1Kwf2Hx5tF8upJsbB21i8VMzAGLP+Fm+e0vf9Bh65ZjNDD03PntN0vmkamG40sQSdi a+1/s0oG2GQUT1WQbzwZ7oyq3T9yPkyfdGbqAYdrfW1Q/Bo+dshBn3gT8Gx7C14n5D9j LT/xePUFOWOYQ1KcGDk+aAFaXSyQ1sj+tA4FizPuGSp9EKfhRIuOb60YpltEhIqo86F0 9HmPyICHYqp2V32wSjZfilqZKuNmp7Bladp0hfsZinXaHIEYZUSi2J4QPcGINraaizJp mrUg== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2023-03-30 header.b=EMLlhEJM; dkim=pass header.i=@oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=QDCD5IYo; arc=pass (i=1 spf=pass spfdomain=oracle.com dkim=pass dkdomain=oracle.com dmarc=pass fromdomain=oracle.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id u9-20020a17090341c900b001c754f13381si860510ple.455.2023.10.18.15.18.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Oct 2023 15:18:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2023-03-30 header.b=EMLlhEJM; dkim=pass header.i=@oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=QDCD5IYo; arc=pass (i=1 spf=pass spfdomain=oracle.com dkim=pass dkdomain=oracle.com dmarc=pass fromdomain=oracle.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 73DCA8113EE9; Wed, 18 Oct 2023 15:18:24 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229702AbjJRWRw (ORCPT <rfc822;zwp10758@gmail.com> + 24 others); Wed, 18 Oct 2023 18:17:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42392 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229726AbjJRWRu (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 18 Oct 2023 18:17:50 -0400 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 94EB0114; Wed, 18 Oct 2023 15:17:48 -0700 (PDT) Received: from pps.filterd (m0246617.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 39IIp6F2018146; Wed, 18 Oct 2023 22:16:59 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : content-transfer-encoding : content-type : mime-version; s=corp-2023-03-30; bh=0S7AhvO2r1Sn1Aa/lDwDJew3h+/4QcW63K+NKu+NC5Q=; b=EMLlhEJMk+u83ogznJBxB2geY0LgKaFaxYkmXwU0ddAdpPmoUWuO67E/PBYxHTXXJD8s /vZbzM5WszGPWi1bP7dGp8xzyTdTyvW4TParkICrBlQPTUx5jeASuj6s0gSbO5TA07XN Qnq5LlBV5tDhK3i0M7buQIzlcXB8l/CGmnFcidfcB4S0Xi17p1ye5SJLuXZdn+/luoQq 8bUHKWAdrogvn3DAn2Ri4QBjZPfSc84ajMAw3viYW17XZeQik+RZkk78a9HjbLeK+O1Z eP8i0LLCoMDTtFQKNum1HrlZmQN3sc6t+b4NoEnWHrqIKq02xf7rq4cmYDqYPCIvOw2q jA== Received: from iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta03.appoci.oracle.com [130.35.103.27]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3tqkhu8ru1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 18 Oct 2023 22:16:58 +0000 Received: from pps.filterd (iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com (8.17.1.19/8.17.1.19) with ESMTP id 39IKCSLO015439; Wed, 18 Oct 2023 22:16:57 GMT Received: from nam12-mw2-obe.outbound.protection.outlook.com (mail-mw2nam12lp2040.outbound.protection.outlook.com [104.47.66.40]) by iadpaimrmta03.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTPS id 3trg1h43dw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 18 Oct 2023 22:16:57 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=bixzI537eDPelgaARUVyBvRdfr6pTsQRVJgiFLuxoBlYK4CWSU4jF1deaHA+ncFgGUFmPVF2hqBqGDZUNTBFUG3p3xkjwDtNcpHFl14Deeuk+SPqqNRCuJZihyXaI0yLVmWb7hpYhcpyTS8zTAdSfbSUTfJQr5MOICgP9M1diFNfsqU8V5nK+wQCX9qRYr2I4rkLpzimoAovdnChH1XL81ax++yjj3k0xhM51iR0lUf+1Dr+mtkUnVdV7FqZGBOhhvqt2yF/gnmyyJwk370zQC4eAelFvdDrjxWbd6YXOpUSIkhgQhYyEW7aNKFwyxJjIcF6nFfZZrMEpy54wAe9zg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=0S7AhvO2r1Sn1Aa/lDwDJew3h+/4QcW63K+NKu+NC5Q=; b=Rn+ucqTUJynw04DR6xNsxhSSpr4w7M+vjpCJAHIAq54FIMxdz74tLH4eIyj6eR+Um8GkVOzVy6MY+m7SC+hmVrqg+h/Fz+N9NaAr38fIBXiX3DD3Uhein2+uqRce6/UQfxfFBoN9+Oq53jHdLabG9rY7K/k0WcP69cGhOdfSR104mkljM6tMbE4qwb8c/RsbUHl0ohTbwNx8+uFsozEQXtQd7cPUOIsUhZcTJkyokvGo+E50iT3OS9IjcYJPTPCDoAk2kpCgRZ+OwNgRx6pAaDVf3W1fbikRM4NVruCM59/JExhVR0IyR1opfZiSuJI+U4lAPzwGJhY9Qh/fbjp68w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=0S7AhvO2r1Sn1Aa/lDwDJew3h+/4QcW63K+NKu+NC5Q=; b=QDCD5IYoFpHLdPWHaS7rE3z5KX/4d/C0Rv5YUEKjPZY/dwo5DEVS37fq6MUIAA46blOZvh82JFAOYx50bT5rEH3e6oVOKYFvJmbDRbt3npKlSB8FLEht1fHz/A0O71ZzUzTy9f/+KTKMhK8v5r+ComFq+f6IGNZm/TKnKFnSIt0= Received: from BYAPR10MB2663.namprd10.prod.outlook.com (2603:10b6:a02:a9::20) by DS7PR10MB7324.namprd10.prod.outlook.com (2603:10b6:8:ec::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6886.34; Wed, 18 Oct 2023 22:16:54 +0000 Received: from BYAPR10MB2663.namprd10.prod.outlook.com ([fe80::8e27:f49:9cc3:b5af]) by BYAPR10MB2663.namprd10.prod.outlook.com ([fe80::8e27:f49:9cc3:b5af%7]) with mapi id 15.20.6886.034; Wed, 18 Oct 2023 22:16:54 +0000 From: Dongli Zhang <dongli.zhang@oracle.com> To: x86@kernel.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, pv-drivers@vmware.com, xen-devel@lists.xenproject.org, linux-hyperv@vger.kernel.org Cc: jgross@suse.com, akaher@vmware.com, amakhalov@vmware.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, pbonzini@redhat.com, wanpengli@tencent.com, vkuznets@redhat.com, peterz@infradead.org, seanjc@google.com, dwmw2@infradead.org, joe.jin@oracle.com, boris.ostrovsky@oracle.com, linux-kernel@vger.kernel.org Subject: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock Date: Wed, 18 Oct 2023 15:11:23 -0700 Message-Id: <20231018221123.136403-1-dongli.zhang@oracle.com> X-Mailer: git-send-email 2.34.1 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: SJ0PR13CA0189.namprd13.prod.outlook.com (2603:10b6:a03:2c3::14) To BYAPR10MB2663.namprd10.prod.outlook.com (2603:10b6:a02:a9::20) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BYAPR10MB2663:EE_|DS7PR10MB7324:EE_ X-MS-Office365-Filtering-Correlation-Id: 62c8e463-993a-4f82-687d-08dbd027ef2c X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: j/VlSwLWHQUJRWPrSQb1bQIIdNsAc/3mESMXa/yHu84sOPEWdjn0h1yGAeFQLajdF7grAPXZJb9l1SzDNucq6EFGsWtefIZo3/u/hBm1ziKtkPI8NyRwjm6t4yH32mz4MWdgSqIlFnccC+RRr3vT2lzB9uHDbOMfoGVV0/xHfGpg/mBkbTGJXr+X5w225nNBGgDqMuN88FGqLzdoi+y2bWCplCSpoSLOkcqUB7mCxDB2e1NDyc/nlfzEPVM5qOkR/MwpmksXtp35AHfL6VitX+XEhd1vBOBf0/GljmTf5kEm/4ue6BnEN2D4lXhPYX4Bo0y1un38ajgYbi9zVigyF7Pg0q/Nai0cvm+U9arVPROj8YWETryViE9H/Q3ZzysctHI9KGxNPWS4uZcyF4bigWkeNi0JAYUrHfofbduCNeGUAxuMwk/fcLMgGkVQs4Jrn39EasQdtUhuOmbiT+1ljt7iKyCP5yILO59Frb/wDpFXzP0IaRRw9cL/302wPZeR4+A7SQaDtl5vBcM3JfHR3Ud/zJhbB81lSi0OWonlzhYwY5OFS6hPHYHmFLmO5KMNlUVT6VpwqDicD1MLVU55Vw== X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:BYAPR10MB2663.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230031)(346002)(366004)(396003)(376002)(136003)(39860400002)(230922051799003)(1800799009)(451199024)(64100799003)(186009)(36756003)(66476007)(86362001)(66946007)(38100700002)(2616005)(83380400001)(41300700001)(6512007)(26005)(6506007)(1076003)(6666004)(2906002)(478600001)(966005)(44832011)(7416002)(5660300002)(66556008)(316002)(8676002)(6486002)(8936002)(4326008);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: mxL2wChVYcC177zpdi5qB+ttYnygWQw2vGxc7XlJUNSJKwSrf0CzuR/AOMApUtD+SiEU9aNz1BeobZMBS2U+Fj8Vdu2GJ7+yc3fen2moJj0QozhF7htGUt3NM+nlOzyD6ES1lU0TR2lIpMg9DGmRp8DJp+XfkHFdOoLHEvZpnrxIxtrjc3N8cSgm8r88uvdgaS6sWFPJACAnkLkz7P8MHZ2TSkSaK40NwyEhEOuxY+9EyROJEKBBwO0ZGOhgzNAgUNNyYa6+TKXpoz1Lem22MtT2X393HpuX/z0qZRV4LyYfOqOm9BwUlskbqJPkS7aJpluKzNl92MnrOfDoT7HhIidVk3yQQMVCSKsqcqh9oVivNFuac/DdO1iQNLqin2vD27tO5stiNrSaRAb2BCHeGwfF2LIlbzB5l0hE+yLqESlne0xR1pSMfHqXLW6opiLx6Ir/PuHfujfUa1ua/to09fjmsZGllxZ+/NAQAyKFoOoI8VyEAXcvXjZz3PMfIp1DG7wD+a3i6Tgt2CMv/9/XM1uaHCrq3mU9pyXJW5tFGOAqgrnV9TrtrBsTWiVYVQrRAeZDAe6AH3IUwmydQrmF8Uh9Jsje2V44KpkFN8GTYvkTTBtUlZzfGL+sduL6aNbRRVNflijPrCb3LRLl+iON7369OImAy+JEZlglJT79Am3laJd6adA6S3YWJ3rOP8lbxhkqSP5aIocg2BJT/Dwkep+DT2+vS0SMHD1I4+PYv/9WPlM1Dp2n5HlyPKcVc8r2KLzl9Sv5t165SKN80ZcbtXKjtgF1AN5ZK+A6nQVYpoJomwUbFN+cJ1pEdXSgEAmLI45DWiPhD60myZQDPKfXHrfWbuE+n1LtNokUHLuNxSEKCP0oHqQzupL7c+TFFxy+RDAam7kNdggytxe4hVvuGDz0hRxDziVmylneRpPxKFWeV6hqgr4kRHrhyo5JJ3ioeCYwUhtYR5aNffC1ZzvUli7Maynbufc8Vv560MycipKJXIN+CGVu/Ml+AO3zrnXy9/h+1bFQpnJNa4IQGP61aKYD4PitGnGDsNzyFKMAlxRKHGfgGz8If6ayCtDvR1GCEsVxAHH+fksCus7x3xzV7exOn3qQvbmMNXrXOxFoBKDulDvyczwPQBIAT4xHmS8R4+79uOkyXLCNUF2p1V8jHWszdm3nA3hyBcK3y/t6/itzSpnw0kgni1jT1zFSXbWLeSeVV5GZ8ZyQWp8cenzTINPuDKECLxWO7vbNbPD6Ux5juFol7aB/aCtDk5uX+HVE9BBeOseAnju0JvCkbi7Z3rOvbTOVPFlnTbP22+HaxaykVkV5Hjx3iIGzXqDOEsXR3YudD9MVcPjamEZlhZSVixKbD/6uLvil1Au64TOyJWvoPqdDAKvOfEiFMPY+zrqNlEACw+ik1NlNwgb7peTNZFwHpLGzDQ4JlMG8j4f+RscoWd5D3ysc50mf1uqcu1mJNbU2ns/iTOwChMus1XbVP9leVWGuEB6kFBSI5pSvklv1oyxHXYdLMNmyJu2Y6zRZG0sFy8AfU85ZneeYnGUsvr6cb7Bti6Av9S9a3qyEcJ6bhNpdEprpa6DTbRiqCn0y X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: laHDKK39RvH2ouOnaZEwE7qXGr+KH10B+VzCPEwrmRTl0XV2ABIWgjkmUf0s497NogqiqBIDJkxePdcYExoPtmRJOlro+fk1N5LqCSuF0utXL5MLHCkleaRMZG0ERD3fryi45tcJ6EesJKj0f8UDLDe8csUs7Q62vBKbxi351PNii9wLOCbcM2vm62IjQYb1fsGT+2tHHERgQV9be0AekWsi431oxTLFhrgNGccjvR7Q/GVx6TV5pwF8PIyh9Xcr56dw7kqy0m8ggkhs+cJq5+Kpi4LsM1/v9cwFDdTqrKnQjY5b++9V/KCYNX9z11hUszvnbC0GZ6XwOSwx02XYGDoihOSA5mT86BCGUU/v5a6hgvZSjFVFXbZ2VdR6W5byvnHKTgl7NbEHUoshCzlhEL4pquAofwOT8JRX1vDD6qbtjGexhFYQhyN3e9HNDV9Ft0lJ0mIoIJLjFIcUKYBUHL675RmHoQ4rBC1fGjMV04uaausCWP173BXppbPKCNUQl3/TkW3CnVCWQyScTLNEykrcdzY0TnKQpj28WAiBvC9DOhmSxut+wxgbbdqg0V2NEdLIZSoy6IneTFahDp7I/VI17DtCzK0YQ3y4HN8pz35VFMnYdBwr4YyJ2BhGyOSCiTMQF6En8yX4DHEWtuisqC3zMDbdIsfHb8QORKoAZWygCKL1wGkVqCsxDf3fDWTorzhMYvCKBN/ldIGKNA79exEi3U6jNXQVF1UjGn7dTrqPVllM1f4hIkJUdE7iAbsjdnJeKbqhsBHJjht1j2zG1ALP6hPyYMXJqvWICizOVLqZ4EtcCplS5OTmmya7hY2+LlfC4oHAy/n5YFtn4OtS568bU4NNqdvqdQLoqBZ+wPOL0K0CI98QROm2MGY8kyfOA4Hvp9J4qdCIgXvAmSoRqkpv006Lv3TzxAUnymecuBw/pQSevjRhaEwWeKtC4ti72GWuJ/N7gjp8ugcpnT7EFd76nWABhqGY8+qufOOQaoPP9u0DHGDqSTtFGVB8zzHUA2uOkoLD/YocRLaFwPSqSlGxiR2qzBemmDJ/gHhhItQMGHlNKwR10cc4sTVPQA0n0UMj0vPpgIDmHrBK+xfUhK/VKTbDZHSs+jQx76Hbwyw+nsBrJqDohUtFCgPRbZvVk1rJ0TN7OFR8EEXi7vlHvwxZ7xO9Wy5XP2yfdE0wldkxiIobowZ1yeaeZwtJr2R1s0qp/lGWAMy2N8FyFfaF/E1ezhDKvnghuIlFzWQ4BGE= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: 62c8e463-993a-4f82-687d-08dbd027ef2c X-MS-Exchange-CrossTenant-AuthSource: BYAPR10MB2663.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 18 Oct 2023 22:16:53.9536 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: VPUQs7wZovusS4MG23uJQt5z0aDrsF2gONM82Y5ICqmgOxcR5TxavE8HyoKGy/lpUIcVXjf5/mHGy1tvcQW/Rw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS7PR10MB7324 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.980,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-10-18_18,2023-10-18_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 phishscore=0 suspectscore=0 mlxscore=0 mlxlogscore=999 bulkscore=0 malwarescore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2309180000 definitions=main-2310180183 X-Proofpoint-GUID: I_9Vn3ylX5O_ShZeLzA9mAADAWz4pA27 X-Proofpoint-ORIG-GUID: I_9Vn3ylX5O_ShZeLzA9mAADAWz4pA27 X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Wed, 18 Oct 2023 15:18:24 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780133404615705734 X-GMAIL-MSGID: 1780133404615705734 |
Series |
[RFC,1/1] x86/paravirt: introduce param to disable pv sched_clock
|
|
Commit Message
Dongli Zhang
Oct. 18, 2023, 10:11 p.m. UTC
As mentioned in the linux kernel development document, "sched_clock() is
used for scheduling and timestamping". While there is a default native
implementation, many paravirtualizations have their own implementations.
About KVM, it uses kvm_sched_clock_read() and there is no way to only
disable KVM's sched_clock. The "no-kvmclock" may disable all
paravirtualized kvmclock features.
94 static inline void kvm_sched_clock_init(bool stable)
95 {
96 if (!stable)
97 clear_sched_clock_stable();
98 kvm_sched_clock_offset = kvm_clock_read();
99 paravirt_set_sched_clock(kvm_sched_clock_read);
100
101 pr_info("kvm-clock: using sched offset of %llu cycles",
102 kvm_sched_clock_offset);
103
104 BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
105 sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
106 }
There is known issue that kvmclock may drift during vCPU hotplug [1].
Although a temporary fix is available [2], we may need a way to disable pv
sched_clock. Nowadays, the TSC is more stable and has less performance
overhead than kvmclock.
This is to propose to introduce a global param to disable pv sched_clock
for all paravirtualizations.
Please suggest and comment if other options are better:
1. Global param (this RFC patch).
2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).
Indeed I like the 2nd method.
3. Enforce native sched_clock only when TSC is invariant (hyper-v method).
4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
want to keep the pv sched_clock.
To introduce a param may be easier to backport to old kernel version.
References:
[1] https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com/
[2] https://lore.kernel.org/all/20231018195638.1898375-1-seanjc@google.com/
[3] https://lore.kernel.org/all/20231002211651.GA3774@noisy.programming.kicks-ass.net/
Thank you very much for the suggestion!
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
arch/x86/include/asm/paravirt.h | 2 +-
arch/x86/kernel/kvmclock.c | 12 +++++++-----
arch/x86/kernel/paravirt.c | 18 +++++++++++++++++-
3 files changed, 25 insertions(+), 7 deletions(-)
Comments
Dongli Zhang <dongli.zhang@oracle.com> writes: > As mentioned in the linux kernel development document, "sched_clock() is > used for scheduling and timestamping". While there is a default native > implementation, many paravirtualizations have their own implementations. > > About KVM, it uses kvm_sched_clock_read() and there is no way to only > disable KVM's sched_clock. The "no-kvmclock" may disable all > paravirtualized kvmclock features. > > 94 static inline void kvm_sched_clock_init(bool stable) > 95 { > 96 if (!stable) > 97 clear_sched_clock_stable(); > 98 kvm_sched_clock_offset = kvm_clock_read(); > 99 paravirt_set_sched_clock(kvm_sched_clock_read); > 100 > 101 pr_info("kvm-clock: using sched offset of %llu cycles", > 102 kvm_sched_clock_offset); > 103 > 104 BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) > > 105 sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time)); > 106 } > > There is known issue that kvmclock may drift during vCPU hotplug [1]. > Although a temporary fix is available [2], we may need a way to disable pv > sched_clock. Nowadays, the TSC is more stable and has less performance > overhead than kvmclock. > > This is to propose to introduce a global param to disable pv sched_clock > for all paravirtualizations. > > Please suggest and comment if other options are better: > > 1. Global param (this RFC patch). > > 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware). > > Indeed I like the 2nd method. > > 3. Enforce native sched_clock only when TSC is invariant (hyper-v method). > > 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for > all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may > want to keep the pv sched_clock. Normally, it should be up to the hypervisor to tell the guest which clock to use, i.e. if TSC is reliable or not. Let me put my question this way: if TSC on the particular host is good for everything, why does the hypervisor advertises 'kvmclock' to its guests? If for some 'historical reasons' we can't revoke features we can always introduce a new PV feature bit saying that TSC is preferred. 1) Global param doesn't sound like a good idea to me: chances are that people will be setting it on their guest images to workaround problems on one hypervisor (or, rather, on one public cloud which is too lazy to fix their hypervisor) while simultaneously creating problems on another. 2) KVM specific parameter can work, but as KVM's sched_clock is the same as kvmclock, I'm not convinced it actually makes sense to separate the two. Like if sched_clock is known to be bad but TSC is good, why do we need to use PV clock at all? Having a parameter for debugging purposes may be OK though... 3) This is Hyper-V specific, you can see that it uses a dedicated PV bit (HV_ACCESS_TSC_INVARIANT) and not the architectural CPUID.80000007H:EDX[8]. I'm not sure we can blindly trust the later on all hypervisors. 4) Personally, I'm not sure that relying on 'TSC is crap' detection is 100% reliable. I can imagine cases when we can't detect that fact that while synchronized across CPUs and not going backwards, it is, for example, ticking with an unstable frequency and PV sched clock is supposed to give the right correction (all of them are rdtsc() based anyways, aren't they?). > > To introduce a param may be easier to backport to old kernel version. > > References: > [1] https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com/ > [2] https://lore.kernel.org/all/20231018195638.1898375-1-seanjc@google.com/ > [3] https://lore.kernel.org/all/20231002211651.GA3774@noisy.programming.kicks-ass.net/ > > Thank you very much for the suggestion! > > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> > --- > arch/x86/include/asm/paravirt.h | 2 +- > arch/x86/kernel/kvmclock.c | 12 +++++++----- > arch/x86/kernel/paravirt.c | 18 +++++++++++++++++- > 3 files changed, 25 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h > index 6c8ff12140ae..f36edf608b6b 100644 > --- a/arch/x86/include/asm/paravirt.h > +++ b/arch/x86/include/asm/paravirt.h > @@ -24,7 +24,7 @@ u64 dummy_sched_clock(void); > DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock); > DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock); > > -void paravirt_set_sched_clock(u64 (*func)(void)); > +int paravirt_set_sched_clock(u64 (*func)(void)); > > static __always_inline u64 paravirt_sched_clock(void) > { > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c > index fb8f52149be9..0b8bf5677d44 100644 > --- a/arch/x86/kernel/kvmclock.c > +++ b/arch/x86/kernel/kvmclock.c > @@ -93,13 +93,15 @@ static noinstr u64 kvm_sched_clock_read(void) > > static inline void kvm_sched_clock_init(bool stable) > { > - if (!stable) > - clear_sched_clock_stable(); > kvm_sched_clock_offset = kvm_clock_read(); > - paravirt_set_sched_clock(kvm_sched_clock_read); > > - pr_info("kvm-clock: using sched offset of %llu cycles", > - kvm_sched_clock_offset); > + if (!paravirt_set_sched_clock(kvm_sched_clock_read)) { > + if (!stable) > + clear_sched_clock_stable(); > + > + pr_info("kvm-clock: using sched offset of %llu cycles", > + kvm_sched_clock_offset); > + } > > BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) > > sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time)); > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > index 97f1436c1a20..2cfef94317b0 100644 > --- a/arch/x86/kernel/paravirt.c > +++ b/arch/x86/kernel/paravirt.c > @@ -118,9 +118,25 @@ static u64 native_steal_clock(int cpu) > DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); > DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); > > -void paravirt_set_sched_clock(u64 (*func)(void)) > +static bool no_pv_sched_clock; > + > +static int __init parse_no_pv_sched_clock(char *arg) > +{ > + no_pv_sched_clock = true; > + return 0; > +} > +early_param("no_pv_sched_clock", parse_no_pv_sched_clock); > + > +int paravirt_set_sched_clock(u64 (*func)(void)) > { > + if (no_pv_sched_clock) { > + pr_info("sched_clock: not configurable\n"); > + return -EPERM; > + } > + > static_call_update(pv_sched_clock, func); > + > + return 0; > } > > /* These are in entry.S */
On Thu, Oct 19, 2023, Vitaly Kuznetsov wrote: > Dongli Zhang <dongli.zhang@oracle.com> writes: > > > As mentioned in the linux kernel development document, "sched_clock() is > > used for scheduling and timestamping". While there is a default native > > implementation, many paravirtualizations have their own implementations. > > > > About KVM, it uses kvm_sched_clock_read() and there is no way to only > > disable KVM's sched_clock. The "no-kvmclock" may disable all > > paravirtualized kvmclock features. ... > > Please suggest and comment if other options are better: > > > > 1. Global param (this RFC patch). > > > > 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware). > > > > Indeed I like the 2nd method. > > > > 3. Enforce native sched_clock only when TSC is invariant (hyper-v method). > > > > 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for > > all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may > > want to keep the pv sched_clock. > > Normally, it should be up to the hypervisor to tell the guest which > clock to use, i.e. if TSC is reliable or not. Let me put my question > this way: if TSC on the particular host is good for everything, why > does the hypervisor advertises 'kvmclock' to its guests? I suspect there are two reasons. 1. As is likely the case in our fleet, no one revisited the set of advertised PV features when defining the VM shapes for a new generation of hardware, or whoever did the reviews wasn't aware that advertising kvmclock is actually suboptimal. All the PV clock stuff in KVM is quite labyrinthian, so it's not hard to imagine it getting overlooked. 2. Legacy VMs. If VMs have been running with a PV clock for years, forcing them to switch to a new clocksource is high-risk, low-reward. > If for some 'historical reasons' we can't revoke features we can always > introduce a new PV feature bit saying that TSC is preferred. > > 1) Global param doesn't sound like a good idea to me: chances are that > people will be setting it on their guest images to workaround problems > on one hypervisor (or, rather, on one public cloud which is too lazy to > fix their hypervisor) while simultaneously creating problems on another. > > 2) KVM specific parameter can work, but as KVM's sched_clock is the same > as kvmclock, I'm not convinced it actually makes sense to separate the > two. Like if sched_clock is known to be bad but TSC is good, why do we > need to use PV clock at all? Having a parameter for debugging purposes > may be OK though... > > 3) This is Hyper-V specific, you can see that it uses a dedicated PV bit > (HV_ACCESS_TSC_INVARIANT) and not the architectural > CPUID.80000007H:EDX[8]. I'm not sure we can blindly trust the later on > all hypervisors. > > 4) Personally, I'm not sure that relying on 'TSC is crap' detection is > 100% reliable. I can imagine cases when we can't detect that fact that > while synchronized across CPUs and not going backwards, it is, for > example, ticking with an unstable frequency and PV sched clock is > supposed to give the right correction (all of them are rdtsc() based > anyways, aren't they?). Yeah, practically speaking, the only thing adding a knob to turn off using PV clocks for sched_clock will accomplish is creating an even bigger matrix of combinations that can cause problems, e.g. where guests end up using kvmclock timekeeping but not scheduling. The explanation above and the links below fail to capture _the_ key point: Linux-as-a-guest already prioritizes the TSC over paravirt clocks as the clocksource when the TSC is constant and nonstop (first spliced blob below). What I suggested is that if the TSC is chosen over a PV clock as the clocksource, then we have the kernel also override the sched_clock selection (second spliced blob below). That doesn't require the guest admin to opt-in, and doesn't create even more combinations to support. It also provides for a smoother transition for when customers inevitably end up creating VMs on hosts that don't advertise kvmclock (or any PV clock). > > To introduce a param may be easier to backport to old kernel version. > > > > References: > > [1] https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com/ > > [2] https://lore.kernel.org/all/20231018195638.1898375-1-seanjc@google.com/ > > [3] https://lore.kernel.org/all/20231002211651.GA3774@noisy.programming.kicks-ass.net/ On Mon, Oct 2, 2023 at 11:18 AM Sean Christopherson <seanjc@google.com> wrote: > > Do we need to update the documentation to always suggest TSC when it is > > constant, as I believe many users still prefer pv clock than tsc? > > > > Thanks to tsc ratio scaling, the live migration will not impact tsc. > > > > >From the source code, the rating of kvm-clock is still higher than tsc. > > > > BTW., how about to decrease the rating if guest detects constant tsc? > > > > 166 struct clocksource kvm_clock = { > > 167 .name = "kvm-clock", > > 168 .read = kvm_clock_get_cycles, > > 169 .rating = 400, > > 170 .mask = CLOCKSOURCE_MASK(64), > > 171 .flags = CLOCK_SOURCE_IS_CONTINUOUS, > > 172 .enable = kvm_cs_enable, > > 173 }; > > > > 1196 static struct clocksource clocksource_tsc = { > > 1197 .name = "tsc", > > 1198 .rating = 300, > > 1199 .read = read_tsc, > > That's already done in kvmclock_init(). > > if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && > boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && > !check_tsc_unstable()) > kvm_clock.rating = 299; > > See also: https://lore.kernel.org/all/ZOjF2DMBgW%2FzVvL3@google.com > > > 2. The sched_clock. > > > > The scheduling is impacted if there is big drift. > > ... > > > Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock > > operations (not only sched_clock), e.g., after line 300. > > ... > > > Should I introduce a new param to disable no-kvm-sched-clock only, or to > > introduce a new param to allow the selection of sched_clock? > > I don't think we want a KVM-specific knob, because every flavor of paravirt guest > would need to do the same thing. And unless there's a good reason to use a > paravirt clock, this really shouldn't be something the guest admin needs to opt > into using. On Mon, Oct 2, 2023 at 2:06 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Oct 02, 2023 at 11:18:50AM -0700, Sean Christopherson wrote: > > Assuming the desirable thing to do is to use native_sched_clock() in this > > scenario, do we need a separate rating system, or can we simply tie the > > sched clock selection to the clocksource selection, e.g. override the > > paravirt stuff if the TSC clock has higher priority and is chosen? > > Yeah, I see no point of another rating system. Just force the thing back > to native (or don't set it to that other thing).
On Thu, 2023-10-19 at 08:40 -0700, Sean Christopherson wrote: > > > Normally, it should be up to the hypervisor to tell the guest which > > clock to use, i.e. if TSC is reliable or not. Let me put my question > > this way: if TSC on the particular host is good for everything, why > > does the hypervisor advertises 'kvmclock' to its guests? > > I suspect there are two reasons. > > 1. As is likely the case in our fleet, no one revisited the set of advertised > PV features when defining the VM shapes for a new generation of hardware, or > whoever did the reviews wasn't aware that advertising kvmclock is actually > suboptimal. All the PV clock stuff in KVM is quite labyrinthian, so it's > not hard to imagine it getting overlooked. > > 2. Legacy VMs. If VMs have been running with a PV clock for years, forcing > them to switch to a new clocksource is high-risk, low-reward. Doubly true for Xen guests (given that the Xen clocksource is identical to the KVM clocksource). > > If for some 'historical reasons' we can't revoke features we can always > > introduce a new PV feature bit saying that TSC is preferred. Don't we already have one? It's the PVCLOCK_TSC_STABLE_BIT. Why would a guest ever use kvmclock if the PVCLOCK_TSC_STABLE_BIT is set? The *point* in the kvmclock is that the hypervisor can mess with the epoch/scaling to try to compensate for TSC brokenness as the host scales/sleeps/etc. And the *problem* with the kvmclock is that it does just that, even when the host TSC hasn't done anything wrong and the kvmclock shouldn't have changed at all. If the PVCLOCK_TSC_STABLE_BIT is set, a guest should just use the guest TSC directly without looking to the kvmclock for adjusting it. No?
On Thu, Oct 19, 2023, David Woodhouse wrote: > On Thu, 2023-10-19 at 08:40 -0700, Sean Christopherson wrote: > > > If for some 'historical reasons' we can't revoke features we can always > > > introduce a new PV feature bit saying that TSC is preferred. > > Don't we already have one? It's the PVCLOCK_TSC_STABLE_BIT. Why would a > guest ever use kvmclock if the PVCLOCK_TSC_STABLE_BIT is set? > > The *point* in the kvmclock is that the hypervisor can mess with the > epoch/scaling to try to compensate for TSC brokenness as the host > scales/sleeps/etc. > > And the *problem* with the kvmclock is that it does just that, even > when the host TSC hasn't done anything wrong and the kvmclock shouldn't > have changed at all. > > If the PVCLOCK_TSC_STABLE_BIT is set, a guest should just use the guest > TSC directly without looking to the kvmclock for adjusting it. > > No? No :-) PVCLOCK_TSC_STABLE_BIT doesn't provide the guarantees that are needed to use the raw TSC directly. It's close, but there is at least one situation where using TSC directly even when the TSC is stable is bad idea: when hardware doesn't support TSC scaling and the guest virtual TSC is running at a higher frequency than the hardware TSC. The guest doesn't have to worry about the TSC going backwards, but using the TSC directly would cause the guest's time calculations to be inaccurate. And PVCLOCK_TSC_STABLE_BIT is also much more dynamic as it's tied to a given generation/sequence. E.g. if KVM stops using its masterclock for whatever reason, then kvm_guest_time_update() will effectively clear PVCLOCK_TSC_STABLE_BIT and the guest-side __pvclock_clocksource_read() will be forced to do a bit of extra work to ensure the clock is monotonically increasing.
Hi Vitaly, Sean and David, On 10/19/23 08:40, Sean Christopherson wrote: > On Thu, Oct 19, 2023, Vitaly Kuznetsov wrote: >> Dongli Zhang <dongli.zhang@oracle.com> writes: >> >>> As mentioned in the linux kernel development document, "sched_clock() is >>> used for scheduling and timestamping". While there is a default native >>> implementation, many paravirtualizations have their own implementations. >>> >>> About KVM, it uses kvm_sched_clock_read() and there is no way to only >>> disable KVM's sched_clock. The "no-kvmclock" may disable all >>> paravirtualized kvmclock features. > > ... > >>> Please suggest and comment if other options are better: >>> >>> 1. Global param (this RFC patch). >>> >>> 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware). >>> >>> Indeed I like the 2nd method. >>> >>> 3. Enforce native sched_clock only when TSC is invariant (hyper-v method). >>> >>> 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for >>> all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may >>> want to keep the pv sched_clock. >> >> Normally, it should be up to the hypervisor to tell the guest which >> clock to use, i.e. if TSC is reliable or not. Let me put my question >> this way: if TSC on the particular host is good for everything, why >> does the hypervisor advertises 'kvmclock' to its guests? > > I suspect there are two reasons. > > 1. As is likely the case in our fleet, no one revisited the set of advertised > PV features when defining the VM shapes for a new generation of hardware, or > whoever did the reviews wasn't aware that advertising kvmclock is actually > suboptimal. All the PV clock stuff in KVM is quite labyrinthian, so it's > not hard to imagine it getting overlooked. > > 2. Legacy VMs. If VMs have been running with a PV clock for years, forcing > them to switch to a new clocksource is high-risk, low-reward. > >> If for some 'historical reasons' we can't revoke features we can always >> introduce a new PV feature bit saying that TSC is preferred. >> >> 1) Global param doesn't sound like a good idea to me: chances are that >> people will be setting it on their guest images to workaround problems >> on one hypervisor (or, rather, on one public cloud which is too lazy to >> fix their hypervisor) while simultaneously creating problems on another. >> >> 2) KVM specific parameter can work, but as KVM's sched_clock is the same >> as kvmclock, I'm not convinced it actually makes sense to separate the >> two. Like if sched_clock is known to be bad but TSC is good, why do we >> need to use PV clock at all? Having a parameter for debugging purposes >> may be OK though... >> >> 3) This is Hyper-V specific, you can see that it uses a dedicated PV bit >> (HV_ACCESS_TSC_INVARIANT) and not the architectural >> CPUID.80000007H:EDX[8]. I'm not sure we can blindly trust the later on >> all hypervisors. >> >> 4) Personally, I'm not sure that relying on 'TSC is crap' detection is >> 100% reliable. I can imagine cases when we can't detect that fact that >> while synchronized across CPUs and not going backwards, it is, for >> example, ticking with an unstable frequency and PV sched clock is >> supposed to give the right correction (all of them are rdtsc() based >> anyways, aren't they?). > > Yeah, practically speaking, the only thing adding a knob to turn off using PV > clocks for sched_clock will accomplish is creating an even bigger matrix of > combinations that can cause problems, e.g. where guests end up using kvmclock > timekeeping but not scheduling. > > The explanation above and the links below fail to capture _the_ key point: > Linux-as-a-guest already prioritizes the TSC over paravirt clocks as the clocksource > when the TSC is constant and nonstop (first spliced blob below). > > What I suggested is that if the TSC is chosen over a PV clock as the clocksource, > then we have the kernel also override the sched_clock selection (second spliced > blob below). > > That doesn't require the guest admin to opt-in, and doesn't create even more > combinations to support. It also provides for a smoother transition for when > customers inevitably end up creating VMs on hosts that don't advertise kvmclock > (or any PV clock). I would prefer to always leave the option to allow the guest admin to change the decision, especially for diagnostic/workaround reason (although the kvmclock is always buggy when tsc is buggy). As a summary of discussion: 1. Vitaly Kuznetsov prefers global param, e.g., for the easy deployment of the same guest image on different hypervisors. 2. Sean Christopherson prefers an automatic change of sched_clock when clocksource is or not TSC. However, the clocksource and TSC are different concepts. 1. The clocksource is an arch global concept. That is, all archs (e.g., x86, arm, mips) share the same implementation to register/select clocksource. In additon, something like HPET does not have sched_clock. 2. Some architecture has its own sched_clock implementation. E.g., x86 has its own sched_clock implementation in arch/x86/kernel/tsc.c. 309 notrace u64 sched_clock(void) 310 { 311 u64 now; 312 preempt_disable_notrace(); 313 now = sched_clock_noinstr(); 314 preempt_enable_notrace(); 315 return now; 316 } 3. When !CONFIG_PARAVIRT, it is native_sched_clock(). 4. When CONFIG_PARAVIRT, it is sched_clock_noinstr()->paravirt_sched_clock() referring to paravirt specific implementation (native/kvm/xen/vmware/hyperv). That is, the pv sched_clock is a concept under x86 when CONFIG_PARAVIRT==true. Although the implementation is possible, I just do not like the idea to change some arch global code, to accommodate some requirement as a leaf of the tree. How about to keep the change at x86 as in below? It won't work unless I change 'tsc_clocksource_reliable' to an early_param. --- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/kernel/kvmclock.c | 12 +++++++----- arch/x86/kernel/paravirt.c | 16 +++++++++++++++- 3 files changed, 23 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 6c8ff12..118b793 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -24,7 +24,7 @@ DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock); DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock); -void paravirt_set_sched_clock(u64 (*func)(void)); +bool paravirt_set_sched_clock(u64 (*func)(void)); static __always_inline u64 paravirt_sched_clock(void) { diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index fb8f5214..0b8bf56 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -93,13 +93,15 @@ static noinstr u64 kvm_sched_clock_read(void) static inline void kvm_sched_clock_init(bool stable) { - if (!stable) - clear_sched_clock_stable(); kvm_sched_clock_offset = kvm_clock_read(); - paravirt_set_sched_clock(kvm_sched_clock_read); - pr_info("kvm-clock: using sched offset of %llu cycles", - kvm_sched_clock_offset); + if (!paravirt_set_sched_clock(kvm_sched_clock_read)) { + if (!stable) + clear_sched_clock_stable(); + + pr_info("kvm-clock: using sched offset of %llu cycles", + kvm_sched_clock_offset); + } BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) > sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time)); diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 97f1436..f8ad521 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -118,9 +118,23 @@ static u64 native_steal_clock(int cpu) DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); -void paravirt_set_sched_clock(u64 (*func)(void)) +bool paravirt_set_sched_clock(u64 (*func)(void)) { + if (tsc_clocksource_reliable) + goto refuse; + + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && + !check_tsc_unstable()) + goto refuse; + static_call_update(pv_sched_clock, func); + + return 0; + +refuse: + pr_info("sched_clock: use native when TSC is reliable"); + return -EPERM; } /* These are in entry.S */ Indeed my favorite is to keep within kvmclock. (This won't work until I turn 'tsc_clocksource_reliable' into early_param). diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index fb8f5214..f16655d 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -286,6 +286,7 @@ static int kvmclock_setup_percpu(unsigned int cpu) void __init kvmclock_init(void) { + bool prefer_tsc; u8 flags; if (!kvm_para_available() || !kvmclock) @@ -313,19 +314,8 @@ void __init kvmclock_init(void) if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); - flags = pvclock_read_flags(&hv_clock_boot[0].pvti); - kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); - - x86_platform.calibrate_tsc = kvm_get_tsc_khz; - x86_platform.calibrate_cpu = kvm_get_tsc_khz; - x86_platform.get_wallclock = kvm_get_wallclock; - x86_platform.set_wallclock = kvm_set_wallclock; -#ifdef CONFIG_X86_LOCAL_APIC - x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock; -#endif - x86_platform.save_sched_clock_state = kvm_save_sched_clock_state; - x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state; - kvm_get_preset_lpj(); + if (tsc_clocksource_reliable) + prefer_tsc = true; /* * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate @@ -334,10 +324,31 @@ void __init kvmclock_init(void) * Invariant TSC exposed by host means kvmclock is not necessary: * can use TSC as clocksource. * + * The TSC is used also when tsc_clocksource_reliable is configured + * in kernel command line on purpose. */ if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && !check_tsc_unstable()) + prefer_tsc = true; + + if (!prefer_tsc) { + flags = pvclock_read_flags(&hv_clock_boot[0].pvti); + kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); + } + + x86_platform.calibrate_tsc = kvm_get_tsc_khz; + x86_platform.calibrate_cpu = kvm_get_tsc_khz; + x86_platform.get_wallclock = kvm_get_wallclock; + x86_platform.set_wallclock = kvm_set_wallclock; +#ifdef CONFIG_X86_LOCAL_APIC + x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock; +#endif + x86_platform.save_sched_clock_state = kvm_save_sched_clock_state; + x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state; + kvm_get_preset_lpj(); + + if (prefer_tsc) kvm_clock.rating = 299; clocksource_register_hz(&kvm_clock, NSEC_PER_SEC); Thank you very much! Dongli Zhang > >>> To introduce a param may be easier to backport to old kernel version. >>> >>> References: >>> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpHsjNlDiQ$ >>> [2] https://urldefense.com/v3/__https://lore.kernel.org/all/20231018195638.1898375-1-seanjc@google.com/__;!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpHh5avzQg$ >>> [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20231002211651.GA3774@noisy.programming.kicks-ass.net/__;!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpH74It6kQ$ > > On Mon, Oct 2, 2023 at 11:18 AM Sean Christopherson <seanjc@google.com> wrote: >>> Do we need to update the documentation to always suggest TSC when it is >>> constant, as I believe many users still prefer pv clock than tsc? >>> >>> Thanks to tsc ratio scaling, the live migration will not impact tsc. >>> >>> >From the source code, the rating of kvm-clock is still higher than tsc. >>> >>> BTW., how about to decrease the rating if guest detects constant tsc? >>> >>> 166 struct clocksource kvm_clock = { >>> 167 .name = "kvm-clock", >>> 168 .read = kvm_clock_get_cycles, >>> 169 .rating = 400, >>> 170 .mask = CLOCKSOURCE_MASK(64), >>> 171 .flags = CLOCK_SOURCE_IS_CONTINUOUS, >>> 172 .enable = kvm_cs_enable, >>> 173 }; >>> >>> 1196 static struct clocksource clocksource_tsc = { >>> 1197 .name = "tsc", >>> 1198 .rating = 300, >>> 1199 .read = read_tsc, >> >> That's already done in kvmclock_init(). >> >> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && >> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && >> !check_tsc_unstable()) >> kvm_clock.rating = 299; >> >> See also: https://urldefense.com/v3/__https://lore.kernel.org/all/ZOjF2DMBgW*2FzVvL3@google.com__;JQ!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpFjD9PZNg$ >> >>> 2. The sched_clock. >>> >>> The scheduling is impacted if there is big drift. >> >> ... >> >>> Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock >>> operations (not only sched_clock), e.g., after line 300. >> >> ... >> >>> Should I introduce a new param to disable no-kvm-sched-clock only, or to >>> introduce a new param to allow the selection of sched_clock? >> >> I don't think we want a KVM-specific knob, because every flavor of paravirt guest >> would need to do the same thing. And unless there's a good reason to use a >> paravirt clock, this really shouldn't be something the guest admin needs to opt >> into using. > > > On Mon, Oct 2, 2023 at 2:06 PM Peter Zijlstra <peterz@infradead.org> wrote: >> >> On Mon, Oct 02, 2023 at 11:18:50AM -0700, Sean Christopherson wrote: >>> Assuming the desirable thing to do is to use native_sched_clock() in this >>> scenario, do we need a separate rating system, or can we simply tie the >>> sched clock selection to the clocksource selection, e.g. override the >>> paravirt stuff if the TSC clock has higher priority and is chosen? >> >> Yeah, I see no point of another rating system. Just force the thing back >> to native (or don't set it to that other thing).
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 6c8ff12140ae..f36edf608b6b 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -24,7 +24,7 @@ u64 dummy_sched_clock(void); DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock); DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock); -void paravirt_set_sched_clock(u64 (*func)(void)); +int paravirt_set_sched_clock(u64 (*func)(void)); static __always_inline u64 paravirt_sched_clock(void) { diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index fb8f52149be9..0b8bf5677d44 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -93,13 +93,15 @@ static noinstr u64 kvm_sched_clock_read(void) static inline void kvm_sched_clock_init(bool stable) { - if (!stable) - clear_sched_clock_stable(); kvm_sched_clock_offset = kvm_clock_read(); - paravirt_set_sched_clock(kvm_sched_clock_read); - pr_info("kvm-clock: using sched offset of %llu cycles", - kvm_sched_clock_offset); + if (!paravirt_set_sched_clock(kvm_sched_clock_read)) { + if (!stable) + clear_sched_clock_stable(); + + pr_info("kvm-clock: using sched offset of %llu cycles", + kvm_sched_clock_offset); + } BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) > sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time)); diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 97f1436c1a20..2cfef94317b0 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -118,9 +118,25 @@ static u64 native_steal_clock(int cpu) DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); -void paravirt_set_sched_clock(u64 (*func)(void)) +static bool no_pv_sched_clock; + +static int __init parse_no_pv_sched_clock(char *arg) +{ + no_pv_sched_clock = true; + return 0; +} +early_param("no_pv_sched_clock", parse_no_pv_sched_clock); + +int paravirt_set_sched_clock(u64 (*func)(void)) { + if (no_pv_sched_clock) { + pr_info("sched_clock: not configurable\n"); + return -EPERM; + } + static_call_update(pv_sched_clock, func); + + return 0; } /* These are in entry.S */