Message ID | 20230926230649.67852-1-dongli.zhang@oracle.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp2330925vqu; Tue, 26 Sep 2023 19:25:31 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHk96jPjLP0F8jBPAN610XvtnFPNnqKKpPh0R4RBw+xq/nQS2aeGo2pMpj/L8aaUl5kSyQb X-Received: by 2002:a25:ab42:0:b0:d78:2383:2e7d with SMTP id u60-20020a25ab42000000b00d7823832e7dmr754258ybi.42.1695781531203; Tue, 26 Sep 2023 19:25:31 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1695781531; cv=pass; d=google.com; s=arc-20160816; b=hq6yp0VqztOi8odwHg54JmrnlYYDh7oN3/RmjcJz53BlOg/tNWHK4WjZEWjwNyj09E m9k5S1fPHS9haPE2Cx+3rmrsgFbMxIn4KicZLoViY4K+UruGlJh8C694ppIaqKUkbinH XtVYaO9dKBjZtKhHBufL0b/6FKryWF7v5orlVau1GydLTZC79vRswKiFjwtKHjiiGU1T TySI+sNW5rfFklsNvWjfjjfbV5DSaiVx5a+/wlY+UanSrhe9COgE72+GJIeb9mazQUvD vHHBW2J9QFKuGtm0zZlbkAXrNbxWRgOcu4Hpm3FuxRLi8MqYmEd7Soq5qVoqndrd+yxF xjkQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :message-id:date:subject:cc:to:from:dkim-signature:dkim-signature; bh=ARR1f+uwMAS5xPF6bkJfTjbl45MBhgqNmwLL2RZGtqs=; fh=yMHAh+JihOgayq05c6/3ep0VzWoTzwtkE5YkSs9+I1g=; b=e4SHxgS341WJAm1N6yjU4SyEv6HpLntjgkvjG5uG8rnnYVaIPQoYPD+6Ei0inZtzZ9 VEaaelxZkBkQEmYn8CDW5bfVG4arAIuTYn2vEfPQTU994Mkp7WvFMJvGutbuUfafeUUi sm0UsRiOlKp32B9rC0ebEQ1hD9jJSGduE6/bS8FOEYkiJZU3hoF3TUvS2lG5zI9MBFxA X7b6Qtt9mlkYa+U3DhpJNfD3mOLZH34WhhllmMxmF1T4981WcaOfv+ngkjLWpRLJf2lR jqmeXU+p7koNCI+PxgG0SZ5efcRUDFZATCgtpOiHEh31wCd64IqtMEvzp0+acptUT3rO 0hjw== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2023-03-30 header.b=VhwAWkLP; dkim=pass header.i=@oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=WPE8SRg6; arc=pass (i=1 spf=pass spfdomain=oracle.com dkim=pass dkdomain=oracle.com dmarc=pass fromdomain=oracle.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id s185-20020a632cc2000000b00573fb3325b3si14549872pgs.481.2023.09.26.19.25.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 26 Sep 2023 19:25:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2023-03-30 header.b=VhwAWkLP; dkim=pass header.i=@oracle.onmicrosoft.com header.s=selector2-oracle-onmicrosoft-com header.b=WPE8SRg6; arc=pass (i=1 spf=pass spfdomain=oracle.com dkim=pass dkdomain=oracle.com dmarc=pass fromdomain=oracle.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id B3C9580A2663; Tue, 26 Sep 2023 16:50:48 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233414AbjIZXuk (ORCPT <rfc822;pwkd43@gmail.com> + 28 others); Tue, 26 Sep 2023 19:50:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32790 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231875AbjIZXsj (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 26 Sep 2023 19:48:39 -0400 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD98A268C; Tue, 26 Sep 2023 16:10:01 -0700 (PDT) Received: from pps.filterd (m0333520.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 38QLTNHl025098; Tue, 26 Sep 2023 23:09:26 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : content-transfer-encoding : content-type : mime-version; s=corp-2023-03-30; bh=ARR1f+uwMAS5xPF6bkJfTjbl45MBhgqNmwLL2RZGtqs=; b=VhwAWkLPBQt8ZQbgXzxXxf+k9ZHkrmpAehnco4swS426+qjB4itZAqJ6X+MjnRhoVrEB 9y7iozVkx+xqcOj9QoNjJSS4GCbacztiE1CiwHEiOjf+Oii2l1wInz+tPQoaseYsvSkN Kk27/LJbxlw5u2kRVM9ojCCcc8IfZugfkny+I0BlMnHDgkn6q8FLg4kCpJ/9XYVyAUVp FpQWD9v7iHsaCTKApxuhoa2WhLg90TyTy5B0uaDZkznM58W0Z3RpV/BSB+7FrbttTZTC XBJKoU7t7cRYZaB/tLaXmuuSiUdyaidyaC5O/jJY0u8cQIixhBrK+WSu1txA9W2Q+PBK kw== Received: from phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta01.appoci.oracle.com [138.1.114.2]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3t9r2dg6cs-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 26 Sep 2023 23:09:25 +0000 Received: from pps.filterd (phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (8.17.1.19/8.17.1.19) with ESMTP id 38QM939L030602; Tue, 26 Sep 2023 23:09:24 GMT Received: from nam11-dm6-obe.outbound.protection.outlook.com (mail-dm6nam11lp2168.outbound.protection.outlook.com [104.47.57.168]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 3t9pf6rvdy-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 26 Sep 2023 23:09:24 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=k+yINcc3AJHPtLR7oZwc/a7HIH7rX4FAH3I21+zc1bKxthNw1pKwUOXixJMhCAE34lGkLXN7z1rhVe8lI7aYkpwUVGbna9/hMssthWUHWpGguKZ1VgV4GBJ5YTN7kXK2aveG6N5NuzjGoZvKPw1padmZEVMSTwm/YNCof9Fy4nbRjHRe+YZs8nGY2JSaCdFeLqToKije5uaA33jpnyQNC78ARDMPMe7k+/OAb0/g0RpJ5X2ncfQ1o2+/eiGNopQLXkyK/Nwe241JG8H4Us1PE47RM8pFHAyYsPZFU0vCtcmbwc1RFM512q9aVSOCkaphjMFQlH88b4+qmOXRw4i2eg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ARR1f+uwMAS5xPF6bkJfTjbl45MBhgqNmwLL2RZGtqs=; b=MTrZPT6sXqEDgRTpYhQhKW2vWrB8EvKO8VhuAM/JhKCiq5Ghe+qsz685PJy+PXNj+BlEevuMXgQHdKj0Nyi22pduk8dU1kdaPeMsmajTQwTTbElfcrlAI1LcXSK32DhAt9YbYhXKFrogx/Ep6dbRugvGe0GLUv0CA5/IvUuFYhpitjF3OfKO53/CXIxaqjwtdCc3rqXtgk8U+s74gL7WbpelI/ExsQbod6/WwrtdOjK0WTIzBD/IA5Bh++mG7rrUDzW99jRHPcbvHkuAckdND7OuCjL+PUVbDZXBceR++LpPC/kfXW1noqsv3XWaoUo3ZwgBWauBaMAJ3wvaG87+Gw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ARR1f+uwMAS5xPF6bkJfTjbl45MBhgqNmwLL2RZGtqs=; b=WPE8SRg68bMxBvWpwE0RI7hMHeHVjpm5PZfpJ++PnHBeCG9Y3l88PG4S/H8pdSm2gekpxGMnnhlhN9ewk0q3oAbU+KAD06vqzIR/78cM7RrZgSPV6eoX9Htpd0Li7Q7UfYfnOGLqG1YoGE2qJOvmlkh4h97yqLpZ+RymWJxp2CY= Received: from BYAPR10MB2663.namprd10.prod.outlook.com (2603:10b6:a02:a9::20) by DS7PR10MB5069.namprd10.prod.outlook.com (2603:10b6:5:3a8::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6813.28; Tue, 26 Sep 2023 23:09:22 +0000 Received: from BYAPR10MB2663.namprd10.prod.outlook.com ([fe80::7e3d:f3b3:7964:87c3]) by BYAPR10MB2663.namprd10.prod.outlook.com ([fe80::7e3d:f3b3:7964:87c3%7]) with mapi id 15.20.6813.027; Tue, 26 Sep 2023 23:09:22 +0000 From: Dongli Zhang <dongli.zhang@oracle.com> To: kvm@vger.kernel.org, x86@kernel.org Cc: linux-kernel@vger.kernel.org, seanjc@google.com, pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, joe.jin@oracle.com Subject: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically Date: Tue, 26 Sep 2023 16:06:49 -0700 Message-Id: <20230926230649.67852-1-dongli.zhang@oracle.com> X-Mailer: git-send-email 2.34.1 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: BYAPR08CA0056.namprd08.prod.outlook.com (2603:10b6:a03:117::33) To BYAPR10MB2663.namprd10.prod.outlook.com (2603:10b6:a02:a9::20) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BYAPR10MB2663:EE_|DS7PR10MB5069:EE_ X-MS-Office365-Filtering-Correlation-Id: b243f7c6-67be-448d-963c-08dbbee59ebb X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: oPzvg2N6Ssw9OpOby321aLW/onGjKSxLPsSxMuzEotS7jeWAA90BbB6d1xTOxAwoBSLo27kwk9+zlj/suUpe+MA5MWmmJwkJQ4vUJmtYzGKRgR4vbe5AUjdxNcWiMigdYGY6Os7bKVgmCSWqAAVeJd1FVFfXypfPR+BAQYkRsPBH7E1odbLVghmv1tN+Rc5+JZCHsJn+WFi9usiyKrGB1L7XEZL/6zfUghF51WQ8dGY7AMpgtdG4/YxPLYZA2o4pulL8AOhYM6+VgQL17FZ6GSFckWfWWmGxnweNN30jYaeO26zJrd9tb62W4sgPTdJ+ccvOjUx2aZOLnzPy6aGWJke+3WERkDWjaBc6Fuh8jZ+v4cBN3bQTXMFDI7vdyTqrLfxflhvsOv1xWMzGIDNvOGW/hAKMBMjwhY2ZttfXce2XN1ndpM8F90jojIOFOqyag5BoTubaei+J+ZBAIa97HK5zu1H48B/3yW5GKfcrWyS/eMAatFCBJfgKcYsdCYoJ4+2pgRJgGG0cTKRcB5xzVzyYH1fbf7kOD4JaP8cWDgc14jzfMlw0TLwre2/mLH9q X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:BYAPR10MB2663.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230031)(396003)(39860400002)(376002)(346002)(136003)(366004)(230922051799003)(1800799009)(451199024)(186009)(36756003)(38100700002)(86362001)(4326008)(2616005)(15650500001)(2906002)(6486002)(66556008)(5660300002)(6506007)(41300700001)(66946007)(66476007)(44832011)(6666004)(316002)(6512007)(8676002)(478600001)(66899024)(107886003)(83380400001)(26005)(8936002)(1076003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: mJ2Q/z0wXAKK+uNcJZMWsJaXhlpA8bWMThd6APfoBE/HcJAjo+co+9kQoG7Q2RNKrLCzfJ5VFgSS5v3oNPb2XnNbIpadRv7YnCjsHHtBxJgh//joGfUherbvKnd3UTfUrgjcMxCteLcP8yEiBlyGyWPW7/ASUr5WyNm1/XT39cLfLSUpNIblTQn5pD0aQdEtJDopYHZEEyR2NNZoQvAeITXBz3qVT/JQIqj/oRSV3l9z3wuMY4TxjooH/G19VWjYJ5JLSvPDuYNrGQntf3BBXXevK0eNQcA40fLfEIh223IJpzi5L47zNQsp9KeIvcYUrHgYB51DndQ5gnzCRT7F4AOa08jItaAJaBx6dU3TrcTlPbLOI0aAuGEUjAdtRSMmHSgr6xpUD8rupfoyHHgUjeSQgsrsA6WkkbKjPLrJYd3+xt3EL48j0t2LYh7rg6O+zBSHTvGfWenz+B55nkG7diXLPsQOwep/B7C4cRB1k4/i0zJwv6/cGjQ37eumxfO+lUEqks0RfTl1T0Rjthm8olsw/nxOGx+tVwMDklxnLL+fkdxAq6M6S2wI+WaKCM5nnSI3ospIHdVsEyqoo9kyDOBYYQvcuN5Y5s+c7l/qsiA1YMV67sIy6JZPxhzO/8xh5APdKOt8QIgTU1i64ujzujrqPq9oj6pOYIOLedzzGSp8nU+JldFJ5pOpx/6F1K1C0t3kuMZtqApvvgPdHIH9ZJs+fdbC6CKa4Hc4LE63OzviVQtPJBqV8ZO6kUqSgLUzcrHAPKHZqGD9/5p9DtXdj7UMcAehN9e0gz/FGxfbzK/1RJL+/KPdIIajAKsDz3S2C+CQC785IjkrZRVXQkIURGZdS3mnVxgXgulpJJIZqDRopiE/f1HEB33upQj81/ZfleMdkfIKpfnb6XDUwi1mjAhC/oXAhj4x/qaA/EhIBra/cDzuNyQBZX4CRlhYj2vKSplAf8EVdTDt2Lwo0txFZEfxwyUIX28yU+GD+a3zVffscndtuVErKneBHg0yQMkQXf9HO3/MOkXX0qtXFv682nMXphPj3hwCTgYygKJeJhkXIt8LG5d2Nj8BVMgiT+EtbHDmBFNFLO85nEpxlI6ekgF3zIZutCZOV8gECEwzc6fhmP/N+efjGdaBKeC4c8qfV+Rt49cU1hZg4V7wlR5PR/QE22bNxBRsIMI0MXlAzvPGbsuuAXIimA5z/aq8r36iOewFOy2/tLnfCnl0tIgGnrpiHfsYZ4+nHrjt7tjLY2YLLM4SvKrrTK5bNTZFIBlm3Kpo07l6dpqgJ0zarazKVySL8X5frXfJNj3wq5b7VWClhlfAaW4eS2Db6EWhPcV+AXfNMJL5y4GhF8xOIAvgbsBZanPmHGr2wrW4KOyyXebgmL+RTOfcFkv/wdNyPfAJM9qy4V5WOO6Yy1S1oBIImbY3IQn0q0V1e/bVZMatBmF0932OjUmMH8bETtb76GR7KtvmqnK4ludjVR6O9nG/uetQy1kUvMYlz5xQypGAeOewVl3xSRrkcj9BBVz3NH0lW0NxkOLh4FMQYc6eLMFyb1AH0WlZXM1S/wK8ftdYv7AI+rC7O1gcMgFb5XgVYoou X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: jQ7RaJacfl+xBoLIWr2ZxB3rjlRd2GO/KLrOCqYW2CqKRBuUBT6drICTcm3CzbUeBX1GfsSNoDiXMzX8Bjl/FS80eMfgomt1xSGbYbAlkyQc1B/74P3xN9/PR8sWQIucHSyJcKRZzfVGGXCtNl/BRZgCAQeISh257x8cYj+WPdlMQkvNaHqJB/d9dHtNF5w8k2iTEEP0bZ34WbeQJ6S3OgvLvleiAAYldRXE6aN8br2cJ4kc6xJO/3KS6j3K0G1iEQwN4A+C0yJCijnMxfnxr7TGno414GtXTMfz7lS/xyaHCc1/rHWVNm7gGAigTqtbweftingxGa5o4NjjHwyxeu8k/SEGMAI8HtObF2AXhKYF4U5qfjU5KQiH/s7YiUAYAwUcFZVc9og7KCrivlsQcsT3ptFNtZwX1Gpvr/5736A8OVsywhj9rIKlfpLHY1LUxzXF1nzHafWppmfs8rjtWljDMjxUgK9k2k+tVleO+Xo0oU6nMhPVdqLaR9Di36f2gZtxjIpbwMKw4qJzE6ehfKcLUSPWY6TOuox2Th+tR26HrgHwH9E0zGkt+cAteazEDh9UMWZdH1q2UxSSHj0jqOlX8uw4TPRsuYE0NfChxE1SAN+g3oKFweUUyuBqr6hKhNsw/8nBBvtljOYt2tOCUDK8vF4GE/fFKFiVwCN9AxFP2Jd+W8U+6vEQdkMD8Wnt5kiE4ijIyv4zB8xWI5BQlmf/VLlpJpiI5/1foLvgqy9gGMp+Sj29lUkf+cb4ipf/XBthHFEvUolN4pIpxYUVFGVZs7KupV/RwrBZqzr439gJ6Smp3Am/zMxAvbuJI8wa+3g+BMqc0Ysir7WI8++a+O7jbKtY3YyJEblCFAzSsOb9WE0HUYo33ot8NFUUCcZxsYjn2Btn20k4KqwKzd5Lr9NpsaMJvkDDGw2XGkCJB5p3OvOIgK56WKChhULQ2Um+xhWC8E+7I2MxRWBooNa9lw== X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: b243f7c6-67be-448d-963c-08dbbee59ebb X-MS-Exchange-CrossTenant-AuthSource: BYAPR10MB2663.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 26 Sep 2023 23:09:22.4158 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: mBJbQC1p79WQH4gZ0pkqrF69CvJ8NXo2pd5jYP+L38Vuzu+G0jaEbAnm/xYQbKRF059HkAsWX/C1db1skHUhbA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS7PR10MB5069 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.267,Aquarius:18.0.980,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-09-26_15,2023-09-26_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 phishscore=0 mlxscore=0 malwarescore=0 mlxlogscore=999 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2309180000 definitions=main-2309260199 X-Proofpoint-GUID: nxju4DI2XURNTbFX_5mW5n8ym02S18yE X-Proofpoint-ORIG-GUID: nxju4DI2XURNTbFX_5mW5n8ym02S18yE X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Tue, 26 Sep 2023 16:50:48 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778155815315981145 X-GMAIL-MSGID: 1778155815315981145 |
Series |
[RFC,1/1] KVM: x86: add param to update master clock periodically
|
|
Commit Message
Dongli Zhang
Sept. 26, 2023, 11:06 p.m. UTC
This is to minimize the kvmclock drift during CPU hotplug (or when the
master clock and pvclock_vcpu_time_info are updated). The drift is
because kvmclock and raw monotonic (tsc) use different
equation/mult/shift to calculate that how many nanoseconds (given the tsc
as input) has passed.
The calculation of the kvmclock is based on the pvclock_vcpu_time_info
provided by the host side.
struct pvclock_vcpu_time_info {
u32 version;
u32 pad0;
u64 tsc_timestamp; --> by host raw monotonic
u64 system_time; --> by host raw monotonic
u32 tsc_to_system_mul; --> by host KVM
s8 tsc_shift; --> by host KVM
u8 flags;
u8 pad[2];
} __attribute__((__packed__));
To calculate the current guest kvmclock:
1. Obtain the tsc = rdtsc() of guest.
2. If shift < 0:
tmp = tsc >> tsc_shift
if shift > 0:
tmp = tsc << tsc_shift
3. The kvmclock value will be: (tmp * tsc_to_system_mul) >> 32
Therefore, the current kvmclock will be either:
(rdtsc() >> tsc_shift) * tsc_to_system_mul >> 32
... or ...
(rdtsc() << tsc_shift) * tsc_to_system_mul >> 32
The 'tsc_to_system_mul' and 'tsc_shift' are calculated by the host KVM.
When the master clock is actively used, the 'tsc_timestamp' and
'system_time' are derived from the host raw monotonic time, which is
calculated based on the 'mult' and 'shift' of clocksource_tsc:
elapsed_time = (tsc * mult) >> shift
Since kvmclock and raw monotonic (clocksource_tsc) use different
equation/mult/shift to convert the tsc to nanosecond, there may be clock
drift issue during CPU hotplug (when the master clock is updated).
1. The guest boots and all vcpus have the same 'pvclock_vcpu_time_info'
(suppose the master clock is used).
2. Since the master clock is never updated, the periodic kvmclock_sync_work
does not update the values in 'pvclock_vcpu_time_info'.
3. Suppose a very long period has passed (e.g., 30-day).
4. The user adds another vcpu. Both master clock and
'pvclock_vcpu_time_info' are updated, based on the raw monotonic.
(Ideally, we expect the update is based on 'tsc_to_system_mul' and
'tsc_shift' from kvmclock).
Because kvmclock and raw monotonic (clocksource_tsc) use different
equation/mult/shift to convert the tsc to nanosecond, there will be drift
between:
(1) kvmclock based on current rdtsc and old 'pvclock_vcpu_time_info'
(2) kvmclock based on current rdtsc and new 'pvclock_vcpu_time_info'
According to the test, there is a drift of 4502145ns between (1) and (2)
after about 40 hours. The longer the time, the large the drift.
This is to add a module param to allow the user to configure for how often
to refresh the master clock, in order to reduce the kvmclock drift based on
user requirement (e.g., every 5-min to every day). The more often that the
master clock is refreshed, the smaller the kvmclock drift during the vcpu
hotplug.
Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
Other options are to update the masterclock in:
- kvmclock_sync_work, or
- pvclock_gtod_notify()
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 34 +++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)
Comments
On 9/26/23 4:06 PM, Dongli Zhang wrote: > This is to minimize the kvmclock drift during CPU hotplug (or when the > master clock and pvclock_vcpu_time_info are updated). The drift is > because kvmclock and raw monotonic (tsc) use different > equation/mult/shift to calculate that how many nanoseconds (given the tsc > as input) has passed. > > The calculation of the kvmclock is based on the pvclock_vcpu_time_info > provided by the host side. > > struct pvclock_vcpu_time_info { > u32 version; > u32 pad0; > u64 tsc_timestamp; --> by host raw monotonic > u64 system_time; --> by host raw monotonic > u32 tsc_to_system_mul; --> by host KVM > s8 tsc_shift; --> by host KVM > u8 flags; > u8 pad[2]; > } __attribute__((__packed__)); > > To calculate the current guest kvmclock: > > 1. Obtain the tsc = rdtsc() of guest. > > 2. If shift < 0: > tmp = tsc >> tsc_shift > if shift > 0: > tmp = tsc << tsc_shift > > 3. The kvmclock value will be: (tmp * tsc_to_system_mul) >> 32 > > Therefore, the current kvmclock will be either: > > (rdtsc() >> tsc_shift) * tsc_to_system_mul >> 32 > > ... or ... > > (rdtsc() << tsc_shift) * tsc_to_system_mul >> 32 > > The 'tsc_to_system_mul' and 'tsc_shift' are calculated by the host KVM. > > When the master clock is actively used, the 'tsc_timestamp' and > 'system_time' are derived from the host raw monotonic time, which is > calculated based on the 'mult' and 'shift' of clocksource_tsc: > > elapsed_time = (tsc * mult) >> shift > > Since kvmclock and raw monotonic (clocksource_tsc) use different > equation/mult/shift to convert the tsc to nanosecond, there may be clock > drift issue during CPU hotplug (when the master clock is updated). > > 1. The guest boots and all vcpus have the same 'pvclock_vcpu_time_info' > (suppose the master clock is used). > > 2. Since the master clock is never updated, the periodic kvmclock_sync_work > does not update the values in 'pvclock_vcpu_time_info'. > > 3. Suppose a very long period has passed (e.g., 30-day). > > 4. The user adds another vcpu. Both master clock and > 'pvclock_vcpu_time_info' are updated, based on the raw monotonic. > > (Ideally, we expect the update is based on 'tsc_to_system_mul' and > 'tsc_shift' from kvmclock). > > > Because kvmclock and raw monotonic (clocksource_tsc) use different > equation/mult/shift to convert the tsc to nanosecond, there will be drift > between: > > (1) kvmclock based on current rdtsc and old 'pvclock_vcpu_time_info' > (2) kvmclock based on current rdtsc and new 'pvclock_vcpu_time_info' > > According to the test, there is a drift of 4502145ns between (1) and (2) > after about 40 hours. The longer the time, the large the drift. > > This is to add a module param to allow the user to configure for how often > to refresh the master clock, in order to reduce the kvmclock drift based on > user requirement (e.g., every 5-min to every day). The more often that the > master clock is refreshed, the smaller the kvmclock drift during the vcpu > hotplug. > > Cc: Joe Jin <joe.jin@oracle.com> > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> > --- > Other options are to update the masterclock in: > - kvmclock_sync_work, or > - pvclock_gtod_notify() > > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/x86.c | 34 +++++++++++++++++++++++++++++++++ > 2 files changed, 35 insertions(+) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 17715cb8731d..57409dce5d73 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1331,6 +1331,7 @@ struct kvm_arch { > u64 master_cycle_now; > struct delayed_work kvmclock_update_work; > struct delayed_work kvmclock_sync_work; > + struct delayed_work masterclock_sync_work; > > struct kvm_xen_hvm_config xen_hvm_config; > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 9f18b06bbda6..0b71dc3785eb 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); > static bool __read_mostly kvmclock_periodic_sync = true; > module_param(kvmclock_periodic_sync, bool, S_IRUGO); > > +unsigned int __read_mostly masterclock_sync_period; > +module_param(masterclock_sync_period, uint, 0444); Can the mode be 0644 and allow it be changed at runtime? Thanks, Joe > + > /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ > static u32 __read_mostly tsc_tolerance_ppm = 250; > module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); > @@ -3298,6 +3301,31 @@ static void kvmclock_sync_fn(struct work_struct *work) > KVMCLOCK_SYNC_PERIOD); > } > > +static void masterclock_sync_fn(struct work_struct *work) > +{ > + unsigned long i; > + struct delayed_work *dwork = to_delayed_work(work); > + struct kvm_arch *ka = container_of(dwork, struct kvm_arch, > + masterclock_sync_work); > + struct kvm *kvm = container_of(ka, struct kvm, arch); > + struct kvm_vcpu *vcpu; > + > + if (!masterclock_sync_period) > + return; > + > + kvm_for_each_vcpu(i, vcpu, kvm) { > + /* > + * It is not required to kick the vcpu because it is not > + * expected to update the master clock immediately. > + */ > + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); > + } > + > + schedule_delayed_work(&ka->masterclock_sync_work, > + masterclock_sync_period * HZ); > +} > + > + > /* These helpers are safe iff @msr is known to be an MCx bank MSR. */ > static bool is_mci_control_msr(u32 msr) > { > @@ -11970,6 +11998,10 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) > if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0) > schedule_delayed_work(&kvm->arch.kvmclock_sync_work, > KVMCLOCK_SYNC_PERIOD); > + > + if (masterclock_sync_period && vcpu->vcpu_idx == 0) > + schedule_delayed_work(&kvm->arch.masterclock_sync_work, > + masterclock_sync_period * HZ); > } > > void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) > @@ -12344,6 +12376,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) > > INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn); > INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn); > + INIT_DELAYED_WORK(&kvm->arch.masterclock_sync_work, masterclock_sync_fn); > > kvm_apicv_init(kvm); > kvm_hv_init_vm(kvm); > @@ -12383,6 +12416,7 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm) > > void kvm_arch_sync_events(struct kvm *kvm) > { > + cancel_delayed_work_sync(&kvm->arch.masterclock_sync_work); > cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work); > cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work); > kvm_free_pit(kvm);
Hi Joe, On 9/26/23 17:29, Joe Jin wrote: > On 9/26/23 4:06 PM, Dongli Zhang wrote: >> This is to minimize the kvmclock drift during CPU hotplug (or when the >> master clock and pvclock_vcpu_time_info are updated). The drift is >> because kvmclock and raw monotonic (tsc) use different >> equation/mult/shift to calculate that how many nanoseconds (given the tsc >> as input) has passed. >> >> The calculation of the kvmclock is based on the pvclock_vcpu_time_info >> provided by the host side. >> >> struct pvclock_vcpu_time_info { >> u32 version; >> u32 pad0; >> u64 tsc_timestamp; --> by host raw monotonic >> u64 system_time; --> by host raw monotonic >> u32 tsc_to_system_mul; --> by host KVM >> s8 tsc_shift; --> by host KVM >> u8 flags; >> u8 pad[2]; >> } __attribute__((__packed__)); >> >> To calculate the current guest kvmclock: >> >> 1. Obtain the tsc = rdtsc() of guest. >> >> 2. If shift < 0: >> tmp = tsc >> tsc_shift >> if shift > 0: >> tmp = tsc << tsc_shift >> >> 3. The kvmclock value will be: (tmp * tsc_to_system_mul) >> 32 >> >> Therefore, the current kvmclock will be either: >> >> (rdtsc() >> tsc_shift) * tsc_to_system_mul >> 32 >> >> ... or ... >> >> (rdtsc() << tsc_shift) * tsc_to_system_mul >> 32 >> >> The 'tsc_to_system_mul' and 'tsc_shift' are calculated by the host KVM. >> >> When the master clock is actively used, the 'tsc_timestamp' and >> 'system_time' are derived from the host raw monotonic time, which is >> calculated based on the 'mult' and 'shift' of clocksource_tsc: >> >> elapsed_time = (tsc * mult) >> shift >> >> Since kvmclock and raw monotonic (clocksource_tsc) use different >> equation/mult/shift to convert the tsc to nanosecond, there may be clock >> drift issue during CPU hotplug (when the master clock is updated). >> >> 1. The guest boots and all vcpus have the same 'pvclock_vcpu_time_info' >> (suppose the master clock is used). >> >> 2. Since the master clock is never updated, the periodic kvmclock_sync_work >> does not update the values in 'pvclock_vcpu_time_info'. >> >> 3. Suppose a very long period has passed (e.g., 30-day). >> >> 4. The user adds another vcpu. Both master clock and >> 'pvclock_vcpu_time_info' are updated, based on the raw monotonic. >> >> (Ideally, we expect the update is based on 'tsc_to_system_mul' and >> 'tsc_shift' from kvmclock). >> >> >> Because kvmclock and raw monotonic (clocksource_tsc) use different >> equation/mult/shift to convert the tsc to nanosecond, there will be drift >> between: >> >> (1) kvmclock based on current rdtsc and old 'pvclock_vcpu_time_info' >> (2) kvmclock based on current rdtsc and new 'pvclock_vcpu_time_info' >> >> According to the test, there is a drift of 4502145ns between (1) and (2) >> after about 40 hours. The longer the time, the large the drift. >> >> This is to add a module param to allow the user to configure for how often >> to refresh the master clock, in order to reduce the kvmclock drift based on >> user requirement (e.g., every 5-min to every day). The more often that the >> master clock is refreshed, the smaller the kvmclock drift during the vcpu >> hotplug. >> >> Cc: Joe Jin <joe.jin@oracle.com> >> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> >> --- >> Other options are to update the masterclock in: >> - kvmclock_sync_work, or >> - pvclock_gtod_notify() >> >> arch/x86/include/asm/kvm_host.h | 1 + >> arch/x86/kvm/x86.c | 34 +++++++++++++++++++++++++++++++++ >> 2 files changed, 35 insertions(+) >> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >> index 17715cb8731d..57409dce5d73 100644 >> --- a/arch/x86/include/asm/kvm_host.h >> +++ b/arch/x86/include/asm/kvm_host.h >> @@ -1331,6 +1331,7 @@ struct kvm_arch { >> u64 master_cycle_now; >> struct delayed_work kvmclock_update_work; >> struct delayed_work kvmclock_sync_work; >> + struct delayed_work masterclock_sync_work; >> >> struct kvm_xen_hvm_config xen_hvm_config; >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index 9f18b06bbda6..0b71dc3785eb 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); >> static bool __read_mostly kvmclock_periodic_sync = true; >> module_param(kvmclock_periodic_sync, bool, S_IRUGO); >> >> +unsigned int __read_mostly masterclock_sync_period; >> +module_param(masterclock_sync_period, uint, 0444); > > Can the mode be 0644 and allow it be changed at runtime? It can be RW. So far I just copy from kvmclock_periodic_sync as most code are from the mechanism of kvmclock_periodic_sync. static bool __read_mostly kvmclock_periodic_sync = true; module_param(kvmclock_periodic_sync, bool, S_IRUGO); Thank you very much! Dongli Zhang > > Thanks, > Joe >> + >> /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ >> static u32 __read_mostly tsc_tolerance_ppm = 250; >> module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); >> @@ -3298,6 +3301,31 @@ static void kvmclock_sync_fn(struct work_struct *work) >> KVMCLOCK_SYNC_PERIOD); >> } >> >> +static void masterclock_sync_fn(struct work_struct *work) >> +{ >> + unsigned long i; >> + struct delayed_work *dwork = to_delayed_work(work); >> + struct kvm_arch *ka = container_of(dwork, struct kvm_arch, >> + masterclock_sync_work); >> + struct kvm *kvm = container_of(ka, struct kvm, arch); >> + struct kvm_vcpu *vcpu; >> + >> + if (!masterclock_sync_period) >> + return; >> + >> + kvm_for_each_vcpu(i, vcpu, kvm) { >> + /* >> + * It is not required to kick the vcpu because it is not >> + * expected to update the master clock immediately. >> + */ >> + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); >> + } >> + >> + schedule_delayed_work(&ka->masterclock_sync_work, >> + masterclock_sync_period * HZ); >> +} >> + >> + >> /* These helpers are safe iff @msr is known to be an MCx bank MSR. */ >> static bool is_mci_control_msr(u32 msr) >> { >> @@ -11970,6 +11998,10 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) >> if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0) >> schedule_delayed_work(&kvm->arch.kvmclock_sync_work, >> KVMCLOCK_SYNC_PERIOD); >> + >> + if (masterclock_sync_period && vcpu->vcpu_idx == 0) >> + schedule_delayed_work(&kvm->arch.masterclock_sync_work, >> + masterclock_sync_period * HZ); >> } >> >> void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) >> @@ -12344,6 +12376,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) >> >> INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn); >> INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn); >> + INIT_DELAYED_WORK(&kvm->arch.masterclock_sync_work, masterclock_sync_fn); >> >> kvm_apicv_init(kvm); >> kvm_hv_init_vm(kvm); >> @@ -12383,6 +12416,7 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm) >> >> void kvm_arch_sync_events(struct kvm *kvm) >> { >> + cancel_delayed_work_sync(&kvm->arch.masterclock_sync_work); >> cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work); >> cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work); >> kvm_free_pit(kvm); >
On Tue, Sep 26, 2023, Dongli Zhang wrote: > Hi Joe, > > On 9/26/23 17:29, Joe Jin wrote: > > On 9/26/23 4:06 PM, Dongli Zhang wrote: > >> This is to minimize the kvmclock drift during CPU hotplug (or when the > >> master clock and pvclock_vcpu_time_info are updated). Updated by who? > >> Since kvmclock and raw monotonic (clocksource_tsc) use different > >> equation/mult/shift to convert the tsc to nanosecond, there may be clock > >> drift issue during CPU hotplug (when the master clock is updated). Based on #4, I assume you mean "vCPU hotplug from the host", but from this and the above it's not clear if this means "vCPU hotplug from the host", "pCPU hotplug in the host", or "CPU hotplug in the guest". > >> 1. The guest boots and all vcpus have the same 'pvclock_vcpu_time_info' > >> (suppose the master clock is used). > >> > >> 2. Since the master clock is never updated, the periodic kvmclock_sync_work > >> does not update the values in 'pvclock_vcpu_time_info'. > >> > >> 3. Suppose a very long period has passed (e.g., 30-day). > >> > >> 4. The user adds another vcpu. Both master clock and > >> 'pvclock_vcpu_time_info' are updated, based on the raw monotonic. So why can't KVM simply force a KVM_REQ_MASTERCLOCK_UPDATE request when a vCPU is added? I'm missing why this needs a persistent, periodic refresh. > >> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); > >> static bool __read_mostly kvmclock_periodic_sync = true; > >> module_param(kvmclock_periodic_sync, bool, S_IRUGO); > >> > >> +unsigned int __read_mostly masterclock_sync_period; > >> +module_param(masterclock_sync_period, uint, 0444); > > > > Can the mode be 0644 and allow it be changed at runtime? > > It can be RW. > > So far I just copy from kvmclock_periodic_sync as most code are from the > mechanism of kvmclock_periodic_sync. > > static bool __read_mostly kvmclock_periodic_sync = true; > module_param(kvmclock_periodic_sync, bool, S_IRUGO); Unless there's a very good reason for making it writable, I vote to keep it RO to simplify the code. > >> /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ > >> static u32 __read_mostly tsc_tolerance_ppm = 250; > >> module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); > >> @@ -3298,6 +3301,31 @@ static void kvmclock_sync_fn(struct work_struct *work) > >> KVMCLOCK_SYNC_PERIOD); > >> } > >> > >> +static void masterclock_sync_fn(struct work_struct *work) > >> +{ > >> + unsigned long i; > >> + struct delayed_work *dwork = to_delayed_work(work); > >> + struct kvm_arch *ka = container_of(dwork, struct kvm_arch, > >> + masterclock_sync_work); > >> + struct kvm *kvm = container_of(ka, struct kvm, arch); > >> + struct kvm_vcpu *vcpu; > >> + > >> + if (!masterclock_sync_period) This function should never be called if masterclock_sync_period=0. The param is RO and so kvm_arch_vcpu_postcreate() shouldn't create the work in the first place. > >> + return; > >> + > >> + kvm_for_each_vcpu(i, vcpu, kvm) { > >> + /* > >> + * It is not required to kick the vcpu because it is not > >> + * expected to update the master clock immediately. > >> + */ This comment needs to explain *why* it's ok for vCPUs to lazily handle the masterclock update. Saying "it is not expected" doesn't help understand who/what expects anything, or why. > >> + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); > >> + } > >> + > >> + schedule_delayed_work(&ka->masterclock_sync_work, > >> + masterclock_sync_period * HZ); > >> +}
Hi Sean, On 9/28/23 09:18, Sean Christopherson wrote: > On Tue, Sep 26, 2023, Dongli Zhang wrote: >> Hi Joe, >> >> On 9/26/23 17:29, Joe Jin wrote: >>> On 9/26/23 4:06 PM, Dongli Zhang wrote: >>>> This is to minimize the kvmclock drift during CPU hotplug (or when the >>>> master clock and pvclock_vcpu_time_info are updated). > > Updated by who? This is about vCPU hotplug, e.g., when we run "device_add host-x86_64-cpu,id=core4,socket-id=0,core-id=4,thread-id=0" at QEMU side. When we add a new vCPU to KVM, the kvm_synchronize_tsc()-->kvm_track_tsc_matching() triggers KVM_REQ_MASTERCLOCK_UPDATE. When the new vCPU is onlined for the first time at VM side, the handler of KVM_REQ_MASTERCLOCK_UPDATE (that is, kvm_update_masterclock()) updates the master clock (e.g., master_kernel_ns and master_cycle_now, based on the host raw monotonic). The kvm_update_masterclock() finally triggers KVM_REQ_CLOCK_UPDATE so that the master_kernel_ns and the master_cycle_now are propagated to: - pvclock_vcpu_time_info->system_time - pvclock_vcpu_time_info->tsc_timestamp That is ... - vcpu->hv_clock.system_time - vcpu->hv_clock.tsc_timestamp > >>>> Since kvmclock and raw monotonic (clocksource_tsc) use different >>>> equation/mult/shift to convert the tsc to nanosecond, there may be clock >>>> drift issue during CPU hotplug (when the master clock is updated). > > Based on #4, I assume you mean "vCPU hotplug from the host", but from this and > the above it's not clear if this means "vCPU hotplug from the host", "pCPU hotplug > in the host", or "CPU hotplug in the guest". It is about vCPU hotplug from the host (e.g., QEMU), although the KVM_REQ_MASTERCLOCK_UPDATE handler (kvm_update_masterclock()) is triggered when the new vCPU is onlined (usually via udev) at VM side for the first time. 1. QEMU adds new vCPU to KVM VM 2. VM side uses udev or manual echo command to sysfs to online the vCPU > >>>> 1. The guest boots and all vcpus have the same 'pvclock_vcpu_time_info' >>>> (suppose the master clock is used). >>>> >>>> 2. Since the master clock is never updated, the periodic kvmclock_sync_work >>>> does not update the values in 'pvclock_vcpu_time_info'. >>>> >>>> 3. Suppose a very long period has passed (e.g., 30-day). >>>> >>>> 4. The user adds another vcpu. Both master clock and >>>> 'pvclock_vcpu_time_info' are updated, based on the raw monotonic. > > So why can't KVM simply force a KVM_REQ_MASTERCLOCK_UPDATE request when a vCPU > is added? I'm missing why this needs a persistent, periodic refresh. Sorry for making the commit message confusing. There is always a KVM_REQ_MASTERCLOCK_UPDATE request when a vCPU is added. However, the problem is: only the vCPU hotplug triggers KVM_REQ_MASTERCLOCK_UPDATE (suppose without live migration). That is, generally (e.g., without migration), there will be no master clock update if we do not do vCPU hot-add during long period of time. We want more frequent KVM_REQ_MASTERCLOCK_UPDATE. This is because: 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation. 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation. 3. As a result, given the same rdtsc, kvmclock and raw monotonic may return different results (this is expected because they have different mult/shift/equation). 4. However, the base in kvmclock calculation (tsc_timestamp and system_time) are derived from raw monotonic clock (master clock) When tsc_timestamp and system_time are updated: tsc_diff = tsc_timestamp_new - tsc_timestamp_old system_time_new = system_time_old + (incremental from raw clock source) <--- (1) However, from kvmclock, it expects: system_time_new = system_time_old + kvmclock_equation(tsc_diff) <--- (2) There is diff between (1) and (2). That will be the reason of kvmclock drift when we add a new vCPU. Indeed, the drift is between: (3) incremental from raw clock source (that is, tsc_equation(tsc_diff)), and (4) kvmclock_equation(tsc_diff) The less frequent that master clock is updated, the larger the tsc_diff. As a result, the larger the drift. We would like to update the master clock more frequently to reduce the tsc_diff. > >>>> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); >>>> static bool __read_mostly kvmclock_periodic_sync = true; >>>> module_param(kvmclock_periodic_sync, bool, S_IRUGO); >>>> >>>> +unsigned int __read_mostly masterclock_sync_period; >>>> +module_param(masterclock_sync_period, uint, 0444); >>> >>> Can the mode be 0644 and allow it be changed at runtime? >> >> It can be RW. >> >> So far I just copy from kvmclock_periodic_sync as most code are from the >> mechanism of kvmclock_periodic_sync. >> >> static bool __read_mostly kvmclock_periodic_sync = true; >> module_param(kvmclock_periodic_sync, bool, S_IRUGO); > > Unless there's a very good reason for making it writable, I vote to keep it RO > to simplify the code. Unless there's a very good reason for making it writable, I vote to keep it RO to simplify the code. I will keep it as readable. Thank you very much! > >>>> /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ >>>> static u32 __read_mostly tsc_tolerance_ppm = 250; >>>> module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); >>>> @@ -3298,6 +3301,31 @@ static void kvmclock_sync_fn(struct work_struct *work) >>>> KVMCLOCK_SYNC_PERIOD); >>>> } >>>> >>>> +static void masterclock_sync_fn(struct work_struct *work) >>>> +{ >>>> + unsigned long i; >>>> + struct delayed_work *dwork = to_delayed_work(work); >>>> + struct kvm_arch *ka = container_of(dwork, struct kvm_arch, >>>> + masterclock_sync_work); >>>> + struct kvm *kvm = container_of(ka, struct kvm, arch); >>>> + struct kvm_vcpu *vcpu; >>>> + >>>> + if (!masterclock_sync_period) > > This function should never be called if masterclock_sync_period=0. The param > is RO and so kvm_arch_vcpu_postcreate() shouldn't create the work in the first > place. This function should never be called if masterclock_sync_period=0. The param is RO and so kvm_arch_vcpu_postcreate() shouldn't create the work in the first place. Thank you very much for pointing out that. I just copied the code from kvmclock_sync_fn() (although I have noticed that :) ) I think I may need to send a cleanup patch to remove line 3296-3297 from existing code as well. > >>>> + return; >>>> + >>>> + kvm_for_each_vcpu(i, vcpu, kvm) { >>>> + /* >>>> + * It is not required to kick the vcpu because it is not >>>> + * expected to update the master clock immediately. >>>> + */ > > This comment needs to explain *why* it's ok for vCPUs to lazily handle the > masterclock update. Saying "it is not expected" doesn't help understand who/what > expects anything, or why. I will update the comment as: The objective to update the master clock is to reduce (but not to avoid) the clock drift when there is long period of time between two master clock updates. It is not expected to update immediately. It is fine to wait until next vCPU entry. Please let me know if any clarification is needed. I used the below patch to help diagnose the drift issue. @@ -3068,6 +3110,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) u64 tsc_timestamp, host_tsc; u8 pvclock_flags; bool use_master_clock; + struct pvclock_vcpu_time_info old_hv_clock; + u64 tsc_raw, tsc, old_ns, new_ns, diff; + bool backward; + + memcpy(&old_hv_clock, &vcpu->hv_clock, sizeof(old_hv_clock)); kernel_ns = 0; host_tsc = 0; @@ -3144,6 +3191,25 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) vcpu->hv_clock.flags = pvclock_flags; + tsc_raw = rdtsc(); + tsc = kvm_read_l1_tsc(v, tsc_raw); + old_ns = __pvclock_read_cycles(&old_hv_clock, tsc); + new_ns = __pvclock_read_cycles(&vcpu->hv_clock, tsc); + if (old_ns > new_ns) { + backward = true; + diff = old_ns - new_ns; + } else { + backward = false; + diff = new_ns - old_ns; + } + pr_alert("backward=%d, diff=%llu, old_ns=%llu, new_ns=%llu\n", + backward, diff, old_ns, new_ns); Thank you very much! Dongli Zhang > >>>> + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); >>>> + } >>>> + >>>> + schedule_delayed_work(&ka->masterclock_sync_work, >>>> + masterclock_sync_period * HZ); >>>> +}
On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote: > > > We want more frequent KVM_REQ_MASTERCLOCK_UPDATE. > > This is because: > > 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation. > > 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation. > > 3. As a result, given the same rdtsc, kvmclock and raw monotonic may return > different results (this is expected because they have different > mult/shift/equation). > > 4. However, the base in kvmclock calculation (tsc_timestamp and system_time) > are derived from raw monotonic clock (master clock) That just seems wrong. I don't mean that you're incorrect; it seems *morally* wrong. In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use a *different* mult/shift/equation (your #1) to convert TSC ticks to nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2). I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent. Fix that, and the whole problem goes away, doesn't it? What am I missing here, that means we can't do that? Alternatively... with X86_FEATURE_CONSTANT_TSC, why do the sync at all? If KVM wants to decide that the TSC runs at a different frequency to the frequency that the host uses for CLOCK_MONOTONIC_RAW, why can't KVM just *stick* to that?
On Mon, Oct 02, 2023, David Woodhouse wrote: > On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote: > > > > > > We want more frequent KVM_REQ_MASTERCLOCK_UPDATE. > > > > This is because: > > > > 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation. > > > > 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation. > > > > 3. As a result, given the same rdtsc, kvmclock and raw monotonic may return > > different results (this is expected because they have different > > mult/shift/equation). > > > > 4. However, the base in kvmclock calculation (tsc_timestamp and system_time) > > are derived from raw monotonic clock (master clock) > > That just seems wrong. I don't mean that you're incorrect; it seems > *morally* wrong. > > In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use > a *different* mult/shift/equation (your #1) to convert TSC ticks to > nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2). > > I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's > adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent. > > Fix that, and the whole problem goes away, doesn't it? > > What am I missing here, that means we can't do that? I believe the answer is that "struct pvclock_vcpu_time_info" and its math are ABI between KVM and KVM guests. Like many of the older bits of KVM, my guess is that KVM's behavior is the product of making things kinda sorta work with old hardware, i.e. was probably the least awful solution in the days before constant TSCs, but is completely nonsensical on modern hardware. > Alternatively... with X86_FEATURE_CONSTANT_TSC, why do the sync at all? > If KVM wants to decide that the TSC runs at a different frequency to > the frequency that the host uses for CLOCK_MONOTONIC_RAW, why can't KVM > just *stick* to that? Yeah, bouncing around guest time when the TSC is constant seems counterproductive. However, why does any of this matter if the host has a constant TSC? If that's the case, a sane setup will expose a constant TSC to the guest and the guest will use the TSC instead of kvmclock for the guest clocksource. Dongli, is this for long-lived "legacy" guests that were created on hosts without a constant TSC? If not, then why is kvmclock being used? Or heaven forbid, are you running on hardware without a constant TSC? :-) Not saying we shouldn't sanitize the kvmclock behavior, but knowing the exact problematic configuration(s) will help us make a better decision on how to fix the mess.
Hi Sean and David, On 10/2/23 09:37, Sean Christopherson wrote: > On Mon, Oct 02, 2023, David Woodhouse wrote: >> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote: >>> >>> >>> We want more frequent KVM_REQ_MASTERCLOCK_UPDATE. >>> >>> This is because: >>> >>> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation. >>> >>> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation. >>> >>> 3. As a result, given the same rdtsc, kvmclock and raw monotonic may return >>> different results (this is expected because they have different >>> mult/shift/equation). >>> >>> 4. However, the base in kvmclock calculation (tsc_timestamp and system_time) >>> are derived from raw monotonic clock (master clock) >> >> That just seems wrong. I don't mean that you're incorrect; it seems >> *morally* wrong. >> >> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use >> a *different* mult/shift/equation (your #1) to convert TSC ticks to >> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2). >> >> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's >> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent. >> >> Fix that, and the whole problem goes away, doesn't it? >> >> What am I missing here, that means we can't do that? > > I believe the answer is that "struct pvclock_vcpu_time_info" and its math are > ABI between KVM and KVM guests. > > Like many of the older bits of KVM, my guess is that KVM's behavior is the product > of making things kinda sorta work with old hardware, i.e. was probably the least > awful solution in the days before constant TSCs, but is completely nonsensical on > modern hardware. > >> Alternatively... with X86_FEATURE_CONSTANT_TSC, why do the sync at all? >> If KVM wants to decide that the TSC runs at a different frequency to >> the frequency that the host uses for CLOCK_MONOTONIC_RAW, why can't KVM >> just *stick* to that? > > Yeah, bouncing around guest time when the TSC is constant seems counterproductive. > > However, why does any of this matter if the host has a constant TSC? If that's > the case, a sane setup will expose a constant TSC to the guest and the guest will > use the TSC instead of kvmclock for the guest clocksource. > > Dongli, is this for long-lived "legacy" guests that were created on hosts without > a constant TSC? If not, then why is kvmclock being used? Or heaven forbid, are > you running on hardware without a constant TSC? :-) This is for test guests, and the host has all of below: tsc, rdtscp, constant_tsc, nonstop_tsc, tsc_deadline_timer, tsc_adjust A clocksource is used for two things. 1. current_clocksource. Yes, I agree we should always use tsc on modern hardware. Do we need to update the documentation to always suggest TSC when it is constant, as I believe many users still prefer pv clock than tsc? Thanks to tsc ratio scaling, the live migration will not impact tsc. From the source code, the rating of kvm-clock is still higher than tsc. BTW., how about to decrease the rating if guest detects constant tsc? 166 struct clocksource kvm_clock = { 167 .name = "kvm-clock", 168 .read = kvm_clock_get_cycles, 169 .rating = 400, 170 .mask = CLOCKSOURCE_MASK(64), 171 .flags = CLOCK_SOURCE_IS_CONTINUOUS, 172 .enable = kvm_cs_enable, 173 }; 1196 static struct clocksource clocksource_tsc = { 1197 .name = "tsc", 1198 .rating = 300, 1199 .read = read_tsc, 2. The sched_clock. The scheduling is impacted if there is big drift. Fortunately, according to my general test (without RT requirement), the impact is trivial unless the two masterclock *updates* are between a super long period. In the past, the scheduling was impacted a lot when the masterclock was still based on monotonic (not raw). https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=53fafdbb8b21fa99dfd8376ca056bffde8cafc11 Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock operations (not only sched_clock), e.g., after line 300. 296 void __init kvmclock_init(void) 297 { 298 u8 flags; 299 300 if (!kvm_para_available() || !kvmclock) 301 return; 302 303 if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) { 304 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; 305 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; 306 } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) { 307 return; 308 } 309 310 if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu", 311 kvmclock_setup_percpu, NULL) < 0) { 312 return; 313 } 314 315 pr_info("kvm-clock: Using msrs %x and %x", 316 msr_kvm_system_time, msr_kvm_wall_clock); 317 318 this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]); 319 kvm_register_clock("primary cpu clock"); 320 pvclock_set_pvti_cpu0_va(hv_clock_boot); 321 322 if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) 323 pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); 324 325 flags = pvclock_read_flags(&hv_clock_boot[0].pvti); 326 kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); 327 328 x86_platform.calibrate_tsc = kvm_get_tsc_khz; 329 x86_platform.calibrate_cpu = kvm_get_tsc_khz; 330 x86_platform.get_wallclock = kvm_get_wallclock; 331 x86_platform.set_wallclock = kvm_set_wallclock; 332 #ifdef CONFIG_X86_LOCAL_APIC 333 x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock; 334 #endif 335 x86_platform.save_sched_clock_state = kvm_save_sched_clock_state; 336 x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state; 337 kvm_get_preset_lpj(); Should I introduce a new param to disable no-kvm-sched-clock only, or to introduce a new param to allow the selection of sched_clock? Those may not resolve the issue in another thread. (I guess there is a chance that to refresh the masterclock may reduce the drift in that case, although never tried) https://lore.kernel.org/all/00fba193-238e-49dc-fdc4-0b93f20569ec@oracle.com/ Thank you very much! Dongli Zhang > > Not saying we shouldn't sanitize the kvmclock behavior, but knowing the exact > problematic configuration(s) will help us make a better decision on how to fix > the mess.
On Mon, 2023-10-02 at 09:37 -0700, Sean Christopherson wrote: > On Mon, Oct 02, 2023, David Woodhouse wrote: > > On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote: > > > > > > 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation. > > > > > > 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation. > > > > > > > That just seems wrong. I don't mean that you're incorrect; it seems > > *morally* wrong. > > > > In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use > > a *different* mult/shift/equation (your #1) to convert TSC ticks to > > nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2). > > > > I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's > > adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent. > > > > Fix that, and the whole problem goes away, doesn't it? > > > > What am I missing here, that means we can't do that? > > I believe the answer is that "struct pvclock_vcpu_time_info" and its math are > ABI between KVM and KVM guests. > > Like many of the older bits of KVM, my guess is that KVM's behavior is the product > of making things kinda sorta work with old hardware, i.e. was probably the least > awful solution in the days before constant TSCs, but is completely nonsensical on > modern hardware. I still don't understand. The ABI and its math are fine. The ABI is just "at time X the TSC was Y, and the TSC frequency is Z" I understand why on older hardware, those values needed to *change* occasionally when TSC stupidity happened. But on newer hardware, surely we can set them precisely *once* when the VM starts, and never ever have to change them again? Theoretically not even when we pause the VM, kexec into a new kernel, and resume the VM! But we *are* having to change it, because apparently CLOCK_MONOTONIC_RAW is doing something *other* than incrementing at precisely the frequency of the known and constant TSC. But *why* is CLOCK_MONOTONIC_RAW doing that? I thought that the whole point of CLOCK_MONOTONIC_RAW was to be consistent and not adjusted by NTP etc.? Shouldn't it run at precisely the same rate as the kvmclock, with no skew at all? And if CLOCK_MONOTONIC_RAW is not what I thought it was... do we really have to keep resetting the kvmclock to it at all? On modern hardware can't the kvmclock be defined by the TSC alone?
+PeterZ Thomas and Peter, We're trying to address an issue where KVM's paravirt kvmclock drifts from the host's TSC-based monotonic raw clock because of historical reasons (at least, AFAICT), even when the TSC is constant. Due to some dubious KVM behavior, KVM may sometimes re-sync kvmclock against the host's monotonic raw clock, which causes non-trivial jumps in time from the guest's perspective. Linux-as-a-guest demotes all paravirt clock sources when the TSC is constant and nonstop, and so the goofy KVM behavior isn't likely to affect the guest's clocksource, but the guest's sched_clock() implementation keeps using the paravirt clock. Irrespective of if/how we fix the KVM host-side mess, using a paravirt clock for the scheduler when using a constant, nonstop TSC for the clocksource seems at best inefficient, and at worst unnecessarily complex and risky. Is there any reason not to prefer native_sched_clock() over whatever paravirt clock is present when the TSC is the preferred clocksource? Assuming the desirable thing to do is to use native_sched_clock() in this scenario, do we need a separate rating system, or can we simply tie the sched clock selection to the clocksource selection, e.g. override the paravirt stuff if the TSC clock has higher priority and is chosen? Some more details below (and in the rest of the thread). Thanks! On Mon, Oct 02, 2023, Dongli Zhang wrote: > Hi Sean and David, > > On 10/2/23 09:37, Sean Christopherson wrote: > > However, why does any of this matter if the host has a constant TSC? If that's > > the case, a sane setup will expose a constant TSC to the guest and the guest will > > use the TSC instead of kvmclock for the guest clocksource. > > > > Dongli, is this for long-lived "legacy" guests that were created on hosts without > > a constant TSC? If not, then why is kvmclock being used? Or heaven forbid, are > > you running on hardware without a constant TSC? :-) > > This is for test guests, and the host has all of below: > > tsc, rdtscp, constant_tsc, nonstop_tsc, tsc_deadline_timer, tsc_adjust > > A clocksource is used for two things. > > > 1. current_clocksource. Yes, I agree we should always use tsc on modern hardware. > > Do we need to update the documentation to always suggest TSC when it is > constant, as I believe many users still prefer pv clock than tsc? > > Thanks to tsc ratio scaling, the live migration will not impact tsc. > > >From the source code, the rating of kvm-clock is still higher than tsc. > > BTW., how about to decrease the rating if guest detects constant tsc? > > 166 struct clocksource kvm_clock = { > 167 .name = "kvm-clock", > 168 .read = kvm_clock_get_cycles, > 169 .rating = 400, > 170 .mask = CLOCKSOURCE_MASK(64), > 171 .flags = CLOCK_SOURCE_IS_CONTINUOUS, > 172 .enable = kvm_cs_enable, > 173 }; > > 1196 static struct clocksource clocksource_tsc = { > 1197 .name = "tsc", > 1198 .rating = 300, > 1199 .read = read_tsc, That's already done in kvmclock_init(). if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && !check_tsc_unstable()) kvm_clock.rating = 299; See also: https://lore.kernel.org/all/ZOjF2DMBgW%2FzVvL3@google.com > 2. The sched_clock. > > The scheduling is impacted if there is big drift. ... > Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock > operations (not only sched_clock), e.g., after line 300. ... > Should I introduce a new param to disable no-kvm-sched-clock only, or to > introduce a new param to allow the selection of sched_clock? I don't think we want a KVM-specific knob, because every flavor of paravirt guest would need to do the same thing. And unless there's a good reason to use a paravirt clock, this really shouldn't be something the guest admin needs to opt into using.
On Mon, Oct 02, 2023 at 11:18:50AM -0700, Sean Christopherson wrote: > +PeterZ > > Thomas and Peter, > > We're trying to address an issue where KVM's paravirt kvmclock drifts from the > host's TSC-based monotonic raw clock because of historical reasons (at least, AFAICT), > even when the TSC is constant. Due to some dubious KVM behavior, KVM may sometimes > re-sync kvmclock against the host's monotonic raw clock, which causes non-trivial > jumps in time from the guest's perspective. > > Linux-as-a-guest demotes all paravirt clock sources when the TSC is constant and > nonstop, and so the goofy KVM behavior isn't likely to affect the guest's clocksource, > but the guest's sched_clock() implementation keeps using the paravirt clock. > > Irrespective of if/how we fix the KVM host-side mess, using a paravirt clock for > the scheduler when using a constant, nonstop TSC for the clocksource seems at best > inefficient, and at worst unnecessarily complex and risky. > > Is there any reason not to prefer native_sched_clock() over whatever paravirt > clock is present when the TSC is the preferred clocksource? I see none, that whole pv_clock thing is horrible crap. > Assuming the desirable > thing to do is to use native_sched_clock() in this scenario, do we need a separate > rating system, or can we simply tie the sched clock selection to the clocksource > selection, e.g. override the paravirt stuff if the TSC clock has higher priority > and is chosen? Yeah, I see no point of another rating system. Just force the thing back to native (or don't set it to that other thing).
On Mon, Oct 02, 2023 at 11:06:07PM +0200, Peter Zijlstra wrote: > On Mon, Oct 02, 2023 at 11:18:50AM -0700, Sean Christopherson wrote: > > +PeterZ > > > > Thomas and Peter, > > > > We're trying to address an issue where KVM's paravirt kvmclock drifts from the > > host's TSC-based monotonic raw clock because of historical reasons (at least, AFAICT), > > even when the TSC is constant. Due to some dubious KVM behavior, KVM may sometimes > > re-sync kvmclock against the host's monotonic raw clock, which causes non-trivial > > jumps in time from the guest's perspective. > > > > Linux-as-a-guest demotes all paravirt clock sources when the TSC is constant and > > nonstop, and so the goofy KVM behavior isn't likely to affect the guest's clocksource, > > but the guest's sched_clock() implementation keeps using the paravirt clock. > > > > Irrespective of if/how we fix the KVM host-side mess, using a paravirt clock for > > the scheduler when using a constant, nonstop TSC for the clocksource seems at best > > inefficient, and at worst unnecessarily complex and risky. > > > > Is there any reason not to prefer native_sched_clock() over whatever paravirt > > clock is present when the TSC is the preferred clocksource? > > I see none, that whole pv_clock thing is horrible crap. In fact, I don't really see a reason to ever use pv_clock, even on non-constant TSC. The sched_clock machinery used on x86 (and ia64 at some point) reverts to tick-based + 'TSC-with-monotonicity-filter refinement' once it detects the TSC is crap. And that should work in a guest too I suppose. Also, I really should clean all that up -- it's all static_key based, but I think I can do a saner version with static_call. But that's stuck somewhere on the eternal todo list.
On Mon, Oct 02, 2023, David Woodhouse wrote: > On Mon, 2023-10-02 at 09:37 -0700, Sean Christopherson wrote: > > On Mon, Oct 02, 2023, David Woodhouse wrote: > > > On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote: > > > > > > > > 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation. > > > > > > > > 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation. > > > > > > > > > > That just seems wrong. I don't mean that you're incorrect; it seems > > > *morally* wrong. > > > > > > In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use > > > a *different* mult/shift/equation (your #1) to convert TSC ticks to > > > nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2). > > > > > > I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's > > > adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent. > > > > > > Fix that, and the whole problem goes away, doesn't it? > > > > > > What am I missing here, that means we can't do that? > > > > I believe the answer is that "struct pvclock_vcpu_time_info" and its math are > > ABI between KVM and KVM guests. > > > > Like many of the older bits of KVM, my guess is that KVM's behavior is the product > > of making things kinda sorta work with old hardware, i.e. was probably the least > > awful solution in the days before constant TSCs, but is completely nonsensical on > > modern hardware. > > I still don't understand. The ABI and its math are fine. The ABI is just > "at time X the TSC was Y, and the TSC frequency is Z" > > I understand why on older hardware, those values needed to *change* > occasionally when TSC stupidity happened. > > But on newer hardware, surely we can set them precisely *once* when the > VM starts, and never ever have to change them again? Theoretically not > even when we pause the VM, kexec into a new kernel, and resume the VM! > > But we *are* having to change it, because apparently > CLOCK_MONOTONIC_RAW is doing something *other* than incrementing at > precisely the frequency of the known and constant TSC. > > But *why* is CLOCK_MONOTONIC_RAW doing that? I thought that the whole > point of CLOCK_MONOTONIC_RAW was to be consistent and not adjusted by > NTP etc.? Shouldn't it run at precisely the same rate as the kvmclock, > with no skew at all? IIUC, the issue is that the paravirt clock ends up mixing time domains, i.e. is a weird bastardization of the host's monotonic raw clock and the paravirt clock. Despite a name that suggests otherwise (to me at least), __pvclock_read_cycles() counts "cycles" in nanoseconds, not TSC ticks. u64 __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, u64 tsc) { u64 delta = tsc - src->tsc_timestamp; u64 offset = pvclock_scale_delta(delta, src->tsc_to_system_mul, src->tsc_shift); return src->system_time + offset; } In the above, "offset" is computed in the kvmclock domain, whereas system_time comes from the host's CLOCK_MONOTONIC_RAW domain by way of master_kernel_ns. The goofy math is much more obvious in __get_kvmclock(), i.e. KVM's host-side retrieval of the guest's view of kvmclock: hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; The two domains use the same "clock" (constant TSC), but different math to compute nanoseconds from a given TSC value. For decently large TSC values, this results in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds. When KVM decides to refresh the masterclock, e.g. vCPU hotplug in Dongli's case, it resets the base time, a.k.a. system_time. I.e. KVM takes all of the time that was accumulated in the kvmclock domain and recomputes it in the CLOCK_MONOTONIC_RAW domain. The more time that passes between refreshes, the bigger the time jump from the guest's perspective. E.g. IIUC, your proposed patch to use a single RDTSC[*] eliminates the drift by undoing the CLOCK_MONOTONIC_RAW crap using the same TSC value on both the "add" and the "subtract", but the underlying train wreck of mixing time domains is still there. Without a constant TSC, deferring the reference time to the host's computation makes sense (or at least, is less silly), because the effective TSC would be per physical CPU, whereas the reference time is per VM. [*] https://lore.kernel.org/all/ee446c823002dc92c8ea525f21d00a9f5d27de59.camel@infradead.org > And if CLOCK_MONOTONIC_RAW is not what I thought it was... do we really > have to keep resetting the kvmclock to it at all? On modern hardware > can't the kvmclock be defined by the TSC alone? I think there is still use for synchronizing with the host's view of time, e.g. to deal with lost time across host suspend+resume. So I don't think we can completely sever KVM's paravirt clocks from host time, at least not without harming use cases that rely on the host's view to keep accurate time. And honestly at that point, the right answer would be to stop advertising paravirt clocks entirely. But I do think we can address the issues that Dongli and David are obversing where guest time drifts even though the host kernel's base time hasn't changed. If I've pieced everything together correctly, the drift can be eliminated simply by using the paravirt clock algorithm when converting the delta from the raw TSC to nanoseconds. This is *very* lightly tested, as in it compiles and doesn't explode, but that's about all I've tested. --- arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++++--------------- 1 file changed, 42 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6573c89c35a9..3ba7edfca47c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2417,6 +2417,9 @@ static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz, static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0); #endif +static u32 host_tsc_to_system_mul; +static s8 host_tsc_shift; + static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz); static unsigned long max_tsc_khz; @@ -2812,27 +2815,18 @@ static u64 read_tsc(void) static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp, int *mode) { - u64 tsc_pg_val; - long v; + u64 ns, v; switch (clock->vclock_mode) { case VDSO_CLOCKMODE_HVCLOCK: - if (hv_read_tsc_page_tsc(hv_get_tsc_page(), - tsc_timestamp, &tsc_pg_val)) { - /* TSC page valid */ + if (hv_read_tsc_page_tsc(hv_get_tsc_page(), tsc_timestamp, &v)) *mode = VDSO_CLOCKMODE_HVCLOCK; - v = (tsc_pg_val - clock->cycle_last) & - clock->mask; - } else { - /* TSC page invalid */ + else *mode = VDSO_CLOCKMODE_NONE; - } break; case VDSO_CLOCKMODE_TSC: *mode = VDSO_CLOCKMODE_TSC; - *tsc_timestamp = read_tsc(); - v = (*tsc_timestamp - clock->cycle_last) & - clock->mask; + v = *tsc_timestamp = read_tsc(); break; default: *mode = VDSO_CLOCKMODE_NONE; @@ -2840,8 +2834,36 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp, if (*mode == VDSO_CLOCKMODE_NONE) *tsc_timestamp = v = 0; + else + v = (v - clock->cycle_last) & clock->mask; - return v * clock->mult; + ns = clock->base_cycles; + + /* + * When the clock source is a raw, constant TSC, do the conversion to + * nanoseconds using the paravirt clock math so that the delta in ns is + * consistent regardless of whether the delta is converted in the host + * or the guest. + * + * The base for paravirt clocks is the kernel's base time in ns, plus + * the delta since the last sync. E.g. if a masterclock update occurs, + * KVM will shift some amount of delta from the guest to the host. + * Conversions from TSC to ns for the hosts's CLOCK_MONOTONIC_RAW and + * paravirt clocks aren't equivalent, and so shifting the delta can + * cause time to jump from the guest's view of the paravirt clock. + * This only works for a constant TSC, otherwise the calculation would + * only be valid for the current physical CPU, whereas the base of the + * clock must be valid for all vCPUs in the VM. + */ + if (static_cpu_has(X86_FEATURE_CONSTANT_TSC) && + *mode == VDSO_CLOCKMODE_TSC && clock == &pvclock_gtod_data.raw_clock) { + ns >>= clock->shift; + ns += pvclock_scale_delta(v, host_tsc_to_system_mul, host_tsc_shift); + } else { + ns += v * clock->mult; + ns >>= clock->shift; + } + return ns; } static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) @@ -2853,9 +2875,7 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) do { seq = read_seqcount_begin(>od->seq); - ns = gtod->raw_clock.base_cycles; - ns += vgettsc(>od->raw_clock, tsc_timestamp, &mode); - ns >>= gtod->raw_clock.shift; + ns = vgettsc(>od->raw_clock, tsc_timestamp, &mode); ns += ktime_to_ns(ktime_add(gtod->raw_clock.offset, gtod->offs_boot)); } while (unlikely(read_seqcount_retry(>od->seq, seq))); *t = ns; @@ -2873,9 +2893,7 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp) do { seq = read_seqcount_begin(>od->seq); ts->tv_sec = gtod->wall_time_sec; - ns = gtod->clock.base_cycles; - ns += vgettsc(>od->clock, tsc_timestamp, &mode); - ns >>= gtod->clock.shift; + ns = vgettsc(>od->clock, tsc_timestamp, &mode); } while (unlikely(read_seqcount_retry(>od->seq, seq))); ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void) if (ret != 0) return ret; + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) + kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL, + &host_tsc_shift, &host_tsc_to_system_mul); + local_tsc = rdtsc(); stable = !kvm_check_tsc_unstable(); list_for_each_entry(kvm, &vm_list, vm_list) { base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d --
Hi Sean, A quick question ... On 10/2/23 17:53, Sean Christopherson wrote: > On Mon, Oct 02, 2023, David Woodhouse wrote: >> On Mon, 2023-10-02 at 09:37 -0700, Sean Christopherson wrote: >>> On Mon, Oct 02, 2023, David Woodhouse wrote: >>>> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote: >>>>> >>>>> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation. >>>>> >>>>> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation. >>>>> >>>> >>>> That just seems wrong. I don't mean that you're incorrect; it seems >>>> *morally* wrong. >>>> >>>> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use >>>> a *different* mult/shift/equation (your #1) to convert TSC ticks to >>>> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2). >>>> >>>> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's >>>> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent. >>>> >>>> Fix that, and the whole problem goes away, doesn't it? >>>> >>>> What am I missing here, that means we can't do that? >>> >>> I believe the answer is that "struct pvclock_vcpu_time_info" and its math are >>> ABI between KVM and KVM guests. >>> >>> Like many of the older bits of KVM, my guess is that KVM's behavior is the product >>> of making things kinda sorta work with old hardware, i.e. was probably the least >>> awful solution in the days before constant TSCs, but is completely nonsensical on >>> modern hardware. >> >> I still don't understand. The ABI and its math are fine. The ABI is just >> "at time X the TSC was Y, and the TSC frequency is Z" >> >> I understand why on older hardware, those values needed to *change* >> occasionally when TSC stupidity happened. >> >> But on newer hardware, surely we can set them precisely *once* when the >> VM starts, and never ever have to change them again? Theoretically not >> even when we pause the VM, kexec into a new kernel, and resume the VM! >> >> But we *are* having to change it, because apparently >> CLOCK_MONOTONIC_RAW is doing something *other* than incrementing at >> precisely the frequency of the known and constant TSC. >> >> But *why* is CLOCK_MONOTONIC_RAW doing that? I thought that the whole >> point of CLOCK_MONOTONIC_RAW was to be consistent and not adjusted by >> NTP etc.? Shouldn't it run at precisely the same rate as the kvmclock, >> with no skew at all? > > IIUC, the issue is that the paravirt clock ends up mixing time domains, i.e. is > a weird bastardization of the host's monotonic raw clock and the paravirt clock. > > Despite a name that suggests otherwise (to me at least), __pvclock_read_cycles() > counts "cycles" in nanoseconds, not TSC ticks. > > u64 __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, u64 tsc) > { > u64 delta = tsc - src->tsc_timestamp; > u64 offset = pvclock_scale_delta(delta, src->tsc_to_system_mul, > src->tsc_shift); > return src->system_time + offset; > } > > In the above, "offset" is computed in the kvmclock domain, whereas system_time > comes from the host's CLOCK_MONOTONIC_RAW domain by way of master_kernel_ns. > The goofy math is much more obvious in __get_kvmclock(), i.e. KVM's host-side > retrieval of the guest's view of kvmclock: > > hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; > > The two domains use the same "clock" (constant TSC), but different math to compute > nanoseconds from a given TSC value. For decently large TSC values, this results > in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds. > > When KVM decides to refresh the masterclock, e.g. vCPU hotplug in Dongli's case, > it resets the base time, a.k.a. system_time. I.e. KVM takes all of the time that > was accumulated in the kvmclock domain and recomputes it in the CLOCK_MONOTONIC_RAW > domain. The more time that passes between refreshes, the bigger the time jump > from the guest's perspective. > > E.g. IIUC, your proposed patch to use a single RDTSC[*] eliminates the drift by > undoing the CLOCK_MONOTONIC_RAW crap using the same TSC value on both the "add" > and the "subtract", but the underlying train wreck of mixing time domains is > still there. > > Without a constant TSC, deferring the reference time to the host's computation > makes sense (or at least, is less silly), because the effective TSC would be per > physical CPU, whereas the reference time is per VM. > > [*] https://urldefense.com/v3/__https://lore.kernel.org/all/ee446c823002dc92c8ea525f21d00a9f5d27de59.camel@infradead.org__;!!ACWV5N9M2RV99hQ!L3CeHeyvOUGoCmVUUQvSn86OuB-4ZJVQ8VEm-r5hq-stW5n41h1m0K9-M8GI_abiKujrexcj-5IpD74nBA$ > >> And if CLOCK_MONOTONIC_RAW is not what I thought it was... do we really >> have to keep resetting the kvmclock to it at all? On modern hardware >> can't the kvmclock be defined by the TSC alone? > > I think there is still use for synchronizing with the host's view of time, e.g. > to deal with lost time across host suspend+resume. > > So I don't think we can completely sever KVM's paravirt clocks from host time, > at least not without harming use cases that rely on the host's view to keep > accurate time. And honestly at that point, the right answer would be to stop > advertising paravirt clocks entirely. > > But I do think we can address the issues that Dongli and David are obversing > where guest time drifts even though the host kernel's base time hasn't changed. > If I've pieced everything together correctly, the drift can be eliminated simply > by using the paravirt clock algorithm when converting the delta from the raw TSC > to nanoseconds. > > This is *very* lightly tested, as in it compiles and doesn't explode, but that's > about all I've tested. > > --- > arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++++--------------- > 1 file changed, 42 insertions(+), 20 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6573c89c35a9..3ba7edfca47c 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2417,6 +2417,9 @@ static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz, > static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0); > #endif > > +static u32 host_tsc_to_system_mul; > +static s8 host_tsc_shift; > + > static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz); > static unsigned long max_tsc_khz; > > @@ -2812,27 +2815,18 @@ static u64 read_tsc(void) > static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp, > int *mode) > { > - u64 tsc_pg_val; > - long v; > + u64 ns, v; > > switch (clock->vclock_mode) { > case VDSO_CLOCKMODE_HVCLOCK: > - if (hv_read_tsc_page_tsc(hv_get_tsc_page(), > - tsc_timestamp, &tsc_pg_val)) { > - /* TSC page valid */ > + if (hv_read_tsc_page_tsc(hv_get_tsc_page(), tsc_timestamp, &v)) > *mode = VDSO_CLOCKMODE_HVCLOCK; > - v = (tsc_pg_val - clock->cycle_last) & > - clock->mask; > - } else { > - /* TSC page invalid */ > + else > *mode = VDSO_CLOCKMODE_NONE; > - } > break; > case VDSO_CLOCKMODE_TSC: > *mode = VDSO_CLOCKMODE_TSC; > - *tsc_timestamp = read_tsc(); > - v = (*tsc_timestamp - clock->cycle_last) & > - clock->mask; > + v = *tsc_timestamp = read_tsc(); > break; > default: > *mode = VDSO_CLOCKMODE_NONE; > @@ -2840,8 +2834,36 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp, > > if (*mode == VDSO_CLOCKMODE_NONE) > *tsc_timestamp = v = 0; > + else > + v = (v - clock->cycle_last) & clock->mask; > > - return v * clock->mult; > + ns = clock->base_cycles; > + > + /* > + * When the clock source is a raw, constant TSC, do the conversion to > + * nanoseconds using the paravirt clock math so that the delta in ns is > + * consistent regardless of whether the delta is converted in the host > + * or the guest. > + * > + * The base for paravirt clocks is the kernel's base time in ns, plus > + * the delta since the last sync. E.g. if a masterclock update occurs, > + * KVM will shift some amount of delta from the guest to the host. > + * Conversions from TSC to ns for the hosts's CLOCK_MONOTONIC_RAW and > + * paravirt clocks aren't equivalent, and so shifting the delta can > + * cause time to jump from the guest's view of the paravirt clock. > + * This only works for a constant TSC, otherwise the calculation would > + * only be valid for the current physical CPU, whereas the base of the > + * clock must be valid for all vCPUs in the VM. > + */ > + if (static_cpu_has(X86_FEATURE_CONSTANT_TSC) && > + *mode == VDSO_CLOCKMODE_TSC && clock == &pvclock_gtod_data.raw_clock) { > + ns >>= clock->shift; > + ns += pvclock_scale_delta(v, host_tsc_to_system_mul, host_tsc_shift); > + } else { > + ns += v * clock->mult; > + ns >>= clock->shift; > + } > + return ns; > } > > static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) > @@ -2853,9 +2875,7 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp) > > do { > seq = read_seqcount_begin(>od->seq); > - ns = gtod->raw_clock.base_cycles; > - ns += vgettsc(>od->raw_clock, tsc_timestamp, &mode); > - ns >>= gtod->raw_clock.shift; > + ns = vgettsc(>od->raw_clock, tsc_timestamp, &mode); > ns += ktime_to_ns(ktime_add(gtod->raw_clock.offset, gtod->offs_boot)); > } while (unlikely(read_seqcount_retry(>od->seq, seq))); > *t = ns; > @@ -2873,9 +2893,7 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp) > do { > seq = read_seqcount_begin(>od->seq); > ts->tv_sec = gtod->wall_time_sec; > - ns = gtod->clock.base_cycles; > - ns += vgettsc(>od->clock, tsc_timestamp, &mode); > - ns >>= gtod->clock.shift; > + ns = vgettsc(>od->clock, tsc_timestamp, &mode); > } while (unlikely(read_seqcount_retry(>od->seq, seq))); > > ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); > @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void) > if (ret != 0) > return ret; > > + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) > + kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL, > + &host_tsc_shift, &host_tsc_to_system_mul); I agree that to use the kvmclock to calculate the ns elapsed when updating the master clock. Would you take the tsc scaling into consideration? While the host_tsc_shift and host_tsc_to_system_mul are pre-computed, how about the VM using different TSC frequency? Thank you very much! Dongli Zhang > + > local_tsc = rdtsc(); > stable = !kvm_check_tsc_unstable(); > list_for_each_entry(kvm, &vm_list, vm_list) { > > base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d
On Mon, Oct 02, 2023, Dongli Zhang wrote: > > @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void) > > if (ret != 0) > > return ret; > > > > + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) > > + kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL, > > + &host_tsc_shift, &host_tsc_to_system_mul); > > I agree that to use the kvmclock to calculate the ns elapsed when updating the > master clock. > > Would you take the tsc scaling into consideration? > > While the host_tsc_shift and host_tsc_to_system_mul are pre-computed, how about > the VM using different TSC frequency? Heh, I'm pretty sure that's completely broken today. I don't see anything in KVM that takes hardware TSC scaling into account. This code: if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, &vcpu->hv_clock.tsc_shift, &vcpu->hv_clock.tsc_to_system_mul); vcpu->hw_tsc_khz = tgt_tsc_khz; kvm_xen_update_tsc_info(v); } is recomputing the multipler+shift for the current *physical* CPU, it's not related to the guest's TSC in any way. __get_kvmclock() again shows that quite clearly, there's no scaling for the guest TSC anywhere in there.
Hi Sean, On 10/2/23 18:49, Sean Christopherson wrote: > On Mon, Oct 02, 2023, Dongli Zhang wrote: >>> @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void) >>> if (ret != 0) >>> return ret; >>> >>> + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) >>> + kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL, >>> + &host_tsc_shift, &host_tsc_to_system_mul); >> >> I agree that to use the kvmclock to calculate the ns elapsed when updating the >> master clock. >> >> Would you take the tsc scaling into consideration? >> >> While the host_tsc_shift and host_tsc_to_system_mul are pre-computed, how about >> the VM using different TSC frequency? > > Heh, I'm pretty sure that's completely broken today. I don't see anything in KVM > that takes hardware TSC scaling into account. > > This code: > > if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { > kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, > &vcpu->hv_clock.tsc_shift, > &vcpu->hv_clock.tsc_to_system_mul); > vcpu->hw_tsc_khz = tgt_tsc_khz; > kvm_xen_update_tsc_info(v); > } > > is recomputing the multipler+shift for the current *physical* CPU, it's not > related to the guest's TSC in any way. The below is the code. line 3175: query freq for current *physical* CPU. line 3211: scale the freq if scaling is involved. line 3215: compute the view for guest based on new 'tgt_tsc_khz' after scaling. 3146 static int kvm_guest_time_update(struct kvm_vcpu *v) 3147 { 3148 unsigned long flags, tgt_tsc_khz; 3149 unsigned seq; ... ... 3173 /* Keep irq disabled to prevent changes to the clock */ 3174 local_irq_save(flags); 3175 tgt_tsc_khz = get_cpu_tsc_khz(); ... ... 3210 if (kvm_caps.has_tsc_control) 3211 tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz, 3212 v->arch.l1_tsc_scaling_ratio); 3213 3214 if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { 3215 kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, 3216 &vcpu->hv_clock.tsc_shift, 3217 &vcpu->hv_clock.tsc_to_system_mul); 3218 vcpu->hw_tsc_khz = tgt_tsc_khz; 3219 kvm_xen_update_tsc_info(v); 3220 } Would you please let me know if the above understanding is incorrect? Thank you very much! Dongli Zhang > > __get_kvmclock() again shows that quite clearly, there's no scaling for the guest > TSC anywhere in there.
On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote: > > The two domains use the same "clock" (constant TSC), but different math to compute > nanoseconds from a given TSC value. For decently large TSC values, this results > in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds. This is the bit I'm still confused about, and it seems to be the root of all the other problems. Both CLOCK_MONOTONIC_RAW and kvmclock have *one* job: to convert a number of ticks of the TSC running at a constant known frequency, to a number of nanoseconds. So how in the name of all that is holy do they manage to get *different* answers? I get that the mult/shift thing carries some imprecision, but is that all it is? Can't we ensure that the kvmclock uses the *same* algorithm, precisely, as CLOCK_MONOTONIC_RAW? > E.g. IIUC, your proposed patch to use a single RDTSC[*] eliminates the drift by > undoing the CLOCK_MONOTONIC_RAW crap using the same TSC value on both the "add" > and the "subtract", but the underlying train wreck of mixing time domains is > still there. > [*] https://lore.kernel.org/all/ee446c823002dc92c8ea525f21d00a9f5d27de59.camel@infradead.org That one's different; those clock domains are *supposed* to be divergent and the point in the PV wall clock information is to convey the delta between the two. I just made it use the delta between the two at a *given* moment, instead of calculating the difference between *two* different times.
On 3 October 2023 01:53:11 BST, Sean Christopherson <seanjc@google.com> wrote: >I think there is still use for synchronizing with the host's view of time, e.g. >to deal with lost time across host suspend+resume. > >So I don't think we can completely sever KVM's paravirt clocks from host time, >at least not without harming use cases that rely on the host's view to keep >accurate time. And honestly at that point, the right answer would be to stop >advertising paravirt clocks entirely. > >But I do think we can address the issues that Dongli and David are obversing >where guest time drifts even though the host kernel's base time hasn't changed. >If I've pieced everything together correctly, the drift can be eliminated simply >by using the paravirt clock algorithm when converting the delta from the raw TSC >to nanoseconds. > >This is *very* lightly tested, as in it compiles and doesn't explode, but that's >about all I've tested. Hm, I don't think I like this. You're making get_monotonic_raw() not *actually* return the monotonic_raw clock, but basically return the kvmclock instead? And why? So that when KVM attempts to synchronize the kvmclock to the monotonic_raw clock, it gets tricked into actually synchronizing the kvmclock to *itself*? If you get this right, don't we have a fairly complex piece of code that has precisely *no* effect? Can't we just *refrain* from synchronizing the kvmclock to *anything*, in the CONSTANT_TSC case? Why do we do that anyway? (Suspend/resume, live update and live migration are different. In *those* cases we may need to preserve both the guest TSC and kvmclock based on either the host TSC or CLOCK_TAI. But that's different.)
On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote: > > This is *very* lightly tested, as in it compiles and doesn't explode, but that's > about all I've tested. I don't think it's working, if I understand what it's supposed to be doing. I hacked my wallclock patch *not* to use kvm_get_walltime_and_clockread(), but instead use kvm_get_time_and_clockread() so it should be able to compare monotonic_raw with kvmclock time. It looks like this. uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm) { /* * The guest calculates current wall clock time by adding * system time (updated by kvm_guest_time_update below) to the * wall clock specified here. We do the reverse here. */ #ifdef CONFIG_X86_64 struct pvclock_vcpu_time_info hv_clock; struct kvm_arch *ka = &kvm->arch; unsigned long seq, local_tsc_khz = 0; struct timespec64 ts; uint64_t host_tsc; do { seq = read_seqcount_begin(&ka->pvclock_sc); if (!ka->use_master_clock) break; /* It all has to happen on the same CPU */ get_cpu(); local_tsc_khz = get_cpu_tsc_khz(); if (local_tsc_khz && !kvm_get_time_and_clockread(&ts.tv_sec, &host_tsc)) local_tsc_khz = 0; /* Fall back to old method */ hv_clock.tsc_timestamp = ka->master_cycle_now; hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; put_cpu(); } while (read_seqcount_retry(&ka->pvclock_sc, seq)); /* * If the conditions were right, and obtaining the wallclock+TSC was * successful, calculate the KVM clock at the corresponding time and * subtract one from the other to get the epoch in nanoseconds. */ if (local_tsc_khz) { kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * 1000LL, &hv_clock.tsc_shift, &hv_clock.tsc_to_system_mul); uint64_t res = __pvclock_read_cycles(&hv_clock, host_tsc); uint64_t d2 = ts.tv_sec + ka->kvmclock_offset; printk("Calculated %lld (%lld/%lld delta %lld, ns %lld o %lld)\n", res, ts.tv_sec, d2, d2-res, ka->master_kernel_ns, ka->kvmclock_offset); if (0) return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec - __pvclock_read_cycles(&hv_clock, host_tsc); } #endif return ktime_get_real_ns() - get_kvmclock_ns(kvm); } I also removed the kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE) from kvm_xen_shared_info_init(). I don't even know why that's there at all, let alone on *all* vCPUs. So now KVM doesn't keep clamping the kvmclock back to monotonic_raw. When I run xen_shinfo_test, the kvmclock still drifts from the "monotonic_raw" I get from kvm_get_time_and_clockread(), even with your patch. [98513.851371] Calculated 522601895 (98513572796678/522601913 delta 18, ns 98513057857491 o -98513050194765) [98513.851388] Calculated 522619091 (98513572813874/522619109 delta 18, ns 98513057857491 o -98513050194765) [98513.851394] Calculated 522625423 (98513572820206/522625441 delta 18, ns 98513057857491 o -98513050194765) [98513.851401] Calculated 522631389 (98513572826172/522631407 delta 18, ns 98513057857491 o -98513050194765) [98513.851406] Calculated 522636947 (98513572831730/522636965 delta 18, ns 98513057857491 o -98513050194765) [98513.851412] Calculated 522643004 (98513572837787/522643022 delta 18, ns 98513057857491 o -98513050194765) [98513.851417] Calculated 522648344 (98513572843127/522648362 delta 18, ns 98513057857491 o -98513050194765) [98513.851423] Calculated 522654367 (98513572849150/522654385 delta 18, ns 98513057857491 o -98513050194765) ... [98517.386027] Calculated 4057257718 (98517107452615/4057257850 delta 132, ns 98513057857491 o -98513050194765) [98517.386030] Calculated 4057261038 (98517107455934/4057261169 delta 131, ns 98513057857491 o -98513050194765) [98517.386034] Calculated 4057265133 (98517107460029/4057265264 delta 131, ns 98513057857491 o -98513050194765) [98517.386037] Calculated 4057268310 (98517107463206/4057268441 delta 131, ns 98513057857491 o -98513050194765) [98517.386040] Calculated 4057271463 (98517107466359/4057271594 delta 131, ns 98513057857491 o -98513050194765) [98517.386044] Calculated 4057274508 (98517107469404/4057274639 delta 131, ns 98513057857491 o -98513050194765) [98517.386048] Calculated 4057278587 (98517107473483/4057278718 delta 131, ns 98513057857491 o -98513050194765) [98517.386051] Calculated 4057281674 (98517107476570/4057281805 delta 131, ns 98513057857491 o -98513050194765) [98517.386057] Calculated 4057288187 (98517107483083/4057288318 delta 131, ns 98513057857491 o -98513050194765) [98517.386064] Calculated 4057294557 (98517107489453/4057294688 delta 131, ns 98513057857491 o -98513050194765)
On Mon, Oct 02, 2023, Dongli Zhang wrote: > Hi Sean, > > On 10/2/23 18:49, Sean Christopherson wrote: > > On Mon, Oct 02, 2023, Dongli Zhang wrote: > >>> @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void) > >>> if (ret != 0) > >>> return ret; > >>> > >>> + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) > >>> + kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL, > >>> + &host_tsc_shift, &host_tsc_to_system_mul); > >> > >> I agree that to use the kvmclock to calculate the ns elapsed when updating the > >> master clock. > >> > >> Would you take the tsc scaling into consideration? > >> > >> While the host_tsc_shift and host_tsc_to_system_mul are pre-computed, how about > >> the VM using different TSC frequency? > > > > Heh, I'm pretty sure that's completely broken today. I don't see anything in KVM > > that takes hardware TSC scaling into account. > > > > This code: > > > > if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { > > kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, > > &vcpu->hv_clock.tsc_shift, > > &vcpu->hv_clock.tsc_to_system_mul); > > vcpu->hw_tsc_khz = tgt_tsc_khz; > > kvm_xen_update_tsc_info(v); > > } > > > > is recomputing the multipler+shift for the current *physical* CPU, it's not > > related to the guest's TSC in any way. > > The below is the code. > > line 3175: query freq for current *physical* CPU. > > line 3211: scale the freq if scaling is involved. > > line 3215: compute the view for guest based on new 'tgt_tsc_khz' after scaling. > > 3146 static int kvm_guest_time_update(struct kvm_vcpu *v) > 3147 { > 3148 unsigned long flags, tgt_tsc_khz; > 3149 unsigned seq; > ... ... > 3173 /* Keep irq disabled to prevent changes to the clock */ > 3174 local_irq_save(flags); > 3175 tgt_tsc_khz = get_cpu_tsc_khz(); > ... ... > 3210 if (kvm_caps.has_tsc_control) > 3211 tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz, > 3212 v->arch.l1_tsc_scaling_ratio); > 3213 > 3214 if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { > 3215 kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, > 3216 &vcpu->hv_clock.tsc_shift, > 3217 &vcpu->hv_clock.tsc_to_system_mul); > 3218 vcpu->hw_tsc_khz = tgt_tsc_khz; > 3219 kvm_xen_update_tsc_info(v); > 3220 } > > > Would you please let me know if the above understanding is incorrect? Ah, yeah, you're correct. I missed the call to kvm_scale_tsc() at 3211.
On Tue, Oct 03, 2023, David Woodhouse wrote: > On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote: > > > > The two domains use the same "clock" (constant TSC), but different math to compute > > nanoseconds from a given TSC value. For decently large TSC values, this results > > in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds. > > This is the bit I'm still confused about, and it seems to be the root > of all the other problems. > > Both CLOCK_MONOTONIC_RAW and kvmclock have *one* job: to convert a > number of ticks of the TSC running at a constant known frequency, to a > number of nanoseconds. > > So how in the name of all that is holy do they manage to get > *different* answers? > > I get that the mult/shift thing carries some imprecision, but is that > all it is? Yep, pretty sure that's it. It's like the plot from Office Space / Superman III. Those little rounding errors add up over time. PV clock: nanoseconds = ((TSC >> shift) * mult) >> 32 or nanoseconds = ((TSC << shift) * mult) >> 32 versus timekeeping (mostly) nanoseconds = (TSC * mult) >> shift The more I look at the PV clock stuff, the more I agree with Peter: it's garbage. Shifting before multiplying is guaranteed to introduce error. Shifting right drops data, and shifting left introduces zeros. > Can't we ensure that the kvmclock uses the *same* algorithm, > precisely, as CLOCK_MONOTONIC_RAW? Yes? At least for sane hardware, after much staring, I think it's possible. It's tricky because the two algorithms are wierdly different, the PV clock algorithm is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh at us for suggesting that we try to shove the pvclock algorithm into the kernel. The hardcoded shift right 32 in PV clock is annoying, but not the end of the world. Compile tested only, but I believe this math is correct. And I'm guessing we'd want some safeguards against overflow, e.g. due to a multiplier that is too big. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6573c89c35a9..ae9275c3d580 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) v->arch.l1_tsc_scaling_ratio); if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { - kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, - &vcpu->hv_clock.tsc_shift, - &vcpu->hv_clock.tsc_to_system_mul); + u32 shift, mult; + + clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600); + + if (shift <= 32) { + vcpu->hv_clock.tsc_shift = 0; + vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift); + } else { + kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, + &vcpu->hv_clock.tsc_shift, + &vcpu->hv_clock.tsc_to_system_mul); + } + vcpu->hw_tsc_khz = tgt_tsc_khz; kvm_xen_update_tsc_info(v); }
On Tue, Oct 03, 2023, David Woodhouse wrote: > > > On 3 October 2023 01:53:11 BST, Sean Christopherson <seanjc@google.com> wrote: > >I think there is still use for synchronizing with the host's view of time, e.g. > >to deal with lost time across host suspend+resume. > > > >So I don't think we can completely sever KVM's paravirt clocks from host time, > >at least not without harming use cases that rely on the host's view to keep > >accurate time. And honestly at that point, the right answer would be to stop > >advertising paravirt clocks entirely. > > > >But I do think we can address the issues that Dongli and David are obversing > >where guest time drifts even though the host kernel's base time hasn't changed. > >If I've pieced everything together correctly, the drift can be eliminated simply > >by using the paravirt clock algorithm when converting the delta from the raw TSC > >to nanoseconds. > > > >This is *very* lightly tested, as in it compiles and doesn't explode, but that's > >about all I've tested. > > Hm, I don't think I like this. Yeah, I don't like it either. I'll respond to your other mail with details, but this is a dead end anything. > You're making get_monotonic_raw() not *actually* return the monotonic_raw > clock, but basically return the kvmclock instead? And why? So that when KVM > attempts to synchronize the kvmclock to the monotonic_raw clock, it gets > tricked into actually synchronizing the kvmclock to *itself*? > > If you get this right, don't we have a fairly complex piece of code that has > precisely *no* effect? > > Can't we just *refrain* from synchronizing the kvmclock to *anything*, in the > CONSTANT_TSC case? Why do we do that anyway? > > (Suspend/resume, live update and live migration are different. In *those* > cases we may need to preserve both the guest TSC and kvmclock based on either > the host TSC or CLOCK_TAI. But that's different.) The issue is that the timekeeping code doesn't provide a notification mechanism to *just* get updates for things like suspend/reume. We could maybe do something in KVM like unregister the notifier if the TSC is constant, and manually refresh on suspend/resume. But that's pretty gross too, and I'd definitely be concerned that we missed something.
On Tue, Oct 03, 2023, David Woodhouse wrote: > On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote: > > > > This is *very* lightly tested, as in it compiles and doesn't explode, but that's > > about all I've tested. > > I don't think it's working, if I understand what it's supposed to be > doing. ... > When I run xen_shinfo_test, the kvmclock still drifts from the > "monotonic_raw" I get from kvm_get_time_and_clockread(), even with your > patch. It "works", in that it does what I intended it to do. What I missed, and what's obvious in hindsight, is that the timekeeping code doesn't *just* update what KVM uses as the "base" when there "big events" like suspend/resume, the timekeeping code updates the base clock on every tick. My hack only smoothed away the delta, it didn't address the time that had gotten moved into the base. I though the base would be stable if the host was in a steady state, but see above...
On Tue, 2023-10-03 at 17:07 -0700, Sean Christopherson wrote: > > The issue is that the timekeeping code doesn't provide a notification mechanism > to *just* get updates for things like suspend/reume. We could maybe do something > in KVM like unregister the notifier if the TSC is constant, and manually refresh > on suspend/resume. But that's pretty gross too, and I'd definitely be concerned > that we missed something. Or *in* the existing notifier, just do the maths (correctly!) and don't update KVM's raw_clock data if the only thing that's really changed is the passage of time? On the whole, it's better if your other patch works to avoid the drift in the first place. I'll test it...
On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote: > On Tue, Oct 03, 2023, David Woodhouse wrote: > > On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote: > > > > > > The two domains use the same "clock" (constant TSC), but different math to compute > > > nanoseconds from a given TSC value. For decently large TSC values, this results > > > in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds. > > > > This is the bit I'm still confused about, and it seems to be the root > > of all the other problems. > > > > Both CLOCK_MONOTONIC_RAW and kvmclock have *one* job: to convert a > > number of ticks of the TSC running at a constant known frequency, to a > > number of nanoseconds. > > > > So how in the name of all that is holy do they manage to get > > *different* answers? > > > > I get that the mult/shift thing carries some imprecision, but is that > > all it is? > > Yep, pretty sure that's it. It's like the plot from Office Space / Superman III. > Those little rounding errors add up over time. > > PV clock: > > nanoseconds = ((TSC >> shift) * mult) >> 32 > > or > > nanoseconds = ((TSC << shift) * mult) >> 32 > > versus timekeeping (mostly) > > nanoseconds = (TSC * mult) >> shift > > The more I look at the PV clock stuff, the more I agree with Peter: it's garbage. > Shifting before multiplying is guaranteed to introduce error. Shifting right drops > data, and shifting left introduces zeros. > > > Can't we ensure that the kvmclock uses the *same* algorithm, > > precisely, as CLOCK_MONOTONIC_RAW? > > Yes? At least for sane hardware, after much staring, I think it's possible. > > It's tricky because the two algorithms are wierdly different, the PV clock algorithm > is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh > at us for suggesting that we try to shove the pvclock algorithm into the kernel. > > The hardcoded shift right 32 in PV clock is annoying, but not the end of the world. > > Compile tested only, but I believe this math is correct. And I'm guessing we'd > want some safeguards against overflow, e.g. due to a multiplier that is too big. > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6573c89c35a9..ae9275c3d580 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > v->arch.l1_tsc_scaling_ratio); > > if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { > - kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, > - &vcpu->hv_clock.tsc_shift, > - &vcpu->hv_clock.tsc_to_system_mul); > + u32 shift, mult; > + > + clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600); > + > + if (shift <= 32) { > + vcpu->hv_clock.tsc_shift = 0; > + vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift); > + } else { > + kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, > + &vcpu->hv_clock.tsc_shift, > + &vcpu->hv_clock.tsc_to_system_mul); > + } > + > vcpu->hw_tsc_khz = tgt_tsc_khz; > kvm_xen_update_tsc_info(v); > } > I gave that a go on my test box, and for a TSC frequency of 2593992 kHz it got mult=1655736523, shift=32 and took the 'happy' path instead of falling back. It still drifts about the same though, using the same test as before: https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock I was going to facetiously suggest that perhaps the kvmclock should have leap nanoseconds... but then realised that that's basically what Dongli's patch is *doing*. Maybe we just need to *recognise* that, so rather than having a user-configured period for the update, KVM could calculate the frequency for the updates based on the rate at which the clocks would otherwise drift, and a maximum delta? Not my favourite option, but perhaps better than nothing?
On Wed, Oct 04, 2023, David Woodhouse wrote: > On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote: > > > Can't we ensure that the kvmclock uses the *same* algorithm, > > > precisely, as CLOCK_MONOTONIC_RAW? > > > > Yes? At least for sane hardware, after much staring, I think it's possible. > > > > It's tricky because the two algorithms are wierdly different, the PV clock algorithm > > is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh > > at us for suggesting that we try to shove the pvclock algorithm into the kernel. > > > > The hardcoded shift right 32 in PV clock is annoying, but not the end of the world. > > > > Compile tested only, but I believe this math is correct. And I'm guessing we'd > > want some safeguards against overflow, e.g. due to a multiplier that is too big. > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 6573c89c35a9..ae9275c3d580 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > > v->arch.l1_tsc_scaling_ratio); > > > > if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { > > - kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, > > - &vcpu->hv_clock.tsc_shift, > > - &vcpu->hv_clock.tsc_to_system_mul); > > + u32 shift, mult; > > + > > + clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600); > > + > > + if (shift <= 32) { > > + vcpu->hv_clock.tsc_shift = 0; > > + vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift); > > + } else { > > + kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, > > + &vcpu->hv_clock.tsc_shift, > > + &vcpu->hv_clock.tsc_to_system_mul); > > + } > > + > > vcpu->hw_tsc_khz = tgt_tsc_khz; > > kvm_xen_update_tsc_info(v); > > } > > > > I gave that a go on my test box, and for a TSC frequency of 2593992 kHz > it got mult=1655736523, shift=32 and took the 'happy' path instead of > falling back. > > It still drifts about the same though, using the same test as before: > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock > > > I was going to facetiously suggest that perhaps the kvmclock should > have leap nanoseconds... but then realised that that's basically what > Dongli's patch is *doing*. Maybe we just need to *recognise* that, Yeah, I suspect trying to get kvmclock to always precisely align with the kernel's monotonic raw clock is a fool's errand. > so rather than having a user-configured period for the update, KVM could > calculate the frequency for the updates based on the rate at which the clocks > would otherwise drift, and a maximum delta? Not my favourite option, but > perhaps better than nothing? Holy moly, the existing code for the periodic syncs/updates is a mess. If I'm reading the code correctly, commits 0061d53daf26 ("KVM: x86: limit difference between kvmclock updates") 7e44e4495a39 ("x86: kvm: rate-limit global clock updates") 332967a3eac0 ("x86: kvm: introduce periodic global clock updates") splattered together an immpressively inefficient update mechanism. On the first vCPU creation, KVM schedules kvmclock_sync_fn() at a hardcoded rate of 300hz. if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0) schedule_delayed_work(&kvm->arch.kvmclock_sync_work, KVMCLOCK_SYNC_PERIOD); That handler does two things: schedule "delayed" work kvmclock_update_fn() to be executed immediately, and reschedule kvmclock_sync_fn() at 300hz. kvmclock_sync_fn() then kicks *every* vCPU in the VM, i.e. KVM kicks every vCPU to sync kvmlock at a 300hz frequency. If we're going to kick every vCPU, then we might as well do a masterclock update, because the extra cost of synchronizing the masterclock is likely in the noise compared to kicking every vCPU. There's also zero reason to do the work in vCPU context. And because that's not enough, on pCPU migration or if the TSC is unstable, kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules kvmclock_update_fn() with a delay of 100ms. The large delay is to play nice with unstable TSCs. But if KVM is periodically doing clock updates on all vCPU, scheduling another update with a *longer* delay is silly. The really, really stupid part of all is that the periodic syncs happen even if kvmclock isn't exposed to the guest. *sigh* So rather than add yet another periodic work function, I think we should clean up the mess we have, fix the whole "leapseconds" mess with the masterclock, and then tune the frequency (if necessary). Something like the below is what I'm thinking. Once the dust settles, I'd like to do dynamically enable/disable kvmclock_sync_work based on whether or not the VM actually has vCPU's with a pvclock, but that's definitely an enhancement that can go on top. Does this look sane, or am I missing something? --- arch/x86/include/asm/kvm_host.h | 3 +- arch/x86/kvm/x86.c | 53 +++++++++++---------------------- 2 files changed, 19 insertions(+), 37 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 34a64527654c..d108452fc301 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -98,7 +98,7 @@ KVM_ARCH_REQ_FLAGS(14, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) #define KVM_REQ_SCAN_IOAPIC \ KVM_ARCH_REQ_FLAGS(15, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) -#define KVM_REQ_GLOBAL_CLOCK_UPDATE KVM_ARCH_REQ(16) +/* AVAILABLE BIT!!!! KVM_ARCH_REQ(16) */ #define KVM_REQ_APIC_PAGE_RELOAD \ KVM_ARCH_REQ_FLAGS(17, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) #define KVM_REQ_HV_CRASH KVM_ARCH_REQ(18) @@ -1336,7 +1336,6 @@ struct kvm_arch { bool use_master_clock; u64 master_kernel_ns; u64 master_cycle_now; - struct delayed_work kvmclock_update_work; struct delayed_work kvmclock_sync_work; struct kvm_xen_hvm_config xen_hvm_config; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6573c89c35a9..5d35724f1963 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2367,7 +2367,7 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time, } vcpu->arch.time = system_time; - kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu); + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); /* we verify if the enable bit is set... */ if (system_time & 1) @@ -3257,30 +3257,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) #define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100) -static void kvmclock_update_fn(struct work_struct *work) -{ - unsigned long i; - struct delayed_work *dwork = to_delayed_work(work); - struct kvm_arch *ka = container_of(dwork, struct kvm_arch, - kvmclock_update_work); - struct kvm *kvm = container_of(ka, struct kvm, arch); - struct kvm_vcpu *vcpu; - - kvm_for_each_vcpu(i, vcpu, kvm) { - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); - kvm_vcpu_kick(vcpu); - } -} - -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v) -{ - struct kvm *kvm = v->kvm; - - kvm_make_request(KVM_REQ_CLOCK_UPDATE, v); - schedule_delayed_work(&kvm->arch.kvmclock_update_work, - KVMCLOCK_UPDATE_DELAY); -} - #define KVMCLOCK_SYNC_PERIOD (300 * HZ) static void kvmclock_sync_fn(struct work_struct *work) @@ -3290,12 +3266,14 @@ static void kvmclock_sync_fn(struct work_struct *work) kvmclock_sync_work); struct kvm *kvm = container_of(ka, struct kvm, arch); - if (!kvmclock_periodic_sync) - return; + if (ka->use_master_clock) + kvm_update_masterclock(kvm); + else + kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); - schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0); - schedule_delayed_work(&kvm->arch.kvmclock_sync_work, - KVMCLOCK_SYNC_PERIOD); + if (kvmclock_periodic_sync) + schedule_delayed_work(&kvm->arch.kvmclock_sync_work, + KVMCLOCK_SYNC_PERIOD); } /* These helpers are safe iff @msr is known to be an MCx bank MSR. */ @@ -4845,7 +4823,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) * kvmclock on vcpu->cpu migration */ if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1) - kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu); + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); if (vcpu->cpu != cpu) kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu); vcpu->cpu = cpu; @@ -10520,12 +10498,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) __kvm_migrate_timers(vcpu); if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu)) kvm_update_masterclock(vcpu->kvm); - if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu)) - kvm_gen_kvmclock_update(vcpu); if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) { r = kvm_guest_time_update(vcpu); if (unlikely(r)) goto out; + + /* + * Ensure all other vCPUs synchronize "soon", e.g. so + * that all vCPUs recognize NTP corrections and drift + * corrections (relative to the kernel's raw clock). + */ + if (!kvmclock_periodic_sync) + schedule_delayed_work(&vcpu->kvm->arch.kvmclock_sync_work, + KVMCLOCK_UPDATE_DELAY); } if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu)) kvm_mmu_sync_roots(vcpu); @@ -12345,7 +12330,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) kvm->arch.hv_root_tdp = INVALID_PAGE; #endif - INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn); INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn); kvm_apicv_init(kvm); @@ -12387,7 +12371,6 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm) void kvm_arch_sync_events(struct kvm *kvm) { cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work); - cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work); kvm_free_pit(kvm); } base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d --
Hi Sean, On 10/4/23 11:06 AM, Sean Christopherson wrote: > On Wed, Oct 04, 2023, David Woodhouse wrote: >> On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote: >>>> Can't we ensure that the kvmclock uses the *same* algorithm, >>>> precisely, as CLOCK_MONOTONIC_RAW? >>> >>> Yes? At least for sane hardware, after much staring, I think it's possible. >>> >>> It's tricky because the two algorithms are wierdly different, the PV clock algorithm >>> is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh >>> at us for suggesting that we try to shove the pvclock algorithm into the kernel. >>> >>> The hardcoded shift right 32 in PV clock is annoying, but not the end of the world. >>> >>> Compile tested only, but I believe this math is correct. And I'm guessing we'd >>> want some safeguards against overflow, e.g. due to a multiplier that is too big. >>> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index 6573c89c35a9..ae9275c3d580 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) >>> v->arch.l1_tsc_scaling_ratio); >>> >>> if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) { >>> - kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, >>> - &vcpu->hv_clock.tsc_shift, >>> - &vcpu->hv_clock.tsc_to_system_mul); >>> + u32 shift, mult; >>> + >>> + clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600); >>> + >>> + if (shift <= 32) { >>> + vcpu->hv_clock.tsc_shift = 0; >>> + vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift); >>> + } else { >>> + kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL, >>> + &vcpu->hv_clock.tsc_shift, >>> + &vcpu->hv_clock.tsc_to_system_mul); >>> + } >>> + >>> vcpu->hw_tsc_khz = tgt_tsc_khz; >>> kvm_xen_update_tsc_info(v); >>> } >>> >> >> I gave that a go on my test box, and for a TSC frequency of 2593992 kHz >> it got mult=1655736523, shift=32 and took the 'happy' path instead of >> falling back. >> >> It still drifts about the same though, using the same test as before: >> https://urldefense.com/v3/__https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock__;!!ACWV5N9M2RV99hQ!KEdDuRZIThXoz2zaZd3O9rk77ywSaHCQ92fTnc7PFP81bdOhTMvudMBReIfZcrm9AITeKw4kyMTmbPbJuA$ >> >> >> I was going to facetiously suggest that perhaps the kvmclock should >> have leap nanoseconds... but then realised that that's basically what >> Dongli's patch is *doing*. Maybe we just need to *recognise* that, > > Yeah, I suspect trying to get kvmclock to always precisely align with the kernel's > monotonic raw clock is a fool's errand. > >> so rather than having a user-configured period for the update, KVM could >> calculate the frequency for the updates based on the rate at which the clocks >> would otherwise drift, and a maximum delta? Not my favourite option, but >> perhaps better than nothing? > > Holy moly, the existing code for the periodic syncs/updates is a mess. If I'm > reading the code correctly, commits > > 0061d53daf26 ("KVM: x86: limit difference between kvmclock updates") > 7e44e4495a39 ("x86: kvm: rate-limit global clock updates") > 332967a3eac0 ("x86: kvm: introduce periodic global clock updates") > > splattered together an immpressively inefficient update mechanism. > > On the first vCPU creation, KVM schedules kvmclock_sync_fn() at a hardcoded rate > of 300hz. > > if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0) > schedule_delayed_work(&kvm->arch.kvmclock_sync_work, > KVMCLOCK_SYNC_PERIOD); > > That handler does two things: schedule "delayed" work kvmclock_update_fn() to > be executed immediately, and reschedule kvmclock_sync_fn() at 300hz. > kvmclock_sync_fn() then kicks *every* vCPU in the VM, i.e. KVM kicks every vCPU > to sync kvmlock at a 300hz frequency. > > If we're going to kick every vCPU, then we might as well do a masterclock update, > because the extra cost of synchronizing the masterclock is likely in the noise > compared to kicking every vCPU. There's also zero reason to do the work in vCPU > context. > > And because that's not enough, on pCPU migration or if the TSC is unstable, > kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules > kvmclock_update_fn() with a delay of 100ms. The large delay is to play nice with > unstable TSCs. But if KVM is periodically doing clock updates on all vCPU, > scheduling another update with a *longer* delay is silly. We may need to add above message to the places, where KVM_REQ_GLOBAL_CLOCK_UPDATE is replaced with KVM_REQ_CLOCK_UPDATE in the patch? This helps understand why KVM_REQ_CLOCK_UPDATE is sometime enough. > > The really, really stupid part of all is that the periodic syncs happen even if > kvmclock isn't exposed to the guest. *sigh* > > So rather than add yet another periodic work function, I think we should clean up > the mess we have, fix the whole "leapseconds" mess with the masterclock, and then > tune the frequency (if necessary). > > Something like the below is what I'm thinking. Once the dust settles, I'd like > to do dynamically enable/disable kvmclock_sync_work based on whether or not the > VM actually has vCPU's with a pvclock, but that's definitely an enhancement that > can go on top. > > Does this look sane, or am I missing something? > > --- > arch/x86/include/asm/kvm_host.h | 3 +- > arch/x86/kvm/x86.c | 53 +++++++++++---------------------- > 2 files changed, 19 insertions(+), 37 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 34a64527654c..d108452fc301 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -98,7 +98,7 @@ > KVM_ARCH_REQ_FLAGS(14, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) > #define KVM_REQ_SCAN_IOAPIC \ > KVM_ARCH_REQ_FLAGS(15, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) > -#define KVM_REQ_GLOBAL_CLOCK_UPDATE KVM_ARCH_REQ(16) > +/* AVAILABLE BIT!!!! KVM_ARCH_REQ(16) */ > #define KVM_REQ_APIC_PAGE_RELOAD \ > KVM_ARCH_REQ_FLAGS(17, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) > #define KVM_REQ_HV_CRASH KVM_ARCH_REQ(18) > @@ -1336,7 +1336,6 @@ struct kvm_arch { > bool use_master_clock; > u64 master_kernel_ns; > u64 master_cycle_now; > - struct delayed_work kvmclock_update_work; > struct delayed_work kvmclock_sync_work; > > struct kvm_xen_hvm_config xen_hvm_config; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6573c89c35a9..5d35724f1963 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2367,7 +2367,7 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time, > } > > vcpu->arch.time = system_time; > - kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu); > + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); As mentioned above, we may need a comment here to explain why KVM_REQ_CLOCK_UPDATE on the only vcpu is enough. > > /* we verify if the enable bit is set... */ > if (system_time & 1) > @@ -3257,30 +3257,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > > #define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100) > > -static void kvmclock_update_fn(struct work_struct *work) > -{ > - unsigned long i; > - struct delayed_work *dwork = to_delayed_work(work); > - struct kvm_arch *ka = container_of(dwork, struct kvm_arch, > - kvmclock_update_work); > - struct kvm *kvm = container_of(ka, struct kvm, arch); > - struct kvm_vcpu *vcpu; > - > - kvm_for_each_vcpu(i, vcpu, kvm) { > - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); > - kvm_vcpu_kick(vcpu); > - } > -} > - > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v) > -{ > - struct kvm *kvm = v->kvm; > - > - kvm_make_request(KVM_REQ_CLOCK_UPDATE, v); > - schedule_delayed_work(&kvm->arch.kvmclock_update_work, > - KVMCLOCK_UPDATE_DELAY); > -} > - > #define KVMCLOCK_SYNC_PERIOD (300 * HZ) While David mentioned "maximum delta", how about to turn above into a module param with the default 300HZ. BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour or 1-day. > > static void kvmclock_sync_fn(struct work_struct *work) > @@ -3290,12 +3266,14 @@ static void kvmclock_sync_fn(struct work_struct *work) > kvmclock_sync_work); > struct kvm *kvm = container_of(ka, struct kvm, arch); > > - if (!kvmclock_periodic_sync) > - return; > + if (ka->use_master_clock) > + kvm_update_masterclock(kvm); Based on the source code, I think it is safe to call kvm_update_masterclock() here. We want the masterclock to update only once. To call KVM_REQ_MASTERCLOCK_UPDATE for each vCPU here is meaningless. I just want to remind that this is shared workqueue. The workqueue stall detection may report false positive (e.g., due to tsc_write_lock contention. That should not be lock contensive). Thank you very much! Dongli Zhang > + else > + kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); > > - schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0); > - schedule_delayed_work(&kvm->arch.kvmclock_sync_work, > - KVMCLOCK_SYNC_PERIOD); > + if (kvmclock_periodic_sync) > + schedule_delayed_work(&kvm->arch.kvmclock_sync_work, > + KVMCLOCK_SYNC_PERIOD); > } > > /* These helpers are safe iff @msr is known to be an MCx bank MSR. */ > @@ -4845,7 +4823,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > * kvmclock on vcpu->cpu migration > */ > if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1) > - kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu); > + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); > if (vcpu->cpu != cpu) > kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu); > vcpu->cpu = cpu; > @@ -10520,12 +10498,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > __kvm_migrate_timers(vcpu); > if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu)) > kvm_update_masterclock(vcpu->kvm); > - if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu)) > - kvm_gen_kvmclock_update(vcpu); > if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) { > r = kvm_guest_time_update(vcpu); > if (unlikely(r)) > goto out; > + > + /* > + * Ensure all other vCPUs synchronize "soon", e.g. so > + * that all vCPUs recognize NTP corrections and drift > + * corrections (relative to the kernel's raw clock). > + */ > + if (!kvmclock_periodic_sync) > + schedule_delayed_work(&vcpu->kvm->arch.kvmclock_sync_work, > + KVMCLOCK_UPDATE_DELAY); > } > if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu)) > kvm_mmu_sync_roots(vcpu); > @@ -12345,7 +12330,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) > kvm->arch.hv_root_tdp = INVALID_PAGE; > #endif > > - INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn); > INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn); > > kvm_apicv_init(kvm); > @@ -12387,7 +12371,6 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm) > void kvm_arch_sync_events(struct kvm *kvm) > { > cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work); > - cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work); > kvm_free_pit(kvm); > } > > > base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d
On Wed, Oct 04, 2023, Dongli Zhang wrote: > > And because that's not enough, on pCPU migration or if the TSC is unstable, > > kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules > > kvmclock_update_fn() with a delay of 100ms. The large delay is to play nice with > > unstable TSCs. But if KVM is periodically doing clock updates on all vCPU, > > scheduling another update with a *longer* delay is silly. > > We may need to add above message to the places, where > KVM_REQ_GLOBAL_CLOCK_UPDATE is replaced with KVM_REQ_CLOCK_UPDATE in the patch? Yeah, comments are most definitely needed, this was just intended to be a quick sketch to get the ball rolling. > > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v) > > -{ > > - struct kvm *kvm = v->kvm; > > - > > - kvm_make_request(KVM_REQ_CLOCK_UPDATE, v); > > - schedule_delayed_work(&kvm->arch.kvmclock_update_work, > > - KVMCLOCK_UPDATE_DELAY); > > -} > > - > > #define KVMCLOCK_SYNC_PERIOD (300 * HZ) > > While David mentioned "maximum delta", how about to turn above into a module > param with the default 300HZ. > > BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour > or 1-day. Hmm, I think I agree with David that it would be better if KVM can take care of the gory details and promise a certain level of accuracy. I'm usually a fan of punting complexity to userspace, but requiring every userspace to figure out the ideal sync frequency on every platform is more than a bit unfriendly. And it might not even be realistic unless userspace makes assumptions about how the kernel computes CLOCK_MONOTONIC_RAW from TSC cycles. : so rather than having a user-configured period for the update, KVM could : calculate the frequency for the updates based on the rate at which the clocks : would otherwise drift, and a maximum delta? Not my favourite option, but : perhaps better than nothing?
On Tue, 2023-10-10 at 17:20 -0700, Sean Christopherson wrote: > On Wed, Oct 04, 2023, Dongli Zhang wrote: > > > And because that's not enough, on pCPU migration or if the TSC is unstable, > > > kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules > > > kvmclock_update_fn() with a delay of 100ms. The large delay is to play nice with > > > unstable TSCs. But if KVM is periodically doing clock updates on all vCPU, > > > scheduling another update with a *longer* delay is silly. > > > > We may need to add above message to the places, where > > KVM_REQ_GLOBAL_CLOCK_UPDATE is replaced with KVM_REQ_CLOCK_UPDATE in the patch? > > Yeah, comments are most definitely needed, this was just intended to be a quick > sketch to get the ball rolling. > > > > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v) > > > -{ > > > - struct kvm *kvm = v->kvm; > > > - > > > - kvm_make_request(KVM_REQ_CLOCK_UPDATE, v); > > > - schedule_delayed_work(&kvm->arch.kvmclock_update_work, > > > - KVMCLOCK_UPDATE_DELAY); > > > -} > > > - > > > #define KVMCLOCK_SYNC_PERIOD (300 * HZ) > > > > While David mentioned "maximum delta", how about to turn above into a module > > param with the default 300HZ. > > > > BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour > > or 1-day. > > Hmm, I think I agree with David that it would be better if KVM can take care of > the gory details and promise a certain level of accuracy. I'm usually a fan of > punting complexity to userspace, but requiring every userspace to figure out the > ideal sync frequency on every platform is more than a bit unfriendly. And it > might not even be realistic unless userspace makes assumptions about how the kernel > computes CLOCK_MONOTONIC_RAW from TSC cycles. > I think perhaps I would rather save up my persuasiveness on the topic of "let's not make things too awful for userspace to cope with" for the live update/migration mess. I think I need to dust off that attempt at fixing our 'how to migrate with clocks intact' documentation from https://lore.kernel.org/kvm/13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@infradead.org/ The changes we're discussing here obviously have an effect on migration too. Where the host TSC is actually reliable, I would really prefer for the kvmclock to just be a fixed function of the guest TSC and *not* to be arbitrarily yanked back[1] to the host's CLOCK_MONOTONIC periodically. That makes the process of migrating to another machine (potentially with a different host TSC frequency) a whole lot simpler. Userspace just needs to know two things from the kernel: • the relationship between the guest's TSC and its kvmclock (which we helpfully advertise to the *guest* in a pvclock structure, and can give that to userspace too). • The value of *either* the guest's TSC or its kvmclock, at a given value of CLOCK_TAI (not CLOCK_WALLTIME). Theoretically, this can be either TSC or kvmclock. But I think for best precision it would need to be TSC? I am aware that I glibly said "guest TSC" as if there's only one, or they're in sync. I think we can substitute "the TSC of guest vCPU0" in the case of migration. We don't want a guest to be able to change its kvmclock by writing to vCPU0's TSC though, so we may need to keep (and migrate) an additional offset value for tracking that? [1] Yes, I believe "back" does happen. I have test failures in my queue to look at, where guests see the "Xen" clock going backwards.
On Wed, Oct 11, 2023, David Woodhouse wrote: > On Tue, 2023-10-10 at 17:20 -0700, Sean Christopherson wrote: > > On Wed, Oct 04, 2023, Dongli Zhang wrote: > > > > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v) > > > > -{ > > > > - struct kvm *kvm = v->kvm; > > > > - > > > > - kvm_make_request(KVM_REQ_CLOCK_UPDATE, v); > > > > - schedule_delayed_work(&kvm->arch.kvmclock_update_work, > > > > - KVMCLOCK_UPDATE_DELAY); > > > > -} > > > > - > > > > #define KVMCLOCK_SYNC_PERIOD (300 * HZ) > > > > > > While David mentioned "maximum delta", how about to turn above into a module > > > param with the default 300HZ. > > > > > > BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour > > > or 1-day. > > > > Hmm, I think I agree with David that it would be better if KVM can take care of > > the gory details and promise a certain level of accuracy. I'm usually a fan of > > punting complexity to userspace, but requiring every userspace to figure out the > > ideal sync frequency on every platform is more than a bit unfriendly. And it > > might not even be realistic unless userspace makes assumptions about how the kernel > > computes CLOCK_MONOTONIC_RAW from TSC cycles. > > > > I think perhaps I would rather save up my persuasiveness on the topic > of "let's not make things too awful for userspace to cope with" for the > live update/migration mess. I think I need to dust off that attempt at > fixing our 'how to migrate with clocks intact' documentation from > https://lore.kernel.org/kvm/13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@infradead.org/ > The changes we're discussing here obviously have an effect on migration > too. > > Where the host TSC is actually reliable, I would really prefer for the > kvmclock to just be a fixed function of the guest TSC and *not* to be > arbitrarily yanked back[1] to the host's CLOCK_MONOTONIC periodically. CLOCK_MONOTONIC_RAW! Just wanted to clarify because if kvmclock were tied to the non-raw clock, then we'd have to somehow reconcile host NTP updates. I generally support the idea, but I think it needs to an opt-in from userspace. Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST". AFAICT, there are too many edge cases and assumptions about userspace for KVM to safely couple kvmclock to guest TSC by default. > [1] Yes, I believe "back" does happen. I have test failures in my queue > to look at, where guests see the "Xen" clock going backwards. Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o What if we add a module param to disable KVM's TSC synchronization craziness entirely? If we first clean up the peroidic sync mess, then it seems like it'd be relatively straightforward to let kill off all of the synchronization, including the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW. Not intended to be a functional patch... --- arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++--- 1 file changed, 32 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5b2104bdd99f..75fc6cbaef0d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); static bool __read_mostly kvmclock_periodic_sync = true; module_param(kvmclock_periodic_sync, bool, S_IRUGO); +static bool __read_mostly enable_tsc_sync = true; +module_param_named(tsc_synchronization, enable_tsc_sync, bool, 0444); + /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ static u32 __read_mostly tsc_tolerance_ppm = 250; module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); @@ -2722,6 +2725,12 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) bool matched = false; bool synchronizing = false; + if (!enable_tsc_sync) { + offset = kvm_compute_l1_tsc_offset(vcpu, data); + kvm_vcpu_write_tsc_offset(vcpu, offset); + return; + } + raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); offset = kvm_compute_l1_tsc_offset(vcpu, data); ns = get_kvmclock_base_ns(); @@ -2967,9 +2976,12 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) &ka->master_kernel_ns, &ka->master_cycle_now); - ka->use_master_clock = host_tsc_clocksource && vcpus_matched - && !ka->backwards_tsc_observed - && !ka->boot_vcpu_runs_old_kvmclock; + WARN_ON_ONCE(!host_tsc_clocksource && !enable_tsc_sync); + + ka->use_master_clock = host_tsc_clocksource && + (vcpus_matched || !enable_tsc_sync) && + !ka->backwards_tsc_observed && + !ka->boot_vcpu_runs_old_kvmclock; if (ka->use_master_clock) atomic_set(&kvm_guest_has_master_clock, 1); @@ -3278,6 +3290,9 @@ static void kvmclock_sync_fn(struct work_struct *work) void kvm_adjust_pv_clock_users(struct kvm *kvm, bool add_user) { + if (!enable_tsc_sync) + return; + /* * Doesn't need to be a spinlock, but can't be kvm->lock as this is * call while holding a vCPU's mutext. @@ -5528,6 +5543,11 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, if (get_user(offset, uaddr)) break; + if (!enable_tsc_sync) { + kvm_vcpu_write_tsc_offset(vcpu, offset); + break; + } + raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); matched = (vcpu->arch.virtual_tsc_khz && @@ -12188,6 +12208,9 @@ int kvm_arch_hardware_enable(void) if (ret != 0) return ret; + if (!enable_tsc_sync) + return 0; + local_tsc = rdtsc(); stable = !kvm_check_tsc_unstable(); list_for_each_entry(kvm, &vm_list, vm_list) { @@ -13670,6 +13693,12 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit); static int __init kvm_x86_init(void) { + if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) + enable_tsc_sync = true; + + if (!enable_tsc_sync) + kvmclock_periodic_sync = false; + kvm_mmu_x86_module_init(); mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible(); return 0; base-commit: 7d2edad0beb2a6f07f6e6c2d477d5874f5417d6c --
On Fri, 2023-10-13 at 11:07 -0700, Sean Christopherson wrote: > On Wed, Oct 11, 2023, David Woodhouse wrote: > > On Tue, 2023-10-10 at 17:20 -0700, Sean Christopherson wrote: > > > On Wed, Oct 04, 2023, Dongli Zhang wrote: > > > > > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v) > > > > > -{ > > > > > - struct kvm *kvm = v->kvm; > > > > > - > > > > > - kvm_make_request(KVM_REQ_CLOCK_UPDATE, v); > > > > > - schedule_delayed_work(&kvm->arch.kvmclock_update_work, > > > > > - KVMCLOCK_UPDATE_DELAY); > > > > > -} > > > > > - > > > > > #define KVMCLOCK_SYNC_PERIOD (300 * HZ) > > > > > > > > While David mentioned "maximum delta", how about to turn above into a module > > > > param with the default 300HZ. > > > > > > > > BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour > > > > or 1-day. > > > > > > Hmm, I think I agree with David that it would be better if KVM can take care of > > > the gory details and promise a certain level of accuracy. I'm usually a fan of > > > punting complexity to userspace, but requiring every userspace to figure out the > > > ideal sync frequency on every platform is more than a bit unfriendly. And it > > > might not even be realistic unless userspace makes assumptions about how the kernel > > > computes CLOCK_MONOTONIC_RAW from TSC cycles. > > > > > > > I think perhaps I would rather save up my persuasiveness on the topic > > of "let's not make things too awful for userspace to cope with" for the > > live update/migration mess. I think I need to dust off that attempt at > > fixing our 'how to migrate with clocks intact' documentation from > > https://lore.kernel.org/kvm/13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@infradead.org/ > > The changes we're discussing here obviously have an effect on migration > > too. > > > > Where the host TSC is actually reliable, I would really prefer for the > > kvmclock to just be a fixed function of the guest TSC and *not* to be > > arbitrarily yanked back[1] to the host's CLOCK_MONOTONIC periodically. > > CLOCK_MONOTONIC_RAW! Just wanted to clarify because if kvmclock were tied to the > non-raw clock, then we'd have to somehow reconcile host NTP updates. Sorry, yes. CLOCK_MONOTONIC_RAW. That was just a typo in email. Of course we'd never try to use CLOCK_MONOTONIC; that would be daft. We'd never do that. Just like we'd never do something daft like using CLOCK_REALTIME instead of CLOCK_TAI for the KVM_CLOCK_REALTIME pairing in __get_kvmclock()... oh. Shit. > I generally support the idea, but I think it needs to an opt-in from userspace. > Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not > suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST". > AFAICT, there are too many edge cases and assumptions about userspace for KVM to > safely couple kvmclock to guest TSC by default. I think IA32_TSC_ADJUST is OK, isn't it? There is a "real" TSC value and if vCPUs adjust themselves forward and backwards from that, it's just handled as a delta. And we solved 'give all vCPUS the same TSC frequency' by making that KVM-wide. Maybe suspending and resuming the host can be treated like live migration, where you know the host TSC is different so you have to make do with a delta based on CLOCK_TAI. But while I'm picking on the edge cases and suggesting that we *can* cope with some of them, I do agree with your suggestion that "let kvmclock run by itself without being clamped back to CLOCK_MONOTONIC_RAW" should be an opt *in* feature. > > [1] Yes, I believe "back" does happen. I have test failures in my queue > > to look at, where guests see the "Xen" clock going backwards. > > Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o > > What if we add a module param to disable KVM's TSC synchronization craziness > entirely? If we first clean up the peroidic sync mess, then it seems like it'd > be relatively straightforward to let kill off all of the synchronization, including > the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW. > > Not intended to be a functional patch... Will stare harder at the actual patch when it isn't Friday night. In the meantime, I do think a KVM cap that the VMM opts into is better than a module param? > --- > arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++--- > 1 file changed, 32 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 5b2104bdd99f..75fc6cbaef0d 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); > static bool __read_mostly kvmclock_periodic_sync = true; > module_param(kvmclock_periodic_sync, bool, S_IRUGO); > > +static bool __read_mostly enable_tsc_sync = true; > +module_param_named(tsc_synchronization, enable_tsc_sync, bool, 0444); > + > /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ > static u32 __read_mostly tsc_tolerance_ppm = 250; > module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); > @@ -2722,6 +2725,12 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) > bool matched = false; > bool synchronizing = false; > > + if (!enable_tsc_sync) { > + offset = kvm_compute_l1_tsc_offset(vcpu, data); > + kvm_vcpu_write_tsc_offset(vcpu, offset); > + return; > + } > + > raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); > offset = kvm_compute_l1_tsc_offset(vcpu, data); > ns = get_kvmclock_base_ns(); > @@ -2967,9 +2976,12 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) > &ka->master_kernel_ns, > &ka->master_cycle_now); > > - ka->use_master_clock = host_tsc_clocksource && vcpus_matched > - && !ka->backwards_tsc_observed > - && !ka->boot_vcpu_runs_old_kvmclock; > + WARN_ON_ONCE(!host_tsc_clocksource && !enable_tsc_sync); > + > + ka->use_master_clock = host_tsc_clocksource && > + (vcpus_matched || !enable_tsc_sync) && > + !ka->backwards_tsc_observed && > + !ka->boot_vcpu_runs_old_kvmclock; > > if (ka->use_master_clock) > atomic_set(&kvm_guest_has_master_clock, 1); > @@ -3278,6 +3290,9 @@ static void kvmclock_sync_fn(struct work_struct *work) > > void kvm_adjust_pv_clock_users(struct kvm *kvm, bool add_user) > { > + if (!enable_tsc_sync) > + return; > + > /* > * Doesn't need to be a spinlock, but can't be kvm->lock as this is > * call while holding a vCPU's mutext. > @@ -5528,6 +5543,11 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, > if (get_user(offset, uaddr)) > break; > > + if (!enable_tsc_sync) { > + kvm_vcpu_write_tsc_offset(vcpu, offset); > + break; > + } > + > raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); > > matched = (vcpu->arch.virtual_tsc_khz && > @@ -12188,6 +12208,9 @@ int kvm_arch_hardware_enable(void) > if (ret != 0) > return ret; > > + if (!enable_tsc_sync) > + return 0; > + > local_tsc = rdtsc(); > stable = !kvm_check_tsc_unstable(); > list_for_each_entry(kvm, &vm_list, vm_list) { > @@ -13670,6 +13693,12 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit); > > static int __init kvm_x86_init(void) > { > + if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) > + enable_tsc_sync = true; > + > + if (!enable_tsc_sync) > + kvmclock_periodic_sync = false; > + > kvm_mmu_x86_module_init(); > mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible(); > return 0; > > base-commit: 7d2edad0beb2a6f07f6e6c2d477d5874f5417d6c
On Fri, Oct 13, 2023, David Woodhouse wrote: > On Fri, 2023-10-13 at 11:07 -0700, Sean Christopherson wrote: > > I generally support the idea, but I think it needs to an opt-in from userspace. > > Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not > > suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST". > > AFAICT, there are too many edge cases and assumptions about userspace for KVM to > > safely couple kvmclock to guest TSC by default. > > I think IA32_TSC_ADJUST is OK, isn't it? There is a "real" TSC value > and if vCPUs adjust themselves forward and backwards from that, it's > just handled as a delta. I meant the host writing IA32_TSC_ADJUST. E.g. if a host SMM handler mucks with TSC offsets to try and hide the time spent in the SMM handler, then the platform owner gets to keep the pieces. > And we solved 'give all vCPUS the same TSC frequency' by making that > KVM-wide. > > Maybe suspending and resuming the host can be treated like live > migration, where you know the host TSC is different so you have to make > do with a delta based on CLOCK_TAI. > > But while I'm picking on the edge cases and suggesting that we *can* > cope with some of them, I do agree with your suggestion that "let > kvmclock run by itself without being clamped back to > CLOCK_MONOTONIC_RAW" should be an opt *in* feature. Yeah, I'm of the mind that just because we can cope with some edge cases, doesn't mean we should. At this point, kvmclock really should be considered deprecated on modern hardware. I.e. needs to be supported for older VMs, but shouldn't be advertised/used when creating entirely new VMs. Hence my desire to go with a low effort solution for getting kvmclock to play nice with modern hardware. > > > [1] Yes, I believe "back" does happen. I have test failures in my queue > > > to look at, where guests see the "Xen" clock going backwards. > > > > Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o > > > > What if we add a module param to disable KVM's TSC synchronization craziness > > entirely? If we first clean up the peroidic sync mess, then it seems like it'd > > be relatively straightforward to let kill off all of the synchronization, including > > the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW. > > > > Not intended to be a functional patch... > > Will stare harder at the actual patch when it isn't Friday night. > > In the meantime, I do think a KVM cap that the VMM opts into is better > than a module param? Hmm, yeah, I think a capability would be cleaner overall. Then KVM could return -EINVAL instead of silently forcing synchronization if the platform conditions aren't meant, e.g. if the TSC isn't constant or if the host timekeeping isn't using TSC. The interaction with kvmclock_periodic_sync might be a bit awkward, but that's easy enough to solve with a wrapper.
On Fri, 2023-10-13 at 12:02 -0700, Sean Christopherson wrote: > On Fri, Oct 13, 2023, David Woodhouse wrote: > > On Fri, 2023-10-13 at 11:07 -0700, Sean Christopherson wrote: > > > I generally support the idea, but I think it needs to an opt-in from userspace. > > > Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not > > > suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST". > > > AFAICT, there are too many edge cases and assumptions about userspace for KVM to > > > safely couple kvmclock to guest TSC by default. > > > > I think IA32_TSC_ADJUST is OK, isn't it? There is a "real" TSC value > > and if vCPUs adjust themselves forward and backwards from that, it's > > just handled as a delta. > > I meant the host writing IA32_TSC_ADJUST. E.g. if a host SMM handler mucks with > TSC offsets to try and hide the time spent in the SMM handler, then the platform > owner gets to keep the pieces. Oh $DEITY yes, absolutely. > > And we solved 'give all vCPUS the same TSC frequency' by making that > > KVM-wide. > > > > Maybe suspending and resuming the host can be treated like live > > migration, where you know the host TSC is different so you have to make > > do with a delta based on CLOCK_TAI. > > > > But while I'm picking on the edge cases and suggesting that we *can* > > cope with some of them, I do agree with your suggestion that "let > > kvmclock run by itself without being clamped back to > > CLOCK_MONOTONIC_RAW" should be an opt *in* feature. > > Yeah, I'm of the mind that just because we can cope with some edge cases, doesn't > mean we should. At this point, kvmclock really should be considered deprecated > on modern hardware. I.e. needs to be supported for older VMs, but shouldn't be > advertised/used when creating entirely new VMs. > > Hence my desire to go with a low effort solution for getting kvmclock to play nice > with modern hardware. Yeah... although the kvmclock is also the *Xen* clock (and the clock on which Xen timers are based). So while I'm perfectly prepared to call those Xen guests "older VMs", I do still have to launch quite a lot of new ones the same... :) > > > > [1] Yes, I believe "back" does happen. I have test failures in my queue > > > > to look at, where guests see the "Xen" clock going backwards. > > > > > > Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o > > > > > > What if we add a module param to disable KVM's TSC synchronization craziness > > > entirely? If we first clean up the peroidic sync mess, then it seems like it'd > > > be relatively straightforward to let kill off all of the synchronization, including > > > the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW. > > > > > > Not intended to be a functional patch... > > > > Will stare harder at the actual patch when it isn't Friday night. > > > > In the meantime, I do think a KVM cap that the VMM opts into is better > > than a module param? > > Hmm, yeah, I think a capability would be cleaner overall. Then KVM could return > -EINVAL instead of silently forcing synchronization if the platform conditions > aren't meant, e.g. if the TSC isn't constant or if the host timekeeping isn't > using TSC. Right. > The interaction with kvmclock_periodic_sync might be a bit awkward, but that's > easy enough to solve with a wrapper. At least that's all per-KVM already. We do also still need to deal with the mess of having a single system-wide kvm_guest_has_master_clock and different KVMs explicitly setting that to 1 or 0, don't we?
On Fri, Oct 13, 2023, David Woodhouse wrote: > On Fri, 2023-10-13 at 12:02 -0700, Sean Christopherson wrote: > > > But while I'm picking on the edge cases and suggesting that we *can* > > > cope with some of them, I do agree with your suggestion that "let > > > kvmclock run by itself without being clamped back to > > > CLOCK_MONOTONIC_RAW" should be an opt *in* feature. > > > > Yeah, I'm of the mind that just because we can cope with some edge cases, doesn't > > mean we should. At this point, kvmclock really should be considered deprecated > > on modern hardware. I.e. needs to be supported for older VMs, but shouldn't be > > advertised/used when creating entirely new VMs. > > > > Hence my desire to go with a low effort solution for getting kvmclock to play nice > > with modern hardware. > > Yeah... although the kvmclock is also the *Xen* clock (and the clock on > which Xen timers are based). So while I'm perfectly prepared to call > those Xen guests "older VMs", I do still have to launch quite a lot of > new ones the same... :) Heh, yeah, by "new" I meant "new shapes/classes/types of VMs", not simply "new instances of existing VM types". > > > > > [1] Yes, I believe "back" does happen. I have test failures in my queue > > > > > to look at, where guests see the "Xen" clock going backwards. > > > > > > > > Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o > > > > > > > > What if we add a module param to disable KVM's TSC synchronization craziness > > > > entirely? If we first clean up the peroidic sync mess, then it seems like it'd > > > > be relatively straightforward to let kill off all of the synchronization, including > > > > the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW. > > > > > > > > Not intended to be a functional patch... > > > > > > Will stare harder at the actual patch when it isn't Friday night. > > > > > > In the meantime, I do think a KVM cap that the VMM opts into is better > > > than a module param? > > > > Hmm, yeah, I think a capability would be cleaner overall. Then KVM could return > > -EINVAL instead of silently forcing synchronization if the platform conditions > > aren't meant, e.g. if the TSC isn't constant or if the host timekeeping isn't > > using TSC. > > Right. > > > The interaction with kvmclock_periodic_sync might be a bit awkward, but that's > > easy enough to solve with a wrapper. > > At least that's all per-KVM already. We do also still need to deal with > the mess of having a single system-wide kvm_guest_has_master_clock and > different KVMs explicitly setting that to 1 or 0, don't we? Hmm, I think the simplest way to handle kvm_guest_has_master_clock would be to have KVM check that the host clocksource is TSC-based when enabling the capability, but define the behavior to be that once kvmclock is tied to the TSC, it's *always* tied to the TSC, even if the host switches to a different clock source. Then VMs for which kvmclock is tied to TSC can simply not set kvm_guest_has_master_clock and be skipped by pvclock_gtod_update_fn(). Side topic, I have no idea why that thing is an atomic. It's just a flag that tracks if at least one VM is using masterclock, and its only usage is to disable all masterclocks if the host stops using TSC as the clocksource for whatever reason. It really should just be a simple bool that's accessed with {READ,WRITE}_ONCE().
Hi Sean, On 10/13/23 11:07, Sean Christopherson wrote: > On Wed, Oct 11, 2023, David Woodhouse wrote: >> On Tue, 2023-10-10 at 17:20 -0700, Sean Christopherson wrote: >>> On Wed, Oct 04, 2023, Dongli Zhang wrote: >>>>> -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v) >>>>> -{ >>>>> - struct kvm *kvm = v->kvm; >>>>> - >>>>> - kvm_make_request(KVM_REQ_CLOCK_UPDATE, v); >>>>> - schedule_delayed_work(&kvm->arch.kvmclock_update_work, >>>>> - KVMCLOCK_UPDATE_DELAY); >>>>> -} >>>>> - >>>>> #define KVMCLOCK_SYNC_PERIOD (300 * HZ) >>>> >>>> While David mentioned "maximum delta", how about to turn above into a module >>>> param with the default 300HZ. >>>> >>>> BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour >>>> or 1-day. >>> >>> Hmm, I think I agree with David that it would be better if KVM can take care of >>> the gory details and promise a certain level of accuracy. I'm usually a fan of >>> punting complexity to userspace, but requiring every userspace to figure out the >>> ideal sync frequency on every platform is more than a bit unfriendly. And it >>> might not even be realistic unless userspace makes assumptions about how the kernel >>> computes CLOCK_MONOTONIC_RAW from TSC cycles. >>> >> >> I think perhaps I would rather save up my persuasiveness on the topic >> of "let's not make things too awful for userspace to cope with" for the >> live update/migration mess. I think I need to dust off that attempt at >> fixing our 'how to migrate with clocks intact' documentation from >> https://urldefense.com/v3/__https://lore.kernel.org/kvm/13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@infradead.org/__;!!ACWV5N9M2RV99hQ!Kv3rZZ4bxmh0LeZKB1dQQnbCs8aTkGnEWsWu-eSawdYR3qszqITOY_XkAlWZeIODupS-N18Mnc9TBgk_vw$ >> The changes we're discussing here obviously have an effect on migration >> too. >> >> Where the host TSC is actually reliable, I would really prefer for the >> kvmclock to just be a fixed function of the guest TSC and *not* to be >> arbitrarily yanked back[1] to the host's CLOCK_MONOTONIC periodically. > > CLOCK_MONOTONIC_RAW! Just wanted to clarify because if kvmclock were tied to the > non-raw clock, then we'd have to somehow reconcile host NTP updates. > > I generally support the idea, but I think it needs to an opt-in from userspace. > Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not > suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST". > AFAICT, there are too many edge cases and assumptions about userspace for KVM to > safely couple kvmclock to guest TSC by default. > >> [1] Yes, I believe "back" does happen. I have test failures in my queue >> to look at, where guests see the "Xen" clock going backwards. > > Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o > > What if we add a module param to disable KVM's TSC synchronization craziness > entirely? If we first clean up the peroidic sync mess, then it seems like it'd > be relatively straightforward to let kill off all of the synchronization, including > the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW. > > Not intended to be a functional patch... > > --- > arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++--- > 1 file changed, 32 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 5b2104bdd99f..75fc6cbaef0d 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); > static bool __read_mostly kvmclock_periodic_sync = true; > module_param(kvmclock_periodic_sync, bool, S_IRUGO); > > +static bool __read_mostly enable_tsc_sync = true; > +module_param_named(tsc_synchronization, enable_tsc_sync, bool, 0444); > + > /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ > static u32 __read_mostly tsc_tolerance_ppm = 250; > module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); > @@ -2722,6 +2725,12 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) > bool matched = false; > bool synchronizing = false; > > + if (!enable_tsc_sync) { > + offset = kvm_compute_l1_tsc_offset(vcpu, data); > + kvm_vcpu_write_tsc_offset(vcpu, offset); > + return; > + } TBH, I do not like this idea for two reasons. 1. As a very primary part of my work is to resolve kernel issue, when debugging any clock drift issue, it is really happy for me to see all vCPUs have the same vcpu->arch.tsc_offset in the coredump or vCPU debugfs. This patch may lead to that different vCPUs added at different time have different vcpu->arch.tsc_offset. 2. Suppose the KVM host has been running for long time, and the drift between two domains would be accumulated to super large? (Even it may not introduce anything bad immediately) If the objective is to avoid masterclock updating, how about the below I copied from my prior diagnostic kernel to help debug this issue. The idea is to never update master clock, if tsc is stable (and masterclock is already used). diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0b443b9bf562..630f18524000 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2035,6 +2035,9 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) struct kvm_arch *ka = &kvm->arch; int vclock_mode; bool host_tsc_clocksource, vcpus_matched; + bool was_master_clock = ka->use_master_clock; + u64 master_kernel_ns; + u64 master_cycle_now; vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == atomic_read(&kvm->online_vcpus)); @@ -2044,13 +2047,18 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) * to the guest. */ host_tsc_clocksource = kvm_get_time_and_clockread( - &ka->master_kernel_ns, - &ka->master_cycle_now); + &master_kernel_ns, + &master_cycle_now); ka->use_master_clock = host_tsc_clocksource && vcpus_matched && !ka->backwards_tsc_observed && !ka->boot_vcpu_runs_old_kvmclock; + if (!was_master_clock && ka->use_master_clock) { + ka->master_kernel_ns = master_kernel_ns; + ka->master_cycle_now = master_cycle_now; + } + if (ka->use_master_clock) atomic_set(&kvm_guest_has_master_clock, 1); That is, to always re-use the same value in ka->master_kernel_ns and ka->master_cycle_now since VM creation. Thank you very much! Dongli Zhang > + > raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); > offset = kvm_compute_l1_tsc_offset(vcpu, data); > ns = get_kvmclock_base_ns(); > @@ -2967,9 +2976,12 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) > &ka->master_kernel_ns, > &ka->master_cycle_now); > > - ka->use_master_clock = host_tsc_clocksource && vcpus_matched > - && !ka->backwards_tsc_observed > - && !ka->boot_vcpu_runs_old_kvmclock; > + WARN_ON_ONCE(!host_tsc_clocksource && !enable_tsc_sync); > + > + ka->use_master_clock = host_tsc_clocksource && > + (vcpus_matched || !enable_tsc_sync) && > + !ka->backwards_tsc_observed && > + !ka->boot_vcpu_runs_old_kvmclock; > > if (ka->use_master_clock) > atomic_set(&kvm_guest_has_master_clock, 1); > @@ -3278,6 +3290,9 @@ static void kvmclock_sync_fn(struct work_struct *work) > > void kvm_adjust_pv_clock_users(struct kvm *kvm, bool add_user) > { > + if (!enable_tsc_sync) > + return; > + > /* > * Doesn't need to be a spinlock, but can't be kvm->lock as this is > * call while holding a vCPU's mutext. > @@ -5528,6 +5543,11 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, > if (get_user(offset, uaddr)) > break; > > + if (!enable_tsc_sync) { > + kvm_vcpu_write_tsc_offset(vcpu, offset); > + break; > + } > + > raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); > > matched = (vcpu->arch.virtual_tsc_khz && > @@ -12188,6 +12208,9 @@ int kvm_arch_hardware_enable(void) > if (ret != 0) > return ret; > > + if (!enable_tsc_sync) > + return 0; > + > local_tsc = rdtsc(); > stable = !kvm_check_tsc_unstable(); > list_for_each_entry(kvm, &vm_list, vm_list) { > @@ -13670,6 +13693,12 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit); > > static int __init kvm_x86_init(void) > { > + if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) > + enable_tsc_sync = true; > + > + if (!enable_tsc_sync) > + kvmclock_periodic_sync = false; > + > kvm_mmu_x86_module_init(); > mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible(); > return 0; > > base-commit: 7d2edad0beb2a6f07f6e6c2d477d5874f5417d6c
On Fri, Oct 13, 2023, Dongli Zhang wrote: > On 10/13/23 11:07, Sean Christopherson wrote: > > What if we add a module param to disable KVM's TSC synchronization craziness > > entirely? If we first clean up the peroidic sync mess, then it seems like it'd > > be relatively straightforward to let kill off all of the synchronization, including > > the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW. > > > > Not intended to be a functional patch... > > > > --- > > arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++--- > > 1 file changed, 32 insertions(+), 3 deletions(-) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 5b2104bdd99f..75fc6cbaef0d 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); > > static bool __read_mostly kvmclock_periodic_sync = true; > > module_param(kvmclock_periodic_sync, bool, S_IRUGO); > > > > +static bool __read_mostly enable_tsc_sync = true; > > +module_param_named(tsc_synchronization, enable_tsc_sync, bool, 0444); > > + > > /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ > > static u32 __read_mostly tsc_tolerance_ppm = 250; > > module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); > > @@ -2722,6 +2725,12 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) > > bool matched = false; > > bool synchronizing = false; > > > > + if (!enable_tsc_sync) { > > + offset = kvm_compute_l1_tsc_offset(vcpu, data); > > + kvm_vcpu_write_tsc_offset(vcpu, offset); > > + return; > > + } > > TBH, I do not like this idea for two reasons. > > 1. As a very primary part of my work is to resolve kernel issue, when debugging > any clock drift issue, it is really happy for me to see all vCPUs have the same > vcpu->arch.tsc_offset in the coredump or vCPU debugfs. > > This patch may lead to that different vCPUs added at different time have > different vcpu->arch.tsc_offset. The expectation is that *if* userspace disables TSC synchronization, then userespace would be responsible for setting the guest TSC offset directly. And disabling TSC synchronization would be fully opt-in, i.e. we'd fix the existing masterclock sync issues first. > 2. Suppose the KVM host has been running for long time, and the drift between > two domains would be accumulated to super large? (Even it may not introduce > anything bad immediately) That already happens today, e.g. unless the host does vCPU hotplug or is using XEN's shared info page, masterclock updates effectively never happen. And I'm not aware of a single bug report of someone complaining that kvmclock has drifted from the host clock. The only bug reports we have are when KVM triggers an update and causes time to jump from the guest's perspective. If the guest needs accurate timekeeping, it's all but guaranteed to be using NTP or an equivalent. I.e. the imprecision that's inherent in the pvclock ABI shouldn't be problematic for any guest that actually cares. > If the objective is to avoid masterclock updating, how about the below I copied > from my prior diagnostic kernel to help debug this issue. > > The idea is to never update master clock, if tsc is stable (and masterclock is > already used). That's another option, but if there are no masterclock updates, then it suffers the exact same (theoretical) problem as #2. And there are real downsides, e.g. defining when KVM would synchronize kvmclock with the host clock would be significantly harder, as we'd have to capture all the possible ways that KVM could (temporarily) disable the masterclock and start synchronizing.
On 14 October 2023 00:26:45 BST, Sean Christopherson <seanjc@google.com> wrote: >> 2. Suppose the KVM host has been running for long time, and the drift between >> two domains would be accumulated to super large? (Even it may not introduce >> anything bad immediately) > >That already happens today, e.g. unless the host does vCPU hotplug or is using >XEN's shared info page, masterclock updates effectively never happen. And I'm >not aware of a single bug report of someone complaining that kvmclock has drifted >from the host clock. The only bug reports we have are when KVM triggers an update >and causes time to jump from the guest's perspective. I've got reports about the Xen clock going backwards, and also about it drifting over time w.r.t. the guest's TSC clocksource so the watchdog in the guest declares its TSC clocksource unstable. I don't understand *why* we update the master lock when we populate the Xen shared info. Or add a vCPU, for that matter. >> The idea is to never update master clock, if tsc is stable (and masterclock is >> already used). > >That's another option, but if there are no masterclock updates, then it suffers >the exact same (theoretical) problem as #2. And there are real downsides, e.g. >defining when KVM would synchronize kvmclock with the host clock would be >significantly harder... I thought the definition of such an approach would be that we *never* resync the kvmclock to anything. It's based purely on the TSC value when the guest started, and the TSC frequency. The pvclock we advertise to all vCPUs would be the same, and would *never* change except on migration. (I guess that for consistency we would scale first to the *guest* TSC and from that to nanoseconds.) If userspace does anything which makes that become invalid, userspace gets to keep both pieces. That includes userspace having to deal with host suspend like migration, etc.
Hi David and Sean, On 10/14/23 02:49, David Woodhouse wrote: > > > On 14 October 2023 00:26:45 BST, Sean Christopherson <seanjc@google.com> wrote: >>> 2. Suppose the KVM host has been running for long time, and the drift between >>> two domains would be accumulated to super large? (Even it may not introduce >>> anything bad immediately) >> >> That already happens today, e.g. unless the host does vCPU hotplug or is using >> XEN's shared info page, masterclock updates effectively never happen. And I'm >> not aware of a single bug report of someone complaining that kvmclock has drifted >>from the host clock. The only bug reports we have are when KVM triggers an update >> and causes time to jump from the guest's perspective. > > I've got reports about the Xen clock going backwards, and also about it drifting over time w.r.t. the guest's TSC clocksource so the watchdog in the guest declares its TSC clocksource unstable. I assume you meant Xen on KVM (not Xen guest on Xen hypervisor). According to my brief review of xen hypervisor code, it looks using the same algorithm to calculate the clock at hypervisor side, as in the xen guest. Fortunately, the "tsc=reliable" my disable the watchdog, but I have no idea if it impacts Xen on KVM. > > I don't understand *why* we update the master lock when we populate the Xen shared info. Or add a vCPU, for that matter. > >>> The idea is to never update master clock, if tsc is stable (and masterclock is >>> already used). >> >> That's another option, but if there are no masterclock updates, then it suffers >> the exact same (theoretical) problem as #2. And there are real downsides, e.g. >> defining when KVM would synchronize kvmclock with the host clock would be >> significantly harder... > > I thought the definition of such an approach would be that we *never* resync the kvmclock to anything. It's based purely on the TSC value when the guest started, and the TSC frequency. The pvclock we advertise to all vCPUs would be the same, and would *never* change except on migration. > > (I guess that for consistency we would scale first to the *guest* TSC and from that to nanoseconds.) > > If userspace does anything which makes that become invalid, userspace gets to keep both pieces. That includes userspace having to deal with host suspend like migration, etc. Suppose we are discussing a non-permanenet solution, I would suggest: 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen on KVM) is not good enough in some cases, e.g., vCPU hotplug. 2. Do not reply on any userspace change, so that the solution can be easier to apply to existing environments running old KVM versions. That is, to limit the change within KVM. 3. The options would be to (1) stop updating masterclock in the ideal scenario (e.g., stable tsc), or to (2) refresh periodically to minimize the drift. Or there is better option ... Thank you very much! Dongli Zhang
On Mon, 2023-10-16 at 08:47 -0700, Dongli Zhang wrote: > Hi David and Sean, > > On 10/14/23 02:49, David Woodhouse wrote: > > > > > > On 14 October 2023 00:26:45 BST, Sean Christopherson <seanjc@google.com> wrote: > > > > 2. Suppose the KVM host has been running for long time, and the drift between > > > > two domains would be accumulated to super large? (Even it may not introduce > > > > anything bad immediately) > > > > > > That already happens today, e.g. unless the host does vCPU hotplug or is using > > > XEN's shared info page, masterclock updates effectively never happen. And I'm > > > not aware of a single bug report of someone complaining that kvmclock has drifted > > > from the host clock. The only bug reports we have are when KVM triggers an update > > > and causes time to jump from the guest's perspective. > > > > I've got reports about the Xen clock going backwards, and also > > about it drifting over time w.r.t. the guest's TSC clocksource so > > the watchdog in the guest declares its TSC clocksource unstable. > > I assume you meant Xen on KVM (not Xen guest on Xen hypervisor). According to my > brief review of xen hypervisor code, it looks using the same algorithm to > calculate the clock at hypervisor side, as in the xen guest. Right. It's *exactly* the same thing. Even the same pvclock ABI in the way it's exposed to the guest (in the KVM case via the MSR, in the Xen case it's in the vcpu_info or a separate vcpu_time_info set up by Xen hypercalls). > Fortunately, the "tsc=reliable" my disable the watchdog, but I have no idea if > it impacts Xen on KVM. Right. I think Linux as a KVM guest automatically disables the watchdog, or at least refuses to use the KVM clock as the watchdog for the TSC clocksource? Xen guests, on the other hand, aren't used to the Xen clock being as unreliable as the KVM clock is, so they *do* use it as a watchdog for the TSC clocksource. > > I don't understand *why* we update the master lock when we populate > > the Xen shared info. Or add a vCPU, for that matter. Still don't... > > > > The idea is to never update master clock, if tsc is stable (and masterclock is > > > > already used). > > > > > > That's another option, but if there are no masterclock updates, then it suffers > > > the exact same (theoretical) problem as #2. And there are real downsides, e.g. > > > defining when KVM would synchronize kvmclock with the host clock would be > > > significantly harder... > > > > I thought the definition of such an approach would be that we > > *never* resync the kvmclock to anything. It's based purely on the > > TSC value when the guest started, and the TSC frequency. The > > pvclock we advertise to all vCPUs would be the same, and would > > *never* change except on migration. > > > > (I guess that for consistency we would scale first to the *guest* > > TSC and from that to nanoseconds.) > > > > If userspace does anything which makes that become invalid, > > userspace gets to keep both pieces. That includes userspace having > > to deal with host suspend like migration, etc. > > Suppose we are discussing a non-permanenet solution, I would suggest: > > 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen > on KVM) is not good enough in some cases, e.g., vCPU hotplug. I still don't understand the vCPU hotplug case. In the case where the TSC is actually sane, why would we need to reset the masterclock on vCPU hotplug? The new vCPU gets its TSC synchronised to the others, and its kvmclock parameters (mul/shift/offset based on the guest TSC) can be *precisely* the same as the other vCPUs too, can't they? Why reset anything? > 2. Do not reply on any userspace change, so that the solution can be easier to > apply to existing environments running old KVM versions. > > That is, to limit the change within KVM. > > 3. The options would be to (1) stop updating masterclock in the ideal scenario > (e.g., stable tsc), or to (2) refresh periodically to minimize the drift. If the host TSC is sane, just *never* update the KVM masterclock. It "drifts" w.r.t. the host CLOCK_MONOTONIC_RAW and nobody will ever care. The only opt-in we need from userspace for that is to promise that the host TSC will never get mangled, isn't it? (We probably want to be able to export the pvclock information to userspace (in terms of the mul/shift/offset from host TSC to guest TSC and then the mul/shift/offset to kvmclock). Userspace may want to make things like the PIT/HPET/PMtimer run on that clock.)
Hi David, On 10/16/23 09:25, David Woodhouse wrote: > On Mon, 2023-10-16 at 08:47 -0700, Dongli Zhang wrote: >> Hi David and Sean, >> >> On 10/14/23 02:49, David Woodhouse wrote: >>> >>> >>> On 14 October 2023 00:26:45 BST, Sean Christopherson <seanjc@google.com> wrote: >>>>> 2. Suppose the KVM host has been running for long time, and the drift between >>>>> two domains would be accumulated to super large? (Even it may not introduce >>>>> anything bad immediately) >>>> >>>> That already happens today, e.g. unless the host does vCPU hotplug or is using >>>> XEN's shared info page, masterclock updates effectively never happen. And I'm >>>> not aware of a single bug report of someone complaining that kvmclock has drifted >>>> from the host clock. The only bug reports we have are when KVM triggers an update >>>> and causes time to jump from the guest's perspective. >>> >>> I've got reports about the Xen clock going backwards, and also >>> about it drifting over time w.r.t. the guest's TSC clocksource so >>> the watchdog in the guest declares its TSC clocksource unstable. >> >> I assume you meant Xen on KVM (not Xen guest on Xen hypervisor). According to my >> brief review of xen hypervisor code, it looks using the same algorithm to >> calculate the clock at hypervisor side, as in the xen guest. > > Right. It's *exactly* the same thing. Even the same pvclock ABI in the > way it's exposed to the guest (in the KVM case via the MSR, in the Xen > case it's in the vcpu_info or a separate vcpu_time_info set up by Xen > hypercalls). > >> Fortunately, the "tsc=reliable" my disable the watchdog, but I have no idea if >> it impacts Xen on KVM. > > Right. I think Linux as a KVM guest automatically disables the > watchdog, or at least refuses to use the KVM clock as the watchdog for > the TSC clocksource? You may refer to the below commit, which disables watchdog for tsc when it is reliable. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b50db7095fe002fa3e16605546cba66bf1b68a3e > > Xen guests, on the other hand, aren't used to the Xen clock being as > unreliable as the KVM clock is, so they *do* use it as a watchdog for > the TSC clocksource. > >>> I don't understand *why* we update the master lock when we populate >>> the Xen shared info. Or add a vCPU, for that matter. > > Still don't... I do not have much knowledge on Xen-on-KVM. I assume both that and kvmclock are the similar things. The question is: why to update master clock when adding new vCPU (e.g., via QEMU)? It is already in the source code, and TBH, I do not know why it is in the source code like that. Just to explain the source code, taking QEMU + KVM as an example: 1. QEMU adds new vCPU to the running guest. 2. QEMU userspace triggers KVM kvm_synchronize_tsc() via ioctl. kvm_synchronize_tsc()-->__kvm_synchronize_tsc()-->kvm_track_tsc_matching() The above tries to sync TSC, and finally sets KVM_REQ_MASTERCLOCK_UPDATE pending for the new vCPU. 3. The guest side onlines the new vCPU via either udev rule (automatically), or sysfs (echo and manually). 4. When the vCPU is onlined, it will be starting running at KVM side. The KVM sides processes KVM_REQ_MASTERCLOCK_UPDATE before entering into the guest mode. 5. The handler of KVM_REQ_MASTERCLOCK_UPDATE updates the master clock. > >>>>> The idea is to never update master clock, if tsc is stable (and masterclock is >>>>> already used). >>>> >>>> That's another option, but if there are no masterclock updates, then it suffers >>>> the exact same (theoretical) problem as #2. And there are real downsides, e.g. >>>> defining when KVM would synchronize kvmclock with the host clock would be >>>> significantly harder... >>> >>> I thought the definition of such an approach would be that we >>> *never* resync the kvmclock to anything. It's based purely on the >>> TSC value when the guest started, and the TSC frequency. The >>> pvclock we advertise to all vCPUs would be the same, and would >>> *never* change except on migration. >>> >>> (I guess that for consistency we would scale first to the *guest* >>> TSC and from that to nanoseconds.) >>> >>> If userspace does anything which makes that become invalid, >>> userspace gets to keep both pieces. That includes userspace having >>> to deal with host suspend like migration, etc. >> >> Suppose we are discussing a non-permanenet solution, I would suggest: >> >> 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen >> on KVM) is not good enough in some cases, e.g., vCPU hotplug. > > I still don't understand the vCPU hotplug case. > > In the case where the TSC is actually sane, why would we need to reset > the masterclock on vCPU hotplug? > > The new vCPU gets its TSC synchronised to the others, and its kvmclock > parameters (mul/shift/offset based on the guest TSC) can be *precisely* > the same as the other vCPUs too, can't they? Why reset anything? While I understand how source code works, I do not know why. I shared the below patch from my prior diagnostic kernel, and it avoids updating the master clock, if it is already used and stable. https://lore.kernel.org/kvm/cf2b22fc-78f5-dfb9-f0e6-5c4059a970a2@oracle.com/ > >> 2. Do not reply on any userspace change, so that the solution can be easier to >> apply to existing environments running old KVM versions. >> >> That is, to limit the change within KVM. >> >> 3. The options would be to (1) stop updating masterclock in the ideal scenario >> (e.g., stable tsc), or to (2) refresh periodically to minimize the drift. > > If the host TSC is sane, just *never* update the KVM masterclock. It > "drifts" w.r.t. the host CLOCK_MONOTONIC_RAW and nobody will ever care. I think it is one of the two options, although I prefer the 2 than the 1. 1. Do not update master clock. 2. Refresh master clock periodically. > > The only opt-in we need from userspace for that is to promise that the > host TSC will never get mangled, isn't it? Regarding QEMU, I assume you meant either: (1) -cpu host,+invtsc (at QEMU command line), or (2) tsc=reliable (at guest kernel command line) > > (We probably want to be able to export the pvclock information to > userspace (in terms of the mul/shift/offset from host TSC to guest TSC > and then the mul/shift/offset to kvmclock). Userspace may want to make > things like the PIT/HPET/PMtimer run on that clock.) > Thank you very much! Dongli Zhang
On Mon, Oct 16, 2023, David Woodhouse wrote: > On Mon, 2023-10-16 at 08:47 -0700, Dongli Zhang wrote: > > Suppose we are discussing a non-permanenet solution, I would suggest: > > > > 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen > > on KVM) is not good enough in some cases, e.g., vCPU hotplug. > > I still don't understand the vCPU hotplug case. > > In the case where the TSC is actually sane, why would we need to reset > the masterclock on vCPU hotplug? > > The new vCPU gets its TSC synchronised to the others, and its kvmclock > parameters (mul/shift/offset based on the guest TSC) can be *precisely* > the same as the other vCPUs too, can't they? Why reset anything? Aha! I think I finally figured out why KVM behaves the way it does. The unnecessary masterclock updates effectively came from: commit 7f187922ddf6b67f2999a76dcb71663097b75497 Author: Marcelo Tosatti <mtosatti@redhat.com> Date: Tue Nov 4 21:30:44 2014 -0200 KVM: x86: update masterclock values on TSC writes When the guest writes to the TSC, the masterclock TSC copy must be updated as well along with the TSC_OFFSET update, otherwise a negative tsc_timestamp is calculated at kvm_guest_time_update. Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)" becomes redundant, so remove the do_request boolean and collapse everything into a single condition. Before that, KVM only re-synced the masterclock if it was enabled or disabled, i.e. KVM behaved as we want it to behave. Note, at the time of the above commit, VMX synchronized TSC on *guest* writes to MSR_IA32_TSC: case MSR_IA32_TSC: kvm_write_tsc(vcpu, msr_info); break; That got changed by commit 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization on guest writes"), but I don't think the guest angle is actually relevant to the fix. AFAICT, a write from the host would suffer the same problem. But knowing that KVM synced on guest writes is crucial to understanding the changelog. In kvm_write_tsc(), except for KVM's wonderful "less than 1 second" hack, KVM snapshotted the vCPU's current TSC *and* the current time in nanoseconds, where kvm->arch.cur_tsc_nsec is the current host kernel time in nanoseconds. ns = get_kernel_ns(); ... if (usdiff < USEC_PER_SEC && vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) { ... } else { /* * We split periods of matched TSC writes into generations. * For each generation, we track the original measured * nanosecond time, offset, and write, so if TSCs are in * sync, we can match exact offset, and if not, we can match * exact software computation in compute_guest_tsc() * * These values are tracked in kvm->arch.cur_xxx variables. */ kvm->arch.cur_tsc_generation++; kvm->arch.cur_tsc_nsec = ns; kvm->arch.cur_tsc_write = data; kvm->arch.cur_tsc_offset = offset; matched = false; pr_debug("kvm: new tsc generation %llu, clock %llu\n", kvm->arch.cur_tsc_generation, data); } ... /* Keep track of which generation this VCPU has synchronized to */ vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation; vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; Note that the above sets matched to false! But because kvm_track_tsc_matching() looks for matched+1, i.e. doesn't require the first vCPU to match itself, KVM would immediately compute vcpus_matched as true for VMs with a single vCPU. As a result, KVM would skip the masterlock update, even though a new TSC generation was created. vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == atomic_read(&vcpu->kvm->online_vcpus)); if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC) if (!ka->use_master_clock) do_request = 1; if (!vcpus_matched && ka->use_master_clock) do_request = 1; if (do_request) kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); On hardware without TSC scaling support, vcpu->tsc_catchup is set to true if the guest TSC frequency is faster than the host TSC frequency, even if the TSC is otherwise stable. And for that mode, kvm_guest_time_update(), by way of compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the kernel time at the last TSC write, to compute the guest TSC relative to kernel time: static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns) { u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec, vcpu->arch.virtual_tsc_mult, vcpu->arch.virtual_tsc_shift); tsc += vcpu->arch.this_tsc_write; return tsc; } Except the @kernel_ns passed to compute_guest_tsc() isn't the current kernel time, it's the masterclock snapshot! spin_lock(&ka->pvclock_gtod_sync_lock); use_master_clock = ka->use_master_clock; if (use_master_clock) { host_tsc = ka->master_cycle_now; kernel_ns = ka->master_kernel_ns; } spin_unlock(&ka->pvclock_gtod_sync_lock); if (vcpu->tsc_catchup) { u64 tsc = compute_guest_tsc(v, kernel_ns); if (tsc > tsc_timestamp) { adjust_tsc_offset_guest(v, tsc - tsc_timestamp); tsc_timestamp = tsc; } } And so the "kernel_ns-vcpu->arch.this_tsc_nsec" is *guaranteed* to generate a negative value, because this_tsc_nsec was captured after ka->master_kernel_ns. Forcing a masterclock update essentially fudged around that problem, but in a heavy handed way that introduced undesirable side effects, i.e. unnecessarily forces a masterclock update when a new vCPU joins the party via hotplug. Compile tested only, but the below should fix the vCPU hotplug case. Then someone (not me) just needs to figure out why kvm_xen_shared_info_init() forces a masterclock update. I still think we should clean up the periodic sync code, but I don't think we need to periodically sync the masterclock. --- arch/x86/kvm/x86.c | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c54e1133e0d3..f0a607b6fc31 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode) } #endif -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation) { #ifdef CONFIG_X86_64 - bool vcpus_matched; struct kvm_arch *ka = &vcpu->kvm->arch; struct pvclock_gtod_data *gtod = &pvclock_gtod_data; - vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == - atomic_read(&vcpu->kvm->online_vcpus)); + /* + * To use the masterclock, the host clocksource must be based on TSC + * and all vCPUs must have matching TSCs. Note, the count for matching + * vCPUs doesn't include the reference vCPU, hence "+1". + */ + bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 == + atomic_read(&vcpu->kvm->online_vcpus)) && + gtod_is_based_on_tsc(gtod->clock.vclock_mode); /* - * Once the masterclock is enabled, always perform request in - * order to update it. - * - * In order to enable masterclock, the host clocksource must be TSC - * and the vcpus need to have matched TSCs. When that happens, - * perform request to enable masterclock. + * Request a masterclock update if the masterclock needs to be toggled + * on/off, or when starting a new generation and the masterclock is + * enabled (compute_guest_tsc() requires the masterclock snaphot to be + * taken _after_ the new generation is created). */ - if (ka->use_master_clock || - (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched)) + if ((ka->use_master_clock && new_generation) || + (ka->use_master_clock != use_master_clock)) kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc, @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; - kvm_track_tsc_matching(vcpu); + kvm_track_tsc_matching(vcpu, !matched); } static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value) base-commit: dfdc8b7884b50e3bfa635292973b530a97689f12 --
Hi Sean, On 10/16/23 11:49, Sean Christopherson wrote: > On Mon, Oct 16, 2023, David Woodhouse wrote: >> On Mon, 2023-10-16 at 08:47 -0700, Dongli Zhang wrote: >>> Suppose we are discussing a non-permanenet solution, I would suggest: >>> >>> 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen >>> on KVM) is not good enough in some cases, e.g., vCPU hotplug. >> >> I still don't understand the vCPU hotplug case. >> >> In the case where the TSC is actually sane, why would we need to reset >> the masterclock on vCPU hotplug? >> >> The new vCPU gets its TSC synchronised to the others, and its kvmclock >> parameters (mul/shift/offset based on the guest TSC) can be *precisely* >> the same as the other vCPUs too, can't they? Why reset anything? > > Aha! I think I finally figured out why KVM behaves the way it does. > > The unnecessary masterclock updates effectively came from: > > commit 7f187922ddf6b67f2999a76dcb71663097b75497 > Author: Marcelo Tosatti <mtosatti@redhat.com> > Date: Tue Nov 4 21:30:44 2014 -0200 > > KVM: x86: update masterclock values on TSC writes > > When the guest writes to the TSC, the masterclock TSC copy must be > updated as well along with the TSC_OFFSET update, otherwise a negative > tsc_timestamp is calculated at kvm_guest_time_update. > > Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to > "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)" > becomes redundant, so remove the do_request boolean and collapse > everything into a single condition. > > Before that, KVM only re-synced the masterclock if it was enabled or disabled, > i.e. KVM behaved as we want it to behave. Note, at the time of the above commit, > VMX synchronized TSC on *guest* writes to MSR_IA32_TSC: > > case MSR_IA32_TSC: > kvm_write_tsc(vcpu, msr_info); > break; > > That got changed by commit 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization > on guest writes"), but I don't think the guest angle is actually relevant to the > fix. AFAICT, a write from the host would suffer the same problem. But knowing > that KVM synced on guest writes is crucial to understanding the changelog. > > In kvm_write_tsc(), except for KVM's wonderful "less than 1 second" hack, KVM > snapshotted the vCPU's current TSC *and* the current time in nanoseconds, where > kvm->arch.cur_tsc_nsec is the current host kernel time in nanoseconds. > > ns = get_kernel_ns(); > > ... > > if (usdiff < USEC_PER_SEC && > vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) { > ... > } else { > /* > * We split periods of matched TSC writes into generations. > * For each generation, we track the original measured > * nanosecond time, offset, and write, so if TSCs are in > * sync, we can match exact offset, and if not, we can match > * exact software computation in compute_guest_tsc() > * > * These values are tracked in kvm->arch.cur_xxx variables. > */ > kvm->arch.cur_tsc_generation++; > kvm->arch.cur_tsc_nsec = ns; > kvm->arch.cur_tsc_write = data; > kvm->arch.cur_tsc_offset = offset; > matched = false; > pr_debug("kvm: new tsc generation %llu, clock %llu\n", > kvm->arch.cur_tsc_generation, data); > } > > ... > > /* Keep track of which generation this VCPU has synchronized to */ > vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation; > vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; > vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; > > Note that the above sets matched to false! But because kvm_track_tsc_matching() > looks for matched+1, i.e. doesn't require the first vCPU to match itself, KVM > would immediately compute vcpus_matched as true for VMs with a single vCPU. As > a result, KVM would skip the masterlock update, even though a new TSC generation > was created. > > vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == > atomic_read(&vcpu->kvm->online_vcpus)); > > if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC) > if (!ka->use_master_clock) > do_request = 1; > > if (!vcpus_matched && ka->use_master_clock) > do_request = 1; > > if (do_request) > kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); > > On hardware without TSC scaling support, vcpu->tsc_catchup is set to true if the > guest TSC frequency is faster than the host TSC frequency, even if the TSC is > otherwise stable. And for that mode, kvm_guest_time_update(), by way of > compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the kernel time at the > last TSC write, to compute the guest TSC relative to kernel time: > > static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns) > { > u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec, > vcpu->arch.virtual_tsc_mult, > vcpu->arch.virtual_tsc_shift); > tsc += vcpu->arch.this_tsc_write; > return tsc; > } > > Except the @kernel_ns passed to compute_guest_tsc() isn't the current kernel time, > it's the masterclock snapshot! > > spin_lock(&ka->pvclock_gtod_sync_lock); > use_master_clock = ka->use_master_clock; > if (use_master_clock) { > host_tsc = ka->master_cycle_now; > kernel_ns = ka->master_kernel_ns; > } > spin_unlock(&ka->pvclock_gtod_sync_lock); > > if (vcpu->tsc_catchup) { > u64 tsc = compute_guest_tsc(v, kernel_ns); > if (tsc > tsc_timestamp) { > adjust_tsc_offset_guest(v, tsc - tsc_timestamp); > tsc_timestamp = tsc; > } > } > > And so the "kernel_ns-vcpu->arch.this_tsc_nsec" is *guaranteed* to generate a > negative value, because this_tsc_nsec was captured after ka->master_kernel_ns. > > Forcing a masterclock update essentially fudged around that problem, but in a > heavy handed way that introduced undesirable side effects, i.e. unnecessarily > forces a masterclock update when a new vCPU joins the party via hotplug. > > Compile tested only, but the below should fix the vCPU hotplug case. Then > someone (not me) just needs to figure out why kvm_xen_shared_info_init() forces > a masterclock update. > > I still think we should clean up the periodic sync code, but I don't think we > need to periodically sync the masterclock. This looks good to me. The core idea is to not update master clock for the synchronized cases. How about the negative value case? I see in the linux code it is still there? (It is out of the scope of my expectation as I do not need to run vCPUs in different tsc freq as host) Thank you very much! Dongli Zhang > > --- > arch/x86/kvm/x86.c | 29 ++++++++++++++++------------- > 1 file changed, 16 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index c54e1133e0d3..f0a607b6fc31 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode) > } > #endif > > -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) > +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation) > { > #ifdef CONFIG_X86_64 > - bool vcpus_matched; > struct kvm_arch *ka = &vcpu->kvm->arch; > struct pvclock_gtod_data *gtod = &pvclock_gtod_data; > > - vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == > - atomic_read(&vcpu->kvm->online_vcpus)); > + /* > + * To use the masterclock, the host clocksource must be based on TSC > + * and all vCPUs must have matching TSCs. Note, the count for matching > + * vCPUs doesn't include the reference vCPU, hence "+1". > + */ > + bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 == > + atomic_read(&vcpu->kvm->online_vcpus)) && > + gtod_is_based_on_tsc(gtod->clock.vclock_mode); > > /* > - * Once the masterclock is enabled, always perform request in > - * order to update it. > - * > - * In order to enable masterclock, the host clocksource must be TSC > - * and the vcpus need to have matched TSCs. When that happens, > - * perform request to enable masterclock. > + * Request a masterclock update if the masterclock needs to be toggled > + * on/off, or when starting a new generation and the masterclock is > + * enabled (compute_guest_tsc() requires the masterclock snaphot to be > + * taken _after_ the new generation is created). > */ > - if (ka->use_master_clock || > - (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched)) > + if ((ka->use_master_clock && new_generation) || > + (ka->use_master_clock != use_master_clock)) > kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); > > trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc, > @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, > vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; > vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; > > - kvm_track_tsc_matching(vcpu); > + kvm_track_tsc_matching(vcpu, !matched); > } > > static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value) > > base-commit: dfdc8b7884b50e3bfa635292973b530a97689f12
On Mon, Oct 16, 2023, Dongli Zhang wrote: > Hi Sean, > > On 10/16/23 11:49, Sean Christopherson wrote: > > Compile tested only, but the below should fix the vCPU hotplug case. Then > > someone (not me) just needs to figure out why kvm_xen_shared_info_init() forces > > a masterclock update. > > > > I still think we should clean up the periodic sync code, but I don't think we > > need to periodically sync the masterclock. > > This looks good to me. The core idea is to not update master clock for the > synchronized cases. > > > How about the negative value case? I see in the linux code it is still there? See below. > (It is out of the scope of my expectation as I do not need to run vCPUs in > different tsc freq as host) > > Thank you very much! > > Dongli Zhang > > > > > --- > > arch/x86/kvm/x86.c | 29 ++++++++++++++++------------- > > 1 file changed, 16 insertions(+), 13 deletions(-) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index c54e1133e0d3..f0a607b6fc31 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode) > > } > > #endif > > > > -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) > > +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation) > > { > > #ifdef CONFIG_X86_64 > > - bool vcpus_matched; > > struct kvm_arch *ka = &vcpu->kvm->arch; > > struct pvclock_gtod_data *gtod = &pvclock_gtod_data; > > > > - vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == > > - atomic_read(&vcpu->kvm->online_vcpus)); > > + /* > > + * To use the masterclock, the host clocksource must be based on TSC > > + * and all vCPUs must have matching TSCs. Note, the count for matching > > + * vCPUs doesn't include the reference vCPU, hence "+1". > > + */ > > + bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 == > > + atomic_read(&vcpu->kvm->online_vcpus)) && > > + gtod_is_based_on_tsc(gtod->clock.vclock_mode); > > > > /* > > - * Once the masterclock is enabled, always perform request in > > - * order to update it. > > - * > > - * In order to enable masterclock, the host clocksource must be TSC > > - * and the vcpus need to have matched TSCs. When that happens, > > - * perform request to enable masterclock. > > + * Request a masterclock update if the masterclock needs to be toggled > > + * on/off, or when starting a new generation and the masterclock is > > + * enabled (compute_guest_tsc() requires the masterclock snaphot to be > > + * taken _after_ the new generation is created). > > */ > > - if (ka->use_master_clock || > > - (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched)) > > + if ((ka->use_master_clock && new_generation) || > > + (ka->use_master_clock != use_master_clock)) > > kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); > > > > trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc, > > @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, > > vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; > > vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; > > > > - kvm_track_tsc_matching(vcpu); > > + kvm_track_tsc_matching(vcpu, !matched); If my analysis of how the negative timestamp occurred is correct, the problematic scenario was if cur_tsc_nsec/cur_tsc_write were updated without a masterclock update. Passing !matched for @new_generation means that KVM will force a masterclock update if cur_tsc_nsec/cur_tsc_write are changed, i.e. prevent the negative timestamp bug.
Hi Sean, On 10/16/23 15:48, Sean Christopherson wrote: > On Mon, Oct 16, 2023, Dongli Zhang wrote: >> Hi Sean, >> >> On 10/16/23 11:49, Sean Christopherson wrote: >>> Compile tested only, but the below should fix the vCPU hotplug case. Then >>> someone (not me) just needs to figure out why kvm_xen_shared_info_init() forces >>> a masterclock update. >>> >>> I still think we should clean up the periodic sync code, but I don't think we >>> need to periodically sync the masterclock. >> >> This looks good to me. The core idea is to not update master clock for the >> synchronized cases. >> >> >> How about the negative value case? I see in the linux code it is still there? > > See below. > >> (It is out of the scope of my expectation as I do not need to run vCPUs in >> different tsc freq as host) >> >> Thank you very much! >> >> Dongli Zhang >> >>> >>> --- >>> arch/x86/kvm/x86.c | 29 ++++++++++++++++------------- >>> 1 file changed, 16 insertions(+), 13 deletions(-) >>> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index c54e1133e0d3..f0a607b6fc31 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode) >>> } >>> #endif >>> >>> -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) >>> +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation) >>> { >>> #ifdef CONFIG_X86_64 >>> - bool vcpus_matched; >>> struct kvm_arch *ka = &vcpu->kvm->arch; >>> struct pvclock_gtod_data *gtod = &pvclock_gtod_data; >>> >>> - vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == >>> - atomic_read(&vcpu->kvm->online_vcpus)); >>> + /* >>> + * To use the masterclock, the host clocksource must be based on TSC >>> + * and all vCPUs must have matching TSCs. Note, the count for matching >>> + * vCPUs doesn't include the reference vCPU, hence "+1". >>> + */ >>> + bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 == >>> + atomic_read(&vcpu->kvm->online_vcpus)) && >>> + gtod_is_based_on_tsc(gtod->clock.vclock_mode); >>> >>> /* >>> - * Once the masterclock is enabled, always perform request in >>> - * order to update it. >>> - * >>> - * In order to enable masterclock, the host clocksource must be TSC >>> - * and the vcpus need to have matched TSCs. When that happens, >>> - * perform request to enable masterclock. >>> + * Request a masterclock update if the masterclock needs to be toggled >>> + * on/off, or when starting a new generation and the masterclock is >>> + * enabled (compute_guest_tsc() requires the masterclock snaphot to be >>> + * taken _after_ the new generation is created). >>> */ >>> - if (ka->use_master_clock || >>> - (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched)) >>> + if ((ka->use_master_clock && new_generation) || >>> + (ka->use_master_clock != use_master_clock)) >>> kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); >>> >>> trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc, >>> @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, >>> vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; >>> vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; >>> >>> - kvm_track_tsc_matching(vcpu); >>> + kvm_track_tsc_matching(vcpu, !matched); > > If my analysis of how the negative timestamp occurred is correct, the problematic > scenario was if cur_tsc_nsec/cur_tsc_write were updated without a masterclock update. > Passing !matched for @new_generation means that KVM will force a masterclock update > if cur_tsc_nsec/cur_tsc_write are changed, i.e. prevent the negative timestamp bug. Thank you very much for the explanation. Now I understand it. Thanks to the immediate call to kvm_synchronize_tsc() during each vCPU creation ... kvm_vm_ioctl(KVM_CREATE_VCPU) -> kvm_vm_ioctl_create_vcpu() -> kvm_arch_vcpu_postcreate() -> kvm_synchronize_tsc() ... the local variable "bool use_master_clock" in your patch may always be true. At that time, the "(ka->use_master_clock != use_master_clock)" returns true. As a result, we will be able to trigger the KVM_REQ_MASTERCLOCK_UPDATE during VM creation for each vCPU. There is still KVM_REQ_MASTERCLOCK_UPDATE for each vCPU during VM creation. However, there will be no KVM_REQ_MASTERCLOCK_UPDATE for vCPU hot-add. Thank you very much! Dongli Zhang
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 17715cb8731d..57409dce5d73 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1331,6 +1331,7 @@ struct kvm_arch { u64 master_cycle_now; struct delayed_work kvmclock_update_work; struct delayed_work kvmclock_sync_work; + struct delayed_work masterclock_sync_work; struct kvm_xen_hvm_config xen_hvm_config; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9f18b06bbda6..0b71dc3785eb 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); static bool __read_mostly kvmclock_periodic_sync = true; module_param(kvmclock_periodic_sync, bool, S_IRUGO); +unsigned int __read_mostly masterclock_sync_period; +module_param(masterclock_sync_period, uint, 0444); + /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ static u32 __read_mostly tsc_tolerance_ppm = 250; module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); @@ -3298,6 +3301,31 @@ static void kvmclock_sync_fn(struct work_struct *work) KVMCLOCK_SYNC_PERIOD); } +static void masterclock_sync_fn(struct work_struct *work) +{ + unsigned long i; + struct delayed_work *dwork = to_delayed_work(work); + struct kvm_arch *ka = container_of(dwork, struct kvm_arch, + masterclock_sync_work); + struct kvm *kvm = container_of(ka, struct kvm, arch); + struct kvm_vcpu *vcpu; + + if (!masterclock_sync_period) + return; + + kvm_for_each_vcpu(i, vcpu, kvm) { + /* + * It is not required to kick the vcpu because it is not + * expected to update the master clock immediately. + */ + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); + } + + schedule_delayed_work(&ka->masterclock_sync_work, + masterclock_sync_period * HZ); +} + + /* These helpers are safe iff @msr is known to be an MCx bank MSR. */ static bool is_mci_control_msr(u32 msr) { @@ -11970,6 +11998,10 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0) schedule_delayed_work(&kvm->arch.kvmclock_sync_work, KVMCLOCK_SYNC_PERIOD); + + if (masterclock_sync_period && vcpu->vcpu_idx == 0) + schedule_delayed_work(&kvm->arch.masterclock_sync_work, + masterclock_sync_period * HZ); } void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) @@ -12344,6 +12376,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn); INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn); + INIT_DELAYED_WORK(&kvm->arch.masterclock_sync_work, masterclock_sync_fn); kvm_apicv_init(kvm); kvm_hv_init_vm(kvm); @@ -12383,6 +12416,7 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm) void kvm_arch_sync_events(struct kvm *kvm) { + cancel_delayed_work_sync(&kvm->arch.masterclock_sync_work); cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work); cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work); kvm_free_pit(kvm);