Message ID | 20230127035616.508966-1-aik@amd.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp638884wrn; Thu, 26 Jan 2023 20:17:41 -0800 (PST) X-Google-Smtp-Source: AK7set+SHeP7sN/cTh3saN81bodXAJILyv568AvA737uxPV7nmWq9MeGT4emIwOYBS6AEzKF+c0s X-Received: by 2002:a17:906:4b13:b0:878:5e84:e1da with SMTP id y19-20020a1709064b1300b008785e84e1damr5401024eju.27.1674793061643; Thu, 26 Jan 2023 20:17:41 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1674793061; cv=pass; d=google.com; s=arc-20160816; b=mXoUkTz17rTQ5BB4h0TqYaL1A2SG4posTwIAoB873YDheEgY3Z0VITs5rjv/zXfHVY GG8cTQigq6kqIl61zcgmodcPfpYs1UpI4WKOtz775/7RyFXN5CrNvapJ/4cF1SjOBodT lcYKiOOPT/Ol2BL0nR8gFoSHt5L/lKld0i+W0zZsCHSMip/k1ipPPXakN5ttYE7YLm+z a09ZOpEiYiQmClhAvgG+kWonnnyFqK5UdoDwy7zaPGvwoWK6e+Qj0LpGDNrYrnOV8MS7 4GipBC72maRHohvCDp58jnm50DvZBbIxO2bcQrLqaHePyNHCp6BHKEHUi17tNSobpHHm L96g== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=ITqkDJd1BmhTSMM+q2gKfRuRTUX6Y0NcbTampoYR3b0=; b=CGxubLEIXYa5kVRiaXlBKSicDzizeATT4oaB4uRZ91kqTpIwrIMlrQf8whnaYpIbcr nKaNyVyjd16iFc7YxwyGoiB2oOIUKG2E2eOoCEX+LnXOMIbVJpx9Sq6oC1RxM01P83a2 XHjgWvcXqYO7vBXg4vZEz0bLeBZi+keQ0Opcvq/nnJUjcYKAlbi2/BIsFtxZ0FT91R99 RRFWz9qUw+691Z4vPUpq31P0AtFrgNx71iK7/gb7ts7J4UrA3jkJ1OEPonZ7rMhGIMiB fVYBdzgnjks6PSmo8Ac4v+fta99vuuZvUstabNzHPadrxUzyc3FxK6ipnjyMMXo5mTNV F1QA== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@amd.com header.s=selector1 header.b=YNPYdL0y; arc=pass (i=1 spf=pass spfdomain=amd.com dmarc=pass fromdomain=amd.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amd.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id az24-20020a170907905800b0087194db3565si2963186ejc.551.2023.01.26.20.17.18; Thu, 26 Jan 2023 20:17:41 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@amd.com header.s=selector1 header.b=YNPYdL0y; arc=pass (i=1 spf=pass spfdomain=amd.com dmarc=pass fromdomain=amd.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amd.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231721AbjA0D4q (ORCPT <rfc822;lekhanya01809@gmail.com> + 99 others); Thu, 26 Jan 2023 22:56:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55694 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229481AbjA0D4p (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 26 Jan 2023 22:56:45 -0500 Received: from NAM12-BN8-obe.outbound.protection.outlook.com (mail-bn8nam12on2054.outbound.protection.outlook.com [40.107.237.54]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0C0E961D6A; Thu, 26 Jan 2023 19:56:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ofwWvW1ugMSRaY6PaozT/FSHtqQBKf2pDLpNZ/LPAuZKsEMd1bhKap7j8XSAwGNFkPxcZpuPXUWppgtVJjUk3+Wv7Ps0CkxPDY0HNZw9rQ5h9Nc5ASE+hzZ2PooFRVVDlzDTRAv846fWMA2EDNYZ0u6cu8L2GvVgvDzp4YMAu5HL5ipZbjdsuIbAbL2uEjsTs97VsZEFtx/5fNqoKt+2Z+qcs+QjzuwYtCZeYWeCERTXoMHLWrOATtUlX1x9S5B7LqTlMZ2kRuADK9myBrmzF15hhI+Pp3BtTomlOOGk1EMOTxNVSXwTSMYIpZyOVr3RQ1IC6Qg9j+xxXgONJvRv5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ITqkDJd1BmhTSMM+q2gKfRuRTUX6Y0NcbTampoYR3b0=; b=ViD6fZEOG6ilXldVjVPGvIPPfGPwligcbwch5P6Y4bv0UxHeNUrn5ITA1EuRJgZLHd7KWkEVucjP5Cjao6OxHc5vNFSJ0eRGn3d1FMQkMqlH7bTeKeUXTWb7kjZag9FWo8LfYdD+3rB29PMK8UF4VBja6etPpPHUeeUc36Ww9wDPj66C2H3c9mnJWVPCZ7dht40Ke8NX4LG6enKsGmZ8G8WYqyCGsa1h3K25u3wQ8uVwePLzWRpFeQ6CCyK53NhrE5Tz7vCb4GhOx62aUveWwT+lUTTYW5pxJIShBpt9l/n6eD9mzQCewP1IPB8s2hBlgZbqRs9v5f1wFWCdzO4Z0A== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ITqkDJd1BmhTSMM+q2gKfRuRTUX6Y0NcbTampoYR3b0=; b=YNPYdL0yE6VTfzf5mBReVSWIwZ4V8wddCAmm/51/Eizqk2KB5f9x0URlIsxHIoWXRHRf6vV7N5N5D+H9LeRS52yTI/QunXpkTV7X1xuuzjDmS1hulF34Nl61ftuqiKsXs4FMfebWBVCuPq2a7wO/nMjtVu17yTGmzcGEs8q5cGg= Received: from DS7PR03CA0257.namprd03.prod.outlook.com (2603:10b6:5:3b3::22) by MN0PR12MB5740.namprd12.prod.outlook.com (2603:10b6:208:373::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6043.22; Fri, 27 Jan 2023 03:56:39 +0000 Received: from DM6NAM11FT028.eop-nam11.prod.protection.outlook.com (2603:10b6:5:3b3:cafe::fd) by DS7PR03CA0257.outlook.office365.com (2603:10b6:5:3b3::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6043.22 via Frontend Transport; Fri, 27 Jan 2023 03:56:39 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by DM6NAM11FT028.mail.protection.outlook.com (10.13.173.140) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.6043.21 via Frontend Transport; Fri, 27 Jan 2023 03:56:39 +0000 Received: from aiemdeew.1.ozlabs.ru (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Thu, 26 Jan 2023 21:56:29 -0600 From: Alexey Kardashevskiy <aik@amd.com> To: Alexey Kardashevskiy <aik@amd.com> CC: <kvm@vger.kernel.org>, <x86@kernel.org>, <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Sean Christopherson <seanjc@google.com>, Jiri Kosina <jkosina@suse.cz>, Ingo Molnar <mingo@redhat.com>, Dave Hansen <dave.hansen@linux.intel.com>, "Borislav Petkov" <bp@alien8.de>, "Peter Zijlstra (Intel)" <peterz@infradead.org>, "H. Peter Anvin" <hpa@zytor.com>, Tom Lendacky <thomas.lendacky@amd.com> Subject: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?) Date: Fri, 27 Jan 2023 14:56:16 +1100 Message-ID: <20230127035616.508966-1-aik@amd.com> X-Mailer: git-send-email 2.39.1 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.180.168.240] X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DM6NAM11FT028:EE_|MN0PR12MB5740:EE_ X-MS-Office365-Filtering-Correlation-Id: 7512e605-bbab-4ead-1db1-08db001a7e58 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 7vJEF7viiOYTgLAQHRniUyl5KTGAW7snsb79j7Ns7LEDYV18I57aTEKpd6o4K+lQH2vHGxDZEYiviGHwDktHIIdgRg3gytbtsfzPPIJhDyaadGTUi645uu6D1PWdGm9G4chtyd9oizG1inTnFzS/tX1Is7ard+PkpJlnd/UJilL5nifxixQV5t/JUyPo5hMbMYfu4KJ9l6lHnCl08BHU2Sr0H5LXdT2oKtZOXpmIGqjAC1vuYNK0VHBQJWmcd5sThJQ/1VOlVfFaL404v6rUyluP/+pafWjf9foPlX+tM6SYVHr2JKqAeEREe/hbFLdNtieKd4BVX6S15Mx1OrBpE/g3bBSJalJ2valLs+51FBcX14+hOyIzMDcjYenI5ThkurKhgp2gx1F9ExOwgshmTG2yq7wi/iRcwvTyXlUQm77AZcN/8yKtOL2+0HRrmo8jKGc4szV577Y101Fj+DW48da+feSSgPwN4ekjbZHUXBrbdAPWcWWhKhnR+AaTzntVWAX73fW4jAu90OyfEnJAE5xjdEW78QaKnuZw+creAwRZg13vUrxqcYUf9BZfS1G5a/pESm0ELc7vFCskK1F6yD/DVCPVixMTnVnBA2SVUS+GEcsKoSggANulhVrtCCtmayMR1OtxgVRqO2m39fFfZfu8rn2eYWmCeUbvLKOEuhyYCmnTKy9QeaGjp99Tlz8IEdBf1hEwoxRBEUEquT/FLYDFg7SNKK8JsEIMZM2R7hiuCEbsR0d8UO6Kj+/XaDqGlDpmshtHrd6YfF233pUsTdc6tle3+QwEb5xd488btf6V9IsglbuTtfIyTJMalaHT X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230025)(4636009)(39860400002)(396003)(136003)(376002)(346002)(451199018)(46966006)(36840700001)(40470700004)(6200100001)(40460700003)(356005)(2906002)(81166007)(1076003)(36860700001)(2616005)(7049001)(26005)(16526019)(6666004)(7416002)(426003)(186003)(83380400001)(40480700001)(4326008)(30864003)(70206006)(70586007)(8676002)(47076005)(336012)(5660300002)(36756003)(6862004)(316002)(82740400003)(8936002)(966005)(54906003)(37006003)(19627235002)(45080400002)(41300700001)(478600001)(82310400005)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 27 Jan 2023 03:56:39.1129 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 7512e605-bbab-4ead-1db1-08db001a7e58 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT028.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN0PR12MB5740 X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_SPF_HELO, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756147809179205283?= X-GMAIL-MSGID: =?utf-8?q?1756147809179205283?= |
Series |
[Question,kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
|
|
Commit Message
Alexey Kardashevskiy
Jan. 27, 2023, 3:56 a.m. UTC
AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when
"perf" executes. "./perf record sleep 20" is an example.
Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not
defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi)
and actual reading of DB7 (which in turn causes #VC) every function is
inlined and no stack frame is created (?). Replacing __always_inline with
noinline in local_db_save() or native_get_debugreg() fixes the problem.
The crash does not happen with CONFIG_PARAVIRT_XXL as in this case
paravirt_get_debugreg() is used and there is an indirect call via
PVOP_CALL1(). It has not been noticed as the most configs have CONFIG_XEN
enabled which enables CONFIG_PARAVIRT_XXL.
This happens with the recent tip/master, here is my test kernel
and the config:
https://github.com/aik/linux/commits/debug_dr7
Found this while testing DebugSwap (which also fixes the crash as
it eliminates DB7 interception == #VC):
https://lore.kernel.org/all/20230120031047.628097-1-aik@amd.com
Define local_db_save_exc_nmi() to demostrate that the problem better.
Why is this crash happening and how to fix that? I am still reading
the assembly but was hoping for a shortcut here :) Thanks,
aik-Standard-PC-i440FX-PIIX-1996 login: [A[ 15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2)
[ 15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
[ 15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73
[ 15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
[ 15.775323] RIP: 0010:error_entry+0x17/0x140
[ 15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9
[ 15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097
[ 15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed
[ 15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8
[ 15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000
[ 15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000
[ 15.775339] FS: 0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000
[ 15.775340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0
[ 15.775342] Call Trace:
[ 15.775352] <NMI>
[ 15.775355] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775357] ? exc_page_fault+0x11/0x120
[ 15.775360] ? asm_exc_page_fault+0x22/0x30
[ 15.775364] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775365] ? exc_page_fault+0x11/0x120
[ 15.775367] ? asm_exc_page_fault+0x22/0x30
[ 15.775368] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775369] ? exc_page_fault+0x11/0x120
[ 15.775371] ? asm_exc_page_fault+0x22/0x30
[ 15.775372] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775373] ? exc_page_fault+0x11/0x120
[ 15.775374] ? asm_exc_page_fault+0x22/0x30
[ 15.775375] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775376] ? exc_page_fault+0x11/0x120
[ 15.775378] ? asm_exc_page_fault+0x22/0x30
[ 15.775379] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775380] ? exc_page_fault+0x11/0x120
[ 15.775381] ? asm_exc_page_fault+0x22/0x30
[ 15.775382] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775383] ? exc_page_fault+0x11/0x120
[ 15.775384] ? asm_exc_page_fault+0x22/0x30
[ 15.775385] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775386] ? exc_page_fault+0x11/0x120
[ 15.775388] ? asm_exc_page_fault+0x22/0x30
[ 15.775389] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775390] ? exc_page_fault+0x11/0x120
[ 15.775391] ? asm_exc_page_fault+0x22/0x30
[ 15.775392] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775393] ? exc_page_fault+0x11/0x120
[ 15.775395] ? asm_exc_page_fault+0x22/0x30
[ 15.775396] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775397] ? exc_page_fault+0x11/0x120
[ 15.775398] ? asm_exc_page_fault+0x22/0x30
[ 15.775399] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775400] ? exc_page_fault+0x11/0x120
[ 15.775401] ? asm_exc_page_fault+0x22/0x30
[ 15.775403] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775404] ? exc_page_fault+0x11/0x120
[ 15.775405] ? asm_exc_page_fault+0x22/0x30
[ 15.775406] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775407] ? exc_page_fault+0x11/0x120
[ 15.775408] ? asm_exc_page_fault+0x22/0x30
[ 15.775409] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775410] ? exc_page_fault+0x11/0x120
[ 15.775412] ? asm_exc_page_fault+0x22/0x30
[ 15.775413] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775414] ? exc_page_fault+0x11/0x120
[ 15.775415] ? asm_exc_page_fault+0x22/0x30
[ 15.775416] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775420] ? exc_page_fault+0x11/0x120
[ 15.775421] ? asm_exc_page_fault+0x22/0x30
[ 15.775422] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775423] ? exc_page_fault+0x11/0x120
[ 15.775425] ? asm_exc_page_fault+0x22/0x30
[ 15.775426] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775427] ? exc_page_fault+0x11/0x120
[ 15.775431] ? asm_exc_page_fault+0x22/0x30
[ 15.775432] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775433] ? exc_page_fault+0x11/0x120
[ 15.775435] ? asm_exc_page_fault+0x22/0x30
[ 15.775436] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775437] ? exc_page_fault+0x11/0x120
[ 15.775438] ? asm_exc_page_fault+0x22/0x30
[ 15.775439] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775440] ? exc_page_fault+0x11/0x120
[ 15.775441] ? asm_exc_page_fault+0x22/0x30
[ 15.775442] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775443] ? exc_page_fault+0x11/0x120
[ 15.775445] ? asm_exc_page_fault+0x22/0x30
[ 15.775446] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775447] ? exc_page_fault+0x11/0x120
[ 15.775448] ? asm_exc_page_fault+0x22/0x30
[ 15.775449] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775450] ? exc_page_fault+0x11/0x120
[ 15.775454] ? asm_exc_page_fault+0x22/0x30
[ 15.775455] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775456] ? exc_page_fault+0x11/0x120
[ 15.775458] ? asm_exc_page_fault+0x22/0x30
[ 15.775459] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775460] ? exc_page_fault+0x11/0x120
[ 15.775461] ? asm_exc_page_fault+0x22/0x30
[ 15.775462] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775463] ? exc_page_fault+0x11/0x120
[ 15.775465] ? asm_exc_page_fault+0x22/0x30
[ 15.775466] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775467] ? exc_page_fault+0x11/0x120
[ 15.775468] ? asm_exc_page_fault+0x22/0x30
[ 15.775469] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775470] ? exc_page_fault+0x11/0x120
[ 15.775471] ? asm_exc_page_fault+0x22/0x30
[ 15.775472] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775473] ? exc_page_fault+0x11/0x120
[ 15.775475] ? asm_exc_page_fault+0x22/0x30
[ 15.775476] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775477] ? exc_page_fault+0x11/0x120
[ 15.775478] ? asm_exc_page_fault+0x22/0x30
[ 15.775482] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775483] ? exc_page_fault+0x11/0x120
[ 15.775485] ? asm_exc_page_fault+0x22/0x30
[ 15.775486] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775487] ? exc_page_fault+0x11/0x120
[ 15.775488] ? asm_exc_page_fault+0x22/0x30
[ 15.775490] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775491] ? exc_page_fault+0x11/0x120
[ 15.775492] ? asm_exc_page_fault+0x22/0x30
[ 15.775493] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775494] ? exc_page_fault+0x11/0x120
[ 15.775495] ? asm_exc_page_fault+0x22/0x30
[ 15.775496] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775497] ? exc_page_fault+0x11/0x120
[ 15.775499] ? asm_exc_page_fault+0x22/0x30
[ 15.775500] ? check_preemption_disabled+0x8/0xe0
[ 15.775502] ? __sev_es_ist_enter+0x13/0x100
[ 15.775503] ? exc_nmi+0x10e/0x150
[ 15.775505] ? end_repeat_nmi+0x16/0x67
[ 15.775506] ? asm_exc_double_fault+0x30/0x30
[ 15.775507] ? asm_exc_double_fault+0x30/0x30
[ 15.775508] ? asm_exc_double_fault+0x30/0x30
[ 15.775509] </NMI>
[ 15.775509] <#VC>
[ 15.775510] ? __show_regs.cold+0x18e/0x23d
[ 15.775511] </#VC>
[ 15.775511] <#DF>
[ 15.775512] ? __die_body.cold+0x1a/0x1f
[ 15.775513] ? die+0x26/0x40
[ 15.775517] ? handle_stack_overflow+0x44/0x60
[ 15.775518] ? exc_double_fault+0x14b/0x180
[ 15.775519] ? asm_exc_double_fault+0x1f/0x30
[ 15.775520] ? restore_regs_and_return_to_kernel+0x22/0x22
[ 15.775521] ? asm_exc_page_fault+0x9/0x30
[ 15.775522] ? error_entry+0x17/0x140
[ 15.775523] </#DF>
[ 15.775523] WARNING: stack recursion on stack type 6
[ 15.775524] Modules linked in: msr efivarfs
[ 15.837935] ---[ end trace 0000000000000000 ]---
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
---
arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++
arch/x86/kernel/nmi.c | 2 +-
2 files changed, 30 insertions(+), 1 deletion(-)
Comments
On Fri, Jan 27, 2023 at 02:56:16PM +1100, Alexey Kardashevskiy wrote: > AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when > "perf" executes. "./perf record sleep 20" is an example. > > Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not > defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi) > and actual reading of DB7 (which in turn causes #VC) every function is > inlined Very much on purpose. > and no stack frame is created (?). Silly compilers ;-) I think you can force it to by using inline asm with a rsp dependency or somesuch. > Replacing __always_inline with > noinline in local_db_save() or native_get_debugreg() fixes the problem. It will create others. > Why is this crash happening and how to fix that? I am still reading > the assembly but was hoping for a shortcut here :) Thanks, Welcome to the wonderful shit show that is x86 exceptions :/ I thought sev_es_*() is supposed to fix this. Joerg, any clue? > aik-Standard-PC-i440FX-PIIX-1996 login: [A[ 15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2) > [ 15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI > [ 15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73 > [ 15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown > [ 15.775323] RIP: 0010:error_entry+0x17/0x140 > [ 15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9 > [ 15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097 > [ 15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed > [ 15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8 > [ 15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000 > [ 15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > [ 15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000 > [ 15.775339] FS: 0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000 > [ 15.775340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0 > [ 15.775342] Call Trace: > [ 15.775352] <NMI> <snip> > [ 15.775495] ? asm_exc_page_fault+0x22/0x30 > [ 15.775496] ? restore_regs_and_return_to_kernel+0x22/0x22 > [ 15.775497] ? exc_page_fault+0x11/0x120 > [ 15.775499] ? asm_exc_page_fault+0x22/0x30 > [ 15.775500] ? check_preemption_disabled+0x8/0xe0 > [ 15.775502] ? __sev_es_ist_enter+0x13/0x100 > [ 15.775503] ? exc_nmi+0x10e/0x150 > [ 15.775505] ? end_repeat_nmi+0x16/0x67 > [ 15.775506] ? asm_exc_double_fault+0x30/0x30 > [ 15.775507] ? asm_exc_double_fault+0x30/0x30 > [ 15.775508] ? asm_exc_double_fault+0x30/0x30 > [ 15.775509] </NMI> > [ 15.775509] <#VC> > [ 15.775510] ? __show_regs.cold+0x18e/0x23d > [ 15.775511] </#VC> > [ 15.775511] <#DF> > [ 15.775512] ? __die_body.cold+0x1a/0x1f > [ 15.775513] ? die+0x26/0x40 > [ 15.775517] ? handle_stack_overflow+0x44/0x60 > [ 15.775518] ? exc_double_fault+0x14b/0x180 > [ 15.775519] ? asm_exc_double_fault+0x1f/0x30 > [ 15.775520] ? restore_regs_and_return_to_kernel+0x22/0x22 > [ 15.775521] ? asm_exc_page_fault+0x9/0x30 > [ 15.775522] ? error_entry+0x17/0x140 > [ 15.775523] </#DF> > [ 15.775523] WARNING: stack recursion on stack type 6 > [ 15.775524] Modules linked in: msr efivarfs > [ 15.837935] ---[ end trace 0000000000000000 ]--- > > Signed-off-by: Alexey Kardashevskiy <aik@amd.com> > --- > arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++ > arch/x86/kernel/nmi.c | 2 +- > 2 files changed, 30 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h > index b049d950612f..396e2ea114dc 100644 > --- a/arch/x86/include/asm/debugreg.h > +++ b/arch/x86/include/asm/debugreg.h > @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7) > set_debugreg(dr7, 7); > } > > +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */ > +static noinline unsigned long native_get_debugreg7(void) > +{ > + unsigned long dr7; > + asm("mov %%db7, %0" :"=r" (dr7)); > + return dr7; > +} > + > +static __always_inline unsigned long local_db_save_exc_nmi(void) > +{ > + unsigned long dr7; > + > + if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active()) > + return 0; > + > + dr7 = native_get_debugreg7(); > + dr7 &= ~0x400; /* architecturally set bit */ > + if (dr7) > + set_debugreg(0, 7); > + /* > + * Ensure the compiler doesn't lower the above statements into > + * the critical section; disabling breakpoints late would not > + * be good. > + */ > + barrier(); > + > + return dr7; > +} This is broken, and building with DEBUG_ENTRY=y would've told you. > + > #ifdef CONFIG_CPU_SUP_AMD > extern void set_dr_addr_mask(unsigned long mask, int dr); > #else > diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c > index cec0bfa3bc04..400b5b6b74f6 100644 > --- a/arch/x86/kernel/nmi.c > +++ b/arch/x86/kernel/nmi.c > @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi) > */ > sev_es_ist_enter(regs); > > - this_cpu_write(nmi_dr7, local_db_save()); > + this_cpu_write(nmi_dr7, local_db_save_exc_nmi()); > > irq_state = irqentry_nmi_enter(regs); So what I don't get is why sev_es_ist_enter() doesn't cause us to make a stack frame, that has an actual call in it (although admittedly conditional).
On Fri, Jan 27, 2023 at 10:08:14AM +0100, Peter Zijlstra wrote: > Welcome to the wonderful shit show that is x86 exceptions :/ > > I thought sev_es_*() is supposed to fix this. Joerg, any clue? Hmm, no, not yet, the stack-trace doesn't make much sense to me. The sev_es_* function calls in the NMI path are for re-enabling NMI and adjusting the #VC IST stack to allow nested VCs. Alexey, can you try to get a more stable backtrace? For example by building the kernel with frame pointers? Regards, Joerg
On 27/1/23 21:37, Joerg Roedel wrote: > On Fri, Jan 27, 2023 at 10:08:14AM +0100, Peter Zijlstra wrote: >> Welcome to the wonderful shit show that is x86 exceptions :/ >> >> I thought sev_es_*() is supposed to fix this. Joerg, any clue? > > Hmm, no, not yet, the stack-trace doesn't make much sense to me. The > sev_es_* function calls in the NMI path are for re-enabling NMI and > adjusting the #VC IST stack to allow nested VCs. > > Alexey, can you try to get a more stable backtrace? For example by > building the kernel with frame pointers? Do you mean these guys (below)? aik@aik-Standard-PC-i440FX-PIIX-1996:~$ grep FRAME_POINTER /boot/config-$(uname -r) CONFIG_SCHED_OMIT_FRAME_POINTER=y CONFIG_FRAME_POINTER=y CONFIG_UNWINDER_FRAME_POINTER=y Here is the complete output of that VM (200k so not in the email): https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc Note that the original long backtrace appears more than once, I just did not copy all of that in the first email.
On 27/1/23 20:08, Peter Zijlstra wrote: > On Fri, Jan 27, 2023 at 02:56:16PM +1100, Alexey Kardashevskiy wrote: >> AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when >> "perf" executes. "./perf record sleep 20" is an example. >> >> Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not >> defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi) >> and actual reading of DB7 (which in turn causes #VC) every function is >> inlined > > Very much on purpose. > >> and no stack frame is created (?). > > Silly compilers ;-) I think you can force it to by using inline asm with > a rsp dependency or somesuch. > >> Replacing __always_inline with >> noinline in local_db_save() or native_get_debugreg() fixes the problem. > > It will create others. > >> Why is this crash happening and how to fix that? I am still reading >> the assembly but was hoping for a shortcut here :) Thanks, > > Welcome to the wonderful shit show that is x86 exceptions :/ > > I thought sev_es_*() is supposed to fix this. Joerg, any clue? > >> aik-Standard-PC-i440FX-PIIX-1996 login: [A[ 15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2) >> [ 15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI >> [ 15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73 >> [ 15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown >> [ 15.775323] RIP: 0010:error_entry+0x17/0x140 >> [ 15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9 >> [ 15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097 >> [ 15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed >> [ 15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8 >> [ 15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000 >> [ 15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 >> [ 15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000 >> [ 15.775339] FS: 0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000 >> [ 15.775340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0 >> [ 15.775342] Call Trace: >> [ 15.775352] <NMI> > > <snip> > >> [ 15.775495] ? asm_exc_page_fault+0x22/0x30 >> [ 15.775496] ? restore_regs_and_return_to_kernel+0x22/0x22 >> [ 15.775497] ? exc_page_fault+0x11/0x120 >> [ 15.775499] ? asm_exc_page_fault+0x22/0x30 >> [ 15.775500] ? check_preemption_disabled+0x8/0xe0 >> [ 15.775502] ? __sev_es_ist_enter+0x13/0x100 >> [ 15.775503] ? exc_nmi+0x10e/0x150 >> [ 15.775505] ? end_repeat_nmi+0x16/0x67 >> [ 15.775506] ? asm_exc_double_fault+0x30/0x30 >> [ 15.775507] ? asm_exc_double_fault+0x30/0x30 >> [ 15.775508] ? asm_exc_double_fault+0x30/0x30 >> [ 15.775509] </NMI> >> [ 15.775509] <#VC> >> [ 15.775510] ? __show_regs.cold+0x18e/0x23d >> [ 15.775511] </#VC> >> [ 15.775511] <#DF> >> [ 15.775512] ? __die_body.cold+0x1a/0x1f >> [ 15.775513] ? die+0x26/0x40 >> [ 15.775517] ? handle_stack_overflow+0x44/0x60 >> [ 15.775518] ? exc_double_fault+0x14b/0x180 >> [ 15.775519] ? asm_exc_double_fault+0x1f/0x30 >> [ 15.775520] ? restore_regs_and_return_to_kernel+0x22/0x22 >> [ 15.775521] ? asm_exc_page_fault+0x9/0x30 >> [ 15.775522] ? error_entry+0x17/0x140 >> [ 15.775523] </#DF> >> [ 15.775523] WARNING: stack recursion on stack type 6 >> [ 15.775524] Modules linked in: msr efivarfs >> [ 15.837935] ---[ end trace 0000000000000000 ]--- >> >> Signed-off-by: Alexey Kardashevskiy <aik@amd.com> >> --- >> arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++ >> arch/x86/kernel/nmi.c | 2 +- >> 2 files changed, 30 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h >> index b049d950612f..396e2ea114dc 100644 >> --- a/arch/x86/include/asm/debugreg.h >> +++ b/arch/x86/include/asm/debugreg.h >> @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7) >> set_debugreg(dr7, 7); >> } >> >> +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */ >> +static noinline unsigned long native_get_debugreg7(void) >> +{ >> + unsigned long dr7; >> + asm("mov %%db7, %0" :"=r" (dr7)); >> + return dr7; >> +} >> + >> +static __always_inline unsigned long local_db_save_exc_nmi(void) >> +{ >> + unsigned long dr7; >> + >> + if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active()) >> + return 0; >> + >> + dr7 = native_get_debugreg7(); >> + dr7 &= ~0x400; /* architecturally set bit */ >> + if (dr7) >> + set_debugreg(0, 7); >> + /* >> + * Ensure the compiler doesn't lower the above statements into >> + * the critical section; disabling breakpoints late would not >> + * be good. >> + */ >> + barrier(); >> + >> + return dr7; >> +} > > This is broken, and building with DEBUG_ENTRY=y would've told you. Huh, good to know. Is this it telling me so? vmlinux.o: warning: objtool: exc_nmi+0x73: call to native_get_debugreg7() leaves .noinstr.text section >> + >> #ifdef CONFIG_CPU_SUP_AMD >> extern void set_dr_addr_mask(unsigned long mask, int dr); >> #else >> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c >> index cec0bfa3bc04..400b5b6b74f6 100644 >> --- a/arch/x86/kernel/nmi.c >> +++ b/arch/x86/kernel/nmi.c >> @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi) >> */ >> sev_es_ist_enter(regs); >> >> - this_cpu_write(nmi_dr7, local_db_save()); >> + this_cpu_write(nmi_dr7, local_db_save_exc_nmi()); >> >> irq_state = irqentry_nmi_enter(regs); > > So what I don't get is why sev_es_ist_enter() doesn't cause us to make a > stack frame, that has an actual call in it (although admittedly > conditional). Is not the frame gone when sev_es_ist_enter() (which does not get inlined as per objdump's "ffffffff81bd4550 <__sev_es_ist_enter>: ") returned?
On Fri, Jan 27, 2023 at 11:13:38PM +1100, Alexey Kardashevskiy wrote: > > This is broken, and building with DEBUG_ENTRY=y would've told you. > > > Huh, good to know. Is this it telling me so? > > vmlinux.o: warning: objtool: exc_nmi+0x73: call to native_get_debugreg7() > leaves .noinstr.text section > Yep. The ramification of all that is that by calling non-noinstr code (double negative, iow, regular instrumented code) is that you can end up in the tracers/*SAN/breakpoints etc.. code -- something we're very much not ready for at this point. > > > + > > > #ifdef CONFIG_CPU_SUP_AMD > > > extern void set_dr_addr_mask(unsigned long mask, int dr); > > > #else > > > diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c > > > index cec0bfa3bc04..400b5b6b74f6 100644 > > > --- a/arch/x86/kernel/nmi.c > > > +++ b/arch/x86/kernel/nmi.c > > > @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi) > > > */ > > > sev_es_ist_enter(regs); > > > - this_cpu_write(nmi_dr7, local_db_save()); > > > + this_cpu_write(nmi_dr7, local_db_save_exc_nmi()); > > > irq_state = irqentry_nmi_enter(regs); > > > > So what I don't get is why sev_es_ist_enter() doesn't cause us to make a > > stack frame, that has an actual call in it (although admittedly > > conditional). > > Is not the frame gone when sev_es_ist_enter() (which does not get inlined as > per objdump's "ffffffff81bd4550 <__sev_es_ist_enter>: > ") returned? Well, returning would consume the callframe, but the stack setup of the caller should remain. Let me go stare at some asm.
On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote: > Here is the complete output of that VM (200k so not in the email): > > https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc Thanks. So looking at the code in the traces: Code starting with the faulting instruction =========================================== 0: 65 48 8b 04 25 c0 db mov %gs:0x2dbc0,%rax 7: 02 00 9: 48 8b 80 a8 08 00 00 mov 0x8a8(%rax),%rax 10: 0f 0d 48 70 prefetchw 0x70(%rax) 14: e8 .byte 0xe8 15: 82 .byte 0x82 I think the fault in the page-fault handler happens here: DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) { unsigned long address = read_cr2(); irqentry_state_t state; prefetchw(¤t->mm->mmap_lock); <--- Here To be precise, it faults while dereferencing current. That means that GS_BASE is likely broken, need to find out why... This at least explains why it page-faults in a loop until the stack overflows and the guard page is hit. Regards, Joerg
On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc
Okay, I reproduced the problem here and the root cause turned out to be
that the compiler moved the DR7 read instruction before the 5-byte NOP
which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is
guaranteed to cause #VC exception stack recursion if the NMI was
triggered on the #VC stack, and that leads to all kinds of undefined
behavior.
Regards,
Joerg
On 28/1/23 04:25, Joerg Roedel wrote: > On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote: >> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc > > Okay, I reproduced the problem here and the root cause turned out to be > that the compiler moved the DR7 read instruction before the 5-byte NOP > which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is > guaranteed to cause #VC exception stack recursion if the NMI was > triggered on the #VC stack, and that leads to all kinds of undefined > behavior. Cool! (out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does not show any, is this after lifepatching? Meanwhile, this seems to be doing the right thing: diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h index b049d950612f..687b15297057 100644 --- a/arch/x86/include/asm/debugreg.h +++ b/arch/x86/include/asm/debugreg.h @@ -39,7 +39,7 @@ static __always_inline unsigned long native_get_debugreg(int regno) asm("mov %%db6, %0" :"=r" (val)); break; case 7: - asm("mov %%db7, %0" :"=r" (val)); + asm volatile ("mov %%db7, %0" :"=r" (val));
On Sat, Jan 28, 2023 at 10:24:56PM +1100, Alexey Kardashevskiy wrote: > (out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does > not show any, is this after lifepatching? Here is the disassembly of exc_nmi of a kernel built from tip/master with CONFIG_PARAVIRT=n: <exc_nmi>: 41 54 push %r12 55 push %rbp 48 89 fd mov %rdi,%rbp 53 push %rbx 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 65 8b 05 69 66 41 7e mov %gs:0x7e416669(%rip),%eax # 3254c <pcpu_hot+0xc> 48 98 cltq 48 0f a3 05 33 00 2b bt %rax,0x12b0033(%rip) # ffffffff82ecbf20 <__cpu_online_mask> 01 0f 83 c9 00 00 00 jae ffffffff81c1bfbc <exc_nmi+0xec> 65 8b 05 f6 41 40 7e mov %gs:0x7e4041f6(%rip),%eax # 200f0 <nmi_state> 85 c0 test %eax,%eax 0f 85 f8 00 00 00 jne ffffffff81c1bffa <exc_nmi+0x12a> 65 c7 05 e3 41 40 7e movl $0x1,%gs:0x7e4041e3(%rip) # 200f0 <nmi_state> 01 00 00 00 0f 20 d0 mov %cr2,%rax 65 48 89 05 d0 41 40 mov %rax,%gs:0x7e4041d0(%rip) # 200e8 <nmi_cr2> 7e 41 0f 21 fc mov %db7,%r12 <-- here is the DR7 read 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) <-- here are the NOPS that become a call to sev_es_ist_enter() in SEV-ES guests The DR7 read will cause a #VC exception, switching to the #VC IST stack. If the NMI was raised while already on the #VC IST stack, this DR7 read will overwrite the previous stack frame and cause stack recursion, with all funny side effects. > diff --git a/arch/x86/include/asm/debugreg.h > b/arch/x86/include/asm/debugreg.h > index b049d950612f..687b15297057 100644 > --- a/arch/x86/include/asm/debugreg.h > +++ b/arch/x86/include/asm/debugreg.h > @@ -39,7 +39,7 @@ static __always_inline unsigned long > native_get_debugreg(int regno) > asm("mov %%db6, %0" :"=r" (val)); > break; > case 7: > - asm("mov %%db7, %0" :"=r" (val)); > + asm volatile ("mov %%db7, %0" :"=r" (val)); Yeah, something like this will be the fix. I am still thinking about the right place to put the volatile to make it explicit to the situation we are encountering here (which is SEV-ES specific). Best would be an explicit barrier in C code between sev_es_ist_enter() and the DR7 read, but all barriers I tried to far only seem to affect memory instructions and had no influence on the DR7 read (which is obviously not considered as a memory read by the compiler). The best place to put the barrier is in the sev_es_ist_enter() inline function, right after the static_call to __sev_es_ist_enter(). Regards, Joerg
On Sat, Jan 28, 2023 at 02:52:23PM +0100, Joerg Roedel wrote: > Yeah, something like this will be the fix. I am still thinking about > the right place to put the volatile to make it explicit to the situation > we are encountering here (which is SEV-ES specific). > > Best would be an explicit barrier in C code between sev_es_ist_enter() > and the DR7 read, but all barriers I tried to far only seem to affect > memory instructions and had no influence on the DR7 read (which is > obviously not considered as a memory read by the compiler). > > The best place to put the barrier is in the sev_es_ist_enter() inline > function, right after the static_call to __sev_es_ist_enter(). Okay, after some investigation I was not able to find a compiler barrier which affects DR7 read ordering. This leaves us with the only solution of directly forbidding DR7 register access re-ordering by adding a volatile to the asm, like you did before. I will send a fix later today. Regards, Joerg
On January 28, 2023 3:24:56 AM PST, Alexey Kardashevskiy <aik@amd.com> wrote: > > >On 28/1/23 04:25, Joerg Roedel wrote: >> On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote: >>> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc >> >> Okay, I reproduced the problem here and the root cause turned out to be >> that the compiler moved the DR7 read instruction before the 5-byte NOP >> which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is >> guaranteed to cause #VC exception stack recursion if the NMI was >> triggered on the #VC stack, and that leads to all kinds of undefined >> behavior. > >Cool! > >(out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does not show any, is this after lifepatching? > >Meanwhile, this seems to be doing the right thing: > > >diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h >index b049d950612f..687b15297057 100644 >--- a/arch/x86/include/asm/debugreg.h >+++ b/arch/x86/include/asm/debugreg.h >@@ -39,7 +39,7 @@ static __always_inline unsigned long native_get_debugreg(int regno) > asm("mov %%db6, %0" :"=r" (val)); > break; > case 7: >- asm("mov %%db7, %0" :"=r" (val)); >+ asm volatile ("mov %%db7, %0" :"=r" (val)); > > > It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is not... %dr6 is the status register! I believe they should all be volatile (the compiler semantics is that volatile operations are always executed exactly once, in strict program order with respect to any other volatile operations); the real question is if there should also be memory clobbers on %dr6 reads and any %dr write.
On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote: > It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is > not... %dr6 is the status register! Yeah, as a precaution I think we should make all those volatile. Just in case. > I believe they should all be volatile (the compiler semantics is that > volatile operations are always executed exactly once, in strict > program order with respect to any other volatile operations); the real > question is if there should also be memory clobbers on %dr6 reads and > any %dr write. Yes, I think so too. From gcc docs: "6.47.2.1 Volatile ................. ... Note that the compiler can move even 'volatile asm' instructions relative to other code, including across jump instructions." We already have __FORCE_ORDER for exactly things like that.
On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote: > It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is > not... %dr6 is the status register! The reason is that on SEV-ES only accesses to DR7 will cause #VC exceptions, DR0-DR6 are not intercepted. Regards, Joerg
On Tue, Jan 31, 2023, Joerg Roedel wrote: > On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote: > > It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is > > not... %dr6 is the status register! > > The reason is that on SEV-ES only accesses to DR7 will cause #VC > exceptions, DR0-DR6 are not intercepted. I don't think that is technically true. A _well-behaved_ hypervisor will not intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES architecture enforces that behavior.
On Tue, Jan 31, 2023 at 03:53:39PM +0000, Sean Christopherson wrote: > I don't think that is technically true. A _well-behaved_ hypervisor will not > intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES > architecture enforces that behavior. Not from the hardware architecture side, but the GHCB spec does not list NAE events for DR0-DR6 accesses, so a guest is not required to handle them in the VC handler. Linux under SEV-ES will crash if the HV intercepts debug registers, except DR7. Regards, Joerg
On Tue, Jan 31, 2023, Joerg Roedel wrote: > On Tue, Jan 31, 2023 at 03:53:39PM +0000, Sean Christopherson wrote: > > I don't think that is technically true. A _well-behaved_ hypervisor will not > > intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES > > architecture enforces that behavior. > > Not from the hardware architecture side, but the GHCB spec does not > list NAE events for DR0-DR6 accesses, so a guest is not required to > handle them in the VC handler. > > Linux under SEV-ES will crash if the HV intercepts debug registers, > except DR7. Right, I'm just objecting to the wording of "DR0-DR6 are not intercepted". E.g. from a security perspective, the kernel shouldn't rely on DR0-DR6 to execute cleanly.
diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h index b049d950612f..396e2ea114dc 100644 --- a/arch/x86/include/asm/debugreg.h +++ b/arch/x86/include/asm/debugreg.h @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7) set_debugreg(dr7, 7); } +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */ +static noinline unsigned long native_get_debugreg7(void) +{ + unsigned long dr7; + asm("mov %%db7, %0" :"=r" (dr7)); + return dr7; +} + +static __always_inline unsigned long local_db_save_exc_nmi(void) +{ + unsigned long dr7; + + if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active()) + return 0; + + dr7 = native_get_debugreg7(); + dr7 &= ~0x400; /* architecturally set bit */ + if (dr7) + set_debugreg(0, 7); + /* + * Ensure the compiler doesn't lower the above statements into + * the critical section; disabling breakpoints late would not + * be good. + */ + barrier(); + + return dr7; +} + #ifdef CONFIG_CPU_SUP_AMD extern void set_dr_addr_mask(unsigned long mask, int dr); #else diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index cec0bfa3bc04..400b5b6b74f6 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi) */ sev_es_ist_enter(regs); - this_cpu_write(nmi_dr7, local_db_save()); + this_cpu_write(nmi_dr7, local_db_save_exc_nmi()); irq_state = irqentry_nmi_enter(regs);