Message ID | 20230424165726.2245548-1-ardb@kernel.org |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2889858vqo; Mon, 24 Apr 2023 10:09:53 -0700 (PDT) X-Google-Smtp-Source: AKy350ZEz+UjvMVprd5mxaf5MMOr8zPq3f7CHEICJwvXYXj3MdHrvn6VPQ/+NNz2RsaRmQQpfiB3 X-Received: by 2002:a17:902:c408:b0:1a8:ac3:47b4 with SMTP id k8-20020a170902c40800b001a80ac347b4mr20406423plk.20.1682356193165; Mon, 24 Apr 2023 10:09:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682356193; cv=none; d=google.com; s=arc-20160816; b=Ab3pxYfyyRoOR/qe6reNvp24jFwWrSTlEOZwHaFbemCkYRwth5+MyxStkANQ7ciFaJ mYJ1zgT3yfWhoRgMnHHMlynsLvhrzP/zk0zfxwf4xWZbnXkWO9hEPteyioiVjLLTmbiT 9/i8ph7nOObPp+TehfprACSB4cnOOKaPo0wHIIa7qNmdeRY5jsVfdznFku8Wa629jp/P oyfiYxiHBGG9YCG6toI0GVEZk8Bo5T+6lOn4O3D4p4lUF6WfDm35k0Hho3U2iqK2beYN FyaKabOuDLYVFaGLttkAzEVHPOshpJ1/pBrTzqw1+oPkNGUGAso4OwB9nam3wO2Z0tOh XFvA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=d6a04OxxZjYB+W6ahweShNM3fEev1Wp1ZgrYEmsamMc=; b=HOVqJ1FUbtrVdevEABwTqcKCYxhfACaBmAmhXEc4af/S+weZLycvQrakYLf687fm6c RzXf/d/XnyrU1GXtc7VMFBj1qWp+W3cdtrE+JMWTUp3ZsD/Acrqcz3qy7lh5+aV9nbSt DRO3iZCCOCbJNFNaOfj+kmmeHY1P6cPw52NhXc+BaSFr0c2zlXPJGaZLC7wJSGLkKXNZ 3UeiV7c8yMoeME46LXMatBL5rdyd5QkkyXpij27g+x+iUnzCSX8RxsDZMy6NShSeqZSj NFb+N8rcVVSMakqg1q5MoQWEkO7eTp5ewBs60qhyPBxbODw0qia3maE+JcZtDI7Z99r9 FJXg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=JqdGR02v; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x18-20020a17090300d200b001a94b91defbsi7747607plc.264.2023.04.24.10.09.38; Mon, 24 Apr 2023 10:09:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=JqdGR02v; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232033AbjDXQ5r (ORCPT <rfc822;fengqi706@gmail.com> + 99 others); Mon, 24 Apr 2023 12:57:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37474 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231910AbjDXQ5p (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 24 Apr 2023 12:57:45 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 386F75592; Mon, 24 Apr 2023 09:57:38 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id D279861938; Mon, 24 Apr 2023 16:57:37 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C8227C433D2; Mon, 24 Apr 2023 16:57:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1682355457; bh=IwUzErAvEsIxyUwV6dzjukx93sULgQiAOKamWXglYUQ=; h=From:To:Cc:Subject:Date:From; b=JqdGR02vPCKOkvV74LnqaVa1PMRDXMfulxcS05Yfsv4t2NHFyIFM7qG9I3dtguH/z cCQWPW5Y++gWS2vzjHflT7GX5gRVFy4eDmhv7wT+zbjNApmVcuRkbz9y1kImbnbW2T fkSP3QjDCKlFJv6cCsT2o1FeIQmBikYHE9pxctjbHKah5fMtKdO8+mPsHojcu0Ut5Y IAzTqduteM/oWLbmf7EANUc6R98VlVp8QvelDxKCLT6bXkyQtu+UkbCgkmic1cFDiP 980RdBqRhOJJuyVZZ/vKYJEEXItPhtmo1KwHAZ8iQP3QlJrBQ9JHNKtQjCslUN/rgY WeDAvY69eBPDg== From: Ard Biesheuvel <ardb@kernel.org> To: linux-efi@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel <ardb@kernel.org>, Evgeniy Baskov <baskov@ispras.ru>, Borislav Petkov <bp@alien8.de>, Andy Lutomirski <luto@kernel.org>, Dave Hansen <dave.hansen@linux.intel.com>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Alexey Khoroshilov <khoroshilov@ispras.ru>, Peter Jones <pjones@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>, Dave Young <dyoung@redhat.com>, Mario Limonciello <mario.limonciello@amd.com>, Kees Cook <keescook@chromium.org>, Tom Lendacky <thomas.lendacky@amd.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Linus Torvalds <torvalds@linux-foundation.org> Subject: [PATCH 0/6] efi/x86: Avoid legacy decompressor during EFI boot Date: Mon, 24 Apr 2023 18:57:20 +0200 Message-Id: <20230424165726.2245548-1-ardb@kernel.org> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=3630; i=ardb@kernel.org; h=from:subject; bh=IwUzErAvEsIxyUwV6dzjukx93sULgQiAOKamWXglYUQ=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIcVty9uKxIWFLApzWNRVuC/52UhvvBN/OCndsEawqmVa/ KHmaZ87SlkYxDgYZMUUWQRm/3238/REqVrnWbIwc1iZQIYwcHEKwETMVBn+l/85+2em4Y0vJdfn HXyRu4W1cJ3oS+Fy7lXcpcV99yUk8hgZrk2a+NNgj/JHh47p0f9/m0aE9nC9KlcsNT3TcTDuqbE jEwA= X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764078327325553894?= X-GMAIL-MSGID: =?utf-8?q?1764078327325553894?= |
Series |
efi/x86: Avoid legacy decompressor during EFI boot
|
|
Message
Ard Biesheuvel
April 24, 2023, 4:57 p.m. UTC
This series is conceptually a combination of Evgeny's series [0] and mine [1], both of which attempt to make the early decompressor code more amenable to executing in the EFI environment with stricter handling of memory permissions. My series [1] implemented zboot for x86, by getting rid of the entire x86 decompressor, and replacing it with existing EFI code that does the same but in a generic way. The downside of this is that only EFI boot is supported, making it unviable for distros, which need to support BIOS boot and hybrid EFI boot modes that omit the EFI stub. Evgeny's series [0] adapted the entire decompressor code flow to allow it to execute in the EFI context as well as the bare metal context, and this involves changes to the 1:1 mapping code and the page fault handlers etc, none of which are really needed when doing EFI boot in the first place. So this series attempts to occupy the middle ground here: it makes minimal changes to the existing decompressor so some of it can be called from the EFI stub. Then, it reimplements the EFI boot flow to decompress the kernel and boot it directly, without relying on the trampoline code, page table code or page fault handling code. This allows us to get rid of quite a bit of unsavory EFI stub code, and replace it with two clear invocations of the EFI firmware APIs to clear NX restrictions from allocations that have been populated with executable code. The only code that is being reused is the decompression library itself, along with the minimal ELF parsing that is required to copy the ELF segments in place, and the relocation processing that fixes up absolute symbol references to refer to the correct virtual addresses. Note that some of Evgeny's changes to clean up the PE/COFF header generation will still be needed, but I've omitted those here for brevity. Cc: Evgeniy Baskov <baskov@ispras.ru> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Peter Jones <pjones@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Mario Limonciello <mario.limonciello@amd.com> Cc: Kees Cook <keescook@chromium.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> [0] https://lore.kernel.org/all/cover.1678785672.git.baskov@ispras.ru/ [1] https://lore.kernel.org/all/20230416120729.2470762-1-ardb@kernel.org/ Ard Biesheuvel (6): x86: decompressor: Move global symbol references to C code x86: decompressor: Factor out kernel decompression and relocation x86: efistub: Obtain ACPI RSDP address while running in the stub x86: efistub: Perform 4/5 level paging switch from the stub x86: efistub: Prefer EFI memory attributes protocol over DXE services x86: efistub: Avoid legacy decompressor when doing EFI boot arch/x86/boot/compressed/efi_mixed.S | 55 --- arch/x86/boot/compressed/head_32.S | 24 -- arch/x86/boot/compressed/head_64.S | 39 +-- arch/x86/boot/compressed/misc.c | 44 ++- arch/x86/include/asm/efi.h | 2 + drivers/firmware/efi/libstub/efi-stub-helper.c | 4 + drivers/firmware/efi/libstub/x86-stub.c | 360 +++++++++++++------- 7 files changed, 279 insertions(+), 249 deletions(-)
Comments
On Mon, Apr 24, 2023 at 06:57:20PM +0200, Ard Biesheuvel wrote: > arch/x86/boot/compressed/efi_mixed.S | 55 --- > arch/x86/boot/compressed/head_32.S | 24 -- > arch/x86/boot/compressed/head_64.S | 39 +-- > arch/x86/boot/compressed/misc.c | 44 ++- > arch/x86/include/asm/efi.h | 2 + > drivers/firmware/efi/libstub/efi-stub-helper.c | 4 + > drivers/firmware/efi/libstub/x86-stub.c | 360 +++++++++++++------- > 7 files changed, 279 insertions(+), 249 deletions(-) Upon a quick scan, I can't argue with that diffstat and would prefer a lot more if we did this instead of Evgeny's pile which touches a lot of nasty and hard to debug code which gets executed on *everything*. So if people agree with that approach, I'd gladly give it a more detailed look. Thx.
On Wed, 26 Apr 2023 at 11:18, Borislav Petkov <bp@alien8.de> wrote: > > On Mon, Apr 24, 2023 at 06:57:20PM +0200, Ard Biesheuvel wrote: > > arch/x86/boot/compressed/efi_mixed.S | 55 --- > > arch/x86/boot/compressed/head_32.S | 24 -- > > arch/x86/boot/compressed/head_64.S | 39 +-- > > arch/x86/boot/compressed/misc.c | 44 ++- > > arch/x86/include/asm/efi.h | 2 + > > drivers/firmware/efi/libstub/efi-stub-helper.c | 4 + > > drivers/firmware/efi/libstub/x86-stub.c | 360 +++++++++++++------- > > 7 files changed, 279 insertions(+), 249 deletions(-) > > Upon a quick scan, I can't argue with that diffstat and would prefer > a lot more if we did this instead of Evgeny's pile which touches a lot > of nasty and hard to debug code which gets executed on *everything*. > > So if people agree with that approach, I'd gladly give it a more > detailed look. > I think the general approach is better, but to be honest, I may have missed a thing or two, so it would be good if people more familiar with the history could chime in.
On 2023-04-24 19:57, Ard Biesheuvel wrote: > This series is conceptually a combination of Evgeny's series [0] and > mine [1], both of which attempt to make the early decompressor code > more > amenable to executing in the EFI environment with stricter handling of > memory permissions. > > My series [1] implemented zboot for x86, by getting rid of the entire > x86 decompressor, and replacing it with existing EFI code that does the > same but in a generic way. The downside of this is that only EFI boot > is > supported, making it unviable for distros, which need to support BIOS > boot and hybrid EFI boot modes that omit the EFI stub. > > Evgeny's series [0] adapted the entire decompressor code flow to allow > it to execute in the EFI context as well as the bare metal context, and > this involves changes to the 1:1 mapping code and the page fault > handlers etc, none of which are really needed when doing EFI boot in > the > first place. > > So this series attempts to occupy the middle ground here: it makes > minimal changes to the existing decompressor so some of it can be > called > from the EFI stub. Then, it reimplements the EFI boot flow to > decompress > the kernel and boot it directly, without relying on the trampoline > code, > page table code or page fault handling code. This allows us to get rid > of quite a bit of unsavory EFI stub code, and replace it with two clear > invocations of the EFI firmware APIs to clear NX restrictions from > allocations that have been populated with executable code. > > The only code that is being reused is the decompression library itself, > along with the minimal ELF parsing that is required to copy the ELF > segments in place, and the relocation processing that fixes up absolute > symbol references to refer to the correct virtual addresses. > > Note that some of Evgeny's changes to clean up the PE/COFF header > generation will still be needed, but I've omitted those here for > brevity. My series also implements W^X for both UEFI and non-UEFI boot paths, but I agree that we can just consider non-UEFI code legacy and it would be better to avoid touching it and encourage everyone to use UEFI code path on x86 instead. If PE format will also get fixed with either my patches or some others, I do like your approach more than mine, as it removes a lot of old cruft but does not break things (as far as I see). Seems like a perfect compromise between [1] and my approach. I've briefly tested the patches and looked through them and they look good to me. Two things I've noticed: * there's one TSC-related TODO; * probably we want to clear .bss in efi32_stub_entry and efi64_stub_entry for UEFI handover protocol, since it's unfortunately still present and .bss will contain garbage. I'll probably do some more testing on the weekend and let you know if I find something. Please tell me if/when you are going to merge these or similar, and I will clean up and rebase PE-related patches on top of these. I'd also like to send W^X patches for EFISTUB (omitting the non-UEFI boot path) as a follow up after the PE file header will get fixed. They will be considerably smaller with this approach and will not touch legacy code. Thanks, Evgeniy Baskov > > Cc: Evgeniy Baskov <baskov@ispras.ru> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Andy Lutomirski <luto@kernel.org> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> > Cc: Peter Jones <pjones@redhat.com> > Cc: Gerd Hoffmann <kraxel@redhat.com> > Cc: Dave Young <dyoung@redhat.com> > Cc: Mario Limonciello <mario.limonciello@amd.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: Tom Lendacky <thomas.lendacky@amd.com> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > > [0] https://lore.kernel.org/all/cover.1678785672.git.baskov@ispras.ru/ > [1] > https://lore.kernel.org/all/20230416120729.2470762-1-ardb@kernel.org/ > > Ard Biesheuvel (6): > x86: decompressor: Move global symbol references to C code > x86: decompressor: Factor out kernel decompression and relocation > x86: efistub: Obtain ACPI RSDP address while running in the stub > x86: efistub: Perform 4/5 level paging switch from the stub > x86: efistub: Prefer EFI memory attributes protocol over DXE services > x86: efistub: Avoid legacy decompressor when doing EFI boot > > arch/x86/boot/compressed/efi_mixed.S | 55 --- > arch/x86/boot/compressed/head_32.S | 24 -- > arch/x86/boot/compressed/head_64.S | 39 +-- > arch/x86/boot/compressed/misc.c | 44 ++- > arch/x86/include/asm/efi.h | 2 + > drivers/firmware/efi/libstub/efi-stub-helper.c | 4 + > drivers/firmware/efi/libstub/x86-stub.c | 360 > +++++++++++++------- > 7 files changed, 279 insertions(+), 249 deletions(-)
On Fri, 28 Apr 2023 at 14:23, Evgeniy Baskov <baskov@ispras.ru> wrote: > > On 2023-04-24 19:57, Ard Biesheuvel wrote: > > This series is conceptually a combination of Evgeny's series [0] and > > mine [1], both of which attempt to make the early decompressor code > > more > > amenable to executing in the EFI environment with stricter handling of > > memory permissions. > > > > My series [1] implemented zboot for x86, by getting rid of the entire > > x86 decompressor, and replacing it with existing EFI code that does the > > same but in a generic way. The downside of this is that only EFI boot > > is > > supported, making it unviable for distros, which need to support BIOS > > boot and hybrid EFI boot modes that omit the EFI stub. > > > > Evgeny's series [0] adapted the entire decompressor code flow to allow > > it to execute in the EFI context as well as the bare metal context, and > > this involves changes to the 1:1 mapping code and the page fault > > handlers etc, none of which are really needed when doing EFI boot in > > the > > first place. > > > > So this series attempts to occupy the middle ground here: it makes > > minimal changes to the existing decompressor so some of it can be > > called > > from the EFI stub. Then, it reimplements the EFI boot flow to > > decompress > > the kernel and boot it directly, without relying on the trampoline > > code, > > page table code or page fault handling code. This allows us to get rid > > of quite a bit of unsavory EFI stub code, and replace it with two clear > > invocations of the EFI firmware APIs to clear NX restrictions from > > allocations that have been populated with executable code. > > > > The only code that is being reused is the decompression library itself, > > along with the minimal ELF parsing that is required to copy the ELF > > segments in place, and the relocation processing that fixes up absolute > > symbol references to refer to the correct virtual addresses. > > > > Note that some of Evgeny's changes to clean up the PE/COFF header > > generation will still be needed, but I've omitted those here for > > brevity. > > My series also implements W^X for both UEFI and non-UEFI boot paths, but > I > agree that we can just consider non-UEFI code legacy and it would be > better > to avoid touching it and encourage everyone to use UEFI code path on x86 > instead. If PE format will also get fixed with either my patches or some > others, I do like your approach more than mine, as it removes a lot of > old > cruft but does not break things (as far as I see). Seems like a perfect > compromise between [1] and my approach. > Thanks, I'm glad we agree. > I've briefly tested the patches and looked through them and they look > good > to me. Two things I've noticed: > * there's one TSC-related TODO; > * probably we want to clear .bss in efi32_stub_entry and > efi64_stub_entry > for UEFI handover protocol, since it's unfortunately still present > and > .bss will contain garbage. I'll look into that. > I'll probably do some more testing on the weekend and let you know if I > find something. > Yes, please. > Please tell me if/when you are going to merge these or similar, and I > will > clean up and rebase PE-related patches on top of these. > > I'd also like to send W^X patches for EFISTUB (omitting the non-UEFI > boot > path) as a follow up after the PE file header will get fixed. They will > be > considerably smaller with this approach and will not touch legacy code. >
On 4/24/23 11:57, Ard Biesheuvel wrote: > This series is conceptually a combination of Evgeny's series [0] and > mine [1], both of which attempt to make the early decompressor code more > amenable to executing in the EFI environment with stricter handling of > memory permissions. > > My series [1] implemented zboot for x86, by getting rid of the entire > x86 decompressor, and replacing it with existing EFI code that does the > same but in a generic way. The downside of this is that only EFI boot is > supported, making it unviable for distros, which need to support BIOS > boot and hybrid EFI boot modes that omit the EFI stub. > > Evgeny's series [0] adapted the entire decompressor code flow to allow > it to execute in the EFI context as well as the bare metal context, and > this involves changes to the 1:1 mapping code and the page fault > handlers etc, none of which are really needed when doing EFI boot in the > first place. > > So this series attempts to occupy the middle ground here: it makes > minimal changes to the existing decompressor so some of it can be called > from the EFI stub. Then, it reimplements the EFI boot flow to decompress > the kernel and boot it directly, without relying on the trampoline code, > page table code or page fault handling code. This allows us to get rid > of quite a bit of unsavory EFI stub code, and replace it with two clear > invocations of the EFI firmware APIs to clear NX restrictions from > allocations that have been populated with executable code. > > The only code that is being reused is the decompression library itself, > along with the minimal ELF parsing that is required to copy the ELF > segments in place, and the relocation processing that fixes up absolute > symbol references to refer to the correct virtual addresses. > > Note that some of Evgeny's changes to clean up the PE/COFF header > generation will still be needed, but I've omitted those here for > brevity. I tried booting an SEV and an SEV-ES guest using this and both failed to boot: EFI stub: WARNING: Decompression failed: Out of memory while allocating z_stream I'll have to take a closer look as to why, but it might be a couple of days before I can get to it. Thanks, Tom > > Cc: Evgeniy Baskov <baskov@ispras.ru> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Andy Lutomirski <luto@kernel.org> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> > Cc: Peter Jones <pjones@redhat.com> > Cc: Gerd Hoffmann <kraxel@redhat.com> > Cc: Dave Young <dyoung@redhat.com> > Cc: Mario Limonciello <mario.limonciello@amd.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: Tom Lendacky <thomas.lendacky@amd.com> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > > [0] https://lore.kernel.org/all/cover.1678785672.git.baskov@ispras.ru/ > [1] https://lore.kernel.org/all/20230416120729.2470762-1-ardb@kernel.org/ > > Ard Biesheuvel (6): > x86: decompressor: Move global symbol references to C code > x86: decompressor: Factor out kernel decompression and relocation > x86: efistub: Obtain ACPI RSDP address while running in the stub > x86: efistub: Perform 4/5 level paging switch from the stub > x86: efistub: Prefer EFI memory attributes protocol over DXE services > x86: efistub: Avoid legacy decompressor when doing EFI boot > > arch/x86/boot/compressed/efi_mixed.S | 55 --- > arch/x86/boot/compressed/head_32.S | 24 -- > arch/x86/boot/compressed/head_64.S | 39 +-- > arch/x86/boot/compressed/misc.c | 44 ++- > arch/x86/include/asm/efi.h | 2 + > drivers/firmware/efi/libstub/efi-stub-helper.c | 4 + > drivers/firmware/efi/libstub/x86-stub.c | 360 +++++++++++++------- > 7 files changed, 279 insertions(+), 249 deletions(-) >
On Tue, 2 May 2023 at 15:37, Tom Lendacky <thomas.lendacky@amd.com> wrote: > > On 4/24/23 11:57, Ard Biesheuvel wrote: > > This series is conceptually a combination of Evgeny's series [0] and > > mine [1], both of which attempt to make the early decompressor code more > > amenable to executing in the EFI environment with stricter handling of > > memory permissions. > > > > My series [1] implemented zboot for x86, by getting rid of the entire > > x86 decompressor, and replacing it with existing EFI code that does the > > same but in a generic way. The downside of this is that only EFI boot is > > supported, making it unviable for distros, which need to support BIOS > > boot and hybrid EFI boot modes that omit the EFI stub. > > > > Evgeny's series [0] adapted the entire decompressor code flow to allow > > it to execute in the EFI context as well as the bare metal context, and > > this involves changes to the 1:1 mapping code and the page fault > > handlers etc, none of which are really needed when doing EFI boot in the > > first place. > > > > So this series attempts to occupy the middle ground here: it makes > > minimal changes to the existing decompressor so some of it can be called > > from the EFI stub. Then, it reimplements the EFI boot flow to decompress > > the kernel and boot it directly, without relying on the trampoline code, > > page table code or page fault handling code. This allows us to get rid > > of quite a bit of unsavory EFI stub code, and replace it with two clear > > invocations of the EFI firmware APIs to clear NX restrictions from > > allocations that have been populated with executable code. > > > > The only code that is being reused is the decompression library itself, > > along with the minimal ELF parsing that is required to copy the ELF > > segments in place, and the relocation processing that fixes up absolute > > symbol references to refer to the correct virtual addresses. > > > > Note that some of Evgeny's changes to clean up the PE/COFF header > > generation will still be needed, but I've omitted those here for > > brevity. > > I tried booting an SEV and an SEV-ES guest using this and both failed to boot: > > EFI stub: WARNING: Decompression failed: Out of memory while allocating > z_stream > > I'll have to take a closer look as to why, but it might be a couple of > days before I can get to it. > Thanks Tom. The internal malloc() seems to be failing, which is often caused by BSS clearing problems. Could you elaborate a little bit on the boot environment you are using here?
On 5/2/23 08:39, Ard Biesheuvel wrote: > On Tue, 2 May 2023 at 15:37, Tom Lendacky <thomas.lendacky@amd.com> wrote: >> >> On 4/24/23 11:57, Ard Biesheuvel wrote: >>> This series is conceptually a combination of Evgeny's series [0] and >>> mine [1], both of which attempt to make the early decompressor code more >>> amenable to executing in the EFI environment with stricter handling of >>> memory permissions. >>> >>> My series [1] implemented zboot for x86, by getting rid of the entire >>> x86 decompressor, and replacing it with existing EFI code that does the >>> same but in a generic way. The downside of this is that only EFI boot is >>> supported, making it unviable for distros, which need to support BIOS >>> boot and hybrid EFI boot modes that omit the EFI stub. >>> >>> Evgeny's series [0] adapted the entire decompressor code flow to allow >>> it to execute in the EFI context as well as the bare metal context, and >>> this involves changes to the 1:1 mapping code and the page fault >>> handlers etc, none of which are really needed when doing EFI boot in the >>> first place. >>> >>> So this series attempts to occupy the middle ground here: it makes >>> minimal changes to the existing decompressor so some of it can be called >>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress >>> the kernel and boot it directly, without relying on the trampoline code, >>> page table code or page fault handling code. This allows us to get rid >>> of quite a bit of unsavory EFI stub code, and replace it with two clear >>> invocations of the EFI firmware APIs to clear NX restrictions from >>> allocations that have been populated with executable code. >>> >>> The only code that is being reused is the decompression library itself, >>> along with the minimal ELF parsing that is required to copy the ELF >>> segments in place, and the relocation processing that fixes up absolute >>> symbol references to refer to the correct virtual addresses. >>> >>> Note that some of Evgeny's changes to clean up the PE/COFF header >>> generation will still be needed, but I've omitted those here for >>> brevity. >> >> I tried booting an SEV and an SEV-ES guest using this and both failed to boot: >> >> EFI stub: WARNING: Decompression failed: Out of memory while allocating >> z_stream >> >> I'll have to take a closer look as to why, but it might be a couple of >> days before I can get to it. >> > > Thanks Tom. > > The internal malloc() seems to be failing, which is often caused by > BSS clearing problems. Could you elaborate a little bit on the boot > environment you are using here? I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my host/hypervisor and guest kernel and the current OVMF tree built using OvmfPkgX64.dsc. I was originally using the current merge window Linux, but moved to the release version just to . With the release version SEV and SEV-ES still fail to boot, but SEV actually #GPs now. And some of the register contents look like encrypted data: ConvertPages: range 1000000 - 4FA1FFF covers multiple entries !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - 00000000 !!!! ExceptionData - 0000000000000000 RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206 RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000 RSI - 0000000001000000, RDI - 1FBA02A45943DE68 R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018 R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98 R14 - 0000000001000000, R15 - 000000007E0A5198 DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 GS - 0000000000000030, SS - 0000000000000030 CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000 CR4 - 0000000000000668, CR8 - 0000000000000000 DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000 IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000 FXSAVE_STATE - 000000007FD96F80 !!!! Find image based on IP(0x597E71C1) /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!! So, yes, probably an area of memory that was zeroes when mapped unencrypted, but wasn't cleared after changing the mapping to encrypted. Thanks, Tom
On Tue, 2 May 2023 at 18:08, Tom Lendacky <thomas.lendacky@amd.com> wrote: > > On 5/2/23 08:39, Ard Biesheuvel wrote: > > On Tue, 2 May 2023 at 15:37, Tom Lendacky <thomas.lendacky@amd.com> wrote: > >> > >> On 4/24/23 11:57, Ard Biesheuvel wrote: > >>> This series is conceptually a combination of Evgeny's series [0] and > >>> mine [1], both of which attempt to make the early decompressor code more > >>> amenable to executing in the EFI environment with stricter handling of > >>> memory permissions. > >>> > >>> My series [1] implemented zboot for x86, by getting rid of the entire > >>> x86 decompressor, and replacing it with existing EFI code that does the > >>> same but in a generic way. The downside of this is that only EFI boot is > >>> supported, making it unviable for distros, which need to support BIOS > >>> boot and hybrid EFI boot modes that omit the EFI stub. > >>> > >>> Evgeny's series [0] adapted the entire decompressor code flow to allow > >>> it to execute in the EFI context as well as the bare metal context, and > >>> this involves changes to the 1:1 mapping code and the page fault > >>> handlers etc, none of which are really needed when doing EFI boot in the > >>> first place. > >>> > >>> So this series attempts to occupy the middle ground here: it makes > >>> minimal changes to the existing decompressor so some of it can be called > >>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress > >>> the kernel and boot it directly, without relying on the trampoline code, > >>> page table code or page fault handling code. This allows us to get rid > >>> of quite a bit of unsavory EFI stub code, and replace it with two clear > >>> invocations of the EFI firmware APIs to clear NX restrictions from > >>> allocations that have been populated with executable code. > >>> > >>> The only code that is being reused is the decompression library itself, > >>> along with the minimal ELF parsing that is required to copy the ELF > >>> segments in place, and the relocation processing that fixes up absolute > >>> symbol references to refer to the correct virtual addresses. > >>> > >>> Note that some of Evgeny's changes to clean up the PE/COFF header > >>> generation will still be needed, but I've omitted those here for > >>> brevity. > >> > >> I tried booting an SEV and an SEV-ES guest using this and both failed to boot: > >> > >> EFI stub: WARNING: Decompression failed: Out of memory while allocating > >> z_stream > >> > >> I'll have to take a closer look as to why, but it might be a couple of > >> days before I can get to it. > >> > > > > Thanks Tom. > > > > The internal malloc() seems to be failing, which is often caused by > > BSS clearing problems. Could you elaborate a little bit on the boot > > environment you are using here? > > I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my > host/hypervisor and guest kernel and the current OVMF tree built using > OvmfPkgX64.dsc. > > I was originally using the current merge window Linux, but moved to the > release version just to . With the release version SEV and SEV-ES still fail to > boot, but SEV actually #GPs now. And some of the register contents look > like encrypted data: > > ConvertPages: range 1000000 - 4FA1FFF covers multiple entries > !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - 00000000 !!!! > ExceptionData - 0000000000000000 > RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206 > RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B > RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000 > RSI - 0000000001000000, RDI - 1FBA02A45943DE68 > R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018 > R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98 > R14 - 0000000001000000, R15 - 000000007E0A5198 > DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 > GS - 0000000000000030, SS - 0000000000000030 > CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000 > CR4 - 0000000000000668, CR8 - 0000000000000000 > DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 > DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 > GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000 > IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000 > FXSAVE_STATE - 000000007FD96F80 > !!!! Find image based on IP(0x597E71C1) /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable > RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!! > > So, yes, probably an area of memory that was zeroes when mapped > unencrypted, but wasn't cleared after changing the mapping to > encrypted. > Thanks. It seems I was a bit naive and underestimated the amount of SEV related processing that goes on in the decompressor after the EFI stub has handed over. I will have to take some time and go through this, and decide whether there is a way we can share this code with the EFI stub without introducing yet another permutation that requires testing and maintenance. Any suggestions on how to test this stuff is appreciated - does QEMU emulate any of this? Does consumer-level AMD hardware implement the pieces I'd need to run a SEV host with SNP support etc?
On 5/2/23 11:08, Tom Lendacky wrote: > On 5/2/23 08:39, Ard Biesheuvel wrote: >> On Tue, 2 May 2023 at 15:37, Tom Lendacky <thomas.lendacky@amd.com> wrote: >>> >>> On 4/24/23 11:57, Ard Biesheuvel wrote: >>>> This series is conceptually a combination of Evgeny's series [0] and >>>> mine [1], both of which attempt to make the early decompressor code more >>>> amenable to executing in the EFI environment with stricter handling of >>>> memory permissions. >>>> >>>> My series [1] implemented zboot for x86, by getting rid of the entire >>>> x86 decompressor, and replacing it with existing EFI code that does the >>>> same but in a generic way. The downside of this is that only EFI boot is >>>> supported, making it unviable for distros, which need to support BIOS >>>> boot and hybrid EFI boot modes that omit the EFI stub. >>>> >>>> Evgeny's series [0] adapted the entire decompressor code flow to allow >>>> it to execute in the EFI context as well as the bare metal context, and >>>> this involves changes to the 1:1 mapping code and the page fault >>>> handlers etc, none of which are really needed when doing EFI boot in the >>>> first place. >>>> >>>> So this series attempts to occupy the middle ground here: it makes >>>> minimal changes to the existing decompressor so some of it can be called >>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress >>>> the kernel and boot it directly, without relying on the trampoline code, >>>> page table code or page fault handling code. This allows us to get rid >>>> of quite a bit of unsavory EFI stub code, and replace it with two clear >>>> invocations of the EFI firmware APIs to clear NX restrictions from >>>> allocations that have been populated with executable code. >>>> >>>> The only code that is being reused is the decompression library itself, >>>> along with the minimal ELF parsing that is required to copy the ELF >>>> segments in place, and the relocation processing that fixes up absolute >>>> symbol references to refer to the correct virtual addresses. >>>> >>>> Note that some of Evgeny's changes to clean up the PE/COFF header >>>> generation will still be needed, but I've omitted those here for >>>> brevity. >>> >>> I tried booting an SEV and an SEV-ES guest using this and both failed >>> to boot: >>> >>> EFI stub: WARNING: Decompression failed: Out of memory while allocating >>> z_stream >>> >>> I'll have to take a closer look as to why, but it might be a couple of >>> days before I can get to it. >>> >> >> Thanks Tom. >> >> The internal malloc() seems to be failing, which is often caused by >> BSS clearing problems. Could you elaborate a little bit on the boot >> environment you are using here? > > I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my > host/hypervisor and guest kernel and the current OVMF tree built using > OvmfPkgX64.dsc. > > I was originally using the current merge window Linux, but moved to the > release version just to . With the release version SEV and SEV-ES still > fail to > boot, but SEV actually #GPs now. And some of the register contents look > like encrypted data: > > ConvertPages: range 1000000 - 4FA1FFF covers multiple entries > !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - > 00000000 !!!! > ExceptionData - 0000000000000000 > RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206 > RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B > RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000 > RSI - 0000000001000000, RDI - 1FBA02A45943DE68 > R8  - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018 > R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98 > R14 - 0000000001000000, R15 - 000000007E0A5198 > DS  - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 > GS  - 0000000000000030, SS - 0000000000000030 > CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000 > CR4 - 0000000000000668, CR8 - 0000000000000000 > DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 > DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 > GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000 > IDTR - 000000007F34C018 0000000000000FFF,  TR - 0000000000000000 > FXSAVE_STATE - 000000007FD96F80 > !!!! Find image based on IP(0x597E71C1) > /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable > RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!! > > So, yes, probably an area of memory that was zeroes when mapped > unencrypted, but wasn't cleared after changing the mapping to > encrypted. Yes, looks like a bss clearing issue. If I add __section(".data") to free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can boot an SEV guest. However, an SEV-ES guest is triple faulting. This looks to be because we're still on the EFI CS of 0x38 after switching GDTs in arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable) and things blow up on the iretq. Moving the block headed by the comment "Now switch to __KERNEL_CS so IRET works reliably" to just after calling startup_64_setup_env() fixes it and an SEV-ES guest can boot. This worked before because I believe we switched off the EFI CS as part of the kernel decompressor support and so this bug wasn't exposed. But this needs to be fixed regardless of this series. Thanks, Tom > > Thanks, > Tom > >
On Wed, 3 May 2023 at 19:58, Tom Lendacky <thomas.lendacky@amd.com> wrote: > > On 5/2/23 11:08, Tom Lendacky wrote: > > On 5/2/23 08:39, Ard Biesheuvel wrote: > >> On Tue, 2 May 2023 at 15:37, Tom Lendacky <thomas.lendacky@amd.com> wrote: > >>> > >>> On 4/24/23 11:57, Ard Biesheuvel wrote: > >>>> This series is conceptually a combination of Evgeny's series [0] and > >>>> mine [1], both of which attempt to make the early decompressor code more > >>>> amenable to executing in the EFI environment with stricter handling of > >>>> memory permissions. > >>>> > >>>> My series [1] implemented zboot for x86, by getting rid of the entire > >>>> x86 decompressor, and replacing it with existing EFI code that does the > >>>> same but in a generic way. The downside of this is that only EFI boot is > >>>> supported, making it unviable for distros, which need to support BIOS > >>>> boot and hybrid EFI boot modes that omit the EFI stub. > >>>> > >>>> Evgeny's series [0] adapted the entire decompressor code flow to allow > >>>> it to execute in the EFI context as well as the bare metal context, and > >>>> this involves changes to the 1:1 mapping code and the page fault > >>>> handlers etc, none of which are really needed when doing EFI boot in the > >>>> first place. > >>>> > >>>> So this series attempts to occupy the middle ground here: it makes > >>>> minimal changes to the existing decompressor so some of it can be called > >>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress > >>>> the kernel and boot it directly, without relying on the trampoline code, > >>>> page table code or page fault handling code. This allows us to get rid > >>>> of quite a bit of unsavory EFI stub code, and replace it with two clear > >>>> invocations of the EFI firmware APIs to clear NX restrictions from > >>>> allocations that have been populated with executable code. > >>>> > >>>> The only code that is being reused is the decompression library itself, > >>>> along with the minimal ELF parsing that is required to copy the ELF > >>>> segments in place, and the relocation processing that fixes up absolute > >>>> symbol references to refer to the correct virtual addresses. > >>>> > >>>> Note that some of Evgeny's changes to clean up the PE/COFF header > >>>> generation will still be needed, but I've omitted those here for > >>>> brevity. > >>> > >>> I tried booting an SEV and an SEV-ES guest using this and both failed > >>> to boot: > >>> > >>> EFI stub: WARNING: Decompression failed: Out of memory while allocating > >>> z_stream > >>> > >>> I'll have to take a closer look as to why, but it might be a couple of > >>> days before I can get to it. > >>> > >> > >> Thanks Tom. > >> > >> The internal malloc() seems to be failing, which is often caused by > >> BSS clearing problems. Could you elaborate a little bit on the boot > >> environment you are using here? > > > > I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my > > host/hypervisor and guest kernel and the current OVMF tree built using > > OvmfPkgX64.dsc. > > > > I was originally using the current merge window Linux, but moved to the > > release version just to . With the release version SEV and SEV-ES still > > fail to > > boot, but SEV actually #GPs now. And some of the register contents look > > like encrypted data: > > > > ConvertPages: range 1000000 - 4FA1FFF covers multiple entries > > !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - > > 00000000 !!!! > > ExceptionData - 0000000000000000 > > RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206 > > RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B > > RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000 > > RSI - 0000000001000000, RDI - 1FBA02A45943DE68 > > R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018 > > R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98 > > R14 - 0000000001000000, R15 - 000000007E0A5198 > > DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 > > GS - 0000000000000030, SS - 0000000000000030 > > CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000 > > CR4 - 0000000000000668, CR8 - 0000000000000000 > > DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 > > DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 > > GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000 > > IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000 > > FXSAVE_STATE - 000000007FD96F80 > > !!!! Find image based on IP(0x597E71C1) > > /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable > > RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!! > > > > So, yes, probably an area of memory that was zeroes when mapped > > unencrypted, but wasn't cleared after changing the mapping to > > encrypted. > > Yes, looks like a bss clearing issue. If I add __section(".data") to > free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and > to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can > boot an SEV guest. > > However, an SEV-ES guest is triple faulting. This looks to be because > we're still on the EFI CS of 0x38 after switching GDTs in > arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before > switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable) > and things blow up on the iretq. Moving the block headed by the comment > "Now switch to __KERNEL_CS so IRET works reliably" to just after calling > startup_64_setup_env() fixes it and an SEV-ES guest can boot. > > This worked before because I believe we switched off the EFI CS as part of > the kernel decompressor support and so this bug wasn't exposed. But this > needs to be fixed regardless of this series. > Very interesting. I was under the assumption that everything that goes on in sev_enable() in the decompressor would be rather indispensable to boot in SEV mode (which I only spotted today) so I am quite surprised that things just appear to work. (There is some 32-bit SEV code in the decompressor as well that obviously never gets called when booting in long mode, but I hadn't quite grasped how much other SEV code there actually is)
On Wed, May 03, 2023 at 08:17:28PM +0200, Ard Biesheuvel wrote: > (There is some 32-bit SEV code in the decompressor as well that > obviously never gets called when booting in long mode, but I hadn't > quite grasped how much other SEV code there actually is) If you mean the thing in startup_32 which does get_sev_encryption_bit() etc, that is going away.
On Wed, 3 May 2023 at 20:24, Borislav Petkov <bp@alien8.de> wrote: > > On Wed, May 03, 2023 at 08:17:28PM +0200, Ard Biesheuvel wrote: > > (There is some 32-bit SEV code in the decompressor as well that > > obviously never gets called when booting in long mode, but I hadn't > > quite grasped how much other SEV code there actually is) > > If you mean the thing in startup_32 which does get_sev_encryption_bit() > etc, that is going away. > OK, good to know.
On 5/3/23 13:17, Ard Biesheuvel wrote: > On Wed, 3 May 2023 at 19:58, Tom Lendacky <thomas.lendacky@amd.com> wrote: >> >> On 5/2/23 11:08, Tom Lendacky wrote: >>> On 5/2/23 08:39, Ard Biesheuvel wrote: >>>> On Tue, 2 May 2023 at 15:37, Tom Lendacky <thomas.lendacky@amd.com> wrote: >>>>> >>>>> On 4/24/23 11:57, Ard Biesheuvel wrote: >>>>>> This series is conceptually a combination of Evgeny's series [0] and >>>>>> mine [1], both of which attempt to make the early decompressor code more >>>>>> amenable to executing in the EFI environment with stricter handling of >>>>>> memory permissions. >>>>>> >>>>>> My series [1] implemented zboot for x86, by getting rid of the entire >>>>>> x86 decompressor, and replacing it with existing EFI code that does the >>>>>> same but in a generic way. The downside of this is that only EFI boot is >>>>>> supported, making it unviable for distros, which need to support BIOS >>>>>> boot and hybrid EFI boot modes that omit the EFI stub. >>>>>> >>>>>> Evgeny's series [0] adapted the entire decompressor code flow to allow >>>>>> it to execute in the EFI context as well as the bare metal context, and >>>>>> this involves changes to the 1:1 mapping code and the page fault >>>>>> handlers etc, none of which are really needed when doing EFI boot in the >>>>>> first place. >>>>>> >>>>>> So this series attempts to occupy the middle ground here: it makes >>>>>> minimal changes to the existing decompressor so some of it can be called >>>>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress >>>>>> the kernel and boot it directly, without relying on the trampoline code, >>>>>> page table code or page fault handling code. This allows us to get rid >>>>>> of quite a bit of unsavory EFI stub code, and replace it with two clear >>>>>> invocations of the EFI firmware APIs to clear NX restrictions from >>>>>> allocations that have been populated with executable code. >>>>>> >>>>>> The only code that is being reused is the decompression library itself, >>>>>> along with the minimal ELF parsing that is required to copy the ELF >>>>>> segments in place, and the relocation processing that fixes up absolute >>>>>> symbol references to refer to the correct virtual addresses. >>>>>> >>>>>> Note that some of Evgeny's changes to clean up the PE/COFF header >>>>>> generation will still be needed, but I've omitted those here for >>>>>> brevity. >>>>> >>>>> I tried booting an SEV and an SEV-ES guest using this and both failed >>>>> to boot: >>>>> >>>>> EFI stub: WARNING: Decompression failed: Out of memory while allocating >>>>> z_stream >>>>> >>>>> I'll have to take a closer look as to why, but it might be a couple of >>>>> days before I can get to it. >>>>> >>>> >>>> Thanks Tom. >>>> >>>> The internal malloc() seems to be failing, which is often caused by >>>> BSS clearing problems. Could you elaborate a little bit on the boot >>>> environment you are using here? >>> >>> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my >>> host/hypervisor and guest kernel and the current OVMF tree built using >>> OvmfPkgX64.dsc. >>> >>> I was originally using the current merge window Linux, but moved to the >>> release version just to . With the release version SEV and SEV-ES still >>> fail to >>> boot, but SEV actually #GPs now. And some of the register contents look >>> like encrypted data: >>> >>> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries >>> !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - >>> 00000000 !!!! >>> ExceptionData - 0000000000000000 >>> RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206 >>> RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B >>> RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000 >>> RSI - 0000000001000000, RDI - 1FBA02A45943DE68 >>> R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018 >>> R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98 >>> R14 - 0000000001000000, R15 - 000000007E0A5198 >>> DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 >>> GS - 0000000000000030, SS - 0000000000000030 >>> CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000 >>> CR4 - 0000000000000668, CR8 - 0000000000000000 >>> DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 >>> DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 >>> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000 >>> IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000 >>> FXSAVE_STATE - 000000007FD96F80 >>> !!!! Find image based on IP(0x597E71C1) >>> /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable >>> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!! >>> >>> So, yes, probably an area of memory that was zeroes when mapped >>> unencrypted, but wasn't cleared after changing the mapping to >>> encrypted. >> >> Yes, looks like a bss clearing issue. If I add __section(".data") to >> free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and >> to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can >> boot an SEV guest. >> >> However, an SEV-ES guest is triple faulting. This looks to be because >> we're still on the EFI CS of 0x38 after switching GDTs in >> arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before >> switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable) >> and things blow up on the iretq. Moving the block headed by the comment >> "Now switch to __KERNEL_CS so IRET works reliably" to just after calling >> startup_64_setup_env() fixes it and an SEV-ES guest can boot. >> >> This worked before because I believe we switched off the EFI CS as part of >> the kernel decompressor support and so this bug wasn't exposed. But this >> needs to be fixed regardless of this series. >> > > Very interesting. I was under the assumption that everything that goes > on in sev_enable() in the decompressor would be rather indispensable > to boot in SEV mode (which I only spotted today) so I am quite > surprised that things just appear to work. (There is some 32-bit SEV > code in the decompressor as well that obviously never gets called when > booting in long mode, but I hadn't quite grasped how much other SEV > code there actually is) The sev_enable() function for SEV and SEV-ES is more for ensuring a proper environment (ensuring the proper CPUID bits are present for the guest, checking the GHCB protocol level, properly preparing pagetables, etc.). Since we're still in EFI and using the EFI #VC handler and pagetables, we don't require some of that stuff, but some of the it would need to be performed at some point (I wasn't trying to be a malicious hypervisor). I haven't tested SEV-SNP yet because the hypervisor support is not upstream, yet, and I haven't applied your series to an SNP hypervisor tree, yet. There is a lot of support in sev_enable() for SNP that may likely cause a problem compared to SEV and SEV-ES. SEV-SNP has a lot more checking and validation going on and it ensures it gets the confidential computing blob from EFI into the boot parameters. I'm not sure when I'll be able to test SNP, since I'm off next week and trying to wrap up a bunch of stuff before I leave. Thanks, Tom
On 5/3/23 12:44, Ard Biesheuvel wrote: > On Tue, 2 May 2023 at 18:08, Tom Lendacky <thomas.lendacky@amd.com> wrote: >> >> On 5/2/23 08:39, Ard Biesheuvel wrote: >>> On Tue, 2 May 2023 at 15:37, Tom Lendacky <thomas.lendacky@amd.com> wrote: >>>> >>>> On 4/24/23 11:57, Ard Biesheuvel wrote: >>>>> This series is conceptually a combination of Evgeny's series [0] and >>>>> mine [1], both of which attempt to make the early decompressor code more >>>>> amenable to executing in the EFI environment with stricter handling of >>>>> memory permissions. >>>>> >>>>> My series [1] implemented zboot for x86, by getting rid of the entire >>>>> x86 decompressor, and replacing it with existing EFI code that does the >>>>> same but in a generic way. The downside of this is that only EFI boot is >>>>> supported, making it unviable for distros, which need to support BIOS >>>>> boot and hybrid EFI boot modes that omit the EFI stub. >>>>> >>>>> Evgeny's series [0] adapted the entire decompressor code flow to allow >>>>> it to execute in the EFI context as well as the bare metal context, and >>>>> this involves changes to the 1:1 mapping code and the page fault >>>>> handlers etc, none of which are really needed when doing EFI boot in the >>>>> first place. >>>>> >>>>> So this series attempts to occupy the middle ground here: it makes >>>>> minimal changes to the existing decompressor so some of it can be called >>>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress >>>>> the kernel and boot it directly, without relying on the trampoline code, >>>>> page table code or page fault handling code. This allows us to get rid >>>>> of quite a bit of unsavory EFI stub code, and replace it with two clear >>>>> invocations of the EFI firmware APIs to clear NX restrictions from >>>>> allocations that have been populated with executable code. >>>>> >>>>> The only code that is being reused is the decompression library itself, >>>>> along with the minimal ELF parsing that is required to copy the ELF >>>>> segments in place, and the relocation processing that fixes up absolute >>>>> symbol references to refer to the correct virtual addresses. >>>>> >>>>> Note that some of Evgeny's changes to clean up the PE/COFF header >>>>> generation will still be needed, but I've omitted those here for >>>>> brevity. >>>> >>>> I tried booting an SEV and an SEV-ES guest using this and both failed to boot: >>>> >>>> EFI stub: WARNING: Decompression failed: Out of memory while allocating >>>> z_stream >>>> >>>> I'll have to take a closer look as to why, but it might be a couple of >>>> days before I can get to it. >>>> >>> >>> Thanks Tom. >>> >>> The internal malloc() seems to be failing, which is often caused by >>> BSS clearing problems. Could you elaborate a little bit on the boot >>> environment you are using here? >> >> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my >> host/hypervisor and guest kernel and the current OVMF tree built using >> OvmfPkgX64.dsc. >> >> I was originally using the current merge window Linux, but moved to the >> release version just to . With the release version SEV and SEV-ES still fail to >> boot, but SEV actually #GPs now. And some of the register contents look >> like encrypted data: >> >> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries >> !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - 00000000 !!!! >> ExceptionData - 0000000000000000 >> RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206 >> RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B >> RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000 >> RSI - 0000000001000000, RDI - 1FBA02A45943DE68 >> R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018 >> R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98 >> R14 - 0000000001000000, R15 - 000000007E0A5198 >> DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 >> GS - 0000000000000030, SS - 0000000000000030 >> CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000 >> CR4 - 0000000000000668, CR8 - 0000000000000000 >> DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 >> DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 >> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000 >> IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000 >> FXSAVE_STATE - 000000007FD96F80 >> !!!! Find image based on IP(0x597E71C1) /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable >> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!! >> >> So, yes, probably an area of memory that was zeroes when mapped >> unencrypted, but wasn't cleared after changing the mapping to >> encrypted. >> > > Thanks. > > It seems I was a bit naive and underestimated the amount of SEV > related processing that goes on in the decompressor after the EFI stub > has handed over. I will have to take some time and go through this, > and decide whether there is a way we can share this code with the EFI > stub without introducing yet another permutation that requires testing > and maintenance. > > Any suggestions on how to test this stuff is appreciated - does QEMU > emulate any of this? Does consumer-level AMD hardware implement the > pieces I'd need to run a SEV host with SNP support etc? Unfortunately Qemu doesn't emulate any of this and SEV support, specifically the SEV firmware, isn't available on consumer-level parts, so you would need an EPYC part to do SEV. Thanks, Tom
On Wed, 3 May 2023 at 20:49, Tom Lendacky <thomas.lendacky@amd.com> wrote: > > On 5/3/23 13:17, Ard Biesheuvel wrote: > > On Wed, 3 May 2023 at 19:58, Tom Lendacky <thomas.lendacky@amd.com> wrote: > >> > >> On 5/2/23 11:08, Tom Lendacky wrote: > >>> On 5/2/23 08:39, Ard Biesheuvel wrote: > >>>> On Tue, 2 May 2023 at 15:37, Tom Lendacky <thomas.lendacky@amd.com> wrote: > >>>>> > >>>>> On 4/24/23 11:57, Ard Biesheuvel wrote: > >>>>>> This series is conceptually a combination of Evgeny's series [0] and > >>>>>> mine [1], both of which attempt to make the early decompressor code more > >>>>>> amenable to executing in the EFI environment with stricter handling of > >>>>>> memory permissions. > >>>>>> > >>>>>> My series [1] implemented zboot for x86, by getting rid of the entire > >>>>>> x86 decompressor, and replacing it with existing EFI code that does the > >>>>>> same but in a generic way. The downside of this is that only EFI boot is > >>>>>> supported, making it unviable for distros, which need to support BIOS > >>>>>> boot and hybrid EFI boot modes that omit the EFI stub. > >>>>>> > >>>>>> Evgeny's series [0] adapted the entire decompressor code flow to allow > >>>>>> it to execute in the EFI context as well as the bare metal context, and > >>>>>> this involves changes to the 1:1 mapping code and the page fault > >>>>>> handlers etc, none of which are really needed when doing EFI boot in the > >>>>>> first place. > >>>>>> > >>>>>> So this series attempts to occupy the middle ground here: it makes > >>>>>> minimal changes to the existing decompressor so some of it can be called > >>>>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress > >>>>>> the kernel and boot it directly, without relying on the trampoline code, > >>>>>> page table code or page fault handling code. This allows us to get rid > >>>>>> of quite a bit of unsavory EFI stub code, and replace it with two clear > >>>>>> invocations of the EFI firmware APIs to clear NX restrictions from > >>>>>> allocations that have been populated with executable code. > >>>>>> > >>>>>> The only code that is being reused is the decompression library itself, > >>>>>> along with the minimal ELF parsing that is required to copy the ELF > >>>>>> segments in place, and the relocation processing that fixes up absolute > >>>>>> symbol references to refer to the correct virtual addresses. > >>>>>> > >>>>>> Note that some of Evgeny's changes to clean up the PE/COFF header > >>>>>> generation will still be needed, but I've omitted those here for > >>>>>> brevity. > >>>>> > >>>>> I tried booting an SEV and an SEV-ES guest using this and both failed > >>>>> to boot: > >>>>> > >>>>> EFI stub: WARNING: Decompression failed: Out of memory while allocating > >>>>> z_stream > >>>>> > >>>>> I'll have to take a closer look as to why, but it might be a couple of > >>>>> days before I can get to it. > >>>>> > >>>> > >>>> Thanks Tom. > >>>> > >>>> The internal malloc() seems to be failing, which is often caused by > >>>> BSS clearing problems. Could you elaborate a little bit on the boot > >>>> environment you are using here? > >>> > >>> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my > >>> host/hypervisor and guest kernel and the current OVMF tree built using > >>> OvmfPkgX64.dsc. > >>> > >>> I was originally using the current merge window Linux, but moved to the > >>> release version just to . With the release version SEV and SEV-ES still > >>> fail to > >>> boot, but SEV actually #GPs now. And some of the register contents look > >>> like encrypted data: > >>> > >>> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries > >>> !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - > >>> 00000000 !!!! > >>> ExceptionData - 0000000000000000 > >>> RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206 > >>> RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B > >>> RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000 > >>> RSI - 0000000001000000, RDI - 1FBA02A45943DE68 > >>> R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018 > >>> R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98 > >>> R14 - 0000000001000000, R15 - 000000007E0A5198 > >>> DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 > >>> GS - 0000000000000030, SS - 0000000000000030 > >>> CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000 > >>> CR4 - 0000000000000668, CR8 - 0000000000000000 > >>> DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 > >>> DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 > >>> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000 > >>> IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000 > >>> FXSAVE_STATE - 000000007FD96F80 > >>> !!!! Find image based on IP(0x597E71C1) > >>> /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable > >>> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!! > >>> > >>> So, yes, probably an area of memory that was zeroes when mapped > >>> unencrypted, but wasn't cleared after changing the mapping to > >>> encrypted. > >> > >> Yes, looks like a bss clearing issue. If I add __section(".data") to > >> free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and > >> to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can > >> boot an SEV guest. > >> > >> However, an SEV-ES guest is triple faulting. This looks to be because > >> we're still on the EFI CS of 0x38 after switching GDTs in > >> arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before > >> switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable) > >> and things blow up on the iretq. Moving the block headed by the comment > >> "Now switch to __KERNEL_CS so IRET works reliably" to just after calling > >> startup_64_setup_env() fixes it and an SEV-ES guest can boot. > >> > >> This worked before because I believe we switched off the EFI CS as part of > >> the kernel decompressor support and so this bug wasn't exposed. But this > >> needs to be fixed regardless of this series. > >> > > > > Very interesting. I was under the assumption that everything that goes > > on in sev_enable() in the decompressor would be rather indispensable > > to boot in SEV mode (which I only spotted today) so I am quite > > surprised that things just appear to work. (There is some 32-bit SEV > > code in the decompressor as well that obviously never gets called when > > booting in long mode, but I hadn't quite grasped how much other SEV > > code there actually is) > > The sev_enable() function for SEV and SEV-ES is more for ensuring a proper > environment (ensuring the proper CPUID bits are present for the guest, > checking the GHCB protocol level, properly preparing pagetables, etc.). > Since we're still in EFI and using the EFI #VC handler and pagetables, we > don't require some of that stuff, but some of the it would need to be > performed at some point (I wasn't trying to be a malicious hypervisor). > OK, good to know. > I haven't tested SEV-SNP yet because the hypervisor support is not > upstream, yet, and I haven't applied your series to an SNP hypervisor > tree, yet. There is a lot of support in sev_enable() for SNP that may > likely cause a problem compared to SEV and SEV-ES. > > SEV-SNP has a lot more checking and validation going on and it ensures it > gets the confidential computing blob from EFI into the boot parameters. > I'm not sure when I'll be able to test SNP, since I'm off next week and > trying to wrap up a bunch of stuff before I leave. > Would it make sense, given the apparent direct dependency on EFI, to move that stuff into the EFI stub instead of doing it from the decompressor? Thanks for giving this a spin - I can wait for more testing until you get back, and hopefully, we'll have converged a bit more by that time and there will be a new revision of the series available for testing. BTW any clue whether your boot path is relying on the EFI handover protocol? I.e., QemuStartLegacyImage() in OVMF? Evgeny mentioned that BSS needs clearing in that case, and I was wondering if that is the case you are hitting. (If you have a DEBUG OVMF boot log you could share, I can go through it myself)
On 5/3/23 13:59, Ard Biesheuvel wrote: > On Wed, 3 May 2023 at 20:49, Tom Lendacky <thomas.lendacky@amd.com> wrote: >> >> On 5/3/23 13:17, Ard Biesheuvel wrote: >>> On Wed, 3 May 2023 at 19:58, Tom Lendacky <thomas.lendacky@amd.com> wrote: >>>> >>>> On 5/2/23 11:08, Tom Lendacky wrote: >>>>> On 5/2/23 08:39, Ard Biesheuvel wrote: >>>>>> On Tue, 2 May 2023 at 15:37, Tom Lendacky <thomas.lendacky@amd.com> wrote: >>>>>>> >>>>>>> On 4/24/23 11:57, Ard Biesheuvel wrote: >>>>>>>> This series is conceptually a combination of Evgeny's series [0] and >>>>>>>> mine [1], both of which attempt to make the early decompressor code more >>>>>>>> amenable to executing in the EFI environment with stricter handling of >>>>>>>> memory permissions. >>>>>>>> >>>>>>>> My series [1] implemented zboot for x86, by getting rid of the entire >>>>>>>> x86 decompressor, and replacing it with existing EFI code that does the >>>>>>>> same but in a generic way. The downside of this is that only EFI boot is >>>>>>>> supported, making it unviable for distros, which need to support BIOS >>>>>>>> boot and hybrid EFI boot modes that omit the EFI stub. >>>>>>>> >>>>>>>> Evgeny's series [0] adapted the entire decompressor code flow to allow >>>>>>>> it to execute in the EFI context as well as the bare metal context, and >>>>>>>> this involves changes to the 1:1 mapping code and the page fault >>>>>>>> handlers etc, none of which are really needed when doing EFI boot in the >>>>>>>> first place. >>>>>>>> >>>>>>>> So this series attempts to occupy the middle ground here: it makes >>>>>>>> minimal changes to the existing decompressor so some of it can be called >>>>>>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress >>>>>>>> the kernel and boot it directly, without relying on the trampoline code, >>>>>>>> page table code or page fault handling code. This allows us to get rid >>>>>>>> of quite a bit of unsavory EFI stub code, and replace it with two clear >>>>>>>> invocations of the EFI firmware APIs to clear NX restrictions from >>>>>>>> allocations that have been populated with executable code. >>>>>>>> >>>>>>>> The only code that is being reused is the decompression library itself, >>>>>>>> along with the minimal ELF parsing that is required to copy the ELF >>>>>>>> segments in place, and the relocation processing that fixes up absolute >>>>>>>> symbol references to refer to the correct virtual addresses. >>>>>>>> >>>>>>>> Note that some of Evgeny's changes to clean up the PE/COFF header >>>>>>>> generation will still be needed, but I've omitted those here for >>>>>>>> brevity. >>>>>>> >>>>>>> I tried booting an SEV and an SEV-ES guest using this and both failed >>>>>>> to boot: >>>>>>> >>>>>>> EFI stub: WARNING: Decompression failed: Out of memory while allocating >>>>>>> z_stream >>>>>>> >>>>>>> I'll have to take a closer look as to why, but it might be a couple of >>>>>>> days before I can get to it. >>>>>>> >>>>>> >>>>>> Thanks Tom. >>>>>> >>>>>> The internal malloc() seems to be failing, which is often caused by >>>>>> BSS clearing problems. Could you elaborate a little bit on the boot >>>>>> environment you are using here? >>>>> >>>>> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my >>>>> host/hypervisor and guest kernel and the current OVMF tree built using >>>>> OvmfPkgX64.dsc. >>>>> >>>>> I was originally using the current merge window Linux, but moved to the >>>>> release version just to . With the release version SEV and SEV-ES still >>>>> fail to >>>>> boot, but SEV actually #GPs now. And some of the register contents look >>>>> like encrypted data: >>>>> >>>>> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries >>>>> !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - >>>>> 00000000 !!!! >>>>> ExceptionData - 0000000000000000 >>>>> RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206 >>>>> RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B >>>>> RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000 >>>>> RSI - 0000000001000000, RDI - 1FBA02A45943DE68 >>>>> R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018 >>>>> R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98 >>>>> R14 - 0000000001000000, R15 - 000000007E0A5198 >>>>> DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 >>>>> GS - 0000000000000030, SS - 0000000000000030 >>>>> CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000 >>>>> CR4 - 0000000000000668, CR8 - 0000000000000000 >>>>> DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 >>>>> DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 >>>>> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000 >>>>> IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000 >>>>> FXSAVE_STATE - 000000007FD96F80 >>>>> !!!! Find image based on IP(0x597E71C1) >>>>> /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable >>>>> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!! >>>>> >>>>> So, yes, probably an area of memory that was zeroes when mapped >>>>> unencrypted, but wasn't cleared after changing the mapping to >>>>> encrypted. >>>> >>>> Yes, looks like a bss clearing issue. If I add __section(".data") to >>>> free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and >>>> to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can >>>> boot an SEV guest. >>>> >>>> However, an SEV-ES guest is triple faulting. This looks to be because >>>> we're still on the EFI CS of 0x38 after switching GDTs in >>>> arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before >>>> switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable) >>>> and things blow up on the iretq. Moving the block headed by the comment >>>> "Now switch to __KERNEL_CS so IRET works reliably" to just after calling >>>> startup_64_setup_env() fixes it and an SEV-ES guest can boot. >>>> >>>> This worked before because I believe we switched off the EFI CS as part of >>>> the kernel decompressor support and so this bug wasn't exposed. But this >>>> needs to be fixed regardless of this series. >>>> >>> >>> Very interesting. I was under the assumption that everything that goes >>> on in sev_enable() in the decompressor would be rather indispensable >>> to boot in SEV mode (which I only spotted today) so I am quite >>> surprised that things just appear to work. (There is some 32-bit SEV >>> code in the decompressor as well that obviously never gets called when >>> booting in long mode, but I hadn't quite grasped how much other SEV >>> code there actually is) >> >> The sev_enable() function for SEV and SEV-ES is more for ensuring a proper >> environment (ensuring the proper CPUID bits are present for the guest, >> checking the GHCB protocol level, properly preparing pagetables, etc.). >> Since we're still in EFI and using the EFI #VC handler and pagetables, we >> don't require some of that stuff, but some of the it would need to be >> performed at some point (I wasn't trying to be a malicious hypervisor). >> > > OK, good to know. > >> I haven't tested SEV-SNP yet because the hypervisor support is not >> upstream, yet, and I haven't applied your series to an SNP hypervisor >> tree, yet. There is a lot of support in sev_enable() for SNP that may >> likely cause a problem compared to SEV and SEV-ES. >> >> SEV-SNP has a lot more checking and validation going on and it ensures it >> gets the confidential computing blob from EFI into the boot parameters. >> I'm not sure when I'll be able to test SNP, since I'm off next week and >> trying to wrap up a bunch of stuff before I leave. >> > > Would it make sense, given the apparent direct dependency on EFI, to > move that stuff into the EFI stub instead of doing it from the > decompressor? We tried to make the support for SNP not be reliant on EFI. If EFI is present, then get the confidential computing blob from it. Otherwise, the cc blob can be supplied as part of the setup data in the Linux boot protocol (see find_cc_block() in arch/x86/boot/compressed/sev.c). So I guess some of it could be moved, but I'm not sure how much. > > Thanks for giving this a spin - I can wait for more testing until you > get back, and hopefully, we'll have converged a bit more by that time > and there will be a new revision of the series available for testing. > > BTW any clue whether your boot path is relying on the EFI handover > protocol? I.e., QemuStartLegacyImage() in OVMF? Evgeny mentioned that I don't believe it is. It's using the Grub EFI bootloader to load and boot Linux. > BSS needs clearing in that case, and I was wondering if that is the > case you are hitting. > (If you have a DEBUG OVMF boot log you could share, I can go through it myself) Sure, I'll send you a copy in a separate email. Thanks, Tom
On Wed, 3 May 2023 at 23:23, Tom Lendacky <thomas.lendacky@amd.com> wrote: > > On 5/3/23 13:59, Ard Biesheuvel wrote: > > On Wed, 3 May 2023 at 20:49, Tom Lendacky <thomas.lendacky@amd.com> wrote: > >> > >> On 5/3/23 13:17, Ard Biesheuvel wrote: > >>> On Wed, 3 May 2023 at 19:58, Tom Lendacky <thomas.lendacky@amd.com> wrote: > >>>> > >>>> On 5/2/23 11:08, Tom Lendacky wrote: > >>>>> On 5/2/23 08:39, Ard Biesheuvel wrote: > >>>>>> On Tue, 2 May 2023 at 15:37, Tom Lendacky <thomas.lendacky@amd.com> wrote: > >>>>>>> > >>>>>>> On 4/24/23 11:57, Ard Biesheuvel wrote: > >>>>>>>> This series is conceptually a combination of Evgeny's series [0] and > >>>>>>>> mine [1], both of which attempt to make the early decompressor code more > >>>>>>>> amenable to executing in the EFI environment with stricter handling of > >>>>>>>> memory permissions. > >>>>>>>> > >>>>>>>> My series [1] implemented zboot for x86, by getting rid of the entire > >>>>>>>> x86 decompressor, and replacing it with existing EFI code that does the > >>>>>>>> same but in a generic way. The downside of this is that only EFI boot is > >>>>>>>> supported, making it unviable for distros, which need to support BIOS > >>>>>>>> boot and hybrid EFI boot modes that omit the EFI stub. > >>>>>>>> > >>>>>>>> Evgeny's series [0] adapted the entire decompressor code flow to allow > >>>>>>>> it to execute in the EFI context as well as the bare metal context, and > >>>>>>>> this involves changes to the 1:1 mapping code and the page fault > >>>>>>>> handlers etc, none of which are really needed when doing EFI boot in the > >>>>>>>> first place. > >>>>>>>> > >>>>>>>> So this series attempts to occupy the middle ground here: it makes > >>>>>>>> minimal changes to the existing decompressor so some of it can be called > >>>>>>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress > >>>>>>>> the kernel and boot it directly, without relying on the trampoline code, > >>>>>>>> page table code or page fault handling code. This allows us to get rid > >>>>>>>> of quite a bit of unsavory EFI stub code, and replace it with two clear > >>>>>>>> invocations of the EFI firmware APIs to clear NX restrictions from > >>>>>>>> allocations that have been populated with executable code. > >>>>>>>> > >>>>>>>> The only code that is being reused is the decompression library itself, > >>>>>>>> along with the minimal ELF parsing that is required to copy the ELF > >>>>>>>> segments in place, and the relocation processing that fixes up absolute > >>>>>>>> symbol references to refer to the correct virtual addresses. > >>>>>>>> > >>>>>>>> Note that some of Evgeny's changes to clean up the PE/COFF header > >>>>>>>> generation will still be needed, but I've omitted those here for > >>>>>>>> brevity. > >>>>>>> > >>>>>>> I tried booting an SEV and an SEV-ES guest using this and both failed > >>>>>>> to boot: > >>>>>>> > >>>>>>> EFI stub: WARNING: Decompression failed: Out of memory while allocating > >>>>>>> z_stream > >>>>>>> > >>>>>>> I'll have to take a closer look as to why, but it might be a couple of > >>>>>>> days before I can get to it. > >>>>>>> > >>>>>> > >>>>>> Thanks Tom. > >>>>>> > >>>>>> The internal malloc() seems to be failing, which is often caused by > >>>>>> BSS clearing problems. Could you elaborate a little bit on the boot > >>>>>> environment you are using here? > >>>>> > >>>>> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my > >>>>> host/hypervisor and guest kernel and the current OVMF tree built using > >>>>> OvmfPkgX64.dsc. > >>>>> > >>>>> I was originally using the current merge window Linux, but moved to the > >>>>> release version just to . With the release version SEV and SEV-ES still > >>>>> fail to > >>>>> boot, but SEV actually #GPs now. And some of the register contents look > >>>>> like encrypted data: > >>>>> > >>>>> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries > >>>>> !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - > >>>>> 00000000 !!!! > >>>>> ExceptionData - 0000000000000000 > >>>>> RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206 > >>>>> RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B > >>>>> RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000 > >>>>> RSI - 0000000001000000, RDI - 1FBA02A45943DE68 > >>>>> R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018 > >>>>> R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98 > >>>>> R14 - 0000000001000000, R15 - 000000007E0A5198 > >>>>> DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 > >>>>> GS - 0000000000000030, SS - 0000000000000030 > >>>>> CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000 > >>>>> CR4 - 0000000000000668, CR8 - 0000000000000000 > >>>>> DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 > >>>>> DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 > >>>>> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000 > >>>>> IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000 > >>>>> FXSAVE_STATE - 000000007FD96F80 > >>>>> !!!! Find image based on IP(0x597E71C1) > >>>>> /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable > >>>>> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!! > >>>>> > >>>>> So, yes, probably an area of memory that was zeroes when mapped > >>>>> unencrypted, but wasn't cleared after changing the mapping to > >>>>> encrypted. > >>>> > >>>> Yes, looks like a bss clearing issue. If I add __section(".data") to > >>>> free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and > >>>> to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can > >>>> boot an SEV guest. > >>>> > >>>> However, an SEV-ES guest is triple faulting. This looks to be because > >>>> we're still on the EFI CS of 0x38 after switching GDTs in > >>>> arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before > >>>> switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable) > >>>> and things blow up on the iretq. Moving the block headed by the comment > >>>> "Now switch to __KERNEL_CS so IRET works reliably" to just after calling > >>>> startup_64_setup_env() fixes it and an SEV-ES guest can boot. > >>>> > >>>> This worked before because I believe we switched off the EFI CS as part of > >>>> the kernel decompressor support and so this bug wasn't exposed. But this > >>>> needs to be fixed regardless of this series. > >>>> > >>> > >>> Very interesting. I was under the assumption that everything that goes > >>> on in sev_enable() in the decompressor would be rather indispensable > >>> to boot in SEV mode (which I only spotted today) so I am quite > >>> surprised that things just appear to work. (There is some 32-bit SEV > >>> code in the decompressor as well that obviously never gets called when > >>> booting in long mode, but I hadn't quite grasped how much other SEV > >>> code there actually is) > >> > >> The sev_enable() function for SEV and SEV-ES is more for ensuring a proper > >> environment (ensuring the proper CPUID bits are present for the guest, > >> checking the GHCB protocol level, properly preparing pagetables, etc.). > >> Since we're still in EFI and using the EFI #VC handler and pagetables, we > >> don't require some of that stuff, but some of the it would need to be > >> performed at some point (I wasn't trying to be a malicious hypervisor). > >> > > > > OK, good to know. > > > >> I haven't tested SEV-SNP yet because the hypervisor support is not > >> upstream, yet, and I haven't applied your series to an SNP hypervisor > >> tree, yet. There is a lot of support in sev_enable() for SNP that may > >> likely cause a problem compared to SEV and SEV-ES. > >> > >> SEV-SNP has a lot more checking and validation going on and it ensures it > >> gets the confidential computing blob from EFI into the boot parameters. > >> I'm not sure when I'll be able to test SNP, since I'm off next week and > >> trying to wrap up a bunch of stuff before I leave. > >> > > > > Would it make sense, given the apparent direct dependency on EFI, to > > move that stuff into the EFI stub instead of doing it from the > > decompressor? > > We tried to make the support for SNP not be reliant on EFI. If EFI is > present, then get the confidential computing blob from it. Otherwise, the > cc blob can be supplied as part of the setup data in the Linux boot > protocol (see find_cc_block() in arch/x86/boot/compressed/sev.c). > > So I guess some of it could be moved, but I'm not sure how much. > Fair enough. > > > > Thanks for giving this a spin - I can wait for more testing until you > > get back, and hopefully, we'll have converged a bit more by that time > > and there will be a new revision of the series available for testing. > > > > BTW any clue whether your boot path is relying on the EFI handover > > protocol? I.e., QemuStartLegacyImage() in OVMF? Evgeny mentioned that > > I don't believe it is. It's using the Grub EFI bootloader to load and boot > Linux. > Looking at the log you just sent (thanks), it appears you are using distro shim+grub, which implement their own rough approximation of PE file loading, and typically use the EFI handover protocol. This is good news, because we already know how broken shim and GRUB are, and I was concerned that there might be another issue to track down here. I'll implement Evgeny's suggestion to clear BSS explicitly in the EFI handover protocol entrypoints for the next revision of the series. Thanks,