Message ID | 20230914123001.27659-1-kirill.shutemov@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp352605vqi; Thu, 14 Sep 2023 06:34:44 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE61s8r6f/xYGJE1+orKsXjl4R9hJdpqZRXmet4BLV0P3m6mRJbnVpwkEbcEXE5KBjwXd9k X-Received: by 2002:a05:6358:6f0a:b0:140:f4e1:21ea with SMTP id r10-20020a0563586f0a00b00140f4e121eamr6756028rwn.0.1694698483749; Thu, 14 Sep 2023 06:34:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694698483; cv=none; d=google.com; s=arc-20160816; b=TLY6FUoC3Atn7VjfL+2goCO9pNfMnLmF5k/bX4t/6L2emopfeJMw4YwKqKJKTpeOD2 zXWpNZMCCbA8y9EdjpD0L8ACcPEXTyzD/f43J8Giv/QEJrGjSPMLfjvjJKVz9rN+JEmf 0oIOtTyhZcFsVCIpSf9XTSmRcTU2/wC0cofgeliCeV8/Vchfm1EXiGh6BNF37VzQZ3fE bL+EiOyVTGAf627WmTGrjHk3nW+0NBGQEU4smi5Rx/13xsImvvTRqKOuHEMFh59OgA7e gnHPgaLWP1x8+AzYCbDqX44UOZqDdNf0KdTfovKuTQ4Lng1o1B0VuUwbgEWDHZMZsEdT klHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=DpQWqFXWjh8HfiQfh+pVEPR97yx4XW71NA2pWi60p+Q=; fh=THuPAbEL6nF5Eq3IoJH7bhVs9hz620m2THlrdW1156g=; b=shMFO/G5NyCJJ28cWAh6fo8ile+6AsvRxS1gOXOLu/X66VA773dsQjuGWbOzL/esj9 xe03wR8f2jBLU0oq56YCEGR2cTKjBeKn6+elzrYIn492jGTNBing3aUXWxZ9/KFm7Hn8 9E34ST0umGH8xW+ny8svTUJs8KgQEHz1RVTpx1YrtQosp0rRXYQvJf1cEBIJZ8ENizHf EWTqssKCo8kpH7BZbZa06r81TgMx9SgNd+f2VKh+4Oe/VXZQq9usGO3CQSIMZtr3Zmtn 6HabGmSBO+M9DXXtsfwy6ZMjngD5gWRiygAanM2Tl5J72Ia9Rp6cq8N/aG8j6vudP+mW 55WQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=lMk3VF2K; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id r3-20020a635d03000000b00569dfeb8123si865015pgb.189.2023.09.14.06.34.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Sep 2023 06:34:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=lMk3VF2K; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 6D2078291503; Thu, 14 Sep 2023 05:30:30 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237943AbjINMaT (ORCPT <rfc822;chrisfriedt@gmail.com> + 35 others); Thu, 14 Sep 2023 08:30:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48728 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233620AbjINMaS (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 14 Sep 2023 08:30:18 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DDA941FC7 for <linux-kernel@vger.kernel.org>; Thu, 14 Sep 2023 05:30:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1694694614; x=1726230614; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=sR+zT73wsMXErBcopbNqQMDJkeQDLlNb2UdQEZ/50wo=; b=lMk3VF2KstfTpOOmA7aKxNZIgVC8sObX2ZsrKHAU83Q0jFF9KwgZP134 +2hbuzEkiXvkG2XScWY5ZVnxiGmbjxRZDBGb5rVNlmJnupI4W8R4g2UfX 6I6YLMIbOKyhmKRSDcEsiXbBOplK2Qy7DQMrdDW6spGbb9NFBaMT59g0Q fJqd4+gbQ/gw/9Sxbw10zQq3I2CfkIPXcJaTmu2WFvpihcj5cGN50wRTh TwY9JsXi8V5SJj6YTJcjGASheeuLRgDFgt7ILjFzWWNqsLA27AfpDETRE 523dvp1wZB27coYZ3pB8qq1iqkvMH545BeLp/Pg9JnkdL4SIkKpr+C87a w==; X-IronPort-AV: E=McAfee;i="6600,9927,10832"; a="363972229" X-IronPort-AV: E=Sophos;i="6.02,146,1688454000"; d="scan'208";a="363972229" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Sep 2023 05:30:14 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10832"; a="747708268" X-IronPort-AV: E=Sophos;i="6.02,146,1688454000"; d="scan'208";a="747708268" Received: from njayagop-mobl2.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.48.41]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Sep 2023 05:30:10 -0700 Received: by box.shutemov.name (Postfix, from userid 1000) id E3769109D89; Thu, 14 Sep 2023 15:30:07 +0300 (+03) From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> To: Thomas Gleixner <tglx@linutronix.de>, Dave Hansen <dave.hansen@intel.com>, Borislav Petkov <bp@alien8.de> Cc: Ard Biesheuvel <ardb@google.com>, Kees Cook <keescook@chromium.org>, Aaron Lu <aaron.lu@intel.com>, Bagas Sanjaya <bagasdotme@gmail.com>, Tom Lendacky <thomas.lendacky@amd.com>, x86@kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, regressions@lists.linux.de, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Subject: [PATCH] x86/boot/compressed: Reserve more memory for page tables Date: Thu, 14 Sep 2023 15:30:01 +0300 Message-ID: <20230914123001.27659-1-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Thu, 14 Sep 2023 05:30:30 -0700 (PDT) X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777020157034296795 X-GMAIL-MSGID: 1777020157034296795 |
Series |
x86/boot/compressed: Reserve more memory for page tables
|
|
Commit Message
Kirill A. Shutemov
Sept. 14, 2023, 12:30 p.m. UTC
The decompressor has a hard limit on the number of page tables it can
allocate. This limit is defined at compile-time and will cause boot
failure if it is reached.
The kernel is very strict and calculates the limit precisely for the
worst-case scenario based on the current configuration. However, it is
easy to forget to adjust the limit when a new use-case arises. The
worst-case scenario is rarely encountered during sanity checks.
In the case of enabling 5-level paging, a use-case was overlooked. The
limit needs to be increased by one to accommodate the additional level.
This oversight went unnoticed until Aaron attempted to run the kernel
via kexec with 5-level paging and unaccepted memory enabled.
To address this issue, let's allocate some extra space for page tables.
128K should be sufficient for any use-case. The logic can be simplified
by using a single value for all kernel configurations.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Aaron Lu <aaron.lu@intel.com>
Fixes: 34bbb0009f3b ("x86/boot/compressed: Enable 5-level paging during decompression stage")
---
arch/x86/include/asm/boot.h | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)
Comments
On 9/14/23 05:30, Kirill A. Shutemov wrote: > +/* > + * Total number of page table kernel_add_identity_map() can allocate, > + * including page tables consumed by startup_32(). > + */ > +# define BOOT_PGT_SIZE (32*4096) I agree that needing to know this in advance *exactly* is troublesome. But I do think that we should preserve the comment about the worst-case scenario. Also, I thought this was triggered by unaccepted memory. Am I remembering it wrong? How was it in play? Either way, I think your general approach here is sound. But let's add one little tweak to at least warn when we're getting close to the limit. Now that nobody has to worry about the limit for the immediate future it's a guarantee that in the long term someone will plow through it accidentally. Let's add a soft warning when we're nearing the limit so that there's a chance to catch these things in the future.
On Thu, Sep 14, 2023 at 08:51:50AM -0700, Dave Hansen wrote: > On 9/14/23 05:30, Kirill A. Shutemov wrote: > > +/* > > + * Total number of page table kernel_add_identity_map() can allocate, > > + * including page tables consumed by startup_32(). > > + */ > > +# define BOOT_PGT_SIZE (32*4096) > > I agree that needing to know this in advance *exactly* is troublesome. > > But I do think that we should preserve the comment about the worst-case > scenario. Want me to send v2 for that? > Also, I thought this was triggered by unaccepted memory. Am > I remembering it wrong? How was it in play? Unaccepted memory touched EFI system table. I was able to reproduce without unaccepted memory enabled: if get_rsdp_addr() takes efi_get_rsdp_addr() path. So it is not the root cause, just a trigger. So we need several things to run into the problem: - System supports 5-level paging and it is enabled; - Decompressor takes control in 64-bit mode, so it uses page tables inherited from bootloader until initialize_identity_maps(). In initialize_identity_maps() kernel resets page tables, rebuilding them from scratch. Here we only map what is definitely required: kernel, cmdline, boot_patams, setup_data. Entering in 32-bit mode would make startup_32() map the first 4G unconditionally, but in this setup we rely more on #PF to fill page table. It masks problem as we rarely need all four PMD tables. - Make kernel touch at least one page per-gigabyte in the first 4G. In our case, unaccepted memory path was the last straw: it triggered allocation of the fourth PMD table which failed. We can increase the constant by one and it will work as long as nobody need anything beyond the first 4G (or any 1G-aligned 4G region where we've got loaded, I guess). I am not sure we can guarantee this with (potentially buggy) ACPI and EFI in the picture. > Either way, I think your general approach here is sound. But let's add > one little tweak to at least warn when we're getting close to the limit. Yeah, makes sense.
diff --git a/arch/x86/include/asm/boot.h b/arch/x86/include/asm/boot.h index 9191280d9ea3..aaf1b2fc6ede 100644 --- a/arch/x86/include/asm/boot.h +++ b/arch/x86/include/asm/boot.h @@ -40,23 +40,20 @@ #ifdef CONFIG_X86_64 # define BOOT_STACK_SIZE 0x4000 -# define BOOT_INIT_PGT_SIZE (6*4096) -# ifdef CONFIG_RANDOMIZE_BASE /* - * Assuming all cross the 512GB boundary: - * 1 page for level4 - * (2+2)*4 pages for kernel, param, cmd_line, and randomized kernel - * 2 pages for first 2M (video RAM: CONFIG_X86_VERBOSE_BOOTUP). - * Total is 19 pages. + * Used by decompressor's startup_32() to allocate page tables for identity + * mapping of the 4G of RAM in 4-level paging mode. + * + * The additional page table needed for 5-level paging is allocated from + * trampoline_32bit memory. */ -# ifdef CONFIG_X86_VERBOSE_BOOTUP -# define BOOT_PGT_SIZE (19*4096) -# else /* !CONFIG_X86_VERBOSE_BOOTUP */ -# define BOOT_PGT_SIZE (17*4096) -# endif -# else /* !CONFIG_RANDOMIZE_BASE */ -# define BOOT_PGT_SIZE BOOT_INIT_PGT_SIZE -# endif +# define BOOT_INIT_PGT_SIZE (6*4096) + +/* + * Total number of page table kernel_add_identity_map() can allocate, + * including page tables consumed by startup_32(). + */ +# define BOOT_PGT_SIZE (32*4096) #else /* !CONFIG_X86_64 */ # define BOOT_STACK_SIZE 0x1000