Message ID | 20230606142637.5171-6-kirill.shutemov@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp3443181vqr; Tue, 6 Jun 2023 07:37:55 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ47MYpoDEA/L1fEPd49B7bBLZO6Dj8dJkoegVs4joIxgDyfltQhxoDhpH95IfzrWxu+p+Fk X-Received: by 2002:a05:6214:400d:b0:628:78fe:c9da with SMTP id kd13-20020a056214400d00b0062878fec9damr11270958qvb.27.1686062275511; Tue, 06 Jun 2023 07:37:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686062275; cv=none; d=google.com; s=arc-20160816; b=Nb7Nxa0kTlFcJ/BNk0szvxykWcuZD+YbvLdYFnNfti7xM/FhNnB2XPabzBULRQUqLr bhgkLdt1fuqh3UPRi77K59WY1ITvwb2YOCsP+yToxSIxqPKzy3X4vZZOCwa+Uf6LAG/s oiHSUFj0Si924DfMO0Z+0PlPYp4rJ4bNs/0FVlE/J7jfvpYpOiUq6vpxtQG+gMdgMY6v 7APWLLNiUI/XBnrvEn/p7v6QiLUR0DWbk+YPxslAPZmaykANwsbEL7NEeSE9edIsoARL 3IHK9hsFB8Mlr11hl30c2Ao5CUPZpHgs/9ZF+pA4/B4FpkosnwoSfiID29KJaHwWFH6z G8Eg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=CsTQ7BIEuUL7nR02crM+QVaBwzVMTbnLkCVSEcSRxvU=; b=sU1PeQnU2NgKWEghjDYgdO1mZkLEmWs40q3SdVhjoMQ0RMM4BUJLiKQuGyFHQNd5ki QE7+gjK5o5pEq8UU1OP8uJcdr7V7VId7Q/STpVKfhSOMyVcVNhkXVuI6rYq2gzHCRu2F 8PPPWIM2UTCfpny+6F4ctGJMlNrQZzCbWRqxwcYnwPH7b5ciS2RsTHz2q2Msz0ueLu/d efZu9JNCWepb1h2sGjOCndXxi9/BZ5xQ09jh4LtvanWqlW+LAGoOuFJrKzzbqI6Xr4Dr 7TO5a8WV+jdA9du4Q2yqgqoRCs3/AoCoB2T23mOf82Tyz65tMS9/UhyzijNEugyu2AUV l8eg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="OJpVx8O/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t7-20020ad45bc7000000b0061b637a64e9si6305594qvt.336.2023.06.06.07.37.40; Tue, 06 Jun 2023 07:37:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="OJpVx8O/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237911AbjFFO2M (ORCPT <rfc822;xxoosimple@gmail.com> + 99 others); Tue, 6 Jun 2023 10:28:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32872 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237664AbjFFO1u (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 6 Jun 2023 10:27:50 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 4AB8EE42; Tue, 6 Jun 2023 07:27:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686061669; x=1717597669; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=EsajNby4k25P3GH221zoPAce6NyJi0NyX4YPbFlC4wQ=; b=OJpVx8O/+BP8dZZP9JhATrBol5Gju3R3XiMmGokbFbRuMfbFvTSbeGIp VdK7AObRn0p0Kft2nvaKCcE4uaJN/M3M6wqV1FZCyzhM61c9HXdkeLbBJ g9ggL4WBQUGMm4LG0jOy+8CwZbgFnFsOULqIdBInjPYQ8TXldaKKJ73eN HuOxGu3IuWbnNVSLXHnv7bEJ50wcELzbqfd31lZIRxMn6Dx5NzGOlcTWS cuxO/SiT8dKJamgVOcWUd+qBoudPD7vNqBE5HyTMj7FDkuodXMjsHwL8U 2MureQw4NAzP5ZB7NmTKRGUx48MeXaNGR8QqhvfMBf6K3vElR8nV2S2Dq A==; X-IronPort-AV: E=McAfee;i="6600,9927,10733"; a="341337203" X-IronPort-AV: E=Sophos;i="6.00,221,1681196400"; d="scan'208";a="341337203" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jun 2023 07:27:02 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10732"; a="738815175" X-IronPort-AV: E=Sophos;i="6.00,221,1681196400"; d="scan'208";a="738815175" Received: from rgraefe-mobl1.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.58.173]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jun 2023 07:26:54 -0700 Received: by box.shutemov.name (Postfix, from userid 1000) id 5619010D6F6; Tue, 6 Jun 2023 17:26:42 +0300 (+03) From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> To: Borislav Petkov <bp@alien8.de>, Andy Lutomirski <luto@kernel.org>, Dave Hansen <dave.hansen@intel.com>, Sean Christopherson <seanjc@google.com>, Andrew Morton <akpm@linux-foundation.org>, Joerg Roedel <jroedel@suse.de>, Ard Biesheuvel <ardb@kernel.org> Cc: Andi Kleen <ak@linux.intel.com>, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>, David Rientjes <rientjes@google.com>, Vlastimil Babka <vbabka@suse.cz>, Tom Lendacky <thomas.lendacky@amd.com>, Thomas Gleixner <tglx@linutronix.de>, Peter Zijlstra <peterz@infradead.org>, Paolo Bonzini <pbonzini@redhat.com>, Ingo Molnar <mingo@redhat.com>, Dario Faggioli <dfaggioli@suse.com>, Mike Rapoport <rppt@kernel.org>, David Hildenbrand <david@redhat.com>, Mel Gorman <mgorman@techsingularity.net>, marcelo.cerri@canonical.com, tim.gardner@canonical.com, khalid.elmously@canonical.com, philip.cox@canonical.com, aarcange@redhat.com, peterx@redhat.com, x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, linux-efi@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Subject: [PATCHv14 5/9] efi: Add unaccepted memory support Date: Tue, 6 Jun 2023 17:26:33 +0300 Message-Id: <20230606142637.5171-6-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230606142637.5171-1-kirill.shutemov@linux.intel.com> References: <20230606142637.5171-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767964436313010003?= X-GMAIL-MSGID: =?utf-8?q?1767964436313010003?= |
Series |
mm, x86/cc, efi: Implement support for unaccepted memory
|
|
Commit Message
Kirill A. Shutemov
June 6, 2023, 2:26 p.m. UTC
efi_config_parse_tables() reserves memory that holds unaccepted memory configuration table so it won't be reused by page allocator. Core-mm requires few helpers to support unaccepted memory: - accept_memory() checks the range of addresses against the bitmap and accept memory if needed. - range_contains_unaccepted_memory() checks if anything within the range requires acceptance. Architectural code has to provide efi_get_unaccepted_table() that returns pointer to the unaccepted memory configuration table. arch_accept_memory() handles arch-specific part of memory acceptance. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> --- arch/x86/platform/efi/efi.c | 3 + drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/efi.c | 25 +++++ drivers/firmware/efi/unaccepted_memory.c | 112 +++++++++++++++++++++++ include/linux/efi.h | 1 + 5 files changed, 142 insertions(+) create mode 100644 drivers/firmware/efi/unaccepted_memory.c
Comments
On Tue, Jun 06, 2023 at 05:26:33PM +0300, Kirill A. Shutemov wrote: > efi_config_parse_tables() reserves memory that holds unaccepted memory > configuration table so it won't be reused by page allocator. > > Core-mm requires few helpers to support unaccepted memory: > > - accept_memory() checks the range of addresses against the bitmap and > accept memory if needed. > > - range_contains_unaccepted_memory() checks if anything within the > range requires acceptance. > > Architectural code has to provide efi_get_unaccepted_table() that > returns pointer to the unaccepted memory configuration table. > > arch_accept_memory() handles arch-specific part of memory acceptance. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Reviewed-by: Ard Biesheuvel <ardb@kernel.org> > Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> By and large, this looks ok from the page allocator perspective as the checks for unaccepted are mostly after watermark checks. However, if you look in the initial fast path, you'll see this /* * Forbid the first pass from falling back to types that fragment * memory until all local zones are considered. */ alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp); While checking watermarks should be fine from a functional perspective and the fast paths are unaffected, there is a risk of premature fragmentation until all memory has been accepted. Meeting watermarks does not necessarily mean that fragmentation is avoided as pageblocks can get mixed while still meeting watermarks. This will be tricky to detect in most cases so it may be worth considering augmenting the watermark checks with a check in this block for unaccepted memory /* * It's possible on a UMA machine to get through all zones that are * fragmented. If avoiding fragmentation, reset and try again. */ if (no_fallback) { alloc_flags &= ~ALLOC_NOFRAGMENT; goto retry; } I think the watermark checks will still be needed because hypothetically if 100% of allocations were MIGRATE_UNMOVABLE then there never would be a request that fragments memory and memory would not be accepted. It's not a blocker to the series, simply a suggestion for follow-on work.
On Mon, Jul 03, 2023 at 02:25:18PM +0100, Mel Gorman wrote: > On Tue, Jun 06, 2023 at 05:26:33PM +0300, Kirill A. Shutemov wrote: > > efi_config_parse_tables() reserves memory that holds unaccepted memory > > configuration table so it won't be reused by page allocator. > > > > Core-mm requires few helpers to support unaccepted memory: > > > > - accept_memory() checks the range of addresses against the bitmap and > > accept memory if needed. > > > > - range_contains_unaccepted_memory() checks if anything within the > > range requires acceptance. > > > > Architectural code has to provide efi_get_unaccepted_table() that > > returns pointer to the unaccepted memory configuration table. > > > > arch_accept_memory() handles arch-specific part of memory acceptance. > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Reviewed-by: Ard Biesheuvel <ardb@kernel.org> > > Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> > > By and large, this looks ok from the page allocator perspective as the > checks for unaccepted are mostly after watermark checks. However, if you > look in the initial fast path, you'll see this > > /* > * Forbid the first pass from falling back to types that fragment > * memory until all local zones are considered. > */ > alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp); > > While checking watermarks should be fine from a functional perspective and > the fast paths are unaffected, there is a risk of premature fragmentation > until all memory has been accepted. Meeting watermarks does not necessarily > mean that fragmentation is avoided as pageblocks can get mixed while still > meeting watermarks. Could you elaborate on this scenario? Current code checks the watermark, if it is met, try rmqueue(). If rmqueue() fails anyway, try to accept more pages and retry the zone if it is successful. I'm not sure how we can get to the 'if (no_fallback) {' case with any unaccepted memory in the allowed zones. I see that there's preferred_zoneref and spread_dirty_pages cases, but unaccepted memory seems change nothing for them. Hm?
On Tue, Jul 04, 2023 at 05:37:40PM +0300, Kirill A. Shutemov wrote: > On Mon, Jul 03, 2023 at 02:25:18PM +0100, Mel Gorman wrote: > > On Tue, Jun 06, 2023 at 05:26:33PM +0300, Kirill A. Shutemov wrote: > > > efi_config_parse_tables() reserves memory that holds unaccepted memory > > > configuration table so it won't be reused by page allocator. > > > > > > Core-mm requires few helpers to support unaccepted memory: > > > > > > - accept_memory() checks the range of addresses against the bitmap and > > > accept memory if needed. > > > > > > - range_contains_unaccepted_memory() checks if anything within the > > > range requires acceptance. > > > > > > Architectural code has to provide efi_get_unaccepted_table() that > > > returns pointer to the unaccepted memory configuration table. > > > > > > arch_accept_memory() handles arch-specific part of memory acceptance. > > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > > Reviewed-by: Ard Biesheuvel <ardb@kernel.org> > > > Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> > > > > By and large, this looks ok from the page allocator perspective as the > > checks for unaccepted are mostly after watermark checks. However, if you > > look in the initial fast path, you'll see this > > > > /* > > * Forbid the first pass from falling back to types that fragment > > * memory until all local zones are considered. > > */ > > alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp); > > > > While checking watermarks should be fine from a functional perspective and > > the fast paths are unaffected, there is a risk of premature fragmentation > > until all memory has been accepted. Meeting watermarks does not necessarily > > mean that fragmentation is avoided as pageblocks can get mixed while still > > meeting watermarks. > > Could you elaborate on this scenario? > > Current code checks the watermark, if it is met, try rmqueue(). > > If rmqueue() fails anyway, try to accept more pages and retry the zone if > it is successful. > > I'm not sure how we can get to the 'if (no_fallback) {' case with any > unaccepted memory in the allowed zones. > Lets take an extreme example and assume that the low watermark is lower than 2MB (one pageblock). Just before the watermark is reached (free count between 1MB and 2MB), it is unlikely that all free pages are within pageblocks of the same migratetype (e.g. MIGRATE_MOVABLE). If there is an allocation near the watermark of a different type (e.g. MIGRATE_UNMOVABLE) then the page allocation could fallback to a different pageblock and now it is mixed. It's a condition that is only obvious if you are explicitly checking for it via tracepoints. This can happen in the normal case, but unaccepted memory makes it worse because the "pageblock mixing" could have been avoided if the "no_fallback" case accepted at least one new pageblock instead of mixing pageblocks. That is an extreme example but the same logic applies when the free count is at or near MIGRATE_TYPES*pageblock_nr_pages as it is not guaranteed that the pageblocks with free pages are a migratetype that matches the allocation request. Hence, it may be more robust from a fragmentation perspective if ALLOC_NOFRAGMENT requests accept memory if it is available and retries before clearing ALLOC_NOFRAGMENT and mixing pageblocks before the watermarks are reached. > I see that there's preferred_zoneref and spread_dirty_pages cases, but > unaccepted memory seems change nothing for them. > preferred_zoneref is about premature zone exhaustion and spread_dirty_pages is about avoiding premature stalls on a node/zone due to an imbalance in the number of pages waiting for writeback to complete. There is an arguement to be made that they also should accept memory but it's less clear how much of a problem this is. Both are very obvious when they "fail" and likely are covered by the existing watermark checks. Premature pageblock mixing is more subtle as the final impact (root cause of a premature THP allocation failure) is harder to detect.
On Tue, Jun 06, 2023 at 05:26:33PM +0300, Kirill A. Shutemov wrote: > efi_config_parse_tables() reserves memory that holds unaccepted memory > configuration table so it won't be reused by page allocator. > > Core-mm requires few helpers to support unaccepted memory: > > - accept_memory() checks the range of addresses against the bitmap and > accept memory if needed. > > - range_contains_unaccepted_memory() checks if anything within the > range requires acceptance. > > Architectural code has to provide efi_get_unaccepted_table() that > returns pointer to the unaccepted memory configuration table. > > arch_accept_memory() handles arch-specific part of memory acceptance. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Reviewed-by: Ard Biesheuvel <ardb@kernel.org> > Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> > --- > arch/x86/platform/efi/efi.c | 3 + > drivers/firmware/efi/Makefile | 1 + > drivers/firmware/efi/efi.c | 25 +++++ > drivers/firmware/efi/unaccepted_memory.c | 112 +++++++++++++++++++++++ > include/linux/efi.h | 1 + > 5 files changed, 142 insertions(+) > create mode 100644 drivers/firmware/efi/unaccepted_memory.c > > diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c > new file mode 100644 > index 000000000000..08a9a843550a > --- /dev/null > +++ b/drivers/firmware/efi/unaccepted_memory.c > @@ -0,0 +1,112 @@ > +// SPDX-License-Identifier: GPL-2.0-only > + > +#include <linux/efi.h> > +#include <linux/memblock.h> > +#include <linux/spinlock.h> > +#include <asm/unaccepted_memory.h> > + > +/* Protects unaccepted memory bitmap */ > +static DEFINE_SPINLOCK(unaccepted_memory_lock); > + > +/* > + * accept_memory() -- Consult bitmap and accept the memory if needed. > + * > + * Only memory that is explicitly marked as unaccepted in the bitmap requires > + * an action. All the remaining memory is implicitly accepted and doesn't need > + * acceptance. > + * > + * No need to accept: > + * - anything if the system has no unaccepted table; > + * - memory that is below phys_base; > + * - memory that is above the memory that addressable by the bitmap; > + */ > +void accept_memory(phys_addr_t start, phys_addr_t end) > +{ > + struct efi_unaccepted_memory *unaccepted; > + unsigned long range_start, range_end; > + unsigned long flags; > + u64 unit_size; > + > + unaccepted = efi_get_unaccepted_table(); > + if (!unaccepted) > + return; > + > + unit_size = unaccepted->unit_size; > + > + /* > + * Only care for the part of the range that is represented > + * in the bitmap. > + */ > + if (start < unaccepted->phys_base) > + start = unaccepted->phys_base; > + if (end < unaccepted->phys_base) > + return; > + > + /* Translate to offsets from the beginning of the bitmap */ > + start -= unaccepted->phys_base; > + end -= unaccepted->phys_base; > + > + /* Make sure not to overrun the bitmap */ > + if (end > unaccepted->size * unit_size * BITS_PER_BYTE) > + end = unaccepted->size * unit_size * BITS_PER_BYTE; > + > + range_start = start / unit_size; > + > + spin_lock_irqsave(&unaccepted_memory_lock, flags); > + for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap, > + DIV_ROUND_UP(end, unit_size)) { > + unsigned long phys_start, phys_end; > + unsigned long len = range_end - range_start; > + > + phys_start = range_start * unit_size + unaccepted->phys_base; > + phys_end = range_end * unit_size + unaccepted->phys_base; > + > + arch_accept_memory(phys_start, phys_end); > + bitmap_clear(unaccepted->bitmap, range_start, len); > + } > + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); > +} While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran into what seems to be fairly significant lock contention due to the unaccepted_memory_lock spinlock above, which results in a constant stream of soft-lockups until the workload gets all its memory accepted/faulted in if the guest has around 16+ vCPUs. I've included the guest dmesg traces I was seeing below. In this case I was running a 32 vCPU guest with 200GB of memory running on a 256 thread EPYC (Milan) system, and can trigger the above situation fairly reliably by running the following workload in a freshly-booted guests: stress --vm 32 --vm-bytes 5G --vm-keep Scaling up the number of stress threads and vCPUs should make it easier to reproduce. Other than unresponsiveness/lockup messages until the memory is accepted, the guest seems to continue running fine, but for large guests where unaccepted memory is more likely to be useful, it seems like it could be an issue, especially when consider 100+ vCPU guests. -Mike [ 105.753073] watchdog: BUG: soft lockup - CPU#3 stuck for 27s! [stress:1590] [ 105.756109] Modules linked in: btrfs(E) blake2b_generic(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) raid1(E) raid0(E) multipath(E) linear(E) crc32_pclmul(E) virtio_net(E) i2c_i801(E) net_failover(E) i2c_smbus(E) psmouse(E) failover(E) virtio_scsi(E) lpc_ich(E) [ 105.771644] irq event stamp: 392274 [ 105.773852] hardirqs last enabled at (392273): [<ffffffff9fbddf9a>] _raw_spin_unlock_irqrestore+0x5a/0x70 [ 105.780058] hardirqs last disabled at (392274): [<ffffffff9fbddca9>] _raw_spin_lock_irqsave+0x69/0x70 [ 105.785486] softirqs last enabled at (392158): [<ffffffff9fbdf663>] __do_softirq+0x2a3/0x357 [ 105.790627] softirqs last disabled at (392149): [<ffffffff9ecd0f5c>] irq_exit_rcu+0x8c/0xb0 [ 105.795972] CPU: 3 PID: 1590 Comm: stress Tainted: G EL 6.6.0-rc5-snp-guest-v10n+ #1 [ 105.802416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022 [ 105.807040] RIP: 0010:_raw_spin_unlock_irqrestore+0x45/0x70 [ 105.810327] Code: b1 b1 18 ff 4c 89 e7 e8 79 e7 18 ff 81 e3 00 02 00 00 75 26 9c 58 0f 1f 40 00 f6 c4 02 75 22 48 85 db 74 06 fb 0f 1f 44 00 00 <65> ff 0d fc 5d 46 60 5b 41 5c 5d c3 cc cc cc cc e8 c6 9f 28 ff eb [ 105.818749] RSP: 0000:ffffc9000585faa8 EFLAGS: 00000206 [ 105.821155] RAX: 0000000000000046 RBX: 0000000000000200 RCX: 00000000000028ce [ 105.825258] RDX: ffffffff9f8c7830 RSI: ffffffff9fbddf9a RDI: ffffffff9fbddf9a [ 105.828287] RBP: ffffc9000585fab8 R08: 0000000000000000 R09: 0000000000000000 [ 105.831322] R10: 000000ffffffffff R11: 0000000000000000 R12: ffffffffa091a160 [ 105.834713] R13: 00000000000028cd R14: ffff88807c459030 R15: ffff88807c459018 [ 105.838230] FS: 00007fa4a8ee8740(0000) GS:ffff88b1aff80000(0000) knlGS:0000000000000000 [ 105.841677] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 105.844115] CR2: 00007fa3719ec010 CR3: 000800010f7c2003 CR4: 0000000000770ef0 [ 105.847160] PKRU: 55555554 [ 105.848490] Call Trace: [ 105.849634] <IRQ> [ 105.850517] ? show_regs+0x68/0x70 [ 105.851986] ? watchdog_timer_fn+0x1dc/0x240 [ 105.853868] ? __pfx_watchdog_timer_fn+0x10/0x10 [ 105.855827] ? __hrtimer_run_queues+0x1c3/0x340 [ 105.857821] ? hrtimer_interrupt+0x109/0x240 [ 105.859673] ? __sysvec_apic_timer_interrupt+0x67/0x170 [ 105.861941] ? sysvec_apic_timer_interrupt+0x7b/0x90 [ 105.864048] </IRQ> [ 105.864981] <TASK> [ 105.865901] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ 105.868110] ? accept_memory+0x150/0x170 [ 105.869795] ? _raw_spin_unlock_irqrestore+0x5a/0x70 [ 105.871885] ? _raw_spin_unlock_irqrestore+0x5a/0x70 [ 105.873966] ? _raw_spin_unlock_irqrestore+0x45/0x70 [ 105.876047] accept_memory+0x150/0x170 [ 105.877665] try_to_accept_memory+0x134/0x1b0 [ 105.879532] get_page_from_freelist+0xa3e/0x1370 [ 105.881479] ? lock_acquire+0xd8/0x2b0 [ 105.883045] __alloc_pages+0x1b7/0x390 [ 105.884619] __folio_alloc+0x1b/0x50 [ 105.886151] ? policy_node+0x57/0x70 [ 105.887673] vma_alloc_folio+0xa6/0x360 [ 105.889325] do_pte_missing+0x1a5/0x8b0 [ 105.890948] __handle_mm_fault+0x75e/0xda0 [ 105.892693] handle_mm_fault+0xe1/0x2d0 [ 105.894304] do_user_addr_fault+0x1ce/0xa40 [ 105.896050] ? exit_to_user_mode_prepare+0xa4/0x230 [ 105.898113] exc_page_fault+0x84/0x200 [ 105.899693] asm_exc_page_fault+0x27/0x30 [ 105.901405] RIP: 0033:0x55ebd4602cf0 [ 105.902926] Code: 8b 54 24 0c 31 c0 85 d2 0f 94 c0 89 04 24 41 83 fd 02 0f 8f 3e 02 00 00 48 85 ed 48 89 d8 7e 1b 66 2e 0f 1f 84 00 00 00 00 00 <c6> 00 5a 4c 01 f8 48 89 c2 48 29 da 48 39 d5 7f ef 49 83 fc 00 0f [ 105.910671] RSP: 002b:00007ffebfede470 EFLAGS: 00010206 [ 105.912897] RAX: 00007fa3719ec010 RBX: 00007fa3683ff010 RCX: 00007fa3683ff010 [ 105.915859] RDX: 00000000095ed000 RSI: 0000000140001000 RDI: 0000000000000000 [ 105.918853] RBP: 0000000140000000 R08: 00000000ffffffff R09: 0000000000000000 [ 105.921836] R10: 0000000000000022 R11: 0000000000000246 R12: ffffffffffffffff [ 105.924810] R13: 0000000000000002 R14: fffffffffffff000 R15: 0000000000001000 [ 105.927771] </TASK> [ 110.367774] watchdog: BUG: soft lockup - CPU#15 stuck for 32s! [stress:1594] [ 110.369415] watchdog: BUG: soft lockup - CPU#13 stuck for 36s! [stress:1602]
On Tue, Oct 10, 2023 at 04:05:18PM -0500, Michael Roth wrote: > On Tue, Jun 06, 2023 at 05:26:33PM +0300, Kirill A. Shutemov wrote: > > efi_config_parse_tables() reserves memory that holds unaccepted memory > > configuration table so it won't be reused by page allocator. > > > > Core-mm requires few helpers to support unaccepted memory: > > > > - accept_memory() checks the range of addresses against the bitmap and > > accept memory if needed. > > > > - range_contains_unaccepted_memory() checks if anything within the > > range requires acceptance. > > > > Architectural code has to provide efi_get_unaccepted_table() that > > returns pointer to the unaccepted memory configuration table. > > > > arch_accept_memory() handles arch-specific part of memory acceptance. > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Reviewed-by: Ard Biesheuvel <ardb@kernel.org> > > Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> > > --- > > arch/x86/platform/efi/efi.c | 3 + > > drivers/firmware/efi/Makefile | 1 + > > drivers/firmware/efi/efi.c | 25 +++++ > > drivers/firmware/efi/unaccepted_memory.c | 112 +++++++++++++++++++++++ > > include/linux/efi.h | 1 + > > 5 files changed, 142 insertions(+) > > create mode 100644 drivers/firmware/efi/unaccepted_memory.c > > > > diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c > > new file mode 100644 > > index 000000000000..08a9a843550a > > --- /dev/null > > +++ b/drivers/firmware/efi/unaccepted_memory.c > > @@ -0,0 +1,112 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > + > > +#include <linux/efi.h> > > +#include <linux/memblock.h> > > +#include <linux/spinlock.h> > > +#include <asm/unaccepted_memory.h> > > + > > +/* Protects unaccepted memory bitmap */ > > +static DEFINE_SPINLOCK(unaccepted_memory_lock); > > + > > +/* > > + * accept_memory() -- Consult bitmap and accept the memory if needed. > > + * > > + * Only memory that is explicitly marked as unaccepted in the bitmap requires > > + * an action. All the remaining memory is implicitly accepted and doesn't need > > + * acceptance. > > + * > > + * No need to accept: > > + * - anything if the system has no unaccepted table; > > + * - memory that is below phys_base; > > + * - memory that is above the memory that addressable by the bitmap; > > + */ > > +void accept_memory(phys_addr_t start, phys_addr_t end) > > +{ > > + struct efi_unaccepted_memory *unaccepted; > > + unsigned long range_start, range_end; > > + unsigned long flags; > > + u64 unit_size; > > + > > + unaccepted = efi_get_unaccepted_table(); > > + if (!unaccepted) > > + return; > > + > > + unit_size = unaccepted->unit_size; > > + > > + /* > > + * Only care for the part of the range that is represented > > + * in the bitmap. > > + */ > > + if (start < unaccepted->phys_base) > > + start = unaccepted->phys_base; > > + if (end < unaccepted->phys_base) > > + return; > > + > > + /* Translate to offsets from the beginning of the bitmap */ > > + start -= unaccepted->phys_base; > > + end -= unaccepted->phys_base; > > + > > + /* Make sure not to overrun the bitmap */ > > + if (end > unaccepted->size * unit_size * BITS_PER_BYTE) > > + end = unaccepted->size * unit_size * BITS_PER_BYTE; > > + > > + range_start = start / unit_size; > > + > > + spin_lock_irqsave(&unaccepted_memory_lock, flags); > > + for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap, > > + DIV_ROUND_UP(end, unit_size)) { > > + unsigned long phys_start, phys_end; > > + unsigned long len = range_end - range_start; > > + > > + phys_start = range_start * unit_size + unaccepted->phys_base; > > + phys_end = range_end * unit_size + unaccepted->phys_base; > > + > > + arch_accept_memory(phys_start, phys_end); > > + bitmap_clear(unaccepted->bitmap, range_start, len); > > + } > > + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); > > +} > > While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran > into what seems to be fairly significant lock contention due to the > unaccepted_memory_lock spinlock above, which results in a constant stream > of soft-lockups until the workload gets all its memory accepted/faulted > in if the guest has around 16+ vCPUs. > > I've included the guest dmesg traces I was seeing below. > > In this case I was running a 32 vCPU guest with 200GB of memory running on > a 256 thread EPYC (Milan) system, and can trigger the above situation fairly > reliably by running the following workload in a freshly-booted guests: > > stress --vm 32 --vm-bytes 5G --vm-keep > > Scaling up the number of stress threads and vCPUs should make it easier > to reproduce. > > Other than unresponsiveness/lockup messages until the memory is accepted, > the guest seems to continue running fine, but for large guests where > unaccepted memory is more likely to be useful, it seems like it could be > an issue, especially when consider 100+ vCPU guests. Okay, sorry for delay. It took time to reproduce it with TDX. I will look what can be done.
On Fri, Oct 13, 2023 at 03:33:58PM +0300, Kirill A. Shutemov wrote: > > While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran > > into what seems to be fairly significant lock contention due to the > > unaccepted_memory_lock spinlock above, which results in a constant stream > > of soft-lockups until the workload gets all its memory accepted/faulted > > in if the guest has around 16+ vCPUs. > > > > I've included the guest dmesg traces I was seeing below. > > > > In this case I was running a 32 vCPU guest with 200GB of memory running on > > a 256 thread EPYC (Milan) system, and can trigger the above situation fairly > > reliably by running the following workload in a freshly-booted guests: > > > > stress --vm 32 --vm-bytes 5G --vm-keep > > > > Scaling up the number of stress threads and vCPUs should make it easier > > to reproduce. > > > > Other than unresponsiveness/lockup messages until the memory is accepted, > > the guest seems to continue running fine, but for large guests where > > unaccepted memory is more likely to be useful, it seems like it could be > > an issue, especially when consider 100+ vCPU guests. > > Okay, sorry for delay. It took time to reproduce it with TDX. > > I will look what can be done. Could you check if the patch below helps? diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c index 853f7dc3c21d..591da3f368fa 100644 --- a/drivers/firmware/efi/unaccepted_memory.c +++ b/drivers/firmware/efi/unaccepted_memory.c @@ -8,6 +8,14 @@ /* Protects unaccepted memory bitmap */ static DEFINE_SPINLOCK(unaccepted_memory_lock); +struct accept_range { + struct list_head list; + unsigned long start; + unsigned long end; +}; + +static LIST_HEAD(accepting_list); + /* * accept_memory() -- Consult bitmap and accept the memory if needed. * @@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end) { struct efi_unaccepted_memory *unaccepted; unsigned long range_start, range_end; + struct accept_range range, *entry; unsigned long flags; u64 unit_size; @@ -80,7 +89,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end) range_start = start / unit_size; + range.start = start; + range.end = end; +retry: spin_lock_irqsave(&unaccepted_memory_lock, flags); + + list_for_each_entry(entry, &accepting_list, list) { + if (entry->end < start) + continue; + if (entry->start > end) + continue; + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); + + /* Somebody else accepting the range */ + cpu_relax(); + goto retry; + } + + list_add(&range.list, &accepting_list); + for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap, DIV_ROUND_UP(end, unit_size)) { unsigned long phys_start, phys_end; @@ -89,9 +116,15 @@ void accept_memory(phys_addr_t start, phys_addr_t end) phys_start = range_start * unit_size + unaccepted->phys_base; phys_end = range_end * unit_size + unaccepted->phys_base; + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); + arch_accept_memory(phys_start, phys_end); + + spin_lock_irqsave(&unaccepted_memory_lock, flags); bitmap_clear(unaccepted->bitmap, range_start, len); } + + list_del(&range.list); spin_unlock_irqrestore(&unaccepted_memory_lock, flags); }
On 10/13/23 18:22, Kirill A. Shutemov wrote: > On Fri, Oct 13, 2023 at 03:33:58PM +0300, Kirill A. Shutemov wrote: >> > While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran >> > into what seems to be fairly significant lock contention due to the >> > unaccepted_memory_lock spinlock above, which results in a constant stream >> > of soft-lockups until the workload gets all its memory accepted/faulted >> > in if the guest has around 16+ vCPUs. >> > >> > I've included the guest dmesg traces I was seeing below. >> > >> > In this case I was running a 32 vCPU guest with 200GB of memory running on >> > a 256 thread EPYC (Milan) system, and can trigger the above situation fairly >> > reliably by running the following workload in a freshly-booted guests: >> > >> > stress --vm 32 --vm-bytes 5G --vm-keep >> > >> > Scaling up the number of stress threads and vCPUs should make it easier >> > to reproduce. >> > >> > Other than unresponsiveness/lockup messages until the memory is accepted, >> > the guest seems to continue running fine, but for large guests where >> > unaccepted memory is more likely to be useful, it seems like it could be >> > an issue, especially when consider 100+ vCPU guests. >> >> Okay, sorry for delay. It took time to reproduce it with TDX. >> >> I will look what can be done. > > Could you check if the patch below helps? > > diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c > index 853f7dc3c21d..591da3f368fa 100644 > --- a/drivers/firmware/efi/unaccepted_memory.c > +++ b/drivers/firmware/efi/unaccepted_memory.c > @@ -8,6 +8,14 @@ > /* Protects unaccepted memory bitmap */ > static DEFINE_SPINLOCK(unaccepted_memory_lock); > > +struct accept_range { > + struct list_head list; > + unsigned long start; > + unsigned long end; > +}; > + > +static LIST_HEAD(accepting_list); > + > /* > * accept_memory() -- Consult bitmap and accept the memory if needed. > * > @@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > { > struct efi_unaccepted_memory *unaccepted; > unsigned long range_start, range_end; > + struct accept_range range, *entry; > unsigned long flags; > u64 unit_size; > > @@ -80,7 +89,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > > range_start = start / unit_size; > > + range.start = start; > + range.end = end; > +retry: > spin_lock_irqsave(&unaccepted_memory_lock, flags); > + > + list_for_each_entry(entry, &accepting_list, list) { > + if (entry->end < start) > + continue; > + if (entry->start > end) > + continue; > + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); > + > + /* Somebody else accepting the range */ > + cpu_relax(); Should this be rather cond_resched()? I think cpu_relax() isn't enough to prevent soft lockups. Although IIUC hitting this should be rare, as the contending tasks will pick different ranges via try_to_accept_memory_one(), right? > + goto retry; > + } > + > + list_add(&range.list, &accepting_list); > + > for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap, > DIV_ROUND_UP(end, unit_size)) { > unsigned long phys_start, phys_end; > @@ -89,9 +116,15 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > phys_start = range_start * unit_size + unaccepted->phys_base; > phys_end = range_end * unit_size + unaccepted->phys_base; > > + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); > + > arch_accept_memory(phys_start, phys_end); > + > + spin_lock_irqsave(&unaccepted_memory_lock, flags); > bitmap_clear(unaccepted->bitmap, range_start, len); > } > + > + list_del(&range.list); > spin_unlock_irqrestore(&unaccepted_memory_lock, flags); > } >
On Fri, Oct 13, 2023 at 06:44:45PM +0200, Vlastimil Babka wrote: > On 10/13/23 18:22, Kirill A. Shutemov wrote: > > On Fri, Oct 13, 2023 at 03:33:58PM +0300, Kirill A. Shutemov wrote: > >> > While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran > >> > into what seems to be fairly significant lock contention due to the > >> > unaccepted_memory_lock spinlock above, which results in a constant stream > >> > of soft-lockups until the workload gets all its memory accepted/faulted > >> > in if the guest has around 16+ vCPUs. > >> > > >> > I've included the guest dmesg traces I was seeing below. > >> > > >> > In this case I was running a 32 vCPU guest with 200GB of memory running on > >> > a 256 thread EPYC (Milan) system, and can trigger the above situation fairly > >> > reliably by running the following workload in a freshly-booted guests: > >> > > >> > stress --vm 32 --vm-bytes 5G --vm-keep > >> > > >> > Scaling up the number of stress threads and vCPUs should make it easier > >> > to reproduce. > >> > > >> > Other than unresponsiveness/lockup messages until the memory is accepted, > >> > the guest seems to continue running fine, but for large guests where > >> > unaccepted memory is more likely to be useful, it seems like it could be > >> > an issue, especially when consider 100+ vCPU guests. > >> > >> Okay, sorry for delay. It took time to reproduce it with TDX. > >> > >> I will look what can be done. > > > > Could you check if the patch below helps? > > > > diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c > > index 853f7dc3c21d..591da3f368fa 100644 > > --- a/drivers/firmware/efi/unaccepted_memory.c > > +++ b/drivers/firmware/efi/unaccepted_memory.c > > @@ -8,6 +8,14 @@ > > /* Protects unaccepted memory bitmap */ > > static DEFINE_SPINLOCK(unaccepted_memory_lock); > > > > +struct accept_range { > > + struct list_head list; > > + unsigned long start; > > + unsigned long end; > > +}; > > + > > +static LIST_HEAD(accepting_list); > > + > > /* > > * accept_memory() -- Consult bitmap and accept the memory if needed. > > * > > @@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > > { > > struct efi_unaccepted_memory *unaccepted; > > unsigned long range_start, range_end; > > + struct accept_range range, *entry; > > unsigned long flags; > > u64 unit_size; > > > > @@ -80,7 +89,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > > > > range_start = start / unit_size; > > > > + range.start = start; > > + range.end = end; > > +retry: > > spin_lock_irqsave(&unaccepted_memory_lock, flags); > > + > > + list_for_each_entry(entry, &accepting_list, list) { > > + if (entry->end < start) > > + continue; > > + if (entry->start > end) > > + continue; > > + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); > > + > > + /* Somebody else accepting the range */ > > + cpu_relax(); > > Should this be rather cond_resched()? I think cpu_relax() isn't enough to > prevent soft lockups. Right. For some reason, I thought we cannot call cond_resched() from atomic context (we sometimes get there from atomic context), but we can. > Although IIUC hitting this should be rare, as the contending tasks will pick > different ranges via try_to_accept_memory_one(), right? Yes, it should be rare. Generally, with exception of memblock, we accept all memory with MAX_ORDER chunks. As long as unit_size <= MAX_ORDER page allocator should never trigger the conflict as the caller owns full range to accept. I will test the idea with larger unit_size to see how it behaves.
On 10/13/23 11:22, Kirill A. Shutemov wrote: > On Fri, Oct 13, 2023 at 03:33:58PM +0300, Kirill A. Shutemov wrote: >>> While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran >>> into what seems to be fairly significant lock contention due to the >>> unaccepted_memory_lock spinlock above, which results in a constant stream >>> of soft-lockups until the workload gets all its memory accepted/faulted >>> in if the guest has around 16+ vCPUs. >>> >>> I've included the guest dmesg traces I was seeing below. >>> >>> In this case I was running a 32 vCPU guest with 200GB of memory running on >>> a 256 thread EPYC (Milan) system, and can trigger the above situation fairly >>> reliably by running the following workload in a freshly-booted guests: >>> >>> stress --vm 32 --vm-bytes 5G --vm-keep >>> >>> Scaling up the number of stress threads and vCPUs should make it easier >>> to reproduce. >>> >>> Other than unresponsiveness/lockup messages until the memory is accepted, >>> the guest seems to continue running fine, but for large guests where >>> unaccepted memory is more likely to be useful, it seems like it could be >>> an issue, especially when consider 100+ vCPU guests. >> >> Okay, sorry for delay. It took time to reproduce it with TDX. >> >> I will look what can be done. > > Could you check if the patch below helps? > > diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c > index 853f7dc3c21d..591da3f368fa 100644 > --- a/drivers/firmware/efi/unaccepted_memory.c > +++ b/drivers/firmware/efi/unaccepted_memory.c > @@ -8,6 +8,14 @@ > /* Protects unaccepted memory bitmap */ > static DEFINE_SPINLOCK(unaccepted_memory_lock); > > +struct accept_range { > + struct list_head list; > + unsigned long start; > + unsigned long end; > +}; > + > +static LIST_HEAD(accepting_list); > + > /* > * accept_memory() -- Consult bitmap and accept the memory if needed. > * > @@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > { > struct efi_unaccepted_memory *unaccepted; > unsigned long range_start, range_end; > + struct accept_range range, *entry; > unsigned long flags; > u64 unit_size; > > @@ -80,7 +89,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > > range_start = start / unit_size; > > + range.start = start; > + range.end = end; > +retry: > spin_lock_irqsave(&unaccepted_memory_lock, flags); > + > + list_for_each_entry(entry, &accepting_list, list) { > + if (entry->end < start) > + continue; > + if (entry->start > end) Should this be a >= check since start and end are page aligned values? > + continue; > + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); > + > + /* Somebody else accepting the range */ > + cpu_relax(); > + goto retry; Could you set some kind of flag here so that ... > + } > + ... once you get here, that means that area was accepted and removed from the list, so I think you could just drop the lock and exit now, right? Because at that point the bitmap will have been updated and you wouldn't be accepting any memory anyway? Thanks, Tom > + list_add(&range.list, &accepting_list); > + > for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap, > DIV_ROUND_UP(end, unit_size)) { > unsigned long phys_start, phys_end; > @@ -89,9 +116,15 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > phys_start = range_start * unit_size + unaccepted->phys_base; > phys_end = range_end * unit_size + unaccepted->phys_base; > > + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); > + > arch_accept_memory(phys_start, phys_end); > + > + spin_lock_irqsave(&unaccepted_memory_lock, flags); > bitmap_clear(unaccepted->bitmap, range_start, len); > } > + > + list_del(&range.list); > spin_unlock_irqrestore(&unaccepted_memory_lock, flags); > } >
On Fri, Oct 13, 2023 at 12:45:20PM -0500, Tom Lendacky wrote: > On 10/13/23 11:22, Kirill A. Shutemov wrote: > > On Fri, Oct 13, 2023 at 03:33:58PM +0300, Kirill A. Shutemov wrote: > > > > While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran > > > > into what seems to be fairly significant lock contention due to the > > > > unaccepted_memory_lock spinlock above, which results in a constant stream > > > > of soft-lockups until the workload gets all its memory accepted/faulted > > > > in if the guest has around 16+ vCPUs. > > > > > > > > I've included the guest dmesg traces I was seeing below. > > > > > > > > In this case I was running a 32 vCPU guest with 200GB of memory running on > > > > a 256 thread EPYC (Milan) system, and can trigger the above situation fairly > > > > reliably by running the following workload in a freshly-booted guests: > > > > > > > > stress --vm 32 --vm-bytes 5G --vm-keep > > > > > > > > Scaling up the number of stress threads and vCPUs should make it easier > > > > to reproduce. > > > > > > > > Other than unresponsiveness/lockup messages until the memory is accepted, > > > > the guest seems to continue running fine, but for large guests where > > > > unaccepted memory is more likely to be useful, it seems like it could be > > > > an issue, especially when consider 100+ vCPU guests. > > > > > > Okay, sorry for delay. It took time to reproduce it with TDX. > > > > > > I will look what can be done. > > > > Could you check if the patch below helps? > > > > diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c > > index 853f7dc3c21d..591da3f368fa 100644 > > --- a/drivers/firmware/efi/unaccepted_memory.c > > +++ b/drivers/firmware/efi/unaccepted_memory.c > > @@ -8,6 +8,14 @@ > > /* Protects unaccepted memory bitmap */ > > static DEFINE_SPINLOCK(unaccepted_memory_lock); > > +struct accept_range { > > + struct list_head list; > > + unsigned long start; > > + unsigned long end; > > +}; > > + > > +static LIST_HEAD(accepting_list); > > + > > /* > > * accept_memory() -- Consult bitmap and accept the memory if needed. > > * > > @@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > > { > > struct efi_unaccepted_memory *unaccepted; > > unsigned long range_start, range_end; > > + struct accept_range range, *entry; > > unsigned long flags; > > u64 unit_size; > > @@ -80,7 +89,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end) > > range_start = start / unit_size; > > + range.start = start; > > + range.end = end; > > +retry: > > spin_lock_irqsave(&unaccepted_memory_lock, flags); > > + > > + list_for_each_entry(entry, &accepting_list, list) { > > + if (entry->end < start) > > + continue; > > + if (entry->start > end) > > Should this be a >= check since start and end are page aligned values? Right. Good catch. > > + continue; > > + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); > > + > > + /* Somebody else accepting the range */ > > + cpu_relax(); > > + goto retry; > > Could you set some kind of flag here so that ... > > > + } > > + > > ... once you get here, that means that area was accepted and removed from > the list, so I think you could just drop the lock and exit now, right? > Because at that point the bitmap will have been updated and you wouldn't be > accepting any memory anyway? No. Consider the case if someone else accept part of the range you are accepting. I guess we can check if the range on the list covers what we are accepting fully, but it complication. Checking bitmap at this point is cheap enough: we already hold the lock.
On Fri, Oct 13, 2023 at 08:27:28PM +0300, Kirill A. Shutemov wrote:
> I will test the idea with larger unit_size to see how it behaves.
It indeed uncovered an issue. We need to record ranges on accepting_list
in unit_size granularity. Otherwise, we fail to stop parallel accept
requests to the same unit_size block if they don't overlap on physical
addresses.
Updated patch:
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
index 853f7dc3c21d..8af0306c8e5c 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -5,9 +5,17 @@
#include <linux/spinlock.h>
#include <asm/unaccepted_memory.h>
-/* Protects unaccepted memory bitmap */
+/* Protects unaccepted memory bitmap and accepting_list */
static DEFINE_SPINLOCK(unaccepted_memory_lock);
+struct accept_range {
+ struct list_head list;
+ unsigned long start;
+ unsigned long end;
+};
+
+static LIST_HEAD(accepting_list);
+
/*
* accept_memory() -- Consult bitmap and accept the memory if needed.
*
@@ -24,6 +32,7 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
{
struct efi_unaccepted_memory *unaccepted;
unsigned long range_start, range_end;
+ struct accept_range range, *entry;
unsigned long flags;
u64 unit_size;
@@ -78,20 +87,58 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
end = unaccepted->size * unit_size * BITS_PER_BYTE;
- range_start = start / unit_size;
-
+ range.start = start / unit_size;
+ range.end = DIV_ROUND_UP(end, unit_size);
+retry:
spin_lock_irqsave(&unaccepted_memory_lock, flags);
+
+ /*
+ * Check if anybody works on accepting the same range of the memory.
+ *
+ * The check with unit_size granularity. It is crucial to catch all
+ * accept requests to the same unit_size block, even if they don't
+ * overlap on physical address level.
+ */
+ list_for_each_entry(entry, &accepting_list, list) {
+ if (entry->end < range.start)
+ continue;
+ if (entry->start >= range.end)
+ continue;
+
+ /*
+ * Somebody else accepting the range. Or at least part of it.
+ *
+ * Drop the lock and retry until it is complete.
+ */
+ spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+ cond_resched();
+ goto retry;
+ }
+
+ /*
+ * Register that the range is about to be accepted.
+ * Make sure nobody else will accept it.
+ */
+ list_add(&range.list, &accepting_list);
+
+ range_start = range.start;
for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
- DIV_ROUND_UP(end, unit_size)) {
+ range.end) {
unsigned long phys_start, phys_end;
unsigned long len = range_end - range_start;
phys_start = range_start * unit_size + unaccepted->phys_base;
phys_end = range_end * unit_size + unaccepted->phys_base;
+ spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
arch_accept_memory(phys_start, phys_end);
+
+ spin_lock_irqsave(&unaccepted_memory_lock, flags);
bitmap_clear(unaccepted->bitmap, range_start, len);
}
+
+ list_del(&range.list);
spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
}
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index f3f2d87cce1b..e9f99c56f3ce 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -96,6 +96,9 @@ static const unsigned long * const efi_tables[] = { #ifdef CONFIG_EFI_COCO_SECRET &efi.coco_secret, #endif +#ifdef CONFIG_UNACCEPTED_MEMORY + &efi.unaccepted, +#endif }; u64 efi_setup; /* efi setup_data physical address */ diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile index b51f2a4c821e..e489fefd23da 100644 --- a/drivers/firmware/efi/Makefile +++ b/drivers/firmware/efi/Makefile @@ -41,3 +41,4 @@ obj-$(CONFIG_EFI_CAPSULE_LOADER) += capsule-loader.o obj-$(CONFIG_EFI_EARLYCON) += earlycon.o obj-$(CONFIG_UEFI_CPER_ARM) += cper-arm.o obj-$(CONFIG_UEFI_CPER_X86) += cper-x86.o +obj-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 7dce06e419c5..d817e7afd266 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -50,6 +50,9 @@ struct efi __read_mostly efi = { #ifdef CONFIG_EFI_COCO_SECRET .coco_secret = EFI_INVALID_TABLE_ADDR, #endif +#ifdef CONFIG_UNACCEPTED_MEMORY + .unaccepted = EFI_INVALID_TABLE_ADDR, +#endif }; EXPORT_SYMBOL(efi); @@ -605,6 +608,9 @@ static const efi_config_table_type_t common_tables[] __initconst = { #ifdef CONFIG_EFI_COCO_SECRET {LINUX_EFI_COCO_SECRET_AREA_GUID, &efi.coco_secret, "CocoSecret" }, #endif +#ifdef CONFIG_UNACCEPTED_MEMORY + {LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID, &efi.unaccepted, "Unaccepted" }, +#endif #ifdef CONFIG_EFI_GENERIC_STUB {LINUX_EFI_SCREEN_INFO_TABLE_GUID, &screen_info_table }, #endif @@ -759,6 +765,25 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables, } } + if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && + efi.unaccepted != EFI_INVALID_TABLE_ADDR) { + struct efi_unaccepted_memory *unaccepted; + + unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted)); + if (unaccepted) { + unsigned long size; + + if (unaccepted->version == 1) { + size = sizeof(*unaccepted) + unaccepted->size; + memblock_reserve(efi.unaccepted, size); + } else { + efi.unaccepted = EFI_INVALID_TABLE_ADDR; + } + + early_memunmap(unaccepted, sizeof(*unaccepted)); + } + } + return 0; } diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c new file mode 100644 index 000000000000..08a9a843550a --- /dev/null +++ b/drivers/firmware/efi/unaccepted_memory.c @@ -0,0 +1,112 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include <linux/efi.h> +#include <linux/memblock.h> +#include <linux/spinlock.h> +#include <asm/unaccepted_memory.h> + +/* Protects unaccepted memory bitmap */ +static DEFINE_SPINLOCK(unaccepted_memory_lock); + +/* + * accept_memory() -- Consult bitmap and accept the memory if needed. + * + * Only memory that is explicitly marked as unaccepted in the bitmap requires + * an action. All the remaining memory is implicitly accepted and doesn't need + * acceptance. + * + * No need to accept: + * - anything if the system has no unaccepted table; + * - memory that is below phys_base; + * - memory that is above the memory that addressable by the bitmap; + */ +void accept_memory(phys_addr_t start, phys_addr_t end) +{ + struct efi_unaccepted_memory *unaccepted; + unsigned long range_start, range_end; + unsigned long flags; + u64 unit_size; + + unaccepted = efi_get_unaccepted_table(); + if (!unaccepted) + return; + + unit_size = unaccepted->unit_size; + + /* + * Only care for the part of the range that is represented + * in the bitmap. + */ + if (start < unaccepted->phys_base) + start = unaccepted->phys_base; + if (end < unaccepted->phys_base) + return; + + /* Translate to offsets from the beginning of the bitmap */ + start -= unaccepted->phys_base; + end -= unaccepted->phys_base; + + /* Make sure not to overrun the bitmap */ + if (end > unaccepted->size * unit_size * BITS_PER_BYTE) + end = unaccepted->size * unit_size * BITS_PER_BYTE; + + range_start = start / unit_size; + + spin_lock_irqsave(&unaccepted_memory_lock, flags); + for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap, + DIV_ROUND_UP(end, unit_size)) { + unsigned long phys_start, phys_end; + unsigned long len = range_end - range_start; + + phys_start = range_start * unit_size + unaccepted->phys_base; + phys_end = range_end * unit_size + unaccepted->phys_base; + + arch_accept_memory(phys_start, phys_end); + bitmap_clear(unaccepted->bitmap, range_start, len); + } + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); +} + +bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end) +{ + struct efi_unaccepted_memory *unaccepted; + unsigned long flags; + bool ret = false; + u64 unit_size; + + unaccepted = efi_get_unaccepted_table(); + if (!unaccepted) + return false; + + unit_size = unaccepted->unit_size; + + /* + * Only care for the part of the range that is represented + * in the bitmap. + */ + if (start < unaccepted->phys_base) + start = unaccepted->phys_base; + if (end < unaccepted->phys_base) + return false; + + /* Translate to offsets from the beginning of the bitmap */ + start -= unaccepted->phys_base; + end -= unaccepted->phys_base; + + /* Make sure not to overrun the bitmap */ + if (end > unaccepted->size * unit_size * BITS_PER_BYTE) + end = unaccepted->size * unit_size * BITS_PER_BYTE; + + spin_lock_irqsave(&unaccepted_memory_lock, flags); + while (start < end) { + if (test_bit(start / unit_size, unaccepted->bitmap)) { + ret = true; + break; + } + + start += unit_size; + } + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); + + return ret; +} diff --git a/include/linux/efi.h b/include/linux/efi.h index 29cc622910da..9864f9c00da2 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -646,6 +646,7 @@ extern struct efi { unsigned long tpm_final_log; /* TPM2 Final Events Log table */ unsigned long mokvar_table; /* MOK variable config table */ unsigned long coco_secret; /* Confidential computing secret table */ + unsigned long unaccepted; /* Unaccepted memory table */ efi_get_time_t *get_time; efi_set_time_t *set_time;