Message ID | 20230613052523.1106821-2-feng.tang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp324534vqr; Mon, 12 Jun 2023 22:53:01 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6zX38BieWzQcQ59/3etd+LT1t5VPo8hdEFcR7sRwOgqy/1qNRDm84p9fyQSTFGOZJHomsS X-Received: by 2002:a81:5a54:0:b0:565:5478:713a with SMTP id o81-20020a815a54000000b005655478713amr873191ywb.49.1686635580919; Mon, 12 Jun 2023 22:53:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686635580; cv=none; d=google.com; s=arc-20160816; b=M2I/IYa7YwqIC44Ckyyec+17wPnv2Gc1M/gWOmJrHq58wDzeYo6acWx0pkie6qOD0e WNJWKOyUfxTzPXnYDDsdPuULFIS7x+Cc+78MGqIJKE86UcizUUsL3fvtjllEdTmOfcA5 RLrQivO5K52EAJeV6g+KLBTnKICKBDUSYmR+CJWTU9q4whRKDlf5I0nLC2LH128XaBhe BMH4oeRd0HaSFMM6RAsUCXkjXiB+xilQLFOA7g3QLKEJbIPl4AYWMgwfPjtOVPVUWnN+ jvtkYj+wc189bcHAYVypSUpuYWbKcDK47SQbSXz9EsoYP12r0zA0L0BRp9evG0yGadK6 YdAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=rkhn+pQiu97P3GZq+Dm1EBx6T2HUi4CZzTXdCl9En0U=; b=QCC7u6bSr3vRmyufhHqqQrtW+x2g4n8YZmL/U7KKVKPx6hcbZ3LMJMbaExiqC6mzzi cssT4/zWHKNmtKTPo9idc/xN/VwAjt/gznYzkIe0YYcZt7fQFhWLdlJb1mcy13CcaZFo wuJqcrurzSwr9BXa3jOEs7HAbgLgT/TOtrJFo6Fz16e5AoxGd2YMk4tEvspkm9fzyqwO o+ObGOJmMvaz+vfavhNOsARVNOeqUpPqnetmxA4eKBsQYZUrwS6SLIoi790wtpfwnz2k IheAU8vGe+8sTqGh9xSl1o08p31e8lmgjU7P+tCvGb1m0tjvdLhZYJGAG7XPxesvemxF RsCg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=hP6pd6av; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d4-20020a170903230400b001add2ba445asi5217032plh.259.2023.06.12.22.52.48; Mon, 12 Jun 2023 22:53:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=hP6pd6av; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240301AbjFMFgk (ORCPT <rfc822;liningstudo@gmail.com> + 99 others); Tue, 13 Jun 2023 01:36:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44878 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240367AbjFMFgM (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 13 Jun 2023 01:36:12 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 98B0030E6 for <linux-kernel@vger.kernel.org>; Mon, 12 Jun 2023 22:32:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686634375; x=1718170375; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=R9c4pKHRdrSlCBk6dSsMemyO2HEpJapH/lMhjwhIo64=; b=hP6pd6av2anjU2lJLW4BioD0CHhfJCXjKHccw6ylCpJ0Cl9qCYD/gUra R0VH6FHtQajGCN7cRfvEbzSybZQOY5pPYAfUtouewnH44AW1FWqsaz1w1 9Dd1WLYvo+2Ena0FDNLSqtH+ATeQVsUibNulzxmRSYfP34qD1uV4In3t2 nxb3ooCCeulD6kVl3yTBVlJWTVb5/igFVPDm20kEKv6VNUqNI/YGaBc0d LZR0P84C/9V5ZapsiK+kKxCvFqCLdXy9MXTvumvR3M0GQGrT3MuYPcj+T Y9g/LsFCHliWvPHCZglOAJbxUdZv2e86gQisC3zET+QFFVg8zooNGFvp7 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="357117806" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="357117806" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 22:32:42 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10739"; a="776671625" X-IronPort-AV: E=Sophos;i="6.00,238,1681196400"; d="scan'208";a="776671625" Received: from feng-clx.sh.intel.com ([10.238.200.228]) by fmsmga008.fm.intel.com with ESMTP; 12 Jun 2023 22:32:39 -0700 From: Feng Tang <feng.tang@intel.com> To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, "H . Peter Anvin" <hpa@zytor.com>, Peter Zijlstra <peterz@infradead.org>, David Woodhouse <dwmw@amazon.co.uk>, "Paul E . McKenney" <paulmck@kernel.org>, x86@kernel.org, linux-kernel@vger.kernel.org Cc: rui.zhang@intel.com, tim.c.chen@intel.com, Feng Tang <feng.tang@intel.com> Subject: [Patch v2 2/2] x86/tsc: use logical_packages as a better estimation of socket numbers Date: Tue, 13 Jun 2023 13:25:23 +0800 Message-Id: <20230613052523.1106821-2-feng.tang@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230613052523.1106821-1-feng.tang@intel.com> References: <20230613052523.1106821-1-feng.tang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768565590715847653?= X-GMAIL-MSGID: =?utf-8?q?1768565590715847653?= |
Series |
[v2,1/2] smp: Add helper function to mark possible bad package number
|
|
Commit Message
Feng Tang
June 13, 2023, 5:25 a.m. UTC
Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
on qualified platorms") was introduced to solve problem that
sometimes TSC clocksource is wrongly judged as unstable by watchdog
like 'jiffies', HPET, etc.
In it, the hardware socket number is a key factor for judging whether
to disable the watchdog for TSC, and 'nr_online_nodes' was chosen as
an estimation due to it is needed in early boot phase before
registering 'tsc-early' clocksource, where all none-boot CPUs are not
brought up yet.
In recent patch review, Dave and Rui pointed out there are many cases
in which 'nr_online_nodes' is not accurate, like:
* numa emulation (numa=fake=4 etc.)
* numa=off
* platforms with CPU-less HBM nodes, CPU-less Optane memory nodes.
* SNC (sub-numa cluster) mode enabled
* 'nr_cpus=' and 'maxcpus=' kernel cmdline parameter setup
Peter Zijlstra suggested 'logical_packages', and it seems to be the
best option we have as it covers all the cases above except the
last one. But it is only usable after smp_init() and all CPUs are
brought up, while 'tsc-early' is initialized before that.
One solution is to skip the watchdog for 'tsc-early' clocksource,
and move the check after smp_init(), while before 'tsc' clocksource
is registered, where 'logical_packages' could be used.
For 'nr_cpus' and 'maxcpus' cmdline case, the global flag
'package_count_unreliable' will be set to true and the watchdog
will be kept as is.
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
Changelog:
since v1:
* Handle the 'maxcpus=' and 'nr_cpus=' cmdline parameter cases,
by keeping watchdog as the package number is unreliable (Dave Hansen)
since RFC:
* use 'logical_packages' instead of topology_max_packages(), whose
implementaion is not accurate, like for heterogeneous systems
which have combination of Core/Atom CPUs like Alderlake (Dave Hansen)
arch/x86/include/asm/topology.h | 3 +++
arch/x86/kernel/smpboot.c | 2 +-
arch/x86/kernel/tsc.c | 43 +++++++++++++--------------------
3 files changed, 21 insertions(+), 27 deletions(-)
Comments
On Tue, Jun 13, 2023 at 01:25:23PM +0800, Feng Tang wrote: > Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC > on qualified platorms") was introduced to solve problem that > sometimes TSC clocksource is wrongly judged as unstable by watchdog > like 'jiffies', HPET, etc. > > In it, the hardware socket number is a key factor for judging whether > to disable the watchdog for TSC, and 'nr_online_nodes' was chosen as > an estimation due to it is needed in early boot phase before > registering 'tsc-early' clocksource, where all none-boot CPUs are not > brought up yet. > > In recent patch review, Dave and Rui pointed out there are many cases > in which 'nr_online_nodes' is not accurate, like: > > * numa emulation (numa=fake=4 etc.) > * numa=off > * platforms with CPU-less HBM nodes, CPU-less Optane memory nodes. > * SNC (sub-numa cluster) mode enabled > * 'nr_cpus=' and 'maxcpus=' kernel cmdline parameter setup > > Peter Zijlstra suggested 'logical_packages', and it seems to be the > best option we have as it covers all the cases above except the > last one. But it is only usable after smp_init() and all CPUs are > brought up, while 'tsc-early' is initialized before that. > > One solution is to skip the watchdog for 'tsc-early' clocksource, > and move the check after smp_init(), while before 'tsc' clocksource > is registered, where 'logical_packages' could be used. > > For 'nr_cpus' and 'maxcpus' cmdline case, the global flag > 'package_count_unreliable' will be set to true and the watchdog > will be kept as is. So I have at least two machines where I boot with 'possible_cpus=#' because the BIOS MADT is reporting a stupid number of CPUs that aren't actually there. So I think I'm lucky and side-stepped this nonsense, but if someone were to use nr_cpus= for this same purpose, they get screwed over and get the watchdog. Sad day for them I suppose. I realize there might not be a perfect solution, but I'm also struggling to see the value of the whole package_count_unreliable thing.
On Thu, 2023-06-15 at 11:20 +0200, Peter Zijlstra wrote: > On Tue, Jun 13, 2023 at 01:25:23PM +0800, Feng Tang wrote: > > Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC > > on qualified platorms") was introduced to solve problem that > > sometimes TSC clocksource is wrongly judged as unstable by watchdog > > like 'jiffies', HPET, etc. > > > > In it, the hardware socket number is a key factor for judging > > whether > > to disable the watchdog for TSC, and 'nr_online_nodes' was chosen > > as > > an estimation due to it is needed in early boot phase before > > registering 'tsc-early' clocksource, where all none-boot CPUs are > > not > > brought up yet. > > > > In recent patch review, Dave and Rui pointed out there are many > > cases > > in which 'nr_online_nodes' is not accurate, like: > > > > * numa emulation (numa=fake=4 etc.) > > * numa=off > > * platforms with CPU-less HBM nodes, CPU-less Optane memory nodes. > > * SNC (sub-numa cluster) mode enabled > > * 'nr_cpus=' and 'maxcpus=' kernel cmdline parameter setup > > > > Peter Zijlstra suggested 'logical_packages', and it seems to be the > > best option we have as it covers all the cases above except the > > last one. But it is only usable after smp_init() and all CPUs are > > brought up, while 'tsc-early' is initialized before that. > > > > One solution is to skip the watchdog for 'tsc-early' clocksource, > > and move the check after smp_init(), while before 'tsc' clocksource > > is registered, where 'logical_packages' could be used. > > > > For 'nr_cpus' and 'maxcpus' cmdline case, the global flag > > 'package_count_unreliable' will be set to true and the watchdog > > will be kept as is. > > So I have at least two machines where I boot with 'possible_cpus=#' > because the BIOS MADT is reporting a stupid number of CPUs that > aren't > actually there. Does the MADT report those CPUs as disabled but online capable? can you send me a copy of the acpidmp? I had a patch to parse MADT and count the number of physical packages by decoding all the valid APICIDs in MADT. I'm wondering if the patch still works on this machine. > > So I think I'm lucky and side-stepped this nonsense, but if someone > were > to use nr_cpus= for this same purpose, they get screwed over and get > the > watchdog. Sad day for them I suppose. what if using package_count_from_MADT? thanks, rui
On Thu, Jun 15, 2023 at 11:20:21AM +0200, Peter Zijlstra wrote: > On Tue, Jun 13, 2023 at 01:25:23PM +0800, Feng Tang wrote: > > Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC > > on qualified platorms") was introduced to solve problem that > > sometimes TSC clocksource is wrongly judged as unstable by watchdog > > like 'jiffies', HPET, etc. > > > > In it, the hardware socket number is a key factor for judging whether > > to disable the watchdog for TSC, and 'nr_online_nodes' was chosen as > > an estimation due to it is needed in early boot phase before > > registering 'tsc-early' clocksource, where all none-boot CPUs are not > > brought up yet. > > > > In recent patch review, Dave and Rui pointed out there are many cases > > in which 'nr_online_nodes' is not accurate, like: > > > > * numa emulation (numa=fake=4 etc.) > > * numa=off > > * platforms with CPU-less HBM nodes, CPU-less Optane memory nodes. > > * SNC (sub-numa cluster) mode enabled > > * 'nr_cpus=' and 'maxcpus=' kernel cmdline parameter setup > > > > Peter Zijlstra suggested 'logical_packages', and it seems to be the > > best option we have as it covers all the cases above except the > > last one. But it is only usable after smp_init() and all CPUs are > > brought up, while 'tsc-early' is initialized before that. > > > > One solution is to skip the watchdog for 'tsc-early' clocksource, > > and move the check after smp_init(), while before 'tsc' clocksource > > is registered, where 'logical_packages' could be used. > > > > For 'nr_cpus' and 'maxcpus' cmdline case, the global flag > > 'package_count_unreliable' will be set to true and the watchdog > > will be kept as is. > > So I have at least two machines where I boot with 'possible_cpus=#' > because the BIOS MADT is reporting a stupid number of CPUs that aren't > actually there. > > So I think I'm lucky and side-stepped this nonsense, but if someone were > to use nr_cpus= for this same purpose, they get screwed over and get the > watchdog. Sad day for them I suppose. Thanks for sharing the info! Now I know one reason why those cmdline parameters were created. > I realize there might not be a perfect solution, Yes. Rui is working on a MADT based parsing which may take a while before being stable, given all kinds of fancy firmware out there. > but I'm also struggling > to see the value of the whole package_count_unreliable thing. 'possible_cpus' and 'nr_cpus' parameters are a little different from "maxcpus' that they limit the possible cpus during boot, and after boot user has no ways to online other CPUs beyond that limit. But for 'maxcpus' case, user can online a small number of CPU during boot and onlined all CPUs later on, which has a possible under estimation issue for package number, while the above 2 don't have. So how about only keeping the 'package_count_unreliable' check for 'maxcpus' case? Thanks, Feng
On Fri, Jun 16, 2023 at 06:53:21AM +0000, Zhang, Rui wrote: > On Thu, 2023-06-15 at 11:20 +0200, Peter Zijlstra wrote: > > So I have at least two machines where I boot with 'possible_cpus=#' > > because the BIOS MADT is reporting a stupid number of CPUs that > > aren't > > actually there. > > Does the MADT report those CPUs as disabled but online capable? > can you send me a copy of the acpidmp? Sent privately, it's a bit big. > I had a patch to parse MADT and count the number of physical packages > by decoding all the valid APICIDs in MADT. > I'm wondering if the patch still works on this machine. I can certainly give it a spin; it has IPMI serial-over-ethernet that works. Brilliant dev machine. > > So I think I'm lucky and side-stepped this nonsense, but if someone > > were > > to use nr_cpus= for this same purpose, they get screwed over and get > > the > > watchdog. Sad day for them I suppose. > > what if using package_count_from_MADT? So I'm thinking that if you cap possible_mask the actual logical packages is the right number. Suppose you have a machine with 8 sockets, but limit possible_mask to only 1 socket. Then TSC will actually be stable, it doesn't matter you have 7 idle sockets that are not synchronized. Then again, perhaps if you limit it to 2 sockets you're still in trouble, I'm not entirely sure how the TSC sync stuff comes apart on these large systems.
On Fri, Jun 16, 2023 at 10:02:31AM +0200, Peter Zijlstra wrote: > On Fri, Jun 16, 2023 at 06:53:21AM +0000, Zhang, Rui wrote: > > On Thu, 2023-06-15 at 11:20 +0200, Peter Zijlstra wrote: > > > > So I have at least two machines where I boot with 'possible_cpus=#' > > > because the BIOS MADT is reporting a stupid number of CPUs that > > > aren't > > > actually there. > > > > Does the MADT report those CPUs as disabled but online capable? > > can you send me a copy of the acpidmp? > > Sent privately, it's a bit big. So if I remove 'possible_cpus=40' it does crazy shit like this: [ 1.268447] setup_percpu: NR_CPUS:512 nr_cpumask_bits:160 nr_cpu_ids:160 nr_node_ids:2 [ 1.303567] pcpu-alloc: [0] 000 001 002 003 004 005 006 007 [ 1.309871] pcpu-alloc: [0] 008 009 020 021 022 023 024 025 [ 1.316172] pcpu-alloc: [0] 026 027 028 029 040 042 044 046 [ 1.322475] pcpu-alloc: [0] 048 050 052 054 056 058 060 062 [ 1.328777] pcpu-alloc: [0] 064 066 068 070 072 074 076 078 [ 1.335084] pcpu-alloc: [0] 080 082 084 086 088 090 092 094 [ 1.341387] pcpu-alloc: [0] 096 098 100 102 104 106 108 110 [ 1.347688] pcpu-alloc: [0] 112 114 116 118 120 122 124 126 [ 1.353992] pcpu-alloc: [0] 128 130 132 134 136 138 140 142 [ 1.360293] pcpu-alloc: [0] 144 146 148 150 152 154 156 158 [ 1.366596] pcpu-alloc: [1] 010 011 012 013 014 015 016 017 [ 1.372900] pcpu-alloc: [1] 018 019 030 031 032 033 034 035 [ 1.379201] pcpu-alloc: [1] 036 037 038 039 041 043 045 047 [ 1.385504] pcpu-alloc: [1] 049 051 053 055 057 059 061 063 [ 1.391806] pcpu-alloc: [1] 065 067 069 071 073 075 077 079 [ 1.398109] pcpu-alloc: [1] 081 083 085 087 089 091 093 095 [ 1.404411] pcpu-alloc: [1] 097 099 101 103 105 107 109 111 [ 1.410714] pcpu-alloc: [1] 113 115 117 119 121 123 125 127 [ 1.417016] pcpu-alloc: [1] 129 131 133 135 137 139 141 143 [ 1.423319] pcpu-alloc: [1] 145 147 149 151 153 155 157 159 [ 2.110382] smp: Bringing up secondary CPUs ... [ 2.112255] x86: Booting SMP configuration: [ 2.113253] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 [ 2.221253] .... node #1, CPUs: #10 [ 0.163522] smpboot: CPU 10 Converting physical 0 to logical die 1 [ 2.337372] #11 #12 #13 #14 #15 #16 #17 #18 #19 [ 2.504253] .... node #0, CPUs: #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 [ 2.563253] .... node #1, CPUs: #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 [ 2.662321] smp: Brought up 2 nodes, 40 CPUs [ 2.664257] smpboot: Max logical packages: 8 It is an IVB-EP with *2* sockets, 10 cores and SMT, 40 is right, 160 is quite insane.
On Fri, Jun 16, 2023 at 10:02:31AM +0200, Peter Zijlstra wrote: > I can certainly give it a spin; it has IPMI serial-over-ethernet that > works. Brilliant dev machine. FWIW, I feel it should be illegal to sell a computer without working IPMI support. That Intel vPro shit is horrible. MeshCommander sorta works, but its horrific crap. </rant>
On Fri, 2023-06-16 at 10:10 +0200, Peter Zijlstra wrote: > On Fri, Jun 16, 2023 at 10:02:31AM +0200, Peter Zijlstra wrote: > > On Fri, Jun 16, 2023 at 06:53:21AM +0000, Zhang, Rui wrote: > > > On Thu, 2023-06-15 at 11:20 +0200, Peter Zijlstra wrote: > > > > > > So I have at least two machines where I boot with > > > > 'possible_cpus=#' > > > > because the BIOS MADT is reporting a stupid number of CPUs that > > > > aren't > > > > actually there. > > > > > > Does the MADT report those CPUs as disabled but online capable? > > > can you send me a copy of the acpidmp? > > > > Sent privately, it's a bit big. > > So if I remove 'possible_cpus=40' it does crazy shit like this: > > [ 1.268447] setup_percpu: NR_CPUS:512 nr_cpumask_bits:160 > nr_cpu_ids:160 nr_node_ids:2 > > [ 1.303567] pcpu-alloc: [0] 000 001 002 003 004 005 006 007 > [ 1.309871] pcpu-alloc: [0] 008 009 020 021 022 023 024 025 > [ 1.316172] pcpu-alloc: [0] 026 027 028 029 040 042 044 046 > [ 1.322475] pcpu-alloc: [0] 048 050 052 054 056 058 060 062 > [ 1.328777] pcpu-alloc: [0] 064 066 068 070 072 074 076 078 > [ 1.335084] pcpu-alloc: [0] 080 082 084 086 088 090 092 094 > [ 1.341387] pcpu-alloc: [0] 096 098 100 102 104 106 108 110 > [ 1.347688] pcpu-alloc: [0] 112 114 116 118 120 122 124 126 > [ 1.353992] pcpu-alloc: [0] 128 130 132 134 136 138 140 142 > [ 1.360293] pcpu-alloc: [0] 144 146 148 150 152 154 156 158 > [ 1.366596] pcpu-alloc: [1] 010 011 012 013 014 015 016 017 > [ 1.372900] pcpu-alloc: [1] 018 019 030 031 032 033 034 035 > [ 1.379201] pcpu-alloc: [1] 036 037 038 039 041 043 045 047 > [ 1.385504] pcpu-alloc: [1] 049 051 053 055 057 059 061 063 > [ 1.391806] pcpu-alloc: [1] 065 067 069 071 073 075 077 079 > [ 1.398109] pcpu-alloc: [1] 081 083 085 087 089 091 093 095 > [ 1.404411] pcpu-alloc: [1] 097 099 101 103 105 107 109 111 > [ 1.410714] pcpu-alloc: [1] 113 115 117 119 121 123 125 127 > [ 1.417016] pcpu-alloc: [1] 129 131 133 135 137 139 141 143 > [ 1.423319] pcpu-alloc: [1] 145 147 149 151 153 155 157 159 > > [ 2.110382] smp: Bringing up secondary CPUs ... > [ 2.112255] x86: Booting SMP configuration: > [ 2.113253] .... node #0, CPUs: #1 #2 #3 #4 #5 > #6 #7 #8 #9 > [ 2.221253] .... node #1, CPUs: #10 > [ 0.163522] smpboot: CPU 10 Converting physical 0 to logical die 1 > [ 2.337372] #11 #12 #13 #14 #15 #16 #17 #18 #19 > [ 2.504253] .... node #0, CPUs: #20 #21 #22 #23 #24 #25 > #26 #27 #28 #29 > [ 2.563253] .... node #1, CPUs: #30 #31 #32 #33 #34 #35 > #36 #37 #38 #39 > [ 2.662321] smp: Brought up 2 nodes, 40 CPUs > [ 2.664257] smpboot: Max logical packages: 8 > > It is an IVB-EP with *2* sockets, 10 cores and SMT, 40 is right, 160 > is > quite insane. According to the MADT, there are indeed 40 valid CPUs. And then 80 CPUs with APIC ID : FF enabled : 0 Online capable : 0 a dumb question, why are these CPUs added into the possible_mask? I can dig into this later but I just don't have a quick answer at the moment. thanks, rui
On Fri, Jun 16, 2023 at 09:19:18AM +0000, Zhang, Rui wrote: > According to the MADT, there are indeed 40 valid CPUs. And then 80 CPUs > with > > APIC ID : FF > enabled : 0 > Online capable : 0 > > a dumb question, why are these CPUs added into the possible_mask? > I can dig into this later but I just don't have a quick answer at the > moment. I really don't know.. I've not gotten around to reading that part of the x86 code yet.
On Fri, 2023-06-16 at 11:42 +0200, Peter Zijlstra wrote: > On Fri, Jun 16, 2023 at 09:19:18AM +0000, Zhang, Rui wrote: > > > According to the MADT, there are indeed 40 valid CPUs. And then 80 > > CPUs > > with > > > > APIC ID : FF > > enabled : 0 > > Online capable : 0 > > > > a dumb question, why are these CPUs added into the possible_mask? > > I can dig into this later but I just don't have a quick answer at > > the > > moment. > > I really don't know.. I've not gotten around to reading that part of > the > x86 code yet. > > I did a double check. The MADT is composed of 1. 40 valid LAPIC entries. 2. 80 invalid LAPIC entries with APIC ID : FF Enabled : 0 Online capable: 0 I'm mot sure why "Online capable" is decoded because this new bit is introduced in ACPI 6.3. Maybe a problem in the acpica tool? These entries are ignored because of the invalid APIC ID. 3. 120 x2APIC entries with APIC ID : valid value Enabled : 0 As "Online capable bit" is not supported, these 120 x2APIC entries are counted as possible CPUs. That is why we got 160 possible CPUs. thanks, rui
On Fri, Jun 16, 2023 at 07:23:12PM +0800, Zhang, Rui wrote: > On Fri, 2023-06-16 at 11:42 +0200, Peter Zijlstra wrote: > > On Fri, Jun 16, 2023 at 09:19:18AM +0000, Zhang, Rui wrote: > > > > > According to the MADT, there are indeed 40 valid CPUs. And then 80 > > > CPUs > > > with > > > > > > APIC ID : FF > > > enabled : 0 > > > Online capable : 0 > > > > > > a dumb question, why are these CPUs added into the possible_mask? > > > I can dig into this later but I just don't have a quick answer at > > > the > > > moment. > > > > I really don't know.. I've not gotten around to reading that part of > > the > > x86 code yet. > > > > > I did a double check. > > The MADT is composed of > > 1. 40 valid LAPIC entries. > 2. 80 invalid LAPIC entries with > APIC ID : FF > Enabled : 0 > Online capable: 0 > I'm mot sure why "Online capable" is decoded because this new bit is > introduced in ACPI 6.3. Maybe a problem in the acpica tool? > These entries are ignored because of the invalid APIC ID. > 3. 120 x2APIC entries with > APIC ID : valid value > Enabled : 0 > As "Online capable bit" is not supported, these 120 x2APIC entries > are counted as possible CPUs. Nice shot! So IIUC, this is a firmware bug, and deserves a warning or error message? And without 'possible_cpus' or 'nr_cpus' parameter, system will waste quite some memory due to "nr_cpu_ids == 160". From Peter's log: [ 2.664257] smpboot: Max logical packages: 8 it also revealed again the problem in 'calculate_max_logical_packages()': " ncpus = cpu_data(0).booted_cores * topology_max_smt_threads(); __max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus); pr_info("Max logical packages: %u\n", __max_logical_packages); " But 'logical_packages' should still be correct in this case. Thanks, Feng > That is why we got 160 possible CPUs. > > thanks, > rui > > >
On Fri, Jun 16, 2023 at 10:02:31AM +0200, Peter Zijlstra wrote: > On Fri, Jun 16, 2023 at 06:53:21AM +0000, Zhang, Rui wrote: > > On Thu, 2023-06-15 at 11:20 +0200, Peter Zijlstra wrote: > > > > So I have at least two machines where I boot with 'possible_cpus=#' > > > because the BIOS MADT is reporting a stupid number of CPUs that > > > aren't > > > actually there. > > > > Does the MADT report those CPUs as disabled but online capable? > > can you send me a copy of the acpidmp? > > Sent privately, it's a bit big. > > > I had a patch to parse MADT and count the number of physical packages > > by decoding all the valid APICIDs in MADT. > > I'm wondering if the patch still works on this machine. > > I can certainly give it a spin; it has IPMI serial-over-ethernet that > works. Brilliant dev machine. > > > > So I think I'm lucky and side-stepped this nonsense, but if someone > > > were > > > to use nr_cpus= for this same purpose, they get screwed over and get > > > the > > > watchdog. Sad day for them I suppose. > > > > what if using package_count_from_MADT? > > So I'm thinking that if you cap possible_mask the actual logical > packages is the right number. > > Suppose you have a machine with 8 sockets, but limit possible_mask to > only 1 socket. Then TSC will actually be stable, it doesn't matter you > have 7 idle sockets that are not synchronized. > > Then again, perhaps if you limit it to 2 sockets you're still in > trouble, I'm not entirely sure how the TSC sync stuff comes apart on > these large systems. I had the similar thought. For this case, the defensive way is to keep the watchdog for 'nr_cpus=' and 'possible_cpus=' setup, and if the specific setup has no TSC sync issue, people can add one more parameter 'tsc=reliable' to skip the watchdog, while aggressive way is to ignore the 2 cmdline parameters as the above case is really rare. Again, as you mentioned, I can't find a perfect solution to cover all kinds of setup and broken firmware. But at least 'logical_packages' is much better than 'nr_online_nodes' :) Thanks, Feng
On Fri, Jun 16 2023 at 15:18, Feng Tang wrote: > On Thu, Jun 15, 2023 at 11:20:21AM +0200, Peter Zijlstra wrote: > Yes. Rui is working on a MADT based parsing which may take a while > before being stable, given all kinds of fancy firmware out there. Please not yet another mad table parser. The topology can be evaluated during early boot via: 1) The APIC IDs of the possible CPUs. 2) CPUID leaf 0xb or 0x1f where the topmost subleaf gives the number of bits to shift the APIC ID right for the package/socket Trying to accomodate for anything else than the documented enumeration is crazy. If fancy firmware is broken then they can keep the pieces. So something like the below should just work. I fundamentally hate the hackery in topology.c, but cleaning this mess up is a completely different problem and already worked on. Thanks, tglx --- --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -509,9 +509,12 @@ extern int default_check_phys_apicid_pre #ifdef CONFIG_SMP bool apic_id_is_primary_thread(unsigned int id); void apic_smt_update(void); +extern unsigned int apic_to_pkg_shift; +bool logical_packages_update(u32 apicid); #else static inline bool apic_id_is_primary_thread(unsigned int id) { return false; } static inline void apic_smt_update(void) { } +static inline bool logical_packages_update(u32 apicid) { return true; } #endif struct msi_msg; --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -177,6 +177,9 @@ static int acpi_register_lapic(int id, u return -EINVAL; } + if (!logical_packages_update(acpiid)) + return -EINVAL; + if (!enabled) { ++disabled_cpus; return -EINVAL; --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -692,6 +692,8 @@ static void early_init_amd(struct cpuinf } } + detect_extended_topology_early(c); + if (cpu_has(c, X86_FEATURE_TOPOEXT)) smp_num_siblings = ((cpuid_ebx(0x8000001e) >> 8) & 0xff) + 1; } --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -29,6 +29,8 @@ unsigned int __max_die_per_package __rea EXPORT_SYMBOL(__max_die_per_package); #ifdef CONFIG_SMP +unsigned int apic_to_pkg_shift __ro_after_init; + /* * Check if given CPUID extended topology "leaf" is implemented */ @@ -66,7 +68,7 @@ int detect_extended_topology_early(struc { #ifdef CONFIG_SMP unsigned int eax, ebx, ecx, edx; - int leaf; + int leaf, subleaf; leaf = detect_extended_topology_leaf(c); if (leaf < 0) @@ -80,6 +82,14 @@ int detect_extended_topology_early(struc */ c->initial_apicid = edx; smp_num_siblings = max_t(int, smp_num_siblings, LEVEL_MAX_SIBLINGS(ebx)); + + for (subleaf = 1; subleaf < 8; subleaf++) { + cpuid_count(leaf, subleaf, &eax, &ebx, &ecx, &edx); + + if (ebx == 0 || !LEAFB_SUBTYPE(ecx)) + break; + apic_to_pkg_shift = BITS_SHIFT_NEXT_LEVEL(eax); + } #endif return 0; } --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1501,17 +1501,50 @@ void __init native_smp_prepare_boot_cpu( native_pv_lock_init(); } +bool logical_packages_update(u32 apicid) +{ + unsigned int pkg; + + if (!apic_to_pkg_shift) + return true; + + pkg = (apicid >> apic_to_pkg_shift) + 1; + if (pkg <= __max_logical_packages) + return true; + + if (system_state == SYSTEM_BOOTING) { + __max_logical_packages = pkg; + return true; + } + + pr_err("Physical hotplug APICID %x package %u > max logical packages %u\n", + apicid, pkg, __max_logical_packages); + return false; +} + void __init calculate_max_logical_packages(void) { - int ncpus; + unsigned int ncpus, npkg; /* * Today neither Intel nor AMD support heterogeneous systems so * extrapolate the boot cpu's data to all packages. */ ncpus = cpu_data(0).booted_cores * topology_max_smt_threads(); - __max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus); - pr_info("Max logical packages: %u\n", __max_logical_packages); + npkg = DIV_ROUND_UP(total_cpus, ncpus); + + /* Did logical_packages_update() set up __max_logical_packages? */ + if (!__max_logical_packages) { + __max_logical_packages = npkg; + } else { + pr_info("Max logical packages ACPI enumeration: %u\n", + __max_logical_packages); + if (npkg <= __max_logical_packages) + return; + __max_logical_packages = npkg; + } + + pr_info("Max logical packages estimated: %u\n", __max_logical_packages); } void __init native_smp_cpus_done(unsigned int max_cpus)
On Thu, Jun 22 2023 at 16:27, Thomas Gleixner wrote: > On Fri, Jun 16 2023 at 15:18, Feng Tang wrote: > So something like the below should just work. Well it works in principle, but does not take any of the command line parameters which limit nr_possible CPUs or the actual kernel configuration into account. But the principle itself works correctly. Below is an updated version, which takes them into account. The data here is from a two socket system with 32 CPUs per socket. No command line parameters (NR_CPUS=64): smpboot: Allowing 64 CPUs, 32 hotplug CPUs clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e3306b9ada, max_idle_ns: 440795224413 ns smp: Brought up 1 node, 32 CPUs smpboot: Max logical packages ACPI enumeration: 2 "possible_cpus=32" (NR_CPUS=64) or No command line parameter (NR_CPUS=32): smpboot: Allowing 32 CPUs, 0 hotplug CPUs clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e3306b9ada, max_idle_ns: 440795224413 ns smp: Brought up 1 node, 32 CPUs smpboot: Max logical packages ACPI enumeration: 1 maxcpus=32 smpboot: Allowing 64 CPUs, 0 hotplug CPUs clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e3306b9ada, max_idle_ns: 440795224413 ns smp: Brought up 1 node, 32 CPUs smpboot: Max logical packages ACPI enumeration: 2 But that's really all we should do. If the ACPI table enumerates CPUs as hotpluggable which can never arrive, then so be it. We have enough parameters to override the BIOS nonsense. Trying to do more magic MAD table parsing with heuristics is just wrong. We already have way too many heuristics and workarounds for broken firmware, but for the problem at hand, we really don't need more. The only systems I observed so far which have a non-sensical amount of "hotpluggable" CPUs are high-end server machines. It's a resonable expectation that machines with high-end price tags come with correct firmware. Trying to work around that (except with the existing command line options) is just proliferating this mess. This has to stop. Thanks, tglx --- --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -509,9 +509,12 @@ extern int default_check_phys_apicid_pre #ifdef CONFIG_SMP bool apic_id_is_primary_thread(unsigned int id); void apic_smt_update(void); +extern unsigned int apic_to_pkg_shift; +void logical_packages_update(u32 apicid, bool enabled); #else static inline bool apic_id_is_primary_thread(unsigned int id) { return false; } static inline void apic_smt_update(void) { } +static inline void logical_packages_update(u32 apicid, bool enabled) { } #endif struct msi_msg; --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -178,6 +178,7 @@ static int acpi_register_lapic(int id, u } if (!enabled) { + logical_packages_update(acpiid, false); ++disabled_cpus; return -EINVAL; } @@ -189,6 +190,8 @@ static int acpi_register_lapic(int id, u if (cpu >= 0) early_per_cpu(x86_cpu_to_acpiid, cpu) = acpiid; + logical_packages_update(acpiid, cpu >= 0); + return cpu; } --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -692,6 +692,8 @@ static void early_init_amd(struct cpuinf } } + detect_extended_topology_early(c); + if (cpu_has(c, X86_FEATURE_TOPOEXT)) smp_num_siblings = ((cpuid_ebx(0x8000001e) >> 8) & 0xff) + 1; } --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -29,6 +29,8 @@ unsigned int __max_die_per_package __rea EXPORT_SYMBOL(__max_die_per_package); #ifdef CONFIG_SMP +unsigned int apic_to_pkg_shift __ro_after_init; + /* * Check if given CPUID extended topology "leaf" is implemented */ @@ -66,7 +68,7 @@ int detect_extended_topology_early(struc { #ifdef CONFIG_SMP unsigned int eax, ebx, ecx, edx; - int leaf; + int leaf, subleaf; leaf = detect_extended_topology_leaf(c); if (leaf < 0) @@ -80,6 +82,14 @@ int detect_extended_topology_early(struc */ c->initial_apicid = edx; smp_num_siblings = max_t(int, smp_num_siblings, LEVEL_MAX_SIBLINGS(ebx)); + + for (subleaf = 1; subleaf < 8; subleaf++) { + cpuid_count(leaf, subleaf, &eax, &ebx, &ecx, &edx); + + if (ebx == 0 || !LEAFB_SUBTYPE(ecx)) + break; + apic_to_pkg_shift = BITS_SHIFT_NEXT_LEVEL(eax); + } #endif return 0; } --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1501,17 +1501,91 @@ void __init native_smp_prepare_boot_cpu( native_pv_lock_init(); } +struct logical_pkg { + unsigned int enabled_cpus; + unsigned int disabled_cpus; +}; + +/* + * Needs to be size of NR_CPUS because virt allows to create the weirdest + * topologies just because it can. + */ +static struct logical_pkg logical_pkgs[NR_CPUS] __refdata; + +void logical_packages_update(u32 apicid, bool enabled) +{ + struct logical_pkg *lp; + unsigned int pkg; + + if (!apic_to_pkg_shift || system_state != SYSTEM_BOOTING) + return; + + pkg = (apicid >> apic_to_pkg_shift); + + lp = logical_pkgs + pkg; + if (enabled) + lp->enabled_cpus++; + else + lp->disabled_cpus++; + + if (++pkg > __max_logical_packages) + __max_logical_packages = pkg; +} + +static void __init logical_packages_finish_setup(unsigned int possible) +{ + unsigned int pkg, maxpkg = 0, maxcpus = 0; + + if (!apic_to_pkg_shift) + return; + + /* Scan the enabled CPUs first */ + for (pkg = 0; pkg < __max_logical_packages; pkg++) { + if (!logical_pkgs[pkg].enabled_cpus) + continue; + + maxpkg++; + maxcpus += logical_pkgs[pkg].enabled_cpus; + + if (maxcpus >= possible) { + __max_logical_packages = maxpkg; + return; + } + } + + /* There is still room, scan for disabled CPUs */ + for (pkg = 0; pkg < __max_logical_packages; pkg++) { + if (logical_pkgs[pkg].enabled_cpus || !logical_pkgs[pkg].disabled_cpus) + continue; + + maxpkg++; + maxcpus += logical_pkgs[pkg].disabled_cpus; + + if (maxcpus >= possible) + break; + } + + __max_logical_packages = maxpkg; +} + void __init calculate_max_logical_packages(void) { int ncpus; + if (__max_logical_packages) { + pr_info("Max logical packages ACPI enumeration: %u\n", + __max_logical_packages); + return; + } + /* * Today neither Intel nor AMD support heterogeneous systems so * extrapolate the boot cpu's data to all packages. */ ncpus = cpu_data(0).booted_cores * topology_max_smt_threads(); __max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus); - pr_info("Max logical packages: %u\n", __max_logical_packages); + + pr_info("Max logical packages estimated: %u\n", __max_logical_packages); } void __init native_smp_cpus_done(unsigned int max_cpus) @@ -1619,6 +1693,8 @@ early_param("possible_cpus", _setup_poss for (i = 0; i < possible; i++) set_cpu_possible(i, true); + + logical_packages_finish_setup(possible); } #ifdef CONFIG_HOTPLUG_CPU
Hi, Thomas, On Thu, 2023-06-22 at 16:27 +0200, Thomas Gleixner wrote: > On Fri, Jun 16 2023 at 15:18, Feng Tang wrote: > > On Thu, Jun 15, 2023 at 11:20:21AM +0200, Peter Zijlstra wrote: > > Yes. Rui is working on a MADT based parsing which may take a while > > before being stable, given all kinds of fancy firmware out there. > > Please not yet another mad table parser. > > The topology can be evaluated during early boot via: > > 1) The APIC IDs of the possible CPUs. > > 2) CPUID leaf 0xb or 0x1f where the topmost subleaf gives the > number > of bits to shift the APIC ID right for the package/socket > exactly the same in my proposal. The difference is that I also 1. get the bitshift of the core id and count the number of cores in a package. On Intel hybrid platforms, using nr_package_cpus / nr_core_cpus to get the number of cores in a package (x86_max_cores) is broken. e.g. for a 6Pcore + 8Ecore system, package has 20 CPUs and 14 cores. 2. get the maximum number of SMT siblings in each core to set correct smp_num_siblings. On future hybrid platforms, it is possible that Ecore is used as boot CPU. This could result in smp_num_siblings set to 1 during boot cpu startup, and cpu_smt_control set to CPU_SMT_NOT_SUPPORTED. > Trying to accomodate for anything else than the documented > enumeration > is crazy. If fancy firmware is broken then they can keep the pieces. > > So something like the below should just work. > > I fundamentally hate the hackery in topology.c, but cleaning this > mess > up is a completely different problem and already worked on. > > Thanks, > > tglx > --- > --- a/arch/x86/include/asm/apic.h > +++ b/arch/x86/include/asm/apic.h > @@ -509,9 +509,12 @@ extern int default_check_phys_apicid_pre > #ifdef CONFIG_SMP > bool apic_id_is_primary_thread(unsigned int id); > void apic_smt_update(void); > +extern unsigned int apic_to_pkg_shift; > +bool logical_packages_update(u32 apicid); > #else > static inline bool apic_id_is_primary_thread(unsigned int id) { > return false; } > static inline void apic_smt_update(void) { } > +static inline bool logical_packages_update(u32 apicid) { return > true; } > #endif > > struct msi_msg; > --- a/arch/x86/kernel/acpi/boot.c > +++ b/arch/x86/kernel/acpi/boot.c > @@ -177,6 +177,9 @@ static int acpi_register_lapic(int id, u > return -EINVAL; > } > > + if (!logical_packages_update(acpiid)) > + return -EINVAL; > + > if (!enabled) { > ++disabled_cpus; > return -EINVAL; > --- a/arch/x86/kernel/cpu/amd.c > +++ b/arch/x86/kernel/cpu/amd.c > @@ -692,6 +692,8 @@ static void early_init_amd(struct cpuinf > } > } > > + detect_extended_topology_early(c); > + > if (cpu_has(c, X86_FEATURE_TOPOEXT)) > smp_num_siblings = ((cpuid_ebx(0x8000001e) >> 8) & > 0xff) + 1; > } > --- a/arch/x86/kernel/cpu/topology.c > +++ b/arch/x86/kernel/cpu/topology.c > @@ -29,6 +29,8 @@ unsigned int __max_die_per_package __rea > EXPORT_SYMBOL(__max_die_per_package); > > #ifdef CONFIG_SMP > +unsigned int apic_to_pkg_shift __ro_after_init; > + > /* > * Check if given CPUID extended topology "leaf" is implemented > */ > @@ -66,7 +68,7 @@ int detect_extended_topology_early(struc > { > #ifdef CONFIG_SMP > unsigned int eax, ebx, ecx, edx; > - int leaf; > + int leaf, subleaf; > > leaf = detect_extended_topology_leaf(c); > if (leaf < 0) > @@ -80,6 +82,14 @@ int detect_extended_topology_early(struc > */ > c->initial_apicid = edx; > smp_num_siblings = max_t(int, smp_num_siblings, > LEVEL_MAX_SIBLINGS(ebx)); > + > + for (subleaf = 1; subleaf < 8; subleaf++) { > + cpuid_count(leaf, subleaf, &eax, &ebx, &ecx, &edx); > + > + if (ebx == 0 || !LEAFB_SUBTYPE(ecx)) Do we ever see ebx == 0 for a valid subtype? When decoding CPUID.0B/1F, we check for invalid subtype only. thanks, rui
On Fri, 2023-06-23 at 01:07 +0200, Thomas Gleixner wrote: > On Thu, Jun 22 2023 at 16:27, Thomas Gleixner wrote: > > On Fri, Jun 16 2023 at 15:18, Feng Tang wrote: > > So something like the below should just work. > > Well it works in principle, but does not take any of the command line > parameters which limit nr_possible CPUs or the actual kernel > configuration into account. But the principle itself works correctly. > > Below is an updated version, which takes them into account. > > The data here is from a two socket system with 32 CPUs per socket. > > No command line parameters (NR_CPUS=64): > > smpboot: Allowing 64 CPUs, 32 hotplug CPUs > clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: > 0x1e3306b9ada, max_idle_ns: 440795224413 ns > smp: Brought up 1 node, 32 CPUs > smpboot: Max logical packages ACPI enumeration: 2 > > "possible_cpus=32" (NR_CPUS=64) or > No command line parameter (NR_CPUS=32): > > smpboot: Allowing 32 CPUs, 0 hotplug CPUs > clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: > 0x1e3306b9ada, max_idle_ns: 440795224413 ns > smp: Brought up 1 node, 32 CPUs > smpboot: Max logical packages ACPI enumeration: 1 > > maxcpus=32 > smpboot: Allowing 64 CPUs, 0 hotplug CPUs > clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: > 0x1e3306b9ada, max_idle_ns: 440795224413 ns > smp: Brought up 1 node, 32 CPUs > smpboot: Max logical packages ACPI enumeration: 2 > > But that's really all we should do. If the ACPI table enumerates CPUs > as > hotpluggable which can never arrive, then so be it. > > We have enough parameters to override the BIOS nonsense. Trying to do > more magic MAD table parsing with heuristics is just wrong. > > We already have way too many heuristics and workarounds for broken > firmware, but for the problem at hand, we really don't need more. > > The only systems I observed so far which have a non-sensical amount > of > "hotpluggable" CPUs are high-end server machines. We see insane possible CPUs on IVB servers, one from Peter and one in LKP lab, maybe a different problem but still related, because they may cause bogus __max_logical_packages detected. Below is fix I made but I don't have chance to test it yet. From 77152bceb059944ee49ac0dc45e313354590c3ab Mon Sep 17 00:00:00 2001 From: Zhang Rui <rui.zhang@intel.com> Date: Fri, 23 Jun 2023 12:14:28 +0800 Subject: [PATCH RFC] x86/acpi: Ignore invalid x2APIC entries According to https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#processor-local-x2apic-structure "[Compatibility note] On some legacy OSes, Logical processors with APIC ID values less than 255 (whether in XAPIC or X2APIC mode) must use the Processor Local APIC structure to convey their APIC information to OSPM, and those processors must be declared in the DSDT using the Processor() keyword. Logical processors with APIC ID values 255 and greater must use the Processor Local x2APIC structure and be declared using the Device() keyword." This means that if valid LAPIC entries are already detected, the APIC ID in x2APIC must be equal to or larger than 0xff. On an IVB-EP 2 sockets system with 20 CPUs per socket, Linux detects 40 present CPUs from LAPIC, and 80 possible CPUs from x2APIC, while many of the x2APIC entries share the same APIC ID with LAPCI entries like below. [02Ch 0044 1] Subtable Type : 00 [Processor Local APIC] [02Fh 0047 1] Local Apic ID : 00 ... [164h 0356 1] Subtable Type : 00 [Processor Local APIC] [167h 0359 1] Local Apic ID : 39 [16Ch 0364 1] Subtable Type : 00 [Processor Local APIC] [16Fh 0367 1] Local Apic ID : FF ... [3ECh 1004 1] Subtable Type : 09 [Processor Local x2APIC] [3F0h 1008 4] Processor x2Apic ID : 00000000 ... [B5Ch 2908 1] Subtable Type : 09 [Processor Local x2APIC] [B60h 2912 4] Processor x2Apic ID : 00000077 Follow the ACPI spec to ignore such x2APIC entries. Signed-off-by: Zhang Rui <rui.zhang@intel.com> --- arch/x86/kernel/acpi/boot.c | 35 ++++++++++++++++++----------------- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 1c38174b5f01..9e06cc82ae95 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -199,6 +199,8 @@ static bool __init acpi_is_processor_usable(u32 lapic_flags) return false; } +static bool has_lapic_cpus; + static int __init acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end) { @@ -227,6 +229,14 @@ acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end) if (!acpi_is_processor_usable(processor->lapic_flags)) return 0; + /* + * According to https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#processor-local-x2apic-structure + * when MADT provides both valid LAPIC and x2APIC entries, the APIC ID + * in x2APIC must be equal or greater than 0xff. + */ + if (has_lapic_cpus && apic_id < 0xff) + return 0; + /* * We need to register disabled CPU as well to permit * counting disabled CPUs. This allows us to size @@ -252,6 +262,7 @@ static int __init acpi_parse_lapic(union acpi_subtable_headers * header, const unsigned long end) { struct acpi_madt_local_apic *processor = NULL; + int cpu; processor = (struct acpi_madt_local_apic *)header; @@ -275,10 +286,11 @@ acpi_parse_lapic(union acpi_subtable_headers * header, const unsigned long end) * to not preallocating memory for all NR_CPUS * when we use CPU hotplug. */ - acpi_register_lapic(processor->id, /* APIC ID */ + cpu = acpi_register_lapic(processor->id, /* APIC ID */ processor->processor_id, /* ACPI ID */ processor->lapic_flags & ACPI_MADT_ENABLED); - + if (cpu >= 0) + has_lapic_cpus = true; return 0; } @@ -1118,21 +1130,10 @@ static int __init acpi_parse_madt_lapic_entries(void) acpi_parse_sapic, MAX_LOCAL_APIC); if (!count) { - memset(madt_proc, 0, sizeof(madt_proc)); - madt_proc[0].id = ACPI_MADT_TYPE_LOCAL_APIC; - madt_proc[0].handler = acpi_parse_lapic; - madt_proc[1].id = ACPI_MADT_TYPE_LOCAL_X2APIC; - madt_proc[1].handler = acpi_parse_x2apic; - ret = acpi_table_parse_entries_array(ACPI_SIG_MADT, - sizeof(struct acpi_table_madt), - madt_proc, ARRAY_SIZE(madt_proc), MAX_LOCAL_APIC); - if (ret < 0) { - pr_err("Error parsing LAPIC/X2APIC entries\n"); - return ret; - } - - count = madt_proc[0].count; - x2count = madt_proc[1].count; + count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_LAPIC, + acpi_parse_lapic, MAX_LOCAL_APIC); + x2count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC, + acpi_parse_x2apic, MAX_LOCAL_APIC); } if (!count && !x2count) { pr_err("No LAPIC entries present\n"); -- 2.34.1
On Fri, Jun 23, 2023 at 11:04:34PM +0800, Feng Tang wrote: > Hi Thomas, > > On Fri, Jun 23, 2023 at 01:07:24AM +0200, Thomas Gleixner wrote: > > On Thu, Jun 22 2023 at 16:27, Thomas Gleixner wrote: > > > On Fri, Jun 16 2023 at 15:18, Feng Tang wrote: > > > So something like the below should just work. > > > > Well it works in principle, but does not take any of the command line > > parameters which limit nr_possible CPUs or the actual kernel > > configuration into account. But the principle itself works correctly. > > > > Below is an updated version, which takes them into account. > > > > The data here is from a two socket system with 32 CPUs per socket. > > > > No command line parameters (NR_CPUS=64): > > > > smpboot: Allowing 64 CPUs, 32 hotplug CPUs > > clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e3306b9ada, max_idle_ns: 440795224413 ns > > smp: Brought up 1 node, 32 CPUs > > smpboot: Max logical packages ACPI enumeration: 2 > > > > "possible_cpus=32" (NR_CPUS=64) or > > No command line parameter (NR_CPUS=32): > > > > smpboot: Allowing 32 CPUs, 0 hotplug CPUs > > clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e3306b9ada, max_idle_ns: 440795224413 ns > > smp: Brought up 1 node, 32 CPUs > > smpboot: Max logical packages ACPI enumeration: 1 > > > > maxcpus=32 > > smpboot: Allowing 64 CPUs, 0 hotplug CPUs > > clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e3306b9ada, max_idle_ns: 440795224413 ns > > smp: Brought up 1 node, 32 CPUs > > smpboot: Max logical packages ACPI enumeration: 2 > > > > But that's really all we should do. If the ACPI table enumerates CPUs as > > hotpluggable which can never arrive, then so be it. > > > > We have enough parameters to override the BIOS nonsense. Trying to do > > more magic MAD table parsing with heuristics is just wrong. > > > > We already have way too many heuristics and workarounds for broken > > firmware, but for the problem at hand, we really don't need more. > > > > The only systems I observed so far which have a non-sensical amount of > > "hotpluggable" CPUs are high-end server machines. It's a resonable > > expectation that machines with high-end price tags come with correct > > firmware. Trying to work around that (except with the existing command > > line options) is just proliferating this mess. This has to stop. > > > > Thanks, > > > > tglx > > Thanks for helping on this. > > I run some tests with your patch againt latest kernel, and found with > some "maxcpus=" setup, the kernel will soft hung, that it will print > some hung/stall message from time to time. > > My test machine is Cascacade Lake AP, 2 packages (4 NUMA nodes), 96C > and 192T. The cmdline is "maxcpus=24", and 24 is the number of core > per NUMA node. Don't know if you can reproduce it with "maxcpus=16" > on your test box. > > The box is in remote lab and I don't have serial console, but a remote > console, and I took 2 pictures of the error message (attched). > > Also I will check more on how to debug on this remote machine. [ Above mail was auto-rejected by many mail servers due to the big size of the pictures ] From debug, the reason of the hung/stall is detect_extended_topology_early() is called by cpu hotplug after boot, where there is "maxcpus=XXX" setting, (#echo 1 > /sys/devices/system/cpu/cpuX/online). It could be fixed with below patch: ---------------------------------------------------------------- diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index 828c1f7edac1..1ff73c8c4972 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -29,7 +29,7 @@ unsigned int __max_die_per_package __read_mostly = 1; EXPORT_SYMBOL(__max_die_per_package); #ifdef CONFIG_SMP -unsigned int apic_to_pkg_shift __ro_after_init; +unsigned int apic_to_pkg_shift; /* * Check if given CPUID extended topology "leaf" is implemented ---------------------------------------------------------------- I also tested 'numa=off' and 'numa=fake=8' cmdline parameter on one 2 package Cascad Lake SP and one 2 package (4 NUMA nodes) Cascade Lake AP, and the code works fine by giving the _correct_ estimation: "smpboot: Max logical packages ACPI enumeration: 2" Thanks, Feng > Thanks, > Feng
On Sun, Jun 25 2023 at 22:51, Feng Tang wrote: > From debug, the reason of the hung/stall is detect_extended_topology_early() > is called by cpu hotplug after boot, where there is "maxcpus=XXX" setting, > (#echo 1 > /sys/devices/system/cpu/cpuX/online). > > It could be fixed with below patch: > ---------------------------------------------------------------- > diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c > index 828c1f7edac1..1ff73c8c4972 100644 > --- a/arch/x86/kernel/cpu/topology.c > +++ b/arch/x86/kernel/cpu/topology.c > @@ -29,7 +29,7 @@ unsigned int __max_die_per_package __read_mostly = 1; > EXPORT_SYMBOL(__max_die_per_package); > > #ifdef CONFIG_SMP > -unsigned int apic_to_pkg_shift __ro_after_init; > +unsigned int apic_to_pkg_shift; Bah, yes. I missed the early_init() call from init_intel(). I hate that code with a passion.
On Tue, Jun 27, 2023 at 01:14:34PM +0200, Thomas Gleixner wrote: > On Sun, Jun 25 2023 at 22:51, Feng Tang wrote: > > From debug, the reason of the hung/stall is detect_extended_topology_early() > > is called by cpu hotplug after boot, where there is "maxcpus=XXX" setting, > > (#echo 1 > /sys/devices/system/cpu/cpuX/online). > > > > It could be fixed with below patch: > > ---------------------------------------------------------------- > > diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c > > index 828c1f7edac1..1ff73c8c4972 100644 > > --- a/arch/x86/kernel/cpu/topology.c > > +++ b/arch/x86/kernel/cpu/topology.c > > @@ -29,7 +29,7 @@ unsigned int __max_die_per_package __read_mostly = 1; > > EXPORT_SYMBOL(__max_die_per_package); > > > > #ifdef CONFIG_SMP > > -unsigned int apic_to_pkg_shift __ro_after_init; > > +unsigned int apic_to_pkg_shift; > > Bah, yes. I missed the early_init() call from init_intel(). I hate that > code with a passion. I tested the patch on a 2S Icelake with Sub-NUMA-Cluster enabled, which shows 4 NUMA nodes, and this patch gave the right package number: 2 On a hybrid system Alderlake with 8 P-core and 8 E-core, '__max_logical_packages' is 1. I also tried 'acpi=off' parameter. On a 2S Cascade Lake box, it only brought up one CPU due to no MP table, while on another single socket 18C/36T Cascade Lake box which has MP table, it brought up 18 CPUs. The '__max_logical_packages' is 1 for both of them. The patch covers most of cases we have listed, so feel free to add: Tested-by: Feng Tang <feng.tang@intel.com> Thanks, Feng
On Thu, Jun 29, 2023 at 09:27:10PM +0800, Feng Tang wrote: > On Tue, Jun 27, 2023 at 01:14:34PM +0200, Thomas Gleixner wrote: > > On Sun, Jun 25 2023 at 22:51, Feng Tang wrote: > > > From debug, the reason of the hung/stall is detect_extended_topology_early() > > > is called by cpu hotplug after boot, where there is "maxcpus=XXX" setting, > > > (#echo 1 > /sys/devices/system/cpu/cpuX/online). > > > > > > It could be fixed with below patch: > > > ---------------------------------------------------------------- > > > diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c > > > index 828c1f7edac1..1ff73c8c4972 100644 > > > --- a/arch/x86/kernel/cpu/topology.c > > > +++ b/arch/x86/kernel/cpu/topology.c > > > @@ -29,7 +29,7 @@ unsigned int __max_die_per_package __read_mostly = 1; > > > EXPORT_SYMBOL(__max_die_per_package); > > > > > > #ifdef CONFIG_SMP > > > -unsigned int apic_to_pkg_shift __ro_after_init; > > > +unsigned int apic_to_pkg_shift; > > > > Bah, yes. I missed the early_init() call from init_intel(). I hate that > > code with a passion. > > I tested the patch on a 2S Icelake with Sub-NUMA-Cluster enabled, which > shows 4 NUMA nodes, and this patch gave the right package number: 2 > > On a hybrid system Alderlake with 8 P-core and 8 E-core, > '__max_logical_packages' is 1. > > I also tried 'acpi=off' parameter. On a 2S Cascade Lake box, it only > brought up one CPU due to no MP table, while on another single socket > 18C/36T Cascade Lake box which has MP table, it brought up 18 CPUs. > The '__max_logical_packages' is 1 for both of them. > > The patch covers most of cases we have listed, so feel free to add: > > Tested-by: Feng Tang <feng.tang@intel.com> Hi Thomas, I plan to put these together and resend. Can I use your Signed-off-by for your code? Thanks, Feng
On Mon, Jul 17 2023 at 21:38, Feng Tang wrote: > On Thu, Jun 29, 2023 at 09:27:10PM +0800, Feng Tang wrote: > > I plan to put these together and resend. Can I use your Signed-off-by > for your code? No. I'm reworking the topology code at large scale and this temporary hack is just in the way. Give me a couple of days to polish it up for posting. That will just provide the correct information out of the box. Thanks, tglx
On Wed, Jul 26, 2023 at 09:37:17PM +0200, Thomas Gleixner wrote: > On Mon, Jul 17 2023 at 21:38, Feng Tang wrote: > > On Thu, Jun 29, 2023 at 09:27:10PM +0800, Feng Tang wrote: > > > > I plan to put these together and resend. Can I use your Signed-off-by > > for your code? > > No. I'm reworking the topology code at large scale and this temporary > hack is just in the way. Give me a couple of days to polish it up for > posting. That will just provide the correct information out of the box. Glad to see this happening! and thanks for the heads up. - Feng > Thanks, > > tglx
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 458c891a8273..d47b7b39e44d 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -124,6 +124,8 @@ extern unsigned int __max_die_per_package; extern unsigned int __max_logical_packages; #define topology_max_packages() (__max_logical_packages) +extern unsigned int logical_packages; + static inline int topology_max_die_per_package(void) { return __max_die_per_package; @@ -144,6 +146,7 @@ bool topology_is_primary_thread(unsigned int cpu); bool topology_smt_supported(void); #else #define topology_max_packages() (1) +#define logical_packages 1 static inline int topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; } static inline int diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index b78770b0c43d..c5ac9eb17cd7 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -104,7 +104,7 @@ EXPORT_PER_CPU_SYMBOL(cpu_info); /* Logical package management. We might want to allocate that dynamically */ unsigned int __max_logical_packages __read_mostly; EXPORT_SYMBOL(__max_logical_packages); -static unsigned int logical_packages __read_mostly; +unsigned int logical_packages __read_mostly; static unsigned int logical_die __read_mostly; /* Maximum number of SMT threads on any online core */ diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 344698852146..fe762f7268a0 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1146,8 +1146,7 @@ static struct clocksource clocksource_tsc_early = { .uncertainty_margin = 32 * NSEC_PER_MSEC, .read = read_tsc, .mask = CLOCKSOURCE_MASK(64), - .flags = CLOCK_SOURCE_IS_CONTINUOUS | - CLOCK_SOURCE_MUST_VERIFY, + .flags = CLOCK_SOURCE_IS_CONTINUOUS, .vdso_clock_mode = VDSO_CLOCKMODE_TSC, .enable = tsc_cs_enable, .resume = tsc_resume, @@ -1195,12 +1194,6 @@ void mark_tsc_unstable(char *reason) EXPORT_SYMBOL_GPL(mark_tsc_unstable); -static void __init tsc_disable_clocksource_watchdog(void) -{ - clocksource_tsc_early.flags &= ~CLOCK_SOURCE_MUST_VERIFY; - clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY; -} - bool tsc_clocksource_watchdog_disabled(void) { return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY) && @@ -1223,23 +1216,6 @@ static void __init check_system_tsc_reliable(void) #endif if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) tsc_clocksource_reliable = 1; - - /* - * Disable the clocksource watchdog when the system has: - * - TSC running at constant frequency - * - TSC which does not stop in C-States - * - the TSC_ADJUST register which allows to detect even minimal - * modifications - * - not more than two sockets. As the number of sockets cannot be - * evaluated at the early boot stage where this has to be - * invoked, check the number of online memory nodes as a - * fallback solution which is an reasonable estimate. - */ - if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && - boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && - boot_cpu_has(X86_FEATURE_TSC_ADJUST) && - nr_online_nodes <= 2) - tsc_disable_clocksource_watchdog(); } /* @@ -1455,6 +1431,21 @@ static int __init init_tsc_clocksource(void) if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3)) clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP; + /* + * Disable the clocksource watchdog when the system has: + * - TSC running at constant frequency + * - TSC which does not stop in C-States + * - the TSC_ADJUST register which allows to detect even minimal + * modifications + * - not more than two sockets. + */ + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && + boot_cpu_has(X86_FEATURE_TSC_ADJUST) && + is_package_count_reliable() && + logical_packages <= 2) + clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY; + /* * When TSC frequency is known (retrieved via MSR or CPUID), we skip * the refined calibration and directly register it as a clocksource. @@ -1590,7 +1581,7 @@ void __init tsc_init(void) } if (tsc_clocksource_reliable || no_tsc_watchdog) - tsc_disable_clocksource_watchdog(); + clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY; clocksource_register_khz(&clocksource_tsc_early, tsc_khz); detect_art();