Message ID | 20231031195049.2075561-1-steve.wahl@hpe.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b90f:0:b0:403:3b70:6f57 with SMTP id t15csp481727vqg; Tue, 31 Oct 2023 12:59:28 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHGEkpIyYQhUf5YuEqLJ6efnaKBSY+IZH+haJW444zOr3nERs/l+oYDkITaAZ/i+6QYE+FX X-Received: by 2002:a05:6a20:8e2a:b0:174:63a9:293 with SMTP id y42-20020a056a208e2a00b0017463a90293mr13089224pzj.48.1698782368300; Tue, 31 Oct 2023 12:59:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698782368; cv=none; d=google.com; s=arc-20160816; b=YjUpac3qrNMksF7fYkFBiVlmVkEVdQ9hGwyDWY3n9AtW1JGPtH4d9C6lnz46hlQPmx IzxmLuR1Yhz9ylJcjd1cM91/pk181/e65627TbVPjIa/NwhlZvLFGLr67+U4OAJXX9SE bQ0NR8ARM3IoOr9FbWURu4llenSMlqxJeHyeJBxTe3McPcCEQwj33hw7r1Heq0qGit6i skZoKhzuqHLFLkqGLRCO44jdZ36xlF8wV3JFjp4MKqTfuITxMA6108SYiY1oRNHilcMX 8P6h1+20GEMbLjwC6/r8cpZ8/+69m2b0mF7Y8Nazkpeic1KGC5nsWcsFFB4PYPt5qgNO Ghhw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:to:from:dkim-signature; bh=EAVA3QRBlRTCAPL3OoZCRFS0+fakGmgPKSgrygqN48U=; fh=WIWQNI4KQR1CK7EF2pl8sVTiZ7VXjibUi9AV3wKPpeI=; b=JUcUsPi+MdSYWn8npBpAl6yLRwvvcJVsj5jjIg0LkkNr8RfzUWIrrYvTBVVfUZyTFV fTtsOr8yX4qiUjtByvge4kkvS/BsT/F8HC9i7ICSK5zZUinfNH4o3MLfA1u784P4nneO E76Z7TvBlTKlmFVv3aqzMtI3E0GtYnb+PXG7Qx68v4zow1ROyUDQprUcRTecO0kwWqaC nL8/L/JDAfaC8iZUjld9G7x2v3P9Pwmd92hGhz0Q9IjctV6MgN1UVE0VKvfGRjEfZcxA R8JCfi5jKETFpQhPsPtsiuwmRtWscz/qJb6Qqtc81pAvfQaVXSI30B4QAkrYabI/+Erf qUYw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@hpe.com header.s=pps0720 header.b=XgT75fG5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=hpe.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id t9-20020aa79469000000b006c0fefd59eesi57210pfq.279.2023.10.31.12.59.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 31 Oct 2023 12:59:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@hpe.com header.s=pps0720 header.b=XgT75fG5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=hpe.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 623BE80FDDF0; Tue, 31 Oct 2023 12:59:27 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231905AbjJaT7Y (ORCPT <rfc822;chrisjones.unixmen@gmail.com> + 33 others); Tue, 31 Oct 2023 15:59:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33558 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231392AbjJaT7X (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 31 Oct 2023 15:59:23 -0400 Received: from mx0b-002e3701.pphosted.com (mx0b-002e3701.pphosted.com [148.163.143.35]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A4879C9 for <linux-kernel@vger.kernel.org>; Tue, 31 Oct 2023 12:59:21 -0700 (PDT) Received: from pps.filterd (m0148664.ppops.net [127.0.0.1]) by mx0b-002e3701.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 39VDWgBN009110; Tue, 31 Oct 2023 19:58:55 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hpe.com; h=from : to : subject : date : message-id : mime-version : content-transfer-encoding; s=pps0720; bh=EAVA3QRBlRTCAPL3OoZCRFS0+fakGmgPKSgrygqN48U=; b=XgT75fG5L/rBas6c+JjW0rq1lXfkzWq8/aHHzdixH4i5d+/tG3mSYNwUNbYj743lDHxE N+qp49qrj6jqS86e+WYCpB47d2W03W7bw3YcZT3pE+kGTixm+NvvJuN/+jRucPAz1e1T HXhBt/7aJexSzOmW1n1odyd4bQvsv9+6GOqmO3AjcgC+FkIphHGWErDZ7BiueC5j+0tX ZdcKhAuhc0Jc8grHmXAc8fxi3Gp4Bp5LKSEszzLhPHeKFBMn7KLx/CXU4+VTTz/Hm65R G6HkiHRLkOkA2+/Z97nbaym+4SmOtlQ8Gca0Rg7zS/rLIFFYHnWwfLy4+Vyjw9JGYV5+ nw== Received: from p1lg14880.it.hpe.com ([16.230.97.201]) by mx0b-002e3701.pphosted.com (PPS) with ESMTPS id 3u3215dp62-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 31 Oct 2023 19:58:54 +0000 Received: from p1lg14885.dc01.its.hpecorp.net (unknown [10.119.18.236]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by p1lg14880.it.hpe.com (Postfix) with ESMTPS id D283080046F; Tue, 31 Oct 2023 19:58:53 +0000 (UTC) Received: from dog.eag.rdlabs.hpecorp.net (unknown [16.231.227.39]) by p1lg14885.dc01.its.hpecorp.net (Postfix) with ESMTP id A5E0E80805C; Tue, 31 Oct 2023 19:58:49 +0000 (UTC) Received: by dog.eag.rdlabs.hpecorp.net (Postfix, from userid 200934) id 8C2DA302F47FB; Tue, 31 Oct 2023 14:50:49 -0500 (CDT) From: Steve Wahl <steve.wahl@hpe.com> To: Steve Wahl <steve.wahl@hpe.com>, rja_direct@groups.int.hpe.com, Dave Hansen <dave.hansen@linux.intel.com>, Andy Lutomirski <luto@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, linux-kernel@vger.kernel.org Subject: [PATCH] x86/mm/ident_map: Use gbpages only where full GB page should be mapped. Date: Tue, 31 Oct 2023 14:50:49 -0500 Message-Id: <20231031195049.2075561-1-steve.wahl@hpe.com> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Proofpoint-GUID: LLvxAdC1Hr9Rd9kgWXiVGW3GalB_IM5z X-Proofpoint-ORIG-GUID: LLvxAdC1Hr9Rd9kgWXiVGW3GalB_IM5z X-HPE-SCL: -1 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.987,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-10-31_07,2023-10-31_03,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 priorityscore=1501 clxscore=1015 impostorscore=0 lowpriorityscore=0 phishscore=0 malwarescore=0 spamscore=0 mlxlogscore=999 suspectscore=0 mlxscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2310240000 definitions=main-2310310162 X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Tue, 31 Oct 2023 12:59:27 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781301710470794153 X-GMAIL-MSGID: 1781302420847575317 |
Series |
x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
|
|
Commit Message
Steve Wahl
Oct. 31, 2023, 7:50 p.m. UTC
Instead of using gbpages for all memory regions, use them only when
map creation requests include the full GB page of space; descend to
using smaller 2M pages when only portions of a GB page are requested.
When gbpages are used exclusively to create identity maps, large
ranges of addresses not actually requested can be included in the
resulting table. On UV systems, this ends up including regions that
will cause hardware to halt the system if accessed (these are marked
"reserved" by BIOS). Even though code does not actually make
references to these addresses, including them in an active map allows
processor speculation into this region, which is enough to trigger the
system halt.
The kernel option "nogbpages" will disallow use of gbpages entirely
and avoid this problem, but uses a lot of extra memory for page tables
that are not really needed.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
---
Please ignore previous send with internal message. Thanks.
arch/x86/mm/ident_map.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
Comments
On 10/31/23 12:50, Steve Wahl wrote: > Instead of using gbpages for all memory regions, use them only when > map creation requests include the full GB page of space; descend to > using smaller 2M pages when only portions of a GB page are requested. ... The logic here is sound: we obviously don't want systems rebooting because speculation went wonky. > diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c > index 968d7005f4a7..b63a1ffcfe9f 100644 > --- a/arch/x86/mm/ident_map.c > +++ b/arch/x86/mm/ident_map.c > @@ -31,18 +31,26 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page, > if (next > end) > next = end; > > - if (info->direct_gbpages) { > + /* > + * if gbpages allowed, this entry not yet present, and > + * the full gbpage range is requested (both ends are > + * correctly aligned), create a gbpage. > + */ > + if (info->direct_gbpages > + && !pud_present(*pud) > + && !(addr & ~PUD_MASK) > + && !(next & ~PUD_MASK)) { This is a _bit_ too subtle for my taste. Let's say someone asks for mapping of 2MB@5G, then later asks for 1G@5G. The first call in here will do the 2MB mapping with small (pud) entries. The second call will see the new pud_present() check and *also* skip large pud creation. That's a regression. It might not matter _much_, but it's a place where the old code creates large PUDs and the new code creates small ones. It's the kind of behavior change that at least needs to be noted in the changelog. Off the top of my head, I can't think of why we'd get overlapping requests in here, though. Did you think through that case? Is it common? > pud_t pudval; > > - if (pud_present(*pud)) > - continue; > - > - addr &= PUD_MASK; > pudval = __pud((addr - info->offset) | info->page_flag); > set_pud(pud, pudval); > continue; > } > > + /* if this is already a gbpage, this portion is already mapped */ > + if (pud_large(*pud)) > + continue; I'd probably move this check up to above the large PUD creation code. It would make it more obvious that any PUD that's encountered down below is a small PUD. > if (pud_present(*pud)) { > pmd = pmd_offset(pud, 0); > ident_pmd_init(info, pmd, addr, next); That would end up looking something like this: bool do_gbpages = true; ... // Is the whole range already mapped? if (pud_large(*pud)) continue; /* PUD is either empty or small */ // GB pages allowed? do_gbpages &= info->direct_gbpages; // Addresses aligned for GB pages? do_gbpages &= ~(addr & ~PUD_MASK); do_gbpages &= ~(next & ~PUD_MASK); // Check for existing mapping using a small PUD do_gbpages &= !pud_present(*pud); if (do_gbpages) { set_pud(pud, __pud((addr - info->offset) | info->page_flag)); continue }
Dave, Thank you for taking the time to review my patch. (More below.) On Thu, Nov 02, 2023 at 09:02:44AM -0700, Dave Hansen wrote: > On 10/31/23 12:50, Steve Wahl wrote: > > Instead of using gbpages for all memory regions, use them only when > > map creation requests include the full GB page of space; descend to > > using smaller 2M pages when only portions of a GB page are requested. > ... > > The logic here is sound: we obviously don't want systems rebooting > because speculation went wonky. > > > diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c > > index 968d7005f4a7..b63a1ffcfe9f 100644 > > --- a/arch/x86/mm/ident_map.c > > +++ b/arch/x86/mm/ident_map.c > > @@ -31,18 +31,26 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page, > > if (next > end) > > next = end; > > > > - if (info->direct_gbpages) { > > + /* > > + * if gbpages allowed, this entry not yet present, and > > + * the full gbpage range is requested (both ends are > > + * correctly aligned), create a gbpage. > > + */ > > + if (info->direct_gbpages > > + && !pud_present(*pud) > > + && !(addr & ~PUD_MASK) > > + && !(next & ~PUD_MASK)) { > > This is a _bit_ too subtle for my taste. > > Let's say someone asks for mapping of 2MB@5G, then later asks for 1G@5G. > The first call in here will do the 2MB mapping with small (pud) > entries. The second call will see the new pud_present() check and > *also* skip large pud creation. > > That's a regression. It might not matter _much_, but it's a place where > the old code creates large PUDs and the new code creates small ones. > It's the kind of behavior change that at least needs to be noted in the > changelog. I will add a note that requests that create a small page mapping will have that small mapping persist in this range, even if subsequent requests enlarge the mapped area to cover the whole 1G segment. > Off the top of my head, I can't think of why we'd get overlapping > requests in here, though. Did you think through that case? Is it common? Yes, I had thought about the overlaps. Of the choices I had here, continuing to use the already allocated PMD page and fill the map in with 2M pages seemed the best option. Existing usage (the kernel decompression stub and kexec) start with huge tracts of memory and then add smaller pieces that may or may not already reside in the map created so far. (See, for example, the comment around line 231 in arch/x86/kernel/machine_kexec_64.c.) My early private versions with printks reflected this, but this was limited to testing on UV systems. In short, with current usage overlap is expected, but it would be rare for small pieces that requrie PMD mapping to be followed by large pieces that include the whole PUD level region. > > pud_t pudval; > > > > - if (pud_present(*pud)) > > - continue; > > - > > - addr &= PUD_MASK; > > pudval = __pud((addr - info->offset) | info->page_flag); > > set_pud(pud, pudval); > > continue; > > } > > > > + /* if this is already a gbpage, this portion is already mapped */ > > + if (pud_large(*pud)) > > + continue; > > I'd probably move this check up to above the large PUD creation code. > It would make it more obvious that any PUD that's encountered down below > is a small PUD. That makes sense. I will change this. > > if (pud_present(*pud)) { > > pmd = pmd_offset(pud, 0); > > ident_pmd_init(info, pmd, addr, next); > > That would end up looking something like this: > > bool do_gbpages = true; > ... > > // Is the whole range already mapped? > if (pud_large(*pud)) > continue; > > /* PUD is either empty or small */ > > // GB pages allowed? > do_gbpages &= info->direct_gbpages; > // Addresses aligned for GB pages? > do_gbpages &= ~(addr & ~PUD_MASK); > do_gbpages &= ~(next & ~PUD_MASK); > // Check for existing mapping using a small PUD > do_gbpages &= !pud_present(*pud); > > if (do_gbpages) { > set_pud(pud, __pud((addr - info->offset) | > info->page_flag)); > continue > } I tried coding it up with the bool instead of the single if statement, and to me it did not look as easy to read and understand as the single if statement version. So unless you firmly object, I'm leaving it the original way, but improving the comment above the if statement to have one-for-one explanations for each condition. Thanks, --> Steve Wahl
diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c index 968d7005f4a7..b63a1ffcfe9f 100644 --- a/arch/x86/mm/ident_map.c +++ b/arch/x86/mm/ident_map.c @@ -31,18 +31,26 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page, if (next > end) next = end; - if (info->direct_gbpages) { + /* + * if gbpages allowed, this entry not yet present, and + * the full gbpage range is requested (both ends are + * correctly aligned), create a gbpage. + */ + if (info->direct_gbpages + && !pud_present(*pud) + && !(addr & ~PUD_MASK) + && !(next & ~PUD_MASK)) { pud_t pudval; - if (pud_present(*pud)) - continue; - - addr &= PUD_MASK; pudval = __pud((addr - info->offset) | info->page_flag); set_pud(pud, pudval); continue; } + /* if this is already a gbpage, this portion is already mapped */ + if (pud_large(*pud)) + continue; + if (pud_present(*pud)) { pmd = pmd_offset(pud, 0); ident_pmd_init(info, pmd, addr, next);